A novel machine learning approach for scene text extraction

A novel machine learning approach for scene text extraction

Accepted Manuscript A novel machine learning approach for scene text extraction Ghulam Jillani Ansari, Jamal Hussain Shah, Mussarat Yasmin, Muhammad S...

2MB Sizes 0 Downloads 157 Views

Accepted Manuscript A novel machine learning approach for scene text extraction Ghulam Jillani Ansari, Jamal Hussain Shah, Mussarat Yasmin, Muhammad Sharif, Steven Lawrence Fernandes

PII: DOI: Reference:

S0167-739X(17)32152-0 https://doi.org/10.1016/j.future.2018.04.074 FUTURE 4152

To appear in:

Future Generation Computer Systems

Received date : 24 September 2017 Revised date : 19 April 2018 Accepted date : 23 April 2018 Please cite this article as: G.J. Ansari, J.H. Shah, M. Yasmin, M. Sharif, S.L. Fernandes, A novel machine learning approach for scene text extraction, Future Generation Computer Systems (2018), https://doi.org/10.1016/j.future.2018.04.074 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

A NOVEL MACHINE LEARNING APPROACH FOR SCENE TEXT EXTRACTION Ghulam Jillani Ansaria, Jamal Hussain Shaha, Mussarat Yasmina, Muhammad Sharifa, Steven Lawrence Fernandesb a Department of Computer Science, COMSATS Institute of Information Technology, Wah Cantt, Pakistan b Department of Electronics and Communication Engineering, Sahyadri College of Engineering & Management, Mangaluru, India [email protected], [email protected],[email protected], [email protected], [email protected] *Corresponding Author: Mussarat Yasmin (email: [email protected])

Abstract Image based text extraction is a popular and challenging research field in computer vision in recent times. In this paper, an exigent aspect such as natural scene text identification and extraction has been investigated due to cluttered background, unstructured scenes, orientations, ambiguities and much more. For text identification, contrast enhancement is done by applying LUV channel on an input image to get perfect stable regions. Then L-Channel is selected for region segmentation using standard segmentation technique MSER. In order to differentiate among text/non-text regions, various geometrical properties are also considered in this work. Further, classification of connected components is performed to obtain segmented image by the fusion of two feature descriptors LBP and T-HOG. Firstly both features descriptors are separately classified using linear SVM(s). Secondly the results of both are combined by applying weighted sum fusion technique to classify into text/non-text portions. In text recognition, text regions are recognized and labeled with a novel CNN network. The CNN output is stored in a text file to make a text word. Finally, the text file is searched through lexicon for proper optimized scene text word incorporating hamming distance (error correction) technique if necessary.

Key words: Text Detection, Text Recognition, CNN, LBP, SVM

1. INTRODUCTION Identification and extraction of scene text from natural images and videos have been an essential job in recent computer vision research due to the frequent use and advent of smart gadgets. It also has huge demand in content based image retrieval and understanding. Technically, extraction of text undergoes through two major steps: (a) Text Detection in which it is identified and localized from natural scenes and/or videos. In short, it is a process to determine text/nontext regions. (b) Text Recognition means to understand semantic meaning of the text. In general, text is of two types  [1]; (1) scene text which is normally a click of camera and reveals common surroundings. This make scene unstructured and ambiguous due to uncertain situations e.g., advertisement holdings, sign boards, shops, text on buses, face panels and many more, (2) caption or graphic text is added manually to images and/or videos in order to support visual and audio content, so here text extraction is simpler than natural scene text. The challenges regarding diversity and complexity of natural images for text extraction are addressed from three different angles; (1) variation in natural scene text due to unrestrained and uncontrolled surroundings reflects absolutely diverse font size, style, color, scales and orientation, (2) Background uncertainty having challenges like roads, signs, grass, buildings, bricks and paves etc. All these factors make it difficult to separate actual text from natural image and thus can become source of puzzlement and bugs, (3) Intrusion factors like noise, low quality, distortion, non consistent light also generate problems in natural scene text identification and extraction. The most important text identification classes are: texture, connected components and hybrid approaches. In texture oriented approach proposed by authors in [2], text is closely related to some class of texture from which certain characteristics such as filter response, wavelength coefficient and native intensities can be processed. However, these methods thoroughly scan all locations and levels hence proved computationally costly. Connected component approach published in [3] carries out edge identification earlier and then applies down to up method to join minor regions into bigger regions unless all regions are detected. In addition, geometrical characteristics such as eccentricity, solidity, aspect ratio, euler number, extent and some heuristics are used to combine text regions to extract and confine the text. Hybrid approaches are a combination of texture oriented and connected component oriented approaches.

The aim and uniqueness of proposed work is that it not only recognizes the text intelligently but also removes error in order to preserve actual meaning of the text. As per reviewing the literature such work is maiden effort in this scenario. Recently, MSER [4, 5], SWT [6] and binarization techniques [7] are popular for extracting text from natural scene. In the proposed methodology, MSER is used to extract text after enhancing the contrast of a scene image while applying LUV-channel and selecting L-channel in gray scale for finding stable regions as a preprocessing step. Secondly, a novel approach is used to perform classification of MSER components into text and non-text by applying a feature descriptor which is a combination of LBP, T-HOG and SVM as classifier. Thirdly, CNN architecture is proposed for character recognition and character labeling. The output of CNN is stored in a text file to make a string of words. Task is completed if the output of text file matches the scene text otherwise error correction is handled using hamming distance technique. The complete framework will be described thoroughly in the coming subsequent portions of this work. This work has been evaluated on standard datasets in the respective domain and finds out that it performs well on all standard datasets primarily with increasing accuracy along with additional calculation of precision, recall and f-score. Moreover, the application is programmed in C and Python on Ubuntu platform. The combination of these languages as a single entity gave remarkable results. The first portion text detection presented in Fig. 1 is programmed in C, while the second portion text recognition is carried out using Python language.

2. RELATED WORK The researches have proposed several methods for identifying and detecting text in scene/videos in the last two decades. As mentioned earlier there are mainly three classes: texture oriented, connected component oriented and hybrid oriented. Texture based methods [6, 7, 8, 9] as described earlier are computationally costly due to textural properties (local intensities, filter response, wavelength coefficients) to identify non-text and text positions in natural scene. In addition, these approaches are mostly used for horizontal text and unable to work on non-horizontal text because they are not scale and rotation invariant. Kim et al. [10] proposed texture based technique in which each pixel is classified by using raw pixel intensity as local feature using SVM classifier. Text regions were discovered with the help of MeanShift [11] in probability maps. The proposed technique generates remarkable identification

and extraction outcomes in natural images and videos with straightforward backgrounds but it seems hard to apply this technique on unconstrained natural scenes and videos. Zhong et al. [12] presented an attractive methodology that spots text directly in DCT image area. It is highly efficient and does not require decoding image before detection but has limited correctness on detection. Chen et al. [13] implemented a robust text detector to speed up the text detection process. It uses cascade Adaboost [14] classifier, where a classifier is trained using feature set including vertical difference, intensity variance, horizontal difference and mean strength but the text detection precision is low on natural scene images. Connect component based methods [3, 6, 9, 15] are used to put together smaller components into larger components to find out text regions by adopting various edge detection techniques and using geometric properties. In addition, most of these methods are scale, rotation, font and change invariant. In recent years, these methods are found more popular and act as main stream in the research of text identification and detection in natural images. Using the property of characters having constant stroke width is proposed by Epshtein et al. [6] by giving new image operator Stroke Width Transform (SWT). The local operator uses edge maps to pick up character strokes and it is then capable to take out text regions of different directions and levels from unconstrained natural scenes. As this technique uses diverse series of hand crafted parameters and rules, it is only well for straight texts. Unlike these algorithms, the proposed approach by Yi et al. [9] is used to identify skewed texts in scene images. In the first stage, the image is broken down into various regions by keeping in view the division of intensities in color space. Secondly, regions are grouped together into connected components using spatial distance, relative size and color similarity of the regions. Lastly non-text areas are cropped by some set of rules. In addition, this technique is based on unnaturally designed filtering policies and factors, so it is tricky to make them generalize for huge datasets. Shivakumara et al. [16] introduced multiangular text detection method. This technique employs Fourier-Laplace to recognize candidate regions. These regions communicate only text and not characters or strokes. This method is unable to detect letters or words directly, so it is not comparable with any other bench mark methods. Yao et al. [2] extended the SWT and proposed an algorithm which extracts texts of random orientations scene images. The algorithm applied two-level classification method and double sets of circumvolution invariant and circumvolution features particularly modeled for grabbing the built in properties of characters in natural scene images.

Huang et al. [15] developed a novel detector on the basis of SWT known as Stroke Feature Transform (SFT). The problem of mismatch of edge points in SWT is resolved in SFT by introducing constraints relations of local edge points and color consistency hence produced improved component extraction results. The identification performance of SFT is very high for horizontal text on standard datasets. Gupta et al. [17] introduced a new methodology for text detection in natural scene images. This method has two parts: (1) a robust and flexible engine to produce synthetic text from cluttered synthetic images. The engine places synthetic text on top in background images using a natural way by viewing 3D scene geometry, (2) synthetic images to learn a Fully-Convolution Regression Network (FCRN) which carries out text identification and scoping based on multiple scales in an image. He et al. [18] presented new Text-Attentional Convolution Neural Network (Text-CNN) that mainly focuses on features and detects textual regions from the image. He also developed learning mechanism based on multi-level and rich supervised information including character label, text candidate mask and binary text/non-text information to train the Text-CNN. In addition, a powerful low-level image operator called Contrast Enhancement Maximally Stable Extremal Regions (CE-MSER) is introduced in extension to popularly used MSER by improving intensity contrast between backgrounds and text patterns. Zhang et al. [19] proposed a novel approach which generally handles multiple orientation, fonts and languages by using both local and global features for bounding text lines as coarse to fine process. Firstly, a fully convolution network (FCN) is trained to foresee the prominent map of text areas in a complete way. Secondly, text line suppositions are estimated by grouping prominent map and character components. In the end, FCN classifier is employed to predict the centriods of every character which eliminates the false suppositions. Liu et al. [20] proposed a new robust and stable CNN based model named Deep Matching Prior Network (DMPNet) to identify text with tighter quadrangle. He used sliding window in intermediate convolution layers to recall text with higher overlapping area. He proposed accurate and fast polygonal areas with the help of Monte-Carlo method. Then he designed a sequential protocol for relative regression to exactly predict compact quadrangle. Hybrid approaches [8, 21] are a mixture of texture oriented and connected component oriented approaches. In this way, these methods are more advantageous to take out text from scene images. Liu et al [21] presented a technique to find edge intensities of all probable text areas by

employing elaborative detection policy, gradient and geometrical characteristics of components and shapes are then confirmed to get possible text areas. It is afterwards pursued by a texture examination process to recognize valid text and non-text regions. Pan et al. [8] proposed the hybrid method to extract candidate components using multi-scale probability maps. These maps are used to classify which is then trained by using Histogram of Oriented Gradient features [22] calculated with a bunch of already defined patterns. A conditional random filed (CRF) model [23] grouped together binary related associations and unary component characteristics is employed to differentiate text components from non-text components. Both techniques can only detect horizontal texts like others. All the above mentioned techniques are considered as bench mark on their own idea and reflect promising results in identifying and detecting scene text. But most of them used ICDAR dataset and its variant for generating results. Moreover, it is also found that less work is performed on text recognition rather than text detection, so taking the advantage of this aspect; both text extraction and recognition are implemented in a novel way in this article along with error correction. Only 5% to 7% work exists in literature which handles error correction after text recognition. Results are generated using three benchmark datasets i.e., ICDAR 2003 [24], SVT [25] and IIIT5k [26]. For training and testing purpose, Char74K [27] is used along with the combination of above mentioned datasets in order to train classifier comprehensively. In the end, result accuracies are compared on the basis of Intermediate based features, CNN based methods and networks, complete word recognition and character based recognition with other bench mark algorithms and techniques. In addition, precision, recall and f-score are also computed which other selected set of techniques do not perform on the said benchmark datasets. Comparison charts are also listed in the end.

3. OBJECTIVES AND CONTRIBUTION A novel scene text extraction system is presented in this article which enables to identify and recognize the scene text intelligently and efficiently. Many issues can arise in scene text detection including font size, font color, font style, orientation, blur, occlusion, opacity and noise etc. Because of all these issues, it is sometimes difficult to train the system to give intelligent decision. In this work, CNN based character recognition and labeling is proposed to recognize and label scene text.

Following are key points of the proposed work: 1. Contrast enhancement of scene text image is performed using LUV channel representation from where L-Channel is selected for MSER stable regions. 2. In most of the techniques in literature, MSER regions are detected directly using gray scale images which sometimes causes problem to detect stable regions because of gradient in the image. To avoid this, MSER regions are detected in the proposed work by applying L-Channel after converting into gray scale which improves region accuracy and stability. 3. Next components are connected and classified using feature fusion technique. The output of (2) is used to retrieve LBP and T-HOG features separately and then fed them into SVM. The output of each SVM is then combined using weighted sum technique to linearly classify into non-text and text parts. The extracted text region is considered as intermediate based features usually having subtracted background or called segmented image. Moreover, character bounding is performed at this stage. 4. A novel CNN model is introduced for intelligent character recognition and labeling. The extracted text regions from (3) become input of CNN model after splitting the image into image patches of 26x26. 5. An error correction technique is proposed using hamming distance. The recognized labels are stored in a text file for proper word recognition. Then the distance is calculated using lexicon search; if hamming distance is calculated as 0, the string and scene text is same, if it is non-zero then optimized word combinations are listed.

4. Natural Scene Text Recognition Model A complete process diagram of suggested framework is presented in Fig. 1 while Fig. 2 presents pseudo code of the methodology. The proposed work has two major portions: (1) Text detection is comprised of contrast enhancement to detect stable MSER regions after which connected component classification is performed while applying T-HOG and LBP feature descriptors for character grouping. (2) Text recognition portion is executed after words splitting in which major activity is CNN based character recognition/labeling as a pre-step. In the post-step, labels are stored in some text file after which error correction process is activated, if required, using hamming distance technique. At last, the output of text file is required correct word.

Fig.1. F Framework R Representation of Propoosed work

4.1. Con ntrast Enhaancement and a Detectiion MSER [4, 5] is a ppowerful alggorithm reported in reccent times too detect chaallenging text regions wever, its loow-level natture sometim mes limits and connsider them as "stable eextremal reggions". How its perfoormance due to the folllowing two reasons. Fiirstly, MSER R regions aare easily distorted by complexx backgrounnds; this means separaating a charracter into m multiple com mponents, w which can create pproblem off detecting actual texxt in furthher steps. Secondly, low-contrasst, vague, ambiguoous and low w-quality im mages misleaad MSER too detect "staable extrem mal regions".. It is then difficultt to recover such compponents for further f proccessing hencce leading tto major redduction on recall.

Pseudo Code 1: Text Detection Input: Scene text image Output: Segment image with character bounding BEGIN Step 1: Preprocess selected scene text image - Apply LUV channel - Extract L-channel for further processing in Step 2 Step 2: Find out "Stable Region" using MSER - Sweep threshold of intensity to perform simple luminance of an image - Extract "Extremal Regions" or connected components - Find out threshold value, where an extremal region becomes "Maximally Stable" i.e., local minima to the relative growth of its square, in this case it is 0.4 - Approximate regions with different shades - Keep those region descriptors as features Step 3: Perform connected component classification - Train SVM(s) with +ve and -ve samples - Extract LBP and T-HOG features considering Step 2 output respectively - Apply weighted sum fusion technique to classify regions into text and non-text Step 4: Perform character bounding END (a) Pseudo Code 2: Text Recognition and Correction Input: Character image patch of size 26×26 Output: Recognized word with correction (if required) BEGIN Step 1: CNN is used to recognize and label each character for Step 2 Step 2: Store each label into text file Step 3: Compare text file with lexicon for error correction if (hamming distance == 0) No error otherwise Display list of probable words that can match scene text END (b) Fig.2. Pseudo Code of the Proposed Work. (a) Pseudo Code 1 for Text Detection. (b) Pseudo Code for Text Recognition. To improve the detection power of MSER in order to detect as much text components as possible, contrast enhancement is performed in the proposed method using color channels (RGB, and HSI, RGB and CMY, RGB and LUV, RGB and LAB, RGB and YUV) as pre processing step on natural images, which in turn not only enhances region level contrast of natural images

but alsoo improves local stabiility which is effectivee to overcoome low levvel distortioon of text regions.. After peerforming sseries of exxperiments on natural images using various color channnels, it is found thhat LUV ccolor space has an ability to impprove charaacter detecttion rate off extremal regions.. Because of L and U cchannels in LUV repreesentation, tthe text does not have a gradient and therrefore can bbe segmenteed properlyy. Whereas ccontrast imaage in grayy scale has lluminance gradientt which affeects the perfformance off MSER to ssegment the text [28]. In addittion, computtational com mplexity is aalso maintaiined in the ppresented m methodologyy. Neuman et al [229] perform med identifyying extrem mal regions using RGB B and HSI channels aalong with intensityy gradient im mage, whichh improves detection raate of character up to 94.8%. In thiis context, intensityy gradient iimage meanns highest ddifference of o a pixel annd its neighhbors on I-cchannel of HSI collor channel. But the adopted methodology is computationnally expennsive. In thee proposed method,, preprocessing of imaage after appplying LU UV-channel is shown bbelow in F Fig. 3 and computaationally effficient as coompared to [29]. The L L-channel iis further ussed for MSE ER region detectorr after conveerting in graay scale.

(a)

(b)

(c)

(d))

Fig.3. R Results of ppreprocessinng using LU UV-channel for contrastt enhancemeent in naturaal images shown in stack forrm. Each staacks (a), (b)), (c) and (d)) consists off top to dow wn original im mage, Lchannel reepresentationn, U-channeel representaation and V-channel reppresentationn. monotonic MSER detects invvariant stablle regions nnot affectedd by affine transformaations and m ments. The MSER algorithm appllies threshold many tim mes on the ggray scale illuminaation adjustm image w with increassing threshoold t. In thiss way each thresholdedd image connsists of a nnumber of connectted componnents (CCs) called extrremal regionns (ERs). A At the samee time, a paarent-child relationship is form med by usinng ERs in images i of vvarious threesholds, whhere child reegions are

nested in parent regions. Therefore, a component tree is constructed. From [28] let  1, 2, 3, 4, ... n be the series of nested ERs. For each  ( i ) , the stability is defined by  (i ) given below in Eq. (1):  (i ) 

  (i )   \  (i )  

(1)

|  (i ) |

In the above equation, the area of region is denoted by

| . |, X\Y are the sets of pixels in X

which are not found in Y and vice versa.  is parameter of the method. A big  outcome fewer regions recovered because region j has to be stable over a larger gray scale range. Therefore  is set to 1 for the proposed system.  j measures area of change of the regions  j   and  ( j )   which is normalized by the region  j . ERs having local minima  ( j ) are defined as MSER. The above algorithm detects MSER on the basis of local minima from gray scale image. For extracting MSER on the basis of local maxima, the gray scale image needs to be inverted to apply the algorithm again in the same manner as for local minima. In the proposed scenario, the contrast using LUV channel is enhanced in order to extract MSER from the gray scale L image. Component tree is used for efficient pruning of overlapped MSER regions. The leaf nodes gradually move upwards in hierarchy replacing the parents only if maximum of their confidence is larger than the confidence of their parents. At each pass, one component tree is detected by MSER algorithm. Therefore components Zi and Z j are replaced by the one with larger confidence if the subsequent condition given in Eq. (2) holds:

 Z i  Z j | | ZZ i  ZZ j |

> 0.4

(2)

Where  is intersection of components,  is union of components and 0.4 is the threshold achieved dynamically. MSER components on gray scale L-image are demonstrated in Fig. 4.

(aa)

(b b)

Fig.4. Colum F mn (a) show ws detected MSER regiions using ggray scale L-channel. C Column (b) represennts text regioons by discaarding non-text regionss on the basiis of geometric characteeristics.

nnected Coomponent C Classificatioon 4.2. Con The prooposed classification teechnique iss based on parallel exeecution of two linear SVMs. It means tthat two feeature sets are separattely fed intto two lineaar SVMs aand their ouutputs are combineed using weeighted sum m method to determine ttext and nonn-text portioons. To do tthis, Local Binary P Pattern [28]] (LBP) andd Histogram m of Orienteed Gradientt [22] (HOG G) are used as feature descripttors. LBP annd HOG aree separatelyy applied onn the segmennted MSER R regions froom Fig. 4. For the sake of easee, the valuess of each feaature vectorr are normallized betweeen [-1, 1]. LBP is uused as firstt feature desscriptor propposed by [330] and it is defined in a pixel

þ

0

= (X 0 , Y0 )

by

Eq. (3): 7

LBP P ( X 0 , Y0 )   2n (i  c )

(3)

i 0

where i is intenssity level off pixel in thhe 8-conneected neighbborhood of pixel þ , c is the 0 intensityy of pixel þ and e  x  is: 0 1, ex =  0,

x>0 ottherwise

Consideering random m arrangem ments of thee neighborinng pixels, thhe LBP desscriptor encodes each pixel too a 8-bit num mber [28]. The featuree set consistts of histoggram values which are related to

each image region. A neighboring pixel is assigned value 0 if it is close to the center pixel or 1 if not [31]. This process is shown in Fig. 5.

77

70

20

0

75

76

24

0

80

77

21

0

0

0

1

1

1

12

1

64

2

4 8

32

16

Fig.5. General Representation of LBP features descriptor In the proposed classification method, linear SVM on the LBP of connected components is in the form of binary image to make classification task faster rather than applying to gray scale image. The LBP feature set is fed to linear SVM1 whose output is thresholded to create a feature vector. The resultant feature vector say 'A' is confidence score on the predicted classes. The HOG generally divides an image into 2-dimensional array of separate HOG with a fixed number

Pb of

Px xPy

cells and calculates a

bins within each cell as Dalal and Triggs did for human

body recognition [22]. The resultant descriptor is complicated and typically comprised of more than 100 features. By definition Dalal and Triggs take image K and compute the gradient K by using difference schema given in Eq. (4). K (m, n) 

1 (K(m  1, n)  K (m  1, n), K(m, n  1)  K (m, n  1)) 2

(4)

The above formula is applied on each color channel to calculate gradient and collect highest scale vector. The latter variant of [22] i.e., T-HOG descriptor and its experimental verification is published in [32] and is used for texture based text classification. T-HOG descriptor is focused to grab the gradient distribution properties of character strokes in occidental-like scripts. The input image K is divided into small image patches of delimited candidate rectangle after converting into gray scale and fixed height H . These are further normalized with Gaussian weight window in order to compensate the local variations of contrast and brightness. This normalized image is then divided into Py horizontal stripes and its direction is quantized into a small number Pb of equal angular ranges, therefore, matching bins of histogram are incremented. The opposite directions are identified to make each bin  / Pb radians wide. In this way, T-HOG descriptor becomes in a concatenation of those Py histograms.

By definition, T-HOG estimates magnitude of gradient from the following definition in Eq. (5): ()(m, n)  max{0,| (m, n) |   2 }

(5)

This definition becomes 0 if the coarse gradient scale  is lesser than the supposed sampling noise  . T-HOG feature set is fed into linear SVM2 and its output is also thresholded to create another feature vector. The obtained feature vector, say 'B' also has different entries of confidence score on the predicted classes. There are two important feature fusion approaches; fusion on the basis of concatenation and fusion on the basis of weights also known as concatenated and weighted sum fusion approaches respectively. In the first approach, various feature descriptors are integrated and then fed the concatenation into a classifier. On the other hand, the weighted sum fusion approach feeds various features into individual classifiers and then merges the classification scores using weighted sum. Likewise it also helps to improve classification performance by reducing the dimensionality of feature vector to some extent. This work uses T-HOG and LBP features for detecting text in natural scene images because both can be implemented using histograms. The classification result of LBP feature vector 'A' and THOG feature vector 'B' produced from SVM1 and SVM2 respectively are then fused using the weighted sum technique to determine text/non-text areas. The framework for connected component classification is presented in Fig. 6. Let the output scores of LBP and T-HOG features using individual SVM classifiers (SVM1, SVM2) are fvLBP and fvT  HOG respectively. According to the weighted sum technique, the final output is defined as in Eq. (6):

fv   fvLBP  (1   ) fvT  HOG ,

0   1

(6)

The values of  are described by:

  { |   0.1J , J  1, 2,..., 9} , for all values of  , the fusion process is considered best for   0.5 . In addition, some heuristics are used iteratively during this section to generate output with white background. This novel approach imparts significant impact for character recognition, labeling and generating remarkable results on different parameters.

Fig.6. Coonnected com mponent claassification framework The prooposed classifiers SVM M1 and SVM M2 are traiined using text character sampless (positive sampless shown in Fig. 8) froom Char74kk dataset [[27] and 200000 non-teext samples (negative sampless shown in F Fig. 7) from m ICDAR 20003 dataset [24], SVT dataset [255] and IIIT 5k dataset [26]. Thhe non-text samples arre generatedd from natuural scene iimages collected from SVT and ICDAR R 2003 and eeach of them m is resized into image patch size of 26×26 ass shown in F Fig. 7 and Fig. 8 inn 8×8 grid respectively..

F Fig.7. Negattive sampless for trainingg SVM1, SV VM2 and CN NN

F Fig.8. Positiive samples for trainingg SVM1, SV VM2 and CN NN 4.3. Chaaracter bou unding In this section, boounding boxx is placedd around caandidate chaaracters. Acccording to [33], the criteria is defined oon the basis of geometriic properties. The idea behind is too consider thhe heights j1 , j2

andd widths

k1 , k2 of

two boounded boxxes along with coordinaates (l1 , m1 ) annd

(l2 , m2 ) as

shown in

Fig. 9.

Fig.9. Geoometric paraameters usedd for characcter grouping

By definnition from [33], Eq. (77) and Eq. (88) are givenn as: jm min( j1 , j2 ),  x | l1  l2 | (k1  k2 ) / 2

(7)

and

 y | m1  m2 |

(8)

Notice that is  x is negative if and only if two boxes overlap in the x direction. Therefore, it can be confirmed that they are compatible and supposed to belong to the same text. Furthermore, the definition for character bounding is limited only if below conditions are satisfied: | j1  j2 | s1h  x  s2 h

 y  s3h

where

s1 , s2

and

s3 are

the parameters of character bounding module. The parameter

s3

is

specifically important to determine whether the group will be a character or non-character. The same procedure is applied to all detected characters. The output of character bounding is shown in Fig. 9. After successful character bounding, the word is split into characters to create character image patch of 26×26 each. These image patches become input for the presented CNN model from where they are recognized and labeled. The advantage achieved form Section 4.2 and Section 4.3 respectively is that all RGB character patches are with subtracted background which absolutely enhances the CNN recognition power.

4.4. CNN Architecture and Character Recognition

For simplicity, the main purpose of this section is to recognize the character and label it as character information e.g., {'a', 'b', 'c' ...etc}. Keeping the ambiguous nature of certain images, a highly supervised multilayer CNN model is proposed for character information, character region segmentation, character labeling and binary text and non-text information. This additional information helps out to find more specific features of text from low-level region segmentation to high level binary classification. In this way, the presented model is powerful to sequentially understand what, where and whether of character which becomes a great advantage in making reliable decision. Although training multilayer supervised CNN is non-trivial due to various levels of information having various learning difficulties and convergence rates, it is therefore well suited for feature sharing. Hence the CNN model is formulated as follows. N

Given total N training examples denoted as

 (x k 1

k

, yk ), the goal of CNN is to minimize

N

arg min   ( y , f ( x ,W ))  (W ) W

k 1

k

k

(9)

where f ( xk ,W ) in Eq. (9) is a function parameterized by W .  (.) denotes the loss function which is typically soft-max loss for classification and least square loss for regression tasks. The  (W ) will act as a regularization term. The training procedure tries to find a mapping function that connects the input image patch and output labels i.e., 0/1 binary classification without any extra information. The CNN model is trained using stochastic gradient learning algorithm. This algorithm is mostly used in various CNN models. The significance to choosing this algorithm is unbalanced output layers (2D and 62D) and different loss functions. An important property of CNN model is that it optimizes sequentially from low-level region regression to high-level binary classification. Such approach is more suitable to identify text and non-text components. For this the CNN model is trained using stochastic gradient learning algorithm by invoking positive samples collectively from Char74k [27] , ICDAR 2003 [24], SVT [25] and IIIT 5K [26] datasets and 20000 negative samples from the same datasets, shown in Fig. 8 and Fig. 7 respectively. This enables the model with meaningful low-level text information which is important to identify text and non-text regions at pixel level. The proposed CNN model is shown in Fig. 10 below as per formulation and training process and used for both detection and recognition. The input to this model is a series of character image patches, where each character is recognized and labeled sequentially. Keeping this logic in mind, the network has two convolution layers with filters f1 and f 2 respectively. Both f1  78 and f 2  216 filters are used for detection. The network is trained with a supervised learning algorithm using stochastic gradient descent given a set of 26×26 RGB image patches since the image has white background. This is obtained after classification described in Section 4.2 for better recognition and labeling. In this way, the accuracy of CNN remarkably improves shown in Table 4 as compared to other techniques. n number of 11×11 patches are randomly taken out which are already contrast normalized as described above in Section 4.1 to form input vectors x ( k )  64 where k  {1,...., n} . Stochastic gradient descent (SGD) is then used to train a set of

low-level filters D   64* f1 . For a single 11×11 patch x , the first layer response Q is computed

by impllementing tthe inner prroduct withh filter pooll followed by a scalarr activation function: Q  maxximum{0,| DT x |   }, where   0.55 as hypper parameter. Given 26×26 2 imagge, Q for vvery 11×11 window is calculated to obtain a 16×16× f1 response map. Next, average pool is performed p too reduce thhe response map up to 4×4× f1 . The same functionnality is appplied on the second connvolution layyer of modeel to obtain the reducedd response map of 2×2× f 2 . F Further, the outputs of layers are connected fully f to classsification llayer. The b propaggating the cclassificationn error usinng SGD butt the filter error is used to miinimize by back mains unchannged. size rem

Fig.10. CN NN networkk architecturre 5. Word d Recognitiion and Errror Correcttion The outtput of propposed CNN N is a charaacter label each e time; this is becaause characcter image patches are given aas input seriaally to the network. n Eaach characteer label in thhen collectedd into text word is in thhe text filee which is files to recognize the word. IIn short, thhe actual reccognized w extracteed and recoggnized afterr applying thhe proposedd technique from Sectiion 4.1 to Section 4.4 on naturral scene im mages. Someetimes, CNN N is unable tto recognizee the characcter properlyy which in

turn is labeled fallsely. So thhere is a cchance of inversion of o characterrs for speccific word recognittion. Thereffore, hamming distancee (an error correction technique) t iis used to ccorrect the meaningg of scene teext. Hammiing distance is defined aas follows: Given tw wo vectors vec1 and veec2  Z n , the t hamminng distance bbetween vecc1 and vec2 is defined as d (vec1 , vec2 ) to tthe number of places where w vec1 annd vec2 diffeer. Thus ham mming distaance is the me analogy iis applied foor the text number of bits chaanged, to chhange one innto the otheer. The sam The process iis explainedd as follows and also shhown in Fig 11. word. T Once alll character labels are reeceived for a specific nnatural scenne text imagge into a texxt file, it is then treated as strinng for processing. Thee string is thhen searched using lexicon where hamming ming distancce results inn a 0 (zero) value, it cooncludes thaat word in distancee is calculatted. If hamm text file and natural scene textt is same othherwise if hhamming disstance  0 , it shows thhat there is a differeence of charracters betw ween recognnized word iin text file aand natural scene text. The value of hamm ming distancce reflects thhe total num mber of unreecognized ccharacters. T The faulty w word string is searchhed using leexicon whicch lists optimized wordd combinations. Hence, the above described process is very useeful to generrate correct scene text word if som me error occcurs during detecting, NN. recognizzing and labbeling in CN

Fig.11. W Word Correcttion Processs

6. Results and Experiments

To obtain the conclusions and analysis, the experiments are performed on Intel based core i5 6th generation 3.4 Ghz CPU and Nvidia Geforce GTX 1070 GPU with 8GB memory and compute capability of 6.1. The training of positive and negative samples is performed by making minibatches of size 128 each. The total number of epochs used is 30 although the network becomes stable after 25 epochs in most of the times during training. The learning rate and momentum are selected 0.001 and 0.9 respectively. In this work, following benchmark datasets are used for the assessment, training, testing and implementation of this section. The IIIT5k dataset [26] is the leading and most tricky dataset reported in recent days because the images it contains have too much variations in color, size, font, layout and occurrence of blur, noise, varying illumination and distortion. The dataset is a collection of 5000 word cropped images including scene text in born digital images and natural scene images. The dataset is split into 2000 and 3000 images for training and testing respectively. The SVT [25] data set is a collection of 647outdoor scene text word images with high variability gathered from Google Street View of road-side scenes. The Char74K dataset [27] is used to evaluate recognition of single character in natural scene images. It is a collection of symbols for both Kannada and English languages. The dataset is divided into GoodImg, BadImg and FontImg. The ICDAR 2003 dataset [24] is a bench mark for natural scene text recognition and detection. The dataset is a collection of 509 completely annotated text images from which 251 and 258 images are used for testing and training respectively. The following evaluation protocols are used to describe the effect of proposed technique. The results are presented in terms of precision as P, recall as R (true positives), f-measure/sore and accuracy, which are given in Eq. (10), Eq. (11), Eq. (12) and Eq. (13) respectively. precision 

TP TP  FP

(10)

Precision measures the ratio among all detections and true positives. recall 

TP TP  FN

Recall measures the ratio among all true positives and true text that is detected.

(11)

f  measure 

(12)

2. precision.recall precision  recall

F-score is an overall indicator to judge the performance of algorithm and calculated using precision and recall by applying harmonic mean. Finally, accuracy is the most important and intuitive metric for classification performance and recognition. accuracy 

TP  TN TP  TN  FP  FN

(13)

6.1. Evaluation and comparison Some standard protocols have been followed for evaluation, where each word is linked with a lexicon and hamming distance is calculated to find out the optimized set of words if required. The recognition results of proposed methodology demonstrate excellent potential in recognizing scene word images from all the three bench mark datasets.

6.2. Intermediate based features with background subtraction Yao et al. [34] (Strokelet) and Lee et al. [35] both achieved leading performance on intermediate based features. Although they demonstrate huge perfections over conventional low level features but still their performance is not match able with the proposed methodology which shows major improvement in accuracy (A) along with additional computation of precision (P), recall (R), fscore (F) on all three datasets during experiments. Table 1 reflects the comparison with other algorithms and techniques.

Table 1 Text Detection Accuracy on three bench mark datasets (Intermediate based Features) Method Wang et al. [25] Mishra et al. [26] Shi et al. [38] Lee et al. [35] Yao et al. [34] Proposed

P 86.1

ICDAR 2003 R F 77.3 81.5

A 76.0 81.8 87.4 88.0 88.5 89.7

P 83.4

Comparison graphs of Table 1 are shown in Fig. 12.

SVT R F 74.7 78.8

A 57.0 73.2 73.5 80.0 75.9 85.1

P 85.9

IIIT5k R F 75.9 80.6

A 64.1 80.2 83.5

90 85 80 75 70 65

1 100 80 60 40 20 0

100 80 60 40 20 0

Accuracy

Accuracy

Accuracy

Fig.12. Com mparison reesults of Inteermediate L Level Features

NN based methods m and d networks 6.3. CN Table 2 shows thatt CNN baseed methods largely outpperform midd level featuure approacches about 5% to 10% of peerfection inn all aspectts of detecttion and reecognizing text. This ability of

significaant perform mance increaase is direcctly related to learn deeep high levvel featuress from the scene w word imagess. Su and Luu et al. [36] approach based on reecurrent neuural networrks (RNN) model using u HOG features obbtained 83% % accuracy on o SVT dataaset. In com mparison, CN NN based featuress have an abbility to achhieve remarrkable resullts on all thhree data seets based onn separate characteer classificaation. By giiving traininng to a CNN N model with w huge annd diverse aamount of data, thee proposed technique t atttained signnificant enhaancement ovver the CNN N based feattures in all datasetss. Althoughh widely ussed SVT iss complex dataset, thhe presentedd model reemarkably perform ms up to 89..5% of accuuracy. In adddition, the proposed m methodologgy is implem mented on IIIT5k dataset annd achievedd elegant results r i.e.,, 85.2% acccuracy. Table 2 jusstifies the compariisons with oother algoritthms.

Table 2 Text Deetection Acccuracy on thhree bench m mark datasetts (CNN bassed Methodds) Meth hod

ICDA AR 2003 F R -

P -

Wang et al.. [25] Alsharif and Pineau et al. [39] Su and Lu et e al. [36] CNN basedd features Proposed

A 90.00

P -

R -

SVT F -

A 70.0

P -

IIIT55k F R -

A -

-

-

-

93.1

-

-

-

74.3

-

-

-

-

87.5

79.3

83.2

92.00 96.22 97.1

85.6

777.9

81.4

83.0 86.1 89.5

866.7

76.9

81.5

85.2

Comparrison graphss of Table 2 are shown in Fig. 13.

98 96 94 92 90 88 86

Acccuracy

Wang 2012

Alsharif Su and Lu CNN Proposed 2013 20 014 based d featurees

100 0 80 0 60 0 40 0

A Accuracy

20 0 0 Wang 2012

Alsharif Su aand Lu CNN N Proposed 2 2014 based 2013 featurres

100 0 80 0 60 0 40 0

A Accuracy

20 0 0 Wang 2012

Alsharif Su aand Lu 2 2014 2013

CNN N Proposed based featurres

b methoods and netw works Figg.13. Compaarison resultts of CNN based 6.4. Com mplete imaage word reepresentatioon Almazáán et al. [37]] technique is based onn the whole w word imagee representaation with acccuracy of 87% onn the SVT. U Using the IIIT5k, I it prroduced resuults of 75.66% and 88.66% accuracy on little and hugge lexicon. The proposed approacch strives ffurther by aattaining ann accuracy oof 98.2%, 94.7%, and 95.3% on ICDAR R 2003, SV VT and IIIT T5k respectiively. This is all becauuse of the novelty presented in Section 4.2 and Seection 4.3 respectively which furtther becomee input in m It actts as a key role to enhhance discrim minative poower of model for charracter and CNN model. then woord recognitiion. Table 3 shows the comparisonn with other algorithms.

Table 3 Text Deetection Acccuracy on thhree bench m mark datasetts (Complette Word Reccognition) Meth hod P 84.3

Almaz´an eet al. [37] He et al. [400] Proposed

ICDA AR 2003 F R 77.2 80.6

A 90.1 97.00 98.22

P 84.5

SVT R F 776.2 80.1

A 87.0 93.5 94.7

P 855.6

IIIT55k F R 77.7 81.5

Comparrison graphss of Table 3 are shown in Fig. 14.

100 98 96 94

Accuracy

92 90 88 86 Almaz´an 2 2014

He 2 2016

P Proposed

96 94 92 90

Accuracy

88 86 84 82 Almaz´an 20 014

He 2016

P Proposed

A 88.6 94.0 95.3

96 94 92 Accuracy

90 88 86 84 Almaz´an 20 014

He 2016

P Proposed

F Fig.14. Com mparison ressults of Com mplete Wordd Recognition

m as coompared to other text Tables 1, 2 and 3 aalso reflect the suggestted work evvaluation metrics descripttors using aall three beench mark ddatasets. Thhese tables clearly shoow that the proposed methodoology is uniique in term ms of accuraacy primarilly with addiitional compputation of precision, recall annd f-score on o all three standard s dattasets.

Table 4 elegantly ppresents thatt the propossed method significantlyy improves over the exxisting text descripttors based oon characteer recognitioon accuracyy. Tesseractt OCR is basically dessigned for scannedd document text. Thereefore its acccuracy is abbout 37.3%, 34.9% and 32.2% onn all three datasetss respectivelly. De com mpose et al. [27] applied nearest neighborhood (NN) classsification on Charr74K dataseet and reportted promisinng results i..e., 41.0% aaccuracy on ICDAR20003 dataset. VL Featt implemennted combinnation of HO OG and SVM M which heelped to achhieve higherr accuracy as relateed to the onne reported in [41]. Thhe methods in [42] andd [43] attainn accuracies of 81.7% and 83..9% respecttively on ICDAR20033 dataset. T These two works requuire large amount a of additionnal training data hence not public. Here it is suurprising to see that thee results of CNN [44] are mucch lower as compared tto other techhniques. Thhis is all duee to trainingg CNN with relatively small nuumber of saamples thatt might be around 16kk-to-18k sam mples. In thhe end of thhis article, compariison charts are a also presented to reflect the preeeminence oof proposed work.

Tab ble 4 Charracter Recognition Acccuracy on thhree bench m mark datasetts Meth hods

ICDAR 20003

SVT

IIIT5k k

Tessaract OCR

37.3%

34.9% %

32.2%

de Campos ett al. [27]

41.0%

-

-

Wang et al. [445]

51.5%

-

-

Neumann et al a [5]

64.0%

-

-

VLFeat (HOG G+SVM)

74.1%

70.9% %

70.7%

Yi et al. [46]

76.0%

-

-

Lee et al. [35]]

79.0%

-

-

Wang et al. [225]

64.0%

-

-

Jaderberg et al. a [44]

79.5%

75.0% %

74.9%

Tian et al. [477] CoHOG

80.5%

75.8% %

77.8%

81.7%

77.2% %

78.8%

84.1%

81.3% %

82.9% %

Tian et al. [477] ConvCoHOG G Proposed

Comparrison graphss of Table 4 are shown in Fig. 15.

10 00 8 80 6 60 4 40 2 20 0

Accuracy

90 80 70 60 50 40 30 20 10 0

Accuracy

90 80 70 60 50 40 30 20 10 0

Accuracy

Fig.15. C Comparisonn results of C Character Recognition R 7. Concclusion In this work, a noovel way off scene textt extractionn is suggestted under m machine leaarning. By leveragiing the prim macy of MS SER, stablee regions arre detected on enhanced image uusing gray scale L--channel. Too make robuust feature descriptor, LBP and T--HOG featuure sets are combined into a siingle feature vector by weighted sum s to identtify text andd non-text rregions perffectly with linear S SVM. This hhelps to dettect characteers using chharacter groouping of thhe scene texxt. Further the outpput of text detection moodule is sentt to text recoognition moodule as an iinput. CNN is utilized as a poowerful textt recognitioon tool in ttoday's com mputer visioon problem for recognnizing and labelingg the text ccharacters. IIt leveragess supervisedd text inforrmation on high scaless i.e., text region ddetection, chharacter labbeling and bbinary text and a non-textt regions. T The presenteed CNN is

trained with positive and negative samples collected from Char74K [27], IIIT 5K [26], ICDAR 2003 [24] and SVT [25] datasets. These multi-selected samples help out to make CNN learning powerful which is then able to robustly extract actual text from unconstrained scene images. Further it is tried to generate text meaning and error correction in text by introducing hamming distance technique. As evidence of the power of proposed methodology, state-of-the-art results are demonstrated on three bench mark datasets.

8. Limitations and Future Work The proposed methodology has enough space to improve in different ways. First, it is better to perform character level detection and recognition using a single feed forward integrated pipeline rather that stepwise approach in two portions that is text detection and recognition. Secondly, text in scenes are not properly separated from each other especially those which suffer from different kinds of variations, font size and style, viewpoints and blurs. These aspects need to be addressed more efficiently, robustly and sophisticatedly. In future, this work can be extended to investigate word level detection and recognition. Furthermore, detecting and recognizing multiline scene text might be the most powerful extension of this work. Finally, reducing computational cost and response time is the recurring and prime objective in future.

REFERENCES [1]

Shivakumara, P., et al., Gradient vector flow and grouping-based method for arbitrarily oriented scene text detection in video images. IEEE Transactions on Circuits and Systems for Video Technology, 2013. 23(10): p. 1729-1739.

[2]

Yao, C., et al. Detecting texts of arbitrary orientations in natural images. in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. 2012: IEEE.

[3]

Das, M.S., B.H. Bindhu, and A. Govardhan, Evaluation of text detection and localization methods in natural images. International Journal of Emerging Technology and Advanced Engineering, 2012. 2(6): p. 277-282.

[4]

Matas, J., et al., Robust wide-baseline stereo from maximally stable extremal regions. Image and vision computing, 2004. 22(10): p. 761-767.

[5]

Neumann, L. and J. Matas. A method for text localization and recognition in real-world images. in Asian Conference on Computer Vision. 2010: Springer.

[6]

Epshtein, B., E. Ofek, and Y. Wexler. Detecting text in natural scenes with stroke width transform. in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. 2010: IEEE.

[7]

Kasar, T., J. Kumar, and A. Ramakrishnan. Font and background color independent text binarization. in Second international workshop on camera-based document analysis and recognition. 2007.

[8]

Pan, Y.-F., X. Hou, and C.-L. Liu, A hybrid approach to detect and localize texts in natural scene images. IEEE Transactions on Image Processing, 2011. 20(3): p. 800-813.

[9]

Yi, C. and Y. Tian, Text string detection from natural scenes by structure-based partition and grouping. IEEE Transactions on Image Processing, 2011. 20(9): p. 2594-2605.

[10]

Kim, K.I., K. Jung, and J.H. Kim, Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003. 25(12): p. 1631-1639.

[11]

Leibe, B. and B. Schiele. Scale-invariant object categorization using a scale-adaptive mean-shift search. in DAGM-symposium. 2004: Springer.

[12]

Zhong, Y., H. Zhang, and A.K. Jain, Automatic caption localization in compressed video. IEEE transactions on pattern analysis and machine intelligence, 2000. 22(4): p. 385-392.

[13]

Chen, X. and A.L. Yuille. Detecting and reading text in natural scenes. in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on. 2004: IEEE.

[14]

Viola, P. and M. Jones. Fast and robust classification using asymmetric adaboost and a detector cascade. in Advances in neural information processing systems. 2002.

[15]

Huang, W., et al. Text localization in natural images using stroke feature transform and text covariance descriptors. in Proceedings of the IEEE International Conference on Computer Vision. 2013.

[16]

Shivakumara, P., T.Q. Phan, and C.L. Tan, A laplacian approach to multi-oriented text detection in video. IEEE transactions on pattern analysis and machine intelligence, 2011.

33(2): p. 412-419.

[17]

Gupta, A., A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

[18]

He, T., et al., Text-attentional convolutional neural network for scene text detection. IEEE transactions on image processing, 2016. 25(6): p. 2529-2541.

[19]

Zhang, Z., et al. Multi-oriented text detection with fully convolutional networks. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

[20]

Liu, Y. and L. Jin, Deep matching prior network: Toward tighter multi-oriented text detection. arXiv preprint arXiv:1703.01425, 2017.

[21]

Liu, Y., S. Goto, and T. Ikenaga, A contour-based robust algorithm for text detection in color images. IEICE transactions on information and systems, 2006. 89(3): p. 1221-1230.

[22]

Dalal, N. and B. Triggs. Histograms of oriented gradients for human detection. in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. 2005: IEEE.

[23]

Lafferty, J., A. McCallum, and F.C. Pereira, Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.

[24]

Lucas, S.M., et al. ICDAR 2003 robust reading competitions. in Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on. 2003: IEEE.

[25]

Wang, K., B. Babenko, and S. Belongie. End-to-end scene text recognition. in Computer Vision (ICCV), 2011 IEEE International Conference on. 2011: IEEE.

[26]

Mishra, A., K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. in BMVC 2012-23rd British Machine Vision Conference. 2012: BMVA.

[27]

de Campos, T., B.R. Babu, and M. Varma, Character recognition in natural images. 2009.

[28]

Opitz, M., Text Detection and Recognition in Natural Scene Images. 2013: na.

[29]

Neumann, L. and J. Matas. Real-time scene text localization and recognition. in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. 2012: IEEE.

[30]

Ojala, T., M. Pietikainen, and D. Harwood. Performance evaluation of texture measures with classification based on Kullback discrimination of distributions. in Pattern

Recognition, 1994. Vol. 1-Conference A: Computer Vision & Image Processing., Proceedings of the 12th IAPR International Conference on. 1994: IEEE. [31]

Anthimopoulos, M., Text Detection in Images and Videos. Department of Informatics and Telecommunications, 2005.

[32]

Minetto, R., et al., SnooperText: A text detection system for automatic indexing of urban scenes. Computer Vision and Image Understanding, 2014. 122: p. 92-104.

[33]

Retornaz, T. and B. Marcotegui. Scene text localization based on the ultimate opening. in International Symposium on Mathematical Morphology. 2007.

[34]

Yao, C., et al. Strokelets: A learned multi-scale representation for scene text recognition. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.

[35]

Lee, C.-Y., et al. Region-based discriminative feature pooling for scene text recognition. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.

[36]

Su, B. and S. Lu. Accurate scene text recognition based on recurrent neural network. in Asian Conference on Computer Vision. 2014: Springer.

[37]

Almazán, J., et al., Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence, 2014. 36(12): p. 2552-2566.

[38]

Shi, C., et al. Scene text recognition using part-based tree-structured character detection. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013.

[39]

Alsharif, O. and J. Pineau, End-to-end text recognition with hybrid HMM maxout models. arXiv preprint arXiv:1310.1811, 2013.

[40]

He, P., et al. Reading Scene Text in Deep Convolutional Sequences. in AAAI. 2016.

[41]

Mishra, A., K. Alahari, and C. Jawahar. Top-down and bottom-up cues for scene text recognition. in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. 2012: IEEE.

[42]

Coates, A., et al. Text detection and character recognition in scene images with unsupervised feature learning. in Document Analysis and Recognition (ICDAR), 2011 International Conference on. 2011: IEEE.

[43]

Wang, T., et al. End-to-end text recognition with convolutional neural networks. in Pattern Recognition (ICPR), 2012 21st International Conference on. 2012: IEEE.

[44]

Jaderberg, M., A. Vedaldi, and A. Zisserman. Deep features for text spotting. in European conference on computer vision. 2014: Springer.

[45]

Wang, K. and S. Belongie. Word spotting in the wild. in European Conference on Computer Vision. 2010: Springer.

[46]

Yi, C., X. Yang, and Y. Tian. Feature representations for scene text character recognition: A comparative study. in Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. 2013: IEEE.

[47]

Tian, S., et al., Multilingual scene character recognition with co-occurrence of histogram of oriented gradients. Pattern Recognition, 2016. 51: p. 125-134.

Ghulam Jillani Ansari has attained degree of Bachelor in Computer Science from BZU Multan, Pakistan in 2000 and MS (CS) from University of Agriculture Faisalabad, Pakistan in 2004. He is currently student of PhD (CS) in COMSTAS Wah Cantt, Pakistan and employed as an Assistant Professor at University of Education Lahore (Multan Campus), Pakistan since 2006. His research interests are Object Oriented Programming, Image Processing, Computer Vision, Neural Networks and Machine Learning. Presently his research area in PhD (CS) is Computer Vision and Graphics. Jamal Hussain Shah, PhD is Assistant Professor at COMSATS, Wah Cantt Pakistan. He completed his PhD in Pattern Recognition from University of Science and Technology China, Hefei, P.R China. He completed Masters in Computer Science from COMSATS Wah, Pakistan. His area of specialization is Automation and Pattern Recognition. He is in education field since 2008. He has 21 publications in IF, SCI and ISI journals as well as in national and international conferences. He is currently supervising 4 PhD (CS) students and 6 Masters. He has received COMSATS research productivity award since 2013-2016. His research interests include Deep learning, Algorithms design and Analysis, Machine Learning, Image processing and Big Data. Mussarat Yasmin, PhD is Assistant Professor at COMSATS, Wah Cantt Pakistan. Her area of specialization is Image Processing. She is in education field since 1993. She has so far 45 research publications in IF, SCI and ISI journals as well as in national and international conferences. A number of undergraduate projects are complete under her supervision. She is currently supervising five PhD (CS) students. She is gold medallist in MS (CS) from IQRA University, Pakistan. She is getting COMSATS research productivity award since 2012. Her research interests include Neural Network, Algorithms design and Analysis, Machine Learning and Image processing. Muhammad Sharif, PhD is Associate Professor at COMSATS, Wah Cantt Pakistan. His area of specialization is Artificial Intelligence and Image Processing. He is into teaching field from 1995 to date. He has 110 plus research publications in IF, SCI and ISI journals and national and international conferences. He has so far supervised 25 MS (CS) thesis. He is currently supervising 5 PhD (CS) students and co-supervisor of 5 others. More than 200 undergraduate students have successfully completed their project work under his supervision. His research interests are Image Processing, Computer Networks & Security and Algorithms Design and Analysis. Steven Lawrence Fernandes, PhD is member of Core Research Group, Karnataka Government Research Centre of Sahyadri College of Engineering and Management, Mangalore, Karnataka. He has received Young Scientist Award by Vision Group on Science and Technology, Government of Karnataka. He also received grant from The Institution of Engineers (India), Kolkata for his Research work. His current Ph.D. work, “Match Composite Sketch with Drone Images”, has received patent notification (Patent Application Number: 2983/CHE/2015) from the Government of India.

Ghulam Jillani Ansari

Jamal Hussain Shah

Muhammad Sharif

Mussarat Yasmin

Research Highlights     

A novel method is proposed for scene text extraction, recognition and correction. MSER technique is used for segmenting text/non-text areas after preprocessing. A feature fusion approach is used for CC classification using SVM and weighted sum. A CNN model is proposed for character labeling and hamming distance for correction. Conclusions and analysis are performed on datasets ICDAR2003, SVT and IIIT5k.