Gabor Wavelets Based Word Retrieval from Kannada Documents

Gabor Wavelets Based Word Retrieval from Kannada Documents

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 79 (2016) 441 – 448 7th International Conference on Communication,...

363KB Sizes 0 Downloads 25 Views

Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 79 (2016) 441 – 448

7th International Conference on Communication, Computing and Virtualization 2016

Gabor Wavelets Based Word Retrieval from Kannada Documents Mallikarjun Hangarge1, Veershetty C1, Rajmohan Pardeshi1, B. V. Dhandra2 1

Department of P.G. Studies and Research in Computer Science, Karnatak Arts Science and Commerece College, Bidar, India 2 Department of P.G. Studies and Research in Computer Science, Gulbarga University Kalaburagi,India

Abstract In this paper, we propose a technique for word retrieval based on Gabor wavelets. Gabor wavelets are employed to capture directional energy features of the word image. A pre-processing step is performed in order to improve the quality of the document images, at the same time word segmentation is accomplished using 294 document images. As a result, a large dataset of 45000 words is generated for experimentation. Then, Gabor wavelets are used to represent the candidate word as well as a query word. Then cosine distance is used to measure the similarity between two words, based on it, relevance of the word is estimated by generating distance ranks. Then correctly matched words are selected at different distance thresholds such as 96%, 97%, 98% and 99%. The results achieved are encouraging in terms of average precision 81.25%, average recall 82.09% and F measure 84.53% at a threshold 97%. ©©2016 Published by Elsevier B.V.B.V. This is an open access article under the CC BY-NC-ND license 2016The TheAuthors. Authors. Published by Elsevier (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the Organizing Committee of ICCCV 2016. Peer-review under responsibility of the Organizing Committee of ICCCV 2016

1. Keywords: Kannada word retrieval, Gabor wavelets, Document image retrieval , Cosine distance

1.

Introduction An efficient and effective information retrieval method is necessary for fetching the relevant document from a huge storage of digital documents in response to user query to rank the related documents in order. In this aspect, an effective method is necessary for matching the documents to retrieve from a large volume of digitized documents. In fact, retrieval of information is much harder for image data than text data. To inspect the document or retrieve from the database, a proper mechanism is needed to characterize the document image in a systematic manner; this can be done through DIP (Digital Image Processing) using two different approaches. One is OCR (Optical Character Recognition), and another is document image retrieval (keyword spotting) technique [12].

Corresponding author. Tel.: 919448890499; E-mail address: [email protected]

1877-0509 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the Organizing Committee of ICCCV 2016 doi:10.1016/j.procs.2016.03.057

442

Mallikarjun Hangarge et al. / Procedia Computer Science 79 (2016) 441 – 448

Optical Character Recognition (OCR) deals with character recognition in document image, however, it yields erroneous results if documents are poor in quality, complex in layouts and due to improper segmentation etc. Applying OCR to retrieve a document is a cumbersome task as it requires complete transcription of the document. This method works well for European languages whose optical character recognizers are robust but did not work reliably for Indian and non-European languages like Hindi, Kannada, Telugu and Malayalam etc. Even for European languages, OCRs are not available for recognition of highly degraded document images. Therefore, many researchers are moved to Digital Image Retrieval (DIR) techniques for information retrieval without converting the whole image document into editable text. In this process, text is identified at word level using image property [12] and it is called keyword spotting. Many interesting works have been done on keyword spotting in document image using word image property instead of text transcription. In word spotting, a given word image is represented in terms of features and matching is done based on feature distance. Main challenges in word spotting are extraction of proper features and comparison of features that is, estimating the similarity between features. Word spotting is an alternative way for retrieving and indexing the document images. Particularly in Indian context, this technique is the best alternative to multilingual OCR. In fact, designing of multi-script OCR for Indian scripts is very complex task. Therefore, word retrieval from all Indic scripts need to be tested and emphasis has to be given to develop a generic word spotting algorithm. Thus, we have proposed to retrieve words of Kannada script based on Gabor features. There are few works reported for retrieval of words from documents of Indian scripts such as Devanagari, Bangla, and Gurumukhi [26, 31], however, the proposed work for retrieval of keywords from Kannada document images is first of its kind. The remainder of the paper is described as follows: the literature is explained in Section 2, in Section 3 the proposed methodology is presented, in Section 4 experimental results are presented and finally in Section 5 conclusions are drawn. 2.

Related work The word spotting was initially projected for speech processing and later it was incorporated for retrieval of historical documents and manuscripts of different scripts and languages. Word spotting is the process of spotting the word or a phrase in the document image which was initially reported in [8] for printed text documents and later in [9] for handwritten text documents. The majority of the work done on word spotting (matching and retrieval) is in two different types, namely with segmentation and without segmentation of the underlying document image. Many algorithms have been proposed for both the methods since last three decades. Without segmentation and with segmentation techniques are proposed in [28, 30] and in [13, 14] respectively. The methods based on segmentation involves three levels decomposition of input document namely at line [13, 14], word [22, 23] and character [20]. Both the methods have their own pros and cons. It is clearly evident that segmentation based word spotting yields erroneous results due to improper segmentations specifically in case of degraded documents, touched lines, broken characters. The process of document segmentation is more complex, particularly, in case of handwritten documents. Meanwhile, the selection of optimal sliding window size for word spotting is the major issue in segmentation free method. The various feature extraction techniques proposed based on segmentation are discussed and clustered in the following paragraphs. The techniques used for content retrieval based on without segmentation is not discussed here. Segmentation based techniques: To describe a query word image, movements of the black pixel are investigated in [10] as holistic features. The ‘Binary Gradient Structural and Convexity’ (GSC) features are discovered in [19]. In [21], the authors discussed discrete cosine transformation of the counter to gain the feature vector. Profile features and ink transition profiles are investigated to represent a word image in [27]. Besides, multiple features were explored and combined in order to improve the word image representation. In [17], compound features scheme was reported consisting of projection profiles, upper/lower word profiles and background to-ink transition. Word profiles, structural features and statistical movements were discussed in [18], and further, authors employed Fourier coefficients for dimensionality reduction. Various distance measures were used in estimating the similarity between query and candidate word image as discussed in [15, 16]. Further, dynamic time wrapping (DTW) was frequently used method to calculate the similarity of the words as used in [24, 25] because it tolerates special difference unlike above methods.

Mallikarjun Hangarge et al. / Procedia Computer Science 79 (2016) 441 – 448

3.

443

Proposed method In this Section, we explain our approach for word retrieval from document images. Fig. 1 shows an overview of our approach. The document images are subjected to a pre-processing step that is noise reduction and segmentation of words. The features are extracted from words and formed feature vectors. Afterwards, the feature vectors corresponding to the query words are matched with candidate word images (knowledge base). The similarity scores are stored in a similarity matrix. Next, similarity matrix is used to count the occurrences of the query word image in the database. The relevant occurrences are detected and irrelevant patterns are filtered out based on thresholding the distance. Input Document Image

Query Image

Feature Extraction

Pre-processing & Segmentation

Database

Feature Extraction

Similarity Estimation

Similar Words Retrieval Fig-1. the flow diagram of the proposed method

3.1. Pre-processing The task of extraction of words from input document image is done in three steps. First, we binarized image using Otsu’s gray level threshold selection method [1]. Further, basic morphological operators are employed to remove the noise and special symbols from documents such as double quotations, commas etc. Then, horizontal and vertical dilation is performed using line structuring element (length of the structuring element is empirically fixed) to make each word as a single connected component. Later, bounding boxes are fixed on each component. Eventually, these connected components are extracted by applying connected component rule. The words segmenting accuracy of this technique is 100%, except in case of touched lines/words and it avoids line segmentation. The complete process of words extraction from Kannada document can be visualized in Fig-2.

(a)

(b)

(c)

Fig-2. (a) document page (b) document with horizontal dilation and connected components (c) segmented words

444

Mallikarjun Hangarge et al. / Procedia Computer Science 79 (2016) 441 – 448

3.2. Feature extraction There are many feature extraction techniques proposed in the literature. The selection of feature extraction technique certainly depends on the interest of the region or properties of the image to be extracted. In this work, visual discriminating properties of the word image are the good clues to find out similarity or dissimilarity between the words and thus, motivated to employ Gabor wavelets. Because, Gabor wavelets are efficient in characterizing visual properties of the underlying image and are similar to perception in the human visual system [11]. In fact, Gabor filters are also efficient in representing energy distribution properties (global) of an image. This potential feature of Gabor wavelets is utilized in this paper. The Gabor filters implementation is explained in the following paragraph. Gabor features: For feature extraction, we use Gabor wavelets, because of its biological relevance and computational properties [2][3]. Gabor wavelets have wide applications for instance face recognition [6], data compression [5] and texture analysis [4], handwritten character recognition [7] and in recent past for script identification in video frames [29]. It characterizes a spatial frequency structure in the image, at the same time it provides spatial relations. This property certainly becomes useful to extract directional energy features from patterns of Kannada words, as they are circular in nature. The implementation detail of Gabor filter is given below

ሺܽǡ ܾǡ ‫ܨ‬ǡ ߪ௔ ǡ ߪ௕ ሻ ൌ ͳȀሺʹߨߪ௔ ǡ ߪ௕ ሻ‡š’ሾെ

ሺ‫ݑ‬ǡ ‫ݒ‬ሻ ൌ ݁‫݌ݔ‬ሼെ where, ߪ௨ ൌ

ଵ ଶగఙೌ

, ߪ௩ ൌ

ଵ ଶగఙ್

భ మሾሺ࢛షࡲሻమ ఙೌమ

ଵ ೌమ ್మ

ଶቆ మ ା మ ቇ ഑ ഑ ೌ



࢜మ ఙೡమ

൅ ʹߨ݆‫ܨ‬௔ ሿǡ

(1)



ሽ,

(2)

and parameter F specifies the central frequency of interest. Gabor wavelets are

derived from equation (1) as defined ݃௠௡ ሺܽǡ ܾሻ ൌ ‫ି ݔ‬௠ ݃ሺܽᇱ ܾ ᇱ ǡ where ܽᇱ ൌ ‫ି ݔ‬௠ ሺܽܿ‫ ׎ݏ݋‬൅ ܾ‫׎݊݅ݏ‬ሻ ,

ி௡గ ௞

ǡ ߪ௔ ߪ௕ ሻ ,

(3)

ܾ ᇱ ൌ ‫ି ݔ‬௠ ሺܽ‫ ׎݊݅ݏ‬൅ ܾܿ‫׎ݏ݋‬ሻǤ

In equation (3), m = 0, 1…; s-1, where S is the total number of scales, and n = 0, 1…; k -1, where k is the total number of orientations. The central frequency of interest is F. Let I be the input word image, each input word image matrix is converted into a square matrix by appending zeros in case of non-square matrix. Then Gabor wavelets are computed on each word image using equation (3) with 4 different scales and 6 orientations and yields 24 features of each word. Angle of orientation i.e., ‫(׎‬theta) is defined below: ௡గ

Orientation (‫׎‬ሻ ൌ ሺ ሻ. ௞ The parameters used in the above equations are defined as follows:

(4)



‫ ݔ‬ൌ ሺܷ௛ Ȁܷ௟ ሻ௦ିଵ ߪ௔ୀ ሺሺ‫ ݔ‬൅ ͳሻξʹ݈݊ʹሻȀʹߨ‫ ݔ‬௠ ሺ‫ ݔ‬െ ͳሻܷ௟ ߨ ʹߨ –ƒ ቀ ቁ ඥܷ௛ଶ ͳ ଶ ʹ݇ െ൬ ൰ Ǥ ߪ௕ୀ ͳȀሺ ʹ݈݊ʹ ʹߨߪ௔ By varying m and n, we have applied filter ‰ ୫୬ ሺƒǡ „ሻ to the input image I (using 2-D convolution) to get features at different scales and orientations. In our experiment, we have extracted 48 features for each word (24 mean square energies & 24 mean amplitudes). The effect of Gabor wavelets applied on a word image shown in figure-

445

Mallikarjun Hangarge et al. / Procedia Computer Science 79 (2016) 441 – 448

3 is visualized in figure-4.

Fig-3. sample word image on which gabor wavelets are applied to visualize its effect shown in figure-4.

5

5

5

10

10

10

15

15

15

20

20

20

20

25

25

25

25

30

30

30

30

35

35

35

20

40

60

80

100

120

140

160

180

(a)

200

20

40

60

80

100

120

140

160

180

200

5 10 15

35

20

(b)

40

60

80

100

120

140

160

180

200

20

40

(c)

60

80

100

120

140

160

180

200

(d) ௡గ

௡గ





Fig-4. visualization of a word image (Figure-3) after applying Gabor wavelets at (a) Scale=1, orientation=( ) (b) Scale=1, orientation=( ) ௡గ

௡గ





(c) Scale=1, orientation=( ) (d) Scale=1, orientation=( )

3.3. Similarity measure The similarity between query and candidate image is computed using cosine distance. It is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. Cosine distance measures orientation between two vectors but not magnitude. Two vectors with the same orientation have a cosine similarity of 1, two vectors of orientation 90° have a similarity of 0, and two vectors of diametrically opposed have a similarity of -1.These are independent of their magnitudes. Cosine similarity is particularly used in positive space, where the outcome is bounded in [0, 1]. The cost function of cosine distance is as follows [32]. ௫Ǥ௬

݈ܵ݅݉݅ܽ‫ݕݐݎ‬ሺ‫ݔ‬ǡ ‫ݕ‬ሻ ൌ ܿ‫ ߠݏ݋‬ൌ ԡ௫ԡǤԡ௬ԡ

(5)

where x and y are the feature vectors. 4.

Experiments

4.1. Dataset There is no publicly available dataset of Kannada document images at present. Therefore, we have collected 294 document pages belongs to various fields namely literature (142 pages), Government (32 pages) and history (120 pages). Then documents are scanned at 300 dpi resolution using HP scanner. A dataset of 45000 words is generated from these document pages. This dataset involves words of different font size, style, length and thickness (pixel density). 4.2. Evaluation protocol To evaluate the performance of the method, manually a ground truth is prepared for 5 different query words which have multiple occurrences in the dataset. The ground truth of the database is shown in Table-1.

446

Mallikarjun Hangarge et al. / Procedia Computer Science 79 (2016) 441 – 448

Table-1 ground truth of the database Sl.No.

Document Categories

1

Literature

No of Pages

Keyword Image

142

Word Frequency in database 13 43

2

Modern Govt.

32

33 32

3

History

120

63

Table-2 presents the word retrieval results for 5 keywords in terms of recall and precision as well as F- measure Image

Threshold μ  െ>

99.0%

98.0%

97.0%

96.0%

Recall (%) Precision (%) F-measure (%)

100.00 33.33 49.99

100.00 72.72 84.20

74.35 87.87 80.54

100.00 72.72 84.20

Recall (%) Precision (%) F-measure (%)

100.00 12.12 21.61

100.00 45.45 62.49

61.76 63.63 79.88

100.00 45.45 62.49

Recall (%) Precision (%) F-measure (%) Recall (%) Precision (%) F-measure (%) Recall (%) Precision (%) F-measure (%) Recall (%) Precision (%) F-measure (%)

100.00 45.45 62.49 100.00 43.75 60.80 100.00 11.11 19.19 100.00 29.15 42.83

100.00 69.69 82.13 100.00 81.25 89.65 100.00 53.96 70.09 100.00 64.61 77.71

100 78.78 88.13 96.62 96.61 96.61 77.75 79.36 77.51 82.09 81.25 84.53

100.00 69.69 82.12 100.00 81.25 89.65 100.00 53.96 70.09 100.00 64.61 77.71

Average

4.3. Result analysis The results of word retrieval from Kannada documents are presented in Table-2. The similarity is measured between the query and candidate word image using cosine distance. The relevance of the word is estimated by generating distance ranks. Then, correctly matched words are selected at different distance thresholds for instance 96, 97, 98 and 99 percent. To measure the performance of word retrieval, Recall (RC), Precision (PR) and FMeasure (FM) are used. These are defined as described in the following paragraph. Let N be the total number of word instances for every keyword, M be the total number of detected keyword instances and CRR is the correctly detected keyword instances [8][9]. ‡…ƒŽŽ ൌ 

஼ோோ

‫ͲͲͳ כ‬



”‡…‹•‹‘ ൌ 

஼ோோ

‡ƒ•—”‡ ൌ 



‫ͲͲͳ כ‬

஼ோଶ‫כ‬ሺோ஼‫כ‬௉ோሻ ேሺோ஼ା௉ோሻ

(6) (7) (8)

Mallikarjun Hangarge et al. / Procedia Computer Science 79 (2016) 441 – 448

We have achieved encouraging results in terms of average recall, precision and F-measure. In Table 2, we can notice that the average F measure is 84.53% which indicates the significant performance of the method. The performance of the proposed method is also evaluated in terms of average precision and recall and they are 82.09% and 81.25% when threshold is set to 97% respectively. The comparative analysis is not reported herewith because, the scripts employed for experimentation in [26, 31] are Hindi, Bangla and Gurumukhi. In other words, for straight forward comparison there is no work reported in the literature on retrieval of Kannada words. 5.

Conclusion In this paper, retrieval of words from Kannada documents based on Gabor features is proposed. The method has showed its outstanding performance in terms of average F measure. Basically, word spotting is essential to avoid complete transcription of the document image to edit/read it electronically. Particularly in Indian context, this technique is the best alternative to multilingual OCR. In fact, designing of multi-script OCR for Indian scripts is very complex task. Therefore, word retrieval from all Indic scripts need to be tested and emphasis has to be given to develop a generic word spotting algorithm. In this paper, as an initial attempt word retrieval from Kannada script is carried out and it will be extended to all Indic scripts in future to develop a generic algorithm.

References 1.

N. Otsu, A threshold selection method from gray-level histograms, IEEE Transaction, on Pattern. Analysis and Machine Intelligence, vol. 9(1), pp. 62–66, 1979.

2.

S. Marcelja, Mathematical description of the responses of simple cortical cells, Journal Opt. Soc. Amer., vol. 70, pp. 1297–1300, 1980.

3.

J.G. Daugman, Complete discrete 2-d Gabor transforms by neural networks for image analysis and compression, IEEE Trans. attern Analysis and Machine Intelligence, vol.36, no. 7, pp. 1169–1179, 1988.

4.

R. Mehrotra, K. R. Namuduri, and N. Ranganathan, Gabor filter-based edge detection,” Pattern Recognition., vol. 25, no. 12, pp. 479– 1494, 1992.

5.

H. Szu, B. Telfer, and J. Garcia, Wavelet transforms and neural networks for compression and recognition, Neural Networks, vol. 9, no. 4, pp. 695–708, 1996.

6.

L. Wiskott, J. M. Fellous, N. Kruger, and C. Malsburg, Face recognition by elastic bunch graph match, IEEE Transaction. Pattern Analysis. Machine Intelligence, vol. 19, pp. 764–768, July 1997.

7.

D. Deng, K. P. Chan, and Y. Yu, Handwritten Chinese character recognition using spatial Gabor filters and self-organizing feature maps, in Proc. ICIP, vol. 3, pp. 940–944, 1994.

8.

Shyh-shiaw Kuo and Oscar E. Agazzi, Keyword Spotting in Poorly Printed Documents Using Pseudo 2-D Hidden Markov Models IEEE, pp.0162- 8828,1994.

9.

R. Manmatha, C. Han, E.M. Riseman, Word Spotting, A New Approach to Indexing Handwritings. Proceedings of Computer Vision and Pattern Recognition (CVPR), pp.631-637, 1996.

10. B.S. Manjunath, and W.Y. Ma, Texture Features for Browsing and Retrieval of Image Data. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18. (8), pp.837-842, 1996. 11. T. S. Lee. Image representation using 2D Gabor wavelets.IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(10):959– 971, Oct. 1996. 12. Doermann. D, “The Indexing and Retrieval of Document Images: A Survey,” Computer Vision and Image Understanding (CVIU) 70, pp.287-298, 1998. 13. Kolcz A, Alspector J, Augusteijn M, Carlson R, Popescu GV, A line-Oriented Approach to Word Spotting in Handwritten Documents. Pattern Analysis and Application vol.3 (2):pp.153–168, 2000.

447

448

Mallikarjun Hangarge et al. / Procedia Computer Science 79 (2016) 441 – 448 14. Marcolino A, Ramos V, Ramalho M, Pinto JC, Line and word Matching in Old Documents. In: Fifth Ibero American Symposium on Pattern Recognition (SIAPR), pp. 123–135, 2000. 15. R. Manmatha and T.M. Rath, Indexing of Handwritten Historical Documents—Recent Progress, Proceeding of Symposium on Document Image Understanding, pp. 77-85, 2003. 16. Y. Lu and C.L. Tan, Word Spotting in Chinese Document Images without Layout Conference on Pattern Recognition, pp. 57-60, 2002.

Analysis. Proceeding of 16th International

17. Rath TM, Manmatha R. Word Image Matching using Dynamic Time Warping. In: Computer Vision and Pattern Recognition, pp. 521– 527, 2003. 18. Jawahar CV, Balasubramanian A, Meshesha M. Word Level Access to Document image Datasets. In: Proceedings of the workshop on Computer Vision Graphics and Image Processing (WCVGIP), pp. 73–76, 2004. 19. B. Zhang, S.N. Srihari, and C. Huang, Word Image Retrieval Using Binary Features. Proceeding of SPIE, vol. 5296, pp. 45-53, 2004. 20. Kim S, Park S, Jeong C, Kim J, Park H, Lee G Keyword Spotting on Korean Document Images by Matching the Keyword Image. In: Digital libraries: vol. 3815, pp. 158–166, 2005. 21. T. Adamek, N.E. Connor, and A.F. Smeaton, Word Matching Using Single Closed Contours for Indexing Historical Documents. Journal of Document Analysis and Recognition, vol.9 (2) pp. 153-165, 2007. 22. Konidaris T, Gatos B, Ntzios K, Pratikakis I, Theodoridis S, Perantonis SJ. Keyword-Guided Word Spotting in Historical Printed Documents using Synthetic Data and user Feedback. International Journal on Document Analysis and Recognition (IJDAR) vol. 9(24), pp. 167–177, 2007. 23. Konidaris T, Gatos B, Perantonis SJ, Kesidis A Keyword Matching in Historical Machine- Printed Documents using Synthetic Data, Word Portions and Dynamic Time Warping. In: The eighth International Association for Pattern Recognition (IAPR) workshop on document analysis systems, pp. 539–545, 2008. 24. Brina C.D, Niels R, Overvelde A, Levi G, Hulstijn W. Dynamic Time Warping: A New Method in the Study of Poor Handwriting. Human Movement Science 27(2):242–255, 2008. 25. Forne´s A, Llado´s J, Sa´nchez G Old Handwritten Musical Symbol Classification by a Dynamic Time Warping Based Method. Graph Recognition vol 5046:51–60, 2008. 26. Tarafdar, ArundhatiRanju Mondal, Srikanta Pal, Umapada Pal, Shape code based word-image matching for retrieval of Indian multilingual documents, Pattern Recognition (ICPR), 2010. 27. Murugappan A,Ramachandran B, Dhavachelvan P. A Survey of Keyword Spotting Techniques for Printed Document Images. Artificial Intelligence vol. 35(2) pp. 119-136 2011 28. Yat. M, Lam L, Suen CY Arabic Handwritten Word Spotting using Language Models pp. 43–48, 2012. 29. Nabin Sharma, Sukalpa Chanda, Umapada Pal, Michael Blumenstein, " Word-wise script identification from video frames" in Proc. of ICDAR, pp867-871,2013. 30. Khayyat M, Lam L, Suen CY, Learning-Based Word Spotting System for Arabic Handwritten Documents. Pattern Recognition vol. 47(3), pp. 1021–1030, 2014. 31. http://maroo.cs.umass.edu/getpdf.php?id=999. 32. https://en.wikipedia.org/wiki/Cosine_similarity