Available online at www.sciencedirect.com
ScienceDirect Procedia Technology 11 (2013) 580 – 584
The 4th International Conference on Electrical Engineering and Informatics (ICEEI 2013)
Arabic Handwriting Data Base for Text Recognition Jabril Ramdan, Khairuddin Omar, Mohammad Faidzul, Ali Mady* Center of Artificial Intelligent, School of Computer Science, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43200 Bangi, Selangor, Malaysia.
Abstract The process of assessing the outcomes obtained by various groups of researchers is heavily facilitated by conventional databases. This paper introduces, an database (AHDB/FTR) comprising Arabic Handwritten Text Images, which helps the researches associated with recognition of Arabic handwritten text with open vocabulary, word segmentation and writer identification and can be freely accessed by researchers worldwide. This database consists of four hundred and ninety seven images of Libyan cities, which were hand written by five Arabic scholars.
© 2013 2013 The TheAuthors. Authors.Published PublishedbybyElsevier Elsevier Ltd. © B.V. Selection and andpeer-review peer-reviewunder underresponsibility responsibilityofofthe theFaculty FacultyofofInformation Information Science Technology, Universiti Kebangsaan Selection Science && Technology, Universiti Kebangsaan Malaysia. Malaysia. Keyword: Arabic Handwritten text image; Database AHDB/FTR.
1. Introduction Due to the availability of numerous databases of printed and handwritten Latin text, over these years, majority of the text recognition researches have mainly concentrated on Latin script [1, 2]. Nevertheless, databases of Chinese and Indian languages are also found quite commonly [3, 4, 5]. It is noteworthy that, presently there is a lack of studies related to Arabic handwritten recognition; the absence of freely available Arabic databases is regarded as the primary reason for this limited number of studies. Consequently, every research group have conducted their researches based on data collected by them individually,
* Corresponding author. Tel.: +603 - 8921 ext 6347. E-mail address:
[email protected]
2212-0173 © 2013 The Authors. Published by Elsevier Ltd. Selection and peer-review under responsibility of the Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia. doi:10.1016/j.protcy.2013.12.231
Jabril Ramdan et al. / Procedia Technology 11 (2013) 580 – 584
581
which resulted in diverse rates of outcomes in recognizing Arabic language, therefore it is quite hard to compare the systems used by those researchers [8]. This scenario clearly depicts the need of having a common database for text image recognition and identification of the writer. However, there exist databases for handwritten texts [6], handwritten words [7] and bank checks [8]. Based on our understanding, there is just one database that contains Arabic handwritten text-lines with open vocabulary [9]. On the other hand, this database does not comprise a dataset of words and furthermore it is not that much easy to get access to it. In order to address this problem, this paper proposes an Arabic Handwritten Text Images Database (AHDB/FTR), which encompasses all Arabic characters and forms (beginning, middle, end, and isolated) Overview of the AHDB/FTR The data collection starts by making five Arabic scholars of different ages and educational qualification to write text-lines, by using any pen of their choice (no restriction was posed in the selection of pens). Then those texts are then scanned and stored as grayscale BMP images (resolution: 300 dpi). The Fig 1 below is the sample of an image included in the database:
Fig 1. Sample of image words
The AHDB/FTR database comprises of 497 such images of the names of Libyan towns and many of characters in total, with various types of characters (beginning, middle, end, and isolated). Each binary image bitmap represents the handwritten town name and additional GT information. The table -1 illustrates the data set entry of the AHDB/FTR – database, and the figure 2 depicts a standard example.
582
Jabril Ramdan et al. / Procedia Technology 11 (2013) 580 – 584 Table 1. A data set entry of the AHDB/FTR -database. TheSymbols M, B, A, E stand for the used character shapes (Middle, begin, alone, end position in a word). Image Ground truth Global word Character sequence Baseline Quantity of word Quantity of PAWs Quantity of character
ﻟﻤﻠﻮﺩﺓ B –ﻝM – ﻡM – ﻝE – ﻭB ﺩ- Bﺓ 114 1 3 6
Fig 2. Illustrate Baseline
Fundamentally, the names of the towns comprise utmost three words, and could possibly be having some associated sub-words. In the beginning the writers were asked to fill in the forms without any restrictions, however, they were asked to avoid lines and boxes. It is quite common that, in AHDB/FTR database, majority of the words are either overlapping or touching. It is worthy to mention again that, AHDB/FTR can be utilized for Arabic word recognition, word spotting, and writer identification. The table 2 illustrates an example of a data set entry of the AHDB/FTR – database, IFN/ENIT – database and AHTID/MW– database. Table 2. A data set entry of the AHDB/FTR, IFN/ENIT and AHTID/MW database. AHDB/FTR
IFN/ENIT
AHTID/MW
Jabril Ramdan et al. / Procedia Technology 11 (2013) 580 – 584
3. Description of Arabic characteristics Arabic is one of the popular languages, which is used by more than three hundred million people, in more than 20 countries. Arabic text is very unique of its kind; it is basically cursive and is written horizontally from right to left. There are twenty eight Arabic alphabets; few of these alphabets modify their shapes based on their position in the word. Quite a number of those Arabic characters have four shapes: isolated, initial, medial and final (Table 1) [8]. Therefore, it is possible to break down an Arabic word into more than one sub-word called PAW (Piece of Arabic Word), each of it signifies one or more associated letters. Majority of the Arabic letters comprise one to three shape dots. The existence of these dots in their positions helps people to distinguish letters belonging to the same family shape. In addition, few Arabic letters can be written in several styles, consequently, it is essential to gather samples from all those styles [9, 10]. Table 3 illustrates of different shapes of an Arabic letter. Table 3- Example of different shapes of an Arabic letter Letter label Yaa Nuun
Isolated ﻱ ﻥ
Begin ﻳـ ﻧـ
Middle ـﻴـ ـﻨـ
End ـﻲ ـﻦ
4. Purpose of this paper and dataset Of late, researches are showing more interest in addressing the problems of Arabic handwritten text recognition. Even though some studies use their own set of database, it has been obviously emphasized to have and standard database to facilitate the studies related to Arabic handwriting recognition, furthermore the standardized database must be available for all purposes. After the collection and storage of images, fundamental pre-processing tasks such as: noise filtering, text block segmentation, image binarization, and word segmentation must be conducted, which are heavily based on the quality and nature of the documents used, however, they are independent from the subsequent processing phases. As the basic pre-processing tasks have already been done during the database development, the AHDB/FTR database contains already pre-processed binary images of single words, which facilitates the work on recognition methods to be independent from the nature of exclusive documents. 5. Future directions The dataset can be enhanced by adding more images and as well as increasing the number of writers, and including new decoration images. 6. Conclusions This paper had stressed the need of a standardized database of Arabic handwritten images, and had introduces one such. The proposed AHDB/FTR database contains 497 word images of names of Libyan towns, written by five different writes. This database will be very useful for a variety of research applications such as, Arabic handwritten text recognition and writer identification systems. It is noteworthy that the AHDB/FTR database will be made freely available to interested researchers. Ultimately we believe that, the database will facilitate the research community and will be considered as an significant asset. 7. Acknowledgment The authors would like to thank all writers who contributed to AHDB/FTR database generation.
583
584
Jabril Ramdan et al. / Procedia Technology 11 (2013) 580 – 584
References [1] J. J. Hull, “A database for Handwritten Text Recognition Research,” IEEE Transaction on Pattern Analysis and Machine Intelligence, 1994, vol. 16, pp. 550–554,. [2] Y. LeCun, L. Bottou, Y. Bengio and P. Haffiner, “Gradient based learning applied to document recognition,” Proceedings of IEEE, vol. 86(11), 1998, p. 2278–2324. [3] T. Saito, H. Yamada and K. Yamomoto, “One database ELT9 of handprinted characters in JIS Chinese characters and its analysis (in Japanese),” Transaction of IECEJ, 1985, vol. J.68-D(4), pp. 757–764. [4] U. Bhattacharya and B. B. Chaudhuri, “Databases for Research on Recognition of Handwritten Characters of Indian Scripts,” International Conference of Document Analysis and Recognition, 2005, p.789–793. [5] S. Mihov, K. U. Schulz, C. Ringlsteller, V. Dojchinova, V. Nakaova, K. Kalpakchieva, O. Gerasimov, A. Gotsharek and C. Gercke, “A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques,” Proceeding of International Conference of Document Analysis and Recognition, 2005, p. 162–166. [6] S. Al-Ma’adeed, D. Ellimam and C. A. Higgins, “A data Base for Arabic Handwritten Text Recognition Research,” Proceeding of Eighth International Workshop on Frontiers in Handwriting Recognition, 2002, p. 485–489. [7] M. Pechwitz, S. S. Maddouri, V. Maergner, N. Ellouze and H. Amiri, “IFN/ENIT - database of handwritten Arabic words,” Proceeding of Colloque International Francophone sur l’Écrit et le Document, 2002, p. 129–136. [8] Abdulkader, Ahmad. "Two-Tier Approach for Arabic Offline Handwriting Recognition." Tenth International Workshop on Frontiers in Handwriting Recognition. 2006.