An Efficient Text Representation for Searching and Retrieving Classical Diacritical Arabic Text

Available online at www.sciencedirect.com Procedia Computer Science 00 (2018) 000–000 ScienceDirect www.elsevier.com/locate/procedia Available onli...

Download PDF

667KB Sizes 5 Downloads 87 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com Procedia Computer Science 00 (2018) 000–000

ScienceDirect

www.elsevier.com/locate/procedia

Available online at www.sciencedirect.com Procedia Computer Science 00 (2018) 000–000

ScienceDirect

www.elsevier.com/locate/procedia

The 4th International Procedia Conference Arabic Computational Linguistics (ACLing 2018), Computeron Science 142 (2018) 150–157 November 17-19 2018, Dubai, United Arab Emirates

An TextConference Representation for Searching and(ACLing Retrieving TheEfficient 4th International on Arabic Computational Linguistics 2018), November 17-19 2018, Dubai, United Arab Emirates Classical Diacritical Arabic Text a Saqib Hakak *, Amirrudin Kamsina*, Palaiahnakote Shivakumaraa, Omar Tayanb, An Efficient Mohd.Yamani Text Representation for Searching and Retrieving Idna Idrisa , Gulshan amin Gilkarc Classical Diacritical Arabic Text Faculty of Computer Science and Information technology, University of Malaya, Kuala-Lumpur,50603, Malaysia a

a a a b b Saqib Hakak *, Amirrudin Kamsin *, Palaiahnakote Shivakumara Omar Tayan Department of Computer Engineering, College of Computer Science and Engineering, Taibah University, ,Medina, Saudi Arabia , c a c Arabia College ofMohd.Yamani Computer Science and Information technology, Shaqra University, Saudi Idna Idris , Gulshan amin Riyadh, Gilkar a

Faculty of Computer Science and Information technology, University of Malaya, Kuala-Lumpur,50603, Malaysia

b Abstract Department of Computer Engineering, College of Computer Science and Engineering, Taibah University, Medina, Saudi Arabia c

College of Computer Science and Information technology, Shaqra University, Riyadh, Saudi Arabia

Due to the rapid growth of the Internet and advanced technologies, data storage and extraction of Arabic diacritical data in real time from an Arabic corpus have become a vital issue in the field of information retrieval. In this paper, we propose a new idea Abstract for representing Arabic diacritic text in the corpus such that search engines can enhance the search time of retrieving the desired text with high precision. To achieve our goal, we segment the Arabic diacritical sentences/verses into individual characters along Due the rapid growth the Internet advancedthetechnologies, data storage and extraction Arabic diacritical data in data real with to diacritics which areofnecessary for and interpreting meanings. Then, we propose a new dataofstructure for representing time an Arabic corpus have become vital issue in the field of retrieval. uses In this we propose a new idea usingfrom segmented alphabets. To verify thea corpus representation, theinformation proposed approach thepaper, Boyer-Moore algorithm for for representing in the corpus that searchrepresentation engines can enhance searchreduces time of the retrieving desired searching given Arabic verses diacritic of Arabictext diacritical data. such The proposed of data the structure search the time from text withtohigh precision. Toworst achieve ourwhere goal, we segmentthe thediacritical Arabic diacritical individual along O(m*n) O(1+m) in the case, m denotes verse to sentences/verses be searched, andinto n denotes the characters total number of with diacritics which are necessary foron interpreting the meanings. Then, we propose a new data structure for representing data diacritical verses. Experimental results popular corpus show that the proposed method outperforms the existing search methods using segmented alphabets. To verify the corpus representation, the proposed approach uses the Boyer-Moore algorithm for in terms of time complexity. searching given verses of Arabic diacritical data. The proposed representation of data structure reduces the search time from O(m*n) to O(1+m) in the worst case, where m denotes the diacritical verse to be searched, and n denotes the total number of diacritical verses. Experimental on popular © 2018 The Authors. Published results by Elsevier B.V. corpus show that the proposed method outperforms the existing search methods in terms time access complexity. This is anofopen article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational © 2018 The Authors. Published by Elsevier B.V. Linguistics. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) © 2018 The under Authors. Published of bythe Elsevier B.V. Peer-review responsibility scientific committee of the 4th International Conference on Arabic Computational Linguistics. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) Keywords: Digital Quran; Pattern matching; verification Quran; Information Quran authentication; Arabic/Farsi texts;Computational Urdu texts. Peer-review under responsibility of the scientificof committee of theretrieval; 4th International Conference on Arabic Linguistics. Keywords: Digital Quran; Pattern matching; verification of Quran; Information retrieval; Quran authentication; Arabic/Farsi texts; Urdu texts. * Corresponding author. Tel.: +6-0183529281. E-mail address: [email protected], [email protected] 1877-0509 © 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) * Corresponding author. Tel.: +6-0183529281. Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics . E-mail address: [email protected], [email protected] 1877-0509 © 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics . 1877-0509 © 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/) Peer-review under responsibility of the scientific committee of the 4th International Conference on Arabic Computational Linguistics. 10.1016/j.procs.2018.10.470

2

Saqib Hakak et al. / Procedia Computer Science 142 (2018) 150–157 Hakak et al. / Procedia Computer Science 00 (2018) 000–000

151

1. Introduction As data storage and processing increases, data representation and design also require changes in order to cope with the challenges of the complex corpus, with heterogeneous data [1, 2]. There are numerous methods available in the literature that focus on new methods for improving time complexity and desired results, but those approaches hardly focus on the design of such a corpus. For instance, it is a fact that Non-Latin diacritical texts like Digital Quran and Al-Hadith (teaching of Prophet Mohammad (peace be upon Him)) content are being uploaded on the Internet through social media websites, blogs etc every second without organizing data in a particular way that causes inefficient retrieval. The work by [1] identified that errors found in the uploaded content were due to missing diacritic symbols. In the case of diacritical texts, the position of diacritic symbols is vital for correctly reading and understanding the meaning of the whole sentence (for example, the Quran verse) [3]. Therefore, this shows that there is a need for better representation and retrieving methods which represent data without errors and missing symbols. Moreover, popular search engines such as Google, Yahoo, and MSN shows promising results for retrieving non-Latin texts from the corpus but are still at an infant stage, especially for diacritical texts, such as Arabic, Urdu and Farsi etc [4, 5] for more accurate results. Thus, there is immense scope for proposing new representations for the above mentioned digital texts, in order to enable such search engines to efficiently and precisely retrieve the required data in real time [6]. Several methods [7, 8] were found in the literature for representing data in English texts. However, those methods may not be used directly to represent diacritical scriptures. The reason is that diacritical scripts are sensitive compared to English data and require four stages of pre-processing, indexing, querying, and finally, retrieving [9]. Moreover, it is sensitive in nature, as for example, given that the position of an isolated dot changes the meaning of the whole sentence (verse), while in the case of English, changing one character does not significantly affect the meaning of the sentence. Therefore, diacritical scripts require an accurate and efficient representation [3, 9]. Arabic is one of the most influential and widely spoken languages with approximately 350 million native speakers [10]. It comes under the family of Semitic languages and differs syntactically and morphologically with Latin languages. Arabic is written from right to left and has 26 pure consonants and two “semi-vowels” (yaa and waaw) which behave in some contexts as consonants and in other contexts as vowels. The vowels used in Arabic are short and popularly known as diacritics. Those diacritics are written either above or beneath the consonant to give the word a desired sound and meaning. Native speakers usually do not require diacritics for reading or understanding Arabic text for daily activities like reading magazines, textbooks, letters and so on. However, the use of diacritics is heavily prescribed and recommended for religious scriptures. In the Arabic language, the two most sensitive and most important religious scriptures include the Quran (the holy book of Muslims) and Al-Hadith (teachings of Prophet Mohammad (PBUH). For example, Table I shows details of diacritics used in Quran with the representation given using the Unicode encoding scheme. Table II shows that the positions of the symbols are context sensitive. Table I: Description of Diacritics used in Arabic texts [10] Diacritics used in Arabic texts

Description

Symbols

Fatha

Small diagonal line above a letter

َ

Kasra

Small diagonal line below a letter

َ

Damma

Small “comma-like” diacritic placed above a letter

َ

Tanwin

Double vowel diacritic at the end/beginning & middle of the verses

َََ

Sukun

Small circle shape above the letter indicating that the a consonant is not followed by a vowel

‫د‬

d

U+0652

Shadda

Small half “siin” written above the letter to indicate the reduplication

‫د‬

dd

U+0651

Madda

Diacritic appears on top of alif indicating a long alif

‫آ‬

aa

U+0622

[10] Sound[11]

Unicode

aa

U+064E

ai

U+0650

au

U+064F

Ain, aan, aun

U+064C, U+064B, U+064D

Saqib Hakak et al. / Procedia Computer Science 142 (2018) 150–157 Hakak et al. / Procedia Computer Science 00 (2018) 000–000

152

3

For example, an Arabic word ‫ كتب‬consisting of three consonants i.e.‫ كتب‬gives different meanings when a different arrangement of diacritics are used [12]. More details are shown in Table II. Table II: Different interpretations of the word ‫ كتب‬with different diacritics [12] Arabic Word

‫كتب‬ ‫كتب‬ ‫كتب‬ ‫كتَّب‬

Transliteration kataba kutub kutiba kattaba

Part of Speech Verb Noun Verb Verb

Meaning (in English) Wrote Books Written Make someone to write

In summary, the above discussion shows that diacritical scripts are sensitive, and hence, requires accurate representation to retrieve data accurately with their correct meanings. In addition, good representation results in efficient retrieval. The remainder of this paper is organised as follows: Section 2 contains the related work. The proposed methodology is explained in Section 3. Section 4 describes the results and discussions. Finally, the paper is concluded in Section 5. 2. Related work As discussed previously, less attention has been paid towards corpus organisation and representation of diacritical texts compared to the new methods which generally explore different pattern matching methods to achieve efficiency. We review the methods related to data representation and pattern matching. For example, standard and popular pattern matching algorithms include Boyer-Moore, KMP and Rabin Karp [3, 13, 14]. Those algorithms have been enhanced to improve the search time and accuracy [15]. Furthermore, such algorithms are limited to Latin texts only and may not be suitable for nonLatin texts like Arabic without pre-processing [16]. Alsmadi et. al [2] proposed an algorithm for searching and verifying of Arabic Quran verses. The proposed algorithm is based on the hashing approach and involves removal of diacritics which makes the verification process questionable. In Arabic religious scripts, diacritics are vital. Alginahi et. al [17] proposed an algorithm for detecting Quranic Arabic text from websites. This approach also removes diacritics to achieve their objectives. Therefore, the retrieved data cannot be properly authenticated due to the removal of diacritics. In the same area, Sabbah et.al [11] and Alshareef et. al [18] also proposed methods for retrieving text based on diacritics removal. However, [10] proposed a method which does not focus on removal of diacritics for retrieving text from the corpus. This method considers the nondiacritic text for experimentation. Hakak et al. [19, 20] proposed the overall model of authenticating the Quran involving the protection phase also. Similarly, there are web engines available related to searching Quranic verses online [18]. For instance, tanzil.net [21], search-truth [22] and Muslim-web [23] are some examples of search engines. Such search engines operate on full-words, as well as stem and word synonyms for retrieving text from the corpus. It is noted from our initial analysis, that those engines were not efficient enough for retrieving queries for Arabic Quran that involves diacritical text. When applying variable length verses, the performance of those search engines degrades, and in some ca ses, are unable to retrieve the requested

verse. In addition, those three search engines require more time for verses with complex diacritics. Results based on different diacritical verses for some popular Quran search engines are shown in Table III. The main reason for retrieving some verses while not retrieving others is most probably due to inefficient text representation that leads to inefficient retrieval. Pattern matching algorithms will be able to retrieve data more accurately and efficiently if the data organisation is improved.

4

Saqib Hakak et al. / Procedia Computer Science 142 (2018) 150–157 Hakak et al. / Procedia Computer Science 00 (2018) 000–000

153

Table III: Experiments on Popular Quranic Web Engines

Chapter number (Surah) 2 2 2

67

114

Verses

َّ ‫فقلنا اضربوه ببعضها ۚ ك َٰذلك يحيي‬ � ‫الموت َٰى ويريكم آياته‬ َّ ‫فقلنا اضربوه ببعضها ۚ ك َٰذلك يحيي‬ � ‫الموت َٰى ويريكم آياته لعلَّكم‬ ‫﴾ َٰذلك الكتاب َل ريب ۛ فيه ۛ هدى‬١﴿ ‫الم‬ ‫﴾ الَّذين يؤمنون بالغيب ويقيمون‬٢﴿ ‫للمتَّقين‬ ‫﴾ والَّذين‬٣﴿ ‫الصََّلة وم َّما رزقناهم ينفقون‬

‫يؤمنون بما أنزل إليك وما أنزل من قبلك‬ ﴾٤﴿ ‫وباْلخرة هم يوقنون‬ ۖ ‫قل هو الرَّ حم َٰـن آمنَّا به وعليه تو َّكلنا‬ ﴾٢٩﴿ ‫فستعلمون من هو في ضَلل ُّمبين‬ ‫قل أرأيتم إن أصبح ماؤكم غورا فمن‬ ﴾٣٠﴿ ‫يأتيكم بماء َّمعين‬ ﴾٢﴿ ‫﴾ ملك النَّاس‬١﴿ ‫قل أعوذ برب النَّاس‬ ‫﴾ من شر الوسواس الخنَّاس‬٣﴿ ‫إل َٰـه النَّاس‬ ‫﴾ الَّذي يوسوس في صدور النَّاس‬٤﴿

Tanzil.net

Muslim-Web

Search truth

Retrieved

Retrieved

Retrieved

No Results

No Results

Retrieved

No Results

No Results

No Results

No Results

No Results

No Results

No Results

No Results

No Results

In the light of the above discussion, one can confirm that most of the existing methods focus on new approaches for pattern searching to achieve time efficiency. On the contrary, only a little attention has been given on improving efficient text representation and organization, which also contributes to improving time efficiency [24, 25]. As highlighted previously, there are methods that focus on text representation and organization. However, such methods only work well by either removing diacritics or considering non-diacritical texts. This task involves an overhead to authenticate the retrieved data to confirm the meanings. Those factors motivated us to propose a new idea for representing data efficiently, such that optimal time efficiency is achieved. 3. Proposed methodology To achieve efficient representation, we propose to segment the characters in the verses/sentences based on the fact that each character is encoded by a unique code to organize the data. For the purpose of segmentation, we explored the regular expression approach to remove the white spaces and segment the input verse into respective individual characters. After segmenting characters, the verse is extracted based on its first extracted character. This organization helps in retrieving the query word or verse quickly. An example of text retrieval from the Corpus in the proposed framework is shown in Fig 1.

Hakak et al. / Procedia Computer Science 00 (2018) 000–000

5

Saqib Hakak et al. / Procedia Computer Science 142 (2018) 150–157

154

Internet

Verse to be found DataBase

‫د حورا ولهم عذاب واصب‬

‫ب‬

‫ور ا و ل ه‬ ‫ذ اب و ا‬

‫د‬

Search the Leaf node Using Brute force approach ‫د‬

Segmentation Process

Searching Phase Verses staring with ‫ا‬

are stored in this leaf node

‫ ب‬are stored in this leaf node

Verses staring with

‫ا‬ ‫ب‬

are stored in this leaf node

Verses staring with

are stored in this leaf node

Verses staring with

are stored in this leaf node

Verses staring with

are stored in this leaf node

Verses staring with

Boyer-Moore Algorithm for searching within Leaf Nodes

Proposed Index representation

َّ ‫در ا منه وم غفرة ورح م و كان‬ ‫� غفورا‬ ‫رحيما‬ ‫د حورا وله م ع ذاب و اص ب‬ ‫د عواه م فيها سب حان ك اللَّه َّم وت حيَّتهم فيها س َل‬ ‫وآخر دعو اهم أ ن الحم د َّ ر ب ال عالمين‬ Verses staring with Based n characters are Proposed Index Approach

stored in n leaf nodes.

‫د‬

n characters

Fig. 1. The proposed architecture for searching Diacritical texts

The steps involved in the proposed approach are listed below: The complete Quranic texts involving Urdu, uthmani and plain are sorted alphabetically. The verses are organized within the leaf nodes based on their first character and these leaf nodes are labelled accordingly based on the respective first character of verses contained within it. For Example, the leaf node containing the verses with characters “‫ ”ت‬is labelled as “‫”ت‬. This approach is repeated for all the respective verses. The input verse to be searched is segmented into individual characters. The first segmented character is taken and mapped to leaf nodes. For example, if the first segmented character is “‫”ت‬, the whole search process is limited to a leaf node labelled as “‫”ت‬. Finally, for searching the correct verse from a respective leaf node consisting of similar verses, we used Boyer Moore string matching algorithm. The complete details are mentioned in below-mentioned sections: 3.1. Sorting Phase In the sorting phase, the whole corpus containing diacritical text is sorted. Different characters are represented by different leaf nodes (ln). Each leaf node has a specific value based on the character that it represents. For example, a

6

Saqib Hakak et al. / Procedia Computer Science 142 (2018) 150–157 Hakak et al. / Procedia Computer Science 00 (2018) 000–000

155

character “‫“ ب‬represents leaf node “‫“ ب‬. After creation of all leaf nodes, all verses are placed within their respective leaf nodes based on their first character. After sorting the corpus, the next step is character segmentation. 3.2. Character Segmentation The aim of character segmentation in our proposed approach is to segment connected characters into individual characters as most of the compilers and programming languages cannot process the connected Arabic verses accurately resulting in poor retrieval performance. There have been different encoding approaches proposed to represent individual characters. American Standard Code for Information Interchange (ASCII) is widely used as an encoding technique for English due to the fact that each English character uses 7 bits with one extra bit to handle noise. Since ASCII uses 7 bits for character representation, it can handle 27 characters i.e. 128 English characters [26]. However, this encoding scheme is inadequate to handle other non-Latin texts. Therefore, an 8-bit scheme has been proposed by the International Standard Organization (ISO) 8859 family [27]. Further, to handle more complex characters, UNICODE 8, 16 and 32 encodings have been proposed [27]. As a result, we use Unicode scheme for segmenting characters from Arabic verses, as it has variable length encoding and suits diacritical Arabic text and other complex texts. This cue leads us to explore an approach using regular expressions to segment characters from Arabic text. A regular expression is a sequence of special characters that define the search pattern. There are many regular expression symbols that represent a particular operation. For example, “question mark (?)” indicates zero or one occurrence of the preceding element. Similarly, an “asterisk (*)” indicates zero or more occurrences of the preceding element. Similarly, in regular expressions, there are special operators known as “curly bracket {L}”. This expression maps each individual letter (L) to its unique Unicode number [28]. 4. Experimental results To evaluate the performance of the proposed method, we use some well-known corpus, including Arabic data-sets, Uthmanic data-sets and Urdu datasets from tanzil.net (http://tanzil.net/#2:1, 2016) and hamariweb ("http://www.hamariweb.com/poetries/default.aspx”) respectively. The sizes of those data sets are 1.24, 1.30 and 2 MB, respectively. The performance of the proposed method is measured by calculating the search time for the data representation steps. To show the effectiveness of the proposed approach, we compare our algorithm with the existing methods that includes; Quran Quote Verification (QVT) algorithm [18] and the traditional MySQL approaches that use the linear search algorithm and binary search algorithm [29]. In our comparative study, the implementation of the other algorithms was achieved using NetBeans 8.02 on i-5 Intel Processor with 4 MB cache, 4 GB RAM using Windows 10. Table IV: Search time for different Diacritical Texts (in milliseconds) Different Text samples

B+ Tree Index

Muslim-Web

Method(MySQL)

engine

964.2

2614

2741.25

980.2

380.2

321.2

-

-

-

240.4

912.4

No result

No result

928.4

370.1

Search-Truth

QVT Algorithm

Proposed Approach

‫اتل ما أوحي إليك من الكتاب وأقم‬ ‫الصََّلة إنَّ الصََّلة تنهى عن الفحشاء‬ َّ ‫� أكبر و‬ َّ ‫والمنكر ولذكر‬ ‫� يعلم ما‬ ‫تصنعون‬ ‫اقبال کےبارے ميں کيا کہيں ہم لوگ‬ ‫من كان يريد حرث ٱلءاخرة نزد لهۥ‬ ‫فى حرثهۦ ومن كان يريد حرث ٱلدُّنيا‬ ‫نؤتهۦ منها وما لهۥ فى ٱلءاخرة من‬ ‫َّنصيب‬

Quantitative results of the proposed and existing methods are reported in Table IV where it is noticed that the proposed method achieved the best time efficiency compared to the existing methods. To calculate the time for

156

Saqib Hakak et al. / Procedia Computer Science 142 (2018) 150–157 Hakak et al. / Procedia Computer Science 00 (2018) 000–000

7

retrieval, we choose random verses as queries for searching the respective corpus. The time efficiency is calculated as: (𝑃𝑃. 𝐴𝐴 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 − 𝐸𝐸. 𝐴𝐴 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡)/𝐸𝐸. 𝐴𝐴 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 ∗ 100

(1)

Here, P.A denotes the proposed approach and E.A denotes existing approaches. It is noted from Table IV using equation (1), that the proposed approach had achieved 61 %, 72%, 84%, and 62% improved time efficiency as compared with the B+ tree approach, Muslim-Web, Search-Truth and the QVT approach, respectively. The main reason for poor accuracy is due to the presence of different diacritics that increases the time complexity of an algorithm and existing data-structure approach, where all those verses are being stored in a serial fashion. This serial organisation of data within the corpus is another factor for the poor performance of the existing approaches. From the results, it can be concluded that the proposed segmentation and representation combination can achieve efficient time complexity, irrespective of the corpus size, and can be extended to other languages with minimal changes. 5. Conclusion In this work, we have proposed a new approach that involves a novel step for segmenting characters from verses and representing data using segmented characters to efficiently retrieve the text from diacritical texts like digital Quranic text in plain Arabic text and Uthmanic text as well as Urdu and Farsi documents. This study has explored the regular-expression approach for segmenting characters from verses in a new way, motivated by the Unicode-16 encoding scheme. The proposed approach involved a novel indexing method using segmented characters to represent the data such that optimal time efficiency can be achieved, irrespective of corpus size and language. Experimental results on validating the indexing and data representation approach show that the there is a significant improvement in the searching time as compared to existing methods from the literature. To the best of our knowledge, this is the first attempt to develop a segmentation-based data representation approach for retrieving diacritical texts such as Quranic and Urdu/Farsi texts. In the future, we aim to extend this work by calculating the other evaluation parameters including recall and precision on the larger corpus. Acknowledgements This work was partly supported by the “NOOR Research Center, Taibah University, Al-Madinah Al-Munawwarah, Saudi Arabia, under Grant NRC1-126B” and University of Malaya, Malaysia under UMRG RP043A-17 HNE. References [1] A. Mohammed, M. S. Sunar, and M. S. H. Salam, "Quranic verses verification using speech recognition techniques," Jurnal Teknologi, vol. 73, no. 2, pp. 99-106, 2015. [2] I. Alsmadi and M. Zarour, "Online integrity and authentication checking for Quran electronic versions," Applied Computing and Informatics, vol. 13, no. 1, pp. 38-46, 2017. [3] S. Hakak, A. Kamsin, S. Palaiahnakote, O. Tayan, M. Y. I. Idris, and K. Z. Abukhir, "Residual-based approach for authenticating pattern of multistyle diacritical Arabic texts," PloS one, vol. 13, no. 6, p. e0198284, 2018. [4] S. Hakak, A. Kamsin, O. Tayan, M. Y. I. Idris, A. Gani, and S. Zerdoumi, "Preserving Content Integrity of Digital Holy Quran: Survey and Open Challenges," IEEE Access, vol. 5, pp. 7305-7325, 2017. [5] S. Hakak, A. Kamsin, O. Tayan, M. Y. I. Idris, and G. A. Gilkar, "Approaches for preserving content integrity of sensitive online Arabic content: A survey and research challenges," Information Processing & Management, 2017. [6] E. F. Khalaf, K. Daqrouq, and A. Morfeq, "Arabic Vowels Recognition by Modular Arithmetic and Wavelets using Neural Network," Life Science Journal, vol. 11, no. 3, pp. 33-41, 2014. [7] B. H. Hammo, "Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents," Information retrieval, vol. 12, no. 3, pp. 300-323, 2009. [8] A. Al-Badarneh, E. Al-Shawakfa, B. Bani-Ismail, K. Al-Rababah, and S. Shatnawi, "The impact of indexing approaches on Arabic text classification," Journal of Information Science, vol. 43, no. 2, pp. 159-173, 2017.

8

Saqib Hakak et al. / Procedia Computer Science 142 (2018) 150–157 Hakak et al. / Procedia Computer Science 00 (2018) 000–000

157

[9] J. Atwan, M. Mohd, H. Rashaideh, and G. Kanaan, "Semantically enhanced pseudo relevance feedback for arabic information retrieval," Journal of Information Science, vol. 42, no. 2, pp. 246-260, 2016. [10] M. Al-Sanabani and S. Al-Hagree, "Improved An Algorithm For Arabic Name Matching," Open Transactions On Information Processing ISSN (Print): 2374–3786 ISSN (Online): 2374–3778. [11] T. Sabbah and A. Selamat, "A framework for Quranic verses authenticity detection in online forum," in Advances in Information Technology for the Holy Quran and Its Sciences (32519), 2013 Taibah University International Conference on, 2013, pp. 6-11: IEEE. [12] K. Kirchhoff and D. Vergyri, "Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition," (in English), Speech Communication, vol. 46, no. 1, pp. 37-51, May 2005. [13] S. Hakak, A. Kamsin, P. Shivakumara, and M. Y. I. Idris, "partition-based pattern matching approach for efficient retrieval of arabic text," Malaysian Journal of Computer Science, vol. 31, no. 3, pp. 200-209, 2018. [14] S. Faro and T. Lecroq, "The exact online string matching problem: A review of the most recent results," ACM Computing Surveys (CSUR), vol. 45, no. 2, p. 13, 2013. [15] S. Hakak, A. Kamsin, P. Shivakumara, M. Y. I. Idris, and G. A. Gilkar, "A new split based searching for exact pattern matching for natural texts," PloS one, vol. 13, no. 7, p. e0200912, 2018. [16] A. A. Hlayel and A. Hnaif, "An algorithm to improve the performance of string matching," (in English), Journal of Information Science, vol. 40, no. 3, pp. 357-362, Jun 2014. [17] Y. M. Alginahi, O. Tayan, and M. N. Kabir, "Verification of qur’anic quotations embedded in online arabic and islamic websites," Int. J. Islam. Appl. Comput. Sci. Technol, vol. 1, pp. 41-47, 2013. [18] A. Alshareef and A. El Saddik, "A Quranic quote verification algorithm for verses authentication," in Innovations in Information Technology (IIT), 2012 International Conference on, 2012, pp. 339-343: IEEE. [19] S. Hakak, A. Kamsin, J. Veri, R. Ritonga, and T. Herawan, "A Framework for Authentication of Digital Quran," in Information Systems Design and Intelligent Applications: Springer, 2018, pp. 752-764. [20] S. I. Hakak, A. Kamsin, M. Y. I. Idris, A. Gani, G. Amin, and S. Zerdoumi, "Diacritical Digital Quran Authentication Model." [21] http://tanzil.net/#2:1. retreived on (2016, 2nd January). [22] http://www.searchtruth.com/. retreived on (2017, 2nd December). [23] http://quran.muslim-web.com/. retreived on (2017, 15th November). [24] B. Hammo, A. Sleit, and M. El-Haj, "Effectiveness of query expansion in searching the Holy Quran," in The Second International Conference on Arabic Language Processing, Morocco, 2007, pp. 1-10. [25] S. Nisha, N. Ali, and A. Shawkat Ali, "Searching quranic verses: A keyword based query solution using. net platform," in Information and Communication Technology for The Muslim World (ICT4M), 2014 The 5th International Conference on, 2014, pp. 1-5: IEEE. [26] T. McEnery, R. Xiao, and Y. Tono, "Corpus-based language studies: An advanced resource book," ed: Taylor & Francis, 2000. [27] A. McEnery and R. Xiao, "Character encoding in corpus construction," in "Developing Linguistic Corpora : A Guide to Good Practice," AHDS, Oxford2005. [28] L. Ilie, "Regular Expression Matching," in Encyclopedia of Algorithms: Springer, 2008, pp. 1-99. [29] J. Greenspan and B. Bulger, MySQL/PHP database applications. John Wiley & Sons, Inc., 2001.

An Efficient Text Representation for Searching and Retrieving Classical Diacritical Arabic Text

An Efficient Text Representation for Searching and Retrieving Classical Diacritical Arabic Text

Recommend Documents