Author’s Accepted Manuscript A7'ta: Data on A Monolingual Arabic Parallel Corpus for Grammar Checking Nora Madi, Hend S. Al-Khalifa
www.elsevier.com/locate/dib
PII: DOI: Reference:
S2352-3409(18)31539-7 https://doi.org/10.1016/j.dib.2018.11.146 DIB3561
To appear in: Data in Brief Received date: 28 August 2018 Revised date: 29 November 2018 Accepted date: 30 November 2018 Cite this article as: Nora Madi and Hend S. Al-Khalifa, A7'ta: Data on A Monolingual Arabic Parallel Corpus for Grammar Checking, Data in Brief, https://doi.org/10.1016/j.dib.2018.11.146 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
A7'ta: Data on A Monolingual Arabic Parallel Corpus for Grammar Checking Nora Madi, Hend S. Al-Khalifa Department of Information Technology, College of Computer & Information Sciences, King Saud University, Riyadh, Saudi Arabia
Contact email:
[email protected],
[email protected]
Abstract Grammar error correction can be considered as a “translation” problem, such that an erroneous sentence is “translated” into a correct version of the sentence in the same language. This can be accomplished by employing techniques like Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Producing models for SMT or NMT for the goal of grammar correction requires monolingual parallel corpora of a certain language. This data article presents a novel monolingual parallel corpus of Arabic text called A7'ta ()أخطاء. It contains 445 erroneous sentences and their 445 error-free counterparts. This is the first Arabic parallel corpus that can be used as a linguistic resource for Arabic natural language processing (NLP) mainly to train sequence-to-sequence models for grammar checking. Sentences were manually collected from a book that has been prepared as a guide for correctly writing and using Arabic grammar and other linguistic features. Keywords: Error Checking, Arabic language, NLP, parallel corpus Specifications Table Subject area More specific subject area Type of data How data was acquired Data format Experimental factors
Experimental features Data source location Data accessibility
Computer Science Computational linguistics, natural language processing, Linguistic Error Checking corpus. Arabic language. Text files, figures Corpus was manually extracted from the book اللغوية- الصحافة السعودية ً( أنموذجاLinguistic Error Detector – Saudi Press as a Sample) [1]. Raw Texts are manually converted to text format. Texts are cleaned by removing extra spaces and unnecessary punctuation. Text files are divided into folders according to the book’s categorizations. The data contains 300 documents, 445 erroneous sentences and their error-free counterparts, and a total of 3,532 words. Saudi Arabia Data is with this article and can be downloaded from
https://github.com/iwan-rg Value of the Data There are no monolingual Arabic parallel corpora for linguistic error correction [2]. This data could be used for training machine learning models (i.e. classifiers, neural networks, etc.) to recognize linguistic errors in Arabic text [3], [4]. Extracted sentences can be used as samples in learning the Arabic language and some of its linguistic properties [1]. This data can be used either as a parallel corpus or turned into an error-annotated corpus. Data The corpus is a collection of Modern Standard Arabic (MSA) sentences (and words) extracted from the book كشاف األخطاء اللغوية- ً( الصحافة السعودية أنموذجاLinguistic Error Detector – Saudi Press as a Sample) [1]. The corpus contains about 445 erroneous sentences and their 445 error-free counterparts structured in text files. Each pair of sentences differs in one or more words. There are 300 documents and a total of 3,532 words. Data is presented in Figures 1 and 2, and at https://github.com/iwan-rg. Experimental Design, Materials, and Methods The created corpus is based on a book called كشاف األخطاء اللغوية- ً( الصحافة السعودية أنموذجاLinguistic Error Detector – Saudi Press as a Sample) [1]. One of the main goals of this book was to provide a simple guide for those who care about preserving and maintaining the proper form of written Arabic especially those in the fields of journalism and media in general. The book analyzed five Saudi newspapers: Al-Jazirah [5], Al Riyadh [6], Al Watan [7], Asharq Al-Awsat [8], and Al Shorouq [9]. Writing errors were manually mined from the newspapers and categorized into eight main categories: syntax errors, morphological errors, semantic errors, linguistic errors, stylistic errors, spelling errors, punctuation errors, and the use of informal as well as borrowed words. Each category of errors in the book contains sub-categories, which in turn consist of one or more specific types of errors. Further, each type of error is defined by three points (Figure 1): 1. The Error: Defines what the error is, along with one or more examples from the newspaper articles. 2. The Correction: Defines how the correct version should be and corrects the examples listed in “The Error” point. 3. The Rule(s): Explains the linguistic rule(s) that this type of error adheres to.
Figure 1 Sample Page from "Linguistic Error Detector" Book
Sentences were manually extracted from the electronic version of the book and organized into text files (Figure 2). The extraction process: 1. We created a folder for each of the eight main categories. 2. Within each folder, we created a sub-folder for each sub-category within the main category if any. 3. Inside each main folder or sub-folder, folders were created for each type of error. 4. Within each error type folder, two files were created; one for the correctly written sentences ( )الصوابand another for the erroneous sentences ()الخطأ. Since correct sentences were put in a separate file from the erroneous sentences, a model could be trained using only the correct sentences and another model could be trained on annotated erroneous sentences depending on the task. Also, since each pair of sentences differs in one or more words, it would be simple to compare each sentence in a pair and label each erroneous word.
Figure 2 General Corpus Structure
Acknowledgments We would like to thank the authors of the book كشاف األخطاء اللغوية- ًالصحافة السعودية أنموذجا (Linguistic Error Detector – Saudi Press as a Sample) [1] for their efforts in producing it and making it available. We would also like to acknowledge the support provided for the publication of this data article by IWAN Research Group, Saudi Arabia. References [1] A. Aljindi et al., )أنموذجاً السعودية الصحافة( اللغوية األخطاء كشاف. Riyadh: Princess Noura Bint Abdul Rahman University. [2] W. Zaghouani, “Critical Survey of the Freely Available Arabic Corpora,” arXiv Prepr. arXiv1702.07835, 2017. [3] N. Tomeh, N. Habash, R. Eskander, and J. Le Roux, “A Pipeline Approach to Supervised Error Correction for the QALB-2014 Shared Task,” in EMNLP 2014 Workshop on Arabic Natural Langauge Processing (ANLP), 2014, pp. 114–120. [4] A. Almahairi, K. Cho, N. Habash, and A. Courville, “First Result on Arabic Neural Machine Translation,” arXiv Prepr. arXiv1606.02680, 2016. [5] “الجزيرة جريدة.” [Online]. Available: http://www.al-jazirah.com/. [Accessed: 17-Aug-2018]. [6] “الرياض جريدة.” [Online]. Available: http://www.alriyadh.com/. [Accessed: 17-Aug-2018]. [7] “الين أون الوطن.” [Online]. Available: http://www.alwatan.com.sa/Default.aspx?issueno=6532. [Accessed: 17-Aug-2018]. [8] “األوسط الشرق.” [Online]. Available: https://aawsat.com/. [Accessed: 17-Aug-2018]. [9] “الشروق بوابة.” [Online]. Available: http://www.shorouknews.com/. [Accessed: 17-Aug2018].