An innovative approach to autocorrecting grammatical errors in Arabic texts

Accepted Manuscript An Innovative Approach to Autocorrecting Grammatical Errors in Arabic Texts Chouaib Moukrim, Abderrahim Tragha, El Habib Benlahmer...

Download PDF

NAN Sizes 1 Downloads 54 Views

Report

Full Text

Accepted Manuscript An Innovative Approach to Autocorrecting Grammatical Errors in Arabic Texts Chouaib Moukrim, Abderrahim Tragha, El Habib Benlahmer, Tarik Almalki PII: DOI: Reference:

S1319-1578(18)31001-2 https://doi.org/10.1016/j.jksuci.2019.02.005 JKSUCI 581

To appear in:

Journal of King Saud University - Computer and Information Sciences

Received Date: Revised Date: Accepted Date:

25 September 2018 3 February 2019 5 February 2019

Please cite this article as: Moukrim, C., Tragha, A., Benlahmer, E.H., Almalki, T., An Innovative Approach to Autocorrecting Grammatical Errors in Arabic Texts, Journal of King Saud University - Computer and Information Sciences (2019), doi: https://doi.org/10.1016/j.jksuci.2019.02.005

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Available online at www.sciencedirect.com

An Innovative Approach to Autocorrecting Grammatical Errors in Arabic Texts Chouaib MOUKRIMa, Abderrahim TRAGHAa, El Habib BENLAHMERa, Tarik ALMALKIb a

Faculty of Science Ben M'sik, Hassan II University, Casablanca, Morocco b Faculty of Literature Ben M'sik, Hassan II University, Casablanca,

Acknowledgment I would like to express the very thanks to my PhD supervisor, Professor Abderrahim TRAGHA from Hassan II University, who gave me the opportunity to do such research, as well as all the members of the laboratory of Information Technologies and Modeling (LTIM).

Corresponding author details： Mr. Chouaib MOUKRIM Faculty of Science Ben M'sik, Hassan II University Casablanca, Morocco Tel: +212604193324 E-mail: [email protected]

An Innovative Approach to Autocorrecting Grammatical Errors in Arabic Texts Received 25 September 2018;

Abstract Natural Language Processing (NLP) has been a growing area of research in computer and cognitive sciences, using experimental approaches. Morphology and syntax play specifically a vital role in the correct interpretation of a sentence. In this paper, we will present a syntactic error correction system based on the automatic generation of correct sentences in Arabic. First, we extract the words from the considered sentence and we then generate all the possible sentences that are syntactically correct; based on a logical description of the rules of Arabic grammar in the ontology. We will afterwards compare the original sentence with the generated sentences to detect any eventual errors followed by the correction phase. In case the system has not found a sentence that looks similar to the original sentence, the correct alternative sentences are automatically offered. The use of the Arabic syntactic corrector can increase productivity and improve the quality of the text for anyone who writes in the Arabic language. Successful tests have been performed using a set of Arabic sentences. The implemented system achieved a precision rate of about 92% and a recall rate of about 84%. By observing the achieved results, it is concluded that our approach is promising. Keywords: Arabic, syntactic errors, natural language processing, ontology.

Available online at www.sciencedirect.com

1.

Introduction

Many works focus on natural language processing at several levels. Namely, morphology to define the structure of words [1], [2], syntax that determines the composition of sentences [3, 4], and semantics to determine the meaning [5, 6]. Several programs such as automatic translation, extraction of information, automatic text summary, etc. can exploit these. However, the defect of such software program lies in the relation between the words constituting a sentence that can be sometimes syntactically incorrect and which can therefore lead to incorrect results. This imperatively requires a system of automatic effective correction. Most of the work in this field deals with the spelling level [7]; it simply verifies the existence of words in the dictionary but cannot detect syntactic errors. Concerning Arabic, which is among the most widely used languages on the Web, research on these types of errors remains limited (i.e. the Arabic GramCheck developed for modern standard Arabic) [8]. Hence, the difficulty of correcting grammatical errors in Arabic is depicted at several levels: the complexity and the richness of this language; the absence of vowels in most texts; the irregularity of the word order in the construction of sentences; problems of inflecting words (words endings depending on their cases: nominative, accusative, genitive, etc.); agglutination; and other problems of morphological parsing. All these factors hinder the automatic processing of errors at several levels. The objective of our work is to design a new approach capable of automatic processing of the syntactic errors of the Arabic language. The word 'processing' can be defined as any algorithmic manipulation of an input—in this case, linguistic signals for various purposes, such as categorization, comprehension to production, translation, etc. In our case, this word may aim to transform existing linguistic data for the purpose of detecting and correcting ungrammatical elements by generating sentences from words. The word 'automatic' means having the capability of independently, detecting or correcting errors, without the participation of human, and imposing serious constraints in order to perform the corresponding calculations. The linguistic data must be understood in a totally explicit, coherent and operative way. For that, the use of various types of formalisms and computer techniques must be appropriate. The process could be automated entirely or only partially; the user can have the choice between semi-automatic correction or a purely automatic one—referred to in the field as a computer-assisted system. The remainder of this paper is structured as follows: Section 2 shows the origin of the grammatical formalization of languages, as well as the previous related works. In Section 3, we describe the syntactic approach that we have adopted and the dictionary used. In Section 4, we explain the domain ontology used. We present in Section 5 our approach and we will illustrate our work by an example in Section 6. Section 7

is devoted to the description of the performance evaluation of the resulting system. Finally, in Section 8 we will bring together all the data and summary conclusions. 2.

The origin of grammatical formalization of languages and related work 2.1. The grammatical formalization of languages

The convergence of interest of several scientists (linguists, mathematicians, logicians and computer scientists) was at the origin of the current formal grammar, in the mid-fifties. Their objective was to describe the language’s functioning (conceived as representative of the human mind functioning), in the manner of a machine, which corresponds to the processing of various information. The initiators of this trend (grammatical formalization of languages) have been seeking to characterize the mathematical structures of the language. research by Z. Harris in 1968 [9] and N. Chomsky in 1956 can be cited in the field of the relationship between grammar theory and automata theory [10]. Chomsky's further 1959 and 1963 studies (latter reprinted in [11]) about the mathematical properties of various classes of formal grammars were referenced [12]. 2.2. Related work Some systems appropriately deal with syntactic errors. We can start with the work on grammatical error correction done by K. Knight and I. Chander (1994) who developed a statistical method for correcting article and preposition errors [13]; these are among the most difficult errors to recognize and correct automatically, indeed they represent about 13.5% of the errors in the Cambridge Learner Corpus [14]. Subsequent work has focused on designing better features and testing of different classifiers, including decision tree learning [15] and logistic regression such as Lee’s system [16] that has derived syntactic features from a statistical Penn Treebank parser and semantic features of a large handcrafted ontology (WordNet). Recent work has shown that training on the annotated learner's text may yield better results [17]. Nevertheless, many studies report results on fully errorannotated corpora, like Gamon [18] and Dahlmeier & Ng [19]. In general, these studies report results with low precision and recall. We note that there is very little research (that we have known up to the present) on the correction of grammatical errors in Arabic, except for a study by K. Shaalan, who developed, using Prolog, an Arabic GramCheck for certain common grammatical errors [8]. This tool has the initial purpose of detecting the error and showing the user the violated syntactic rule for the ungrammatical sentence and

Available online at www.sciencedirect.com

possibly offering suggestions for improvement. This present study is the first attempt using the rules of Arabic grammar in the ontology to correct the majority of syntactic errors and propose several suggestions. For other languages like English, such systems, e.g., Grammarly, Ginger, etc., already exist.

3.

The syntactic approach and the dictionary adopted 3.1. The syntactic approach adopted

There are several formalisms to represent the parsing of a text. On the other hand, almost all of the literature deals with two syntactic representation formats, namely, constituency and dependency representation. The syntactic approach adopted in this article is inspired by the grammar of dependence (GD) founded by L. Tesnière [20]. It is based on the logic of the predicate. This article will propose a linguistic point of view of the traditional Arabic grammar that has been interpreted into a symbolic description, which will finally take a computational formalism in the ontology. In order to achieve this objective, it seems necessary to translate the grammatical data in terms of a structure in the form of a quadruplet (GC, R, OP, Ax). Such us GC denotes the Grammatical Category, R a set of grammatical relations, OP operation and Ax a set of axioms. The sentence in the frame where we are located is defined as a syntactic network that can be expressed by the following formula: (∀ x, y ∈ GC) / S = ⋀𝑛𝑖 Ri(x, y)

(1)

Such us x,y represent words and R a grammatical relation For example, the relation 'Subject' is established between a verb and a noun such as: (∃ x ∈ Verb, ∀ y ∈ Noun) / Subject (x, y)

(2)

We have noticed that dependency parsing allows easy machine processing, facilitating supervised learning and the application of classical algorithms [21]. Indeed, dependency trees represent a hierarchical way of structuring information where each word is linked to a headword of which it is dependent. Unlike constituency-based parsing, where the number of phrases representing the sentence cannot be predicted in advance, each generated analysis contains a fixed number of representational elements. Therefore, knowing that each word has only one head, the dependency parsing will contain exactly one element of representation for each word.

3.2. The dictionary used Accepting that the organization of the dictionary is an essential step in the whole process of generating sentences, we organized our dictionary in the form of tables in the database containing about 6000 roots. We chose Arramooz Alwaseet [22] which is an Arabic open source dictionary. It is generated from Ayaspell (Arab spellchecker); its data is collected manually. This dictionary consists of three parts:  Stop words  Nouns (see Table 1)  Verbs (see Table 2) The dictionary contains more than 50,000 words, which cover more than 10,000 verbs and 40,000 nouns as well as dozens of particles and syntactic tools. Table 1 - Description of table « Nouns » Fields

Descriptions

vocalized

vocalized word

unvocalized

unvocalized word

wordtype

word type (Noun of Subject, noun of object, …)

Root

word root

feminable

the word accepts Teh_marbuta

defined

the word is defined or not

gender

the word gender

number

the word is single, dual or plural

Single

the single form of the word

dualable

accept dual suffix

feminine

the feminine form of the word

masculine

the masculine form of the word

masculin_plural

accept masculine plural

feminin_plural

accept feminine plural

broken_plural

the irregular plural if exists

mamnou3_sarf

Doesn’t accept tanwin

k_suffix

accept Kaf suffix

…

…

Table 2 - Description of table « Verb »

Available online at www.sciencedirect.com

Fields

Descriptions

vocalized

vocalized word

unvocalized

unvocalized word

root

root of the verb

future type

The future mark, used only for trilateral verbs

triliteral

the verb is triliteral (3 letters) or not

transitive

transitive or not

double_trans

has double transitivity for two objects

think_trans

the verb is transitive to human

unthink_trans

the verb is transitive to unhuman being

reflexive_trans

pronominal verb

past

can be conjugated in past tense

future

can be conjugated in present and future tense

passive

can be conjugated in passive voice

…

…

calculation. OWL, therefore, allows an ontology to increase the meaning of the predefined vocabulary. To achieve this goal, we constructed Arabic grammar based on the language of set theory that can be used to define nearly all mathematical objects. We borrowed some of its theoretical principles to construct Arabic grammar. We have chosen the domain ontology language, which is a description of a particular domain (the Arabic syntax) by defining the classes and relations (properties) because of the remarkable convergence between this type of ontology and the mathematical set theory. The Ontology of Arabic Syntax "OAS" [29] is a representative data model of a set of concepts within the domain of Arabic Syntax, as well as the relationships among these concepts. We can represent OAS by a graph governed by axioms whose nodes are concepts or classes, and whose arcs denote their properties: OAS = {C, R}

4.

Domain ontology used

The Semantic Web came with new practices in the web content organization and a new infrastructure that enables software agents to effectively help Internet users in their access to information sources and services. This is about arriving at a smart Web, where the information would not be just stored, but understood by computers in order to bring relevant answers to the user. XML makes it possible to indicate the logical organization of the content of a document but does not ensure the semantization of the information. The ontology consists in annotating this information in order to endow it with a meaning that can be interpreted by the computer. This is precisely the role of the RDF and RDF-S layer in the Semantic Web architecture. OWL is an extension of RDF Schemas based on RDF. It defines a rich vocabulary to describe ontologies. The OWL language can be defined in three sub-languages, depending on the level of expressiveness that one wants to express.  OWL Lite: this is a subset of OWL that expresses the classification and the simple relationships between classes. This sub-language does not make it possible to express complex constraints on classes or associations.  OWL Domain Language: this allows a higher level of expressiveness while maintaining the completeness and decidability (all calculations must be completed in a limited time). This subset relies on the characteristics of the description logic to include useful properties of the reasoning systems.  OWL Full: This subset offers maximum expressiveness but without any guarantee of

(3)

Such as C represents the concepts and R the relations. The main purpose of OAS is to provide software agents with an artificial linguistic intelligence to reason about objects of Arabic syntactic structure, allowing the machines to 'understand' the constituents of the Arabic sentence. The OAS concepts represent grammatical or linguistic categories, while their ontological relationships refer to the various syntactic links between these grammatical categories. Each grammatical class constitutes a set, in other words, a well-defined syntactic category, whose elements are related to grammatical functions (Fig. 1). The grammatical concepts, which are organized in hierarchical tree form, represent classes in the ontological sense of the term, while hierarchical grammatical relations represent properties, and we can distinguish between two types of relationships; the dependency relationship “ ‫عالقة‬ ‫ ”عاملية‬linking words and sentences, as well as the functional relationship “‫ ”عالقة وظيفية‬that assign to words and sentences functional features.

Available online at www.sciencedirect.com

Fig. 1. Graph of classes and properties Fig. 2. Properties and Classes

The ontology is created and implemented using the Protégé tool [23] to edit the Arabic grammar rules with the OWL 2 'Web Ontology Language 2'. Recommended by the W3C “World Wide Web Consortium” and based on the RDF 'Resource Description Framework' [24] by adding several aspects specific to OWL 2, such as Boolean connectives, sub property chains and qualified cardinality restrictions, etc. Our ontology is organized by a set of Arabic grammar rules to provide mechanisms for describing groups of similar resources 'classes' and the relations between these resources 'properties' (Fig. 2). An inheritance system allows each ontological entity to inherit descriptive properties and entity axioms in which it is included, so the class 'Defined_noun' inherits the syntax and functional characteristics of the class 'Noun', to properly define the grammatical classes of the OWL 2, based on the

Available online at www.sciencedirect.com

logical description that provides us with the appropriate means. For example, we defined “nominative noun ‫ ”اسم_مرفوع‬by the expression in (Fig. 3).

Table 4 - Symbolic and computational description of the “Verb” Computational description in Protégé

Interpretation

not (Its_case some Genetive_marker)

Do not accept the genitive case “Genetive_marker ”

not (Its_gender some Gender)

Do not accept the gender

(∀𝑥 ∈ Verb | (Its_pattern(𝑥) = 𝑃𝑎𝑡𝑡𝑒𝑟𝑛))

(Its_tense some Tense)

Accepts tense

(∀𝑥 ∈ Verb | (Its_tense(𝑥) = 𝑇𝑒𝑛𝑠𝑒))

(Its_pattern only Pattern)

Symbolic description ¬(∃𝑥 ∈ Verb|Its_case(𝑥) = Genetive_marker)

(∀𝑥 ∈ Verb|¬(Its_case(𝑥) = Genetive_marker)) ¬(∃𝑥 ∈ Verb |Its_gender(𝑥) = 𝐺𝑒𝑛𝑑𝑒𝑟) ∀𝑥 ∈ Verb |¬(Its_gender(𝑥) = 𝐺𝑒𝑛𝑑𝑒𝑟))

Accepts pattern

Table 5 - Symbolic and computational description of the “Particle” Symbolic description

‫”اسم_مرفوع‬ Such us 'AND' is the union and 'NOT' it’s a logical negation, the 'SOME' in the example (Fig. 3) means that the case-ending property takes some of its values from the jussive marker '‫( 'عالمة_الجزم‬there exists ∃), whereas 'ONLY' means that for all (∀) attribute values are taken from the nominative marker '‫'عالمة_الرفع‬. To clearly illustrate this. The tables 3, 4 and 5 respectively show the logical and computational description in “Protégé” for the noun, the verb, and the particle. Fig. 3. Description of “nominative noun -

Table 3 - Symbolic and computational description of the “Noun” Computational description in Protégé

Interpretation

not(Its_case some jussive_marker)

Do not accept the jussive case “‫”الجزم‬

not (Its_tense some Tense)

Do not accept the tense (past, present, future)

(∀𝑥 ∈ Noun | (Its_gender(𝑥) = 𝐺𝑒𝑛𝑑𝑒𝑟))

(Its_gender only Gender)

Accepts a gender

(∀𝑥 ∈ Noun | (Its_pattern(𝑥) = 𝑃𝑎𝑡𝑡𝑒𝑟𝑛))

(Its_pattern only Pattern)

Symbolic description ¬(∃𝑥 ∈ Noun|Its_case(𝑥) = 𝑗𝑢𝑠𝑠𝑖𝑣𝑒_𝑚𝑎𝑟𝑘𝑒𝑟)

(∀𝑥 ∈ Noun|¬(Its_case(𝑥) = 𝑗𝑢𝑠𝑠𝑖𝑣𝑒_𝑚𝑎𝑟𝑘𝑒𝑟)) ¬(∃𝑥 ∈ Noun |Its_tense(𝑥) = 𝑇𝑒𝑛𝑠𝑒) ∀𝑥 ∈ Noun|¬(Its_tense(𝑥) = 𝑇𝑒𝑛𝑠𝑒))

Accepts pattern

¬(∃𝑥 ∈ Particle|Its_pattern(𝑥) = 𝑃𝑎𝑡𝑡𝑒𝑟𝑛)

(∀𝑥 ∈ Particle|¬(Its_pattern(𝑥) =

Computational description in Protégé

Interpretation

not (Its_pattern some Pattern)

Do not accept the pattern

not (Its_tense some Tense)

Do not accept the tense

𝑃𝑎𝑡𝑡𝑒𝑟𝑛)) ¬(∃𝑥 ∈ Particle|Its_tense(𝑥) = 𝑇𝑒𝑛𝑠𝑒) ∀𝑥 ∈ Particle|¬(Its_tense(𝑥) = 𝑇𝑒𝑛𝑠𝑒)) ¬(∃𝑥 ∈ Particle|Its_gender(𝑥) = 𝐺𝑒𝑛𝑑𝑒𝑟)

not (Its_gender some Gender)

∀𝑥 ∈ Particle|¬(Its_gender(𝑥) = 𝐺𝑒𝑛𝑑𝑒𝑟)) (∀𝑥 ∈ Particle| (Its_indeclinable(𝑥) = 𝐼𝑛𝑑𝑒𝑐𝑙𝑖𝑛𝑎𝑏𝑙𝑒))

(Its_indeclinable only Indeclinable)

Do not accept the gender

Accepts static case-ending

The exploitation of OAS ontology is ensured by a system of queries defined by SPARQL. In a similar way to SQL queries, the user can access the OAS database via this RDF query language. The following example could illustrate how to query OAS: let us assume that we want to determine the syntactic relationships whose governed element is “nominative noun - ‫ ”اسم_مرفوع‬for example: R(x, ‫)اسم_مرفوع‬ (4) (Fig. 4) shows the possible relations verifying (4) and their governors.

Available online at www.sciencedirect.com

Segmentation

Generation of sentences

Detection & correction

• Original_Sentence["word1", "word2", "word3", ...] • List_Sentences["Sentence1","Sentence2", "Sentence3",...] • Comparison(List_sentences, original_Sentence)

Fig. 6. The Three phases of the adopted method

5.1. The segmentation phase

Fig. 4. SPARQL query of R(x, ‫)اسم_مرفوع‬ If we want to restrict the governor field “x” as it is a particle “‫”حرف‬: 𝑅(x, ‫مرفوع‬-‫حرف)𝑥( ∧ )اسم‬ (5) The result is:

Fig. 5. SPARQL query of R(x,

5.

‫( ∧ )اسم_مرفوع‬x) ‫حرف‬

The problem of segmentation in sentences of the Arabic language is complicated. Indeed, Arabic uses neither capital letters nor regular punctuation, which makes classical methods of segmentation, inappropriate to this language. Moreover, the agglutination of words is another peculiarity of Arabic, which makes segmentation even more difficult to achieve [25]. We have adopted the segmentation in two steps: first, a segmentation of the text into sentences, and second, a segmentation of sentences into words. The segmentation of the text into words is carried out by the Software Architecture for Arabic language processing (SAFAR) platform [26], which contains an Arabic text segmenter based on the contextual exploration of punctuation marks, and connector words acting as separators of sentences as well as those of certain particles, such as conjunctions of coordination. The sentence processor is an application that shows how to split a text into sentences, then normalize the sentences and transliterate them. The segmentation of the sentence into words is based on the detection of spaces, punctuation marks and certain special characters. SAFAR platform proposes several methods allowing the tokenization, which is defined as the process of splitting a text into elements (words). Furthermore, during the segmentation phase, the system needs to know, in all cases, the category of the word. The segmentation of the sentence can be seen as an operation whose argument is the sentence and the associated result, which is a set of distinct words {w0, w1 ... wn} (see Fig. 7).

The adopted method

The adopted method groups the information from Alkhalil parser [2] and the aforementioned dictionary by gathering information about roots, nouns, verbs, etc. as well as some morphological rules. This method is divided into three phases (see Fig. 6).

sentence

Segmentation

{w0 , w1 …wn}

Fig. 7. The segmentation phase

Available online at www.sciencedirect.com

5.2. Sentences generation phase: The process of developing a sentence is done in two steps: Step 1: Categorization The categorization associates a set of syntactic features (Number, Gender, Person...) with each word obtained in the segmentation phase (Fig. 8)

Wn

Categorization

𝐶𝑎𝑡 𝐺𝑒𝑛𝑑𝑒𝑟 𝑊𝑛 𝑁𝑢𝑚𝑏𝑒𝑟 … …

𝑁𝑜𝑢𝑛/𝑉𝑒𝑟𝑏/ . . 𝑓𝑒𝑚𝑎𝑙𝑒/𝑚𝑎𝑙𝑒 𝑃𝑙𝑢𝑟𝑎𝑙/𝑠𝑖𝑛𝑔. … …

produces duplicate results to our system, which is why we need to examine them in order to obtain the necessary information that we want to use by removing duplicates. We first extract the category of the concerned word, namely: isVerb (), isNoun (), isParticle ().  If it is a verb, we use only the following syntactic features: Type, Transitive, Impartial, Prefix, Suffix, and Tense.  If it is a noun, we use only the following syntactic features: Type, Gender, Number, Prefix, Suffix, and Definiteness.  If it is a particle, we use the syntactic feature: Type. Table 6 shows an example of the categorization of the word "‫ = كتب‬write/written/books" is: Table 6 - The categorization of the word "‫ كتب‬-write" Categories

Fig. 8. The categorization step

Categorization provides us with two types of information; the first is relative to the Lexical Features (LF) to which the word belongs and the second specifies the Functional Features (FF) of the word concerned. Knowing that each lexical category is given with distinct features; If it is a verb, the word can take only the features relating to the verbal features (tense, transitivity, grammatical form, etc.), in the case where the word is a noun, the categorization process associates the considers word of nominal features (Number, Gender, etc.). This relevant information form two disjoint sets: GC = FF ∪ LF (6) We used the morphosyntactic parser of Standard Arabic words AlKhalil Morpho Sys 2 [2]. It is an open source software developed with the object-oriented programming language Java. It consists in making a morphological analysis, allowing for each word of the Arabic text taken out of context to identify its different possible morphosyntactic labels, thus, it can treat non-vocalized texts as well as partially or totally vocalized. Alkhalil provides morphosyntactic information of the word such as the possible vocalizations of the word, the affixes that are added to the stems (prefix and suffix), the stem, the nature of the word (noun, verb or particle), and in the case nouns and verbs the system provides the pattern (‫)الوزن‬, the root and the POS tags, etc. With the results of AlKhalil, our system still cannot identify the different forms of Arabic words. For this reason, we have used the dictionary to help recognize the different forms of a word. For example, to recover the singular "the boy - ‫ "الولد‬from the plural "the boys - ‫"األوالد‬, we can use a query to obtain the singular in the dictionary: ‫ْالَد‬ ‫َو‬ ‫<أ‬/vocalized> ‫<أوالد‬/unvocalized> ‫<ولد‬/root> ‫<جمع تكسير‬/number> ‫ََلد‬ ‫<و‬/single> …

The number of results provided is very important, however, it turns out that the morphological parser Alkhalil

Verb

Noun

Particle

Syntactic features

Output1

Output2

Type

past active verb

past passive verb

Transitive

yes

yes

Prefix

#

#

Suffix

#

#

Type

Non-derivative noun

Verbal noun

Gender

masculine

feminine

Number

plural

singular

Prefix

#

‫ك‬

Suffix

#

#

Type

#

#

Step 2: Merge “syntactic logic from the rules described in the ontology” After having determined all the necessary syntax information, we proceed to the elaboration of the sentences. The words combine by the merging operation forming a set of oriented pairs (x, y) conforming to the axioms which control the formation of the couples described above (7). Then the set of these couples forms a simple sentence so that it takes the form: ⋀𝑛1 𝑅(𝑥, 𝑦)

(7)

Fig. 9. Merging step

The pair formation is licensed by pre-established schemes described by an ontology domain (Fig. 9).

Available online at www.sciencedirect.com

Example: We will illustrate this by the sentence S: َ‫د التفاح‬ ‘‫ة‬ ُ‫ أكل أحم‬- Ahmed ate the apple’ 𝑆 = 𝑆𝑢𝑏𝑗(‫أحمد‬،‫التفاحة(𝑗𝑏𝑜 ∧ )أكل‬،‫)أكل‬ ∧ def(‫تفاحة‬،‫)ال‬

each relation has a starting base and an ending, the starting point is called the domain and the ending is called the range (Fig. 12). This law applies to all grammatical relationships without exception.

Range

Domain ‫تفاحة‬

‫ال‬

‫أحمد‬ ُ

apple

the

Ahmed

‫أكل‬

ate

Relationship Def ‫تعريف‬ ‫نننبببب‬

Sub ‫فاعل‬ Fig. 12. The grammatical relationship law ‫مفعول به‬

Obj Fig. 10. Merging example

We can distinguish two operations of mergers; one operates on the linear axis (Fig. 10) to compose grammatical relations, while the other operates on the vertical axis by specifying the functional aspects of the words (type, tense, transitivity for verbs, etc.). The functional axis will be represented by a matrix (Fig. 11) containing the syntactic information associated with the words. After updating the categorical matrix by the operations of categorization, words can, therefore, combine in the linear axis and take their licensed positions as they are established in Arabic grammar. 𝐶𝑎𝑡 𝑇𝑟𝑎𝑛𝑠 ‫𝑒𝑠𝑛𝑒𝑇 خرج‬ … …

𝑉𝑒𝑟𝑏 𝑁𝑜 𝑃𝑎𝑠𝑡𝑒 … …

Fig. 11. Matrix of the syntactic information of the word “out - ‫”خرج‬

We have implemented our ontology by the definitions of the grammatical relations, which connect the grammatical fields; they are characterized by a set of formal properties, which we will present a brief summary of: a) Grammatical relations are pairs whose elements are subject to a specific order. It is similar to that of a pair of mathematical relations. Thus, if the order of the two ends of the pair changes the meaning of the relationship also changes, then we say that the grammatical pair is an asymmetric relation, in which case the next two pairs are not equal:

∀ x, y ∈ GC / Example: , ‫د‬ ُ‫≠ (الول‬

R(x, y) ⟶ ∼R (y, x)

(8)

Subject)‫د‬ ُ‫ الول‬, ‫(خرج‬ Subject)‫خرج‬ If you pay attention to the subject relation, you find that it has a specific direction; it is, therefore, possible to say that

a) Grammatical relations are intransitive in the sense that no element of GC is related to a second element itself in relation to a third element, the Arabic grammar prohibits that the first element is also related to the third element. (∀ x, y, z ∈ GC)(∀ R) / ∼(R(x, y) ∧ R (y, z) ⟶ R (y, z)) ∼ (∼(R(x, y) ∧ R (y, z)) ∨ R (y, z)) (R(x, y) ∧ R (y, z)) ∧ ∼R (y, z))

(9)

Such a formal system described above is likely to provide us with some information on how the sentence is developed; the construction of a sentence requires two kinds of information. On one hand, we are led to specify the categorical elements indicated above by the letters x, y, z. This type of information is provided to us by the database (...). b) In the grammatical sentence, we distinguish two types of relationships: the main relationship that constitutes the essence of the sentence, and then a secondary relationship that can be abandoned without compromising the general meaning of the sentence. c) Grammatical relations are irreflexive (or antireflexive) because no categorical element is related to itself: (∀ x ∈ GC) (∀ R) / ∼R(x, x)

(10)

Example: ‫د‬ ُ‫خرج الول‬ Subject)‫د‬ ُ‫ الول‬, ‫د‬ ُ‫(الول‬ Etc. On the other hand, the linking of these categorical elements requires enumerating all the possible links recognized by the Arabic grammar. This type of information is provided by an ontological database. The Arabic syntax ontology does not only define the possible grammatical links, but it imposes constraints in the form of axioms. For example, the relation Subject (‫)فاعل‬ must be controlled by the following constraint:

Available online at www.sciencedirect.com

(∃ x ∈ Verb, ∀y ∈ Noun)/ Subject (x, y) ⟶ has_case (y, Nominative)

(11)

Which postulates that all subject y of x bears nominative grammatical case endings (‫)عالمة الرفع‬. We have adopted about fifty grammatical relations; Table 7 illustrates some of these relations in our Arabic ontology: Table 7 - The grammatical relations Relation

Domain

Range

Address

Subject

Nominative noun

Verb/Operating noun

http://arabicontology.or g/arabe.owl#‫فعل‬

Pro-agent

Nominative noun /Preposition

Verb passive/passive participle

http://arabicontology.or g/arabe.owl# ‫نائب_الفاعل‬

First Object

Accusative noun

Verb/Operating noun

http://arabicontology.or g/arabe.owl#‫مفعول_به‬

Noun of Kana sisters

Nominative noun

Kana sisters

http://arabicontology.org /arabe.owl# ‫اسم_أخوات_كان‬

Predicate of inchoative

Nominative noun

Nominative noun

http://arabicontology.org /arabe.owl# ‫خبر_مبتدأ‬

Adjective

Noun

Noun

http://arabicontology.org /arabe.owl# ‫نعت‬

Vocative

Accusative noun

Vocative particle

http://arabicontology.org /arabe.owl# ‫منادى‬

Its pattern

Noun/Verb

pattern

http://arabicontology.org /arabe.owl# ‫وزنه‬

Its gender

Noun

Gender

http://arabicontology.org /arabe.owl# ‫جنسه‬

Its number

Noun

Number

http://arabicontology.org /arabe.owl# ‫عدده‬

Its tense

Verb

Tense

http://arabicontology.org /arabe.owl# ‫زمنه‬

Preposition

http://arabicontology.org /arabe.owl#‫مجرور_بحرف‬

Genitive by Genitive preposition noun Possessive Construction

Genitive noun

Undefined noun

http://arabicontology.org /arabe.owl#‫مضاف_اليه‬

Predicate of Inna sisters

Nominative noun

Inna sisters

http://arabicontology.org /arabe.owl#‫خبر_أخوات_ان‬

Predicate of Kana sisters

Nominative noun

Kana sisters

http://arabicontology.org /arabe.owl#‫خبر_أخوات_كان‬

Circumstantial

Undefined accusative noun

Verb

http://arabicontology.org /arabe.owl#‫حال‬

…

…

…

…

This phase consists of comparing all the syntactically correct sentences generated by the previous phase with the original sentence; in this case, we have two possibilities:  If the system has found the original sentence in the list of sentences generated: in this case, the system passes to the next sentence because it considers the sentence correct.  If the system did not find the original sentence in the list of generated sentences: the user can choose how to correct it: he can choose manually a sentence among the sentences generated from the results of the previous phase. It can also choose a purely automatic correction that proposes the most likely correct sentence. The following figure (Fig. 13) shows how to detect and correct errors based on the list of generated sentences:

. . .

List_Sentences Sentence 1 Sentence 2

Original sentence

The syntax of the sentence is corret

False

Error detection and correction suggestion

Sentence 3

…

. . .

True

Sentence n

Step 1: Comparing the original sentence with the sentences generated

Step 2: Detecting & Correcting Syntax Errors

Fig. 13. The phase of detection and correction of errors

We calculate “Levenshtein distance” [27] between the original sentence and the sentences generated in order to allow a certain flexibility during the comparison. Levenshtein distance: Levenshtein distance is the minimum number of operations (taken in this set) necessary to transform A1 into A2. The optimal corrective derivation is the sequence of edits used to calculate the Levenshtein distance:

5.3. The phase of detection and correction of errors: D (A1 ; A2) = e1 ; e2,…, en with ek = (xi ; xj), 1≤ k≤n

Available online at www.sciencedirect.com

∀ xi, xj ∈ {∑ ∪ {𝜀}}

(12)

A dynamic programming algorithm allows calculating the D (A1; A2) in a time of the order of θ (| A1 |, | A2 |), with | A1 | (Resp. A2 |) the length of A1 (or of A2). Unit costs can also be attributed to these operations as follows: 𝜔(xi,xj)={

1 𝑠i xi ≠xj ∀ xi, xj ∈ {∑ ∪ {𝜀}} 0 si xi=xj

(13)

From this point of view, Levenshtein distance is also the minimal cost of transforming A1 into A2 according to operations with unit costs. The function “LevenshteinDistance” returns an integer, whenever it is small, the proposed sentence must be in the first place. 6.

Example

The purpose of this example is to illustrate the link between syntactically correct sentence generation and the correction of detected syntactic errors. Let’s suppose the following incorrect sentence: «‫»رجع المسافرين البعيدون‬ “The distant travellers returned” To correct the syntactic errors of this sentence, we will apply our approach, using the following steps:

‫رجع‬ returned

Seg1(‫رج‬ ‫)ع‬Cat1(Seg1) : Primitive noun

6.2. Categorization After the segmentation of the sentence into three units, this stage has the objective of associating a set of morphosyntactic features (Number, Gender, Person …) with each word obtained (Fig. 14).

: active participle Gender: masculine Number: dual

Number: singular

Prefix: Definition|‫ال‬

Pattern: ُ‫َف ْعل‬ or ‫ل‬ َُ ‫ َف ْع‬or ُ‫َف ْعل‬ Case: accusative or nominative or genitive

Cat2(Seg1) : past active verb Transitive:

yes Pattern: ‫ل‬ َُ ‫َف َع‬

Cat3(Seg1) : past passive verb Transitive: yes Pattern: ‫ل‬ َُ ‫فع‬

‫البعيد‬ ‫ون‬ distant

Seg2(‫مسافر‬ ‫ )ين‬Cat1(Seg2)

Gender: masculine

6.1. Segmentation The first step is to segment the sentence into words, so the result is: Segmentation (‫= )رجع المسافرين البعيدون‬ Seg1(‫ )رجع‬+Seg2(‫ )المسافرين‬+ Seg3(‫)البعيدون‬

‫المساف‬ ‫رين‬ travellers

Suffix: ‫ ين‬or ‫ان‬ Case: accusative or nominative or genitive Cat2(Seg2) : active participle

Gender: masculine Number: plural Prefix: Definition|‫ال‬ Suffix: ‫ ين‬or ‫ون‬ Case: accusative or nominative or genitive Cat3(Seg2) : passive participle Gender: masculine Number: dual Prefix: Definition|‫ال‬

Suffix: ‫ ين‬or ‫ان‬ Case: accusative or nominative or genitive Cat4(Seg2) : passive participle Gender: masculine Number: plural Prefix: Definition|‫ال‬ Suffix: ‫ ين‬or ‫ون‬ Case: accusative or nominative or genitive Fig. 14. Categorization step

Seg3(‫بعيد‬ ‫)ون‬ Cat1(Seg3) : adjective

Gender: masculine Number: dual Prefix: Definition|‫ال‬ Suffix: ‫ين‬ Case: accusative or nominative or genitive Cat2(Seg3) : adjective Gender: masculine Number: plural

Prefix: Definition|‫ال‬ Suffix: ‫ ين‬or ‫ون‬ Case: accusative or nominative or genitive

Available online at www.sciencedirect.com

6.3. Merger In order to construct a correct sentence syntactically, we must look for all the possible mergers, we then obtain:    

Mrg1_1 =(Cat1(Seg1) +Cat1(Seg2)) =R1_1(Noun1, Noun2) Mrg1_2 =(Cat1(Seg2) +Cat1(Seg3)) =R1_2(Noun2, Adj) Mrg2_1 =(Cat2(Seg1) +Cat1(Seg2)) =R2_1(Verb, Noun) Mrg2_2 =(Cat1(Seg1) +Cat1(Seg3)) =R2_2(Noun, Adj) 

The SPARQL query that finds the 1st relation R1_1:

Fig. 15. The SPARQL query for R1_1



As a result of this phase, we get eight syntactically correct sentences, namely:  َ ‫ِين‬ ‫َ البعيد‬ ‫ِين‬ ‫ُ المسافر‬ ‫ْع‬ ‫َج‬ ‫ر‬  ِ‫ين‬ ‫ْع‬ ‫َج‬ ‫ر‬ َ‫ينِ البعي‬ ْ‫د‬ ْ‫ُ المسافر‬  ‫َ المسافرون البعيدون‬ ‫َع‬ ‫َج‬ ‫ر‬  ‫رجع المسافران البعيدان‬  َ ‫ِين‬ ‫َ البعيد‬ ‫ِين‬ ‫رجع المسافر‬  ِ‫ين‬ َ‫ينِ البعي‬ ْ‫د‬ ْ‫رجع المسافر‬  ‫ُجِع المسافرون البعيدون‬ ‫ر‬  ‫ُجِع المسافران البعيدان‬ ‫ر‬ In order to improve our system, we can give the user the choice to take into account diacritic marks or not, since almost all Arabic texts are non-vowelized except religious books and some school manuals; the system normalizes these sentences by removing diacritic marks. The result then becomes:  ‫رجع المسافرين البعيدين‬  ‫رجع المسافرون البعيدون‬  ‫رجع المسافران البعيدان‬

The SPARQL query that finds the 2nd relation R1_2:

6.4. Error detection and correction We finally compare the three sentences with the original sentence: for (int i=0; i < phrasesList.size(); i++){ if(LevenshteinDistance.computeLevenshtein Distance (phrasesList.get(i), phrase_original)==0) isCorrect=1; else phrases_proposedList.add(phrasesList.get( i));

Fig. 16. The SPARQL query for R1_2



The SPARQL query that finds the 3rd relation R2_1: }

The proposed sentences are: ‫رجع المسافرين البعيدين‬ ‫رجع المسافرون البعيدون‬ ‫رجع المسافران البعيدان‬ 7.

Fig. 17. The SPARQL query for R2_1



The SPARQL query that finds the 4th relation R2_2:

Fig. 18. The SPARQL query for R2_2

Evaluation & Discussions

In this section, we present the results of the evaluations conducted on Arabic sentences, in order to validate our approach, we need to evaluate by using precision and recall metrics on the combined syntactic information. However, there is no good corpus that contains annotations on several levels of Arabic grammar, which firstly leads to an annotation of a new reference corpus containing 360 sentences. Of the 360 Arabic sentences, there were 30 syntactically correct sentences and 330 ungrammatical sentences, which regroups several types of grammatical errors, namely:  200 errors of agreement (gender, number, singular, dual or plural…).  100 errors of the grammatical case endings (nominative, accusative or genitive).

Available online at www.sciencedirect.com

 30 errors of definite article by the “‫”ال‬. The corpora of grammatical errors are not easy to find. Admittedly, they do not exist in the Arabic language for the time being. Indeed, we manually annotated 330 ungrammatical sentences from the learners' writings, with three classes of errors: agreement error, case endings error, and the use of definite and indefinite articles. We have collected and annotated this group of sentences. These errors classes are presented with examples in Appendix A. The evaluation of our system uses the two-common metrics of Precision and Recall as well as F-measure. The formulas are as follows: 𝑅𝑒𝑐𝑎𝑙𝑙 =

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟𝑠 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑

(14)

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑡𝑟𝑜𝑑𝑢𝑐𝑒𝑑 𝑒𝑟𝑟𝑜𝑟𝑠

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟𝑠 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛𝑠

𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∗

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙

(15) (16)

(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)

Table 8 summarizes the results obtained: Table 8 - Results of detection of syntactic errors Syntactic error

Precision

Recall

F-measure

Agreement

96,75%

89,5%

92,98%

Case endings

90,42%

85%

87,62%

Definite article « ‫» ال‬

88,88%

80%

84,20%

Total

92,01%

84,83%

88,27%

The average sentence length was seven words, and the longest sentence was eleven words long. Our system includes approximately 200 grammar rules. The complexity of the approach is generally proportional to the size of the sentence and the number of outputs of the Alkhalil morphological parser for each extracted word as well as the number of grammatical rules used in the sentence generation phase. It can be integrated into higher-level NLP applications since it is developed with the open source programming language Java. Moreover, the results of the correct sentences can be obtained through web services, libraries, and XML outputs. The results of our approach given in Table 8 show a precision of 92% and a recall of 84% or more, on average, which is a good level for this type of task “error detection”. It is noteworthy as well the high level of precision, which characterizes a very significant level of reliability. This property of detection is important in this case because if the system finds that the sentence contains an error, then it will automatically move to the next phase to generate the correct sentences based on the words extracted from the

previous phase. The recall could be improved on it by being more thorough in the lists of entities constituted. If we consider now the F-measure metric (16), which is a better synthesis indicator, we find that our methodology performs well (88,27%). We also evaluated a set of 30 correct sentences in order to test the system with grammatically correct sentences; our system considered that 27 sentences are correct and 3 are incorrect, which gives our system another advantage and successful implementation by using this approach. It can be seen that some relationships are lost due to the ambiguity of syntactic rules. We can, therefore, introduce the statistical machine translation systems [28] based on our linguistic information from the ontology and the morphological parser in order to regenerate correct sentences. It would be interesting to compare our approach with others. However, as we explained in Section 2.2, there are no available systems about the correction of syntactic errors in Arabic just as there is no corpus containing such information for testing. Moreover, it is not possible to make a comparison with related research for other languages, although this is very difficult because the experimental conditions are not the same. The results are satisfactory since the phase of detecting syntactic error allows for a high precision return and more correct syntactic information while keeping a large amount of information. On the other hand, the results obtained during the evaluation of the other parts, and in particular, the part of the correction, which contains “the phase of automatic generation of the correct sentences”, allows us to hope that an evaluation on a larger corpus will confirm even more validity of the proposed approach. In the medium term, it would be interesting to evaluate the correction on a larger corpus and to validate the approach on other syntactic information, namely: the syntactic relations that play a paramount role in our system. 8.

Conclusion & perspectives

In this paper, we presented a new approach to the detection and automatic correction of syntactic errors in Arabic texts. This approach is based on the generation of sentences using the dependency model, whose rules and constraints are obtained through a logical description of Arabic grammar by the ontology. We based this on two assumptions. Firstly, is it possible to generate all possible sentences, and secondly it is possible to compare the original sentence and the generated sentences. This work is still in its early stages and our main objective has been to implement a new approach to the detection and correction of syntactic errors based on the automatic generation of sentences on a larger corpus. The first results obtained are encouraging, and we are looking forward to expanding our research.

Available online at www.sciencedirect.com

Appendix A. Classes of errors and some examples Table A.1 - Classes of errors and some examples Error class

Error type

Example ‫ نجح [المجتهدين ← المجتهدون] في االمتحان‬ ‫الموحد‬  The diligent succeeded in the standardized exam

The subject is a nominative noun

The predicate of “Inna” and its sisters

‫الطالبين‬ ‫ إن‬ ]‫مجتهدان‬  The two students standing are hardworking

the predicate of “Inna” is nominative (‫)مرفوع‬

The exception

]‫دا‬ ٍ‫ عاد الفائزون إال [سع‬ ً‫د ← سع‬  The winners returned all except Saadah

If the exceptive sentence is affirmative and complete, the excepted object takes the accusative case, ‫النصب‬

subject of the nominal sentence “Inna” and its sisters

‫ إن [المهاجمون ← المهاجمين] فشلوا في‬ ‫خطتهم‬  The attackers failed in their plan

the subject of “Inna” is accusative (‫)منصوب‬

]ٍ ‫ِ ← فاضل‬ ‫ٍ [الفاضل‬ ‫ مررت برجل‬  I passed by a virtuous man

The adjective “‫ ”النعت‬follows in case ending the qualified “‫ ”المنعوت‬to which it refers

]‫ [هذه ← هذا] الماء [الصافية ← الصافي‬  This pure water

The follows the noun (feminine, masculine, singular, plural, etc.)

]‫ له [ثالث ← ثالثة] بنين و [ثالثة ← ثالث‬ ٍ‫بنات‬  He has three sons and three daughters

The number has the opposite gender of the noun

‫ [تنهض ← ينهض] التعليم بالمجتمعات‬  Education is rising up the societies

The verb must agree with its subject in both number and gender

qualified

‫ انتقل [الطالبين ← الطالب] الناجحون إلى‬ ‫الجامعات‬  Successful students moved to universities

The adjective “‫ ”النعت‬follows in number the qualified “‫ ”المنعوت‬to which it refers

case endings of the predicate “Kana” and its sisters

]‫ أصبح المعروف [منكر ← منكرا‬  the right became wrong

The predicate of “Kana” must always be accusative

The case endings of the circumstantial

]‫ جاء الولد الخائف [مسرع ← مسرعا‬  The frightened boy came quickly

The circumstantial must always be accusative

The case endings of the object

]‫ عندما وجدنا الطفالن [جالسان ← جالسين‬ ‫قرب الحديقة‬  When we found the two children sitting near the park

The first object must always be accusative

Case endings of the genitive and its adjective

‫ يرى جوهر الحقائق [بعينان‬ ]‫[ثاقبتان ← ثاقبتين‬  He sees the essence of the facts with piercing eyes

The genitive noun by a preposition must always be genitive and the adjective follow it.

Case endings of the possessor and its adjective

‫ وقفت بجانب [السيارتان‬ ]‫[الجميلتان ← الجميلتين‬  I stood beside the two beautiful cars

The genitive noun by the possession must always be genitive and the adjective follow it.

Deletion of "Nun" in the case of the present nominative verb

‫ن الصوم ليس‬ ّ‫ كانوا [يعلموا ← يعلمون] أ‬ ‫ٍ عن الطعام والشراب‬ ‫ّد امتناع‬ ‫مجر‬  They knew that fasting was not just abstinence from food and drink

The "Nun - ‫ "ن‬cannot be deleted with the nominative verb.

ts qualified

ts

Case ending class

The grammatical rule

‫غير‬

←

‫[مجتهدين‬

]‫بعينين‬

‫الواقفين‬

←

]‫السيارتين‬

←

‫ أضف لمعلوماتك [الغير ← غير] الكافية‬  Add to your insufficient information

‫ غير‬is used without

Available online at www.sciencedirect.com

possessed

‫ يشهد [العصرنا ← عصرنا] الحاضر‬ ‫كبيرا‬  Our present era is witnessing great development ‫تطورا‬

The noun possessed must always be indefinite.

Available online at www.sciencedirect.com

References [1] Al-Sughaiyer, I. Al-Kharashi, I. (2004). Arabic Morphological Analysis Techniques: A Comprehensive Survey.Journal of the American Society for Information Science and Technology. [2] Boudchiche, M., Mazroui, A., Ould Abdallahi Ould Bebah, M., Lakhouaja, A., Boudlal, A. (2016). AlKhalil Morpho Sys 2: A robust Arabic morphosyntactic analyzer, Journal of King Saud University – Computer and Information Sciences, doi:10.1016/j.jksuci.2016.05.002. [3] Socher, R. Christopher, D. (2010). Better Arabic parsing: baselines, evaluations, and analysis. COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics. [4] Klein, D. Christopher, D. (2003). Fast exact inference with a factored model for natural language parsing. in Suzanna Becker, vol. 15, pp. 3–10. MIT Press. [5] Elkateb, S.Black, W. Vossen Piek, Farwell, D. Rodríguez, H. Pease, A. Alkhalifa M. (2006). Arabic WordNet and the challenges of Arabic. In Proceedings of the Arabic NLP/MT Conference, London, UK. [6] Ferré, S. (2017). Sparklis: an expressive query builder for sparql endpoints with guidance in natural language. Semantic Web. 8(3), pp. 405–418. [7] Attia, M. Pecina, P. Samih, Y. Shaalan, K. Van Genabith, J. (2015). Arabic spelling error detection and correction. Natural Language Engineering, Available on CJO doi:10.1017/S1351324915000030 [8] Shaalan K. (2005). Arabic GramCheck: A Grammar Checker for Arabic. Software Practice and Experience, John Wiley & sons Ltd. UK. 35(7), pp. 643-665. [9] Harris, Z. (1968). Mathematical structures of language. John Wiley. New York. [10] Chomsky, N. (1956). Three models for the description of language, IEEE Transactions on Information Theory. (Vol.2, pp.113-114). [11] Chomsky, N. (1959). On certain formal properties of grammars, Information and Control. (Vol.2, pp.137-167). [12] Chomsky, N. & Miller, G. A. (1968). L'analyse formelle des langues naturelles. Paris. [13] Knight, K. & Chander, I. (1994). Automated postediting of documents. In Proceedings of AAAI. [14] Leacock, C., Chodorow, M., Gamon, M. & Tetreault, J. (2010). Automated Grammatical Error Detection for Language Learners. Morgan & Claypool Publishers. [15] Gamon, M., Gao, J., Brockett, C., Klementiev, A., Dolan, W. B., Belenko, D. & Vanderwende, L. (2008). Using contextual speller techniques and language modeling for ESL error correction. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP). (pp. 449–456). Hyderabad, India. [16] John Lee. (2004). Automatic article restoration. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT). (pp 31–36). Boston. [17] Han, N.R., Tetreault, J., Lee, S.H. & Ha, J.Y. (2010). Using an error-annotated learner corpus to develop an ESL/EFL error correction system. In Proceedings of LREC. [18] Gamon, M. (2010). Using mostly native data to correct errors in learners’ writing. In Proceedings of the Eleventh Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Los Angeles. [19] Dahlmeier, D. & Hwee, T. N. (2011). Grammatical error correction with Alternating Structure Optimization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 915–923). Portland. Oregon. USA. [20] Tesnière, L. (1959). Éléments de syntaxe structurale. Paris. [21] Kubler, S. Mcdonald, R. Nivre, J. (2009). Dependency parsing. Synthesis Lectures on Human Language Technologies. vol. 1. pp. 1–127. [22] Mustafa, I., Al-Ziyaat, Abdul Qaadir, A., H. & Al-Najjaar, M. (1960). Al-Waseet Dictionary, the Academy of the Arabic Language in Cairo. [23] Horridge, M. Knublauch, H. Rector, A. Stevens, R. Wroe, C. (2011). A practical guide to building OWL ontologies using Protégé 4 and CO-ODE tools, Edition 1.3. The University of Manchester. hmowl-power.cs.man.ac.uk/protegeowltutorial/resources/ProtegeOWLTutorialP4_v1_3.pdf (21/11/2018) [24] Klyne, G. Carroll, J. J. (2004). Resource Description Framework (RDF) : Concepts and Abstract Syntax. Rapport technique, W3C : World Wide Web Consortium, https://www.w3.org/TR/2004/REC-rdf-concepts-20040210/ (21/11/2018) [25] Hadrich Belguith, L., Aloulou, C. & Ben Hamadou, A. (2008). MASPAR : De la segmentation à l'analyse syntaxique de textes arabes. Revue Information Interaction Intelligence I3. (Vol.7, N°2). [26] Souteh, Y. & Bouzoubaa, K. (2011). SAFAR platform and its morphological layer. Eleventh Conference on Language Engineering ESOLEC’2011. Cairo. Egypt. [27] Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertionsand reversals SOL Phys Dokl. (pp.707-710). [28] Chollampatt, S., Ng, H.T. (2017). Connecting the dots: towards human-level grammatical error correction. In: the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp 327–333. Copenhagen, Denmark. [29] Almalki, T. (2015).‫أنطولوجيا النحو العربي نحو توصيف منطقي لساني للنحو العربي القديم‬. ‫ دار النابغة للنشر والتوزيع‬. Tanta. Egypt.

An innovative approach to autocorrecting grammatical errors in Arabic texts

An innovative approach to autocorrecting grammatical errors in Arabic texts

Recommend Documents