Graph-based Arabic text semantic representation

Graph-based Arabic text semantic representation

Information Processing and Management 57 (2020) 102183 Contents lists available at ScienceDirect Information Processing and Management journal homep...

3MB Sizes 1 Downloads 82 Views

Information Processing and Management 57 (2020) 102183

Contents lists available at ScienceDirect

Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman

Graph-based Arabic text semantic representation Wael Etaiwi , Arafat Awajan ⁎

T

Princess Sumaya University for Technology, Amman, Jordan

ARTICLE INFO

ABSTRACT

Keywords: Knowledge representation Semantic graph Semantic representation Textual entailment Arabic natural language processing ArbTED

Semantic representation reflects the meaning of the text as it may be understood by humans. Thus, it contributes to facilitating various automated language processing applications. Although semantic representation is very useful for several applications, a few models were proposed for the Arabic language. In that context, this paper proposes a graph-based semantic representation model for Arabic text. The proposed model aims to extract the semantic relations between Arabic words. Several tools and concepts have been employed such as dependency relations, part-ofspeech tags, name entities, patterns, and Arabic language predefined linguistic rules. The core idea of the proposed model is to represent the meaning of Arabic sentences as a rooted acyclic graph. Textual entailment recognition challenge is considered in order to evaluate the ability of the proposed model to enhance other Arabic NLP applications. The experiments have been conducted using a benchmark Arabic textual entailment dataset, namely, ArbTED. The results proved that the proposed graph-based model is able to enhance the performance of the textual entailment recognition task in comparison to other baseline models. On average, the proposed model achieved 8.6%, 30.2%, 5.3% and 16.2% improvement in terms of accuracy, recall, precision, and F-score results, respectively.

1. Introduction Semantics refers to the systematic representation of the knowledge in a sufficiently precise notation that can be used by computer programs (Hayes, 1974). The semantic relations between the text components help in better understanding of human language and in building more accurate automated cognitive systems. According to linguistics, semantics refers to the study of the relations between text components (words, statements, etc.) and their implicit signification (Abend & Rappoport, 2017), while semantic representation reflects the meaning of the text as it is understood by humans. Several applications utilize semantic representation to obtain better results in the computational linguistic area (e.g., machine translation and question answering). The core idea of the semantic representation is to develop specific and precise notations of the text that reflect its meaning. The common techniques that are used to represent knowledge and semantic can be classified into four main groups: Predicate logic representation, Network representation, Frame representation, and Rule-based representation. In predicate logic representation, the sentences are split into words, and the semantic relations between words are defined using predicate logic notations. For instance, the statement Time is running is represented in the form of text as: running(time). Predicate logic is utilized to represent the semantic level of analysis for many languages such as English (Ali & Khan, 2009) and Urdu (Ali & Khan, 2010). The representation and retrieval complexities for complex sentences and the exclusion of supporting words (e.g., is) are the main drawbacks of this notation. In addition, predicate logic-based methods face difficulties when trying to represent ambiguous words that have different meanings



Corresponding author. E-mail address: [email protected] (W. Etaiwi).

https://doi.org/10.1016/j.ipm.2019.102183 Received 26 June 2019; Received in revised form 13 December 2019; Accepted 14 December 2019 0306-4573/ © 2019 Elsevier Ltd. All rights reserved.

Information Processing and Management 57 (2020) 102183

W. Etaiwi and A. Awajan

(Ali & Khan, 2009). Network representation (i.e., a semantic network or semantic graph) was proposed by Quillian (1968) in 1968. It describes the text as a directed labeled graph in terms of vertices and edges. There are positive relationships between the amount of original text and the size of the semantic network and between its complexity and the time needed for the manipulation process. However, semantic networks are powerful and flexible knowledge representation techniques that could be used to model the semantic relation between text components. The frame representation is a data structure proposed by Minsky in 1974 that represents sentences as slots of objects that carry information Mylopoulos (1980). Splitting the original text into small slots and extracting their values are time-consuming processes that make the frame representation an inefficient knowledge representation. Furthermore, building the original sentence from its frame representation is very difficult (Ali & Khan, 2009). Finally, in the rule-based representation, a sentence is represented as a set of if-then rules. In rule-based systems, when a set of rules is satisfied, the system provides a solution without applying the remaining rules. Thus, the solution may differ from the solution provided when applying other rules. This allows rule-based representation to provide multiple representations of the same sentence, which makes the process of retrieving the original text from its rule-based representation a difficult task (Tayal, Raghuwanshi, & Malik, 2015). The task of mapping the natural language text into its semantic representation is called semantic parsing. The mapping process parses the text into its semantic representation without syntactic classification of the texts components (Wilks & Fass, 1992). Semantic parsers have attracted a huge amount of attention in the field of Natural Language Processing (NLP) over the last few decades (Liang, 2016). Semantic parsers have been used to perform several NLP tasks such as Question Answering and Machine Translation. Semantic parsers have been categorized into two main types: shallow semantic parsers and deep semantic parsers. The shallow semantic parser labels each word in the original sentence according to its semantic role (Jurafsky & Martin, 2009).The deep semantic parser represents each composite component in the text depending on its meaning in the sentence (Liang & Potts, 2015). This paper is organized as follows: In Section 2, the main objectives and goals of this research are described and listed. In Section 3,we briefly review the related work on knowledge representation and semantic representation for Arabic text. Section 4 describes the main features of the Arabic language that affect the semantic representation model. Section 5 describes the proposed model. Section 6 represents the process of building the proposed semantic graph. The experimental results are discussed in Section 7. Finally, the conclusion is presented in Section 8. 2. Research objective In general, the semantic analysis uses well-built resources in machine learning techniques. However, in semantic analysis, there is less work dedicated to the Arabic language, and the proposed semantic methods and applications do not achieve good results. This is due to the structural and morphological complexity of the Arabic language and the lack of Arabic semantic resources. In general, most of the developed Arabic language parsers focus on the structure of the Arabic language in terms of syntax and morphology rather than semantic computations. Thus, this research aims to propose a new graph-based semantic representation model for Arabic text that enhances different Arabic NLP applications such as textual entailment. This research has the following main goals:

• Provide a new graph-based semantic representation model for the Arabic language that represents the semantic relations between words as a semantic graph. • Provide a semantic representation parser for the Arabic language that extracts the semantic relations between Arabic words and •

represents them into a semantic graph. The proposed parser utilizes different Arabic language linguistic tools such as dependency relations and part-of-speech tags. Evaluate the proposed semantic representation model according to its ability to enhance Arabic textual entailment recognition and compare the results with other state-of-the-art approaches.

3. Related work Several models and projects have been proposed for semantic representation and parsing of natural language text, such as Abstract Meaning Representation for (AMR) (Banarescu et al., 2013), Groningen Meaning Bank (GMB) (Bos, Basile, Evang, Venhuizen, & Bjerva, 2017), Universal Conceptual Cognitive Annotation (UCCA) (Abend & Rappoport, 2013), and Universal Networking Language (UNL) (Boguslavsky et al., 2000). These approaches differ in terms of representation type, structure (concepts and relations types), granularity and automaticity. Three main different representation types are used in semantic text representation include: text representation, graph-based representation, and frame representation. In terms of structure, most of the semantic representation methods use words as concepts (Banarescu et al., 2013) such as UCCA and (Vidal, Lama, Otero-García, & Bugarn, 2014), but some other methods use other concepts such as PropBank frames (e.g., AMR) and WordNet synsets (e.g., UNL). Furthermore, relations are different according to each method. For instance, AMR uses PropBank relations while UNL has its own relations set. In terms of Granularity, most of the representation methods annotate sentences, while some other methods (such as GMB and UCCA) annotate short texts. AMR, UNL, and UCCA are fully manual annotated, while GBM annotates the semantic representation automatically. 3.1. Semantic representation In semantic text representation, Universal Networking Language (UNL) (Alansary, Nagi, & Adly, 2009; Boguslavsky et al., 2000) transfers the original text into a language independent representation, which enables the translation of a text written in any language 2

Information Processing and Management 57 (2020) 102183

W. Etaiwi and A. Awajan

into any other natural language. The structure of UNL consists of three groups of components: Linguistic components, Software components, and System interface components. Linguistic components contain a set of dictionaries to transform the original text into UNL expressions. These UNL expressions represent the relations and attributes of syntax and grammatical rules responsible for generating the target language sentences. The software components are responsible for converting the natural language into UNL expressions and vice-versa. The system interface components enable the flow of UNL documents over the World Wide Web. UNL is commonly used for interlingua-based machine translation systems, such as English-Arabic machine translation (Alansary et al., 2009). Most of the proposed semantic representation models are graph-based models. Semantic features are used to enhance NLP applications results such as documents classification (Kastrati, Imran, & Yayilgan, 2019) and paraphrase identification (ALSmadi, Jaradat, AL-Ayyoub, & Jararweh, 2017). Banarescu et al. (Banarescu et al., 2013) proposed a sentence-level semantic parser (AMR) to map English text into a rooted directed labeled graph. The proposed approach utilizes PropBank frames (Palmer, Gildea, & Kingsbury, 2005) to represent words and frame-specific PropBank relations to represent the semantic relation between words. AMR is manually annotated. A novel multilayered framework for semantic representation called Universal Conceptual Cognitive Annotation (UCCA) has been proposed by Bend and Rappoport (Abend & Rappoport, 2013). The proposed framework uses basic linguistic theory to build a manual cross-linguistically scheme for semantic representation (Dixon & Dixon, 2010). Directed acyclic graphs (DAGs) were used to represent the semantic structure of sentences. Unlike AMR that annotates sentences, UCCA annotates short text (e.g., multiword expressions) in addition to short sentences. Therefore, the same entity or multiword expression could be annotated in many different sentences. For Arabic text, a graph-based text representation design was proposed by Ismail, Aref, and Moawad (2013b), it was proposed toward abstractive Arabic text summarization system. Each word and concept of the input document is represented as a vertex (object) in the graph, while each edge corresponds to the semantic and topological relations between objects. The proposed model consists of five main steps: preprocessing, word sense instantiation, concept validation, sentence ranking, and semantic graph generation. Even though each object has its own feature set that contains additional information about the represented word (such as tense and type), the generated semantic graph does not determine the type of the semantic relation between each pair of connected vertices. A morpho-semantic knowledge graph caled CAMS-KG (Classical Arabic Morpho-Semantic Knowledge Graph) was proposed by Bounhas, Soudani, and Slimani (2019). The proposed model represents both morphological and semantic links as a graph-based Arabic text representation. It combines Ghwanmeh stemmer (Ghwanmeh, Kanaan, Al-Shalabi, & Rabab’ah, 2009) and MADAMIRA (Pasha et al., 2014) for morphological analysis and disambiguation, in addition to exploiting contextual knowledge links using an implemented concordance builder tool. Qu, Fang, Bai, and Jiang (2018) proposed two graph-based semantic representations in order to find the semantic similarity between concepts. The proposed semantic representations, called CORM and CARM, are built using the Wikipedia semantic network. Both information content and features of concepts are used to measure the semantic similarity between concepts. Using both types of feaures avoids two main limitations: lake of semantic information and the insufficient information of some features. Semantic document representation could be used to utilize documents classification. Kastrati et al. (2019) proposed a document semantic representation model for classifying financial documents using deep learning neural networks. The proposed model starts by document representation phase followed by the document classification phase. An ontology and its relevant terminology acquisition are used in the first phase in order to enrich documents with semantics. The representation phase aims to represent the document as a feature vector that could be used in the classification phase. Then, deep learning neural network is used in the classification phase in order to find the most appropriate class of the document. The experiments results showed that the semantic representation of documents improves the performance of documents classification. A paragraph-based graph representation model is proposed by de Arruda, Marinho, Costa, and Amancio (2019). The proposed model is built based on the paragraphs semantic similarity which is calculated based on TF-IDF (term frequency-inverse document frequency) weighting and cosine similarity. The authors studied the properties of paragraph-based graph representation and concluded that both cooccurrence graphs and paragraph-based graphs can be used in an integral way in order to catch both syntax and content features. Semantic frame representation was used to build a lexical dataset of the verb valences in the Arabic Quran. Sharaf and Atwell (2009) developed FrameNet frames for Quranic verbs. They studied the verbs and their context in the Quran and compared the semantic frames with verbs in English FrameNet. Lakhfif and Laskri (2015) used frame semantic representation for building an interlingua-based system for Arabic language machine translation. The proposed approach captures the underlying meaning and semantics of the Arabic text. The authors concluded that the integration of WordNet and FrameNet in a single unified knowledge resource can improve disambiguation accuracy in the machine translation task. A rule-based semantic frame annotation of Arabic speech was proposed by Lhioui, Zouaghi, and Zrigui (2017). Semantic frames were used to represent the meaning of Arabic user statements in order to enhance the performance of the human-machine Spoken Dialogue System (SDS). The main difficulty in SDS is the limitation of a fairly constrained semantic space. Moreover, the semantic representation should be composite within speech turns and during the dialogue. A rule-based system was used to annotate TIHR_ARC corpus that was used later in the evaluation process. The experiments showed that the automatically annotated TIHR_ARC corpus is reliable and trustworthy enough to be used by subsequent stochastic systems. 3.2. Distinctions Within the last decades, most of the proposed knowledge representation methods for Arabic text have been focused on the morphological and syntactical aspects rather than the semantic perspectives (Aboamer & Kracht, 2018; El-Sayed, 2015; Haddad & 3

Information Processing and Management 57 (2020) 102183

W. Etaiwi and A. Awajan

Yaseen, 2003; Ismail, Aref, & Moawad, 2013a). Since graphs are commonly used to simplify Arabic NLP problems due to its ability to represent and formalize huge and complex data structure into a standard and formal way (Etaiwi & Awajan, 2018), the semantic graph is the most common semantic representation method that has been employed for semantic representation (Abend & Rappoport, 2013; Alansary et al., 2009; Banarescu et al., 2013; Ismail et al., 2013b). However, those methods do not consider the morphological and syntactical features of the Arabic language during semantic graph construction. In addition, most often, the contributions made in Arabic text semantic representation utilize translated English resources which may have a negative impact on the performance of the Arabic semantic representation methods. For instance, one of the commonly used English semantic resources for this purpose is WordNet and its Arabic translated version Arabic WordNet (Lakhfif & Laskri, 2015). In this research work, a semantic representation model is proposed to extract the semantic relations from Arabic text. The proposed method employs several Arabic NLP tools and resources which are not translated English resources in the extraction process. Furthermore, the proposed model considers the semantic features as well as morphological and syntactical features of the given text. It is important to note that the proposed model adopts several tools, resources, and text features in order to reduce the negative impact of resources quality on the semantic representation. 4. Arabic language features The Arabic language has a sophisticated structure in terms of grammar, syntax, and morphology. Furthermore, it has many features that make its semantic parsing a challenging task. Arabic language features can be grouped into two main types: morphological level features and sentence level features. The morphological features of Arabic words have an impact on the analysis and processing of Arabic text. These features include: agreement feature and words formation. The agreement feature refers to the compatibility feature between two words in a sentence in terms of number, person, gender, case, and definiteness. For instance, a noun and its adjective need to agree with respect to number, gender and definiteness (e.g., the sentence (ālrǧlyn ālkrymyn - two generous men) consists of a noun (ālrǧlyn - two men) and adjective (ālkrymyn - generous) that agree on number, gender, case, and definiteness). In addition, Arabic word formation is a challenging task due to its complicated structure. In terms of how Arabic words are formed, two types of Arabic words exist (Awajan, 2007): derivative words and nonderivative words. For the derivate Arabic words, the word is formed based on a root-and-pattern scheme. Whereas a list of morphological standard patterns is used to generate words from their roots. On the other hand, non-derivative Arabic words are produced without following any derivation rules, such as pronouns, prepositions, question words, and foreign words. The words pattern determines how the word can be used. For instance, the (fā’l) pattern is utilized to generate the machine noun (hāsb computer) from its root (hsb - compute), and the (mf’l) pattern is used to form the place noun (msn’ - manufactory) from its root (sn’ - manufacture). Moreover, the Arabic language allows attaching optional affixes and clitics to stems in order to form words (Awajan, 2014; 2015). For example, the single word (sm’tk - I heard you) contains a verb, subject, and object attached together. Extracting Arabic words concatenative parts and finding what they are referring to are not easy tasks and may affect the semantic representation of the sentence. Thus, the sophisticated structure of Arabic words makes their analytic process a challenging task. In addition, various sentence level features of the Arabic language affect the semantic representation of Arabic sentences. Arabic sentences are categorized into two main types: verbal sentences and nominal sentences. The verbal sentence is a sentence that contains a verb and a subject while a nominal sentence is a sentence that begins with a noun or a pronoun and consists of two main parts: a subject or topic and a predicate (e.g., (ālqmr ǧmyl - The moon is beautiful)). The predicate can be either a noun or a sentence). Thus, determining sentence type helps to find the core element in the sentence that affects the semantic representation of the sentence. In verbal sentences, the verb itself is considered as the core element in the sentence. In nominal sentences, the subject of the sentence is considered as the core element in the sentence. Another important feature of the Arabic sentence is its flexible order property. The flexible order property of the Arabic language causes syntactic and semantic ambiguities that require more deep analysis for all possible sentence forms as well as the relations between words. Table 1 illustrates an example of the flexible order property. Note that the sentences in the example have the same meaning. Arabic words may have different meanings and different semantic interpretations (Shaalan, Siddiqui, Alkhatib, & Monem, 2009). Therefore, name entity recognition in the Arabic language is a challenging task. For example, the noun (omnyh - wish) can be recognized as a named entity (telecommunication organization in Jordan) or a noun that means wish. Furthermore, Arabic names Table 1 Arabic language flexible order example . Sentence

Translation in English - Transliteration in English

Order

The boy ate the apple - kl lwld ltfht.

Verb-Subject-Object

The boy, he ate the apple - lwld kl ltfht.

Subject-Verb-Object

The apple, the boy ate it - ltfht lwld klh.

Object-Subject-Verb

The apple, eaten by the boy - ltfht klh lwld.

Object-Verb-Subject

4

Information Processing and Management 57 (2020) 102183

W. Etaiwi and A. Awajan

that are derived from adjectives are usually ambiguous and affect the whole sentence representation. For example, the word

(krym) can be used as a named entity (person name) or an adjective that means generous. Therefore, a name entity recognition task aims to extract and classify named entities from text into predefined groups such as persons, locations, and organizations. The extraction of such entities helps to represent their semantic role in the sentence correctly. In this paper, we propose a novel semantic parser for Arabic language texts. It is an automatic parser that depends on many different Arabic techniques and tools such as a Farasa dependency parser (Abdelali, Darwish, Durrani, & Mubarak, 2016), POS tagging, and a word segmenter. The main task is to extract the semantic relation between words in the sentence and represent the sentence and its semantic relations as a semantic graph. The main innovation of the proposed model is it is targeted for the Arabic language, considering Arabic language features and challenges during the parsing process, with the ability to extend it to other languages. 5. The proposed model A Graph is defined as G = (V , E ) where V is a set of vertices, and E is a set of edges, where E⊆V × V. A graph is called a weighted graph if there is a weight function Wthat assigns value for each edge in the graph. This value is application/domain dependent that can be cost, distance, or any descriptive value. Otherwise, the graph is called an unweighted graph. According to the type of edges, a graph is classified into two main types: a directed graph and an undirected graph. In the directed graph, E(X, Y) ≠ E(Y, X), while in the undirected graph, E (X , Y ) = E (Y , X ) . A graph can be either homogeneous or heterogonous based on its vertices type. The homogeneous graph consists of vertices of the same type while the heterogonous graph includes vertices of different types. A path in a graph is a sequence P of nodes v1, v2, ..., vk1, vk, where for each consecutive pair vi, vi + 1 there is e (vi, vi + 1) E . A cycle is a path v1, v2, ..., vk, vk in which for all k > 2, where it consists of distinct vertices, except v1 = vk . In cases where the graph does not contain a cycle path it is then called acyclic. Rooted Directed Acyclic Graphs (DAGs) are constructed to represent sentences as semantic graphs Gs. The vertices set V contains two main types of vertices: Word vertices Vw and concepts vertices Vc. Formally, the vertices set is:

V = {Vw, Vc w

W, c

C}

where C is a set of predefined concepts containing person, location, and date\time and W is a set of all words in the text. On the other hand, edges set E consists of weighted edges that connect any two vertices in the graph, where:

E

{vx , vy vx , vy

V}

Since in the weighted graph, each edge e(vx, vy), where vx, vy ∈ V, is associated with a weight We that represents the semantic relation between its two connected vertices vx and vy. We ∈ R, where R is a set of predefined relation types that are illustrated in Fig. 1. The proposed model includes five main groups of relations: 1. Word Relations. Each word in the original sentence has been represented as a vertex in the semantic graph, and each word has

Fig. 1. The proposed semantic relations. 5

Information Processing and Management 57 (2020) 102183

W. Etaiwi and A. Awajan

Fig. 2. Adding verb relations.

several attributes or related words, such as root, synonym, plurality(single, double, or plural), and type (noun, verb or article). These related words have been represented in the semantic graph as vertices and linked to the original word via labeled edges according to the words attribute type. 2. Verb Relations: In the Arabic language, each verb has a subject, one or more objects, tense (past, present or future), negation, and occurrence frequency. These verb attributes have been represented in the semantic graph in two different ways: 1) Add a new edge that connects the verb vertex to another noun vertex in case the attribute is subject, object or occurrence frequency. 2) Add a new vertex and connect it to the verb vertex (e.g., a new vertex will be added to the graph that represents the verb tense, and a new edge will be added to link the verb with it’s tense). For example, the sentence (ālwld ākl āltāfht. - the , and ) (ākl - ate, ākl - boy, and āltfāht - apple) (Fig. 2a), boy ate an apple) consists of three main vertices ( the subject and object relations are represented by adding new edges between the existing vertices (Fig. 2b) while the tense relation is represented by adding a new vertex and new edge to the graph (Fig. 2c). 3. Noun Relations: The relation between nouns has been represented in several ways. Mainly, the relations between nouns are either adjective or quantity. In this case, a new edge will be added to the graph in order to connect the two nouns vertices together. Fig. 3 illustrates an example of representing the sentence (ltqs myl - The weather is beautiful). Nouns can be categorized into three main types: Location, Person and Date\Time. Each type has its own attributes and properties. These categories have been represented as follows: Location. Location type has six attributes: place, path, direction, source, destination, and modifier. To represent locations, a new concept vertex called (Location) has been created and added to the semantic graph. After that, the original word in the sentence has been connected to this concept vertex with a new edge labeled with relation type. For example, as illustrated in Fig. 4, in order to represent the sentence (dhb ālwld āla ālmdrsah - the boy went to the school), the verb is



• •

connected to the concept vertex (Location) via a new edge, a new vertex (ālmdrsah - the school) is added, and finally a new edge connects the concept vertex with the original words vertex that represents the relation type (Destination). Person. Person names can be mentioned in the original text in many different forms in the Arabic language. They may consist of one or more phrases (e.g. ) (Hatem Al Ta’ae) or noun phrases (e.g. ) (mohtr’ āldarah - Inventor of an atom). Thus, a new concept vertex called (Person) has been created and added to the semantic graph in order to represent the persons name. Fig. 5 illustrates an example of representing the sentence (ktb nzār qbāny ālqasyda Nizar Qabani wrote the poem). Date\Time. Date\Time has five main attributes: start, finish, duration, age, and date, which has one or more sub-attributes (day, month, year, etc.). Similar to location and person, Date\Time has been represented by creating a new concept vertex called (Time) and adding it to the semantic graph. After that, the original word in the sentence has been connected to this concept vertex with a new edge labeled with relation type. Fig. 6 illustrates an example of representing the sentence (htalt ālamtār lyl ālgom’h - The rain fell Friday night).

4. Conjunctions Relation. Two new concept vertices have been used to represent conjunctions relation: (w - and) and (āw - or). Furthermore, the option of the conjunction has been represented as a relation edge that connects the concept conjunction vertex with the original word vertex. For example, the representation of the sentence (ālwld ākl āltfāht wālbrtqāla - The boy ate an apple and an orange) is illustrated in Fig. 7. 5. Questions (Interrogatives) Relation. Another special vertex has been added in order to represent the unknown object and question

Fig. 3. Adding noun relations. 6

Information Processing and Management 57 (2020) 102183

W. Etaiwi and A. Awajan

Fig. 4. Adding location relations.

Fig. 5. Adding person relations.

Fig. 6. Adding date\time relations.

about it. For example, in order to represent the question

(āmhamd fāz m khāld? - Is the winner Mohamad or

Khaled?), a new vertex called “unknown” is created and attached to the conjunction vertex (the object that we ask about), as illustrated in Fig. 8.

The proposed model may represent different sentences by the same graph if they share the same semantic meaning. This is the result of the fact that the order of the words within a sentence has no impact on its representation. Thus, the impact of Arabic language flexible order property has been reduced. For instance, Table 3 shows the different ordered sentences and Fig. 9 illustrates their corresponding graph representation. With comparison to other graph-based models, the proposed model shares common features and differs in many other features with the state-of-the-art models. In terms of the concepts that have been represented, AMR 7

Information Processing and Management 57 (2020) 102183

W. Etaiwi and A. Awajan

Fig. 7. Adding conjunction relations.

Fig. 8. Adding question relations.

represented propBank frames, UNL considered WordNet frames. On the other hand, the proposed model and UCCA adopted words as concepts. In terms of the relations used between concepts, GMB used VerbNet roles, AMR used frame-specific PropBank relations, UNL adopted a set of 30 frequently used relations. While the proposed model using a set of predefined semantic relations. Also, GMB and UCCA used to parse short text, while AMR and the proposed model used to parse individual sentences. Finally, according to the parsing automaticity, AMR, UNL, and UCCA parse text manually, while GMB and the proposed model automatically parse text using specific parsers. Numerous Arabic graph-based models have been proposed for simplifying Arabic NLP problems, rather than Arabic text representation (Etaiwi & Awajan, 2018). Arabic text representation modes are categorized into three main modes (Karima et al., 2012): Bag of words representation, N-grams representation, and Concepts representation. 1) Bag of words, in which the text is represented as a vector of words. The main disadvantage of this representation is its excluding the grammatical and semantical properties of the represented text. 2) N-grams representation, in which the text is represented as a sequence of N components such as characters or words. As well as the bag of words representation, N-grams representation does not consider any semantic or grammatical properties in the text representation. 3) Concepts representation, in which the text is represented as concepts rather than text terms, such as graph-based text. The extracted concepts are related to each others using semantic, grammatical or syntactical relations. Few contributions focused on using graphs for Arabic text representation, as listed in Table 2. For instance, Hadni and Gouiouez (2017) used the semantic graph-based representation for text categorization, while Al-Taani and Al-Omour (2014); Alami, Meknassi, Alaoui Ouatik, and Ennahnahi (2015) and Ismail et al. (2013b) exploited the graph theory in the text 8

Information Processing and Management 57 (2020) 102183

W. Etaiwi and A. Awajan

Fig. 9. Semantic graph that represent different sentence listed in Table 3.

Table 2 Graph-based text representation methods. Method

Usage (Application/Resource)

Language

Consider Semantic Relations

AMR (Banarescu et al., 2013) GMB (Bos et al., 2017) UCCA (Abend & Rappoport, 2013) UNL (Boguslavsky et al., 2000) Vidal et al. (Vidal et al., 2014) Ismail et al. (Ismail et al., 2013b) CAMS-KG (Bounhas et al., 2019) CORM and CARM (Qu et al., 2018) Kastrati et al. (Kastrati et al., 2019) Arruda et al. (de Arruda et al., 2019) Aabic WordNet (Lakhfif & Laskri, 2015) Hadni et al. (Hadni & Gouiouez, 2017) Al-Taani et al. (Al-Taani & Al-Omour, 2014) Alami et al. (Alami et al., 2015) Halabi et al. (Halabi & Awajan, 2019) The proposed model

Application Application Application Application Application Application Resource Application Application Application Resource Application Application Application Application Application

English English English English English Arabic English English English English Arabic Arabic Arabic Arabic Arabic Arabic

Yes Yes Yes Yes Yes No Yes No No No No No No No No Yes

Table 3 Different orderd sentences with the same meaning. ID 1

2

3

4

Sentence ”

“ (ākl ālwld āltfāha fy ālhadyqa ’nd ālsabāh ālbākr) (The boy ate an apple in the garden in the early morning.) “



(ākl ālwld āltfāha ’nd ālsabāh ālbākr fy ālhadyqa) (The boy ate an apple in the early morning in the garden.) “



(’nd ālsabāh ālbākr ākl ālwld āltfāha fy ālhadyqa) (In the early morning, the boy ate an apple in the garden.) “



(fy ālhadyqa ’nd ālsabāh ālbākr ākl ālwld āltfāha) (In the garden, in the early morning, the boy ate an apple.)

9

Information Processing and Management 57 (2020) 102183

W. Etaiwi and A. Awajan

summarization field. In Alami et al. (2015), each sentence in the given text is represented as a vertex, while the edges represent the interconnection between sentences. The similarity relation between sentences is defined as a function of concepts overlap. This representation method does not consider the semantic, syntactical and grammatical relations between words or sentences. The rich semantic graph proposed by Ismail et al. (2013b) is used for abstractive Arabic text summarization. The semantic is represented in the vertices themselves. Each vertex has many attributes include type, descriptor, subject, place, objects, etc. The proposed representation model focuses on the semantic attributes of words rather than the semantic relation between words. A graph is used to represent Arabic words and the co-occurrence relation between them, such as Halabi and Awajan (2019) and El Bazzi, Mammass, Zaki, and Ennaji (2016). Halabi and Awajan (2019) used a graph to represent the words stem and the cooccurrence relation in order to extract key-phrases from Arabic text. While El Bazzi et al. (2016) used the TextRank algorithm (Mihalcea & Tarau, 2004) in order to represent words and the co-occurrence relation between them for Arabic document indexing. Furthermore, many researchers developed Arabic semantic graph resources, such as Arabic WordNet (Black et al., 2006; Lakhfif & Laskri, 2015) and CAMS-KG (Bounhas et al., 2019). In Arabic WordNet, Arabic words are grouped as a set of synonyms. While in Bounhas et al. (2019), a morpho-semantic knowledge graph is developed from vocalized classical Arabic corpus. The model presented in this study is an application-independent Arabic graph-based semantic representation model. Furthermore, the semantic relations between text components (for example, words) are considered in the final representation. To the best of our knowledge, the proposed graph-based semantic representation model is the first of its type. 6. Building the semantic graph Many Arabic text processing toolkits were proposed for the Arabic language in order to perform specific text processing tasks, such as POS tagging, segmentation, dependency parsing, and named entity recognition. Farasa is one of the latest Arabic text processing toolkits that has been proposed by the Arabic Language Technologies Group in Qatar Computing Research Institute (QCRI). It is an open source text processing toolkit that provides many text processing capabilities such as segmentation, lemmatization, POS tagging, Arabic diacritization, dependency parsing, constituency parsing, named-entity recognition, and spell-checking. The Stanford NLP Group developed various techniques and tools for different Arabic language processing tasks (e.g. Word Segmentation (Monroe, Green, & Manning, 2014)). Another freely available software to process the Arabic language was published online and includes stemmers, statistical parsers, POS taggers and word segmenters (e.g., Tashaphyne Arabic Light Stemmer (Dahab, Ibrahim, & Al-Mutawa, 2015)). The process of building the proposed semantic graph consists of three main phases: (1) Identify the dependency relation between words in the original text. (2) Extract the potential relations between words using Arabic language tools (e.g., POS taggers). (3) Apply predefined rules in order to identify the semantic relation between words and to build the final semantic graph. We used the Farasa Arabic language toolkit in addition to other Arabic language tools and a library called Tashaphyne Arabic Light Stemmer. Farasa dependency parser has been used in order to find the syntactic dependency relation between Arabic words. Farasa produces the dependency relation of each word in the sentence, for example, the dependency relations between words in the sentence (ālwld ākl āltfāht - The boy ate an apple) are illustrated in Fig. 10. After finding the dependency relations, potential relations will be figured out using many different Arabic language tools (Farasa Segmenter, Farasa POS tagger, Farasa Named entity recognizer, and Tashaphyne Arabic Light Stemmer). Farasa Segmenter is used to segment the original sentence into its words. Farasa POS tagger is used in order to find subjects, objects, and adjective relations. Farasa Named entity recognizer is used to extract person and location name entities. Finally, a set of predefined rules is applied to extract the semantic relation between words based on Arabic language features and Arabic NLP techniques and methods. For instance, for quadric roots (which consist of three letters) with fat’ha “ ” or damma “ ” diacritics on the middle letter of its present verb in Arabic language, location words are nouns and have a specific set of patterns (such as “ ” and (mf’el)), as the word (ml’b - playground) and the word (masjed - masjed) (Sattar, 2012). Thus, the following rule is applied: If (the word is noun and pattern is (mf’al)), then (the word is location) Tashaphyne Arabic Light Stemmer is used to extract the root of the word. Then, the pattern of the original word is determined by comparing the original word and its root. According to the dependency parser, the relation between the location word and its parent word is a location relation.

Fig. 10. Farasa dependency Parser results. 10

Information Processing and Management 57 (2020) 102183

W. Etaiwi and A. Awajan

7. Evaluation The semantic graph could be evaluated according to its ability to enhance other NLP applications, such as Question Answering (QA), keyword extraction and Textual Entailment (TE) recognition. In this research, TE recognition will be used to evaluate the ability of the proposed semantic graph representation to enhance other Arabic NLP applications. 7.1. Textual entailment The main purpose of TE recognition is to decide whether the meaning of one text is entailed or can be inferred from another text. The directional relation between pairs of text is denoted by T → H, where T is the entailing text and H is the entailed hypothesis (Dagan, Glickman, & Magnini, 2006). This relation can be expressed as: T entails H if the meaning of H can be inferred from the meaning of T. For example, the following pairs of text are entails because they share the same semantic meaning: T:“ ” (zārt ālmmthla ānǧlynā gowly mohayam āllāǧeyn ālswryyn ’la ālhodwd āltrkya ālswrya - Actor Angelina Jolie visited the Syrian refugee camp on the Turkish-Syrian border as a goodwill ambassador). H:“ ” (āngelynā gowly tzwr mo’skr āllāge’yn ālswryyn fy trkyā Angelina Jolie visits the Syrian refugee camp in Turkey). On the other hand, the following pairs of statements are not entailed because they are expressing different meaning. “ ” (ktāb gadyd llkātb ālāmryky dyfyd brwks yqwl ān ālānsān y’tmd āktr ’la ’wātefh w ’qlh ālbātny - A new book by American author David Brooks says that human relies more on his emotions and inner mind). ” (msā’rnā thadd mda ngāhna fy ālhyāh - Our feelings determine how successful we H:“ are in life). Automatic recognition of textual entailment can support a wide variety of NLP tasks, including information retrieval, QA and text summarization (Korman, Mack, Jett, & Renear, 2018).Therefore, several TE recognition systems are proposed based on Machine Learning (ML), lexical or semantic approaches (Androutsopoulos & Malakasiotis, 2010). However, TE recognition can be considered as a classification problem. The typical TE recognizers consist of three main phases (Haggag, ELFattah, & Ahmed, 2016): the text representation phase, the comparison phase, and the entailment decision phase. In the representation phase, the text and the hypothesis are represented in a form that facilitates the comparison between them. In the comparison phase, both text and hypotheses are compared based on their text representation. Finally, based on the comparison phase output, the TE recognizer determines the entailment relationship between the texts. In this research, the semantic graph is used in the first phase of the TE recognition system in which both text and hypotheses are represented as separated semantic graphs. Then, the two graphs are compared to recognize the entailment relation between the text and the hypotheses. 7.2. Dataset To evaluate the ability of semantic graph representation to enhance textual entailment recognition models, Arabic Textual Entailment Dataset (ArbTED)1 (Alabbas & Ramsay, 2013a) is considered. ArbTED is a well-known TE dataset that has been used for evaluating the proposed TE recognition systems. It contains 600 Arabic text-hypothesis pairs. Each pair was manually annotated by three human annotators. Each pair is annotated as “Entails” if all three annotators agreed that the text entails the hypotheses, and the pair is annotated as “NotEntails” otherwise. The dataset is balanced; half of the pairs are entails, and the remaining are not entails. 7.3. Experiments The main goal of the experiments is to evaluate the ability of the proposed semantic representation graph to enhance the results of TE application. The evaluation metrics include: precision, recall, F-score and accuracy. The precision represents the fraction of the correct decisions to the total number of the given decisions in a particular class. It is calculated as:

Precision =

TruePositive TruePositive + FalsePositive

(1)

Where TruePositive is the correct entail decisions, and FalsePositive is the incorrect entail decisions. The recall refers to the fraction of the correct decisions that are given by the TE method to the total number of text pairs in a particular class. It is calculated as: 1

http://www.cs.man.ac.uk/~ramsay/arabicTE/ 11

Information Processing and Management 57 (2020) 102183

W. Etaiwi and A. Awajan

Recall =

TruePositive TruePositive + FalseNegative

(2)

Where FalseNegative is the incorrect not-entail decisions. The accuracy measures how close the obtained entailment decisions to the actual entailment results. The evaluation process is a binary decision process in which the decision of the proposed model is correct if and only if it matches the real entailment decision in the dataset. It is calculated as:

Accuracy =

TruePositive + TrueNegative TruePositive + TrueNegative + FalsePositive + FalseNegative

(3)

Where TrueNegative is the correct not-entail decisions. Finally, the F-score is the harmonic mean of precision and recall. It is calculated as:

F

score =

2 × Precision × Recall Precision + Recall

(4)

To determine the entailment of text and hypotheses, text and hypotheses are represented as two separated semantic graphs: GT and GH. GT is the semantic graph that represents text, and GH is the semantic graph that represents hypotheses. The Entailment Similarity (ES) between text and hypothesis graphs is defined as the existence percentage of the hypotheses semantic graph components (edges and vertices) in the text semantic graph components. It is computed as:

ES (GT , GH ) =

|{e + v e

EH , e |{e + v e

ET , v EH v

VH , v VH }|

VT }|

(5)

where EH is the edges set of GH, ET is the edges set of GT, VH is the vertices set of GH, and VT is the vertices set of GT. Finally, when the ES exceeds a predefined threshold λ, then the hypotheses entails the text and the hypotheses does not entail the text otherwise. The decision is made as follow:

Decision =

Entail, if ES NotEntail, otherwise

(6)

The sensitivity of the threshold λ is analyzed and studied in a separate experiment. Table 4 illustrates the results of applying the proposed model on 150 pairs of TE statements (25% of the dataset size) which were selected randomly and automatically from ArbTED. The sensitivity assesses the performance of the proposed model with respect to the threshold value. The assessment results showed that the precision increases when λ inceases while a lower λindicates a higher recall. The best F-score results occur when λ=30%. Thus, λ = 30% is applied in the experiments. Fig. 11 illustrates the relationship between the evaluation measures and the different threshold values. The effect of semantic representation of text as a semantic graph is evaluated in the experiments. The performance of the proposed model was assessed against the best-reported results of the state-of-the-art TE approaches for Arabic text. The results (shown in Table 5) were obtained from applying the proposed TE model on ArbTED as reported in thier proposed researches. Four different evaluation measures are used in the evaluation: accuracy, precision, recall and F-score. The results show that the proposed model overcomes the other TE methods that are used for the same dataset in terms of precision, recall and F-score. Furthermore, the proposed model performs closely to others in terms of accuracy. 7.4. Discussion The performance of the proposed model depends on many different factors, such as the quality of the dataset used, the quality of the Arabic toolkits that are used to build the semantic graph, the ES measure of the semantic graphs, and the threshold value. Since both the quality and size of the training dataset (text) are key important factors that affect the performance of NLP machine learning approaches (Kavzoglu, 2009), the quality of the processed text affects the overall performance of the semantic representation process. Clear text that has fewer syntactical errors, less ambiguity and less usage of informal words enhances the quality of the semantic Table 4 sensitivity analysis of threshold λ. Threshold

Precision

Recall

F-Score

Accuracy

10% 15% 20% 25% 30% 35% 40% 45% 50%

57.66% 58.65% 60.48% 64.04% 66.36% 70.65% 70.67% 73.77% 79.05%

100% 98.73% 94.94% 92.41% 89.87% 82.28% 67.09% 56.96% 43.04%

73.15% 73.58% 73.89% 75.65% 76.34% 76.02% 68.83% 64.29% 55.74%

61.33% 62.67% 64.67% 68.67% 70.67% 72.67% 68.00% 66.67% 64.00%

12

Information Processing and Management 57 (2020) 102183

W. Etaiwi and A. Awajan

Fig. 11. The relationship between the evaluation measures and the different threshold values. Table 5 evaluation measures on TE using ArbTED.

The Proposed Model Khader et al. (2016) The Arabic Text Entailment (ATE) (AL-Khawaldeh, 2015) Sentiment Analysis and Negation Resolvingfor Arabic Text Entailment (SANATE) (AL-Khawaldeh, 2015) bag-of-words (BOW) (Alabbas & Ramsay, 2013b) Tree Edit Distance1 (ZS-TED1) (Alabbas & Ramsay, 2013b) Tree Edit Distance2 (ZS-TED2) (Alabbas & Ramsay, 2013b) Extended Tree Edit Distance1 (ETED1) (Alabbas & Ramsay, 2013b) Extended Tree Edit Distance2 (ETED2) (Alabbas & Ramsay, 2013b)

Precision

Recall

F-Score

Accuracy

64.2% 61% – – 63.6% 57.7% 61.6% 59% 63.2%

80.7% 61% – – 43.7% 64.7% 73.7% 65.7% 75%

71.5% – – – 51.8% 61% 67.1% 62.1% 68.6%

67.8% – 61.7% 69.3% 59.3% 58.7% 63.8% 60% 65.7%

graph representation. Since the process of building the semantic graph is rule-based, the correct input to the rules yield helps in applying the most suitable rules and produces the most appropriate representation of the text. The input to the rules is a collection of semantic relations between words. These relations are extracted based on many different Arabic text language processing toolkits. The quality of the toolkits used to extract the semantic relations affects the quality of text representation. The ES measure between the produced semantic graph representations affects the process of determining the entailment relation between texts. The ES calculation depends on many different factors, such as the total number of matching nodes, vertices and paths between the compared graphs. Khader, Awajan, and Alkouz (2016) proposed a lexical-based textual entailment method that adopted the semantic matching based on word overlap, in addition to synonyms and bigram matching. While AL-Khawaldeh (2015) proposed Sentiment Analysis and Negation Resolving for Arabic Text Entailment (SANATE) approach. The entailment decision is given based on the polarity of the input text. The author assumed that the positive text entails positive hypotheses and vice versa. The experimental results showed that resolving the negation and the extraction of text polarity enhances entailment accuracy. Arabic Textual Entailment (ATE) algorithm (AL-Khawaldeh, 2015) extracted the entailment statistically by finding the common words between text and hypotheses. As well as ATE, BoW measures the similarity between the text and the hypotheses using the bag-of-words, and the total number of common words is divided by the length of the hypotheses. Alabbas and Ramsay (2013b) extended the dynamic programming based textual entailment algorithms proposed by Zhang and Shasha (called ZS-TED) (Zhang & Shasha, 1989). ZS-TED is a tree-based matching algorithm between two ordered rooted trees. The authors tested two different approaches: ZS-TED1 and ZS-TED2. In ZS-TED1, the authors determined the cost of deleting, inserting or exchanging a node manually. While in ZS-TED2, the authors determined costs intuitionally-based depending on a set of stopwords, synonyms, and hypernyms. In addition, Alabbas and Ramsay (2013b) improved tree edit distance (TED) for textual entailment (Kouylekov & Magnini, 2005). In the extended version of ETD, ETD1, the authors assumed that the cost of subtrees is half the sum of the costs of their parts. While in ETD2, the cost of the subtrees is determined intuitionally. 13

Information Processing and Management 57 (2020) 102183

W. Etaiwi and A. Awajan

8. Conclusion In this article, we proposed a graph-based semantic representation model of Arabic texts. The proposed model represents words and the semantic relation between them as a rooted acyclic graph called a semantic graph. The vertices in the proposed semantic graph consist of original words in addition to the main concepts while the edges represent the semantic relations between words. Arabic language features are considered during the semantic graph construction. The proposed representation model was evaluated according to its ability to enhance TE recognition. ArbTED was used in the experiments. The experimental results show that the proposed representation model enhances the performance of TE recognition applications. CRediT authorship contribution statement Wael Etaiwi: Conceptualization, Methodology, Software, Investigation, Writing - original draft, Visualization. Arafat Awajan: Supervision, Validation, Writing - review & editing, Project administration. Declaration of Competing Interest None. References Abdelali, A., Darwish, K., Durrani, N., & Mubarak, H. (2016). Farasa: A fast and furious segmenter for arabic. Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: Demonstrations. Association for Computational Linguistics11–16. Abend, O., & Rappoport, A. (2013). Universal conceptual cognitive annotation (UCCA). Proceedings of the 51st annual meeting of the association for computational linguistics (volume 1: Long papers). Association for Computational Linguistics228–238. Abend, O., & Rappoport, A. (2017). The state of the art in semantic representation. Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers)77–89. Aboamer, Y., & Kracht, M. (2018). Representing meaning of arabic sentence dynamically and more smoothly. Procedia Computer Science, 142, 321–327. AL-Khawaldeh, F. T. (2015). A study of the effect of resolving negation and sentiment analysis in recognizing text entailment for arabic. World of Computer Science & Information Technology Journal, 5(7), 124–128. AL-Smadi, M., Jaradat, Z., AL-Ayyoub, M., & Jararweh, Y. (2017). Paraphrase identification and semantic text similarity analysis in arabic news tweets using lexical, syntactic, and semantic features. Information Processing & Management, 53(3), 640–652. Al-Taani, A. T., & Al-Omour, M. M. (2014). An extractive graph-based arabic text summarization approach. The international arab conference on information technology, jordan158–163. Alabbas, M., & Ramsay, A. (Ramsay, 2013a). Natural language inference for arabic using extended tree edit distance with subtrees. Journal of Artificial Intelligence Research, 48(1), 1–22. Alabbas, M., & Ramsay, A. (Ramsay, 2013b). Optimising tree edit distance with subtrees for textual entailment. Proceedings of the international conference recent advances in natural language processing RANLP 2013. INCOMA Ltd. Shoumen, BULGARIA9–17. Alami, N., Meknassi, M., Alaoui Ouatik, S., & Ennahnahi, N. (2015). Arabic text summarization based on graph theory. 2015 IEEE/ACS 12th international conference of computer systems and applications (AICCSA)1–8. Alansary, S., Nagi, M., & Adly, N. (2009). The universal networking language in action in english-arabic machine translation. Proceedings of 9th egyptian society of language engineering conference on language engineering,(ESOLEC 2009)23–24. Ali, A., & Khan, M. A. (2009). Selecting predicate logic for knowledge representation by comparative study of knowledge representation schemes. 2009 international conference on emerging technologies23–28. Ali, A., & Khan, M. A. (2010). Knowledge representation of urdu text using predicate logic. 2010 6th international conference on emerging technologies (ICET)293–298. Androutsopoulos, I., & Malakasiotis, P. (2010). A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research, 38(1), 135–187. de Arruda, H. F., Marinho, V. Q., Costa, L., & Amancio, D. R. (2019). Paragraph-based representation of texts: a complex networks approach. Information Processing & Management, 56(3), 479–494. Awajan, A. (2007). Arabic text preprocessing for the natural language processing applications. Arab Gulf Journal of Scientific Research, 25(4), 179–189. Awajan, A. (2014). Unsupervised approach for automatic keyword extraction from arabic documents. Proceedings of the 26th conference on computational linguistics and speech processing (ROCLING 2014). The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)175–184. Awajan, A. (2015). Keyword extraction from arabic documents using term equivalence classes. ACM Transactions on Asian Low-Resource Language Information Processing, 14(2), 7:1–7:18. Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., ... Schneider, N. (2013). Abstract meaning representation for sembanking. Proceedings of the 7th linguistic annotation workshop and interoperability with discourse. Association for Computational Linguistics178–186. Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, P., Pease, A., & Fellbaum, C. (2006). Introducing the arabic wordnet project. Proceedings of the third international wordnet conference. Citeseer295–300. Boguslavsky, I., Frid, N., Iomdin, L., Kreidlin, L., Sagalova, I., & Sizov, V. (2000). Creating a universal networking language module within an advanced nlp system. Proceedings of the 18th conference on computational linguistics-volume 1. Association for Computational Linguistics83–89. Bos, J., Basile, V., Evang, K., Venhuizen, N., & Bjerva, J. (2017). The groningen meaning bank. In N. Ide, & J. Pustejovsky (Vol. Eds.), Handbook of linguistic annotation. 2. Handbook of linguistic annotation (pp. 463–496). Springer. Bounhas, I., Soudani, N., & Slimani, Y. (2019). Building a morpho-semantic knowledge graph for arabic information retrieval. Information Processing & Management, 102124. Dagan, I., Glickman, O., & Magnini, B. (2006). The PASCAL recognising textual entailment challenge. Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment. Springer Berlin Heidelberg177–190. Dahab, M. Y., Ibrahim, A., & Al-Mutawa, R. (2015). A comparative study on arabic stemmers. International Journal of Computer Applications, 125(8), 38–47. Dixon, R. M., & Dixon, R. M. W. (2010). Basic linguistic theory volume 1: Methodology. 1. Oxford University Press. El Bazzi, M. S., Mammass, D., Zaki, T., & Ennaji, A. (2016). A graph based method for arabic document indexing. 2016 7th international conference on sciences of electronics, technologies of information and telecommunications (SETIT)308–312. El-Sayed, H. (2015). Arabic between formalization and computation. International Journal of Languages, Literature and Linguistics, 1(1), 25–29. Etaiwi, W., & Awajan, A. (2018). Graph-based arabic nlp techniques: A survey. Procedia Computer Science, 142, 328–333 Arabic Computational Linguistics Ghwanmeh, S., Kanaan, G., Al-Shalabi, R., & Rabab’ah, S. (2009). Enhanced algorithm for extracting the root of arabic words. 2009 sixth international conference on computer graphics, imaging and visualization388–391. Haddad, B., & Yaseen, M. (2003). Towards semantic composition of arabic: A λ-drt based approach. Mt summit ix, workshop on machine translation for semitic languages:

14

Information Processing and Management 57 (2020) 102183

W. Etaiwi and A. Awajan

Issues and approaches, amta, new orleans. Hadni, M., & Gouiouez, M. (2017). Graph based representation for arabic text categorization. Proceedings of the 2nd international conference on big data, cloud and applicationsBDCA’17New York, NY, USA: ACM75:1–75:7. Haggag, M. H., ELFattah, M. M., & Ahmed, A. M. (2016). Different models and approaches of textual entailment recognition. International Journal of Computer Applications, 142(1), 32–39. Halabi, D., & Awajan, A. (2019). Graph-based arabic key-phrases extraction. 2019 2nd international conference on new trends in computing sciences (ICTCS)1–7. Hayes, P. J. (1974). Some problems and non-problems in representation theory. Proceedings of the 1st summer conference on artificial intelligence and simulation of behaviourAISB’74IOS Press63–79. Ismail, S., Aref, M., & Moawad, I. (2013a). Rich semantic graph: A new semantic text representation approach for arabic language. (pp. 97–100). Ismail, S. S., Aref, M., & Moawad, I. F. (Aref, Moawad, 2013b). Rich semantic graph: A new semantic text representation approach for arabic language. 7th WSEAS european computing conference (ECC 13). Jurafsky, D., & Martin, J. H. (2009). Speech and language processing ((2nd Edition)). Upper Saddle River, NJ, USA: Prentice-Hall, Inc. Karima, A., Zakaria, E., Yamina, T. G., Mohammed, A., Selvam, R., Venkatakrishnan, V., et al. (2012). Arabic text categorization: A comparative study of different representation modes. Journal of Theoretical and Applied Information Technology, 38(1), 1–5. Kastrati, Z., Imran, A. S., & Yayilgan, S. Y. (2019). The impact of deep learning on document classification using semantically rich representations. Information Processing & Management, 56(5), 1618–1632. Kavzoglu, T. (2009). Increasing the accuracy of neural network classification using refined training data. Environmental Modelling & Software, 24(7), 850–858. Khader, M., Awajan, A., & Alkouz, A. (2016). Textual entailment for arabic language based on lexical and semantic matching. International Journal of Computing and Information Sciences, 12(1), 67–74. Korman, D. Z., Mack, E., Jett, J., & Renear, A. H. (2018). Defining textual entailment. Journal of the Association for Information Science and Technology, 69(6), 763–772. Kouylekov, M., & Magnini, B. (2005). Recognizing textual entailment with tree edit distance algorithms. Proceedings of the first challenge workshop recognising textual entailment17–20. Lakhfif, A., & Laskri, M. T. (2015). A frame-based approach for capturing semantics from arabic text for text-to-sign language MT. International Journal of Speech Technology, 19(2), 203–228. Lhioui, C., Zouaghi, A., & Zrigui, M. (2017). A rule-based semantic frame annotation of arabic speech turns for automatic dialogue analysis. Procedia Computer Science, 117, 46–54. Liang, P. (2016). Learning executable semantic parsers for natural language understanding. Communications of the ACM, 59(9), 68–76. Liang, P., & Potts, C. (2015). Bringing machine learning and compositional semantics together. Annual Review of Linguistics, 1(1), 355–376. Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing order into text. Proceedings of the 2004 conference on empirical methods in natural language processing. Barcelona, Spain: Association for Computational Linguistics404–411. Monroe, W., Green, S., & Manning, C. D. (2014). Word segmentation of informal arabic with domain adaptation. Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: Short papers). Association for Computational Linguistics206–211. Mylopoulos, J. (1980). An overview of knowledge representation. Proceedings of the 1980 workshop on data abstraction, databases and conceptual modeling. ACM Press5–12. Palmer, M., Gildea, D., & Kingsbury, P. (2005). The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1), 71–106. Pasha, A., Al-Badrashiny, M., Diab, M. T., El Kholy, A., Eskander, R., Habash, N., ... Roth, R. (2014). Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. LREC14. LREC 1094–1101. Qu, R., Fang, Y., Bai, W., & Jiang, Y. (2018). Computing semantic similarity based on novel models of semantic representation using wikipedia. Information Processing & Management, 54(6), 1002–1021. Quillian, M. R. (1968). Semantic networks. In M. L. Minsky (Ed.). Semantic information processing. MIT Press. Sattar, S. H. A. (2012). Fundamentals of classical arabic. 1. Sacred Learning. Shaalan, K., Siddiqui, S., Alkhatib, M., & Monem, A. A. (2009). Challenges in arabic natural language processing. Computational linguistics, speech and image processing for arabic language59–83. Sharaf, A., & Atwell, E. (2009). Knowledge representation of the quran through frame semantics: A corpus-based approach. Proceedings of the fifth corpus linguistics conference. Tayal, M. A., Raghuwanshi, M. M., & Malik, L. G. (2015). Semantic representation for natural languages. International Refereed Journal of Engineering and Science (IRJES), 4(10), 01–07. Vidal, J. C., Lama, M., Otero-García, E., & Bugarn, A. (2014). Graph-based semantic annotation for enriching educational content with linked data. Knowledge-Based Systems, 55, 29–42. Wilks, Y., & Fass, D. (1992). The preference semantics family. Computers & Mathematics with Applications, 23(2), 205–221. Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6), 1245–1262.

15