A New Approach for Calculating Semantic Similarity between Words Using WordNet and Set Theory

A New Approach for Calculating Semantic Similarity between Words Using WordNet and Set Theory

Available online at www.sciencedirect.com Available online at www.sciencedirect.com ScienceDirect ScienceDirect Procedia Computer Science 00 (2018) ...

369KB Sizes 34 Downloads 65 Views

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

ScienceDirect ScienceDirect

Procedia Computer Science 00 (2018) 000–000 Available online at www.sciencedirect.com Procedia Computer Science 00 (2018) 000–000

ScienceDirect

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

Procedia Computer Science 151 (2019) 1261–1265

International Workshop on Web Search and Data Mining International Workshop WebLeuven, Search and Data Mining April 29 - May 2,on 2019, Belgium April 29 - May 2, 2019, Leuven, Belgium

A New Approach for Calculating Semantic Similarity between A New Approach for Calculating Semantic Similarity between Words Using WordNet and Set Theory Words Using WordNet and Set Theory Hanane EZZIKOURI*, Youness MADANI, Mohammed ERRITALI, Mohamed Hanane EZZIKOURI*, Youness OUKESSOU MADANI, Mohammed ERRITALI, Mohamed OUKESSOU Sultan Moulay Slimane University, Faculty of Sciences and Techniques Beni Mellal,Faculty Morocco Sultan Moulay Slimane University, of Sciences and Techniques Beni Mellal, Morocco

Abstract Abstract Calculating semantic similarity between words is a challenging task of a lot of domains such as Natural language processing Calculating semanticretrieval similarity words is a challenging a lot ofdictionary domains conceptually such as Natural languagewhere processing (NLP), information andbetween plagiarism detection. WordNettask is aoflexical organized, each (NLP), information retrieval and plagiarism is a lexical organized, where are each concept has several characteristics: Synsets anddetection. Glosses. WordNet Synset represent sets ofdictionary synonymsconceptually of a given word and Glosses a concept has several Glosses. Synsetforrepresent sets semantic of synonyms of a given wordtwo andconcepts. Glosses are short description. In characteristics: this paper, weSynsets proposeand a new approach calculating similarity between Thea short description. this on paper, we propose a new calculating similarity betweenbetween two concepts. The proposed method isInbased set theory’s concepts andapproach WordNetforproperties, by semantic calculating the relatedness the synsets’ proposed method based on set theory’s concepts and WordNet properties, by calculating the relatedness between the synsets’ and glosses’s of theistwo concepts. and glosses’s of the two concepts. © 2019 The Authors. Published by Elsevier B.V. © 2019 2019 The The Authors. Published by B.V. © Authors. by Elsevier Elsevier This is an open accessPublished article under the CC B.V. BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under theConference CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the Program Chairs. Peer-review under responsibility of the Conference Program Chairs. Peer-review under responsibility of the Conference Program Chairs. Keywords: Semantic Similarity; Natural Language Processing; WordNet; Set Theory. Keywords: Semantic Similarity; Natural Language Processing; WordNet; Set Theory.

1. Introduction 1. Introduction With the advent of the web 3.0 the amount of data generated every day becomes important, this rapid increase in theofadvent of the web 3.0 the aamount of to data generated every day becomes important, rapid of increase in the With volume information has created problem find interesting information among this hugethis amount data. To the volume of information has created a problem to find interesting information among this huge amount of data. To

* Corresponding author. * E-mail Corresponding address:author. [email protected] E-mail address: [email protected] 1877-0509 © 2019 The Authors. Published by Elsevier B.V. 1877-0509 © 2019 Thearticle Authors. Published by Elsevier B.V. This is an open access under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under the Conference CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the Program Chairs. Peer-review under responsibility of the Conference Program Chairs.

1877-0509 © 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of the Conference Program Chairs. 10.1016/j.procs.2019.04.182

1262 2

Hanane EZZIKOURI et al. / Procedia Computer Science 151 (2019) 1261–1265 EZZIKOURI et al./ Procedia Computer Science 00 (2018) 000–000

overcome this problem, researchers try to find optimal approaches for searching relevant information based on domains such as information retrieval, text classification, document clustering, topic detection, topic tracking, questions generation, question answering, essay scoring, short answer scoring, machine translation, text summarization and others. Semantic similarity is an active research area which is increased explosively; it tries to calculate relatedness between words, concepts, sentences and documents. Similarity among two words is a measure of the likeliness of their meaning, computed based on the properties of concepts and their relationships in taxonomy or ontology .Similarity has a fundamental role in information management especially when data is unstructured and originate from different sources in flexible environments. Semantic similarity refers to measuring the closeness of two concepts within a given ontology. Potential application for these measures comprise knowledge discovery and decision support systems that utilize ontology, ones of the widely fields using semantic similarity measures are information retrieval systems, plagiarism detection and sentiment analysis. In the literature, we find different approaches for calculating the similarity such as machine learning approaches, and dictionary-based approach using some dictionaries like WordNet. WordNet is a large lexical database of English, in which Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Additionally, a synset contains a brief definition (“Gloss”) and, in most cases, one or more short sentences illustrating the use of the synset members [1]. In this paper, we present a new formula for calculating the semantic similarity between two concepts (C1 and C2) using WordNet dictionary. The proposed approach uses the synonymy (IS-A/PartOf) relationships based on concepts’ synsets (S1 and S2 respectively for C1 and C2) and their glosses (G1 and G2 respectively for S1 and S2). The synsets based similarity gives a semantic similarity score as in many measures [8][9][10][11][12] based on WordNet, especially characteristic based ones. The usage of glosses is for maximizing the score of the similarity. Firstly, we look for each word’s sysnset and its equivalent gloss, then for finding the value of the semantic similarity; we calculate the intersection between the pairs (S1, G1) and (S2, G2). The rest of this paper is organized as follows: 2. Literature Review The field of semantic similarity knows in recent years a great development, the number of papers related to this domains increase explosively. In the literature, many researchers try to find new optimal approaches for improving the results of the similarity’s calculation. Madani et al. [2] proposed a new approach for calculating semantic similarity between documents. The proposed method used WordNet dictionary and the measure of Leacock and Chodorow [3]. They applied this approach in a research engine (information retrieval systems) for finding the relevant documents for a user’s query. Gupta, et al. [4] uses the different pre-processing methods based on Natural Language Processing (NLP) techniques, provides how similarity calculation can be improved using fuzzy-semantic similarity measures and introduces an improved fuzzy-semantic measure that provide a significant improvement in the efficiency and accuracy of the system compared to the base method offered by Alzahrani et al.. The system evaluated using PAN 2012 data set. Authors in [5] propose a robust semantic similarity measure that uses the information available on the Web to measure similarity between words or entities. The proposed method exploits page counts and text snippets returned by a Web search engine. They define various similarity scores for two given words P and Q, using the page counts for the queries P, Q and P AND Q. Moreover, they propose a novel approach to compute semantic similarity using



Hanane EZZIKOURI et al. / Procedia Computer Science 151 (2019) 1261–1265 Author name / Procedia Computer Science 00 (2018) 000–000

1263 3

automatically extracted lexico-syntactic patterns from text snippets. These different similarity scores are integrated using support vector machines, to leverage a robust semantic similarity measure. Three factors associated with the ontology hierarchy can affect the measurement of semantic distance: path length, depth and local density. Similarity measures and taxonomy are linked by taxonomic relationships, that is, the position of concepts in taxonomy, the number of hierarchical links, and the informational content of concepts are considered. The semantic measures proposed are classified into three main categories: • Structure-based Structure-based or edge-counting measures calculate semantic similarity based on the structure of the ontology hierarchy (IS-A, PartOf), it calculate the length of the path connecting the terms, and the position of the terms in the taxonomy. Thus, the more similar the two concepts, the more links there are between the concepts and the more closely related they are [6][7]. ¬ The shortest way It is a simple, powerful measurement [91] designed primarily to work with hierarchies. Where Max is the maximum path length between C1 and C2 in the taxonomy and SP is the short path connecting C1 to C2. (1) 𝑆𝑆𝑆𝑆𝑆𝑆(𝐶𝐶1, 𝐶𝐶2) = 2 ∗ 𝑀𝑀𝑀𝑀𝑀𝑀(𝐶𝐶1, 𝐶𝐶2) − 𝑆𝑆𝑆𝑆 ¬ Hirst and St-Onge (HSO) The HSO measure [8] calculates the relationship between concepts using the distance of the path between the nodes of the concepts, the number of changes in the direction of the path connecting the two concepts and the admissibility of the path. 𝑆𝑆𝑆𝑆𝑆𝑆123 (𝐶𝐶1, 𝐶𝐶2) = 𝐶𝐶 − 𝑆𝑆𝑆𝑆 − 𝑘𝑘 ∗ 𝑑𝑑

(2)

¬ Wu and Palmer The WuP [9] measure calculates similarity by considering the depths of the two concepts in the WordNet taxonomies, along with the depth of the LCS (Least Common Subsumer (LCS), the formula is :

𝑆𝑆𝑆𝑆𝑆𝑆789 (𝐶𝐶1, 𝐶𝐶2) = 2 ×

𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑ℎ(𝐿𝐿𝐿𝐿𝐿𝐿(𝐶𝐶1, 𝐶𝐶2) 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑ℎ(𝐶𝐶1) + 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑ℎ(𝐶𝐶2)

(3)

• Information content measures Measures based on the content of the information use the informational content of concepts to measure the semantic similarity between two concepts. The value of the information content of a concept is calculated based on the frequency of the term in a given collection of documents. ¬ Lin Lin et al. [10] [11] proposed a measure based on an ontology restricted to hierarchical links and a corpus. This similarity takes into account the information shared by two concepts like Resnik [12], but the difference between them is in the definition. D∗EF ((HIJK (LM,LD)) (4) 𝑆𝑆𝑆𝑆𝑆𝑆ABC (𝐶𝐶1, 𝐶𝐶2) = EFNH(LM)OPEF (H(LD))

• Measures based on characteristics Characteristic measures assume that each term is described by a set of terms indicating its properties or characteristics. The measure of similarity between two terms is defined according to their properties (definitions or "glosses" in WordNet [1]) or according to their relationship with other similar terms in the hierarchical structure. Tversky [13] takes into account the characteristics of terms to calculate the similarity between different concepts, ignoring the position and informational content of terms in the taxonomy. Each term should be described by a set of words indicating its characteristics. |𝐶𝐶1 ∩ 𝐶𝐶2| (5) 𝑆𝑆𝑆𝑆𝑆𝑆QRST (𝐶𝐶1, 𝐶𝐶2) = |𝐶𝐶1 ∩ 𝐶𝐶2| + 𝛼𝛼|𝐶𝐶1 − 𝐶𝐶2| + (𝛼𝛼 − 1)|𝐶𝐶2 − 𝐶𝐶1| Where C1 and C2 represent the corresponding description sets of the two terms. αЄ[0,1] is the relative importance of uncommon characteristics. The value of α increase with the commonality and decrease with the difference between the two concepts. The determination of α is based on the observation that similarity is not necessarily a

Hanane EZZIKOURI et al. / Procedia Computer Science 151 (2019) 1261–1265 EZZIKOURI et al./ Procedia Computer Science 00 (2018) 000–000

1264 4

symmetric relation. 3. Research methodology In this section, we will present the different steps of our work and how we calculate the semantic similarity between two words W1 and W2. The inputs are two English words, and the output is a semantic similarity score. The process is as follow: first, we lemmatize the two words with The Stanford CoreNLP [14], this is to remove inflectional and derivationally related forms of a word to a common base. WordNet is based on lemmas which should facilitate finding the appropriate synset. The resulting lemmas will be used for finding the equivalent synsets (S1, S2) respectively for (W1,W2) in WordNet. For each synset, we retrieve its gloss, and by applying the text preprocessing methods (splitting, removing stop words, POS tagging and lemmatization), we extract the two sets G1 and G2 respectively for S1 and S2. Our approach is based on the hypothesis that “more the number of common words between sets increase, higher the similarity will be.” The following formula and algorithm bellow show how to calculate semantic similarity between two words W1 and W2: 𝑆𝑆𝑆𝑆𝑆𝑆(𝑊𝑊1, 𝑊𝑊2) =

(6)

(𝑆𝑆M ⋂ 𝑆𝑆D ) + (𝐺𝐺M ⋂𝐺𝐺D ) (𝑆𝑆1 ⋃ 𝐺𝐺M ) ⋃ (𝑆𝑆D ⋃ 𝐺𝐺D )

Algorithm: Semantic Similarity calculation Inputs: word1, word2 Output: Semantic Similarity Score Begin Lem_word1ß lemmatization(word1); Lem_word1ß lemmatization(word1); S1ßSynset(word1); S2ßSynset(word2); G1ßNLP_Gloss(S1); G2ßNLP_Gloss(S2); if ((S1⋂S2)⋃(G1⋂G2))=Ø Simß0; (2 ⋂ 2 )P(^\ ⋂^] ) Else 𝑆𝑆𝑆𝑆𝑆𝑆(𝑊𝑊1, 𝑊𝑊2) = \ ] ; ) ⋃ (2 ⋃ ^ ) Fin if End

Figure 1: Our Approach schema

(2M ⋃ ^\

]

]

Algorithm 1: Semantic Similarity calculation

4. Conclusion Semantic similarity plays an important role in a lot of domains. In information retrieval system it is used to find the relevant documents for a user’s need in a semantic way. Others use it in plagiarism detection field, and in sentiment analysis. In this paper, we have presented a new approach to calculate the semantic similarity between two words in the WordNet dictionary. The proposed method uses advantages of both synset and glosses for maximizing the similarity score. We used the synonymy (IS-A/PartOf) relationships based on concepts’ synsets (S1 and S2 respectively for C1 and C2) and their glosses (G1 and G2 respectively for S1 and S2).

References [1] George A. Miller (1995). WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39-41.



Hanane EZZIKOURI et al. / Procedia Computer Science 151 (2019) 1261–1265 Author name / Procedia Computer Science 00 (2018) 000–000

1265 5

[2] Youness, M., Mohammed, E., & Jamaa, B. (2018). Semantic Indexing of a Corpus. INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 11(7), 63-80. [3] C. Leacock et M. Chodorow. Combining Local Context andWordNet Similarity forWord Sense Identification. In WordNet : An Electronic Lexical Database, C. Fellbaum, MIT Press, 1998. [4] Gupta, D., Vani, K., & Singh, C. K. (2014, September). Using Natural Language Processing techniques and fuzzy-semantic similarity for automatic external plagiarism detection. In Advances in Computing, Communications and Informatics (ICACCI, 2014 International Conference on (pp. 2694-2699). IEEE. [5] Bollegala, D., Matsuo, Y., & Ishizuka, M. (2007). Measuring semantic similarity between words using web search engines. www, 7, 757-766. [6] R. Rada, H. Mili, E. Bicknell, et M. Blettner, « Development and application of a metric on semantic nets », IEEE transactions on systems, man, and cybernetics, vol. 19, no 1, p. 17–30, 1989. [7] R. Richardson, A. Smeaton, et J. Murphy, Using WordNet as a knowledge base for measuring semantic similarity between words. Technical Report Working Paper CA-1294, School of Computer Applications, Dublin City University, 1994. [8] G. Hirst et D. St-Onge, « Lexical chains as representations of context for the detection and correction of malapropisms », WordNet: An electronic lexical database, vol. 305, p. 305–332, 1998. [9] Z. Wu et M. Palmer, « Verbs semantics and lexical selection », in Proceedings of the 32nd annual meeting on Association for Computational Linguistics, 1994, p. 133–138. [10] D. Lin, « An information-theoretic definition of similarity. », in Icml, 1998, vol. 98, p. 296–304. [11] D. Lin, « Principle-based parsing without overgeneration », in Proceedings of the 31st annual meeting on Association for Computational Linguistics, 1993, p. 112–120. [12] P. Resnik, « Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language », J. Artif. Intell. Res.(JAIR), vol. 11, p. 95–130, 1999. [13] A. Tversky, « Features of similarity. », Psychological review, vol. 84, no 4, p. 327, 1977. [14] Manning, Christopher D., Surdeanu, Mihai, Bauer, John, Finkel, Jenny, Bethard, Steven J., and McClosky, David. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.