Combining sentence similarities measures to identify paraphrases

Combining sentence similarities measures to identify paraphrases

ARTICLE IN PRESS JID: YCSLA [m3+;July 13, 2017;11:52] Available online at www.sciencedirect.com Computer Speech & Language xxx (2017) xxx-xxx www...

811KB Sizes 68 Downloads 165 Views

ARTICLE IN PRESS

JID: YCSLA

[m3+;July 13, 2017;11:52]

Available online at www.sciencedirect.com

Computer Speech & Language xxx (2017) xxx-xxx www.elsevier.com/locate/csl

Combining D87X Xsentence similarities measures to identify paraphrasesI agedPD8X XRafael XT FerreiraD89X X*,a, D90X XGeorge D.C. CavalcantiDb91X X , D92X XFred FreitasDb93X X , D94X XRafael Dueire LinsD95XaX , D96X XSteven J. SimskeDc97X X , D98X XMarcelo RissDd9X X

Q1

TagedP Department of statistics and informatics, Federal Rural University of Pernambuco, BrazilX X Centro de Inform atica, Universidade Federal de Pernambuco, Recife, Pernambuco, Brazil c Hewlett-Packard, Fort Collins, CO 80528, USA d Hewlett-Packard Brazil, Porto Alegre, Rio Grande do Sul, Brazil a

Q2

b

Received 7 June 2016; received in revised form 13 January 2017; accepted 1 July 2017

TagedPAbstract Paraphrase identification consists in the process of verifying if two sentences are semantically equivalent or not. It is applied in many natural language tasks, such as text summarization, information retrieval, text categorization, and machine translation. In general, methods for assessing paraphrase identification perform three steps. First, they represent sentences as vectors using bag of words or syntactic information of the words present the sentence. Next, this representation is used to measure different similarities between two sentences. In the third step, these similarities are given as input to a machine learning algorithm that classifies these two sentences as paraphrase or not. However, two important problems in the area of paraphrase identification are not handled: (i) the meaning problem: two sentences sharing the same meaning, composed of different words; and (ii) the word order problem: the order of the words in the sentences may change the meaning of the text. This paper proposes a paraphrase identification system that represents each pair of sentence as a combination of different similarity measures. These measures extract lexical, syntactic and semantic components of the sentences encompassed in a graph. The proposed method was benchmarked using the Microsoft Paraphrase Corpus, which is the publicly available standard dataset for the task. Different machine learning algorithms were applied to classify a sentence pair as paraphrase or not. The results show that the proposed method outperforms state-of-theart systems. Ó 2017 Elsevier Ltd. All rights reserved. TagedPKeywords: Sentence similarity; D10X X Paraphrase identification; D10X X Sentence simplification; D102X X Graph-based model D103X X

1 2 3 4 5

1. Introduction TagedPThe degree of similarity between phrases is measured by sentence similarity, or short-text similarity methods. These similarity methods should also address problems of measuring sentences with partial information, such as when one sentence is split into two or more short texts or phrases that contain two or more sentences. One specific task derived from sentence similarity is the Paraphrase Identification (PI). This task aims to verify if two sentences I

This paper has been recommended for acceptance by Pascale fung. * Corresponding author. E-mail addresses: [email protected], [email protected] (R. Ferreira), [email protected] (G.D.C. Cavalcanti), [email protected] (F. Freitas), [email protected] (R.D. Lins), [email protected] (S.J. Simske), [email protected] (M. Riss). http://dx.doi.org/10.1016/j.csl.2017.07.002 0885-2308/ 2017 Elsevier Ltd. All rights reserved.

Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002

JID: YCSLA

2 6 7 8 9 10 11 12 13 14 15

ARTICLE IN PRESS

[m3+;July 13, 2017;11:52]

R. Ferreira et al. / Computer Speech & Language xxx (2017) xxx-xxx

TagedPare semantically equivalent or not (Das and Smith, 2009). Automatic text summarization (Ferreira et al., 2013), information retrieval (Yu et al., 2009), image retrieval (Coelho et al., 2004), text categorization (Liu and Guo, 2005), and machine translation (Papineni et al., 2002) are examples of applications that rely on or may benefit from sentence similarity and PI methods. TagedPThe literature reports several efforts to address such a problem by extracting syntactic information from sentences (Islam and Inkpen, 2008; Oliva et al., 2011) or by representing sentences using vectors of bag of words (Mihalcea et al., 2006; Qiu et al., 2006). Sentences are modelled in such a way to allow similarity methods to compute different measures to evaluate the degree of similarity between words. In general, a PI method conveys these similarities as input to machine learning algorithms in order to identify paraphrases. However, two important problems are not handled in traditional sentence similarities approaches: TagedPThe Meaning Problem (Choudhary and Bhattacharyya, 2002): It is characterized by the lack of semantic analysis in the previous sentence similarity measures proposed. Essentially this problem is to measure the similarity between the meaning of sentences (Choudhary and Bhattacharyya, 2002). Nevertheless, the measures that claim to deal with it only apply methods such as the latent semantic indexing (Deerwester et al., 1990), corpus-based methods (Li et al., 2003) and WordNet similarity measures (Mihalcea et al., 2006). These techniques, however, are used to find the semantic similarity of the words in a sentence, but not the similarity between two complete sentences. Thus, the evaluation of the meaning similarity degree between two sentences remains still an open problem. For example, the sentences Peter is a handsome boy and Peter is a good-looking lad, share a similar meaning, if the context they appear in does not change much. TagedPD104X XThe Word Order Problem (Zhou et al., 2010): In many cases different word order implies in divergent sentences’ meanings (Zhou et al., 2010). For example, A loves B and B loves A, represent two completely different sentences in meaning. Therefore, dealing with this problem certainly enhances the final measure of sentence similarity.

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

TagedPThis paper proposes a paraphrase identification system that combines lexical, syntactic and semantic similarity measures. Since traditional methods only rely on lexical and syntactic measures, we believe that the addition of semantic role annotation analysis (Marquez et al., 2008) is a promising alternative to address the meaning and the word order problems. These three measures were previously tried on and obtained good results on the sentence similarity problem (Ferreira et al., 2014b). The same authors improved their sentence similarity measure by using a similarity matrix that penalizes the measure based on sentence size (Ferreira et al., 2014a). This penalization is important because large sentences could be considered similar even if they contain more information than small sentences. To the best of our knowledge, these measures were never used to identify paraphrases.D105X X Thus, the main novelty of this paper is the application of different machine learning algorithms to combine sentence similarity measures in order to identify paraphrases. In addition it presents the concept of Basic Unit to the sentence similarity measures proposed in previous papers. TagedPThe proposes system is composed of three steps: 1TagedP . D106X XSentence Representation: This step performs the lexical, syntactic and semantic analysis and encapsules the outputs in a text file (for lexical) and two RDFs1 files (for syntactic and semantic). TagedP2. D107X XSimilarity Analysis: It measures the similarity of each pair of sentences using the output of the previous step. TagedP3. D108X XParaphrase Classification: The last step applies a machine learning algorithm, using the sentences similarities measures from second step, to identify if the pair of sentences is paraphrase or not. TagedPIn order to evaluate the proposed system, a series of experiments was performed using the Microsoft Research Paraphrase Corpus (MSRP) (Dolan et al., 2004), which is the standard dataset for this problem. The proposed approach was compared using four measures: accuracy, precision, recall, and F-measure (Achananuparp et al., 2008), in the experimental study, the principal hypothesis of this work was validated showing that the combination of lexical, syntactic, and semantic aspects of a sentence pair achieve better results for the PI task than state-of-the-art methods. In addition, it is also validated that the use of the sentence representation proposed in (Ferreira et al., 2014b) achieves good performance for the PI task. 1

Resource Description Framework.

Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002

JID: YCSLA

ARTICLE IN PRESS

[m3+;July 13, 2017;11:52]

R. Ferreira et al. / Computer Speech & Language xxx (2017) xxx-xxx

3

44

TagedPThe rest of this paper is organized as follows. Section 2 presents the most relevant differences between the proposed method and the state of the art related work. Section 3 explains the proposed sentence representation, the similarity measure, and the paraphrase identification process. The benchmarking of the proposed and the best other similar methods is presented in Section 4. This paper ends drawing the conclusions and discussing lines for further work in Section 5.

45

2. Related D109X Xwork

40 41 42 43

46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89

TagedPThis section gives an overview of previous methods for paraphrase identification (Androutsopoulos and Malakasiotis, 2010). The methods proposed could be divided into: (i) Threshold-based, which empirically identifies a threshold in sentence similarities values that divide the sentences into two groups (paraphrase or not); and (ii) Machine Learning, that applies machine learning to different features (usually similarities values) to identify paraphrases. TagedPThe threshold-based methods always carry out a sentence similarity step before PI, and returns as output a similarity value between 0 and 1. The second step finds a threshold that classifies the sentences as paraphrase or not. It follows the description about the state-of-art threshold-based methods. TagedPMihalcea et al. (Mihalcea et al., 2006) represent the sentences as bag of word vector and perform a similarity measure that work as follows: for each word in the first sentence (main sentence), it tries to identify the word in the second sentence that has the highest semantic similarity according to one of the word-to-word similarity measure. Then, the process is repeated using the second sentence as the main sentence. Finally, the resulting similarity scores are combined using the arithmetic average. This method uses a threshold equal to 0.5 to identify paraphrases. Thus, sentences with a similarity value highest than 0.5 are tagged as paraphrases. TagedPOliva and collaborators (Oliva et al., 2011) propose the SyMSS method that assesses the influence of the syntactic structure of two compared sentences in the similarity calculation. They represent the sentences as syntactic dependence tree. It is based on the idea that a sentence is made up of meaning of its individual words and the syntactic connections among them. Using WordNet, semantic information is obtained through a process that finds the main phrases composing the sentence. Then they applied different thresholds, between 0 and 1 with a variation of 0.05, to identify which one improves PI accuracy. The best results were obtained using 0.6 as threshold. TagedPIslam and Inkpen (Islam and Inkpen, 2008), presented an approach to measure the similarity of two texts that makes use of semantic and syntactic information. They combines three different similarities to perform the PI task. At first, they uses the entire sentence as a string to calculate string similarity, which is acquired applying the longest common subsequence measure (Kondrak, 2005). Then, they use a bag of word representation to perform a semantic word similarity, which is measured by a corpus-based measure (Islam and Inkpen, 2006). The last similarity uses syntactic information to evaluate a word order similarity. The final similarity is calculated combining the string similarity, semantic similarity and common-word order similarity. As proposed on (Oliva et al., 2011) they test different thresholds between 0 and 1. However, the variance value was 0.1. As (Oliva et al., 2011) they conclude that the best threshold was 0.6. TagedPDas and Smith (Das and Smith, 2009) proposed a probabilistic model (Smith and Eisner, 2006) which incorporates syntax, semantics (using WordNet), and hidden loose alignments between two sentences trees to perform sentence similarity. Applying theses features they estimate similarity as a posterior probability in a classifier. If the posterior probability exceeds 0.5, the pair is labeled a paraphrase. TagedPA different way to look to PI is relying on machine learning algorithms. In this context, the methods proposed uses different kind of features, like similarities between sentences and sentence dependency relation, to classify the sentence pair as paraphrase or not. The papers Heilman and Smith (Heilman and Smith, 2010), Qiu et al. (Qiu et al., 2006) and Stephen et al. (Wan et al., 2006) proposed a machine learning based methods to deal with PI process. It is important to notice that all of these papers uses a supervised approach. TagedPHeilman and Smith (Heilman and Smith, 2010) proposed a tree edit method, which encapsules the syntactic relations among tokens in a sentence, to identifies paraphrases. The main idea is to transform the tree created for the first sentence into the second using nine operations, like insert and delete. The authors train a logistic regression classification model to seek a tree edits short sequence of that transforms one tree into another. They found 33 edit sequences that classify sentence pairs as paraphrases or not. Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002

ARTICLE IN PRESS

JID: YCSLA

4 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123

[m3+;July 13, 2017;11:52]

R. Ferreira et al. / Computer Speech & Language xxx (2017) xxx-xxx

TagedPA supervised two-phase framework based on sentences dissimilarity to identify paraphrases is introduced by Qiu and colleagues (Qiu et al., 2006). They represent sentences using semantic triples extract from PropBank (Palmer et al., 2005). In the first phase, the system calculates semantic similarity among sentence’s tokens to find related words and pair them up in a greedy manner. The second phase is responsible to identify if extra information, unpaired tokens, exists in the sentences and if the effect of its removal is not significant. Thus, a Support Vector Machine (SVM) classifier is applied to a wide set of features of unpaired tokens, including internal counts of numeric expressions, named entities, words, semantic roles, whether they are similar to other tuples in the same sentence, and contextual features like source/target sentence length and paired tokens count. The SVM classifies sentence pairs as paraphrase or not. TagedPStephen et al. (Wan et al., 2006) created an approach that uses of 17 different features in order to identifies paraphrases. The features are divided into: (i) n-gram overlap features, (ii) dependency relation overlap features, (iii) dependency tree-edit distance features, and (iv) surface features features. The authors applies four different machine learning algorithms (naive bayes learner, C4.5 decision tree, support vector machine, K-nearest neighboD10X Xr) using the 17 features to label a pair of sentence as paraphrase or not. XTagedPD1 XRecently, studies have been proposed the application of Convolutional Neural Network (CNN) techniques to deal with the problem of paraphrase identification (Yin and Sch€utze, 2015; Yin et al., 2015). XD12 XThe main idea is to use word embeddings to represent the sentences and then apply CNN algorithms to classify sentence pairs as paraphrase or not. TagedPThe work by Ferreira and collaborators (Ferreira et al., 2014b) and (Ferreira et al., 2014a) proposes an approach that apply lexical, syntactic and semantic analysis to represent sentences. Then, it proposed a word matching algorithm, using statistics and WordNet measures, to carry out the similarity between sentences. These approaches achieve good results in sentence similarities contexts, however they have not been used yet to identify paraphrases. TagedPAlthough the methods in this paper take advantage of similar ideas to (Ferreira et al., 2014b) and (Ferreira et al., 2014a) to compare sentences, the approach here proposes a new similarity matrix added to a new algorithm. Besides that, the similarity measure relies on a size penalization coefficient to reduce the similarity for sentences with different sizes. Thus, the approach proposed in this paper combines: (i) three layers sentence representation, which encompasses three different levels of sentence information. Previous works do not combine these three levels of sentence information. They usually provides only one or two analysis. (ii) A similarity measure that encapsulates a matrix considering similarities among all words in sentences and a size penalization coefficient to deal with sentences with different sizes. This paper proposed the application of different machine learning algorithms to identify paraphrases using the proposed similarities as features. TagedPTable 1 presents a summary of the systems presented and the proposed system. It lists: (i) if the system uses a Threshold (T), Machine Learning (ML) or Convolutional Neural Network (CNN); (ii) the sentence representation used; (iii) syntactic (Syntactic Relations (SR) and Word Order (WO)) and semantic (Corpus-based and WordNet Table 1 Features’ comparison among related work and the proposal. System

Method

Representation

Syntactic

Semantic

Features

(Mihalcea et al., 2006)

T

Bag of word D6X X vector

D7X X 

Corpus-based WordNet

Sentence similarity D8X X

(Oliva et al., 2011)

T

Syntactic tree D9X X

SR

WordNet

Sentence similarity D10X X

(Islam and Inkpen, 2008)

T

Sentence as string D1X X Bag of word D13X X D14X X vector Syntactic tree D15X X

WO

Corpus-based

Sentence similarity D12X X

(Das and Smith, 2009)

T

Probabilistic model D16X X

SR

WordNet

Sentence similarity D17X X

(Heilman and Smith, 2010)

ML

Syntactic tree D18X X

SR

D19X X 

Tree edit D20X X transformations

(Qiu et al., 2006)

ML

Semantic triples D21X X

D2X X 

SRA

Sentences dissimilarity D23X X

(Wan et al., 2006)

ML

Bag of word D24X X vector Syntactic tree D26X X

SR

D25X X 

17 Features

(Yin and Sch€utze, 2015)

CNN

D27X X 

D28X X 

D29X X 

Word embeddings D30X X

(Yin et al., 2015)

CNN

D31X X 

D32X X 

D3X X 

Word embeddings D34X X

Proposed Approach

ML

Bag of word D35X X vector Syntactic tree D37X X Semantic triples D38X X

SR

SRA WordNet

Sentences similarity D36X X

Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002

JID: YCSLA

ARTICLE IN PRESS

[m3+;July 13, 2017;11:52]

R. Ferreira et al. / Computer Speech & Language xxx (2017) xxx-xxx

5

129

TagedPmeasures and Semantic Role Annotation (SRA)) aspects used; and (iv) the features that the system relies on to perform the PI. TagedPD13X XIt is importante to notice that the recently some papers were proposed to deal with the identification of paraphrase in Twitter (Xu et al., 2015). D14X XHowever, the papers described in this section are more focused on news sentences. Therefore, they were not compared with the works on Twitter. - FIQUEI NA DVIDA SE DEIXO AQUI, NA DISCUSSO OU CONCLUSO.

130

3. Proposed D15X Xmethod

124 125 126 127 128

131 132 133 134

TagedPGiven a dataset G = {a1, a2, ...D16X X, an}, where ai = (si1, si2) is a pair of sentences, a paraphrase identification system aims to verify if two sentences, si1 and si2, are semantically equivalent or not. TagedPFig. 1 shows the proposed paraphrase identification system that is composed of three modules: Sentence Representation, Similarity Analysis, and Classification. TagedPSentence Representation. This module represents each sentence (sij) using three different sets of features: Lexical, Syntactic, and Semantic. The output of this component are two triples (r1 and r2), where ri is the representation of each sentence using lexical, syntactic and semantic analysis. The lexical representation consists in a bag of words vector, and the syntactic and semantic representations are Resource Description Framework (RDF) graphs. More details about these representations are presented in Section 3.1. TagedPSimilarity Analysis. This module provides different methods to measure similarities between sentences using the triples (r1 and r2), provided by previous module. These similarities are used as features (F) to identify paraphrases in the classification module. So, F ¼ {sim1, sim2, ...D,17X X simn} is the feature vector used to identify paraphrase, where n is number of different similarity measures used and simi is the similarity value between r1 and r2. Details about the similarities method are provided in Section 3.2. TagedPClassification. This module receives the output (F) from the previous module as input to machine learning techniques in order to classify the pair of sentences as paraphrase or not. The classification process is supervised, in other words, the system requires a dataset containing sentences labeled as paraphrases or not.

135 136 137 138 139 140 141 142 143 144 145 146 147

TagedPThe following sections describe the Sentence Representation and Similarity Analysis modules. 3.1. The system’s D18X X sentence representation TagedPThis section explains the sentence representation used for calculating the similarity measure encompassing three layers: lexical, syntactic and semantic. A single sentence is taken as input to build such representation. The output is a text and two RDF files (W3C, 2004) that contain the lexical, syntactic and semantic layers, respectively. Each layer is detailed as followsD19X X: TagedP3.1.1. The D120X Xlexical layer TagedPThe lexical layer takes a sentence as input and yields as output a text file a list of the sentence tokens representing it. The steps performed in this layer are: 1TagedP . D12XLexical X analysis: This step splits the sentence into a list of tokens, including punctuation. TagedP2. Stopword D12X X removal: It rules out words with little representative value to the document, e.g. articles and pronouns, and the punctuation. This work benefits from the stopword list proposed by Dolamic and Savoy (Dolamic and Savoy, 2010).

Fig. 1. Architecture of the proposed paraphrase identification system. G is a dataset contain pair of sentences; sij is a pair of sentences; r1 and r2 are the triples that contains the representation using lexical, syntactic and semantic analysis for each sentence; and F is feature vector used to classify the pair os sentences as paraphrase or nor.

Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002

JID: YCSLA

6 148 149 150

151 152 153 154 155 156 157 158 159 160 161 162 163

164 165 166 167 168 169 170 171

ARTICLE IN PRESS

[m3+;July 13, 2017;11:52]

R. Ferreira et al. / Computer Speech & Language xxx (2017) xxx-xxx

TagedP3. D123X XLemmatization: This step applies the stemming preprocessing service, which translates tokens into their basic form. For instance, plural words are made singular and all verb tenses and persons are exchanged by the verb infinitive. Lemmatization for this system is carried out by the Stanford coreNLP tool2. TagedPFig. 2 depicts the operations accomplished in this layer for the sentence “The judge declared the defendant guiltyD124X X”. It also displays the output of each step. The output of this layer is a text file containing the list of tokens. TagedPThis layer helps improving the performance for simple text processing tasks. Although it does not convey much information about the sentence, it is widely employed in various traditional text mining tasks, such as, information retrieval and summarization. TagedP3.1.2. The syntactic D125X X layer TagedPThis layer receives the sequence of tokens, generated in the previous, lexical layer, and converts them into a graph of RDF triples (W3C, 2004). This transformation is executed in the following steps: 1TagedP . D126X XSyntactic analysis: This step benefits from the output of a dependence tree, built based on (de Marneffe and Manning, 2008), to extract relations such as subject, direct object, adverbial modifier, among others. In addition, prepositions and conjunction relations are also extracted in this step. TagedP2. D127X XGraph creation: Next, a directed graph stores the entities with their relations. The vertices are the elements obtained from the shallow layer, while the edges denote the relations described in the previous steps. TagedPFig. 3 deploys the syntactic layer for the sentence “The judge declared the defendant guiltyD128X X”. The edges usually have one direction, following the direction of the syntactic relations. This is not always the case, however. The model also accommodates bi-directed edges, usually corresponding to conjunction relations. One should notice that all vertices from the example are listed in the output of the previous layer. TagedPThe syntactic analysis step is important as it represents an order relation among the tokens of a sentence. It describes the possible or acceptable syntactic structures of the language; and decomposes the text into syntactic units in order to “understandD129X X” the way in which the syntactic elements are arranged in a sentence. Such kind of relations could be used in applications, as for instance, automatic text summarization, text categorization, information

Fig. 2. Lexical Layer for the sentence “The judge declared the defendant guiltyD”. 1X X

Fig. 3. Syntactic layer for “The judge declared the defendant guiltyD”.D 2X X 3X X

2

http://nlp.stanford.edu/software/corenlp.shtml.

Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002

ARTICLE IN PRESS

JID: YCSLA

[m3+;July 13, 2017;11:52]

R. Ferreira et al. / Computer Speech & Language xxx (2017) xxx-xxx

7

Fig. 4. Semantic layer for the phrase “The judge declared the defendant guiltyD”D 4X X .5X X

172 173 174 175 176 177 178 179 180 181

182 183 184 185 186 187

188 189 190 191 192 193 194 195 196 197

TagedPretrieval, etc. The process of creating the dependence tree could extract wrong relations, however the Section 4 shows that this layer is important to identify paraphrases. TagedPThe RDF format was chosen to store the graph because: (i) it is a standard model for data interchange on the web; (ii) it provides a simple and clean format; (iii) inferences are easily summoned with the RDF triples; and (iv) there are several freely available tools to handle RDF. TagedP3.1.3. The D130X Xsemantic layer TagedPThis layer decorates the RDF graph with entity roles and sense identification. It takes as input the sequence of groups of tokens, extracted in the lexical layer and applies SRA to define the roles of each of the entities and to identify their “meaningD13X X” in the sentence. TagedPThe semantic layer uses SRA to perform two different operations: 1TagedP . D132XSense X identification: Sense identification is of paramount importance to this type of representation since different words could denote the same meaning, particularly regarding to verbs. For instance, “affirm”, “argue”, “claim”, “declare” are words that could be associated with the sense of “statementD13X X”. TagedP2. Role D134X X annotation: Differently from the syntactic layer, role annotation identifies the semantic function of each entity. For instance, in the same sentence of the previous example, judge is the speaker of the action declared. Thus, the interpretation of the action is additionally identified, not only its syntactic relation. TagedPThis layer deals with the meaning problem, receiving the output of the step of sense identification as its input. The general meaning of the main entities of a sentence, not just the written words, is identified in this step. On its turn, role annotation extracts discourse information, as it deploys the order of the actions, the actors, etc, dealing with word order problem. Such information is relevant for the extraction and summarization tasks, for instance. For both sense identification and role annotation the proposed methods extracts relations from Framenet3 using Semafor toolkit4. Once again, it is important to notice that these NLP procedures cannot identify all semantic relations. Despite that, the relations found are enough to improve the proposed paraphrase identification method. TagedPFig. 4 presents a semantic layer example. Two different types of relations are identified in the figure: the sense relations, e.g. the triple guilty-sense-veredict, and the role annotation relations, e.g. judge-speaker-declare. The semantic layer uses a RDF graph representation, likewise the syntactic layer. 3 4

framenet.icsi.berkeley.edu. www.ark.cs.cmu.edu/SEMAFOR.

Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002

ARTICLE IN PRESS

JID: YCSLA

8 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223

[m3+;July 13, 2017;11:52]

R. Ferreira et al. / Computer Speech & Language xxx (2017) xxx-xxx

3.2. Similarity Analysis TagedPAs mentioned before, the process of paraphrase identification aims to determine whether two sentences share essentially the same meaning. This paper proposes three sentence similarity measures based on the representation detailed on the previous section in order to identify paraphrases. The similarity measure proposed here assesses the degree of sentence similarity based on the three-layer representation of sentences presented in Section 3.1. Before detailing the proposed measures, the concept of D135X XBasic Unit (BU) should be presented. TagedPBU represents the minimal unit of the proposed sentence similarity algorithm. In the Lexical Layer, it is a single word. Thus, the similarity between two BU is the similarity between two words. The similarity measure is divided into two steps: agedPT D136X XSimilarity Matrix Value (SMV): It measures the similarities among sentence’s BU using similarities between words measures. TagedP D137X XSize Penalization Coefficient (SPC): It decreases the similarity when sentences analyzed do not have the same amount of BU. TagedPThe first step is calculating the similarity matrix values as follows. Let A = {a1, a2 ,D138X X..., an} and B = {b1, b2 ,...D139X X, bm} be two sentences, such that, ai is a BU of sentence A, bj is a BU of sentence B, n is the number of tokens of sentence A and m is the number of tokens of sentence B. The calculus of the similarity is presented in Algorithm 1. TagedPThe algorithm receives the set of BUs from sentences A and B as input. Then, it creates a matrix of dimension of m £ n the dimension of the input BUs sets. The variables total_similarity and iteration are initialized with values 0. The variable total_similarity adds up the values of the similarities in each step, while iteration is used to transform the total_similarity into a value between 01(lines 13). The second step is the calculation of similarities for each pair (ai, bj), where ai and bj are the tokens of sentence A and B, respectively. The matrix stores the calculated similarities (lines 48). The last part of the algorithm is divided in three steps. First, it sums to total_similarity the high similarity value from matrix (line 10). Then, it removes the line and column from the matrix that contains the high similarity (lines 11 and 12). To conclude, it updates the iteration value (line 13). The output is the division of total_si milarity and iteration (line 15). TagedPTo compute the similarities between tokens the system uses three different measures: D140XTagedP XLevenshtein metric (Lev) (Miller et al., 2009) calculates the minimum number of operations of insertion, deletion, or substitution of a single character needed to transform one string into another. TagedPD14X XResnik measure (Res) (Miller, 1995) attempts to quantify how much information content is common to two concepts. The information content is based on the lowest common subsumer (LCS) of the two concepts. TagedPD142X XLin measure (Miller, 1995) is the ratio of the information contents of the LCS in the Resnik measure to the information contents of each of the concepts. Algorithm 1. Proposed Similarity Algorithm.

Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002

JID: YCSLA

ARTICLE IN PRESS

[m3+;July 13, 2017;11:52]

R. Ferreira et al. / Computer Speech & Language xxx (2017) xxx-xxx

9

Fig. 5. Example of similarity between triples, where Sim is the similarity between two tokens or two edges, TotalSimilarity is the total similarity of one triple, where u and v are edges and a1, a2, b1 and b2 are the tokens associated to the nodes of the graph.

224 225 226 227

228 229 230 231 232 233 234 235 236 237 238 239 240 241

TagedPAfter the calculation of the total_similarity; the system computes a Size Penalization Coefficient (SPC), that lowers the weight of the similarity between sentences with different number of tokens. The SPC is proportional to the total_similarity. Eq. (1) shows how SPC is calculated. It is important to notice that in case of sentences with the same number of tokens, SPC is equal to zero.  ðjnmj  SMVÞ=n if ðn > mÞ SPC ¼ ð1Þ ðjnmj  SMVÞ=m otherwise where n and m are the number of tokens in sentence 1 and sentence 2, respectively, and SMV is the total similarity found in the SMV process. TagedPAs for the Syntactic and Semantic Layers, the process follows the same idea of lexical one. However, the BU is represented as a triple (vertex,edge,vertex). In the Syntactic Layer, the similarity is measured by the arithmetic mean of each vetex/edge/vetex matching, as presented in Fig. 5. TagedPIn the Semantic Layer, the sense edges, details in Section 3.1, are connected with the words presented in the sentence and with its sense. Therefore, it is important to measure if two sentences contain related words and senses. Hence the measure is calculated from the pair (vertex, edge) as BU. Fig. 6 shows the similarity calculation. TagedPIt is important to notice that the system produces nine different combination of similarities measures. They are a product of a combination of the sentences representation layers (Lexical, Syntactic and Semantic) and the similarities between words measures. Therefore, the similarities are: Lexical-Levenshtein, Syntactic-Levenshtein, Semantic-Levenshtein, Lexical-Resnik, Syntactic-Resnik, Semantic-Resnik, Lexical-Lin, Syntactic-Lin and SemanticLin. This combinations are used as features to identify paraphrase.

Fig. 6. Example of similarity between pairs (vertex, edge), where Sim is the similarity between two tokens or two edges, TotalSimilarity is the total similarity of one triple, u and v are edges and a1 and b1 are the tokens associated to the nodes of the graph.

Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002

ARTICLE IN PRESS

JID: YCSLA

10 242 243 244 245 246 247 248 249

[m3+;July 13, 2017;11:52]

R. Ferreira et al. / Computer Speech & Language xxx (2017) xxx-xxx

TagedPAn example illustrates the process. For Sentence1 = {A, B, C, D} and Sentence2 = {D, R, S}, where {A,B,C,D,R, S} are BUs. The first step is create the matrix (4 £ 3) contain the similarities among all BUs (Table 2). TagedPTables 3 and 4 represent two iteration of line 1013 of Algorithm 1. In the first iteration, the total_similarity receives the value 1. Then, it adds 0.6 in the second iteration. At this point, total_similarity ¼ 1:6. TagedPThe last iteration adds 0.3 to total_similarity (=1.9) removes row 1 and column 2 and stop the process. The SMV is 0.64, total_similarity divided by iteration (1.9/3). TagedPThe system calculates the final similarity as present in Eq. (2). In the example, the SPC ¼ 0:17 and final_similarity ¼ 0:47.

250

final_similarity ¼ total_similaritySPC:

251

4. Experiments

ð2Þ

256

TagedPThe experimental study conducted here aimed at evaluating the proposed method and to compare it with state-ofthe-art methods. Moreover, the effectiveness of the proposed representation that combines lexical, syntactic, and semantic aspects of a sentence pair is also evaluated. This section is organized as follows. Section 4.1 presents the dataset and the metrics used to evaluate the proposed approach. In Section 4.2 the sets of features are described and in Sections 4.3 and 4.4 the results and discussion are showed.

257

4.1. Dataset and D143X Xevaluation metrics

252 253 254 255

258 259 260 261 262 263 264

TagedPThe Microsoft Research Paraphrase Corpus (MSRP) (Dolan et al., 2004) consists of 5801 pairs of sentences: 4076 training pairs and 1725 test pairs, collected from thousands of news sources on the web over a period of 18 months. This database was labeled by two human annotators who determined whether two sentences were paraphrases or not. TagedPThe following evaluation metrics were used: (i) accuracy, the proportion of all correctly predicted sentences compared to all sentences; (ii) precision, the proportion of correctly predicted paraphrase sentences to all predicted paraphrase sentences; (iii) recall, the proportion of correctly predicted paraphrase sentences to all paraphrase sentences; and (iv) F-measure, the uniform harmonic mean of precision and recall (Achananuparp et al., 2008). Table 2 Step 1: Create the similarity D39X X D40X X matrix.

D R S

A

B

C

D

0.3 0.2 0.4

0.8 0.3 0.6

0.1 0.3 0.5

1 0.7 0.6

Table 3 Step 2: Removed row D41X X 1 and column D42X X 4.

R S

A

B

C

0.2 0.4

0.3 0.6

0.3 0.5

Table 4 Step 3: Removed row D43X X 2 and column D4X X 2

R

A

C

0.2

0.3

Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002

JID: YCSLA

ARTICLE IN PRESS

[m3+;July 13, 2017;11:52]

R. Ferreira et al. / Computer Speech & Language xxx (2017) xxx-xxx

11

Table 5 System FDeatures: 45X X the similarities combinations.

265 266 267 268 269

Abbreviation

Sentence layer D46X X

Similarities between D47X X D48X X words

Lexical-Lev Syntactic-Lev Semantic-Lev Lexical-Res Syntactic-Res Semantic-Res Lexical-Lin Syntactic-Lin Semantic-Lin

Lexical Syntactic Semantic Lexical Syntactic Semantic Lexical Syntactic Semantic

Levenshtein Levenshtein Levenshtein Resnik Resnik Resnik Lin Lin Lin

4.2. Feature D14X Xsets TagedPThe proposed method was evaluated using nine different combinations of similarities as features (details in Section 3.2). These combinations are based on the sentences representation layers (Lexical, Syntactic, and Semantic) and the similarities between words measures as presented in Table 5. TagedPInitially, four different subsets of these measures were used to perform the classificationD145X X D146XTagedP XNine Features: The feature vector is composed of the whole set of similarities, showed in Table 5. TagedPD147X XLevenshtein Features: The similarities using Levenshtein measure to identifies similarity between tokens. TagedPD148X XResnik Features: The similarities using Resnik measure to identifies similarity between tokens. TagedPD149X XLin Features: The similarities using Lin measure to identifies similarity between tokens.

277

TagedPA feature selection algorithm proposed by Hall [12] is used to extract a relevant subset of features. Its strategy to select the best set of features is based on the correlation between features. It eliminates features with high correlation values by considering the individual predictive ability of each feature along with the degree of redundancy between them. The Nine Features subset was used as input for this algorithm and the output was the following subset of features: Lexical-Lev, Syntactic-Lev, Lexical-Res, and Semantic-Res. It is important to highlight that each layer representation was selected at least once. It indicates the importance of all layers in the sentence representation. Thus, the subset containing the features Lexical-Lev, Syntactic-Lev, Lexical-Res and Semantic-Res is abbreviated as Selected Features.

278

4.3. Results

270 271 272 273 274 275 276

279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295

TagedPDifferent machine learning algorithms were applied in order to find which one fits better to the proposed PI approach. These algorithms were executed using WEKA Data Mining Software (Witten and Frank, 2000). A selection of different machine learning techniques arising from families was experimented (Fernandez-Delgado et al., 2014). The algorithms that reached better results include: Bayesian Network, RBF Network, C4.5 decision tree classifier, and SMO support vector machine with a polynomial kernel. All algorithms were set at the default configuration. TagedPTable 6 presents the results of the proposed method using different feature sets and the classifiers cited above applied to MSRP test data. The results were presented using the test dataset to make analysis easier. However the models were trained and evaluated in the training dataset using 10 folders crossvalidation and the results followed the same order as those presented in the Table 6. TagedPThe results of the proposed approach are compared to the best result of the application of the similarity measure proposed in (Ferreira et al., 2014b), and two baselines proposed by Mihalcea et al. (Mihalcea et al., 2006): (i) Random Baseline that makes a random decision between true (paraphrase) or false (not paraphrase) value for each candidate pair and (ii) Vector-based Baseline that uses a cosine similarity measure as traditionally used in information retrieval, with TF-IDF weighting to identify paraphrases. TagedPUsing the whole set of features, the results reached 70.89% of accuracy and 80.2% of F-measure. However, the Levenshtein Features obtained slightly better results, 71.13% of accuracy and 80.2% of F-measure, using only three features. This indicates that some features are redundant or, in some cases, lead the algorithm to incorrect results. Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002

ARTICLE IN PRESS

JID: YCSLA

12

[m3+;July 13, 2017;11:52]

R. Ferreira et al. / Computer Speech & Language xxx (2017) xxx-xxx Table 6 Results of the proposed approach applied to test data using different features. The top D49X X five results for each metric are in bold. Features

Algorithm

Accuracy

Precision

Recall

F-Measure

Nine features D50X X Nine features D52X X Nine features D54X X Nine features D5X X Levenshtein features D56X X Levenshtein features D58X X Levenshtein features D60X X Levenshtein features D61X X Resnik features D62X X Resnik features D64X X Resnik features D6X X Resnik features D67X X Lin features D68X X Lin features D70X X Lin features D72X X Lin features D73X X Selected features D74X X Selected features D76X X Selected features D78X X Selected features D79X X

Bayesian network D51X X RBF network D53X X C4.5 SMO Bayesian network D57X X RBF network D59X X C4.5 SMO Bayesian network D63X X RBF network D65X X C4.5 SMO Bayesian network D69X X RBF D71Xnetwork X C4.5 SMO Bayesian network D75X X RBF network D7X X C4.5 SMO

66.37 70.14 70.89 70.08 68.92 70.60 71.13 70.14 66.60 69.62 68.23 66.49 67.07 69.73 68.17 66.49 75.13 74.08 70.26 69.73

79.70 73.10 75.30 71.70 78.00 72.90 75.50 71.60 78.90 72.10 69.20 66.50 77.50 72.10 73.30 66.50 82.00 73.40 73.50 71.80

66.30 87.30 83.60 90.90 74.30 88.90 83.90 91.20 68.00 88.70 94.00 100.0 71.10 89.00 82.10 100.0 80.20 95.60 86.40 89.70

72.40 79.50 79.30 80.20 76.10 80.10 79.40 80.20 73.00 79.50 79.70 79.90 74.20 79.60 77.40 79.90 81.10 83.10 79.40 79.80

70.55 51.30 65.40

74.50 68.30 71.60

84.70 50.00 79.50

79.30 57.80 75.30

Ferreira et al. (Ferreira et al., 2014b) Random baseline D80X X Vector-based baseline D81X X

296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323

TagedPFor this reason, the feature selection algorithm was used to indicate the relevant features to this task. As presented in Section 4.2, the group Selected Features contains the features Lexical-Lev, Syntactic-Lev, Lexical-Res, and Semantic-Res. It achieved the best results of the experiments: 75.13% of accuracy, 82% of precision, 95.6% of recall and 83.1% of F-measure. TagedPThe Levenshtein Features obtained better results than the Resnik and Lin Features. It happens because this dataset contains a lot of proper nouns, such as, people and place names, and the Resnik and Lin Features do not deal with this issue since they are based on the WordNet dictionary which do not contain these kind of words. In addition, the dataset, in general, relies on similar words in sentences pairs. This explains the good performance of the Levenshtein Features. As presented in Section 3.2, the Levenshtein Measure calculates the minimum number of operations needed to transform one string into another. In other words, the use of similar words improves the accuracy of this measure. If this experiment were performed with a dataset containing different words, the Resnik Features and Lin Features should probably achieve better results when compared with to the Levenshtein Features. TagedPIn terms of accuracy and F-measure, all combinations achieved better results than the Random Baseline and only two combinations achieved worse results than the Vector-based Baseline. The best result achieved 9.73 (accuracy) and 7.80 (F-measure) percentage points better than the Vector-based Baseline. Moreover, the best result using the similarity measure proposed in (Ferreira et al., 2014b) applied to the PI task also achieved better result than the baselines. It confirms the hypothesis that the sentence representation used here achieves good results regardless of the similarity algorithm used. It happens because by dealing with meaning and word order problems this sentence representation increases the performance of similarities measures, and, in turn, the PI methods that use these similarities. TagedPThe experimentation also shows that the proposed method obtained better results compared with (Ferreira et al., 2014b) in all evaluation metrics, in terms of percentage points: 4.58 of accuracy, 7.50 of precision, 10.90 of recall and 3.80 of F-measure. It confirms the hypothesis that the proposed sentence similarity algorithm improves the results of the similarity measure proposed by (Ferreira et al., 2014b) in relation to the PI task. TagedPThe classifiers RBF Network and C4.5 achieved the best results in terms of accuracy in almost every feature sets, excluding the Selected Feature where Bayesian Network achieved a better result. In terms of F-measure the RBF Network and SMO algorithms were better than others in four sets, once againD15X X the Bayesian Network using Selected Feature achieved a good result, it looses only to RBF Network. In general, RBF Network obtained better results. Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002

ARTICLE IN PRESS

JID: YCSLA

[m3+;July 13, 2017;11:52]

R. Ferreira et al. / Computer Speech & Language xxx (2017) xxx-xxx

13

Table 7 Paraphrases systems D82X X comparison - The best results of each metric were highlighted. System

Accuracy

Precision

Recall

F-Measure

Selected Features + BayesNet Selected Features + RBFNetwork Das and Smith (Das and Smith, 2009) Ferreira et al. (Ferreira et al., 2014b) Heilman and Smith (Heilman and Smith, 2010) Islam and Inkpen (Islam and Inkpen, 2008) Mihalcea et al. (Mihalcea et al., 2006) Oliva et al. (Oliva et al., 2011) Qiu et al. (Qiu et al., 2006) Stephen et al. (Wan et al., 2006) Yin and Schutze (Yin and Sch€utze, 2015) Yin et al. (Yin et al., 2015)

75.13 74.08 76.06 70.55 73.20 72.60 70.30 70.42 72.00 75.00 78.10 78.90

82.00 73.40 79.57 74.50 75.70 74.70 69.60 74.04 72.50 77.00 D83XX D85XX

80.20 95.60 86.05 84.70 87.80 89.10 97.70 88.63 93.40 90.00 D84XX D86XX

81.10 83.10 82.68 79.30 81.30 81.26 81.29 80.68 81.63 82.99 84.40 84.80

342

TagedPThe two combinations that achieved the best results were selected: (Selected Features, BayesNet) and (Selected Features, RBF Network), and compared with state-of-the-art results (Table 7). The best results of each metric were highlighted. TagedPThe proposed method achieves better precision and second better in relation to the F-measure and recall when compared with state-of-the-art systems. The methods proposed by Yin (Yin and Sch€utze, 2015; Yin et al., 2015) did not present the results for precisoin and recall. Once again it is important to highlight that none of the related system deals with meaning and word order problems simultaneously, as our method. TagedPThe fact that the proposed system achieves better precision results is an important statement because, in general, true positive paraphrase is crucial for a lot of applications that uses PI results. For example, Question Answering (Marsi and Krahmer, 2005) or Machine Translation Evaluation (Bannard and Callison-Burch, 2005) applications. TagedPFurthermore, the selected features used to classify the paraphrases combines statistic (Lexical-Lev and Syntactic-Lev) and dictionary-based (Lexical-Res and Semantic-Res) measures to compare similarities between words. The previous works presented in Section 2 and Table 7 points out that systems based on statistic measures tend to achieve better results in terms of accuracy and precision and dictionary-based measure improves the recall of the systems. TagedPAs Das and Smith (Das and Smith, 2009) and Mihalcea et al. (Mihalcea et al., 2006) methods are based on statistic and dictionary-based measures, respectively, it explains why they achieve 79.57% of Precision and 97.7% of Recall, respectively. However, the proposed method achieves better F-Measure (83.1%) by combining these two measures.

343

4.4. Discussion

324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341

344 345 346 347 348

TagedPD152X XThe proposed article did not achieve better results in general terms, but the main contribution was to eliminate meaning and word order problems. It follows some example to show the benefits of the proposed method related to these problems. D153X XAs mentioned before, the meaning problem happens when different words are used to describe the same entities. The sentences S1 and S2 presents a pair of paraphrase that were not identified by other methods. They have different unit matchs: (i) approved - > passed; (ii) legislation - > bill; this morning - > today. D154XTagedP XS1: “The House Government Reform Committee rapidly approved the legislation this morning.” TagedPD15X XS2: “The House Government Reform Committee passed the bill today.”

349 350

TagedPD156X XThe word order problem deals when the information in the sentence comes with a different structure. The sentences S3 e S4 uses the same words; however with different order. It could lead to problems for other algorithms. TagedPD157X XS3: “Atlantic Coast will continue its operations as a Delta Connections carrier.” TagedPD158X XS4: “It will continue its regional service for Delta Air Lines DAL.N , Atlantic Coast said.”

Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002

JID: YCSLA

14 351 352

ARTICLE IN PRESS

[m3+;July 13, 2017;11:52]

R. Ferreira et al. / Computer Speech & Language xxx (2017) xxx-xxx

TagedPTo conclude the dicussion it is presented several situations where the proposed method tend to erroneously predict if a pair of sentence is paraphrase or not. The two main situations found were: TagedPD159X XNegation: It happens when one of the sentence denies other. For example, the sentence 2 (S6) is the negation of sentence 1 (S5): S5: “Another said its members would continue to call the more than 50 million phone numbers on the Federal Trade Commission’s list.” S6: “Meantime, the Direct Marketing Association said its members should not call the nearly 51 million numbers on the list.” TagedPD160X XStatement Sentences: It happens when the sentences are statements and the speaker is described differently. For example: S7: “It still remains to be seen whether the revenue recovery will be short or long livedD16X X” he said. S8: “It remains to be seen whether the revenue recovery will be short- or long-livedD”162X X said James Sprayregen, UAL bankruptcy attorney, in court.

355

TagedPThis confirms the hypothesis posted in (Mihalcea et al., 2006), (Islam and Inkpen, 2008) and (Oliva et al., 2011) that sentence similarity measures are an important step in paraphrase recognition, but this is not always enough. Often, it happens that some portions of both sentences share a high degree of word overlaping.

356

5. Conclusion

353 354

368

TagedPThis paper proposed three new sentence similarity measures and a new method to identify paraphrases. The sentence similarity measures that integrates lexical, syntactic and semantic analysis. It aims to improve the results incorporating different levels of information the sentence. These similarities deals with two major state-of-the-art problems: meaning and word order. Another contribution of this work is the evaluation of different machine learning algorithms applying the proposed similarities as features to classify sentence pairs as paraphrases or not. TagedPThe proposed method was evaluated using the Microsoft paraphrase corpus and widely accepted sentence similarity measures: accuracy, precision, recall and F-measure. The method achieves better precision and F-measure and the second better in relation to the accuracy and recall when compared to state-of-the-art systems. In addition, a detailed experimentation for different similarities measures applied to paraphrase identification was presented. TagedPThere are new developments of this work already in progress, which include: (i) the improvement of the proposed method to deal with paraprase identification on Twitter; (ii) the creation of mechanisms to deal with the negation and statements sentences problems; and (iii) the application of the proposed method to a textual entailment task.

369

Acknowledgments

357 358 359 360 361 362 363 364 365 366 367

371

TagedPThe research results reported in this paper have been partly funded by a R&D project between Hewlett-Packard do Brazil and UFPE originated from tax exemption (IPI - Law n 8.248, of 1991 and later updates).

372

References

373

TagedPAchananuparp, P., Hu, X., Shen, X., 2008. The evaluation of sentence similarity measures. In: Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery. Springer-Verlag, Berlin, Heidelberg, pp. 305–316. TagedPAndroutsopoulos, I., Malakasiotis, P., 2010. A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38 (1), 135–187. TagedPBannard, C.J., Callison-Burch, C., 2005. Paraphrasing with bilingual parallel corpora. In: Knight, K., Ng, H.T., Oflazer, K. (Eds.), ACL. The Association for Computer Linguistics, Ann Arbor, Michigan. TagedPChoudhary, B., Bhattacharyya, P., 2002. Text clustering using semantics. In: Proceedings of World Wide Web Conference 2002. TagedPCoelho, T.A.S., Calado, P., Souza, L.V., Ribeiro-Neto, B.A., Muntz, R.R., 2004. Image retrieval using multiple evidence ranking. IEEE Trans. Knowl. Data Eng. 16 (4), 408–417. TagedPDas, D., Smith, N.A., 2009. Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 Volume 1. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 468–476. TagedPDeerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A., 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41 (6), 391–407.

370

374 375 376 377 378 379 380 381 382 383 384 385

Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002

JID: YCSLA

ARTICLE IN PRESS R. Ferreira et al. / Computer Speech & Language xxx (2017) xxx-xxx

386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 Q3 404

405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 Q4 420

421 422 423 424 425 426 427 428 429 430 431 Q5 432

433 434 435 436 Q6 437

438 439 440 Q7 441

442 443 444

[m3+;July 13, 2017;11:52]

15

TagedPDolamic, L., Savoy, J., 2010. When stopword lists make the difference. J. Assoc. Inf. Sci. Technol. 61 (1), 200–203. doi: 10.1002/asi.v61:1. TagedPDolan, B., Quirk, C., Brockett, C., 2004. Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA. doi: 10.3115/1220355.1220406. TagedPFernandez-Delgado, M., Cernadas, E., Barro, S., Amorim, D., 2014. Do we need hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res 15 (1), 3133–3181. TagedPFerreira, R., Lins, R.D., Freitas, F., Avila, B., Simske, S.J., Riss, M., 2014. A new sentence similarity assessment measure based on a three-layer sentence representation. In: Proceedings of ACM Symposium on Document Engineering. TagedPFerreira, R., Lins, R.D., Freitas, F., Avila, B., Simske, S.J., Riss, M., 2014. A new sentence similarity method based on a three-layer sentence representation. In: Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, pp. 110–117. TagedPFerreira, R., de Souza Cabral, L., Lins, R.D., de Frana Silva, G., Freitas, F., Cavalcanti, G.D.C., Lima, R., Simske, S.J., Favaro, L., 2013. Assessing sentence scoring techniques for extractive text summarization. Expert Syst. Appl. 40 (14), 5755–5764. TagedPHeilman, M., Smith, N.A., 2010. Tree edit models for recognizing textual entailments, paraphrases, and answers to questions. In: Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 1011–1019. TagedPIslam, A., Inkpen, D., 2006. Second order co-occurrence PMI for determining the semantic similarity of words. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), pp. 1033–1038. TagedPIslam, A., Inkpen, D., 2008. Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2 (2), 1011–1025.X X TagedPKondrak, G., 2005. N-gram similarity and distance. In: Consens, M., Navarro, G. (Eds.), String Processing and Information Retrieval. Lecture Notes in Computer Science. vol. 3772, Springer Berlin Heidelberg, Buenos Aires, Argentina, pp. 115–126. TagedPLi, Y., Bandar, Z.A., McLean, D., 2003. An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. Knowl. Data Eng. 15 (4), 871–882. doi: 10.1109/TKDE.2003.1209005. TagedPLiu, T., Guo, J., 2005. Text similarity computing based on standard deviation. In: Proceedings of the 2005 International Conference on Advances in Intelligent Computing - Volume Part I. Springer-Verlag, Berlin, Heidelberg, pp. 456–464. TagedPde Marneffe, M.-C., Manning, C.D., 2008. The stanford typed dependencies representation. Coling 2008: Proceedings of Workshop on CrossFramework and Cross-Domain Parser Evaluation. Association for Computational Linguistics, Manchester, United Kingdom, pp. 1–8. TagedPMarquez, L., Carreras, X., Litkowski, K.C., Stevenson, S., 2008. Semantic role labeling: an introduction to the special issue. Comput. Linguist. 34 (2), 145–159. TagedPMarsi, E., Krahmer, E., 2005. Explorations in sentence fusion. In: Proceedings of the 10th European Workshop on Natural Language Generation, pp. 109–117. TagedPMihalcea, R., Corley, C., Strapparava, C., 2006. Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1. AAAI Press, Boston, Massachusetts, pp. 775–780. TagedPMiller, F. P., Vandome, A. F., McBrewster, J., 2009. Levenshtein distance: information theory, computer science, string (computer science), string metric, damerau? levenshtein distance, spell checker, hamming distance.X X TagedPMiller, G.A., 1995. Wordnet: a lexical database for english. Commun. ACM 38, 39–41. TagedPOliva, J., Serrano, J.I., del Castillo, M.D., Iglesias, A., 2011. Symss: a syntax-based measure for short-text semantic similarity. Data Knowl. Eng. 70 (4), 390–405. doi: 10.1016/j.datak.2011.01.002. TagedPPalmer, M., Gildea, D., Kingsbury, P., 2005. The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31 (1), 71–106. TagedPPapineni, K., Roukos, S., Ward, T., Zhu, W.-J., 2002. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 311–318. TagedPQiu, L., Kan, M.-Y., Chua, T.-S., 2006. Paraphrase recognition via dissimilarity significance classification. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 18–26. TagedPSmith, D.A., Eisner, J., 2006. Quasi-synchronous grammars: alignment by soft projection of syntactic dependencies. In: Proceedings of Workshop on Statistical Machine Translation. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 23–30. TagedPW3C, 2004. Resource description framework. http://www.w3.org/RDF/. Last Access June 2015.X X TagedPWan, S., Dras, M., Dale, R., Paris, C., 2006. Using dependency-based features to take the ‘para-farce’ out of paraphrase. In: Proceedings of the Australasian Language Technology Workshop 2006. Sydney, Australia, pp. 131–138. TagedPWitten, I.H., Frank, E., 2000. Data mining: practical machine learning tools and techniques with java implementations. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. TagedPXu, W., Callison-Burch, C., Dolan, W.B., 2015. Semeval-2015 task 1: paraphrase and semantic similarity in twitter (pit). Proc. Sem. Eval.X X TagedPYin, W., Sch€ utze, H., 2015. Convolutional neural network for paraphrase identification. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 901–911. TagedPYin, W., Sch€ utze, H., Xiang, B., Zhou, B., 2015. ABCNN: attention-based convolutional neural network for modeling sentence pairs. Trans. Assoc. Comput. Linguist. 259–272.X X TagedPYu, L.-C., Wu, C.-H., Jang, F.-L., 2009. Psychiatric document retrieval using a discourse-aware model. Artif. Intell. 173 (78), 817–829. TagedPZhou, F., Zhang, F., Yang, B., 2010. Graph-based text representation model and its realization. In: Proceedings of International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE), 2010, pp. 1–8. doi: 10.1109/NLPKE.2010.5587861.

Please cite this article as: R. Ferreira et al., Combining X Xsentence similarities measures to identify paraphrases, Computer Speech & Language (2017), http://dx.doi.org/10.1016/j.csl.2017.07.002