Information Processing and Management 50 (2014) 297–314
Contents lists available at ScienceDirect
Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman
Revisiting Cross-document Structure Theory for multi-document discourse parsing Erick Galani Maziero ⇑, Maria Lucía del Rosário Castro Jorge, Thiago Alexandre Salgueiro Pardo Interinstitutional Center for Computational Linguistics (NILC), Institute of Mathematical and Computer Sciences (ICMC), University of São Paulo (USP), Avenida Trabalhador São-carlense, 400, 13566-590 São Carlos, SP, Brazil
a r t i c l e
i n f o
Article history: Received 8 May 2013 Received in revised form 19 December 2013 Accepted 29 December 2013 Available online 23 January 2014 Keywords: Discourse parsing Multi-document processing Cross-document Structure Theory Machine learning
a b s t r a c t Multi-document discourse parsing aims to automatically identify the relations among textual spans from different texts on the same topic. Recently, with the growing amount of information and the emergence of new technologies that deal with many sources of information, more precise and efficient parsing techniques are required. The most relevant theory to multi-document relationship, Cross-document Structure Theory (CST), has been used for parsing purposes before, though the results had not been satisfactory. CST has received many critics because of its subjectivity, which may lead to low annotation agreement and, consequently, to poor parsing performance. In this work, we propose a refinement of the original CST, which consists in (i) formalizing the relationship definitions, (ii) pruning and combining some relations based on their meaning, and (iii) organizing the relations in a hierarchical structure. The hypothesis for this refinement is that it will lead to better agreement in the annotation and consequently to better parsing results. For this aim, it was built an annotated corpus according to this refinement and it was observed an improvement in the annotation agreement. Based on this corpus, a parser was developed using machine learning techniques and hand-crafted rules. Specifically, hierarchical techniques were used to capture the hierarchical organization of the relations according to the proposed refinement of CST. These two approaches were used to identify the relations among texts spans and to generate multi-document annotation structure. Results outperformed other CST parsers, showing the adequacy of the proposed refinement in the theory. Ó 2014 Elsevier Ltd. All rights reserved.
1. Introduction Discourse parsing has a relatively short history in Computational Linguistics, with the comprehensive initial efforts dating back to the 1990s with the work of Marcu (1997). The task of discourse parsing aims to uncover the discourse relations among text spans in a single document, usually following the well-known Rhetorical Structure Theory (RST) (Mann & Thompson, 1987), for subsidizing applications of text planning/generation and summarization, for instance. Based upon the success of the discourse-based approaches and the explosion of data mainly brought by the web, the research community started to envision the possibility of automatically parsing sets of documents, establishing relationships among passages of different texts, in a task that is known as multi-document discourse parsing. Although the first works are attributed to Trigg (1983) and the tradition that there is in investigations on hypertext linking (see, e.g., Allan, 1996; Green, 1999) and in the more recent scenario of text entailment (see, e.g., Dagan, Glickman, & Magnini, 2005; Rios & Gelbukh, 2012), ⇑ Corresponding author. Tel.: +55 16 33739700. E-mail addresses:
[email protected] (E.G. Maziero),
[email protected] (M.L.d.R.C. Jorge),
[email protected] (T.A.S. Pardo). 0306-4573/$ - see front matter Ó 2014 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.ipm.2013.12.003
298
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314
only in 2000 the Cross-document Structure Theory (CST) was proposed by Radev (2000) to be a model of general purpose use. CST proposes a set of relations to connect passages of different texts (on the same topic) for determining similarities and differences among the texts, including relations of content overlap, elaboration, citation, etc. To have a better idea of how these multi-document relations occur, consider a set of documents narrating a car accident. These documents might have repeated information related to the location of the accident, contradictory information related to the number of deaths (since it is usual that the news are constantly updated to have more accurate information), complementary details of the accident (e.g., some documents might give extra information, as the drivers’ age), and information written in different styles (e.g., one document narrating in indirect speech something that was said in another text by a witness of the accident). CST received some criticism due to its supposed generality. Afantenos (2007) argue that it is not possible to have a representative model that does not consider domain-dependent knowledge. To demonstrate this, the authors redefine the model with ontological knowledge for the sport domain. However interesting this may be, it is expensive to achieve for other domains and subtracts from the model its generality, which is its main advantage over previous approaches. Independently from this discussion, CST showed to be robust enough to improve results in some applications, mainly in summarization (see, e.g., Jorge & Pardo, 2010; Zhang, Blair-Goldensohn, & Radev, 2002), by allowing to determine the main passages of the documents at the same time that it provides the means to deal with the multi-document phenomena, as the occurrence of redundancy, complementarity, and contradiction among the documents, as well as writing style matters and decisions. Efforts on automatic multi-document parsing are first attributed to Zhang, Otterbacher, and Radev (2003) and Zhang and Radev (2004), but they suffer from (i) data sparseness for machine learning – due to a small training corpus available – and (ii) definitional problems in CST (as acknowledged by Afantenos, Doura, Kapellou, & Karkaletsis, 2004), as some relations from the original model are very similar and hard to distinguish in some cases. Only recently there are more initiatives on multi-document discourse parsing (which are discussed later in this paper), for varied purposes, but they are also limited by the corpus that is used, which causes these works to deal with selected groups of relations and not to tackle the problem as a whole. Under the light of previous works and bottlenecks, this paper addresses two main points: (i) the proposal of a refined – and still of general use – CST model and (ii) the investigation of varied strategies for multi-document discourse parsing. We believe that, the better a discourse model represents the multi-document discourse phenomena, the more refined the management of such information and the accuracy of multi-document processing applications may be, including the task of discourse parsing itself. We start by revisiting CST and, based on corpus annotation and agreement measurement, we propose a new version of the model accompanied by a typology of relations, aiming at not only turning it into a sounder model, but better systematizing it. The typology comprises a slightly reduced set of relations (regarding the original CST model) and intends to organize such relations according to their meaning and type of multi-document phenomena that they represent. The final relation set was based on both empirical evidence from corpus and previous works in the area. We then show that the robustness of the model and the better corpus annotation result in better – state of the art – parsing outcomes, which we tackle in two main ways, following the traditional flat and the newer hierarchical machine learning strategies. In particular, this hierarchical approach benefits from the typology of relations proposed for the revisited CST. We tackle all of the relations and, for some of them, we also make use of symbolic rules for relation detection. Rules were used for simple relations that are easily detected (e.g., the identity and translation relations) and for relations whose occurrence is sparse in our corpus and might not be appropriately learned in our machine learning strategies (e.g., the contradiction relation). In the next section we briefly introduce the related work in the area, both from the theoretical (discourse models) and practical (parsing strategies) perspectives. Section 3 presents our refinement to CST, while Section 4 reports our work on parsing. Conclusions and final remarks are made in Section 5.
2. Related work 2.1. Multi-document models Among the first investigations that guided the multi-document modeling are the works of Trigg (1983) and Trigg and Weiser (1986), which aimed at contributing to the management and storage of multiple scientific papers. The goal was to make explicit the underlying structure of the texts by capturing semantic relations among textual segments and hierarchical levels of information, such as domains and subdomains. For this aim, two types of segments were considered: Chunks and Tocs. Chunks represented textual portions that might be sentences, paragraphs or even documents; Tocs (from ‘‘Table of Contents’’) were indicators that pointed to more than one chunk, which may correspond, for example, to a subdomain or topic. Links represented semantic relations among segments and were divided in two types: commentary and normal links, which represented opinion and content relations, respectively. For instance, a comment link of type Criticism or a normal link of type Explanation could be established between a segment A and a segment B. Trigg (1983) and Trigg and Weiser (1986) also suggested two types of directionality for the links: physical and semantic. Physical directionality referred to the order in which the link was read, exactly as it was drawn, while semantic directionality depends on the meaning of the link, which
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314
299
does not necessarily correspond to the physical direction. For instance, if A criticizes B, the physical directionality indicates that A points to (refers to) B, but semantic directionality indicates that B have to be read before. Links were manually established. Allan (1996) presented a methodology for the identification of links among documents. These links represented contentbased relations among the documents. This study, in the area known as Link Typing, was carried out in the context of information retrieval, where hypertexts gained increasing popularity. Particularly, Allan proposed a set of links and their classification. The proposed links were classified in three types: Pattern-matching, Manual and Automatic. Pattern-matching links might be identified by using simple or elaborated pattern matching techniques, such as word matching. Examples of these links were Definition Comment and Content. The identification of manual links might require human intervention, because it was not possible to do this by using computational techniques at that time. Examples of these links were: Circumstance, Cause, and Purpose. Automatic links, on the other hand, might require more sophisticated techniques than Pattern-matching links, but it was still possible to identify them with computational techniques. Examples of these links were Aggregate, Tangent, Comparison, Contrast, and Equivalence. McKeown and Radev (1995) and McKeown and Radev (1998) proposed a set of semantic relations and a model for multidocument summarization using these relations. These relations were analogous to the links proposed by Trigg (1983) and Allan (1996), but more focused in the representation of multi-document phenomena, such as similarities, contradictions, differences, complementarities, and evolution of facts in time. These relations would make possible a better exploration of user’s interests in a set of related documents and to produce good quality summaries. Different from Trigg (1983) and Allan (1996), McKeown and Radev (1995) and McKeown and Radev (1998) proposed that relations should be established among textual units that were not necessarily fixed segments such as sentences or paragraphs. In their model, textual information was organized in templates, which were predefined structures containing attributes and values, corresponding to typical information of a particular topic. For instance, they worked on texts related to terrorism and templates were filled up with information such as location of the attack, number of dead people, and perpetrators. According to their proposal, templates were automatically filled up and semantic relations should be manually established among these templates. Radev (2000) proposed CST (Cross-document Structure Theory) based on RST (Rhetorical Structure Theory) (Mann & Thompson, 1987), the single-document discourse structuring theory, and also on other multi-document previous works (Allan, 1996; McKeown & Radev, 1998; Trigg, 1983; Trigg & Weiser, 1986). This model was originally designed and applied in multi-document summarization by exploring different user preferences. For this aim, the author proposed a set of 24 discursive relations that model multi-document phenomena. In CST, relations may have two different directionalities: symmetrical or asymmetrical. Directionality depends on the semantic nature of the relation: symmetrical directionality occurs when both textual segments are affected in the same way by the relations (e.g., Equivalence relation among text units) and asymmetrical directionality occurs when one textual unit affects the other (e.g., a textual unit elaborates the content of another unit). This will be explained in more detail in Section 3. It is important to notice that, for multiple documents, the notion of discourse (sequence of phrases in a given order, aiming at achieving a communicative goal, according to the author’s intentions) may not be the same that is meant for a single document, since multiple texts are produced by multiple authors with diverse purposes. Thus, it might be more convenient to refer to CST as a ‘‘semantic-discursive’’ theory, because it expresses relations among contents from various documents and builds a structure of this ‘‘discourse’’. For simplicity, we refer to these relations in the same way the author does, as discourse relations. Afantenos et al. (2004) and Afantenos (2007) also explored semantic relations among different textual units from texts on the same topic. Afantenos proposed an information extraction method based on templates, similar to McKeown and Radev (1998), but focused on the Football category. Different from McKeown and Radev (1998) and Radev (2000), they argued that relations should be more specific according to the topic. For instance, for the football matches category, relations should describe the evolution of events through time and also point out similarities and differences. For this reason, the author classified relations in two categories: Synchronic and Diachronic. Synchronic relations described an event, at a particular period of time, along many sources of information. These relations are very similar to the ones proposed in CST. Examples of synchronic relations are Identity, Equivalence, Elaboration, etc. Diachronic relations described the evolution or progress of an event in one source of information, through a period of time. For example, discussions on the progress of certain team along a match could be described by diachronic relations such as Stability or Antithesis. For instance, Stability describes the permanence of an opinion or state of a fact through different periods of time (e.g., the performance of a football team narrated some minutes before is equal to the performance narrated some minutes after); on the other hand, Antithesis describes the changing of an opinion or state of a fact in different periods of time (e.g., the performance of a football team is good at some point and then changes to bad). Although very interesting, this model never had an automatic application. Other recent approaches on multi-document modeling, such as in tasks of Recognizing Textual Entailment (RTE) (Dagan et al., 2005), aimed at recognizing if the meaning of some portion of text might be entailed from another. These portions of text generally are from different documents. Usually, the main goal of RTE is to classify semantic relations among a text (T) and one or more hypothesis (H) in Entailment, Contradiction or Unknown. Differently from the tradition in discourse modeling, RTE attempts are tailored for determining entailment, not dealing with the richness of a large set of discourse relations. Despite that all models treat in some way the multi-document phenomena, they have some differences because most of them were created for different purposes. For instance, Trigg (1983) and Trigg and Weiser (1986) aimed a different organization and management of multiple scientific documents, focusing on critiquing, arguing and making explicit the semantics
300
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314
behind those texts, and, therefore, covering various factors of the multi-document phenomena such as complementarities and contradictions, among others. Although the proposed links are very rich, they still do not cover all the phenomena, for instance, there were no links for making explicit redundancies among parts of texts. Allan (1996) has more notorious deficiencies in this sense, since the proposed links were created for hypertext linking. These links did not focus on the semantics among parts of texts, but on semantics among complete texts, and, therefore, few multi-document factors were considered. RTE models, on the other hand, treated the meanings of segments in order to detect entailment, but other multidocument phenomena such as redundancies or diverse writing styles and forms were not considered. The models of McKeown and Radev (1998), Radev (2000), and Afantenos (2007) were more focused on treating the multi-document phenomena, and the relations proposed are very similar among those models because the three of them were created for summarization purposes. For instance, the relations cover redundancies, complementarities, contradictions and writing styles. CST model has recently become very popular in the multi-document scenario, especially in multi-document summarization applications. Besides its popularity, the subjectivity underlying CST has led to various attempts of refinement, as we discuss later in this paper. 2.2. Multi-document parsing The works of Zhang et al. (2003) and Zhang and Radev (2004) were the first attempts to automate the CST analysis for the English language. The authors developed the analysis in two stages: initially, a classifier determines if two sentences from different texts are related by some CST relation, regardless of which relation may occur; then, another classifier determines the CST relations between the pair of sentences. Three types of features are used: lexical, syntactical and semantic features. In the first step, the features are provided by lexical similarity measures, such as cosine (Salton & Lesk, 1968) and BLEU (Papineni, Roukos, Ward, & Zhu, 2002) measures. In the second step, the features include number of words in each sentence, number of common words in the sentences, number of words of each morphosyntactic class (part-of-speech tag) in each sentence, number of common words for each morphosyntactic class in the sentences, and the semantic similarity among the main concepts in the sentences. They used labeled and unlabeled data (in this case, using a boosting technique). The corpus used in this process was the CSTBank (Radev, Otterbacher, & Zhang, 2004), with 41 news texts distributed in 6 clusters, each one on the same topic. The obtained results are reproduced in Table 1. Precision is the number of correctly classified examples in relation to the total number of classified examples. Recall is the number of correctly classified examples in relation to the number of correct examples in the test set. F-measure is a harmonic average of the two cited values. One may see that only some relations were treated and that some of them had low performance (below 0.2 F-measure), as Subsumption, Elaboration and Description. The authors explained that these results were caused by the sparseness of training data, considering the number of relations. In general, they obtained a 0.25 average F-measure for the 6 relations (notice that the table incorporates the results for the ‘‘no relation’’ in the average numbers). Marsi and Krahmer (2005) treated semantic relation classification among pairs of sentences in a corpus in Dutch. The authors used five mutually exclusive relations: Equals, Generalizes, Specifies, Restates, and Intersects. In their work, human performance was evaluated (using F-measure) and achieved 0.98 in the alignment of the sentences (by alignment, the authors mean the determination of which pairs of sentences show to have some relationship) and 0.95 in the identification of the relations. The alignment of the sentences is made in the level of dependency structures of the sentences, since the authors had the hypothesis that only string overlap is not sufficient for recognizing semantic relations. The proposed automatic methodology achieved 0.85 in the alignment and 0.80 in the identification of the semantic relations. The authors used machine learning techniques to identify relations, and used the following features: (i) if sentences were identical or not, (ii) binary features for the occurrence of each of the 5 relations in lower (children) nodes in the tree, (iii) if one of the lower nodes is aligned or not, and (iv) the lexical semantic relations among the nodes as found in Wordnet (Miller, 1995). Miyabe, Takamura, and Okumura (2008) proposed a methodology to identify the relations Equivalence and Transition among the sentences grouped according to their similarity. Firstly, the Equivalence relation was identified (which indicates sentences with the same content, but different wording), followed by the Transition relation (that happens among sentences with the same information, differing by numerical values; this relation would be similar to CST contradiction relation). The following features were used: (i) cosine similarity, (ii) normalized length (in characters), (iii) difference in the document
Table 1 Results obtained by Zhang and Radev (2004). CST relation
Precision
Recall
F-measure
No relation Equivalence Subsumption Follow-up Elaboration Description Overlap
0.8875 0.5000 0.1000 0.4727 0.3125 0.3333 0.5263
0.9605 0.3200 0.0417 0.2889 0.1282 0.1071 0.2941
0.9226 0.3902 0.0588 0.3586 0.1818 0.1622 0.3773
Average
0.4474
0.3057
0.3502
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314
301
publication dates, (iv) sentence position in the text, (v) conjunctions, (vi) semantic similarities, (vii) expressions at the end of the sentence, and (viii) named entities. The corpus used was from the Text Summarization Challenge (Okumura, Fukushima, & Nanba, 2003) and Workshop on Multimodal Summarization for Trend Information (Kato, Matsushita, & Kando, 2005), both for the Japanese language. The precision for Equivalence relation (0.95) is the double of the value for Transition relation (0.43), since this relation is more complex to identify than the Equivalence relation. The F-measure was also better for Equivalence (0.76) than for Transition (0.46). Zahri and Fukumoto (2011) presented a methodology of multi-document summarization that incorporated link analysis with rhetorical relations for summary generation. Although the goal of the work was multi-document summarization, a methodology for automatic identification of relations among sentences was briefly described. The authors used a machine learning approach with SVM to identify the following relations: Identity, Paraphrase (equivalent to Equivalence relation), Subsumption, Overlap, and Elaboration. The following features were provided: (i) cosine similarity, (ii) word overlap between sentences, (iii) length of sentences, and (iv) the overlap ratio of the words between sentences. Training was performed using the CSTBank corpus, but the methodology was not evaluated. Kumar, Salim, and Raza (2012) presented a Machine Learning approach to automatically detect CST relations, specifically, Identity, Overlap, Subsumption, and Description. The authors proposed five features: (i) cosine similarity, (ii) word overlap, (iii) length (whose value was 1 if the first sentence was bigger than the second one, 1 if it was shorter, and 0 if they had the same length), (iv) NP (Noun Phrase) similarity between the sentences and (v) VP (Verb Phrase) similarity between the sentences. To perform the classification, three classification techniques were used: SVM classifier, using the RBF kernel function, Neural Networks, using a feed forward network with tan-sigmoid transfer function in the hidden layer, and Case-Based Reasoning (CBR), using cosine similarity measure for case comparison. All techniques were implemented using CSTBank data, considering 477 sentence pairs for training and 205 sentence pairs for testing. These included 100 sentence pairs that had no CST relations. A 5-fold cross validation was performed. For No relation and Identity relation, SVM and Neural Networks performed better than CBR in terms of F-measure; for the other relations, CBR performed better, achieving 0.80, 0.78 and 0.72 F-measure for Subsumption, Description and Overlap, respectively. CBR also achieved 80.5% of accuracy, while SVM achieved 78.6% and Neural Networks 80.0%. Authors argued that these results could be due to the capacity of generalization of CBR. In a related research line, there are the works in text entailment, which, as commented in the previous subsection, do not aim at identifying a rich set of discourse relations, but only entailment and, sometimes, some related relations. Some of them are interesting for the features and measures they explore for the task. For instance, the well-known works of Jijkoun and De Rijke (2005) and MacCartney, Grenager, De Marneffe, Cer, and Manning (2006) make use of a dependency-based word similarity measure (following the work of Lin, 1998) and lexical chains (as proposed by Hirst & St-Onge, 1998), and a set of grammatical and shallow semantic features, respectively. Others enrich the set of relations and come closer to the problem of discourse parsing, although being still very limited. In such line, it is worthy to cite the work of Ohki et al. (2011). These authors extended the relation set (that usually includes the entailment, contradiction, and unknown relations) with the confinement relation, which indicates that two sentences are in an entailment relation, but with some condition or restriction (which, if not satisfied, might result in a contradiction relation). Using templates/rules to identify the confinement relation, the authors achieved a 61% F-measure. Our method to identify CST relations is different from the previous approaches because (a) it adopts a refined CST theory to generate and organize the classifiers, such as the top-down approach to the hierarchical classifiers, (b) it uses a richer set of features (lexical, morphosyntactic and semantic features), (c) we combine the classifiers with rules in a joint approach, and, still, (d) we try to recognize all the predicted relations in the refined CST (while most of the previous works attempt to find only some of them, as one may notice). We present our refined model in the next section.
3. CST annotation: original theory, previous refinements and a new proposal In the last decade, interest in CST applications began to arise, especially in multi-document summarization (Jorge & Pardo, 2010; Radev et al., 2004; Zhang et al., 2002), but also in other areas such as query reformulation, learning support, or opinion mining in the web (see, e.g., Beltrame, Cury, & Menezes, 2012; Inam, Shoaib, Majeed, & Shaerjeel, 2012; Murakami et al., 2010). These interests in CST lead to the necessity of a well-defined theory to properly represent the multi-document phenomena. As mentioned in the previous section, Radev (2000) proposed the original theory with a set of 24 relations, which are listed in Table 2. These relations might be applied among any textual unit (words, phrases, sentences, paragraphs or larger portions of text) as illustrated in Fig. 1, which was extracted from Radev (2000, p. 5). Besides its simplicity, CST analysis might be subjective and ambiguous because different human annotators may identify different relations among the same textual segments, or may select different textual segments to relate. In fact, some works have criticized CST, such as Zhang et al. (2002) and Afantenos et al. (2004), who argued that CST was ambiguous and generic, respectively. Also, several works had studied the applicability of CST in multi-document corpora and had evidenced these problems too. This subjectivity may be caused by various factors such as similarity among relation definitions or the lack of a proper understanding of the relations and their semantic nature. Different from other models, such as Trigg (1983),
302
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314 Table 2 Original set of CST relations. Identity Equivalence Translation Subsumption Contradiction Historical background Cross-reference Citation
Modality Attribution Summary Follow-up Elaboration Indirect Speech Refinement Agreement
Judgment Fulfillment Description Reader profile Contrast Parallel Generalization Change of perspective
Fig. 1. CST relations among textual segments in different levels.
Allan (1996) or Afantenos et al. (2004), Radev (2000) did not make any classification or division of the relations. Besides the obvious differences in the semantic nature of the relations (e.g., Indirect Speech relation is more related to a writing style decision, while Elaboration relation is a content relation), no differentiation among relation types was proposed. One of the contributions of this works is a proposal for classifying the CST relations, which will be discussed later. Radev et al. (2004) already acknowledged the difficulties in CST analysis for building the CSTBank. This corpus is composed of 6 clusters of documents written in English, where each cluster contains an average of 8 texts on the same topic. In total, the corpus contains 41 texts and, in average, 28 sentences per text. This corpus was manually annotated using the original set of relations proposed by Radev (2000) and their definitions. The annotation was performed considering sentences as the textual unit of analysis. The task was carried out by 8 annotators. According to the reports in this work, the kappa agreement measure (Carletta, 1996) obtained from the annotation was 0.53 (in a range of 0–1), which was considered low, since good kappa measures would be above 0.6. The authors argued that it was difficult to have agreement in the task, since there were multiple relations that may connect the same pair of sentences, and, therefore, it was difficult for the annotators to use the same criteria. Figs. 2–4 illustrate this difficulty in the annotation of CSTBank, with examples directly extracted from this corpus. The examples show difficulties of different nature. It may be observed in Fig. 2 that, for the same pair of sentences, two different annotators observed different relations, but it is possible to grasp the reasons behind each one (the Follow-up relation might refer to the hole in the floor and the existence of smoke after the hit; the Subsumption relation might refer to first sentence offering more information than the second). In Fig. 3, both relations look equally good and it is very hard to pick only one as the correct one. The example in Fig. 4
Sentence 1: The crash put a hole in the 25th floor of the Pirelli building, and smoke was seen pouring from the opening. Sentence 2: The Pirelli Building in Milan, Italy, was hit by a small plane. Annotator 1 → Follow-up Annotator 2 → Subsumption Fig. 2. Example of disagreement in CSTBank.
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314
303
Sentence 1: Gulf Air is jointly owned by the Gulf states of Bahrain, Oman, Qatar and Abu Dhabi. Sentence 2: Bahrain television reported 143 people were on board. Annotator 1 → Elaboration Annotator 2 → Historical background Fig. 3. Example of incoherent relations in CSTBank.
Sentence 1: Police cordoned off the area as people gawked at the skyscraper. Sentence 2: Police cordoned off the area as passersby gawked at the skyscraper. Annotator 1 → Equivalence Annotator 2 → Identity Fig. 4. Another example of incoherent relations in CSTBank.
shows two sentences that should be labeled with the Equivalence relation, but the pair was labeled as Identity by one of the annotators. Table 3 shows the number of agreements and disagreements that occur in the two publicly available clusters of CSTBank. There are 704 pairs of sentences and 276 (39.2%) show disagreement among the annotators. It is interesting to notice that such disagreements are systematic and not originated only by chance or human error, as one might imagine. For instance, we could find other 25 examples that are similar to the one in Fig. 2, where one annotator indicated a Follow-up relation and the other indicated a different (and sometimes contrastive) one. In the same line, we found other 17 examples where only one annotator identified a Contradiction relation, while the other indicated relations as Equivalence, Subsumption and Overlap. It is also very common to find confusions of Historical background with Description and Elaboration relations, as well as among Elaboration and several other relations. Such occurrences are due to different text interpretations, which are valid given that the CST model is not formalized enough for allowing a better distinction among such cases, with some relation definitions that are very vague or very similar. Human errors also happen. For instance, examples as the one in Fig. 4 – where some annotators did not identify the Identity relation – are rarer. However, there are very weird cases. We could find, for instance, 3 cases where one annotator indicated an Identity relation (which account for sentences with the same exact wording) and the other annotator indicated an Equivalence relation (which account for sentences with different wording, but same content). These are relations that make no sense together. These annotation problems and ambiguities have leaded to some attempts for a theory refinement. For instance, Zhang et al. (2002) found that some relations were ambiguous because they had similar definitions, for example, Elaboration and Refinement, which reduced the chance of agreement among annotators. In order to improve this scenario, the authors proposed a refined set of CST relations consisting of 18 relations out of the 24 original ones; redundant relations were eliminated. In order to measure the level of agreement, the authors conducted an experiment consisting in the annotation of five pairs of documents out of 11 journalistic articles on the topic of an airplane crash. The authors report that the annotators agreed totally or partially (when the majority of them indicate the same relation) in 58% of the cases for a sample, remaining 42% of cases with complete disagreement. As one may see, there is still too much disagreement. For our proposal, it was used the CSTNews corpus (Cardoso et al., 2011), which is a set of 50 clusters containing, in average, 3 news texts on the same topic in each cluster. The texts were extracted from several important Brazilian journals, such as: Folha de São Paulo, O Estado de São Paulo, O Globo, Jornal do Brasil, and Gazeta do Povo. The corpus contains a total of 140 texts and, in average, 40 sentences per cluster. All of the texts are written in Brazilian Portuguese. For the annotation of CSTNews, it was decided to produce a new CST refinement, in order to better understand the nature of the theory and how it represents the multi-document phenomena. This new refinement was based on the 18 relations proposed by Zhang et al. (2002). In this refinement, it was proposed the elimination of some relations based on definition similarity, which may cause ambiguity, according to previous critics to the theory. For instance, Description was eliminated because it was considered very similar to Elaboration, since both definitions cover textual segments that bring more details about an entity or event of another textual unit. Following the same criteria, Fulfillment was considered very similar to Follow-up. Reader Profile and Change of Perspective were considered unnecessary, since they deal with information that is beyond the text. The final 14 relations used for the annotation may be observed in Table 4. It was also observed that some of the relations follow similar patterns. For example, Subsumption and Overlap both relate textual segments that have similar content but also may have different extra content, this is, refers to information units with Table 3 Disagreement in a sample of the CSTBank. Cluster name
Agreement
Disagreement
Milan9 Gulfair11 Total
385 43 428
161 115 276
304
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314 Table 4 Refined CST relations for CSTNews annotation. Elaboration Overlap Subsumption Historical background Attribution Equivalence Follow up
Contradiction Summary Identity Modality Indirect Speech Citation Translation
partial redundancy. In the case of Subsumption, only one textual unit has extra content, while, for Overlap, both textual segments have extra content. According to this evidence, following the work of Maziero, Jorge, and Pardo (2010), we propose a typology of CST relations, where relations are grouped according to their semantic nature. This taxonomy is shown in Fig. 5. It may be observed that there are two main categories of relations: content and form. The content category refers to relations that indicate similarities and differences among contents in the texts. This category is divided into three subcategories: Redundancy, Complement, and Contradiction. Redundancy includes relations that express a total or partial similarity among sentences. For example, Identity, Equivalence and Summary express full similarity among textual segments, while Overlap and Subsumption only express that some information is similar, but not all, as it has been explained before. Complement relations link textual segment that elaborate, give continuity or background to some other information. These relations may be considered temporal or non-temporal; for instance, Historical background and Follow-up are considered temporal, while Elaboration is non-temporal. The last subcategory for the content category only includes Contradiction. On the other hand, in the form category, all the relations that deal with superficial aspects of information are included, for example, writing styles (Indirect-speech, Modality), citations (Attribution, Citation) or language (Translation). The classification of relations according to this typology also reflects some restrictions on the annotation of these relations. For example, the same pair of information units may not be related by Identity and Contradiction or Subsumption and Overlap, since the two relations do not make sense for the same pieces of information. This understanding of the semantic nature of the relations helps annotators to be less subjective when annotating CST, which may help to reduce the annotation ambiguity, and, therefore, to reach higher levels of annotation agreement. As Hovy and Lavid (2010) argue, annotation agreement is essential in order to have a trustful annotation and, consequently, to build more accurate tools from the data. Besides the typology, the definition of each relation was also formalized. This formalization was given by two main features: directionality and restrictions. Given two sentences S1 and S2, directionality might be null (S1–S2), to the left (S1 S2) or to the right (S1 ? S2). On the other hand, restrictions determine the situations in which a relation should be established. Another extra feature (not obligatory) contained in the definition is a commentary, which gives extra explanations that may help to understand the definition of the relation. An example of CST relation definition is shown below in Table 5. For instance, let us consider the following pair of sentences relating the details of an airplane accident in the Democratic Republic of Congo. These sentences were extracted from CSTNews corpus and translated to English: (S1) According to a UN spokesman, the Russian-made plane was trying to land at the Bukavu airport in the middle of a storm. (S2) The airplane exploded and caught fire, said the spokesman.
Fig. 5. Typology of CST relations.
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314
305
Table 5 Example of CST relation definition. Name of relation: Follow-up Directionality: S1 S2 Restrictions: S2 presents events that occur after the events presented in S1; the events in S1 and S2 must be related and the time interval between them must be short Commentary: –
In this example, sentence S2 narrates the explosion and fire of the airplane, which is an event that happened after the event narrated in S1. It is important to notice that the directionality may represent two types of reading, similar to the ones explained by Trigg (1983). In the example above, it may be observed that the semantic directionality points to the left (S1 S2), while the physical directionality points to the right (S1 ? S2), since S1 has to be read first in order to understand the logical sequence of the events. The annotation of CSTNews was performed by 4 computational linguists during a period of 4 months. Before the annotation process began, annotators were trained for about 3 months. The training phase included the study of the refined theory and also learning how to use CSTTool (Aleixo & Pardo, 2008), which is a semiautomatic tool for CST annotation. CSTTool was designed to perform the 3 basic tasks of annotation: text segmentation, detection of textual segments pairs that are candidates to have CST relations, and identification of the relation among the selected unit pairs. Text segmentation is performed using SENTER (Pardo, 2006), which splits texts into sentences. The detection of candidate textual segments pairs is carried out by using the word overlap measure, which measures the lexical similarity between a pair of segments. This measure is indicated by Eq. (1).
Word overlap between sentences S1 and S2 Word Overlap ðS1; S2Þ ¼
number of common words between S1 and S2 : number of words in S1 þ number of words in S2
ð1Þ
The result of this equation is a value between 0 and 1, with values closer to 1 indicating that segments have more words in common and values closer to 0 indicating that segments have fewer words in common. In order to select the candidate pairs, a threshold value is established and it indicates the minimum value of word overlap that two segments should have to be selected as candidates. For the annotation of CSTNews, it was established a threshold of 0.12, the same used by Zhang et al. (2002) and confirmed for Portuguese by Aleixo and Pardo (2008). This procedure is followed by other works in the literature (see, e.g., Zhang et al., 2003) and guarantees that the task is viable, since trying to relate all possible pairs of sentences is intractable. In fact, Zhang et al. (2003) argue that CST relations usually happen among sentences with some lexical similarity. For not restricting too much the annotation, annotators might also consider other pairs of segments to relate (i.e., not only those indicated by the annotation tool). Finally, the core of the annotation itself (the choice of relations) is performed by the annotators, which already followed the proposed typology shown in Fig. 5. Some frequencies were extracted after annotation, which are shown in Fig. 6. Those results were expected because the content relations referring to redundancy information (e.g., Overlap and Subsumption) are more frequent in texts on the same topic than relations of the form type (e.g., Modality and Indirect Speech).
Fig. 6. Frequency of the CST relations.
306
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314 Table 6 Kappa values obtained in annotation. Kappa value Relations Directionality Grouped relations
0.50 0.44 0.61
Table 7 Agreement in annotation.
Relations Directionality Grouped relations
Total agreement (%)
Partial agreement (%)
Null agreement (%)
54 58 70
27 27 21
18 14 9
The Kappa agreement measure was applied to measure agreement for relations, directionality and grouped relations. The last one refers to the category grouping various relations; therefore, kappa measure for grouped relations express the level of agreement when relations belong to the same subcategory (Redundancy, Complement, Contradiction, Authorship and Style). It was computed in this way because annotators may disagree on some particular relation, but may agree on its type. Table 6 shows the agreement values. It is interesting to notice that kappa value for relations is lower than the kappa value for the grouped relations. This shows that the hierarchical organization of relations may help to have a better level of agreement in the relations’ categories. Besides kappa, it was also calculated a simple percentage measure of agreement, following the work of Zhang et al. (2002). With this measure, three types of agreement were evaluated: total, partial, and null agreement. Total agreement refers to the full agreement among all the annotators; partial agreement is when only some annotators (the majority of them) agree; and null agreement is when none or the minority of annotators agree. The results of these measures are shown in Table 7 and the numbers refer to the percentage of agreement. These results show that there is 81% of partial or total agreement for relations, 85% of partial or total agreement for directionality, and 91% of partial or total agreement for grouped relations. These results showed to be better than those obtained by Zhang et al. (2002), who obtained 58% of partial or total agreement for relations. This reflects that the proposed refinement helps to reduce ambiguity and helps to have a better understanding of the relations. This proposal differs from other works such as Afantenos et al. (2004), McKeown and Radev (1998), Allan (1996), Trigg (1983) and even Radev (2000) in the way the organization of relations helps the understanding of multi-document phenomena and clarifies the restrictions that the model imposes and the way definitions express the meaning of relations, which make the annotation task more objective and, therefore, clearer. There is certainly room for improving and formalizing even more the set of CST relations. It is worthy to notice, however, that there is a trade-off among formalization and subjectivity/usefulness. Discourse models as CST try to assign meaning/ semantics to the relationships among text passages. As every work on semantics, there is a certain degree of subjectivity, which is not explicitly codified in the texts and is up to the readers to grasp. As we try to restrict, formalize and make explicit the types of relationship, we run the risk of losing the rich interpretations that may occur and, therefore, diminishing the usefulness of the discourse model for the envisioned applications. Therefore, some parsimony is necessary in approaches of model (re)engineering. In what follows, we present our approach for automatic CST parsing, which follows the refined CST model that we introduced in this section. 4. Parsing The refinements in the theory were used to create an automatic discourse parser. This initiative showed the adequacy of the refinements in the theory to automate the analysis. The typology, definitions of the relations and corpus annotation were the most important items to understand and automate the analysis. Fig. 7 shows the architecture of the parser. All the documents of the group are initially segmented into sentences. All pairs of sentences from different documents are compared using word overlap (Eq. (1), where S1 and S2 are sentences of the pair). The word overlap measure was used to select a small set of pairs of sentences which potentially may have some CST relation, as verified by Zhang and Radev (2004). Its value varies from 0 to 1 (when exactly equal). The selected pairs of sentences are processed to identify the CST relations that hold among the sentences. The pairs are processed to extract lexical, morphosyntactic and semantic features which shall be used by the classifiers. The identification of CST relation is made in two distinct ways: (i) using automatic classifiers and (ii) applying rules that were manually created through corpus analysis. Firstly, the classifiers search for content relations. Then, the rules are applied for the remaining (form) relations. The result is a graph, in which the nodes are sentences of the documents and the edges are the relations among the sentences.
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314
307
Fig. 7. Parser architecture.
Several approaches were explored to create classifiers to identify the CST relations. Firstly, all the relations were considered. The unbalance of the relations led to bad results. Then, to create the classifiers, only six relations were considered (Overlap, Subsumption, Equivalence, Elaboration, Historical background and Follow-up, of the Content group, according to the typology of relations). These six relations are the most frequent in the corpus. This way, the results improved. The other relations were treated by rules. Four types of classifiers were created: multiclass, multilabel, binary, and hierarchical classifiers. Given the low results obtained by some classifiers (as reported by Maziero et al., 2010), the multilabel classifiers were discarded and the other three types were considered. Some relations (Indirect Speech, Attribution, Translation, and Contradiction), which do not have the necessary frequency in the corpus to be explored by machine learning classifiers, were detected by the rules that were manually created. These rules were created after a corpus analysis, when the examples in the corpus were studied. Another relation (Identity) was also detected by a rule, since it is very simple and is established when the sentence pair is identical. It is important to notice that another option would be resampling the low frequency relations, but this would probably lead to overfitting and to a non-natural scenario, since these relations are naturally more difficult to find in texts. In what follows we report the creation of the classifiers, the rules, and the achieved results. 4.1. Classifiers The classifiers were trained considering only the six relations cited above. Some previous experiments were conducted with all the CST relations, but the results were very low, motivating the use of this approach. The defined typology of relations helped in the task. Some classifiers were created considering the hierarchy of the relations. Using a top-down approach for hierarchical classification (Freitas & Carvalho, 2007), the first classifier uses only two classes: Redundancy and Complement. The examples classified as Redundancy are treated by another classifier that classifies between the relation Equivalence and the Partial Redundancy subcategory. Continuing in this branch of the typology, the examples classified as Partial Redundancy are treated by a classifier for the relations Subsumption and Overlap. In this approach, it is important to notice that the unbalance of the corpus is decreased by grouping the examples. This phenomenon also happens in the levels below, in the hierarchy of the relations. Another approach of binary classifiers was explored: one relation against all the others. In this way, six classifiers were obtained. They are used in a pipeline for each pair of sentences and the result with higher confidence value is considered for that pair. A multiclass classifier was also constructed considering all the six relations. In this case, the class is predicted in only one step. The techniques to develop the classifiers were: NaiveBayes, Support Vector Machine (SVM), and decision tree (J48). NaiveBayes is a probabilistic technique; SVM is mathematical; and J48 is symbolic. These techniques were used due to their popularity and for belonging to different learning paradigms. The features used are listed below: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
Difference in number of words between the sentences (S1–S2). Percentage of words in common in S1 in relation to S2. Percentage of words in common in S2 in relation to S1. Position of S1 in the text (0 – beginning: first three sentences; 1 – middle; 2 – end: last three sentences). Position of S2 in text (the same as above). Number of words in the longest common substring between S1 and S2. Difference in the number of nouns between S1 and S2 (excluding proper nouns). Difference in the number of adverbs between S1 and S2. Difference in the number of adjectives between S1 and S2. Difference in the number of verbs between S1 and S2.
308
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314
(11) (12) (13) (14)
Difference in the number of proper nouns between S1 and S2. Difference in the number of numerals between S1 and S2. Difference in the number of verbs of attribution between S1 and S2 (these verbs are also counted in attribute 10). Number of possible synonyms in common in S1 and S2.
The first six features are superficial and needed only basic processing of the texts. Feature 1 is the difference in number of words between the sentences. It may show which sentence has more information, since this may indicate the Subsumption relation. Features 2 and 3 are the percentage of words in common. They indicate the overlapping of information as a cue to some content relation (e.g., Overlap and Subsumption). Features 4 and 5 are the position of the sentences in their texts; generally, the first sentences are the most important and are elaborated by the following sentences. Feature 6 is the lengthiest common substring between the sentences, indicating the redundancy of information. These features may indicate the relations Overlap, Subsumption and Elaboration. Features 7–12 are computed using a POS tagger (Aires, Aluísio, Kuhn, Andreeta, & Oliveira, 2000) to identify the word classes. Feature 13 was obtained from a list of attribution verbs (e.g., ‘‘tell’’, ‘‘say’’ and ‘‘explain’’), which are generally used to attribute some information or speech to a person, document, organization, etc. Feature 14 was obtained by using a thesaurus for Portuguese (Maziero, Pardo, Di Fellipo, & Dias da Silva, 2008). These features measure the overlapping of word classes between the sentences. They are useful, for example, to identify the sentence which has more adjectives and possibly elaborates the other sentence. Feature 14 is the overlapping of synonyms and intends to evidence the same information using different words, which might indicate the Equivalence relation.
4.2. Rules The rules are applied to every pair of sentences that was also processed by the classifiers, since a pair of sentences may have both content and form relations. For example, a simple rule was created to identify the Identity relation. It occurs when the sentences are absolutely equal: the rule simply verifies the absolute matching of the strings of the two sentences. The rule to identify the Translation relation verifies if there is some piece of text in another language (not Portuguese, which is the language of the corpus), translates it to Portuguese, obtains its synonyms and searches these synonyms in the other sentence. If it happens in the other sentence, one may assume that there is a Translation relation between them. It is interesting to notice that it is enough that only one word is in another language for a Translation relation to happen. The translation relation is usual to happen for the names of films and countries, for instance. The rule for Contradiction relation intends to find numerical divergences among the sentences with the same information (different by numbers, only). For example, if one sentence is ‘‘Two bombs exploded in the market’’ and another sentence is ‘‘A bomb exploded in the market’’, the relation Contradiction occurs. There are other types of contradictions to appear in the sentences, but they are difficult to recognize due to the complexity of this task. For this reason, we focused this work on numerical contradictions only, which are the most frequent cases in our corpus. An example of other type of Contradiction happens between the sentences below, extracted from CSTNews and translated to English: (S1) In a note sent after viewing the report, TAM says ‘‘that had no record of any mechanical problem on this plane on July 16’’. (S2) One day before the accident, on Monday, 16, the plane would have presented problems upon landing at Congonhas, during the flight 3215, coming from Belo Horizonte (Confins), only stopping very near the end of the runway. In S1, it is said that the airplane did not present problems at July 16, but, in S2, it is said that in the same day (16) the airplane presented problems. The contradiction occurs with non-numerical information and this inference is difficult to perform by automatic means. Some other contradiction types are more subtle and even harder to detect. The rule for Indirect Speech/Attribution relations considers the two CST relations because they are very similar, attributing some information to an author/source. The difference is that Indirect Speech needs to have a direct speech in some sentence of the pair. As an example of rule, consider the following schema for the rule Indirect Speech/Attribution. 1. Sentence S1 is verified in order to check if some of the following cases occur: an attribution verb if followed by the word ‘‘that’’; words ‘‘according’’, ‘‘to’’ and ‘‘as’’ are followed by a proper noun, pronoun or determiner; a punctuation signal (comma, for instance) is followed by an attribution verb. 2. If some of the verifications of step 1 is true, the next checking is performed in sentence S2: there is a verb in the first person. 3. If the rule succeeds: a. in the two sentences: it is assumed that the Indirect Speech relation occurs; b. only in the first sentence: it is assumed that the Attribution relation occurs. 4. Then, the sentences of the pair are inverted and the steps above are applied again.
309
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314
As an example, consider the sentences S1 and S2 below with an Attribution relation. They have an overlapping of information (an air crash at Congo Democratic Republic) and S1 presents a source for the shared information (Spokesman of the United Nations), which is marked by the last condition for the first sentence (a comma followed by the attribution verb ‘‘reported’’). (S1) A plane crash in the town of Bukavu, in east of the Democratic Republic of Congo, killed 17 people on Thursday afternoon, reported today a spokesman of the United Nations. (S2) At least 17 people died after the crash of a passenger plane in the Democratic Republic of Congo. The creation of these rules was facilitated by the taxonomy and the hierarchy of the CST relations, which reduced the subjectivity of each relation. 4.3. Results Initially, as previously discussed, to decide if a pair of sentences has a CST relation, the word overlap measure was used. This way, if the sentences had a threshold higher than 0.12, they were candidates to receive a CST relation. The CSTNews has 31,106 possible pairs of sentences and the human annotators selected only 1651 pairs to be related, indicating that only 5.3% of the possible pairs must be annotated. Table 8 summarizes the results for this step with different possible thresholds. Using the word overlap with a threshold of 0.12 resulted in an F-measure of 0.6268 for correctly detecting which pairs of sentences should not receive a relation. The value 0.12 was chosen because it had a good precision and recall, with the maximum precision and not so low recall. Since there are only relatively few pairs of sentences that are related (5.3% of them), we slightly favor precision in order to avoid a large number of pairs that should not be related. Tables 9–11 show the results for the multiclass, binary and hierarchical classifiers, respectively. All the classifiers are decision trees (using the technique J48) because this technique obtained the best results. All of the classification results were obtained by using 10-fold cross-validation. For the multiclass classifier, the relations Overlap and Subsumption had results over 0.44 F-measure (Table 9). These results (macro average) are better than the results obtained by Zhang and Radev (2004): 0.38 and 0.06, respectively, for Overlap and Subsumption; Elaboration had over 0.39 against 0.18 for the parser for English. Zhang and Radev (2004) treated the relation Description, which we did not treat because we joint this relation with Elaboration in the refinement of the CST model. Instead, we treated the relation Historical background. The average result was a 0.40 F-measure, against 0.35 for English, being 14% better, therefore. At the same table (Table 9), micro average results were calculated in view of the unbalance of the corpus. As it may be seen, aided by Fig. 6, depending on the frequency of the relation, the values are better or worse. For example, the relation Equivalence has fewer instances, compared to the other relations; therefore, its micro average precision is better than the macro average, but its recall is worse. Nevertheless, the relation Overlap, the most frequent in the set, has lower micro average precision, but greater recall. Table 8 Results for the step of selecting pairs of sentence to receive a CST relation. Technique of selection
Precision
Recall
F-measure
Word Word Word Word Word Word
0.4814 0.5416 0.6456 0.7041 0.7733 0.9943
0.7698 0.7135 0.6323 0.5881 0.5269 0.2132
0.5924 0.6158 0.6389 0.6409 0.6268 0.3511
overlap = 0.8 overlap = 0.9 overlap = 0.1 overlap = 0.11 overlap = 0.12 overlap = 0.20
Table 9 Results for multiclass classifier. Relation
For macro average results
For micro average results
Precision
Recall
F-measure
Precision
Recall
F-measure
Elaboration Overlap Follow-up Historical background Subsumption Equivalence No relation
0.405 0.441 0.282 0.299 0.449 0.378 0.773
0.385 0.478 0.273 0.260 0.447 0.359 0.527
0.395 0.458 0.277 0.278 0.448 0.368 0.627
0.387 0.479 0.301 0.221 0.474 0.393 0.773
0.396 0.518 0.253 0.195 0.522 0.282 0.527
0.392 0.498 0.275 0.207 0.496 0.328 0.627
Average
0.432
0.390
0.407
0.433
0.385
0.403
310
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314
Table 10 Results for binary classifiers. Relation
Precision
F-measure
Relation
Precision
Recall
F-measure
0.821 0.347
0.809 0.363
Historical background Other Hist. background
0.953 0.478
0.991 0.143
0.972 0.220
0.589
0.584
0.586
Average
0.715
0.567
0.596
Equivalence Other Equivalence
0.977 0.368
0.991 0.179
0.984 0.241
Overlap Other Overlap
0.743 0.493
0.773 0.452
0.758 0.472
Average
0.672
0.585
0.612
Average
0.618
0.612
0.615
Follow-up Other Follow-up
0.818 0.346
0.880 0.246
0.848 0.287
Subsumption Other Subsumption
0.900 0.485
0.930 0.388
0.915 0.431
Average
0.582
0.563
0.567
Average
0.692
0.659
0.673
Precision
Elaboration Other Elaboration
0.799 0.380
Average
Recall
Table 11 Results for hierarchical classifiers. Relation
Precision
F-measure
Relation
Recall
F-measure
0.675 0.621
0.657 0.638
Classifier D (Elaboration Temporal) Elaboration 0.548 Temporal 0.593
0.601 0.541
0.573 0.566
0.648
0.647
Average
0.571
0.569
Classifier B (Partial Equivalence) Partial 0.958 Equivalence 0.556
0.988 0.256
0.973 0.351
Classifier E (Follow-up Historical background) Follow-up 0.884 0.935 Hist. background 0.683 0.532
0.909 0.599
Average
0.622
0.662
Average
0.754
Classifier C (Overlap Subsumption) Overlap 0.824 Subsumption 0.637
0.852 0.587
0.838 0.611
Average
0.719
0.724
Classifier A (Complement Redundancy) Complement 0.640 Redundancy 0.656 Average
0.648
0.757
0.730
Recall
0.570
0.783
0.733
Even using unlabeled data, the methodology of Zhang and Radev (2004) obtained results below than our parser. It is important to let it clear that the comparison between the parser for Portuguese and the parser for English is not completely fair due to the different languages and corpora used to train and test the classifiers. However, these results are the best results obtained for the task of identifying CST relations and may give an idea of the state of the art for the task. A previous experiment balancing the classes was performed and the average values were better – 0.716, 0.730, and 0.721, for precision, recall and F-measure, respectively. However, the balancing of the classes does not reflect the real scenario of CST relations occurrence and all the results that we show in the tables were obtained without balancing. The binary classifiers obtained better results, as shown in Table 10. All the classifiers had average results over 0.56. The six classifiers are applied in sequence. Each classifier decides between one relation and all the others together (the class Other). We select the one with the highest confidence in a relation. When more than one classifier found a CST relation, the result with higher confidence was chosen. On the other hand, when all the classifiers indicate the ‘‘other’’ relation, we consider that the pair under analysis does not have a CST relation. Considering, for example, the classifier for relation Historical background, the average precision was 0.715 and the class Historical background had a 0.478 precision while the class Other had 0.953. If we consider the average results for all the classifiers together, we obtain 0.644, 0.595, and 0.608 for precision, recall and F-measure, respectively. The results listed in Table 11 were obtained using part of the defined typology. This reduced typology contains only the six relations (Identity, Equivalence, Overlap, Subsumption, Follow-up and Historical background) organized in the same way of the typology showed. These classifiers are binary, since each classifier decides only between two branches of the hierarchy, but they differ from the binary classifiers of Table 10, which decide only among one relation against all the others. These classifiers are used in a top-down – hierarchical – approach. All the hierarchical classifiers had average results over 0.56, as the binary classifiers. However, the results, here, are better than the results of the binary classifiers; for example, two classifiers, C and E, had results over 0.72. These results show the validity of the hierarchy created to group the CST relations.
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314
311
Table 12 Results of the rules. Rule
Precision
Recall
F-measure
Indirect Speech/Attribution Translation Contradiction
0.528 0.500 0.272
0.632 0.500 0.176
0.575 0.500 0.214
Average
0.433
0.436
0.430
The classifier A decides among relations of Complement and Redundancy with average precision and recall of 0.648. The results of classification of the CST relations (classifiers B, C, D and E), in the leaves of the hierarchy, are better than the results in the multiclass and binary classifiers; of course, it depends on the previous classifications, in the top-down approach. For example, to achieve the relation Subsumption, a pair of sentence needs to be treated by the classifiers A and C. In general, the average results for the hierarchical classifiers together are 0.697, 0.658 and 0.671 for precision, recall and F-measure, respectively. Table 12 presents the results for the rules. The rule to identify the Translation relation obtained a 0.5 F-measure because this relation appears only twice in the corpus. One of these occurrences is due to the translation of the word ‘‘Hezbolá’’ to ‘‘Hezbollah’’, a translation not previewed by the bilingual dictionary used and, therefore, only one instance was correctly identified. The Indirect Speech/Attribution rule obtained a better result in comparison to when the relations were identified by a multiclass classifier. In an experiment previously explored (which results are not shown in this document), these relations obtained a null result due to the unbalance of the corpus. The task of finding contradictions is hard, mainly because to the inferences that are necessary to identify some of them. This way, the rule to find the Contradiction relation searches only for numerical differences between the sentences and achieves a low result, therefore. Table 13 presents a general comparison of results for the task of CST parsing. This work treated more relations than all the other works. Because of this, some average results are worse than the obtained by some other works, which restricted the relation set that was automatically detected. In this table, we included our results for the multiclass classifier for the 6 content relations (Overlap, Subsumption, Equivalence, Historical background, Follow-up, and Elaboration). Although they are not the best (since they were outperformed by the hierarchical classifiers), they allow us to directly measure the performance for each relation. Therefore, we expect that our results are even better for some relations than the ones reported in the table. As compared before, our work obtained better results than the parsing for English (Zhang & Radev, 2004) and treated more relations. It is worthy to say that the Transition relation in the work of Miyabe et al. (2008) are equivalent to the Contradiction relation and were put in the same line of this relation in the table. P stands for Precision, R for Recall and F for F-measure. For a fairer comparison between the parser for Portuguese and the best and more complete parser for English (Zhang & Radev, 2004), the method used for English was adapted to Portuguese, using the same features and machine learning techniques. Some features (the ones involving semantic similarity with the support of English repositories, as WordNet) were achieved by the translation of Portuguese words to English, using the API of the WordReference1 dictionary, and by the use of the Semantic Distance Toolkit (Pedersen, Patwardhan, & Michelizzi, 2004). Training was performed using the AdaBoost algorithm (Freund & Schapire, 1996), as originally made by the authors for English. Table 14 shows the achieved results (composed by Precision (P), Recall (R) and F-measure (F)) in the last column. For making the comparison easier, the table also shows the results for Portuguese (showing the results for the multiclass classifier, as before). The best results for boosting were achieved with the use of Decision Trees (Quinlan, 1993). One may see that our joint use of classifiers and symbolic rules achieved the highest average results to CST relations classification: a 0.467 F-measure for Portuguese against 0.374 for English. Our method better identified relations as Contradiction and Equivalence. For some relations, it performed poorer, as for the temporal relations Follow-up and Historical background. The relations Attribution and Indirect Speech, identified by a rule for Portuguese (and, for this reason, shown in only one line in the table), obtained a 0.576 F-measure, while the classifiers to these relations obtained 0.213 and 0.176 F-measures, respectively. It is important to notice that the results are the same for ‘‘No relation’’ because we have selected candidate pairs of sentences using the same word overlap measure for both techniques. In order to have a unique result for our parser, the CST parser (with the hierarchical classifiers and the rules) was applied to the corpus in a ten-fold cross-validation way and obtained a general accuracy of 68%. As a simple baseline result, considering the assignment of the relation Overlap (the majority class) to every pair of sentences, a low accuracy of 16.7% would be obtained. The majority of the errors occurred with the less frequent relations in the case of the classifiers, and with the relation Contradiction in the case of the rules. These relations need to be better studied in the future.
1
http://www.wordreference.com/.
312
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314
Table 13 Comparison of results for CST parsing. Zhang and Radev (2004)
Miyabe et al. (2008)
Kumar et al. (2012)
Relation
This work P
R
F
P
R
F
P
R
F
P
R
F
Contradiction Description Elaboration Equivalence Follow-up Historical background Identity Indirect-Speech/Attribution No relation Overlap Subsumption Translation
0.273 – 0.405 0.378 0.282 0.299 1.000 0.529 0.773 0.441 0.449 0.500
0.177 – 0.385 0.359 0.273 0.260 1.000 0.632 0.527 0.478 0.447 0.500
0.214 – 0.395 0.368 0.277 0.278 1.000 0.576 0.627 0.458 0.448 0.500
– 0.333 0.313 0.500 0.473 – – – 0.888 0.526 0.100 –
– 0.107 0.128 0.320 0.289 – – – 0.961 0.294 0.042 –
– 0.162 0.182 0.390 0.359 – – – 0.923 0.377 0.059 –
0.431 – – 0.950 – – – – – – – –
0.486 – – 0.627 – – – – – – – –
0.456 – – 0.755 – – – – – – – –
– 0.470 – – – – 0.950 – 0.760 0.630 0.480 –
– 0.620 – – – – 0.870 – 0.320 0.610 0.770 –
– 0.540 – – – – 0.910 – 0.450 0.620 0.590 –
Average
0.484
0.458
0.467
0.447
0.306
0.350
0.690
0.556
0.606
0.658
0.638
0.622
Table 14 Comparison of results in the CSTNews corpus. This work (multiclass classifier and rules)
Zhang and Radev (2004) on CSTNews (boosting with decision trees)
Relation
P
R
F
P
R
F
Contradiction Elaboration Equivalence Follow-up Historical background Identity Indirect-Speech Attribution Indirect-Speech/Attribution Overlap Subsumption Translation No relation
0.273 0.405 0.378 0.282 0.299 1.000 – – 0.529 0.441 0.449 0.500 0.773
0.177 0.385 0.359 0.273 0.260 1.000 – – 0.632 0.478 0.447 0.500 0.527
0.214 0.395 0.368 0.277 0.278 1.000 – – 0.576 0.458 0.448 0.500 0.627
0.231 0.460 0.269 0.548 0.554 0.737 0.231 0.219 – 0.552 0.494 0.000 0.773
0.143 0.475 0.163 0.540 0.409 0.859 0.143 0.208 – 0.589 0.513 0.000 0.527
0.176 0.468 0.203 0.544 0.471 0.793 0.176 0.213 – 0.570 0.503 0.000 0.627
Average
0.484
0.458
0.467
0.390
0.367
0.374
During this work, we could notice that some relations need a deeper knowledge of the texts. For example, to identify the Summary relation, it is necessary to know that a sentence is a summary of another, that is, a sentence presents the same information of another, but summarized. This summary may occur in many ways, which is not trivial to be found. The Citation relation is similar to Attribution; the only difference is that Citation attributes some information in one sentence to another document in the cluster. As Citation did not happen in our corpus, this relation was not considered. For the Modality relation, there was just one example in the corpus. This made impracticable the automatic learning and the production of rules to identify the relation. The definition of Modality relation says that the source/authorship of the information in one sentence is modalized (sometimes undetermined, others softened) in another sentence and it is hard to identify. We conclude this paper in the next section. 5. Conclusions The results obtained by the parser evidences the adequacy of the refinements in CST. The definition of a hierarchy, the restrictions for the annotation, and the formalization in the definition for each relation leaded to a high agreement in annotation and to a parser with good accuracy scores in the identification of CST relations. To the best of our knowledge, these are the best results obtained so far. This work has two main contributions, one theoretical and one practical. The theoretical contribution is a refined, more well-defined and systematized CST, which may be applied to other languages. This theory is useful to understand the multidocument phenomena and how this improves the performance of a discourse parser. On the other hand, one practical contribution is the creation of a reference corpus annotated with CST, which may be used to test other strategies for parsing and even to subsidize new theoretical works. Another practical contribution is the CST parser itself. As presented, the methodology employed is capable to be used for other languages. It is only necessary to pro-
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314
313
vide the features required by the classifiers and eventual rules to be applied. With this parser, many applications requiring the CST analysis may now be totally automated, for example, CST-based multi-document summarization systems. The treatment of the non-identified relations may be carried out by using specific techniques to each relation. For example, to identify the relation Summary, measures as ROUGE (Lin, 2004), used to evaluate summaries, may be used. The use of non-labeled examples in some semi-supervised strategy may also be explored in the future to improve the accuracy of the predictions. Both the CST parser and the corpus are available online for free use by the research community.2 Acknowledgments The authors are thankful to FAPESP, CAPES and CNPq for their financial support. References Afantenos, S. D., Doura, I., Kapellou, E., & Karkaletsis, V. (2004). Exploiting cross-document relations for multi-document evolving summarization. In Proceedings of the 3rd Hellenic conference on artificial intelligence (pp. 410–419). Afantenos, S. D. (2007). Reflections on the task of content determination in the context of multi-document summarization of evolving events. In Paper presented at the recent advances in natural language processing conference, Borovets, Bulgaria. Aires, R. V. X., Aluísio, S. M., Kuhn, D. C. S., Andreeta, M. L. B., & Oliveira, O. N. Jr., (2000). Combining multiple classifiers to improve part of speech tagging: A case study for Brazilian Portuguese. Proceedings of the Brazilian Artificial Intelligence Symposium, 20–22. Aleixo, P., & Pardo, T. A. S. (2008). Finding related sentences in multiple documents for multidocument discourse parsing of Brazilian Portuguese texts. In Annals of the VI workshop of information technology and human language (pp. 26–28). Allan, J. (1996). Automatic hypertext link typing. In Proceeding of the 7th conference on hypertext (pp. 42–52). Beltrame, W. A. R., Cury, D., & Menezes, C. S. (2012). Fique Sabendo: Um Sistema de Disseminação Seletiva da Informação para Apoio à Aprendizagem. In Annals of the 23rd Brazilian symposium on computer in education. Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2), 249–254. Cardoso, P. C. F., Maziero, E. G., Jorge, M. L. C., Seno, E. M. R., Di Felippo, A., Rino, L. H. M., et al. (2011). CSTNews – A discourse-annotated corpus for single and multi-document summarization of news texts in Brazilian Portuguese. In Proceedings of the 3rd RST Brazilian meeting (pp. 1–18). Dagan, I., Glickman, O., & Magnini, B. (2005). The PASCAL recognising textual entailment challenge. In Proceedings of the 1st PASCAL challenges workshop on RTE (pp. 1–8). Freitas, A. A., & Carvalho, A. C. P. F. (2007). A tutorial on hierarchical classification with applications in bioinformatics. In D. Taniar (Ed.), Research and trends in data mining technologies and applications (pp. 175–208). USA: Idea Group Inc.. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proceedings of thirteenth international conference on machine learning (pp. 148–156). Green, S. J. (1999). Automatically generating hypertext. In Proceedings of the joint conferences on new methods in language processing and computational natural language learning (pp. 101–110). Hirst, G., & St-Onge, D. (1998). Lexical chains as representation of context for the detection and correction of malapropisms. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 305–332). USA: The MIT Press. Hovy, E. H., & Lavid, J. M. (2010). Towards a ‘science’ of corpus annotation: A new methodological challenge for corpus linguistics. International Journal of Translation Studies, 22(1), 13–36. Inam, S., Shoaib, M., Majeed, F., & Shaerjeel, M. I. (2012). Ontology based query reformulation using rhetorical relations. International Journal of Computer Sciences IJCS, 9(4), 261–268. Jijkoun, V., & De Rijke, M. (2005). Recognizing textual entailment using lexical similarity. In Proceedings of the PASCAL first challenges workshop (pp. 73–76). Jorge, M. L. C., & Pardo, T. A. S. (2010). Experiments with CST-based multidocument summarization. In Proceedings of the ACL workshop TextGraphs-5: Graph based methods for NLP (pp. 74–82). Kato, T., Matsushita, M., & Kando, N. (2005). Must: A workshop on multimodal summarization for trend information. In Proceedings of the NTCIR-5 workshop meeting (pp. 556–563). Kumar, J. Y., Salim, N., & Raza, B. (2012). Automatic identification of cross-document structural relationships. In Proceedings of the information retrieval & knowledge management international conference (pp. 26–29). Lin, C. -Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Proceedings of the ACL workshop on text summarization branches out (pp. 74– 81). Lin, D. (1998). An information theoretic definition of similarity. In Proceedings of the international conference on machine learning (pp. 296–304). MacCartney, B., Grenager, T., De Marneffe, M. -C., Cer, D., & Manning, C. D. (2006). Learning to recognize features of valid textual entailments. In Proceedings of the conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 41–48). Mann, W. C., & Thompson, S. A. (1987). Rhetorical structure theory: A framework for the analysis of texts. Information Sciences Institute (ISI Reprint Series ISI/ RS-87-190). Marcu, D. (1997). The rhetorical parsing of natural language texts. In Proceeding of 35th annual meeting of ACL-EACL (pp. 96–103). Marsi, E., & Krahmer, E. (2005). Classification of semantic relations by humans and machines. In Proceedings of ACL workshop on empirical modeling of semantic equivalence and entailment (pp. 1–6). Maziero, E. G., Jorge, M. L. C., & Pardo, T. A. S. (2010). Identifying multidocument relations. In Proceedings of 7th international workshop NLPCS (pp. 60–69). Maziero, E. G., Pardo, T. A. S., Di Fellipo, A., & Dias da Silva, B. C. (2008). A Base de Dados Lexical e a Interface Web do TeP 2.0 – Thesaurus Eletrônico para o Português do Brasil. In VI workshop on information technology and human language (pp. 390–392, Anais.). McKeown, K. & Radev, D. R. (1995). Generating summaries of multiple news articles. In Proceedings of 18th annual international ACM-SIGIR conference on research and development (pp. 74–82). McKeown, K., & Radev, D. R. (1998). Generating natural language summaries from multiple on-line sources. Computational Linguistics-Special Issue on NLG, 24(3), 469–550. Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38(11), 39–41. Miyabe, Y., Takamura, H., & Okumura, M. (2008). Identifying a cross-document relation between sentences. In Proceedings of 3rd IJCNLP (pp. 141–148). Murakami, K., Nichols, E., Mizuno, J., Watanabe, Y., Goto, H., Ohki, M., et al. (2010). Automatic classification of semantic relations between facts and opinions. In Proceedings of 2nd workshop on NLP challenges in the information explosion Era NLPIX (pp. 21–30). Ohki, M., Nichols, E., Matsuyoshi, S., Murakami, K., Mizuno, J., Masuda, S., et al. (2011). Recognizing confinement in web texts. In Proceedings of 9th IWCS (pp. 215–224). 2
http://www.icmc.usp.br/pessoas/taspardo/sucinto/.
314
E.G. Maziero et al. / Information Processing and Management 50 (2014) 297–314
Okumura, M., Fukushima, T., & Nanba, H. (2003). Text summarization challenge 2 – Text summarization evaluation at NTCIR workshop 3. In Proceedings of HLT-NAACL 2003 workshop. Text summarization (DUC03) (pp. 49–56). Papineni, K., Roukos, S., Ward, T., & Zhu, W. -J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of 40th annual meeting of the ACL (pp. 311–318). Pardo, T. A. S. (2006). SENTER: Um Segmentador Sentencial Automático para o Português do Brasil. Technical report NILC-TR-06-01NILC, São Paulo, Brazil. Pedersen, T., Patwardhan, S., & Michelizzi, J. (2004). WordNet::Similarity: Measuring the relatedness of concepts. Demonstration papers at HLT-NAACL 2004 (HLT-NAACL–Demonstrations ‘04) (pp. 38–41). Quinlan, R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann Publishers. Radev, D. R. (2000). A common theory of information fusion from multiple text sources, step one: Cross-document structure. In Proceedings of 1st ACL SIGDIAL workshop on discourse and dialogue (pp. 74–83). Radev, D. R., Otterbacher, J., & Zhang, Z. (2004). CSTBank: A corpus for the study of cross-document structural relationships. In Proceedings of 4th LREC. Rios, M., & Gelbukh, A. (2012). Recognizing textual entailment with a semantic edit distance metric. In Proceedings of 11th MICAI (pp. 15–20). Salton, G., & Lesk, M. E. (1968). Computer evaluation of indexing and text processing. Journal of the ACM, 15(1), 8–36. Trigg, R. (1983). A network-based approach to text handling for the online scientific community. Technical report TR-1346, Univ. of Maryland, Maryland. Trigg, R., & Weiser, M. (1986). TEXTNET: A network-based approach to text handling. ACM Transactions Office Information Systems, 6(1), 1–23. Zhang, Z., Blair-Goldensohn, S., & Radev, D. R. (2002). Towards CST-enhanced summarization. In Proceedings of the 18th national conference on artificial intelligence (AAAI-2002), Edmonton/Canadá. Zhang, Z., & Radev, D. R. (2004). Combining labeled and unlabeled data for learning cross-document structural relationships. In Proceedings of IJCNLP (pp. 32– 41). Zhang, Z., Otterbacher, J., & Radev, D. R. (2003). Learning cross-document structural relationships using boosting. In Proceedings of 12th ICIKM (pp. 124–130). Zahri, N., & Fukumoto, F. (2011). Multi-document summarization using link analysis based on rhetorical relations between sentences. CICLing Lecture Notes in Computer Science, 2, 328–338.