Information Systems 36 (2011) 406–430
Contents lists available at ScienceDirect
Information Systems journal homepage: www.elsevier.com/locate/infosys
Knowledge-based sense disambiguation (almost) for all structures Federica Mandreoli, Riccardo Martoglia DII - University of Modena e Reggio Emilia, Via Vignolese 905/b, 41125 Modena, Italy
a r t i c l e i n f o
abstract
Article history: Received 10 August 2009 Received in revised form 4 August 2010 Accepted 23 August 2010 Recommended by: F. Naumann
Structural disambiguation is acknowledged as a very real and frequent problem for many semantic-aware applications. In this paper, we propose a unified answer to sense disambiguation on a large variety of structures both at data and metadata level such as relational schemas, XML data and schemas, taxonomies, and ontologies. Our knowledge-based approach achieves a general applicability by converting the input structures into a common format and by allowing users to tailor the extraction of the context to the specific application needs and structure characteristics. Flexibility is ensured by supporting the combination of different disambiguation methods together with different information extracted from different sources of knowledge. Further, we support both assisted and completely automatic semantic annotation tasks, while several novel feedback techniques allow us to improve the initial disambiguation results without necessarily requiring user intervention. An extensive evaluation of the obtained results shows the good effectiveness of the proposed solutions on a large variety of structure-based information and disambiguation requirements. & 2010 Elsevier B.V. All rights reserved.
Keywords: Semantic web Structure-based information Word sense disambiguation
1. Introduction The ever increasing need of publishing and exchanging data on Web spaces together with the recent growth of the Semantic Web have favored the diffusion of a wide variety of online data structures that semantically describe their contents through meaningful markup terms and relationships. We refer to relational database schemas such as those used in biological applications [1], XML1 documents where arbitrary elements and structural relationships describe what data are, RDF2 graphs which represent information about things that can be identified on the Web, OWL3 ontologies which formally describe the meaning of terminology used in Web documents, as well
Corresponding author.
E-mail addresses:
[email protected] (F. Mandreoli),
[email protected] (R. Martoglia). 1 http://www.w3.org/XML/ 2 http://www.w3.org/RDF/ 3 http://www.w3.org/TR/owl2-overview/ 0306-4379/$ - see front matter & 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.is.2010.08.004
as taxonomies and directory trees used by search engines and online market places to classify their data. There is a general agreement that annotating such online data structures with machine-interpretable semantics would allow the development of much smarter applications for final users. In line with this view, many of the metadata-intensive applications dealing with this kind of structures actually found their effectiveness on unambiguous markup lexicons. Moreover, a new generation of Semantic Web applications is emerging [2], focused on exploiting the Semantic metadata available on the Web. The exact meaning of the involved terms is used, for instance, for the automatic classification of heterogeneous XML data [3] and ontologies [4], for XML query expansion [5], for Web crawling [6], for ontology matching [7,8] and clustering [9], to enhance enterprise search and interoperability [10,2] on document collections and business processes. On the other hand, markup lexicons are intrinsically ambiguous, that is many of the terms used in these structures can be interpreted in multiple ways depending on the context in which they occur. Consider, for instance,
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
Purchase Order
407
Product
PK PurchaseOrderId
PK ProductId
Date Customer TaxRate Discount InvoiceAmount
Name Producer Line UnitCost metadata FK2
description
Order Detail FK1
PK
Order DetailId
FK1 PurchaseOrderId FK2 ProductId Quantity
PurchaseOrder relational schema
identifier
title
creator
subject
series
format
DublinCore standard portion
Fig. 1. A small example showing the need for disambiguation in a schema matching context.
the small example shown in Fig. 1 involving DigiSell, a fictitious online digital media store selling e-books, songs and videos. Since the beginning, the store has managed basic product data (name, producer, line of product) and economic/accounting information (such as product cost, order and invoice details) through a SQL DDL relational schema called PurchaseOrder for short (Fig. 1a). DigiSell is now interested in enhancing the digital media description available on their website, by including, for instance, additional information such as the subject of a book or video, or the format of the media, while also performing validity checks on the already available data. Useful metadata about media products are already available as XML documents compliant with the Dublin Core4 standard for resource description and cataloging (Fig. 1b shows a portion we call DublinCore). In this case, matching techniques are needed in order to ‘‘link’’ the two schemas and derive the correct correspondences between them; these techniques should be automatic (remember that complete real-life examples often involve significantly larger structures), and should go beyond purely syntactical information in order to be effective. Let us examine the terminology of the two fragments more in detail to find out why. PurchaseOrder ( Fig. 1a) is made up of three tables, PurchaseOrder, Product, and OrderDetail, each containing a list of table attributes. Prefixes PK and FK are used to distinguish primary and foreign keys, respectively, and each foreign key is associated with an arrow from the referencing table to the referenced one. With reference to the WordNet [11] lexical knowledge base, which is one of the most used external knowledge sources for term senses, term line can take on a very large variety of meanings (29 senses), ranging from ‘‘a formation of people or things one beside another’’, such as ‘‘line of soldiers’’, to ‘‘text consisting to a row of words’’, from ‘‘cable, transmission line’’ to ‘‘line of products, line of
4
http://dublincore.org/
merchandise’’, which happens to be the right sense in our case. Further, an order could be a ‘‘purchase order’’ but also a ‘‘command given by a superior’’ or a ‘‘logical arrangement of elements’’ (13 different senses), a product could imply ‘‘commodities offered for sale’’ but also a ‘‘result of a chemical reaction’’ or a ‘‘quantity obtained by multiplication’’ (six different senses), and so on. The same applies to the terms of DublinCore shown in Fig. 1b: 10 possible meanings for term title, seven meanings for series, and so on. So, as we can see, the involved terms convey a clear meaning to humans but convey only a small fraction (if any) of their meaning to machines [12]. This is obviously a problem: though being different in the adopted terminology, the two structures present many conceptual correspondences (such as name and title, producer and creator, line and series) which would be completely missed by merely exploiting syntactical matching techniques. It would neither be possible to consider term synonyms, since these vary with the particular meaning each term is used in: for instance, considering term formation as a synonym of line could have very undesired consequences. Instead, as noted in [7,13], the right meaning of each term w.r.t. a reference thesaurus such as WordNet could allow for instance to compute semantic similarities between the terms and to eventually discover the right correspondences. The issue of providing explicit semantic to already existing structure-based information is indeed the common denominator to all those applications relying on machine-interpretable markup lexicon. This problem is mostly independent from the specific application goals while the effectiveness of the proposed solutions greatly influence the application performance. We therefore argue that a key issue to effectively fill the semantic annotation needs of the current and future generations of applications is to develop robust and generic disambiguation methods. However, very few application-independent solutions exist [12,10] while great efforts have been spent to produce partial solutions in
408
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
specific application contexts where the main focus was on how to use semantic annotations rather than on how to effectively disambiguate the involved structures [3–9]. On the other hand, word sense disambiguation (WSD), i.e. the problem of computationally determining which ‘‘sense’’ of a word is activated by the use of the word in a particular context, is a core research topic in the computational linguistics field [14]. Long-standing research interest in WSD has produced in the last decades different approaches which show good performance on plain text [14,15]. In this paper, we start from the lesson learnt in WSD field [15,14] and propose a unified answer to the sense disambiguation problems on a large variety of structures both at data and metadata level such as relational schemas, XML data and schemas, taxonomies, and ontologies. The practical implication of our proposal is that we put STRIDER,5 a structural disambiguation service, at applications disposal, which can be tailored to the specific requirements and structure characteristics so to effectively support the disambiguation needs of almost all applications. In the WSD field, the representation of a word in context is the main support together with additional knowledge resources for allowing automatic methods to choose the appropriate sense [15]. One of the most critical issues when manipulating structure-based information is that the effectiveness of the context extracted from the structure may depend on the specific application needs and structure characteristics. For instance, structures on specific topics, such as the DBLP bibliographic archive schema, usually benefit from ‘‘broad’’ contexts, containing almost all terms in the schema. On the other hand, the best strategy to disambiguate a given category in an online marketplace, such as the eBay ‘‘string’’ meant as musical instruments, is to select only the nearby categories in the taxonomy, whereas including also distant ones, such as ‘‘women’s clothing’’, introduces noise, thus bringing to wrong results. The sense disambiguation approach we propose aims at overcoming all the above mentioned problems in the following way:
it achieves homogeneity in the disambiguation phase
by converting the input structure into a common format that follows the open information model (OIM) [16] and that makes the addition of new structure types easy; it introduces the notion of crossing setting as a means to tailor the extraction of the context to the specific application needs and structure characteristics. In particular, a crossing setting allows us to define an abstraction of the neighborhood of each term to be disambiguated, so to extract context information from correlated terms; it achieves flexibility by supporting the combination of different disambiguation methods founded on WSD principles together with different information 5
STRucture-based Information Disambiguation ExpeRt.
extracted from the two main sources of knowledge, i.e. a thesaurus and a corpus. For instance, it can exploit the sense definitions and usage examples that are often available in a thesaurus or the term frequencies in the corpus and it integrates the most frequent sense heuristic for word meaning prediction [14]; it produces a ranking of the plausible senses of each term in the input structure. In this way, it supports both the assisted semantic annotation and the completely automatic one. In the former case, the disambiguation task is committed to a human expert and the disambiguation service assists him/her by providing useful suggestions. In the latter case, there is no human intervention and the selected sense can be the top one; it can improve the disambiguation results by means of several feedback techniques which do not necessarily require user intervention.
The proposed approach has been implemented in a new and completely redesigned 2.0 version of our STRIDER system [17,18]. A web version of STRIDER is also available online.6 Since no official benchmarks are available for evaluating structure-based information disambiguation systems, we also propose an evaluation method to thoroughly assess disambiguation effectiveness at different levels of quality and for structures presenting different specialization levels. An extensive evaluation of STRIDER shows the good effectiveness of the proposed solutions on a large variety of structure-based information and disambiguation requirements. The rest of the paper is organized as follows: After a discussion of related works (Section 2) and an overview of our disambiguation approach (Section 3), Section 4 discusses our context extraction method, while the actual disambiguation algorithms are presented in Section 5. An in-depth experimental evaluation is provided in Section 6, while Section 7 concludes the paper and outlines directions for future work.
2. Literature review Our structural disambiguation approach draws upon some of the ideas and techniques that have been proposed in the well-established free text disambiguation field. In this section we will first discuss the text disambiguation fundamentals on which we founded a large part of our work (Section 2.1); in this case, we will mainly emphasize the ideas and techniques which we think can be most useful to the new context. Then, we will discuss the systems and techniques actually offering structural disambiguation features (Section 2.2) and briefly compare our approach to them. As we will see, only few approaches are available, and most of them are indeed quite limited in features and designed for very specific applications, such as schema matching or web documents’ annotation. 6
http://www.strider.unimo.it/
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
2.1. A brief account of free text WSD The free text disambiguation field is the most ‘‘classic’’ and well studied field of WSD. Most of the methods used to disambiguate free text can be adapted to the structural WSD problem, and an understanding of the way they work is fundamental in order to successfully devise and evaluate structural disambiguation mechanisms. In this section we provide a brief account of free text WSD. Interested readers can refer to [14,15] for detailed discussions on this topic. WSD is the ability to computationally determine which sense of a word is activated by its use in a particular context. The possible senses of a word are usually drawn from a sense inventory and the representation of the word in context together with the use of external knowledge resources are the main supports for allowing automatic methods to select the appropriate sense. Approaches to WSD are often classified according to the main source of knowledge used in sense differentiation. Methods that rely primarily on dictionaries, thesauri, and lexical knowledge bases, are termed knowledge-based, as opposed to methods that work directly from corpora which are named corpus-based. In this paper, we will mainly investigate knowledge-based methods which share most similarities with ours, while corpus-based proposals will be discussed only shortly. Knowledge-based methods are generally applicable in conjunction with any lexical knowledge base that defines word senses (and relations among them). However WordNet is, with no doubt, one of the most used external knowledge sources in this context. A simple and intuitive knowledge-based approach is the gloss overlap approach originally proposed by Lesk in [19]. It essentially assumes that the correct senses of the target words are those whose glosses have the highest overlap (i.e. words in common). Unfortunately, while this approach can in some cases achieve satisfying accuracies, it is very sensitive to the exact wording of definitions. The revision presented in [20] corrects this problem by expanding the glosses with the glosses of related concepts; however, this still does not lead to state-of-the-art performance compared to other knowledge-based proposals [15]. The structure of the concept network made available by computational lexicons like WordNet is instead exploited by structural approaches. Among those, graph-based approaches are very common and they often rely on the notion of lexical chain [21–24]; a lexical chain is a sequence of semantically related words, which creates a context and contributes to the continuity of meaning and the coherence of a discourse [25]. The above cited papers essentially find in the concept network the available lexical chains among all possible senses and the semantically strongest chains are selected for disambiguation purposes. Other graph-based approaches, instead, rely on graph theory, link analysis and social network analysis; for instance, [26] proposes a variety of measures analyzing the connectivity of the graph structures and identifying the most relevant word senses, while [27] applies PageRank-style algorithms to the graphs extracted from natural language documents. While graph-based approaches, and in particular those
409
based on lexical chains, are ‘‘global’’ approaches, i.e. they disambiguate the whole document and all its words at the same time, similarity-based methods are usually applied in local contexts [15] and can thus be more flexible to be adapted to a structural disambiguation setting, where each term can have a very specific and selected context on the basis of the structure characteristics. Similarity-based methods rely on the principle that words that share a common context are usually closely related in meaning. Therefore the appropriate senses can be selected by choosing those meanings found within the smallest semantic distance. To this end, semantic similarity measures have been proposed for instance in [28–32] and experimentally compared in [33] for different tasks. Among those, as we will see in Section 5.1, STRIDER exploits a thesaurus-based similarity straightly derived from one of the most known and effective, the Leacock– Chodorow one [29]. Recent research works have proposed term similarities based on term co-occurrence information; in this case the assumption is that the closeness of the words in text is indicative of some kind of relationship between them [34]. While these methods require very large textual corpora in order to show an acceptable effectiveness in disambiguation applications, they are becoming more and more studied and widespread since they can easily exploit the enormous amount of text information available from the web [35]. For instance, in [34] the authors test several ‘‘standard’’ term co-occurrence computation formulas for WSD, where the frequency statistics are derived from large corpora extracted with web mining techniques. In particular, the pointwise mutual information (PMI) formula appears one of the most effective. This result is confirmed by the studies in [36]. In a recent study [37], the authors propose more complex term-similarity techniques based on the computation of the cosine similarity between vectors in a vector space model; they also demonstrate that similarities based on term co-occurrence information can perform as well as knowledgebased methods. Further, an original approach, proposed in [38], defines a new term similarity measure, named ‘‘Google Distance’’ based only on the page-count statistics gathered by the Google search engine and the tests the authors perform show good performance figures w.r.t. standard WordNet approaches. Some approaches even try to exploit Wikipedia extracted information [39]. Since these newest approaches are significantly more complex than ‘‘standard’’ approaches, even if their effectiveness is not always significantly better, STRIDER’s corpus-based similarity exploits the well known and effective PMI approach (see Section 5.1). Differently from knowledge-driven proposals, corpusbased approaches solve word ambiguity by exploiting training texts. In particular, (semi)supervised methods [40] make use of annotated corpora to train from, or as seed data in a bootstrapping process, whereas unsupervised methods [41] work directly from raw unannotated corpora by clustering word occurrences and then classifying new occurrences into the induced clusters/senses. Comparing and evaluating different WSD systems are extremely difficult, because of the different test sets,
410
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
sense inventories, and knowledge sources adopted [15]. Senseval (now renamed Semeval) [42,43] is an international word sense disambiguation competition whose objective is to compare WSD systems in vitro, that is as if they were stand-alone, independent applications, and in vivo, i.e. as modules embedded in applications. The in vitro evaluation usually relies on measures borrowed from the information retrieval field (coverage, precision, recall, F1, etc.). While the performance of the supervised methods usually exceeds the other approaches, they are applicable only to those words for which the system has been trained; further, relying on the availability of large training corpora for different domains, languages, and tasks is not a realistic assumption (see [44,45] for an estimation of the amount of required sense-tagged text and human effort to construct such a corpus). On the other hand, knowledge-based methods seem to be most promising in the short-medium term mainly because they have the advantage of a larger coverage. Indeed, they use knowledge resources which are increasingly enriched and it has been shown that the more knowledge is available, the better their performance [26,46]. Moreover, applications in the Semantic Web need knowledge-rich methods which can exploit the potential of domain ontologies and enable semantic interoperability between users, enterprises, and systems [15].
2.2. Structural disambiguation approaches and applications Structural disambiguation is acknowledged as a very real and frequent problem for many semantic-aware applications. However, with a few exceptions which will be discussed later, up to now it has only been partially considered in very specific applications, such as schema matching [47,8], automated construction of domain specific ontologies [48], XML/ontology clustering [9,3,49], or in the semantic annotation of web pages [10]. In all these cases, disambiguation is usually not the main issue discussed by the authors, since their main focus is rather on how to use semantic annotations than how to produce them in a general context. Therefore, the proposed disambiguation solutions are only partial solutions, designed to be effective in the very specific considered scenarios; further, experimental evaluation is limited to the benefits of disambiguation on the specific scenarios, while no in-depth evaluation of the effectiveness of the disambiguation techniques is performed. On the other hand, our main aim has been specifically to develop and extensively test a stand-alone and robust method for automatically producing such annotations for different kinds of structures, thus generalizing other partial structural disambiguation solutions and satisfying the needs of a very large variety of current and future applications. In the following we will first of all give an overview of some of the application-specific structural disambiguation approaches, then we will focus on the few general-purpose actual structural disambiguation methods that have been presented. In many schema matching applications, the closeness between nodes relies on semantic information; for instance,
[8] employs context-sensitive strategies and community information, such as user ratings and comments, for ontology matching (RDF-S and OWL only), while a good number of statistical WSD approaches have also been proposed (e.g. [47]). However, they have limited applicability (e.g. [47] only applies to relational data) and rely on data (training data, etc.) which may not always be available. In a different context, [48] presents a system aimed at the construction of domain specific ontologies; in this specific application, the disambiguation algorithm is tailored to the problem of finding the correct sense for ontology candidate terms, so to append subtrees of concepts under the appropriate ontology node, and is based on a semantic interconnection measure which is designed ad-hoc to work with WordNet. As to data clustering scenarios, in [3] the authors propose a technique for XML data clustering, where disambiguation is performed on the documents’ tag names. The local context of a tag is captured as a bag of words containing the tag name itself, the textual content of the element and the text of the subordinate elements and then it is enlarged by including related terms retrieved with WordNet. This context is then compared to those associated to the different WordNet senses of the term to be disambiguated by means of standard vector model techniques. The technique appears convincing, but only applicable to XML documents; further, bag of words approach only provides a very rough similarity estimation, which would not be applicable to other finer-grained applications, such as schema matching and semantic web. In a semantic web setting, [9] discusses a method to cluster terms (meanings); the exploited disambiguation technique is derived from a previous approach for disambiguating keywords in user queries [50]. The context of each meaning is extracted from OWL ontologies and is then compared to other contexts by means of the NGD Google co-occurrence similarity [38]. In this case, the meaning extraction appears applicable only to OWL graphs and no other formats, such as relational or tree schemas, are discussed or supported. Closely related to structural disambiguation, a large number of semantic annotation approaches, providing annotation w.r.t. a reference ontology describing the domain of interest, are available in the literature [10]. Differently from proper general-purpose structural disambiguation approaches, such techniques annotate whole documents and do not allow finer annotation granularities, as required by structural disambiguation tasks. Further, while many frameworks and tools providing annotation facilities are currently available, such as Mangrove [51], the process of annotation is in many cases manual, since most tools limit themselves to assist the user in selecting and inserting annotations. On the other hand, while fully automated tools for annotating web documents are still missing, some semiautomatic systems exist which assist the user by suggesting both the subject of annotation and its annotation from a certain knowledge source. For instance, [52] and [53] are based on statistical and supervised learning approaches, while Armadillo [54] and KnowItAll [55] have automatic annotation features which are based on unsupervised techniques. The former systems require a large work and frequent user interventions in the training phase, while the latter do not need explicit user supervision but suffer from a limited accuracy. All annotation systems support HTML
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
documents, while providing support for the semantic annotation of other formats (such as relational schemas, XML schemas, ontologies) is a much less studied problem [10]. Finally, as we said, very few structural disambiguation approaches have been presented as independent applications and evaluated in vitro in the literature. Even if quite promising, also these methods often lack general-purpose applicability and flexibility in some ways. Bouquet et al. [12] present a node disambiguation technique exploiting the hierarchical structure of a schema tree together with WordNet hierarchies. However, in order for this approach to be fully effective, the schema relations have to coincide, at least partially, with those of WordNet. In [17,18] we presented a first release of the STRIDER system. The underlying approach poses the first bases for extracting and exploiting structural context information in the disambiguation process, including relational information between the nodes. However, the approach is still very WordNet dependent, since it only considers WordNet related termsimilarity computation methods, and is optimized and tested on tree-shaped XML schemas only, thus not providing a uniform way to handle different structural formats. The approach presented in [4] allows the disambiguation of ontologies by exploiting an ad-hoc context extraction which is not generalized and is specifically tailored to RDF/OWL data (XML, relational and other formats are not supported), while the way the context similarities are computed is very dependent on WordNet relations. On the other hand, STRIDER pre-processing and context extraction phases (see Section 4) and disambiguation algorithms (see Section 5) are completely orthogonal from the specific employed knowledge sources and type of structures considered (relational, tree models, graph models), thus effectively working in a much wider array of contexts.
3. Our structural disambiguation approach: an overview In this section we present the functional architecture of the STRIDER disambiguation system and outline the information flow involved in the disambiguation process (see Fig. 2). STRIDER can be used for disambiguating diverse data structures, i.e. structures used to describe data instances, called models [56]. The linguistic and semantic phenomena used to resolve word sense ambiguity are provided by two knowledge sources: a thesaurus, which is also used as sense inventory, and a corpus. The techniques STRIDER founds on are generally applicable in conjunction with any thesaurus that defines word senses and an hypernymy relation among senses and with any corpus where the relative term frequencies approximate their actual use in society. Practically, in its implementation STRIDER uses WordNet as thesaurus because, given its widespread diffusion within the research community, it can be considered a de facto standard for English WSD. Further, since its senses are ordered w.r.t. their frequency of use, WordNet also allows us to easily integrate the most frequent sense heuristic for word meaning prediction. Moreover, the used corpus is the WWW because it is the largest corpus on earth and the information
411
entered by millions of independent users averages out to provide automatic semantics of useful quality [38]. The process follows three main steps:
First, STRIDER translates the input model into a
labeled, directed multi-graph complying with the OIM-RDF meta-model [16] (pre-processing phase, Section 4.1); then, in the context extraction phase (Section 4.2), the list of polysemous terms to be disambiguated is extracted from the nodes of the graph. Each term is assigned context information. This includes both information contained within the graph (graph context) and information provided by the thesaurus (sense context); finally, disambiguation ( Section 5) is performed through a similarity-based approach where similarities are computed by exploiting either the hypernymy relation provided by the thesaurus or word frequencies in the considered corpus.
The outcome of the overall process is a ranking of the plausible senses for each term contained in the model. Moreover, in order to maximize the effectiveness of the disambiguation process, several feedback techniques, both manual and automated, are available to refine the initial results. In the following sections we will elaborate on our approach further, also by mean of a reference example. To this end, we will focus on the PurchaseOrder relational schema already shown in the introduction (Fig. 1a) and choose it as our reference example. 4. Pre-processing and context extraction This section describes how the models are prepared for the proper disambiguation process and how context is associated to each term to be disambiguated (preprocessing and context extraction in Fig. 2). 4.1. Pre-processing Our aim is to support the disambiguation of a wide range of models:
relational schemas: tables and views; tree-structured models: trees that can be employed to
describe, for instance, the structure of XML documents, web directories and taxonomies; graph-structured models: graphs describing data structures in, for instance, RDF-S files or OWL ontologies.
The pre-processing component is used to unify the input formats to a common one. It allows us not to care about the specific input format and to achieve homogeneity in the disambiguation phase. More precisely, STRIDER adopts the RDF encoding of the open interface model (OIM) [16] specification. The main aim of OIM is to provide sharing and reuse of metadata. In OIM-RDF [56], models are represented as directed labeled graphs
412
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
INPUT
OUTPUT PRE PROCESSING
CONTEXT EXTRACTION
DISAMBIGUATION Polysemous Terms
Relational Schema
1-… 2-… 3-… ... N-... Senses’ Confidences and Ranking
FORMAT UNIFICATION
Annotated Terms List Graph Context
Common Format (OIM-RDF)
Tree
Sense Context
Graph Sense Inventory Thesaurus (WordNet)
Term Hypernymy Sense Relation Frequency Corpus Frequency (WWW)
SENSE FEEDBACK
Fig. 2. STRIDER disambiguation system’s architecture.
Table Product
Table OrderDetail type
&1 column
Table
name Product
column
&2
&4
name
name
Line
SqlType
&13
name
column
OrderDetail type
type
...
type
type
ProductId
SqlType
Column Referential Constrait
&14
name ProductId
SqlType
SqlType uniqueKeyRole
type
...
foreignKeyRole &12 type
...
varchar
&3 name
param 100
type
&5 name
ColumnType
&15
...
name int
int
Fig. 3. A portion of the OIM-RDF graph of our reference example.
complying with the RDF vocabulary description language and each supported model type is associated with a specific OIM meta-model. Some nodes of such graphs denote model elements, such as relations and attributes in relational schemas, categories in web directories, etc. Each element is uniquely identified by an object identifier (OID). Definition 1 (OIM-RDF graph). An OIM-RDF graph is a set of RDF triples /s,p,oS where s is the source node, p is the edge label, and o is the target node. The node identifiers and edge labels are drawn from the set of OIDs, which can be implemented as integers, pointers, URIs, etc. The type of o is the union of OIDs and literals, which include strings, integers, floats, and other data types.
For instance, Fig. 3 illustrates a small portion of the OIM-RDF graph encoding the relation schema of our reference example. Node &1 represents the Product table and is the source node of four triples: / &1,type,TableS, / &1,name, ProductS, / &1,column,&2S, / &1,column,&4S. The involved edge labels come from the OIM relational metamodel [56]. For the format unification we enhanced the platform for generic model management presented in [56]; first, it converts the input model into an OIM-RDF graph, then it performs a mapping between the native and the converted models. Note that the followed approach makes the addition of new model types easy, as it is only a matter of defining conversion rules from the new format to the common one.
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
CONTEXT EXTRACTION TERMS & SENSES EXTRACTION OIM-RDF
GRAPH CONTEXT EXTRACTION SENSE CONTEXT EXTRACTION
Polysemous Terms
Graph Context
Sense Context
Thesaurus Fig. 4. Context extraction component.
The disambiguation process will focus on all the nodes’ strings as they capture all the model informative content we are interested in. Going back to our reference example (Fig. 3), for the table Product we will disambiguate the strings Product, Line and ProductId, which are the names of nodes &1, &2 and &4, respectively. Similarly, we consider OrderDetail and ProductId from table OrderDetail. 4.2. Context extraction Once the input model is converted into the corresponding OIM-RDF graph, the terms and senses extraction component depicted in Fig. 4 extracts the polysemous terms from each node N’s string and associates each of these terms, denoted by the tuple (t,N),7 with a list of plausible senses Sensesðt,NÞ ¼ ½s1 ,s2 , . . . ,sk . In principle, such list is the complete list of senses provided by the dictionary but it can also be a shrunk version suggested either by human or machine experts (see, e.g. Section 5.5 on feedback). Notice that more than one term can be extracted from one string. See, for instance, the ProductId node in Fig. 1a which contains the terms product and id. Moreover, insignificant and frequent terms, including articles and conjunctions, can be source of noise in disambiguating the other terms. Such words, usually referred as stopwords, are filtered out. Before going on defining our disambiguation task in detail, we would like to make a short note about the kinds of labels which are dealt with in STRIDER. As we have seen, the employed term identification approach works under the assumption that the nodes are described by meaningful labels. This is what usually happens for data which are published and exchanged on Web spaces, in particular the ones expressed in one of the W3C standards for data representation, XML, RDF and OWL, whose goal is to produce ‘‘human-legible and reasonably clear’’ documents. This is the kind of data structures which are used in almost all the applications STRIDER is aimed to and 7 Notice that the same term t could be present more than once and that the disambiguation is strictly dependent on the node each instance belongs to. For this reason, each term is represented by the pair (t,N).
413
which are considered in this paper. However, it may be the case that some labels contain abbreviations or acronyms, making them syntactically different and difficult to understand. In this case, it is sufficient to apply a normalization technique to the labels prior to term identification. For instance, a first possibility is the one proposed in the schema matching tools [57,58], where abbreviations and acronyms are expanded (e.g. a label PO would become PurchaseOrder) based on a thesaurus lookup. The thesaurus can include terms used in common language as well as domain-specific references. Another possible method is the one proposed in [59], which expands abbreviations and acronyms in a semi-automatic way by exploiting both external sources (such as userdefined/online abbreviation dictionaries, standard domain abbreviations, and so on) and internal ones (such as context information from the given schema, or complementary schemas about similar subjects). Definition 2 (Structural disambiguation task). Let us consider the set of terms extracted from an OIM-RDF graph. For each term (t,N), the goal of the structural disambiguation task is to define a confidence function f which associates each sense si 2 Sensesðt,NÞ with a score fðiÞ 2 ½0,1 expressing the confidence of choosing it as the right one for term (t,N). It is worth noting that the confidence function f induces a ranking of the senses in Senses(t,N). In this way STRIDER supports two types of disambiguation services: the assisted and the completely automatic one. In the former case, the disambiguation task is committed to a human expert and the disambiguation service assists him/her by providing useful suggestions. In the latter case, there is no human intervention and the selected sense is the top one, i.e. the one with the highest confidence value. 4.2.1. Graph context extraction Identifying the correct sense of each node label can be done by analyzing its context of use given by the labels of the surrounding nodes. Graph context extraction (see Fig. 4) is the phase devoted to the contextualization of each polysemous term (t,N) through the extraction of its context Gcontðt,NÞ from the OIM-RDF graph. In its broad form, Gcontðt,NÞ contains the terms of the nodes connected to n. However, depending on the graph characteristics, finer choices could be preferable. For instance, while disambiguating term line in table Product of our running example, terms that are ‘‘near’’ in the graph (such as product and cost) can be useful to understand its correct sense as a ‘‘kind of product or merchandise’’. Instead, distant terms such as customer could lead to wrong interpretations of term line, such as ‘‘a formation of people one beside another’’. For the above reasons, we introduce the notion of crossing setting. Broadly speaking, a crossing setting associates each term with an abstraction of (a portion of) the OIM-RDF graph which represents its graph context. Such an abstraction is (a subset of) the OIM-RDF nodes connected through labeled arcs, which can be either native OIM-RDF arcs or arcs derived through the evaluation
414
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
?table
SR
?column
?referencing
?referenced
=
=
Table
FK
Column
type
type
Arc Type
?a uniqueKeyRole
foreignKeyRole
SR FK
column ?tableURI name ?table
?columnURI
?referencingURI
name ?column
name ?referencing
Direct Direction 1 0.1
Opposite Direction 0.8 0.1
?referencedURI name ?referenced
Fig. 5. Arc types for relational schemas.
of SPARQL queries on the OIM-RDF graph. For instance, referring to the relational case, the following SPARQL query, which is depicted on the left side of Fig. 5, selects each column name (identified by the variable ?column) and the name of the table it belongs to (identified by the variable ?table): select ?table ?column where {?tableURI sql:name ?table. ?tableURI rdf:type sql:Table. ?tableURI sql:column ?columnURI. ?columnURI rdf:type sql:Column. ?columnURI sql:name ?column} The results are a set of label pairs such as (line, Product). The relationship between columns and tables is then made explicit by an arc labeled SR (structural relationship) which is added between each of these pairs. A crossing setting s is made up of a reachability threshold t and a set of arc types, where each arc type has a name, is defined by a SPARQL query on the OIM-RDF graph, and is associated with two crossing costs in (0,1], one for each crossing direction. Costs will be used to compute the distance between two nodes in the graph, in principle the lower the cost of an arc crossing direction is the closer two nodes connected by such arc are. For each supported model type, we defined the default crossing setting shown in Appendix A (where the value of t has been experimentally set to 3). The arc types of the relational default crossing setting are depicted in Fig. 5. Beside SR, the foreign key arc type (FK) is an example of referential constraint which can be made explicit. Note that the FK costs are appreciably lower than those of SR because FK represents a very strong integrity constraint between pairs of attributes. Generally speaking, lowest crossing costs should be assigned to the strongest relationships, whereas highest values should be reserved to the weakest ones. Indeed, what matters in the crossing cost settlement are not the absolute cost values but rather the arc type order which is induced by those values. The position of each arc type in this order and its costs should reflect its strength w.r.t. the other arc types.
Users are allowed to freely modify both the value of t and the set of arc types of the default crossing settings or to introduce their own crossing settings. By applying the arc type definitions of a given crossing setting s, we derive a view of the OIM-RDF graph on which the graph context Gcontðt,NÞ of each term (t,N) is computed. In particular, Gcontðt,NÞ contains all and only those nodes Ng in the OIM-RDF view which are reachable from N. One node Ng is said to be reachable if and only if there exists one path from N to Ng whose total sum cost, obtained by summing the single costs associated to each crossing setting’s relationship w.r.t. the crossing direction, is smaller or equal to the reachability threshold t. Since we are dealing with graphs, different paths may be available that connect a pair of nodes; among all, we choose the one having the lowest path cost LPC(N,Ng). Each reachable node Ng is associated with a weight weight(Ng) which essentially represents the weight that the terms of Ng should have in the disambiguation of the terms of N. The way node weights influence the proper disambiguation process will be shown in Section 5. In principle, the more Ng is close to N and is connected by arcs with low costs the more its terms influence the disambiguation of (t,N). For this reason, we compute weight(Ng) by applying a Gaussian distance decay function on LPC(N,Ng): g
weightðNg Þ ¼ 2
2
eðLPCðN,N ÞÞ pffiffiffiffiffiffi 2p
=8
2 þ 1 pffiffiffiffiffiffi 2p
ð1Þ
Definition 3 (Graph context). The graph context Gcontðt,NÞ of a term (t,N) is a set of tuples ((tg,Ng), Senses(tg,Ng),weight(Ng)), one for each term (tg,Ng) of each reachable node Ng. Example 4.1. Fig. 6 shows the graph obtained by applying our default crossing setting to the OIM-RDF graph depicted in Fig. 3. Each node is assigned a unique identifier which is used in the table to show the total costs and the corresponding weights. The graph context of term line, GCont(line,5), is made up of the terms (product,1), (producer,2), (name,3), (product,4),
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
Direct : Opposite : Keys
Product
1
SR
OrderDetail 7
SR SR SR
2
SR
Producer
3 Name
5 4
UnitCost
SR
6 9
Line
ProductId FK
ProductId
Node 1 2 3 4 6 9 7
415
LPC (Total cost) 0.8 0.8+1.0=1.8
Weight 0.939 0.734
0.8+1.0=1.8 0.8+1.0=1.8 0.8+1.0=1.8 0.8+1.0+0.1=1.9 0.8+1.0+0.1+0.8=2.7
0.734 0.734 0.734 0.710 0.523
Total Costs and Weights for reachable nodes for node Line
Fig. 6. Node selection and weight calculation for term Line.
Table 1 Sense explanation, examples and Scontðsk Þ of term (line,5). sk
Explanation
A formation of people or things one beside another y y s22 A particular kind of product or merchandise y y s1
Examples
Scontðsk Þ
The line of soldiers advanced with their bayonets fixed; they were arrayed in line of battle; the cast stood in line for the curtain call y A nice line of shoes
Battle, bayonet, cast, curtain, formation, y y Kind, line, merchandise, product, shoe, y y
y
(id,4), (unit cost,6), (product,9), (id,9), (order,7) and (detail,7). Note that all the terms from the same node have the same weight and that terms can be composed by more than one word if the corresponding lemma is contained in the dictionary (i.e. unit cost is a lemma composed by two words). Finally, the Gcont(line,5) tuple about (product,1) using WordNet [11] as inventory of senses is: ((product,1),[s1 = commodities offered for sale (‘‘good business depends on having good merchandise’’; ‘‘that store offers a variety of products’’),y,s5 =a quantity obtained by multiplication (‘‘the product of 2 and 3 is 6’’),y],0.939). 4.2.2. Sense context extraction As most of the semantics is carried by noun words [15], the sense context extraction phase enriches the context of each term (t,N) to be disambiguated with the nouns used to explain its plausible senses, similarly to [19]. The sense context is particularly useful when the graph context provides too little information, for instance because the OIM-RDF graph is too small. Definition 4 (Sense context). The sense context Scont(s) of each sense s 2 Sensesðt,NÞ of a given term (t,N) is a set of nouns extracted through a part-of-speech tagger from the definitions, the examples and any other explanation of the sense provided by the thesaurus. Example 4.2. For each term sense s, WordNet provides a gloss that contains the sense explanation and a list of examples in using that term with that meaning. Table 1 shows some of the 29 senses of the term line with sense explanations, examples and Scont(s) (the extracted nouns are in alphabetical order), respectively.
5. Disambiguation algorithms STRIDER disambiguation algorithms follow a similarity-based approach. The key to their working is to semantically correlate each term to be disambiguated with the terms of its contexts, graph context and sense contexts, through the use of knowledge sources. As we said in the previous sections, the outcome of the overall process is a ranking of the plausible senses of each term (see Fig. 7). In particular, the confidence function f is composed by three different contributions: those given by the graph context, the sense context and the sense frequency. The first two are context-dependent and are computed with two confidence functions, named fG and fS , respectively; the latter uses the frequency of senses and is computed with fF . Each confidence function gives rise to a confidence vector bearing the same name; then, the final confidence vector f is obtained as their linear combination: f ¼ a fG þ b fS þ g fF , a þ b þ g ¼ 1, where a, b and g are parameters that can be freely adjusted in order to conveniently weigh the contributions. Example 5.1. Starting from the current example, we will explain how STRIDER disambiguates terms by following the disambiguation of term line of our reference schema. As we anticipated in the introductory example, there are 29 different senses including (s1 ) ‘‘a formation of people or things one beside another’’, (s5 ) ‘‘text consisting to a row of words’’, (s9 ) ‘‘cable, transmission line’’ and (s22 ) ‘‘line of products, line of merchandise’’. We remind that (s22 ) is the right one. In particular, in this example, we will just give a glimpse of how STRIDER chooses the sense for the given term by exploiting the relevant computed confidence values; then, further examples will clarify how the confidence values leading to this result are
416
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
DISAMBIGUATION
α
GRAPH CONTEXT contribution computation
Graph Context TERM SIMLARITY computation
+ β Sense Contexts
SENSE CONTEXT contribution computation
=
+ γ
SENSE FREQUENCY contribution computation
1-… 2-… 3-… ... N-... Senses’ Confidences and Ranking
Fig. 7. Overview of the different contributions’ computations for the disambiguation of a given term.
actually computed. By performing automatic disambiguation on the schema, our approach is able to correctly disambiguate this term, together with the large majority of the other schema terms. In particular, using the default setting of a ¼ b ¼ 0:4 and g ¼ 0:2, the highest value of the output confidence vector is fð22Þ ¼ 0:61, much higher than, for instance, fð1Þ ¼ 0:24, meaning that STRIDER is confident in s22 being the right sense.
information on term co-occurrence can be successfully exploited to compute term similarity (or distance) by means of several functions, such as Jaccard (recently used in [39]), PMI [36] or NGD [38]. In STRIDER we propose a corpus-based similarity based on an exponential version of PMI:
5.1. Computing term similarity
where M is a normalizing constant, f ðtÞ is the frequency of term t and f ðtx ,ty Þ is the frequency of co-occurrence of terms tx and ty together. We performed several ad-hoc tests comparing the results obtained through such formula to those produced by applying different ones, such as Jaccard or NGD, and finally found our exponential PMI to be the most effective to our means, especially when, as in our case, f ðtÞ is the aggregate WWW pagecount estimate returned by search engines such as Yahoo or Google.
The confidence in choosing one of the senses associated with a given term is dependent on the similarity between that term and each term in the context. To this extent, this section focuses on how the term similarity component works, whose output is at the basis of the computation of the graph and sense context contributions (see Fig. 7). In order to quantify the similarity between two terms tx and ty , we decided not to restrict our vision to a specific external source but to investigate the two alternative approaches we discussed in Section 2, thesauri and large textual corpora. Specifically, in the thesaurus-based similarity we make use of the hypernymy–hyponymy hierarchy of the reference thesaurus through one of the most promising measures available in this field, the Leacock–Chodorow [29] one, which has been reviewed in this way: 8 lenðtx ,ty Þ < if ( a common ancestor ln ð2Þ simðtx ,ty Þ ¼ 2H : 0 otherwise where len(tx,ty) is the minimum among the number of links connecting each sense in Senses(tx) and each sense in Senses(ty) and H is the height of the hypernymy hierarchy (e.g. 16 in WordNet). The set of lowest common ancestors of tx and ty in the hypernyms hierarchies will be denoted as mchðtx ,ty Þ (minimum common hypernyms). Alternatively, when a large document repository or simply a common web search engine is available,
simðtx ,ty Þ ¼ ePMIðtx ,ty Þ ¼
M f ðtx ,ty Þ f ðtx Þ f ðty Þ
ð3Þ
Example 5.2. Let us consider the two terms: line (the one to be disambiguated) and one of the terms of its graph context which will contribute in disambiguating it, product. The minimum path length is 1, since the senses of such nouns that join most rapidly are ‘‘line (line of products)’’ and ‘‘product (merchandise)’’ and the minimum common hypernym is ‘‘product (merchandise)’’ itself. Thus, the thesaurus-based similarity is 3.47. On the other hand, by exploiting Google’s term frequencies, we have f(line)= 1.08 109, f(product)= 1.4 109, f(line,product)= 0.23 109. Thus, setting M to the total number of Google’s indexed pages, the corpus-based similarity is 2.31. In the next section, we will see how such similarity values will be useful for disambiguating our term. 5.2. Graph context contribution The graph context contribution for a term (t,N) is computed by means of Algorithm 5.1 which relies on the
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
similarities between t itself and the terms in its graph context. Notice that, for the sake of simplicity of presentation, the algorithm takes one term at a time. It does not correspond to its actual implementation which, for efficiency reasons, works on entire models and performs computations on commutative functions only once. Algorithm 5.1. GraphContextContr(t,N) 1: 2: 3: 4: 5: 6: 7: 8:
for each sk in Senses(t,N) do // cycle in t senses fG ½k = 0 norm= 0 for each tig in Gcontðt,NÞ do // cycle in graph context terms fG ½k = fG ½k þ GCC½k½i norm= norm + GCN[i] fG ½k = fG ½k / norm return fG
GCC½k½i ¼ simðt,tig Þ numHypðmchðt,tig Þ,sk Þ weightðNig Þ, ð4Þ simðt,tig Þ
is the similarity between t and tig where (computed as shown in Section 5.1 using either Eq. (2 or 3), numHypðmchðt,tig Þ,sk Þ is the number of the minimum
...
tgi
...
|Gcont| (t,N)
Example 5.3. Continuing our running example, Fig. 8b graphically depicts some of the entries for the computation of the confidence fG for two senses of the term line, s1 and s22 , where the latter is the right one. All values rely on the thesaurus-based similarity presented in Section 5.1. The contribution of the graph topology in the computation is evident from the two instances of product which only differs in the weights of the nodes they belong to; the first occurrence refers to node 1 (see Fig. 6) which is closer to the line’s node than node 4. Notice that many significant contributions are available for s22 , since it is about merchandise lines and thus it is very close to most graph context terms such as product. This is also clear from the fG values: The most confident sense (0.58) as to the graph context contribution is s22 , while the other values are generally much lower (0.05 for s1 ). 5.3. Sense context contribution Beside the contribution of the graph context, we also exploit the sense context by quantifying the similarity between the graph context of each polysemous term and the context of each of its senses. To this end, in this subsection we propose two alternative algorithms for sense context contribution computation. It is worth
er ct ct r il du duc e du rde deta ... pro pro nam pro id o
Gcont (t,N) terms
s
s1
...
0 0 0 0 0 1.23
Gcont (line,5) terms
0 ...
φg(1) 0.05
3.26 1.22 0 2.55 0 0 0 ...
φg(22) 0.58
...
φg(k)
sk
s22
...
...
|Senses| (t,N)
(t,N) senses
common hypernymis in mchðt,tig Þ which are ancestors of sk . Eq. (4) shows that the similarity between each pair ðt,tig Þ is only considered as supporting evidence of those senses sk which are descendent of at least one of the minimum common hypernymis between t and tig . The fG ðkÞ confidence value for each sense sk is thus the normalized sum of these supports (lines 5–7). Normalization brings all scores in a range between 0 and 1 and is performed on the basis of the total number of minimum common hypernyms between t and tig (GCN stands for graph context normalization): GCN½i ¼ simðt,tig Þ jmchðt,tig Þj weightðNig Þ,
The algorithm takes in input a term (t,N) to be disambiguated and produces a vector fG of confidence values in choosing each of the senses in Senses(t,N). The idea behind the algorithm, which derives from [60], is that when two polysemous terms ðt,tig Þ are similar, their most informative subsumer provides information about which sense of each term is the relevant one. Moreover, the contribution of each term tig in Gcontðt,NÞ is proportional to its weight weight(Ngi ) (see Eq. (1)). As also depicted in Fig. 8a, the algorithm computes the bi-dimensional GCC array (graph context contribution) by means of two nested cycles: one on the senses of t (outer cycle at line 1, dimension ‘‘k’’) and the other one on the terms in Gcontðt,NÞ( inner cycle at line 4, dimension ‘‘i’’). Specifically, the contribution of each term tig to the confidence of each sense sk is
tg1
417
GCC tg
φg
(line,5) senses
GCC
φg
i contribution to confidence φg of sense sk
Fig. 8. Graphical exemplification of the graph context contribution computation algorithm: (a) disambiguation of a generic term t and (b) disambiguation of example term line.
418
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
noting that the algorithms presented in this paragraph can be seen as a variant and/or a generalization of those employed for graph context; indeed, while they share the same goal, i.e. quantifying the similarity between the senses of the term to be disambiguated and its graph context, in the graph context contribution algorithms such senses are represented by a single node in the thesaurus hierarchy, now all the terms available in their sense contexts contribute to the final result. The first algorithm shares the same basic ideas of the graph context one with the main difference that, in this case, each sense sk of term t to be disambiguated is represented by the set of terms in Scontðsk Þ. Thus, it compares the given graph context Gcontðt,NÞ to the available sense contexts and the sense context Scontðsk Þ containing the terms most similar to those of Gcontðt,NÞ will be the most probable one. The flow of the algorithm is similar to Algorithm 5.1 and for this reason it will not be presented in its entirety. However, the computation is now performed on three dimensions (cycles) in total: On the senses sk of t (dimension ‘‘k’’), on the terms tig in Gcontðt,NÞ( dimension ‘‘i’’) and now also on the terms tjs in each Scontðsk Þ (dimension ‘‘j’’). Supports are dealt with as in lines 5–6 of Algorithm 5.1, by replacing GCC and GCN with the three-dimensional sense context contribution array (SCC) and the bi-dimensional sense context normalization array (SCN), respectively, where SCC½k½i½j ¼ simðtig ,tjs Þ numHypðmchðt,tig Þ,sk Þ weightðNig Þ SCN½i½j ¼ simðtig ,tjs Þ jmchðt,tig Þj weightðNig Þ Notice that, instead of using the term t to be disambiguated, simðtig ,tjs Þ compares term tig in Gcontðt,NÞ with term tjs describing sense sk. As Fig. 9a graphically exemplifies, this is done for all terms in the graph context against all terms describing all senses: In this bi-dimensional representation, each cell HCC[k][i] contains a vector of similarities, one for each tjs ; the sum of these values represents tig contribution to sk .
g
t1
...
g
ti
...
|Gcont (t,N) |
The algorithm discussed above for sense context contribution computation is based on the same strengths of the graph context one and, from the test performed, it delivers a high effectiveness. However, there are some cases where the hypernym structures may not be completely adequate or sufficient to describe a concept and thus to disambiguate it. While Section 5.1 shows that we can be completely independent from such structures as to the computation of term similarities, we also felt important to provide STRIDER with an alternative confidence computation algorithm, Algorithm 5.2, which weighs the different similarity contributions directly on the basis of the Gcontðt,NÞ weights and independently from the thesaurus hierarchies. We did this for the computation of the sense contribution since, in this case, the senses are not involved in similarity computations as members of linguistic hierarchies but as a set of associated descriptive terms. Given a sense sk (line 1) and a term tig in Gcontðt,NÞ (line 4), the similarities simðtig ,tjs Þ between tig and every term tjs in Scontðsk Þ are computed (lines 6–7), summarized in a single value by means of a parametric function f( ) and then added to the fS ½k confidence of the sense with weight weight(Ngi ) (line 8). As to f( ), various alternatives can be employed, such as max( ) or mean( ); among those, we found mean( ) to be particularly effective. Finally, normalization is performed in two steps: For each sense sk , fS ½k values are divided by
r ct ce ct du odu ame rodu p pro pr n id
Gcont (t,N) terms
Scont(s1) {battle, bayonet ...}
Scont(s1) ...
φs(k)
Scont(sk) {t1s,…,tsj,…}
er etail ... d
ord
Gcont (line,5) terms
{…} {…} {…} {…} {…} {…} {…} ... 0
0
0
0
0
5.98 0
{…} {…} {…} {…} {…} {…} {…} ... Scont (s22) {merchandise, 27.13 3.26 0 20.51 0 0 0 ... product ...} ...
Scont
(|Senses (t,N) |)
SCC g t i contribution
to confidence φs of sense sk Disambiguation of ageneric term t
φs(1)
0.04
φs(22)
0.76
...
...
...
(t,N) sense contexts
Example 5.4. Consider Fig. 9. The similarities between Gcontðt,NÞ terms (above the matrix) and the terms from Scontðs22 Þ, such as merchandise and product, give a good clue of this sense being the right one. This is captured by the algorithm and it is evident from the high similarity values of the Scontðs22 Þ row. For instance, the sum of the similarities between product (the first term of Gcontðt,NÞ) and all terms in Scontðs22 Þ is 27.13, while it is 0 for those in Scontðs1 Þ, since they involve a very distant military meaning. fS values confirm this intuition: s22 has the highest confidence (0.76), while the others are generally much lower (0.04 for s1 ).
φs
(line,5) sense contexts
SCC
Disambiguation of example term line
Fig. 9. Graphical exemplification of the sense context contribution computation algorithm.
φs
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
the sum of all weights in order to obtain a weighted mean of the terms similarities (line 10), then the values in fS are brought in the range [0,1] (line 11) in order to be compatible with the other contributions.
Algorithm 5.2. SenseContextContr(t,N) (independent from hierarchies) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:
for k= 1 to jSenses(t,N)jdo // cycle in senses sk of term t fS ½k = 0 sumWeights = 0 for each tig in Gcontðt,NÞdo // cycle in graph context terms vsimti = [0,y,0] for each tjs in Scontðsk Þ do // cycle in sense context terms vsimti ½j = simðtig ,tjs Þ
fS ½k = fS ½k + f ðvsimti Þ * weightðNig Þ
sumWeights = sumWeights + weight(Ngi )
fS ½k = fS ½k/sumWeights fS = fS =maxðfS Þ return fS
Finally, notice that both algorithms can be combined with the two alternative ways of computing the term similarity sim(tx,ty) shown in Section 5.1. It results in four variants which use the sense context: two ‘‘pure’’ solutions, one using the thesaurus both in the similarity computation and in the algorithm and another one only using the corpus-based similarity, and two ‘‘hybrid’’ solutions, which uses the thesaurus in the similarity computation but not in the algorithm and viceversa. All variants will be evaluated in Section 6.
419
confidence in choosing each successive sense sk proportionally to its position, posðsk Þ: 8 < 1r posðsk Þ1 jSensesðtÞj1 fF ½k ¼ linearDecayðsk Þ ¼ : 1
if jSensesðtÞj 41 otherwise
ð5Þ where 0 o r o 1 is a parameter we usually set at 0.8 (as to previous experimentations) and jSensesðtÞj is the cardinality of Senses(t). In this way, we can provide a rough quantification of the frequency of the senses, where the first sense has full confidence and the last one has still a non-null decay (in our case, 1:5), thus allowing us to exploit the benefits of sense frequency for all the senses. Extended versions of the WordNet thesaurus, such as the one which can be queried online at the official site, offer more detailed information about sense frequencies in the form of frequency counts fcðsk Þ, which are integers associated to each s. The larger fcðsk Þ, the more probable sk will be for a given term. The second frequency function uses fc and is generally more precise and effective since the decay function is not linear but is shaped differently for each term: 8 < 1rfcðs1 Þfcðsk Þ if fcðs Þ 40 1 fcðs1 Þ fF ½k ¼ freqCountDecayðsk Þ ¼ : 1 otherwise ð6Þ
Example 5.5. The frequencies associated to the first four of the 29 senses of line are 51, 20, 15 and 15, respectively, meaning that the second sense is less than the half frequent than the first, that the third and fourth are indeed equally probable, and so on. By applying Eq. (6) to this small example, we obtain fF ¼ f1,0:51,0:43,0:43, . . .g.
5.4. Sense frequency contribution
5.5. Feedback
The last contribution we present is the one based on the frequencies of the senses in the English language. The underlying idea is to attempt to emulate the common sense of a human in choosing the right meaning of a term when the context gives little help. Indeed, among all possible meanings that a word may have, it is generally true that one meaning occurs more often than the other meanings and its worth noting that word meanings exhibit a Zipfian distribution. Differently from the other contributions, this one is independent from the context and only relies on the knowledge about the a priori distribution provided by the thesaurus. In particular, WordNet incorporates the information extracted from the SemCor corpus consisting of more than 200,000 sense-tagged terms [61]. In STRIDER we propose two ways of computing sense frequency confidence contribution fF : The first method basically starts from the consideration that WordNet already orders the senses of each term t on the basis of the frequency of use (i.e. the first is the most common sense, etc.). In this case, similarly to our proposal in [17], we can easily compute a value fF ðkÞ for each sense sk by means of a linear decay function, which, starting from fF ð1Þ ¼ 1, decrements the
The presented disambiguation algorithms are able to achieve very effective results from a single execution. However, in order to get even better results, we propose several feedback techniques, which refine the initial results by performing successive disambiguation runs. At each run i following the first one (i=2,y,n), the set of senses Sensesðt,NÞi for some term (t,N) is a subset of Sensesðt,NÞi1 . It is worth noting that very few solutions for automatically refining disambiguation results, based on different techniques, have been proposed in the literature, and they were tested only in open text disambiguation settings [24]. First of all, user feedback is available, in which the user plays an active role by deactivating/activating the influence of selected senses on the disambiguation process. For instance, suggesting the correct meaning of a difficult term in the model may help the system in choosing the right meaning of the others. However, being the focus of our work also on completely automatic techniques, we devised automated feedback techniques which are actually able, in most situations, to significantly improve the results of successive runs without user intervention. We propose three gradually more complex methods, all based on the idea that removing the ‘‘bottom’’ senses in the first
420
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
run has a positive effect on the disambiguation, since the ‘‘noise’’ produced by those senses which, most probably, are not right is eliminated:
Table 2 Features of the tested models. Model name
Spec. level
Method 1—Simple auto-feedback: The results of the first
run are refined by automatically disabling the contributions of all but the top X senses in the following run. Thus, only a second run is required. A very low X value, such as 2 or 3, can in principle be chosen, since in most cases the STRIDER algorithms ensures that the right sense is in the very top positions in the ranking. Method 2—‘‘Knockout’’ auto-feedback: At each run, the sense of each term with the lowest confidence, i.e. the one presumably bringing more ‘‘noise’’, is disabled. The process stops when for all terms only the top X senses are left. This method requires more runs than Method 1, but it is expected to be more effective because of its greater gradualness. In particular the number of runs will depend on the maximum number of senses among the model’s terms and on the value of X. Method 3—Stabilizing ‘‘knockout’’ auto-feedback: It is a fixpoint version of Method 2. In this case a residual vector D is maintained for each term, where each entry DðkÞ keeps the modulus of the confidence variation of sk between the current step (i) and the previous one (i 1): DðkÞ ¼ jfi ðkÞfi1 ðkÞj. When the confidence vector of a term becomes stable, that is when the Euclidean length of its residual vector D is less than a given e, only the top X senses are kept. This method is expected to achieve the same effectiveness as Method 2, while requiring less runs.
In Section 6 we will show samples of the effectiveness of all methods on different types of schemas. 6. Experimental evaluation In this section we present the results we obtained through an actual implementation of the STRIDER disambiguation system including all the features we presented in the previous sections. 6.1. Experimental setting Tests were conceived in order to show the behavior of our disambiguation approach in different scenarios. We performed several tests on a large number of models. Since no reference collections are officially available for evaluating structural disambiguation approaches, we will present a selection of the experiments on the most interesting and challenging models we found, for instance those containing terms that are not used with their most common meaning. Table 2 shows the 10 chosen models, their input formats and their features both from a structural and a semantic point of view. The complete models can be found in Appendix B. The models are chosen so to be representative of diverse disambiguation challenges and scenarios: in order to evaluate the effectiveness of our approach, the output of our system will be compared to the ‘‘gold standard’’ manually
Number of
#Senses
Sense simil.
Terms Nodes Edges Mean Max SQL DDL PurchaseOrders 2 Computer 3 Student 2
26 31 19
18 24 18
68 92 68
5.115 29 3.774 13 4 11
2.807 3.162 3.057
XMLSchema Yahoo eBay Shakespeare DBLP
1 1 3 3
15 17 14 14
18 18 17 17
56 56 52 52
2.733 2.941 8 5.429
6 8 29 11
3.373 3.420 2.152 2.601
OWL/RDF Camera Travel Process
3 2 1
30 48 69
125 49 119
384 156 320
4.533 12 3.375 12 3.666 13
2.785 3.248 2.916
disambiguated models. To this end, two human taggers contributed to annotate the corpus w.r.t. WordNet thesaurus. The taggers performed disambiguation without any external knowledge; in case of disagreements, a third human tagger chose which tag should apply. Notice that, in some particular cases where the target words present very fine-grained senses in WordNet, more than one sense is chosen as equally appropriate. We consider three relational schemas, four simple trees and three ontologies. Relational schemas are examples commonly proposed on various database books: PurchaseOrder (see Fig. 1a for the complete schema description) is a simple schema for modelling commercial orders for suppliers, Computer is a schema for the assembly of PC components and Student contains the course-exam-student relational schema. Simple trees and ontologies are all available on the internet: We chose a s small portion of Yahoo ’s web directories and eBays ’s catalog, a model extracted from Shakespeare’s Plays in XML8 and the entire DBLP XML schema, a scientific digital library for computer science journals and proceedings. Camera and Travel models come from the OWL ontologies of the Protege Ontologies Library;9 Camera ontology is also the Jena Framework10 running example. Process ontology is extracted from the ‘‘OWL-S Submission’’, an OWL-based Web service ontology which supplies a core set of markup language constructs for describing the properties and capabilities of Web services in unambiguous, computer-interpretable form.11 The models we chose have various specialization levels (see Table 2) indicating how much a model is contextualized in a particular scope: (1) Low Generic models can gather very heteroSpecialization geneous concepts, such as a web directory. The portion of the eBay catalog (see [17] 8
http://www.xml.com/pub/r/396 http://protege.stanford.edu/download/ontologies.html http://jena.sourceforge.net/ontology 11 http://www.w3.org/Submission/2004/07/ 9
10
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
(2) Medium Specialization
(3) High Specialization
for the entire model) contains terms such as batteries as an electronic device and chair as a piece of furniture that come from very different semantic areas, or very generic and abstract terms such as condition and control. Models with medium specialization level can gather heterogeneous concepts that come from a specific semantic area, such as exams, students and courses from the Student model. Finally, highly specific models gather concepts from a very specific area. For instance the terms inside the Camera model are typically about photographic techniques and devices, such as lens and shutter.
In the following, we will present the obtained results by grouping them in three groups, corresponding to the different specialization levels. As we will see, different groups represent different but equally complex challenges: While low and medium specialization groups (Groups 1 and 2) involve terms with very heterogeneous meanings, from a certain point of view the higher is the specialization of the models the harder is their disambiguation, because the involved terms require an accurate collocation in a specialized context (Group 3). Notice that such groups are heterogeneous as to the input format. Grouping by input format would not have been a significant choice since our disambiguation system is completely independent from the input format, thanks to the format unification of the pre-processing phase. Table 2 also shows the structural features of the chosen models, such as number of terms, nodes and edges. Relational schemas and trees have few dozens of nodes, from whose a similar number of terms (depending on the number of nodes that contain more than one term and on the terms that are eliminated with the stopwords list) is extracted. Ontologies are characterized by more complex graphs, having a higher number of nodes and edges (including blank nodes that are preserved but not disambiguated). Note that the number of edges is more than three times higher than the number of nodes. It is also worth noting that the size of the models is not a discriminating feature per-se: even smaller models provide complex and interesting disambiguation challenges (as will be shown in the following experiments), while we are not interested inefficiency evaluations. Moreover, Table 2 shows semantic features such as the mean and maximum number of terms’ senses and the average similarity among the senses of each given term in the graph (‘‘Sense Simil.’’ in Table, computed by using a variant of Eq. (2)). Such similarity should give a coarse intuition of how hard it is to disambiguate the terms in a given graph: the higher the value the more the senses of the term are similar one to each other and, thus, possibly difficult to discriminate. As a final implementation note, we developed STRIDER using Java JDK 1.6, the Jena2 framework and the Berkeley Db database for the persistent RDF storage.
421
6.2. Effectiveness evaluation In our experiments we evaluated the performances of our disambiguation algorithms mainly in terms of effectiveness. Efficiency evaluation is not crucial for a disambiguation approach and is beyond the goal of this article so it will not be deepened (in any case, the disambiguation process for the analyzed models required few seconds at most). In order to produce a complete analysis we computed full precision figures, synthesizing disambiguation effectiveness of the system on the different models. We firstly analyze the performance of the system by employing the default settings, i.e. those which we derived from a large number of preliminary tests on different models. These generally allow a good disambiguation effectiveness and are available for the user without any configuration or intervention; in particular they include crossing settings for all common file formats (more on this later), and adopt as default thesaurus-based similarity and thesaurus-based sense context algorithms. The impact of different similarities, sense context algorithms will be analyzed in further successive sections. The effectiveness gains achievable through our automated feedback techniques will also be presented in-depth in a dedicated final section. Before delving into the test results, we will shortly describe the rationale behind STRIDER default crossing settings, which can be found in detail in Appendix A. In these settings for relational, tree and graph models, we set specific costs for each kind of edge w.r.t. the specific kind of relationship and for both crossing directions. For instance we set the costs for navigating FK edges very low; this is to capture the strong relationship that links the terms representing the referencing columns and those referenced, and thus to obtain a high mutual influence in their disambiguation. Graphs originated from the conversion of SQL DDL models contain a limited number of edges (such as primary keys, foreign keys and unique constraints), but ontologies have numerous kinds of relationships including user-defined ones. We fixed a low cost for the most common ontology relationship types stating equivalent concepts (i.e. OWL properties equivalentClass, equivalentProperty), a high cost for those relationship types concerning inequalities (i.e. OWL properties disjointWith, differentFrom) and a default cost for any other. Finally, we derived two reachability threshold settings, one specifically devised for limiting the context (limited context setting) and thus useful for disambiguating lowly specialized and heterogeneous models, and the other not limiting the context (complete context setting), useful for more specific models. Following this intuition, in the next tests we will employ limited context setting for Group 1 models and the complete context for Groups 2 and 3 (in the next section we will also specifically test the impact of the reachability threshold settings on the obtained precision figures). In order to evaluate the precision of our disambiguation approach, we compared the fully automated disambiguation results with the manual ‘‘golden-standard’’. In general, precision P is the number of terms correctly disambiguated divided by the number of terms in the models. Since our system is able to produce complete ranking results, we can compute precision P(M) at different levels of quality, by
422
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
Fig. 10. Mean precision levels for the three groups.
considering the results up to the first M ranks: For instance, P(1) will be the percentage of terms in which the correct senses are at the top position of the ranking. Fig. 10 shows the mean precisions P of the models in each group at three levels of quality (P(1), P(2), and P(3)) and details the three contributions described in the previous section (the graph context contribution (Graph), the sense context one (Sense) and the sense frequency one (Freq), together with their combination (Total)). It is worth noting that two baselines are also provided: the random baseline is denoted as (Rnd) in figure, while the sense frequency contribution coincides with the well known most frequent sense baseline. Indeed, we decided to show sense frequency contributions results even if they do not directly give effectiveness indications as their computation is not context-dependent: they positively contribute to the final results and, as a baseline, they show the disambiguation difficulty of the considered groups: The lower is the sense frequency contribution the more the involved terms are used in unusual meanings. The combination of the three contributions produces good P(1) precision levels of 88% and 85% for Groups 1 and 2, respectively. Precision results for Group 3 are lower (nearly 75%), but we have to consider the high mean and maximum number of terms’ senses; even in this difficult setting, the results are quite encouraging, particularly if we notice that P(2) is well above 91% for each group. As to the effectiveness of the sense context, notice that its contribution alone (sense) is generally very near to the graph context one, even in the complex Group 3 setting, meaning a good effectiveness of this approach too; further, in all the three cases the combination of the three contributions (total) produces better results than each of the contributions alone. Finally, note that, for all groups, the random baseline is very distant from the results we achieved, proving once again the goodness of the STRIDER results and the complexity of the considered disambiguation tasks.
6.2.1. Impact of the crossing setting reachability threshold We will now specifically test if disambiguating low specialization models with a limited context and those with medium/high specialization with a complete context setting is really producing better precision results as expected. Fig. 11 shows two illustrative comparisons between limited and complete contexts for the eBay model (Group 1, Fig. 11a) and Computer model (Group 3, Fig. 11b). These models are representative of the two main behaviors: Group 1 models respond better to the limited crossing setting, while Group 2 and Group 3 show an opposite trend. Notice that the sense frequency contribution is not presented as it is not affected by context variations. In the first case, the precisions P(1) of graph, sense and total context contributions decrease from 94% to 88% when the complete context is used. This is due to the fact that Group 1 concepts are very heterogeneous and limiting the context only to directly related terms reduces the disambiguation ‘‘noise’’ produced by those completely uncorrelated. For instance, when the complete eBay model is used to disambiguate the term chair in the furniture category, the top sense is ‘‘the position of professor’’ and not ‘‘a seat for one person’’ as the process is wrongly influenced by terms like speaker and fan as ‘‘persons’’ (‘‘someone who speaks’’ and an ‘‘enthusiastic devotee of sports’’, respectively). Instead, when the model terms are specific and more contextualized, such as in the other two groups and especially in Group 3, the result is the opposite: Notice that sense and total contributions increase beyond 90% for P(1) and 96% for P(2) when the complete context is used on computer model (Fig. 11b). 6.2.2. Evaluation of the alternative sense context contribution The sense context contributions involved in the results presented in the previous sections have all been com-
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
Typical Group 1 behavior (e Bay model)
423
Typical Group 2 behavior (Computer model)
Fig. 11. Mean precision levels comparison between limited and complete context.
Sense Context Algorithms comparison
Similarity comparison (Process model)
Fig. 12. Comparisons between sense context contributions (standard vs alternative formula) and similarities (thesaurus-based vs corpus-based).
puted by means of the standard thesaurus-based sense context algorithm. In this section we will report the results obtained on our models by means of the alternative algorithm. In particular, we present a small selection of the performed tests, together with some examples, in order to show its performance. The standard algorithm appears to be the most effective in the majority of the situations; however, we found out that in some cases the alternative algorithm was able to obtain even higher precision levels in disambiguating some of the models. Specifically, we noticed that in three models of our reference set the alternative method provides significant precision increments, typically in the range of 5– 10% (as in the Process or DBLP models) up to almost 13% in specifically complex situations (as in the Shakespeare model). Fig. 12a shows such comparison for the Process and the Shakespeare models. In these particular situations, the hypernym structures of WordNet are not completely adequate to describe some of the involved concepts and thus to disambiguate them. Being the
standard algorithm founded on the concept of minimum common hypernyms (only the confidences of the senses which are descendants of the mch are incremented), this may result in a non-optimal confidence computation. For instance, in the DBLP model, the sense chosen by the standard algorithm for term pages is ‘‘US Diplomat and writer Thomas Nelson Page’’ and not a ‘‘book or magazine page’’. This is due to the fact that many terms in the associated graph context (such as author) involve living things and the derived minimum common hypernym may thus skew the successive computations. This is not true for the alternative computation algorithm, since the similarities between the terms’ senses directly contribute to the sense confidences and no minimum common hypernym computation is needed. 6.2.3. Evaluation of the Corpus-based similarity The thesaurus-based similarity function described in Section 5.1 and on which the results presented in previous sections are based, appears very effective in most situations;
424
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
however, we found out that in some cases the hypernyms hierarchies are not as accurate as expected, worsening the overall disambiguation quality. On the other hand, large document repositories are a finer-grained reality representation we may exploit in order to compensate for the hierarchies shortcomings. In particular, we noticed that when the model we want to disambiguate is very generic, such as the low specialization group’s models, some of the involved hierarchies could be quite misleading. The following tests show the contribution of the corpus-based similarity (Eq. (3)). Since simðtx ,ty Þ is used both in graph and sense context evaluations, both contributions are affected with a noticeable effect on the overall precision. In particular, we noticed a precision enhancement on some of the models, in particular Process, Yahoo and Computer. Fig. 12b shows on the left the values of the Graph, Sense and Total contributions using the thesaurus-based similarities and on the right the corpus-based ones for the Process model. P(1) is enhanced from 3% up to 9%. For instance, even if the right meaning of term parameter is in the thesaurus, the thesaurus-based similarities would incorrectly disambiguate it as ‘‘a quantity that characterizes a statistical population’’ and not as a ‘‘constant in a function’’. This is because the huge amount of documents indexed by the search engine allows STRIDER to better discriminate the subtle shades of meanings that distinguish one sense from the other ones and that in some cases are not accurately represented in the thesaurus hypernyms hierarchy. 6.2.4. Feedback evaluation As discussed in Section 5.5, STRIDER provides a user feedback and an automated one. In this section, we get back to the default settings and focus on the impact on the effectiveness achieved by the latter. Generally speaking, all our automated feedback methods improve the overall precision of about 70% of our models, without worsening it in any case. We will start by analyzing each method on the most significant cases, then we will present a complete report of the achieved precisions on all models. Method 1 performs a successive run after the first one disabling the contributions
Method 2 feedback - Travel model
of all but the top X senses in the following runs. For instance, by applying Method 1 with X=3 to the Student model, the results of the second run show a precision increment of almost 5%, and similar results are generally obtainable on most of the considered models. We achieved even better results using Method 2, which at each run knocks out from the ranking the sense of the term with the lowest confidence. Thus, successive runs are triggered with increasing precision, since the noise produced by worse confidences in each ranking is eliminated run after run. Fig. 13a shows the behavior of Method 2 feedback on the Travel model’s P(1). Since the sense frequency contribution is not affected by the modifications to the context terms’ senses made by the feedback, we only show the variations on graph context and sense context contributions. We noticed that in the first runs, the confidences of the models’ terms vary subtly without immediately modifying the corresponding rankings. For instance, in the example considered in Fig. 13a, both precisions remain unchanged for the first four runs (and thus are not shown in the figure), then they begin to increase gradually run after run. In particular, the graph context contribution increases from 85% of the first run to more than 91% of the ninth run; the sense context contribution achieves even better results increasing from 85% of the first run up to more than 93% of the tenth run which concludes the process. Moreover, the precision P(2) raises to a perfect 100% for both contributions from run 8 on, strengthening the performance of the Method 2 feedback. Finally, Method 3 achieved similar results to Method 2 but proved to require a lower number of successive runs, witnessing that confidences stabilize on right values and thus that monitoring the residual vector allows the system to shorten the transient phase without losing in quality. For instance, Fig. 13b shows the behavior of Method 3 for the PurchaseOrder model. Precision P(1) of the graph context contribution increases from 80% of run 1 to almost 89% of run 2 and then remains stable for the following runs. Instead, the sense context contribution’s P(1) needs 6 runs to raise its precision from 73% up to almost 89% achieving an increase of more than 15%. We also obtained the same precision results of Fig. 13a by
Method 3 feedback- Purchase Order model
Fig. 13. Precision enhancements run by run of Method 2 and Method 3 feedback.
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
Table 3 Performance of auto-feedback techniques (method 3). Models
Auto-feedback OFF
Auto–feedback ON
P(1)
P(2)
P(3)
P(1)
P(2)
P(3)
Group 1 Graph Sense Total
0.886 0.847 0.857
0.976 0.971 0.971
0.981 0.981 0.990
0.895 0.898 0.898
0.976 0.971 0.971
0.981 0.981 0.990
Group 2 Graph Sense Total
0.852 0.847 0.902
0.958 0.958 0.976
0.982 0.976 0.976
0.873 0.860 0.902
0.958 0.958 0.976
0.982 0.976 0.976
Group 3 Graph Sense Total
0.750 0.696 0.757
0.914 0.835 0.931
0.956 0.886 0.931
0.803 0.749 0.793
0.932 0.871 0.931
0.956 0.922 0.931
applying Method 3 on the Travel model, but with half the number of runs. Further, generally speaking, we noticed that the higher the mean and maximum number of terms’ senses are the more the Methods 2 and 3 contributions are considerable on the overall precision. As we see in Figs. 13a and b the precision increment due to the feedback is substantial for both the Travel and the PurchaseOrder models, and it is more evident for the latter model than the former one (see Table 2 for the mean and maximum number of terms’ senses). Our analysis of the automated feedback performances is concluded by Table 3, which shows a complete report of the achieved precision levels on all the considered models for Method 3. In particular, the precision increments are significant for all groups of models and are clearly visible by comparing the ‘‘feedback off’’ figures (left part of table) to the ‘‘feedback on’’ (right part of table), proving that automated feedback can provide equally promising results as those achieved in open text WSD settings [24].
7. Conclusions Structural disambiguation is acknowledged as a very real and frequent problem for many semantic-aware applications. The solution we proposed is STRIDER 2.0, a unified answer to the various sense disambiguation problems on a large variety of structure-based information types both at data and metadata level, including relational schemas, XML data and schemas, taxonomies, and ontologies. The approach is knowledge-driven and leverages on the combination and adaptation of some of the most effective free text WSD disambiguation techniques to the domain WSD context, also taking advantage of the best features of two worlds of knowledge sources, i.e. electronically oriented thesauri and text corpora. Since different disambiguation tasks require different context extraction processes, we proposed flexible ways of navigating the structures in order to extract each term’s context; the latter can thus be tailored to the specific disambiguation needs and can include a large variety of
425
possibly available additional information, providing a significant improvement in the disambiguation effectiveness. The extensive experimental evaluation we performed showed the good effectiveness of the presented solutions on a large variety of structure-based information and disambiguation requirements, even from a single disambiguation run. Moreover, we proved that such encouraging single-run results can be further improved by performing successive runs through the newly introduced automatic feedback techniques. Such very satisfying level of effectiveness can be achieved without any user intervention; furthermore, it proves that our combination of WSD techniques can provide very good performances in a domain-constrained word context, even better than those achievable in standard open text disambiguation settings. Ultimately, the achieved results can give a significant contribution towards the Semantic Web vision [62], making possible for an ever growing number of applications to automatically select all and only the relevant knowledge in a more effective and accurate way. Currently, STRIDER works on English language human readable labels, which characterize a very large portion of the structured information available on the web. In future work, we plan to enhance the system with label normalization techniques (such as the ones discussed in Section 4.2) so to be able to ‘‘smartly’’ expand abbreviations and acronyms, possibly exploiting context information from the structure. Further, we plan to exploit the algorithms which are independent from the thesaurus hierarchies together with commonly available web dictionaries so to test the effectiveness of the proposed approach on different languages. To this end, we already performed some initial but promising tests in the Italian language. Finally, we plan to devise a way to ‘‘hide’’ the wealth of settings and disambiguation possibilities to the user, thus achieving the best possible disambiguation results while avoiding the need for manual configuration depending on the model characteristics. The idea is to exploit some extraction and analysis techniques so to be able to automatically detect the discriminating features of the involved models (such as specialization level) and, consequently, to auto-configure the system so to use the most suitable combination of algorithms and settings.
Acknowledgments The authors would like to thank Enrico Ronchetti for his contribution to the design and implementation of the STRIDER disambiguation service and the anonymous reviewers for their useful suggestions.
Appendix A. Default crossing setting Table 4 shows our default crossing setting for each input format; arc types are those we found most
426
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
Table 4 Default crossing setting for each input format. Edge
Direct direction
Opposite direction
Applicability Relational
Tree
Graph
O
O
SR
1
0.8
O
FK
0.1
0.1
O
owl:differentFrom
1
0.8
O
owl:disjointWith
1
0.8
O
owl:complementOf
1
0.8
O
owl:equivalentClass
0.1
0.1
O
owl:equivalentProperty
0.1
0.1
O
owl:someValuesFrom
0.1
0.1
O
rdfs:domain
0.1
0.1
O
rdfs:label
0.1
0.1
O
rdfs:range
0.1
0.1
O
rdfs:subPropertyOf
0.1
0.1
O
Table 5 SPARQL queries for SR and FK edges. Edge
Input format
Query
SR
Relational
select ?table ?column ?tableURI ?columnURI where { ?tableURI sql:name ?table. ?tableURI rdf:type sql:Table. ?tableURI sql:column ?columnURI. ?columnURI rdf:type sql:Column. ?columnURI sql:name ?column}
SR
Tree
select distinct ?parent ?child ?parentURI ?childURI where { {?c xsd:schema ?b. ?b xsd:complexType ?parentURI. ?parentURI xsd:name ?parent . ?parentURI ?middlePredicate ?x. ?x xsd:element ?childURI . ?childURI xsd:name ?child. FILTER ( ?middlePredicate = xsd:sequence or ?middlePredicate = xsd:all) } UNION {?parentURI xsd:complexType ?y . ?parentURI xsd:name ?parent . ?y ?middlePredicate ?x. ?x xsd:element ?childURI . ?childURI xsd:name ?child. FILTER ( ?middlePredicate = xsd:sequence or ?middlePredicate = xsd:all ) } UNION {?parentURI xsd:type ?child . ?parentURI xsd:name ?parent . ?childURI xsd:name ?child. ?childURI ?middlePredicate ?c . ?c xsd:element ?d. ?d xsd:name ?e . FILTER ( ?middlePredicate = xsd:sequence or ?middlePredicate = xsd:all ) } }
SR
Graph
select ?parentURI ?childURI where { ?childURI rdfs:subClassOf ?parentURI. ?parentURI rdf:type rdf:Class. ?childURI rdf:type owl:Class. }
FK
Relational
select ?referencing ?referenced ?referencedURI?referencingURI where { ?a sql:foreignKeyRole ?referencedURI. ?a sql:uniqueKeyRole ?referencingURI. ?referencingURI sql:name ?referencing. ?referencedURI sql:name ?referenced}
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
Buy
motherBoard PK identifier
communicationPort
hardDisk
bus chipSet controller slots
PK identifier name transmissionRate
427
PK identifier storageCapacity rotationSpeed
Computers Desktop Pc Components
FK1
Cameras
Antiques
Accessories
Furniture
Batteries
Chair
Musical Instruments String
desktopPC PK
identifier
FK1 motherBoard processor FK2 communicationPort FK3 hardDisk fans mouse keyboard monitor
FK2
Memory
FK3
Speaker
PK
Play
STUDENT PK
identifier teacher numStudents credit
Act
Personae Subtitle
Title
identifier name sex major gradAssistant university
lectures
Mouse
Fig. 17. eBay Schema.
Fig. 14. Computer schema.
CLASS
Fan
Persona Group
Number
Scene Character Speech
Persona FK2
Speaker
EXAM FK1
PK,FK1 PK,FK2
class student
Line Fig. 18. Shakespeare’s plays schema.
date score
Fig. 15. Student schema.
PREFIX sql: o http://www.oim-converter.com/ vocabulary/sqlddl/1.0 4 PREFIX xsd: o http://www.w3.org/2001/XMLSchema 4 PREFIX rdf: o http://www.w3.org/1999/02/22-rdfsyntax-ns/ 4 PREFIX rdfs: o http://www.w3.org/2000/01/rdf-schema 4 PREFIX owl: o http://www.w3.org/2002/07/owl 4
Concept
Health
Entertainment
Hygiene
Medicine
Appendix B. Schema descriptions
Music Genres
Neurology
Hygiene
Cardiology
Conferences
Jazz
Rock
Awards
Lyrics
Fig. 16. Yahoo schema.
significant for our disambiguation purposes. The SPARQL queries we used to specify the SR and FK arc types are presented in Table 5. Notice that SR (structural relationship) is presented three times, once for each input format supported by our disambiguation approach. The used namespace prefix list is:
In this section we show the schemas we chose for our experimental evaluation. As to relational schemas, Figs. 14 and 15 depict the Computer and Student schemas, respectively, while the PurchaseOrder schema has already been presented in Section 3 as our running example. Moreover, the trees we used in our test are a small s portion of Yahoo ’s web directories and eBays ’s catalog, the schema extracted from Shakespeare’s Plays and the entire DBLP XML schema, are shown in Figs. 16, 17, 18 and 19, respectively. Finally, we present the graphs we used in our experimental evaluation: the Camera Ontology is shown in Fig. 20, while a selection of nodes and edges from the Travel and Process ontologies is shown in Figs. 21 and 22, respectively. Notice that boxes represent ontology classes, while arrows represent relationships between classes
428
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
Dblp [Bibliography]
Article
Proceedings
Id
Title Volume Number
Year
Id Author Title Pages
Ee Crossref [Hyperlink] [Reference]
Fig. 19. DBLP schema (terms containing acronyms and abbreviations are replaced with the text within square brackets).
Window
PurchaseableItem cost*
isa Camera
isa lens* Lens
isa
SingleLensReflex
Money
isa
viewFinder isa Viewer
Digital
body*
isa
Large-Format
compatibleWith*
body*
Body isa
shutter-speed*
BodyWithNonAdjustableShutterSpeed
Range
Fig. 20. Camera schema.
Destination
hasPart*
hasActivity* isOfferedAt* Activity isa Sightseeing
isa isa
Museums
Safari
hasAccommodation* Accommodation
isa
isa
isa
Adventure
isa isa
Hotel
isa
BunjeeJumping
isa
BudgetAccommodation
hasRating*
LuxuryHotel hasRating* AccommodationRating
Fig. 21. A portion of the Travel schema.
isa BedAndBreakfast
Campground
hasRating*
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
429
Service Model isa Function has Result* Result
isa
Composite Function in Condition* Condition
Atomic Function
has Precondition*
composed Of*
process
Control Construct isa
Value Of
isa
has Parameter*
from Process
Perform
with Output*
has Result Variable*
isa
Produce
has Input*
has Data From*
has Output*
the Variable
produced Binding* Binding isa Output Binding
isa
toParameter Parameter
Input Binding isa
isa Input
Result Variable
isa Output
Fig. 22. A portion of the Process schema.
(object property properties have an asterisk beside their label).
[8]
References [1] L. Guo, J. Shanmugasundaram, G. Yona, Topology search over biological databases, in: ICDE, 2007. [2] M. d’Aquin, E. Motta, M. Sabou, S. Angeletou, L. Gridinoc, V. Lopez, D. Guidi, Toward a new generation of semantic web applications, IEEE Intelligent Systems 23 (3) (2008) 20–28. [3] M. Theobald, R. Schenkel, G. Weikum, Exploiting structure, annotation, and ontological knowledge for automatic classification of XML data, in: Proceedings of the International Workshop on Web and Databases (WebDB’03), 2003, pp. 1–6. [4] L. Guo, X. Wang, J. Fang, Ontology clarification by using semantic disambiguation, in: Proceedings of the 12th International Conference on CSCW in Design (CSCWD), 2008, pp. 476–481. [5] M. Theobald, H. Bast, D. Majumdar, R. Schenkel, G. Weikum, Topx: efficient and versatile top-k query processing for semistructured data, VLDB Journal 17 (1) (2008) 81–115. [6] M. Ehrig, A. Maedche, Ontology-focused crawling of web documents, in: Proceedings of the 2003 ACM Symposium on Applied Computing (ACM SAC’03), 2003, pp. 1174–1178. [7] J. Gracia, V. Lopez, M. d’Aquin, M. Sabou, E. Motta, E. Mena, Solving semantic ambiguity to improve semantic web based ontology
[9]
[10]
[11] [12]
[13]
[14] [15]
matching, in: Proceedings of the Second International Workshop on Ontology Matching (OM), 2007. H. Paulheim, M. Rebstock, J. Fengel, Context-sensitive referencing for ontology mapping disambiguation, in: Proceedings of the International Workshop on Contexts and Ontologies: Representation and Reasoning (C&O:RR), 2007. J. Gracia, M. d’Aquin, E. Mena, Large scale integration of senses for the semantic web, in: Proceedings of the 18th International Conference on World Wide Web (WWW), 2009. V. Uren, P. Cimiano, J. Iria, S. Handschuh, M. Vargas-Vera, E. Motta, F. Ciravegna, Semantic annotation for knowledge management: requirements and a survey of the state of the art, Journal of Web Semantics 4 (1) (2006) 14–28. G.A. Miller, WordNet: a lexical database for English, Communication of the ACM 38 (11) (1995) 39–41. P. Bouquet, L. Serafini, S. Zanobini, Bootstrapping semantics on the web: meaning elicitation from schemas, in: Proceedings of the 15th International Conference on World Wide Web (WWW’06), 2006, pp. 505–512. F. Mandreoli, R. Martoglia, P. Tiberio, Approximate query answering for a heterogeneous XML document base, in: Proceedings of the Fifth Conference on Web Information Systems Engineering (WISE’04), 2004. E. Agirre, E.P. Edmonds, Word Sense Disambiguation Algorithms and Applications, Springer, 2007. R. Navigli, Word sense disambiguation: a survey, ACM Computing Surveys 41 (2) (2009).
430
F. Mandreoli, R. Martoglia / Information Systems 36 (2011) 406–430
[16] D. Shutt, P. Bernstein, T. Bergstraesser, J. Carlson, S. Pal, P. Sanders, Microsoft repository version 2 and the open information model, Information Systems 24 (2) (1999) 71–98. [17] F. Mandreoli, R. Martoglia, E. Ronchetti, Versatile structural disambiguation for semantic-aware applications, in: Proceedings of 14th ACM International Conference on Information and Knowledge Management (CIKM’05), 2005, pp. 209–216. [18] F. Mandreoli, R. Martoglia, E. Ronchetti, STRIDER: a versatile system for structural disambiguation, in: Proceedings of 10th International Conference on Extending Database Technology (EDBT’06), 2006, pp. 1194–1197. [19] M. Lesk, Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone, in: Proceedings of the Fifth annual international conference on Systems documentation (SIGDOC ’86), 1986, pp. 24–26. [20] S. Banerjee, T. Pedersen, Extended gloss overlaps as a measure of semantic relatedness, in: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI’03), 2003, pp. 805–810. [21] R. Mihalcea, Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling, in: Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), 2005. [22] R. Navigli, P. Velardi, Structural semantic interconnections: a knowledge-based approach to word sense disambiguation, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (7) (2005) 1075–1086. [23] M. Galley, K. McKeown, Improving word sense disambiguation in lexical chaining, in: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI), 2003, pp. 1486–1488. [24] R. Navigli, A structural approach to the automatic adjudication of word sense disagreements, Natural Language Engineering 14 (4) (2008) 547–573. [25] M. Halliday, E.R. Hasan, Cohesion in English, Longman Group Ltd, 1976. [26] R. Navigli, M. Lapata, Graph connectivity measures for unsupervised word sense disambiguation, in: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), 2007, pp. 1683–1688. [27] R. Mihalcea, P. Tarau, E. Figa, PageRank on semantic networks, with application to word sense disambiguation, in: Proceedings of the 20th International Conference on Computational Linguistics (COLING), 2004. [28] R. Rada, H. Mili, E. Bicknell, M. Blettner, Development and application of a metric on semantic nets, IEEE Transactions on Systems Man and Cybernetics 19 (1) (1989) 17–30. [29] C. Leacock, M. Chodorow, Combining local context and WordNet similarity for word sense identification, in: C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database, MIT Press, 1998, pp. 256–283. [30] J. Jiang, D. Conrath, Semantic similarity based on corpus statistics and lexical taxonomy, in: Proceedings of the 10th International Conference on Research in Computational Liguistics, 1997. [31] P. Resnik, Using information content to evaluate semantic similarity in a taxonomy, in: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI ’95), 1995, pp. 448–453. [32] E. Agirre, G. Rigau, Word sense disambiguation using conceptual density, in: Proceedings of the 16th International Conference on Computational Linguistics (COLING), 1996, pp. 16–22. [33] E. Budanitsky, G. Hirst, Semantic distance in WordNet: an experimental, application-oriented evaluation of five measures, in: Proceedings of the Workshop on WordNet and Other Lexical Resources, in the North American Chapter of the Association for Computational Linguistics (NAACL’01), 2001. [34] E. Terra, C.L.A. Clarke, Frequency estimates for statistical word similarity measures, in: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL’03), 2003, pp. 244–251. [35] P. Cimiano, S. Handschuh, S. Staab, Towards the self-annotating web, in: Proceedings of the 13th WWW Conference (WWW’04), 2004, pp. 462–471. [36] P. D. Turney, Word sense disambiguation by web mining for word co-occurrence probabilities, in: Proceedings of the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (Senseval-3), 2004, pp. 239–242. [37] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pas-ca, A. Soroa, A study on similarity and relatedness using distributional and wordnet-based approaches, in: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American
[38]
[39]
[40]
[41]
[42] [43]
[44]
[45] [46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59] [60]
[61]
[62]
Chapter of the Association for Computational Linguistics (NAACL), 2009, pp. 19–27. R.L. Cilibrasi, P.M.B. Vitanyi, The Google similarity distance, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE) 19 (3) (2007) 370–383. M. Strube, S.P. Ponzetto, WikiRelate! Computing semantic relatedness using Wikipedia, in: Proceedings of the 21st National Conference on Artificial Intelligence (AAAI’06), 2006. L. Ma rquez, G. Escudero, D. Martı´nez, G. Rigau, Supervised corpusbased methods for WSD, in: E. Agirre, E.P. Edmonds, Word Sense Disambiguation Algorithms and Applications, Springer, 2007. T. Pedersen, Unsupervised corpus-based methods for WSD, in: E. Agirre, E.P. Edmonds, Word Sense Disambiguation Algorithms and Applications, Springer, 2007. A. Kilgarriff, M. Palmer, Introduction to the special issue on senseval, Computers in the Humanities 34 (1–2) (2000) 1–13. A. Kilgarriff, P. Edmonds, Introduction to the special issue on evaluating word sense disambiguation systems, Journal of Natural Language Engineering 8 (4) (2002) 279–291. T.H. Ng, Getting serious about word sense disambiguation, in: Proceedings of the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What and How?, 1997, pp. 1–7. P. Edmonds, Designing a task for Senseval-2, Technical Report, University of Brighton, 2000. M. Cuadros, G. Rigau, Quality assessment of large scale knowledge resources, in: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2006, pp. 534–541. J. Madhavan, P.A. Bernstein, A. Doan, A.Y. Halevy, Corpus-based schema matching, in: Proceedings of 21st International Conference on Data Engineering (ICDE’05), 2005, pp. 57–68. R. Navigli, P. Velardi, A. Gangemi, Ontology learning and its application to automated terminology translation, IEEE Intelligent Systems 18 (1) (2003) 22–31. A. Tagarelli, S. Greco, Clustering transactional XML data with semantically-enriched content and structural features, in: Proceedings of the Fifth Conference on Web Information Systems Engineering (WISE’04), 2004, pp. 266–278. R. Trillo, J. Gracia, M. Espinoza, E. Mena, Discovering the semantics of user keywords, Journal of Universal Computer Science 13 (12) (2007) 1908–1935. L. McDowell, O. Etzioni, S. Gribble, A. Halevy, H. Levy, W. Pentney, D. Verma, S. Vlasseva, Mangrove: enticing ordinary people onto the semantic web via instant gratification, in: Proceedings of the Second International Semantic Web Conference (ISWC’03), 2003, pp. 754–770. F. Ciravegna, A. Dingli, D. Petrelli, Y. Wilks, User-system cooperation in document annotation based on information extraction, in: Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW’02), 2002, pp. 122–137. V. Svatek, M. Labsky, M. Vacura, Knowledge modelling for deductive web mining, in: Proceedings of the 14th International Conference on Knowledge Engineering and Knowledge Management (EKAW’04), 2004, pp. 337–353. F. Ciravegna, S. Chapman, A. Dingli, Y. Wilks, Learning to harvest information for the semantic web, in: Proceedings of First European Semantic Web Symposium (ESWS ’04), 2004, pp. 312–326. O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D.S. Weld, A. Yates, Web-scale information extraction in KnowItAll: (preliminary results), in: Proceedings of the 13th International conference on World Wide Web (WWW’04), 2004, pp. 100–110. S. Melnik, E. Rahm, P.A. Bernstein, Rondo: a programming platform for generic model management, in: Proceedings of the 22nd ACM International Conference on Management Of Data (SIGMOD’03), 2003, pp. 193–204. J. Madhavan, P.A. Bernstein, E. Rahm, Generic schema matching with cupid, in: Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01), 2001, pp. 49–58. H. Do, E. Rahm, COMA—a system for flexible combination of schema matching approaches, in: Proceedings of the 28th VLDB Conference, 2002. S. Sorrentino, S. Bergamaschi, M. Gawinecki, L. Po, Schema normalization for improving schema matching, in: ER, 2009, pp. 280–293. P. Resnik, Disambiguating noun groupings with respect to wordnet senses, in: Proceedings of the Third Workshop on Very Large Corpora, 1995, pp. 54–68. G.A. Miller, C. Leacock, R. Tengi, R.T. Bunker, A semantic concordance, in: Proceedings of the workshop on Human Language Technology (HLT), 1993, pp. 303–308. T. Berners-Lee, J. Hendler, O. Lassila, The semantic web, Scientific American 284 (5) (2001).