Data & Knowledge Engineering 69 (2010) 573–597
Contents lists available at ScienceDirect
Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak
A methodology to learn ontological attributes from the Web David Sánchez * Intelligent Technologies for Advanced Knowledge Acquisition (ITAKA), Departament d’Enginyeria Informàtica i Matemàtiques, Universitat Rovira i Virgili, Avda. Països Catalans, 26. 43007 Tarragona, Spain
a r t i c l e
i n f o
Article history: Received 21 January 2009 Received in revised form 25 January 2010 Accepted 27 January 2010 Available online 4 February 2010 Keywords: Ontology learning Meronyms Attributes Features Web mining Knowledge acquisition
a b s t r a c t Class descriptors such as attributes, features or meronyms are rarely considered when developing ontologies. Even WordNet only includes a reduced amount of part-of relationships. However, these data are crucial for defining concepts such as those considered in classical knowledge representation models. Some attempts have been made to extract those relations from text using general meronymy detection patterns; however, there has been very little work on learning expressive class attributes (including associated domain, range or data values) at an ontological level. In this paper we take this background into consideration when proposing and implementing an automatic, non-supervised and domain-independent methodology to extend ontological classes in terms of learning concept attributes, data-types, value ranges and measurement units. In order to present a general solution and minimize the data sparseness of pattern-based approaches, we use the Web as a massive learning corpus to retrieve data and to infer information distribution using highly contextualized queries aimed at improving the quality of the result. This corpus is also automatically updated in an adaptive manner according to the knowledge already acquired and the learning throughput. Results have been manually checked by means of an expert-based concept-per-concept evaluation for several well distinguished domains showing reliable results and a reasonable learning performance. Ó 2010 Elsevier B.V. All rights reserved.
1. Introduction Ontologies have emerged in recent years as a fundamental tool for formalizing and representing knowledge. They offer a formal and explicit specification of a shared conceptualization. With the massive growth of the information society and the success of the Web 2.0, the need for this kind of knowledge formalization model has become imperative. In fact, ontologies are a fundamental element for the success of the Semantic Web [68]. However, the ontological construction of such structures is typically carried out by knowledge engineers and domain experts, resulting in long and tedious development stages. Given the massive scope of the Semantic Web, the manual approach is not scalable enough. Because of this knowledge representation bottleneck, researchers have put their efforts into aiding the ontology construction process [56]. Ontologies are composed of at least three elements: classes (concepts of the domain), relations (different types of binary associations between concepts or data-values) and instances (real world individuals). Formally [56], an ontology is presented as an object model comprising a set of concepts or C classes which are taxonomically related by the transitive is-a relation H C x C (e.g. dog is a mammal) and non-taxonomically related by named object relations R C x C x String (e.g. cigarettes cause lung cancer).
* Tel.: +34 977 556563; fax: +34 977 559710. E-mail address:
[email protected] 0169-023X/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2010.01.006
574
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
From the point of view of automatic ontology learning, many approaches have been developed to acquire mainly domain concepts and organize them into taxonomies (as detailed in [56]). However, the identification of non-taxonomic relations has received very little attention [10,13]. Within the non-taxonomical field, we can identify special binary relations among concepts (object relations), which express part-of and associations between objects and data-values (data-properties), which are typically referred as attributes [1], features [8] or parts [39]. The former can be expressed as P C x C x Part-of where C are classes of the ontology and Part-of expresses a meronym relationship between concepts (e.g. p = Digital_Camera CCD_Sensor part_of). The latter can be expressed as D C x T x String, where C is a class of the ontology, T are data-types and String is a textual representation of the relation (e.g. d = Digital_Camera Float Price). These relations (whose acquisition represents the central point of this paper) play a very important role in adding semantic content to ontological classes due to their descriptive nature. In fact, in classical theories of knowledge representation, concepts are defined in terms of their attributes (e.g. color, shape, size, etc.) [64]. Almuhared and Poesio [1] demonstrated that identifying a concept by its attributes leads to a better lexical description. Therefore, this data represents a valuable aid in knowledge driven tasks such as question answering [67], information retrieval [6] or word-sense disambiguation [2]. However, an investigation of the structure of existing ontologies via the Swoogle ontology search engine [37] has shown that available ontologies rarely model these kinds of relationships. From the point of view of ontology learning, the following tasks should be performed to acquire expressive part-of object relations and data-properties: (i) discovery and labelling of relevant properties for a domain and, for data-properties, (ii) identification of the appropriate data-type and specification of possible value restrictions. Due to the generality and unbounded nature of literal data-values and the inherent ambiguity of human natural language, these are challenging tasks [39]. In fact, as will be shown in the related work section, even though some approaches in the field of Information Extraction have been developed to extract features, very little work has been done on acquiring expressive class attributes and restrictions in the field of ontology learning. In this paper we present a new methodology for acquiring class attributes at an ontological level. In addition to object relations, one of the paper’s contributions is to address the discovery of data-properties and their associated data-types and value ranges. Moreover, unlike many previous approaches, the method has been designed in an automatic and domain-independent way, exploiting several well-established analytical techniques. In order to minimize data sparseness which characterizes approaches based on the analysis of concrete and/or domain-dependant repositories, the Web is exploited as a social scale learning source. Due to the unsupervised nature of the employed analytical techniques, the method relies on the Web information distribution in order to assess the reliability of the extracted knowledge. Specially designed statistical scores based on collocation measures and highly contextualized assessments have been also designed in order to improve the accuracy of the results. The method has been manually evaluated for several well distinguished domains, showing its feasibility in obtaining relevant and reliable results in a scalable manner. The rest of the paper is organized as follows. Section 2 presents an overview of previous approaches to learning concept attributes/features/meronyms from textual documents. Section 3 introduces the basis of our methodology, including a description of the main techniques employed to acquire and filter attribute candidates and a study of the Web as learning source. Section 4 describes in detail the proposed methodology, which is divided into a three-staged procedure and covers the acquisition of attributes, data-types and value ranges. Section 5 discusses some relevant aspects regarding the analysis of web resources and presents an adaptive algorithm for incremental corpus analysis. Section 6 describes the evaluation procedure and presents and discusses the results of several tests. Section 7 analyses the computational complexity of the proposed algorithms and shows the throughput and the practical feasibility of the methodology. The final section presents the conclusions and proposes some lines of future work. 2. Related work The notion of concept attribute is not completely clear and the term has been used in widely different ways in knowledge representation literature [1]. Guarino [48] classified attributes into relational (e.g. color, position) and non-relational ones (like object parts). In the qualia structure of the generative lexicon [30] four types of roles are identified: Constitutive Role (parts), Formal Role (qualities), Agentive Role (relational) and Telic Role (purpose). An analysis of previous research in the NLP literature on information extraction also shows different ways of referring to object–data–value relationships and how these fall into one or several of the previously stated categories. Typically, these relationships are considered as concept attributes [1] or features [8]. This means defining a certain data-type or measure range. Other authors refer to special object–object relations and talk about meronyms [45] or part-of [39]. In both these cases, the analysis only covers the discovery of the relation. In the present study, we generally refer to attributes as: Definition 1 (attribute). Attributes are object–object part-of relationships and object–data–value properties which can help to semantically define and describe an ontological concept. Therefore, this definition ranges from pure part-of relationships (e.g. the optical lens of a digital camera) which can be represented as special object–object relationships, to specific features (e.g. ISO or resolution of a digital camera) and properties (e.g. size or weight) which can be qualified or numerically quantified. In the last case, attribute data-types, values and
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
575
measurement units will be specified at the class level to obtain a semantically-rich representation. We refer to this additional knowledge as attribute restrictions. Definition 2 (attribute restriction). An attribute restriction is an assertion related to an ontological concept and an attribute. This assertion aims to specify an attribute’s value ranges or data-types/measurement units. If we look at previous research in the area, apart from manual approaches [69], we can distinguish different methods according to the kind of information used to discover the relations. A classical technique consists of using semi-structured information contained in the text. Tables or item lists are analysed to discover the data-value tuples associated with a certain class [9]. Tabular information is especially useful as it is one of the most structured constructions one can find in natural language resources. The main problem is to correctly analyse the table structure (fields, columns, separators, titles, etc.) in order to extract correct tuples. Consequently, the performance of the learning methods depends on the correctness of the inferred table model. An overview of table recognition can be found in Zanibbi et al. [63]. The conclusion is that few table models have been described. Some approaches focus on the recovery of tables from encoded documents [12]. Other attempts [7] assume that tables have already been harvested or that semi-structured pages associated to table contents are available [11]. Another problem of table-based methods is the availability of this kind of structure. Even though technological domains (which are the examples used to evaluate the approaches mentioned) are very prone to describing entities in a tabular fashion, for more general domains, tables are scarcer, partial and incomplete, thus hampering their applicability as a domainindependent solution. Other approaches also exploit semi-structured information but use a knowledge base (typically an ontology) to properly interpret textual content. Schekotykhin et al. [35] present an automatic method for instantiating predefined attribute-filled ontologies. Ontological information about typical attribute labels and value ranges is used to find specific instances. However, the problem of the proper ontology definition is left to the knowledge engineer. Yoshinaga and Torisawa [51] also exploit semi-structured information (HTML pages containing tables or lists) to extract attribute value pairs by studying specific web layouts. In An et al. [74] authors exploit the HTML structure of deep web information repositories (such as book libraries, air companies, etc.) in order to extract relevant attributes. As with classical web wrappers, the main problems of these approaches are a rigid dependency on the web structure (which forces the design of specific extractors for each data repository) and the availability of that repository in a specific format or layout for the analysed domain [52]. Probst et al. [34] also deal with web repositories but do not rely on structured information to extract attributes from product descriptions; instead they require sample domain data which they use to train a classifier which is able to detect attribute labels and limited values. Domain data is also used by Wu and Weld [22], who exploit manually composed Wikipedia infoboxes as training data for classifiers in order to enrich an ontology. Other works, more closely related to our research, exploit language regularities to discover attributes. Linguistic patterns are usually employed to express part-of relations. New concepts extracted from those patterns are used as feature labels [39] and values [1]. If patterns are general and text repositories are big and heterogeneous enough, the main advantage of those approaches is their domain independency. This approach was applied in Tokunaga et al. [36] to small collections of Japanese web documents. However, the local statistical analysis and dependence on the document structure hampered the generality of the results. In Fleischman et al. [41] attribute–value pairs are extracted by filtering pattern-based candidates using a model that has been acquired through supervised learning and which is limited to person names. In Girju et al. [61], part-of patterns are learned from manually pre-tagged text but only object–object relationships are extracted. Pasca et al. [44] also extracted class attributes from the Web using patterns and local statistics but this method required a list of precompiled class instances to use as seeds. In more recent works [31], a slightly supervised approach is proposed using a set of precompiled attributes as seeds for attribute discovery on web documents and query logs. Finally, in [66] a similar approach is employed but additionally relying on the HTML structure of web documents to identify relevant attributes. Patterns have also been used in the field of information extraction [53] to compile large amounts of product features from reviews [42]. In this case, features are associated with user’s opinions instead of appropriate data ranges. A similar approach requires human input for every learning iteration [50]. OPINE [8] uses web queries involving meronym patterns to determine web information distribution in order to assess extracted candidates from review sites. It uses WordNet hierarchies to discriminate candidate types. From this survey, we can conclude that automatic and non-supervised methodologies focused on learning attributes from an ontological point of view are scarce. Therefore, our approach aims to contribute to this area. As we will describe in the following sections, it relies on well known techniques such as the use of meronym-discovery patterns to extract concept candidate attributes. Patterns will be general enough to be able to retrieve concept meronyms in a domain-independent way (e.g. the lens of a camera) as well as qualities and relational attributes (e.g. weight, colour). However, as the results derived from an ontology learning method need to be more accurate than those from general information extraction systems, the candidates obtained will be iteratively evaluated by means of specially designed statistical scores. They aim at minimizing ambiguity and increasing the quality of the results compared with other approaches (as will be shown in the evaluation section). In order to achieve a greater expressiveness, in contrast to previous approaches, attributes will be also refined in an incremental fashion (with a self-adaptive corpus analysis) to discover, when needed, associated data-types, value ranges and measurement units. This additional information can improve the semantic content of ontological classes by defining attribute restrictions.
576
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
Table 1 Meronym detection patterns defined by Berland and Charniak [39]. Pattern
Example
NP’s NP NP of {thejajan} NP NP in {thejajan} NP NP of NPs NP in NPs
... ... ... ... ...
camera’s sensor . . . resolution of the camera . . . exposure in a camera . . . speed of processors . . . cache in processors . . .
Relation Meronym Meronym Meronym Meronym Meronym
(‘‘sensor”, ‘‘camera”) (‘‘resolution”, ‘‘camera”) (‘‘exposure”, ‘‘camera”) (‘‘speed” ‘‘processor”) (‘‘cache”, ‘‘processor”)
3. Background In this section we present the bases of our methodology. First, we analyze the characteristics of the Web as a learning corpus which can be exploited to aid the learning process. Then, we describe the main techniques employed for knowledge acquisition, that is, linguistic patterns and Web-based statistical analyses.
3.1. The Web as a learning corpus Many classical knowledge acquisition techniques have performance limitations due to the use of a typically reduced corpus [20]. The use of massive amounts of heterogeneous data can bring benefits to unsupervised learning techniques and can minimize the constraints regarding the availability of a corpus for the domain analyzed. This idea is supported by current social studies in which it is argued that collective knowledge is much more powerful than individual knowledge [32]. The Web is the biggest repository of information available [20] with more than 1000 billion web resources indexed by Google1 covering almost any possible domain. Web data, understood as individual observations in web resources, may seem unreliable due to its uncontrolled publication and ‘‘dirty” form [5]. However, taking the Web as a global source, it has been demonstrated that the amount and heterogeneity of the information available are so high that it can approximate the real distribution of information on a social scale [59]. In addition to the amount of information, another interesting characteristic of web data is its high redundancy. This is especially important because, on the one hand, the degree to which information is repeated can be a measure of its reliability (a factor that has been exploited in [38]). On the other hand, the fact that the same information may appear expressed in many different -more or less complex- forms can support the development of reliable and scalable shallow linguistic analytic techniques [20,52,49], thus avoiding, or at least minimizing, data sparseness problems. Because a direct analysis of such an enormous repository is impracticable, web search engines can be exploited as effective web information retrieval [27] and extraction [29] tools. However, their main drawbacks are the limited access to web resources (Google, for example only indexes 1000 web sites per query) and the restricted expressiveness of the query matching [73] and query language. Therefore, special emphasis should be placed on constructing appropriate queries so as to retrieve relevant domain-related resources.
3.2. Text analysis using linguistic patterns Linguistic patterns have been extensively used in the past to develop non-supervised information extraction [52], knowledge acquisition methodologies [14,55] and enhanced information systems [21]. These approaches use regular expressions that indicate a relation of interest within the text. General lexical-syntactic patterns can be designed by hand [47,39] or can be learned by using a set of pre-related concepts and domain texts [45,57]. One of the most important successes resulting from the application of patterns is the discovery of taxonomic relationships. Hearst [47] studied and defined a set of domain-independent patterns for hyponymy discovery which have provided the basis for further refinements and learning approaches [43]. As with hyponymy/hypernymy, one may define linguistic patterns which express roles [54], metaphor and simile [70] or other kind of relationships such as meronymy, holonymy, telicity, etc. [45]. For example, Berland and Charniak [39] define a set of general patterns to discover meronymy relationships within text (see Table 1). These patterns are general enough to capture data-value relations such as qualities (see Section 2). In addition to the patterns presented, Almuhareb and Poesio [1] have demonstrated that the extraction precision can be increased by insisting on the presence of the to be verbal form to ensure that the attribute actually stands for a concept (i.e. to avoid modifiers), as shown in Table 2. We have taken this basic set of patterns (Table 2) and extended it by incorporating other verbal forms indicating inclusion (Table 3). They have been exploited in previous research to extract product features [8]. As with the previous refinement, the different variations of the verb are considered. 1
http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.
577
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597 Table 2 Refined set of meronym detection patterns defined by Almuhared and Poesio [1]. Pattern
Example
NP’s NP {isjarejwasjwere} NP of {thejajan} NP {isjarejwasjwere} NP in {thejajan} NP {isjarejwasjwere} NP of NPs {isjarejwasjwere} NP in NPs {isjarejwasjwere}
... ... ... ... ...
camera’s sensor is . . . resolution of the camera is . . . exposure in the camera is . . . speed of processors is . . . cache in processors is . . .
Relation Meronym Meronym Meronym Meronym Meronym
(‘‘sensor”, ‘‘camera”) (‘‘resolution”, ‘‘camera”) (‘‘exposure”, ‘‘camera”) (‘‘speed” ‘‘processor”) (‘‘cache”, ‘‘processor”)
Table 3 Additional set of patterns defined with verbs indicating inclusion. Pattern
Example
Relation
NP havejhasjhad NP NP comejcomesjcame with NP NP featurejfeaturesjfeatured NP
. . . camera has iso . . . . . . camera comes with lens cap . . . . . . camera features zoom . . .
Meronym (‘‘iso”, ‘‘camera”) Meronym (‘‘lens cap”, ‘‘camera”) Meronym (‘‘zoom”, ‘‘camera”)
All the patterns described have been manually constructed from observations found in natural language texts. They represent domain-independent regular expressions which can potentially be used in any domain of knowledge. However, as stated in [39], the main problem with these ‘‘closed” approaches is the sparseness of data. In general, these approaches are unlikely to find a significant amount of matchings for a pair of related concepts if they use those patterns in a limited corpus. More matchings or data would allow recall to be improved and more robust statistical measures to be computed. On considering this problem, some authors [45,57] have tried to extend this basic set of patterns by using pre-related concepts and a domain-related corpus to provide the basis for learning regular expressions. The result is an additional set of patterns which, in many situations, includes domain-dependent concepts within the regular expression (e.g. NP is a city in NP, NP is the capital of NP). Even though those patterns could potentially increase recall, their lack of generality may limit their applicability. To avoid having to pay those trade-offs, we have opted for a different approach. Instead of applying a carefully selected/ learned and numerous set of extraction patterns over a reduced set of domain-related documents, we use the basic set of domain-independent patterns over the biggest repository currently available: the Web. As is outlined in Section 3.1, both the amount and redundancy of web information help to minimize the data sparseness problems typically associated with restricted pattern-based approaches. 3.3. Web-based statistical analysis As described in Section 2, the use of linguistic patterns to extract attributes has been demonstrated to be an effective approach. Several authors [39,1] have been able to compile a significant amount of candidate attributes that can certainly increase the semantic content of concepts. However, the use of such a shallow and general approach limits their performance because of the ambiguity related to natural language texts and the unreliability of individual observations taken from an uncontrolled repository such as the Web [8]. In general, when using non-supervised learning techniques, it is necessary to assess the suitability of the extracted knowledge and the scope of the domain explored [52]. Some authors have addressed these kinds of issues by using knowledge bases (such as WordNet) [19,25], pre-tagged examples [23] or expert’s opinions [46]. The need of domain knowledge or for the user to intervene both introduce trade-offs to the learning methodology, thus hampering its potential applicability. The use of statistical analyses to infer the semantic relatedness between concepts has proven to be an appropriate technique [52] which does not lead to the loss of the non-supervised and domain-independent nature of our approach. From the point of view of information extraction, the suitability of each candidate may be evaluated using a score computed from the degree of term occurrence [38] or co-occurrence in the text [52]. This problem has been extensively studied in the literature, resulting in many different concept similarity/relatedness scores (most of which are summarized in [65]). From a non-supervised point of view, the statistic assessment of the semantic relatedness between concepts can be computed from their co-occurrence; authors [58,8] typically use a measure derived from the standard collocation function between terms (1)
ck ða; bÞ ¼
pðabÞk pðaÞpðbÞ
ð1Þ
where pðaÞ is the probability that the word a occurs within the text and pðabÞ is the probability that words a and b co-occur. Here, the collocation of a and b is defined as the comparison between the probability of observing a and b together and observing them independently. If a and b are statistically independent, the probability that they co-occur is given by the product pðaÞpðbÞ. If they are not independent, and they have a tendency to co-occur (which is the case of words in a corpus),
578
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
pðabÞ will be greater than pðaÞpðbÞ. Therefore the ratio between pðabÞ and pðaÞpðbÞ is a measure of the degree of their statistical dependence. From this formula, the symmetric conditional probability (SCP) can be defined as c2 [28] and the pointwise mutual information (PMI) as log 2 c1 [60]. Again, the problem of such unsupervised techniques is the data sparseness. As with pattern-based information extraction, Brill [20] has also demonstrated the suitability of using a wide corpus, such as the Web, to improve the quality of classical statistical methods. The problem is that, in most cases, it is not practical to analyze such an enormous repository when measuring co-occurrence. However, the availability of massive web information retrieval (IR) tools can help in this task. It is claimed that the probabilities of web search engine terms, conceived as the frequencies of page counts returned by the search engine divided by the number of indexed pages, approximate the relative frequencies of those search terms as actually used in society [59]. Following this premise, Turney [58] adapted PMI to approximate term probabilities from web search hit counts. His research presents several heuristics for exploiting the statistics provided by web search engines. He defined a score (2) to compute the collocation between an initial word (problem) and a related candidate concept (choice) in the form:
Scoreðchoice; problemÞ ¼
hitsðproblem AND choiceÞ hitsðchoiceÞ
ð2Þ
The formula is very similar to PMI ðlog 2 c1 Þ, but since it is looking for the maximum score among a set of choices -or candidates-, log 2 and p(problem) can be dropped from the denominator because it has the same value for all choices. Note also that the corpus size (the total number of webs indexed by a search engine), which should divide each hit count in order to obtain term probabilities, is also eliminated because it is common to the numerator and the denominator. This measure has been extensively used for Web-based information retrieval [8] and knowledge acquisition [14,13,55] tasks and is one of the best performing measures for the unsupervised ranking of a list of alternatives (particularly in the classical TOEFL test [72]). As ranking and selecting candidates from a list of extracted ones is precisely what we want to achieve during the attribute discovery, our selection scores will also exploit explicit term co-occurrences computes from web search engine hit counts. 4. Attribute learning methodology In this section we describe the proposed automatic methodology for learning concept attributes and associated attribute restrictions. As shown in Fig. 1, three stages have been defined: (i) the extraction and selection of attributes (corresponding to boxes 1.1 and 1.2), differentiating object–object relations and object–data–values properties (box 1.2), (ii) the extraction and selection of attribute values (boxes 2.1. and 2.2, respectively), and (iii) the identification of measurement units (box 3). The later two stages, aimed to learn attribute restrictions, are only considered for object–data–value properties. All the stages are executed in an iterative fashion, starting with an initial ontological concept and exploiting the knowledge already acquired to contextualize the analysis. 4.1. Attribute extraction and selection As Fig. 2 shows, the first stage of the learning process receives a concept and a patternSet as input. Definition 3 (concept). A concept corresponds to the word or list of words which represent the domain for which attributes should be retrieved. Those terms will be used as keywords to create web queries and construct pattern-based extraction rules. Definition 4 (patternSet). PatternSet is the list of meronym patterns written as a regular expression, as presented in Tables 2 and 3. First, the system uses the concept (e.g. digital camera) and the patternSet to construct a set of queries using the web search engine query language (e.g. digital camera has, digital camera comes with, digital camera’s * is). Definition 5 (query). A query is any string created from the combination of a concept’s terms or a pattern’s regular expression. It is executed into a web search engine in order to retrieve web resources or compute statistics. In this case, constructed queries are written between double quotes (‘‘ ”) in order to force the search engine to provide exact pattern matches. Search wildcards (*) are also used if they are required by the pattern’s regular expression (and supported by the web search engine). As stated in Section 3.1, in order to overcome recall limitations that may be introduced by the constrained web query language, it is important to carefully compose each queried expression, maintaining the concordance between the verb and the concept (e.g. digital camerahas, digital cameras have) and trying all the possible verbal tenses (e.g. digital camera comes with, digital camera came with). Some examples of queries for different pattern combinations are presented in Table 4. Even though some web search engines incorporate stemming capabilities (e.g. digital camera and digital cameras should return the same set of results), they are also characterized by their high variance if there are small changes in the web query [53]. Therefore, in practice, given that a web search engine only indexes a limited set of resources per query and that the
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
579
Fig. 1. General architecture of the attribute learning methodology.
most relevant ones are presented first, searching for the same pattern in its different textual forms helps to widen the result set without compromising the domain independence of the analysis. Non-intrusive query-expansion algorithms have proven to be effective when retrieving resources from the Web [53]. Each query is executed and, as a result, a webSet to be analysed is retrieved. Definition 6 (webSet). WebSet is any set of web resources retrieved from a web search engine for a specific query. Webs are considered as text sources which are used for linguistic analyses. Section 5 will describe the policy used to decide the specific amount of web resources to analyse (a function of the constant NUMBER_WEBS presented in Fig. 2). From the webSet, each web content is parsed (i.e. the clear text is extracted) and linguistically analysed2 using sentence detection, tokenization, pos-tagging and chunking in order to find pattern matches. The pattern’s regular expression is used at this point to extract attributeCandidates from the text if a complete matching – including the appropriate morphological form- is found (e.g. digital camera has high resolution). Definition 7 (attributeCandidate). AttributeCandidate is a word or set of words extracted from text by means of a meronym pattern regular expression. It represents a potential attribute for the domain explored. AttributeCandidates compiled for each pattern are also processed via an English stemming algorithm in order to detect terms with the same meaning but different lexical forms (e.g. lens = lenses). Because attributes are used to express a concept’s characteristics, it is very common to find qualified or quantified attributeCandidates (e.g. very high resolution, 1600 ISO, high quality zoom lens). However, as we are interested at this stage in 2 OpenNLP Tools (http://opennlp.sourceforge.net/) is a Java package that hosts a variety of natural language processing tools which carry out sentence detection, tokenization, pos-tagging, chunking and parsing on the basis of maximum entropy models for English texts [3].
580
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
Fig. 2. Attribute extraction and selection algorithm.
discovering the attribute label or meronym object, we apply a post-tagging-based analysis to each attributeCandidate to remove qualifiers and obtain the main noun(s) (e.g. resolution, ISO, zoom lens). Due to the large amount of dirty data that may be found on the Web, a lot of noise may affect the attributeCandidates because they are extracted from individual observations. Thus, the next step consists of assessing which ones usually express a real attribute-like relationship on a large scale. At this point we are only interested in the suitability of the attributeCandidate as a concept’s semantic characteristic and not in the concrete type (e.g. part-of, qualifier, telic, etc.). Cafarella, et al. [40] address this stage by counting the number of appearances in the corpus (in our case, a set of web resources). As was described in Section 3.3, there are limitations to approaches based on a limited number of observations; therefore, we opted to use the Web as a whole to calculate robust measurements of information distribution. Section 3.3 also described how web-scale statistics can be obtained from the hit count retrieved for specially formulated queries. Therefore, like Turney [58], we can search at web-scale for explicit term co-occurrences between the input concept and the attributeCandidate in order to measure their degree of relationship (3).
Score1 ðattributeCandidateÞ ¼
hitsðconcept AND attributeCandidateÞ hitsðattributeCandidateÞ
ð3Þ
This is the strategy implemented by other authors who also employ Web-based statistical approaches dealing with ontology learning [58,52,8]. The problem is that the type of semantic relationship between terms cannot be assessed from their absolute co-occurrence in a document set. Concepts may co-occur due to many semantic relationships such as synonymy, hyponymy and even antonymy. Consequently, relying exclusively on decontextualized term co-occurrence as an indication of meronym-type relatedness results in ambiguous queries and a poor statistical assessment (as it will be tested in Section 6). In order to minimize this problem, we contextualized term co-occurrence by forcing the explicit appearance of the meronym pattern into the query string. In this manner, term co-occurrence will be estimated from the amount their explicit appearances in an attribute-like linguistic construction (4). For each attributeCandidate, we construct a query involving the attributeCandidate, the concept and each pattern’s regular expression (e.g. digital camera has resolution, digital camera comes with zoom lens). Note again the use of (‘‘ ”) to ensure the exact occurrence of the queried expression.
Score2 ðattributeCandidate; patternÞ ¼
hitsðpatternðconcept; attributeCandidateÞÞ hitsðattributeCandidateÞ
ð4Þ
Our hypothesis (which will be tested in the evaluation section) is that this kind of contextualized queries, even subestimating term co-occurrences (as only pattern matching expressions are considered), result in less ambiguity and higher accuracy in the statistical assessment of meronymy.
581
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
Table 4 Web queries constructed for each pattern’s regular expression for the digital camera domain and number of results provided by Google. Note that the plural version of the first pattern (e.g. ‘‘digital cameras’ * is”) is not supported by Google as it omits the ‘ character. Pattern
Web queries
NP’s NP {isjarejwasjwere}
‘‘digital ‘‘digital ‘‘digital ‘‘digital
NP of {thejajan} NP {isjarejwasjwere}
‘‘of ‘‘of ‘‘of ‘‘of ‘‘of ‘‘of ‘‘of ‘‘of
the digital camera is” a digital camera is” the digital camera are” a digital camera are” the digital camera was” a digital camera was” the digital camera were” a digital camera were”
15,200 16,000 299 91,800 109,000 12,700 34 7
NP in {thejajan} NP {isjarejwasjwere}
‘‘in ‘‘in ‘‘in ‘‘in ‘‘in ‘‘in ‘‘in ‘‘in
the digital camera is” a digital camera is” the digital camera are” a digital camera are” the digital camera was” a digital camera was” the digital camera were” a digital camera were”
306 9500 114 95,300 102,000 14 7520 6
NP of NPs {isjarejwasjwere}
‘‘of ‘‘of ‘‘of ‘‘of
digital digital digital digital
cameras cameras cameras cameras
is” are” was” were”
25,000 15,100 5150 11,200
NP in NPs {isjarejwasjwere}
‘‘in ‘‘in ‘‘in ‘‘in
digital digital digital digital
cameras cameras cameras cameras
is” are” was” were”
511 158,000 2540 21,800
NP havejhasjhad NP
‘‘digital ‘‘digital ‘‘digital ‘‘digital
camera has” cameras have” camera had” cameras had”
186,000 278,000 10,800 976
NP comejcomesjcame with NP
‘‘digital ‘‘digital ‘‘digital ‘‘digital
camera comes with” cameras come with” camera came with” cameras came with”
18,900 18,800 19,700 77
NP featurejfeaturesjfeatured NP
‘‘digital ‘‘digital ‘‘digital ‘‘digital
camera features” cameras feature” camera featured” cameras featured”
168,000 14,000 1670 433
camera’s camera’s camera’s camera’s
Google Web hit count * * * *
is” are” was” were”
603 69,000 42,100 25,700
However, as the hit count presented in Table 4 shows, each pattern presents a different degree of generality. On the one hand, very general – and more ambiguous-patterns (e.g. NP has) may result in a large amount of matchings not necessarily expressing attributes. On the other hand, more specific patterns (e.g. NP comes with) may provide a less ambiguous, but limited estimation. This is a variable which should be taken into consideration in the statistical assessment in order to fully capture the notion of collocation (i.e. the probability of a candidate + concept co-occurrence in a meronym-like manner with respect to the candidate and the pattern probabilities). This is done by multiplying the denominator by the hit count of the pattern’s query considering only the input concept (e.g. digital camera has, digital camera comes with) as a normalization element. In practice, this would result in downgrading those attributeCandidates that appear in patterns that are very frequent (and thus likely to be more ambiguous) and promoting those attributeCandidates that occur in highly contextualized (and more reliable) patterns. In terms of the mathematical coherency, we power the numerator to square to compensate for the corpus size which should divide the pattern’s hit count in order to express a probability. Note that the resulting formula (5) follows the same principle as the symmetric conditional probability (SCP) [28] measure introduced in Section 3.3, but estimating term probabilities from web hit counts.
Score3 ðattributeCandidate; patternÞ ¼
ðhitsðpatternðconcept; attributeCandidateÞÞÞ2 hitsðattributeCandidateÞ hitsðpatternðconceptÞÞ
ð5Þ
With the proposed formula, each pattern results in a different set of queries and, at the same time, in a different score value. Each one estimates, in a normalized manner, the explicit co-occurrence of theattributeCandidate from the point of view of a
582
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
particular pattern. As each one represents a subestimation of the total amount of meronym-like co-occurrences at a Webscale, their sum will summarize all the explicit pattern-based occurrences. As the wider the sample data, the more reliable the statistical analysis will be [20], we set the final score as the sum of the values obtained by using each pattern in the patternSet (6). With this mechanism we try to avoid any constraint or dependency between the attributeCandidate and the use of a particular pattern; for example, the fact of finding a pattern matching in the text for a candidate does not imply that this is the most common textual form.
Score4 ðattributeCandidateÞ ¼
X pati 2patternSet
ðhitsðpat i ðconcept; attributeCandidateÞÞÞ2 hitsðattributeCandidateÞ hitsðpati ðconceptÞÞ
! ð6Þ
Even though the ambiguity in the statistical assessment is reduced, the drawback of using very concrete queries is that, in some situations, they tend to return very few results, notwithstanding the size of the Web (see Table 5). In fact, in some cases, results are so reduced that any conclusions extracted from the statistical analysis would be unreliable. As has already been stated, attribute labels are typically accompanied by a wide diversity of modifiers (i.e. numbers or adjectives quantifying or qualifying the attributeCandidate). Therefore, it is unlikely that the constructed web queries will find formal ontology-like conceptual assertions (e.g. digital camera has resolution) in the text. On the contrary, most of the text refers to specific instances (i.e. specific digital cameras) for which a specific feature is mentioned (e.g. the digital camera has a high resolution). In order to generalize the query formulation (including possible modifiers) whilst maintaining the desired degree of relation contextualization, we include a search wildcard between the pattern and the attributeCandidate (e.g. digital camera has * resolution). This indicates that the latter can include one or several modifiers in the retrieved resources. Again, this strategy leads to wider sample data and more general and robust statistics (which provide a better approximation of the information distribution that we are trying to assess) [20]. However, the problem is that some search engines do not support search wildcards (this issue will be discussed in greater detail later). Definition 8 (attributeScore). AttributeScore is the value obtained from computing Score4 to an attributeCandidate of a particular concept. AttributeCandidates with an attributeScore that exceeds a threshold are selected (see Table 6). To set the specific threshold value (the MINIMUM_ATTRIBUTE_SCORE constant of Fig. 2) the reduced amount of hits potentially obtained by the score’s numerator (which involves complex linguistic constructions with double quotes) should be compared with the general nature of the denominator. After an empirical study of pattern-based statistics from several well distinguished domains [14], it was concluded that feasible candidate occurrence values tend to be around 5–6 orders of magnitude higher that the patternbased occurrences sought by the numerator. Furthermore, the pattern expression added to the denominator as a normalization tends to be around 2–3 orders of magnitude higher. Taking this into consideration, we empirically set a value of 1E8, indicating that potentially correct candidates result in numerator hit counts of around 8 orders of magnitude lower than the denominator. This intuition will be tested in the evaluation section for several domains. The threshold controls the selection procedure’s behaviour, which usually results in a reduction of more than 50% of the candidates. Because the rejected set mainly consists of noisy terms, it is clearly important to introduce a statistical assessor to the Web-based learning process to improve precision. The final attributeSet is the result of the execution of the first learning stage. Definition 9 (attributeSet). AttributeSet is the list of extracted and selected attributes for a particular concept. 4.2. Extraction and selection of attribute values Some of the attributes contained in the attributeSet obtained in the previous stage may express pure part-of relations (e.g. the lens of a digital camera) whereas others may correspond to qualified or quantified features (e.g. the digital camera has an 8 MP resolution). In order to obtain a rich ontological structure, we try to detect and deal with both these cases. More in
Table 5 Comparison of the web hit counts obtained for several queries performed for a candidate attribute (resolution) in the digital camera domain while including or omitting search wildcards. Pattern example
Possible web queries
Google Web hit count
NP’s NP is
‘‘digital camera’s resolution is” ‘‘digital camera’s * resolution is”
9 3590
NP has NP
‘‘digital camera has resolution”
6
NP comes with NP
‘‘digital camera has * resolution” ‘‘digital cameras comes with resolution” ‘‘digital camera comes with * resolution”
10,300 0 72,200
‘‘digital camera features resolution” ‘‘digital camera features * resolution”
7 15,600
NP features NP
583
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597 Table 6 Examples of attributeScores obtained for several attributeCandidates retrieved for the digital camera domain. Extracted text
Attribute candidate
Attribute score
2 Mb memory card A viewfinder Face detection Five megapixel ccd sensor Adequate memory A proprietary battery High resolution A lens ... Unique properties Real photo technology Two things Support Thumbnail index Tremendous applications
Memory card Viewfinder Face detection Ccd sensor Memory Battery Resolution Lens ... Properties Technology Things Support Index Applications
0.0012 5.13E4 2.37E4 1.32E4 6.39E5 5.02E5 2.96E5 2.26E5 ... 2.48E8 1.86E8 1.7E8 1.55E8 6.62E9 6.47E9
detail, during the previous stage and in parallel to the attributeCandidate extraction process, each attribute is tagged according to the following criteria: (1) If the attributeCandidate has been found in the text without any modifier, it is tagged as Object. This is the case of non quantified/qualified meronymy relations for which only the presence or absence of a concept component can be measured (e.g. digital camera has lens cap). Consequently, the attribute represents a new ontological class (e.g. lens cap) and a part-of object–object relationship can be defined regarding the original concept. No further analysis is required for this type of attribute. (2) If the attributeCandidate appears in the text accompanied by one or more adjectives, it is considered as a data-valued attribute and its value range is tagged as String. This situation indicates that a particular feature or part can be qualified in some way using a concatenation of modifiers (e.g. digital camera has high resolution, digital camera has panoramic lenses). (3) Finally, if the attributeCandidate appears with numerical characters, it is also considered as a data-valued property or meronym and its range is tagged as Number. This is the case for measurable attributes or parts (e.g. digital camera has 1600 ISO, digital camera has 3 lenses). In the later two cases, in order to obtain expressive ontological assertions, we will identify the possible values or qualifiers that typically accompany the attributes. Note that a particular attribute may appear several times during the analysis and, in consequence, it can be tagged in different ways (e.g. an attribute may be qualified with adjectives – high resolution – or quantified with numbers – 10 megapixel resolution–). Definition 10 (dataAttribute). DataAttributes are those selected attributes which have been tagged as numeric or string and for which quantifiers and/or qualifiers may be discovered. More in detail, for every dataAttribute, we will try to discover its values (i.e. quantifiers or qualifiers), value ranges (i.e. minimum and maximum numerical values) and data-types (i.e. measurement units). Some sample data-values are available from the previous stage. Even though this information can be used directly to define attribute restrictions (i.e. values and data-types), it has been retrieved from a very limited set of observations. Thus, in order to improve the reliability of the results, we introduce a new learning step to extract and select a representative set of attribute values from an additional corpus of web resources. As shown in Fig. 3, the process is performed as follows: for each dataAttribute we construct a query involving it and the initial concept with an AND operator (e.g. digital camera AND resolution). This contextualizes attribute appearances towards the concept’s domain, aiming to retrieve resources with domain-related values and qualifiers. We suppose that each web resource is referring the attribute unambiguously (i.e. in a unique sense which will commonly refer to the domain indicated by the concept used as context). This premise is based on the observation that words tend to exhibit only one sense in a given discourse or document (context). This fact was tested by Yarowsky [17] on a large corpus (37.232 examples), obtaining a 99% precision. For the English language, qualifiers and quantifiers are typically associated to attributes by including them before the noun. The resulting noun phrase (e.g. high resolution) taxonomically specialises the meaning of the attribute. In fact, noun phrase analysis has been used extensively in the past to discover concept specialisations and to build taxonomies [24]. Following this premise, we will seek for noun phrases involving the attribute in order to extract potential quantifiers or qualifiers. More in detail, as a result of the query, a webSet with NUMBER_WEBS resources is obtained (more on this in Section 5). For each web, the clear text is extracted, dataAttributes are located and their noun phrases are linguistically analysed using the
584
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
Fig. 3. Data attribute value extraction and selection algorithm.
same tools as in the first stage. Attribute modifiers (nouns, adjectives and values) are detected and extracted, obtaining a list of valueCandidates for each dataAttribute. Note that in some cases, numbers will be accompanied by measurement units (e.g. 20x zoom, 10 mp resolution). Definition 11 (valueCandidate). A valueCandidate is any modifier (numerical-quantitative- or textual-qualitative-) extracted from the analysis of a dataAttribute. Due to most of the candidates will be extracted from individual observations and because of the shallow nature of the extraction pattern, the set of valueCandidates can be quite big and noisy. In general, on the one hand, we can find out-ofrange values; on the other hand, we can have problems with misspelled words, ambiguous terms or concepts outside the scope of the domain. In order to minimize those issues and define only feasible attribute value restrictions corresponding to the level of generality of the analysed concept, we again use Web-based statistics to assess the suitability of valueCandidates. Inspired by Turney’s basic score [58] and considering, as stated above, the common English linguistic construction for qualifiers, the following formula is defined (7):
Score5 ðv alueCandidateÞ ¼
hitsðv alueCandidate dataAttribute AND conceptÞ hitsðv alueCandidate AND conceptÞ
ð7Þ
A query with the exact (between ‘‘ ”) noun phrase defined by the valueCandidate just before the dataAttribute is constructed. In order to assess their linguistic dependency, following the notion of the PMI collocation measure, their co-occurrence is compared against the number of occurrences of the valueCandidate alone. Again, based on the premise stated above, the concept term is added (with an AND operator) to both queries in order to contextualize the statistical assessment towards the domain. Definition 12 (valueScore). ValueScore is the value obtained from computing Score5 to a valueCandidate of a particulardataAttribute. Again, those candidates with a valueScore above a threshold are selected as valid attribute modifiers (see some examples in Table 7). As the queries performed to compute the valueScore are less restrictive than for the first stage, we set a less
585
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597 Table 7 Examples of valueScores obtained for valueCandidates for several attributes for the digital camera domain. Attribute
Value candidates
Value score
Iso
100 1600 800 3200 D70 High Ultra high Own Common
0.012 0.066 0.015 0.051 5.27E4 0.019 0.002 6.21E5 4.24E5
Lens
14–45 mm 24–120 18 55 mm 18–70 Dual d Other Interchangeable Cool
0.28 0.027 0.406 0.086 0.0158 2.61E4 7.43E4 0.237 5.46E5
Exposure
Manual Other Automatic
0.037 2.42E4 0.034
CCD
12 mp 7.2 mp 298.00 Megapixel Cheap Cmos
0.00357 0.0012 0 0.046 3.44E5 0.009
Connectivity
Wireless Wifi Bluetooth Easy
0.0145 0.0045 0.014 9.4E5
relaxed threshold (the MINIMUM_VALUE_SCORE constant of Fig. 3). Previous investigations [14] applied to taxonomy learning by means of the same type of noun phrase-based patterns show that correct candidates tend to be around 3 orders of magnitude lower when concatenated to its generalization (as in the numerator) than when queried alone (as in the denominator). Thus, in this case, we set a threshold around 1E3 to avoid scarce or misspelled candidates. Numeric values are the most problematic set in this case as, in some situations, their possible values represent a continuous range (e.g. float numbers) and their granularity may be fine. For example, many digital cameras may have a generic resolution of 9 MP, but a specific model (e.g. Panasonic TZ5) may have a resolution of 9.1 MP. In the latter case, the cooccurrence with the attribute will be low because it is typically associated with a camera instance. In consequence, a potentially correct value may be discarded by the selection threshold. However, from the point of view of ontological engineering, it is better to set the value restrictions at the appropriate level of abstraction. Thus, the concept digital camera may have typical resolutions from 4 to 12 MP and more specialized concepts (i.e. subclasses or instances such as specific camera models) may be further analysed repeating the whole procedure to refine those attribute restrictions (e.g. Panasonic TZ5 redefines the resolution value restriction to 9.1 as it is the typical value at which the attribute resolution cooccurs). Definition 13 (valueSet). ValueSet is the list of values extracted and selected for a particular dataAttribute. With the final valueSet of selected modifiers, the dataAttribute tagging is updated according to the nature of the new values retrieved (i.e. we can find that a Numeric pre-tagged attribute may also have String qualifiers). 4.3. Identifying measurement units and data-types In order to further refine the ontological attribute definition, ranges and concrete data-types are specified for each dataAttribute. Thus, String tagged attributes are associated with String data-type and selected valueSet is set as a range restriction (i.e. the exposure of a digital camera may be manual or automatic). Numeric tagged attributes are associated to Float datatypes. Whenever they are available, minimum and maximum values from the retrieved valueSet are used to establish range restrictions (e.g. digital camera ISO goes from 100 to 3200). Moreover, as a final learning stage, numerical data are analysed to assess, if applicable, the measurement unit in which quantifiers are expressed (e.g. resolution is measured in MP).
586
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
Fig. 4. Algorithm for identifying the measurement unit of numeric attributes.
Definition 14 (numericAttribute). NumericAttributes are those selected attributes which have been tagged only as numeric, and for which a measurement unit may be discovered. As it is common to many Latin-based languages, numerically measurable features are typically accompanied with the corresponding measurement unit just after the numerical value (e.g. 8 MP resolution). A common feature of measurement units is that they are words composed by at least one non-numeric character. Following those premise, as Fig. 4 shows, the unit discovery analyses the valueSet retrieved and selected during the previous stage for each numericAttribute, to discover nonnumeric characters/words associated with the numeric label (e.g. 3”, 2.5 inches, 1-inch). Those become measureCandidates for the given attribute. Definition 15 (measureCandidate). A measureCandidate is a word or list of words not completely expressed with numeric characters which accompanies numerical values associated to a numericAttribute. It can potentially indicate the measurement unit in which numerical attribute values are expressed. Due to the variability in which measureCandidates may be expressed (e.g. cm, centimetres), we again employ web-scale statistics to assess which should be the most appropriate measurement label for that numericAttribute. Similarly to the previous stage, the assessment is performed through the evaluation of the degree of occurrence of the attribute noun phrase at a web-scale. This is performed by re-creating the noun phrase linguistic construction by means of a query. However, due to the potential amount of different values retrieved for each attribute, the number of different queries needed for each value + measureCandidate + numericAttribute noun phrase combination (e.g. 2.5 inches lcd) could be overwhelming. Instead of this and considering that the measurement unit typically appears after the value and before the attribute, we can omit the explicit presence of the value string in the query. So, the measureCandidate reliability is estimated by the hit count of the measureCandidate numericAttribute query. In this manner, the number of queries is dramatically reduced and is scaled linearly to the number of discovered measurement labels. At the end, following the PMI principles to seek for the statistical dependency of each measureCandidate with respect to the numericAttribute, we define the following score (8):
Score6 ðmeasureCandidateÞ ¼
hitsðmeasureCandidate numericAttribute AND conceptÞ hitsðmeasureCandidate AND conceptÞ
ð8Þ
Again, following the same theoretical principles enounced in the previous section, the concept is added with an AND operator to each query to contextualize the queries in the scope of the domain. Definition 16 (measureScore). MeasureScore is the value obtained from computing Score6 for a measureCandidate of a particular numericAttribute. When all the measureCandidates for a numericAttribute have been queried, the one with the highest measureScore is selected as the specific data-type and the appropriate ontological attribute restriction is specified. Some examples are summarized in Table 8. Once this stage has been executed for all numericAttributes, the learning is complete for the initial concept. All the acquired data (attributes and their attribute restrictions) are stored ontologically using OWL-DL3 as a formal language following the principles described during the explanation. Object–object properties are set as new ontological classes, measurement units are set as data-types and max–min values are set as range restrictions. Given an input ontology or taxonomy, as stated in Section 4.2, one can recursively execute the analysis for each concept’s subclasses (which will inherit attribute related assertions) in order to refine attribute restrictions. This will result in specific attribute restriction re-definitions at the appropriate ontological level (e.g. a superclass may have a wider value range for an attribute than its subclass). 3
http://www.w3.org/TR/owl-features/.
587
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
Table 8 Examples of measureScores obtained for measureCandidates for several attributes. Those with the highest score – tagged in bold – are selected as the measurement label of numericAttributes. Attribute
Measure candidate
Measure score
Memory card
gb mb
0.0156 0.00969
Zoom lens
x mm
9.2E5 0.0018
Resolution
mp k bit
0.021 3.3E5 2.43E3
Lens
mm extra
0.039 1.43E3
Lcd
mp inch x
3.05E4 0.205 8.6E4
5. Adaptive web corpus analysis In the previous section, we have mentioned that, for the first two learning stages, a set of webSet resources is retrieved from a web search engine for analysis. The compilation of the webSet is important because (1) it is not feasible to analyse the all of web sites resulting from a general query; (2) many search engines only index the first so-many sites for a given query; and (3) recall of our methodology directly depends on the corpus evaluated. However, how big should this set of web resources be in order to obtain a result with good recall (i.e. to retrieve a representative set of concept attributes and restrictions for the domain)? During the research and testing we observed that for a given pattern-based query, most valid attributes were obtained when analyzing a few dozens of web sites. In addition, the significant increase of this initial set (to several hundreds) does not improve the results in relation to precision-recall equilibrium. This is in line with the results observed in [18], in which recall follows a logarithmic distribution in relation to the size of the webSet. On the contrary, precision tends to decrease because of the noise introduced by the growth of false candidates. Several factors contribute to this situation, including the high redundancy of web data [71] and the ranking policy implemented by web search engines (i.e. better query-related resources are presented first). However, the concrete amount of resources may vary depending on the domain, linguistic pattern and learning stage. In order to decide the amount of web resources to be evaluated in each situation, we introduce an adaptive approach in which the size of the webSet is dynamically increased in function of the productiveness (learningRate) of each learning step. Definition 17 (learningRate). LearningRate computes the relation between the amount of new entities (attributes or values) selected during an iteration of a learning stage in relation to the total number of extracted entities (9).
LearningRate ¼
#Selected entities #Extracted entities
ð9Þ
For a specific resource retrieval query, an initial webSet of resources with a fixed size is analysed. The size is defined by the NUMBER_WEBS constant of Figs. 2 and 3 (for example, it represents the maximum number of resources returned by a search engine per query). As a result, a number of candidates are extracted and finally selected. The system computes a learningRate for that learning iteration. If this exceeds a minimum threshold (the MINIMUM_LEARNING_RATE constant in the algorithms presented in Figs. 2 and 3, for example, 40%), it repeats the same learning stage by retrieving an additional webSet of resources (i.e. an additional NUMBER_WEBS set). This iterative analysis is controlled by the webOffset variable, which is incremented by NUMBER_WEBS after each iteration. Definition 18 (webOffset). WebOffset indicates the absolute position from which webSet will be extracted from the ranked list of results provided by the web search engine for a given query. The results from this new iteration are added to the previous ones and the learningRate is re-computed. The process continues until the amount of new selected entities is so low and the amount of rejected ones is so high that the learningRate is below the threshold. This indicates that most of the relevant knowledge has been already acquired for the particular concept, pattern and learning stage. In this manner, highly productive concepts/patterns will result in wider analyses aimed to improve domain’s recall. On the contrary, very concrete concepts/pattern will be efficiently analysed by exploiting a narrower web corpus. Fig. 5 presents an example of the learningRate evolution for the first stage of the learning process for a given input concept and several linguistic patterns. One can observe that, as stated in Section 4.1, some patterns are more general (e.g. X hasjhave Y) than others (e.g. X features Y), leading to a higher number of extractions and learning iterations.
588
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
Fig. 5. Evolution of learning rates for different linguistic patterns with a learning threshold of 40%.
Another important aspect of the web analysis is the access to the web content. The system’s runtime (see Section 7 for detailed analysis) is mainly influenced by the number of individual web accesses, including the downloading of each web site needed to perform the analysis. However, there is an alternative to these individual online accesses: the use of the web snippets. Definition 19 (snippet). A snippet, as provided by a search engine, partially represents the textual content of a web site and contains the context of one or several matches of the search query. Using snippets as representations of web content [26], we are able to efficiently analyse a large set of partial web contents (e.g. up to 10 web snippets for Google) with only one call to the search engine. The trade-off is that additional query matches not contained in the snippet are omitted. This issue has been analyzed in previous investigations [13,14] which concluded that, due to the complexity of the pattern-based queries, the probability of retrieving several matches from the same web site is low. More in detail, the average extraction ratio ranged from 0.7 to 1.2 pattern-based candidate extractions per web site for several tests. 6. Evaluation The fact that ontologies conceptualize knowledge in a non-uniquely expressive way makes evaluating the results a challenging problem [33]. Several attempts have been made to design evaluation procedures [4], but most of these have focused on methodological ontology construction and/or the taxonomical aspect. Some of the approaches that are focused on the experimental side [33] base their evaluations on comparisons with standard ontologies or electronic repositories such as WordNet. However, as stated in the introduction, attributes and concept features are not present in WordNet, these kinds of relationships are very rarely modelled by domain ontologies [37] and there are no standard definitions of attributes [36]. Because we lacked the electronic knowledge needed to design an automatic evaluation procedure, we have carried out a manual evaluation. Two human knowledge engineering experts are requested to perform a concept-per-concept evaluation of the results obtained in the different learning stages. The algorithm’s performance is evaluated by means of the typical measures used in Information Retrieval: precision, recall and F-measure. 6.1. Evaluation measures Definition 20 (precision). Precision specifies to which extent the knowledge is correctly extracted. In this case, precision measures the percentage of correctly selected entities (attributes, values and measurement units) in relation to the total number of selected entities (10)
Precision ¼
#correctly selected entities #total selected entities
ð10Þ
Definition 21 (recall). Recall shows how much of the existing knowledge is extracted. The computation of recall (11) requires a baseline set of correct domain entities (i.e. a Gold Standard) with which to compare the learned entities
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
Recall ¼
#correctly selected entities #domain entities
589
ð11Þ
In our case, the baseline set of domain entities is extracted from manually composed web repositories that summarize entity features. They are mainly review web sites containing large sets of specifications for certain types of products. In any case, as a closed, unequivocal and widely agreed set of attributes for any given domain is extremely rare, most Web-based learning methods do not evaluate recall [40]. In fact, because of the dynamicity and ambiguity of human languages, it is unfeasible to manually count the complete set of attributes of each possible entity [44]. In these situations, an alternative way of measuring the potential domain coverage consists of computing the local recall. Definition 22 (local recall). Local recall computes recall by binding the domain scope to the coverage of the corpus analysed. In our case, this is the attributes, values and measurement units extracted from the set of web resources analysed. It computes the number of correctly selected items against the full set of correct entities extracted from that corpus (according to human judgements) (12)
Local Recall ¼
#correctly selected entities #correctly retriev ed entities
ð12Þ
Despite its locality, this score (in conjunction with precision) can give a measure of how well the selection procedure performs in evaluating candidates. This metric is consistent with the recall used in TREC conferences [19] and has been used in the past [52,43] to evaluate automatically learned knowledge. Definition 23 (F-measure). F-measure provides the weighted harmonic mean of precision and recall (13) and summarizes the overall performance of the learning algorithm
F Measure ¼
2 Precision Recall Precision þ Recall
ð13Þ
As with the recall, a local F-measure (14) can be computed using the local recall instead of the global one.
Local F Measure ¼
2 Precision Local Recall Precision þ Local Recall
ð14Þ
6.2. Evaluation procedure The evaluation process has been conducted in three directions. First, some experiments have been performed to compare our proposal with other related works. The main problem of comparing Web-based approaches is the dynamicity of the corpus evaluated. Even using the same domain, the corpus may change over time. Therefore, because accurate conclusions cannot be drawn from by directly comparing other authors’ results with ours, we reproduced some of their experiments. Specifically, because previous approaches only cover the extraction of attributes and most of the unsupervised attempts use the same set of meronym patterns (introduced in Tables 2 and 3), we focused these first tests on evaluating the proposed statistical score of the first learning stage. Thus, we did several analyses of the same domain using identical execution conditions (same workstation, same day, same parameters, and same search engine), but varying the statistical scores used to select attribute candidates. The goal is to show the benefits of the designed score in relation to previous studies also based on statistical assessments of the Web. The first stage of the learning process is the most important because (i) it starts the learning from scratch, (ii) its results have a direct influence in the other stages and (iii) the statistical assessment is the most complex. Consequently, it is further evaluated in the second battery of tests. Specifically, because its performance depends on the thresholds used to select or reject candidates and control the corpus analysis, the influence and generality of these thresholds have been evaluated for several domains. The third set of evaluations focuses on measuring the performance of our approach as a complete solution. This meant carrying out tests for several – more general and ambiguous – domains of knowledge covering all learning stages. From the results of those tests, the performance of each learning stage is considered and the quality of the final results is estimated. This shows the performance of our contribution in terms of ontology learning (i.e. which ontological statements are correct). In this case, a direct comparison with other approaches is not applicable as none of them consider the extraction of values and measurement units. As stated above, the evaluation process is carried out independently by two human experts. The evaluation is blind in the sense that evaluators are only aware of which domain is used in the test, but not of the parameters used to obtain those results. After this, they must agree on their judgements which are then compared against those that have been automatically obtained. The human consensus achieved in the later two stages of the learning process was above a 95% in most cases because qualifiers and quantifiers for a given attribute are easy to evaluate. For the first stage, the initial consensus was around an 82–92%, showing that, even for human subjects, it is not easy deciding whether a term is a valid attribute or meronym for a given entity.
590
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
For the first stage, the set of extracted attributeCandidates are compared against the human judgements and tagged as correct or incorrect given that they may indicate a part-of, descriptor, qualifier or feature of the domain. This is compared with the results obtained by the selection procedure in the first stage computing precision and local recall. Whenever possible, we also try to find a reliable source (mainly review web sites with detailed product specification sheets) to use as baseline for evaluating the results’ coverage. In this case, experts are requested to evaluate which of the source’s attributes are the same or equivalent (the same attribute may be expressed with different words) to those retrieved by our system. In this manner, we are able to compute the domain recall of the learning procedure. In the next stage, the experts need to declare whether or not the set of extracted valueCandidates for each selected attribute is feasible considering that they properly indicate a qualification or quantification of the corresponding attribute. This is again compared against the selection procedure computing precision and local recall of the second stage. Finally, taking into consideration the attribute and its value set, the associated measurement unit and its data-type (boolean, string or numeric), whenever they have been discovered, are evaluated. As they represent a unique result, we only compute the precision. In the end, a total of 40 individual tests have been carried out for different domains, parameters or statistical scores. Given that an average of 120 candidates should be manually evaluated for each test, almost 5000 extracted entities have been considered by the evaluators. As stated above, each battery of tests has been performed using the same web search engine. Regarding this aspect, the results may be influenced by the differences between the IR algorithm of web search engines (which may provide different web-scale statistics and web sets). In the past, we have studied the behaviour of several well-known search engines [14] and concluded that Google is the most suitable one for knowledge acquisition tasks as it presented the highest IR recall (at the cost of a slower access time compared to for, example, Yahoo). In addition, given that Google supports search wildcards (unlike Bing) which are needed to create some queries, the choice is even clearer. 6.3. Evaluating the attribute selection procedure As Section 3.2 shows, patterns used to acquire attributes are similar to those employed in related studies [1,8,44]. Thus, our main contribution regarding the first learning stage is the use of contextualized pattern-based web-scale statistical scores to select attributeCandidates. In order to evaluate their performance and compare our approach with other studies using the same execution conditions, we made several analyses with the same domain and parameters, but using several typical selection statistics considered in the literature. These are: (1) No statistical analysis is employed (i.e. all the extracted candidates are selected as valid), as in [1]. (2) Only the individual appearances of each extracted candidate are considered (i.e. local statistics), as in [44]. Those appearing more than once are selected. (3) Web-scale statistics are employed in the original PMI-score fashion [58] (Score1 presented in Section 4.1) to search for the absolute co-occurrence between the candidate and the domain, as in [8]. A selection threshold of 0.1 is used as in [13]. (4) Web-scale statistics are employed, using the contextualized pattern-based queries proposed in Score4 . This are the selection statistics used in our study. A selection threshold of 1E8 and a minimum learning rate of 20% have been set (more details on this in Section 6.4). These statistics have been applied to select/reject the same set of extracted candidates for several domains of knowledge that have also been used by other authors dealing with the extraction of attributes. In general, authors employ technological domains for evaluation purposes because of the proliferation of class descriptors and the availability of resources. Specifically, the evaluation has been performed for the following domains: digital camera (evaluated in [35,42]), laptop (evaluated in [35]), dvd player (evaluated in [42]) and scanner (evaluated in [8]). Another advantage of using these domains is the availability of product review sites offering detailed specification sheets (i.e. the tabular data exploited by approaches as discussed in [62]). Even though they cannot be considered as a gold standard (as they only represent a partial view of all the possible attributes of a domain), they can be used as a relative baseline for computing an approximate domain recall and comparing the behaviour of the different tests. Therefore, in addition to the expert-based concept-per-concept evaluation, we also checked the results against product review sites covering digital cameras,4 laptops,5 dvd players6 and scanners.7 The evaluation results are presented in Table 9. Results shown in the table are consistent with the intuitions presented in Section 4.1. First of all, the results obtained from the first two selection criteria (no statistics and local analysis) are poor (local F-measure around or below 50% in all cases). Obviously, when no statistical assessment is considered, the local recall (i.e. the amount of selected entities from the set of extracted ones) is 100% because extracted candidates are not filtered. However, this is at the expense of very low precision 4 5 6 7
http://www.dpreview.com/reviews: 35 common features to most digital cameras are taken as baseline. http://reviews.cnet.com/laptops: 20 common features to most laptops are taken as baseline. http://reviews.cnet.com/dvd-players/?tag=bc: 37 common features to most DVD player are taken as baseline. http://www.imaging-resource.com/SCAN1.HTM: 24 common features to most scanners are taken as baseline.
591
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
Table 9 Evaluation results for different attribute selection criteria: no statistical assessment, only local statistics considered, Web-based absolute co-occurrences (PMI) and our score, using patterns to contextualize web queries. Domain
# Selection criteria
Precision (%)
Local recall (%)
Local F-measure (%)
Recall (%)
F-measure (%)
Digital camera
No statistics Local statistics Web-PMI Our score
25.2 53.3 75 74.28
100 54 61.53 86,6
40.25 53.64 67.6 80
68.57 31.42 40 57.24
36.85 39.53 52.17 64.65
Laptop
No statistics Local statistics Web-PMI Our score
36 54 73.68 75
100 46.5 66.6 89.3
52.94 50 70 81.52
75 35 60 70
48.64 42.47 66.1 72.4
DVD player
No statistics Local statistics Web-PMI Our score
39.6 53.5 77 76.27
100 52.2 67.3 90
56.7 52.84 71.8 82.57
64.8 32.4 48.64 59.45
49.1 40.35 59.6 66.8
Scanner
No statistics Local statistics Web-PMI Our score
34.4 48 80 80.6
100 49 66.6 88.2
51.2 48.5 72.68 84.2
70.8 33.3 41.6 62.5
46.3 39.3 54.73 70.4
(below 40%) caused by false positives. When local statistics are introduced to filter the set of extracted candidates, some of these false positives are filtered, but this causes a low local recall of around 50%. In some situations, this local assessment performs even worse than the naive approach. This shows that the statistical evidence that is extracted from a limited number of observations is not sufficient to provide reliable results. In fact, authors using this local approach typically use a much larger number of resources to base the statistical assessment [44]. The result’s potential recall is around a solid 70% in all tests (i.e. the maximum domain coverage offered by the extracted candidate set when no candidate filtering is applied). Consequently, the online analysis of additional resources for computing more robust statistics will hamper the algorithm’s performance without providing a significantly higher coverage (more on this in Section 6.4). This is the reason why web-scale statistics computed efficiently from web search engine queries are a valuable aid. In fact, web-scale statistics based on term co-occurrence (PMI-like Score1 from Section 4.1) provide a more robust estimation of information distribution based on a more general set of – estimated – observations taken from the whole Web. As a result of the much more reliable statistical assessment, precision and local recall are improved by a considerable margin resulting in a local F-measure of around 70%. However, that score does not take into account the context of the co-occurrence of seek terms and, consequently, the nature of the term relation implicit in that co-occurrence is not considered during the assessment. Thus, in order to minimize ambiguity, we designed the Score4 (Section 4.1) which considers the patterns in the statistical assessment. This forces the candidate and the domain to co-occur in a meronym-like syntactic construction, leading to more accurate estimations. This approach improves the performance of the selection algorithm by around 10%, providing the best results of the bunch, mainly due to the improvement of local recall. We can also observe that, while maintaining good precision (between 75% and 80%), this score achieves a general recall which is quite near (around 5–10% lower) to the highest recall which can be achieved for the extracted set (computed for the test where no candidate selection is performed). These results are consistent with those observed in [14] (for the taxonomical aspect) in which it was concluded that pattern-based statistical scores provide a better assessment of inter-concept relations due to the minimized ambiguity. 6.4. Evaluating the parameters’ influence The next battery of tests is aimed at showing the influence that different thresholds have on extracting and selecting attributes. We conducted this evaluation using the same domains as above, for which we set several learning iterations with different selection (MINIMUM_ATTRIBUTE_SCORE constant in Fig. 2) and learning (MINIMUM_LEARNING_RATE constant in Fig. 2) thresholds. For the first case, based on empirical evidences introduced in Section 4.1 for which a value of 1E8 was set, we evaluate threshold values from 1E7 to 1E9 (results are summarized in Table 10). This is the most critical value as the performance of the algorithm directly depends on the behaviour of the selection procedure. In this case, we set a learning threshold for all tests of 20% (more details latter). As expected, precision tends to decrease when the selection threshold is relaxed; this is more evident (an average 10% decrease) for threshold values below a certain point (below 1E8), which indicates an increasing number of false positives. The local recall behaves in an inverse manner, decreasing very significantly (around a 26–32%) for thresholds above a certain point (above 1E8), which indicates a too strict selection policy. On the other hand, lower threshold values slightly improve local recall (around 4–5%) at the cost of comparatively worse precision. It is interesting to see how the domain’s recall grows significantly (around 15–20%) as the threshold decreases, although this is hardly improved for values below 1E8 (a maximum of 5%). This indicates that most of the domain knowledge extracted from the evaluated set has been selected. This can be also observed when compared with the maximum possible
592
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
Table 10 Evaluation results for a selection threshold between 1E7 and 1E9, with a learning threshold of 20. Domain
Selection threshold
Precision (%)
Local recall (%)
Local F-measure (%)
Recall (%)
F-measure (%)
Digital camera
1E7 1E8 1E9
80 74.28 62.7
60 86.6 90
68.6 80 73.9
40 57.24 60
53.3 64.65 61.32
Laptop
1E7 1E8 1E9
84 75 68.2
59.5 89.3 93.75
69.6 81.52 78.9
55 70 75
66.4 72.4 71.4
DVD player
1E7 1E8 1E9
77.7 76.27 70.1
58.3 90 94
66.6 82.57 80.3
45.9 59.45 59.45
57.7 66.8 64.3
Scanner
1E7 1E8 1E9
88.2 80.6 69.5
52 88.2 94.1
65.4 84.2 80
41.6 62.5 66.6
56.5 70.4 68
recall, as stated in the previous section when the selection process is not applied. To conclude, by combining precision and recall, the F-measure achieves the maximum value for the selected 1E8 threshold in all the tests. This statistical regularity between different domains regarding selection thresholds has been also observed for the taxonomy in [14]. In the next set of tests, we will evaluate the influence of the learning threshold. We set the MINIMUM_ATTRIBUTE_SCORE constant to 1E8 and vary the MINIMUM_LEARNING_RATE by between 10% and 40%, with corpus increments of 10 web sites (this corresponds to the NUMBER_WEBS constant in Fig. 2, and the number of resources provided by Google per query). The variation of this threshold controls how much emphasis (i.e. the number of analysed resources) will be put on the analysis associated with each pattern and iteration. This is a less critical parameter than the selection threshold, but it can have influence the results recall. Evaluation results are shown in Table 11. In this case, as the selection threshold is maintained during the different tests, we see lower variance in precision and local recall. As expected, we observe a reduction in the precision as the learning threshold decreases, especially below a 20%. This indicates that, because the system is forced to analyse noisier resources (i.e. resources that are less related to the domain according to the ranking presented by the web search engine), the number of false positives increases. It is important to note that, on average, a reduction of the learning threshold from 20% to 10% doubled the number of results analysed and increased the candidates retrieved by 50%. Non selected correct candidates (which affect local recall) are usually maintained during the different tests. Variations in local recall are mainly motivated by the absolute number of extractions and selections for each test. As a result, the selection quality (measured by the local F-measure) is quite similar because the selection threshold and policy is the same for all tests, and only varies by an insignificant 1–4%. Recall values are more interesting to evaluate. In general, we observe that for the most constrained learning threshold (40%), values are significantly lower, indicating that more resources should be analysed in order to improve domain coverage. In fact, doubling the learning threshold from 20% to 40% resulted on average, in a 50% reduction in the total number of extracted candidates. On the other hand, in all cases except the laptop domain, recall is barely improved by lowering the learning threshold to 10%, showing that, in general, additional analyses do not bring better coverage. This fact, in conjunction with the lower precision due to the growth of false positives, results in slightly a lower F-measure. In conclusion, as shown in Section 5, it can be seen that once a relevant set of web resources has been analysed for a given query and once most of the domain knowledge has been retrieved, the noise added by the growing false candidates hampers the quality of the results without improving the coverage. On the other hand, too small corpuses result in significantly lower recall. This is why it is important to auto-detect the ideal number of web resources (as a function of the learning threshold, in this case) taking into consideration the learning rate obtained for each iteration, as presented in Section 5. Table 11 Evaluation results for a learning threshold among 10% and 40%, with a selection threshold of 1E8. Domain
Learning threshold
Precision (%)
Local recall (%)
Local F-measure (%)
Recall (%)
F-measure (%)
Digital camera
10 20 40
71.4 74.28 75
85 86.6 87.5
77.6 80 80.7
57.24 57.24 45.7
63.47 64.65 56.79
Laptop
10 20 40
69 75 76.7
91.8 89.3 88.8
78.8 81.52 82.3
80 70 60
74 72.4 67.3
DVD player
10 20 40
68.75 76.27 72
91.16 90 94.7
78.38 82.57 81.8
59.45 59.45 54
63.7 66.8 61.7
Scanner
10 20 40
75.5 80.6 81.5
87.7 88.2 84.6
81.14 84.2 83
62.5 62.5 50
68.4 70.4 62
593
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
Other authors have also observed the fact that too many resources could result in lower precision due to an increasing number of spurious candidates without recall improvements [44]. 6.5. Evaluation of several domains The final battery of tests is aimed at evaluating the algorithm as a whole, including the two later stages. In order to show the applicability and the domain independence of our approach, in addition to the technological domains considered in previous tests, we will also employ more general -and more ambiguous-domains. Thus, in contrast to Web-based or tabular-based works which centre the evaluation on specific technological entities for which specific review sites and semi-structured data (i.e. tables) are widely available in the Web, we will use more general domains. Non-technological products such as Book, Shoe, Car, and Drug and non-product domains such as City, Restaurant and Hotel will be analyzed. Those domains represent more abstract concepts for which pattern matches and semi-structured data are scarcer, being typically referred to by means of specific instances as stated in [44]. Due to this generality, domain recall is not considered because the domain’s wide scope and the lack of an agreed standard with which to compare the results. The results of the evaluation are shown in Table 12. An examination of the table shows that, for the technological domains, results are quite uniform and consistent for the first stage, achieving a precision of around 75% and a local recall around 85%. As a result, the F-Measure is maintained above an 80%. As shown in Section 6.3, for those domains, the quality of the selection process is higher (around 10–12%) than other tests based on uncontextualized Web-based statistics or local analyses. Results obtained for non-technological domains are slightly lower, with an F-measure around 73–80%. The generality of those domains results in a higher ambiguity and, in consequence, a lower precision caused by the non-contextual patternbased extractions. As stated in [34], attribute learning for these kinds of domains represents a more challenging task. In fact, human evaluations presented a lower consensus, which ranged from an average of 90% down to 84% for these domains, indicating that attributes were not as easy or straightforward to detect. Analysing the results, we observed a higher percentage of rejected candidates after the statistical assessment that goes from an average of 64% for the technological domains to an average of 72% for the other ones. This results in higher noise that affects the extraction precision (which is around 10% lower than for the technological domains). Local recall is maintained high, indicating that the threshold is adequate for selecting most of the correctly retrieved candidates (as stated in Section 6.4). Analysing the results from a qualitative point of view, we observed that most of the missing attributes refer to general features which are common to many domains (such as price, size, weight, etc.). In contrast, domain-dependant features (e.g. the resolution of a scanner or a digital camera, the number of pages of a book, the population of a city, the star rate of a hotel, etc.) were properly selected. Given that the web scores try to assess the probability that a concept and its attribute co-occur in the scope of the attribute, this is to be expected because general attributes co-occur in many domains. However, in terms of ontological engineering, specific domain features are better at unambiguously describing the concept’s meaning than are general ones [1]. Most of the erroneously selected attributes (which affect precision) refer to attributes which commonly appear expressed in a meronym-like manner (e.g. digital camera has advantages) but which do not represent a part-of relationship or feature. There are also erroneously selected attributes which correspond to descriptions of specific instances that cannot be generalized as a concept level but which appear commonly in the domain (e.g. the laptop has a scratch). These cases can be hardly tackled by means of an unsupervised approach and are caused by the inherent ambiguity of non-contextual pattern-based extractions. Finally, a small percentage of errors is caused by partial text extractions (e.g. the hotel has a wide variety) caused by the reduced context obtained from web snippets. These cases may be minimized by accessing the full web text, at the cost of a much higher overhead. The second learning stage is less critical because the unbounded nature of quantified or qualified features means that it is not expected to end with a finite set of possible values (general recall). On the other hand, typical sample values are retrieved and selected to assess the data-type nature of the attribute. In this sense, the more relaxed score performs adequately, with high precision and an almost perfect local recall. Only the drug domain resulted in a significantly lower local recall due to the high amount, fine granularity and low occurrence of correctly extracted data-values for certain attributes (e.g. absorption percentage). Table 12 Evaluation results for several domains of knowledge (technological, non-technological and non-product domains) using the default learning parameters. Domain
1st stage Precision
1st stage L. Recall
1st stage L. F-meas
2nd stage Precision
2nd stage L. Recall
2nd stage L. F-meas
3rd stage Precision
Digital camera Laptop DVD player Scanner Shoe Book Drug Car Hotel Restaurant City
74.28 75 76.27 80.6 65 69.5 64.3 60.4 71.25 64.19 70.9
86.6 89.3 90 88.2 93.18 87.2 91.8 92 91.9 91.2 88
80 81.5 82.5 84.2 76.58 78.8 75.6 73 80.2 75.34 78.5
86.23 85.3 79.43 83.4 81.15 81.6 85.71 74.5 82.69 89.9 77.6
97.9 96.5 96.4 94.3 95 97 78 100 90.5 97.8 96.7
91.7 90.4 87.1 88.5 87.5 88.6 81.67 85.4 86.4 93.7 86.1
82 76 77.9 79.25 76.19 78 78.5 79.36 81.2 82.71 72.6
594
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
The quality of the knowledge acquired at this stage contributes to the performance of the final step, in which a measurement unit and data-type are assigned to each attribute, obtaining a precision of more than 70% in all cases. Erroneously tagged attributes are typically cases in which no indication of qualifier or quantifier has been retrieved during the analysis, resulting in a pure part-of assertion and a boolean type (as stated in Section 4.2). If higher precision is critical, additional analyses focused on finding those evidences may be executed at the cost of a much higher learning effort; i.e. more web resources may be retrieved or all the selected attributes may be analysed for value discovery, regardless of the associated datatype at the end of the first stage. 7. Computational complexity In this section, we analyse the methodology from the runtime point of view. We formalize the computational complexity of the algorithms in terms of the most time-consuming tasks in order to give an idea about their behavior and throughput (i.e. number of entities learned in a limited period of time). Considering that web resources are analysed by means of snippets (instead of individually downloading each one), the main aspect which influences the algorithms’ runtime is the number of online accesses performed to retrieve those snippets and, especially, to obtain web-scale statistics for computing the different scores. Due to the inherent network delays and the constraints introduced by most search engines to consecutive queries, in our test, the typical time for a query response ranges from 300 to 700 ms. In contrast, the amount of time required for off-line web parsing, syntactical analyses of text and pattern extractions from the corpus retrieved at each iteration of the algorithm is about 30 ms. Given the above, in the next lines, we analyse the number of web queries required during the learning, in order to evaluate the main influential factor of the algorithms’ runtime and to study the methodology’s scalability. In the first stage, for each of the P patterns presented in Section 3.2, the system queries a web search engine and analyses the returned resources obtaining a certain amount of attributeCandidates ðA1 Þ. Each one is evaluated by means of queries for web-scale statistics. Considering the score introduced in Section 4.1, 2 P A1 queries are performed. As a result of the statistical assessment, S1 attributes are selected and A1 S1 are rejected. Then, depending on the learning threshold, the algorithm may decide to evaluate an additional set of resources (querying the pattern again and resulting in a new set of A2 attributeCandidates to evaluate) or to continue with the next pattern. After performing all iterations for all patterns, a total of PP P Aji queries Ij queries for resources (where Ij is the number of iterations executed for the ‘j’ pattern) and a total of 2 P for statistics (where Aji is the number of attributeCandidates retrieved for the ‘i’ iteration of the ‘j’ pattern) have been performed. As a result a total of Sji attributes are selected. Once the attribute learning is finished, value learning starts by evaluating, from the total amount of Sji attributes, those D attributes which are tagged as dataAttributes. Each one is queried into the search engine and the resulting resources are analysed to extract V 1 valueCandidates. Following the same iterative behavior and considering the score for value selection preP sented in Section 4.2, a total of K j queries for resources (where K j is the number of iterations executed for the ‘j’ PP V ji queries for statistics (where V ji is the number of valueCandidates retrieved in the ‘i’ dataAttribute) and a total of 2 iteration of the ‘j’ dataAttribute) are performed. Finally, for a subset of N dataAttributes representing numericAttributes for which a valueSet has been retrieved, M measurementCandidates are extracted. In this case, only queries for statistics are required in order to select the candidate with the P higher score (introduced in Section 4.3). A total amount of 2 M j queries are performed (where M j is the amount of measureCandidates discovered for the ‘j’ numericAttribute). In summary, the amount of web queries (Q) required to analyse a concept grows linearly in relation to the amount of candidates retrieved (18). So, the runtime required to analyse the concept depends on its learning productiveness: the number of PP P PP V ji Þ and measureCandidates ð M j Þ. attributeCandidates ð Aji Þ, valueCandidates ð
Q¼
P X j¼1
Ij þ 2P
P X I X j¼1
i¼1
! Aji
þ
K X j¼1
Ij þ 2
D X K X j¼1
i¼1
! V ji
þ2
N X
Mj
ð15Þ
j¼1
The average amount of attributeCandidates found or the domains tested in the previous section was around 120; the number of valueCandidates found for all numericAttributes was around 180; finally, the amount of measureCandidates was around 25. Given the above formula, around Q = 8000 queries were performed per concept. Estimating that each query takes an average of 500 ms to complete ðtq Þ, the total runtime required to evaluate each concept ðtc ¼ Q t q Þ will be around 67 min. In practice, the average tc was 76 min per concept, representing a relatively accurate behavior in comparison with the theoretical estimation. After the analysis, an average of 64 attributes with their corresponding attribute restrictions where learned, obtaining a uniform learning throughput of 0.84 ontological rich attribute assertions per minute. It is important to note that all the tests have been performed using a unique computer and IP. Its hardware and software configuration was: Intel Core2Duo 2 GHz, 4 GB RAM, Windows XP SP3, Java 1.6, 10 MB Internet connection and a pool of Google API accounts, allowing a maximum of 100 thousand queries per day in total. Given that, most of the time, the execution is waiting for the response of a web query, the parallelization of queries may provide a substantial performance improvement. Considering that search engines delay concurrent queries performed from the same IP, ideally, query calls should be distributed through several computers with different IPs and search API accounts. A study of the performance improvements obtained when parallelizing query-dependant ontology learning tasks in a grid-like manner is provided in
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
595
[16]. The work shows how candidate entity analyses can be considered as independent learning tasks and easily parallelized, achieving an almost linear performance improvement with regards to the level of hardware parallelism. 8. Conclusion and future work In this paper we have introduced a new methodology for discovering ontological concept attributes and, where applicable, attribute restrictions from the Web. The methodology is automatic and non-supervised, as neither user interaction nor previous knowledge (apart from a set of general patterns) is needed during the learning process. However, as with other automatic approaches [44,8], the results present limitations because of language ambiguity (as stated in the evaluation) and they need to be validated in terms of ontological engineering before they are employed in a critical setting. Even human judgements may be affected by ambiguity and by the fact that knowledge can be conceptualized in many expressive ways [33]. In any case, given the effort required for manual ontology engineering, it is always preferable and faster to present the user with a predefined structure to validate (learned from a representative corpus composed by a wide and heterogeneous community of users) than asking him to formalize domain knowledge from scratch [4]. This is why, as stated in the introduction, automated ontology learning methods are a valuable aid to the knowledge engineer, especially for wide domains [56]. The proposed algorithms are designed to learn attributes and attribute restrictions from scratch, retrieving web resources and analysing them in an incremental fashion. They offer a general purpose approach that is not dependant on domain specificity, the corpus availability for the domain, or resource structure. At each stage, the knowledge acquired is used to learn more specific data by contextualizing the analysis. The algorithms are also able to adapt their behaviour and the corpus with regards to the learning throughput without relying on domain-dependant parameters. The lack of constraints and the generality of the corpus employed mean that the proposal can be applied in a stand-alone manner but also as a complement to other ontological learning methodologies [13,14] or to previously defined ontologies. In addition to linguistic patterns applied to the Web in order to minimize data sparseness (patterns which have also been successfully applied in previous automatic approaches [1,8,44]), the use of Web-based statistical assessors to filter the extracted candidates is especially interesting. Several scores for each learning stage have been designed, based on the notion of term collocation and estimating concept probabilities from web hit count. We have shown (during the evaluation of several domains) that contextualized web queries involving linguistic patterns result in a better estimation of the information distribution (due to the minimization of ambiguity) in comparison with other approaches using absolute co-occurrence values [8] and local statistics [40]. Finally, in addition to object–object relations defined at an ontological level, we are able to detect data-properties and their associated data-types and value ranges, a knowledge which is typically omitted by related approaches (as stated in Section 2). This allows formalizing semantically-rich ontological assertions. As studied in Section 7, the learning runtime, which is mainly influenced by the number of web queries, grew linearly in relation with the amount of candidates discovered for a given concept. As a result, the learning throughput (i.e. number of entities learned in a given period of time) is approximately constant and maintained (given the response time of web search engines) at a reasonable level. Potentially, more than a thousand of ontological facts can be learned per day in an automatic fashion and in a domain-independent way. Moreover, the throughput can be potentially improved by parallelising the analyses through a computer network. In conclusion, methods such as the proposed one help to automatically extend the expressiveness of a given ontology without depending on the corpus structure or its availability, and without the knowledge acquisition bottleneck which characterize manual approaches. Furthermore, the extracted attributes can help existing methods to harvest pre-specified semantic relations [40,57] during the acquisition of relations that are of interest to a wide set of web users. Rarely found attribute-rich ontologies can bring benefits in many knowledge related tasks such as question answering [67], information retrieval [6] or word-sense disambiguation [2]. A future line of research could involve refining the attribute extraction by trying to identify the category/role which they play in the concept’s definition (e.g. part, quality, purpose, etc.). Further analyses or more specific patterns may be needed to achieve this goal. Moreover, when dealing with natural language resources such as web sites, problems related to semantic ambiguity may arise (mainly polysemy and synonymy). We found, for example, that when analysing a concept such as scanner, attributes related to the computer hardware (e.g. optical resolution) and to the radar (e.g. frequency coverage) may appear. Some additional processing may be applied to disambiguate the results, by taking into consideration the context provided by the ontology in which the searched concept is framed. We also detected, especially for technological domains, the presence of acronyms and abbreviations which can be taken into consideration (in conjunction with synonyms) as alternative lexicalizations of the searched concepts in order to widen the corpus analysis for rarer domains. A proposal to automatically discover concept acronyms from the Web has been presented in [15]. Acknowledgement The work has been supported by the University Rovira i Virgili (2009AIRE-04), the Ministry of Science and Innovation (DAMASK project, Data mining algorithms with semantic knowledge, TIN2009-11005) and the Spanish Government (PlanE, Spanish Economy and Employment Stimulation Plan).
596
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
References [1] A. Almuhareb, M. Poesio, Attribute-based and value-based clustering: an evaluation, in: Proceedings of the Conference on Empirical Methods and Natural Language Proceedings, Barcelona, Spain, 2004, pp. 158–165. [2] A. Almuhareb, M. Poesio, MSDA: Wordsense discrimination using context vectors and attributes, in: Proceedings of European Conference on Artificial Intelligence, 2006, pp. 543–547. [3] A. Borthwick, A Maximum Entropy Approach to Named Entity Recognition, Ph.D. Thesis, New York, 1999. [4] A. Gómez-Pérez, M. Fernández-López, O. Corcho, Ontological Engineering, 2nd printing, Springer-Verlag, 2004. [5] A. Kilgariff, Googleology is bad science, Computational Linguistics 3 (1) (2007) 147–151. [6] A. Moreno, D. Riaño, D. Isern, J. Bocio, D. Sánchez, L. Jiménez, Knowledge exploitation from the web, in: Proceedings of the Fifth International Conference on Practical Aspects of Knowledge Management, Vienna, Austria, 2004, pp. 175–185. [7] A. Pivk, P. Cimiano, Y. Sure, M. Gams, V. Rajkovic, Transforming arbitrary tables into logical from with TARTAR, Data and Knowledge Engineering 60 (3) (2007) 567–595. [8] A. Popescu, O. Etzioni, Extracting product features and opinions from reviews, in: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Vancouver, Canada, 2005, pp. 339–346. [9] A. Tiberino, D. Embley, D. Lonsdale, Y. Ding, G. Nagy, Towards ontology generation from tables, World Wide Web: Internet and Information Systems 8 (2005) 261–285. [10] A. Weichselbraun, G. Wohlgenannt, A. Scharl, M. Granitzer, T. Neidhart, A. Juffinger, Discovery and evaluation of non-taxonomic relations in domain ontologies, International Journal of Metadata, Semantics and Ontologies 4 (3) (2009) 212–222. [11] C. Tao, D.W. Embley, Automatic hidden-web table interpretation, conceptualization and semantic annotation, Data and Knowledge Engineering 68 (7) (2009) 683–703. [12] D. Pinto, A. McCallum, X. Wei, W.B. Croft, Table extraction using conditional random fields, in: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, pp. 235–242. [13] D. Sánchez, A. Moreno, Learning non-taxonomic relationships from web documents for domain ontology construction, Data and knowledge Engineering, Elsevier 63(3) (2008) 600–623. [14] D. Sánchez, A. Moreno, Pattern-based automatic taxonomy learning from the Web, AI Communications, IOS Press, 21(1) (2008) 27–48. [15] D. Sánchez, D. Isern, Automatic extraction of acronym definitions from the web, Applied Intelligence, doi: 10.1007/s10489-009-0197-4, 2009. [16] D. Sánchez, D. Isern, A. Rodríguez, A. Moreno, General purpose agent-based parallel computing, in: Proceedings of 10th International Work-Conference on Artificial Neural Networks, Salamanca, Spain, 2009, pp. 231–238. [17] D. Yarowsky, Unsupervised word-sense disambiguation rivalling supervised methods, in: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, Cambridge, MA, 1995, pp. 189–196. [18] D. Sánchez, A. Moreno, A methodology for knowledge acquisition from the web, International Journal of Knowledge-Based and Intelligent Engineering Systems, IOS Press, 10(6) (2006) 453–475. [19] E. Alfonseca, S. Manandhar, An unsupervised method for general named-entity recognition and automated concept discovery, in: Proceedings of the First International Conference on General WordNet, Mysore, India, 2002. [20] E. Brill, Processing natural language without natural language processing, in: Proceedings of the Fourth International Conference on Computational Linguistics and Intelligent Text Processing, Mexico City, Mexico, 2003, pp. 360–369. [21] E. Métais, Enhancing information systems management with natural language processing techniques, Data and Knowledge Engineering 41 (2–3) (2002) 247–272. [22] F. Wu, D. Weld, Automatically refining the Wikipedia infobox ontology, in: Proceedings of the 17th World Wide Web Conference, Beijing, China, 2008, pp. 635–644. [23] G. Bisson, C. Nedellec, D. Cañamero, Designing clustering methods for ontology building. The Mo’K workbench, in: Proceedings of the Workshop on Ontology Learning, 14th European Conference on Artificial Intelligence, Berlin, Germany, 2000, pp. 13–19. [24] G. Grefenstette, SQLET: short query linguistic expansion techniques: palliating one-word queries by providing intermediate structure to text, in: Proceedings of Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, Italy, 1997, pp. 97–114. [25] G. Pirró, A semantic similarity metric combining features and intrinsic information content, Data and Knowledge Engineering 68 (11) (2009) 1289– 1308. [26] I. Varlamis, S. Stamou, Semantically driven snippet selection for supporting focused web searches, Data and Knowledge Engineering 68 (2) (2009) 261– 277. [27] J. Dujmovic, H. Bai, Evaluation and comparison of search engines using the LSP method, Computer Science and Information Systems 3 (2) (2006) 711– 722. [28] J. Ferreira da Silva, G.P. Lopes, A local maxima method and a fair dispersion normalization for extracting multi-word units from corpora, in: Proceedings of Sixth Meeting on Mathematics of Language, 1999, pp. 369–381. [29] J.L. Hong, E.G. Siew, S. Egerton, Information extraction for search engines using fast heuristic techniques, Data and Knowledge Engineering 69 (2) (2010) 169–196. [30] J. Pustejovsky, The generative lexicon, Computational Linguistics 17 (4) (1991) 409–441. [31] J. Reisinger, M. Pasca, Low-cost supervision for multiple-source attribute extraction, in: Proceedings of 10th International Conference on Intelligent Text Processing and Computational Linguistics, 2009, pp. 382–393. [32] J. Surowiecki, The wisdom of crowds: why the many are smarter than the few and how collective wisdom shapes business, Economies, Societies and Nations, Doubleday Books, 2004. [33] K. Dellschaft, S. Staab, On how to perform a gold standard based evaluation of ontology learning, in: Proceedings of the Fifth International Semantic Web Conference, 2006, pp. 228–241. [34] K. Probst, R. Ghani, M. Krema, A. Fano, Y. Liu, Semi-supervised learning of attribute value pairs from product descriptions, in: Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, 2007, pp. 2838–2843. [35] K. Schekotykhin, D. Jannach, G. Friedrich, O. Kozeruk, AllRight: automatic ontology instantiation from tabular web documents, in: Proceedings of the Sixth International Semantic Web Conference and 2nd Asian Semantic Web Conference, Busan, South Korea, 2007, pp. 463–476. [36] K. Tokunaga, J. Kazama, K. Torisawa, Automatic discovery of attribute words from Web document, in: Proceedings of the Second International Joint Conference on Natural Language Processing, Korea, 2005, pp. 106–118. [37] L. Ding, T. Finin, A. Joshi, R. Pan, R.S. Cost, Y. Peng, P. Reddivari, V.C. Doshi, J. Sachs, Swoogle: a search and metadata engine for the semantic web, in: Proceedings of the 13th ACM Conference on Information and Knowledge Management, ACM Press, 2004, pp. 652–659. [38] M. Banko, M.J. Cafarella, S. Soderland, M. Broadhead, O. Etzioni, Open information extraction from the web, in: Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, 2007, pp. 2670–2676. [39] M. Berland, E. Charniak, Finding parts in very large corpora, in: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, Maryland, USA, 1999, pp. 57–64. [40] M. Cafarella, D. Downey, S. Soderland, O. Etzioni, KnowItNow: fast, scalable information extraction from the web, in: Proceedings of the Human Language Technology Conference, Vancouver, Canada, 2005, pp. 563–570. [41] M. Fleischman, E. Hovy, Fine grained classification of named entities, in: Proceedings of the 19th Conference on Computational Linguistics, 2002, pp. 1–7. [42] M. Hu, B. Liu, Mining and summarizing customer reviews, in: Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining, Seattle, USA, 2004, pp. 168–177.
D. Sánchez / Data & Knowledge Engineering 69 (2010) 573–597
597
[43] M. Pasca, Acquisition of categorized named entities for web search, in: Proceedings of the 13th ACM International Conference on Information and Knowledge Management, USA, 2004, pp. 137–145. [44] M. Pasca, B. Van Durme, N. Garera, The role of documents vs. queries in extracting class attributes from text, in: Proceedings of Sixteenth Conference on Information and Knowledge Management, Lisboa, Portugal, 2007, pp. 485–494. [45] M. Ruiz-Casado, E. Alfonseca, P. Castells, Automatising the learning of lexical patterns: an application to the enrichment of WordNet by extracting semantic relationships from Wikipedia, Data and Knowledge Engineering 61 (3) (2007) 484–499. [46] M. Sabou, Extracting ontologies from software documentation: a semi-automatic method and its evaluation, in: Proceedings of the ECAI-2004 Workshop on Ontology Learning and Population, Valencia, Spain, 2004. [47] M.A. Hearst, Automatic acquisition of hyponyms from large text corpora, in: Proceedings of the 14th International Conference on Computational Linguistics, 1992, pp. 539–545. [48] N. Guarino, Concepts, attributes and arbitrary relations: some linguistic and ontological criteria for structuring knowledge base, Data and Knowledge Engineering 8 (1992) 249–261. [49] N. Kiyavitskaya, N. Zeni, J.R. Cordy, L. Mich, J. Mylopoulos, Cerno: light-weight tool support for semantic annotation of textual documents, Data and Knowledge Engineering 68 (12) (2009) 1470–1492. [50] N. Kobayashi, K. Inui, K. Tateishi, T. Fukushima, Collecting evaluative expressions for opinion extraction, Journal of Natural Language Processing 12 (3) (2005) 203–222. [51] N. Yoshinaga, K. Torisawa, Open-domain attribute value acquisition from semi-structured texts, in: Proceedings of the Sixth International Semantic Web Conference, Workshop on Text to Knowledge: Lexicon/Ontology Interface, Busan, South Korea, 2007, pp. 55–66. [52] O. Etzioni, M. Cafarella, D. Downey, A.M. Popescu, T. Shaked, S. Soderland, D.S. Weld, A. Yates, Unsupervised named-entity extraction from the web: an experimental study, Artificial Intelligence 165 (2005) 91–134. [53] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, A. Yates, Unsupervised named-entity extraction from the web: an experimental study, Artificial Intelligence 165 (1) (2005) 91–134. [54] P. Cimiano, J. Wenderoth, Automatic acquisition of ranked qualia structures from the web, in: Proceedings of the Annual Meeting of the Association for Computational Linguistics, Prague, 2007, pp. 888–895. [55] P. Cimiano, A. Pick, L. Schmidt, S. Staab, Learning taxonomic relations from heterogeneous sources of evidence, in: Proceedings of the ECAI Ontology Learning Workshop, 2004, pp. 59–73. [56] P. Cimiano, Ontology Learning and Population from Text. Algorithms, Evaluation and Applications, Springer-Verlag, 2006. [57] P. Pantel, M. Pennacchiotti, Espresso: leveraging generic patterns for automatically harvesting semantic relations, in: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 2006, pp. 113–120. [58] P.D. Turney, Mining the Web for synonyms: PMI–IR versus LSA on TOEFL, in: Proceedings of the 12th European Conference on Machine Learning, Freiburg, Germany, 2001, pp. 491–499. [59] R. Cilibrasi, P.M.B. Vitanyi, The Google similarity distance, IEEE Transaction on Knowledge and Data Engineering 19 (3) (2006) 370–383. [60] R. Fano, Transmission of Information, MIT Press, Cambridge, MA, 1961. [61] R. Girju, A. Badulecu, D. Moldovan, Automatic discovery of part-whole relations, Computational Linguistics 32 (1) (2006) 83–135. [62] R. Studer, V.R. Benjamins, D. Fensel, Knowledge engineering: principles and methods, IEEE Transactions on Knowledge and Data Engineering 25 (1–2) (1998) 161–197. [63] R. Zanibbi, D. Blostein, J. Cordy, Decision-based specification and comparison of table recognition algorithms, in: Proceedings of Machine Learning in Document Analysis and Recognition, 2008, pp. 71–103. [64] R.J. Brachman, H.J. Levesque, Reading in Knowledge Representation, California, USA, 1985, pp. 41–70. [65] S. Mohammad, Measuring semantic distance using distributional profiles of concepts, Ph.D. Thesis, University of Toronto, Toronto, Canada, 2008. [66] S. Ravi, M. Pasca, Using structured text for large scale attribute extraction, in: Proceedings of 17th Conference on Information and Knowledge Management, 2008, pp. 1183–1192. [67] S. Schlobach, M. Olsthoorn, M. de Rijke, Type checking in open-domain question answering, in: Proceedings of the European Conference on Artificial Intelligence, 2004, pp. 398–402. [68] T. Berners-lee, J. Hendler, O. Lassila, The semantic web, Scientific American, 2001. [69] T. Chklovski, Y. Gil, An analysis of knowledge collected from volunteer contributions, in: Proceedings of the 20th National Conference on Artificial Intelligence, Pittsburgh, USA, 2005, pp. 564–571. [70] T. Veale, Y. Hao, Comprehending and generating apt metaphors: a web-driven, case-based approach to figurative language, in: Proceedings of AAAI, 2007, pp. 1471–1476. [71] T.B. Jans, The effect of query complexity on web searching results, Information Research 6 (1) (2000). [72] T.K. Landauer, S.T. Dumais, A solution to Plato’s problem: the latent semantic analysis theory of the acquisition, induction, and representation of knowledge, Psychological Review 104 (1997) 211–240. [73] V. Di Lecce, M. Calabrese, D. Soldo, Fingerprinting lexical contexts over the web, Journal of Universal Computer Science 15 (4) (2009) 805–825. [74] Y.J. An, J. Geller, Y. Wu, S.A. Chun, Semantic deep web: automatic attribute extraction from the deep web data sources, in: Proceedings of the ACM Symposium on Applied Computing, 2007, pp. 1667–1672.
David Sánchez is a Lecturer at the University Rovira i Virgili’s Computer Science and Mathematics Department. He is a member of the ITAKA research group (Intelligent Techniques for Advanced Knowledge Acquisition). His research interests are intelligent agents and ontology learning from the Web. He received a Ph.D. on Artificial Intelligence from UPC (Technical University of Catalonia) in 2008. He has been involved in several research projects (National and European), and published several papers and conference contributions.