Entity set expansion with semantic features of knowledge graphs

Entity set expansion with semantic features of knowledge graphs

Accepted Manuscript Entity set expansion with semantic features of knowledge graphs Jun Chen, Yueguo Chen, Xiangling Zhang, Xiaoyong Du, Ke Wang, Ji-R...

3MB Sizes 1 Downloads 39 Views

Accepted Manuscript Entity set expansion with semantic features of knowledge graphs Jun Chen, Yueguo Chen, Xiangling Zhang, Xiaoyong Du, Ke Wang, Ji-Rong Wen

PII: DOI: Reference:

S1570-8268(18)30044-1 https://doi.org/10.1016/j.websem.2018.09.001 WEBSEM 470

To appear in:

Web Semantics: Science, Services and Agents on the World Wide Web

Received date : 17 December 2017 Revised date : 13 June 2018 Accepted date : 12 September 2018 Please cite this article as: J. Chen, et al., Entity set expansion with semantic features of knowledge graphs, Web Semantics: Science, Services and Agents on the World Wide Web (2018), https://doi.org/10.1016/j.websem.2018.09.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Entity Set Expansion with Semantic Features of Knowledge Graphs Jun Chena,b , Yueguo Chena,b,∗, Xiangling Zhanga,b , Xiaoyong Dua,b , Ke Wangc , Ji-Rong Wena b Key

a School of Information, Renmin University of China, China Laboratory of Data Engineering and Knowledge Engineering, MOE, China c School of Computing Science, Simon Fraser University, Canada

Abstract A large-scale knowledge graph contains a huge number of path-based semantic features, which provides a flexible mechanism to assign and expand semantics/attributes to entities. A particular set of these semantic features can be exploited on the fly, to support particular entity-oriented semantic search tasks. In this paper, we use entity set expansion as an example to show how these path-based semantic features can be effectively utilized in a semantic search application. The entity set expansion problem is to expand a small set of seed entities to a more complete set of similar entities. Traditionally, people solve this problem by exploiting the statistical co-occurrence of entities in the web pages, where semantic correlation among the seed entities is not well exploited. We propose to address the entity set expansion problem using the path-based semantic features of knowledge graphs. Our method first discovers relevant semantic features of the seed entities, which can be treated as the common aspects of these seed entities, and then retrieves relevant entities based on the discovered semantic features. Probabilistic models are proposed to rank entities, as well as semantic features, by handling the incompleteness of knowledge graphs. Extensive experiments on a public knowledge graph (i.e., DBpedia V3.9) and three public test collections (i.e., CLEF-QALD 2-4, SemSearch-LS 2011, and INEX-XER 2009) show that our method significantly outperforms the state-of-the-art techniques. Keywords: Knowledge Graph, Semantic Feature, Entity Set Expansion, Semantic Search, Ranking Model 1. Introduction Imagine you are visiting a European oil painting exhibition and highly attracted by Vincent_Van_Gogh and Paul_Gauguin’s work, and you want to learn more about similar painters with the same painting style. Usually, you can submit a query like “Painters similar to Vincent Van Gogh and Paul Gauguin” to a search engine. However, unfortunately, as illustrated in Fig. 1, the web search engine returns relevant documents to Vincent_Van_Gogh and Paul_Gauguin as results (e.g., the relationship between them and their representative paintings), which is not robust enough to meet such information needs (i.e., similar painters to them). To address such a case, the entity set expansion (shorted as ESE) problem was proposed, which aims to expand a small set of seed entities (shorted as seeds, e.g., Vincent_Van_Gogh and Paul_Gauguin) to a more complete set of similar entities, by first discovering the common aspects of the seeds (e.g., both Vincent_Van_Gogh and Paul_Gauguin are post-impressionist painters), and then retrieving similar entities having these common aspects (e.g., Henri_de_Toulouse-Lautrec and Paul_Cezanne). The ESE problem is of practical importance and can be widely used in many applications such as web search ∗ Corresponding

author Email address: [email protected] (Yueguo Chen)

Preprint submitted to Journal of Web Semantics

Figure 1: Top-5 relevant results of a search engine for the query “Painters similar to Vincent Van Gogh and Paul Gauguin”.

(search by examples) [1, 2], item recommendation [3], dictionary construction [4], query refinement [5] and expanSeptember 16, 2018

sion [6]. For instance, item recommendation systems could provide suggestions to the users based on the items they have browsed. To better illustrate the ESE problem, please consider the following example:

producer

producer

starring Forrest_Gump

Contact director

starring

director

Example 1. Given a query composed of the seeds as {Forrest_Gump, Apollo_13_(film), Philadelphia_(film)}, return a ranked list of relevant entities with respect to the query, whose implicit query intent is to find “Tom Hanks’ movies where he plays a leading role”.

Brian_Grazer

Gary_Sinise

Steve_Starkey

Ron_Howard

director

Philadelphia_(film)

Tom_Hanks director subject

producer

type starring subject

subject

subject

starring

Robert_Zemeckis

Jonathan_Demme

director

Apollo_13_(film)

subject

Films_directed_by _Robert_Zemeckis

producer

starring

starring

American_films

Film type

subject Cast_Away

Edward_Saxon

type type

producer

Jack_Rapke

Figure 2: A subgraph of DBpedia V3.9. The dashed line indicates a missing predicate, i.e., starring.

Traditionally, people solve the ESE problem based on unstructured texts such as web corpus [7, 8, 9]. SEAL [9] is one of the typical method, which first learns contextual patterns of the seeds (i.e., maximally-long prefix and suffix bracketing the seeds in a web page) as the common aspects, and then ranks entities extracted from the retrieved web pages using these learned patterns. As a result, entities sharing more patterns with the seeds are likely to have higher similarities. This method has benefited a lot from the power of search engines, because the retrieved web pages are often highly relevant to the seeds. However, it is difficult to learn the common aspects of the seeds from unstructured texts. For instance, given the seeds in Example 1, SEAL generates some false positive results such as Tom_Hanks and Radio_Flyer, which is far from the query intent of the seeds (i.e., “Tom Hanks’ movies where he plays a leading role”), simply because they share more ad-hoc contextual patterns with the seeds in the retrieved web pages (e.g., ), which describe the information (i.e., properties and values) of entities and the relations among them. The whole RDF datasets can be represented as directed and labeled graphs, where subjects and objects are nodes and predicates are labels of the edges (e.g., Fig. 2 shows a subgraph of DBpedia V3.9). Even though many existing KGs are quite large, they are still incomplete [15]. For instance, 71% of people in Freebase lack the Place_of_Birth information [16]. In this paper, we apply large-scale KGs to address the ESE problem. In particular, we use the path-based semantic features to represent the common aspects of the seeds. As illustrated in Fig. 2, one example of path-based semantic features is composed of a predicate starring and an entity Tom_Hanks, which can be used to describe the entities having Tom_Hanks as an actor. Note that an entity may have a large number of path-based semantic features, because the paths for describing them can be

longer than one. For instance, the path-based semantic feature composed of a sequence of predicates starring and director, and an entity Robert_Zemeckis can be used to describe the entities playing roles in the films directed by Robert_Zemeckis. These path-based semantic features depict the properties of an entity in many aspects, which allow us to find the common aspects of the seeds, and therefore to build an effective ranking model for the ESE problem. Besides, they also allow us to achieve a finegrained analysis of the relevance among entities. To our knowledge, this is the first work to focus on the ranking models of path-based semantic features and entities, for addressing the ESE problem. There are two major challenges: • Given a query composed of several seeds, it is likely that every seed has many path-based semantic features in KGs. How to choose a proper set of pathbased semantic features from the KGs for ranking entities will be critical for the search performance. • Given the incomplete KGs, it is likely that some seeds may miss some important path-based semantic features (e.g., a seed Apollo_13_(film) in Example 1 and Fig. 2 misses an important path-based semantic feature composed of a predicate starring and an entity Tom_Hanks). How to design robust ranking models that can address the deficiency of KGs will be very challenging. To address the above challenges, we propose a method called SEED to effectively evaluate the relevance of entities to the seeds. The key idea of the ranking models of our method SEED is to apply the highly relevant common path-based semantic features shared by the seeds in an error-tolerant manner. In other words, an applied pathbased semantic feature has to be a common one of some seeds if not all. For a path-based semantic feature, a seed without it is tackled by estimating the probability of the seed having it, so that more path-based semantic features can be applied for estimating the relevance of the entities to the seeds. The main contributions of this paper can be summarized as: • We propose flexible definitions of path-based semantic features for describing the common aspects of the

1 http://www.w3.org/RDF/

2

seeds, by considering the incompleteness of KGs.

Different from the above pattern-based methods, the embedding-based methods [28, 29, 30, 31] represent entities as embeddings for evaluating their similarity. S-EM [28] is a typical embedding-based method, which is designed to learn a classifier based on the PU learning. It also applies search engines to retrieve relevant web pages, and then extract entities and represent them as embeddings using the context around them. The classifier will return the similar entities by evaluating the similarity of embeddings. However, these methods often fail to discover the semantic correlations among the seeds, neither provide a fine-grained analysis of the relevance among entities.

• We design effective ranking models of path-based semantic features and entities with respect to the seeds, as well as the relaxation and selection mechanisms for choosing a proper set of path-based semantic features to rank entities. • We conduct extensive experiments on a public KG and three public test collections, the results show that in terms of precision, our proposed method significantly outperforms the state-of-the-art techniques. The rest of the paper is organized as follows. §2 introduces the related work. §3 gives the definitions of the ESE problem and path-based semantic features. §4 introduces our method SEED including the ranking models of path-based semantic features and entities, as well as the relaxation and selection mechanisms for choosing a proper set of path-based semantic features to rank entities. Experimental settings and results are given in §5 and §6 respectively. §7 concludes this paper.

2.2. Methods based on Structured Data Compared to unstructured data, structured data such as HIN (Heterogeneous Information Network) and KGs represents entities in a semantic way, which provides an alternative way of addressing the ESE problem. Some topology-based methods [32, 33, 34, 35, 36, 37, 38, 39] have been studied to evaluate the semantic correlation between two entities based on HIN, which can be applied to address the ESE problem by aggregating the similarities of an entity to all seeds. LDSD [32] and HeteSim [34] are two typical methods, which evaluate the similarity between two entities based on the links or paths in HIN. However, these pairwise similarity measures cannot well reveal the commonality of the seeds, because an entity may be similar to different seeds in quite different ways. Compared to pairwise similarity measures, some methods [40, 41, 42] are proposed to address the ESE problem using the common or frequent aspects discovered from the seeds in KGs. Metzger et al. [40] propose a method called QBEES to investigate the commonality of the seeds. It first detects common aspects shared by all seeds, and then generates the candidates based on these common aspects. However, this method does not consider the incompleteness of KGs, which often results in the failure of discovering the important commonality of the seeds. Abedjan et al. [41] apply the association rule mining algorithm (shorted as ARM) [43] for discovering significant aspects shared by the seeds, which can be applied for retrieving similar entities. Compared to QBEES, this ARM-based method is more error-tolerant. It can therefore find more useful aspects shared by all/part of the seeds, however the lack of an effective ranking model for entities leads to many false positive results. Besides, the traditional language modelbased method (shorted as BBR) [44] can also be applied to evaluate the relevance of an entity to the query, which combines a term-based language model with a structural model for ranking entities, however such a method is not effective enough to achieve favorable accuracy for addressing ESE in KGs.

2. Related Work 2.1. Methods based on Unstructured Data The ESE problem has been widely studied based on unstructured texts, which can be classified into patternbased methods, and embedding-based methods in general. The pattern-based methods [7, 9, 17] focus on expanding entity set through a pattern generation step and an entity extraction step. The intuition of such methods is that similar entities should share more same patterns. SEAL [9] is a well-known pattern-based method, which first learns contextual patterns (i.e., maximally-long prefix and suffix bracketing the seeds in a web page) based on the web pages crawled by the seeds. Then, a graph of entities extracted from the crawled web pages (i.e., edges are created based on the patterns) is built for ranking entities. Unfortunately, the performance of such methods is often affected by the limited supervision provided by the seeds (e.g., 2-5 seeds are given in many cases) [18, 19]. To address this issue, some bootstrapping-based strategies are proposed to expand entity set iteratively [8, 20, 21, 22, 23, 24, 25]. However, some seeds will introduce noisy patterns, and the entities returned by these noisy patterns may introduce more noisy patterns, and finally will lead to a semantic drift. Therefore, some methods [19, 26, 27] are proposed to provide external constraints to resolve the semantic drift problem. Probabilistic Co-Bootstrapping (shorted as PCB) [26] is a typical method based on such an idea. It automatically generates the discriminant negative entities, which highly overlap with the highly ranked positive entities, to refine the expansion boundary. However, the performance of this method is highly affected by the quality and quantity of the initial seeds, as well as the strategy for selecting the positive entity and discriminant negative entities as feedbacks.

2.3. Entity-Oriented Search Events There are also some important entity-oriented search events focusing on the entity search tasks [13]. Some associated test collections derived from these tasks can be 3

used as the benchmarks for the ESE problem. For instance, INEX-XER 2009 [45] launched two tasks, entity ranking and entity list completion, to retrieve entities in Wikipedia data. Each topic of INEX-XER 2009 specifies a small number of seeds, with a natural language question to describe the desired entities as an addition. In SemSearch-LS 2011, queries of the ad-hoc list task target a group of entities that match certain criteria over RDF datasets [46, 47]. Furthermore, there are also some question answering tasks and entity search tasks that apply linked data as a basis (e.g., the CLEF-QALD series [14]). For the above tasks, Balog et al. [13] have already linked the ground truths of them to the corresponding entities in the DBpedia KGs, which can be easily applied to evaluate the performance of methods addressing the ESE problem. In this paper, we apply 7 state-of-the-art mentioned above as the alternative methods for comparing the performance with our method including SEAL, LDSD, HeteSim, PCB, BBR, QBEES, and ARM. We apply three public test collections derived from the above entity-oriented search events as the ground truths including CLEF-QALD 2-4, SemSearch-LS 2011, and INEX-XER 2009.

that the predicate starring is the path to the anchor entity Tom_Hanks, which can be used to describe the properties of entities that have Tom_Hanks as an actor. Furthermore, we also can use the SFs to describe the query intent. For instance, the query intent “Who directed the movies acted by Tom_Hanks” can be well described via the SF π2 =director−1 ◦starring:Tom_Hanks, where the sequence of predicates director−1 ◦starring is the path to the anchor entity Tom_Hanks. The length of a SF π = P : e is measured as the number of predicates in its path P . For instance, the length of π1 is 1, and the length of π2 is 2. Given a SF π, if an entity e has the SF π, we say that e is a target entity of π, which is denoted as e |= π. The set of all target entities of a SF π is denoted as E(π) = {e|e |= π}. For instance, as illustrated in Fig. 2, Forrest_Gump, Philadelphia_(film), and Cast_Away are all in the set E(π1 ). Definition 3 (Common SF). Given a KG K and a query Q containing m seeds, a SF π is a common path-based semantic feature (shorted as CSF) of Q, if ||E(π) ∩ Q|| = m. According to the definition, the CSFs are therefore the SFs shared by all seeds in the query Q. They can then be used to find entities having the same CSFs. However, due to the incompleteness of KGs, it is common that some entities miss some important SFs, which may result in a low recall of relevant SFs if we simply apply the CSFs as relevant SFs. To address such an issue, we give the definition of k-relaxed CSF as follows.

3. Preliminary The ESE problem based on KGs can be defined as follows: given a KG K and a query Q containing m seeds (Q = {e1 , e2 , . . . , em }, m > 1), return a ranked list of relevant entities with respect to the query. Before presenting our method, we first give some important definitions.

Definition 4 (k-relaxed CSF). Given a KG K and a query Q containing m seeds, a SF π is a k-relaxed common path-based semantic feature (shorted as kCSF) of Q, if ||E(π) ∩ Q|| ≥ max(2, m − k).

Definition 1 (Knowledge Graph). A knowledge graph (shorted as KG) is denoted as K = {E, U, P, τ }, where 1) E is an entity set, 2) U ⊆ E × E is a set of directed edges, 3) P is a set of edge labels (predicates), and 4) τ : U → P is a mapping function that defines the mappings from the edges to the labels.

Compared to a CSF, a kCSF means that there can be at most k seeds in Q that do not have it, but at least 2 seeds in Q that have it. Obviously, the introduction of the relaxation mechanism will increase the recall of CSFs. When enlarging the parameter k, the probability that a SF is a kCSF of the query Q will be enlarged accordingly.

A KG K represents the information of millions of entities, as well as their diverse relations, using a labeled and directed graph model. Each labeled edge represents a relation between two entities. For instance, τ (s, o) → p (also can be represented as a triple ), where s ∈ E, o ∈ E, p ∈ P , ∈ U , means the relation between the entities s and o is the predicate p. Note that, we use p−1 to represent the inverse relation of the predicate p, where a triple can also be represented as another triple . In our study, we propose the definition of path-based semantic feature to depict the properties of an entity in many aspects.

4. Ranking Models In this section, we present the key techniques of SEED, i.e., the ranking models of SFs and entities, as well as the relaxation and selection mechanisms for selecting top-n relevant kCSFs to rank entities. Note that, we give the frequently used notations in Tab. 1.

Definition 2 (Path-Based Semantic Feature). Given a KG K, a path-based semantic feature (shorted as SF) consists of a sequence of predicates P = p1 ◦ · · · ◦ pl with length l and an anchor entity e. We denote it as π = P : e.

4.1. The Ranking Model of SFs To effectively evaluate the relevance of a SF π to a query Q, we first introduce two components: 1) the discriminability of a SF π in a KG (i.e., d(π), a measure to estimate how discriminative a SF is in a KG); 2) the commonality of a SF π to a query Q (i.e., c(π, Q), a measure

The sequence of predicates P in a SF is also called a path. For instance, a SF π1 =starring:Tom_Hanks means 4

Notation K = {E, U, L, τ } Q = {e1 , e2 , . . . , em } P = p1 ◦ p2 ◦ · · · ◦ pl π=P :e E(π) = {e|e |= π} k n Φ(Q) C(e) c∗ r(π, Q) r(e, Q) d(π) c(π, Q) p(π|e)

Table 1: Frequently used notations Description A KG A query containing m seeds A path composed of a sequence of predicates with length l A SF composed of a path P and an anchor entity e A set of entities having a SF π A parameter for generating k-relaxed CSFs from a query Q A parameter for selecting top-n relevant CSFs from a query Q A set of top-n relevant kCSFs derived from a query Q for ranking entities A set of classes of an entity e A representative class generalized from the set of classes C(e) The relevance of a SF π to a query Q The relevance of an entity e to a query Q The discriminability of a SF π in a KG The commonality of a SF π to a query Q The probability of an entity e having a SF π

to estimate the semantic correlation between a SF and a query. We then formalize the relevance of a SF π to a query Q, i.e., r(π, Q), as the product of the above two components: r(π, Q) = d(π) × c(π, Q)

have it if KGs are complete; 2) this π is a false positive SF, which is not desired by the query intent. As a result, only some seeds have it. However, it is hard to judge whether a SF π is false positive or not. To effectively evaluate the semantic correlation between a SF and a query, we first introduce p(π|e) to estimate the probability of an entity e having a SF π. For e |= π, p(π|e) is naturally estimated as 1.0. Otherwise, we estimate p(π|e) as p(π|c∗ ) based on a representative class c∗ , which is derived from a class generalization step according to e and π. For instance, as illustrated in Example 1 and Fig. 2, the seed Apollo_13_(film) doesn’t have the SF π1 =starring:Tom_Hanks due to the incompleteness of KGs. If we determine that the class Film is the representative class c∗ of Apollo_13_(film) and π1 , we can estimate the probability of Apollo_13_(film) having π1 statistically under the constraint of Film. Such a way of using p(π|c∗ ) to estimate p(π|e) is based on the relaxation of an entity e to a class c∗ , so that the likelihood of e |= π is considered with the background of c∗ . Accordingly, the probability of an entity e having a SF π is defined as: ( 1 if e |= π T p(π|e) = (3) E(c∗ )|| p(π|c∗ ) = ||E(π) otherwise ||E(c∗ )||

(1)

In the following, we detailedly introduce and formalize the above two components (i.e., d(π) and c(π, Q)). 4.1.1. The Discriminability of SFs Given a query Q containing m seeds, it is likely to find many SFs in a KG. Among these discovered SFs, some are fine-grained that are only shared by few entities, the others are coarse-grained that are widely shared by many entities. Therefore, we need a measure to evaluate the discriminability of a SF for retrieving relevant entities more effectively. For instance, given two SFs derived from the seeds in Example 1, π1 =starring:Tom_Hanks and π3 =subject:American_films, we find that π1 is more specific than π3 , in terms of describing the common aspects shared by the entities in a KG. This is because ||E(π1 )||  ||E(π3 )||, i.e., the number of films played by Tom_Hanks is much fewer than the total number of American_films in a KG. The entities are semantically closer to each other under the constraint of π1 than under that of π3 . Inspired by the idea of IDF (Inverse Document Frequency) [48], the discriminability of a SF π in a KG is defined as: 1 d(π) = (2) ||E(π)||

where E(c∗ ) is the set of entities belonging to the representative class c∗ . Therefore, the more entities shared by E(π) and E(c∗ ), the larger probably for an entity having such a SF. We then formalize the commonality of a SF to a query, i.e., c(π, Q), as the product of the probability of each entity in the query having this SF: Y c(π, Q) = p(π|e) (4)

where larger ||E(π)|| means that entities in E(π) are more loosely correlated in terms of the constraint of π. It therefore has less discriminability for entities having such a SF.

e∈Q

4.1.2. The Commonality of SFs Given a query Q containing m seeds, a discovered SF π may be not shared by all seeds. The reason may be in two factors: 1) this π is relevant to some seeds, which is desired by the query intent. However, due to the deficiency of KGs, some seeds do not have it, although they should

In following, we detailedly discuss how to generalize the representative class c∗ for estimating the probability of an entity e having a SF π.

5

background of the SF π. A fine-grained class c is likely to have smaller coverage with E(π) than a coarse-grained one (i.e., p(c|π), a measure to achieve the first goal). However, a fine-grained class c is likely to have larger information gain over a coarse-grained class ci (i.e., I(ci ) − I(ci |c), a measure to achieve the second goal).

4.1.3. Class Generalization Most entities in KGs have multiple classes. Fig. 2 shows an example class Film, which is one class of the entity Apollo_13_(film) (i.e., indicated by a specific predicate type). The classes of entities in DBpedia are defined as an ontology. As illustrated in Fig. 3, it contains a hierarchy of classes with the root class Thing. A directed relation between two classes shows the hypernymy relationship between them (e.g., Actor is a sub-class of Artist).

4.2. The Ranking Model of Entities To effectively evaluate the relevance of an entity to the query, we first retrieve the relevant SFs of the query. In our study, we apply a parameter k to generate the kCSFs from the seeds in the query as a relaxation mechanism, and apply a parameter n to select the top-n relevant kCSFs for ranking entities as a selection mechanism. We then use these selected kCSFs, i.e., Φ(Q), for retrieving relevant entities. Based on the ranking model of SFs discussed above, the relevance of an entity e to a query Q, i.e., r(e, Q), is basically an aggregation of the product of two components: 1) the probability of an entity e having a SF π (i.e., Eq. 3); 2) the relevance of a SF π to a query Q (i.e., Eq. 1). Accordingly, we formalize the relevance of an entity e to a query Q as: X r(e, Q) = p(π|e) × r(π, Q) (6)

Thing

Work

WrittenWork

Comic

Agent

Film

Book

Artist

Actor

Dancer

Place

Person

MovieDirector

PopulatedPlace

Writer

Country

Painter

Figure 3: A partial ontology information of classes in DBpedia V3.9.

π∈Φ(Q)

Intuitively, it is likely that entities belonging to the fine-grained classes (i.e., the low-level classes in the ontology) may share more common aspects in KGs, which can help us to estimate p(π|e) more effectively. For instance, in Example 1, the class Film will be better than Work for the seed Apollo_13_(film) under the background of the film-related SFs (e.g., π1 =starring:Tom_Hanks). Therefore, we propose to discover a class that can represent an entity under the background of a SF, which we call a class generalization step. A straightforward solution of the class generalization step is to apply the most fine-grained class (i.e., a leaf-level class with least number of entities in the ontology). However, such a class may not have a good match to the SF π in terms of the entities satisfying it. To address this issue, we need a solution to find a representative class instead of the most fine-grained one, which should satisfy: 1) it covers as many entities in E(π) as possible; 2) it is as fine-grained as possible. However, these two goals are often self-contradictory. To achieve a tradeoff between them, we borrow the idea of information gain used in constructing decision trees [49]. Inspired by the idea of maximal information gain [50], we try to find the representative class c∗ by maximizing the following aggregated value: X c∗ = argmax p(c|π)× (p(ci |π)×(I(ci )−I(ci |c))) (5) c∈C(e)

Based on the above ranking models of SFs and entities, as well as the relaxation and selection mechanisms of kCSFs, the overall definition of our method SEED for addressing the ESE problem is: given a KG K, a query Q containing m seeds (Q = {e1 , e2 , . . . , em }, m > 1), a parameter k for generating kCSFs from the query, and a parameter n for selecting top-n relevant kCSFs for ranking entities, return a ranked list of relevant entities with respect to the query. 5. Experimental Settings In order to evaluate the performance of our method SEED, we conduct the experiments on a public KG (i.e., DBpedia V3.9) and three public test collections (i.e., CLEFQALD 2-4, SemSearch-LS 2011, and INEX-XER 2009). Besides, 7 state-of-the-art methods are applied for comparison. 5.1. Knowledge Graphs The DBpedia version 3.9 is applied as the KG in our experiments. It is extracted from a Wikipedia dump in April 2013. It describes nearly 4 million entities including 832K persons, 639K places, 372K creative works (including 116K music albums, 78K films, and 18.5K video games), 209K organizations (including 49K companies and 45K educational institutions), 226K species, 5.6K diseases, etc. Among the sub-datasets of DBpedia V3.9, Redirects, Articles Categories, Persondata, Mapping-based Properties, and Mapping-based Types are applied in our experimental study, some important statistical information of DBpedia V3.9 are listed in Tab. 2.

ci ∈C(e)

where C(e) is the classes of entity e, p(c|π) =

||E(c)∩E(π)|| , ||E(π)||

i )∩E(c)|| i )|| I(ci )=−log ||E(c , I(ci |c)=−log ||E(c , and N is the N ||E(c)|| total number of entities in the KG K. Both p(c|π) and p(ci |π) serve the weights of the classes c and ci , under the

6

not need to link the results because they are directly based on DBpedia. Tab. 3 lists some important characteristics of these three test collections.

Table 2: Statistical information of DBpedia V3.9

#entities #classes #categories #properties #facts/triples #entities with categories Avg(#entities per classes) Avg(#entities per leaf classes) Avg(#classes per entity)

4,260,000 529 753,524 2,333 111,951,944 4,125,821 23,121 5,927 3.12

Table 3: Characteristics of test collections

Characteristics #topics Avg(#results) Min(#results) Max(#results)

5.2. Test Collections We apply three public test collections in our experiments: CLEF-QALD 2-4 [14], SemSearch-LS 2011 [46, 47], and INEX-XER 2009 [45].

CLEF 60 42 8 1148

Test Collections SemSearch INEX 43 55 12.5 29.8 2 7 41 68

For each topic, we apply 5 query groups for experimental evaluation, whose number of seeds m (i.e., #seeds) are 2, 3, 4 and 5 individually for the first 4 query groups, and a mix of them for the mix query group. To construct queries for the query groups with m seeds, we apply the first m entities of the ground truth list as the query of each topic, and the remaining entities of the list are used as ground truths for evaluating the performance on this topic. To generate the mix query group, for INEX, we directly apply the original seeds (i.e., example entities) of INEX as queries. For CLEF and SemSearch, we construct queries for the mix query group, by successively picking topics from the first 4 query groups. We first pick 25% topics performing best (i.e., in terms of the average MAP of all methods) in the query group with 2 seeds, which means these topics will have 2 seeds in the mix group. Then, for the remaining topics, we pick another 25% topics performing best (i.e., excluding those have been picked) in the query group with 3 seeds, and so on.

• CLEF-QALD (shorted as CLEF): The Question Answering over Linked Data campaign aims to answer natural language questions (e.g., “Give me a list of all bandleaders that play trumpet”) using the linked datasets (i.e., DBpedia) [14]. Because some queries of CLEF-QALD aim to retrieve a list of relevant entities, we select distinct queries with such a query intent from QALD-2, QALD-3, and QALD-4. Each topic of CLEF can be described as a SPARQL2 query (e.g., “SELECT DISTINCT ?x WHERE {?x occupation Bandleader. ?x instrument Trumpet.}”). For most of these topics, we can easily find SFs from the seeds to describe the desired query intent (e.g., for “Give me a list of all bandleaders that play trumpet”, instrument:Trumpet and occupation:Bandleader can perfectly describe the query intent).

5.3. Evaluation Metrics

• SemSearch-LS (shorted as SemSearch): Topics of the ad-hoc list task at the Semantic Search Challenge 2011 target a group of entities matching keywords (e.g., “Axis powers of WorldWar II”) in RDF datasets (i.e., BTC 2009 dataset) [46, 47]. Some topics are hard to find relevant SFs in KGs.

We adopt the following metrics [48] for experimental evaluation: • Precision@k (shorted as p@k): the mean of the percentages of the relevant entities in the top-k ranked results for all queries, p@5 and p@10 are measured in our study.

• INEX-XER (shorted as INEX): The INEX 2009 XML Entity Ranking task aims to retrieve related entities relevant to the title (e.g., “List of countries in World War Two”) in Wikipedia data [45]. In addition, each topic also provides extra information, including categories (e.g., “countries”) and example entities (e.g., Japan, Germany, United_States). Because the example entities have been explicitly provided for each topic, we put them in the head of the ground truth list of each topic. Note that some topics are hard to find relevant SFs in KGs too.

• R-Precision (shorted as p@R): the mean of the percentages of the relevant entities in the top-R ranked results for all queries, where R is the number of the given gold standard results for a query. • Mean Reciprocal Rank (shorted as MRR): the mean of the reciprocal position of the first relevant entity in the ranked results for all queries. • Mean Average Precision (shorted as MAP): the mean of the average precision of the relevant entities in the ranked results for all queries.

For these test collections, Balog et al. [13] have already linked the ground truths of SemSearch and INEX to the corresponding entities in the DBpedia. For CLEF, we do

Note that results of each query are evaluated based on the given ground truths excluding the seeds, and all significant tests are conducted using a t-test at a significance level of p = 0.05 [51].

2 http://www.w3.org/TR/rdf-sparql-query

7

methods based on KGs in most cases) as the baselines for the significant tests. For our method SEED, we set the parameter k and n as 3 and 100 respectively by default. As illustrated in Tab. 4, SEAL achieves good performance on three test collections, because it has benefited a lot from the power of search engines, and the retrieved web pages are often highly relevant to the seeds. However, it is difficult to discover and extract the common aspects of the seeds from unstructured texts in some cases. For instance, as illustrated in Tab. 5, SEAL generates some false positive results such as Tom_Hanks and Radio_Flyer, which is far from the query intent of the seeds (i.e., “Tom Hanks’ movies where he plays a leading role”), simply because they share more ad-hoc contextual patterns with the seeds in the retrieved web pages (e.g.,

5.4. Alternative Methods for Comparison We apply 7 state-of-the-art techniques mentioned in the related work as the alternative methods for comparing the performance with our method SEED including SEAL [9], LDSD [32], HeteSim [34], PCB [26], BBR [44], QBEES [40], and ARM [41]. Among these alternative methods, SEAL [9] is a method based on unstructured texts, we apply the public project in Github3 for evaluating its performance, where Google is configured as its search engine for retrieving relevant documents, and the random walk with restart algorithm [52] is configured for ranking entities. Because the results of SEAL are generated from web pages, in order to ensure the fairness, we first link them to the corresponding entities in DBpedia V3.9 before evaluating its performance. We re-implement the other methods described in their paper based on KGs by ourselves. In practice (i.e., satisfied by all the test topics), we find that SFs whose length is at most 2 are effective enough to describe the query intents based on KGs[53]. Thus, we generate the SFs whose lengths are not exceeding 2 for all queries. 6. Experimental Results We first make an overall comparison of all methods on three test collections. Then we evaluate the impact of the parameter k that generates kCSFs from the seeds, and the parameter n that selects top-n relevant kCSFs for ranking entities, as well as the effectiveness of the ranking models in our method SEED. Finally, we study some use cases to show the effectiveness of our method SEED. The experiments are designed to majorly address the following research questions: • RQ1: Does our error-tolerant method SEED outperform the state-of-the-art techniques? (§6.1) • RQ2: Can the proposed relaxation mechanism improve the search performance by addressing the incompleteness of KGs? (§6.2) • RQ3: How does the proposed selection mechanism affect the search performance? (§6.3) • RQ4: How robust is our ranking model with different settings? (§6.4) 6.1. An Overall Comparison In this experiment, we address the research question RQ1 by evaluating the performance of all compared methods on three test collections using 5 query groups. We apply SEAL (i.e., a method achieves the best performance among the methods based on unstructured texts) and ARM (i.e., a method achieves the best performance among the 3 https://github.com/TeamCohen/SEAL

8

Table 4: Overall comparison on three test collections using 5 query groups labelled as m (i.e., #seeds). We apply SEAL and ARM as the baselines, the notation ∗ and • denotes statistically significant improvement over SEAL and ARM in each query group respectively. CLEF SemSearch INEX m methods p@5 p@10 p@R MRR MAP p@5 p@10 p@R MRR MAP p@5 p@10 p@R MRR MAP SEAL .377 .290 .269 .550 .263 .315• .241• .258• .415• .247• .392• .373• .307• .540 .285• LDSD .147 .122 .113 .264 .118 .133 .103 .093 .231 .102 .235 .215 .181 .482 .151 HeteSim .357 .305 .242 .508 .229 .149 .103 .097 .247 .075 .185 .150 .129 .293 .107 PCB .267 .253 .187 .382 .182 .108 .095 .078 .206 .069 .262 .233 .188 .367 .163 2 BBR .340 .305 .263 .446 .248 .210 .177 .146 .364 .140• .335 .275 .232 .463 .184 QBEES .507∗ .400∗ .369∗ .654∗ .336∗ .179 .141 .098 .226 .093 .483•∗ .381• .315• .621•∗ .273• ARM .503∗ .422∗ .377∗ .662∗ .372∗ .200 .169 .152 .341 .129 .361 .321 .271 .558 .233 SEED .527•∗ .448•∗ .421•∗ .691•∗ .415•∗ .245• .227• .185• .325 .186• .435•∗ .396• .341•∗ .605•∗ .328•∗ • • • • • SEAL .453 .363 .340 .591 .354 .305 .235 .246 .404 .236 .450• .423• .359• .532 .339• LDSD .170 .143 .131 .270 .140 .169 .137 .114 .297 .094 .269 .254 .223 .459 .176 HeteSim .390 .312 .279 .557 .256 .175 .144 .122 .321 .099 .219 .181 .154 .359 .120 PCB .347 .315 .253 .455 .245 .115 .095 .088 .226 .078 .274 .245 .195 .411 .168 3 BBR .393 .335 .298 .505 .284 .231• .184 .151 .423 .144 .327 .294 .243 .501• .190 QBEES .557∗ .440∗ .423∗ .688∗ .430∗ .169 .144 .104 .227 .093 .419• .381• .305• .599•∗ .277• ARM .550∗ .468∗ .446∗ .665∗ .442∗ .200 .184 .149 .366 .130 .335 .312 .269 .534 .238 SEED .617•∗ .520•∗ .510•∗ .792•∗ .520•∗ .315• .258• .228• .434•∗ .216• .527•∗ .481•∗ .401•∗ .496• .374•∗ SEAL .420 .350 .354 .539 .359 .312• .235• .251• .419• .226• .415• .377• .326• .516 .318• LDSD .197 .163 .153 .308 .146 .200 .169 .146 .339 .118 .300 .271 .216 .477 .181 HeteSim .360 .287 .271 .532 .249 .172 .138 .113 .334 .085 .196 .167 .137 .325 .110 PCB .405 .378 .325 .515 .305 .142 .127 .108 .246 .098 .312 .298 .255 .430 .211 4 BBR .363 .312 .302 .526 .280 .262• .207• .173 .469• .144• .319 .287 .247 .511 .188 QBEES .557•∗ .453∗ .452•∗ .668∗ .454∗ .179 .152 .128 .264 .119 .396• .360• .269 .507 .251• ARM .527∗ .430∗ .420∗ .716∗ .427∗ .200 .183 .154 .397 .117 .362 .308 .267 .557 .228 SEED .660•∗ .530•∗ .551•∗ .791•∗ .567•∗ .345•∗ .285•∗ .274• .485•∗ .245•∗ .504•∗ .458•∗ .382•∗ .672•∗ .356•∗ SEAL .410 .317 .352 .535 .351 .285• .256• .226• .401• .205• .365• .337• .298• .396 .295• LDSD .173 .145 .153 .282 .151 .186 .138 .139 .317 .109 .327 .319 .232 .538•∗ .194 HeteSim .350 .283 .273 .507 .254 .152 .110 .087 .274 .069 .192 .171 .144 .320 .110 PCB .437 .428 .369 .585 .335 .168 .145 .122 .255 .115 .335 .314 .275 .478 .235 5 BBR .350 .323 .304 .515 .290 .214• .172 .149 .440• .127• .335 .287 .250 .576 .200 QBEES .520∗ .428∗ .449∗ .638∗ .451∗ .145 .128 .100 .228 .098 .285 .250 .200 .391 .189 ARM .503∗ .418∗ .426∗ .665∗ .437∗ .172 .155 .142 .343 .102 .331 .294 .263 .481 .226 SEED .613•∗ .492•∗ .557•∗ .746•∗ .568•∗ .405•∗ .325•∗ .275•∗ .512•∗ .242•∗ .488•∗ .440•∗ .381•∗ .663•∗ .365•∗ SEAL .417 .325 .315 .565 .325 .295• .241• .245• .407• .213• .446• .375• .314• .631• .303• LDSD .173 .140 .139 .275 .145 .175 .138 .124 .301 .101 .262 .227 .204 .496 .170 HeteSim .375 .298 .275 .535 .255 .162 .120 .089 .285 .087 .227 .169 .136 .339 .119 PCB .402 .365∗ .334 .551 .315 .178 .158 .135 .305 .112 .315 .278 .258 .455 .237 mix BBR .370 .329 .295 .501 .280 .225∗ .185 .175 .452•∗ .135 .323 .277 .251 .504 .210 QBEES .517∗ .423∗ .412∗ .630∗ .407∗ .171 .148 .125 .256 .108 .423• .381• .299 .592• .287• ARM .541∗ .457∗ .433∗ .653∗ .428∗ .195 .185 .158 .389 .135 .350 .304 .273 .537 .254 SEED .610•∗ .497•∗ .530•∗ .763•∗ .540•∗ .387•∗ .326•∗ .279•∗ .483•∗ .228• .491•∗ .433•∗ .399•∗ .690•∗ .391•∗ Table 5: Top-10 relevant entities retrieved by all methods for the query in Example 1. Underlined entities are not ground truths and thus deemed as incorrect results. SEAL Saving_Private_Ryan The_Green_Mile Cloud_Atlas Captain_Phillips Sleepless_in_Seattle You’ve_Got_Mail A_League_of_Their_Own Tom_Hanks Radio_Flyer Rita_Wilson

LDSD The_Insider_(film) Reversal_of_Fortune Dances_with_Wolves The_People_vs._Larry_Flynt The_Bridge_on_the_River_Kwai The_English_Patient_(film) Rocky Raging_Bull High_Noon Capote_(film)

HeteSim The_Polar_Express_(film) You’ve_Got_Mail Contact Cast_Away Death_Becomes_Her Back_to_the_Future_Part_III Back_to_the_Future_Part_II Flight_(2012_film)) The_Silence_of_the_Lambs_(film) Melvin_and_Howard

PCB A_Beautiful_Mind_(film) On_the_Waterfront The_French_Connection_(film) Born_on_the_Fourth_of_July_(film) The_Bridge_on_the_River_Kwai Gandhi_(film) In_the_Heat_of_the_Night_(film) Cast_Away Ben-hur_(1959_film) Sleepless_in_Seattle

BBR A_Beautiful_Mind_(film) Schindler’s_List Born_on_the_Fourth_of_July_(film) The_Deer_Hunter Dances_with_Wolves The_Godfather The_Lost_Weekend_(film) The_Silence_of_the_Lambs_(film) The_Doors_(film) The_Bridge_on_the_River_Kwai

QBEES Remember_Me_(2010_film) We_Are_Marshall Mommie_Dearest_(film) Dances_with_Wolves Fame_(1980_film) 101_Dalmatians_(1996_film) The_Princess_Bride_(film) Smokey_and_the_Bandit The_Commitments_(film) The_Good_Shepherd_(film)

ARM Cast_Away Saving_Private_Ryan Cloud_Atlas Sleepless_in_Seattle Dragnet_(1987_film) Big_(film) The_Da_Vinci_Code_(film) Next_(2007_film) Oliver_the_Eighth Rhythm_in_a_Riff

SEED Splash_(film) Big_(film) Sleepless_in_Seattle The_Green_Mile_(film) You’ve_Got_Mail Saving_Private_Ryan He_Knows_You’re_Alone The_Polar_Express_(film) The_’Burbs Bachelor_Party_(1984_film)

collections in most cases, with exceptions of some metrics beaten by some alternative methods (i.e., SEAL and

QBEES) on SemSearch and INEX, when the number of the seeds m is 2 and 3. We also observe that almost all 9

Table 6: Impact of the parameter k on three test collections using 4 query groups labelled as m (i.e., #seeds), n = 100. We apply SEED with k = 0 as the baseline in each query group, the notation ∗ denotes statistically significant improvement over the baseline in each query group. m 3 4

5

mix

k 0 1 0 1 2 0 1 2 3 0 1 2 3

p@5 .598 .617∗ .590 .647∗ .660∗ .540 .610∗ .620∗ .613∗ .567 .607∗ .613∗ .610∗

p@10 .508 .520∗ .510 .540∗ .530∗ .470 .495∗ .508∗ .492∗ .480 .502∗ .495∗ .497∗

CLEF p@R .499 .510∗ .500 .548∗ .551∗ .484 .551∗ .546 .557∗ .484 .512∗ .525∗ .530∗

MRR .703 .792∗ .675 .774∗ .791∗ .641 .721∗ .738∗ .746∗ .673 .742∗ .754∗ .763∗

MAP .482 .520∗ .512 .558∗ .567∗ .497 .565∗ .574∗ .568∗ .497 .527∗ .535∗ .540∗

SemSearch p@10 p@R MRR .240 .188 .376 .258∗ .228∗ .434∗ .225 .180 .267 .275∗ .248∗ .415∗ .285∗ .274∗ .485∗ .203 .185 .258 .265∗ .241∗ .355∗ .315∗ .265∗ .495∗ .325∗ .275∗ .512∗ .221 .196 .295 .282∗ .227∗ .415∗ .305∗ .259∗ .465∗ .326∗ .279∗ .483∗

p@5 .454 .527∗ .365 .427∗ .504∗ .250 .392∗ .473∗ .488∗ .427 .465∗ .481∗ .491∗

SemSearch

0.5

0.5

0.5

0.4

0.4

0.4

p@R

0.6

0.3

0.3 0.2

0.2

0.1

0.1

0.1

0

n

SemSearch

INEX

0.5

0.5

0.5

0.4

0.4

0.4

MAP

0.6

MAP

0.6

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0.1

0

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200

MAP .335 374∗ .278 .332∗ .356∗ .209 .300∗ .343∗ .365∗ .336 .357∗ .370∗ .391∗

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200

n

0.6

0.3

MRR .458 .496∗ .455 .584∗ .672∗ .324 .544∗ .654∗ .663∗ .520 .622∗ .661∗ .690∗

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200

n CLEF

INEX p@R .356 .401∗ .307 .345∗ .382∗ .220 .319∗ .362∗ .381∗ .346 .362∗ .379∗ .399∗

0.3

0.2

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200

p@10 .408 .481∗ .323 .392∗ .458∗ .229 .333∗ .394∗ .440∗ .373 .412∗ .437∗ .433∗

INEX

0.6

0

MAP

MAP .175 .216∗ .178 .218∗ .245∗ .155 .225∗ .238∗ .242∗ .179 .208∗ .215∗ .228∗

0.6

p@R

p@R

CLEF

p@5 .286 .315∗ .245 .325∗ .345∗ .225 .315∗ .387∗ .405∗ .235 .305∗ .367∗ .387∗

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200

n

n

10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200

n

Figure 4: Impact of the parameter n in Φ(Q) on three test collections using the mix query group, k = 3. We apply SEED with all retrieved kCSFs as the baseline (i.e., labelled as a red line), and compare SEED with different n (i.e., labelled as a blue broken line).

methods based on KGs perform better on CLEF than the others, because topics of CLEF are composed of handcrafted natural language questions, which can be transformed to SPARQL queries over DBpedia. Therefore, it will be easy to discover relevant SFs from the seeds, which can well describe the desired query intent. However, the topics of SemSearch and INEX are keyword search queries on structured datasets and unstructured datasets respectively. Therefore, some of them are hard to find relevant SFs in DBpedia. When looking into the impact of the number of the seeds m, we find that most methods perform best when m = 3 or m = 4, showing that more seeds can improve the performance of methods addressing the ESE problem, but too many seeds may introduce noise. However, the performance of our error-tolerant method SEED basically increases when m is enlarged from 2 to 5. Especially, it benefits more from the enlargement of m on SemSearch

and INEX than CLEF. This is reasonable because a certain percentage of the seeds on SemSearch and INEX cannot find relevant kCSFs to describe the desired query intent, more seeds can assist SEED to discover more somewhat relevant kCSFs to improve the performance. Compared to SEED, when m is enlarged, the performance of QBEES drops on SemSearch and INEX. This is because it requires SFs shared by all seeds (i.e., CSFs) for ranking entities, and more seeds may reduce the recall of effective SFs. Moreover, we find that SEED performs better in the mix query group than the other 4 query groups on the INEX, which shows that the performance can be improved if the given seeds are effectively provided according to their ambiguity. 6.2. Impact of The Parameter k To observe how the proposed relaxation mechanism affects the search performance for addressing the research question RQ2, we conduct an experiment to evaluate the 10

Table 7: Performance of alternative ranking models on three test collections using the mix query group, k = 3 and n = 100. We apply the ranking model with d(π) = 1 and c(π, Q) = 1 as the baseline, the notation ∗ denotes statistically significant improvement over the baseline. d(π)

c(π, Q)

1 Eq. 2 1 Eq. 2

1 1 Eq. 4 Eq. 4

p@5 .493 .560∗ .535∗ .610∗

p@10 .403 .452∗ .443∗ .497∗

CLEF p@R .399 .453∗ .435∗ .530∗

MRR .695 .739∗ .725∗ .763∗

MAP .367 .412∗ .421∗ .540∗

p@5 .195 .295∗ .275∗ .387∗

p@5 .425 .450∗ .465∗ .491∗

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.3

p@R

0.5

0.2

0.2

0.2

0.1

0.1

0.1

0.0

0.0 3

QBEES+

4

ARM

5

ARM+

mix

QBEES

3

4

QBEES+

CLEF

ARM

5

ARM+

mix

2

SEED

QBEES

0.5

0.4

0.4

0.3

0.3

0.3

MAP

0.5

0.4

MAP

0.5

0.2

0.2

0.2

0.1

0.1

0.1

0.0

0.0

QBEES+

4

ARM

5

ARM+

mix

SEED

MAP .275 .328∗ .315∗ .391∗

4

ARM

5

ARM+

mix

SEED

INEX 0.6

QBEES

3

QBEES+

SemSearch 0.6

3

MRR .641 .685 .672∗ .690∗

0.0 2

SEED

0.6

2

INEX p@R .305 .352∗ .345∗ .399∗ INEX

0.6

2

p@10 .340 .415∗ .405∗ .433∗

SemSearch 0.6

QBEES

MAP

MAP .175 .198∗ .195∗ .228∗

0.6

p@R

p@R

CLEF

SemSearch p@10 p@R MRR .105 .185 .352 .185∗ .225 .412∗ .198∗ .230∗ .422∗ .326∗ .279∗ .493∗

0.0 2

QBEES

3

QBEES+

4

ARM

5

ARM+

mix

SEED

2

QBEES

3

QBEES+

4

ARM

5

ARM+

mix

SEED

Figure 5: Impact of the ranking model for QBEES and ARM (i.e., labelled with “+”) on three test collections using 5 query groups. The effectiveness results of QBEES, QBEES+, ARM, ARM+, and SEED in each query group are reported from left to right accordingly.

impact of the parameter k, by using different values of k to generate kCSFs. We apply 4 query groups on three test collections for this experiment, and set the parameter n as 100. For each query group, we set the value of k from 0 to m − 2. As illustrated in Tab. 6, we can find that the relaxation mechanism (i.e., enlarging the parameter k) does improve the performance on three test collections. This is because the introduction of the relaxation mechanism assists SEED to discover more relevant kCSFs, which otherwise will be missing due to the incompleteness of KGs. When enlarging the value of k, as the recall of kCSFs increases, more false positive kCSFs will also be generated. However, with the proposed error-tolerant ranking models, SEED is effective enough to find the relevant kCSFs and entities as well. By default, we set the parameter k = 3 in our study.

test collections for this experiment, and set the parameter k as 3. As illustrated in Fig. 4, compared to the baseline (i.e., SEED with all kCSFs applied), we can find that the performance of SEED quickly converges to that of the baseline as the value of n is enlarged. These results somehow show that the derived top-n relevant kCSFs are effective for ranking entities. The introduction of the selection mechanism allows us to filter more useless kCSFs so that SEED can be more efficiently conducted. By default, we set the parameter n = 100 in our study. 6.4. Impact of Ranking Model Two components affect the performance of our ranking models: d(π) and c(π, Q). We conduct an experiment by using different settings of the ranking model, for addressing the research question RQ4. We apply all query groups on three test collections for this experiment, and set the parameter k and n as 3 and 100 respectively. Four alternative ranking model are compared, where the relevance of a component set as 1 when it is not applied to the ranking model. We apply ranking model with d(π) = 1 and c(π, Q) = 1 as the baseline. According to the results reported in Tab. 7, we find that two individual components can significantly improve the search performance over the baseline on three test collections. The best performance

6.3. Impact of The Parameter n To observe how the proposed selection mechanism affects the search performance for addressing the research question RQ3, we conduct an experiment to evaluate the impact of the parameter n, by using different values of n from 10 to 200 at intervals of 10 to select top-n relevant kCSFs as Φ(Q). We apply the mix query group on three 11

Case ID 1 2 3

4 5 6 7 8

Table 8: Queries for case study. Seeds Query intent W._C._Handy “A list of all bandleaders that play trumpet” Dizzy_Gillespie C++Builder “Which software has been developed by organizations NeXTSTEP founded in California” MySQL Kejimkujik_National_Park La_Mauricie_National_Park “National Parks East Coast Canada US” St._Lawrence_Islands_National_Park Apollo_13_(film) Philadelphia_(film) “Tom Hanks’ movies where he plays a leading role” Forrest_Gump You’ve_Got_Mail Kyushu “Give me all islands that belong to Japan” Minami-Tori-shima West_of_Eden Georgia_on_My_Mind_(novelette) “Science fiction book written in the 1980” The_Magic_Labyrinth Che_Guevara “Revolutionaries of 1959 in Cuba" Fidel_Castro Cambridge_University_Library Connemara_Public_Library “Give me all libraries established earlier than 1400” Austrian_National_Library

Table 9: Top-5 relevant kCSFs for each case. Note that the kCSFs with bold are the desired kCSFs for each case. Case ID Top-5 relevant kCSFs occupation:Bandleader instrument:Trumpet 1 occupation:Composer instrument:Piano subject:African-American_Musicians license:Proprietary_Software 2 developer◦foundationplace:California product◦foundationplace:California location:Canada 3 subject:National_Parks_of_Canada starring:Tom_Hanks starring:Gary_Sinise 4 subject:Films_featuring_a_Best_Drama_Actor_Golden_Globe_winning_performance subject:Films_featuring_a_Best_Actor_Academy_Award_winning_performance subject:English-language_films country:Japan location:East_Asia 5 location:Pacific_Ocean subject:Islands_of_Japan subject:Islands_of_Tokyo subject:1980s_Science_Fiction_Novels language:English_Language 6 mediatype:Hardcover literarygenre:Science_Fiction country:United_States subject:Marxist_Writers subject:Anti-revisionists 7 subject:International_Opponents_of_Apartheid_in_South_Africa subject:Anti-fascists comander:Bay_of_Pigs_Invasion 8 subject:Deposit_Libraries

is achieved when both components are applied, which is exactly the proposed ranking models. In order to verify the effectiveness of our ranking model, we apply it to QBEES and ARM (i.e., labeled as QBEES+ and ARM+ respectively). The results are reported in Fig. 5, we can find that our ranking model improves the p@R and MAP of QBEES and ARM on three test collections.

Compared to QBEES+ and ARM+, SEED still performs best. This is because SEED applies more effective kCSFs (e.g., those whose lengths are more than 1) than QBEES (i.e., applying CSFs for retrieving entities) and ARM (i.e., employing minsupp = 66.7% as a parameter to generate kCSFs for retrieving entities). 12

SFs, by handling the incompleteness of KGs. Through extensive experiments on a public KG and three public test collections, we find that our proposed method SEED outperforms the state-of-the-art techniques, which is very suitable for addressing the ESE problem, especially for the queries having good coverage with the entities and predicates (relations) in KGs. Even for the queries that do not have good information coverage in KGs, SEED may also work well with the relaxation and selection mechanisms for choosing a proper set of SFs to rank entities.

6.5. Case Study We study some use cases to show how our ranking models work effectively. We list the detailed information of these use cases and their top-5 relevant kCSFs in Tab. 8 and 9 respectively. For case 1, whose query intent is to find “A list of all bandleaders that play trumpet” by the seeds {W._C._Handy, Dizzy_Gillespie}, the top-2 relevant kCSFs are successfully discovered by our method SEED, which exactly matches to the query intent. For case 2, the query intent is to find “Which software has been developed by organizations founded in California” by the seeds {C++Builder, NeXTSTEP, MySQL}. Compared to case 1, case 2 is much complicated because it intends to find a kCSF with a path whose length is 2. In such a case, SEED still successfully discovers the desired kCSFs, i.e., developer ◦ foundationplace:California. For case 3, the second kCSF subject:National_Parks_of_Canada effectively captures the query intent (i.e., “National Parks East Coast Canada US”), which leads to a good performance (i.e., p@R and MAP of this case are 0.527 and 0.498 respectively). For case 4, although the kCSFs initiated with predicate subject are not quite relevant to the query intent (i.e., “Tom Hanks’ movies where he plays a leading role”), the desired kCSF starring:Tom_Hanks has been discovered to achieve high performance (i.e., p@R and MAP of this case are 0.837 and 0.815 respectively). For case 5, both kCSFs country:Japan and subject:Islands_of_Japan are important to find relevant entities. Although some false positive kCSFs are introduced, our ranking model is effective enough to reduce their negative impact. When all derived kCSFs are involved, the performance is very good (i.e., p@R and MAP of this case are 0.935 and 0.905 respectively). The worst situation is that all retrieved kCSFs are not relevant to the query intent, which is majorly caused by two reasons. First, KGs are incomplete. For cases 6 and 7, neither predicates nor categories can well describe “happened in which year” in the queries. Therefore, the performance of these two topics is very poor (i.e., both p@R and MAP of these two topics are close to 0). Second, some queries such as the case 8 require a logical reasoning process, which is not supported by SEED. Although the predicate expressing the established year exists in KGs (e.g., ). However, such a query requires a reasoning step based on the value of the established year, which is beyond the scope of this paper.

8. Acknowledges This work is supported by the National Science Foundation of China under the grant (No. 61472426, U1711261, and 61432006). References [1] M. Zhu, Y. B. Wu, Search by multiple examples, in: Seventh ACM International Conference on Web Search and Data Mining, WSDM 2014, New York, NY, USA, February 24-28, 2014, 2014, pp. 667–672. [2] J. Chen, G. Jacucci, Y. Chen, T. Ruotsalo, SEED: entity oriented information search and exploration, in: IUI 2017, Limassol, Cyprus, March 13-16, 2017, 2017, pp. 137–140. [3] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, J. Riedl, Grouplens: Applying collaborative filtering to usenet news, Commun. ACM 40 (3) (1997) 77–87. [4] W. W. Cohen, S. Sarawagi, Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22-25, 2004, 2004, pp. 89–98. [5] J. Hu, G. Wang, F. H. Lochovsky, J. Sun, Z. Chen, Understanding user’s query intent with wikipedia, in: Proceedings of the 18th International Conference on World Wide Web, WWW, 2009, pp. 471–480. [6] H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, H. Li, Context-aware query suggestion by mining click-through and session data, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008, pp. 875–883. [7] O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. S. Weld, A. Yates, Web-scale information extraction in knowitall: (preliminary results), in: WWW, 2004, pp. 100–110. [8] O. Etzioni, M. J. Cafarella, D. Downey, A. Popescu, T. Shaked, S. Soderland, D. S. Weld, A. Yates, Unsupervised named-entity extraction from the web: An experimental study, Artif. Intell. 165 (1) (2005) 91–134. [9] R. C. Wang, W. W. Cohen, Automatic set instance extraction using the web, in: Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics, ACL, 2009, pp. 441–449. [10] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. G. Ives, Dbpedia: A nucleus for a web of open data, in: ISWC, 2007, pp. 722–735. [11] K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, J. Taylor, Freebase: a collaboratively created graph database for structuring human knowledge, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, 2008, pp. 1247–1250. [12] F. M. Suchanek, G. Kasneci, G. Weikum, Yago: a core of semantic knowledge, in: WWW, 2007, pp. 697–706.

7. Conclusions In this paper, we address the ESE problem with SFs of KGs. We first propose the flexible definition of SFs, which is used to describe the common aspects shared by the seeds, by considering the incompleteness of KGs, as the basis for discovering and ranking entities. Then we retrieve relevant entities based on the retrieved SFs. Probabilistic models are proposed to rank entities, as well as 13

[13] K. Balog, R. Neumayer, A test collection for entity search in dbpedia, in: The 36th International ACM SIGIR conference on research and development in Information Retrieval, 2013, pp. 737–740. [14] V. Lopez, C. Unger, P. Cimiano, E. Motta, Evaluating question answering over linked data, J. Web Sem. 21 (2013) 3–13. [15] J. Chen, Y. Chen, X. Du, X. Zhang, X. Zhou, SEED: A system for entity exploration and debugging in large-scale knowledge graphs, in: ICDE 2016, Helsinki, Finland, May 16-20, 2016, 2016, pp. 1350–1353. [16] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, W. Zhang, Knowledge vault: a web-scale approach to probabilistic knowledge fusion, in: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, 2014, pp. 601–610. [17] L. Bing, W. Lam, T. Wong, Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning, in: Sixth ACM International Conference on Web Search and Data Mining, WSDM, 2013, pp. 567–576. [18] Z. Kozareva, E. H. Hovy, Learning arguments and supertypes of semantic relations using recursive patterns, in: ACL 2010, July 11-16, 2010, Uppsala, Sweden, 2010, pp. 1482–1491. [19] T. McIntosh, J. R. Curran, Reducing semantic drift with bagging and distributional similarity, in: ACL 2009, 2-7 August 2009, Singapore, 2009, pp. 396–404. [20] Y. He, D. Xin, SEISA: set expansion by iterative similarity aggregation, in: Proceedings of the 20th International Conference on World Wide Web, WWW, 2011, pp. 427–436. [21] A. Cucchiarelli, P. Velardi, Unsupervised named entity recognition using syntatic and semantic contextual evidence, Computational Linguistics 27 (1) (2001) 123–131. [22] M. Pasca, Weakly-supervised discovery of named entities using web search queries, in: CIKM 2007, Lisbon, Portugal, November 6-10, 2007, 2007, pp. 683–690. [23] P. Pantel, M. Pennacchiotti, Espresso: Leveraging generic patterns for automatically harvesting semantic relations, in: ACL 2006, Sydney, Australia, 17-21 July 2006, 2006. [24] P. P. Talukdar, J. Reisinger, M. Pasca, D. Ravichandran, R. Bhagat, F. C. N. Pereira, Weakly-supervised acquisition of labeled class instances using graph random walks, in: 2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Proceedings of the Conference, 25-27 October 2008, Honolulu, Hawaii, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, 2008, pp. 582–590. [25] T. McIntosh, J. R. Curran, Weighted mutual exclusion bootstrapping for domain independent lexicon and template acquisition, in: Proceedings of the Australasian Language Technology Association Workshop, ALTA 2008, Hobart, Australia, December 8-10, 2008, 2008, pp. 97–105. [26] B. Shi, Z. Zhang, L. Sun, X. Han, A probabilistic cobootstrapping method for entity set expansion, in: COLING 2014, August 23-29, 2014, Dublin, Ireland, 2014, pp. 2280–2290. [27] M. Pennacchiotti, P. Pantel, Automatically building training examples for entity extraction, in: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL 2011, Portland, Oregon, USA, June 23-24, 2011, 2011, pp. 163–171. [28] X. Li, L. Zhang, B. Liu, S. Ng, Distributional similarity vs. PU learning for entity set expansion, in: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL, 2010, pp. 359–364. [29] L. Lim, H. Wang, M. Wang, Semantic queries by example, in: Joint 2013 EDBT/ICDT Conferences, EDBT ’13 Proceedings, 2013, pp. 347–358. [30] K. Sadamitsu, K. Saito, K. Imamura, G. Kikui, Entity set expansion using topic information, in: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 2011, pp. 726–731. [31] Z. Zhang, L. Sun, X. Han, A joint model for entity set expansion and attribute extraction from web search queries, in: Proceed-

[32] [33] [34] [35] [36] [37] [38]

[39] [40]

[41] [42]

[43] [44] [45]

[46]

[47] [48] [49] [50] [51] [52] [53]

14

ings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., 2016, pp. 3101– 3107. A. Passant, dbrec - music recommendations using dbpedia, in: The Semantic Web - ISWC 2010 - 9th International Semantic Web Conference, ISWC, 2010, pp. 209–224. M. Ji, Q. He, J. Han, W. S. Spangler, Mining strong relevance between heterogeneous entities from unstructured biomedical data, Data Min. Knowl. Discov. 29 (4) (2015) 976–998. C. Shi, X. Kong, Y. Huang, P. S. Yu, B. Wu, Hetesim: A general framework for relevance measure in heterogeneous networks, IEEE Trans. Knowl. Data Eng. 26 (10) (2014) 2479–2492. C. Shi, X. Kong, P. S. Yu, S. Xie, B. Wu, Relevance search in heterogeneous networks, in: EDBT’12, 2012, pp. 180–191. Y. Sun, J. Han, X. Yan, P. S. Yu, T. Wu, Pathsim: Meta pathbased top-k similarity search in heterogeneous information networks, PVLDB 4 (11) (2011) 992–1003. C. Meng, R. Cheng, S. Maniu, P. Senellart, W. Zhang, Discovering meta-paths in large heterogeneous information networks, in: WWW 2015, 2015, pp. 754–764. X. Cao, C. Shi, Y. Zheng, J. Ding, X. Li, B. Wu, A heterogeneous information network method for entity set expansion in knowledge graph, in: PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part II, 2018, pp. 288–299. C. Shi, Y. Li, J. Zhang, Y. Sun, P. S. Yu, A survey of heterogeneous information network analysis, IEEE Transactions on Knowledge and Data Engineering 29 (1) (2017) 17–37. S. Metzger, R. Schenkel, M. Sydow, Aspect-based similar entity search in semantic knowledge graphs with diversity-awareness and relaxation, in: IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014, pp. 60–69. Z. Abedjan, F. Naumann, Improving RDF data through association rule mining, Datenbank-Spektrum 13 (2) (2013) 111–120. Y. Zheng, C. Shi, X. Cao, X. Li, B. Wu, Entity set expansion with meta path in knowledge graph, in: Advances in Knowledge Discovery and Data Mining - 21st Pacific-Asia Conference, PAKDD 2017, Jeju, South Korea, May 23-26, 2017, Proceedings, Part I, 2017, pp. 317–329. S. R. Agrawal Rakesh, Fast algorithms for mining association rules in large databases, in: VLDB, 1994, pp. 487–499. M. Bron, K. Balog, M. de Rijke, Example based entity search in the web of data, in: Advances in Information Retrieval - 35th European Conference on IR Research, ECIR, 2013, pp. 392–403. G. Demartini, T. Iofciu, A. P. de Vries, Overview of the INEX 2009 entity ranking track, in: Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX, 2009, pp. 254–264. R. Blanco, H. Halpin, D. M. Herzig, P. Mika, J. Pound, , H. S. Thompson, Entity search evaluation over structured web data, in: In Proc. of the 1st International Workshop on EntityOriented Search (EOS’11), 2011, pp. 65–71. C. Bizer, P. Mika, The semantic web challenge, 2009, J. Web Sem. 8 (4) (2010) 341. C. D. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval, Cambridge University Press, 2008. J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993. C. Shannon, A mathematical theory of communication, Bell Syst. Techn. J. 27 (1948) 379–423. M. D. Smucker, J. Allan, B. Carterette, A comparison of statistical significance tests for information retrieval evaluation, in: CIKM, 2007, pp. 623–632. H. Tong, C. Faloutsos, J. Pan, Fast random walk with restart and its applications, in: ICDM, 2006, pp. 613–622. X. Zhang, Y. Chen, J. Chen, X. Du, K. Wang, J. Wen, Entity set expansion via knowledge graphs, in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017, pp. 1101–1104.