Path-based reasoning approach for knowledge graph completion using CNN-BiLSTM with attention mechanism

Path-based reasoning approach for knowledge graph completion using CNN-BiLSTM with attention mechanism

Expert Systems With Applications 142 (2020) 112960 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www...

2MB Sizes 0 Downloads 25 Views

Expert Systems With Applications 142 (2020) 112960

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

Path-based reasoning approach for knowledge graph completion using CNN-BiLSTM with attention mechanism Batselem Jagvaral, Wan-Kon Lee, Jae-Seung Roh, Min-Sung Kim, Young-Tack Park∗ School of Computer Science and Engineering, Soongsil University, Seoul, 06978, Republic of Korea

a r t i c l e

i n f o

Article history: Received 7 June 2019 Revised 25 August 2019 Accepted 17 September 2019 Available online 26 September 2019 Keywords: Knowledge graph completion Link prediction Path-based reasoning Low-dimensional embedding

a b s t r a c t Knowledge graphs are valuable resources for building intelligent systems such as question answering or recommendation systems. However, most knowledge graphs are impaired by missing relationships between entities. Embedding methods that translate entities and relations into a low-dimensional space achieve great results, but they only focus on the direct relations between entities and neglect the presence of path relations in graphs. On the contrary, path-based embedding methods consider a single path to make inferences. It also relies on simple recurrent neural networks while highly efficient neural network models are available for processing sequence data. We propose a new approach for knowledge graph completion that combines bidirectional long short-term memory (BiLSTM) and convolutional neural network modules with an attention mechanism. Given a candidate relation and two entities, we encode paths that connect the entities into a low-dimensional space using a convolutional operation followed by BiLSTM. Then, an attention layer is applied to capture the semantic correlation between a candidate relation and each path between two entities and attentively extract reasoning evidence from the representation of multiple paths to predict whether the entities should be connected by the candidate relation. We extend our model to perform multistep reasoning over path representations in an embedding space. A recurrent neural network is designed to repeatedly interact with an attention module to derive logical inference from the representation of multiple paths. We perform link prediction tasks on several knowledge graphs and show that our method achieves better performance compared with recent state-of-the-art path-reasoning methods. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction Knowledge graphs (KGs), such as Freebase, WordNet, or NELL, are valuable resources for building intelligent systems such as question answering or recommendation systems. These KGs contain millions of facts about real-world entities and relations in the form of triples, e.g., (Bill Gates, founded, Microsoft). Additionally, a large amount of missing relations (triples) exists between the entities in KGs. To effectively use KGs for other applications, one must perform a KG completion (KGC) task and infer missing links or triples. The basic idea of a KGC task is to automatically infer missing triples by utilizing existing triples. In recent years, the embedding methods that translate entities and relations into a low-dimensional space have achieved great results on KGC tasks.



Corresponding author. E-mail addresses: [email protected] (B. Jagvaral), [email protected] (W.-K. Lee), [email protected] (J.-S. Roh), [email protected] (M.-S. Kim), [email protected] (Y.-T. Park). https://doi.org/10.1016/j.eswa.2019.112960 0957-4174/© 2019 Elsevier Ltd. All rights reserved.

However, most embedding methods only consider the direct relations between entities and overlook the presence of paths. Previously, path ranking algorithms (PRAs), such as those proposed by Lao, Mitchell, and Cohen (2011) and Gardner and Mitchell (2015), have shown that the relation paths that consist of the relation types between two entities can be effectively used for KGC. Such methods perform random walks over a graph and construct a feature matrix by enumerating the paths between all entities (entity pairs) given a candidate relation. Then, a binary classification method, such as logistic regression or decision tree, is trained on the feature matrix to infer missing links. In recent years, pathbased reasoning methods (Das et al., 2018; Das, Neelakantan, Belanger, & McCallum, 2017; Nickel, Tresp, & Kriegel, 2011; Xiaotian et al., 2017) have successfully applied recurrent neural networks (RNNs) to KGC tasks by embedding reasoning paths into a lowdimensional space and have shown significant improvements over PRA methods. The idea behind these path-reasoning approaches is that the semantic of a relation between entities can be represented by the semantic of multiple paths that connect the entities. Therefore, the missing relations between two entities can be inferred by

2

B. Jagvaral, W.-K. Lee and J.-S. Roh et al. / Expert Systems With Applications 142 (2020) 112960

learning the paths that connect the entities. However, these reasoning methods train a simple RNN, whereas highly efficient methods on sequence data processing are available and exhibit better performance compared to RNNs. Most of these methods use maxpooling or mean operations to combine multiple paths and neglect the fact that each path provides different reasoning evidence. In fact, an individual path, such as (s, spouse, e ) ∧ (e, bornIn, t ), frequently does not provide any indication of a semantic relationship between entities s and t. In this paper, we propose a new attention-based approach for KGC that couples a convolutional neural network (CNN) with a bidirectional long a short-term memory (BiLSTM) module. First, given a candidate relation and two entities, our method encodes multiple reasoning paths between the entities into lowdimensional embeddings using the CNN followed by the BiLSTM module. Second, we assume that not all paths between two entities equally contribute to inferring the missing relation between the entities. To this end, an attention mechanism (Bahdanau, Cho, & Bengio, 2015; Vaswani et al., 2017) is applied to capture the semantic correlation between a candidate relation and each path between two entities and to generate a vector representation for all paths between the entities. The paths of varying lengths that connect entities are encoded into a fixed-length real-valued vector. Finally, the summation of the relation embedding and the vector representation of the paths is passed through a fully-connected layer to predict if two entities should be connected by the candidate relation. The principle behind our method is that the CNN extracts the local features in the path and the BiLSTM network uses the ordering of the local features to learn about the entities and the relation orderings of each path. Finally, the attention layer extracts reasoning evidence from the paths that are correlated to the candidate relation. The attention mechanism in our model is identical to that of Xiaotian et al. (2017). The only difference is that instead of computing the dot product of the target relation and weighted path vectors, we applied the additive attention function using a feedforward network that scales well to smaller values. Dot-product attention is faster and space-efficient but in a few cases, requires an additional scaling factor to compute correct attention weights, which was not implemented in the previous study. Furthermore, when the search space in path representation is very large, combining all paths does not provide sufficient evidence to make an inference about the relationship between entities. Therefore, to narrow down the search in a continuous space, we suggest using multiple steps of reasoning. Owing to this issue, we extend our model to perform multistep reasoning over path distribution. We adopt an RNN-like multihop (Sukhbaatar, Szlam, Weston, & Fergus, 2015) reasoning network that enables the model to read the embeddings of the same paths multiple times and update the encoding vector at each step before producing the final output. Through experiments, we demonstrate that multistep reasoning over path distribution can significantly improve the reasoning performance of KGC tasks. Moreover, our model is collectively trained end-to-end with gradient descent for all candidate relations. In experiments, we perform link prediction tasks on four different KGs, i.e., NELL, Freebase, Kinship, and Countries. We compare our method with the most recent state-of-the art methods of path-based reasoning using various measures. For link prediction tasks, given test triples, we replace the source or target entity for each test triple with random entities and measure the rank of the corrupted set in terms of the original triple using each method. We further visualize multiple reasoning paths and observe that the paths that connect similar entity pairs are closely clustered together. Empirically, we show that our approach achieves comparable results with previous methods and exhibits better performance in a few cases.

2. Related work This section reviews previous studies on KGC tasks. Previous works are broadly divided into two categories, i.e., path-based reasoning and KG embedding. KG embedding predicts missing links by applying low-dimensional embedding approaches to KGs (Bordes, Usunier, Garcia-Duran, Weston, & Yakhnenko, 2013; Nickel, Murphy, Tresp, & Gabrilovich, 2015; Nickel et al., 2011; Wang, Mao, Wang, & Guo, 2017). The key idea of embedding-based KGC is to represent entities and relations as low-dimensional vectors and estimate candidate facts as resulting from latent factors. RESCAL (Nickel et al., 2011) and TransE (Bordes et al., 2013) are two typical methods of learning latent representations by minimizing reconstruction loss and margin-based ranking loss, respectively. RESCAL is a relational learning algorithm based on tensor factorization using alternating least-squares. This algorithm can be scaled for large RDF datasets such as YAGO, and it can achieve good results in the tasks of link prediction, entity resolution, or collective classification. Manifold approaches that rely on translations have been implemented until now. TransE is the first method in which relations are interpreted as the addition of the embeddings of the entities in a low-dimensional space. In the neural tensor network (NTN) model, Socher, Chen, Manning, and Ng (2013) combined linear transformations and multiple bilinear forms of subject and object embeddings to feed them into a nonlinear neural layer. An NTN is a data-driven deep neural network that generalizes several conventional neural network models and provides a more powerful method of modeling relational information compared to a standard neural network layer. An NTN utilizes a standard linear neural network layer with a bilinear tensor layer that directly relates two entity vectors across multiple dimensions. Yang, Yih, He, Gao, and Deng (2015) proposed Bilinear Diagonal model DistMult by combining NTN and TransE for KGC where relations are represented as diagonal matrices. Wang, Wang, and Guo (2015) proposed a KGC approach that uses embeddings and rules. Their approach views a KGC tasks as an integer linear programming problem that requires an objective function and rule constraints. Objective functions are generated from an embedding model to predict the plausibility of candidate triples, and constraints are translated from physical and logical rules to impose restraints on candidate triples. Recently, a state-of-the art method named ComplEx was presented by Trouillon et al. (2017). ComplEx uses the complex counterpart of the standard dot product between real vectors and achieves better performance of embedding entities and relations compared to previous works. ConvE (Dettmers, Minervini, Stenetorp, & Riedel, 2018) improves the performance of ComplEx by applying a multilayer convolutional network model to triples and shows that CNN models can be efficiently used in link prediction problems. The aforementioned studies mainly focused on embedding individual relations and entities and experienced difficulty in modeling the semantic relations between entities. In this study, we focus on performing KGC tasks based on the contextual paths between entities. Path-based reasoning for KGC uses individual paths in KGs to perform link prediction. In early works, such as the PRA, paths are treated as atomic features. As a result, a single classifier must train a feature matrix with millions of distinct paths. Size increases considerably with the number of relations in a KG. To solve this problem, Neelakantan, Roth, and McCallum (2015) proposed a neural network model named Path-RNN. This model decomposes each path as a sequence of relations and feeds it into an RNN architecture to construct a vector representation for the path. Then, the relevance of the path to a query relation is calculated by the dot product of their representations. However, as multiple paths connect an entity pair, Path-RNN uses the max operator to select the path with the largest predictability score. To improve the performance of Path-RNN, Das et al. (2017) proposed

B. Jagvaral, W.-K. Lee and J.-S. Roh et al. / Expert Systems With Applications 142 (2020) 112960

3

additive attention function that we used. In their work, a single reasoning step was used for a link prediction task. Zhou et al. (2016) proposed Att-BLSTM that combines the bidirectional LSTM and attention mechanism to classify relations in the text. The underlying BiLSTM attention mechanism of our approach is similar to that of Att-BLSTM, except that their module focuses on word-level attention whereas our module focuses on sentence-level attention. Accordingly, their attention module is performed on outputs of a single BiLSTM to attentively aggregate word embeddings from the hidden units of BiLSTM on the word level. Meanwhile, we just concatenate the last hidden states of forward LSTM and the first hidden states of backward LSTM in BiLSTM without attention to represent a sentence (path); then, our attention module is performed over multiple sentences on the sentence level to attentively aggregate sentence embeddings. 3. Method In this section, we present our approach for KGC via link prediction tasks, which aim to predict missing links in a graph. An overview of the approach is shown in Figs. 1 and 2. First, we briefly review the problem of KGC and the PRA and describe how we obtain paths. Then, we introduce the CNN and BiLSTM modules, which embed relational paths into a low-dimensional space and combine those paths using an attention module according to a query relation. Then, we describe the RNN, which performs multistep reasoning over embedded paths. Fig. 1. Architecture of the proposed method. Gray box in path embedding layer indicates the sequence paddings. The operation  denotes element-wise summation.

several path combination operators including Mean, Top-K, and LogSumExp. However, each of these path combination operators works at a score level and has disadvantages. In addition, the same RNN was used to model different query relations while each relation has unique latent features. Note that there are triple-based methods that use the paths between entity pairs, such as that proposed by Lin et al. (2015). These methods focus on learning a better representation of entities and relations using paths and can be treated as the path-enhanced versions of the method proposed by Bordes et al. (2013). In contrast, path-based relation inference methods make inferences mainly based on paths. Conventional models for path-based reasoning mainly include the PRA and its variants. The PRA, which was first proposed by Lao and Cohen (2010), performs random walk inference for predicting new relation instances in KGs. In recent years, various extensions have been explored, ranging from incorporating a text corpus as additional evidence during inference Gardner, Talukdar, Kisiel, and Mitchell (2013) to introducing better schemes to generate more predictive paths (Gardner & Mitchell, 2015) or considering the associations among certain relations (Wang, Yuanfei, Luo, Wang, & Lin, 2016). These extensions attempt to reduce sparsity while treating each path as a distinct feature. In contrast to these methods, Neelakantan et al. (2015) adopted a different approach by treating each path as a sequence of relations and used an RNN architecture to generate its vector representation. The resulting path representations were used for inference. Das et al. (2017) improved the performance of Path-RNN by proposing several path combination operators. However, they also used a single RNN model to handle all query relations. Das et al. (2018) performed KGC by finding paths using a reinforcement learning (RL) algorithm and achieved lower results compared to our method. Xiaotian et al. (2017) introduced attention-based reasoning approaches for KGC tasks. An attention mechanism enables models to focus on different paths and attentively combine paths. However, they used the dot-product operation for attention, which is less scalable than the

3.1. Link prediction for knowledge graph completion Given a set of entities, E, and a set of binary relations, R, over the E entities, KG G can be specified by a set of triples. Each triple (es , r, et ) is an ordered set of terms such that es ,et ∈ E are source and target entities and r ∈ R is a relation between them. The goal of the KGC task is to complete the missing piece of information for an incomplete triple. This can be achieved by performing link prediction on the KG. A link prediction task estimates the probability of an entity connected to another entity via a specific relation in the graph. For instance, given a query (Bill Gates, FounderOf, ?), it could predict that the missing entity is Microsoft. Link prediction attempts to predict either the target entity given a query (es , r, ?) or the source entity given a query (?, r, et ). These tasks can significantly expand existing KGs by inferring new facts from them. In our study, we treat a link prediction task as a ranking problem over all possible sets of links. For example, given a query (es , r, ?), we estimate the probability score of having relation r between es and candidate entity e ∈ E. The estimation is performed for all candidate entities, E, and triples with high plausibly scores are added to the KG. For model evaluation, we produce corrupted triples in which the source or target entity is replaced by a randomly sampled entity. Thus, training data contain a set of correct triples, (es , r, et ) ∈ G, and a set of corrupted triples, (es , r, et ) ∈ / G. Then, we employ the PRA to obtain the relation paths for each training instance (es , r, et ) that are most relevant to relation r. Given the triples in the graph and training triples, the PRA first performs random walks to enumerate a set of bounded-length paths for training triples. Random walks are conducted on the entire graph starting at the source entity and reaching the target entity while recording which relations connect the source entity with its target. We obtain multiple relation paths, and obtained relation path p contains a sequence of relations, {r1 , r2 , . . . , rl }. Then, we compute the random walk probability of arriving at et given a random walk starting from es and following exactly all relations in p. Once the random walk probability for each relation path is computed, the paths with high probability are selected as potential path features. Then, we further expand all relation paths, p, into complete paths, π , by including intermediate entities such that π = {es , r1 , e1 , . . . , rl , et } ∈ .

4

B. Jagvaral, W.-K. Lee and J.-S. Roh et al. / Expert Systems With Applications 142 (2020) 112960

Fig. 2. Multistep reasoning over paths.

3.2. Path sequence encoding Previous path-embedding approaches (Das et al., 2017; Nickel et al., 2015; Xiaotian et al., 2017) mainly used RNNs for reasoning paths in graphs. However, RNNs experience the vanishing gradient problem when sequencing a long sequence. In contrast, LSTM is more suitable for modeling long-term dependency to create an appropriate composition of the path sequence. In addition, typical path-reasoning methods (Gardner & Mitchell, 2015; Lao et al., 2011; Nickel et al., 2015) treat the paths between entities as distinct features. As a result, a classifier must consider a large number of distinct paths for training. Instead of constructing a large feature matrix, we have designed a new neural network structure that embeds paths into low-dimensional vectors. To encode path sequence information, we propose the use of BiLSTM in addition to the CNN, both of which have been successfully applied for processing long and short sequences. The overall structure of our path encoding model is shown in Fig. 1. First, we turn each input entity and relation into a vector using an embedding matrix. Let e ∈ Rd and r ∈ Rd denote the d-dimensional embedding vectors of entity e and relation r in the graph, respectively. Subsequently, the embeddings of the entities and relations in our input sequence flow through each layer of the path encoder. Convolution operators first derive new features for entity and relation composition that can capture the essential local relationships in the path. Then, the BiLSTM encodes extracted features into a single vector. Note that it is prohibitive to incorporate the entire set of entities in the KG into path encoding. We represent entities by their types to reduce model parameters and prevent computational bottlenecks. This can also prevent the problem of encountering unseen entities during a test (Das et al., 2017). A set of paths, P(es ,et ) = {π1 , π2 , . . . , πn }, between two entities (es , et ) connected by relation r is obtained and embedded into a continuous vector space. We perform random walks over the entire graph to extract paths starting from source entity es and reaching target entity et , as explained in the above section. Once the paths between two entities are obtained, we perform path encoding to obtain the embeddings of the paths between the source and target entities. A path of length l between two entities (es , et ) is defined as π = {es , r1 , e1 , . . . , rl , et } ∈ , where ej and rj represent the jth entity and relation terms in the path, respectively. Here, es and et are the embeddings of entities at positions 0 and l, respectively. All path sequences are padded to the same length, l, with zero padding. Length l is defined as the number of terms in the longest path from the path set. We use 1D convolutional operators that slide multiple filters with the same window size over path sequences to generate features. Let Wk ∈ R3×d be a one-dimensional filter with a window

(kernel) size of 3. The filter Wk is applied to each window of the path at each position (term) to generate feature map c. Starting from the source entity, we move the filter from the left to the right by one term at a time until it reaches the last term. For example, a feature cj,k is generated after applying the kth filter to the jth window of the path sequence using a stride of 1, as follows:



c j,k =









f (Wk e j−1 , r j , e j + b), if j is even f (Wk r j , e j , r j+1 + b),

otherwise

(1)

where b is bias and f is a ReLU nonlinear activation function. Next, the features obtained from  k filters are concatenated to produce the feature vector, c j = c j,1 , c j,2 , . . . , c j,k , such that c j ∈ Rk and k is the number of filters. After processing the entire path, a sequence of vectors are obtained such as {c1 , c2 , . . . , cl } where l is the number of terms in the path. Then, the output vectors of the convolutional operators are fed into BiLSTMs. Each vector in the output of the convolution layer represents a time step in the BiLSTM module. The BiLSTM consists of two components, namely a forward LSTM and a backward LSTM. The forward LSTM reads the terms in the path from left to right and the backward LSTM reads the sequence from right to left. We − → denote the outputs of the forward and backward LSTMs as h j and ← − h j , respectively. At each time step, we feed a k-dimensional embedding of a term to an LSTM cell. The LSTM basically memorizes the results of previous computations and uses this information in the current computation. Detailed information about LSTM can be found in Hochreiter and Schmidhuber (1997). We also manually set the number of hidden states in the LSTM cell to d/2 to subsequently match the embeddings of the path and the target relation. Two separate sequences of hidden states are obtained after processing all terms in the path sequence using the BiLSTM. For example, given an input sequence {c1 , c2 , . . . , cl }, the forward − → − → − → − → LSTM produces hidden states {h1 , h2 , . . . , hl }, such that h j ∈ Rd/2 ← − ← − ← − and the backward LSTM produces hidden states {h1 , h2 , . . . , hl }, ← − such that h j ∈ Rd/2 :

−−→  − → h j = LSTM h j−1 , c j ←−−  ← − h j = LSTM h j+1 , c j

(2)

Then, the last hidden state of the forward LSTM and the first hidden state of the backward LSTM in the BiLSTM are concatenated to produce the final representation of path π with length l:

− → ← − m = hl ; h1

(3)

B. Jagvaral, W.-K. Lee and J.-S. Roh et al. / Expert Systems With Applications 142 (2020) 112960

such that m ∈ Rd . Embedding vector m consists of the both preceding and succeeding information of the path to efficiently capture its orderings. Note that paths are translated into embedding vectors using the BiLSTMs with the same LSTM operating on each path sequence. We produce M = {m1 , m2 , . . . , mn } embedding vectors for n paths, where M ∈ Rd×n . As shown in Fig. 1, the time distributed dense layer in Keras is used to process all path sequences simultaneously using the same encoder. Then, the extracted embedding vectors are passed through an attention layer and an output layer for prediction. By combining the CNN and BiLSTM, we are able to utilize the CNN’s ability of recognizing local patterns and the LSTM’s ability of harnessing an entity and relation ordering.

3.4. Multistep reasoning

3.3. Representation of multiple paths and output layer

uz+1 = Wo (oz + uz )

Existing path-based reasoning methods frequently use maxpooling or the mean operation to combine multiple paths and neglect the fact that each path provides different reasoning evidence. However, not all paths between entities represent the relation between them. To measure the importance of paths for path combination, we apply a widely used additive attention function introduced by Bahdanau et al. (2015) and calculate matching scores for all paths. First, we encode the query relation through an input module and convert it into a vector presentation such as u = A(r ) = r. Next, the embedding of the relation is matched against encoded paths to measure the matching score for each path in the attention module. Each path embedding, mi , is assigned a matching score to represent the semantic similarity between relation r and path π i . Encoded path vectors are combined using a weighted sum operation to produce state vector o.

score(mi , u ) =

vTa tanh(Wa [mi ; u] ), exp(score(mi , u )) αi = n , j exp(score (m j , u )) o=

n

αi mi

We extend our model to perform multiple steps of reasoning, where the model must reason with the information obtained from path representations multiple times. To implement multistep reasoning, we adopt the RNN model presented by Gan et al. (2019) and Sukhbaatar et al. (2015) and apply a layerwise mechanism in which the same weight variables are shared between multiple layers. An overview of the RNN module that we use is shown in Fig. 2. The embedded paths are stored in a memory module and the RNN interacts with the memory module to derive reasoning evidence from the embedded paths. After reading the memory, it generates a state vector u to represent the reasoning evidence. At each step, the state vector is computed as follows:

(6)

Rd×d

where Wo ∈ denotes transition weight parameters. The linear mapping Wo is added between layers to update uz . Initial state u1 is defined by relation embedding A(r). Output state uz+1 is computed as the weighted sum of previous state uz and path response vector oz . At the last step, uZ is generated and passed through a weight matrix Wp and a nonlinear activation function to compute the final prediction score:



P (r|es , et ) = sigmoid Wp uZ



(7)

The hyperparameter Z, which provides the best prediction scores on a development set, is empirically selected from a range of values. The general idea behind our method is to predict a missing relation between a source entity and a target entity by exploring the contextual paths between them. The proposed multistep reasoning module can repeatedly refine the prediction using newly relevant information about the candidate relation and the encoded paths of entities. 3.5. Objective function

(4)

i=1

where Wa ∈ Rd×2d and va ∈ Rd are weight parameters and mi is the path representation vector of the ith path, π i . α i is the matching score that represents the degree to which the model attends to path π i when responding to the query relation. The weighted sum operation combines the essential information from multiple paths and maintains the values of the paths we want to focus on while discarding irrelevant paths. Finally, the probability score of the entity pair (es , et ) with relation r is computed as follows:

P (r|es , et ) = sigmoid(Wp (o + u ) )

5

(5)

where Wp ∈ Rd is learnable parameters. The output prediction score determines whether the triple is valid or not. The proposed attention model attends to more and less important paths in different manners. For example, the paths that are more relevant to the candidate relation should receive high weights. Conversely, the paths that are not relevant to the relation should receive low weights. Das et al. (2017) previously applied a similar approach using the dot product to measure the match between a path and a relation. However, the additive scoring function that we use exhibits better performance compared to the dot product and efficiently scales to large values (Vaswani et al., 2017). Moreover, it is fully differentiable and trained with standard backpropagation.

We optimize our model by minimizing binary cross-entropy loss. The adaptive moment estimation (Adam) optimizer is used for the optimization. The simplified form of the objective function is defined as follows: L() = −

1 c c c (es ,r,et )∈T+ log P (r | es , et )+ (ecs ,rc ,etc )∈T− log (1 − P (r |es , et ) N (8) T+

where N is the number of triples in the training set and and T− denote correct and incorrect triples, respectively.  denotes all learnable parameters in our model. The goal of this objective function is to train the model to provide higher values on correct triples and lower values on corrupted incorrect triples while minimizing overall error. For link prediction, we tune model parameters to retrieve the top k predictions from all candidates using the development set. We also use the standard L2 norm of weights as a constraint function. The model parameters are randomly initialized and updated by considering a gradient step with a constant learning rate on the batch of training triples. In our experiment, we apply a range of learning rates to find out how this affects prediction performance. The training is stopped when the loss function converges to an optimal point. 4. Experiments We evaluate our model on link prediction tasks and report the results on four different KGs. The statistics of the graph datasets are presented in Table 1. The hyperparameters of our model that

6

B. Jagvaral, W.-K. Lee and J.-S. Roh et al. / Expert Systems With Applications 142 (2020) 112960

Table 1 Statistics of the datasets used in this study.

is defined by a single convolution layer, a projection layer to the embedding dimension, and an inner product layer.

Datasets

#entities

#relations

#train

#dev

#test

#tasks

NELL995 FB15k-237 Countries Kinship

75,492 14,951 272 104

200 1345 2 26

154,213 483,142 1158 6926

5000 50,000 68 769

5000 59,071 72 1069

12 10 2 26

result in the best performance on the development set are selected via a small grid search. Several measures are adopted to quantitatively evaluate our model, including F1, mean average precision (MAP), and mean reciprocal rank (MRR). MAP is the average of precision values at the ranks where relevant correct entities are ranked. The MAP score is computed using the following equation:

MAP =

1 AP (q ) |Qr | q∈Qr

(9)

where AP is the average of precision scores at the rank locations of each correct triple. MRR refers to the rank position of the first correct triple for the ith query. The MRR score is computed using the following equation:

MRR =

1

|Qr |

1

q∈Qr

|rankq |

(10)

In the following sections, we describe the implementation details of our experiments and compare the results on the link prediction tasks obtained by the proposed model and four baseline methods. Then, we present the results of applying multistep reasoning on paths and analyze the embeddings of these paths by reducing it into a two-dimensional space. 4.1. Baseline methods We compare our model with the recent path-based reasoning methods that have achieved state-of-the art performance on KGC tasks, as detailed below. •











PRA: This was the first method to implement path-based reasoning; it was presented by Lao et al. (2011). It uses distinct features to represent the paths that connect entities, creates a large feature matrix, and then trains a binary classification model on the feature matrix. DistMult: This is an approach (Yang et al., 2015) based on the Bilinear Diagonal model for knowledge base completion that represents the target relation as a diagonal matrix. Single-Model: This is a path learning method that applies an RNN model (Das et al., 2017) to path sequences and uses pooling methods to combine the resulting vectors such as mean, max, or log operations. In our experiment, it was reimplemented with type embeddings. Based on the experimental results described in their paper, we apply the LogSumExp operator for score aggregation because it provides better performance compared to other pooling methods. DeepPath: This is an RL approach (Xiong, Hoang, & Wang, 2017) to implement a path learning process based on pretained KG embeddings obtained by Bordes et al. (2013). It encodes the continuous state of an RL agent to perform path reasoning in the vector space environment of a KG. MINERVA: This is a method (Das et al., 2018) of searching for a graph for answer-providing paths using RL conditioned on an input question. ConvE: This is a method (Dettmers et al., 2018) that uses 2D convolutions over entity and relation embeddings to predict missing links in knowledge graphs. ConvE is the simplest multi-layer convolutional architecture for link prediction: it

The original source code provided by the authors in Lao et al. (2011), Xiong et al. (2017) and Das et al. (2018) is used to reproduce the experimental results for these models. We set the parameters of each model using the parameter settings recommended by the authors, and the obtained results are close to the results presented in the papers. 4.2. Parameter settings and datasets We use four benchmark datasets, i.e., Countries (Bouchard, Singh, & Trouillon, 2015), Kinship (Kok & Domingos, 2007), NELL (Xiong et al., 2017), and FB15k-237 (Toutanova et al., 2015). NELL is a technique that crawls the web and learns entities and relationships endlessly. We use the NELL dataset that was generated after the 995th iteration by Xiong et al. (2017). The FB15k-237 dataset was created from the original WN18 and FB15K datasets by removing massively redundant inverse triples. Countries is a small dataset introduced by Bouchard et al. (2015) that consists of countries, regions (e.g., EUROPE), 23 subregions, 1158 facts about the neighborhood of countries, and the locations of countries and subregions. The Kinship dataset contains the family relationships between the members of a small tribe from Central Australia. We use existing domain and range triples to obtain the types of entities for large datasets. However, we generate these triples manually for small datasets. For testing, we remove the overlapping triples that appear in training and test datasets. For large datasets, i.e., NELL995 and FB15k-275, we use the correct and incorrect triples previously created by Das et al. (2018) and Xiong et al. (2017) to best compare our model to prior work. For the remaining small datasets, we reproduce incorrect triples by randomly switching source or target entities from correct test triples. We add target triples and their inverses to the graph to prepare reasoning paths for model training and evaluation. After generating paths, we remove the paths that contain target triples at the beginning or end to prevent overfitting problems. For example, suppose there is a path from x to y connected through isa, isa−1 , and athleteplaysforteam: isa(x, Person) ∧ isa−1 (Person, x) ∧ athleteplaysforteam(x , y) ⇒ athleteplaysforteam(x, y). When generating such a path, x and x can be the same person such that x =x. In such cases, test triple athleteplaysforteam(x,y) is required to be already present in the graph. Additionally, we observe that certain relation paths that connect positive entity pairs can appear in the negative entity pairs illustrated in Fig. 5a. As such paths are not useful for prediction, they are excluded from consideration. Furthermore, longer paths are considered less reliable and effective than short paths (Das et al., 2018) on link prediction tasks. Thus, in our experiment, we restrict the maximum length of relation paths to 3 (relations) in the PRA setting so that after incorporating the types of entities into relation paths, a reasoning path consists of at most 7 terms excluding paddings. We consider the relation paths with a random walk probability score larger than 0.1 that reach a target entity. To obtain a fair evaluation of our model, the correct or incorrect entity pairs that have no specific paths are eliminated from training and assigned a score of negative infinity for the evaluation. We randomly initialize all the model parameters and use the Adam optimizer as the training algorithm with a minibatch size of 64. Training is stopped when the accuracy on the development set does not improve by 10−2 within the last 10 epochs. We apply a grid search approach to tune the hyperparameters in our model. We select the learning rate, γ , for the Adam optimizer among {0.0 01, 0.0 01, 0.0 02, 0.0 025, 0.0 03}, the dimension of vectors, k,

B. Jagvaral, W.-K. Lee and J.-S. Roh et al. / Expert Systems With Applications 142 (2020) 112960

7

Table 2 Comparison with state-of-the art methods on NELL995 and FB15k-237 datasets. Dataset

NELL995

Metric

MAP

MRR

Hits@1

Hits@3

MAP

FB15k-237 MRR

Hits@1

Hits@3

PRA (Lao et al., 2011) DeepPath (Xiong et al., 2017) Single-Model (Das et al., 2017) MINERVA (Das et al., 2018) DistMult (Yang et al., 2015) ConvE (Dettmers et al., 2018) Our approach

0.696 0.809 0.855 0.876 – – 0.894

0.696 0.835 0.859 0.879 0.863 0.909 0.898

0.637 0.744 0.788 0.813 0.801 0.904 0.838

0.747 0.890 0.914 0.931 0.907 0.929 0.951

0.412 0.449 0.574 0.606 – – 0.652

0.412 0.459 0.575 0.615 0.541 0.567 0.660

0.322 0.343 0.512 0.490 0.413 0.444 0.544

0.331 0.436 0.567 0.659 0.554 0.629 0.708

Table 3 Link prediction results over NELL995. Task (relation)

#Q

PRA

TransE

DeepPath

MINERVA

Our approach

athletePlaysInLeague worksFor orgHiredPerson athletePlaysSport teamPlaysSport bornLocation athleteHomeStadium orgHeadquaterCity athletePlaysForTeam Overall

381 513 486 603 112 193 201 249 387 3125

0.904 0.619 0.512 0.533 0.787 0.734 0.861 0.855 0.600 0.711

0.773 0.676 0.722 0.963 0.760 0.604 0.717 0.619 0.626 0.717

0.951 0.721 0.797 0.894 0.729 0.782 0.843 0.807 0.765 0.809

0.954 0.816 0.866 0.985 0.739 0.757 0.924 0.942 0.839 0.876

0.978 0.838 0.855 0.980 0.864 0.810 0.914 0.937 0.860 0.894

among {50, 100}, the number of hidden units in BiLSTM among {64, 128}, the number of filters among {30, 40, 50, 60}, and the weight of regularization, λ, among {0, 0.005, 0.01, 0.1, 0.5, 1}. We adjust minibatch size to make each epoch contain 64 minibatches and run the training process with 100 epochs. We select the optimal configuration based on the performance on the development set. 4.3. Link prediction results Following Bordes et al. (2013), we performed the link prediction tasks to evaluate our method. For KGs, link prediction predicts the missing es or et for a correct triple (es , r, et ). Instead of obtaining the best one entity, this task emphasizes the rank of the original correct entity. The evaluation of this task mainly contains two metrics, i.e., the first rank position of the correct entities (MRR) and the proportion of correct entities in the top k ranks (Hits@k). The link prediction evaluation contains two parts. The first is to evaluate the source entity prediction. In this evaluation, the real source entity for each triple in the test set is replaced with all entities in the dictionary, which produces the sets of corrupted triples that contain the ground truth. Then, we use our model to compute the scores of corrupted triples and rank each corrupted set to the original triple in ascending order. Next, we adjust the entities in the candidate set for relation r ahead of the noncandidate set and maintain the order relatively unchanged in the candidate set and noncandidate set. Finally, we record the rank of the original correct source entity. The second part is to evaluate the target entity prediction in the same manner. MAP is calculated by averaging the source rank and target rank among all test triples obtained above. Hits@k evaluation also contains two parts, i.e., Hits@k for source entities and Hits@k for tail entities. We combine the two parts into a single task in our evaluation. We increase the hit count when the correct source entity is in the top k ranks. The Hits@k for source and target entities is the overall hit rate among all test triples. In a few cases, randomly sampled corrupted triples could be included in correct triples. To eliminate this issue, we remove the corrupted triples already included in the training, validation, and test sets that rank ahead of the correct entity. Our experimental results on NELL995 and FB15k-253 are shown in

Table 2. For the large datasets, we report hit scores at ranks 1 and 3 because the Hits@10 scores of the baseline models are identical to that of our model owing to the lack of negative samples in the datasets. At hit rates, our model considerably minimizes undesired or unexpected results. In Table 3, we report the results of link prediction on NELL995 for each task. MINERVA and our approach achieve high performance on the NELL995 dataset, and our approach improves the overall performance of MINERVA by 0.2% in terms of MAP. In particular, our model performs better on challenging tasks such as bornLocation and playsForTeam. Further, we compare the MRR, MAP, and Hits@3 values of the baselines on NELL995 and FB15k237 in Table 2 and observe that our model can more accurately predict missing links on the large datasets. Note that for the Countries and Kinship datasets, DeepPath cannot find sufficient paths to perform RL in most target relations. Additionally, we compared our approach with the existing non-path models to verify the competitiveness of the approach in the knowledge graph completion task in Table 2. In our experiments, we have achieved comparable results to the state-of-the art methods across all evaluation metrics. Especially, the MRR and Hits@k scores of ConvE are similar to that of our approach. It is notable that our approach was performing better on the KG such as FB15k-237 with a large number of diverse relations compared to ConvE. On the contrary, ConvE was giving slightly better results on the dataset with a fewer number of relations. Also, we conducted experiments with varying lengths of paths and presented the results in Table 4. With the longer path length, the PRA generates a considerable number of paths that can cause Table 4 Experimental results on NELL995 given partial entity types and different lengths of paths. Metric Coverage = Coverage = Coverage = Coverage = Path length Path length

100% 70% 50% 30% =3 =4

MAP

MRR

Hits@1

Hits@3

0.894 0.891 0.889 0.888 0.894 0.884

0.898 0.893 0.892 0.891 0.898 0.887

0.838 0.830 0.829 0.828 0.838 0.821

0.951 0.946 0.946 0.945 0.951 0.944

8

B. Jagvaral, W.-K. Lee and J.-S. Roh et al. / Expert Systems With Applications 142 (2020) 112960 Table 5 Experimental results on Kinship and Countries datasets. Dataset

Kinship

Countries

Metric

Hits

PRA (Lao et al., 2011) MINERVA (Das et al., 2018) Single-Model (Das et al., 2017) Our approach

MAP

MRR

0.724 0.816 0.804 0.946

0.799 0.824 0.804 0.952

Hits

@1

@3

@10

0.699 0.710 0.814 0.918

0.896 0.937 0.885 0.984

0.922 0.922 0.916 0.990

MAP

MRR

0.687 0.960 0.941 0.947

0.739 0.960 0.941 0.947

@1

@3

@10

0.577 0.925 0.918 0.916

0.900 0.995 0.956 0.986

0.990 0.995 0.956 0.985

Fig. 3. Precision, recall, and F1 evaluations on NELL995 and Kinship datasets.

Fig. 4. (a) Visualization of attentions on the paths for ‘worksfor’ relation of NELL995 dataset. Horizontal axis indicates the examples that we randomly sampled and vertical axis indicates the paths for each step. Each pixel shows the attention weight α ij of ith path for jth sample. b) MAP scores obtained on ‘worksfor’ relation of NELL995 with different steps using our approach.

memory overflow. To avoid this problem, we had to reduce the number of paths by increasing the path threshold to a higher value such as 0.9 in the PRA. As a result, some entities in the test data could not get connected by paths, thereby leading to a small decrease in the overall performance. We also observed that shorter paths tend to provide more reliable reasoning evidence than longer paths on the NELL dataset. Additionally, in the above experiments, we use type embeddings to represent entities and expect the entity types to be fully covered by KG; however, in a few cases, the type information can be inaccessible or only partially available. If the type information is unavailable for an entity, we can use the embedding of that entity instead of representing the entity by its type. Accordingly, we performed additional experiments on the NELL dataset with entity types that are partially covered by KG and presented their results in Table 4. The length of paths was

set to 3. We noticed that without representing an entity by its type, the performance of knowledge graph completion decreases slightly. We suspect that it was due to the fact that some entities in the test data were not present in the training data. The experimental results on the small datasets, namely, Kinship and Countries, are shown in Table 5. Our model achieves excellent results on the Kinship dataset because this dataset was created to evaluate the reasoning ability of logic rule learning systems with more predictable paths compared to other datasets. However, on the Countries dataset, our model shows lower results compared to baseline models. We expect that this is because the number of training triples in the Countries dataset is too small to efficiently train our model. The precision, recall, and F1 scores of different models on the NELL995 and Kinship test sets are also shown in Fig. 3, and the F1 score of our model is the highest. The recall

B. Jagvaral, W.-K. Lee and J.-S. Roh et al. / Expert Systems With Applications 142 (2020) 112960

9

Table 6 Reasoning paths with the corresponding attention weights. Relation

Reasoning paths

α

athleteplaysinleague

athlete homestadium stadium homestadium-1 athlete playsin league athlete playsfor team coachesteam-1 coach coachesin league athlete playsfor team playsin league competeswith league

0.046 0.044 0.039

teamplayssport

team coachesteam-1 coach worksfor org plays sport team coachesteam-1 coach hiredperson-1 organization plays sport team playsagainst team belongsto-1 person plays sport

0.024 0.023 0.018

athleteplayssport

athlete playsin sportsleague belongsto-1 human plays sport athlete playsfor team belongsto1 person plays sport athlete playsfor team sportsgame-1 game sportsgame sport

0.068 0.043 0.036

worksfor

journalist writesfor publication writesfor-1 journalist collaborateswith-1 company journalist writesfor publication writesfor-1 journalist worksfor company journalist collaborateswith-1 agent acquired-1 company acquired company

0.028 0.023 0.022

Fig. 5. (a) Path distribution over NELL data. Horizontal and vertical axes indicate the number of positive and negative examples, respectively. We display the marginal distribution of paths on the horizontal and vertical axes as histograms. (b) Visualization of path embeddings.

score of our model is approximately 10% higher than that of MINERVA. This proves that our model can more accurately retrieve relevant results. Our method achieves the best F1 scores on the link prediction tasks on the NELL995 and Kinship datasets. This implies that our model can find the right candidate sets with high probability. Otherwise, we will obtain decidedly worse results because we merely think that the triples with entities that are not in the candidate sets are negative. Finally, apparently, our system works best on the different tasks of the link prediction problem in terms of all metrics and outperforms the baseline methods on all datasets except for the Countries dataset. The experimental results prove that our model can effectively predict missing links for different kinds of KGs, particularly large-scale KGs. 4.4. Error analysis We vary the number of reasoning steps from 1 to 7 to tune our model. Fig. 4b shows the comparison of the results obtained from different reasoning steps on the NELL dataset. We apply the same optimal parameters to our model as previously mentioned. We observed that the MAP score increases gradually with the number of reasoning steps. The MAP score reaches the highest value after the 3rd step and then decreases. In addition to steps, path-level attention weights are shown in Fig. 4a. Higher and lower attention weights are denoted by yellow and green colors, respectively.

First, the model attends to numerous useless paths that yield low MAP. However, as the model reaches the 3rd reasoning step, the MAP score increases significantly and the model attends to fewer paths. This produces a high prediction score. We observe that MAP decreases gradually starting from the 4th and 5th steps owing to the overfitting problem. In Table 6, we illustrate reasoning paths with the corresponding weights, α , computed by our model. The paths with the highest attention weight are shown in the table, and the weights are sorted in descending order. In Fig. 5b, we visualize the embeddings of reasoning paths in a two-dimensional vector space on the NELL dataset. To visualize the embeddings of paths, we randomly sample 20 entity pairs for each relation from the test dataset and compute the embeddings of the paths for these pairs. Then, we apply the tSNE dimension reduction technique (van der Maaten & Hinton, 2008) on the learned embedding vectors and reduce the number of dimensions to two. Most entity pairs are composed of multiple paths from the KG, and we use different colors to indicate the different relations. We observe that the embeddings of reasoning paths are clustered by relation. In Fig. 5a, we plot the distribution of reasoning paths for entire target relations as a scatter plot on the NELL test dataset. The scatter plot shows the correlation between positive and negative paths. A good path should lie along the horizontal or vertical axis. The paths that are not informative in prediction can be weighted and drown out during the training.

10

B. Jagvaral, W.-K. Lee and J.-S. Roh et al. / Expert Systems With Applications 142 (2020) 112960

5. Conclusion In this paper, we propose a new approach for KGC that combines BiLSTM and CNN modules with an attention mechanism. Given a candidate relation and two entities, we encode the paths that connect the entities into a low-dimensional space using a convolutional operation followed by BiLSTM. Then, an attention layer is applied to combine multiple paths efficiently. We further extend our model to perform multistep reasoning over path representations in an embedding space. Compared to other models, our path encoder is more effective in extracting features from paths in large graphs. In addition, we illustrate that applying multistep reasoning can be useful in path-based reasoning. In our experiments, we show that our method performs better than recent state-of-the art methods on the link prediction tasks over challenging KGs and can efficiently represent the paths between two entities to predict the missing relation between them. Finally, in our study, we only used a single type for entity representation but in large KGs, most entities have multiple types. To fully express the semantic features of entities, in a future study, we incorporate multiple types into path encoding. In addition, paths between each entity pairs may overlap and the number of paths can greatly increase as the number of pairs grows. To deal with this issue, we will research on using external memory representations to store overall paths of entity pairs in the entire graph. Acknowledgement This work was supported by Institute for Information & communications Technology Promotion (IITP) and funded by the Korea government (MSIT) [grant number 20190 0 0 067] (Semantic Analysis Reasoning Methods for Automatic Completion of Large Scale Knowledge Graph). Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Credit authorship contribution statement Batselem Jagvaral: Conceptualization, Data curation, Writing original draft, Writing - review & editing. Wan-Kon Lee: Writing - review & editing. Jae-Seung Roh: Data curation. Min-Sung Kim: Data curation. Young-Tack Park: Conceptualization, Writing - original draft, Writing - review & editing. References Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. arXiv: 1409.0473. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems 26 (pp. 2787–2795). Bouchard, G., Singh, S., & Trouillon, T. (2015). On approximate reasoning capabilities of low-rank vector spaces. AAAI spring symposia. Das, R., Dhuliawala, S., Zaheer, M., Vilnis, L., Durugkar, I., Krishnamurthy, A., et al. (2018). Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. In International conference on learning representations (ICLR). Das, R., Neelakantan, A., Belanger, D., & McCallum, A. (2017). Chains of reasoning over entities, relations, and text using recurrent neural networks. In Proceedings of the 15th conference of the European chapter of the Association for Computational Linguistics: Volume 1, Long papers (pp. 132–141). Association for Computational Linguistics.

Dettmers, T., Minervini, P., Stenetorp, P., & Riedel, S. (2018). Convolutional 2D knowledge graph embeddings. AAAI. Gan, Z., Cheng, Y., Kholy, A. E., Li, L., Liu, J., & Gao, J. (2019). Multi-step reasoning via recurrent dual attention for visual dialog. arXiv: 1902.00579. Gardner, M., & Mitchell, T. (2015). Efficient and expressive knowledge base completion using subgraph feature extraction. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 1488–1498). Lisbon, Portugal: Association for Computational Linguistics. Gardner, M., Talukdar, P. P., Kisiel, B., & Mitchell, T. (2013). Improving learning and inference in a large knowledge-base using latent syntactic cues. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 833–838). Seattle, Washington, USA: Association for Computational Linguistics. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–1780. Kok, S., & Domingos, P. (2007). Statistical Predicate Invention. In Proceedings of the 24th international conference on machine learning (pp. 433–440). New York, NY, USA: ACM. Lao, N., & Cohen, W. W. (2010). Relational retrieval using a combination of path– constrained random walks. Machine Learning, 81(1), 53–67. Lao, N., Mitchell, T., & Cohen, W. W. (2011). Random walk inference and learning in a large scale knowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 529–539). Stroudsburg, PA, USA: Association for Computational Linguistics. Lin, Y., Liu, Z., Luan, H., Sun, M., Rao, S., & Liu, S. (2015). Modeling relation paths for representation learning of knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 705–714). Lisbon, Portugal: Association for Computational Linguistics. van der Maaten, L., & Hinton, G. (2008). Viualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605. Neelakantan, A., Roth, B., & McCallum, A. (2015). Compositional vector space models for knowledge base completion. In Proceedings of the 53rd annual meeting of the Association for Computational Linguistics and the 7th international joint conference on natural language processing (Volume 1: Long papers) (pp. 156–166). Association for Computational Linguistics. Nickel, M., Murphy, K., Tresp, V., & Gabrilovich, E. (2015). A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104, 11–33. Nickel, M., Tresp, V., & Kriegel, H.-P. (2011). A three-way model for collective learning on multi-relational data. In Proceedings of the 28th international conference on machine learning, ICML 2011 (pp. 809–816). Socher, R., Chen, D., Manning, C. D., & Ng, A. Y. (2013). Reasoning with neural tensor networks for knowledge base completion. In Proceedings of the 26th international conference on neural information processing systems – Volume 1 (pp. 926–934). USA: Curran Associates Inc. Sukhbaatar, S., Szlam, A., Weston, J., & Fergus, R. (2015). End-to-end memory networks. In Proceedings of advances in neural information processing systems (NIPS) 28 (pp. 2440–2448). Toutanova, K., Chen, D., Pantel, P., Poon, H., Choudhury, P., & Gamon, M. (2015). Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 1499–1509). Lisbon, Portugal: Association for Computational Linguistics. Trouillon, T., Dance, C. R., Welbl, J., Riedel, S., Éric Gaussier, & Bouchard, G. (2017). Knowledge graph completion via complex tensor factorization. Journal of Machine Learning Research, 18. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., et al. (2017). Attention is all you need (pp. 5998–6008). Wang, Q., Mao, Z., Wang, B., & Guo, L. (2017). Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12), 2724–2743. Wang, Q., Wang, B., & Guo, L. (2015). Knowledge base completion using embeddings and rules. In Proceedings of the 24th international conference on artificial intelligence (pp. 1859–1865). AAAI Press. Wang, Q., Yuanfei, Luo, Wang, B., & Lin, C.-Y. (2016). Knowledge base completion via coupled path ranking. In Proceedings of the 54th annual meeting of the Association for Computational Linguistics (ACL) (pp. 1308–1318). ACL - Association for Computational Linguistics. Xiaotian, J., Quan, W., Baoyuan, Q., Yongqin, Q., Peng, L., & Bin, W. (2017). Attentive path combination for knowledge graph completion. In Proceedings of the ninth Asian conference on machine learning: 77 (pp. 590–605). PMLR. Xiong, W., Hoang, T., & Wang, W. Y. (2017). DeepPath: A reinforcement learning method for knowledge graph reasoning. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 564–573). Association for Computational Linguistics. Yang, B., Yih, W., He, X., Gao, J., & Deng, L. (2015). Embedding entities and relations for learning and inference in knowledge bases. In ICRL (p. 12). Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., et al. (2016). Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th annual meeting of the Association for Computational Linguistics (ACL) (pp. 207–212). ACL - Association for Computational Linguistics. doi:10.18653/v1/P16-2034.