Ensemble method to joint inference for knowledge extraction

Ensemble method to joint inference for knowledge extraction

Expert Systems With Applications 83 (2017) 114–121 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www...

934KB Sizes 0 Downloads 65 Views

Expert Systems With Applications 83 (2017) 114–121

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

Ensemble method to joint inference for knowledge extraction Yongbin Liu a,b, Chunping Ouyang a,∗, Juanzi Li b a b

College of Computer Science and Technology, University South China, Hengyang 421001, China Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

a r t i c l e

i n f o

Article history: Received 25 October 2016 Revised 16 March 2017 Accepted 18 April 2017 Available online 22 April 2017 Keywords: Ensemble learning Joint inference Knowledge extraction Markov logic network

a b s t r a c t Joint inference is a fundamental issue in the field of artificial intelligence. The greatest advantage of the joint inference is demonstrated by its capability of avoiding errors from cascading and accumulating on a pipeline of multiple chained sub-tasks. Markov Logic Network(MLN) is the most common joint inference model that provides a flexible representation and handles uncertainty. It has been applied successfully to joint inference on many natural language processing tasks to avoid error propagation. However, due to the great expressiveness of first-order logic, the representation for it in MLN generates rather complicated graph structures, which makes the learning and inference on large scale data intractably. In this paper, we present an ensemble learning approach to deal with the challenges in MLNs. Firstly, we give a proof within the probably approximately correct (PAC) framework. The proof points out what conditions are necessary for successful applying the ensemble learning approach to MLN. Secondly, the paper explains how to combine the learners. Finally, in order to illustrate the working mechanism of the ensemble joint inference model, we present an Ensemble Markov Logic Networks (EMLNs) method and use it to extract knowledge from a large scale corpus published by Google.1 Experiments suggest that significant speedup can be gained by the EMLNs. Meanwhile, it show that this approach leads to a higher precision and recall than that of those pipeline approaches. © 2017 Elsevier Ltd. All rights reserved.

1. Introduction Recently, joint inference has been successfully applied to many fields. One of the most famous models is Markov logic network (MLN) (Domingos & Lowd, 2009; Richardson & Domingos, 2006), which is an uncertainty extension of first order logic. Markov logic bridges the gap between first-order logic and probability graphical model. The MLN has been successfully employed in a number of NLP tasks, including joint information extraction (Poon & Domingos, 20 07; 20 09), jointly identifying senses (Meza-Ruiz & Riedel, 2009), and jointly WSD and SRL (Che & Liu, 2010). Despite the obvious benefits of the joint inference based on MLNs, a main difficulty which the previous works encountered is the scalability of Markov logic networks. The reason is that the learning and inference of the graph model are slower than that of other methods. The cause of the slow is first order logic formulae connect too many nodes in the graph. The size of the largest clique (factor) and the number of nodes (random variables) in the Markov networks (factor graph) grow quite rapidly with the ∗

Corresponding author. E-mail addresses: [email protected] (Y. (C. Ouyang), [email protected] (J. Li). 1 code.google.com/p/relation-extraction-corpus/. http://dx.doi.org/10.1016/j.eswa.2017.04.036 0957-4174/© 2017 Elsevier Ltd. All rights reserved.

Liu),

[email protected]

increase of corpus, even though the formulae are fixed. A Markov network is generated by grounding a Markov logic network. The grounding is the process of creating a network for finding a probable state in this world. In a factor graph representation of a ground MLN, the number of factors will grow exponentially with the number of ground atoms. The ground network will be very large. Even approximate inference in the network involving large evidence datasets remains intractable. For example, Fig. 1 shows the learning time by the MLNs (CPU 2.70 GHz, RAM 32.0 GB). No matter what approximate inference is used,the learning and inference are not easy. So, the scalability of the joint model based on the Markov logic is a severe challenge. Next, a lot of works will be considered to speed up the learning time. We take a fresh look at this challenge as an ensemble learning problem (Opitz & Maclin, 1999) for the first time. Ensemble learning is a learning paradigm where a finite numbers of learners are collected for the same task (Krogh, 1996). However, The joint model is performed on multiple tasks. Why can we apply ensemble learning method to a joint model? How do we conduct the ensemble joint model? In this paper, we figure out the answers to these problems and extend ensemble learning to the more general scenario of the inference on multiple different tasks. Firstly, we address what conditions can make the ensemble learning apply in joint inference within the probably approximately correct (PAC)

Y. Liu et al. / Expert Systems With Applications 83 (2017) 114–121

115

Fig. 1. Learning time: development number of variables and factors.

framework. Then we figure out how to combine the learners. Finally, We choose the Markov Logic Networks (MLNs) to achieve the joint inference and present a novel Ensemble Markov Logic Networks (EMLNs) model. The key of our method is the embedding of the Ensemble Markov Logic Networks (EMLNs). In the EMLNs, we adopt the Markov logic networks to joint inference on multi-tasks, and employ the divide and conquer strategy of ensemble learning to address the intractable inference of the Markov logic networks on large scale data. The experimental results on the large scale corpus published by Google show that our approach leads to a better performance than that of the pipeline systems. The main contributions of this paper can be summarized as follows: •





Traditionally, the probably approximately correct (PAC) learning refers the single concept class. We discuss the PAC framework of the multiple tasks in the joint inference model. And we extend PAC learning to multi-concept classes. We present an ensemble learning approach to joint inference on the three NLP sub-tasks. We explain how to combine those weak learners to a strong ability and present the dynamic weighted combination method in the ensemble joint inference model. Our Ensemble Markov Logic Networks (EMLNs) address the problem of the Markov Logic Networks intractable dealing with the large scale data. Experiments show that this approach leads to a higher precision and recall than that of pipeline approaches.

The rest of this paper is organized as follows. Section 2 gives a review of related works and Section 3 figures out joint model ensemble PAC learnability. The Ensemble Markov Logic Networks are proposed in Section 4 and in Section 5, we conduct the experiments on Google corpora, and Section 6 concludes this paper. 2. Related work 2.1. Joint inference In recent years, there has been an increasing interest in joint inference of multiple natural language processing tasks (McCallum, 2009). The joint inference allows bi-directional information flow to avoid error propagation. McCallum and Jensen (2003) proposed

undirected graphical models for joint information extraction. Roth and Yih (2007) employed integer linear programming (ILP) for global inference. Many other papers presented the joint inference models based on Markov logic for NLP tasks. For example, MezaRuiz and Riedel (2009) presented jointly identifying predicates, arguments and senses; Che and Liu (2010) explored jointly modeling on semantic role labeling and word sense disambiguation. Markov logic networks can be sufficiently expressive for general AI by first-order logic. It makes the hypothesis space of the learning and inference greater than that of posterior probability model. The learning and inference of Markov logic network are not easily dealt with. So, researchers have presented various approaches to improve learning and inference in the MLNs, e.g. reducing the size of the network (Mihalkova & Richardson, 2009; Shavlik & Natarajan, 2009), and paralleling inference on Markov logic networks(Beedkar, Del Corro, & Gemulla, 2013; Niu, Ré, Doan, & Shavlik, 2011). Although the approaches of reducing size can speed up the learning and inference process to a certain extent, but they still cannot adapt to large evidence dataset. The parallel methods suffer from the problem of the partitioning MLNs, which is too expensive in practice. A new problem arose: how do we find those partitions which are too expensive in practice even for a quite simple MLN (Ahmad, Halawani, & Albidewi, 2012; Zhang et al., 2015). So we hope to address the problem simply. 2.2. Ensemble learning Ensemble methods which have become a hot topic since the 1990s, are the approaches that train multiple learners and then combine them. These methods had achieved great success in many real-world tasks(Zhou, 2012). The representative approaches of the ensemble learning are Boosting(Schapire, 1990) and Bagging(Breiman, 1996) that are two state-of-the-art learning approaches. Dietterich (20 0 0) generalized the benefit of ensemble method to the three fundamental reasons: statistical issue, computational issue, and representational issue. Statistical reason: A learning algorithm often is viewed as exploring a best hypothesis f in large hypotheses space H. But it is impossible that every algorithm searches the whole space of hypotheses to find the best hypothesis. The reality is that these algorithms search an approximate result. From statistical point of view, the “average” of several different hypotheses can improve the approximation. As

116

Y. Liu et al. / Expert Systems With Applications 83 (2017) 114–121

Fig. 2. Three fundamental reasons for combination: (a) the statistical issue, (b) the computational issue, and (c) the representational issue (Dietterich, 20 0 0).

shown in Fig. 2(a), the inner curve indicates the set of hypotheses {h1,h2,h3}. We can find a better approximation to f by combining the hypotheses. Computational reason: Many learning algorithms perform on local search and only get local optima. Indeed, global optimal training is NP-hard for these learning algorithms. So an ensemble method that combines the results of the running local search on the different starting points may provide a good approximation to f, as shown in Fig. 2(b). Representational reason: In some learning tasks, the unknown function f cannot be represented by any hypothesis in H. By ensemble the hypotheses, it may be to expand the space of representable functions, as shown in Fig. 2(c). In the previous section, the challenge on the computation of the MLNs was described. In the next section, we will explore the ensemble learning method to joint model for the scalability of Markov logic.

is discussed within the probably approximately correct (PAC) framework. Traditionally, PAC learning refers that the concept class C is PAC-learnable by L using H if, for any target concept c in C, L will with probability (1 − δ ) output a hypothesis h with errorD (h) <  , after observing a reasonable number of training examples and performing a reasonable amount of computation (Mitchell, 1997). Then, we extend PAC learning to multi-concept classes. The definition 1 states this extended definition precisely.

3. Joint model ensemble learnability

time that is polynomial in 1 , δ1 , ni , and size(ci ). i i

For a single task, learnability of ensemble learning was proved by Schapire (1990). In our work, the problem is extended to joint inference ensemble dealt with multiple tasks. The problem

This definition figures out the condition of PAC-learnable of learner L. This condition is the learning process of L must be in time that grows at most polynomial, moreover, to ensure arbitrar-

Definition 1. Consider a set of concepts C∪ = {C1 , C2 , ..Ck } of length k defined over a set of instances X of length n and a learner L using a set of hypothesis spaces H∪ = {H1 , H2 , ..Hk }, each Hi corresponding to only Ci . C ∪ is PAC-learnable by L using H ∪ if for all ci ∈ Ci , i = 1, 2, ..k, distributions Di over X,  i such that 0 < i < 12 , and δ i such that 0 < δi < 12 , learner L will with probability at least (1 − δi ) output a hypothesis hi ∈ Hi such that errorDi (hi ) < i , in

Y. Liu et al. / Expert Systems With Applications 83 (2017) 114–121

117

ily high confidence (1 − δi ) and arbitrarily low errorDi (i ) for all ci . In the definition, size(ci ) refers the inherent complexity of concept Ci . It is easy to find the definition extended the PAC learnable (Mitchell, 1997). It is useful to introduce a definition of version space before the proof. A version space in concept learning or induction is the subset of all hypotheses that are consistent with the observed training examples (Mitchell, 1982). In this paper, the version space is defined in accordance with Mitchell (Mitchell, 1997).

which can be sufficient the definition 1. In joint model ensemble, if we make the number of training examples is maximum of the bounds, we just guarantee joint model ensemble is PAC learnable. We have completed the proof of joint model ensemble learnability here, the following discusses how many base learners are produced and how to combine the base learners.

V SHi Di = {hi ∈ Hi | (∀x, ci (x ) ∈ Di )(hi (x ) = ci (x ))},

In this section, we focus on how to combine the base learners in ensemble joint model. Peronne and Cooper (2012) and Zhou, Wu, and Tang (2002) had figured out the number of base learners is not the more the better in ensemble learning of the single task. It is not difficult to extend the result to the ensemble joint model of multi-tasks. This paper discusses how to find a good method to combine base learners, which are multi-tasks joint model. Krogh and Vedelsby (1995) found that the base learners should be as more diverse as possible for a good ensemble. Based on the above research, we present a dynamic weighted combination method to ensemble joint model. The following is a detailed description of our combination method by the case of ensemble Markov logic networks for extracting knowledge.

i = 1, 2, ..k (1)

In this paper, we adopt the applied research of the version space that was conducted by Haussler (1988). The key research is that the version space is  -exhausted. The definition 2 is the extension of according to the Haussler definition (Haussler, 1988). Definition 2. Consider a {H1 , H2 , ..Hk }, target a set of instance distributions D∪ examples D∪ = {D1 , D2 , ..Dk } •



set of hypothesis spaces H∪ = of concepts c∪ = {c1 , c2 , ..ck }, a set = {D1 , D2 , ..Dk }, and a set of training of c ∪ .

If every hypothesis hi ∈ Hi in V SHi Di has error less than  i with respect to ci and Di . then V SHi Di is said to be  i -exhausted with respect to ci and Di . A set of version spaces V SH,D = {V SH1 D1 , V SH2 D2 , ..V SHk Dk } is said to be a set of ∪ = {1 , 2 , ..k }-exhausted with respect to c ∪ and D ∪ , ifevery V SHi Di (i = 1, 2, ..k ) is  i -exhausted with respect to ci and Di .

Mitchell (1997) leaded to a theorem by the definition of version space. In the paper, we extend the theorem to a set of version spaces. The precise states of the theorem are as follows: Theorem 1. ( ∪ -exhausting the version space). If a set of hypothesis spaces H∪ = {H1 , H2 , ..Hk } is finite, and a set of D∪ = {D1 , D2 , ..Dk } is a sequence set of m∪ = {mi ≥ 1|m1 , m2 , ..mk } independent randomly drawn examples of target a set of concepts c∪ = {c1 , c2 , ..ck }, then for any set of ∪ = {0 ≤ i ≤ 1|1 , 2 , ..k }, the probability that the set of version spaces V SH,D = {V SH1 D1 , V SH2 D2 , ..V SHk Dk } is not  ∪ -exhausted (with respect to the set of c ∪ ) is less than or equal to:

|H∪ |e−∪ m∪

⎧ |H1 |e−1 m1 (respect to c1 ) ⎪ ⎪ ⎨ |H2 |e−2 m2 (respect to c2 ) = ... ⎪ ⎪ ⎩ |Hk |e−k mk (respect to ck )

(2)

The theorem can be easily proved, omitted here. From Theorem 1, we can get as follows

{|H∪ |e−∪ m∪

⎧ −1 m 1 ≤ δ∪ (respect to c1 ) ⎪ ⎪|H1 |e ⎨ |H2 |e−2 m2 ≤ δ∪ (respect to c2 ) ≤ δ∪ } = ... ⎪ ⎪ ⎩ |Hk |e−k mk ≤ δ∪ (respect to ck )

to solve for m, we find



1

m∪ ≥

=

(ln |H∪ | + ln δ∪−1

∪ ⎧ m ≥ ⎪ ⎪ 1 ⎨



In the paper, our task is to extract knowledge. The knowledge extraction involves three NLP tasks including entity tagging, coreference resolution and relation extraction. Our goal is to use a joint model to accomplish the inference of the three tasks. The complexity of the joint inference model for the three tasks is much higher than that of a single task. In the paper, we are inspired by the benefit of ensemble learning on computational issue. We adopt the ensemble learning method to address the computational complexity of the joint model. In this paper, we use the Markov logic networks as the joint inference model. We present an Ensemble Markov logic networks for extracting knowledge. So we need to find a combination method that can boost weak learners in which each one is a joint model based on MLNs to strong learner. Weighted voting is the most popular combination method. In each learner inference, the inputs are a set of evidence variables and the outputs are the most likely state of a set of query variables. As shown in Fig. 2(b), the individual learners run the local search from many different starting points. They are with unequal performance. The unequal performance gives us the chance to get the stronger learner by weighted voting. In the experiments, we find the probabilities of the predicting class label are different for the three tasks in the same learner. The probabilities signify each learner confidence to the results of query variables. For one task, a formulation of ensemble Markov logic learning is as follows: xi is denoted as the instance, which takes classification as an example, h is a learner hypothesis, and H is a set of hypotheses. The weighted combination gives the combined output H(x) as N 

w i hi ( x )

(5)

i −1

m2 ≥ 12 (ln |H2 | + ln δ2−1 ... ⎪

⎪ ⎩

4.1. Dynamic weighted combination

Hx =

1 (ln |H1 | + ln δ1 1

(3)

4. How to combine base learners

(respect to c1 ) (respect to c2 )

(4)

mk ≥ 1 (ln |Hk | + ln δk−1 (respect to ck ) k

It is not difficult to see that the inequalities shown in Eq. (4) give a set of bounds on the number of training examples,

where wi is the weight for hi and N is the number of learners. In practice, the weights are often constrained by wi ≥ 0 and N wi =1 . The following is a detailed description of the formulation of the combination of three tasks. We view the three tasks as three classification problems and take the combination as dynamic bayesian inference. Let L = (l1 , l2 , lN ) be the outputs of the individual learners, where li is the class label for the three tasks by the learner hi , let t ej , t cj , t rj denote the class label for three tasks respectively,

118

Y. Liu et al. / Expert Systems With Applications 83 (2017) 114–121

and let pi denote the probability of the class label tj predicted by hi respectively. There is a Bayesian probability for the combined output, i.e.

P (t ej , t cj , t rj |L ) = P (t ej , t cj , t rj )P (L|t ej , t cj , t rj )

(6)

In our work, the learners are independent, i.e. N

P (L|t j ) =

P (li |t j )

(7)

Fig. 3. Bi-directional information flow.

i=1

Using (7) to replace the second term at the right-hand side of (6), it is easy to see (8)

P (t ej , t cj , t rj |L ) = P (t ej , t cj , t rj )

N

P (li |t ej , t cj , t rj )

i=1

∝ log P (

t ej , t cj , t rj

) + log

N

P(

|

li t ej , t cj , t rj

)

(8)

i=1

The first term at the right-hand side of (8) does not depend on the individual learners. The left-hand side of (8) is directly proportional to the right-hand side of (8) taken the log function. For entity tagging, the second term at the right-hand side of (8) be reduced to





N

log ⎝

N

P (li |t ej , t cj , t rj )

i=1,li =t ej



= log ⎝

P (li |t ej , t cj , t rj )⎠

i=1,li =t ej

N

⎞ N

P (t ej , t cj , t rj |li )P (li )

i=1,li =t ej

⎛ = log ⎝

N

= log

N

pi

i=1,li =t ej N

N 

i=1,li =t ej N 

=

N 

log(1 − pi ) +

N

log P (li )

i=1

i=1

i=1,li =t j

log

i=1,li =t ej

P ( li )

N N   pi ( 1 − pi ) + log(1 − pi ) + log P (li ) 1 − pi e

pi + 1 − pi

N 

log(1 − pi ) +

i=1

N 

log P (li )

(9)

i=1

The left-hand side of (8) can be regarded as Hj (x), and the first term at the right-hand side of (9) can be expressed by N 

hij

i=1

pi (x ) log 1 − pi

(10)

Since the second term at the right-hand side of (9) does not depend on the class label tj , and the value of third term at the right-hand side of (9) is fixed. Hj (x) can be reduced to

Hij (x ) ∝

N  i=1

hij (x ) log

pi 1 − pi

(11)

the (11) discloses that the optimal weights for weighted voting of a task satisfy.

wei ∝

pei 1 − pei

pci 1 − pci

wri ∝

pri 1 − pri

(13)

Because pi is dynamically generated, the combination of those weak learners is a dynamic process. 5. Markov logic networks for knowledge extraction In this paper we focus on the joint model for knowledge extraction task. In the previous section we generalize knowledge extraction into three tasks which include entity recognition, coreference resolution and relation extraction. Then, it need to define a model that directly represents the dependencies among the three tasks by modeling the joint distribution over the three tasks. We employ Markov logic networks to jointly extract entity and relation which are linked by cross-sentence coreference. Formally, the probability distribution for the three tasks over a set of output variables y conditioned on input variables x is:

f i ∈F

i=1 N 

wci ∝

p( y e , y c , y r | x ) ∝

P ( li )⎠

(1 − pi ) + log

i=1,li =t ej

log



i=1,li =t ej

log pi +

N 

N

i=1

N

pi + log

i=1,li =t ej

=

( 1 − pi )

i=1,li =t ej

i=1,li =t ej

=

P (t ej , t cj , t rj |li )P (li )⎠

i=1,li =t ej

for coreference resolution and relation extraction:

(12)



exp

Ki 



wik fik (xi , yei , yci , yri )

(14)

k=1

where ye signifies a set of entity recognition output, yc signifies a set of coreference resolution output, and yr signifies a set of relation extraction output. As shown in Fig. 3, the information in the model is bidirectional flow among the three tasks that are coreference resolution, relation extraction and the entity recognition. In other words, the output results of the relation extraction can be used as the input of the entity recognition, and vice versa. It’s the same between the relation extraction and the coreference resolution. The traditional methods deal with these three tasks by the pipeline way. The pipeline means that the execution of these three tasks is sequential. The pipeline means that the flow of information is unidirectional. Compared with the pipeline mode, our model has obvious advantages. For example, the correct coreference results can provide the support for the entity recognition. 5.1. Feature selection Similar to other models, feature selection is the first step of joint reference using Markov logic networks. Our model employ the features in the state-of-the-art systems. We go over these features and then propose new feature to compensate these inadequacies. Almost all systems use the following features: Lemma The lemma of the current word. A lemma is the canonical form of a set of words. For example, go, goes, went, going, and gone are forms of the same lexeme, with go as the lemma. POS The part-of-speech tags of the current word. It is the process of marking up a word in a text as corresponding to a particular part of speech. For example, the words can be marked as nouns, verbs, adjectives, adverbs, etc.

Y. Liu et al. / Expert Systems With Applications 83 (2017) 114–121

119

Table 2 Local formulae.

Table 1 Observable predicates. Predicates

Description

Local formulae

Word(i, w) Pos(i, t) Lemma(i, l) ChunkType(m, ct) Cp(i, ca) Dep(g, d, dt) DepPath(m, pa, v)

Token i has word w Token i has part-of-speech t Token i has lemma l Chunk m has chunk type ct The capitalization pattern of i The dependency relation between g and d is dt The dependency relation path between m and root is pa, v is the first predicate through the path Sub-word strings match between i and j

Lemma (i, +l ) ∧ Pos(i, +t ) ⇒ EntityT ype(i, +et ) C p(i, +ca ) ∧ Pos(i, +t ) ⇒ EntityT ype(i, +et ) subSt rMat ch (i, j ) ∧ DepPath (i, + pa, v ) ∧ DepPath ( j, + pa, v ) ∧i = j⇒Coref(i, j) Pos(i, + p) ∧ Pos( j, + p) ∧ DepPath (i, + pa, +v ) ∧DepPath ( j, + pa, +v ) ∧ i = j ⇒ Relation(i, j, +r )

subStrMatch(i, j)

Chunk Type The type of a group of words (NP, PP, etc.). It is the process of marking up a group of words within a set of rules. For example, the chunk types include NP, VP, PP, ADV, etc. Capitalization Pattern The initial of the constituent being classified. For relation extraction and coreference resolution, we use the new feature which is a path from the constituent being classified to the root in a dependency parse tree. This is a very effective feature which encodes the information including relative position, certain syntactic configuration and domain. In our model, we will declare predicates to represent each of the features. Table 1 presents these predicates. 5.2. First-order logic formulae

Table 3 Illustration of datasets. Dataset

#Examples train

test

Place of birth Graduated institution Date of birth Education degree Place of death

400 450 350 450 400

100 50 150 50 100

#Unique words

#Sentences

32739 41310 30054 42904 30894

1916 3077 1880 3373 1955

The example of the global constraints is:

Ent it yT ype(i, ti ) ∧ ti = t j ⇒ ¬Ent it yT ype(i, t j )

Each entity should be labeled with only one label. The constraint relation of two entities, i.e.

Relation(i, j, r ) ⇒ ¬Relation( j, i, r )

The essence of Markov logic networks is a set of weighted firstorder logic formulae. In general, the formulae are created by domain experts through observing data or learned from training data. In this paper, we adopt manually defined formulae. We use two kinds of formulae for three tasks respectively. One is local formula that relate any number of observed predicates to exactly one hidden predicate (Riedel & Meza-Ruiz, 2008). The other one is global formulae that relate more than one hidden predicates. An observed predicate means the value of its variables is known from the observations. A hidden predicate is one in which value of free variables is not known from the observations. Obviously, a set of evidence (input) is the observed ground atoms, a set of query (output) is the hidden ground atoms. So the global formulae are the key of bi-directional information flow in our work.

(16)

(17)

The examples of the joint inference are:

Ent it yT ype(i, +ti ) ∧ DepPath( j, + pa, +v ) ∧ i = j ⇒ Relation(i, j, +r )

(18)

Relation(i, j, +r ) ⇒ EntityT ype(i, +t )

(19)

Relation(i, j, +r ) ⇒ EntityT ype( j, +t )

(20)

We use the global formulae to implement the bi-directional information flows in Fig. 3. 6. Experiments

5.3. Local formulae

6.1. Datasets and evaluation metric

A formula that includes only one hidden ground atom is called local formula. For example, a grounding of the local formula

We use the Relation Extraction Corpus which were released by Google. The corpus is about public figures on Wikipedia: nearly 10,0 0 0 examples of “placeofbirth”, over 40,0 0 0 examples of “graduatedfromaninstitution”, 3042 examples of “placeofdeath”, 2490 examples of “dateofbirth”, and 1850 examples of “educationdegree”. 500 examples are randomly sampled from each type of the corpus for our experiments. Without gold standard label boundaries, we use a well-known Stanfords NLP group tool to label the datasets. Then, we utilize a rule-based method to remove some erroneous examples. Finally, we get a labeled corpus for the three tasks which can be used to train or evaluate knowledge extraction systems. Table 3 illustrates these datasets. We use Precision (P), Recall (R) and F measure (F1) to evaluate the overall performance. For coreference resolution, MUC is used for evaluation and we only consider named entities and pronouns coreference.

C p(i, +ca ) ∧ Pos(i, + p) ⇒ Ent it yT ype(i, +et )

(15)

It relates two observed ground atoms that are Cp and Pos to a hidden EntityType ground atom. Note that the symbol “ + ” for variables indicates that there is a separate weight for each possible pair of (ci, pi). In our model, we define a list of observed predicates to describe the features. It is shown in Table 1. For our problem, we have three hidden predicate, i.e. EntityTpye, Coref and Relation. Most of our local formulas are listed in Table 2. 5.4. Global formulae Global formulae relate more than one hidden ground atoms. Global formulae are designed for two purposes: to add global constraints on a same hidden predicate and to joint inference over the different hidden predicates. Because there are three tasks involved in our work, there are three different hidden predicates, i.e. EntityType, Coref and Relation. Next we will show global constraints and joint inference.

6.2. Baseline system We adopt factor graphs model (Kschischang, Frey, & Loeliger, 2001) as pipelined approaches to compare with our proposed method. The main reason factor graph chosen is to make a fair

120

Y. Liu et al. / Expert Systems With Applications 83 (2017) 114–121 Table 4 Entity recognition: results for various models. Models

P

R

F

Pipeline Joint model/MNLs-100 Joint model/EMLNs

85.1 76.3 88.4

78.2 64.1 82.1

81.5 69.7 85.1

Table 5 Coreference resolution: results for various models. Models

P

R

F

Pipeline Joint model/MNLs-100 Joint model/EMLNs

60.4 61.5 78.2

75.2 74.3 81.1

66.9 67.2 79.6

Table 6 Relation extraction: results for various models.

Fig. 4. Average accuracy: different number of example for each weak learner.

Fig. 5. Average time: different number of examples for each weak learner.

comparison in the same features selection. We train the pipeline systems using the features described in Section 3. For this pipeline we train an isolated system for entity recognition at first. Then we use the output of the entity recognition task as input for the coreference task. Finally, we use the output of the combination of the two tasks as input for the relation extraction system. 6.3. Results All of the experiments were conducted on the servers with the following specifications: Intel Xeon CPU E5-2680 2.70 GHz 4 cores, RAM 32.0 GB. In experiments, we use Eq. (4) to compute the number m of random training examples. Here,

m1 ≥

1

1

(ln |H1 | + ln δ1−1 ) (respect to cent it y )

1 (ln 5 + ln 0.5−1 ) ≈ 4.6 0.5 1 m2 ≥ (ln |H2 | + ln δ2−1 ) (respect to ccore f erence ) m1 ≥

2

1 (ln 4 + ln 0.5−1 ) ≈ 4.2 0.5 1 m3 ≥ (ln |H3 | + ln δ3−1 ) (respect to crelation ) m2 ≥

1 (ln 5 + ln 0.5−1 ) ≈ 4.6 0.5

P

R

F

Pipeline Joint model/MNLs-100 Joint model/EMLNs

54.0 70.3 76.3

58.3 63.5 74.0

56.0 66.7 75.1

m is the number of random training examples that each learner needs at least. It is observed that the number of training examples is at least 5, to guarantee Ensemble Markov Logic Networks learnability. In our first experiment, we evaluated the accuracy and performance time on the development set which is a training set for weak learner (MLNs). Fig. 4 shows the averaged accuracy with respect to the performance on the development set. Fig. 5 shows the corresponding averaged time. Considering the two aspects, we selected 50 examples as the partitioning size for EMLNs. For our EMLNs, we separately compare the EMLNs with pipeline system and joint model based on 100 training datasets for each of the three tasks. The results are shown in Tables 4–6. Compared with the staged pipeline, our method greatly improve the accuracy on the three tasks. The main reason is error propagation along the pipeline. Furthermore, the greatest significance is error accumulation avoidance on joint model. This is one of the greatest benefits by using the joint inference model to achieve knowledge extraction. Compared with the joint model based on 100 training example, the EMLNs has the higher accuracy. It can be see that the Markov logic networks is intractable on large scale data from Fig. 5. We explore ensemble approach to MLNs, and unlocked the shackles of MLNs. This is one of the contributions of our work. 7. Conclusion We leveraged the use of ensemble method in joint inference for knowledge extraction. Our contributions are as follows: We discuss the PAC framework of the multiple tasks in the joint inference model and extend PAC learning to multi-concept classes. We presented an ensemble learning approach to joint inference on the three NLP sub-tasks. We explained how to combine those weak learners to a strong ability and presented the dynamic weighted combination method in the ensemble joint inference model. We used our Ensemble Markov Logic Networks (EMLNs) to experiment on the large scale corpus. Experiments show that this approach leads to a higher precision and recall than that of pipeline approaches. In our further work, Our model is potential to be paralleled. We will devote our most efforts to parallel ensemble Markov logic networks. Acknowledgment

3

m3 ≥

Models

(21)

The work is supported by 973 Program (No. 2014CB340504), the State Key Program of National Natural Science of China (No .61533018), NSFC-ANR (No. 61261130588), National Natural Science

Y. Liu et al. / Expert Systems With Applications 83 (2017) 114–121

Foundation of China (No. 61402220), the State Scholarship Fund of CSC (No. 201608430240), the Philosophy and Social Science Foundation of Hunan Province (No. 16YBA323), Tsinghua University Initiative Scientific Research Program (No. 20131089256), Science and Technology Support Program (No. 2014BAK04B00), China Postdoctoral Science Foundation (No. 2014M550733), National Natural Science Foundation of China (No. 1309007), THU-NUS NExT Co-Lab and the Scientific Research Fund of Hunan Provincial Education Department (No. 16C1378). References Ahmad, A., Halawani, S. M., & Albidewi, I. A. (2012). Novel ensemble methods for regression via classification problems. Expert Systems with Applications, 39(7), 6396–6401. Beedkar, K., Del Corro, L., & Gemulla, R. (2013). Fully parallel inference in markov logic networks.. In BTW (pp. 205–224). Citeseer. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. Che, W., & Liu, T. (2010). Jointly modeling wsd and srl with markov logic. In Proceedings of the 23rd international conference on computational linguistics (pp. 161–169). Association for Computational Linguistics. Dietterich, T. G. (20 0 0). Ensemble methods in machine learning. In Multiple classifier systems (pp. 1–15). Springer. Domingos, P., & Lowd, D. (2009). Markov logic: An interface layer for artificial intelligence. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1), 1–155. Haussler, D. (1988). Quantifying inductive bias: Ai learning algorithms and valiant’s learning framework. Artificial Intelligence, 36(2), 177–221. Krogh, A., & Vedelsby, J. (1995). Neural network ensembles, cross validation, and active learning. Advances in Neural Information Processing Systems, 7(10), 231–238. Krogh, P. S. A. (1996). Learning with ensembles: How over-fitting can be useful. In Proceedings of the 1995 conference: 8 (p. 190). Kschischang, F. R., Frey, B. J., & Loeliger, H. A. (2001). Factor graphs and the sum-product algorithm. In IEEE trans. info. theory (pp. 498–519). McCallum, A. (2009). Joint inference for natural language processing. In Proceedings of the thirteenth conference on computational natural language learning. (pp. 1–1). McCallum, A., & Jensen, D. (2003). A note on the unification of information extraction and data mining using conditional-probability, relational models.

121

Meza-Ruiz, I., & Riedel, S. (2009). Jointly identifying predicates, arguments and senses using markov logic. In Proceedings of human language technologies: The 2009 annual conference of the north american chapter of the association for computational linguistics (pp. 155–163). Association for Computational Linguistics. Mihalkova, L., & Richardson, M. (2009). Speeding up inference in statistical relational learning by clustering similar query literals.. In ILP (pp. 110–122). Springer. Mitchell, T. M. (1982). Generalization as search. Artificial Intelligence, 18(2), 203–226. Mitchell, T. M. (1997). Machine learning. Computer Science Series. (McGraw-Hill, Burr Ridge, 1997) MATH. Niu, F., Ré, C., Doan, A., & Shavlik, J. (2011). Tuffy: Scaling up statistical inference in markov logic networks using an rdbms. Proceedings of the VLDB Endowment, 4(6), 373–384. Opitz, D., & Maclin, R. (1999). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 169–198. Peronne, B. M. P., & Cooper, L. N. (2012). When networks disagree: Ensemble methods for neural networks. Neural networks for speech and image processing. Poon, H., & Domingos, P. (2007). Joint inference in information extraction. In AAAI: 7 (pp. 913–918). Poon, H., & Domingos, P. (2009). Unsupervised semantic parsing. In Proceedings of the 2009 conference on empirical methods in natural language processing: Volume 1-volume 1 (pp. 1–10). Association for Computational Linguistics. Richardson, M., & Domingos, P. (2006). Markov logic networks. Machine Learning, 62(1–2), 107–136. Riedel, S., & Meza-Ruiz, I. (2008). Collective semantic role labelling with markov logic,. (pp. 193–197). Roth, D., & Yih, W.-t. (2007). Global inference for entity and relation identification via a linear programming formulation. In Introduction to statistical relational learning (pp. 553–580). Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227. Shavlik, J. W., & Natarajan, S. (2009). Speeding up inference in markov logic networks by preprocessing to reduce the size of the resulting grounded network. In IJCAI: 9 (pp. 1951–1956). Zhang, J., Wan, J., Li, F., Mao, J., Zhuang, L., Yuan, J., . . . Yu, Z. (2015). Efficient sparse matrixcvector multiplication using cache oblivious extension quadtree storage format. Future Generation Computer Systems, 54(C), 490–500. Zhou, Z.-H. (2012). Ensemble methods: Foundations and algorithms. CRC Press. Zhou, Z. H., Wu, J., & Tang, W. (2002). Ensembling neural networks: Many could be better than all. Artificial Intelligence, 137, 239–263.