A neural network-based joint learning approach for biomedical entity and relation extraction from biomedical literature

A neural network-based joint learning approach for biomedical entity and relation extraction from biomedical literature

Journal Pre-proofs A neural network-based joint learning approach for biomedical entity and relation extraction from biomedical literature Ling Luo, Z...

1MB Sizes 0 Downloads 126 Views

Journal Pre-proofs A neural network-based joint learning approach for biomedical entity and relation extraction from biomedical literature Ling Luo, Zhihao Yang, Mingyu Cao, Lei Wang, Yin Zhang, Hongfei Lin PII: DOI: Reference:

S1532-0464(20)30011-3 https://doi.org/10.1016/j.jbi.2020.103384 YJBIN 103384

To appear in:

Journal of Biomedical Informatics

Received Date: Revised Date: Accepted Date:

9 August 2019 19 November 2019 3 February 2020

Please cite this article as: Luo, L., Yang, Z., Cao, M., Wang, L., Zhang, Y., Lin, H., A neural network-based joint learning approach for biomedical entity and relation extraction from biomedical literature, Journal of Biomedical Informatics (2020), doi: https://doi.org/10.1016/j.jbi.2020.103384

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Published by Elsevier Inc.

A neural network-based joint learning approach for biomedical entity and relation extraction from biomedical literature

Ling Luo1, Zhihao Yang1,*, Mingyu Cao1, Lei Wang2,*, Yin Zhang2, Hongfei Lin1 1

College of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China 2

Beijing Institute of Health Administration and Medical Information, Beijing, 100850, China * Corresponding authors [email protected] or [email protected]

Abstract Recently joint modeling methods of entity and relation exhibit more promising results than traditional Figure 1: An example of the DDI task. “mechanism” is a relation in the predefined relation set. The texts in different color denote the entities of different types. Here, “pseudoephedrine” involves the overlapping relations since the entity belongs to two relations.

pipelined methods in general domain. However, they are inappropriate for the biomedical domain due to numerous overlapping relations in biomedical text. To alleviate the problem, we propose a neural networkbased joint learning approach for biomedical entity and

becomes an important data source for biomedical

relation extraction. In this approach, a novel tagging

research. Therefore, automatic extraction of entities

scheme that takes into account overlapping relations is

and their relations from the biomedical literature has

proposed. Then the Att-BiLSTM-CRF model is built to jointly extract the entities and their relations with our

received much attention. Recently, various related

extraction rules. Moreover, the contextualized ELMo

tasks have been proposed, such as protein-protein

representations pre-trained on biomedical text are used

interaction (PPI) extraction [1], drug-drug interaction

to further improve the performance. Experimental

(DDI)

results on biomedical corpora show that our method

interaction (CPI) extraction [3].

extraction

[2],

and

chemical-protein

can significantly improve the performance of

Take the DDI task for example. As shown in Figure

overlapping relation extraction and achieves the state-

1, the objective of this task is to recognize the

of-the-art performance.

mentions of drug entities, and extract possible DDI

Keywords: Joint learning, Biomedical entity relation

relations between them. Different from open

extraction, Att-BiLSTM-CRF, Biomedical ELMo

information extraction [4], the entity and relation types of these tasks studied in this work are from

1

Introduction

predefined sets.

Exponentially growing biomedical literature contains

Traditional methods (pipelined methods) for

a wealth of useful biomedical information and

biomedical relation extraction (RE) partition the

1

extraction process into two subtasks and address them

incapable of identifying the overlapping relations.

incrementally. First, biomedical entity mentions in a

Therefore, it is inappropriate for the biomedical

given text are recognized using the technologies of

domain since there are a lot of overlapping relations in

named entity recognition (NER) [5]. Then each entity

biomedical texts (for example, as will be introduced in

pair is classified into the task-specific relations (i.e.,

section 3.1, about 60% and 80% of the relations in the

relation classification, RC) [6]. This separated

DDI and CPI datasets, respectively, are overlapping

framework makes the task easy to deal with, and each

ones).

component can be more flexible. However, it neglects

To alleviate the problem, in this paper, we propose

the relevance between these two subtasks and the fact

a neural network-based joint learning approach to the

that the results of NER may affect the performance of

joint extraction of biomedical entities and relations.

RC which leads to error propagation without any

First, inspired by the method proposed by Zheng et al.

feedback.

[13], we convert the joint extraction task to the tagging

Recent studies show that joint modeling of entity

problem, in which the novel tagging scheme and

and relation exhibits promising results in non-

extraction rules are proposed to extract the

biomedical domain (such as news domain). However,

overlapping relations in biomedical texts. Then, the

most existing joint modeling methods are feature-

Att-BiLSTM-CRF model is developed to extract the

based structured systems [7-9], i.e., they still need the

entities and their relations. The main contributions of

complicated feature engineering and heavily rely on

our work can be summarized as follows:

the other NLP toolkits. In order to avoid the feature

(1)

We transform the joint extraction of

engineering, some neural network-based methods

biomedical entities and relations into a tagging task.

were further proposed for the entity and relation joint

To extract the overlapping relations, we propose the

extraction [10, 11]. For the biomedical domain, the

novel tagging scheme and extraction rules.

neural joint model was also explored [12]. Although

(2)

We develop an attention-based BiLSTM-CRF

these joint models adopt shared parameters in a single

(Att-BiLSTM-CRF) model to extract the entities and

model

relation

their relations. It can enhance the long distance

representations, they are still the incremental models

dependence relations between related entities and

and extract the entities and relations with different

focus on the important words for the model’s

decoders separately. This leads to a drawback that

predictions.

to

encode

both

entity

and

information between output entities and relations

(3)

We also explore the effectiveness of the ELMo

cannot be fully exploited. More recently, Zheng et al.

embedding [14] for our joint model. The corresponding

[13] proposed a novel tagging scheme to convert the

experiments were conducted to compare the ELMo

joint extraction task to a tagging problem. In their joint

embeddings of different domain and different

model, information of entities and relations is

dimensionalities. The experimental results show that

integrated into a unified tagging scheme and can be

the in-domain ELMo embedding is the most effective

fully exploited by a biased LSTM-based model.

and it can further improve the performance.

However, the method only considers the situation

We conducted the experiments on the DDI and CPI

where an entity belongs to at most one relation, and is

datasets and the results show that our joint method

2

Figure 2: The processing flowchart of our method.

outperforms most of the existing pipelined methods

for each token. In their joint method, information of

and achieves the state-of-the-art performance.

entities and relations is integrated into a unified tagging scheme and can be fully exploited. But their

2

Methods

tagging

scheme

is

incapable

of

identifying

In this section, our joint method is described, which

overlapping relation. However, there are abundant

contains two phases (i.e., training phase and test

amounts of overlapping relations in the biomedical

phase), as shown in Figure 2. At the training phase, the

literature and, therefore, ignoring the relations will

joint extraction of biomedical entities and relations is

lose a wealth of useful biomedical information. To

first transformed into a tagging task using our tagging

alleviate the problem, we add new tags to extract the

scheme. And traditional word2vec embedding and

overlapping relations in our tagging scheme.

contextualized ELMo embedding are learned as the

Figure 3 shows an example of how to tag a sentence

pretrained embeddings. Then the Att-BiLSTM-CRF

with our tagging scheme according to the original gold

model is trained with the feature embeddings. At the

standard annotations of the DDI dataset. Each token is

test phase, the test text is first tagged using our model.

assigned a label that contributes to extract the results.

Then biomedical entities and relations are extracted

Since not all tokens are the part of the entity and not

from the tag sequence with our extraction rules. The

all entities participate in the relation, the tokens can be

process is described in details in the following sections.

divided into three types: I) involving neither entities nor relations; II) only involving entities; III) involving

2.1

The Tagging Scheme

both entities and relations. Concretely, for the type I

Inspired by the method of Zheng et al. [13], we treat

token, tag “O” represents the “Other” tag, which

the joint extraction of biomedical entities and relations

means that the corresponding token is independent of

as the sequence labeling task which predicts the label

the extracted results. In addition to “O”, the tag of the

3

Figure 3: An example sentence annotated based on our tagging scheme, where “ME” and “AD” are the abbreviation of the relation type “mechanism” and “advise”, respectively.

type II token consists of two parts (i.e., the entity

As shown in Figure 3, the input sentence contains

boundary label and the entity type label), and the tag

three entity tuples (i.e., (Cholestyramine, drug),

of type III token consists of four parts (i.e., the entity

(raloxifene, drug) and (EVISTA, brand)) and two

boundary label, the entity type label, the relation type

entity-relation

label, and the entity role label). For the entity

mechanism, raloxifene} and {Cholestyramine, advise,

boundary label, the “BIES” (Begin, Inside, End,

EVISTA}. Each token is labeled according to its entity

Single) tagging scheme is used to represent the

information and relation information. Concretely, the

position information of the token in the entity. The

tokens “Cholestyramine” and “raloxifene” are single

entity type label and relation type label are predefined

entities with “drug” type, and “EVISTA” is a single

according to the datasets. For example, in the DDI

entity with “brand” type. The entity “Cholestyramine”

task, the entity type label is obtained from the

participates in both relations “mechanism” and

predefined set {“drug”, “group”, “brand” and

“advise” (an overlapping relation), so its label is “S-

“drug_n”}, and the relation type label is obtained from

drug-M-1”. The entities “raloxifene” and “EVISTA”

the predefined set {“mechanism”, “effect”, “advise”

are the second entities in the relations “mechanism”

and “int”}. Moreover, for the relation type label, we

and “advise”, respectively. Therefore, they are labeled

add a tag “M” to represent the “Multiple” tag that

as “S-drug-ME-2” and “S-brand-AD-2”, respectively.

denotes the entity simultaneously participates in

Besides, the other tokens irrelevant to the entity and

different types of relations. The entity role label

relation results are labeled as “O”.

represents the role of the entity in the relation, defined

2.2

by “1”, “2” and “m”. Here, “1” denotes the token of

triples

(i.e.,

{Cholestyramine,

The Extraction Rules

the first entity in the relation; “2” denotes the token of

We propose the following four extraction rules which

the second entity; and “m” denotes the token of

take into account overlapping relations to extract entities

different role entity in an overlapping relation. Thus,

and relations from the tag sequence.

the

total

number

of

tags

(1)

is

The entity is extracted according to the entity

N=3∗4∗|E|∗(|R|+1)+4∗|E|+1, where |E| is the size of

boundary label and the entity type label of tokens. If the

the predefined entity type set and |R| is the size of the

entity types of the tokens are inconsistent in an entity,

predefined relation type set. Finally, the entity and

the entity type label of the first token in the entity is used

relation extraction results can be represented by

as the entity type of the entity. The relation type label

(Entity, EntityType) and (Entity1, RelationType,

and the entity role label are processed with the same

Entity2).

rule.

4

(2)

The relation extraction follows the nearest

cannot be extracted (see Supplementary Material A.1

principle. Each entity should find the closest entity

for more details).

which can be matched to form a relation triple. (3)

2.3

For relation type label, an entity can only match

Features

another entity with the same relation type, except that

Our method uses word and character embeddings as

the entity with the relation type of “M” can match the

basic

entity with any relation types. For entity role label, an

contextualized ELMo embeddings for our model are

entity can only match another entity with different entity

investigated.

features.

In

addition,

the

effects

of

Word Embedding Word embedding, also known

role label. It is directional when an entity matches another

as distributed word representation, can capture both

entity. Entities with entity role "1" are only looking

the semantic and syntactic information of words from

forward to find the next entity (from left to right);

a

entities with entity role "2" are only looking backwards

considerable attention from many researchers [15]. To

to find the previous entity (from right to left); entities

achieve

with entity role "m" are both looking forward and

downloaded all MEDLINE abstracts from the

backward.

PubMed website1. Then these abstracts were used to

(4)

large

unlabeled a

corpus

and

has

attracted

high-quality word embedding, we

As shown by the example in Figure 3, firstly, three

train word embedding by the word2vec tool2 using the

entities (Cholestyramine, drug), (raloxifene, drug) and

skip-gram model [16] as pretrained word embedding.

(EVISTA, brand) are extracted by rule 1. Secondly, the

Character Feature Character features in a word

entity

usually contain rich structure information of the word.

“raloxifene” backwards by rule 2 and 4. Since the

Moreover, they can alleviate the out-of-vocabulary

relation types of the two entities are “M” and “ME”

problem [17]. Therefore, a convolutional neural

respectively, and the entity roles are “1” and “2”, they

network (CNN) is used to obtain the character

can be combined to a triple {Cholestyramine,

features. Firstly, a character lookup table initialized at

mechanism, raloxifene} by rule 1 and 3. Then, the entity

random contains an embedding for every character.

“raloxifene”

entity

Then, the character embeddings corresponding to

“Cholestyramine” (this triple has been found in previous

every character in a word are inputted into the

stage). Finally, the entity “EVISTA” looks forward to

convolutional layer. Afterwards, a max pooling layer

find the entity “Cholestyramine”. The relation types of

and an average pooling layer are used to extract global

the two entities are “M” and “AD” respectively, and the

features from the convolutional layer. Finally, these

entity positions are “1” and “2”, therefore, constituting a

two global features are concatenated together to

triple {Cholestyramine, advise, EVISTA}.

represent the character feature of the word.

entity

“Cholestyramine”

looks

first

forward

finds

to

the

the

Contextualized

Although the problem of the overlapping relation

ELMo

Embedding

Above

extraction can be alleviated with our method, there are

mentioned word2vec method only allows a single

still a few more complex overlapping relations which

context-independent representation for each word.

1

2

https://www.ncbi.nlm.nih.gov/pubmed

5

code.google.com/p/word2vec

However, a word can have completely different semantics in different contexts. To overcome the shortcoming, ELMo [14] is recently proposed for leaning

high-quality

deep

context-dependent

representations from a bidirectional language model. It has been shown large improvements in a broad range of NLP tasks. Although several pretrained ELMo models have been released, most of them are trained on the text corpus of general domain (i.e., the Billion Word Benchmark) [14]. Different from general domain text, biomedical domain texts contain a considerable number of domain specific terms and Figure 4: The architecture of our Att-BiLSTM-CRF

language structures. Therefore, we pretrained the 256-

model.

and 512-dimensional biomedical ELMo models with default configurations on the MEDLINE abstracts

from it, the objective of the new attention mechanism

from the PubMed website (approximately 2.2B

in this work is to highlight the important hidden states

tokens). At the same time, we found a 1024-

generated by the BiLSTM layer for the model’s

dimensional biomedical ELMo model was also

predictions. The architecture of our Att-BiLSTM-CRF model is

released most recently on the ELMo official website3.

illustrated in Figure 4. Firstly, all features are 2.4

Att-BiLSTM-CRF Model

concatenated as input in the embedding layer.

In recent years, several neural network-based models

Secondly, two successive BiLSTM layers are used. In

have been proposed and widely used in the sequence

the BiLSTM layer, a forward LSTM computes a

labeling task [18-20]. Among others, the model of

representation of the sequence from left to right at

bidirectional Long Short-Term Memory with a

every word t, and a backward LSTM computes a

conditional random field layer (BiLSTM-CRF)

representation of the same sequence in reverse. These

exhibits promising results. Furthermore, our previous

two distinct networks use different parameters, and

work [21] show that attention mechanism can alleviate

then the representation of a word ht is obtained by

the biased problem of the BiLSTM-CRF model on

concatenating

long sentences and improve the performance. So the

representations.

its

left

and

right

context

similar Att-BiLSTM-CRF model is developed to

Then an attention layer is used to consider all the

extract the entity relation. Our previous attention

hidden states generated by the BiLSTM layer to be

mechanism [21] focuses on the related tokens in the

important when they are used as features for the next

different sentences of a document to address the

layer. In the attention layer, we introduce an attention

tagging inconsistency problem for NER. Different

matrix A to calculate the similarity between the

3

https://allennlp.org/elmo

6

current target hidden state and all hidden states. The

where Pi, j is the network score of the jth tag of the ith

attention weight value at,

word in the sentence, Ti, j is the score of transition from

j

in the attention matrix

captures the similarity between the tth current target

tag i to tag j in successive words.

hidden state ht and the jth hidden state hj:

t, j 

exp(score(h t , h j ))



L k 1

Finally a softmax function is used to yield the conditional probability of the path y by normalizing (1)

the above score over all possible tag paths y . During

exp(score(h t , h k ))

the training phase, the objective of the model is to

score(ht , h j )  Wa tanh( Wbht  Wb 'h j  ba ) (2)

maximize the log-probability of the correct tag sequence. At inference time, the best tag path that

Then a global vector gt is computed as a weighted sum

obtains the maximum score is predicted by:

of each BiLSTM output hj: g t   j 1 t , j h j L

arg max s( X, y )

(3)

(7)

y

Next, the global vector and the BiLSTM output of the

3

Results

target word are concatenated as a vector [gt, ht] to be fed to a tanh function to produce the output of

3.1

attention layer.

In our experiments, two public biomedical datasets

zt

tanh( Wg [gt , ht ])

Experimental Datasets and Settings

were used to evaluate the performance: (1) DDI

(4)

extraction 2013 dataset (DDI) [22] and (2) The

Next, a tanh layer on top of the attention layer is used

BioCreative CHEMPROT dataset (CPI) [3]. See

to predict confidence scores for the word having each

Supplementary Material A.2 for more dataset details.

of the possible labels as the output scores of the

In addition, it should be noted that about 60% and 80%

network.

of the relations in the DDI and CPI datasets, et  tanh( We h t  b e )

respectively, are overlapping relations. For the CPI

(5)

dataset, like many teams in the challenge, the training

where the weight matrix set {Wa, Wb, Wb’, Wg, We}

set and development set are combined to use as the

and the bias vector set {ba, be} are the parameters of

training set. Then for two datasets we randomly

the model, and L is the length of the sentence.

selected 10% of the corresponding training sets as the

Then, instead of modeling tagging decisions

validation sets to tune the hyper-parameters. After the

independently, the CRF layer is added to decode the

hyper-parameters were chosen, the models were

best tag path in all possible tag paths. The final score

evaluated on both test sets. See Supplementary

of the sentence X along with a sequence of predictions

Material A.3 for the hyper-parameter details.

y is then given by the sum of transition scores and network scores:

s( X, y )   i 1 (Tyi1 , yi  Pi , yi ) L

(6)

7

Micro-averaged Precision (P), Recall (R) and F1-

Method

socre (F1), which were used in the DDI and CPI tasks

Nearest Rule Our Rule Our Rule+M

[2, 3], are also used by us to evaluate the prediction results. For the entity recognition, an entity is regarded

F1 0.481 0.685 0.702

DDI R1 R2 0.706 0.179 0.699 0.604 0.734 0.668

F1 0.277 0.421 0.494

CPI R1 R2 0.338 0.150 0.409 0.337 0.504 0.425

Table 1: The effect of our tagging scheme and extraction

as correct when its left and right boundaries and entity

rules. “R1” and “R2” denote the recall of non-overlapping

types are both correct. For the relation extraction,

and overlapping relations, respectively.

when the left and right boundaries of the corresponding entities and the relation type are both

Method

correct, a predicted relation is considered to be correct.

Att-BiLSTM-CRF +ELMo-256(Gen) +ELMo-512(Gen) +ELMo-1024(Gen) +ELMo-256(Bio) +ELMo-512(Bio) +ELMo-1024(Bio)

Our statistical significance results are based on the Approximate Randomization test [23]. 3.2

The Effect of our Tagging Scheme and Extraction Rules

DDI NER-F1 RE-F1 0.911 0.702 0.908 0.709 0.912 0.718 0.915 0.737 0.913 0.730 0.917 0.736 0.922 0.751

CPI NER-F1 RE-F1 0.773 0.494 0.781 0.509 0.782 0.513 0.786 0.517 0.787 0.522 0.796 0.538 0.811 0.551

Table 2: The effect of the contextualized ELMo embedding. “256”, “512” and “1024” denote the dimensionalities of ELMo embeddings. “Gen” denotes the ELMo pretrained on the general domain text, and “Bio” denotes the ELMo pretrained on the biomedical text.

To explore the effectiveness of our tagging scheme and extraction rules, the results of several comparisons are provided in Table 1. Nearest Rule: the tagging scheme in which “M” tag of the relation type label and “m” tag of entity role label are not used (thus only the

experimental results show that our method is effective

last relation of the overlapping relation is tagged.), and

for

the relation extraction only follows the nearest

the

overlapping

relation

extraction

from

biomedical text.

principle as used in the method of Zheng et al. [13]. Our Rule: the same tagging scheme as Nearest Rule,

3.3

but the relation extraction uses our extraction rules

Recently, the ELMo model has exhibit promising

described in section 2.2. Our Rule+M: our tagging

results for some NLP tasks [14]. To explore the

scheme described in section 2.1 is used and the

effectiveness of the ELMo embedding for our model,

relation extraction follows our extraction rules. In the

the corresponding experiments were conducted to

experiment, all models are the Att-BiLSTM-CRF

compare the ELMo embeddings of different domain

models without the ELMo embeddings.

and different dimensionalities. And the results of

As shown in Table 1, the model with our rules and

The Effect of the ELMo Embedding

several comparisons are provided in Table 2.

tagging scheme achieves the best performances on

The results show that even better performances are

both datasets. When only our extraction rules are used,

achieved on both datasets when these ELMo

the recall of overlapping relations is significantly

embeddings are added into the model. Coinciding

improved. Furthermore, when “M” tag of the relation

with

type label and “m” tag of entity role label are

dimensionality of the ELMo embeddings can improve

introduced into the tagging scheme, the performance

the model performance. Moreover, the in-domain

of the model can achieve further improvements. The

biomedical ELMo embeddings can achieve larger

8

previous

works

[14],

increasing

the

improvements than the general domain ones. When

mentions are classified. (1) BiLSTM-CRF+BiLSTM:

the 1024-dimensional ELMo embeddings are adding

the NER results are obtained with the BiLSTM-CRF

to our model, the best performances are achieved on

model (i.e., the Att-BiLSTM-CRF model described in

both datasets (an average improvement of 5.3% in F-

section 2.4 without the attention layer), and then the

score over the model without ELMo embeddings). In

above-mentioned BiLSTM model is used to extract

addition, we also explored the different locations of

relations. (2) BiLSTM-CRF+HRNN: the same

applying ELMo embeddings, and find that adding it at

BiLSTM-CRF model is used for NER, and then

the input layer for our model is best choice. See

HRNN model [25] is used to extract DDI relations. (3)

supplemental material A.4 for details.

BiLSTM-CRF+HieRCNN: the same BiLSTM-CRF

3.4

model is used for NER, and then HieRCNN model

Performance Comparison

[27] is used to extract CPI relations.

To further demonstrate the effectiveness of our

For the joint extraction method, the state-of-the-art

method, it is compared with several state-of-the-art

joint methods in news domain were compared with

entity relation extraction methods. These methods can

our method. (1) LSTM-LSTM-Bias: an end-to-end

be divided into three categories: the relation extraction

method based on a novel tagging scheme proposed by

methods using golden standard entities, the pipelined

Zheng et al. [13]. We rebuilt their model on the DDI

methods, and joint extraction methods.

and CPI datasets. (2) Graph Tagging: an overlapping

First, the results of the state-of-the art relation

entity extraction methods proposed by Wang et al. [28].

extraction methods using golden standard entities are

It is a transition-based approach which converts the

shown. (1) BiLSTM: a bidirectional LSTM model

joint task into a directed graph by designing a novel

only with the inputs of word embedding and entity

graph scheme. We retrained their model on the DDI

position embedding was built to extract relations as

and CPI corpora using their official code4. To make

our baseline. (2) SCNN [24] is a syntax convolutional

the comparison fair, the same biomedical 100-

neural network-based method for DDI extraction. (3)

dimensional word2vec embedding was also used for

HRNN [25] is a hierarchical recurrent neural network-

Graph Tagging and its hyper-parameters were tuned

based method for DDI extraction, which integrates the

on the validation set as our model does. In addition,

shortest dependency path and sentence sequence. (4)

the BiLSTM-CRF model with our tagging scheme and

Peng et al. [26] built an ensemble of SVM, CNN and

extraction rules is used as a baseline for comparison.

RNN using majority voting for CPI extraction and

The experimental results in Table 3 show that our

achieved the highest performance during the

Att-BiLSTM-CRF model outperforms the BiLSTM-

challenge. (5) HieRCNN [27] is a hierarchical

CRF model on both datasets. The main reason is our

recurrent convolutional neural network-based method

attention mechanism can capture the long distance

for CPI extraction.

dependence and focus on the important words for the

For the pipelined methods, the NER results are first

model’s predictions. It demonstrates that our attention

obtained and then the relations between the entity

layer is effective. When the ELMo embeddings are not

4

https://github.com/hitwsl/joint-entity-relation

9

CPI

DDI Method BiLSTM (golden) SCNN (golden) [24] HRNN (golden) [25] Peng et al. (golden) [26] HieRCNN (golden) [27] BiLSTM-CRF+BiLSTM BiLSTM-CRF+HRNN BiLSTM-CRF+HieRCNN LSTM-LSTM-Bias Graph Tagging BiLSTM-CRF BiLSTM-CRF+ELMo Att-BiLSTM-CRF Att-BiLSTM-CRF+ELMo

P 0.932 0.932 0.817 0.894 0.903 0.901 0.905

NER R 0.861 0.861 0.813 0.918 0.933 0.921 0.939

F1 0.895** 0.895** 0.815** 0.906* 0.918 0.911 0.922

P 0.684 0.725 0.741 0.648 0.692 0.644 0.587 0.684 0.737 0.722 0.750

RE R 0.665 0.651 0.718 0.630 0.707 0.388 0.571 0.669 0.740 0.685 0.752

F1 0.674 0.686 0.729 0.639** 0.692** 0.484** 0.579** 0.677** 0.738* 0.702** 0.751

P 0.762 0.762 0.742 0.777 0.823 0.788 0.825

NER R 0.733 0.733 0.706 0.754 0.790 0.758 0.798

F1 0.747** 0.747** 0.724** 0.765** 0.806 0.773** 0.811

P 0.572 0.727 0.632 0.492 0.486 0.417 0.415 0.545 0.585 0.564 0.595

RE R 0.573 0.574 0.652 0.449 0.515 0.211 0.348 0.430 0.495 0.440 0.512

F1 0.572 0.641 0.642 0.469** 0.500** 0.280** 0.379** 0.480** 0.536* 0.494** 0.551

Table 3: Performance comparison with the other existing methods. The first part is the relation extraction methods on golden standard entities, the second part is the pipelined methods, and the third part is the joint methods. * and ** denote the statistically significant differences against the best results of our method (i.e., Att-BiLSTM-CRF+ELMo) at p < 0.05 and p < 0.01, respectively. Note: the NER results of the LSTM-LSTM-Bias method are not shown since the method only focuses the entity of the relation and does not detect the entity types.

used, our Att-BiLSTM-CRF model outperforms all

significantly outperforms it on both datasets. The

pipelined methods on the DDI dataset and achieves

possible reason is that Graph Tagging method is

the competitive performance compared with the state-

proposed to extract the relations in the news domain

of-the-art pipelined methods on the CPI dataset. When

which is different from biomedical domain. For

the ELMo embeddings are added, the Att-BiLSTM-

example, there are large amounts of biomedical

CRF model achieves the better performances than all

domain entity names (such as proteins, genes, drugs

pipelined methods on two datasets. Compared with

and diseases) in biomedical literature. Compared with

the LSTM-LSTM-Bias joint model, our method

our method, Graph Tagging method achieves lower

significantly improves the recall of relation extraction

performance in NER, which also leads to the lower

without loss of precision. The main reason is that

performance of relation extraction.

LSTM-LSTM-Bias

only

considers

the non-

In addition, we found that the relation extraction

overlapping relations while there are a lot of

results of pipelined methods are worse than the

overlapping relations in the datasets. The results show

corresponding methods using golden standard entities.

the effectiveness of our method for the overlapping

The pipeline results of the same BiLSTM relation

relation extraction. Unlike the LSTM-LSTM-Bias

extraction models drop by 3.5% and 10.3% than the

method, Graph Tagging method can identify

results of the corresponding methods using golden

overlapping relations which makes it achieve higher

standard entities on the DDI and CPI datasets,

performances in recall on both biomedical corpora.

respectively. It indicates the performance of relation

However, our Att-BiLSTM-CRF+ELMo method

extraction is affected by the result of NER due to the

10

Sentence 1 Standard Pipeline Joint Sentence 2 Standard Pipeline Joint Sentence 3 Standard Pipeline Joint

In a pharmacokinetic study of 8 chronic hepatitis C patients concomitantly receiving methadone, treatment with PEG-Intron once weekly. NER: {(methadone, drug), (PEG-Intron, brand)}; RE: {(methadone, mechanism, PEG-Intron)} NER: {(methadone, drug), (PEG, brand)}; RE: {(methadone, mechanism, PEG)} NER: {(methadone, drug), (PEG-Intron, brand)}; RE: {(methadone, mechanism, PEG-Intron)} Warfarin users who initiated citalopram, fluoxetine, or mirtazapine had an increased risk of hospitalization for gastrointestinal bleeding. NER: {(Warfarin, drug), (citalopram, drug), (fluoxetine, drug), (mirtazapine, drug)}; RE: {(Warfarin, effect, citalopram), (Warfarin, effect, fluoxetine), (Warfarin, effect, mirtazapine)} NER: {(Warfarin, drug), (citalopram, drug), (fluoxetine, drug), (mirtazapine, drug)}; RE: {(Warfarin, effect, citalopram)} NER: {(Warfarin, drug), (citalopram, drug), (fluoxetine, drug), (mirtazapine, drug)}; RE: {(Warfarin, effect, citalopram), (Warfarin, effect, fluoxetine), (Warfarin, effect, mirtazapine)} Limited evidence suggests that ascorbic acid may influence the intensity and duration of action of bishydroxycoumarin. NER: {(ascorbic acid, drug), (bishydroxycoumarin, drug)}; RE: {(ascorbic acid, mechanism, bishydroxycoumari)} NER: {(ascorbic acid, drug), (bishydroxycoumarin, drug)}; RE: {} NER: {(ascorbic acid, drug), (bishydroxycoumarin, brand)}; RE: {(ascorbic acid, effect, bishydroxycoumari)}

Table 4: Examples of the extracted results of different methods on the DDI dataset. Each example contains four rows: the first row is the sentence; the second row is the gold standard answer; the third and the fourth rows are the extracted results of the BiLSTM-CRF+BiLSTM pipelined method and our joint extraction method, respectively. The text in red denotes the wrong extraction result.

error propagation. Especially when the result of NER

method (where the results of Att-BiLSTM-CRF

achieves low performance, it will significantly affect

without ELMo are used for comparison so that the

the performance of relation extraction. Although the

results are fairly comparable), and then selected three

same architecture of BiLSTM-CRF model is used for

examples as shown in Table 4.

NER in both methods, our joint method obtains higher

For sentence 1, the pipelined method generates an

NER scores in F-score than the pipelined model. It

entity boundary error, and the error is propagated to

shows that our method can exploit the dependencies

the relation classification. Even if the relation

between entities and their relations to simultaneously

classification is correct, the wrong relation extraction

improve the performances of entity and relation

result is obtained. In contrast, our joint extraction

extraction. On the DDI dataset, our Att-BiLSTM-CRF

method can alleviate the error propagation by the joint

model with ELMo embeddings outperforms the state-

learning of entities and relations.

of-the-art methods using golden standard entities.

For sentence 2, the sentence contains the

However, it cannot outperform the state-of-the-art CPI

overlapping relations between one entity and three

extraction methods using golden standard entities

parallel entities. Although the pipelined method

since the NER on the CPI dataset is still a difficult task

correctly recognizes all entities, it only extracts the

(the best F-score achieved by the Att-BiLSTM-

first relation and misses out other relations. The main

CRF+ELMo is only about 81%). See Supplementary

reason is that the pipelined method treats the relation

Material A.5 and A.6 for more analyses of results

extraction of each entity pair as a separate relation

3.5

classification task without considering the dependence

Case Study

of the parallel entities. Compared with it, our joint

To analyze the advantages and disadvantages of our

extraction method extracts all the relations correctly

joint extraction method, we compared some prediction

since it fully considers the dependencies of the entities

results of the pipelined method and joint extraction

and relations by a joint model. The example shows

11

that our method not only considers the dependence

proposed a multichannel dependency-based CNN

between entities and relations but also can learn the

model to extract PPIs. Sahu and Anand [44] explored

interaction between relations.

a LSTM model with attention pooling for DDI

For sentence 3, the pipelined method detects the

extraction. Zhang et al. [25] applied a hierarchical

correct entities but misses out their relation. Compared

LSTM model employing an attention mechanism to

with it, joint extraction method can predict the entities’

detect and extract DDIs.

boundary and detect the relation existing between

However, the pipeline methods neglect the

them correctly, but the entity and relational types are

relevance of two subtasks (NER and RC) and have the

mistakenly recognized. It shows that our method is

error propagation problem. Recently joint modeling

more capable of detecting entities pair and their

methods of entity and relation exhibit more promising

relations, but it needs to be further improved in

results than traditional pipelined methods in general

distinguishing the relationship between the entities.

domain. These

methods

include

feature-based

methods [7-9] and neural network-based methods [10,

4

Related Work

11, 13, 28].

Most of previous biomedical entity and relation

For the biomedical domain, there is less related

extraction work focus on the pipelined method, i.e.,

work in the joint extraction of entities and relations. Li

treating this task as two separated tasks, NER and RC.

et al. exploited a transition-based feed forward neural

In the previous NER works, the state-of-the-art

network to jointly extract drug-disease entity mentions

CRF-based biomedical NER methods [5, 29-31]

and their adverse drug event relations [45]. Further,

depend on effective feature engineering. Recently,

they applied a neural joint model to biomedical text

several neural network architectures [18-20] have

[12]. These methods adopt shared neural network

been proposed and exhibit promising results in general

parameters to jointly encode the representation of

domain. Moreover, the similar models have been

NER and RC tasks, but not joint decoding. Different

applied in biomedical field [32, 33], including genes

from them, our method converts the joint tasks to one

and proteins [34], diseases [35] and chemicals [21].

task by the novel tagging scheme, so that the end-toend model can be used easily to extract entities and

After recognizing entity mentions in a given text

relations.

using the NER technologies, each entity pair is examined to decide whether they have task-specific

5

relations using RC models. Over the past decade,

Conclusion

many methods have been proposed and successfully

In this paper, we propose an end-to-end method for

employed for RC in biomedical texts, including

joint extraction of biomedical entity and relation

pattern-based

37], feature-based

extraction. Specifically, a tagging scheme that takes

methods [38, 39] and kernel-based methods [40-42].

into account overlapping relations is proposed to

Recently, deep neural network-based methods have

convert the joint extraction task to a tagging problem.

been proposed and achieved the state-of-the-art

Then the Att-BiLSTM-CRF model is used to extract

performance. For example, Peng and Lu [43]

entities and their relations in texts with our extraction

methods

[36,

12

rules. Our method can fully exploit dependencies of

[9] X Ren, Z Wu, W He, M Qu, CR Voss, H Ji, et al. Cotype: Joint extraction of typed entities and relations with knowledge bases. Proceedings of the 26th International Conference on World Wide Web: International World Wide Web Conferences Steering Committee; 2017. p. 1015-1024. [10] M Miwa, M Bansal. End-to-end relation extraction using lstms on sequences and tree structures. arXiv preprint arXiv:160100770. 2016. [11] S Zheng, Y Hao, D Lu, H Bao, J Xu, H Hao, et al. Joint entity and relation extraction based on a hybrid neural network. Neurocomputing. 2017;257:59-66. [12] F Li, M Zhang, G Fu, D Ji. A neural joint model for entity and relation extraction from biomedical text. BMC bioinformatics. 2017;18:198. [13] S Zheng, F Wang, H Bao, Y Hao, P Zhou, B Xu. Joint extraction of entities and relations based on a novel tagging scheme. arXiv preprint arXiv:170605075. 2017. [14] M Peters, M Neumann, M Iyyer, M Gardner, C Clark, K Lee, et al. Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)2018. p. 2227-2237. [15] S Lai, K Liu, S He, J Zhao. How to generate a good word embedding. IEEE Intelligent Systems. 2016;31:5-14. [16] T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems2013. p. 3111-3119. [17] M Rei, G Crichton, S Pyysalo. Attending to Characters in Neural Sequence Labeling Models. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers2016. p. 309-318. [18] R Collobert, J Weston, L Bottou, M Karlen, K Kavukcuoglu, P Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research. 2011;12:2493-2537. [19] G Lample, M Ballesteros, S Subramanian, K Kawakami, C Dyer. Neural Architectures for Named Entity Recognition. Proceedings of NAACL-HLT2016. p. 260-270. [20] X Ma, E Hovy. End-to-end Sequence Labeling via Bidirectional LSTM-CNNs-CRF. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)2016. p. 1064-1074. [21] L Luo, Z Yang, P Yang, Y Zhang, L Wang, H Lin, et al. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics. 2017;34:1381-1388. [22] M Herrero-Zazo, I Segura-Bedmar, P Martínez, T Declerck. The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions. Journal of biomedical informatics. 2013;46:914-920. [23] A Yeh. More accurate tests for the statistical significance of result differences. Proceedings of the 18th conference on

entities and relations and significantly improve the performance of overlapping relation extraction. Moreover, the biomedical ELMo embeddings are used and

further

improve

the

performance.

The

experimental results on two public biomedical datasets demonstrate the effectiveness of our method.

Acknowledgement This work was supported by the grant from the National Key Research and Development Program of China (No. 2016YFC0901902).

References [1] S Pyysalo, F Ginter, J Heimonen, J Björne, J Boberg, J Järvinen, et al. BioInfer: a corpus for information extraction in the biomedical domain. BMC bioinformatics. 2007;8:50. [2] I Segura-Bedmar, P Martínez, MH Zazo. Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)2013. p. 341-350. [3] M Krallinger, O Rabal, SA Akhondi. Overview of the BioCreative VI chemical-protein interaction Track. Proceedings of the sixth BioCreative challenge evaluation workshop2017. p. 141-146. [4] M Banko, MJ Cafarella, S Soderland, M Broadhead, O Etzioni. Open information extraction from the web. IJCAI2007. p. 2670-2676. [5] B Settles. Biomedical named entity recognition using conditional random fields and rich feature sets. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP)2004. [6] C Giuliano, A Lavelli, L Romano. Exploiting shallow linguistic information for relation extraction from biomedical literature. 11th Conference of the European Chapter of the Association for Computational Linguistics2006. [7] Q Li, H Ji. Incremental joint extraction of entity mentions and relations. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)2014. p. 402-412. [8] M Miwa, Y Sasaki. Modeling joint entity and relation extraction with table representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)2014. p. 1858-1869.

13

Computational linguistics-Volume 2: Association for Computational Linguistics; 2000. p. 947-953. [24] Z Zhao, Z Yang, L Luo, H Lin, J Wang. Drug drug interaction extraction from biomedical literature using syntax convolutional neural network. Bioinformatics. 2016;32:34443453. [25] Y Zhang, W Zheng, H Lin, J Wang, Z Yang, M Dumontier. Drug–drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths. Bioinformatics. 2017;34:828-835. [26] Y Peng, A Rios, R Kavuluru, Z Lu. Extracting chemical– protein relations with ensembles of SVM and deep learning models. Database. 2018;2018. [27] C Sun, Z Yang, L Wang, Y Zhang, H Lin, J Wang, et al. Chemical-protein interaction extraction from biomedical literature: a hierarchical recurrent convolutional neural network method. International Journal of Data Mining and Bioinformatics. 2019;22:113-130. [28] S Wang, Y Zhang, W Che, T Liu. Joint Extraction of Entities and Relations Based on a Novel Graph Scheme. IJCAI2018. p. 4461-4467. [29] R Leaman, C-H Wei, Z Lu. tmChem: a high performance approach for chemical named entity recognition and normalization. Journal of cheminformatics. 2015;7:S3. [30] C-H Wei, H-Y Kao, Z Lu. GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. BioMed research international. 2015;2015. [31] B Settles. ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005;21:3191-3192. [32] M Habibi, L Weber, M Neves, DL Wiegandt, U Leser. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics. 2017;33:i37-i48. [33] TH Dang, H-Q Le, TM Nguyen, ST Vu. D3NER: biomedical named entity recognition using CRF-biLSTM improved with fine-tuned embeddings of various linguistic information. Bioinformatics. 2018;34:3539-3546. [34] L Li, L Jin, Z Jiang, D Song, D Huang. Biomedical named entity recognition based on extended recurrent neural networks. 2015 IEEE International Conference on bioinformatics and biomedicine (BIBM): IEEE; 2015. p. 649-652. [35] S Sahu, A Anand. Recurrent neural network models for disease name recognition using domain invariant features. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)2016. p. 2216-2225. [36] A Leeuwenberg, A Buzmakov, Y Toussaint, A Napoli. Exploring pattern structures of syntactic trees for relation extraction. International Conference on Formal Concept Analysis: Springer; 2015. p. 153-168. [37] DP Corney, BF Buxton, WB Langdon, DT Jones. BioRAT: extracting biological information from full-length papers. Bioinformatics. 2004;20:3206-3213.

[38] S Kim, H Liu, L Yeganova, WJ Wilbur. Extracting drug– drug interactions from literature using a rich feature-based linear kernel approach. Journal of biomedical informatics. 2015;55:2330. [39] M Miwa, R Sætre, Y Miyao, Ji Tsujii. A rich feature vector for protein-protein interaction extraction from multiple corpora. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1: Association for Computational Linguistics; 2009. p. 121-130. [40] A Airola, S Pyysalo, J Björne, T Pahikkala, F Ginter, T Salakoski. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC bioinformatics. 2008;9:S2. [41] Y Zhang, H Lin, Z Yang, Y Li. Neighborhood hash graph kernel for protein–protein interaction extraction. Journal of biomedical informatics. 2011;44:1086-1092. [42] W Zheng, H Lin, Z Zhao, B Xu, Y Zhang, Z Yang, et al. A graph kernel based on context vectors for extracting drug–drug interactions. Journal of biomedical informatics. 2016;61:34-43. [43] Y Peng, Z Lu. Deep learning for extracting protein-protein interactions from biomedical literature. BioNLP 2017. 2017:2938. [44] SK Sahu, A Anand. Drug-drug interaction extraction from biomedical texts using long short-term memory network. Journal of biomedical informatics. 2018;86:15-24. [45] F Li, Y Zhang, M Zhang, D Ji. Joint Models for Extracting Adverse Drug Events from Biomedical Text. IJCAI2016. p. 2838-2844.

14

Graphical abstract

15

Highlights



A

neural

network-based

joint

extraction

of

biomedical entities and relations approach is proposed. 

The novel tagging scheme and extraction rules are proposed to extract the overlapping relations in biomedical texts.



The effectiveness of the ELMo embedding is explored for our joint model.



Our joint method outperforms most of the existing pipelined methods and achieves the state-of-the-art performance.

16

Ling Luo designed the algorithm, conducted the experiments and drafted the manuscript. Zhihao Yang provided the initial ideas and revised the manuscript. Mingyu Cao participated in the model designs and the experiments. Lei Wang provided biomedical support and revised the manuscript. Yin Zhang and Hongfei Lin commented on algorithm designs. All authors read and approved the final manuscript.

17

Declaration of interests

☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:

18