Journal Pre-proofs A neural network-based joint learning approach for biomedical entity and relation extraction from biomedical literature Ling Luo, Zhihao Yang, Mingyu Cao, Lei Wang, Yin Zhang, Hongfei Lin PII: DOI: Reference:
S1532-0464(20)30011-3 https://doi.org/10.1016/j.jbi.2020.103384 YJBIN 103384
To appear in:
Journal of Biomedical Informatics
Received Date: Revised Date: Accepted Date:
9 August 2019 19 November 2019 3 February 2020
Please cite this article as: Luo, L., Yang, Z., Cao, M., Wang, L., Zhang, Y., Lin, H., A neural network-based joint learning approach for biomedical entity and relation extraction from biomedical literature, Journal of Biomedical Informatics (2020), doi: https://doi.org/10.1016/j.jbi.2020.103384
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2020 Published by Elsevier Inc.
A neural network-based joint learning approach for biomedical entity and relation extraction from biomedical literature
Ling Luo1, Zhihao Yang1,*, Mingyu Cao1, Lei Wang2,*, Yin Zhang2, Hongfei Lin1 1
College of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China 2
Beijing Institute of Health Administration and Medical Information, Beijing, 100850, China * Corresponding authors
[email protected] or
[email protected]
Abstract Recently joint modeling methods of entity and relation exhibit more promising results than traditional Figure 1: An example of the DDI task. “mechanism” is a relation in the predefined relation set. The texts in different color denote the entities of different types. Here, “pseudoephedrine” involves the overlapping relations since the entity belongs to two relations.
pipelined methods in general domain. However, they are inappropriate for the biomedical domain due to numerous overlapping relations in biomedical text. To alleviate the problem, we propose a neural networkbased joint learning approach for biomedical entity and
becomes an important data source for biomedical
relation extraction. In this approach, a novel tagging
research. Therefore, automatic extraction of entities
scheme that takes into account overlapping relations is
and their relations from the biomedical literature has
proposed. Then the Att-BiLSTM-CRF model is built to jointly extract the entities and their relations with our
received much attention. Recently, various related
extraction rules. Moreover, the contextualized ELMo
tasks have been proposed, such as protein-protein
representations pre-trained on biomedical text are used
interaction (PPI) extraction [1], drug-drug interaction
to further improve the performance. Experimental
(DDI)
results on biomedical corpora show that our method
interaction (CPI) extraction [3].
extraction
[2],
and
chemical-protein
can significantly improve the performance of
Take the DDI task for example. As shown in Figure
overlapping relation extraction and achieves the state-
1, the objective of this task is to recognize the
of-the-art performance.
mentions of drug entities, and extract possible DDI
Keywords: Joint learning, Biomedical entity relation
relations between them. Different from open
extraction, Att-BiLSTM-CRF, Biomedical ELMo
information extraction [4], the entity and relation types of these tasks studied in this work are from
1
Introduction
predefined sets.
Exponentially growing biomedical literature contains
Traditional methods (pipelined methods) for
a wealth of useful biomedical information and
biomedical relation extraction (RE) partition the
1
extraction process into two subtasks and address them
incapable of identifying the overlapping relations.
incrementally. First, biomedical entity mentions in a
Therefore, it is inappropriate for the biomedical
given text are recognized using the technologies of
domain since there are a lot of overlapping relations in
named entity recognition (NER) [5]. Then each entity
biomedical texts (for example, as will be introduced in
pair is classified into the task-specific relations (i.e.,
section 3.1, about 60% and 80% of the relations in the
relation classification, RC) [6]. This separated
DDI and CPI datasets, respectively, are overlapping
framework makes the task easy to deal with, and each
ones).
component can be more flexible. However, it neglects
To alleviate the problem, in this paper, we propose
the relevance between these two subtasks and the fact
a neural network-based joint learning approach to the
that the results of NER may affect the performance of
joint extraction of biomedical entities and relations.
RC which leads to error propagation without any
First, inspired by the method proposed by Zheng et al.
feedback.
[13], we convert the joint extraction task to the tagging
Recent studies show that joint modeling of entity
problem, in which the novel tagging scheme and
and relation exhibits promising results in non-
extraction rules are proposed to extract the
biomedical domain (such as news domain). However,
overlapping relations in biomedical texts. Then, the
most existing joint modeling methods are feature-
Att-BiLSTM-CRF model is developed to extract the
based structured systems [7-9], i.e., they still need the
entities and their relations. The main contributions of
complicated feature engineering and heavily rely on
our work can be summarized as follows:
the other NLP toolkits. In order to avoid the feature
(1)
We transform the joint extraction of
engineering, some neural network-based methods
biomedical entities and relations into a tagging task.
were further proposed for the entity and relation joint
To extract the overlapping relations, we propose the
extraction [10, 11]. For the biomedical domain, the
novel tagging scheme and extraction rules.
neural joint model was also explored [12]. Although
(2)
We develop an attention-based BiLSTM-CRF
these joint models adopt shared parameters in a single
(Att-BiLSTM-CRF) model to extract the entities and
model
relation
their relations. It can enhance the long distance
representations, they are still the incremental models
dependence relations between related entities and
and extract the entities and relations with different
focus on the important words for the model’s
decoders separately. This leads to a drawback that
predictions.
to
encode
both
entity
and
information between output entities and relations
(3)
We also explore the effectiveness of the ELMo
cannot be fully exploited. More recently, Zheng et al.
embedding [14] for our joint model. The corresponding
[13] proposed a novel tagging scheme to convert the
experiments were conducted to compare the ELMo
joint extraction task to a tagging problem. In their joint
embeddings of different domain and different
model, information of entities and relations is
dimensionalities. The experimental results show that
integrated into a unified tagging scheme and can be
the in-domain ELMo embedding is the most effective
fully exploited by a biased LSTM-based model.
and it can further improve the performance.
However, the method only considers the situation
We conducted the experiments on the DDI and CPI
where an entity belongs to at most one relation, and is
datasets and the results show that our joint method
2
Figure 2: The processing flowchart of our method.
outperforms most of the existing pipelined methods
for each token. In their joint method, information of
and achieves the state-of-the-art performance.
entities and relations is integrated into a unified tagging scheme and can be fully exploited. But their
2
Methods
tagging
scheme
is
incapable
of
identifying
In this section, our joint method is described, which
overlapping relation. However, there are abundant
contains two phases (i.e., training phase and test
amounts of overlapping relations in the biomedical
phase), as shown in Figure 2. At the training phase, the
literature and, therefore, ignoring the relations will
joint extraction of biomedical entities and relations is
lose a wealth of useful biomedical information. To
first transformed into a tagging task using our tagging
alleviate the problem, we add new tags to extract the
scheme. And traditional word2vec embedding and
overlapping relations in our tagging scheme.
contextualized ELMo embedding are learned as the
Figure 3 shows an example of how to tag a sentence
pretrained embeddings. Then the Att-BiLSTM-CRF
with our tagging scheme according to the original gold
model is trained with the feature embeddings. At the
standard annotations of the DDI dataset. Each token is
test phase, the test text is first tagged using our model.
assigned a label that contributes to extract the results.
Then biomedical entities and relations are extracted
Since not all tokens are the part of the entity and not
from the tag sequence with our extraction rules. The
all entities participate in the relation, the tokens can be
process is described in details in the following sections.
divided into three types: I) involving neither entities nor relations; II) only involving entities; III) involving
2.1
The Tagging Scheme
both entities and relations. Concretely, for the type I
Inspired by the method of Zheng et al. [13], we treat
token, tag “O” represents the “Other” tag, which
the joint extraction of biomedical entities and relations
means that the corresponding token is independent of
as the sequence labeling task which predicts the label
the extracted results. In addition to “O”, the tag of the
3
Figure 3: An example sentence annotated based on our tagging scheme, where “ME” and “AD” are the abbreviation of the relation type “mechanism” and “advise”, respectively.
type II token consists of two parts (i.e., the entity
As shown in Figure 3, the input sentence contains
boundary label and the entity type label), and the tag
three entity tuples (i.e., (Cholestyramine, drug),
of type III token consists of four parts (i.e., the entity
(raloxifene, drug) and (EVISTA, brand)) and two
boundary label, the entity type label, the relation type
entity-relation
label, and the entity role label). For the entity
mechanism, raloxifene} and {Cholestyramine, advise,
boundary label, the “BIES” (Begin, Inside, End,
EVISTA}. Each token is labeled according to its entity
Single) tagging scheme is used to represent the
information and relation information. Concretely, the
position information of the token in the entity. The
tokens “Cholestyramine” and “raloxifene” are single
entity type label and relation type label are predefined
entities with “drug” type, and “EVISTA” is a single
according to the datasets. For example, in the DDI
entity with “brand” type. The entity “Cholestyramine”
task, the entity type label is obtained from the
participates in both relations “mechanism” and
predefined set {“drug”, “group”, “brand” and
“advise” (an overlapping relation), so its label is “S-
“drug_n”}, and the relation type label is obtained from
drug-M-1”. The entities “raloxifene” and “EVISTA”
the predefined set {“mechanism”, “effect”, “advise”
are the second entities in the relations “mechanism”
and “int”}. Moreover, for the relation type label, we
and “advise”, respectively. Therefore, they are labeled
add a tag “M” to represent the “Multiple” tag that
as “S-drug-ME-2” and “S-brand-AD-2”, respectively.
denotes the entity simultaneously participates in
Besides, the other tokens irrelevant to the entity and
different types of relations. The entity role label
relation results are labeled as “O”.
represents the role of the entity in the relation, defined
2.2
by “1”, “2” and “m”. Here, “1” denotes the token of
triples
(i.e.,
{Cholestyramine,
The Extraction Rules
the first entity in the relation; “2” denotes the token of
We propose the following four extraction rules which
the second entity; and “m” denotes the token of
take into account overlapping relations to extract entities
different role entity in an overlapping relation. Thus,
and relations from the tag sequence.
the
total
number
of
tags
(1)
is
The entity is extracted according to the entity
N=3∗4∗|E|∗(|R|+1)+4∗|E|+1, where |E| is the size of
boundary label and the entity type label of tokens. If the
the predefined entity type set and |R| is the size of the
entity types of the tokens are inconsistent in an entity,
predefined relation type set. Finally, the entity and
the entity type label of the first token in the entity is used
relation extraction results can be represented by
as the entity type of the entity. The relation type label
(Entity, EntityType) and (Entity1, RelationType,
and the entity role label are processed with the same
Entity2).
rule.
4
(2)
The relation extraction follows the nearest
cannot be extracted (see Supplementary Material A.1
principle. Each entity should find the closest entity
for more details).
which can be matched to form a relation triple. (3)
2.3
For relation type label, an entity can only match
Features
another entity with the same relation type, except that
Our method uses word and character embeddings as
the entity with the relation type of “M” can match the
basic
entity with any relation types. For entity role label, an
contextualized ELMo embeddings for our model are
entity can only match another entity with different entity
investigated.
features.
In
addition,
the
effects
of
Word Embedding Word embedding, also known
role label. It is directional when an entity matches another
as distributed word representation, can capture both
entity. Entities with entity role "1" are only looking
the semantic and syntactic information of words from
forward to find the next entity (from left to right);
a
entities with entity role "2" are only looking backwards
considerable attention from many researchers [15]. To
to find the previous entity (from right to left); entities
achieve
with entity role "m" are both looking forward and
downloaded all MEDLINE abstracts from the
backward.
PubMed website1. Then these abstracts were used to
(4)
large
unlabeled a
corpus
and
has
attracted
high-quality word embedding, we
As shown by the example in Figure 3, firstly, three
train word embedding by the word2vec tool2 using the
entities (Cholestyramine, drug), (raloxifene, drug) and
skip-gram model [16] as pretrained word embedding.
(EVISTA, brand) are extracted by rule 1. Secondly, the
Character Feature Character features in a word
entity
usually contain rich structure information of the word.
“raloxifene” backwards by rule 2 and 4. Since the
Moreover, they can alleviate the out-of-vocabulary
relation types of the two entities are “M” and “ME”
problem [17]. Therefore, a convolutional neural
respectively, and the entity roles are “1” and “2”, they
network (CNN) is used to obtain the character
can be combined to a triple {Cholestyramine,
features. Firstly, a character lookup table initialized at
mechanism, raloxifene} by rule 1 and 3. Then, the entity
random contains an embedding for every character.
“raloxifene”
entity
Then, the character embeddings corresponding to
“Cholestyramine” (this triple has been found in previous
every character in a word are inputted into the
stage). Finally, the entity “EVISTA” looks forward to
convolutional layer. Afterwards, a max pooling layer
find the entity “Cholestyramine”. The relation types of
and an average pooling layer are used to extract global
the two entities are “M” and “AD” respectively, and the
features from the convolutional layer. Finally, these
entity positions are “1” and “2”, therefore, constituting a
two global features are concatenated together to
triple {Cholestyramine, advise, EVISTA}.
represent the character feature of the word.
entity
“Cholestyramine”
looks
first
forward
finds
to
the
the
Contextualized
Although the problem of the overlapping relation
ELMo
Embedding
Above
extraction can be alleviated with our method, there are
mentioned word2vec method only allows a single
still a few more complex overlapping relations which
context-independent representation for each word.
1
2
https://www.ncbi.nlm.nih.gov/pubmed
5
code.google.com/p/word2vec
However, a word can have completely different semantics in different contexts. To overcome the shortcoming, ELMo [14] is recently proposed for leaning
high-quality
deep
context-dependent
representations from a bidirectional language model. It has been shown large improvements in a broad range of NLP tasks. Although several pretrained ELMo models have been released, most of them are trained on the text corpus of general domain (i.e., the Billion Word Benchmark) [14]. Different from general domain text, biomedical domain texts contain a considerable number of domain specific terms and Figure 4: The architecture of our Att-BiLSTM-CRF
language structures. Therefore, we pretrained the 256-
model.
and 512-dimensional biomedical ELMo models with default configurations on the MEDLINE abstracts
from it, the objective of the new attention mechanism
from the PubMed website (approximately 2.2B
in this work is to highlight the important hidden states
tokens). At the same time, we found a 1024-
generated by the BiLSTM layer for the model’s
dimensional biomedical ELMo model was also
predictions. The architecture of our Att-BiLSTM-CRF model is
released most recently on the ELMo official website3.
illustrated in Figure 4. Firstly, all features are 2.4
Att-BiLSTM-CRF Model
concatenated as input in the embedding layer.
In recent years, several neural network-based models
Secondly, two successive BiLSTM layers are used. In
have been proposed and widely used in the sequence
the BiLSTM layer, a forward LSTM computes a
labeling task [18-20]. Among others, the model of
representation of the sequence from left to right at
bidirectional Long Short-Term Memory with a
every word t, and a backward LSTM computes a
conditional random field layer (BiLSTM-CRF)
representation of the same sequence in reverse. These
exhibits promising results. Furthermore, our previous
two distinct networks use different parameters, and
work [21] show that attention mechanism can alleviate
then the representation of a word ht is obtained by
the biased problem of the BiLSTM-CRF model on
concatenating
long sentences and improve the performance. So the
representations.
its
left
and
right
context
similar Att-BiLSTM-CRF model is developed to
Then an attention layer is used to consider all the
extract the entity relation. Our previous attention
hidden states generated by the BiLSTM layer to be
mechanism [21] focuses on the related tokens in the
important when they are used as features for the next
different sentences of a document to address the
layer. In the attention layer, we introduce an attention
tagging inconsistency problem for NER. Different
matrix A to calculate the similarity between the
3
https://allennlp.org/elmo
6
current target hidden state and all hidden states. The
where Pi, j is the network score of the jth tag of the ith
attention weight value at,
word in the sentence, Ti, j is the score of transition from
j
in the attention matrix
captures the similarity between the tth current target
tag i to tag j in successive words.
hidden state ht and the jth hidden state hj:
t, j
exp(score(h t , h j ))
L k 1
Finally a softmax function is used to yield the conditional probability of the path y by normalizing (1)
the above score over all possible tag paths y . During
exp(score(h t , h k ))
the training phase, the objective of the model is to
score(ht , h j ) Wa tanh( Wbht Wb 'h j ba ) (2)
maximize the log-probability of the correct tag sequence. At inference time, the best tag path that
Then a global vector gt is computed as a weighted sum
obtains the maximum score is predicted by:
of each BiLSTM output hj: g t j 1 t , j h j L
arg max s( X, y )
(3)
(7)
y
Next, the global vector and the BiLSTM output of the
3
Results
target word are concatenated as a vector [gt, ht] to be fed to a tanh function to produce the output of
3.1
attention layer.
In our experiments, two public biomedical datasets
zt
tanh( Wg [gt , ht ])
Experimental Datasets and Settings
were used to evaluate the performance: (1) DDI
(4)
extraction 2013 dataset (DDI) [22] and (2) The
Next, a tanh layer on top of the attention layer is used
BioCreative CHEMPROT dataset (CPI) [3]. See
to predict confidence scores for the word having each
Supplementary Material A.2 for more dataset details.
of the possible labels as the output scores of the
In addition, it should be noted that about 60% and 80%
network.
of the relations in the DDI and CPI datasets, et tanh( We h t b e )
respectively, are overlapping relations. For the CPI
(5)
dataset, like many teams in the challenge, the training
where the weight matrix set {Wa, Wb, Wb’, Wg, We}
set and development set are combined to use as the
and the bias vector set {ba, be} are the parameters of
training set. Then for two datasets we randomly
the model, and L is the length of the sentence.
selected 10% of the corresponding training sets as the
Then, instead of modeling tagging decisions
validation sets to tune the hyper-parameters. After the
independently, the CRF layer is added to decode the
hyper-parameters were chosen, the models were
best tag path in all possible tag paths. The final score
evaluated on both test sets. See Supplementary
of the sentence X along with a sequence of predictions
Material A.3 for the hyper-parameter details.
y is then given by the sum of transition scores and network scores:
s( X, y ) i 1 (Tyi1 , yi Pi , yi ) L
(6)
7
Micro-averaged Precision (P), Recall (R) and F1-
Method
socre (F1), which were used in the DDI and CPI tasks
Nearest Rule Our Rule Our Rule+M
[2, 3], are also used by us to evaluate the prediction results. For the entity recognition, an entity is regarded
F1 0.481 0.685 0.702
DDI R1 R2 0.706 0.179 0.699 0.604 0.734 0.668
F1 0.277 0.421 0.494
CPI R1 R2 0.338 0.150 0.409 0.337 0.504 0.425
Table 1: The effect of our tagging scheme and extraction
as correct when its left and right boundaries and entity
rules. “R1” and “R2” denote the recall of non-overlapping
types are both correct. For the relation extraction,
and overlapping relations, respectively.
when the left and right boundaries of the corresponding entities and the relation type are both
Method
correct, a predicted relation is considered to be correct.
Att-BiLSTM-CRF +ELMo-256(Gen) +ELMo-512(Gen) +ELMo-1024(Gen) +ELMo-256(Bio) +ELMo-512(Bio) +ELMo-1024(Bio)
Our statistical significance results are based on the Approximate Randomization test [23]. 3.2
The Effect of our Tagging Scheme and Extraction Rules
DDI NER-F1 RE-F1 0.911 0.702 0.908 0.709 0.912 0.718 0.915 0.737 0.913 0.730 0.917 0.736 0.922 0.751
CPI NER-F1 RE-F1 0.773 0.494 0.781 0.509 0.782 0.513 0.786 0.517 0.787 0.522 0.796 0.538 0.811 0.551
Table 2: The effect of the contextualized ELMo embedding. “256”, “512” and “1024” denote the dimensionalities of ELMo embeddings. “Gen” denotes the ELMo pretrained on the general domain text, and “Bio” denotes the ELMo pretrained on the biomedical text.
To explore the effectiveness of our tagging scheme and extraction rules, the results of several comparisons are provided in Table 1. Nearest Rule: the tagging scheme in which “M” tag of the relation type label and “m” tag of entity role label are not used (thus only the
experimental results show that our method is effective
last relation of the overlapping relation is tagged.), and
for
the relation extraction only follows the nearest
the
overlapping
relation
extraction
from
biomedical text.
principle as used in the method of Zheng et al. [13]. Our Rule: the same tagging scheme as Nearest Rule,
3.3
but the relation extraction uses our extraction rules
Recently, the ELMo model has exhibit promising
described in section 2.2. Our Rule+M: our tagging
results for some NLP tasks [14]. To explore the
scheme described in section 2.1 is used and the
effectiveness of the ELMo embedding for our model,
relation extraction follows our extraction rules. In the
the corresponding experiments were conducted to
experiment, all models are the Att-BiLSTM-CRF
compare the ELMo embeddings of different domain
models without the ELMo embeddings.
and different dimensionalities. And the results of
As shown in Table 1, the model with our rules and
The Effect of the ELMo Embedding
several comparisons are provided in Table 2.
tagging scheme achieves the best performances on
The results show that even better performances are
both datasets. When only our extraction rules are used,
achieved on both datasets when these ELMo
the recall of overlapping relations is significantly
embeddings are added into the model. Coinciding
improved. Furthermore, when “M” tag of the relation
with
type label and “m” tag of entity role label are
dimensionality of the ELMo embeddings can improve
introduced into the tagging scheme, the performance
the model performance. Moreover, the in-domain
of the model can achieve further improvements. The
biomedical ELMo embeddings can achieve larger
8
previous
works
[14],
increasing
the
improvements than the general domain ones. When
mentions are classified. (1) BiLSTM-CRF+BiLSTM:
the 1024-dimensional ELMo embeddings are adding
the NER results are obtained with the BiLSTM-CRF
to our model, the best performances are achieved on
model (i.e., the Att-BiLSTM-CRF model described in
both datasets (an average improvement of 5.3% in F-
section 2.4 without the attention layer), and then the
score over the model without ELMo embeddings). In
above-mentioned BiLSTM model is used to extract
addition, we also explored the different locations of
relations. (2) BiLSTM-CRF+HRNN: the same
applying ELMo embeddings, and find that adding it at
BiLSTM-CRF model is used for NER, and then
the input layer for our model is best choice. See
HRNN model [25] is used to extract DDI relations. (3)
supplemental material A.4 for details.
BiLSTM-CRF+HieRCNN: the same BiLSTM-CRF
3.4
model is used for NER, and then HieRCNN model
Performance Comparison
[27] is used to extract CPI relations.
To further demonstrate the effectiveness of our
For the joint extraction method, the state-of-the-art
method, it is compared with several state-of-the-art
joint methods in news domain were compared with
entity relation extraction methods. These methods can
our method. (1) LSTM-LSTM-Bias: an end-to-end
be divided into three categories: the relation extraction
method based on a novel tagging scheme proposed by
methods using golden standard entities, the pipelined
Zheng et al. [13]. We rebuilt their model on the DDI
methods, and joint extraction methods.
and CPI datasets. (2) Graph Tagging: an overlapping
First, the results of the state-of-the art relation
entity extraction methods proposed by Wang et al. [28].
extraction methods using golden standard entities are
It is a transition-based approach which converts the
shown. (1) BiLSTM: a bidirectional LSTM model
joint task into a directed graph by designing a novel
only with the inputs of word embedding and entity
graph scheme. We retrained their model on the DDI
position embedding was built to extract relations as
and CPI corpora using their official code4. To make
our baseline. (2) SCNN [24] is a syntax convolutional
the comparison fair, the same biomedical 100-
neural network-based method for DDI extraction. (3)
dimensional word2vec embedding was also used for
HRNN [25] is a hierarchical recurrent neural network-
Graph Tagging and its hyper-parameters were tuned
based method for DDI extraction, which integrates the
on the validation set as our model does. In addition,
shortest dependency path and sentence sequence. (4)
the BiLSTM-CRF model with our tagging scheme and
Peng et al. [26] built an ensemble of SVM, CNN and
extraction rules is used as a baseline for comparison.
RNN using majority voting for CPI extraction and
The experimental results in Table 3 show that our
achieved the highest performance during the
Att-BiLSTM-CRF model outperforms the BiLSTM-
challenge. (5) HieRCNN [27] is a hierarchical
CRF model on both datasets. The main reason is our
recurrent convolutional neural network-based method
attention mechanism can capture the long distance
for CPI extraction.
dependence and focus on the important words for the
For the pipelined methods, the NER results are first
model’s predictions. It demonstrates that our attention
obtained and then the relations between the entity
layer is effective. When the ELMo embeddings are not
4
https://github.com/hitwsl/joint-entity-relation
9
CPI
DDI Method BiLSTM (golden) SCNN (golden) [24] HRNN (golden) [25] Peng et al. (golden) [26] HieRCNN (golden) [27] BiLSTM-CRF+BiLSTM BiLSTM-CRF+HRNN BiLSTM-CRF+HieRCNN LSTM-LSTM-Bias Graph Tagging BiLSTM-CRF BiLSTM-CRF+ELMo Att-BiLSTM-CRF Att-BiLSTM-CRF+ELMo
P 0.932 0.932 0.817 0.894 0.903 0.901 0.905
NER R 0.861 0.861 0.813 0.918 0.933 0.921 0.939
F1 0.895** 0.895** 0.815** 0.906* 0.918 0.911 0.922
P 0.684 0.725 0.741 0.648 0.692 0.644 0.587 0.684 0.737 0.722 0.750
RE R 0.665 0.651 0.718 0.630 0.707 0.388 0.571 0.669 0.740 0.685 0.752
F1 0.674 0.686 0.729 0.639** 0.692** 0.484** 0.579** 0.677** 0.738* 0.702** 0.751
P 0.762 0.762 0.742 0.777 0.823 0.788 0.825
NER R 0.733 0.733 0.706 0.754 0.790 0.758 0.798
F1 0.747** 0.747** 0.724** 0.765** 0.806 0.773** 0.811
P 0.572 0.727 0.632 0.492 0.486 0.417 0.415 0.545 0.585 0.564 0.595
RE R 0.573 0.574 0.652 0.449 0.515 0.211 0.348 0.430 0.495 0.440 0.512
F1 0.572 0.641 0.642 0.469** 0.500** 0.280** 0.379** 0.480** 0.536* 0.494** 0.551
Table 3: Performance comparison with the other existing methods. The first part is the relation extraction methods on golden standard entities, the second part is the pipelined methods, and the third part is the joint methods. * and ** denote the statistically significant differences against the best results of our method (i.e., Att-BiLSTM-CRF+ELMo) at p < 0.05 and p < 0.01, respectively. Note: the NER results of the LSTM-LSTM-Bias method are not shown since the method only focuses the entity of the relation and does not detect the entity types.
used, our Att-BiLSTM-CRF model outperforms all
significantly outperforms it on both datasets. The
pipelined methods on the DDI dataset and achieves
possible reason is that Graph Tagging method is
the competitive performance compared with the state-
proposed to extract the relations in the news domain
of-the-art pipelined methods on the CPI dataset. When
which is different from biomedical domain. For
the ELMo embeddings are added, the Att-BiLSTM-
example, there are large amounts of biomedical
CRF model achieves the better performances than all
domain entity names (such as proteins, genes, drugs
pipelined methods on two datasets. Compared with
and diseases) in biomedical literature. Compared with
the LSTM-LSTM-Bias joint model, our method
our method, Graph Tagging method achieves lower
significantly improves the recall of relation extraction
performance in NER, which also leads to the lower
without loss of precision. The main reason is that
performance of relation extraction.
LSTM-LSTM-Bias
only
considers
the non-
In addition, we found that the relation extraction
overlapping relations while there are a lot of
results of pipelined methods are worse than the
overlapping relations in the datasets. The results show
corresponding methods using golden standard entities.
the effectiveness of our method for the overlapping
The pipeline results of the same BiLSTM relation
relation extraction. Unlike the LSTM-LSTM-Bias
extraction models drop by 3.5% and 10.3% than the
method, Graph Tagging method can identify
results of the corresponding methods using golden
overlapping relations which makes it achieve higher
standard entities on the DDI and CPI datasets,
performances in recall on both biomedical corpora.
respectively. It indicates the performance of relation
However, our Att-BiLSTM-CRF+ELMo method
extraction is affected by the result of NER due to the
10
Sentence 1 Standard Pipeline Joint Sentence 2 Standard Pipeline Joint Sentence 3 Standard Pipeline Joint
In a pharmacokinetic study of 8 chronic hepatitis C patients concomitantly receiving methadone, treatment with PEG-Intron once weekly. NER: {(methadone, drug), (PEG-Intron, brand)}; RE: {(methadone, mechanism, PEG-Intron)} NER: {(methadone, drug), (PEG, brand)}; RE: {(methadone, mechanism, PEG)} NER: {(methadone, drug), (PEG-Intron, brand)}; RE: {(methadone, mechanism, PEG-Intron)} Warfarin users who initiated citalopram, fluoxetine, or mirtazapine had an increased risk of hospitalization for gastrointestinal bleeding. NER: {(Warfarin, drug), (citalopram, drug), (fluoxetine, drug), (mirtazapine, drug)}; RE: {(Warfarin, effect, citalopram), (Warfarin, effect, fluoxetine), (Warfarin, effect, mirtazapine)} NER: {(Warfarin, drug), (citalopram, drug), (fluoxetine, drug), (mirtazapine, drug)}; RE: {(Warfarin, effect, citalopram)} NER: {(Warfarin, drug), (citalopram, drug), (fluoxetine, drug), (mirtazapine, drug)}; RE: {(Warfarin, effect, citalopram), (Warfarin, effect, fluoxetine), (Warfarin, effect, mirtazapine)} Limited evidence suggests that ascorbic acid may influence the intensity and duration of action of bishydroxycoumarin. NER: {(ascorbic acid, drug), (bishydroxycoumarin, drug)}; RE: {(ascorbic acid, mechanism, bishydroxycoumari)} NER: {(ascorbic acid, drug), (bishydroxycoumarin, drug)}; RE: {} NER: {(ascorbic acid, drug), (bishydroxycoumarin, brand)}; RE: {(ascorbic acid, effect, bishydroxycoumari)}
Table 4: Examples of the extracted results of different methods on the DDI dataset. Each example contains four rows: the first row is the sentence; the second row is the gold standard answer; the third and the fourth rows are the extracted results of the BiLSTM-CRF+BiLSTM pipelined method and our joint extraction method, respectively. The text in red denotes the wrong extraction result.
error propagation. Especially when the result of NER
method (where the results of Att-BiLSTM-CRF
achieves low performance, it will significantly affect
without ELMo are used for comparison so that the
the performance of relation extraction. Although the
results are fairly comparable), and then selected three
same architecture of BiLSTM-CRF model is used for
examples as shown in Table 4.
NER in both methods, our joint method obtains higher
For sentence 1, the pipelined method generates an
NER scores in F-score than the pipelined model. It
entity boundary error, and the error is propagated to
shows that our method can exploit the dependencies
the relation classification. Even if the relation
between entities and their relations to simultaneously
classification is correct, the wrong relation extraction
improve the performances of entity and relation
result is obtained. In contrast, our joint extraction
extraction. On the DDI dataset, our Att-BiLSTM-CRF
method can alleviate the error propagation by the joint
model with ELMo embeddings outperforms the state-
learning of entities and relations.
of-the-art methods using golden standard entities.
For sentence 2, the sentence contains the
However, it cannot outperform the state-of-the-art CPI
overlapping relations between one entity and three
extraction methods using golden standard entities
parallel entities. Although the pipelined method
since the NER on the CPI dataset is still a difficult task
correctly recognizes all entities, it only extracts the
(the best F-score achieved by the Att-BiLSTM-
first relation and misses out other relations. The main
CRF+ELMo is only about 81%). See Supplementary
reason is that the pipelined method treats the relation
Material A.5 and A.6 for more analyses of results
extraction of each entity pair as a separate relation
3.5
classification task without considering the dependence
Case Study
of the parallel entities. Compared with it, our joint
To analyze the advantages and disadvantages of our
extraction method extracts all the relations correctly
joint extraction method, we compared some prediction
since it fully considers the dependencies of the entities
results of the pipelined method and joint extraction
and relations by a joint model. The example shows
11
that our method not only considers the dependence
proposed a multichannel dependency-based CNN
between entities and relations but also can learn the
model to extract PPIs. Sahu and Anand [44] explored
interaction between relations.
a LSTM model with attention pooling for DDI
For sentence 3, the pipelined method detects the
extraction. Zhang et al. [25] applied a hierarchical
correct entities but misses out their relation. Compared
LSTM model employing an attention mechanism to
with it, joint extraction method can predict the entities’
detect and extract DDIs.
boundary and detect the relation existing between
However, the pipeline methods neglect the
them correctly, but the entity and relational types are
relevance of two subtasks (NER and RC) and have the
mistakenly recognized. It shows that our method is
error propagation problem. Recently joint modeling
more capable of detecting entities pair and their
methods of entity and relation exhibit more promising
relations, but it needs to be further improved in
results than traditional pipelined methods in general
distinguishing the relationship between the entities.
domain. These
methods
include
feature-based
methods [7-9] and neural network-based methods [10,
4
Related Work
11, 13, 28].
Most of previous biomedical entity and relation
For the biomedical domain, there is less related
extraction work focus on the pipelined method, i.e.,
work in the joint extraction of entities and relations. Li
treating this task as two separated tasks, NER and RC.
et al. exploited a transition-based feed forward neural
In the previous NER works, the state-of-the-art
network to jointly extract drug-disease entity mentions
CRF-based biomedical NER methods [5, 29-31]
and their adverse drug event relations [45]. Further,
depend on effective feature engineering. Recently,
they applied a neural joint model to biomedical text
several neural network architectures [18-20] have
[12]. These methods adopt shared neural network
been proposed and exhibit promising results in general
parameters to jointly encode the representation of
domain. Moreover, the similar models have been
NER and RC tasks, but not joint decoding. Different
applied in biomedical field [32, 33], including genes
from them, our method converts the joint tasks to one
and proteins [34], diseases [35] and chemicals [21].
task by the novel tagging scheme, so that the end-toend model can be used easily to extract entities and
After recognizing entity mentions in a given text
relations.
using the NER technologies, each entity pair is examined to decide whether they have task-specific
5
relations using RC models. Over the past decade,
Conclusion
many methods have been proposed and successfully
In this paper, we propose an end-to-end method for
employed for RC in biomedical texts, including
joint extraction of biomedical entity and relation
pattern-based
37], feature-based
extraction. Specifically, a tagging scheme that takes
methods [38, 39] and kernel-based methods [40-42].
into account overlapping relations is proposed to
Recently, deep neural network-based methods have
convert the joint extraction task to a tagging problem.
been proposed and achieved the state-of-the-art
Then the Att-BiLSTM-CRF model is used to extract
performance. For example, Peng and Lu [43]
entities and their relations in texts with our extraction
methods
[36,
12
rules. Our method can fully exploit dependencies of
[9] X Ren, Z Wu, W He, M Qu, CR Voss, H Ji, et al. Cotype: Joint extraction of typed entities and relations with knowledge bases. Proceedings of the 26th International Conference on World Wide Web: International World Wide Web Conferences Steering Committee; 2017. p. 1015-1024. [10] M Miwa, M Bansal. End-to-end relation extraction using lstms on sequences and tree structures. arXiv preprint arXiv:160100770. 2016. [11] S Zheng, Y Hao, D Lu, H Bao, J Xu, H Hao, et al. Joint entity and relation extraction based on a hybrid neural network. Neurocomputing. 2017;257:59-66. [12] F Li, M Zhang, G Fu, D Ji. A neural joint model for entity and relation extraction from biomedical text. BMC bioinformatics. 2017;18:198. [13] S Zheng, F Wang, H Bao, Y Hao, P Zhou, B Xu. Joint extraction of entities and relations based on a novel tagging scheme. arXiv preprint arXiv:170605075. 2017. [14] M Peters, M Neumann, M Iyyer, M Gardner, C Clark, K Lee, et al. Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)2018. p. 2227-2237. [15] S Lai, K Liu, S He, J Zhao. How to generate a good word embedding. IEEE Intelligent Systems. 2016;31:5-14. [16] T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems2013. p. 3111-3119. [17] M Rei, G Crichton, S Pyysalo. Attending to Characters in Neural Sequence Labeling Models. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers2016. p. 309-318. [18] R Collobert, J Weston, L Bottou, M Karlen, K Kavukcuoglu, P Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research. 2011;12:2493-2537. [19] G Lample, M Ballesteros, S Subramanian, K Kawakami, C Dyer. Neural Architectures for Named Entity Recognition. Proceedings of NAACL-HLT2016. p. 260-270. [20] X Ma, E Hovy. End-to-end Sequence Labeling via Bidirectional LSTM-CNNs-CRF. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)2016. p. 1064-1074. [21] L Luo, Z Yang, P Yang, Y Zhang, L Wang, H Lin, et al. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics. 2017;34:1381-1388. [22] M Herrero-Zazo, I Segura-Bedmar, P Martínez, T Declerck. The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions. Journal of biomedical informatics. 2013;46:914-920. [23] A Yeh. More accurate tests for the statistical significance of result differences. Proceedings of the 18th conference on
entities and relations and significantly improve the performance of overlapping relation extraction. Moreover, the biomedical ELMo embeddings are used and
further
improve
the
performance.
The
experimental results on two public biomedical datasets demonstrate the effectiveness of our method.
Acknowledgement This work was supported by the grant from the National Key Research and Development Program of China (No. 2016YFC0901902).
References [1] S Pyysalo, F Ginter, J Heimonen, J Björne, J Boberg, J Järvinen, et al. BioInfer: a corpus for information extraction in the biomedical domain. BMC bioinformatics. 2007;8:50. [2] I Segura-Bedmar, P Martínez, MH Zazo. Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)2013. p. 341-350. [3] M Krallinger, O Rabal, SA Akhondi. Overview of the BioCreative VI chemical-protein interaction Track. Proceedings of the sixth BioCreative challenge evaluation workshop2017. p. 141-146. [4] M Banko, MJ Cafarella, S Soderland, M Broadhead, O Etzioni. Open information extraction from the web. IJCAI2007. p. 2670-2676. [5] B Settles. Biomedical named entity recognition using conditional random fields and rich feature sets. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP)2004. [6] C Giuliano, A Lavelli, L Romano. Exploiting shallow linguistic information for relation extraction from biomedical literature. 11th Conference of the European Chapter of the Association for Computational Linguistics2006. [7] Q Li, H Ji. Incremental joint extraction of entity mentions and relations. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)2014. p. 402-412. [8] M Miwa, Y Sasaki. Modeling joint entity and relation extraction with table representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)2014. p. 1858-1869.
13
Computational linguistics-Volume 2: Association for Computational Linguistics; 2000. p. 947-953. [24] Z Zhao, Z Yang, L Luo, H Lin, J Wang. Drug drug interaction extraction from biomedical literature using syntax convolutional neural network. Bioinformatics. 2016;32:34443453. [25] Y Zhang, W Zheng, H Lin, J Wang, Z Yang, M Dumontier. Drug–drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths. Bioinformatics. 2017;34:828-835. [26] Y Peng, A Rios, R Kavuluru, Z Lu. Extracting chemical– protein relations with ensembles of SVM and deep learning models. Database. 2018;2018. [27] C Sun, Z Yang, L Wang, Y Zhang, H Lin, J Wang, et al. Chemical-protein interaction extraction from biomedical literature: a hierarchical recurrent convolutional neural network method. International Journal of Data Mining and Bioinformatics. 2019;22:113-130. [28] S Wang, Y Zhang, W Che, T Liu. Joint Extraction of Entities and Relations Based on a Novel Graph Scheme. IJCAI2018. p. 4461-4467. [29] R Leaman, C-H Wei, Z Lu. tmChem: a high performance approach for chemical named entity recognition and normalization. Journal of cheminformatics. 2015;7:S3. [30] C-H Wei, H-Y Kao, Z Lu. GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. BioMed research international. 2015;2015. [31] B Settles. ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005;21:3191-3192. [32] M Habibi, L Weber, M Neves, DL Wiegandt, U Leser. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics. 2017;33:i37-i48. [33] TH Dang, H-Q Le, TM Nguyen, ST Vu. D3NER: biomedical named entity recognition using CRF-biLSTM improved with fine-tuned embeddings of various linguistic information. Bioinformatics. 2018;34:3539-3546. [34] L Li, L Jin, Z Jiang, D Song, D Huang. Biomedical named entity recognition based on extended recurrent neural networks. 2015 IEEE International Conference on bioinformatics and biomedicine (BIBM): IEEE; 2015. p. 649-652. [35] S Sahu, A Anand. Recurrent neural network models for disease name recognition using domain invariant features. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)2016. p. 2216-2225. [36] A Leeuwenberg, A Buzmakov, Y Toussaint, A Napoli. Exploring pattern structures of syntactic trees for relation extraction. International Conference on Formal Concept Analysis: Springer; 2015. p. 153-168. [37] DP Corney, BF Buxton, WB Langdon, DT Jones. BioRAT: extracting biological information from full-length papers. Bioinformatics. 2004;20:3206-3213.
[38] S Kim, H Liu, L Yeganova, WJ Wilbur. Extracting drug– drug interactions from literature using a rich feature-based linear kernel approach. Journal of biomedical informatics. 2015;55:2330. [39] M Miwa, R Sætre, Y Miyao, Ji Tsujii. A rich feature vector for protein-protein interaction extraction from multiple corpora. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1: Association for Computational Linguistics; 2009. p. 121-130. [40] A Airola, S Pyysalo, J Björne, T Pahikkala, F Ginter, T Salakoski. All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning. BMC bioinformatics. 2008;9:S2. [41] Y Zhang, H Lin, Z Yang, Y Li. Neighborhood hash graph kernel for protein–protein interaction extraction. Journal of biomedical informatics. 2011;44:1086-1092. [42] W Zheng, H Lin, Z Zhao, B Xu, Y Zhang, Z Yang, et al. A graph kernel based on context vectors for extracting drug–drug interactions. Journal of biomedical informatics. 2016;61:34-43. [43] Y Peng, Z Lu. Deep learning for extracting protein-protein interactions from biomedical literature. BioNLP 2017. 2017:2938. [44] SK Sahu, A Anand. Drug-drug interaction extraction from biomedical texts using long short-term memory network. Journal of biomedical informatics. 2018;86:15-24. [45] F Li, Y Zhang, M Zhang, D Ji. Joint Models for Extracting Adverse Drug Events from Biomedical Text. IJCAI2016. p. 2838-2844.
14
Graphical abstract
15
Highlights
A
neural
network-based
joint
extraction
of
biomedical entities and relations approach is proposed.
The novel tagging scheme and extraction rules are proposed to extract the overlapping relations in biomedical texts.
The effectiveness of the ELMo embedding is explored for our joint model.
Our joint method outperforms most of the existing pipelined methods and achieves the state-of-the-art performance.
16
Ling Luo designed the algorithm, conducted the experiments and drafted the manuscript. Zhihao Yang provided the initial ideas and revised the manuscript. Mingyu Cao participated in the model designs and the experiments. Lei Wang provided biomedical support and revised the manuscript. Yin Zhang and Hongfei Lin commented on algorithm designs. All authors read and approved the final manuscript.
17
Declaration of interests
☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:
18