Multi-granularity sequence labeling model for acronym expansion identification

Multi-granularity sequence labeling model for acronym expansion identification

ARTICLE IN PRESS JID: INS [m3Gsc;July 3, 2016;15:3] Information Sciences 0 0 0 (2016) 1–13 Contents lists available at ScienceDirect Information ...

1MB Sizes 155 Downloads 96 Views

ARTICLE IN PRESS

JID: INS

[m3Gsc;July 3, 2016;15:3]

Information Sciences 0 0 0 (2016) 1–13

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Multi-granularity sequence labeling model for acronym expansion identification Jie Liu a,b,∗, Caihua Liu a,b, Yalou Huang a,b a b

College of Computer and Control Engineering, Nankai University, China College of Software, Nankai University, China

a r t i c l e

i n f o

Article history: Received 3 September 2015 Revised 4 May 2016 Accepted 25 June 2016 Available online xxx Keywords: Conditional random fields Neural network Acronym Sequence labeling Multi-granularity

a b s t r a c t Identifying expansion forms for acronyms is beneficial to many natural language processing and information retrieval tasks. In this work, we study the problem of finding expansions in texts for given acronym queries by modeling the problem as a sequence labeling task. However, it is challenging for traditional sequence labeling models like Conditional Random Fields (CRF) due to the complexity of the input sentences and the substructure of the categories. In this paper, we propose a Latent-state Neural Conditional Random Fields model (LNCRF) to deal with the challenges. On one hand, we extend CRF by coupling it with nonlinear hidden layers to learn multi-granularity hierarchical representations of the input data under the framework of Conditional Random Fields. On the other hand, we introduce latent variables to capture the fine granular information from the intrinsic substructures within the structured output labels implicitly. The experimental results on real data show that our model achieves the best performance against the state-of-the-art baselines. © 2016 Elsevier Inc. All rights reserved.

1. Introduction Acronyms (e.g. CRF) are compressed forms of terms, and are used as substitutes of the fully expanded forms (e.g., conditional random fields). In many literature and documents, especially in the scientific and engineering fields, the amount of acronyms is increasing at an astounding rate. By using acronyms, people avoid repeating frequently used long phrases. For example, ‘ROM’ is often used to refer to ‘Read Only Memory’, ‘HIV’ is often used to take the place of the long phrase ‘Human Immunodeficiency Virus’, etc. Such abbreviations or acronyms can convey exactly the same information with less words, which simplifies our writing and reading. However, using acronyms obstructs the readers who do not have the domain-specific knowledge. Acronyms also present serious problems for Natural Language Processing (NLP) and Information Retrieval (IR) algorithms. Acronyms and abbreviations that are not common enough to be a part of daily conversation are typically not in the lexicons and can be considered as misspelled words, meaning they have negative aspects on NLP algorithms. Moreover, the existence of acronyms in text hinders the automatic creation of the very lexicons that are needed. As such, acronyms must be taken care of for related NLP tasks and IR tasks. The previous acronym search systems are mainly based on two-step methods, i.e. acronym identification and expansion identification. Since the acronym identification is relatively easy to be solved using lexical methods, expansion finding is



Corresponding author. E-mail address: [email protected] (J. Liu).

http://dx.doi.org/10.1016/j.ins.2016.06.045 0020-0255/© 2016 Elsevier Inc. All rights reserved.

Please cite this article as: J. Liu et al., Multi-granularity sequence labeling model for acronym expansion identification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.06.045

JID: INS 2

ARTICLE IN PRESS

[m3Gsc;July 3, 2016;15:3]

J. Liu et al. / Information Sciences 000 (2016) 1–13

Fig. 1. An example tagging of a training sentence for expansion finding. The label ‘B’ stands for ‘Begin of an expansion’, ‘I’ stands for ‘Inside of an expansion’, and ‘O’ stands for ‘Others’.

the major bottleneck for many IR and NLP tasks. Previous methods mainly exploit pattern-matching method or supervised learning method. The pattern-matching methods often fail due to the variation of the ways to construct acronyms, it is very difficult to design sufficient and precise rules or patterns to get good precision and recall. Recently, the supervised methods have been widely adopted to overcome the shortcoming of rule-based methods. Machine learning methods that learn the patterns automatically from large corpus have demonstrated the advantages over other kinds of acronym-expansion finding methods. Furthermore, the supervised learning methods based on structured prediction model, i.e. Conditional Random Fields (CRF), have shown to be more appropriate for the expansion identification task [14]. All these methods are faced with two major challenges for this task. One is the variation of the expansion forms. And the other is the complex latent dynamic lies in the expansions. As the first challenge, the expansion forms vary a lot, e.g. spelling, morphological, syntactic, semantic variations, term synonymy and homonymy. The performance of traditional linear-chain CRFs depends heavily on the quality of the input features which are costly and difficult for human feature engineering. The hand-crafted features are often noisy and redundant, which would constrain the performance of CRF. More importantly, as the second challenge, there are complex fine-grained structures within the structured output which is a sequence of labels. Such substructures can be essentially regarded as granular structure information in granular computing [39]. Ignoring the fine granular structures may lead to important information loss. As far as expansion identification task concerned, the contexts of acronym expansions often have more complex underlying structures, and the widely used labeling scheme like ‘BIO’ is too coarse to fully encapsulate the syntactic and query matching behavior of word sequences. Taking the acronym query ‘NASA’ with the expansion ‘National Aeronautics and Space Administration’ as an example, as is shown in Fig. 1, both words ‘and’ and ‘Space’ are labeled as inside-expansion ‘I’, they match differently with the query and serve as different roles in the expansion constitution. Hence, the dependence between the neighboring word token pairs (eg. < ‘and’, ‘Space’ >, and < ‘Space’, ‘Administration’ > ) should be different, even though their labels are identical. In practice, given the limited data, the relationship between specific words and their orthographic or syntactic contexts may be better modeled at a level finer than class labels. In other words, the underlying fine-grained structure is an important intermediate information between input features and labels. In this paper, we propose a novel multi-granularity sequence labeling model, Latent-state Neural Conditional Random Fields (LNCRF), to deal with the two challenges described above in the problem of acronym expansion identification. Firstly, to alleviate the impact of the variations of the input sentences, we combine Conditional Random Fields (CRF) [9] with nonlinear hidden layers which extend CRF to a deep learning model[11] to some extent. Being similar to other deep learning models, the hidden layers empower CRF to learn invariant higher levels features from input sequences automatically. From the point of view of granular computing, such multiple levels of abstractions of data corresponds to hierarchical granularity[36]. Moreover, we further introduce a set of latent-state variables to capture the finer granular structures within the structured output. Using the latent states, the coarse output space is partitioned into finer information granules. Learning the latent variables serves as a seamlessly integrated granulation process of the sequential tokens within sentences for acronym expansion identification. In summary, this new sequence labeling model for acronym expansion identification is able to capture the hidden sub-structures of each class and at the same time learn the non-linear relationships of complex input features and class labels. We evaluate our model on a real dataset collected from wikipedia [14]. Experimental results show that the proposed approach can achieve superior performance against the state-of-the-art baselines. The rest of this paper is organized as follows. In Section 2, we introduce previous work that is related to this paper. In Section 3, we give a brief formal description of the task. In Section 4, we describe the proposed methodology for expansion finding task, including the Latent-state Neural Conditional Random Fields architecture and the learning algorithm. We then compare the proposed methods with existing representative methods on a real data collected from Wikipedia in Section 5. Finally, conclusions are given in Section 6. 2. Related work To find acronym expansions, current research methods mainly extract pairs of < acronym, expansion > from texts, for example, Conditional Random Fields (CRF). Most existing methods can be classified into two categories: pattern-matching technique and machine learning based methods. Pattern-matching techniques design rules and patterns to find the longest common substring. Representative works include AFP (Acronym Finding Program) [32] and TLA (Three Letter Acronym)[37]. Recently, machine learning based methods have been preferred as the pattern-matching methods require more human efforts on designing and tuning the rules and patterns, including example-based methods [16,19,35] and sequence-based methods [14,20]. Please cite this article as: J. Liu et al., Multi-granularity sequence labeling model for acronym expansion identification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.06.045

JID: INS

ARTICLE IN PRESS J. Liu et al. / Information Sciences 000 (2016) 1–13

[m3Gsc;July 3, 2016;15:3] 3

In this work, the acronym expansion identification task is cast as a sequence labeling task, which is a common NLP problem. Sequence labeling deals with tasks such as chunking [2], named entity recognition [18] and information extraction [3]. The CRF[9] model and its related models have achieved state-of-the-art performance in many kinds of structured prediction tasks [20,26]. The standard CRFs model the interactions between labels, which offers them the ability to learn the structures and dynamics of sequence data. To deal with the complexity and nonlinearity of the sequence data, Lafferty et al. [10] present a kernelized version of linear CRF to enable nonlinear mapping. Recently, models combining neural network and CRF [5,25,41] were proposed. Many CRF-based models have been applied successfully to many sequence labeling tasks such as Natural Language Processing [24,29,30,33], bioinformatics [15,28], and computer vision [7,8] etc., benefiting from the ability to learn the structures and dynamics of sequence data. Due to the high variety and complexity of the task, it is favorable to introduce the idea of granular computing[23] to uncover the latent substructure of natural language sentences. Information granules are collections of entities that are arranged together due to their similarity, functional or physical adjacency, indistinguishability, coherency, or the like [39]. Granulation of the tokens in the sentences makes it possible to capture the latent finer granular information contained in the acronym expansions. Fuzzy set, rough set and their combinations have been extensively explored for granular computing [22,40,42]. The granular computing approaches have been widely applied in a wide range of applications[1,6,12]. Actually, as a general principle and methodology of problem solving, its core idea exists widely in many important machine learning approaches. Especially, deep learning can be regarded as a multi-granularity method. It learns multi-level features which can be viewed as granules on different level. In our model, we also exploit hidden neural layer to learn intermediate granular feature representation. In the field of natural language process, there have been very little work on granular computing. Until very recently, the idea of granular computing has been exploited in sentence matching [34,38] and short text classification[4]. For the sequence labeling tasks, Latent-dynamic Conditional Random Fields (LDCRF) [17] are proposed for gesture recognition via learning latent variables for substructures of gestures, which is an instance of the concept of granule to some extent. Inspired by the success in computer vision tasks, LDCRFs have been applied to some NLP tasks [14,31] and achieves superior performance against standard CRFs. However, despite the progress, these previous latent variable CRF models can not learn high level features and substructures simultaneously. 3. Task definition The goal of expansion identification is to recognize tokens that constitute the long form of a given acronym. For example, as is shown in Fig. 1, when an acronym ‘NASA’ is given, the aim of this task is to find a sequence of words ‘National Aeronautics and Space Administration’ in the sentence. Such a task is quite related to sequence labeling tasks [14]. Different from the previous acronym recognition applications, the aim of the task considered in this paper is to identify all appropriate long forms existing in corpus, as there is no context of the issued acronym queries. Whereas, previous work can only handle the co-occurred pairs of acronym and expansion in the same sentence. Actually the cases of co-occurrence are not common in the corpus, because pairs like < acronym, expansion > are unlikely to appear together once the definition (expansion) of an acronym is given. Taking expansions that appear without acronym into account could provide more positive examples and useful information for the prediction, which is ignored by many methods. Formally, for each sentence x = (x1 , x2 , x3 , . . . , xT ) in a corpus, it consists of a sequence of tokens xt ∈ Rd , which represents its own characteristics and the compatibility with an acronym query q. Our task is to predict the class label yt ∈ Y for each token xt in the sentence. Most of the previous works see the task of expansion identification as a binary classification task where Y = {0, 1} using some example-based classification model [35], such as Support Vector Model. Even though some work includes features of the surrounding tokens for xt , e.g. the form of xt+1 and/or xt−1 , the classifier determines the class label yt for each token xt independently. It has been an assumption for token classification tasks to decide a class label independently of other class labels. However, it is obvious that the expansions often have inter-dependence between neighboring tokens: a token would be more likely to be a member of the expansion sequence if its antecedent token is a member of the expansion. Thus, the task is more suitably to be formalized as a sequence labeling problem: given an acronym query q and a sentence with tokens x = (x1 , . . . , xt ), determine the optimal sequence of token labels y = (y1 , . . . , yn ) of each sequences in the corpus. With the predicted label sequence y, one can identify that whether there is an expansion for the acronym term q; and which subsequence of the sentence is the corresponding expansion. Casting the expansion identification task as a sequence labeling problem, the widely used ‘BIO’ scheme is adopted to label the training data. In many NLP tasks, such as NP chunking [27], named entity recognition, etc., this scheme is widely used. It also has been shown that attaching B- and I- prefixes to token labels may improve a classifier by associating phrases’ boundary information with the starts of expansions [14]. An example of the BIO labeling method is shown in Fig. 1. Specifically, the labels we used in our model are listed in the Table 1. 4. Methodology We consider the problem of expansion finding in text given acronym queries. In this section we will first describe a Neural Conditional Random Fields (NCRF) that enables CRF to learn higher level features from raw input by combining CRF with neural networks. Then we further introduce the latent state layer to model the substructures for NCRF model which we call it Latent-state Neural Conditional Random Fields (LNCRF). Please cite this article as: J. Liu et al., Multi-granularity sequence labeling model for acronym expansion identification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.06.045

ARTICLE IN PRESS

JID: INS 4

[m3Gsc;July 3, 2016;15:3]

J. Liu et al. / Information Sciences 000 (2016) 1–13 Table 1 Label list for expansion finding. Meaning

Label

Beginning of expansion Inside of expansion Others

B I O

Fig. 2. Comparisons of LNCRF model with standard CRF model (a), NCRF model (b), and LNCRF (c). Gray Circles are observed variables. The LNCRF combines the strength of LDCRF and NCRF to learn the intrinsic fine-grained granular state and high level feature, which can be naturally applied to recognize the acronym expansion from sentences.

4.1. Neural condition random fields Linear-chain CRFs have been widely used in sequence labeling tasks such as Part-Of-Speech (POS) tagging, Chunking (CHUNK), Named Entity Recognition (NER) and Semantic Role Labeling (SRL). We choose to apply CRF based model to our task for its ability of incorporating arbitrary feature functions on observations without complicating the training. Formally, we let x = (x1 , x2 , ..., xT ) denote a sequence of T terms, and y = (y1 , y2 , ..., yT ) as the corresponding state (field) sequence. Each yt ∈ Y can take a pre-defined categorical value shown in Table 1. The conditional probability p(y|x) is given by

P (y|x; θ ) = exp





θ  F (y, x ) − log Z (x; θ )



(1)

where F (y, x ) = t f (yt , yt−1 , xt ), and Z(x; θ ) is a partition function that normalizes the exponential form over all the possible states of y of length |x| to be a probability distribution. It is defined as

Z ( x; θ ) =





exp

y





(yt , xt ; θ )

(2)

t

and  is a parametric potential function

(yt , yt−1 , xt ; θ ) = θ  f (yt , yt−1 , x, t )

(3)

where θ is a vector of linear weights and f (yt , yt−1 , x, t ) computes a set of features given the node at position t. Typically, given a set of training example D = {x(n ) , y(n ) }N , where x(n ) ∈ X and y(n ) ∈ Y, the linear weights can be n=1 estimated by maximizing the penalized log-likelihood with respect to the conditional probability



max θ

N 



log P (y(n ) |x(n ) ; θ ) .

(4)

n=1

There are efficient exact inference algorithms for linear chains CRFs such as Viterbi algorithm [26]. To make the CRF models have the ability to learn features, we exploit a nonlinear transformation layer to compute hidden representation of input observation vectors. Here the hidden representation can be regarded as powerful features leaned from input data. Considering a sequence of observations x and labels y, NCRF, which is a nonlinear-chain CRF, is defined as



P ( y|x; θ , α ) =

 1 exp θ  f (yt , yt−1 , φ (x; α ), t ) Z ( x; θ ) t



(5)

In its simplest form, φ ≡ φ (x; α ) is a sequence of higher level feature representation of x by feeding each xt to the function φ (xt ; α ), which is a nonlinear function x → RM with parameter α . The function φ (xt ; α ) actually serves as a trainable higher level feature extractor whose parameters are learned with the sequence labeling loss simultaneously. In other words, the NCRF model, as is shown in Fig. 2(b), puts a CRF upon φ (xt ; α ) that amounts to a hidden layer. Please cite this article as: J. Liu et al., Multi-granularity sequence labeling model for acronym expansion identification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.06.045

ARTICLE IN PRESS

JID: INS

J. Liu et al. / Information Sciences 000 (2016) 1–13

[m3Gsc;July 3, 2016;15:3] 5

In fact, our implementation φ (xt ; α ) is an neural network itself. The fixed size vector xt is first fed to a linear function which performs linear transformations over the inputs:

g(xt ) = ωk xt + b

(6)

Then affine transformation is applied to introduce nonlinearity

φi (xt ) =  (gk (xt ) )

(7)

where i = 1, · · · , M, ϱ is an non-linear (tanh) transfer function, and the parameters α include all the weights ω and bias term b. Therefore, the overall NCRF hybrid contains two nonlinear hidden layers and one structured-output layer. The model in Eq. (5) is different from traditional CRF in Eq. (1) in its learning process, as is shown in Fig. 2. It not only learns the linear parameters, i.e. θ , but also optimizes the features by tuning the transformation parameter α in transformation function φ . This capacity of learning features benefits the model by learning higher level feature for classification from complex input data. Recently, Deep Neural Network (DNN) based deep learning approaches[11] have been widely exploited to learn multiple levels of features, and each level can be interpreted as a granularity [42]. Similarly, the representation output by the hidden layer of NCRF can be viewed as a higher level abstracted granularity which is learned from the low level input which is the finest granular in our task. 4.2. Learning latent states for neural CRF We further present the LNCRF (Latent-state Neural Conditional Random Fields) to solve the problem of sequence labeling with finer granular structures. By introducing latent states to NCRF, LNCRF is designed to be a nonlinear sequence labeling model that can learn the sub-structure in sequence labels. As shown in Fig. 2(c), the LNCRF model contains two layers of latent variables. Each layer serves a different purpose. The first layer is known as the neural network layer and aims to learn the non-linear mapping from the input sequences. The second layer is known as the latent state layer and models sequential sub-structure from the label sequence. Given an information system S = (U, At , {Va |a ∈ At }, {Ia |a ∈ At } ), universe U = {x1 ; ...; xn } is a non-empty set of objects, which are the tokens of the sentence in this task. Assume that each object is associated with a unique class label y ∈ Y. The set of attributes is expressed as At = F ∪ Y, where F is the set of attributes describing the tokens. Objects are divided into disjoint classes which form a partition πY of universe U. In LNCRF model, the attributes correspond to the features φ t learned by the neural layer. Each sentence is a sequence of tokens x = {x1 , x2 , . . . , xT }, the nodes are transformed by the lower nonlinear layer and results in sequences of higher level feature φ = {φ1 , φ2 , . . . , φT }. Each vector φt ∈ RM is a simple form of φ (xt ; α ). In order to capture the finer granular information of substructure, a series of hidden variables, h = {h1 , h2 , . . . , hk }, are introduced to model the hidden or latent dynamics of the process, where each hidden state hk is a member of H. Specifically, each coarse granule yi is further partitioned into finer granules by a set of hidden states Hy j whose elements are possible hidden states for the class label yj . And H is the union of all possible Hy . With the latent states, we obtain a new partition, πH , of U. Since πH is a refinement of πY , we have πH  πY . Based on the refinement relation, multi-level granulations of the universe are constructed. The membership score for the finer level granules for each object, i.e. token, is inferred in a sequence learning manner using the following function:



 1 P (h|φ ; θ , α ) = exp θk · Fk (h, φ ) , φ Z( , θ , α )

(8)

k

where the partition function Z is defined as:

Z (φ ; θ ) =





exp



h

θk · Fk (h, φ ) .

(9)

k

Instead of being calculated separately, the membership function is integrated with the loss function for the sequence labeling model defined in NCRF as shown in Eq. (5), which results in the latent-state neural conditional random (LNCRF) model:

P (y|φ ; θ , α ) =



P (y|h, φ ; θ , α ) · P (h|φ ; θ , α ),

(10)

h

where θ is the parameter vector defining the sequence output layer, and α is the parameter introduced by the nonlinear hidden layer defined in Eq. (7). Since sequences which have any hk ∈ / Hy j will have P (y|h, φ ; θ , α ) = 0, the LNCRF model is then formulated as:

P (y|φ ; θ , α ) =



P (h|φ ; θ , α ).

(11)

h:∀h j ∈Hy j

Due to the hierarchical levels of the granules, each coarse granule yj ∈ Y has a disjoint set of hidden states Hy j corresponding to the finer level granules, it is hence tractable for the training and inference of the model. Please cite this article as: J. Liu et al., Multi-granularity sequence labeling model for acronym expansion identification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.06.045

ARTICLE IN PRESS

JID: INS 6

[m3Gsc;July 3, 2016;15:3]

J. Liu et al. / Information Sciences 000 (2016) 1–13

Fig. 3. The architecture of LNCRF with multiple levels of granularity learning.

4.3. Learning the parameters Given a set of manually-labeled pairs {(xi , yi )}m , we can estimate model parameters in a supervised fashion. This archii=1 tecture of the proposed LNCRF is shown in Fig. 3. Note that the decoding graph for LNCRF is the same as shown in Fig. 2(a) and (b). In supervised training, we aim to estimate = {θ , α} that maximizes the conditional likelihood of training data while regularizing model parameters:

J=

m 

log p(y(i ) |x(i ) ; ) −

i=1

1 || ||2 2σ 2

(12)

We use gradient descendent method to search for the optimal parameters. In our experiments, LBFGS [13] is used to optimize the objective function. Below are some derivatives from the supervised LNCRF:

 ∂J = fi (ht , ht−1 , φ , t ) ∂θi t −



fi (h, h , φ , t )P (h, h |φ ) −

θi

(13)

∂φt,i 1 P (h, h |φ ) − αk , ∂αk σ

(14)

h,h

t

1

σ

  ∂φt,i ∂J = θi ∂αk ∂αk t i −

 t

h,h

i

θi

where φ t, i is the ith element of hidden feature vector φ t at time t. The partial derivative ∂ φ t, i /∂ α k can be easily calculated by back propagation algorithm of neural network, which we omit them for lack of space. Note that the marginal probabilities P(h, h |x; ) can be computed efficiently using belief propagation [21]. With the hidden layer, each input token x ∈ Rd is mapped to φ ∈ RM . Hence, the computational complexity for the M neurons within a sequence is O(TMd), where T is the length of an input sequence. For inference algorithm like belief propagation, the computation complexity of LDCRF is O(T Kd + T K 2 ). Hence, the total computational complexity of LNCRF  |Y | inference is O(T Md + T KM + T K 2 ), where K = i |Hyi |. Usually, it is reasonable to expect an increase of the computation coming from the hidden nodes and latent states. Under some conditions, however, if we learn a compact high level feature φ from high dimensional input x, i.e. M d, it is possible to lead to a lower computational complexity than LDCRF. 4.4. Inference Given the model parameters = {θ , α}, prediction of a new test sequence x is to estimate the most probable label sequence y∗ that maximizes our conditional model:

y∗ = arg max P (y|x, ). y

(15)

In this study, belief propagation is used to compute the marginal probability P (yi = a|x, ) for each possible state a ∈ Y. Then the predicted label y∗i is the one associated with the maximum sum of marginal probabilities. Please cite this article as: J. Liu et al., Multi-granularity sequence labeling model for acronym expansion identification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.06.045

ARTICLE IN PRESS

JID: INS

[m3Gsc;July 3, 2016;15:3]

J. Liu et al. / Information Sciences 000 (2016) 1–13

7

4.5. Feature functions We exploit multiple feature functions. Typically, there are two types of features used in CRFs: transition features and emission features. A transition feature is a binary function that indicates whether a transition (ht−1 = i, ht = j ) occurs, i.e., the state functions depend on the dependency of a vertex given a single label. The transition function has the form

Fi,ej (xi , hi ) =



fi,e j (ht−1 , ht , φ , t )

t

fi,e j (ht−1 , ht , φ , t ) = δ (ht−1 = i )δ (ht = j )

(16)

There are |H| × |H| transition functions and each corresponds to a hidden state variable pair (h, h ). Algorithm 1 Training algorithm for LNCRF. Require: A training set consisting of n labeled sequences{(x(n ) , y(n ) )}, n = 1 . . . N. Require: A nonlinear transformation function φ which maps input observations to M-dimensional feature vectors. Require: The feature functions used are defined as in Section 4.5. Initialize the parameters randomly. repeat for all parameters θi ∈ or αk ∈ do ∂J estimate the partial derivatives ∂θ using eqn. (13) i

∂J estimate the partial derivatives ∂α using eqn. (14) k end for Use the LBFGS [13] method to update . until convergence

An emission feature is a binary function that indicates whether an observation-dependent feature co-occurs with hidden state j. We define unigram feature function as v (x , h ) = Fw, i i j



fi,v j (ht−1 , ht , φ , t )

t

v (h fw, t−1 , ht , φ , t ) = δ (φt = w )δ (ht = j ) j

(17)

where φ t ≡ φ (xt ) is the feature vector for xt . For a standard CRF, φ (xt ) is the raw input vector, and for LNCRF, φ (xt ) is the output of multiple hidden layers. Assuming φt has dimension d, we have |Y | × M state functions. The emission function above computes the features on individual vertex, and a transition function computes the features on an edge on the graph. In our study, we use another type of transition functions ve ( x , y ) = Fw,i, i i j



ve ( h fw,i, t−1 , ht , φ , t ) j

t

ve ( h fw,i, t−1 , ht , φ , t ) = δ (φt = w )δ (ht−1 = i )δ (ht = j ) j

(18) The number of this type is |H| × |H| × M. It considers the emission and transition information simultaneously. 4.6. Features For expansion identification, we exploit the features used in [14] to capture various information that might be useful for acronym expansion recognition. There are three kinds of features to be used for the proposed model. 1) Orthographical features that describe the structure of each target token (i.e. word to be labeled) without considering the query (i.e. acronym). These features are important, because people often use some orthographical information to emphasize the tokens of expansions. We list the orthographical features in Table 2. 2) Token-query features that describe the relationships between the target token and the given acronym query. For example, whether the first character of the token occurs in the given acronym, whether the token contains an upper case letter that occurs in the acronym. 3) Context features that represent the compatibility of the acronym query and the context window of each target token, because the neighbor tokens are important indicators of a target token’s category. For example, it will be more confident that ‘Basketball’ is a part of the expansion of ‘NBA’ if one notices the last token ‘National’. The context features are extracted with a window of size= 3, e.g., whether the capitalized letter Capt of the token at position t appears in the acronym and the capitalized letter Capt+1 of the next token appears in the acronym after the letter Capt . Please cite this article as: J. Liu et al., Multi-granularity sequence labeling model for acronym expansion identification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.06.045

ARTICLE IN PRESS

JID: INS 8

[m3Gsc;July 3, 2016;15:3]

J. Liu et al. / Information Sciences 000 (2016) 1–13 Table 2 Orthographical features. Feature name

Regular expression

ALLCAPS INITCAP CAPMIX HAS DASH PUNCTUATION

[A-Z]+ ˆ[A-Z].∗ .∗ [A-Z][a-z].∗ |.∗ [a-z][A-Z].∗ .∗ -.∗ [,.;?!-+]

5. Experiments In this section, we use the real data collected from Wikipedea.org to empirically validate the effectiveness of the proposed LNCRF model for acronym expansion identification. We first elaborate the experiment configurations on the dataset, metrics and baselines. Then, we demonstrate the experimental results along with the algorithmic sensitivity and efficiency analysis. Finally, some case studies shall be illustrated to show how the LNCRF model helps intuitively. 5.1. Data We inspect the performance of this proposed method with an corpus [14] which is created based on Wikipedia.org. The content of Wikepia.org, a free online encyclopedia, is created by many volunteers. Each entry defines and describes an entity or an event, in which a lot of acronyms and its definitions, namely expansions, are included. It is worth noting that acronyms and its corresponding definitions may not appear in pairs. With a set of randomly selected acronyms as queries, all the possible relevant pages are crawled with them as seeds. Removing the html tags and segmenting the text into sentences are conducted on these crawled pages. All these labels of the corpus are obtained manually. The whole dataset contains 255 unique acronyms, 1372 distinct expansions, 6185 expansion sentences and 108,830 words. There are 4.02 letters in each acronym, 5.2 distinct expansions corresponding to each acronym, 3.11 words in each expansion on average. The average sentence length is 17.6 words. 5.2. Evaluation criteria We evaluate our approach on the task of expansion identification, using various metrics that fall into two aspects: (1) token level evaluation which is the performance measurements per word tokens that are correctly tagged; and (2) Expansion level evaluation which demonstrates the performance on how well the target sub-sequences (expansions in our case) are correctly tagged, meaning that the entire decoded state of the sub-sequence corresponding to the acronym expansion has to be correct. For each acronym expansion identification approaches, the two strategies are utilized for evaluation. Three evaluation criteria are employed to evaluate the performance per token, namely, precision, recall and F1 scores. Since a dominating majority of tokens are labeled as ’O’, the data is very unbalanced. So we report the performance on the precision PrecPerTokeni , recall ReclPerTokeni and F1 F1PerTokeni on each category yi ∈ {B, I, O}, instead of the average performance measurements over all categories. In this way, the reported measures are not all overwhelmed by tokens with label ‘O’. Ai is defined as number of correctly labeled tokens of class yi . And Bi , Ci , Di are defined as number of false negative tokens of class yi , number of false positive tokens of class yi and number of true negative tokens of class yi respectively.

P recPerT okeni =

Ai Ai + Ci

(19)

ReclPerT okeni =

Ai Ai + Bi

(20)

F 1PerT okeni = 2

P recPerT okeni · ReclPerT okeni P recPerT okeni + ReclPerT okeni

(21)

It is realized that there are biases existed in the recall, because the dataset is not built on a random sample of the whole web. However, highly correlation with the true recall are still expected. We believe that it provides effective information for using this criterion to compare different methods. Also, we use the accuracy per token in this paper:

AccPerT okeni =

Ai + Ci Ai + Bi + Ci + Di

(22)

To evaluate the performance per expansion, a label sequence beginning with ’B’ and ending right before the next ’O’ is firstly found. This type of label sequences are corresponded to predicted expansions. For instance, ’International Conference on Data Mining’ is found based on ‘B-I-I-I-I’ shown in Fig. 1. When the detected expansion is exactly right, a correct Please cite this article as: J. Liu et al., Multi-granularity sequence labeling model for acronym expansion identification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.06.045

ARTICLE IN PRESS

JID: INS

[m3Gsc;July 3, 2016;15:3]

J. Liu et al. / Information Sciences 000 (2016) 1–13

(a)

(b)

9

(c)

Fig. 4. The experimental results on per label, which are (a) Precision, (b) Recall, and (c) F1 score, respectively. Table 3 Experimental results per expansion. Method

Precision

Recall

F1

Linear SVM Kernel SVM CRF NCRF LNCRF

0.6742 0.6971 0.8353 0.8429 0.9193

0.7523 0.7311 0.8720 0.9170 0.9163

0.7106 0.7121 0.8532 0.8784 0.9178

expansion is counted. We defined the precision, recall and F1 score per expansion as follows:

P recPerExpan =

#of correct expansions found , total#of expansions found

(23)

ReclPerExpan =

#of correct expansions found , total#of expansions in corpus

(24)

F 1PerExpan = 2

P recPerExpan · ReclPerExpan . P recPerExpan + ReclPerExpan

(25)

5.3. Comparison models Several methods are compared to verify the effectiveness of the proposed approach for the expanding finding problem via five-fold cross-validation. All these methods can be categorized into two classes:(1) CRF, NCRF, and LNCRF are sequence models; (2) Linear SVM and kernel SVM are non-sequence models trained on token-based samples. From the other view, all these methods can be orthogonally classified as linear models, i.e. CRF and SVM, and nonlinear models for the other approaches. Since linear SVM and kernel SVM are non-sequence models, the dependence between tokens are not encoded in them. We decompose the training set into token-based samples and the input to the SVM is a sequence of extracted feature vector. For the non-sequence models, linear SVM model is a multiclass SVM model with a linear kernel. The second baseline, kernel SVM model for multiclass classification, is trained with an RBF kernel. An additional kernel parameter γ is needed to be tuned. Both of the penalty parameters C and the parameters of the kernel γ are selected via cross-validation. The standard linear-chain CRF and nonlinear-chain NCRF are selected as baselines for sequence models. We add a regularizer to avoid overfitting during training process and the parameter for regularization is tuned via cross-validation. Moreover, we also specify the architecture of the hidden layers for the NCRF model. One hidden layer is used and the number of hidden nodes is enumerated in the set {5,10,15,20,25}. The marginal probabilities of each state for each token is calculated for unlabeled test sequences via belief propagation. The highest marginal probability state is selected as the token labels. Our proposed LNCRF model was trained with the objective function described in Eq. (12). During evaluation, we compute ROC curves using the maximal marginal probabilities of Eq. (15). To fairly compare with NCRF, LNCRF also uses one hidden layer and the number of hidden nodes is enumerated in the set {5,10,15,20,25}. During training and validation, we varied the number of hidden states per label (from 2 to 5 states per label) and the regularization term (with values 10k , k = −3..3). For all the sequence models, the three kinds of feature functions described in Section 4.5 are utilized in following evaluation. 5.4. Results The experimental results per token is shown in the Fig. 4 and Fig. 5. The performance per expansion are shown in Table 3. Please cite this article as: J. Liu et al., Multi-granularity sequence labeling model for acronym expansion identification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.06.045

JID: INS 10

ARTICLE IN PRESS

[m3Gsc;July 3, 2016;15:3]

J. Liu et al. / Information Sciences 000 (2016) 1–13

Fig. 5. Accuracy per token. Table 4 Significance tests (t-test) results in terms of accuracy. LNCRF

Linear SVM Kernel SVM CRF NCRF

B

I

O

0.0 0 01 0.0 0 01 0.0 0 01 0.0 0 02

0.0 0 02 0.0 0 02 0.0 0 03 0.0012

0.0 0 01 0.0 0 01 0.0 0 01 0.0 0 02

The performance per token, as is shown in Fig. 4, clearly indicates that the proposed LNCRF model outperforms the rest models significantly in terms of both precision and recall, and hence achieves the best F1 score. The linear SVM and standard CRF give unsatisfying results due to lack of high level feature learning ability. The non-linear baseline models, i.e. Kernel SVM and NCRF, perform better than their linear counterparts. However, Kernel SVM does not perform as well as NCRF due to its ignorance of the sequence information. It is worth noticing that NCRF is much better than CRF, which demonstrates the effectiveness of the hidden layers for feature learning. Moreover, LNCRF boosts NCRF with a large margin in F1 score which indicates the effectiveness of modeling fine-grained intermediate substructure. It can be seen from Fig. 4 that label ‘O’ is predicted more accurately than label ‘B’ and ‘I’ by each method in all evaluation measures, because the data is unbalanced. The number of the tokens with label ‘O’ is approximately fifteen times of that of tokens with label ‘B’ and eight times of that of tokens with label ‘I’. Fig. 5 gives the accuracy per token, which demonstrates consistent comparison results with Fig. 4. To verify the statistical significance of our experiments, we perform t-test on the improvements achieved by LNCRF over the accuracy of each category. As shown in Table 4, all the t-test results are less than 0.01, which means the improvements of the LNCRF model are statistically significant in contrast to the baselines. Table 3 shows the performance per expansion which reflects the actual effectiveness of each model more directly and precisely . From Table 3 we have following observations: (1) the linear and nonlinear sequence models, CRF, NCRF and LNCRF, outperform the counterpart non-sequence models SVM and kernel SVM respectively, (2) the nonlinear sequence models NCRF and LNCRF further improve the performance of CRF, (3) the sequence model LNCRF modeled with latent layer achieves the best performance on the result of per expansion. Statistical significance tests (t-test) on the improvements achieved by LNCRF over other comparison models also indicate similar result with Table 4 and is not shown for limit of space. In addition to the recognition performance analysis, we give an experimental study of the running time of the training phase and the testing phase. The training and decoding times were measured in wall-clock seconds on an Intel Xeon 8-core 3.4GHz machine with 16Gb of RAM running Windows 7. These timings only include the core training and decoding processes and exclude all preparatory steps. Since the number of iterations needed during training and the length of sentence for testing may slightly vary, we use per iteration time for training time and per sentence decoding time for testing time analysis. Fig. 6 illustrates that both the per iteration time of training and per sentence decoding time of testing appear to grow in a nearly linear manner with the number of the hidden states and hidden nodes on this dataset. Besides, the feed forward computation and belief propagation algorithm lead to a very low time cost for per sentence decoding, which makes the model pretty efficient in test phase. 5.5. Case studies Two cases of expansion finding with two given acronyms, MDAC and CCRC, are illustrated in Table 5. Table 5(a) lists the distinct correct acronym extensions for these two acronyms. For ‘CCRC’ and ‘MDAC’, three and four distinct acronym Please cite this article as: J. Liu et al., Multi-granularity sequence labeling model for acronym expansion identification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.06.045

ARTICLE IN PRESS

JID: INS

[m3Gsc;July 3, 2016;15:3]

J. Liu et al. / Information Sciences 000 (2016) 1–13

(a)

11

(b)

Fig. 6. Computation time analysis. Per iteration time and per sentence decoding time are used for training and testing time analysis, respectively.

Table 5 Acronym expansion results on two real examples. (a) Distinct expansions for “CCRC” and “MDAC”. CCRC

MDAC

Climate Change Research Center; City of Cambridge Rowing Club; Criminal Cases Review Commission

Metro Detroit Athletic Conference; Mississippi Department of Agriculture and Commerce; Microsoft Data Access Components; Mental Disability Advocacy Center

(b) Detected Results of “CCRC” (13 occurrences in the corpus) and “MDAC” (18 occurrences in the corpus). In the two cases, the number of true positive (TP), false positive (FP) and false negative (FN) is shown together with the distinct expansions detected under each judgment.

CCRC

Model

Judgment

#Occurrences

Distinct Expansions Detected

LNCRF

TP FP FN TP FP FN TP FP FN TP FP FN TP FP

13 0 0 12 3 1 12 2 1 18 0 0 17 1

FN TP

1 14

FP FN

5 4

ALL – – ALL Cambridge Rowing Club; Royal Commission on City of Cambridge Rowing Club; ALL Cambridge Rowing Club City of Cambridge Rowing Club; ALL – – ALL Metro Detroit Athletic Conference North (Thumb Area Members) Metro Detroit Athletic Conference Metro Detroit Athletic Conference; Microsoft Data Access Components; Mental Disability Advocacy Center Agriculture and Commerce;Area Members Mississippi Department of Agriculture and Commerce

NCRF

CRF

MDAC

LNCRF

NCRF

CRF

extensions exist for each of them. It is shown that both NCRF and CRF produce some false positive and false negative detections. Specifically, we take the results of CCRC as an example. One of its expansion is ‘City of Cambridge Rowing Club’. This is the main cause leads to NCRF’s false detections, such as ‘Cambridge Rowing Club’ which is broken at the token ‘of’. We can infer that the word token ‘of’ in the expansion suggests a substructure information which can not be modeled by NCRF and leads to the wrong tagging. Likewise, there is similar phenomenon in the other cases including ‘MDAC’, which will not give further elaboration. The experimental results demonstrate that the nonlinear sequence model LNCRF fits for the acronym expansion recognition task better than other state-of-the-art compared models. Please cite this article as: J. Liu et al., Multi-granularity sequence labeling model for acronym expansion identification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.06.045

JID: INS 12

ARTICLE IN PRESS

[m3Gsc;July 3, 2016;15:3]

J. Liu et al. / Information Sciences 000 (2016) 1–13

6. Conclusion In this paper, we present a Latent-state Neural Conditional Random Fields (LNCRF) model for acronym expansion identification. The LNCRF model aims to learn multi-level granularity for each token in the sentences. Coupling the CRF with Neural Network enables learning higher level features from human-engineered features which are usually noisy and redundant. Introducing late state layer helps to model the fine-grained intermediate substructures between term labels and input features. The experimental results on real data show that the proposed approach gives significantly better performance in expansion finding task than the existing state-of-the-art methods including Support Vector Machines, Conditional Random Fields and Neural Conditional Random Fields. In the future work, we will further explore the effects of the different setting of latent states and its application in other related sequence labeling tasks. Acknowledgment This research is supported by the National Program on Key Basic Research Project(No. 2013CB329304), the Key Program of National Natural Science Foundation of China (No. 61432011), National Natural Science Foundation of China (No. 61105049 and No. 61222210), the Natural Science Foundation of Tianjin (No. 14JCQNJC0 060 0), the Science and Technology Planning Project of Tianjin (No. 13ZCZDGX01098), Tianjin Key Laboratory of Cognitive Computing and Application, and the Open Project Foundation of Information Technology Research Base of Civil Aviation Administration of China (No. CAAC-ITRB201303). References [1] R. Al-Hmouz, W. Pedrycz, A. Balamash, Description and prediction of time series: a general framework of granular computing., Expert Syst. Appl. 42 (10) (2015) 4830–4839. [2] P.V.S. Avinesh, G. Karthik, Part-of-speech tagging and chunking using conditional random fields and transformation-based learning, in: Proceedings of the IJCAI and the Workshop On Shallow Parsing for South Asian Languages (SPSAL), 2007, pp. 21–24. [3] C.H. Chang, M. Kayed, M.R. Girgis, K.F. Shaalan, A survey of web information extraction systems, Knowl. Data Eng. IEEE Trans. 18 (10) (2006) 1411–1428. [4] M. Chen, X. Jin, D. Shen, Short text classification improved by learning multi-granularity topics, in: IJCAI, Citeseer, 2011, pp. 1776–1781. [5] T.-M.-T. Do, T. Artieres, Neural conditional random fields, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, vol. 9, JMLR.Org, 2010, p. 5. [6] A. Gacek, Signal processing and time series description: A perspective of computational intelligence and granular computing, Appl. Soft Comput. 27 (2015) 590–601. [7] X. He, R.S. Zemel, M.A. Carreira-Perpinan, Multiscale conditional random fields for image labeling. CVPR 2004. Pages II–695–II–702 Vol.2. [8] S. Kumar, M. Hebert, Discriminative fields for modeling spatial dependencies in natural images. NIPS 2003. [9] J. Lafferty, A. Mccallum, F. Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, in: ICML, 2001, pp. 282–289. [10] J. Lafferty, X. Zhu, Y. Liu, Kernel conditional random fields: representation and clique selection, ICML, 2004. [11] Y. LeCun, Y. Bengio, G. Hinton, Deep learning., Nature 521 (7553) (2015) 436–444. [12] J. Li, C. Mei, W. Xu, Y. Qian, Concept learning via granular computing: a cognitive viewpoint, Inf. Sci. 298 (2015) 447–467. [13] D.C. Liu, J. Nocedal, On the limited memory bfgs method for large scale optimization, Math. Program. 45 (3) (1989) 503–528. [14] J. Liu, J. Chen, Y. Zhang, Y. Huang, Learning conditional random fields with latent sparse features for acronym expansion finding, in: CIKM, 2011, pp. 867–872. [15] Y. Liu, J. Carbonell, P. Weigele, V. Gopalakrishnan, Segmentation conditional random fields (scrfs): A new approach for protein fold recognition, in: Proc. of the 9th Ann. Intl. Conf. on Comput. Biol. RECOMB, ACM Press, 2005, pp. 14–18. [16] P. Mnard, S. Ratt, Classifier-based acronym extraction for business documents, Knowl. Inf. Syst. 29 (2) (2011) 305–334. [17] L.-P. Morency, A. Quattoni, T. Darrell, Latent-dynamic discriminative models for continuous gesture recognition, in: CVPR, 2007, pp. 1–8. [18] D. Nadeau, S. Sekine, A survey of named entity recognition and classification, Lingvisticae Investigationes 30 (1) (2007) 3–26. [19] D. Nadeau, P.D. Turney, A supervised learning approach to acronym identification, in: In 8th Canadian Conference on Artificial Intelligence (AI2005) (LNAI 3501, 2005, pp. 319–329. [20] B.A. Osiek, G. Xexéo, L.A.V. de Carvalho, A language-independent acronym extraction from biomedical texts with hidden markov models, IEEE Trans. Biomed. Eng. 57 (11) (2010) 2677–2688. [21] J. Pearl, Probabilistic Reasoning in Intelligent Systems:Networks of Plausible Inference, Morgan Kaufmann, 1988. [22] A. Pedrycz, K. Hirota, W. Pedrycz, F. Dong, Granular representation and granular computing with fuzzy sets, Fuzzy Sets Syst. 203 (2012) 17–32. [23] W. Pedrycz, Granular Computing: Analysis and Design of Intelligent Systems, CRC Press, 2013. [24] F. Peng, A. Mccallum, Information extraction from research papers using conditional random fields, Inf. Process. Manage. 42 (4) (2006) 963–979. [25] J. Peng, L. Bo, J. Xu, Conditional neural fields, Adv. Neural Inf. Process. Syst. 22 (2009) 1419–1427. [26] L.R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proc. IEEE 77 (2) (1989) 257–286. [27] L. Ramshaw, M. Marcus, Text chunking using transformation-based learning, in: D. Yarovsky, K. Church (Eds.), Proceedings of the Third Workshop on Very Large Corpora, Association for Computational Linguistics, Somerset, New Jersey, 1995, pp. 82–94. [28] K. Sato, Y. Sakakibara, RNA secondary structural alignment with conditional random fields, Bioinformatics 21 (2) (2005) ii237–ii242. [29] B. Settles, Abner: an open source tool for automatically tagging genes, proteins, and other entity names in text, Bioinformatics 21 (14) (2005) 3191–3192. [30] F. Sha, F. Pereira, Shallow parsing with conditional random fields. Association for Computational Linguistics, 2003. [31] X. Sun, L.-P. Morency, D. Okanohara, J. Tsujii, Modeling latent-dynamic in shallow parsing: a latent conditional model with improved inference, in: Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, Association for Computational Linguistics, 2008, pp. 841–848. [32] K. Taghva, J. Gilbreth, Recognizing acronyms and their definitions, Inf. Sci. Res. Inst. 1 (1999) 191–198. [33] B. Tasker, A. Pieter, D. Koller, Discriminative probabilistic models for relational data, in: Proceedings of the 18th Annual Conference on Uncertainty in Artificial Intelligence (UAI-02), Morgan Kaufmann, 2002, pp. 485–492. [34] S. Wan, Y. Lan, J. Guo, J. Xu, L. Pang, X. Cheng, A deep architecture for semantic matching with multiple positional sentence representations, AAAI (2016) 2835–2841. [35] J. Xu, Y. Huang, Using svm to extract acronyms from text, Soft Comput. (2007) 369–373. [36] Y. Yao, Perspectives of granular computing, in: Granular Computing, 2005 IEEE International Conference on, vol. 1, IEEE, 2005, pp. 85–90. [37] S. Yeates, Automatic extraction of acronyms from text, in: New Zealand Computer Science Research Students’ Conference, 1999, pp. 117–124.

Please cite this article as: J. Liu et al., Multi-granularity sequence labeling model for acronym expansion identification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.06.045

JID: INS

ARTICLE IN PRESS J. Liu et al. / Information Sciences 000 (2016) 1–13

[m3Gsc;July 3, 2016;15:3] 13

[38] W. Yin, H. Schütze, Multigrancnn: An architecture for general matching of text chunks on multiple levels of granularity, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), 2015, pp. 63–73. [39] L.A. Zadeh, Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets Syst. 90 (2) (1997) 111–127. Fuzzy Sets: Where Do We Stand? Where Do We Go? [40] X. Zhang, D. Miao, Quantitative information architecture, granular computing and rough set models in the double-quantitative approximation space of precision and grade., Inf. Sci. 268 (2014) 147–168. [41] Y. Zhou, Q. Hu, J. Liu, Y. Jia, Combining heterogeneous deep neural networks with conditional random fields for chinese dialogue act recognition, Neurocomputing 168 (2015) 408–417. [42] P. Zhu, Q. Hu, W. Zuo, M. Yang, Multi-granularity distance metric learning via neighborhood granule margin maximization, Inf. Sci. 282 (2014) 321–331.

Please cite this article as: J. Liu et al., Multi-granularity sequence labeling model for acronym expansion identification, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2016.06.045