Siamese capsule networks with global and local features for text classification

Siamese capsule networks with global and local features for text classification

Siamese Capsule Networks with Global and Local Features for Text Classification Communicated by Dr. Aixin Sun Journal Pre-proof Siamese Capsule Net...

1MB Sizes 0 Downloads 44 Views

Siamese Capsule Networks with Global and Local Features for Text Classification

Communicated by Dr. Aixin Sun

Journal Pre-proof

Siamese Capsule Networks with Global and Local Features for Text Classification Yujia Wu, Jing Li, Jia Wu, Jun Chang PII: DOI: Reference:

S0925-2312(20)30116-8 https://doi.org/10.1016/j.neucom.2020.01.064 NEUCOM 21821

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

2 July 2019 15 November 2019 15 January 2020

Please cite this article as: Yujia Wu, Jing Li, Jia Wu, Jun Chang, Siamese Capsule Networks with Global and Local Features for Text Classification, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.01.064

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Siamese Capsule Networks with Global and Local Features for Text Classification✩ Yujia Wua , Jing Lia,∗, Jia Wub , Jun Changa a School b Department

of Computer Science, Wuhan University, Wuhuan 430072, China of Computing, Faculty of Science and Engineering, Macquarie University, Sydney, NSW 2109, Australia

Abstract Text classification is a popular research topic in the field of natural language processing and provides wide applications. The existing text classification methods based on deep neural networks can completely extract the local features of text. The text classification models constructed based on these methods yield good experimental results. However, these methods generally ignore the global semantic information of different categories of text and global spatial distance between categories. To some extent, this adversely affects the accuracy of classification. In this study, to address this problem, Siamese capsule networks with global and local features were proposed. A Siamese network was used to glean information about the global semantic differences between categories, which could more accurately represent the semantic distance between different categories. A global memory mechanism was established to store global semantic features, which were then incorporated into the text classification model. Capsule vectors were used to obtain the spatial position relationships of local features, thereby improving the representation capabilities of the features. The experimental results showed that the proposed model achieved better results and performed significantly better on six different public datasets, as compared with ten baseline algorithms. ✩ Fully

documented templates are available in the elsarticle package on CTAN. author Email addresses: [email protected] (Yujia Wu), [email protected] (Jing Li), [email protected] (Jia Wu), [email protected] (Jun Chang) ∗ Corresponding

Preprint submitted to Journal of LATEX Templates

January 22, 2020

Keywords:

Text Classification, Capsule Networks, Siamese Networks, Neural

Networks, Global and Local Features

1. Introduction Text classification is a very important task in research related to internet information processing. Specifically, text classification is used to classify documents automatically according to predefined categories. Text can be classified 5

by topic, sentiment, etc., according to the types of documents to be processed. Because of its wide application for personalized recommendations, customer service, market research, and the social sciences, text classification has quickly become a hot research topic in the field of natural language processing (NLP) [1].

10

Deep learning has been successfully applied in various fields such as computer vision, and the constantly developing deep learning methods enable deep neural networks to process text and obtain distributed representations of words [2, 3]. Additionally, deep learning methods achieve considerably good performance in various NLP tasks such as semantic analyses [4] and information retrieval [5].

15

Therefore, compared to traditional text classification methods such as support vector machine [6], naive Bayes [7], decision trees [8], logistic regression [9], and term frequency-inverse document frequency (TF-IDF) [10], deep learning models represented by convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have achieved better results in text classification. In 2014,

20

Kim [11] proposed a text classification model based on CNNs that yielded good experimental results. The study classified the topics of text, mainly short text. The main intention of this study was to extract local text features via convolutions, and then model sentences. In the task of text classification, neural networks can be divided according

25

to their complexity into shallow neural networks [12] and deep neural networks (which are the more complex ones) [13]. CNNs perform well in text topic classification [11] and have also achieved good results in text sentiment classification

2

[14]. RNNs [15] are another type of commonly used deep neural network, and these models can use contextual information in sequences of text to extract the 30

contextual features of text. Hence, they also perform well on certain datasets. Recurrent convolutional neural networks (RCNNs) have the combined characteristics of CNNs and RNNs [16], and their network structure is similar to that of an RNN. Therefore, RCNNs can make full use of contextual information, and employ max pooling to aggregate text features while reducing their computa-

35

tional complexity. This improves the accuracy of text classification. In recent years, attention mechanisms have also been widely used in various types of deep learning tasks as they can intuitively explain the importance of each text feature in the current document, thereby improving the accuracy of text classification to some extent [17]. Attention mechanisms also used to solve the aspect-level

40

sentiment classification [18]. CNN algorithms use convolutions to extract the weighted summation of the features of the previous layer, without considering the spatial hierarchy and positional relationship between local features. In addition, CNN algorithms use max pooling or average pooling to aggregate the features. This reduces the computational complexity but leads to a loss of spa-

45

tial information. The loss of some of the sample information may result in misclassification. Although CNNs and RNNs can improve the performance of text classification, they still have some limitations when extracting local features due to their inability to obtain the spatial position relationship between local features, which may affect the classification performance.

50

To cope with the failure of CNNs and RNNs to obtain the spatial position relationship between features, Hinton et al. [19] proposed a concept of capsules. Later, Sabour et al. [20] proposed using a dynamic routing algorithm instead of max pooling as the feature aggregation method and capsule vectors instead of neurons, and this combination achieved optimal performance in image classifi-

55

cation tasks. Such models, which are called capsule networks, can automatically learn the spatial position relationships between features, and therefore are very suitable for processing tasks in image fields. In various NLP tasks, the positional relationships of words in a sentence can also affect the accuracy of classification 3

to some extent. If the positional relationships of words can be learned during 60

training, the performance of the model can be improved [21]. The advantages of capsule networks have led to their rapid deployment for applications in the field of NLP. Gong et al. [22] proved that the aggregation mechanism in a capsule network had obvious advantages over traditional methods such as max pooling. Wang et al. [23] proposed a sentiment classification method that combined

65

RNNs and capsule networks (i.e., RNNs were used to extract text features, and attention mechanisms were used to construct capsule representations). Yang et al.[24, 25] investigated the text classification method based on capsule networks with dynamic routing and proposed three policies for stabilizing the dynamic routing process to reduce the noise interference of capsules. Zhao et al.

70

[26] published their article Towards Scalable and Reliable Capsule Networks for multi-label text classification and answering questions. Chen et al.[27] proposed a transfer capsule network model for transferring document-level knowledge to aspect-level sentiment classification. Although capsule networks can adequately represent the spatial position re-

75

lationships of local features and improve the performance of text classification models, they are still associated with a few problems. CNNs and capsule networks can successfully detect and extract local features with a strong ability of categorical representation. On the one hand, capsule networks use capsule vectors instead of neurons, which can possibly represent the positional hierarchical

80

relationships between local features. On the other hand, capsule networks employ a routing mechanism instead of max/average pooling, which reduces the loss of information to a certain degree. However, these methods usually ignore global semantic information when extracting features. The global semantic or topic features of various categories

85

of documents are not fully extracted (i.e. the features of spatial distance information indicating different categories are not extracted). The lack of such global features is an important factor that restricts the improvement in performance of text classification. Therefore, we need to propose a model that can represent the spatial position relationship of local features, as well as the global semantic 4

90

information or topic features of various categories of documents. A Siamese network is composed of two sub-networks with shared weights. The architecture of the two sub-networks is identical, and the input as well as the output is also identical. If the two sub-networks have the same type of text as the input, they extract similar semantic features, and the spatial distance

95

between the output features will be less. If the two sub-networks have different categories of text as the input, the spatial distance between their output features will be longer. Owing to this feature, Siamese networks perform very well in learning sentence similarity [28]. Inspired by these observations, in this study, Siamese capsule networks with global and local features were proposed.

c1

c1

c2

c2

(a) Unseparated

(b) Separated

Figure 1: Effective separation of unseparated global features by learning the global semantic or topic features of various categories of documents

100

For example, as shown in Figure 1(a), the red circle contains the local features of the category c1 , and the blue circle contains the local features of the category c2 . However, some features as shown by the small gray circles in Figure 1(a), may belong to either category c1 or category c2 . When extracting local features, it is difficult to effectively separate these features into their correct cat-

105

egories. But we can regard these unseparated features as global features, and we can separate these global features by learning the global semantic or topic features of the various categories of document, and then assigning these global features to their correct categories, as shown in Figure 1(b). We learn global features through a Siamese network to solve the above problems. In this study,

110

in order to obtain the contextual information of a text, a capsule network was first constructed based on a bi-directional gated recurrent unit (BiGRU). The GRU [29] is a simple recurrent neural network. Through the gating 5

mechanism, the recurrent neural network can memorize past information, and selectively filter out a lot of unimportant information. In addition, capsule vec115

tors have been used to obtain the spatial position relationship of local features to further improve the representational ability of the features. Two identical capsule networks were then combined to form a Siamese network. The Siamese network was used to gather information on the global semantic differences between categories so as to represent more accurately the semantic distance be-

120

tween different categories. A global memory mechanism was also set up to store global semantic features, from which the topic central capsules of the document categories were extracted using singular value decomposition (SVD). The global semantic features were incorporated into a text classification model, which combined global and local features for text classification. The global and

125

local features were obtained through a two-stage training process to construct a classification model. The proposed method was tested on six different types of dataset. The experimental results showed that the proposed method significantly improved the performance of text classification. The main contributions of this study are

130

summarized as follows: (1) A BiGRU layer was proposed to obtain the forward and backward contextual information of local features, and to extract the features with contextual information as the input of the capsule network for text classification, thereby achieving a certain improvement in performance.

135

(2) A Siamese training mechanism was proposed. Through comparative learning, the capsule networks was able to learn the global semantic distance information between different categories and obtain the global differences between different categories of text, thereby improving the performance of the model.

140

(3) A network framework combining global and local features was proposed, which improved the classification performance of the model. Experiments were carried out on six datasets. The experimental results showed that the

6

proposed model generally performed better than the baseline algorithms.

2. The Proposed Method 145

Figure 2 shows the overall framework of the proposed model. The model was functionally divided into three parts: 1) a basic capsule network that was used to extract local features of the text, 2) a Siamese network that was used to learn difference information between categories, and 3) a global memory mechanism for storing global features.

150

2.1. Formalizations A dataset contained c categories of documents. The input of the capsule network was a sentence from a certain category of document . Each category of document D = {T1 , T2 , · · · , Tn } included n sentences Td (1 ≤ d ≤ n) . Each sentence Td ={i1 , i2 , · · · , ik } contained k words. Each word ik was represented

155

by an m-dimensional word vector. Each capsule A was an 8-dimensional or 16-dimensional vector. Here, ω was the classification weight that represented the proportion of the capsule network employing local features for classification. The model would input a sentence and then output a classification probability. The model was trained in three steps, as shown in Figure 2. In Step 1,

160

two randomly initialized capsule networks were used to constitute one Siamese network for learning the difference information between categories. In Step 2, ω was set to 1, and the basic capsule network that was well trained in Step 1 was trained in a supervised manner to obtain the local features. Subsequently, the obtained local capsule features were stored in the global memory mechanism. In

165

Step 3, an SVD operation was performed for all global capsule features stored in the global memory mechanism to obtain the topic central capsule for each category. The topic central capsule represented the global features. The model was tested as follows: ω was set to 0.5, and the test sample was input into the capsule network to obtain a classification probability. Meanwhile,

170

the test sample was input into another capsule network with a shared weight

7

Local Feature

Capsule Network

Routing



ω

Global Memory

Σ

×

VT



×

Global Memory Mechanism



U

W

Td

Shared Weight Local Feature

Global Feature

Z





Capsule Network

DigitCaps layer

(1-ω)

Figure 2: Siamese capsule networks with global and local features was functionally divided into three partsa basic capsule network for extracting the local features, Siamese network for learning difference information between categories, and global memory mechanism for storing global features.

to obtain a group of capsules. This group of capsules was linearly transformed, and then a group of capsules with the same size as the topic central capsule were obtained. Then the distances between these groups and the topic central capsule were calculated to obtain a classification probability. Finally, the two 175

probabilities were summed to obtain the final classification probability. 2.2. Basic Capsule Networks This section introduces the basic capsule network. A capsule network has two convolutional layers which are used to extract local features, and it utilizes capsule vectors to represent the spatial position relationships between lo-

180

cal features. Although the original capsule network is capable of successfully extracting local features, it does not consider the contextual relationship of the features. Because the contextual relationships in the text are an important basis for classification, we added a GRU network to the capsule network to obtain

8

the contextual information of local features. Embedding layer

h1

Conv1 layer

Conv2 layer

DigitCaps layer

Class layer

16

256 8 32



ik

PrimaryCaps layer 8

256

h2



Td

i1 i2

GRU layer

Reshape

hk

Routing

Capsule Network Figure 3: Basic Capsule Network which was used to extract the local features of the text

185

In addition to adding the GRU layer for the extraction of contextual text information, we have also set up two convolutional layers. Convolutional layer 1, designated as Conv1, is used to extract those high-level local features which have context information, where the convolution kernel size is 3 and the stride is set to 1 in order to extract near local features of this distance. In order to

190

obtain some further local features, the conv2 layer’s stride is set to 2, in order to extract local features that are further away. In addition, in order to reduce the parameters, we do not use the fully connected layer. We use a routing mechanism instead of the fully connected layer. We only route between two consecutive capsule layers, so the features generated by the Conv2 layer need to

195

be reshaped composition capsules. The DigitCaps layer generates 16D capsules for each digital category, each capsule receiving input from all other capsules in the last layer. We use the modulus of the instantiation vector to represent the class probability of the capsule representation. Figure 3 illustrates the framework of a basic capsule network for text classi-

200

fication. The network consisted of seven layers. The network input layer was a sentence Td in a certain category of document D . Each sentence Td contained k words. The final output layer of the network was the category probability of the document. In order to obtain the contextual relationship between adjacent words in a

9

205

sentence Td , we constructed a BiGRU layer to merge contextual information in the document. Therefore, the BiGRU layer was used as the second layer of the capsule network to extract the contextual information of a text. The output feature vector hk was obtained by combining the forward and backward features, as shown in Eq. (1). hk = [GRU (hfk−1 , ik ); GRU (hbk+1 , ik )]

210

(1)

Here, k new features were extracted for the input sentence Td using the BiGRU layer to form a set of feature maps H . Here H referred to the contextual text features extracted by the BiGRU encoder, as shown in Eq. (2).

H = [h1 , h2 , · · · , hk ]

(2)

In order to extract higher-level features and construct capsule vector representations, we used the method proposed in the study by Sabour et al. [20]. 215

Two convolutional layers Conv1 and Conv2 were set as the third and fourth layers of the capsule network. Conv1 had 256 convolution kernels, with a size of 3; stride was set to 1 and the ReLU activation function was used. Conv2 also had 256 convolution kernels, with a size of 3; stride was set to 2 to obtain the information about additional positions, and the ReLU activation function was

220

used. The fifth layer PrimaryCaps was reshaped from the Conv2 layer, with the purpose of creating an eight-dimensional capsule vector with eight adjacent high-level features. Capsules were able to store the positional relationships of the adjacent features. At the sixth layer DigitCaps, the modulus of the activity

225

vector of each capsule represented the existence probability for each document category. The number of capsules to be set was the same as the number of categories. The dimension of the capsules at the DigitCaps layer was set to 16, which was consistent with the study conducted by Sabour et al. [20]. Previous CNN-based neural networks typically used max or average pooling

230

to aggregate features. Their information aggregation methods were simple and

10

efficient, but too much information could be lost. Capsule networks use dynamic routing algorithms instead of pooling operations, which have been proven very effective [22]. The goal of dynamic routing algorithms is to find a set of mappings between the capsules of the PrimaryCaps layer and those of the DigitCaps 235

layer. More specific distance procedures and details can be found in a previous work [20]. Generally, a capsule network achieving optimal performance can be obtained after approximately three route iterations. The last layer of the capsule network was the Class layer, wherein the capsule length was used to indicate the probability of existence of each category. In this study, capsule networks with

240

dynamic routing were applied for text classification, thereby yielding a good experimental result. 2.3. Siamese Capsule Networks A Siamese network is used to learn the categorical difference information of global semantics. It can more accurately represent the semantic distance between different categories. Section 2.2 introduces the basic capsule network. We created a Siamese network with two basic capsule networks that shared the same structure and network parameters. Their structures are shown in Figure 4.

Capsule Network W

Tj

Shared Weight

0/1 Local Feature

Capsule Network



Ti

Local Feature



245

Figure 4: The Siamese Network which was used to learn difference information between categories.

Two samples are needed as the input for a Siamese network. If the two 11

250

samples are from the same category, then the output capsules are similar and are spatially close. If the two samples are from different categories, then the spatial distance between the output capsules is longer. Because of the fact that the capsules use vectors rather than scalars to represent features, the distance between two capsule vectors cannot be measured directly by Euclidean distance

255

or cosine similarity. In this study, we defined an equation for calculating the distance between two groups of capsules. In this case, A and B were set to represent two capsule groups with a length of m, and the size of each capsule was 8. Therefore, a group of capsules could be represented as a matrix with a size of m × 8 . The term R(A, B) indicates the distance between the two capsule

260

groups A and B , and is given by Eq. (3).

R(A, B) =

hA, Bi kAk kBk

(3)

where kAk represents the matrix norm, as given by Eq. (4), and hA, Bi represents the inner product of matrixes A and B, as given by Eq. (5).

kAk =

p

hA, Ai

hA, Bi = T R(B T A)

(4)

(5)

In Eq. (5), B T represents the transposition of matrix B, and T R(·) represents the sum of the elements of the main diagonal. 265

When R(A, B) = 0 , the distance between the two capsule groups was the farthest, and the similarity between them was the worst. When R(A, B) = 1 , the distance between the two capsule groups was the closest, and the similarity between them was the best. The training goal of the Siamese network was to ensure that the distance between samples of the same category, and the distance

270

between samples of different categories, was as close to 1 and 0 as possible, respectively. Therefore, the loss function of the Siamese Network was defined by Eq. (6).

12

Loss =

k X

L(Ti , Tj , y)

i

(6)

i=1

where L(Ti , Tj , y) can be given by Eq. (7).

L(Ti , Tj , y) =

y (1 − y) R(Ti , Tj ) + max{(b − R(Ti , Tj )), 0} 2 2

(7)

where b represents the distance, which is a fixed value. 275

2.4. Global Memory Mechanism In the previous neural network methods, the global information of the entire document has often not been considered. The global information refers to the information common to all document categories, whereas local features only consider the individualized information of a sample. Therefore, the classifica-

280

tion performance of the model can be further improved by considering global information. In this study, we proposed a global memory mechanism for storing document information of all categories. As shown in Figure 2, each sample input into a capsule network was output as a group of eight-dimensional capsules. Such a group of eight-dimensional capsules represented, to some extent,

285

the various features of the corresponding input sample because those features were refined layer by layer. Although it is difficult to explain the meanings of these capsule vectors, they have been proven highly successful at including the spatial position relationships of local features [20], which are probably things such as grammatical structure information or the distance between words. It

290

also has been proven that if the positional distance relationships between words can be learned during training, the performance of the model can be improved [21]. The global memory mechanism based on a capsule network is shown in Figure 5. All the samples passed through the capsule network and were stored

295

in the global memory mechanism to form a global memory matrix (G). Because the capsule network was already able to extract capsule vector features with very strong representational capabilities, we directly stored the capsule vectors 13

U

×

Σ

m×m

Uk

n×n

m ×n

×

Σk k×k

VT

×

VkT

×

k×n



m×n

=



Global Memory

Global Feature

m×k

Z

U k  k -1

Figure 5: Global Memory Mechanism for storing global features

generated by each sample to form a G-matrix. The features extracted by the capsule network are a good representation of the original sample. However, there 300

are usually a large number of extracted features, and much of the information is redundant, which may affect the performance and computational speed of the classification model. Here, we used an SVD method to further refine the features, in order to reduce the number of dimensions and improve the feature representation capability, and also to reduce the computation time of the model.

305

An SVD can map the original features from a high-dimensional space to a lowdimensional feature space. Even if two samples do not have many explicit features in common, they may still have a latent semantic similarity. An SVD is great for mining such latent semantics. We used the global memory capsules extracted from the entire training set

310

to create a G-matrix, in which each capsule needed to be extended as adjacent features. The size of the extension was m × n , indicating that there were n samples of different categories, and each sample had m feature vectors. The SVD matrix of the G-matrix can be expressed by Eq. (8). G = U ΣV T

(8)

In order to reduce the dimensions of the feature vectors, we selected the first 315

k largest values in the singular value matrix Σ and the first k columns and rows 14

corresponding to the left and right singular feature vectors U and V T . Then the G-matrix was able to be approximated as shown in Eq. (9). G ≈ Uk Σk VkT

(9)

where Uk is the lower-dimensional matrix formed by the first k columns in matrix U , VkT is the lower-dimensional matrix formed by the first k rows of 320

matrix V T , and Σk is the first k singular values. The size of VkT was k × n, indicating the topic features of c categories of documents. The topic central capsule Bj (j = 1, 2, · · · , c) of each category was formed by a reduction or increase in dimensionality. The term Bj contained k capsules with a size of 8.

325

For Uk and Σk , a linear variable matrix was formed, as in Eq. (10). Z = Uk Σk −1

(10)

After an SVD operation, the large reduction in dimensionality reduced the computational complexity and filtered out some of the noise. The semantic space generated, now possessed features that had greater categorical representation capabilities than the original features did while retaining the most important 330

global information in the original G-matrix. Meanwhile, the size of the matrix changed from m×n to m×k, thereby significantly improving the computational speed. In the test, we directly input the test sample xi through the basic capsule network to obtain m capsule vectors of A for a group of samples. The size of

335

each capsule was 8. The transpose of A and the linear transformation matrix Z were multiplied to obtain a group of k capsules, D. The size of each capsule was 8. Using Eq. (3), the similarity of these capsules was compared with that of topic central capsules Bj to obtain the classification probability of each category, i.e., PG (xi |cj ) .

340

Finally, a comprehensive classification probability was obtained by combining local features and global features. The classification probability of the pro-

15

posed method is composed of two partslocal and global. The probability weight obtained by the basic capsule network was set to ω , and the probability weight obtained by the global memory mechanism was set to (1 − ω), as shown in Eq. 345

(11).

P = ω × PL (xi |cj ) + (1 − ω) × PG (xi |cj )

(11)

Some features that significantly contribute to the classification possess less weight and may be ignored in the global capsule network; however, they are fully considered in the basic capsule network. However, the basic capsule network may also ignore important information that does not appear often but usually 350

appears in the global capsule network also, along with some information that is not explicitly associated. Eq. (11) can consider both local features and global features, which will improve the classification accuracy to some extent.

3. Experiment In the experiment in this study, word2vec was used for preprocessing. Each 355

word in the sample was represented as a 300-dimensional word vector. 3.1. Datasets Six common benchmark datasets were used in the experiment to test the proposed model: MR:[30] This is a movie review dataset, and each review is a sentence, labeled

360

as positive or negative. MPQA:[31] This is a dataset of opinions. The opinions in the multi-perspective question answering (MPQA) dataset were labeled as positive or negative. SUBJ:[32] This is a subjectivity dataset, wherein a sentence was labeled as subjective or objective.

365

SST1:[33] Stanford Sentiment Treebank is a movie review dataset and an extension of the MR dataset. The reviews were divided into five categories according to the sentiment levels of the sentencesvery positive, positive, neutral, negative,

16

and very negative. SST2:[33] This is similar to the SST1 dataset except that neutral reviews are 370

not included. The reviews were classified into two categoriespositive and negative. TREC: [34] This is a question dataset, and the questions were classified into six categories.

375

3.2. Baselines For a means of comparison we selected several of the latest baseline algorithms in order to verify the performance of the proposed algorithm. CNN:[11] a classic convolutional network structure for text classification. It has only one convolutional layer and one pooling layer for extracting local fea-

380

tures. RNN:[15] A recurrent structure is used to capture contextual information. Three different shared information mechanisms were proposed to simulate the multi-task processing of text with specific tasks and of text at the shared layer. RCNN:[16] a model combining the convolutional and recurrent neural net-

385

works. The convolutional layer extracts features. The max pooling layer automatically determines which words play a key role in text classification to capture the key information in the text. HAN:[17] a hierarchical attention network for text classification. It is also an attention mechanism in which words and sentences are obtained.

390

LSTM/Bi-LSTM:[29] LSTM is a recurrent neural network, while Bi-LSTM consists of two recurrent neural networks. BLSTM:[35] To obtain fixed-length text representations and obtain more meaningful features, 2D max pooling is applied. ULMFiT:[36] Universal language model fine-tuning (ULMFiT) is an effective

395

migration learning method that can be applied to any task in NLP. DR-AGG:[22] a capsule network based on a dynamic routing mechanism for text classification. 17

Capsule-A/Capsule-B:[24] Capsule-A is a matrix capsule network for text classification, while Capsule-B is a cascade of three capsule networks for im400

proving classification performance. USE: [37] an encoder that combines unsupervised and supervised training data for multi-task training to learn a universal sentence.

3.3. Settings and Hyper-parameters 405

The specifications of the computer used in the experiment were as follows: 3.2 GHz Intel i7-8700 CPU; 11 GB NVIDIA GeForce RTX 2080Ti GPU; 32 GB memory; Windows 10 operating system. The proposed algorithm was implemented using the Python language under the tensorflow1.10 framework. For training, we first used the preprocessing of word2vec, and set the dimension of

410

the word vector to 300. Relevant hyper-parameters are listed in Table 1. Table 1: The Hyper-parameters

Dataset

b

lr

fn

fs

rn

d

caps

svd

MR

128

0.0002

256

3

3

0.2

672

15

MPQA

128

0.0002

256

3

3

0.2

272

50

SUBJ

128

0.0002

256

3

3

0.2

944

100

SST1

128

0.0003

256

3

3

0.2

400

40

SST2

128

0.0003

256

3

3

0.2

400

40

TREC

256

0.0002

256

3

3

0.1

272

30

In Table 1, b is the batch size, lr is the initial learning rate, fn is the number of filters, fs is the filter size, rn is the number of route iterations, d is the regularization constant of the dropout layer, caps is the number of capsules, and svd is the retained dimension. 415

In the experiment, we used the 300-dimensional word2vec vector to initialize the embedded vectors. For all datasets, we used 20% of the data as test data. We processed data at a batch size of 256 on the TREC dataset, and at a batch size of 128 for other datasets. We used the Adam optimizer to train the SST1

18

and SST2 datasets at an initial learning rate of 0.0003, and other datasets at 420

an initial learning rate of 0.0002. In the capsule network, there were 256 convolutional kennels with a size of 3 in both convolutional layers. Route iteration was performed three times to achieve optimal performance. The dropout regularization was 0.1 on the TREC dataset, and 0.2 on the other datasets. For each sample, 672 capsules were obtained on the MR dataset, 272 on MPQA and

425

TREC datasets, 944 on the SUBJ dataset, and 400 on the SST datasets. The SVD retained dimension in the global memory mechanism was determined by repeated experimentation. 3.4. Overall Performance The accuracy of the results is presented in Table 2. It can be seen from the

430

table that the proposed method achieved optimal performance on five datasets, and a near-optimal result on the TREC dataset. On each dataset, the optimal results are marked in bold red, and the second-best results are marked in bold black and underlined. Basic Capsules (the baseline capsule network proposed in this study) in-

435

cluded the BiGRU unit, and achieved the second-best result on three datasets. However, without taking into account the global features, basic Capsules failed to learn the global categorical difference information. The SCN (Siamese Capsule Networks with Global and Local Features) model proposed in this study which took into account the global features achieved the optimal results on five

440

datasets, and the second-best result on the TREC dataset. It is clear that obtaining global features improves the classification performance to some extent. On the MR dataset, only the basic Capsules considering local features achieved better results than some of the baseline algorithms. The possible reason was that the newly added GRU extracted the contextual information, which im-

445

proved the accuracy of the classification. MR is a two-category sentiment classification dataset, on which models with cyclic results usually performed better (e.g. BLSTM-2DCNN also achieved 82.3% accuracy). Similar datasets included MPQA and SUBJ. On such sentiment classification datasets, the models that 19

Table 2: Experimental Results

Dataset

MR

MPQA

SUBJ

SST1

SST2

TREC

CNN (Kim, 2014)

81.0

89.1

93.0

42.3

86.8

92.8

LSTM (Kyunghyun Cho et al., 2014)

76.3

87.7

82.6

/

79.9

89.3

BiLSTM (Socher et al., 2014)

78.5

88.3

83.2

/

79.3

90.5

RCNN (Siwei Lai et al., 2015)

81.0

88.3

90.3

44.1

82.9

90.7

RNN (Pengfei Liu et al., 2016)

80.6

88.6

89.4

43.8

80.6

90.8

HAN (Zichao Yang et al., 2016)

77.1

87.4

89.1

47.2

83.5

87.1

BLSTM-2DCNN (Peng Zhou et al., 2016)

82.3

/

94.0

50.4

89.5

96.1

ULMFiT (Jeremy Howard et al., 2018)

82.1

88.9

93.2

50.3

89.3

/

DR-AGG (Jingjing Gong et al., 2018)

/

/

/

50.5

87.6

/

Capsule-A (Wei Zhao et al., 2018)

81.3

/

93.3

/

86.4

91.8

Capsule-B (Wei Zhao et al., 2018)

82.3

/

93.8

/

86.8

92.8

81.5

86.8

93.3

/

87.2

97.4

Basic Capsules (ours)

83.0

89.2

93.8

50.6

88.7

96.1

SCN (ours)

83.2

89.8

94.1

50.8

89.5

96.5

USE (Daniel Cer et al., 2018

)

were able to extract contextual information performed better than other mod450

els overall. On SUBJ, the maximum length of text was 121, and the length of most samples was relatively large, which helped obtain better global features. Therefore, the SCN model considering both local and global features had the best performance on this dataset. SST1 is a movie review dataset, an extension of the MR dataset. Reviews were divided into five categories according to the

455

sentiment levels of the sentences: very positive, positive, neutral, negative, and very negative. SST2, based on SST1, excluded the neutral reviews and merged the other reviews into two categories, namely positive and negative. On such datasets, BLSTM-2DCNN and ULMFiT showed good results. Especially on SST2, where both BLSTM-2DCNN and SCN achieved optimal results. TREC

460

is a question dataset, on which the USE model showed the optimal result. In this dataset, the questions were divided into six categories. Many words such as How, What, and When usually occurred in a sample. However, they were not important global features, and due to the high frequency of their appearance, 20

they often interfered with other global features that helped in the classification. 465

Therefore, SCN only achieved the second-best result on this dataset. Table 3: Variance report Dataset

Basic Capsules

SCN

MR

83(82.39±0.36)

83.2(82.80±0.23)

MPQA

89.2(88.46±0.35)

89.8(88.94±0.33)

SUBJ

93.8(92.54±0.48)

94.1(93.62±0.29)

SST1

50.6(49.54±0.44)

50.8(50.06±0.36)

SST2

88.7(88.27±0.25)

89.5(88.97±0.28)

TREC

96.1(95.55±0.49)

96.5(96.13±0.20)

For our proposed method, Table 3 shows the results in the format ”best(meanstd)” for the different datasets. In general, deep learning models such as CNNs have many fully connected layers, and this model works well on large datasets. However, it is easy to cause overfitting on small datasets. However, the capsule 470

network uses a routing algorithm instead of a full connection, so the required parameters are reduced, and the routing algorithm only needs about 3 iterations to train a good classification model. Therefore, the capsule network feature is suitable for training on small samples, which is suitable for small datasets. To verify the stability of the capsule network, we ran our training algorithms 10

475

times, and reported the average accuracy and the best accuracy obtained in Table 3. As can be seen from Table 3, our method has better stability. As can be seen from Table 3, we have experimental statistics on 6 datasets. The proposed SCN standard deviation is between 0.20% and 0.36%, while the standard deviation of Basic Capsules is between 0.25% and 0.48%. Our method has

480

better stability. In this section, we present the effects of experimental improvements to the method through significance testing. The significance test is a widely used method for detecting whether there is a difference between the experimental group and the control group in a scientific experiment and it is concerned with

485

determining whether the difference is significant. The basic idea of the significance test is to first make assumptions about the scientific data and then use 21

the test to test whether the hypothesis is correct. In general, the hypothesis to be tested is called the null hypothesis, expressed as H0, and the hypothesis in opposition to H0 is called the substitution hypothesis, expressed as H1. We 490

investigated the effects of experimental improvements by asking the following questions: Q: Does the SCN model with global features work better than models that do not have global features?

495

For our problem, the null hypothesis is H0, which claims that there is no difference between the SCN model and the Basic Capsules. The alternative hypothesis is H1, which claims there is a difference. According to the standard significance test calculation, the conventional significance level is α = 0.05, and the null hypothesis is to be rejected if the p-value <α, as this indicates that

500

a significant difference does exist, and this result would thus give us a positive answer to the question: R: The SCN model with global features is better than the model without the global features.

Table 4: The p-value of statistical tests on the SCN and Basic Capsules.

505

Dataset

MR

MPQA

SUBJ

SST1

SST2

TREC

p-value

1.01E-02

7.90E-03

1.84E-05

1.26E-02

2.54E-05

3.90E-03

Table 4 displays the statistical evaluation of the differences between SCN and Basic Capsules. We perform the statistical tests on paired models. Firstly, the analysis of variance t-test is conducted on the Basic Capsules. We use the conventional significance level (i.e. p-value) of 0.05. Table 4 lists the statistical analysis of the results between SCN and Basic capsules. As shown in Table 4, the

510

t-test results between the SCN and the Basic capsules over 6 datasets provide p-values less than 0.05, which further verifies the stability and effectiveness of SCN in text classification. It can be seen from the above experimental analysis that SCN performed 22

better than the baseline algorithms on most datasets. However, due to the 515

different types of datasets, SCN also had some limitations. It did not achieve optimal results on certain datasets. The global memory mechanism could extract important global features, which could make up for the limitations of the capsule network to a certain extent. The classification model considering both local and global features achieved the optimal results on five datasets.

520

3.5. Comparison Experiment Table 5: Comparison Experiment with GRU Dataset

MR

MPQA

SUBJ

SST1

SST2

TREC

CNN (Without GRU)

81.0

89.1

93.0

42.3

86.8

92.8

CNN (add GRU)

81.5

89.0

93.2

46.8

87.1

94.5

Capsules (Without GRU)

80.6

88.4

91.2

47.1

87.6

91.8

Capsules (add GRU)

83.0

89.2

93.8

50.6

88.7

96.1

In order to prove that the GRU-embedded network was able extract contextual information in a way that was effective for classification and that improved classification performance, we controlled other variables and used the CNN and Capsules models to conduct comparative experiments. CNN and 525

Capsules models were divided into two types: without a GRU and with a GRU. The experimental results are listed in Table 5. The Capsules model with a GRU performed significantly better on all six datasets than that without a GRU. The CNN model with a GRU was superior to that without a GRU on five datasets. On the MPQA dataset, the CNN model with a GRU achieved a near-optimal

530

result. Therefore, on most datasets, adding a GRU was able to significantly improve the model performance. The training of the Siamese network was performed in two stages. The spatial distance information between the categories was learned in the first stage. In the second stage, classification training was performed in a supervised man-

535

ner. In order to prove that the spatial distance information between the categories could be learned in the first stage to improve classification performance, we controlled other variables and used CNN and Capsule models to conduct 23

Table 6: Comparison Experiment with Siamese networks Dataset

MR

MPQA

SUBJ

SST1

SST2

TREC

CNN

81.0

89.1

93.0

42.3

86.8

92.8

Siamese CNN

82.3

89.2

93.4

48.6

87.9

94.5

Capsule

83.0

89.2

93.8

50.6

88.7

96.1

Siamese Capsule

83.2

89.6

94.1

50.8

89.4

96.3

comparative experiments. The models were divided into two types, namely, Siamese CNN and Siamese capsule models with two-stage training (wherein the 540

categorical spatial distance was learned), and the CNN and Capsule models with one-stage training. The experimental results are listed in Table 6. After the two-stage training, the Siamese CNN and Siamese capsule models learned the spatial distance information between categories. Their performances were significantly better than the models that only performed classification training,

545

i.e., their performances on all six datasets were superior to that of the onestage training models. Therefore, the Siamese network was able to significantly improve the model performance after learning the spatial distance information between categories. 4. Related Work

550

This section gives a brief review of the relevant work and research being done in the field of text classification, capsule networks and Siamese networks. 4.1. Text Classification Text classification is an important research topic in the NLP field. Traditional text classification methods such as support vector machine [6], naive

555

Bayes [7], decision tree [8], logistic regression [9], and TF-IDF [10] have achieved good results. In recent years, as deep learning has developed, it has achieved great success (especially in the field of computer vision) and has also been widely used in various NLP tasks [4, 5, 38]. The deep learning methods as represented by the CNN-based text classification model proposed by Kim [11] achieved bet-

560

ter experimental results than traditional text classification methods. However, 24

such methods used a shallow convolutional neural network [11, 12]. Some other researchers have managed to use a deeper neural network [13] to solve the text classification problem. In 2015, Tang et al.[14] used CNNs to solve the sentiment classification prob565

lem by considering influencing factors such as a users personal preferences and the overall product quality and obtained good results. In addition, RNNs[15] can use contextual information to extract important text features and they performed well on some datasets. RNN models based on a sequence autoencoder [39] can improve the classification results. RCNNs combine the characteristics

570

of CNNs and RNNs [16], and their network structures are similar to RNNs. Text classification methods based on Generative Adversarial Networks [40] can perform semi-supervised training and have also achieved good results. RNNs are good at obtaining long-distance information from documents, but they cannot extract certain local features very well. In contrast, CNNs can ob-

575

tain better local features of text. In order to take advantage of this, Wang [41] proposed a diagonal recurrent neural network (DRNN) that introduced the local feature extraction function of CNN-based models into the RNN-based models, and also obtained good classification results. Howard et al. [36] proposed a universal language model for text classification, which could be used as pre-training

580

for different text classification tasks. Zeng et al. [42] proposed a topic memory network to extract the potential topics of text to solve the data sparseness problem of short text. Attention mechanisms have been widely used in various types of deep learning tasks in the last two years. An attention mechanism can intuitively explain

585

the importance of each text feature in the current document, thereby improving the accuracy of text classification to some extent [17]. In addition, attention models without any convolution have also been successfully applied to text classification [43]. Zhang et al. [13] further explored CNN-based models for text classification at the character level. Dynamic convolutional neural networks

590

(DCNNs) [44] can improve the model applicability by dynamically merging the lengths of sentences when determining the merging parameters. Deep pyramid 25

convolutional neural networks [45] can effectively represent the remote association in the text, thereby improving the text classification results. In text classification tasks based on deep neural networks, words are usually 595

considered as features of text, and word vectors are vectors that are used to represent words. Currently, word vectors have gradually become an important basis for processing text in the NLP field. Since the Google team launched the word2vec tool[2], word2vec has been widely used in various NLP tasks. Word2vec is a word vector representation tool that Google obtained from 1 bil-

600

lion items of Google news data via training. It can better represent the distance relationship between words. Another widely used word vector representation method is Glove [46]. Based on these word vector representation tools, Wang et al. [47] et al. proposed a word vector representation method combining words and tags to classify text, and these researchers also obtained good results.

605

4.2. Capsule Networks Although CNNs and RNNs have improved model performance in various computer tasks, they still have some limitations in extracting local features. Namely, they cannot obtain the spatial position relationships between features. The lack of such spatial position relationships affects the performance of the

610

classifier to a certain extent. To solve this problem, Hinton et al. [19] first introduced the concept of a capsule. Later Sabour et al. [20] proposed a feature aggregation method, in which a dynamic routing algorithm was used instead of max pooling, and capsule vectors were used instead of neurons. This method achieved optimal performance in image classification. Such models, which are

615

called capsule networks, can automatically learn the spatial position relationships between features, which is very suitable for image processing. Xi et al. [48] further investigated the potential of capsule networks on more complex data. After this, Hinton et al. [49] proposed a new dynamic route mechanism based on the expectation maximization (EM) clustering algorithm and used matrix

620

capsules instead of vector capsules to obtain better performance on some image data. LaLonde et al. [50] used capsule networks to improve the experimental 26

results of target segmentation. Jaiswal et al. [51] also obtained better results by introducing capsules into generative adversarial nets. Verma et al. [52] proposed a graph capsule network. 625

In various NLP tasks, the positional relationships of words in a sentence also affect the accuracy of the classification to a certain extent. If the positional relationships of the words can be learned during the training period, the model performance can be improved [21]. Therefore, the advantages of capsule networks have caused them to be rapidly deployed and applied in the field of NLP.

630

Gong et al. [22] proved that the aggregation mechanism in capsule networks has obvious advantages over the traditional pooling methods. Wang et al. [23] proposed a sentiment classification method by combining RNNs and capsule networks. In particular, RNNs were used to extract text features, and then an attention mechanism was used to construct capsule representations. Yang et

635

al. [24, 25] studied the capsule networks with dynamic routing for text classification and proposed three policies for the stable dynamic routing process to reduce the noise interference of capsules. Zhao et al. [26] published their article Towards Scalable and Reliable Capsule Networks for multi-label text classification and answering questions. Chen et al. [27] proposed a transfer capsule

640

network model for transferring document-level knowledge to aspect-level sentiment classification. Zhang et al. [53] proposed the use of capsule networks to solve the problem of cross-domain sentiment classification. 4.3. Siamese networks The earliest Siamese networks were used for target tracking. Due to their

645

computational efficiency and robustness, Siamese networks have recently attracted great attention in visual tracking research [54]. Bertinetto et al. [55] proposed a basic tracking algorithm with a novel, fully convolutional Siamese network that was trained end-to-end to detect objects in video footage. Valmadre et al. [56] proposed a CFNet tracking algorithm, and this work is the first

650

to overcome this limitation of computational efficiency by interpreting the correlation filter learner, which has a closed-form solution, as a differentiable layer 27

in a deep neural network. Dong et al. [57] proposed a new quadruplet deep network with four branches to examine potential connections between training instances, aiming to achieve a more powerful representation. Lu et al. [58] 655

proposed a novel shrinkage loss to penalize or reduce the importance of easy training data, to balance the training data. Li et al. [59]proposed the Siamese region proposal network, which is trained offline and end-to-end large-scale image pairs. In the medical field, the Siamese network architecture has been used for pairwise learning of information from two-dimensional sagittal intermediate-

660

weighted turbo spin echo slices that were obtained from similar locations on both knees [60]. Siamese networks achieve high performance in learning sentence similarity. Mueller et al. [28] presents a Siamese adaptation of the LSTM network for labeled data comprising of pairs of variable-length sequences, and this model

665

is applied to assess the semantic similarity between sentences. Huang et al. [61] proposed a supervised topic model based on the Siamese network, and the model can trade off label-specific word distributions with document-specific label distributions in a uniform framework. Experiments on real world datasets validate that our model performs competitively in topic discovery, quantitatively

670

as well as qualitatively. Arpita et al. [62] proposed a novel approach known as Siamese convolutional neural network for cQA to determine the semantic similarity between current and archived questions. Neculoiu et al. [63] presented a deep architecture for learning a similarity metric on variable length character sequences. This model combines a stack of character-level bidirectional LSTMs

675

with a Siamese architecture. This model learns to project variable length strings into a fixed-dimensional embedding space by only using information about the similarity between the pairs of strings. In this study, considering the highly successful applications of Siamese networks in the fields of computer vision and NLP, Siamese Capsule Networks with

680

Global and Local Features were proposed.

28

5. Conclusion In this study a text classification framework based on Siamese capsule networks with global and local features was proposed. First, a GRU was added to obtain contextual information of local features, which improved the perfor685

mance of the model. Subsequently, a Siamese network was used for comparative learning such that the capsule networks could learn global difference information between the different categories of the samples. Finally, local and global features were combined to further improve the classification performance of the model. Experiments were conducted on six of the most commonly used benchmark

690

datasets. The experimental results showed that the proposed method could successfully learn the contextual information of text, spatial distance information between categories, and global difference information between samples, thereby improving the classification performance of the model.

Author Contribution Statement 695

Yujia Wu: Conceptualization, Methodology, Software, Validation, Visualization, Data curation, Writing- Original draft preparation. Jing Li: Writing- Reviewing and Editing, Funding acquisition. Jia Wu: Investigation, Project administration. Jun Chang: Supervision.

700

Declaration of Competing Interest We wish to draw the attention of the Editor to the following facts which may be considered as potential conflicts of interest and to significant financial contributions to this work. We confirm that the manuscript has been read and approved by all named

705

authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us.

29

We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments 710

to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property. We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with

715

the office). He/she is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from [email protected].

720

6. Acknowledgements This work is supported by National Natural Science Funds of China (61772180), National Natural Science Funds of China (41201404) and Fundamental Research Funds for the Central Universities of China (2042015gf0009). References

725

[1] G. Liu, J. Guo, Bidirectional LSTM with attention mechanism and convolutional layer for text classification, Neurocomputingdoi:10.1016/j. neucom.2019.01.078. [2] Q. V. Le, T. Mikolov, Distributed Representations of Sentences and Documents, in: ICML, 2014.

730

[3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed Representations of Words and Phrases and their Compositionality, in: NIPS, 2013. [4] W.-t. Yih, X. He, C. Meek, Semantic Parsing for Single-Relation Question Answering, in: ACL, 2014. 30

735

[5] Y. Shen, X. He, J. Gao, L. Deng, G. Mesnil, Learning semantic representations using convolutional neural networks for web search (2014) 373–374. [6] T. Mullen, N. Collier, Sentiment analysis using support vector machines with diverse information sources, in: Proceedings of EMNLP, 2004, 2004. [7] S. Tan, X. Cheng, Y. Wang, H. Xu, Adapting naive bayes to domain adap-

740

tation for sentiment analysis, in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2009. doi:10.1007/978-3-642-00958-7_33. [8] G. Wang, J. Sun, J. Ma, K. Xu, J. Gu, Sentiment classification: The contribution of ensemble learning, Decision Support Systems 57 (1) (2014)

745

77–93. doi:10.1016/j.dss.2013.08.002. [9] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, C. Potts, Learning word vectors for sentiment analysis, in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL2011), 2011.

750

[10] B. Trstenjak, S. Mikac, D. Donko, KNN with TF-IDF based framework for text categorization, in: Procedia Engineering, 2014. doi:10.1016/j. proeng.2014.03.129. [11] Y. Kim, Convolutional Neural Networks for Sentence Classification, in: EMNLP, 2014.

755

[12] R. Johnson, T. Zhang, Effective Use of Word Order for Text Categorization with Convolutional Neural Networks, in: HLT-NAACL, 2015. [13] X. Zhang, J. J. Zhao, Y. LeCun, Character-level Convolutional Networks for Text Classification, in: NIPS, 2015. [14] D. Tang, B. Qin, T. Liu, Learning Semantic Representations of Users and

760

Products for Document Level Sentiment Classification, in: ACL, 2015.

31

[15] P. Liu, X. Qiu, X. Huang, Recurrent neural network for text classification with multi-task learning, in: IJCAI, 2016. [16] S. Lai, L. Xu, K. Liu, J. Zhao, Recurrent Convolutional Neural Networks for Text Classification, in: AAAI, 2015. 765

[17] Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, E. H. Hovy, Hierarchical Attention Networks for Document Classification, in: HLT-NAACL, 2016. [18] B. Huang, Y. Ou, K. M. Carley, Aspect level sentiment classification with attention-over-attention neural networks, ArXiv abs/1804.06536. [19] G. E. Hinton, A. Krizhevsky, S. D. Wang, Transforming autoencoders,

770

2011. [20] S. Sabour, N. Frosst, G. E. Hinton, Dynamic routing between capsules, in: NIPS, 2017. [21] Y. Lin, S. Shen, Z. Liu, H. Luan, M. Sun, Neural relation extraction with selective attention over instances, in: ACL, 2016.

775

[22] J. Gong, X. Qiu, S. Wang, X. Huang, Information aggregation via dynamic routing for sequence encoding, in: COLING, 2018. [23] Y. Wang, A. Sun, J. Han, Y. Liu, X. Zhu, Sentiment analysis by capsules, in: WWW, 2018. [24] M. Yang, W. Zhao, J. Ye, Z. Lei, Z. Zhao, S. Zhang, Investigating capsule

780

networks with dynamic routing for text classification, in: EMNLP, 2018. [25] M. S. Yang, W. Zhao, L. Chen, Q. Qu, Z. Zhao, Y. Shen, Investigating the transferring capability of capsule networks for text classification, Neural networks : the official journal of the International Neural Network Society 118 (2019) 247–261.

785

[26] W. Zhao, H. Peng, S. Eger, E. Cambria, M. Yang, Towards scalable and reliable capsule networks for challenging nlp applications, in: ACL, 2019. 32

[27] Z. Chen, T. Qian, Transfer capsule network for aspect level sentiment classification, in: ACL, 2019. [28] J. Mueller, A. Thyagarajan, Siamese recurrent architectures for learning 790

sentence similarity, in: AAAI, 2016. [29] K. Cho, B. van Merrienboer, aglar G¨ ulehre, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using rnn encoder-decoder for statistical machine translation, in: EMNLP, 2014. [30] B. Pang, L. Lee, A sentimental education: Sentiment analysis using sub-

795

jectivity summarization based on minimum cuts, in: ACL, 2004. [31] J. Wiebe, T. Wilson, C. Cardie, Annotating expressions of opinions and emotions in language, Language Resources and Evaluation 39 (2005) 165– 210. [32] B. Pang, L. Lee, Seeing stars: Exploiting class relationships for sentiment

800

categorization with respect to rating scales, in: ACL, 2005. [33] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, C. Potts, Recursive deep models for semantic compositionality over a sentiment treebank, in: EMNLP, 2013. [34] X. Li, D. Roth, Learning question classifiers, in: COLING, 2002.

805

[35] P. Zhou, Z. Qi, S. Zheng, J. Xu, H. Bao, B. Xu, Text classification improved by integrating bidirectional lstm with two-dimensional max pooling, in: COLING, 2016. [36] J. Howard, S. Ruder, Universal language model fine-tuning for text classification, in: ACL, 2018.

810

[37] D. Cer, Y. Yang, S. yi Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, R. Kurzweil, Universal sentence encoder for english, in: EMNLP, 2018.

33

[38] L. J. Zhang, S. Wang, B. Liu, Deep learning for sentiment analysis: A survey, Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8. 815

[39] A. M. Dai, Q. V. Le, Semi-supervised sequence learning, in: NIPS, 2015. [40] T. Miyato, A. M. Dai, I. J. Goodfellow, Adversarial training methods for semi-supervised text classification, in: ICLR, 2016. [41] B. Wang, Disconnected recurrent neural networks for text categorization, in: ACL, 2018.

820

[42] J. Zeng, J. Li, Y. Song, C. Gao, M. R. Lyu, I. King, Topic memory networks for short text classification, in: EMNLP, 2018. [43] T. Shen, T. Zhou, G. Long, J. Jiang, C. Zhang, Bi-directional block self-attention for fast and memory-efficient sequence modeling, CoRR abs/1804.00857.

825

[44] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, in: ACL, 2014. [45] R. Johnson, T. Zhang, Deep pyramid convolutional neural networks for text categorization, in: ACL, 2017. [46] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word

830

representation, in: EMNLP, 2014. [47] Y. Wang, C. Li, W. Wang, Y. Zhang, D. Shen, X. Zhang, R. Henao, L. Carin, Joint embedding of words and labels for text classification, in: ACL, 2018. [48] E. Xi, S. Bing, Y. Jin, Capsule network performance on complex data,

835

CoRR abs/1712.03480. [49] G. E. Hinton, S. Sabour, N. Frosst, Matrix capsules with em routing, in: ICLR, 2018.

34

[50] R. LaLonde, U. Bagci, Capsules for object segmentation, CoRR abs/1804.04241. 840

[51] A. Jaiswal, W. AbdAlmageed, P. Natarajan, Capsulegan: Generative adversarial capsule network, in: ECCV Workshops, 2018. [52] S. Verma, Z.-L. Zhang, Graph capsule convolutional neural networks, CoRR abs/1805.08090. [53] B. Zhang, X. Xu, M. Yang, X. Chen, Y. Ye, Cross-domain sentiment clas-

845

sification by capsule network with semantic rules, IEEE Access 6 (2018) 58284–58294. [54] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, W. Hu, Distractor-aware siamese networks for visual object tracking, ArXiv abs/1808.06048. [55] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. S. Torr, Fully-

850

convolutional siamese networks for object tracking, ArXiv abs/1606.09549. [56] J. Valmadre, L. Bertinetto, J. F. Henriques, A. Vedaldi, P. H. S. Torr, Endto-end representation learning for correlation filter based tracking, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 5000–5008.

855

[57] X. Dong, J. Shen, D. Wu, K. Guo, X. Jin, F. M. Porikli, Quadruplet network with one-shot learning for fast visual object tracking, IEEE Transactions on Image Processing 28 (2017) 3516–3527. [58] X. Lu, C. Ma, B. Ni, X. Yang, I. Reid, M.-H. Yang, Deep regression tracking with shrinkage loss supplementary document, 2018.

860

[59] B. Li, J. Yan, W. Wu, Z. Zhu, X. Hu, High performance visual tracking with siamese region proposal network, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018) 8971–8980.

35

[60] G. H. Chang, D. T. Felson, S. Qiu, A. Guermazi, T. D. Capellini, V. B. Kolachalama, Assessment of knee pain from mri scans using a convolutional 865

siamese network, 2019. [61] M. Huang, Y. Rao, Y. Liu, H. Xie, F. L. Wang, Siamese network-based supervised topic modeling, in: EMNLP, 2018. [62] A. Das, H. Yenala, M. K. Chinnakotla, M. Shrivastava, Together we stand: Siamese networks for similar question retrieval, in: ACL, 2016.

870

[63] P. Neculoiu, M. Versteegh, M. Rotaru, Learning text similarity with siamese recurrent networks, in: ACL, 2016.

36

*Biography of the author(s) Click here to download Biography of the author(s): biography.docx

Yujia Wu received the B.S. degree in Electronic Engineering from Hubei University of Economics in 2011, Wuhan, China. He received M.S degree in Electronic and Communication Engineering from South-central University for Nationalities in 2014, Wuhan, China. He is a Ph.D. candidate in School of Computer Science, Wuhan University, Wuhan, China. His main research interests include data mining and natural language processing. ([email protected])

Jing Li received the Ph.D. degree from Wuhan University, Wuhan, China, in 2006. He is currently a Professor in Computer School of Wuhan University, Wuhan, China. His research interests include data mining and multimedia technology. ([email protected])

Jia Wu received the Ph.D. degree in computer science from the University of Technology Sydney, Ultimo, NSW, Australia. He is currently an Assistant Professor with the Department of Computing, Faculty of Science and Engineering, Macquarie University, Sydney. Prior to that, he was with the Centre for Artificial Intelligence, University of Technology Sydney. He is also a Honorary Professor in School of Computer Science, Wuhan University. His current research interests include data mining, computer vision and machine learning. Dr. Wu is an Associate Editor of ACM Transactions on Knowledge Discovery from Data (TKDD). He is the recipient of Best Paper Award in Data Science Track (SDM 2018) and the Best Student Paper Award (IJCNN 2017) and Best Paper Candidate Award (ICDM 2014). Since 2009, he has authored or co-authored over 60 refereed journal and conference papers, such as the IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, the IEEE TRANSACTIONS ON CYBERNETICS, Pattern Recognition, the International Joint Conference on Artificial Intelligence, AAAI Conference on Artificial Intelligence, International Conference on Data Engineering, the International Conference on Data Mining, SIAM International Conference on Data Mining, and the Conference on Information and Knowledge Management, in these areas.

Jun Chang received the Ph.D. degree from Wuhan University in 2011, Wuhan, China. He is currently an Assistant Professor in School of Computer Science, Wuhan University, Wuhan, China. His current research interests include computer vision, large-scale machine learning and stream data mining. ([email protected])