Traffic scene semantic segmentation using self-attention mechanism and bi-directional GRU to correlate context
Communicated by
Dr. Shen Jianbing Shen
Journal Pre-proof
Traffic scene semantic segmentation using self-attention mechanism and bi-directional GRU to correlate context Min Yan , Junzheng Wang , Jing Li, Ke Zhang , Zimu Yang PII: DOI: Reference:
S0925-2312(19)31707-2 https://doi.org/10.1016/j.neucom.2019.12.007 NEUCOM 21633
To appear in:
Neurocomputing
Received date: Revised date: Accepted date:
28 June 2019 23 November 2019 1 December 2019
Please cite this article as: Min Yan , Junzheng Wang , Jing Li, Ke Zhang , Zimu Yang , Traffic scene semantic segmentation using self-attention mechanism and bi-directional GRU to correlate context, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.12.007
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.
Traffic scene semantic segmentation using self-attention mechanism and bi-directional GRU to correlate context Min Yan, Junzheng Wang, Jing Li∗, Ke Zhang, Zimu Yang Key Laboratory of Ministry of Industry and Information Technology Beijing Institute of Technology, Beijing 100081, P. R. China
Abstract Context information plays an important role in semantic segmentation of urban traffic scenes, which is one of the key tasks of the intelligent platform’s (such as unmanned vehicles) perceiving environment, and has inspired a wide range of interests from researchers. This paper synthesizes three considerations: feature space correlation, information distributed in the long distance of image plane and long distance sequence information, and proposes a combination of self-attention mechanism and bi-directional gated recurrent unit(GRU) neural network to extract various contextual information on the basis of deep feature network, so as to achieve better semantic segmentation performance. In order to explore the optimal implementation, two kinds of topological connections are attempted. One is self-attention branch and bi-directional GRU branch in series, and the other is in parallel. In addition, in order to train the network better and achieve more precise segmentation results, a cascade refinement supervised method using two losses is proposed. Experiments carried out on Cityscapes, Mapillary, CamVid and KITTI semantic segmentation datasets demonstrate the outstanding performance and robust generalization ability of our method. Keywords: semantic segmentation, context, self-attention, gated recurrent unit 2010 MSC: 00-01, 99-00 ∗ Corresponding
author Email address: mini
[email protected] (Jing Li)
Preprint submitted to Journal of LATEX Templates
December 10, 2019
1. Introduction Visual information plays an important role in the process of people’s perception of the surrounding environment. It is characterized by rich information. Similarly, the image information captured by the camera plays an important 5
role in the task of perception environment of unmanned platform. With the enhancement of hardware computing power and the development of deep learning, more and more perception tasks are implemented by deep networks, such as object detection [1, 2], object tracking [3, 4], pose estimation [5], depth estimation [6, 7], scene semantic segmentation [8, 9, 10, 11] etc.. Especially in the tasks of
10
object detection, depth estimation, and semantic segmentation, the performance of end-to-end deep networks greatly outperform the traditional methods. Category information of scenarios is the key information for intelligent decisionmaking of unmanned platforms. The pixel-level semantic annotation for images is to achieve the class perception of the scene, which is generally called se-
15
mantic segmentation. Semantic segmentation of urban scenes is a part of this task. Pixel-level category annotation is not to annotate each pixel separately according to the information of a single pixel. Context information plays a very important role in it. Considering the different distances of objects in the scene, there may be different scales, and the range of context that needs to be corre-
20
lated is certainly not fixed. So what kind of information should be depended on and how to correlate it are worth studying. Since Fully Convolutional Networks (FCNs) [12] achieved breakthrough results in semantic segmentation tasks, a large number of end-to-end methods based on FCNs have been proposed [13, 14, 15, 16]. The especially deep net-
25
work can also be well trained because of the network structure Resnet [17], the activation function rectified linear units (ReLU) [18], and the operation batch normalization (BN) [19], etc.. And the increase of depth greatly enhances the modeling ability of the network, so that the semantic segmentation of complex scenes can also be very effective. The FCNs based semantic segmentation net2
30
works are essentially Convolutional Neural Networks (CNNs). Theoretically, with the increase of depth, the perceptual field of view of CNNs increases step by step. However, the context correlation directly obtained by CNNs is not strong enough, and the actual field of view is much smaller than the theoretical field of view [20]. So after full convolution encoding and decoding networks, in
35
order to correlate context in a wider perceptual field of view, subsequent researchers proposed the pyramid pooling module (PPM) in PSPNet[16] and the atrous spatial pyramid pooling module (ASPP) in DeepLabs[15, 21] on the basis of deep feature networks. These structures focus on the information distributed in the spatial distance of the image plane and are a relatively mechanical context
40
design with a limited range of association. In addition, self-attention mechanism [22] is also used to extract context information [23, 24, 25, 26]. This method is based on the correlation under different projection conditions of the feature and has the ability to establish a correlation relationship with the entire image. It is simple to be implemented and has less computation. In addition to the meth-
45
ods mentioned above for context association, there are other kinds of methods [27, 28, 29] based on recurrent neural network (RNN). The output of RNN is not only related to the distance of the input, but also to the sequence order, so these methods get the ordered long-distance context information by using the RNN structure.
50
At present, a large number of articles generally only adopt one of the methods of associating non-local context. This paper synthesizes three considerations: feature space correlation, information distributed in the long distance of image plane and long sequence information, and proposes a method of combining selfattention mechanism and RNN to implement context correlation. Specifically,
55
we choose the GRU [30] in RNN because it converges faster than vanilla RNN and Long Short-Term Memory (LSTM)[31], and takes less GPU storage[28]. In addition, we use two parallel bi-directional GRUs, one is responsible for the horizontal direction and the other is responsible for the vertical direction, which can make use of the information distributed on the left and right sides and the
60
upper and lower sides of the target pixel at the same time. The parallel method 3
is adopted because the GRU is serially calculated, and the time efficiency is not high on parallel computers. So using a parallel structure will be more efficient in terms of time consumption. The self-attention mechanism is also a special case of non-local operations in the embedded Gauss version mentioned in the 65
non-local neural network article[32]. As far as we know, the combination of these two modules has only been used in natural language processing (NLP), and we have not seen the case in semantic segmentation yet. After getting the basic features from the basic feature network, there may be many different implementations of how to combine the self-attention mechanism
70
with GRU, such as considering the way of series connection, in which the output of one structure acts as the input of another structure. At this time, it can be self-attention first, GRU later, or vice versa. There are also parallel ways, that is the two structures share the same input, and then the outputs of the two structures are concatenated for unified classification. The self-attention module
75
calculates the correlation between each pair of pixels and normalizes the weight of one pixel with respect to each pixel. Finally, the context information is obtained by weighting the information of each pixel. From the results of existing articles [23, 24], which use self-attention module to achieve context-related semantic segmentation, we can see that through the self-attention module, pixels
80
of the same category as the target pixel provide a larger proportion of information in context information. And this association is global. The collection of global context information makes features more salient and greatly improves the accuracy of semantic annotation. The bi-directional GRU is a kind of sequential modeling method. The output depends on the order and the distance of the
85
sequence. Moreover, the bi-directional GRU that we use can only correlate the information of a row or a column of pixels. We tend to first use the self-attention module to have a global constraint on the entire network, and collect global information for classification, then revise the previous results by GRU according to one row or column of information. In order to better achieve the desired
90
results, we propose a two-loss supervising method. One loss is used to supervise the learning of self-attention module, to guide self-attention module to learn con4
text information well for classification, and the other loss is used to supervise the learning of converged features of self-attention module and bi-directional GRU module, so as to further improve the segmentation accuracy under the 95
premise of self-attention module. In this way, the self-attention module can be implemented as the main constraint, and then additional context information can be collected through GRU for refinement. Specifically, GRU can act on the output of self-attention module that has collected global context, or on the input of self-attention module that has not yet been associated with global context,
100
that is, the so-called difference between series and parallel connections. Analytically, we may prefer the former, but our experiments on the Cityscapes [33] dataset show that under the guidance of two losses, when the number of training iterations is enough, there is no significant difference in the evaluation metric mean intersection over union(mIoU)[12] between the two cases. Even we find
105
that under the supervision of two losses, the parallel network improves faster in the later period of training. So this paper mainly recommends the parallel implementation. We test the parallel method on Cityscapes [33] dataset, Mapillary [34] dataset, CamVid [35] dataset, and KITTI [36] semantic segmentation dataset and compare it with other methods.
110
In this paper, our main contributions are as follows: (1) For the first time, self-attention mechanism and bi-directional GRU are combined for semantic segmentation of traffic scenes. (2) We compare two kinds of topological connections between self-attention module and bi-directional GRU module and propose a cascading refinement
115
supervision strategy with two losses. (3) We validate the effectiveness of the algorithm on several datasets, obtain the results comparable to state-of-the-art method, and test the generalization performance of the algorithm.
5
2. Related Works 120
Image segmentation methods can be divided into supervised [37], semisupervised [38] and unsupervised [39], three categories. The difference among them is that the labels of the training data are provided in whole, in part and not at all. In addition, the object of image segmentation can be a single image [16], stereo images [40] or a video sequence [41]. The segmentation methods
125
can also be divided into binary segmentation, such as foreground-background segmentation [42], moving object segmentation [41]; super-pixel segmentation [43]; multi-category semantic segmentation [12]. They can also be divided into non-deep network methods and deep network methods. The non-deep network methods artificially abstract image segmentation into an optimization problem,
130
then solve the problem by using optimization methods, such as high-order energy optimization [42], Laplacian optimization [44], sub-Markov random walk [45], lazy random walk [46], submodular function optimization [47], etc.. The deep network methods realize image segmentation by constructing deep network and using end-to-end methods. The methods proposed in this paper are super-
135
vised multi-category semantic segmentation on a single image based on deep networks. With the development of deep learning, the task of semantic segmentation has attracted a lot of researchers’ attention, and a large number of network models have been proposed to improve the accuracy of segmentation
140
[15, 16, 21, 23, 24] and to achieve real-time performance [48, 49, 50]. One way to improve the segmentation accuracy is to correlate the context information of the target. For example, PSPNet[16] uses multiple average pooling layers of different core sizes to correlate context information of different regions in space. Similarly, the Deeplab series of articles[15, 21] acquire context information with
145
different field of view sizes by using multiple dilated convolutions with different dilated rates. Both of them achieve very good results. But the disadvantage of these kinds of methods is that the design of context is relatively mechanical, and the scope of context is fixed at the time of design, which is not flexible
6
enough. Recently, self-attention proposed by Ashish et al. [22] shows strong 150
contextual modeling ability in semantic segmentation [23, 24]. In those methods, the output features of the basic feature network are projected into two spaces respectively. The inner product is calculated between the features in two projected spaces. Then another projected features are weighted according to the inner product to construct a new feature map, which stands for the context
155
information. This information can be combined with basic features for better classification. Among them, OCNet [24] improves the accuracy of the network by combining self-attention mechanism with PPM or ASPP to enhance the context association, while DANet [23] improves network performance by weighting the features in spatial dimension and channel dimension with inner product cor-
160
relation to correlate the context information in two dimensions. RNN has been widely used in NLP related applications due to its powerful sequence modeling capabilities [51, 52, 53, 53]. Recently, some researchers have used it in semantic segmentation tasks [27, 28] to correlate long-distance context information. The paper [28] uses a set of up-down, left-right bidirectional GRU series
165
structure to achieve context information collection, and the paper [27] collects context information by combining several different jump steps of up-down, leftright bidirectional GRU series structure in parallel, and gets a good labeling accuracy. Different from the above methods, we propose a method combining self-attention mechanism and RNN to correlate context. On the one hand, it
170
reflects the flexibility compared with PPM and ASPP. On the other hand, it combines two different context association methods, which formally can better guarantee the richness of context to enhance the expressive ability of features. In order to improve the performance of a deep network, in addition to the study of effective basic network structure, there are many recent studies focused
175
on how to use the basic network structure scientifically and effectively and how to design effective loss functions. For example, Wang et al. [54] proposed to improve deep network’s performance by adding extra supervision as an intermediate loss for directly feeding supervision into the hidden layers. Dong et al. [55] applied triplet loss based on siamese network to extract features more effectively. 7
180
Similarly, Dong et al. [56] designed a shared network with four branches that receive a multi-tuple of instances as inputs and are connected by a novel loss function consisting of pair loss and triplet loss, which better separate samples of different classes in the feature space and get to obtain smaller classification error, so as to better play the advantages of the proposed network structure.
185
While our method is to add two context modules in series or parallel based on the same feature network and use two losses to effectively supervise the training of the network. In addition, Wang et al. [37] proposed to use two networks to obtain spatiotemporal saliency in video sequences. Firstly, the static saliency network was used to detect the static saliency of a single frame, and then the
190
output of the static saliency network and the continuous frame were used as the input of the dynamic saliency network to obtain spatiotemporal saliency maps. The training of our networks with two losses is similar to this method, which is carried out in stages. First, we get a preliminary result and then combine more information to further enhance the result, but our newly added information
195
considers the context information in the feature space.
3. Methods In this paper, the basic feature network is used to extract the basic features of the input image. Then, two context modules, self-attention module and bidirectional GRU module, are used to enrich the features. Next, the corresponding 200
loss function is constructed to supervise the training. Finally, a deep network model which can annotate the class of image at the pixel level is obtained. Specifically, we try two kinds of network connections, as shown in Fig. 1. Two context modules are connected in series and in parallel, respectively. Pipeline. : The overall networks are shown in Fig. 1. Specifically, in Fig. 1 (a),
205
the basic feature map, X, which is of dimension 2048 is obtained by inputting an outdoor traffic scene picture into the ResNet feature network. We find that in several advanced network implementations in semantic segmentation, such as PSPNet [16], DANet [23], OCNet [24], CCNet [25] etc., the outputs of the 8
Self-attention K
(a)
ResNet
X
Y
K' T R K'' Q Q' S R V
V'
R
Loss 2 W C R C'
GRU
λ
1*1 Conv+BN+ReLU
Self-attention K Q
(b)
ResNet
X
Y
K' T K'' S R
R reshape T transpose Matrix multiplication
R
V R
Q'
V'
Loss 1
Z S
S
Element-wise sum W
C
R
C'
Vertical Bi-GRU
1*1 Conv+Bias S Softmax(*/
512
)
Concatenate Horizontal Bi-GRU
λ
Z
Loss 1 Loss 2
GRU
Figure 1: This figure shows two computational diagrams used in this paper. First, the basic feature network ResNet is used to extract the basic features of the input image. Then, two context modules, self-attention module and bidirectional GRU module, are used to enrich the features. Two losses are used to supervise the training of the network, and two kinds of topological connections between two context modules are attempted. (a) is the case of series connection, and (b) is the case of parallel connection. This figure should be printed in color.
9
convolutional layer adjacent classification layer are all 512 channels. Consid210
ering that the number of target classes in traffic scene semantic segmentation task is much smaller than that in ImageNet classification task, whose data are those we use to pre-train the feature network, we directly reduce the feature dimension to the target dimension 512 to reduce the amount of computation, although it may be more expressive to extract context information with higher
215
dimensions. Specifically, we feed feature map X into a 1*1 convolution layer with BN and ReLU layers to generate a new feature map Y with 512 channels. Next, based on the self-attention module, the global context information is computed. Then the feature map Z is obtained by adding Y with λ weighted global context information. Through 1*1 convolution, 8 times upsampling and
220
softmax function, the outputs and the ground truth labels are used to construct the cross-entropy loss which is expressed as Loss1. In addition, two feature maps with long-distance sequence relationship are further constructed through up-down, left-right bi-directional GRU based on feature map Z. These two features are concatenated with Z to form a new feature map. Then after two 1*1
225
convolutions, 8 times upsampling and softmax function, the cross-entropy loss Loss2 is further constructed. Two losses ,Loss1 and Loss2, work together to guide the training process of the network. After training, we can get a computing network that can output the confidence of each pixel belonging to the given categories. The difference in Fig. 1 (b) is that the two bidirectional GRU’s
230
inputs change from the feature Z associated with global context information to the basic feature map Y after dimensionality reduction. 3.1. Basic Feature Network For the basic feature network, we adopt the ResNet101 network proposed in reference [17]. This residual network shows strong performance in visual tasks
235
such as semantic segmentation and image classification. In addition, considering that many excellent semantic segmentation networks have implementations based on ResNet101, we also adopt this structure to facilitate comparison with similar works. There are 101 layers in the network (including convolution layer 10
and full connection layer). Excluding the first layer and the last layer, every 240
three layers form a block, and each block is a residual structure. The first seven blocks are mainly composed of convolution, BN, and ReLU operations. The middle convolution layers of the remaining blocks are replaced by dilated convolutions. Through this way, with the same amount of computation, the output resolution can be kept the same while the receptive field can be enlarged. We
245
use 100 layers in front of ResNet101, that is, the last average pooling layer and full connection layer are removed. Through this network, the output feature map is reduced to 1/8 relative to the input resolution, and the output feature has a dimension of 2048. This feature map is expressed as X ∈ RH×W ×C1 , and the specific shape is [h/8, w/8, 2048], where h and w represent the height
250
and width of the input image, respectively. The channel is further reduced by convolution to get Y ∈ RH×W ×C2 , and the specific shape is [h/8, w/8, 512]. 3.2. Self-attention Context Module The self-attention mechanism used in this paper is a special case of nonlocal operations in the embedded Gaussian version mentioned in reference [32].
255
There are many other ways of non-local operations like this. It is mentioned in reference [32] that there is no significant difference in the performance of different implementations in image classification tasks. However, considering a large number of network implementations, such as generative adversarial networks, the SAGAN [57]; machine translation networks, the Transformer [22]; and im-
260
age generation networks, the Image Transformer [58]; adopt the self-attention mechanism, and have achieved good performance, we also choose this kind of implementation to follow the experience of predecessors. As shown in Fig. 1, the input feature map is Y. After three 1*1 convolution operations, we can get key = K ∈ RH×W ×256 , query = Q ∈ RH×W ×256 ,
265
value = V ∈ RH×W ×512 , respectively. The actual operation is K = Y ·Wk + bk ,
(1)
Q = Y ·Wq + bq ,
(2)
11
V = Y ·Wv + bv ,
(3)
where Wk , Wq ∈ R512×256 , bk , bq ∈ R256 , Wv ∈ R512×512 and bv ∈ R512 . Then
reshape Q to Q0 ∈ RN ×256 , reshape K to K 0 ∈ RN ×256 , and reshape V to
270
V 0 ∈ RN ×512 , where N = H × W is the number of pixels in the feature map Y. Then transposing K 0 obtains K 00 ∈ R256×N . The inner product between the
feature vector of each pixel in one transformation space and the feature vector of each pixel in another transformation space is computed by S = Q0 · K 00 ∈ RN ×N .
275
(4)
Then, by normalizing the inner product in each row with the following formula . √ W = sof tmax S 512 ∈ RN ×N , (5) √ the relative weight of a pixel with respect to each pixel is obtained. 512 √ is actually 2 × 256, square root of twice the feature dimensions of key and query. The main purpose of dividing this is to ensure that the inner product will not increase too much when the dimension of the projection space is set relatively high, so as to avoid causing the softmax function to enter a very small
280
gradient region [22]. The global context information is obtained by multiplying the projected new feature V 0 with the normalized weight C = W · V 0 ∈ RN ×512 .
(6)
Then reshape C to C 0 ∈ RH×W ×512 . A new and more expressive feature map
Z is constituted by adding λ weighted C 0 to input Y
Z =λ · C 0 + Y ∈ RH×W ×512 ,
(7)
where λ is initialized to 0, so that the network may start training from the state 285
without the self-attention mechanism, and increase the proportion of context as needed. 3.3. Parallel Bidirectional GRU Context Module Considering that RNN is serially calculated, which consumes too much time compared with highly parallel computing structures such as convolution, this
12
290
paper does not use the serial connection between vertical and horizontal bidirectional GRUs as in reference[27, 28], instead the parallel connection is used to form a parallel bidirectional GRU context module. According to Fig. 1, the input of this module may be Y or Z, both of which belong to RH×W ×512 . We construct two parallel bidirectional GRUs, which are responsible for hor-
295
izontal and vertical directions, respectively. A row or a column of image is inputted into the bidirectional GRU network one pixel by one pixel as a sequence. For the sake of illustration, we describe a row as a sequence: assume a row is (x1 , x2 , · · · , xW ). The input order of forward GRU is from x1 to xW , and the reverse GRU is from xW to x1 . The outputs are (y11 , y12 , · · · , y1W )
300
and (y2W , · · · , y22 , y21 ), respectively. The specific GRU has only one layer, and dropout is applied to the output. It’s input and output formula is as follows [59]
305
rt = σ(Wr · [ht−1 , xt ] + br ),
(8)
zt = σ(Wz · [ht−1 , xt ] + bz ),
(9)
¯ t = tanh(Wh · [rt ht−1 , xt ] + bh ), h
(10)
¯ t, ht = (1 − zt ) ht−1 + zt h
(11)
yt = σ(Wo · ht ).
(12)
In the above formulas, [] represents concatenate; · represents matrix multiplication; and represents the inner product of vectors. The dimension of hidden state h is 256, and the initial value is 0. Subscript t denotes the se310
quential index of the input. The lowercase letter r denotes the reset gate, indicating the degree to which candidate hidden state historical information is used, and the smaller it is, the less the historical memory is retained. The lowercase letter z denotes the update gate, which determines the proportion of updates to hidden state, and the larger it is, the more updates. The low-
315
ercase letter y represents the output of the network. The dimensions of r, z, and y are all 256. σ is the logistic sigmoid function and tanh is the hyperbolic tangent function. Wr , Wz , Wh ∈ R256×(256+512) , and Wo ∈ R256×256 . Input 13
(x1 , x2 , · · · , xW ) in reverse order to two GRUs and output (y11 , y12 , · · · , y1W ) and (y2W , · · · , y22 , y21 ) respectively. Then concatenate one with the other re320
versed one to get ([y11 , y21 ], [y12 , y22 ], · · · , [y1W , y2W ]), which represents the output feature of bi-directional GRU at each pixel of a row. Vertical processing is the same, except that the input becomes a column. In this way, the context features of the horizontal direction can be obtained by inputting each row separately, and the context features of the vertical direction can be obtained
325
by inputting each column separately. The output of the parallel bi-directional GRU context module can be obtained by concatenate the two context features.
4. Experiments 4.1. Datasets In order to verify the effectiveness of the algorithm, we have carried out 330
experiments on four datasets: Cityscapes[33], KITTI semantic segmentation[36], CamVid[35], and Mapillary[34] datatsets. Cityscapes . : This dataset is collected from 50 different cities. The image resolution is 2048*1024. There are 34 classes (e.g. building, tree, road, car), and only 19 of them are used for evaluation. The dataset contains 20000 coarsely
335
annotated images and 5000 high quality pixel-level finely annotated images. The finely annotated images are divided into training set, validation set, and testing set with 2975 images, 500 images, and 1525 images respectively. The pixel-level labels of the training set and validation set are publicly available. The labeled data of the test set is not open to the public, and the results of the algorithm
340
need to be submitted to the official website for evaluation. In this paper, only the finely annotated images are used, but not the coarsely annotated images. KITTI Semantic Segmentation . : KITTI dataset itself is a dataset of urban traffic scenes, and is divided into data of various tasks, such as stereo depth estimation, optical flow estimation, odometry, object detection, object
345
tracking etc.. In this paper, we use the semantic segmentation dataset, which 14
contains 200 images in the training set and also 200 images in the testing set. The annotated categories of this data are identical to those of Cityscapes, and there are 34 categories. Only 19 of these categories are used in this paper. The size of the image is about 1242 * 375, and we only use the training set to verify 350
the generalization performance of the algorithm. CamVid . : The CamVid data set is very small. It contains 367 images in the training set, 101 in the validation set and 233 in the testing set. A total of 11 classes (e.g. building, tree, road, pedestrian) have been tagged. The tagged data are publicly available. The image resolution we use is 360*480 as with
355
SegNet[13]. Mapillary . : This dataset contains 20,000 images with 65 categories (e.g. rider, street light), in which 18,000 are used for training and 2,000 for validation. The resolution is high but varies greatly, with 3265*2449, 4032*3024, 4608*3456, 5248*3936, etc..
360
4.2. Implementation Details Our method is implemented based on TensorFlow on a computer with two GTX1080 Ti GPUs. The initial learning rate lr is set to 0.01 and decreases as the number of iterations, iter, increases according to the following formula: iter lr = lr · (1 − )0.9 , (13) total iter where total iter represents the total number of iterations for training. In fact,
365
the learning rate of some parameters of the network is different. For example, the basic feature network is pre-trained on the ImageNet dataset, so the learning rate of the basic feature network is actually reduced by 10 times, and the same goes for the weighting factor λ in the self-attention module. The optimization method we adopt is the momentum optimizer with momentum 0.9.
370
L2 regularization loss is used to improve generalization ability, and the weight decay is set to 0.0005. And during training, the data are augmented by random left-right flipping and random resizing between 0.5 and 2. The paper[16] shows that larger cropsize and batchsize can lead to better results. Specifically, for 15
different datasets and during training or inferencing, the two hyperparameters 375
are set differently, which will be mentioned later. The implementation of BN is synchronized on multiple GPUs. 4.3. Results on Cityscapes Dataset and KITTI Dataset 4.3.1. Ablation Study on the Cityscapes Validation Set Experiments Setup. : Ablation studies are carried out in order to verify the
380
validity of our proposed structure and two losses supervisory method. Specifically, we conduct experiments with several setting, including removing two context modules (denoted by RL2), only self-attention context module or bidirectional GRU context module (denoted by RAL2 and RGL2, respectively), two context modules in series and parallel (denoted by RAGL2 and RA|GL2, re-
385
spectively) under the supervision of only one loss Loss2, and only bi-directional GRU context module (denoted by RGL1L2) and two context modules in series and parallel (denoted by RAGL1L2 and RA|GL1L2, respectively) under the supervision of Loss1 and Loss2. In order to avoid that two losses may be equivalent to improving learning rate, we multiply the loss by two when only
390
one loss is used. The experiment of PSPNet is also conducted for comparison based on the same basic feature network and hyperparameter settings. It is noted that this PSPNet does not use the auxiliary loss in reference [16]. We provide intersection over union(IoU), pixel accuracy(ACC) and runtime(Time), which is the average time of 500 times of inference for an image with cropsize
395
672*672, using a Geforce GTX 1080 Ti GPU, while IoU is the main metric used for comparison. Experiments in this subsection are carried out with batchsize 4 and cropsize 672*672. First, total iters=40K iterations are trained on the Cityscapes training set for every model. Then, to further verify that two losses can indeed
400
better supervise the training of the network adopting two context modules, 50K iterations are trained on the basis of 40K iterations (note that the learning rate here is calculated according to total iters=90K with the initial iteration number 40K) for models with two context modules to get 90K-iteration models. We 16
inference with a single scale and cropsize 672*672. As the resolution of images 405
in the Cityscapes dataset is 1024*2048, we clip the image into cropsize 672*672 with stride 480, and input to the network sequentially, then splice the outputs as the results. Experiments results. : The experimental results obtained are shown in Table 1. Two indicators are provided when two losses are used. The first line is the
410
indicators of corresponding branch of Loss1, and the next line is the indicators of corresponding branch of Loss2. Inference time. The inference time is related to the size of the input image and the network structure. It can be seen from Table 1 that under the same basic feature network, the network without context module (RL2) infers the
415
fastest, followed by the network with the self-attention context module (RAL2). The inference time of networks with GRU context module (RGL2, RAG* and RA|G*) is greatly increased, and the inference time is the longest when two context modules exist at the same time. It can be seen from the running time of RAl2 and RL2 that self-attention module takes very little time, so the infer-
420
ence time of parallel connection and series connection with two context modules is very close. The self-attention module improves the ability of semantic segmentation networks greatly only adding a small amount of computing time. Its performance exceeds that of PSPNet, and its running time is less than that of PSPNet. Although GRU, as a context module, can improve the performance of
425
semantic segmentation networks to a certain extent, it does cost a little. In the future, it is worth further exploring real-time performance. 40K iterations. From the data in Table 1, we can get that when trained 40K iterations, the results obtained by other methods, in terms of mIoU and pixel accuracy (Acc), are better than those obtained by PSPNet except for RL2
430
using no context module at all (note that when two losses are used, only the indicators of the convergent branch, that is, the next line, are considered). Under the guidance of Loss2, adding self-attention module (RAL2) or bidirectional GRU module (RGL2) on the basic network RL2 can obtain better indicators
17
than RL2. On the basis of RL2, when two context modules are added at the 435
same time, it is found that the mIoU of RAGL2, whose two modules are added in series, is bigger than those of RAL2 and RGL2, both with only one module. While that of RA|GL2 in parallel increases when compared with RAL2, but decreases slightly when compared with RGL2. However, in the case of two losses, the mIoU of two modules in series or parallel is improved compared with
440
that of one module and only one loss. The mIoU of RGL1L2 with one context module and trained with two losses is not as good as that of RGL2 with one context module and trained with one loss. That is to say, using additional loss, Loss1, can not always improve the performance. It is also related to the network structure. The same phenomenon occurs when the backbone is changed from
445
ResNet to MobileNet; see Appendix for details. But additional loss can be used as an option to improve network performance. 90K iterations. When the model with two context modules are trained 90K iterations, the mIoUs with two losses are higher than those with one loss, and mIoUs of series and parallel models under two losses are very close. When
450
two losses are used, the result of the convergent branch corresponding to Loss2 is also improved compared with that of the upper branch corresponding to Loss1, which indicates that the information of GRU module plays a further role in enhancing the representation ability of features. In addition, when the series or parallel model, that is RAGL1L2 or RA|GL1L2, is trained 90K iterations with
455
two losses, the mIoU is 3.3% higher than PSPNet. λ. Fig. 2 shows how the weight λ in the self-attention module changes with the number of iterations. As the number of iterations increases, λ increases. From Table 1, we’v got that the result of RA|GL2 with two modules in parallel and with only one loss is not as good as that of RGL2 with only GRU module
460
and with only one loss, although they are very close. The main reason may be that when the training iterations are insufficient, the self-attention module is not fully learned in parallel, which is shown in Fig. 2 that the λ of RA|GL2 is smaller than those of other models, whose performances are better. After adding Loss1, the additional constraint makes the self-attention context module learn 18
Method and loss
Train 40000 iterations Train 90000 iterations mIoU(%)
Acc(%)
mIoU(%)
Acc(%)
Time(s)
PSPNet
71.80
94.82
73.43
95.10
0.157
Resnet+Loss2 (RL2)
68.21
94.81
–
–
0.126
Resnet+SA+Loss2 (RAL2)
73.98
95.12
–
–
0.138
Resnet+GRU+Loss2 (RGL2)
74.28
95.41
–
–
0.489
Resnet+SA+GRU+Loss2 (RAGL2)
74.87
95.40
76.17
95.63
0.511
Resnet+SA|GRU+Loss2 (RA|GL2)
74.13
95.42
75.54
95.67
0.510
67.85
94.71
–
–
73.21
95.37
–
–
74.65
95.15
75.63
95.44
75.37
95.34
76.77
95.66
74.01
95.11
75.78
95.47
74.93
95.32
76.74
95.67
Resnet+GRU+Loss1 +Loss2 (RGL1L2)
Resnet+SA+GRU+Loss1 +Loss2 (RAGL1L2)
Resnet+SA|GRU+Loss1 +Loss2 (RA|GL1L2)
–
0.518
0.518
Table 1: Ablation study results on the Cityscapes validation set. In abbreviated forms, R represents ResNet, A represents the self-attention module; G represents the bi-directional GRU module; L1 represents Loss1 ; L2 represents Loss2 ; and | represents parallel connection. Two sets of indicators are provided when two losses are used. The first line is the indicators of corresponding branch of Loss1, which are marked with light grey, and the next line is the indicators of corresponding branch of Loss2. - represents that the corresponding experiments have not been done.
19
465
better and greatly improves the results. λ for self-attention
RA|GL2 RAGL2 RA|GL1L2 RAGL1L2
λ
1.5
1
0.5
0
0
1
2
3
4
5
iterations
6
7
8
9 × 104
Figure 2: The global context weight λ in self-attention module. This figure should be printed in color.
Methods road swalk build. wall fence pole tlight tsign veg. terrain sky person rider car truck bus train mbike bike mIoU PSPNet 97.06 80.78 91.09 52.22 55.64 58.81 66.21 74.01 91.52 61.25 93.01 80.21 60.46 93.90 55.81 71.08 71.47 65.01 75.55 73.43 RAGL2 97.62 82.25 91.84 53.44 61.42 59.88 66.34 75.13 91.96 64.88 94.08 80.82 62.91 94.49 78.84 81.04 67.83 66.50 75.87 76.17 RAGL1L2 97.79 82.91 91.60 52.15 57.74 59.75 65.01 75.04 92.10 65.19 94.16 80.51 61.70 94.70 82.57 85.14 78.05 67.23 75.35 76.77 RA|GL2 97.84 83.21 91.71 51.69 58.91 60.42 65.86 75.37 91.99 64.29 94.03 80.74 62.56 94.63 74.47 82.46 64.44 64.70 75.98 75.54 RA|GL1L2 97.79 83.10 91.63 53.12 58.66 60.07 65.40 75.24 92.04 65.24 94.19 80.35 61.26 94.66 82.52 84.58 75.63 67.18 75.38 76.74
Table 2: The detail results of the 90K-iteration model on the Cityscapes validation set(%). The best indicators are marked red (The results with difference within 0.2 are also marked red.), and the worst indicators (regardless of PSPNet’s results) are marked blue.
Some details. Table 2 records the IoU indicators of the 90K-iteration model in each category. Firstly, the results of methods combining two context modules are obviously better than those of PSPNet. In addition, we can see that each implementation has its own better categories, whether it is with one 470
loss or with two losses, or whether it is in series or in parallel. But in terms of overall indicators, implementations with two losses are more advantageous. The number of poor indicator category of parallel model RA|GL1L2 is smaller than that of series model RAGL1L2 when two losses are used. It can be said that indicators of RA|GL1L2 are slightly more balanced between categories
475
than RAGL1L2. Plus the fact that mIoU of RA|GL1L2 and RAGL1L2 are very close and RA|GL1L2 improves faster in the later period of training, we recommend RA|GL1L2 with two losses and two context modules in parallel. 20
Subsequent experiments have all adopted this combination form. Some specific results of model RA|GL1L2 can be found in Fig. 3. The mIoUs of two branches 480
of the 90K-iteration model RA|GL1L2 can be raised to 76.17% and 77.42%, respectively, by averaging results from left-right flipped and multi-scale (scales = {0.5, 0.75, 1, 1.25, 1.5, 1.75}) inputs.
(a)Image
(b)Ground Truth
(c) RA|GL1L2_1
(d) RA|GL1L2_2
Figure 3: This figure shows some specific results of two branches of the 90K-iteration model RA|GL1L2 on the Cityscapes validation dataset. 1 represents the self-attention branch corresponding to Loss1 and 2 represents the convergent branch corresponding to Loss2. From the red bounding boxes, we can see that the results of the self-attention module are refined to some extent with the help of bi-directional GRU module. This figure should be printed in color.
4.3.2. Comparison with the state-of-the-art on the Cityscapes Testing Set Model RA|GL1L2 is trained 200,000 iterations with batchsize 4 and cropsize 485
672*672 on the Cityscapes training set and validation set for about 4 days and tested on the Cityscapes testing set by averaging results from left-right flipped and multi-scale inputs. The scales are {0.5, 0.75, 1, 1.25, 1.5, 1.75}. The results are shown in Table 3. Note that the results of PSPNet, which is trained with batchsize 8 and cropsize 720*720, come from reference [60] and our
490
implementation of PSPNet is from the open source code of reference [60]. It
21
can be seen that the result of model RA|GL1L2 is almost better than that of PSPNet, and the result of the convergent branch is obviously improved on the basis of the self-attention branch. In order to compare our method with the state-of-the-art methods, the re495
sults of some methods are listed in Table 4. These methods all use ResNet101 as the backbone. They are trained only on the finely annotated training and validation set, and then tested on the testing set. Some methods have also been improved by modifying loss functions. For example, reference [60] constructs an affinity field loss to train the PSPNet along with cross-entropy loss, and ref-
500
erence [24] uses an online hard example loss (OHEM) instead of cross-entropy loss. From Table 4, we can see that the performance of our method is better than that of some methods, but there are still gaps between our method and some other methods. As mentioned in reference [16], larger cropsize and batchsize can yield better performance. We are limited to hardware constraints
505
and do not attempt larger cropsize and batchsize. Therefore, compared with methods that go beyond ours, the proposed method is still slightly insufficient. In addition, we only use the general cross-entropy loss. So the results can be further improved by modifying the loss function and using larger cropsize and batchsize. Methods road swalk build. wall fence pole tlight tsign veg. terrain sky person rider car truck bus train mbike bike mIoU PSPNet[60] 98.33 84.21 92.14 49.67 55.81 57.62 69.01 74.17 92.70 70.86 95.08 84.21 66.58 95.28 73.52 80.59 70.54 65.54 73.73 76.30 RA|GL1L2 1 98.46 85.34 92.68 55.37 58.70 63.37 72.42 77.08 93.22 71.54 95.42 85.39 69.63 95.59 71.96 81.84 76.22 67.92 75.38 78.29 RA|GL1L2 2 98.55 86.05 92.95 57.62 60.23 64.79 72.93 77.46 93.35 71.81 95.45 85.83 70.71 95.71 72.36 83.59 75.30 68.23 75.47 78.86 Table 3: The detail results of the 200K-iteration model on the Cityscapes testing set(%). The best indicators are marked red, and the worst indicators are marked blue. 1 represents the self-attention branch corresponding to Loss1 and 2 represents the convergent branch corresponding to Loss2.
510
4.3.3. Generalization Performance Test on KITTI Dataset The 200K-iteration model RA|GL1L2 trained on Cityscapes dataset is further tested on KITTI semantic segmentation training dataset. Results are shown 22
Methods
Conference
Batchsize*h*w
Iters|epochs
tool
mIoU(%)
DSSPN[61]
CVPR2018
6*513*513
90epochs
Pytorch
74.0
DUC-HDC[62]
WACV2018
4*880*880
-
MXNet
77.6
DepthSeg[63]
CVPR2018
1*800*800
-
MATLAB
78.2
PSPNet[16]
CVPR2017
16*769*769
90000iters
Caffe
78.4
BiSeNet[48]
ECCV2018
-*1024*1024
-
-
78.9
RA|GL1L2(Ours)
-
4*672*672
200000iters
TensorFlow
78.9
AAF[60]
ECCV2018
8*720*720
90000iters
TensorFlow
79.1
DFN[64]
CVPR2018
32*800*800
-
-
79.3
PSANet[26]
ECCV2018
16*-*-
90000iters
Caffe
80.1
CCNet[25]
-
8*768*768
-
Pytorch
81.4
DANet[23]
-
8*768*768
240epochs
Pytorch
81.5
OCNet[24]
-
8*768*768
80000iters
Pytorch
81.7
Table 4: Comparison with the state-of-the-art methods on the Cityscapes testing set. indicates that the corresponding data are not found in the references or not available. These methods all use Resnet101 as the backbone. They are trained only on the finely annotated training and validation set, and then tested on the testing set. Our method gets comparable result with small batchsize and cropsize.
in Table 5. We record the IoU of two branches in each category and the mIoU. Compared with the indicators on KITTI’s official website, our results can be 515
ranked as the sixth, but it should be noted that the indicators in the server are obtained on the testing set, while ours are on the training set. Some specific test results can be found in Fig. 4. From Fig. 4 and Table 5, it can be seen that the convergent branch has slightly better results than the self-attention branch. The images used when calculating the confusion matrixes shown in Fig. 5 are
520
all never seen in the training process of the model, and can be used to test the generalization ability of the model. The results of the testing data from the same source as the training data (shown in Fig. 5(a)) are better than those of data from different sources (shown in Fig. 5(b)). This gap may be caused not only by the appearance difference of data from different sources, but also by
525
the difference of human annotation between two datasets. However, they show similar distributions, and the more significant the diagonal distribution is, the better supporting evidence for generalization ability.
23
Methods road swalk build. wall fence pole tlight tsign veg. terrain sky person rider car truck bus train mbike bike mIoU RA|GL1L2 1 88.74 39.82 82.39 22.23 35.82 52.05 65.78 61.24 82.07 43.32 94.40 65.41 52.95 86.49 55.42 44.15 74.69 52.87 52.54 60.65 RA|GL1L2 2 89.35 41.08 83.23 27.39 37.12 52.90 66.13 61.92 81.98 44.24 94.60 67.20 52.36 87.06 53.03 46.80 74.33 47.88 49.58 60.96 Table 5: Generalization performance of the 200K-iteration model RA|GL1L2 test on the KITTI dataset(%).
(a)Image
(b)Ground Truth
(c) RA|GL1L2_1
(d) RA|GL1L2_2
Figure 4: This figure shows some specific generalization results of two branches of 200Kiteration model RA|GL1L2 on the KITTI training dataset.
1 represents the self-attention
branch corresponding to Loss1 and 2 represents the convergent branch corresponding to Loss2. From the red bounding boxes, we can see that the results of the self-attention module are refined to some extent with the help of bi-directional GRU modules. Note that the third row may need to be zoomed in to see the difference in detail. This figure should be printed in color.
24
(a)
(b)
Figure 5: Fig. 5(a) shows the confusion matrix of the 90K-iteration model RA|GL1L2 on the Cityscapes validation dataset with multi-scale and left-right flipping data augmentation. While Fig. 5(b) shows the confusion matrix of the 200K-iteration model RA|GL1L2 on the KITTI dataset with multi-scale and left-right flipping data augmentation. This figure should be printed in color.
4.4. Results on CamVid Dataset On CamVid dataset, we set batchsize to 10 and cropsize to 360*480. Model 530
RA|GL1L2 is trained 20K iterations on the training and validation sets and tested on the testing set by averaging results from multi-scale (scales = {1, 1.25, 1.5, 1.75, 2.0}) and left-right flipped inputs. And the results are shown in Table 6. Note that the image resolutions used in the references [49, 48] are 720*960, while that used in the reference [65] is 640*480. A higher resolution
535
is more conducive to better results. From Table 6, we can see that our result is the best. It needs to be pointed out that lightweight basic feature networks are used in references [49, 48] to achieve real-time performance, so here is just for reference. Methods SegNet[13] ICNet[49] BiSeNet[48] VPN[65] RA|GL1L2 1 RA|GL1L2 2 mIoU(%)
60.1
67.1
68.7
69.5
68.8
Table 6: Results on the CamVid dataset.
25
69.6
4.5. Results on Mapillary Dataset 540
For this dataset, we train model RA|GL1L2 200K iterations (about 40 epochs) on the training set with batchsize 4 and cropsize 672*672, and tested on the validation set with left-right flipping. The mIoU excluding unpredicted classes is recorded in Table 7( During inference, there are no corresponding pixels in 10 of the 65 valid classes.). Compared with other methods, our method
545
deserves further improvement. Besides adopting larger batchsize and cropsize and more iterations, more consideration should be given to class imbalance to improve the performance of the model. Methods
Batchsize*h*w
iters|epochs
mIoU (%)
DSSPN[61]
6*513*513
90epochs
42.39
In-Place[66]
12*776*776
90epochs
53.12
RA|GL1L2 1
4*672*672
200000iters
44.83
RA|GL1L2 2
4*672*672
200000iters
45.45
Table 7: Results on the Mapillary validation dataset.
5. Conclusions In this paper, we propose a method that combines self-attention mechanism 550
and bidirectional GRU to correlate context information to enhance the expressiveness of features in semantic segmentation task. First, the global context is correlated by self-attention module according to the correlation between different feature space, and then the long-distance sequence context information of a row or a column is correlated by bidirectional GRUs in vertical and horizontal
555
directions. The combination of the two context modules greatly increases the expressiveness of network features. Aiming at this structure, a two-loss method is proposed, which facilitates the training of the network and improves the accuracy of the network. Our method achieves outstanding performance on many urban traffic scene datasets and also generalizes well on unseen data.
26
560
The biggest problems of this method are class imbalance and non real-time performance, which will be the focus of our future research.
Acknowledgments This work was supported by the National Nature Science Foundation of China under Grant 61103157.
565
Appendix A. Additional Ablation Experiments with MobileNet Backbone We have known from the text that in the case of ResNet is the backbone, the proposed method can get better semantic segmentation results by combining two context modules and using two losses for training. In order to verify whether
570
these two strategies can still play the same effect after replacing the backbones. We change ResNet in the text to MobileNet [67] and carry out the same ablation experiments on the training and validation Cityscapes dataset. Note that other structures are consistent with the original. The training hyper-parameters are also consistent with the original settings. But MobileNet here does not perform
575
pre-training on the ImageNet dataset. Appendix A.1. The MobileNet Architecture The model structure of MobileNet is shown in Table A.8. In order to output images of the same scale as ResNet, we modify some parameters from the reference [67].
580
Appendix A.2. Ablation Experiment Results and Analysis The ablation results are shown in Table A.9. It can be seen that when one loss is used for training, the segmentation accuracy can be improved by adding a context module; the results of networks with two context modules are better than those with one context module; and the results of series connection
585
are better than those of parallel connection. However, unlike networks using
27
Input
Operator
c
n
s
6722 ∗ 3
3*3 Conv
32
1
2
block
16
1
1
block
24
2
2
block
32
3
2
block
64
4
1
block
96
3
1
block
160
3
1
block
320
1
1
3362 ∗ 32 3362 ∗ 16 1682 ∗ 24 842 ∗ 32 842 ∗ 64 842 ∗ 96
842 ∗ 160
Table A.8: The MobileNet stucture we used. Where c represents the output channel number of every layer in the same block; and n represents the number of block repetitions; and s stands for stride. While the block structure is shown in Fig. A.6.
ResNet, models trained with two losses is not as good as those trained with one loss. Method and loss
Train 40000 iterations Train 90000 iterations
Time(s)
mIoU(%)
Acc(%)
mIoU(%)
Acc(%)
(MNet)PSPNet
40.99
88.50
46.80
90.10
0.055
MNet+Loss2 (ML2)
31.53
85.27
–
–
0.038
MNet+SA+Loss2 (MAL2)
41.02
88.88
–
–
0.138
MNet+GRU+Loss2 (MGL2)
46.58
90.73
–
–
0.413
MNet+SA+GRU+Loss2 (MAGL2)
48.41
90.93
54.81
92.36
0.430
MNet+SA|GRU+Loss2 (MA|GL2)
47.97
90.90
54.40
92.21
0.428
36.81
87.56
42.17
89.26
46.89
90.41
53.96
92.02
37.27
87.69
43.03
89.44
46.01
90.40
52.62
91.81
MNet+SA+GRU+Loss1 +Loss2 (MAGL1L2)
MNet+SA|GRU+Loss1 +Loss2 (MA|GL1L2)
0.425
0.428
Table A.9: Ablation study results on the Cityscapes validation set. In abbreviated forms, M and MNet represent MobileNet; A represents the self-attention module; G represents the bidirectional GRU module; L1 represents Loss1 ; L2 represents Loss2 ; and | represents parallel connection. Two sets of indicators are provided when two losses are used. The first line is the indicators of corresponding branch of Loss1, which are marked with light grey, and the next line is the indicators of corresponding branch of Loss2. - represents that the corresponding experiments have not been done.
28
Add
1*1 Conv, Linear
1*1 Conv, Linear 3*3 Dwise Conv, stride=2, Relu6 3*3 Dwise Conv, Relu6
1*1 Conv, Relu6
1*1 Conv, Relu6
input
input
Stride=1 block
Stride=2 block
Figure A.6: The block stucture of MobileNet.
References [1] X. Hu, X. Xu, Y. Xiao, H. Chen, S. He, J. Qin, P.-A. Heng, Sinet: A scale590
insensitive convolutional neural network for fast vehicle detection, IEEE Transactions on Intelligent Transportation Systems 20 (3) (2019) 1010– 1019. [2] J. Wei, J. He, Y. Zhou, K. Chen, Z. Tang, Z. Xiong, Enhanced object detection with deep convolutional neural networks for advanced driving
595
assistance, IEEE Transactions on Intelligent Transportation Systems. [3] S.
Scheidegger,
K. Granstr¨ om,
J.
Benjaminsson,
E.
Rosenberg,
A.
Krishnan,
Mono-camera 3d multi-object tracking using deep
learning detections and pmbm filtering, in: 2018 IEEE Intelligent Vehicles Symposium (IV), IEEE, 2018, pp. 433–440. 600
[4] X. Dong, J. Shen, W. Wang, Y. Liu, L. Shao, F. Porikli, Hyperparameter optimization for tracking with continuous deep q-learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 518–527.
29
[5] N. Yang, R. Wang, J. Stuckler, D. Cremers, Deep virtual stereo odometry: 605
Leveraging deep depth prediction for monocular direct sparse odometry, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 817–833. [6] H. Fu, M. Gong, C. Wang, K. Batmanghelich, D. Tao, Deep ordinal regression network for monocular depth estimation, in: Proceedings of the
610
IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2002–2011. [7] X. Guo, H. Li, S. Yi, J. Ren, X. Wang, Learning monocular depth by distilling cross-domain stereo networks, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 484–500.
615
[8] F. Shen, G. Zeng, Semantic image segmentation via guidance of image classification, Neurocomputing 330 (2019) 259–266. [9] L.-C. Chen, M. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, J. Shlens, Searching for efficient multi-scale architectures for dense image prediction, in: Advances in Neural Information Processing
620
Systems, 2018, pp. 8699–8710. [10] F. Cheng, H. Zhang, D. Yuan, M. Sun, Leveraging semantic segmentation with learning-based confidence measure, Neurocomputing 329 (2019) 21– 31. [11] Z. Jiang, Y. Yuan, Q. Wang, Contour-aware network for semantic segmen-
625
tation via adaptive depth, Neurocomputing 284 (2018) 27–35. [12] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440. [13] V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: A deep convolutional
630
encoder-decoder architecture for image segmentation, IEEE transactions on pattern analysis and machine intelligence 39 (12) (2017) 2481–2495. 30
[14] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer, 2015, pp. 635
234–241. [15] L.-C. Chen, G. Papandreou, F. Schroff, H. Adam, Rethinking atrous convolution for semantic image segmentation, arXiv preprint arXiv:1706.05587. [16] H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: Proceedings of the IEEE conference on computer vision and pattern
640
recognition, 2017, pp. 2881–2890. [17] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [18] V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann
645
machines, in: Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814. [19] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167. [20] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Object detectors
650
emerge in deep scene cnns, arXiv preprint arXiv:1412.6856. [21] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE transactions on pattern analysis and machine intelligence 40 (4) (2018) 834–848.
655
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998–6008.
31
[23] J. Fu, J. Liu, H. Tian, Z. Fang, H. Lu, Dual attention network for scene segmentation, arXiv preprint arXiv:1809.02983. 660
[24] Y. Yuan, J. Wang, Ocnet: Object context network for scene parsing, arXiv preprint arXiv:1809.00916. [25] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, W. Liu, Ccnet:
Criss-cross attention for semantic segmentation, arXiv preprint
arXiv:1811.11721. 665
[26] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, J. Jia, Psanet: Point-wise spatial attention network for scene parsing, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 267–283. [27] Y. Zhuang, F. Yang, L. Tao, C. Ma, Z. Zhang, Y. Li, H. Jia, X. Xie, W. Gao, Dense relation network: Learning consistent and context-aware represen-
670
tation for semantic image segmentation, in: 2018 25th IEEE International Conference on Image Processing (ICIP), IEEE, 2018, pp. 3698–3702. [28] Y. Zhuang, L. Tao, F. Yang, C. Ma, Z. Zhang, H. Jia, X. Xie, Relationnet: Learning deep-aligned representation for semantic image segmentation, in: 2018 24th International Conference on Pattern Recognition (ICPR), IEEE,
675
2018, pp. 1506–1511. [29] B. Shuai, Z. Zuo, B. Wang, G. Wang, Scene segmentation with dagrecurrent neural networks, IEEE transactions on pattern analysis and machine intelligence 40 (6) (2017) 1480–1493. [30] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of
680
gated recurrent neural networks on sequence modeling, arXiv preprint arXiv:1412.3555. [31] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (8) (1997) 1735–1780.
32
[32] X. Wang, R. Girshick, A. Gupta, K. He, Non-local neural networks, in: Pro685
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803. [33] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene understanding, in: Proceedings of the IEEE conference on computer
690
vision and pattern recognition, 2016, pp. 3213–3223. [34] G. Neuhold, T. Ollmann, S. Rota Bulo, P. Kontschieder, The mapillary vistas dataset for semantic understanding of street scenes, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4990– 4999.
695
[35] G. J. Brostow, J. Fauqueur, R. Cipolla, Semantic object classes in video: A high-definition ground truth database, Pattern Recognition Letters 30 (2) (2009) 88–97. [36] H. A. Alhaija, S. K. Mustikovela, L. Mescheder, A. Geiger, C. Rother, Augmented reality meets computer vision: Efficient data generation for
700
urban driving scenes, International Journal of Computer Vision 126 (9) (2018) 961–972. [37] W. Wang, J. Shen, L. Shao, Video salient object detection via fully convolutional networks, IEEE Transactions on Image Processing 27 (1) (2017) 38–49.
705
[38] W. Wang, J. Shen, F. Porikli, R. Yang, Semi-supervised video object segmentation with super-trajectories, IEEE transactions on pattern analysis and machine intelligence 41 (4) (2018) 985–998. [39] W. Wang, J. Shen, R. Yang, F. Porikli, Saliency-aware video object segmentation, IEEE transactions on pattern analysis and machine intelligence
710
40 (1) (2017) 20–33.
33
[40] J. Peng, J. Shen, X. Li, High-order energies for stereo segmentation, IEEE transactions on cybernetics 46 (7) (2015) 1616–1627. [41] J. Shen, J. Peng, L. Shao, Submodular trajectories for better motion segmentation in videos, IEEE Transactions on Image Processing 27 (6) (2018) 715
2688–2700. [42] J. Shen, J. Peng, X. Dong, L. Shao, F. Porikli, Higher-order energies for image segmentation, IEEE Transactions on Image Processing 26 (10) (2017) 4911–4922. [43] J. Shen, X. Hao, Z. Liang, Y. Liu, W. Wang, L. Shao, Real-time superpixel
720
segmentation by dbscan clustering algorithm, IEEE Transactions on Image Processing 25 (12) (2016) 5933–5942. [44] J. Shen, Y. Du, X. Li, Interactive segmentation using constrained laplacian optimization, IEEE Transactions on Circuits and Systems for Video Technology 24 (7) (2014) 1088–1100.
725
[45] X. Dong, J. Shen, L. Shao, L. Van Gool, Sub-markov random walk for image segmentation, IEEE Transactions on Image Processing 25 (2) (2015) 516–527. [46] J. Shen, Y. Du, W. Wang, X. Li, Lazy random walks for superpixel segmentation, IEEE Transactions on Image Processing 23 (4) (2014) 1451–1462.
730
[47] J. Shen, X. Dong, J. Peng, X. Jin, L. Shao, F. Porikli, Submodular function optimization for motion clustering and image segmentation, IEEE transactions on neural networks and learning systems. [48] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, N. Sang, Bisenet: Bilateral segmentation network for real-time semantic segmentation, in: Proceedings
735
of the European Conference on Computer Vision (ECCV), 2018, pp. 325– 341.
34
[49] H. Zhao, X. Qi, X. Shen, J. Shi, J. Jia, Icnet for real-time semantic segmentation on high-resolution images, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 405–420. 740
[50] H. Li, P. Xiong, H. Fan, J. Sun, Dfanet: Deep feature aggregation for real-time semantic segmentation, arXiv preprint arXiv:1904.02216. [51] Z. Cao, F. Wei, L. Dong, S. Li, M. Zhou, Ranking with recursive neural networks and its application to multi-document summarization, in: Twentyninth AAAI conference on artificial intelligence, 2015.
745
[52] B. Plank, A. Søgaard, Y. Goldberg, Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss, arXiv preprint arXiv:1604.05529. [53] J. Zhou, W. Xu, End-to-end learning of semantic role labeling using recurrent neural networks, in: Proceedings of the 53rd Annual Meeting of the
750
Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, 2015, pp. 1127–1137. [54] W. Wang, J. Shen, H. Ling, A deep network solution for attention and aesthetics aware photo cropping, IEEE transactions on pattern analysis
755
and machine intelligence 41 (7) (2018) 1531–1544. [55] X. Dong, J. Shen, Triplet loss in siamese network for object tracking, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 459–474. [56] X. Dong, J. Shen, D. Wu, K. Guo, X. Jin, F. Porikli, Quadruplet network
760
with one-shot learning for fast visual object tracking, IEEE Transactions on Image Processing 28 (7) (2019) 3516–3527. [57] H. Zhang, I. Goodfellow, D. Metaxas, A. Odena, Self-attention generative adversarial networks, arXiv preprint arXiv:1805.08318.
35
[58] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, D. Tran, 765
Image transformer, arXiv preprint arXiv:1802.05751. [59] G.-B. Zhou, J. Wu, C.-L. Zhang, Z.-H. Zhou, Minimal gated unit for recurrent neural networks, International Journal of Automation and Computing 13 (3) (2016) 226–234. [60] T.-W. Ke, J.-J. Hwang, Z. Liu, S. X. Yu, Adaptive affinity fields for seman-
770
tic segmentation, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 587–602. [61] X. Liang, H. Zhou, E. Xing, Dynamic-structured semantic propagation network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 752–761.
775
[62] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, G. Cottrell, Understanding convolution for semantic segmentation, in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2018, pp. 1451–1460. [63] S. Kong, C. C. Fowlkes, Recurrent scene parsing with perspective under-
780
standing in the loop, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 956–965. [64] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, N. Sang, Learning a discriminative feature network for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1857–
785
1866. [65] V. Jampani, R. Gadde, P. V. Gehler, Video propagation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 451–461. [66] S. Rota Bul` o, L. Porzi, P. Kontschieder, In-place activated batchnorm for
790
memory-optimized training of dnns, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5639–5647. 36
[67] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, Mobilenetv2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
37
Min Yan: Conceptualization, Methodology, Software, Writing - Original Draft, Writing Review & Editing
795
Junzheng Wang: Resources, Supervision, Project administration
Jing Li: Resources, Writing - Review & Editing, Supervision, Project administration, Funding acquisition
Ke Zhang: Investigation, Data Curation, Writing - Review & Editing
Zimu Yang: Data Curation, Validation, Formal analysis, Writing - Review & Editing
Declaration of interests ☒ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. ☐The authors declare the following financial interests/personal relationships which may be considered as potential competing interests:
Min Yan: Min Yan is a Ph.D. student in Beijing Institute of Technology, Beijing, China. She received her B.S. degree from Beijing Institute of technology, Beijing, China, in 2014, before starting her Ph.D. in computer 800
vision and artificial intellegence. Her research activities are focused on deep neural network mainly applied for semantic segmentation, depth estimation, localization but also for computer vision in general.
Junzheng Wang: Junzheng Wang was born in 1964. He received M.S. and Ph.D. degree of engineering from Beijing Institute of 805
Technology in 1990 and 1994. He is now a professor, tutor of Ph.D. student of School of Automation, Beijing Institute of Technology. His main research interests are motion drive and control, image detection and tracking, and the static and dynamic performance testing control system, etc.
Jing Li: Jing Li was born in 1982. She received M.S. de810
gree of engineering from Shandong University of Technology in 2007, and Ph.D. degree of engineering from Beijing Institute of Technology in 2011. She is now an associate professor of School of Automation, Beijing Institute of Technology. Her research interests include image detection technology, object detection and tracking, etc. 40
Ke Zhang: Ke Zhang is a Ph.D. student in Beijing
815
Institute of Technology, Beijing, China. He received the B.S. degree in communication engineering from Northwestern Polytechnical University, Xi’an, China, in 2006, and the M.S. degree in information and communication system from Xi’an Institute of Electromechanical Information Technology, Xi’an, China, in 820
2009. His research activities are focused on multidimensional information fusion and decision control.
ZiMu Yang: ZiMu Yang received the B.S. degree from the Beijing Institute of Technology in 2017. She is currently pursuing the M.S. degree with the School of Automation, Beijing Institute of Technology. Her research 825
interests include computer vision and robotic environment perception.
41