Learning attentive dynamic maps (ADMs) for Understanding Human Actions

Learning attentive dynamic maps (ADMs) for Understanding Human Actions

J. Vis. Commun. Image R. 65 (2019) 102640 Contents lists available at ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/loca...

2MB Sizes 0 Downloads 33 Views

J. Vis. Commun. Image R. 65 (2019) 102640

Contents lists available at ScienceDirect

J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

Learning attentive dynamic maps (ADMs) for Understanding Human Actions q Chuankun Li a, Yonghong Hou a, Wanqing Li b,⇑, Pichao Wang c a

School of Electrical and Information Engineering, Tianjin University, Tianjin, China Advanced Multimedia Research Lab, University of Wollongong, Australia c Alibaba Group, United States b

a r t i c l e

i n f o

Article history: Received 12 February 2019 Revised 3 June 2019 Accepted 9 September 2019 Available online 14 October 2019 Keywords: Human-robot/machine interaction Deep learning Human action recognition

a b s t r a c t This paper presents a novel end-to-end trainable deep architecture to learn an attentive dynamic map (ADM) for understanding human motion from skeleton data. An ADM intends not only to capture the dynamic information over the period of human motion, referred to as an action, as the conventional dynamic image/map does, but also to embed in it the spatio-temporal attention for the classification of the action. Specifically, skeleton sequences are encoded into sequences of Skeleton Joint Maps (STMs), each STM encodes both joint location (i.e. spatial) and relative temporal order (i.e. temporal) of the skeleton in the sequence. The STM sequences are fed into a customized 3DConvLSTM to explore the local and global spatio-temporal information from which a dynamic map is learned. This dynamic map is subsequently used to learn the spatio-temporal attention at each time-stamp. ADMs are then generated from the learned attention weights and all hidden states of the 3DConvLSTM and used for action classification. The proposed method achieved competitive performance compared with the state-of-theart results on the Large Scale Combined dataset, MSRC-12 dataset and NTU RGB+D dataset. Ó 2019 Elsevier Inc. All rights reserved.

1. Introduction Deep learning has significantly advanced the research of human action recognition from RGB-D data. The methods reported so far can be broadly categorised into four approaches: two-stream CNNs [1,2], 3D CNN [3], CNNs combined with a RNN [4] and Dynamic Image (DI) based methods [5]. However, compared with RGB data, depth modality is invariant to illumination and provides 3D structural information of the subjects. With the advance of easy-to-use 3D sensors, like Microsoft Kinect Sensors and algorithms [6–9] for highly accurate estimation of joint positions from RGB-D sequence, action recognition based on skeleton sequences is becoming an active research topic [10–13]. To date, many methods have been proposed for recognising actions from skeleton data. One approach is to convert a skeleton sequence into a texture image and encode the temporal information in the texture image. For example, Wang [14] et al. encoded joint trajectories of an action instance into texture images (JTM) and utilized HSV (Hue, Saturation, Value) space to represent the temporal information such that the texture images can be fed into q

This paper has been recommended for acceptance by Zicheng Liu.

⇑ Corresponding author.

E-mail address: [email protected] (W. Li). https://doi.org/10.1016/j.jvcir.2019.102640 1047-3203/Ó 2019 Elsevier Inc. All rights reserved.

CNNs. However, this particular encoding method fails to distinguish some actions such as ‘‘knock” and ‘‘catch” due to overlapping of trajectories. The second approach as proposed by Yan et al. [15] and Li et al. [16] is to represent skeletons naturally as graphs and utilize graph convolutional networks (GCN) to extract spatial and temporal features. However, this approach hardly learns long-term temporal information. The third approach is to extract features skeleton sequences and to feed the frame-based features into a recurrent neural network, such as LSTM, to explore the temporal information or long-term patterns for classification. A typical example is the 3DConvLSTM originally proposed for gesture recognition [17,4,18]. This approach has recently been augmented by attention models [19– 21] that intend to identify the body parts and/or the time-stamps that are more discriminative for classification. For skeleton based action recognition, an effective attention model should be able to robustly identify the informative joints (spatial attention) and time-stamps (temporal attention). Up to date, several methods [22–24] have been proposed. In [22,24], the temporal attention is ignored and coarsely designed respectively. And [23] tends to overemphasize the temporal information and underestimate the spatial information.

2

C. Li et al. / J. Vis. Commun. Image R. 65 (2019) 102640

This paper presents a novel end-to-end trainable architecture that takes the advantages offered by deep neural networks, dynamic images/maps and attention models to learn an attentive dynamic map (ADM) from a sequence of skeletons for action classification. The ADM not only captures the dynamic information over the period of an action that is required for classification as the conventional dynamic image/map does, but also embeds in it the spatio-temporal attention. In addition, ADM is learned as an integrated part of the network rather than through a separate unsupervised rank pooling process. To generate an effective ADM, it is proposed to encode skeleton sequences into sequences of Skeleton Joint Maps (STMs), each STM encodes both joint locations and relative temporal order of the skeleton in the sequence. The STM sequences are fed into a customized 3DConvLSTM to explore the local and global spatio-temporal information from which a dynamic map is learned. This dynamic image is subsequently used to learn the spatio-temporal attention at each time-stamp. ADMs are generated from the learned attention weights and all hidden states of the 3DConvLSTM. The proposed method was evaluated on three popular benchmark datasets, NTU RGB+D Dataset [25], MSRC-12 Kinect Gesture Dataset [26] and Large Scale Combined Dataset (LSC) [27] and achieved the state-of-the-art results. The rest of this paper is organized as follows. In Section 2, the work related to skeleton based action recognition is reviewed. Section 3 describes the proposed method. Experimental results are presented in Section 4. Finally, Section 5 concludes the paper. 2. Related work In this section, we briefly review the existing literature that closely relates to the proposed method, including RGB-D based action recognition using deep learning, skeleton-based action recognition using CNN and/or LSTM based models and attention-based action recognition. 2.1. RGB-D based action recognition using deep learning Many works [13] have been reported on action recognition from RGB sequences based on deep learning. The four emerging and promising approaches are Two-stream CNNs [1,2], 3D CNN [3], CNNs combined with a RNN [4] and Dynamic Image (DI) based methods [5]. The first approach is to employ CNNs to extract features from single frames or a stack of frames sampled from a video sequence representing an instance of actions. The widely used two stream architecture [1,2] is a typical example. In general, the twostream CNN architecture captures spatial information and shortterm temporal information, but hardly learns long-term temporal information. The second approach is to apply 3D convolution to a segment of frames and then apply temporal pooling to form videobased features, such as C3D [3], or simply apply 3D convolution to an entire action instance [28–30]. Like the two stream CNN approach, 3D CNN approach captures spatial information and short-term temporal information, but not much long-term temporal information. The third approach is to first extract frame-based features using a CNN and then model temporal dynamics using a RNN, such as ConvLSTM [4] and the network proposed in [17]. This method models long-term features without obviously increasing the complexity of the model and strengthens the ability to model motion dynamics across time in an effective way. The fourth approach is to generate dynamic images [5] through rank pooling and apply a CNN to extract features from the dynamic images. Although a dynamic image is effective for summarizing a video sequence, it cannot always capture the properties required to

identify the video due to the fact that the ranking constraints are linear.

2.2. Skeleton based action recognition using CNNs In the category of CNNs-based methods, the key and most commonly used approach is to convert a skeleton sequence into a texture image and encode the temporal information in the texture image and employ one or multiple CNNs to learn the features for classification. Although it is inevitable to lose spatio-temporal information when a sequence is encoded into the images, the approach has achieved promising performance [31,14,32–34]. In [31], the joint coordinates of a skeleton sequence are organized in a matrix and normalized with respect to the entire training dataset, where the three Cartesian components (x, y, z) of joints are seen respectively as the three channels (R,G,B) of a color image. The temporal dynamics of the sequence are presented across columns and the spatial structure of each frame is arranged in a column. However, this normalization could not guarantee scale invariant. Wang et al. [14] proposed to encode joint trajectories into Joint Trajectory Maps (JTM). And JTM represents spatiotemporal structure of joint trajectories by using color encoding. However, the methods [14] cannot distinguish some actions due to the trajectory overlapping and consequently loss of temporal information. In order to overcome this drawback, Li et al. [32] calculated skeleton joints of single or multiple subjects of pair-wise distances, and encoded these distances of skeleton joints into Joint Distance Maps (JDM). Ke et al. [33] divided the whole body into the five body parts and extracted the cosine distance (CD) and normalized magnitude (NM) features for each body part of human skeleton sequences. The CD and NM features of each body part over all frames are concatenated together and scaled into gray images respectively. Note that the CD and NM features are supposed to be translation, scale, and rotation invariant. In order to capture more spatio-temporal information, Liu et al. [34] proposed an enhanced skeleton visualization method and expressed the 5D space (three dimensional coordinates, the time label, the joint label) as a 2D coordinate space and a 3D color space to obtain ten types of texture images, which implicitly describes spatiotemporal skeleton joints in a compact yet distinctive manner. But this method requires ten CNNs and, hence, has a high computational complexity. In the work of Liu. et al. [35], a skeleton sequence is converted into multi-temporal sequences and 3D CNN was employed to learn robust features. However, this method only captures short-term temporal information. The graph convolutional networks are regarded as a extension of CNN. Yan et al. [15] proposed Spatial-Temporal Graph Convolutional Networks (ST-GCN) to automatically learn both the spatial and temporal patterns. Li et al. [16] proposed a spatio-temporal graph convolution (STGC) approach for assembling the successes of local convolutional filtering and sequence learning ability of autoregressive moving average. However, these methods hardly learn long-term temporal information. Tang et al. [36] proposed a deep progressive reinforcement learning (DPRL) method to select the most informative frames of the input sequences and leverage GCN to learn the dependency among joints. However, the spatial graph is built over clusters, each of which is assigned a weight and thus may not capture delicate pairwise spatial relationship among joints and the computation complexity of graph learning is high. Unlike the previous methods discussed above, the proposed method encodes skeleton joints in each frame together with the temporal order of the frame into a color Skeleton Texture Map

C. Li et al. / J. Vis. Commun. Image R. 65 (2019) 102640

(STM) to allow us to explore the spatial and local temporal information using a 3DCNN and global temporal information using an LSTM. 2.3. Skeleton based action recognition using LSTM models The basic idea of LSTM-based methods is to extract frame-based feature from a skeleton sequence and pass the feature sequence into an LSTM model to map the sequence of features into a class of actions. These methods tend to overstress the temporal information and underestimate the spatial information. To overcome this shortcoming, different improvements have been proposed. The works in [37,38,25,39] extended LSTM models to spatial domains. Specifically, Shahroudy et al. [25] proposed to model the long-term temporal correlation of the features for each body part. All body parts are concatenated by a shared output gate. Liu et al. [38] extended the traditional LSTM-based learning to spatial-temporal domains (ST-LSTM) and used a trusted gate to deal with unreliable input data. A skeleton tree traversal algorithm was used to discover stronger long-term spatial dependency patterns. [39] designed a view adaptive LSTM architecture that enables the network itself to automatically adjust to most suitable observation viewpoints. Zhu et al. [37] proposed an end-to-end fully connected deep LSTM network and added a mixed-norm regularization term to the cost function in order to learn the co-occurrence features of skeleton. Instead of extending LSTM models, Zhang et al. [40] provided a simple universal spatial modeling method and utilized eight geometric relational features to feed into a basic 3-layer LSTM model. However, concatenation of the eight types of geometric features performs worse compared to a single type of these features. The proposed method exploits the spatial and temporal information in a balanced manner. Specifically, both spatial and temporal information is encoded at frame-level into a frame-based STM. The sequences of STMs went through a 3DCNN, a 2-layer ConvLSTM to generate an attentive dynamic map from which the spatio-temporal attention is estimated. And action classification is performed on the (attention-based) attentive dynamic map. 2.4. Attention based action recognition Typical attention models include soft attention [20], hard attention, additive attention [21] and multiplicative attention [41]. Attention has recently been applied to action recognition. Sharma et al. [42] used pooled convolutional descriptors with a soft attention based model for action recognition. Wang et al. [19] divides an untrimmed video into many clip proposals and designs a temporal attention model to highlight the discriminative clip proposals and suppress the background clip proposals. In the attention model, the attention weight of each clip proposal is learn by a linear transformation on its feature and the attention weights of different clip proposals are passed through a softmax layer. Yan et al. [43] proposed a soft attention model by combining a convolutional Long Short-Term Memory (LSTM) with a hierarchical system architecture to recognize action categories in videos. Xin et al. [44] proposed a learnable temporal attention mechanism to automatically select important time points from action sequences. Baradel et al. [45] proposed a new spatio-temporal based attention mechanism for human action recognition which is able to automatically attend to most important human hands and detect the most discriminative moments in an action by using skeleton and RGB. Liu et al. [22] proposed Global Context-Aware Attention LSTM (GCA-LSTM) to focus on key joints for skeleton-based action recognition. Song et al. [23] proposed an end-to-end spatial and temporal attention model for human action recognition from skeleton data. However, they use the hidden state of the previous time stamp in an LSTM whose context information is too local to mea-

3

sure the attention. Wen et al. [24] utilized Spatio-Temporal Naive-Bayes Nearest-Neighbor (ST-NBNN) to consider the spatiotemporal structure of 3D actions and combines the strength of both non-parametric model and parametric model for skeleton-based action recognition. The proposed method provides a much deeper attention model than previous methods. The frame-based STMs are processed through multiple-layer CNNs and 2-layer ConvLSTM first. Spatial attention at each time stamp is estimated using a CNN on each hidden states of all time-stamps of the ConvLSTM. And then outputs from all hidden states are adopted to estimate temporal weights to ensure that attention is associated with high-level and globally-aware spatial and temporal features rather than lowlevel features. 3. Proposed method Fig. 1 shows the diagram of the proposed method. Given a sequence of skeletons that represents an action instance, the skeleton joints of each frame are divided into three body parts and mapped to a skeleton texture map (STM), each body part is encoded in a different color scheme in the STM; the STM sequence is fed into 3DConvLSTM [18] to extract spatial and local temporal features; A spatial convolution is utilized on all the hidden states of the second layer ConvLSTM in the 3DConvLSTM to estimate the spatial attention at each hidden state; Spatial-pooling is applied to the attention-weighted hidden states and concatenation of all pooled features are fed to a fully connected layers (FC) to estimate the attention weight at each time-stamp; and an attentionbased pooling is applied to the outputs of spatial attention to generate an attention-weighted dynamic map for action classification. We follow the paper [18] to construct a customized 3DConvLSTM network to process STM sequences. The 3DConvLSTM consists of 3DCNN which outputs a sequence of spatial and local temporal features for a given sequence of STMs. The sequence of features out of the 3DCNN is input to a two-layer ConvLSTM for extract high spatial-level and long-term features represented by the hidden states of the second layer ConvLSTM. The parameters of the 3DConvLSTM and convolutional and fully connected layers in the attention and classification components are all trained from end-to-end. 3.1. Construction of STM In [14], joint trajectories of a sequence are encoded into a texture image (JTM) by assigning different colors to different time-stamps to encode the evolution of joints over the period of an action instance. Since the trajectories of all joints over the period of an action instance are projected to a single JTM, it is inevitable to lose some spatio-temporal information due to the intra- and inter- trajectory overlapping although the problem is converted from classification of a sequence into classification of an image (i.e. JTM). Unlike JTM, the proposed skeleton texture map (STM) represents the 3D joints of a skeleton at an instance time with body-part and relative time-stamp of the time stamp within the duration of the action is encoded with colors. Specifically in this paper, a fully body is divided into three parts: left body part (left arm and left leg), the middle body part and right body part (right arm and right leg) and each body part has its own color scheme to coloring the joints and their relative time stamps in the sequence. Fig. 2 shows the three color schemes used in this paper for the three body parts respectively. Other color schemes are also possible. Let a skeleton sequence be S ¼ fF 1 ; F 2 ; . . . F N g and each frame consists of m joints, where F i ¼ fPX i ; PY i ; PZ i g is a vector of joint coordinates of frame i, and PX i ¼ fpxi1 ; pxi2 ; . . . pxim g. In this paper,

4

C. Li et al. / J. Vis. Commun. Image R. 65 (2019) 102640

Fig. 1. Overview of the proposed method. (1) An action instance is represented by a sequence of skeletons; (2) the skeleton joints of each frame are divided into three body parts and mapped to a skeleton texture map (STM), each body part and its evolution over the period of the action instance is encoded with a different color scheme; (3) the STM sequence is fed into 3DConvLSTM [18] to extract spatial and local temporal features; (4) A spatial convolution is utilized on each hidden state of the second layer ConvLSTM in the 3DConvLSTM to estimate the spatial attention of each hidden state; (5) Spatial-pooling is applied to the attention-weighted hidden states and concatenation of all pooled features are fed to an FC to estimate the attention weight at each time-stamp; and (6) an attention-based pooling is applied to the outputs of spatial attention to generate an attention-weighted dynamic map for action classification.

joints are projected to the camera image plane or front plane and used to construct an STM which is represented as 120  120  3 matrix B, and the resultant front STM sequence G is as follows:

xik ¼ floorð100 

yik ¼ floorð100 

pxik  minðPX i Þ

Þ þ 10 maxðPX i Þ  minðPX i Þ pyik  minðPY i Þ maxðPY i Þ  minðPY i Þ

ð1Þ

Þ þ 10

ð2Þ

i Bi ðxik ; yik ; :Þ ¼ Cði; kÞ ¼ SC k ðfloorð Þ  nÞ N

ð3Þ

G ¼ fB1 ; B2 ; . . . ; BN g

ð4Þ th

where Cði; kÞ denotes the color that is assigned to k Fig. 2. Three color schemes for three body parts and their relative time-stamp in a sequence. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

frame and copied to th

ðxjk ; yjk Þ

th

joint in i

coordinates of the matrix B. SC k ðiÞ is

the color of the i element in chosen color scheme for joint k and n is the number of elements in the color scheme.

C. Li et al. / J. Vis. Commun. Image R. 65 (2019) 102640

It should be pointed out that multiple STM sequences can be constructed from one 3D skeleton sequence by projecting the joints into different planes, typically, the three Cartesian planes, each sequence of STMs could be processed by the proposed method and classification across from all the sequences are fused together to produce the final classification results. The paper utilizes the front STM sequence to explain the mapping method. Side and top STM sequences can be obtained by the same operation. 3.2. Spatial attention Through the processing of 3DCNN and ConvLSTM, the spatiotemporal information available at each hidden state is only up to the time-stamp of the state. Unlike [18] where the output or hidden state of the last time stamp in the LSTM is considered to be the output and further processed for classification, we focus on vital spatial region captured by each hidden state through a spatial convolution to construct spatial map S for the time stamp according to the hidden state. And the spatial map Si at ith time-stamp is defined as:

Si ¼ reluðws  Hi þ bs Þ th

ð6Þ

where  refers to element-wise multiplication. 3.3. Temporal attention and attentive dynamic map (ADM) After spatial attention, it provides vital spatial context of an action for the estimation of the importance of each step. In fact, outputs of each step provide the amount of valuable information. However, the valuable information of each step is in general not equal. The attention weight of each time-stamp is learned through a simple network consisting of one pooling layer that pools the state-level attentive dynamic map, one flatten layer that concatenates the pooled features from all states, and one fully-connected layer and a softmax layer as shown in Fig. 1.

V i ¼ reluðW a PoolingðSout i Þ þ ba Þ

ð7Þ

expðV i Þ ai ¼ PT q¼1 expðV q Þ

ð8Þ

where W a are the parameters for a fully-connected layer V; T is the number of time-stamps of hidden states and a is the temporal weights. An attentive dynamic map (ADM) DA is a dynamic map weighted by both state-based spatial attention Souti and temporal attention ai . In this paper, an ADM is constructed as the weighted average of all hidden states for each action instance and fed to the action classification network, that is,

DA ¼

T X ai Sout i

3.4. Decision fusion Given a testing skeleton sequence, STMs of three orthogonal planes are constructed and fed into the proposed deep networks respectively. The classification scores of three proposed method on the planes are fused and the max score in the resultant score vector is assigned as the probability of the test sequence being the recognized class. We used multiply score fusion, that is, the score vectors outputted by the three neural networks are multiplied in an element-wise way respectively. The multiply score fusion is expressed as follows:

label ¼ Finðmaxðv 1  v 2  v 3 ÞÞ

ð10Þ

where v is a score vector,  refers to element-wise multiplication, and FinðÞ is a function to find the index of the element having the maximum score. 4. Experiments 4.1. Datasets

ð5Þ

where Hi is the i hidden states of the last ConvLSTM layer, ws is the convolutional kernel to be trained and  denote convolution operation and relu is an activation function. Then, a state-level attentive dynamic map Souti at each time stamp that captures spatio-temporal information up to the time stamp can be constructed as follow:

Sout i ¼ Si  Hi

5

ð9Þ

i¼1

The action classification network consists of a Batch Normalization layer, a Spatial Pyramid Pooling layer and a fully-connected layer as shown in Fig. 1.

There are many datasets available for evaluation [46]. The proposed method was evaluated on the widely used NTU RGB+D Dataset [25], MSRC-12 Kinect Gesture Dataset [26] and Large Scale Combined Dataset (LSC) [27] and compared with existing methods that achieved the state-of-the-art results on these datasets. These datasets cover a wide range of actions including humancomputer interactions, daily activities and human-object and human-human interactions. 4.1.1. NTU RGB+D dataset To the best knowledge of authors, the NTU RGB+D dataset [25] is currently the largest dataset for action recognition. It consists of 56578 action samples of 60 different classes including front view, two side views, and left, right 45-degree views performed by 40 subjects aged between 10 and 35. This dataset is mainly composed of actions including daily activities, human-object, and humanhuman interactions. 4.1.2. MSRC-12 kinect gesture dataset MSRC-12 [26] is a relatively large dataset for gesture/action recognition from 3D skeleton data captured by a Kinect sensor. The dataset has 594 sequences, containing 12 gestures by 30 subjects, 6244 gesture instances in total. The 12 gestures are: ‘‘lift outstretched arms”, ‘‘duck”, ‘‘push right”, ‘‘goggles”, ‘‘wind it up”, ‘‘shoot”, ‘‘bow”, ‘‘throw”, ‘‘had enough”, ‘‘beat both”, ‘‘change weapon and kick”. This dataset mainly consists of humancomputer interactions. 4.1.3. Large Scale Combined dataset The LSC dataset [27] combines nine publicly available datasets, including MSR Action3D Ext [47], UTKinect-Action3D [48], MSR DailyActivity 3D [49], MSRAction Pairs 3D [50], CAD120 [51], CAD60 [52], G3D [53], RGBD-HuDa [54], UTD-MHAD [55], and form a complex action dataset with 94 actions. As some samples do not have the skeleton modality, they are excluded, resulting 88 actions in total for evaluation. The evaluation follows the protocols recommended in [27]. The combined dataset covers a wide range of actions. As each individual dataset has its own characteristics in the subjects performing the actions, execution manners, backgrounds, acting positions, view angles, resolutions, and sensor types, the combination of a large number of action classes makes the dataset more challenging due to large intra-class variation compared to each individual dataset.

6

C. Li et al. / J. Vis. Commun. Image R. 65 (2019) 102640

4.2. Implementation details

Table 2 Comparisons of three score fusion on the three datasets.

In all experiments, STM sequences of three views were used to evaluate the proposed method. For a fair comparison, the same frame sampling procedure as that in [38] was employed, that is, a skeleton sequence was divided into N ¼ 20 segments and one frame is randomly selected from each segment. We followed the paper [18] to construct a customized 3DConvLSTM network including 3D Convolutional Networks (3DCNN), Convolutional LSTM (ConvLSTM) and Spatial Pyramid Pooling (SPP) [56] with pooling features at multi-level. Different from the paper [18], the number of convolutional filters of the two-layer ConvLSTM layers are 128 and 256 respectively. The size of the dynamic maps input to 3DCNN is 112  112 and the size of long-term spatiotemporal feature map out of the second layer of the ConvLSTM is 28  28. Cross-entropy cost function is used in the classification as the ultimate objective function. The network was implemented using Tensorflow and Tensorlayer. Parameters from the 3DCNN to the action classification were trained from scratch because there are no compatible pre-trained models. Stochastic gradient descent (SGD) algorithm was employed to train the entire network from end-to-end. The training underwent 30 epochs. The initial learning rate was set to 0:1 1 and dropped to its 10 every 13 of total epochs. The weight decay was initialized to 0:005 and the batch-size was 15. The applied dropout probability was set to 0:5. For the attention component, 256 3  3 kernels were applied to the convolutional layer. 4.3. Evaluation of key factors Experiments were conducted to evaluate a few key aspects of the proposed method including the need for a late fusion of the recognition from the STMs calculated in the three orthogonal planes, late fusion functions and effectiveness of the data augmentation in training. 4.3.1. Results of individual planes The results of recognition using individual planes are listed in Table 1, where STM-xy, STM-yz and STM-xz are Skeleton Joint Maps in xy plane, yz plane and xz plane respectively. From Table 1, we can see that STM-xy and STM-yz perform better compared with STM-xz. 4.3.2. Comparison of three decision fusion methods The results of using three score fusion methods on the three datasets are listed in Table 2. From the table it can be seen that multiply score fusion improves the final accuracy more than the average and max fusion. This implies that the information captured by STMs of three orthogonal planes is likely to be complementary to each other. 4.3.3. Rotation We followed [57] to augment the dataset for training in order to overcome the drawback of being viewpoint dependent and improve performance. Specifically, skeleton data was rotated with a fixed step of 22:5 degree along the polar angle h and azimuthal

Dataset

Max

Average

Multiply

NTU RGB+D (Cross Subject) NTU RGB+D (Cross View) MSRC-12 LSC (Cross Subject) LSC (Cross Sample)

82.34% 89.14% 96.39% 87.25% 90.17%

83.18% 89.96% 96.49% 88.35% 91.15%

83.55% 90.24% 96.97% 90.13% 92.39%

angle /, in the range of ½0 ; 45  for h and ½45 ; 45  for /. Table 3 shows the comparison of the proposed method with and without rotation on the MSRC-12 Kinect Gesture dataset. As expected, the data augmentation through rotation can improve the performance. 4.3.4. Effectiveness of attentive dynamic map (ADM) To verify Effectiveness of attentive dynamic map, two common methods are compared. (1) Traditional method: hidden state of the last time stamp in the ConvLSTM was used for classification of actions. (2) Dynamic map (DM): average of the hidden states in the ConvLSTM was used for classification of actions. (3) Spatial attention (SA): We just use the spatial attention and hidden state of the last time stamp in the ConvLSTM was used for classification of actions. (4) Temporal attention (TA): The only temporal attention is used in the ConvLSTM. Table 4 shows the accuracies of using STM-xy alone on the Large Scale Combined dataset. From Table 4, it can be seen that the proposed Attentive dynamic map (ADM) based method outperformed the traditional method without attention, which demonstrates that attentive dynamic map learned by the proposed method are effective. Fig. 3 shows the attention maps at each time-stamp of the ConvLSTM layer and its corresponding STM of actions. It can be seen that attention maps pick up the vital spatial region and attention weight usually increases correspondingly due to the accumulation of information over the time. However, if the last a few steps contain much irrelevant information, the attention weight will decrease as expected. 4.4. Results on NTU RGB+D dataset This dataset is challenging and provides two types of evaluation protocols, cross-subject: 20 subjects are used for training, and the remaining subjects are used for testing; and cross-view: two camera views are used for training, and one camera view is used for

Table 3 Comparisons of the proposed method with and without data augmentation through rotation on the MSRC-12 Kinect Gesture Dataset. Method

Without Rotation

With Rotation

STM-xy STM-yz STM-xz Multiply fusion

93.75% 90.70% 91.46% 96.11%

95.48% 93.60% 91.70% 96.97%

Table 4 Performance comparison of different schemes using STM-xy alone on the Large Scale Combined Dataset.

Table 1 Comparisons of individual planes on the three datasets. Dataset

STM-xy

STM-yz

STM-xz

Method

Cross subject

Cross sample

NTU RGB+D (Cross Subject) NTU RGB+D (Cross View) MSRC-12 LSC (Cross Subject) LSC (Cross Sample)

77.72% 85.71% 95.48% 83.39% 88.79%

78.98% 85.25% 93.60% 81.30% 86.12%

72.04% 79.61% 91.70% 75.51% 79.95%

Traditional method DM SA TA ADM

81.73% 82.16% 82.95% 82.87% 83.39%

86.63% 86.92% 87.42% 87.23% 88.79%

7

C. Li et al. / J. Vis. Commun. Image R. 65 (2019) 102640

Fig. 3. Example attention maps learned by the proposed method at each time stamp of the ConvLSTM layer and its corresponding STM of some actions.

testing. The confusion matrices for cross subject and cross view protocols are shown in Figs. 4 and 5 respectively. From the confusion matrices, it can be seen that the proposed method performed well on the NTU RGB+D Dataset. But it is ineffective to distinguish some hand actions such as ‘‘reading” and ‘‘writing”. We compare the proposed method with state-of-the-art methods, as shown in Table 5. We can see that the proposed method outperformed both hand-crafted features based methods [58,59] and deep learning methods [60,25,14,22,33,23,61,15]. Specifically, the proposed method outperformed the recent LSTM-based methods [25,38],

Fig. 4. The confusion matrix of the cross subject protocol on the NTU RGB+D dataset.

Fig. 5. The confusion matrix of the cross view protocol on the NTU RGB+D dataset.

Table 5 Experimental results (accuracies) on the NTU RGB+D Dataset and comparision to the state-of-the-art methods. Note that the method of CNN+LSTM [62] fused the results from three RNNs and seven CNNs. Method

Cross subject

Cross view

Lie Group [58] Dynamic Skeletons [59] HBRNN [60] Deep RNN [25] Part-aware LSTM [25] ST-LSTM+Trust Gate [38] JTM (CNN) [14] STA Model [23] SkeletonNet [33] JDM (CNN) [32] Geometric Features [40] GCA-LSTM [22] Clips+CNN+MTLN [63] View invariant [34] IndRNN [61] ST-GCN [15] CNN+LSTM [62] (10 Channel fusion) Proposed Method

50.08% 60.23% 59.07% 56.29% 62.93% 69.20% 73.40% 73.40% 75.90% 76.20% 70.26% 74.40% 79.57% 80.03% 81.80% 81.50% 82.89%

52.76% 65.22% 63.97% 64.09% 70.27% 77.70% 75.20% 81.20% 81.20% 82.30% 82.39% 82.80% 84.83% 87.21% 87.97% 88.30% 90.10%

83.55%

90.24%

8

C. Li et al. / J. Vis. Commun. Image R. 65 (2019) 102640

4.5. Results on MSRC-12 kinect gesture dataset For this dataset, cross-subjects protocol was adopted, that is, odd subjects were used for training and even subjects were for testing. The confusion matrix is shown in Fig. 6. From the confusion matrix we can see that the proposed method was able to distinguish most of actions very well. But it is ineffective to distinguish ‘‘lift outstretched arms” and ‘‘beat both” which have similar STMs probably caused by 3D to 2D projection. The experimental results are reported in Table 6. It can be seen that the proposed method outperformed the state-of-the-art skeleton-based methods in [64–66,14,67], which demonstrates the effectiveness of the proposed method. Fig. 6. The confusion matrix of the proposed method on the MSRC-12 dataset.

4.6. Results on Large Scale Combined dataset Table 6 Experimental results (accuracies) on the MSRC-12 Kinect Gesture Dataset and comparision to the state-of-the-art methods. Method

Accuracy(%)

HGM [64] ElC-KSVD [65] Cov3DJ [66] JTM (CNN) [14] SOS (CNN) [67] Proposed Method

66.25% 90.22% 91.70% 93.12% 94.27% 96.97%

attention-based methods [22,40]. We did not compare with the method [24] since it was only evaluated on three small datesets and the results are saturated. Nevertheless, its performance is expected to be comparable to the method [38] on small datasets. Notice that the CNN+LSTM method in [62] employed three RNNs and seven CNNs, which is much more computationally intensive than the proposed architecture. However, the proposed method performed better. It should be pointed out that the proposed method outperformed JTMs used in [14] by 10:15 percentage points in the cross subject recognition and 15:04 percentage points in the cross view recognition.

We follow the protocol designed in the recent work [27] to conduct two types of experiments, random cross subject and random cross sample. The confusion matrices for cross subject and cross sample protocols are shown in Figs. 7 and 8 respectively. From the confusion matrices, it can be seen that the proposed method performed well on the Large Scale Combined dataset. But it is ineffective to distinguish some interactive actions with small motion such as ‘‘talking on the phone” and ‘‘write on a paper”. The experimental results are shown in Table 7. Compared with existing methods [50,59,25,68], the proposed method achieved the best in terms of accuracy and recall. 5. Conclusion In this paper, we have presented a novel deep neural network to learn an attention based dynamic maps for skeleton-based action recognition. The 3DConvLSTM is used to learn spatio-temporal features from which a dynamic map is generated for learning the spatio-temporal attention at each time-stamp and subsequently an attention-based dynamic map is constructed for action classification. Experimental results have shown the effectiveness of the proposed method. It is expected the recognition accuracy can be further improved by fusing the classification obtained by the proposed method on the multiple STM sequences constructed from

Fig. 7. The confusion matrix of the cross subject protocol on the Large Scale Combined dataset.

9

C. Li et al. / J. Vis. Commun. Image R. 65 (2019) 102640

Fig. 8. The confusion matrix of the cross sample protocol on the Large Scale Combined dataset.

Table 7 Experimental results on the Large Scale Combined Dataset and comparision to the state-of-the-art methods Method

HON4D [50] Dynamic Skeletons [59] P-LSTM [25] AGNN [68] Proposed Method

Cross sample

Cross subject

Precision

Recall

Precision

Recall

84.6% 85.9% 84.2% 87.6% 92.39%

84.1% 85.6% 84.9% 88.1% 88.30%

63.1% 74.5% 76.3% 84.0% 90.13%

59.3% 73.7% 74.6% 82.0% 84.50%

three properly chosen orthogonal (rather than following the camera coordinate system) or multiple planes. In addition, the proposed method can be easily extended to RGB and/or depth modalities. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgement This work was partly supported by the National Natural Science Foundation of China (grant 61571325) and Key Projects in the Tianjin Science & Technology Pillar Program (16ZXHLGX001900) References [1] K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in: Advances in Neural Information Processing Systems, 2014, pp. 568–576. [2] C. Feichtenhofer, A. Pinz, R.P. Wildes, Spatiotemporal multiplier networks for video action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4768–4777. [3] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4489–4497. [4] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, W.-C. Woo, Convolutional lstm network: a machine learning approach for precipitation nowcasting, in: Advances in Neural Information Processing Systems, 2015, pp. 802–810.

[5] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, S. Gould, Dynamic image networks for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3034–3042. [6] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake, Real-time human pose recognition in parts from single depth images, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 1297–1304. [7] Z. Cao, T. Simon, S.-E. Wei, Y. Sheikh, Realtime multi-person 2d pose estimation using part affinity fields, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7291–7299. [8] B. Xiao, H. Wu, Y. Wei, Simple baselines for human pose estimation and tracking, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 466–481. [9] J. Song, L. Wang, L. Van Gool, O. Hilliges, Thin-slicing network: a deep structured model for pose estimation in videos, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4220– 4229. [10] X. Chen, M. Koskela, Skeleton-based action recognition with extreme learning machines, Neurocomputing 149 (2015) 387–396. [11] L.L. Presti, M. La Cascia, 3d skeleton-based human action classification: a survey, Pattern Recogn. 53 (2016) 130–147. [12] H. Zhang, P. Zhong, J. He, C. Xia, Combining depth-skeleton feature with sparse coding for action recognition, Neurocomputing 230 (2017) 417–426. [13] P. Wang, W. Li, P. Ogunbona, J. Wan, S. Escalera, Rgb-d-based human motion recognition with deep learning: a survey, Comput. Vis. Image Underst. 171 (2018) 118–139. [14] P. Wang, Z. Li, Y. Hou, W. Li, Action recognition based on joint trajectory maps using convolutional neural networks, in: Proc. ACM on Multimedia Conference, 2016, pp. 102–106. [15] S. Yan, Y. Xiong, D. Lin, Spatial temporal graph convolutional networks for skeleton-based action recognition, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [16] C. Li, Z. Cui, W. Zheng, C. Xu, J. Yang, Spatio-temporal graph convolution for skeleton based action recognition, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [17] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625–2634.

10

C. Li et al. / J. Vis. Commun. Image R. 65 (2019) 102640

[18] G. Zhu, L. Zhang, P. Shen, J. Song, Multimodal gesture recognition using 3d convolution and convolutional lstm, IEEE Access. [19] L. Wang, Y. Xiong, D. Lin, L. Van, Gool, Untrimmednets for weakly supervised action recognition and detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4325–4334. [20] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention, in: International Conference on Machine Learning, 2015, pp. 2048–2057. [21] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, ICLR. [22] J. Liu, G. Wang, P. Hu, L.-Y. Duan, A.C. Kot, Global context-aware attention lstm networks for 3d action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1647–1656. [23] S. Song, C. Lan, J. Xing, W. Zeng, J. Liu, An end-to-end spatio-temporal attention model for human action recognition from skeleton data, in: AAAI, 2017, pp. 4263–4270. [24] J. Weng, C. Weng, J. Yuan, Spatio-temporal Naive-Bayes nearest-neighbor (stnbnn) for skeleton-based action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4171– 4180. [25] A. Shahroudy, J. Liu, T.-T. Ng, G. Wang, NTU RGB+ D: a large scale dataset for 3D human activity analysis, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1010–1019. [26] S. Fothergill, H. Mentis, P. Kohli, S. Nowozin, Instructing people for training gestural interactive systems, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ACM, 2012, pp. 1737–1746. [27] J. Zhang, W. Li, P. Wang, P. Ogunbona, S. Liu, C. Tang, A large scale rgb-d dataset for action recognition, in: International Workshop on Understanding Human Activities through 3D Sensors, Springer, 2016, pp. 101–114. [28] S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 221–231. [29] Z. Qiu, T. Yao, T. Mei, Learning spatio-temporal representation with pseudo-3d residual networks, in: proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5533–5541. [30] L. Wang, W. Li, W. Li, L. Van Gool, Appearance-and-relation networks for video classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1430–1439. [31] Y. Du, Y. Fu, L. Wang, Skeleton based action recognition with convolutional neural network, in: Proc. Asian Conference on Pattern Recognition (IAPR), 2016, pp. 579–583. [32] C. Li, Y. Hou, P. Wang, W. Li, Joint distance maps based action recognition with convolutional neural networks, IEEE Signal Process. Lett. 24 (2017) 624–628. [33] Q. Ke, S. An, M. Bennamoun, F. Sohel, F. Boussaid, Skeletonnet: mining deep part features for 3-d action recognition, IEEE Signal Process. Lett. 24 (2017) 731–735. [34] M. Liu, L. Hong, C. chen, Enhanced skeleton visualization for view invariant human action recognition, Pattern Recogn. [35] H. Liu, J. Tu, M. Liu, Two-stream 3d convolutional neural network for skeletonbased action recognition, in: IEEE International Conference on Multimedia Expo Workshops (ICMEW), 2017. [36] Y. Tang, Y. Tian, J. Lu, P. Li, J. Zhou, Deep progressive reinforcement learning for skeleton-based action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5323–5332. [37] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, X. Xie, Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks, in: Proc. AAAI Conference on Artificial Intelligence, 2016, pp. 3697– 3704. [38] J. Liu, A. Shahroudy, D. Xu, G. Wang, Spatio-temporal LSTM with trust gates for 3D human action recognition, in: Proc. European Conference on Computer Vision, 2016, pp. 816–833. [39] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, N. Zheng, et al., View adaptive recurrent neural networks for high performance human action recognition from skeleton data, in: IEEE International Conference on Computer Vision, 2017, pp. 2136–2145. [40] S. Zhang, X. Liu, X. Jun, On geometric features for skeleton-based action recognition using multilayer lstm networks, in: Proc. WACV Conference, 2017, pp. 148–157. [41] N. Mishra, M. Rohaninejad, X. Chen, P. Abbeel, Meta-learning with temporal convolutions, arXiv preprint arXiv: 1707.03141. [42] S. Shikhar, K. Ryan, S. Ruslan, Action recognition using visual attention, in: International Conference on Learning Representations (ICLR) Workshop, 2016. [43] S. Yan, J.S. Smith, W. Lu, B. Zhang, Hierarchical multi-scale attention networks for action recognition, arXiv preprint arXiv: 1708.07590.

[44] M. Xin, H. Zhang, M. Sun, D. Yuan, Recurrent temporal sparse autoencoder for attention-based action recognition, in: Neural Networks (IJCNN), 2016 International Joint Conference on, IEEE, 2016, pp. 456–463. [45] F. Baradel, C. Wolf, J. Mille, Human action recognition: pose-based attention draws focus to hands, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 604–613. [46] J. Zhang, W. Li, P.O. Ogunbona, P. Wang, C. Tang, Rgb-d-based action recognition datasets: a survey, Pattern Recogn. 60 (2016) 86–105. [47] W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3D points, in: CVPRW, 2010, pp. 9–14. [48] L. Xia, C.-C. Chen, J.K. Aggarwal, View invariant human action recognition using histograms of 3d joints, in: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, IEEE, 2012, pp. 20–27. [49] J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining actionlet ensemble for action recognition with depth cameras, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 1290–1297. [50] O. Oreifej, Z. Liu, Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences, in: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, IEEE, 2013, pp. 716–723. [51] H.S. Koppula, R. Gupta, A. Saxena, Learning human activities and object affordances from rgb-d videos, Int. J. Robot. Res. 32 (8) (2013) 951–970. [52] J. Sung, C. Ponce, B. Selman, A. Saxena, Unstructured human activity detection from rgbd images, in: Robotics and Automation (ICRA), 2012 IEEE International Conference on, IEEE, 2012, pp. 842–849. [53] V. Bloom, D. Makris, V. Argyriou, G3d: A gaming action dataset and real time action recognition evaluation framework, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, IEEE, 2012, pp. 7–12. [54] B. Ni, G. Wang, P. Moulin, Rgbd-hudaact: A color-depth video database for human daily activity recognition, in: Consumer Depth Cameras for Computer Vision, Springer, 2013, pp. 193–208. [55] C. Chen, R. Jafari, N. Kehtarnavaz, UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor, in: Proc. IEEE International Conference onImage Processing (ICIP), 2015, pp. 168–172. [56] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, in: European Conference on Computer Vision, Springer, 2014, pp. 346–361. [57] P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, P.O. Ogunbona, Action recognition from depth maps using deep convolutional neural networks, IEEE Trans. Hum. Mach. Syst. 46 (4) (2016) 498–509. [58] R. Vemulapalli, F. Arrate, R. Chellappa, Human action recognition by representing 3D skeletons as points in a lie group, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 588–595. [59] E. Ohn-Bar, M. Trivedi, Joint angles similarities and HOG2 for action recognition, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 465–470. [60] Y. Du, W. Wang, L. Wang, Hierarchical recurrent neural network for skeleton based action recognition, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1110–1118. [61] S. Li, W. Li, C. Cook, C. Zhu, Y. Gao, Independently recurrent neural network (indrnn): Building a longer and deeper rnn, arXiv preprint arXiv: 1803.04831. [62] C. Li, P. Wang, S. Wang, Y. Hou, W. Li, Skeleton-based action recognition using lstm and cnn, in: Proc. IEEE International Conference on Multimedia and Expo Workshop (ICMEW), 2017, pp. 1–6. [63] Q. Ke, M. Bennamoun, S. An, F. Sohel, F. Boussaid, A new representation of skeleton sequences for 3d action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3288– 3297. [64] S. Yang, C. Yuan, W. Hu, X. Ding, A hierarchical model based on latent dirichlet allocation for action recognition, in: Pattern Recognition (ICPR), 2014 22nd International Conference on, IEEE, 2014, pp. 2613–2618. [65] L. Zhou, W. Li, Y. Zhang, P. Ogunbona, D.T. Nguyen, H. Zhang, Discriminative key pose extraction using extended LC-KSVD for action recognition, in: Proc. IEEE International Conference on Digital Image Computing: Techniques and Applications (DlCTA), 2014, pp. 1–8. [66] M.E. Hussein, M. Torki, M.A. Gowayyed, M. El-Saban, Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations, in: Twenty-Third International Joint Conference on Artificial Intelligence, 2013. [67] Y. Hou, Z. Li, P. Wang, W. Li, Skeleton optical spectra based action recognition using convolutional neural networks, IEEE Trans. Circ. Syst. Video Technol. [68] C. Li, Z. Cui, W. Zheng, C. Xu, R. Ji, J. Yang, Action-attending graphic neural network, arXiv preprint arXiv: 1711.06427.