Signal Processing: Image Communication 83 (2020) 115776
Contents lists available at ScienceDirect
Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image
Global relational reasoning with spatial temporal graph interaction networks for skeleton-based action recognition✩ Wenwen Ding a , Xiao Li b ,∗, Guang Li b , Yuesong Wei a a b
School of Mathematical Sciences, Huaibei Normal University, Anhui, China School of Computer Science and Technology, Xidian University, Xi’an, China
ARTICLE
INFO
Keywords: Deep learning Graph convolutional network Convolutional neural networks Spatio-temporal graph Message passing
ABSTRACT With the prevalence of accessible depth sensors, dynamic skeletons have attracted much attention as a robust modality for action recognition. Convolutional neural networks (CNNs) excel at modeling local relations within local receptive fields and are typically inefficient at capturing global relations. In this article, we first view the dynamic skeletons as a spatio-temporal graph (STG) and then learn the localized correlated features that generate the embedded nodes of the STG by message passing. To better extract global relational information, a novel model called spatial–temporal graph interaction networks (STG-INs) is proposed, which perform longrange temporal modeling of human body parts. In this model, human body parts are mapped to an interaction space where graph-based reasoning can be efficiently implemented via a graph convolutional network (GCN). After reasoning, global relation-aware features are distributed back to the embedded nodes of the STG. To evaluate our model, we conduct extensive experiments on three large-scale datasets. The experimental results demonstrate the effectiveness of our proposed model, which achieves the state-of-the-art performance.
1. Introduction Human action recognition has received considerable attention from both academia and industry, owing to its wide application in video surveillance, virtual reality, and human–robot interactions. Owing to the development of low-cost RGB-D sensors and efficient algorithms for estimating joint positions, dynamic skeletons have become an available and effective modality for human action recognition owing to their robustness to illumination changes, background noise, body scales, and motion speeds. Meanwhile, compared with RGB data or optical flow, the skeleton size is smaller, which leads to the advantage of being computationally efficient. Consequently, skeleton-based methods have become prevalent for human action recognition. With the development of artificial intelligence technology, deep neural networks can realize automatic feature extraction to replace hand-crafted features, such as convolutional neural networks (CNNs) [1–3]. One of the key reasons for the success of CNNs is their ability to leverage local statistical properties of the data in which there is an underlying Euclidean structure. Recently, there has been growing interest in trying to apply learning on non-Euclidean geometric data such as graphs and manifolds [4–6]. Graphs are a kind of data structure that models a set of objects (nodes) and their relationships (edges). Graph convolutional networks (GCNs) [7], which generalize CNNs to work
with arbitrarily structured graphs, have shown good performance in learning information of graph-based structure data. GCNs are dedicated only to capturing local features and relations, either in space or in time. While global (long-range) features are aggregated with hierarchical GCNs [8], the node features of graph might be weakened during longrange diffusion [9]. Repeating local operations also comes with a number of unfavorable issues in practice. First, it is computationally inefficient. Stacking multiple convolution operators makes the receptive field unnecessarily large. Consequently, the cost of calculation and storage is high, and the risk of over-fitting is increased [10]. Second, the features visible to distant locations are actually delayed from the latter layers, resulting in inefficient reasoning [11]. A natural way to represent skeleton-based action sequences is a graph-based structure called a spatio-temporal graph (𝖲𝖳𝖦). Each human joint is regarded as a vertex or a node. Each rigid body in human skeletons is defined as a spatial edge and the temporal edges are added between corresponding joints across consecutive frames. Yan et al. [12] proposed spatial–temporal graph convolutional networks (STG-CNs) to apply GCNs to model the 𝖲𝖳𝖦. However, the GCNbased approaches only represent the local dependencies of spatial and temporal edges within a single convolution layer. Capturing relations among the disjoint and distant joints of the human body requires
✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.image.2019.115776. ∗ Corresponding author. E-mail addresses:
[email protected] (W. Ding),
[email protected] (X. Li),
[email protected] (G. Li),
[email protected] (Y. Wei).
https://doi.org/10.1016/j.image.2019.115776 Received 9 October 2019; Received in revised form 18 December 2019; Accepted 30 December 2019 Available online 7 January 2020 0923-5965/© 2020 Elsevier B.V. All rights reserved.
W. Ding, X. Li, G. Li et al.
Signal Processing: Image Communication 83 (2020) 115776
The first framework, which is called graph neural networks (GNNs) [14–16], is to capture the dependence of graphs by propagating neighbor information via recurrent neural networks (RNNs) in an iterative manner until a stable fixed point is reached [17]. Each node in the graph captures the semantic relation and structural information within its neighbor nodes. Unlike standard neural networks, GNNs retain a state that can represent information from the neighborhood with arbitrary depth, which have been shown to be highly effective at relational reasoning tasks and modeling interacting systems. However, this process is computationally expensive. Li et al. [18] proposed a model based on GNN to efficiently capture dependencies between roles and to predict a consistent structured output for situation recognition. Qi et al. [6] introduced a framework, called a graph parsing neural network (GPNN), to offer a human–object interactions representation that applies to images and videos within a message passing inference framework. Si et al. [19] proposed a novel model with spatial reasoning that used a GNN to obtain the high-level information of the spatial structure between the different body parts. The second framework is GCNs, which generalizes the CNN to graph structured data and obtained remarkable performance for skeletonbased action recognition. These works mainly fall into two categories. (1) Spectral perspective. A spectral method is implemented in the spectral domain. A localization operator on the graph is defined as a linear operator that diagonalizes in the Fourier basis [20]. A GCN is applied on the spectral domain converted by the graph structure, where the locality of the graph convolution is considered in the form of a Fourier basis represented by eigenvectors of the Laplacian operator. Shi et al. [21] proposed a two-stream adaptive spectral graph convolutional network (2s-ASGCN), which constructs the graph convolution on the frequency domain with the help of graph Fourier transform. Taking the philosophy of multi-scale convolutional filtering, Li et al. [22] developed a recursive graph convolution model inspired by autoregressive moving average. Based on spectral graph theory, the graph convolution is transformed into the frequency domain by using the graph Fourier transform. However, an essential limitation factor of the spectral method is that the spectral construction is limited to a single spectral domain because the spectral filter coefficients are basis dependent. (2) Spatial perspective. The convolution filters are directly applied on the nodes and their neighbors via the finite size of the convolution kernel. Yan et al. [12] proposed the distance-based sampling function for constructing graph convolution, which was then employed as the basic module to build the final ST-GCN. Zhang et al. [23] designed a graph edge CNN. An edge is described by integrating spatial and temporal neighboring edges to address the consistency in spatial– temporal variation in an action. However, the sampling function is limited to heuristic design with 1-distance neighbor set of joint vertices and the topology of the graph is set by hand and fixed over all layers. The graph convolution is only partitioned into a fixed number of subsets corresponding to the convolutional kernels. To deal with fixed number of neighbor nodes with shared kernel weights, Liu et al. [24] converted the human skeleton to a tree structure. A hierarchy of dilated tree convolutions with triangular filters slid over human skeleton to learn the multi-level structured semantic representation on each skeleton joint. Cai et al. [25] introduced the non-uniform graph convolutional operations by learning different kernel weights for different neighborhood types.
Fig. 1. General framework of the proposed approach.
stacking multiple such convolution layers, which is highly inefficient. Such a drawback increases the difficulty and cost of global reasoning for CNNs. For example, the hand and head are disconnected physically, but their dependencies present significant importance for recognizing the action of drink water or answer the telephone. That is, convolutional operations on an 𝖲𝖳𝖦 only process a local neighborhood, but the global (long-range) interactions occurring between distant body joints are not extracted, either in space or time. To solve this problem, we propose a novel model motivated by overcoming the intrinsic limitation of convolution operations for modeling global relations, called spatial–temporal graph interaction networks (STG-INs). Inspired by the article [13], the core idea of STG-INs is that we take dynamic skeletons as a complex system to reason about rigid bodies (human bones) of their interactions and dynamics. Our model takes an 𝖲𝖳𝖦 as input, performs reasoning in a way that is analogous to collisions between rigid bodies. As shown in Fig. 1, each joint in the 𝖲𝖳𝖦 achieves the updated embeddings according to the response of joints based on its spatio-temporal neighbors. After message passing, each joint in the updated 𝖲𝖳𝖦 conveys the localized spatial– temporal correlations. Then the human skeleton is decomposed into different parts, e.g., two arms, two legs, and one trunk. The aggregated embedding of joints of each body part is projected into a node to form a fully connected graph 𝐼 which is fed into the GCN. After deploying a Gaussian function, the similarity graph is constructed for global relationship learning. After reasoning, the features of the long-range relation are distributed back to the updated 𝖲𝖳𝖦 to help subsequent convolution layers sense the entire space immediately. Finally, the features with synchronous local and global spatio-temporal learning will go through a fully connected layer followed by a Softmax activation function to output the classification score. The contributions of this article are as follows. (1) For global relationship learning, we propose a new approach by projecting each part of human body into a node to form a fully connected similarity graph for capturing relations among the disjoint and distant joints of the human body. (2) We propose a novel model called spatial–temporal graph interaction networks (STG-INs), which simultaneously mine local detailed dynamics and global spatial–temporal relations for skeleton-based action recognition. The remainder of this article is organized as follows. Section 2 presents the related work; in Section 3, we first learn the local relationship via message passing and then reason about the global relationship via the GCN; Section 4 presents our experimental results and discussion; and Section 5 concludes this article.
2.2. Skeleton-based action recognition
2. Related work Deep neural networks can realize automatic feature extraction to replace hand-crafted features. Considering the complexity of the entire human skeleton, the complex strategy is to simulate the trajectories of each joint or rigid body separately. To represent the trajectory of human motion, statistical and geometric methods are often used. Hussein et al. [26] employed a covariance matrix as a discriminative descriptor
2.1. Neural networks with graphs In recent years, more and more research work has combined neural networks with graph structure data, which can be categorized into two architectures. 2
W. Ding, X. Li, G. Li et al.
Signal Processing: Image Communication 83 (2020) 115776
graph are updated recurrently based on message 𝐦𝑡𝑣 . We initialize this 𝑖 hidden state to the features, such that 𝐡0𝑖 = 𝐱𝑖 . At each time 𝑡, each node sends a message to each of its neighboring nodes. For the 𝖲𝖳𝖦, there are 𝐾 = 𝑁 ∗ 𝑇 joints. Each joint in the 𝖲𝖳𝖦 is not only connected to its neighboring joints in the same frame, but also linked with relevant joints in the previous or later frames. The local spatio-temporal relationships of the joint 𝑣𝑖 can be denoted as 𝐡𝑡𝑣 𝑖 initialized with 𝐱𝑣𝑡 at the 𝑡th frame. The message passing goes from joint 𝑖 𝑡 𝑡 𝑡 𝑡 𝑣𝑖 to joint 𝑣𝑗 at time 𝑡 by 𝐦𝑣 𝑣 = 𝑓𝑚 (𝐡𝑣 , 𝐡𝑣 , 𝐞𝑣 𝑣 ). As each joint needs 𝑖 𝑗 𝑖 𝑗 𝑖 𝑗 to consider all the messages of the spatial structure, we sum them with
for the locations of skeleton joints over time; Vemulapalli et al. [27] employed the relative 3D geometry between different body parts by using the rotation and translation required to take one body part to the position and orientation of the other. To encode the temporal motion, linear or non-linear dynamic systems are usually applied to imitate human action, e.g., hidden Markov model (HMMs) and linear dynamic systems (LDSs). Ding et al. [28] introduced the profile HMM to apply symbol sequences using Viterbi and Baum–Welch algorithms for human activity recognition. Ding et al. [29] translated 3D human skeleton sequences into tensor time series and then employed the Tucker decomposition to estimate the parameters of the LDS as action descriptors. However, the selection of hand-crafted features from irregular skeletons is time-consuming and cannot guarantee effectiveness, so deep learning completely solves this problem. Tang et al. [30] employed a graphbased deep learning model to capture both the intrinsic and extrinsic dependencies between human joints. Gao et al. [31] constructed a generalized graph over consecutive frames to capture intrinsic physical connections and non-physical connectivities via spectral graph theory. Wen et al. [32] proposed a new motif-based GCN to encode the hierarchical spatial structure and a variable temporal dense block architecture to capture the global dependencies of the temporal domain for skeleton-based action recognition. Liu et al. [33] proposed a feature boosting network for 3D hand and whole body pose estimation, which enables graphical convolutional features to sense the long-term and short-term dependence of different hand (body) parts.
𝐦𝑡𝑣 = 𝑖
∑
∑
𝐦𝑡𝑣 𝑣 =
𝑗∈𝑁(𝑣𝑖 )
𝑖 𝑗
𝑓𝑚 (𝐡𝑡𝑣 , 𝐡𝑡𝑣 , 𝐞𝑡𝑣 𝑣 ) = 𝑖
𝑗∈𝑁(𝑣𝑖 )
𝑗
𝑖 𝑗
∑
(𝐡𝑡𝑣 , 𝐞𝑡𝑣 𝑣 ),
𝑗∈𝑁(𝑣𝑖 )
𝑗
𝑖 𝑗
(1)
where 𝑁(𝑣𝑖 ) are all the joints that have an edge with joint 𝑣𝑖 at the 𝑡th frame. The message function used is 𝑓𝑚 (𝐡𝑡𝑣 , 𝐡𝑡𝑣 , 𝐞𝑣𝑖 𝑣𝑗 ) = (𝐡𝑡𝑣 , 𝐞𝑣𝑖 𝑣𝑗 ), 𝑖 𝑗 𝑗 where (., .) denotes concatenation. Excluding the 𝐡𝑡𝑣 at the 𝑡th frame in 𝑖 this way allows the function to focus on the messages from the other nodes. After aggregating the node features across the 𝖲𝖳𝖦, the updating function of the hidden state of a node can be defined as follows: 𝑡 𝑡 𝐡𝑡𝑣 = 𝑓ℎ (𝐡𝑡−1 𝑣 , 𝐱𝑣 , 𝐦𝑣 ). 𝑖
𝑖
𝑖
(2)
𝑖
The vertex update function 𝑓ℎ is the long–short term memory (LSTM) cell [37–40], which has the ability to preserve sequence information over time. The dependence on the previous hidden state 𝐡𝑡−1 𝑣𝑖 allows the network to work towards a solution capturing time-dependent embeddings. The feature vector 𝐱𝑣𝑡 is included at the 𝑡th frame in this 𝑖 manner, which allows the update function 𝑓ℎ to try to remember the initial input. In this case, the readout phase is not computed, because we do not need a feature vector for the whole graph here. After the message passing phase, each joint 𝑣𝑖 has the embedded feature 𝐡𝑣𝑖 ∈ R𝐷 of the local spatial–temporal relation, which not only is learned among the neighborhood of this body joints, but also between states of the same body joints across consecutive frames. The skeleton sequence is embedded as a tensor 𝑖𝑛 = [𝐘1 , … , 𝐘𝑡 , … , 𝐘𝑇 ] ∈ R𝑁×𝐷×𝑇 according to the original input sequence 𝐗𝑖𝑛 ∈ R𝑁×3×𝑇 , where 𝐘𝑡 ∈ R𝑁×𝐷 is the hidden feature of each vertex in an 𝖲𝖳𝖦.
3. Model architecture In this section, we dive into our method generated by first learning the local relational features of each joint in a skeleton. After that, a collection of body parts are extracted, which leads to the graphbased global reasoning network. Based on these, we perform feature recalibration and learning. Finally, we elaborate on how to deploy and then test it. In a normal context, an 𝖲𝖳𝖦 is a sequence of a skeleton. Formally, 𝖲𝖳𝖦 = {𝖦1 , 𝖦2 , … , 𝖦𝑡 , … , 𝖦𝑇 } where 𝖦𝑡 = {𝖵𝑡 , 𝖤𝑡 , 𝐀𝑡 , 𝐗𝑡 } is taken as a graph at the 𝑡th frame to depict spatial relation of skeletal joints. is the set of nodes with respect to skeletal joints Here 𝖵𝑡 = {𝑣𝑖𝑡 }𝑁 𝑖=1 and 𝖤𝑡 is the set of edges. We use 𝐀𝑡 ∈ R𝑁×𝑁 to denote a (weighted) adjacency matrix where 𝑎𝑖𝑗 = 1 if joints 𝑣𝑡𝑖 ∈ 𝖵𝑡 and 𝑣𝑡𝑗 ∈ 𝖵𝑡 are spatially adjacent, and 𝑎𝑖𝑗 = 0. Concretely, the convolution is performed over a regular nearest-neighbor graph defined by the adjacency matrix 𝐀𝑡 . The Laplacian matrix 𝐋 is defined as 𝐋 = 𝐃 − 𝐀𝑡 , where 𝐃 ∈ R𝑁×𝑁 is the ∑ diagonal degree matrix with 𝑑𝑖,𝑖 = 𝑁 𝑗=1 𝑎𝑖,𝑗 . Then 𝐋 can be used to uncover many useful properties of a graph. The symmetric normalized 1 1 1 1 Laplacian matrix is defined as 𝐋′ = 𝐃− 2 𝐋𝐃− 2 = 𝐈 − 𝐃 2 𝐀𝐃 2 , where 𝐈 is the identity matrix. The signal matrix 𝐗𝑡 = [𝐱𝑣𝑡 , 𝐱𝑣𝑡 , … , 𝐱𝑣𝑡 ] ∈ R𝑁×3 is supported on 𝑁 1 2 the node set 𝑡 . For skeletal data, each joint can be first represented as a vector of the 3D coordinates 𝐱𝑣𝑖 = (𝑥𝑣𝑖 , 𝑦𝑣𝑖 , 𝑧𝑣𝑖 ) and the feature of each edge in 𝖲𝖳𝖦 can be represented as the orientation of each edge: 𝐞𝑣𝑖 𝑣𝑗 = (𝑥𝑣𝑖 − 𝑥𝑣𝑗 , 𝑦𝑣𝑖 − 𝑦𝑣𝑗 , 𝑧𝑣𝑖 − 𝑧𝑣𝑗 ). Thus, an 𝖲𝖳𝖦 is organized as a 𝑇 × 𝑁 × 3 tensor 𝑖𝑛 , where 𝑇 is the number of frames, 𝑁 is the number of joints in each frame, and 3 is the dimension of 𝑥, 𝑦, 𝑧 coordinates.
3.2. Global relational reasoning via a GCN In a global relational reasoning network, a fully connected graph 𝖦𝐼 = {𝖵𝐼 , 𝐀𝐼 } is built to simulate the interaction space for learning the complex dynamics purely from a dynamic skeleton. It is clear that each human body part is influenced by other body parts. To send a message over the fully connected graph, the embedded skeleton sequence 𝑖𝑛 is projected onto 𝖦𝐼 by learning a spatial mapping matrix 𝐏 ∈ R𝑆×𝑁 and a ′ temporal mapping matrix 𝐐 ∈ R𝑇 ×𝑇 , where 𝑆 denotes how many parts the human skeleton is decomposed into, and 𝑇 ′ = ⌊𝑇 ∕𝜏⌋. The parameter 𝜏 controls the temporal range. The mapping matrices 𝐏 and 𝐐 can determine which part of the human body across several consecutive frames should be mapped to a node of 𝖦𝐼 as shown in Fig. 3(e). Therefore, the new feature 𝐳𝑖 of each node of 𝖦𝐼 can be embedded from multiple human joints across several consecutive frames. In particular, ′ a new feature 𝑖𝑛 ∈ R𝑆×𝐷×𝑇 is generated by 𝑖𝑛 = 𝑟𝑒𝑠ℎ𝑎𝑝𝑒(𝐐 ⊛ 𝑟𝑒𝑠ℎ𝑎𝑝𝑒(𝐏 ⊛ 𝑖𝑛 ))
3.1. Local relational learning via message passing
= 𝑟𝑒𝑠ℎ𝑎𝑝𝑒(𝐐 ⊛ 𝑟𝑒𝑠ℎ𝑎𝑝𝑒(𝐏 ⊛ [𝐘1 , … , 𝐘𝑡 , … , 𝐘𝑇 ])) The concept of GNNs was first introduced by Scarselli et al. [34] as a generalization of recursive neural networks by effectively embedding nodes. Most methods inspired by GNNs learn neural network primitives that generate node embeddings by passing, transforming, and aggregating node feature information across the graph [35]. A message passing phase and a readout phase should be included for processing the data represented in graph domains [36]. In the message passing phase, GNNs can learn an embedding of node features propagated between 1-hop neighboring nodes. In general, hidden states 𝐡𝑡𝑖 at each node 𝑣𝑖 in the
= 𝑟𝑒𝑠ℎ𝑎𝑝𝑒(𝐐 ⊛ 𝑟𝑒𝑠ℎ𝑎𝑝𝑒([𝐏𝐘1 , … , 𝐏𝐘𝑡 , … , 𝐏𝐘𝑇 ])) = 𝑟𝑒𝑠ℎ𝑎𝑝𝑒(𝐐 ⊛ [𝐘′1 , … , 𝐘′𝑡 , … , 𝐘′𝑇 ]) = 𝑟𝑒𝑠ℎ𝑎𝑝𝑒([𝐐𝐘′1 , … , 𝐐𝐘′𝑡 , … , 𝐐𝐘′𝑇 ]) (3)
= [𝐙1 , … , 𝐙𝑖 , … , 𝐙𝑇 ′ ], ]𝑇
R𝑆×𝐷
R𝐷
where 𝐙𝑖 = [𝐳1,𝑖 , … , 𝐳𝑠,𝑖 , … , 𝐳𝑆,𝑖 ∈ and 𝐳𝑠,𝑖 ∈ is the feature of each node of graph 𝖦𝐼 . The symbol ⊛ is defined as the multiplication of a matrix 𝐀 ∈ R𝑎1×𝑎2 and a tensor = [𝐁1 , … , 𝐁𝑖 , … , 𝐁𝑎4 ] ∈ R𝑎2×𝑎3×𝑎4 3
W. Ding, X. Li, G. Li et al.
Signal Processing: Image Communication 83 (2020) 115776
while 𝐁𝑖 ∈ R𝑎2×𝑎3 and 𝐀 ⊛ = [𝐀𝐁1 , … , 𝐀𝐁𝑖 , … , 𝐀𝐁𝑎4 ]. The function 𝑟𝑒𝑠ℎ𝑎𝑝𝑒(⋅) is used to change the shape of a tensor from 𝑑1 × 𝑑2 × 𝑑3 to 𝑑3×𝑑2×𝑑1. In practice, it can be difficult to learn the mapping matrices 𝐏 and 𝐐 by only using a stochastic gradient descent algorithm [8]. To alleviate this issue, the spatial mapping matrix 𝐏 should be simply determined from nearby joints, i.e., 𝑝𝑖𝑗 = 1 if 𝑣𝑗 is inside the 𝑖th human body part, and 𝑝𝑖𝑗 = 0 if it is outside the 𝑖th human body part. Similarly, the temporal mapping matrix 𝐐 should be simply determined from temporally connected joints, i.e., 𝑞𝑖𝑗 = 1 if 𝑣𝑗 is inside the 𝑖th temporal range, and 𝑝𝑖𝑗 = 0 if it is outside the 𝑖th temporal range. Following this line of thinking, the human skeleton can be decomposed into 𝑆 parts, e.g., one trunk, two legs, and two arms, as shown in Fig. 3(b). Each part with the 𝑖th frame is projected onto a node of the fully connected graph 𝖦𝐼 , which has 𝐿 = 𝑆 ∗ 𝑇 ′ nodes. Note that 𝐏 and 𝐐 operate on the set of node states and permute the joint to be invariant for 𝐼 . Intuitively, 𝐏 and 𝐐 provide a soft assignment of several nodes in 𝖲𝖳𝖦 to a node in 𝖦𝐼 . The matrix multiplication rule is the message from each body part across consecutive frames to a node in 𝖦𝐼 . Once we obtain the feature 𝐳𝑖 for each vertex of graph 𝖦𝐼 , capturing global relations between arbitrary human body parts are simplified as modeling the interaction between pairs of nodes over the smaller graph 𝖦𝐼 . The similarity between human body parts in the interaction space can be measured. Following the non-local mean [10], the affinity or similarity between two nodes in 𝖦𝐼 can be represented as 𝑇 𝜙(𝐳 ) 𝑗
𝑓𝑠 (𝐳𝑖 , 𝐳𝑗 ) = 𝑒𝜃(𝐳𝑖 )
Fig. 2. Architecture of the proposed STG-IN. It allows message passing for modeling local detailed dynamics. GCN is used to encode global features via graph-based reasoning. The projection matrix is placed between the message passing block and the GCN. After global reasoning, the reverse project matrix is applied to global relation-aware features. Here ⊕ denotes element-wise sum and ⊗ denotes matrix multiplication.
𝐐′ , which use two single convolution layers to predict these weights. Again, one can force the weighted connections to be binary masks or can simply use a shallow network to generate these connections. In practice, we find that we can reuse the projection generated in the above step to reduce the computational cost without producing any negative effect upon the final accuracy. In other words, we can set 𝐏′ = 𝐏𝑇 and 𝐐′ = 𝐐𝑇 . 3.3. Action recognition with a network
(4)
,
We aim to develop neural networks that read the 𝖲𝖳𝖦 directly and learn a classification function. After global relation reasoning with local message passing, as shown in Fig. 2, the feature of each node 𝐗𝑜𝑢𝑡 = 𝑟𝑒𝑠ℎ𝑎𝑝𝑒(𝑖𝑛 ⊕ 𝑜𝑢𝑡 ) in 𝖲𝖳𝖦 is updated with the global feature 𝐘𝑜𝑢𝑡 of each time step, which are used to predict the class of human action. Following the definition of graph convolution in [7], we adopt the approximation of spectral convolution by Chebyshev polynomials for efficient implementation. Thus, for receptive fields of 𝐾 scales, we define multi-scale convolutional filtering as
where 𝜃(𝐳𝑖 ) = 𝐖𝜃 𝐳𝑖 and 𝜙(𝐳𝑗 ) = 𝐖𝜙 𝐳𝑗 are two embeddings that represent two different transformations of the primitive features. The parameters 𝐖𝜃 ∈ R𝐷×𝐷 and 𝐖𝜙 ∈ R𝐷×𝐷 allow us to not only learn the relations between different human body parts in the same frame, but also learn the correlations between different states of the same human body part across frames. After calculating the similarity matrix with Eq. (4), the normalization is applied on each row of the matrix. Motivated by the recent works [10], we adopt the normalized matrix for representing the adjacency matrix 𝐀𝐼 = (𝑎𝑖𝑗 )𝐿×𝐿 , where 𝑎𝑖𝑗 = 𝑓𝑠 (𝐳𝑖 , 𝐳𝑗 ) . ∑𝐿 𝑗=1 𝑓𝑠 (𝐳𝑖 , 𝐳𝑗 ) We apply the GCNs proposed in [41] to fulfill reasoning on the graph 𝖦𝐼 . The first step of the graph convolution performs Laplacian smoothing, propagating the node features over the graph. During training, the adjacent matrix learns edge weights that reflect the relations between the underlying globally pooled features of each node. If, for example, two nodes contain features that focus on the left hand and the head, learning a strong connection between the two would strengthen the features for a possible action of ‘‘drink’’. After information diffusion, each node has received all necessary information and its feature is updated through a linear transformation. In particular, let 𝐀𝐼 denote the 𝐿×𝐿 node adjacency matrix for diffusing information across nodes, 𝐙𝑖𝑛 = [𝐳1 , … , 𝐳𝑖 , … , 𝐳𝐿 ] ∈ R𝐿×𝐷 is the reshape of 𝑖𝑛 , and let 𝐖𝑔 ∈ R𝐷×𝐷 denote the state update function. A single-layer graph convolution network is defined as
𝐎 = ReLU(
′
𝑜𝑢𝑡 = 𝐏 ⊛ 𝑟𝑒𝑠ℎ𝑎𝑝𝑒(𝐐 ⊛ 𝑜𝑢𝑡 ).
(7)
where 𝐖𝑘 is a matrix of weight parameters which are the 𝑘th Chebyshev coefficients. Here 𝜓𝑘 () is the Chebyshev polynomial of order 𝑘, ReLU is an activation function, and 𝐛 is the bias. After the graph convolutional layer, the output features 𝐎 are convoluted over time and then forwarded to an average pooling layer. A fully connected layer and a Softmax activation function [42] are adopted to generate the final output of classification scores. 4. Experiments and results The proposed method is evaluated on the three action recognition datasets: the NTU RGB+D dataset [43], the SBU Kinect Interaction Dataset [44], and Kinetics [45]. To investigate the effectiveness of our model, we conducted extensive experiments with three different configurations listed as follows.
After the relation reasoning, we perform a reverse projection to transform the resulting features 𝐙𝑜𝑢𝑡 ∈ R𝐿×𝐷 back to the original feature space 𝖲𝖳𝖦, providing complementary features for the following convolution layers to learn better task-specific representations. This reverse projection is very similar to the projection in the first step. ′ Given the node-feature matrix 𝑜𝑢𝑡 ∈ R𝑆×𝐷×𝑇 , which is the reshape of 𝐙𝑜𝑢𝑡 , we aim to map the features to 𝑜𝑢𝑡 ∈ R𝑁×𝐷×𝑇 . Similar to the first step, we adopt linear projection to formulate ′
𝜓𝑘 ()𝐗𝑜𝑢𝑡 𝐖𝑘 + 𝐛),
𝑘=1
(5)
𝐙𝑜𝑢𝑡 = 𝐀𝐼 𝐙𝑖𝑛 𝐖𝑔 .
𝐾 ∑
– Local Relation (LR), which only considered the local detailed dynamics via message passing. – Global Relation (GR), which only considered the global spatial– temporal relation via graph-based reasoning. – STG-IN (LR+GR), which considered both the local detailed dynamics and the global spatial–temporal relation. 4.1. Implementations
(6)
Our proposed model was implemented with the PyTorch deep learning framework [46] on four Nvidia GTX 1080 GPUs. To initialize data and speed up the training process [47], a batch normalization layer was set to 64 for the batched input data followed with the graph
The above projection is actually performing feature diffusion, which forms the dense connections from the interaction graph to the 𝖲𝖳𝖦. The feature 𝐳𝑗 of node 𝑗 is assigned to 𝑦𝑖 weighted by weights 𝐏′ and 4
W. Ding, X. Li, G. Li et al.
Signal Processing: Image Communication 83 (2020) 115776
Fig. 3. (a)–(d) Skeletons decomposed into {2, 5, 7, 9} parts, respectively, by using the skeleton of 20 major body joints as an example. (e) Human body parts (sub-𝖲𝖳𝖦) are mapped to an interaction space via the mapping matrices 𝐏 and 𝐐. Table 1 Recognition rates (%) on the NTU RGB+D dataset with cross-subject and cross-view settings. Methods
Cross-subject
Cross-view
Lie group [27] H-RNN [49] Deep LSTM [43] PA-LSTM [43] ST-LSTM+TS [50] Temporal Conv [51] C-CNN+MTLN [52] ST-GCN [12] GE-GCN [23] TSSI+GLAN+SSAN [53]
50 59.1 60.7 62.9 69.2 74.3 79.6 81.5 84.0 82.4
52.8 64.0 67.3 70.3 77.7 83.1 84.8 88.3 89.4 89.1
LR GR STG-IN
78.3 79.5 85.8
83.5 84.2 88.7
convolution layer. The assigned dropout probability in our network was 0.5 to avoid over-fitting. The initial learning rate was set to 0.01 and reduced by multiplying it by 0.1 every 50 epochs. We applied the Adam optimizer [48] to train the whole model, which can be trained in an end-to-end manner using stochastic gradient descent. Taking the above model as one basic layer, we stacked it into a multi-layer network architecture, in which the output at the previous layer was used as the input of the next layer. In our network, the most key parameter is the number of decomposed body parts 𝑆 and the temporal range 𝜏, which determine the size of the interaction space simulated by a fully connected graph 𝐼 . The larger the number of body parts and the smaller the temporal range, the higher the cost of calculation and storage and, conversely, the smaller the number of body parts, and the higher the temporal range, the more easily the inefficient reasoning occurs. According to the heuristic knowledge and the existing literature [19,49], we hope to express the structure of the human body while decomposing the skeleton into 𝑆 parts by using the skeleton of 20 major body joints as an example, as shown in Fig. 3.
Fig. 4. Comparisons of the number of body parts and the temporal sizes on the NTU RGB+D dataset with cross-subject.
the similar normalization preprocessing step to have position and view invariance [43]. To avoid destroying the continuity of a sequence, no temporal down-sampling was performed. For evaluating the models, two standard evaluation protocols proposed in [43] are recommended. (1) Cross-subject (X-Sub): The persons in the training and testing sets are different. The training set contains 40,320 action sequences from 20 subjects, and the rest belong to the testing set, which contains 16,560 action sequences. (2) Cross-view (X-View): The camera views used in the training and testing sets are different. The 37,920 action sequences captured from cameras 2 and 3 were used for training, whereas the other 18,960 action sequences from camera 1 were for testing. We follow this convention and report the top-1 accuracy on both benchmarks. The performance is evaluated by computing the average recognition across all classes. Table 1 shows the performances of various methods on this NTU RGB+D dataset. These methods were briefly introduced in Section 2. We can find that the performance of hand-crafted feature-based approaches is generally inferior to deep-learning-based methods. Our proposed model outperforms the other skeleton-based methods whether from deep-learning-based approaches or hand-crafted feature-based methods. This result also shows that LR and GR are complementary, as their fusion STG-IN significantly improves on both (5.2% over LR and 4.5% over GR) with cross-view. On the other hand, we also notice that the performance of LR and GR is weaker than the original ST-GCN. GCNs are dedicated to capturing local features while global (long-range) features are aggregated with hierarchical GCNs. The STGCN model consisting of 9 layers of spatial temporal graph convolution operators leads to the combination of local and global features. However, LR only think about the local detailed dynamics via message passing. And likewise, GR consider the global spatial–temporal relation
4.2. Experiments on the NTU RGB+D Dataset The NTU RGB+D dataset [43] is a large-scale RGBD dataset and is currently the most widely used dataset for skeleton-based action recognition tasks. This dataset was captured from 40 volunteers aged from 10 to 35 and has 60 action classes including 50 single person actions, e.g., drinking water, and 10 mutual actions performed by two people, e.g., kicking another person. The dataset has 56,880 action sequences and 4 million frames collected with four pieces of information: RGB frames, depth maps, 3D skeleton joints, and infrared sequences. There are three cameras for every action at the same height simultaneously but from different horizontal angles: −45◦ , 0◦ , and 45◦ . Moreover, the height of the sensors and the distances to the action performer have been adjusted in the dataset to obtain further viewpoint variations. It provides the 3D spatial coordinates of 25 major body joints for each human body in an action. Every action performer performed the action twice, facing the left or right sensor, respectively. We apply 5
W. Ding, X. Li, G. Li et al.
Signal Processing: Image Communication 83 (2020) 115776
Fig. 5. The confusion matrix for the cross-subject evaluation protocol on the NTU RGB+D dataset.
Fig. 6. (a) Confusion matrix for five-fold cross-validation on the SBU Kinect interaction dataset by using the STG-IN method. (b) Confusion matrix for five-fold cross-validation on the SBU Kinect interaction dataset by using the LR method. (c) Confusion matrix for five-fold cross-validation on the SBU Kinect interaction dataset by using the GR method.
Fig. 7. The confusion matrix on the Kinetics-Motion dataset by using the STG-IN method.
via graph-based reasoning. Therefore, compared with the original STGCN, LR and GR have slightly poor performance. The node features of ST-GCN might be weakened during long-range diffusion. STG-IN (LR+GR) extract the feature of the global spatial–temporal relation while retaining the local detailed dynamics. Hence, our model further improves the performance in action recognition. This parameter directly affects the rate of action recognition. Each human body in the NTU RGB+D dataset is represented by 25 major body joints. The number of body parts 𝑆 is selected from {1, 3, 5, 7, 9, 11} and the temporal kernel size 𝑡 is selected from 1 to 10. To evaluate the effect of changing of this parameter, we conducted several tests on the
NTU RGB+D dataset with different numbers of body parts and different size of temporal kernel. The cross-comparison results are reported in Fig. 4. From this figure, we remark that the recognition rate reached its highest value whereas the number of body parts is 7 and the temporal size is 4. This is expected, because the smaller interaction space causes a lack of information, but also a larger interaction space retains noise and causes inter-class confusion. The confusion matrix for the cross-subject evaluation protocol on the NTU RGB+D dataset is shown in Fig. 5. We can see that our model distinguishes well the single people based actions from the interaction actions between two people. Although the objects involved in the 6
W. Ding, X. Li, G. Li et al.
Signal Processing: Image Communication 83 (2020) 115776
Table 2 Recognition rates (%) on SBU Kinect interaction dataset.
Table 3 Comparison with the state-of-the-art methods in Kinetics.
Methods
Accuracy
Method
Top-1 (%)
Top-5 (%)
Raw skeleton [44] Joint feature [54] CHARM [55] Hierarchical RNN [56] Deep LSTM [40] ST-LSTM [50] ST-LSTM+Trust Gate [50] Clips+CNN+MTLN [52] PA-GCN with inception [57] PoT2I+Inception-v3 [58]
49.7 86.9 83.9 80.35 86.03 88.6 93.3 93.6 89.8 95.9
RGB [45] Optical flow [45]
57.0 49.5
77.3 71.9
frame-based frame-based
Feature Enx. [59] Deep LSTM [40] TCN [51] ST-GCN [12] 2S-ASGCN [21]
14.9 16.4 20.3 30.7 34.5
25.8 35.3 40.0 52.8 56.9
skeleton-based skeleton-based skeleton-based skeleton-based skeleton-based
LR GR STG-IN
90.8 91.9 94.2
LR GR STG-IN
32.2 33.5 36.4
52.5 55.2 59.8
skeleton-based skeleton-based skeleton-based
Table 4 Mean class accuracies (%) on the Kinetics-Motion dataset.
actions are different, single people based actions are still confused due to the similar motion patterns, such as drink water and brushing teeth. Other single subject based actions such as Rub two hands together and Clapping are also easily confused with each other due to the similarity of the two actions in motion patterns. By comparing with ST-GCN [12], the complexity of STG-IN is measured in terms of accuracy (%), training speed (minutes per epoch) and testing speed (sequences per second). On four Nvidia GTX 1080 GPUs, our method takes about 1.3 min per epoch while ST-GCN takes about 12 min per epoch during training. Each action sequence is padded by repeating the data from the start to totally 𝑇 = 300. The testing speed of our method is about 148.6 sequences per second (ST-GCN is about 53.7 sequences per second), which has approximately three times faster testing speed than ST-GCN.
Method
RGB [45]
Optical flow [45]
ST-GCN [12]
STG-IN
Accuracy
70.4
72.8
72.4
73.27
4.4. Experiments on the Kinetics-Motion Dataset Kinetics-Motion Dataset The Kinetics dataset [45] is a large-scale RGB action recognition dataset which contains 300,000 video clips form 400 types of actions ranging from daily activities, sports scenes, to complex actions with interactions. The videos are sourced from YouTube and each clip is around 10 s long. As only raw video clips are provided, we obtain skeleton data by estimating joint locations on certain pixels with the OpenPose toolbox [60], which generates 2D pixel coordinates (𝑥, 𝑦) and confidence score for 18 joints on every frame of the clips. We evaluated our model according to their released data (Kinetics-Skeleton dataset). The dataset contains 266,440 samples divided into a training set (246,534 clips) and a validation set (19,906 clips). Following the evaluation method in [45], we trained the models on the training set, and we report both top-1 and top-5 accuracies on the validation set. In Table 3, we can see that our network achieves superior performance over these previous skeleton-based methods, but gains inferior performance to frame-based models [45]. This is due to that action recognition in Kinetics dataset not only extracts the feature of sequences, but also receives the feature of the objects and scenes that the actors are interacting with. To better evaluate skeleton-based methods on estimated joints, [12] proposed the Kinetics-Motion dataset, which is a 30-class subset of Kinetics with action labels strongly related to body motion. The selected 30 classes are 𝑏𝑒𝑙𝑙𝑦 𝑑𝑎𝑛𝑐𝑖𝑛𝑔, 𝑝𝑢𝑛𝑐ℎ𝑖𝑛𝑔 𝑏𝑎𝑔, 𝑐𝑎𝑝𝑜𝑒𝑖𝑟𝑎, 𝑠𝑞𝑢𝑎𝑡, 𝑤𝑖𝑛𝑑𝑠𝑢𝑟𝑓 𝑖𝑛𝑔, 𝑠𝑘𝑖𝑝𝑝𝑖𝑛𝑔 𝑟𝑜𝑝𝑒, 𝑠𝑤𝑖𝑚𝑚𝑖𝑛𝑔 𝑏𝑎𝑐𝑘𝑠𝑡𝑟𝑜𝑘𝑒, ℎ𝑎𝑚𝑚𝑒𝑟 𝑡ℎ𝑟𝑜𝑤, 𝑡ℎ𝑟𝑜𝑤𝑖𝑛𝑔 𝑑𝑖𝑠𝑐𝑢𝑠, 𝑡𝑜𝑏𝑜𝑔𝑔𝑎𝑛𝑖𝑛𝑔, ℎ𝑜𝑝𝑠𝑐𝑜𝑡𝑐ℎ, ℎ𝑖𝑡𝑡𝑖𝑛𝑔 𝑏𝑎𝑠𝑒𝑏𝑎𝑙𝑙, 𝑟𝑜𝑙𝑙𝑒𝑟 𝑠𝑘𝑎𝑡𝑖𝑛𝑔, 𝑎𝑟𝑚 𝑤𝑟𝑒𝑠𝑡𝑙𝑖𝑛𝑔, 𝑠𝑛𝑎𝑡𝑐ℎ 𝑤𝑒𝑖𝑔ℎ𝑡 𝑙𝑖𝑓 𝑡𝑖𝑛𝑔, 𝑡𝑎𝑖 𝑐ℎ𝑖, 𝑟𝑖𝑑𝑖𝑛𝑔 𝑚𝑒𝑐ℎ𝑎𝑛𝑖𝑐𝑎𝑙 𝑏𝑢𝑙𝑙, 𝑠𝑎𝑙𝑠𝑎 𝑑𝑎𝑛𝑐𝑖𝑛𝑔, ℎ𝑢𝑟𝑙𝑖𝑛𝑔 (𝑠𝑝𝑜𝑟𝑡), 𝑙𝑢𝑛𝑔𝑒, 𝑠𝑘𝑎𝑡𝑒𝑏𝑜𝑎𝑟𝑑𝑖𝑛𝑔, 𝑐𝑜𝑢𝑛𝑡𝑟𝑦 𝑙𝑖𝑛𝑒 𝑑𝑎𝑛𝑐𝑖𝑛𝑔, 𝑗𝑢𝑔𝑔𝑙𝑖𝑛𝑔 𝑏𝑎𝑙𝑙𝑠, 𝑠𝑢𝑟𝑓 𝑖𝑛𝑔 𝑐𝑟𝑜𝑤𝑑, 𝑑𝑒𝑎𝑑 𝑙𝑖𝑓 𝑡𝑖𝑛𝑔, 𝑐𝑙𝑒𝑎𝑛 𝑎𝑛𝑑 𝑗𝑒𝑟𝑘, 𝑐𝑟𝑎𝑤𝑙𝑖𝑛𝑔 𝑏𝑎𝑏𝑦, 𝑝𝑢𝑠ℎ 𝑢𝑝, 𝑓 𝑟𝑜𝑛𝑡 𝑟𝑎𝑖𝑠𝑒𝑠, and 𝑝𝑢𝑙𝑙𝑢𝑝𝑠. In this experiments, the recognition rate reached its highest value whereas the number of body parts is 5 and the temporal size is 4. In Table 4, we list the mean class accuracies of RGB, optical flow and skeleton based models on this Kinetics-Motion dataset. We can see that the performance gap is not large on this subset of Kinetics. The confusion matrix on the Kinetics-Motion dataset is illustrated in Fig. 7. We can see that the gap among the recognition rate of these actions is much smaller. Classification errors occur if two actions are highly similar to each other, such as hammer throw and throwing discus. From the above three datasets, the value of body parts 𝑆 is different while the recognition rate reached its highest value, as shown in Table 5. In the phase of global reasoning, each body part with ) joints across 𝜏 consecutive frames is mapped to the interaction 𝑐𝑒𝑖𝑙( 𝑁 𝑆 space. Our model STG-IN with 4 × 4 (16 joints) kernels achieves best performance. It is like that the 3D ConvNet [61] with 3 × 3 × 3 (27 pixels) kernels performs best among the experiment.
4.3. Experiments on the SBU Kinect interaction dataset The SBU Kinect interaction dataset [44] contains eight action classes of two subject interactions, including approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands. There were 282 skeleton sequences in total corresponding to 6822 frames. For each action class, 7 participants performed activities containing 21 sets of interactions performed by different pairs of people. The standard experimental protocol of five-fold cross-validations with the provided splits was followed. The two skeletons were processed as two data samples during training. Prediction scores from the five crops in the center and four corners were averaged for testing prediction. The SBU dataset is very challenging owing to the relatively low accuracy of joint 3D coordinates provided by Kinect, the single-view case captured by only one Kinect during training, and the similarity in motion for some actions. For example, exchanging object and shaking hands involves both subjects extending their arms. Each human body in the SBU Kinect interaction dataset is represented by 15 major body joints. The number of body parts 𝑆 is selected from {1, 2, 3, 4, 5, 6, 7} and the temporal kernel size 𝑡 is selected from 1 to 10. In this experiments, the recognition rate reached its highest value whereas the number of body parts is 5 and the temporal size is 4. The comparisons of the proposed network with the start-of-the-art methods are shown in Table 2. The proposed STG-IN maintains high recognition performance on the SBU Kinect interaction dataset. Fig. 6 shows the comparison of the confusion matrices for LR, GR, and STG-IN on the SBU Kinect interaction dataset. In these figures, it is clear that these approaches work well if two people have only simple interactions, such as leaving, departing, and approaching. However, if the two people perform more complex interactions, such as kicking, punching, and hugging, our graph method STG-IN works better than the LR and GR methods, which only consider the local and global dependency of the human skeleton, respectively. 7
W. Ding, X. Li, G. Li et al.
Signal Processing: Image Communication 83 (2020) 115776 Table 5 Comparisons of the number of body parts and the temporal sizes on the different dataset. Dataset
NTU RGB+D [43]
SUB Kinect Interaction [44]
Kinetics-Motion [12]
Body joints 𝑁 The number of body parts 𝑆 The temporal size 𝜏
25 7 4
20 5 4
18 5 4
5. Conclusions and future work
[11] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141. [12] S. Yan, Y. Xiong, D. Lin, Spatial temporal graph convolutional networks for skeleton-based action recognition, 2018, arXiv preprint arXiv:1801.07455. [13] P. Battaglia, R. Pascanu, M. Lai, D.J. Rezende, et al., Interaction networks for learning about objects, relations and physics, in: Advances in Neural Information Processing Systems, 2016, pp. 4502–4510. [14] P.W. Battaglia, J.B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al., Relational inductive biases, deep learning, and graph networks, 2018, arXiv preprint arXiv: 1806.01261. [15] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, M. Sun, Graph neural networks: A review of methods and applications, 2018, arXiv preprint arXiv:1812.08434. [16] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, P.S. Yu, A comprehensive survey on graph neural networks, 2019, arXiv preprint arXiv:1901.00596. [17] H. Dai, Z. Kozareva, B. Dai, A. Smola, L. Song, Learning steady-states of iterative algorithms over graphs, in: International Conference on Machine Learning, 2018, pp. 1114–1122. [18] R. Li, M. Tapaswi, R. Liao, J. Jia, R. Urtasun, S. Fidler, Situation recognition with graph neural networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4173–4182. [19] C. Si, Y. Jing, W. Wang, L. Wang, T. Tan, Skeleton-based action recognition with spatial reasoning and temporal stack learning, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 103–118. [20] D.I. Shuman, S.K. Narang, P. Frossard, A. Ortega, P. Vandergheynst, The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains, IEEE Signal Process. Mag. 30 (3) (2013) 83–98. [21] L. Shi, Y. Zhang, J. Cheng, H. Lu, Adaptive spectral graph convolutional networks for skeleton-based action recognition, 2018, arXiv preprint arXiv:1805.07694. [22] C. Li, Z. Cui, W. Zheng, C. Xu, J. Yang, Spatio-temporal graph convolution for skeleton based action recognition, 2018, arXiv preprint arXiv:1802.09834. [23] X. Zhang, C. Xu, D. Tao, Graph edge convolutional neural networks for skeleton based action recognition, 2018, arXiv preprint arXiv:1805.06184. [24] J. Liu, A. Shahroudy, G. Wang, L.-Y. Duan, A.K. Chichung, Skeleton-based online action prediction using scale selection network, IEEE Trans. Pattern Anal. Mach. Intell. (2019). [25] Y. Cai, L. Ge, J. Liu, J. Cai, T.-J. Cham, J. Yuan, N.M. Thalmann, Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2272–2281. [26] M.E. Hussein, M. Torki, M.A. Gowayyed, M. El-Saban, Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations, in: IJCAI, Vol. 13, 2013, pp. 2466–2472. [27] R. Vemulapalli, F. Arrate, R. Chellappa, Human action recognition by representing 3d skeletons as points in a lie group, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 588–595. [28] W. Ding, K. Liu, X. Fu, F. Cheng, Profile HMMs for skeleton-based human action recognition, Signal Process., Image Commun. 42 (2016) 109–119. [29] W. Ding, K. Liu, E. Belyaev, F. Cheng, Tensor-based linear dynamical systems for action recognition from 3D skeletons, Pattern Recognit. 77 (2018) 75–86. [30] Y. Tang, Y. Tian, J. Lu, P. Li, J. Zhou, Deep progressive reinforcement learning for skeleton-based action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, 5323–5332. [31] X. Gao, W. Hu, J. Tang, P. Pan, J. Liu, Z. Guo, Generalized graph convolutional networks for skeleton-based action recognition, 2018, arXiv preprint arXiv:1811. 12013. [32] Y.-H. Wen, L. Gao, H. Fu, F.-L. Zhang, S. Xia, Graph CNNs with motif and variable temporal block for skeleton-based action recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 2019, pp. 8989–8996. [33] J. Liu, H. Ding, A. Shahroudy, L.-Y. Duan, X. Jiang, G. Wang, A.K. Chichung, Feature boosting network for 3D pose estimation, IEEE Trans. Pattern Anal. Mach. Intell. (2019). [34] F. Scarselli, M. Gori, A.C. Tsoi, M. Hagenbuchner, G. Monfardini, The graph neural network model, IEEE Trans. Neural Netw. 20 (1) (2009) 61–80. [35] W. Hamilton, Z. Ying, J. Leskovec, Inductive representation learning on large graphs, in: Advances in Neural Information Processing Systems, 2017, pp. 1024–1034. [36] J. Gilmer, S.S. Schoenholz, P.F. Riley, O. Vinyals, G.E. Dahl, Neural message passing for quantum chemistry, in: Proceedings of the 34th International Conference on Machine Learning, Vol. 70, JMLR.org, 2017, pp. 1263–1272.
In this article, a novel model called STG-IN has been proposed, which performs long-range temporal modeling of human body parts to deal with the task of skeleton-based action recognition. To extract dependency between different human bodies, we transformed the skeleton into an STG, and then learned the localized correlated features by message passing. In this model, human body parts were mapped to an interaction space where graph-based reasoning can be efficiently implemented via a GCN. After reasoning, global relation-aware features were distributed back to the embedded nodes of the STG. To evaluate our model, we conducted extensive experiments on the large-scale NTU RGB-D dataset and Kinetics-Motion dataset. The experimental results demonstrated the effectiveness of our proposed model in that achieves the state-of-the-art performance. In the future, to adapt the GCN more effectively, we will focus on different ways of encoding skeleton data. CRediT authorship contribution statement Wenwen Ding: Conceptualization, Methodology, Software, Writing - original draft. Xiao Li: Software, Formal analysis, Validation. Guang Li: Data curation, Formal analysis. Yuesong Wei: Validation, Writing - review & editing. Acknowledgments This work was supported in part by the Postdoctoral Science Foundation funded project of China under Grant 2018M631125, the National Natural Science Foundation of China (Grant No. 61806155), the Natural Science Foundation of Anhui Province (Grant No. 1908085MF186) and the Natural Science Foundation of the Anhui Higher Education Institutions of China (Grant No. KJ2018A0384). References [1] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324. [2] J. Bruna, S. Mallat, Invariant scattering convolution networks, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1872–1886. [3] M. Tygert, J. Bruna, S. Chintala, Y. LeCun, S. Piantino, A. Szlam, A mathematical motivation for complex-valued convolutional networks, Neural Comput. 28 (5) (2016) 815–825. [4] M.M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, P. Vandergheynst, Geometric deep learning: going beyond Euclidean data, IEEE Signal Process. Mag. 34 (4) (2017) 18–42. [5] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, M.M. Bronstein, Geometric deep learning on graphs and manifolds using mixture model CNNs, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5115–5124. [6] S. Qi, W. Wang, B. Jia, J. Shen, S.-C. Zhu, Learning human-object interactions by graph parsing neural networks, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 401–417. [7] M. Defferrard, X. Bresson, P. Vandergheynst, Convolutional neural networks on graphs with fast localized spectral filtering, in: Advances in Neural Information Processing Systems, 2016, pp. 3844–3852. [8] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, J. Leskovec, Hierarchical graph representation learning with differentiable pooling, in: Advances in Neural Information Processing Systems, 2018, pp. 4800–4810. [9] Q. Li, Z. Han, X.-M. Wu, Deeper insights into graph convolutional networks for semi-supervised learning, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [10] X. Wang, R. Girshick, A. Gupta, K. He, Non-local neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803. 8
W. Ding, X. Li, G. Li et al.
Signal Processing: Image Communication 83 (2020) 115776 [50] J. Liu, A. Shahroudy, D. Xu, G. Wang, Spatio-temporal lstm with trust gates for 3d human action recognition, in: European Conference on Computer Vision, Springer, 2016, pp. 816–833. [51] T.S. Kim, A. Reiter, Interpretable 3d human action analysis with temporal convolutional networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW, IEEE, 2017, pp. 1623–1631. [52] Q. Ke, M. Bennamoun, S. An, F. Sohel, F. Boussaid, A new representation of skeleton sequences for 3d action recognition, in: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE, 2017, pp. 4570–4579. [53] Z. Yang, Y. Li, J. Yang, J. Luo, Action recognition with spatio-temporal visual attention on skeleton image sequences, IEEE Trans. Circuits Syst. 48 (3) (2018) 2405–2415. [54] Y. Ji, G. Ye, H. Cheng, Interactive body part contrast mining for human interaction recognition, in: Multimedia and Expo Workshops (ICMEW), 2014 IEEE International Conference on, IEEE, 2014, pp. 1–6. [55] W. Li, L. Wen, M. Choo Chuah, S. Lyu, Category-blind human action recognition: A practical recognition system, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4444–4452. [56] J.-F. Hu, W.-S. Zheng, J. Lai, J. Zhang, Jointly learning heterogeneous features for RGB-D activity recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5344–5352. [57] Y. Qin, L. Mo, C. Li, J. Luo, Skeleton-based action recognition by part-aware graph convolutional networks, Vis. Comput. (2019) 1–11. [58] T. Huynh-The, C.-H. Hua, T.-T. Ngo, D.-S. Kim, Image representation of pose-transition feature for 3D skeleton-based action recognition, Inform. Sci. (2019). [59] B. Fernando, E. Gavves, J.M. Oramas, A. Ghodrati, T. Tuytelaars, Modeling video evolution for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5378–5387. [60] Z. Cao, T. Simon, S.-E. Wei, Y. Sheikh, Realtime multi-person 2d pose estimation using part affinity fields, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, IEEE, 2017, pp. 1302–1310. [61] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4489–4497.
[37] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in: The IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015. [38] F.A. Gers, J. Schmidhuber, F. Cummins, Learning to Forget: Continual Prediction with LSTM, IET, 1999. [39] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, W.-c. Woo, Convolutional LSTM network: A machine learning approach for precipitation nowcasting, in: Advances in Neural Information Processing Systems, 2015, pp. 802–810. [40] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, X. Xie, et al., Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks, in: AAAI, Vol. 2, No. 5, 2016, p. 6. [41] T.N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, 2016, arXiv preprint arXiv:1609.02907. [42] K. Duan, S.S. Keerthi, W. Chu, S.K. Shevade, A.N. Poo, Multi-category classification by soft-max combination of binary classifiers, in: International Workshop on Multiple Classifier Systems, Springer, 2003, pp. 125–134. [43] A. Shahroudy, J. Liu, T.-T. Ng, G. Wang, NTU RGB+ D: A large scale dataset for 3D human activity analysis, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1010–1019. [44] K. Yun, J. Honorio, D. Chattopadhyay, T.L. Berg, D. Samaras, Two-person interaction detection using body-pose features and multiple instance learning, in: Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, IEEE, 2012, pp. 28–35. [45] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., The kinetics human action video dataset, 2017, arXiv preprint arXiv:1705.06950. [46] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch, in: NIPS-W, 2017, 2017. [47] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015, arXiv preprint arXiv:1502.03167. [48] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2014, arXiv preprint arXiv:1412.6980. [49] Y. Du, W. Wang, L. Wang, Hierarchical recurrent neural network for skeleton based action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1110–1118.
9