HDS-SP: A novel descriptor for skeleton-based human action recognition

HDS-SP: A Novel Descriptor for Skeleton-based Human Action Recognition Communicated by Dr Chenguang Yang Journal Pre-proof HDS-SP: A Novel Descript...

Download PDF

3MB Sizes 0 Downloads 188 Views

Report

Full Text

HDS-SP: A Novel Descriptor for Skeleton-based Human Action Recognition

Communicated by Dr Chenguang Yang

Journal Pre-proof

HDS-SP: A Novel Descriptor for Skeleton-based Human Action Recognition Jingjing Liu, Zhiyong Wang, Honghai Liu PII: DOI: Reference:

S0925-2312(19)31650-9 https://doi.org/10.1016/j.neucom.2019.11.048 NEUCOM 21576

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

1 August 2019 16 September 2019 28 November 2019

Please cite this article as: Jingjing Liu, Zhiyong Wang, Honghai Liu, HDS-SP: A Novel Descriptor for Skeleton-based Human Action Recognition, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.11.048

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

HDS-SP: A Novel Descriptor for Skeleton-based Human Action Recognition Jingjing Liua , Zhiyong Wanga , Honghai Liua,∗ a School

of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

Abstract 3D skeleton-based human action recognition has attracted much attention due to a wide spectrum of promising applications in terms of depth sensors. This paper, based on the principle of the better viewpoint the better action recognition performance, devises a HDS-SP descriptor for skeleton-based human action, i.e., the histogram of distributed sectors based on their specific projections. The HDS-SP descriptor consists of both spatial and temporal information from specific viewpoint, in which projecting 3D trajectories on specific planes and creating reasonable histograms using proposed way are two primary contributions. Inspired by the nature of human action, the spatial information is incorporated into the histogram of the displacement of one joint between two successive frames voting for corresponding bins following weight-based rules over an undefined plane while the specific projection planes are optimized by both local search algorithm and Particle Swarm Optimization (PSO); the temporal information is captured by temporal hierarchical construction. The proposed method is evaluated in five widely researched datasets for skeleton-based human action recognition with results significantly outperforming the state-of-the-art. Keywords: Human action recognition, skeleton joints, histogram, PSO

∗ Corresponding

author Email address: [email protected] (Honghai Liu )

Preprint submitted to Journal of LATEX Templates

December 4, 2019

1. Introduction Detection and recognition of human actions is one of challenging problems of many domains including video analysis[1], robotics[2], surveillance[3], humanmachine interaction[4] and assistive living[5]. Most of early studies recognize 5

human actions using 2D visual data. However, these traditional approaches suffer from numerous problems including sensitivity to illumination changes and subject texture variations. Thanks to the available depth sensors like Microsoft Kinect and ASUS Xtion, additional depth information can be captured through these sensors. The primary advantage of depth data is that it can provide 3D in-

10

formation, so that most of the methods that operate on depth data could achieve view invariance or scale invariance or both[6]. Generally speaking, the existing 3D based methods can be divided into two groups: depth map-based representations and skeleton-based representations. The depth map-based methods[7][8] directly rely on depth maps or features extracted from the space time volume,

15

while the skeleton-based methods take skeleton or equivalently joints as inputs. In particular, there are an increasing number of works[9][10][11][12] focusing on the classification of actions from sequences of 3D locations of human body joints. For example, human representations were realized by a feature named histogram of 3D joint locations(HOJ3D) in [13]. The 3D trajectory for each joint

20

was projected onto XY, XZ, and YZ planes to create a 2D descriptor named histogram of oriented displacements (HOD) respectively [14]. A high invariance motion representation based on the relative geometry with the aid of special orthogonal group SO(3) was proposed in [15] to find a bijection between relative geometry and skeleton motion. Representation in [16] focused on the rotation

25

and relative velocity but not joints position. Ding et.al [17] proposed a tensorbased linear dynamical system (tLDS) for action recognition, taking skeleton sequences as high-order tensor time series. Except for these hand-crafted features, deep learning networks [18] [19] [20] have also achieved significant success in this area. These methods can automatically learn spatial-temporal feature

30

representations but demand for sufficient data. Compared with deep learning

2

based methods, hand-crafted features can work on relatively small dataset such as autistic children’s motion database which is laborious to collect enough data in a short time. As a result, hand-crafted features are still essential for action recognition. 35

Inspired but quite more effective than [14], the primary idea of this paper is as follows: the better the viewpoint, the better action recognition performance. For instance, some actions are confused with others, such as Tennis Serve and Forward Punch. It is difficult to distinguish these two actions from the front side. Nevertheless, cases may be different from lateral sides. Normally the 3D

40

data of human action is transformed to 2D data by projecting on XY, YZ and XZ planes. However, the nature of human action gives us hints that action recognition performances may be different from different viewpoints. Thus the goal of our study is to seek out the most suitable viewpoint from which the descriptor can comprise of as much useful information as possible. This process

45

is realized through both local search and Particle Swarm Optimization (PSO) method. After the projection onto a 2D plane from a specific viewpoint, a histogram is created by the vote from each displacement according to its length and angle. As for temporal modeling, temporal hierarchical construction method is utilized, inspired by spatial pyramid matching [21]. Considering the histogram

50

as HDS feature, three most appropriate planes are found and corresponding HDS features are concatenated as the final HDS-SP descriptor to boost the final result. Our HDS-SP descriptor is applied to five datasets.The result suggests that our method outperforms the state-of-the-art using SVM for classification. This paper is organized as follows: In Section II, we review the related work

55

in 3D skeleton data acquisition and representation descriptors. In Section III, we discuss the modelling of descriptor in details. Section IV reports the experimental results and discussion on five datasets compared with other methods. Section V concludes the work with remarks.

3

2. Related Work 60

In recent years, we have witnessed a promising interest in techniques to realize 3D skeleton-based human action classification. Various 3D skeleton-based human representations are proposed to extract compact, discriminative information for action recognition. The main challenges of this theme can be categorized as two parts: 3D skeleton data acquisition and representation descriptors.

65

2.1. Data Acquisition There are several kinds of devices which are capable of providing 3D skeleton data, including motion capture systems(MoCap), structured-light cameras and time-of-flight(TOF) sensors. The obtainment of 3D skeleton data through MoCap is achieved by identifying and tracking the markers which adhere to human

70

joints. Although the captured data is of high quality and accuracy, this kind of system is not widely popularized due to its high price. Compared with binocular distance measurement system, structured-light cameras using infrared light to capture depth information can reach a higher depth image resolution. Common structured light cameras include Microsoft Kinect v1, ASUS Xtion, PrimeSense,

75

RealSense and so on. Since markers are not needed in the structured-light systems, these systems are inexpensive and widely used. The TOF sensor sends out the modulated near infrared light, and measures the amount of time it takes for that light to return or phase differences between emitted and received signals. The Microsoft Kinect v2 provides an affordable alternative as a TOF sensor. It

80

is able to provide a high resolution of depth images (512 × 424) at 30 Hz. In spite of these available devices for 3D skeleton data acquisition, existing acquisition devices still need to be improved due to the limitation of environment and number of users. 2.2. Representation descriptors

85

The study of skeleton-based action recognition dates back to the pioneering work by Johansson [22], which demonstrated that the geometric structures of body motion patterns in man could be determined by the construction of 4

their skeletons. This concept had inspired many extensive works ever since. Because of the natural correspondence of skeletons across time, the majority of 90

the skeleton-based methods explicitly model temporal dynamics. Skeleton-based approaches are mainly divided into two kinds: sequential approaches and space time volume approaches. 2.2.1. Sequential Approaches As for some early works, Lv et al. decomposed the high dimensional 3D

95

joint space into a set of feature spaces where each feature corresponds to the motion of a single joint or combination of related multiple joints [23]. A HMM was built for each feature and action class to model the temporal dynamics. The recent work by Xia et al. [13] proposed a feature called Histogram of 3d Joint Locations (HOJ3D) that essentially encoded spatial occupancy informa-

100

tion relative to the skeleton root, i.e. hip center. They defined a modified spherical coordinate system on the hip center and partitioned the 3d space into n bins. However, the heavy reliance on the hip joint might potentially jeopardize their recognition accuracy, due to the noise embedded in the estimated hip joint location. A more general approach had been proposed by Yao et al. [24]

105

where skeleton motion was encoded by relational pose features. These features described geometric relations between specific joints in a single pose or a short sequence of poses. A local descriptor called trajectorylet to capture the static and kinematic information within a short temporal interval was proposed in [25]. Each trajectorylet was encoded with a discriminative trajectorylet detec-

110

tor set which was selected from a large number of candidate detectors trained through exemplar-SVMs. Chen et al.[26] extracted simple but effective framelevel features from the skeletal data and built a recognition system based on the extreme learning machine. In most cases, all the skeletal joints are used for feature representation. Some studies[27] focused on automatically selecting

115

the most informative skeletal joints to obtain a more discriminative result. Sequential approaches usually require a larger set of training data. However, they provide the potential to capture complex and general activities because of the 5

explicit modeling of motion dynamics. 2.2.2. Space Time Volume Approaches 120

The space time volume approaches usually extract global features from the joints, sometimes combining with point cloud data. Shao et al. proposed a descriptor called spatial-temporal Laplacian pyramid coding (STLPC), for holistic representation of human actions [28]. Through decomposing each sequence into a set of band-pass-filtered components, the proposed pyramid model localized

125

features residing at different scales. Diogo et al. [29] extracted sets of spatial and temporal local features from subgroups of joints, which were aggregated by a robust method based on the VLAD algorithm and a pool of clusters. Several feature vectors were then combined by a metric learning method, followed by a nonparametric k-NN classifier for classification. Wang et al. [30] proposed

130

an effective method to extract mid-level features from Kinect skeletons for 3D human action recognition. The orientations of limbs connected by two skeleton joints were computed and each orientation was encoded into one of the 27 states. Limbs were combined into parts and the limb’s states were mapped into part states. Frequent pattern mining was employed to mine the most frequent and

135

relevant states of parts in continuous several frames. Weng et al. [31] proposed a non-parametric method named Spatio-Temporal Naive-Bayes Nearest-Neighbor, taking the spatial-temporal structure of 3D actions into consideration and relaxing the Naive Bayes assumption of NBNN. The research of space time volume approaches is relatively new but keeps attracting an increasing attention due to

140

their good performance. Except for traditional algorithms, many deep learning based methods emerge inspired by the great success of deep models in computer vision tasks. Convolutional Neural Network(CNN) based methods focus on capturing the spatialtemporal information from skeleton sequences. Du Y et al. [32] treated three

145

coordinates (x, y, z) of each joint as three channel (R, G, B) in spatial-temporal images. A deep learning framework named SkeletonNet [33] extracted body part-based features from each frame of the skeleton sequence. Instead of treat6

ing the features of all frames as a time series, features were transformed into images to generate high-level and discriminative representations. Compared 150

with CNN, recurrent neural network (RNN) and its development network named Long Short Term Memory (LSTM) have greater advantages in capturing temporal information since they can store previous inputs and states. An end-to-end model [34] was proposed which learnt to selectively focus on discriminative joints of skeleton within each frame of the inputs and paid different levels of attention

155

to the outputs of different frames. Wang et al. [35] proposed a two-stream RNN which combined body parts in temporal stream and input traversal skeleton sequences in spatial stream. It is undeniable that deep learning based methods achieved strong performance on large-scale datasets. However, the long time consuming, high dependence on data and poor interpretability of deep learning

160

based methods can not be ignored. As for small and medium data sets, traditional algorithms can obtain no worse results than deep learning ones with much training time saving. Furthermore, there is a growing trend of applying traditional algorithms to deep learning models to improving performance and interpretability.

165

3. Proposed method The principle of the proposed method is based on the 3D trajectory of each joint separately. The flowchart of this algorithm is shown in Fig.1. 3.1. 2D Representation 3.1.1. Projection of 3D trajectory

170

A plane can be determined by its normal vector and the point at which it passes. Assume the normal vector of plane P is ~n(a, b, c) and the point on the plane P is origin O(x0 , y0 , z0 )(x0 = y0 = z0 = 0). Given an arbitrary point V (x, y, z), its projective point onto the plane P is V 0 (x0 , y 0 , z 0 ). According to the geometrical relationship:

7

Figure 1: Overview of the proposed architecture. The 3D trajectory of each individual skeletal joint is projected on a plane, resulting in a 2D trajectory. A histogram of distributed sectors feature is built based on each joint of the 2D trajectory. Then, temporal hierarchical construction method is used to explore the temporal information. The most appropriate projection plane is found by both the local search algorithm and PSO. To better represent human actions and boost the recognition result, three 2D features of all skeletal joints corresponding to three most appropriate plane are concatenated as the final descriptor HDS-SP. The used classification algorithm is linear SVM.

V V 0 k ~n OV 0 ⊥~n 175

the corresponding formulas are listed as follows:   x0 −x = y0 −y = z0 −z = d a b c  a(x0 − x ) + b(y 0 − y ) + c(z 0 − z ) = 0 0 0 0

(1)

(2)

Based on Eq. (2), d can be calculated. Then the values of x0 , y 0 , z 0 are

computed by back substitution. Given a trajectory of one skeleton joint T = {V1 , V2 , ..., Vn }, Vi is the 3D position of the joint at frame i. For the 3D trajectory of each joint, the location 180

at each frame needs to go through the above process. 3.1.2. HDS feature In order to calculate a histogram based on the 2D trajectory after projecting on a specific plane P , we have to design two coordinate axes l~1 and l~2 artificially. l~1 , l~2 and ~n have to satisfy the geometrical relationships that they are mutually

185

perpendicular. 8

Figure 2: A circular area corresponding to the displacements. The number in each block represents the order number of the block. This figure shows a general displacement p ~i voting for the histogram. In this example, the corresponding bin of the displacement p ~i is 9 and adjacent bins are 1, 10, 16, 17.

Given a skeletal joint, Pi is the 3D position at frame i after projection. The displacement p~i between Pi (xi , yi , zi ) and Pi+1 (xi+1 , yi+1 , zi+1 ) is computed by: p~i = [xi+1 − xi , yi+1 − yi , zi+1 − zi ]

(3)

A circular area shown in Fig.2 is designed as follows: the largest length of displacement p~i (1 ≤ i < n) is taken as the radius r of the circle. The whole 190

circular zone is divided into k sectors of same size, while each sector is again segmented at the trisection of the radius. Since there are 3 × k blocks, the histogram naturally has 3 × k bins. The histogram accumulates the lengths of the successive moves. For each displacement p~i , two indexes are computed. θ is the included angle between p~i and l~2 using Eq. (4). |~ pi | represents the length

195

of the displacement p~i . θ=cos−1 (

p~i • ~l2 ), θ ∈ [0◦ , 360◦ ] |~ pi | ~l2

(4)

Now the rules of voting for different bins are described as follows: The length of displacement p~ votes for both the corresponding and the adjacent bins. The 9

Figure 3: Voting rules. For the displacement p ~ shown in this figure, |~ p| will be added to the histogram bin v0 and wx · |~ p| will be added to the histogram bin x. (x = v1 , v2 , v3 , v4 )

value of |~ p| will be added to the corresponding bin v0 which is determined by θ and |~ p| while the value of |~ p| will be proportionally added to the adjacent bin 200

according to the distance between the end point of the displacement p~ and the center of the adjacent block. The position of the center of each block is indicated using an ordered pair Ci (ϕi , gi ), (1 ≤ i ≤ 3 × k) as if in the polar coordinates. gi is the polar radius of the point Ci and ϕi is the polar angle. The distance d between the displacement p~(θ, |~ p|) and the relevant block is

205

calculated using the Cosine formula as follows: 2

dx 2 = |~ p| + gx 2 − 2 · |~ p| · gx · cos(θ − ϕx )

(5)

where x can be taken as b, c, e, f , which are the sequence numbers of the adjacent block v1 , v2 , v3 , v4 respectively. As shown in Fig.3, there are one corresponding block v0 and four adjacent blocks v1 , v2 , v3 , v4 for displacement p~. The weight coefficients of the vote on

10

210

the relevant bins are shown in Eq. (6). The weight coefficient wx is positively associated with the reciprocal of dx , where x can be taken as a, b, c, e, f , which are the sequence numbers of the adjacent block v0 , v1 , v2 , v3 , v4 respectively. The farther the displacement is away from the block, the smaller the weight coefficient is. wx =

215

  

1, x = a 1 db

, x = b, c, e, f

1 dx

(6)

+ d1c + d1e + d1

f

The HDS feature, for a video sequence is the concatenation of histograms of all available skeleton joints. For two action sequences of different lengths with 20 joints, the dimension of corresponding HDS features are the same: 20 × 3 × k. Thus the HDS feature is invariant to the length of sequence of action obviously. It has a clear meaning of the displacements and orientations during the process

220

of the movement. However, this way to describe the motion will miss the order of displacements. If an action is divided into two parts and the second part is executed at first, then the HDS feature would be the same as the original one. To solve this problem, temporal hierarchical construction method is applied to provide temporal information as illustrated in subsection 3.

225

3.1.3. temporal hierarchical construction To compensate for the temporal information, hierarchical construction inspired by the idea of spatial pyramid matching [21] is built as shown in Fig.4. For the top level, HDS feature is computed over the entire trajectory. For the lower levels, HDS features are computed over overlapping subsections of the whole tra-

230

jectory. Fig. 4 shows three levels in the hierarchy. Each hierarchy is corresponding to its ordinal number l, being divided into s subsections where s = 2l−1 . Cij , the subsection i of hierarchy j begins from t = (i − 1) · T (2j−1 + 1) and ends at t = (i + 1) · T (2j−1 + 1). By concatenating the HDS feature over different

subsections of the whole trajectory, temporal variation is caught. For example, 235

if most movement happened in the latter part of the video, HDS features computed over C12 and C22 subsection would make a great difference. Apparently, the number of plies of the temporal hierarchy is closely related to the ability of 11

Figure 4: 3-level temporal hierarchical construction. T is time length of the video sequence.

classification. Within a certain range, we believe that adding temporal information will help with the recognition of different actions. If the number of temporal 240

hierarchical plies is h, the descriptor is made of (2h − 1) parts. Thus, the dimen-

sion of the descriptor representing an action with 20 joints is: 20 × 3k × (2h − 1).

3.2. Optimization of the projection plane Our aim is to find the 2D plane onto which the projection can generate the 245

most distinguishing feature for action types classification. The process of seeking for the most appropriate 2D plane can be considered as a problem of finding the optimal solution with some constraints. Barring unforeseen problems, this can be solved by modern optimization methods since the problem may have many local optimal solutions. It is assumed that all the 2D planes pass through the

250

origin, so a 2D plane can be identified with its normal vector. The normal vector ~n(a, b, c) can be expressed as: ~n(cosi, cosj, cosk). This optimization problem P = (n, f ) can be defined by: the variable n = {i, j, k} ; an objective function f to be maximized, where f is the classification accuracy within the projection onto the selected plane; Constraint conditions are 0 ≤ i, j, k ≤ 180.

255

To solve this problem, we use both local search and Particle Swarm Opti-

12

mization. 3.2.1. Local Search Generally speaking, local search is based on the greedy idea of using neighborhood function to search. That is to say, if a better solution is found than 260

the existing value, then it will take place of the existing one. Although this method is easy to implement, it can only get the local best solution. To be more specific, the local search algorithm starts with several initial solutions and generates neighboring solutions through a neighborhood function. Repeat the above process until meeting the termination conditions.

265

Taking the whole search space into consideration, we start with 13 initial normal vectors ~n(cosi, cosj, cosk): n~1 = (0, 0, 1)

n~2 = (0, 1, 0)

n~3 = (1, 0, 0)

n~4 = (1, 1, 1)

n~5 = (−1, −1, 1)

n~6 = (1, −1, 1)

n~7 = (−1, 1, 1)

n~8 = (1, 1, 0)

n~9 = (1, −1, 0)

n~10 = (1, 0, 1)

n~11 = (−1, 0, 1)

n~12 = (0, 1, 1)

n~13 = (0, −1, 1) As shown in Fig.5, normal vectors ~n(cosi, cosj, cosk) corresponding to initial solutions n = {i, j, k} distributed uniformly in space. For each initial solution ni = {ni (1), ni (2), ni (3)}, the neighborhood function is defined as: Nij = [step × rands + ni (1), step × rands + ni (2), step × rands + ni (3)] 270

(7)

where Nij is the jth neighboring solution generated by the initial solution ni ,step indicates the neighborhood size, rands is a random number within [−1, 1], 1 ≤ j ≤ m, means that getting m neighboring solutions using the neighborhood function. With a known normal solution vector ~n, a linear SVM using LIBSVM [36] is

275

applied for classification and hence the object function value is the classification accuracy using LIBSVM. For LIBSVM, the cost efficient is set to 5 and γ of

13

Figure 5: Initial normal vectors. Collinear vectors generate the same results. Certain neighboring solutions are created near each initial solution.

kernel function is set 1. If the objective function value with the variable n = {i, j, k} is not large, meaning that the projection onto the corresponding plane is confusing to distinguish actions. Conversely, the plane corresponding to a large 280

objective function value is conducive to recognize human actions. According to the objective function value, we choose 8 solutions among all the neighboring solutions to be the initial solutions for the next iteration. Repeat this process for two times and we can get a fairly good solution. 3.2.2. PSO

285

Particle Swarm Optimization (PSO) is an iterative based optimization algorithm [37]. The algorithm was initially inspired by the bird group activities, and then a simplified model was established by swarm intelligence. By sharing information among individuals in a group, the movement of the whole group can evolve from disorder to order in the solution space, and the optimal solution can

290

be obtained. PSO algorithm is shown in Algorithm 1: Firstly, a certain amount of particles are initially created with their locations and velocity assigned. For each particle, calculate the fitness function. Two variables are introduced: pbest 14

Algorithm 1 Procedure PSO for each particle i(1 ≤ i ≤ sp) do Initialization: Xi ← 180 ∗ rands(1, 3) Vi ← rands(1, 3) FIT(Xi ) Set pbesti = Xi end for gbest =max(pbesti ) while iterations < 20 do for i = 1 to 20 do Update the velocity and position of particle i FIT(Xi ) if FIT(Xi )> FIT(pbesti ) then pbesti = Xi end if if FIT(pbesti )>FIT(gbesti ) then gbesti = pbesti end if end for end while function rands(x, y) r ← an array sized (x, y) of random numbers ranged in [0,1] return r end function function fit(X) ~n(a, b, c) ← X Project skeleton joints onto plane ax + by + cz = 0 Calculate descriptor SVM→train accuracy= SVM→predict return accuracy end function 15

and gbest. Each particle has its own best position (pbest) corresponding to the 295

personal best fitness value. The global best particle (gbest) is denoted by all the pbest, which represents the best particle in the entire swarm. Then, update the velocity V and location X for each particle according to the pbest and gbest using Eq. (8) and (9).

V = V + c1 · r1 · (pbest − X) + c2 · r2 · (gbest − X)

(8)

X =X +V

(9)

Where c1 and c2 are acceleration coefficients, r1 and r2 are two independent 300

random numbers distributed in the range of [0, 1]. Generally speaking, the value of Vi can be clamped to the range [Vmin , Vmax ] to control excessive roaming of particles outside the search space. This process is repeated until a user-defined stopping criterion is reached. Since the fitness functions of PSO are not required to be differentiable, derivable and continuous, PSO is applied to find the best

305

solution of normal vector ~n. ~ (i, j, k) since the normal vector is ~n(cos(i), cos(j), cos(k)). The location of particle is V The fitness function is the classification function using SVM while the value of fitness function is the classification accuracy. As for the parameters used in PSO, the size of the population is sp. The range of velocity is [−0.5, 0.5]. The

310

end condition of iteration is that the number of iterations reaching 20. gbest for each process of PSO is taken as the recognition accuracy. 3.3. Boosting the final performance If the highest classification accuracy using LS or PSO still performs not well enough, then several features corresponding to the top classification accuracies

315

can be concatenated as the final descriptor to reach a better result. Whether using the Local Search or PSO we can only get a local best solution, which may be different for each time. So we perform the Local Search or PSO five times and choose three corresponding features to concatenate as the final descriptor

16

which proved a 1%-2% increase in accuracy. The selected normal vectors should 320

be separated as far as possible.

4. Experiments In this section, the recognition performance of the proposed descriptor is evaluated on five public available datasets: MSR-Action3D dataset[38], HDM05 dataset[39], Florence3D-Action dataset [40], UTD-MHAD dataset [41] and UWA3DII 325

dataset [42]. L2 normalization is applied on the descriptor to achieve scaleinvariance. 4.1. Datasets 4.1.1. MSR-Action3D dataset MSR-Action3D dataset provides the skeleton (20 joints) for 20 actions per-

330

formed 2 or 3 times by 10 subjects. There are several validation protocols adopted with this dataset which are described on the authors’ website. Validation protocol of this dataset follows the cross-subject testing procedure, spliting the actions into 3 overlapping sub-sets of 8 classes each one. Captured data of subjects 1, 3, 5, 7, 9 is used for training and the others for test. Table 1: Subsets of data set MSR-Action3D

ActionSet1(AS1)

ActionSet1(AS2)

ActionSet1(AS3)

Horizontal wave(2)

High wave(1)

High throw(6)

Hammer(3)

Hand catch(4)

Forward Kick(14)

Forward Punch(5)

Draw X(7)

Side Kick(15)

High throw(6)

Draw tick(8)

Jogging(16)

Hand Clap(10)

Draw Circle(9)

Tennis Swing(17)

Bend(13)

Hands Wave(11)

Tennis Serve(18)

Tennis Serve(18)

Forward Kick(14)

Golf Swing(19)

Pickup and Throw(20)

Side Boxing(12)

Pickup and Throw(20)

17

335

4.1.2. HDM05 dataset The number of joints recorded in HDM05 dataset is 31, resulting in a longer descriptor. Moreover, the frame rate is much higher, 120 fps instead of 15 fps as in the preceding dataset. Similar to [27], we used 11 actions performed by 5 subjects from the HDM05 dataset, 3 subjects (140 action instances) for training,

340

2 subjects (109 action instances) for testing. The set of actions consisted of deposit floor, elbow to knee, grab high, hop both legs, jog, kick forward, lie down floor, rotate both arms backward, sneak, squat, throw basketball. 4.1.3. Florence3D-Action dataset The dataset collected at the University of Florence during 2012 is also cap-

345

tured using Kinect. The dataset includes 9 actions: wave, drink from a bottle, answer phone, clap, tight lace, sit down, stand up, read watch, bow. The actions are performed by ten subjects for two or three times, resulting in a total of 215 samples. A leave-one-actor-out protocol the same as [43] is used: training the classifier using all the sequences from 9 out of 10 actors and testing the

350

remaining one. Repeat this procedure for all actors and compute the average of 10 classification accuracy values. 4.1.4. UTD-MHAD dataset UTD-MHAD dataset is a multimodal human action dataset which gathers data from one Kinect camera and one wearable inertial sensor. This dataset

355

contains 27 actions performed by 8 subjects (4 females and 4 males) with each subject performing each action for 4 times, generating 861 sequences. For this dataset, cross-subjects protocol is adopted as in [41], namely, the data from the subject numbers 1, 3, 5, 7 used for training while 2, 4, 6, 8 used for testing. 4.1.5. UWA3DII

360

UWA3DII is a multi-view dataset which contains 30 human actions performed by 10 subjects. It is recorded from 4 different viewpoints at different times: front view, left and right side views, and top view. UWA3DII is a challenging database with realistic scenario, variations in subject scales and highly 18

similar actions. For example, the actions “drinking” and “phone answering” 365

are hard to be distinguished using sole skeleton data. Moreover, the lower part of the body is not properly captured because of self occlusion. For cross-view action recognition, we follow the cross view protocol in [42] which takes the samples from two views as training data and the samples from the two remaining views as test data. Transformation method proposed in [44] is used to achieve

370

view invariant. 4.2. Results and Discussion 4.2.1. Parameters Analysis The results of the proposed method are studied under various parameter configurations in which two main kinds of parameters are taken into considera-

375

tion: parameters related to HDS feature and parameters related to optimization of the projection plane. • parameters related to HDS feature: nb, representing the number of historam bins; nl, representing the number of temporal hierarchical levels. Fig.6 reports the quantitative results of recognition accuracy within dif-

380

ferent values of nl. The results show that adding more levels may enhance the classification accuracy. However, increasing the level of temporal hierarchical construction does not have the definitely positive results. Sometimes 2-level of temporal hierarchical construction appears slightly better than 3-level with much time saving. Increasing levels can compensate for

385

the temporal information. But the accuracy may still decrease, meaning that the available frames in each histogram of the new level are too few to make a meaningful histogram. The number of bins nb also affects the final result. We have tried 4 kinds of histogram configurations: 4 × 3, 8 × 3, 12 × 3 and 16 × 3. Fig.7 shows the classification accuracy when using

390

different number of bins. Results show that adding more bins enhances the classification accuracy obviously. Apparently, increasing nb and nl can lead to more time consuming. As shown in Fig.8, the time costs are

19

Figure 6: Evaluation of recognition results over the parameter nl impact while nb is set to 16 × 3.

Figure 7: Evaluation of recognition results over the parameter nb impact while nl is set to 2.

doubled synchronously as the number of nl grows while increasing nb has less impact. 395

• parameters related to optimization of the projection plane: m, the number of neighboring solutions generated by one initial solution in Local Search algorithm; step, the step length of generating neighboring solutions in Local Search algorithm; sp, the size of the population in PSO; c1 and c2 , the acceleration coefficients in PSO. Table 2 reports the impact of these

400

parameters over MSR-Action3D dataset without boosting. It is obvious that the larger step is, the better action recognition performance is. A larger step length in Local Search algorithm means more possibilities to get the local optimal solutions in the solution space. Compared with the step length step, parameter m has less impact on the result. Larger m can 20

Figure 8: Evaluation of time consuming on MSR-Action 3D dataset over the parameter nb and nl impact. Time costs when nb is set to 16 ∗ 3 and nl is set to 3 are taken as 100% for Local Search and PSO respectively.

405

generate more neighboring solutions, but these neighboring solutions do not vary significantly. Thus a larger m cannot improve the accuracy a lot. As for PSO, the role that c1 and c2 play in the algorithm is similar to step in Local Search. However, the same conclusion cannot be summarized as Local Search algorithm. In fact, the step length in PSO is also affected

410

by r1 and r2 , which are two independent random numbers distributed in the range of [0, 1]. Due to the random adjustment brought by r1 and r2 , the impact of c1 and c2 is not absolute. In terms of sp, a larger size of population can have more possibilities to meet the local optimal solutions. But the optimization problem may fall into a bottleneck which explains

415

the performance under setting sp = 40 is slightly worse than the setting sp = 20. After all, the action recognition performance is limited by the quantity and quality of the descriptor since the proposed descriptor focuses on motion information only. Moreover, a larger sp leads to more consumed time on training model.

420

4.2.2. Method comparison • MSR-Action3D: For MSR-Action3D dataset, we get the best classification accuracy of 95.68% without preprocessing of raw data. In addition, the corresponding normal vectors under such circumstance are also listed in Table 3.

21

Table 2: Classification accuracy (%) on the MSR-Action3D dataset using different values of m, step, sp and c1 , c2 .

step = 10

step = 20

step = 30

m=5

92.18

93.98

94.13

m = 10

92.79

94.07

94.07

m = 20

92.79

93.77

94.43

c1 = c2 = 0.2

c1 = c2 = 0.5

c1 = c2 = 0.8

sp = 10

91.77

92.26

92.26

sp = 20

94.13

93.11

93.15

sp = 40

94.07

92.51

93.11

Table 3: The corresponding vector

Set

Accu

corresponding normal vector

AS1

93.48

(1,0.316,1),(-0.779,-0.809,0.950),(0.823,0,0.997)

AS2

96.43

(-0.658,-0.973,0.961),(0.133,1,-0.527),(-0.111,-0.989,0.781)

AS3

97.14

(0.378,1,1)

22

425

As shown in Table 4, our approach outperforms the state-of-the-art methods which take skeleton data as input. In particular, our HDS-SP descriptor is superior to HOD[14] in terms of the different ways to create histogram, the proposal of 2D projection and the temporal perspectives. Table 4: Classification Accuracy Comparison for MSRAction3D dataset.

Method

Input Data

Accuracy(%)

HOD [14]

Skeleton

91.26

Multi-kernel learning [45]

Depth+Skeleton

92.3

RRV [15]

Skeleton

93.44

ST-LSTM+TG [46]

Skeleton

94.8

ST-NBNN [31]

Skeleton

94.8

3RB-tLDS [17]

Skeleton

94.85

Liu et al. [47]

Skeleton

95.3

Our Method

Skeleton

95.68

In spite of the remarkable results, Table 5 gives a comparison of compu430

tation time between the proposed method and other methods in terms of subset AS1 of MSR-Action3D dataset. The computation process is running on a 3.2 GHz machine with 8 GB RAM, using Matlab R2017a. Our method using PSO is less efficient than using Local Search since most of the time is spent on seeking for the most suitable projection plane in the

435

training process. However, the testing time for these two method are the same due to the same size of generated descriptor. Despite of the long duration of training process, the testing process can be quickly enough once the SVM model is well trained. • HDM05: For the HDM05 dataset, the best classification accuracy per-

440

formed by our descriptor is 100% using 16 × 3-bin HOD and 1-level temporal hierarchy by PSO. The corresponding normal vector of projection is ~n(0.6642, 0, 7157, 0.9962). The classification result of our approach outperforms the approach in [27] which get the best classification accuracy 23

Table 5: Average processing time over MSR-Action3D AS1 dataset. The testing time refers to one testing sequence.

Method

Training time(s)

Testing time(ms)

HOD [14]

1.59

3.15

LARP [48]

3636.69

10.41

Our method(LR)

523.27

7.74

Our method(PSO)

1176.30

7.74

of 84.4%. Since the raw data captured by Mocap in HDM05 dataset is 445

of high precision, the proposed method can reach a high classification accuracy. Compared with the other four datasets, the numbers of skeletal joints and frame rate in HDM05 dataset are much higher which can generate a longer descriptor, incorporating more useful information for action recognition.

450

• Florence3D-Action: For Florence3D-Action dataset, the best recognition accuracy performed by our feature descriptor is 95.88% using 16 × 3-bin HOD and 2-level temporal hierarchy and by Local Search, which outperforms the existing methods as shown in Table 6. Table 6: Classification Accuracy Comparison for Florence3D-Action dataset.

Method

Accuracy(%)

Multi-Part Bag-of-Poses [43]

82.0

Vemulapalli et al.[48]

90.9

Shape Analysis of Motion Trajectories[49]

87.04

3DMTG [50]

91.3

RBF kernel machine [51]

91.4

PAM + Pose-Transition Feature [52]

92.09

Our Method

95.88

• UTD-MHAD: Compared with aforementioned datasets which have 10 ac455

tion classes or so, UTD-MHAD dataset contains 27 action classes cover24

ing sport actions, hand gestures, daily activities and training exercises. For UTD-MHAD dataset, Table 7 gives the comparison of the proposed method with previous methods on UTD-MHAD. It is noteworthy that our method using sole skeleton data achieves better performance than the state-of-the-art methods which reveals the effectiveness of proposed

460

method on dataset of more categories. Table 7: Classification Accuracy (%) Comparison for UTD-MHAD dataset.

Method

Input

Accuracy(%)

JTM-CNN [53]

Skeleton

85.8

CNN [54]

Skeleton

86.97

CNN [55]

Skeleton

87.9

JDM [56]

Skeleton

88.1

DPI+att-DTIs [57]

RGB+JEM

88.37

S DDI [58]

Depth+Skeleton

89.04

Our Method

Skeleton

93.95

2

• UWA3DII: Table 8 shows overall recognition accuracies of different methods on UWA3DII dataset using sole skeleton data. Our method achieves better performance than other hand-crafted features. In particular, the recognition performs better when the front view is taken as testing data.

465

The skeletal data of front view bares less self occlusion and wrong joints detection, resulting in higher quality than other views. This suggests the skeletal data requirements of proposed method. The more accurate the data is, the better performance we can achieve. Table 8: Classification Accuracy (%) Comparison for UWA3DII dataset. {1,2} 3

{1,2} 4 28.2

{1,3} 2

17.3

{1,3} 4 27.0

AE [9]

45.0

40.4

35.1

LARP [48]

49.4

42.8

Transferable dictionary learning [59]

59.3

Our Method

59.4

Method HOJ3D [13]

15.3

{1,4} 2

14.6

{1,4} 3

13.4

{2,3} 1 15.0

36.9

34.7

36.0

34.6

39.7

38.1

57.9

50.2

48.1

59.4

56.3

56.3

12.9

{2,4} 1

22.1

{2,4} 3 13.5

20.3

12.7

17.7

49.5

29.3

57.1

35.4

49.0

29.3

39.8

44.8

53.3

33.5

53.6

41.2

56.7

32.6

43.4

59.9

63.4

65.1

67.1

68.2

55.5

73.5

53.4

60.1

61.8

57.4

69.8

65.5

68.2

55.0

69.0

56.7

61.2

25

{2,3} 4

{3,4} 1

{3,4} 2

Mean

470

4.2.3. Ablation Study We conduct extensive ablation study of different units in HDS-SP with following settings: (A). HOD-SP (i.e. 2-level histograms of oriented displacements based on specific projections); (B). 1-level HDS-SP (i.e. 1-level histograms of distributed sectors based on specific projections, without temporal information);

475

(C). HDS (i.e. 2-level histograms of distributed sectors, without optimization of projection plane). Table 9 shows the results of different settings on MSR-Action 3D dataset. Compared with our 2-level HDS-SP descriptor, we have these findings. It is noted that setting C concatenates the HDS features corresponding to XY, YZ and XZ planes. (1) Setting C yields the lowest performance, indicating

480

the importance of proposal of specific projections. (2) The performance under setting B which eliminates the temporal information is highest. This do not definitely mean that temporal information can be absent in the descriptor. In fact, it means that the temporal hierarchical construction proposed in this paper can not present the temporal information with sufficient details which needs to be improved in the future. Table 9: Performance of HOD-SP, 1-level HDS-SP, HDS and 2-level HDS-SP on the MSRAction3D dataset.

Method

HOD-SP

1-level HDS-SP

HDS

2-level HDS-SP

Accuracy

94.01

94.13

92.88

95.68

485

5. Conclusion A novel descriptor called HDS-SP is proposed based on the principle that the better the viewpoint, the better action recognition performance. The main contributions of this paper are listed as three points: First, a new way to form 490

histogram is created which is more elaborate than the traditional way. Second, unlike the usual 2D representation which is obtained by projection onto XY,YZ and XZ planes, projection onto the most discriminating 2D planes is creatively put forward to dig out as much information for classification as possible. Third,

26

the whole sequence is divided into overlapping subsections to provide temporal 495

information besides. The descriptor is scale-invariant and speed-invariant. The method is verified on five datasets and outperforms the state-of-art. It should be noted that the idea of projection onto 2D specific planes instead of XY, YZ and XZ planes provide new direction for dimension-reduction representations. In the past, projection onto XY, YZ and XZ planes is a traditional

500

way for 3D data processing. Compared with this conventional way, the idea of projection onto 2D specific planes can provide more useful information to extract features. Except for 3D skeleton-based human action recognition [60], it can also work in depth maps-based human action recognition [61] [62] while existing depth maps-based methods extract features from front, top and side

505

views. Furthermore, the optimization of the projection plane can be taken as an additional layer for deep learning frameworks [7][53], taking the normal vector ~n as a parameter to be optimized during deep learning. Compared with traditional ways to create histograms, the proposed way to form histograms can provide more information with a more delicate segment for histogram bins and

510

a more reasonable voting rules. Thus the proposed way to form histograms may also improve other descriptors such as histogram of optic flows(HOF)[63] and histogram of oriented gradient(HOG). The improvement of 3D skeleton-based human action recognition accuracy by this method may have some potential applications including somatosensory

515

gaming, advanced human-machine interface and recovery project. Despite the creative contributions of this paper, the proposed method still have some limitations. First, the optimization of projection plane is time consuming. Second, given an incoming video, video segmentation and trained model have to be prepared in advance since the proposed method is built on the model training of

520

clipped video sequences. Our future work will focus on deep learning representations for real-time applications [64]. Combining with temporal convolutional networks for action segmentation, convolutional networks will be explored for action recognition in real time which takes the optimization of projection plane as an additional layer. 27

525

Declaration of interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The authors declare the following financial interests/personal relationships

530

which may be considered as potential competing interests: Acknowledgment This work is supported by the National Natural Science Foundation of China, grants (No. 61733011, 51575338). References

535

References [1] W. Ge, R. T. Collins, R. B. Ruback, Vision-based analysis of small groups in pedestrian crowds, IEEE Transactions on Pattern Analysis & Machine Intelligence 34 (5) (2012) 1003–1016. [2] E. Demircan, D. Kulic, D. Oetomo, M. Hayashibe, Human movement un-

540

derstanding [tc spotlight], Robotics & Automation Magazine IEEE 22 (3) (2015) 22–24. [3] J. Bongjin, C. Inho, K. Daijin, Local transform features and hybridization for accurate face and human detection, IEEE Transactions on Pattern Analysis & Machine Intelligence 35 (6) (2013) 1423–1436.

545

[4] Y. Song, D. Demirdjian, R. Davis, Continuous body and hand gesture recognition for natural human-computer interaction, Acm Transactions on Interactive Intelligent Systems 2 (1) (2012) 1–28. [5] S. Even, O. Kariv, et al., humanoid motion generation system on hrp2-jsk for daily life environment, in: Mechatronics and Automation, 2005 IEEE

550

International Conference, 2005, pp. 1772–1777 Vol. 4. 28

[6] M. Ye, Q. Zhang, L. Wang, J. Zhu, R. Yang, J. Gall, A Survey on Human Motion Analysis from Depth Data, Springer Berlin Heidelberg, 2013. [7] A. S. Keceli, Viewpoint projection based deep feature learning for single and dyadic action recognition, Expert Systems with Applications (2018) 555

235–243 Vol. 104. [8] P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, Ogunbona, Philip, Action recognition from depth maps using deep convolutional neural networks, IEEE Transactions on Human-Machine Systems 46 (4) (2016) 498–509. [9] J. Wang, Z. Liu, Y. Wu, Learning Actionlet Ensemble for 3D Human Action

560

Recognition, Springer International Publishing, 2014. [10] S. Boubou, E. Suzuki, Classifying actions based on histogram of oriented velocity vectors, Journal of Intelligent Information Systems 44 (1) (2015) 49–65. [11] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, X. Xie, Co-occurrence

565

feature learning for skeleton based action recognition using regularized deep lstm networks (2016) 3697–3703. [12] G. Yu, Z. Liu, J. Yuan, Discriminative orderlet mining for real-time recognition of human-object interaction 9007 (2014) 50–65. [13] L. Xia, C. C. Chen, J. K. Aggarwal, View invariant human action recog-

570

nition using histograms of 3d joints, in: Computer Vision and Pattern Recognition Workshops, 2012, pp. 20–27. [14] M. Torki, M. A. Gowayyed, M. E. Hussein, M. El-Saban, Histogram of oriented displacements (hod): Describing trajectories of human joints for action recognition, in: International Joint Conference on Artificial Intelli-

575

gence, 2013, pp. 1351–1357. [15] S. Guo, H. Pan, G. Tan, C. Lin, C. Gao, A high invariance motion representation for skeleton-based action recognition, International Journal of Pattern Recognition & Artificial Intelligence 30 (08) (2016) 1650018. 29

[16] Y. Guo, Y. Li, Z. Shao, Rrv: A spatiotemporal descriptor for rigid body 580

motion recognition, IEEE Transactions on Cybernetics 48 (5) (2018) 1513– 1525. [17] W. Ding, L. Kai, E. Belyaev, C. Fei, Tensor-based linear dynamical systems for action recognition from 3d skeletons, Pattern Recognition 77 (2018) 75– 86.

585

[18] K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, Advances in Neural Information Processing Systems 1 (4) (2014) 568–576. [19] C. Cao, Y. Zhang, C. Zhang, H. Lu, Body joint guided 3-d deep convolutional descriptors for action recognition, IEEE Transactions on Cybernetics

590

48 (3) (2018) 1095. [20] C. Xie, C. Li, B. Zhang, C. Chen, J. Han, C. Zou, J. Liu, Memory attention networks for skeleton-based action recognition. [21] M. E. Hussein, M. Torki, M. A. Gowayyed, M. El-Saban, Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint

595

locations, in: International Joint Conference on Artificial Intelligence, 2013, pp. 639–44. [22] G. Johansson, Spatio-temporal differentiation and integration in visual motion perception, Psychological Research 38 (4) (1976) 379. [23] F. Lv, R. Nevatia, Recognition and segmentation of 3-d human action using

600

hmm and multi-class adaboost, Proc.eur.conf.on Computer Vision Graz. [24] A. Yao, Coupled action recognition and pose estimation from multiple views, International Journal of Computer Vision 100 (1) (2012) 16–37. [25] R. Qiao, L. Liu, C. Shen, A. V. D. Hengel, Learning discriminative trajectorylet detector sets for accurate skeleton-based action recognition, Pattern

605

Recognition 66 (C) (2015) 202–212. 30

[26] C. Xi, M. Koskela, Skeleton-based action recognition with extreme learning machines, Neurocomputing 149 (PA) (2015) 387–396. [27] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, R. Bajcsy, Sequence of the most informative joints (smij): A new representation for human skeletal action 610

recognition, in: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2012, pp. 8–13. [28] L. Shao, X. Zhen, D. Tao, X. Li, Spatio-temporal laplacian pyramid coding for action recognition, IEEE Transactions on Cybernetics 44 (6) (2014) 817–827. doi:10.1109/TCYB.2013.2273174.

615

[29] D. Luvizon, H. Tabia, D. Picard, Learning features combination for human action recognition from skeleton sequences, Pattern Recognition Letters 99. doi:10.1016/j.patrec.2017.02.001. [30] P. Wang, W. Li, P. Ogunbona, Z. Gao, H. Zhang, Mining mid-level features for action recognition based on effective skeleton representation, in:

620

International Conference on Digital Lmage Computing: Techniques and Applications, 2015, pp. 1–8. [31] J. Weng, C. Weng, J. Yuan, Spatio-temporal naive-bayes nearest-neighbor (st-nbnn) for skeleton-based action recognition, in: IEEE Conference on Computer Vision & Pattern Recognition, 2017.

625

[32] Y. Du, Y. Fu, L. Wang, Skeleton based action recognition with convolutional neural network, in: Pattern Recognition, 2016. [33] Q. Ke, S. An, M. Bennamoun, F. Sohel, F. Boussaid, Skeletonnet: Mining deep part features for 3d action recognition, IEEE Signal Processing Letters 24 (6) (2017) 731–735.

630

[34] S. Song, C. Lan, J. Xing, W. Zeng, J. Liu, An end-to-end spatio-temporal attention model for human action recognition from skeleton data, in: The Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), 2016.

31

[35] H. Wang, W. Liang, Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks, in: 2017 IEEE 635

Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [36] C. C. Chang, C. J. Lin, Libsvm: A library for support vector machines. [37] J. Kennedy, R. Eberhart, Particle swarm optimization, in: Icnn’95 - International Conference on Neural Networks, 2002, pp. 1942–1948 vol.4. [38] W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3d points,

640

in: Computer Vision and Pattern Recognition Workshops, 2010, pp. 9–14. [39] M. M¨ uller, T. R¨ oder, M. Clausen, B. Eberhardt, B. Kr¨ uger, A. Weber, Documentation mocap database hdm05, Tech. Rep. CG-2007-2, Universit¨ at Bonn (June 2007). [40] http://www.micc.unifi.it/vim/datasets/3dactions/,, in: Florence 3D ac-

645

tion dataset, 2013. [41] C. Chen, Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor, in: IEEE International Conference on Image Processing, 2015. [42] H. Rahmani, A. Mahmood, D. Huynh, A. Mian, Histogram of oriented

650

principal components for cross-view action recognition, IEEE Transactions on Pattern Analysis & Machine Intelligence 38 (12) (2016) 2430–2443. [43] L. Seidenari, V. Varano, S. Berretti, A. D. Bimbo, P. Pala, Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses, in: IEEE Conference on Computer Vision and Pattern Recognition Workshops,

655

2013, pp. 479–485. [44] M. Liu, L. Hong, C. Chen, Enhanced skeleton visualization for view invariant human action recognition, Pattern Recognition 68 (2017) 346–362.

32

[45] G. Chen, D. Clarke, M. Giuliani, A. Gaschler, A. Knoll, Combining unsupervised learning and discrimination for 3d action recognition, Signal 660

Processing 110 (C) (2015) 67–81. [46] J. Liu, A. Shahroudy, D. Xu, G. Wang, Spatio-temporal lstm with trust gates for 3d human action recognition, in: Computer Vision – ECCV 2016, 2016, pp. 816–833. [47] T. Liu, J. Wang, S. Hutchinson, Q. H. Meng, Skeleton-based human action

665

recognition by pose specificity and weighted voting, International Journal of Social Robotics 11 (2) (2019) 219–234. [48] R. Vemulapalli, F. Arrate, R. Chellappa, Human action recognition by representing 3d human skeletons as points in a lie group, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 588–595.

670

[49] M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, B. A. Del, 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold, IEEE Transactions on Cybernetics 45 (7) (2015) 1340– 1352. [50] B. Liu, H. Yu, X. Zhou, D. Tang, H. Liu, Combining 3d joints moving trend

675

and geometry property for human action recognition, in: IEEE International Conference on Systems, Man, and Cybernetics, 2017, pp. 000332– 000337. [51] J. Cavazza, P. Morerio, V. Murino, Scalable and compact 3d action recognition with approximated rbf kernel machines, IEEE Transactions on Pattern

680

Analysis and Machine Intelligence 93. doi:10.1016/j.patcog.2019.03. 031. [52] T. Huynh-The, C. H. Hua, N. A. Tu, T. Hur, J. Bang, D. Kim, M. B. Amin, B. H. Kang, H. Seung, S. Y. Shin, Hierarchical topic modeling with posetransition feature for action recognition using 3d skeleton data , Information

685

Sciences 444 (2018) 20–35. 33

[53] P. Wang, W. Li, C. Li, Y. Hou, Action recognition based on joint trajectory maps with convolutional neural networks, in: Acm on Multimedia Conference, 2016. [54] Y. Hou, Z. Li, P. Wang, W. Li, Skeleton optical spectra-based action recog690

nition using convolutional neural networks, IEEE Transactions on Circuits & Systems for Video Technology 28 (3) (2018) 807–811. [55] W. Pichao, L. Wanqing, L. Chuankun, H. Yonghong, Action recognition based on joint trajectory maps with convolutional neural networks, Knowledge-Based Systems 158 (2018) 43–53.

695

[56] C. Li, Y. Hou, P. Wang, W. Li, Joint distance maps based action recognition with convolutional neural network, IEEE Signal Processing Letters 24 (5) (2017) 624–628. [57] M. Liu, F. Meng, C. Chen, S. Wu, Joint dynamic pose image and space time reversal for human action recognition from videos, in: The Thirty-Third

700

AAAI Conference on Artificial Intelligence (AAAI-19), 2019. [58] P. Wang, W. Shuang, Z. Gao, Y. Hou, W. Li, Structured images for rgb-d action recognition, in: IEEE International Conference on Computer Vision Workshop, 2017. [59] J. Zhang, S. Member, IEEE, H. P. H. Shum, Member, Action recognition

705

from arbitrary views using transferable dictionary learning, IEEE Transactions on Image Processing 27 (10) (2019) 4709–4723. [60] S. Vantigodi, V. B. Radhakrishnan, Action recognition from motion capture data using meta-cognitive rbf network classifier, in: IEEE Ninth International Conference on Intelligent Sensors, 2014.

710

[61] C. Chen, M. Liu, L. Hong, B. Zhang, J. Han, N. Kehtarnavaz, Multitemporal depth motion maps-based local binary patterns for 3d human action recognition, IEEE Access 5 (2017) 22590–22604.

34

[62] C. Q. Le, T. D. Ngo, D. D. Le, S. Satoh, D. A. Duong, Human action recognition from depth videos using multi-projection based representation, 715

in: IEEE International Workshop on Multimedia Signal Processing, 2015. [63] Z. Gao, H. Zhang, G. P. Xu, Y. B. Xue, Multi-perspective and multimodality joint representation and recognition model for 3d action recognition, Neurocomputing 151 (2015) 554–564. [64] A. Liu, N. Xu, W. Nie, Y. Su, Y. Wong, M. Kankanhalli, Benchmark-

720

ing a multimodal and multiview and interactive dataset for human action recognition, IEEE Transactions on Cybernetics 47 (7) (2017) 1781–1794.

Biography

Jingjing Liu received the B.E. degree in mechanical engineering from School 725

of Mechanical Engineering, Beijing Institute of Technology, Beijing, China, in 2017. She is currently pursuing the Ph.D. degree in School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China. Her current research interests include human action recognition and object detection. She is also interested in early screening of autism based on computer vision.

35

730

Zhiyong Wang (S18) received the B.E. degree from the School of Mechanical & Automotive Engineering, South China University of Technology, Guangzhou, China, in 2016. He is currently pursuing the Ph.D. degree with the School of Mechanical Engineering, Shanghai Jiao Tong University, Shang735

hai, China. His research interests include image processing, facial feature points detection, and gaze estimation.

Honghai Liu received the Ph.D. degree in robotics from Kings College London, London, U.K., in 2003. He is the Chair Professor of Intelligent Sys740

tems and Robotics, University of Portsmouth, Portsmouth, U.K. His research interests include biomechatronics, pattern recognition, intelligent video analyt-

36

ics, intelligent robotics, and their practical applications with an emphasis on approaches that could make contribution to the intelligent connection of perception to action using contextual information. Prof. Liu is a fellow of the 745

Institution of Engineering and Technology. He is an Associate Editor of the IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, and IEEE TRANSACTIONS ON CYBERNETICS.

37