Combining depth-skeleton feature with sparse coding for action recognition

Combining depth-skeleton feature with sparse coding for action recognition

Neurocomputing (xxxx) xxxx–xxxx Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Combining...

2MB Sizes 0 Downloads 113 Views

Neurocomputing (xxxx) xxxx–xxxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Combining depth-skeleton feature with sparse coding for action recognition ⁎

Hanling Zhanga,b, Ping Zhonga, , Jiale Hea, Chenxing Xiaa a b

College of Computer Science and Electronic Engineering Hunan University, Changsha 410082, China Jiangsu Engineering Center of Network Monitoring, Nanjing University of Information Science & Technology, Nanjing 210044, China

A R T I C L E I N F O

A BS T RAC T

Communicated by Z. Wang

RGB-D human action recognition is a very active research topic in computer vision and robotics. In this paper, an action recognition method that combines gradient information and sparse coding is proposed. First of all, we leverage depth gradient information and distance of skeleton joints to extract coarse Depth-Skeleton (DS) feature. Then, the sparse coding and max pooling are combined to refine the coarse DS feature. Finally, the Random Decision Forests (RDF) is utilized to perform action recognition. Experimental results on three public datasets show the superior performance of our method.

Keywords: Human action recognition Depth-Skeleton (DS) feature Sparse coding Gradient information

1. Introduction Human action recognition is a very fertile research theme due to its strong applicability in several real world domains, such as humancomputer interfaces, content-based video indexing, video surveillance, and robotics, among others [1–3]. The main challenges in traditional action recognition systems are occlusions, shadows and background extraction. Recently, the introduction of RGB-D cameras, like Kinect, lightens these difficulties that reduce the action recognition performance in RGB video [4]. The depth maps are different from the conventional RGB images. Pixels in the depth map record the depth of a scene rather than a measure of the intensity of color. Working in low light levels and being color and texture invariant, the depth cameras offer several advantages over traditional RGB cameras. In this context, an increasing attention be directed to the task of recognizing human actions using depth sequences [5]. Several handdesigned RGB-D action recognition approaches are proposed in the last few years. These methods can be categorized as: skeleton-based measures [6–9,4,5] and depth-based measures [9–14]. Hand-designed approaches reported good performance on different action datasets. In recent years, machine learning gains the great success in speech recognition and object recognition. The machine learning approaches [15–17] are also proved to be efficient in action recognition. Despite a lot of progresses, human action recognition remains a longstanding challenge in computer vision. There are several limitations. (1) With the existence of cluttered background, camera motion and occlusions, it is difficult to fully capture spatial-temporal structures of actions from video. Admitting that spatio-temporal interest point based methods [18–20] collect spatial-temporal information of actions,



but their performance primarily depends on correctly detecting a large number of interest points. Machine learning methods also collect spatial-temporal information. Nevertheless, the disadvantages are time consumption and parameter tuning. (2) There are much of redundant data in action video. If all data of video are involved, the redundant information will probably reduce the precision of recognition algorithm; if specific parts are involved only, some representative information may be discarded. Therefore, it is necessary to design a method to retain only the information strictly needed for classification. To address these challenges, a novel action recognition method that combines the coarse DS feature and sparse coding is introduced. Firstly, to accurately capture the spatial-temporal information, coarse DS feature is proposed to collect the gradient and spatial information from the RGB-D video. Secondly, It is superior to remove unreliable parts of feature by leveraging the advantages of sparse representation [21]. Therefore, sparse coding is applied on the coarse DS feature. This optimization procedure can effectively suppress redundant information and highlight discriminative parts. Fig. 1 summarizes the proposed approach. Our contributions can be summarized as follows. (1) We present an algorithm to extract coarse DS feature which deals with dynamics in depth videos. (2) The optimization procedure, sparse coding, is applied on the coarse DS feature to produce a compact and robust representation of action video. (3) The proposed method produces the relatively low-dimensional feature set compared to other techniques on the same public datasets. The rest of the paper is organized as follows: Section 2 describes the related work, Section 3 shows the proposed approach in detail. In

Corresponding author. E-mail addresses: [email protected] (H. Zhang), [email protected] (P. Zhong).

http://dx.doi.org/10.1016/j.neucom.2016.12.041 Received 11 March 2016; Received in revised form 25 November 2016; Accepted 13 December 2016 0925-2312/ © 2016 Elsevier B.V. All rights reserved.

Please cite this article as: Zhang, H., Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.12.041

Neurocomputing (xxxx) xxxx–xxxx

H. Zhang et al.

Fig. 1. Overview of the approach. The illustrated pipeline is composed of two main stages: (1) Extraction of coarse depth-skeleton feature. (2) Optimization procedure using sparse coding.

depth sequence to form the polynormal which is used to jointly characterize the local motion and shape information. Building on the success of HOG-based descriptors for static images, HOG3D [24] viewed videos as spatio-temporal volumes and generalized the key HOG concepts to 3D. Chen et al. [25] employed local binary patterns (LBPs) to describe the DMM and provided two types of feature fusion methods. HOG2 [26] showed that using a modified HOG descriptor applied spatially in each frame around each joint can be used to extract useful depth information. Inspired by HOG2 for action recognition. We propose a novel descriptor named HTG-HOG (Histogram of Temporal Gradient-Histogram of Oriented Gradient). This descriptor can not only capture the information of current frame but also the neighbors. The main difference between the aforementioned works and our method is the extraction of spatio-temporal information. Video data have redundant information, which may damage the action recognition. However, sparse coding provides strong capacity to learn a compact feature. Amiri et al. [27] trained dictionaries for calculation of a non-negative sparse representation for feature vectors. Zhu et al. [16] proposed an approach for action recognition by encoding local 3D spatial-temporal gradient features within the sparse coding framework. Yang et al. [28] proposed a linear spatial pyramid matching kernel based on SIFT sparse codes for image classification.

Section 4, we analyze experimental results. The conclusion is presented in Section 5. 2. Related work Research on action recognition has explored a number of representations of depth sequences. Skeleton-based approaches have become popular thanks to the work of Shotton et al. [22]. It described a real-time method to accurately predict the 3D positions of body joints in individual depth maps, without using any temporal information. In [6], the skeleton joints were represented by combining the temporal and spatial joint relations. The work in [23] applied the idea of the pairwise relative locations of joints to represent the human action. In [7], a human skeleton was presented as points in the Lie group. In [5], action recognition problem was formulated as the problem of computing the similarity between the shape of trajectories in a Riemannian manifold. Slama et al. [4] addressed the problem of modeling and analyzing human motion by focusing on 3D body skeletons. Methods based on depth maps involved the entire set of points of depth map sequences to extract the dynamics of the actions. In [13], a 4D space histogram is used to encode the distribution of the surface normal orientation. SNV [19] clustered hypersurface normals in a 2

Neurocomputing (xxxx) xxxx–xxxx

H. Zhang et al.

Fig. 2. Keyframe examples (top one) and the moving pattern (bottom one) on different datasets. (a)–(e) are pictures of MSRAction3D dataset. (f)–(j) are pictures of MSRActionPair3D dataset. (For interpretation of the references to color in this figure, the reader is referred to the web version of this article.)

Our approach differs from them at using sparse coding on DS feature, and the development of the improved feature extraction system.

sensitive to slight dynamic change in depth sequence. Then the gradient magnitude g and the gradient orientation θ are computed for all the pixels in the cell using respectively Eqs. (2) and (3).

3. Method

g (i , j ) =

Gt (i , j )2 + Gx (i , j )2

3.1. Depth descriptor

θ (i , j ) = arctan Action depicted by video data contains spatio-temporal changes. And it is desirable to extract this kind of information for the video representation. For example, ‘pick up a box’ and ‘put down a box’ (f and g in Fig. 2 respectively) have almost the same moving pattern, but they show different temporal order. Consequently, we propose the depth descriptor that provides temporal information and spatial information for every pixel. First, HTG (Histogram of Temporal Gradient) is utilized to collect spatio-temporal information of each depth map, and then the modified HOG (Histogram of Oriented Gradient) is applied on HTG (over the time). Video can be described by using x, y, t coordinates (see in Fig. 1). The HTG is created as follows: first of all, each map in depth sequence is divided into N*N cells (N=3 in Figs. 1 and 4). Secondly, a centered mask [−1, 0, 1] is used to compute its gradient. The gradients of x − t plane are defined as:

Gt (i , j ) = I t −1 (i , j ) − I t (i , j ) Gx (i , j ) = I t (i − 1, j ) − I t (i + 1, j )

Gt (i , j ) Gx (i , j )

(2)

(3)

In each cell, a temporal gradient histogram is constructed by quantifying the angle of each gradient vector into M bins. Lastly, we normalize the histogram by using L2-norm. The HTG of y − t plane is computed by using the same method. Some Gt examples of action ‘two hand wave’ are shown in Fig. 3(a), the pictures of Fig. 3(b) are corresponding It images. By observing the visual results, it is clear that Gt images can highlight moving parts between consecutive frames while suppress static parts, and these images are more sparse than It. Besides, from Fig. 2, we can further infer that each action has specified moving pattern, and which can be addressed in It images. Therefore, the Gt and It are selected to compute the gradient magnitude. Fig. 1 shows the orientation histograms of HTG for action ‘two hand wave’, and it indicates that active cells change with the appearance of action. For example, the active cells of first depth map are 2, 5, 8, while the kth map's active cells are 1, 2, 3, 4, 5, 8. The histograms of these active cells are emphasized with color blue in Fig. 1. In order to accurately capture the difference of active cells among t axis, then, we aggregate the HTG for all the maps to form a 2D array. Next, the modified HOG algorithm is applied on the 2D array to extract the temporal information further. Therefore, our approach is

(1)

I t (i ,

j ) is the value of the pixel in position (i,j) of the tth image in where depth sequence. From Eq. (1), it is clear that HTG considers not only the spatial information of current frame but also the temporal information of neighbors for each pixel. Thus, HTG is extremely 3

Neurocomputing (xxxx) xxxx–xxxx

H. Zhang et al.

Fig. 3. Some Gt and It images of action ‘two hand wave’. (a) Gt images examples. (b) the corresponding It images.

termed HTG-HOG. Finally, in order to produce the final depth descriptor, we append a vector of the average (over time) HTG in the HTG-HOG. The average HTG vector is used to distinguish the active cells and improve the discrimination of depth descriptor. Different situations of active cells in three actions are shown in Fig. 4. The active cells of ‘bend’ are 5, 8, and the active cells of ‘side kick’ are 2, 4, 5, 8, 9. Therefore, the average HTG vectors are highly representative of one action and highly discriminative compared to the other actions.

I t+1, and the joint probability distribution function p I tI t+1 (r , k ) = P (I t = r ∩ I t+1 = k ) is a normalized bi-dimensional histogram of the two images [29]. p I t (r ) = P (I t = r ) is the probability distribution function of r. As suggested by its definition, this MI function is exceedingly sensitive to variation between consecutive frames. Then, to characterize the static posture information of key frames, the pair-wise joints' distance of 3D skeleton positions is computed. The 3D joint positions of N joints can be described: S = {s1, s2, …, sN }, S ∈ R3 * N . si = [xi ; yi ; di ], xi and yi are the position of joint i and di is the depth value. The skeleton descriptor Fs is defined as:

3.2. Skeleton descriptor The topology structure of body posture can efficiently distinguish different actions, such as ‘bend’ and ‘high arm wave’. And depth cameras provide a powerful skeleton tracking algorithm, which outputs 20 3D human joint positions for each frame. Thus, the joint's distance of key frame is an intuitive way to represent human posture. For example, the action ‘high arm wave’ can be interpreted as ‘arm above the shoulder’. This spatial relationships effectively discriminates interclass human actions. To select the most discriminative frame from a action sequence, the mutual information (MI) [29] is employed to compute the quantity of information between two maps. So the key frame searching problem can be described as following.

I key = arg min MI = arg min h (I t ) + h (I t +1) − h (I t , I t +1) I∈S

Fs = {si − sj |i , j = 1, 2, …, N ; i ≠ j}

(7)

Finally, the depth descriptor and skeleton descriptor are performed to extract the coarse DS feature FDS ∈ R1× D of this video. Examples of key frame are shown in Fig. 2. By observing these results, it is evidently that the key frame extraction based on mutual information can significantly classify different actions, such as ‘high arm wave’ and ‘jogging’. It also show that moving region of action always appears in the center of picture, such as red box area in Fig. 2. Thus, DS feature from other parts (out of red box) are redundant and may result in inaccurate classification when noise exists in those areas. In the next subsection, sparse coding and max pooling are applied to reduce the influence of these data.

(4)

I∈S

3.3. Sparse coding

t

h (I t )

is a measure of variability of the tth image I in where the entropy action sequence. S is the image set of entire sequence. h(I) and h (I t , I t+1) is given by

h (I ) = − ∑ pI (r )log( pI (r ))

Recently, machine learning methods have shown it's power in image classification, action recognition and other domains. Two fundamentally different types of tasks in machine learning are the supervised learning and unsupervised learning. Sparse coding is an unsupervised algorithm that learns to represent input data succinctly using only a small number of bases [21]. Inspired by [28], the sparse coding is utilized to refine the coarse DS feature. Sparse coding problem can be described as:

(5)

r

h (I t , I t +1) = − ∑ p I tI t +1 (r , k )log( p I tI t +1 (r , k ))

(6)

r,k t

where r and k are respectively the depth values of the two maps I and 4

Neurocomputing (xxxx) xxxx–xxxx

H. Zhang et al.

Fig. 4. Active cells of different actions. The depth maps are divided into 3*3 cells, and the active cells are covered with action performer. Table 1 Performance comparison of our proposed method on three datasets, with different variations.

Skeleton descriptor Depth descriptor Coarse DS Skeleton descriptor+Sparse coding Depth descriptor+Sparse coding Coarse DS+Sparse coding

MSRAction3D (%)

MSRDailyActivity3D (%)

MSRActionPair3D (%)

71.38 89.22 91.27 71.49 92.37 96

75.64 84.36 87.5 76.24 91.27 97.5

66.58 88.14 92.22 72.61 93.36 97.78

M

Table 2 Comparison of recognition accuracy on MSRAction3D dataset with 9 state-of-the-art methods. Methods

Accuracy (%)

Actionlet Ensemble[23] Depth Cuboid[12] Hon4d[13] C.Lu[14] DMM-HOG[31] DMM-LBP[25] HOG2[26] HDG[30] SNV[19] Ours

88.2 89.3 88.89 85.62 85.5 92.5 94.84 88.82 93.09 96

min U,V

∑ ‖xm − um V‖2 m =1

+ λ|u m | st

‖vk ‖ ≤ 1,

∀ k = 1, 2, …, K (8)

Let X = [x1, x2, …, xM ]T ∈ RM × D be a set of DS descriptor in a D dimensional feature space, where V = [v1, v2, …, vK ]T is the codebook, and ‖. ‖ denotes the L2 norm of vectors. U = [u1, u2, …, uM ]T is coefficient matrix. λ enforces the sparsity of the solution. The optimization problem Eq. (8) is convex in V (with U fixed) and convex in U (with V fixed), but not in both simultaneously. The conventional way for such a problem is to solve it iteratively by alternatingly optimizing over V or U while fixing the other. In other words, once the codebook is obtained from training data, the sparse coding is efficiently applied on testing data. At first, coarse DS features of all videos are divided into two parts, training data FTrDS ∈ R M1× D and testing data FTeDS ∈ R M2 × D . 10000 patches are extracted from training data FTrDS (X = FTrDS ) to train 5

Neurocomputing (xxxx) xxxx–xxxx

H. Zhang et al.

Fig. 5. The confusion matrix for the proposed approach on MSR Action3D dataset. It is recommended to view the figure on the screen.

random decision forests are made up of decision trees and each tree independently predicts its label. Then, majority voting scheme is applied to predict the final label. For a given improved feature Fimp, each decision tree is trained on the training data Fimps which are randomly selected from Fimp. The remaining feature data are used for testing. More specifically, there are split nodes and leaf nodes on decision tree. Binary classification is performed by split node. If the classes are linearly separable, after log2 (C ) decisions each action class will get separated from the remaining C − 1 classes and reach a leaf node [30]. After decision trees have been trained, the test data are used for validation. The final label of action sequence can be characterized by applying the majority voting algorithm on the labels of all decision trees.

Table 3 Results of our approach on MSR DailyActivity3D dataset compared to previously published results. Methods

Accuracy (%)

Actionlet Ensemble[23] Depth Cuboid[12] Hon4d[13] C.Lu[14] HDG[30] SNV[19] Ours

85.75 88.2 80 95.63 78.42 86.25 97.5

the codebook V iterating between Eqs. (9) and (10).

min V

‖X − VU‖2F s. t .

‖vk ‖ ≤ 1,

min ‖xm − Vu m ‖22 + λ|u m | um

∀ k = 1, 2, …, K

4. Experiments

(9)

This section summarizes empirical results and evaluates the performance of proposed approach on three public datasets. We start by introducing the three datasets. Then, the effectiveness of the proposed approach is verified. Finally, the experimental results are presented. In all experiments, RDF is used as a classifier, and the performance is evaluated by confusion matrix and recognition accuracy.

(10)

The optimization problem in Eq. (9) is a least squares problem with quadratic constraints. Similar to [21], this optimization procedure is done efficiently by Lagrange dual. Fixing V, the optimization can be solved by optimizing over each coefficient um individually. It can be solved very efficiently by algorithms such as the recently proposed feature-sign search algorithm [28]. Next, the codebook V is obtained in this off-line training, we can do on-line sparse coding efficiently as in Eq. (10) on testing data (X = FTrDS ). Finally, the max pooling is defined on each column of U. Then, the pooled features are concatenated to form the improved feature Fimp for classification.

4.1. Datasets MSRAction3D is a RGB-D dataset containing 20 actions, each action is performed by 10 subjects two or three times. This dataset is challenging due to the small inter-class variations among actions and the skeleton tracker often fails. The MSRDailyActivity3D dataset [22] is captured by a kinect device. 16 activities are included in this dataset which was performed by 10 subjects, and each subject is asked to perform each activity twice, one is in sitting position and the other is in standing position. The MSRActionPair3D [13] is a paired-activity dataset of depth sequences captured by depth sensor. This dataset contains 12 actions, and these actions are performed by 10 subjects. In

3.4. Recognition algorithm Let (Zk , yk ), k = 1, …, M2 , be the feature set with respect to class labels, where Zk ∈ Fimp, yk = 1, …, C , and C is the number of classes. This is a multi-class problem, and the number of actions to be classified is quite large. Therefore, the improved feature Fimp can be the input of classifier, like the random decision forests classifier as in our case. The 6

Neurocomputing (xxxx) xxxx–xxxx

H. Zhang et al.

Fig. 6. The confusion matrix for the proposed method on MSR DailyActivity3D dataset. Overall accuracy achieved 97.5%. It is recommended to view the figure on the screen.

4.3. Comparison with other methods

Table 4 Comparison of recognition accuracy on MSR ActionPairs3D dataset. Methods

Accuracy (%)

Actionlet Ensemble[23] Hon4d[13] DMM-HOG[31] SNV[19] Ours

82.22 96.67 66.11 98.89 97.78

We compare our method with 9 action recognition algorithms on MSRAction3D dataset, including Actionlet Ensemble [23], DepthCuboid [12], Hon4d [13], C.Lu [14], DMM-HOG [31], DMMLBP [25], HOG2 [26], HDG [30] and SNV [19]. To make a fair comparison, we directly use the accuracy results reported. For HOG2, we run the authors' code. The comparison results with other methods (Table 2) indicate that our method significantly outperforms other classical methods of action recognition. The confusion matrix is illustrated in Fig. 5. For the majority of actions, our approach works very well. For example, the proposed method can successfully discriminate ‘high arm wave’ and ‘horizontal arm wave’ due to its strong applicability in capturing the difference of spatial structure. The classification errors occur if the occlusion of action is large, such as ‘bend’, because the skeleton track fails frequently. Based on the MSRDailyActivity3D dataset, we compare our method with 6 action recognition models: Actionlet Ensemble [23], Depth Cuboid [12], Hon4d [13], C.Lu [14], HDG [30] and SNV [19]. Table 3 shows the accuracies of different methods. This dataset is more challenging than MSRAction3D dataset. Most actions in this dataset involve the human-object interactions, such as ‘read’. The confusion matrix is highlighted in Fig. 6. The proposed approach can distinguish between ‘eat’ and ‘drink’ even though their temporal order is extremely similar. It is noteworthy that the involved objects show different contours. And the depth descriptor can differentiate subtle shape differences of objects. The proposed method gains 97.5% while SNV obtained 86.25% only. In SNV, most recognition errors occur in nearly static activities, e.g., ‘write on paper’. Nevertheless, our approach is exceedingly sensitive to the slight change, and thanks also to optimization procedure. Based on the MSRActionPairs3D dataset, we compare our method with 4 classic recognition models: Actionlet Ensemble [23], Hon4d [13], DMM-HOG [31], SNV [19]. The comparison of the recognition accuracy with different algorithms is shown in Table 4, which shows that the proposed method can significantly improve recognition rate

this dataset, each pair of actions has the similar motion and shape but the temporal order is reverse. For all datasets, the action performers are standing at a short distance from sensor. 4.2. Validation of the proposed approach To verify the effectiveness of the proposed approach, we compare the proposed coarse DS feature with HOG2 [26] by recognition accuracy on MSRAction3D dataset. HOG2 obtained 91.81% while ours gain 91.27% only. There are two reasons to account for this phenomenon. One has mentioned at the end of Section 3.1. The other is probably due to full use of local information around each joint in HOG2 while we utilize the distance of skeleton joints and global depth information only. To evaluate the optimization performance of sparse coding, we compare our optimization result with coarse DS feature through recognition precision on three datasets. Firstly, the results in Table 1 show that the optimized feature is superior to the coarse feature. Recognition accuracy on MSRAction3D dataset shows a much smaller improvement, because noise data are less involved in this dataset. After sparse coding and max pooling, the representation of video sequence becomes more compact and efficient. We also compare the proposed depth descriptor with skeleton descriptor (Table 1). The proposed skeleton descriptor using MI achieves much worse accuracy than the depth descriptor using gradient information, since the skeleton tracker sometimes fails and the tracked joint positions are quite noisy. 7

Neurocomputing (xxxx) xxxx–xxxx

H. Zhang et al.

Fig. 7. Confusion matrix of the proposed method on MSR ActionPairs3D dataset. It is recommended to view the figure on the screen. Table 6 The performance comparison using different combination of training subjects on three datasets.

accuracy

MSRAction3D

MSRDailyActivity3D

MSRActionPair3D

95.23 ± 0.37

96.94 ± 0.29

97.13 ± 0.68

Table 7 The feature size performance comparison with different methods on MSRDailyActivity3D dataset. Methods

HOG2[26]

DMM-HOG[31]

Hon4d[13]

Ours

Feature size

4320+703

7360

120

256

accuracy of SNV is higher than ours. This is because the adaptive spatio-temporal pyramid structure can globally capture the spatial and temporal orders while the proposed method does not utilize specific temporal pyramid. The confusion matrix is highlighted in Fig. 7. The experiment results on MSRAction3D, MSRDailyActivity3D and MSRActionPair3D show that the proposed method outperforms the state-of-the-art methods. In addition, our algorithm can also achieve good performance in cluttered and static backgrounds.

Fig. 8. Accuracy of correct classification on MSR DailyActivity3D dataset for varying cell and bin number in the depth descriptor.

and yield similar performance to state-of-the-art approaches. The

Table 5 The effects of codebook size on three datasets. ‘1/5′ indicates that one fifth of videos are used as training data. Codebook size

128

256

512

1024

2048

4096

Action

1/5 1/4

85.65 ± 0.52 88.72 ± 0.43

90.67 ± 0.61 95.13 ± 0.54

91.20 ± 0.54 92.13 ± 0.64

83.24 ± 0.67 93.54 ± 0.63

80.94 ± 0.48 95.06 ± 0.81

75.82 ± 0.76 76.42 ± 0.92

Daily

1/5 1/4

83.45 ± 0.48 86.82 ± 0.64

89.57 ± 0.67 93.12 ± 0.95

94.31 ± 0.48 96.75 ± 0.75

86.34 ± 0.54 89.12 ± 0.48

84.49 ± 0.75 84.83 ± 0.45

75.42 ± 0.76 78.14 ± 0.64

Pair

1/5 1/4

90.95 ± 0.67 94.63 ± 0.34

92.61 ± 0.42 95.91 ± 0.35

93.94 ± 0.69 95.96 ± 0.76

93.84 ± 0.41 97.31 ± 0.47

88.72 ± 0.64 92.04 ± 0.37

77.36 ± 0.54 81.46 ± 0.29

8

Neurocomputing (xxxx) xxxx–xxxx

H. Zhang et al.

4.4. Discussion

5. Conclusion

4.4.1. Cell number and bin number There are two parameters in depth descriptor, cell number N and the bin number M. In this experiment, we find that the setting (N=10 and M=18) shows best performance. As shown in Fig. 8, increasing the number of cells improves performance significantly up to about 10 cells, but makes a few differences beyond this.

In this paper, we propose an accurate action recognition algorithm using depth videos and the 3D joint position. The proposed algorithm is based on depth gradient information, pair-wise distance of joints and sparse coding. RDF is used for classification. The proposed algorithm is tested on three standard datasets and compared with several state-ofthe-art algorithms. It is worth mentioning that all actions are performed in the indoor environment. Thus, how to train our method more effectively for recognizing human action in the situation of complicated outdoor background will be the focus of our future work.

4.4.2. Efficiency of codebook size If the codebook size is too small, the histogram feature looses discriminant power; if the codebook size is too large, the histograms from the same class will never match [28]. Similar to [28], the effects of codebook sizes is also investigated. And the depth sequences are divided into training data and testing data. For each experiment, the training data are selected randomly. Detailed results are shown in Table 5. In [28], they used three codebook sizes 256, 512, 1024 and reported that accuracy continues to increase when the codebook size goes up to 1024. In our experiments, seven codebook sizes are proposed (128, 256, 512, 1024, 2048 and 4096). As shown in Table 5, the performance increases initially and then decreases as the codebook size grows further. There is one free parameter λ in Eq. (8). This parameter λ is needed to choose when optimization procedure is applied on each feature vector. Empirically, the performance is best setting λ = 0.1.

Acknowledgements Supported by the National Natural Science Foundation of China (Grant No. 61672222, 61572183, 61472131), Science and Technology Key Projects of Hunan province (No. 2015TP1004), the Priority Academic Program Development of Jiangsu Higher Education Institutions (KJR1510). References [1] H. Liu, M. Yuan, F. Sun, Rgb-d action recognition using linear coding, Neurocomputing 149 (2015) 79–85. [2] Z. Gao, W. Nie, A. Liu, H. Zhang, Evaluation of local spatial-temporal features for cross-view action recognition, Neurocomputing 173 (2016) 110–117. [3] J. Wang, Z. Liu, Y. Wu, Human Action Recognition with Depth Cameras, Springer, 2014. [4] R. Slama, H. Wannous, M. Daoudi, A. Srivastava, Accurate 3d action recognition using learning on the grassmann manifold, Pattern Recognit. 48 (2) (2015) 556–567. [5] M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, A. Del Bimbo, Space-time pose representation for 3d human action recognition, in: International Conference on Image Analysis and Processing, Springer, 2013, pp. 456–464. [6] X. Yang, Y. Tian, Effective 3d action recognition using eigenjoints, J. Vis. Commun. Image Represent. 25 (1) (2014) 2–11. [7] R. Vemulapalli, F. Arrate, R. Chellappa, Human action recognition by representing 3d skeletons as points in a lie group, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 588–595. [8] M.A. Gowayyed, M. Torki, M.E. Hussein, M. El-Saban, Histogram of oriented displacements (hod): Describing trajectories of human joints for action recognition, in: IJCAI, 2013. [9] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, IEEE, 2005, pp. 886–893. [10] M. Liu, H. Liu, Depth context: a new descriptor for human activity recognition by using sole depth sequences, Neurocomputing 175 (2016) 747–758. [11] S. Tang, X. Wang, X. Lv, T.X. Han, J. Keller, Z. He, M. Skubic, S. Lao, Histogram of oriented normal vectors for object recognition with a depth sensor, in: Computer Vision–ACCV 2012, Springer, 2012, pp. 525–538. [12] L. Xia, J. Aggarwal, Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2834–2841. [13] O. Oreifej, Z. Liu, Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 716–723. [14] C. Lu, J. Jia, C.-K. Tang, Range-sample depth feature for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 772–779. [15] S.-Z. Li, B. Yu, W. Wu, S.-Z. Su, R.-R. Ji, Feature learning based on sae-pca network for human gesture recognition in rgbd images, Neurocomputing 151 (2015) 565–573. [16] Y. Zhu, X. Zhao, Y. Fu, Y. Liu, Sparse coding on local spatial-temporal volumes for human action recognition, in: Computer Vision–ACCV 2010, Springer, 2010, pp. 660–671. [17] S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 221–231. [18] Z. Gao, W. Nie, A. Liu, H. Zhang, Evaluation of local spatial-temporal features for cross-view action recognition, Neurocomputing 173 (2016) 110–117. [19] X. Yang, Y. Tian, Super normal vector for activity recognition using depth sequences, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 804–811. [20] P. Scovanner, S. Ali, M. Shah, A 3-dimensional sift descriptor and its application to action recognition, in: Proceedings of the 15th International Conference on Multimedia, ACM, 2007, pp. 357–360. [21] H. Lee, A. Battle, R. Raina, A.Y. Ng, Efficient sparse coding algorithms, in: Advances in Neural Information Processing Systems, 2006, pp. 801–808. [22] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, R. Moore, Real-time human pose recognition in parts from single depth images,

4.4.3. Training data in RDF In classification procedure, different combinations of training data also have an impact on recognition accuracy. In order to further investigate the robustness of proposed method, we try every possible combination. Similar to Hon4d [13] and HDG [30], half of subjects are used for training and the rest of the examples are used as testing data. The results in Table 6 show that our approach is robust to the different combination of training data.

4.4.4. Time computation and feature size efficiency Since the proposed approach is based on depth images and skeletal joint positions, it result in a significant decrease in computation time. In fact, With a Matlab implementation, it takes 9.2172 s to recognize a sequence of 60 frames on MSRDailyActivity3D dataset. The depth descriptor extraction takes 6.5042 s, the key frame searching and skeleton descriptor extraction takes 2.6712 s, the online coding takes 0.0321 s and the classification on RDF takes 0.0097 s. In addition, We compare the performance of our method in terms of feature size with the state-of-the-art methods on the MSRDailyActivity3D dataset. The experiments are implemented by using a computer with Intel Pentium G630 2.70 GHz CPU and 2 GB RAM. Table 7 shows the feature size performance comparing with our method and several alternative state-of-the-art techniques. It is clear that the feature size of the proposed method is bigger than Hon4d [13], but our approach achieves higher recognition accuracy than Hon4d in all datasets (see in Tables 2–4).

4.4.5. Limitations The proposed approach performs favorably against most of the existing algorithms in terms of recognition accuracy. However, the skeleton tracker sometimes fails and the tracked joint positions are quite noisy. It has an effect on recognition accuracy. Additionally, as the depth descriptor collects the dynamical differences between continuous maps, the proposed approach does not work well as other methods if the change of background is drastic. This limitation can be leveraged in the future by the use of additional procedures, which can eliminate the influence of dynamic background. 9

Neurocomputing (xxxx) xxxx–xxxx

H. Zhang et al. Commun. ACM 56 (1) (2013) 116–124. [23] J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining actionlet ensemble for action recognition with depth cameras, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2012, pp. 1290–1297. [24] A. Klaser, M. Marszałek, C. Schmid, A spatio-temporal descriptor based on 3dgradients, in: BMVC 2008-Proceedings of the 19th British Machine Vision Conference, British Machine Vision Association, 2008, pp. 275–1. [25] C. Chen, R. Jafari, N. Kehtarnavaz, Action recognition from depth sequences using depth motion maps-based local binary patterns, in: 2015 IEEE Winter Conference on Applications of Computer Vision, IEEE, 2015, pp. 1092–1099. [26] E. Ohn-Bar, M. Trivedi, Joint angles similarities and hog2 for action recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 465–470. [27] S.M. Amiri, P. Nasiopoulos, V.C. Leung, Non-negative sparse coding for human

[28]

[29] [30]

[31]

10

action recognition, in: 2012 Proceedings of the 19th IEEE International Conference on Image Processing, IEEE, 2012, pp. 1421–1424. J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for image classification, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 1794–1801. A. Dame, E. Marchand, Second-order optimization of mutual information for realtime image registration, IEEE Trans. Image Process. 21 (9) (2012) 4190–4203. H. Rahmani, A. Mahmood, D.Q. Huynh, A. Mian, Real time action recognition using histograms of depth gradients and random decision forests, in: IEEE Winter Conference on Applications of Computer Vision, IEEE, 2014, pp. 626–633. X. Yang, C. Zhang, Y. Tian, Recognizing actions using depth motion maps-based histograms of oriented gradients, in: Proceedings of the 20th ACM International Conference on Multimedia, ACM, 2012, pp. 1057–1060.