Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images

Signal Processing: Image Communication ( ) – Contents lists available at ScienceDirect Signal Processing: Image Communication journal homepage: ww...

Download PDF

1MB Sizes 0 Downloads 54 Views

Report

PDF Reader
Full Text

Signal Processing: Image Communication (

)

–

Contents lists available at ScienceDirect

Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image

Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images Qiming Li a , Wenxiong Lin b , Jun Li a, * a b

Quanzhou Institute of Equipment Manufacturing, Haixi Institute, Chinese Academy of Sciences, Quanzhou, Fujian, 362216, China Fujian Institute of Research on the Structure of Matter, Chinese Academy of Sciences, Fuzhou, Fujian, 350002, China

ARTICLE

INFO

Keywords: Human activity recognition Dynamic representation and matching Shape dynamic time warping

ABSTRACT Segmenting a skeleton feature sequence into pose and motion feature segments can improve the effectiveness of key pose based human activity recognition. Nevertheless, the fixed number of atomic motions in feature segments results in the lost of temporal information, it makes the segmentation technique hardly suitable for all scenarios of feature sequences. To address this issue, this paper proposes a human activity recognition method using dynamic representation and matching of skeleton feature sequences based on the segmentation technique. In our method, a skeleton feature sequence is first segmented into key pose and atomic motion segments according to the potential differences. Afterwards, a learning strategy is proposed to select some frame sets with high confidence value, and we then extract series of poses (i.e., atomic motion series) with different numbers of poses from the learnt frame sets to represent motion feature segments, while a fixed number of centroids obtained by 𝐾-Means are used to represent pose feature segments. Finally, the shape dynamic time warping (shapeDTW) algorithm is utilized to measure the distance between the corresponding motion feature segments of two feature sequences. Comprehensive experiments are conducted on three public human activity datasets, and the results show that our method achieves superior performances in comparison with some state-of-the-art methods.

1. Introduction Human activity recognition is a fundamental requirement in many application of computer vision [1,2], robotics [3], and wearable sensors [4]. Especially, with the rapid development of intelligent personal assistive robotics, being able to detect and recognize daily activities reliably has attracted numerous attentions in research community. However, it is still a channelling task to recognize people’s daily activities only with the RGB image sequences. Thanks to the easy availability of inexpensive depth cameras such as the Microsoft Kinect sensor, it becomes possible to capture the depth information in real-time and recognize activities accurately with the skeleton data extracted from RGB-D images. Among the skeleton based human activity recognition methods, many previous works [5–8] focusing on key poses have been proposed. The key idea shared by these methods is to learn the most discriminative key poses, which can effectively grasp the distinguishing nature of an activity. Then, activity recognition based on key poses can be considered as a sequence matching task. However, almost most of traditional key pose extraction algorithms do not consider the temporal sequence information among key poses. To address this issue, recent *

methods [9,10] propose to segment feature sequences of an activity into key pose and atomic motion segments according to the potential differences. In these works, the representation of a feature sequence is composed of some poses with tiny or significant movements, and the ‘‘Key pose–Atomic motion–Key pose’’ shape representation is benefit to extract high-level features and improve the performance of activity recognition. Nevertheless, these works use a fixed number of atomic motions to represent the motion process, and do not adopt any strategy to extract discriminative poses for feature sequences representation. As a consequence, some temporal information of the complex motion process is lost more or less. In addition, not all the obtained atomic motions are representative for the motion process. In this paper, taking full advantage of segmentation technique, we propose a human activity recognition method using dynamic representation and matching of skeleton feature sequences from RGB-D images. A skeleton feature sequence is first segmented into key pose and motion feature segments according to the potential differences. Afterwards, a learning strategy is proposed to select some frame sets with high confidence value, and we then extract series of poses (i.e., atomic motion series) with different numbers of poses from the learnt

Corresponding author. E-mail address: [email protected] (J. Li).

https://doi.org/10.1016/j.image.2018.06.013 Received 18 January 2018; Received in revised form 24 June 2018; Accepted 24 June 2018 Available online xxxx 0923-5965/© 2018 Published by Elsevier B.V.

Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.

Q. Li et al.

Signal Processing: Image Communication (

)

–

Fig. 1. Architecture of dynamic representation and matching of feature sequences. One atomic motion series in segment 1 (motion feature segment) of sequence 1 contains 3 atomic motions, while another atomic motion series in segment 1 of sequence 2 contains 4 atomic motions.

frame sets to represent motion feature segments, while a fixed number of centroids obtained by 𝐾-Means are used to represent pose feature segments. Finally, in order to match the corresponding motion feature segments of two feature sequences, a novel alignment algorithm, named shapeDTW [11], is used to measure the distance of two atomic motion series in the motion feature sequences for the nearest neighbour classifier. Fig. 1 shows the architecture of dynamic representation and matching of feature sequences in our method. As shown in the figure, atomic motion series with different number of discriminative poses are extracted for motion feature segments according to their different movement velocities, and shapeDTW is used to measure the distance between the corresponding motion feature segments. The advantages of our method are three folds. Firstly, the temporal information among key pose segments can be well preserved with the discriminative frame sets obtained by the proposed learning strategy. Secondly, shapeDTW, which considers both of the global optimal solution and local structural information, is applied to match two feature sequences. Thirdly, dynamic representation and matching of motion feature segments makes our method adaptive to feature sequences with different movement velocities. The remainder of our paper is organized as follows. Section 2 reviews related works about human activity recognition based on skeleton data. Section 3 describes the details of our method. Section 4 presents extensive experiments to validate the effectiveness of our method on some benchmark datasets. Section 5 gives the conclusion of this paper.

vectors which connect different coordinates of human joints to represent human postures. Recently, many researchers [5,7–10,20–23] have observed that a human activity can be viewed as a sequence of key poses. In [20], an antieigenvalue-based method was presented to detect key frames by investigating properties of operators, which transform past states to observed future states of human activities. Some other methods [5,23] performed the selection process of key poses by means of clustering algorithms. Nevertheless, not all frames are representative and discriminative for a human activity, key poses obtained from the clustering centroids of these frames could negatively impact the recognition performance. Therefore, recent models [21,22] used some learning strategies to select discriminative key frames from feature sequences. As a new trend, there are significant interests [9,10] in recognition to segment feature sequences and then extract key poses in each segment. These works argued that a feature sequence is composed of poses with tiny and significant movements. The segmentation process is conducted according to the potential differences. However, most of these methods use a fixed number of key poses to represent feature segments. Therefore, this paper focuses on the selection of dynamic number of representative poses for the motion feature segments. Some classification methods, such as hidden Markov model (HMM) [24,25], support vector machine (SVM) [23,26,27], naive Bayes nearest neighbour (NBNN) [10,28], and dynamic time warping (DTW) [29], are exploited to classify the high-level feature sequence for human activity recognition. Piyathilaka and Kodagoda [24] exploited Gaussian mixture model (GMM) based HMM, which is able to cluster data into different groups as a collection of multinomial Gaussian distributions, to detect human activities. Multiclass SVM was considered to estimate a human activity in [23,26]. In [10], high-level key poses and atomic motions were extracted from feature sequences, and the NBNN algorithm was employed to classify human activities. Because that DTW has the advantage of robustness against variation in speed or style, it was utilized in [29] as a distance measure between two activities. Recently, a novel alignment algorithm, named shapeDTW [11], was proposed to enhance DTW by taking point-wise local structural information into consideration. In order to exploit the local information between poses in a feature sequence, shapeDTW is used to measure the distance of two feature sequences for nearest neighbour classifier in this paper.

2. Related work Due to the introduction of real-time skeleton capturing method [12], many literatures [13–18] based on the skeleton data for human activity recognition were proposed recently. For example, Gan and Chen [14] developed a skeleton feature based representation named APJ3D to describe a 3D human posture, and the improved Fourier temporal pyramid and random forests were exploited to recognize activities. Some other researches focused on utilizing histogram of skeleton data to generate descriptors, such as histograms of oriented displacements (HOD) [17], histograms of 3D joint locations (HOJ3D) [15], and histograms of oriented 4D normal (HON4D) [18]. Moreover, in order to ensure that the skeleton data is invariant to the position of human joints in the world coordinate system, several methods [8,10,19] used a set of distance 2

Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.

Q. Li et al.

Signal Processing: Image Communication (

)

–

3. Our method This section gives the detailed description of dynamic representation and matching of feature sequences for human activity recognition. Firstly, the feature extraction process for skeleton data and the segmentation technique for feature sequences are explained. Secondly, the learning strategy of discriminative frame sets is described. Thirdly, how to represent the key pose and atomic motion segments in a feature sequence is presented. Finally, the dynamic matching method of feature sequences is introduced. 3.1. Feature sequence segmentation In RGB-D datasets, the coordinates of joints in each skeleton are extracted to represent postures of human body. However, because of the occlusion of different human parts, estimated coordinates from RGB-D images may be corrupted sometimes. Errors and noises caused by the corrupted data will severely degrade the effectiveness of the skeleton data. Therefore, we utilize a simple moving average to smooth the skeleton data to alleviate the influence of corrupted data in our implementation. After smoothing the skeleton data, we compute the spatial features between joint coordinates to represent a frame. Some methods for spatial features extraction are introduced in [6,8,10,30]. In our method, we adopt the normalized relative orientation (NRO) feature [10], which has shown its advantage of being insensitive to human body’s height, limb length, and distance to the camera. A NRO feature is computed between two relative joints that rotate round each other, such as elbow and shoulder joints. Considering a human skeleton in a frame is composed of 𝑛 pairs of relative joints, 𝑗𝑖1 and 𝑗𝑖2 are two relative joints respectively. The 𝑖th NRO feature 𝑓𝑖 in a skeleton is computed with Fig. 2. Illustration of segmentation results of the ‘‘drink water’’ data sequence in the Cornell CAD-60 dataset according to different threshold values on the minimum potential difference. (a) Segmentation results of the ‘‘drink water’’ data sequence while 𝑃𝑚𝑖𝑛 = 0.01, 0.02, and 0.04. (b) The precision of the proposed method on the ‘‘drink water’’ data sequence according to different values of 𝑃𝑚𝑖𝑛 .

𝑗𝑖1 − 𝑗𝑖2 , 𝑖 = 1, 2, … , 𝑛, (1) ‖𝑗𝑖1 − 𝑗𝑖2 ‖ ‖ ‖ where ‖⋅‖ denotes the Euclidean distance. A posture feature vector for one skeleton frame is formulated as

𝑓𝑖 =

𝐹 = [𝑓1 , 𝑓2 , … , 𝑓𝑛 ].

(2)

NRO features in a skeleton frame is divided into 5 different limbs (i.e., the left upper limb, the right upper limb, the left lower limb, the right lower limb, and the torso). In the cases of 15 and 20 joints, there are 3 and 4 NRO features for a limb (i.e., 3 × 5 = 15 and 4 × 5 = 20 NRO features in a skeleton frame), respectively. More details about the NRO feature vector can be found in [10]. Intuitively, a human activity can be viewed as a sequence of poses with tiny and significant movements. Therefore, we segment the feature sequence into pose and motion feature segments according to the potential difference [9,10], which is used to define the degree of movement of a posture. The potential difference between two frames in a feature sequence is calculated by 𝑃𝑑 (𝑖) = 𝑃𝑒 (𝑖) − 𝑃𝑒 (𝑖 − 1),

provide more discriminative information than pose feature segments. Furthermore, when the threshold 𝑃𝑚𝑖𝑛 is less then 0.02 in our case, some corrupted frames (marked by the blue square in Fig. 2(a) while 𝑃𝑚𝑖𝑛 = 0.01) still exist, the proposed method obtains low accuracy. When the threshold 𝑃𝑚𝑖𝑛 is more then 0.02, it cannot achieve better performance since some discriminative and representative poses are smoothed by the threshold 𝑃𝑚𝑖𝑛 . Therefore, we set the threshold 𝑃𝑚𝑖𝑛 to 0.02 in our implementation. 3.2. Learning discriminative frame sets Generally, key poses are generated by using some clustering techniques in most of traditional recognition methods [10,31,32]. However, not all the used frames for clustering are discriminative and some frames may belong to the corrupt data. In our method, we propose a learning strategy to select discriminative poses from a constant number of segments. The key ideal of our solution is to perform a k-nearest neighbours (kNN) classifier on some randomly selected frames. More specifically, we randomly select a frame set 𝐽 times, and let 𝐹̂ 𝑗 = {𝐹̂1𝑗 , 𝐹̂2𝑗 , … , 𝐹̂𝑚𝑗 } be the 𝑗th time selected set that consists of 𝑚 randomly selected frames from a constant number of feature segments 𝑆. In order to learn the most discriminative frame sets, we construct a training pool, which contains all frames in 𝑆 and some randomly sampled frames from other activity sequences, for the kNN search of each selected frame in 𝐹̂ 𝑗 . The confidence function 𝐶(𝐹̂ 𝑗 ) of 𝐹̂ 𝑗 is computed by using the votes

(3)

‖ where 𝑃𝑒 (𝑖) = ‖ ‖𝐹𝑖 − 𝐹1 ‖ is the difference of the NRO feature vectors between 𝑖th and first frame. Some frames in a feature sequence are labelled as pose feature segments when their potential difference satisfy |𝑃𝑑 (𝑖)| < 𝑃𝑚𝑖𝑛 , | |

(4)

where 𝑃𝑚𝑖𝑛 is a certain threshold. Otherwise, they are labelled as motion feature segments. Fig. 2 visualizes the segmentation results of the ‘‘drink water’’ data sequence in the Cornell CAD-60 dataset according to different threshold values on the minimum potential difference. As shown in Fig. 2, the obtained pose and motion feature segments appear alternately in a feature sequence. Atomic motion segments consist of movements between two stationary states of a human activity, and 3

Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.

Q. Li et al.

Signal Processing: Image Communication (

from 𝑘 nearest neighbours of each selected frame: 𝐶(𝐹̂ 𝑗 ) = 𝑃 (𝑆|𝐹̂ 𝑗 ) =

𝑚 ∑ 𝑖=1

𝑃 (𝑆|𝐹̂𝑖𝑗 ) =

𝑚 ∑ 𝑃 (𝐹̂𝑖𝑗 , 𝑆) 𝑖=1

𝑃 (𝐹̂𝑖𝑗 )

=

𝑚 ∑ 𝑘𝑖𝑆 𝑖=1

𝑘𝑖

)

–

Algorithm 1: The training process of feature sequence representation of a human activity. ,

(5)

where 𝑃 (𝑆|𝐹̂ 𝑗 ) denotes the probability that the searched 𝑘-nearest neighbours of 𝐹̂ 𝑗 belong to the same feature segments 𝑆 as 𝐹̂ 𝑗 ; 𝑘𝑖 is the number of nearest neighbours of 𝐹̂𝑖𝑗 in the training pooling, and 𝑘𝑖𝑆 is the number of the neighbours that belong to the same feature segments 𝑆 as 𝐹̂𝑖𝑗 . We select top 𝐾 frame sets with high confidence value according to their confidence functions, the selected frame sets of one activity can be viewed as the most representative characteristics distinguished from other activities. Through experiments, we observe that, if 𝐽 (i.e., number of randomly selected frame sets) and 𝑘 (i.e., number of nearest neighbours) are large enough, this learning strategy works well in practice. The confidence function captures the intuition that, if it is high, the selected frame sets include a high proportion of frames which are the most representative for the human activity; if it is low, the selected frames are usually the common, shared ones, such as standing poses from the start and the end of each sequence. Moreover, the corrupted frames, who have much less nearest neighbours than the representative frames in the same feature segments, can be discarded by this learning strategy. As can be seen in Fig. 3, two examples of frame sets with high confidence value are visualized.

1 2

3

4 5

6

7 8

9 10

11

12

13

Input: The skeleton data {𝐽1 , 𝐽2 , ..., 𝐽𝑡 } of a human activity. Output: The feature sequence representation of a human activity. Pre-process the input skeleton data with a moving filter. Compute the NRO features {𝐹1 , 𝐹2 , ..., 𝐹𝑡 } of skeletons according to Eq. (1). Segment the feature sequence into pose feature segments and motion feature segments according to Eq. (3) and Eq. (4). for 𝑗 = 1, ..., 𝐽 do Randomly select 𝑚 frames from a constant number of feature segments. Combine the selected frames together to obtain a frame set 𝐹̂ 𝑗 = {𝐹̂1𝑗 , 𝐹̂2𝑗 , ..., 𝐹̂𝑚𝑗 }. Construct a training pool for the kNN search of 𝐹̂ 𝑗 . Compute the confidence value 𝐶(𝐹̂ 𝑗 ) of 𝐹̂ 𝑗 according to Eq. (5). end Select top 𝐾 frame sets {𝐹̂ 𝑡𝑜𝑝1 , 𝐹̂ 𝑡𝑜𝑝2 , ..., 𝐹̂ 𝑡𝑜𝑝𝐾 } according to their confidence values. Utilize 𝐾-Means method to cluster frames in pose feature segments. Extract atomic motion series directly to represent motion feature segments. Return the feature sequence representation of a human activity.

3.3. Representation of feature segments In this subsection, we use the selected frame sets with high defined confidence value to represent feature segments. As we known, a feature sequence is segmented into pose and motion feature segments according to the potential difference. We extract series of frames (i.e., atomic motion series) with different numbers of poses from the learnt top 𝐾 frame sets to represent the segments. However, pose feature segments are composed of still poses or poses with tiny movements. In other word, pose feature segments indicate the normal states of human body when no significant movement is being performed. Therefore, the pose feature segments give much less discriminative information than the motion feature segments, and it is no need to represent pose feature segments with too much frames. Different from extracting series of frames directly for motion feature segments, we utilize 𝐾-Means method [33] to cluster frames in the same pose feature segment of top 𝐾 selected frame sets. The decreased number of key poses would also reduce computational cost accordingly. As shown in Fig. 4, all frames in the same pose feature segment of the top 𝐾 selected frame sets are extracted and viewed as the input samples in 𝐾-Means method, the obtained clustering centroids are considered as the key poses of this segment. The atomic motion series in the motion feature segment of each selected frame set is extracted directly to represent this segment. The advantage of the proposed method is that motion feature segments in the same activity for different humans are represented by some atomic motion series, in which the number of frames vary with the movement velocity of human activities. Therefore, the obtained centroids in 𝐾-Means for pose feature segments and the atomic motion series for motion feature segments are able to represent a human activity efficiently and effectively. Up to now, how to represent a feature sequence of human activity in the training process has been introduced. We give an overview of the training process of feature sequences representation in Algorithm 1.

sub-sequences of test sequence and the training set. The key idea of dynamic matching is to find the best-matching activity pattern of test sub-sequence from the training set. Suppose a sub-sequence in a test feature sequence is composed of 𝐿 adjacent segments, which include some pose and motion feature segments. Therefore, it is natural to split the matching problem between feature segments into two sub-problems (i.e., matching of pose feature segments and matching of motion feature segments). The distance between pose feature segments in test and training sequence can be defined as √ √ 𝑛𝑠 √1 ∑ ‖ 𝑖 ‖2 𝑙 𝐷𝑐𝑙 = √ 𝑙 = 1, 3, … , (6) ‖𝐹𝑝𝑜𝑠𝑒 − 𝐾𝑃𝑐,𝑏𝑒𝑠𝑡 ‖ , ‖ ‖ 𝑛𝑠 𝑖=1

where 𝑛𝑠 is the number of selected frames in the pose feature segment of 𝑙 𝑖 test sub-sequence, 𝐹𝑝𝑜𝑠𝑒 is the 𝑖th selected frame in the segment, 𝐾𝑃𝑐,𝑏𝑒𝑠𝑡 𝑖 is the best-matching key pose of 𝐹𝑝𝑜𝑠𝑒 in the 𝑙th segment of the 𝑐th human activity in the training set. As we known, the pose and motion feature segments appear alternately in a feature sequence, and the pose feature segment appears first because that a human activity always starts from some still poses. Therefore, the index of feature segments (i.e., 𝑙 in Eq. (6)) is odd, while it is even in the case of motion feature segments. In order to measure the distance between the atomic motion series, ShapeDTW is used to measure the distance between two motion feature segments in our method. As we known, DTW is a dynamic programming based distance measure algorithm between two temporal sequences, it has the advantage of robustness against variation in speed or style. Therefore, DTW is widely applied for the matching between two feature sequence with different numbers of poses in human activity recognition. However, DTW does not consider the local structural information of feature sequences. Recently, shapeDTW [11] is proposed to enhance DTW by incorporating point-wise local structures into the matching process. The shapeDTW algorithm used for matching two atomic motion segments in test and training sub-sequences is described in Algorithm 2. It is worth to point out that the compound shape descriptors [11], which are invariant to magnitude-shift, are used to encode 𝐹𝑠𝑢𝑏,𝑖 in the atomic motion series of sub-sequence. More details about the shapeDTW algorithm can be found in [11].

3.4. Dynamic matching of feature sequences During the test stage, a test feature sequence is first segmented into pose and motion feature segments as the training stage. Secondly, unlike the training stage, we only select the frame set with the highest confidence to represent the sequence based on the learning strategy described in Section 3.2. Finally, we split the test feature sequence into some sub-sequences for matching, and calculate distances between the 4

Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.

Q. Li et al.

Signal Processing: Image Communication (

)

–

Fig. 3. Examples of frame sets with high confidence value identified from training pool by using the proposed learning strategy in the Cornell CAD-60 dataset. (a) Drinking water. (b) Wearing contact lenses.

Fig. 4. Illustration of representation of feature segments with top 𝐾 selected frame sets. All frames in the same pose feature segment of top 𝐾 selected frame sets are used for clustering in 𝐾-Means method. A sequence of frames in the motion feature segment for each selected frame set is extracted directly to represent the segment.

4. Experiments

As a consequence, given two atomic motion series in motion feature segments of test and training sub-sequence, the distance between the two segments can be computed by 𝑡𝑒𝑠𝑡 𝑙 𝐷𝑐𝑙 = 𝐷𝑠ℎ𝑎𝑝𝑒 (𝐹𝑠𝑢𝑏 , 𝐴𝑀𝑐,𝑏𝑒𝑠𝑡 ),

𝑙 = 2, 4, … ,

In this section, we first present the parameter settings of our method in Section 4.1. In Sections 4.2–4.4, we evaluate the effectiveness of our method on three public RGB-D datasets (Cornell CAD-60 dataset [25], MSR Action3D dataset [34], and MSRC-12 dataset [35]), and compare our results with some state-of-the-art methods. All experiments are conducted using MATLAB on an i7 Quad-Core machine.

(7)

𝑡𝑒𝑠𝑡 is the atomic motion series in one motion feature segment where 𝐹𝑠𝑢𝑏 𝑙 of the test sub-sequence; 𝐴𝑀𝑐,𝑏𝑒𝑠𝑡 is the best-matching atomic motion series in the 𝑙th segment of the 𝑐th human activity of the training set; 𝐷𝑠ℎ𝑎𝑝𝑒 indicates the shapeDTW distance between the two atomic motion series. When the key pose distance and atomic motion distance are obtained according to Eqs. (6) and (7), the best-matching distance between the test and training sub-sequences is defined as

𝐷𝑐 =

𝐿 ∑

𝐷𝑐𝑙 .

4.1. Parameter settings In order to optimize our method, there are some parameters that need to be determined, and all parameters are fixed for all experiments to demonstrate the effectiveness of our method. Given a feature sequence of human activity, it is segmented into some segments according to the potential difference in our method. When the potential difference is less than a given threshold 𝑃𝑚𝑖𝑛 which is set to 0.02, the parts of feature sequence are viewed as pose feature segments. The number of adjacent segments for learning discriminative poses and matching feature sequences is set to 5, and the number of randomly selected frames for each segments is 3. We randomly select frame sets 𝐽 =

(8)

𝑙=1

Finally, the naive Bayes nearest neighbour (NBNN) method is used to classify the test sub-sequence according the above obtained distances: 𝐶𝑏𝑒𝑠𝑡 = 𝑎𝑟𝑔𝑚𝑖𝑛𝐷𝑐 . 𝑐

(9) 5

Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.

Q. Li et al.

Signal Processing: Image Communication (

1

2 3

shape descriptors 4 5

2

𝑡𝑒𝑠𝑡 ) 𝑑𝑒𝑠𝑐(𝐹𝑠𝑢𝑏,𝑡 1

and

6

Recall (%)

Bathroom

100 96.2 85.8 94.0

90 100 100 96.7

Bedroom

Talking on phone Drinking water Opening pill container Average

95.8 100 100 98.6

100 94.3 96.7 97.0

Kitchen

Cooking(chopping) Cooking(stirring) Drinking water Opening pill container Average

92.8 100 100 100 98.2

99.9 92.3 94.3 99.1 96.4

Living room

Talking on phone Drinking water Talking on couch Relaxing on couch Average

97.4 100 100 100 99.4

100 94.3 100 100 98.6

Office

Talking on phone Writing on white board Drinking water Working on computer Average

97.4 97.6 100 100 98.8

100 97.6 94.3 100 98.0

Global average

97.8

97.3

𝑡𝑟𝑎𝑖𝑛 ; 𝑊𝑠𝑢𝑏

Construct the warping matrices in shapeDTW: and Calculate the shapeDTW distance by solving the optimization problem: ‖ 𝑡𝑒𝑠𝑡 𝑡𝑒𝑠𝑡 ) + 𝑊 𝑡𝑟𝑎𝑖𝑛 ⋅ 𝑑𝑒𝑠𝑐(𝐹 𝑡𝑟𝑎𝑖𝑛 )‖ . ⋅ 𝑑𝑒𝑠𝑐(𝐹𝑠𝑢𝑏 𝐷𝑠ℎ𝑎𝑝𝑒 = 𝑎𝑟𝑔𝑚𝑖𝑛‖𝑊𝑠𝑢𝑏 ‖ 𝑠𝑢𝑏 𝑠𝑢𝑏 ‖1,2 ‖

500 times from 5 adjacent feature segments, and the training pool for kNN search consists of all the frames in the 5 segments and 10 000 randomly selected frames from the training feature sequences. Although the training process is off-line, all the frames can be used for training. But considering the time consume, 10 000 randomly selected frames are used for training. Then, we perform kNN search with 𝑘 = 2000 and select top 𝐾 = 10 frame sets to represent the segments. When representing feature segments, the centroid number in 𝐾-Means and the number of series in motion feature segments are set to 10.

‘‘New person’’

Brushing teeth Rinsing mouth Wearing contact lens Average

2

𝑡𝑒𝑠𝑡 𝑊𝑠𝑢𝑏

Activity

Precision (%)

𝑡𝑟𝑎𝑖𝑛 ). 𝑑𝑒𝑠𝑐(𝐹𝑠𝑢𝑏,𝑡 2

end 𝑡𝑒𝑠𝑡 ) and 𝑑𝑒𝑠𝑐(𝐹 𝑡𝑟𝑎𝑖𝑛 ) by DTW: Align descriptor sequences 𝑑𝑒𝑠𝑐(𝐹𝑠𝑢𝑏 𝑠𝑢𝑏 𝑡𝑒𝑠𝑡 }, 𝑑𝑒𝑠𝑐{𝐹 𝑡𝑟𝑎𝑖𝑛 }), where a. Initialize 𝐷(1, 1) = 𝑑𝑖𝑠𝑐(𝑑𝑒𝑠𝑐{𝐹𝑠𝑢𝑏,1 𝑠𝑢𝑏,1 𝑑𝑖𝑠𝑐(⋅, ⋅) is the distance between two shape descriptors; 𝑡𝑒𝑠𝑡 }, 𝑑𝑒𝑠𝑐{𝐹 𝑡𝑟𝑎𝑖𝑛 }); b. Initialize 𝐷(1, 2) = 𝐷(1, 1) + 𝑑𝑖𝑠𝑐(𝑑𝑒𝑠𝑐{𝐹𝑠𝑢𝑏,1 𝑠𝑢𝑏,2 c. 𝐷(𝑡1 , 𝑡2 ) = 𝑚𝑖𝑛{𝐷(𝑡1 − 1, 𝑡2 − 1), 𝐷(𝑡1 − 1, 𝑡2 ), 𝐷(𝑡1 , 𝑡2 − 1)} 𝑡𝑒𝑠𝑡 }, 𝑑𝑒𝑠𝑐{𝐹 𝑡𝑟𝑎𝑖𝑛 }). +𝑑𝑖𝑠𝑐(𝑑𝑒𝑠𝑐{𝐹𝑠𝑢𝑏,𝑡 𝑠𝑢𝑏,𝑡 1

7

Location

𝑡𝑟𝑎𝑖𝑛 = {𝐹 𝑡𝑟𝑎𝑖𝑛 , 𝐹 𝑡𝑟𝑎𝑖𝑛 , ..., 𝐹 𝑡𝑟𝑎𝑖𝑛 } in motion feature and 𝐹𝑠𝑢𝑏 𝑠𝑢𝑏,𝑇2 𝑠𝑢𝑏,1 𝑠𝑢𝑏,2 segments of test and training sub-sequences Output: The shapeDTW distance 𝐷𝑠ℎ𝑎𝑝𝑒 between the two series of atomic motion. 𝑡𝑒𝑠𝑡 and 𝐹 𝑡𝑟𝑎𝑖𝑛 by its shape Encode each pose descriptor 𝐹𝑠𝑢𝑏 𝑠𝑢𝑏 descriptors: for 𝑡1 = 1, ..., 𝑇1 and 𝑡2 = 1, ..., 𝑇2 do 𝑡𝑒𝑠𝑡 and 𝐹 𝑡𝑟𝑎𝑖𝑛 in 𝐹 𝑡𝑒𝑠𝑡 and 𝐹 𝑡𝑟𝑎𝑖𝑛 by the compound Encode 𝐹𝑠𝑢𝑏,𝑡 𝑠𝑢𝑏,𝑡 𝑠𝑢𝑏 𝑠𝑢𝑏 1

–

Table 1 Precision and recall of our method in different locations of Cornell CAD-60 dataset.

Algorithm 2: ShapeDTW algorithm for matching two series of atomic motion. 𝑡𝑒𝑠𝑡 = {𝐹 𝑡𝑒𝑠𝑡 , 𝐹 𝑡𝑒𝑠𝑡 , ..., 𝐹 𝑡𝑒𝑠𝑡 } Input: Two series of atomic motion 𝐹𝑠𝑢𝑏 𝑠𝑢𝑏,𝑇 𝑠𝑢𝑏,1 𝑠𝑢𝑏,2

1

)

Table 2 Performance of our method in comparison with state-of-the-art methods on the Cornell CAD-60 dataset.

4.2. Cornell CAD-60 dataset The Cornell CAD-60 dataset [25] is a daily activity dataset of depth sequences captured by a depth camera (i.e., Microsoft Kinect sensor). This dataset consists of 12 types of activities: ‘‘brushing teeth’’, ‘‘cooking(chopping)’’, ‘‘cooking(stirring)’’, ‘‘drinking water’’, ‘‘opening pill container’’, ‘‘relaxing on couch’’, ‘‘rinsing mouth with water’’, ‘‘talking on couch’’, ‘‘talking on the phone’’, ‘‘wearing contact lenses’’, ‘‘working on computer’’, ‘‘writing on white board’’. Each activity is performed at least once by 4 persons (2 males and 2 females) in 5 different locations (‘‘bathroom’’, ‘‘bedroom’’, ‘‘kitchen’’, ‘‘living room’’, ‘‘office’’). Three persons are right-handed actor and the left one is a left-handed actor. A skeleton in a frame consists of 15 joints, and 15 joints are divided into five limbs. The feature vector of a skeleton is composed of 15 NRO features (i.e., 3 NRO features per limb). Two experimental setting (‘‘new person’’ and ‘‘have seen’’) are introduced in [25]. The same experimental setting (i.e., leave-one-out cross-validation to test each human object’s data) of ‘‘new person’’ is adopted in our method. In order to make our method adaptive to the lefthanded human, we mirror the training skeleton data across the visual plane that cuts the person in a half, and the test feature sequences are classified without any mirroring operation. The performance of our method in terms of precision and recall for each activity in 5 different locations is given in Table 1. As can be seen, our method achieves significant results in all locations, the best performance is achieved in the environment of living room. The bathroom environment including brushing teeth, rinsing mouth, wearing contact lens is the most challenging case, since the average precision and recall are 94% and 96.7%. The recognition results for most of activities who has

Method

Precision (%)

Recall (%)

MEMM [25] SSVM [27] Kinematic feature [36] Eigenjoints [28] HMM+GMM [24] Image fusion [37] Depth images segment [38] Spatiotemporal interest pt. [39] Probabilistic body motion [40] Self-organizing neural int. [41] Cippitelli et al. [8] Pose kinetic energy [9] Multi-layer codebooks [10] Our method

67.9 80.8 86 71.9 70 75.9 78.1 93.2 91.1 91.9 93.9 93.8 97.4 97.8

55.5 71.4 84 66.6 78 69.5 75.4 84.6 91.9 90.2 93.5 94.5 95.8 97.3

tiny movements are excellent. The wearing contact lens activity has the highest misrecognition rate among all activities. We also compare our method with some other state-of-the-art skeleton-based methods [8–10,24,25,27,28,36–41] in terms of precision and recall. All the recognition results of these methods in Table 2 are provided in Cornell benchmark [25]. As shown in the table, our method outperforms all the other state-of-the-art methods by achieving 97.8% and 97.3% in terms of precision and recall, especially improves pose kinetic energy [9] and multi-layer codebooks [10] methods based on the segmentation technique. The results demonstrate that our method is able to learn discriminative frames to represent feature segments, and the dynamic representation and matching of motion feature segments is robust to temporal stretching. 4.3. MSR action3D dataset MSR Action3D dataset [34] is a challenging activity dataset due to the presence of similar and complex gesture. There are 20 activity types performed 2–3 times by 10 persons in the dataset, and the total number of activity samples is 567. The 20 activities are divided into 6

Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.

Q. Li et al.

Signal Processing: Image Communication (

)

–

Fig. 5. Confusion matrix of our method on the MSR Action3D dataset.

Table 4 Average accuracy of our method in comparison with state-of-the-art methods on MSRC-12 dataset.

Table 3 Average accuracy of our method in comparison with state-of-the-art methods on MSR Action3D dataset. Method

Used depth images

Average accuracy (%)

Method

Average accuracy (%)

Bag of 3D points [34] Histogram of 3D Joints [15] Random occupancy pattern [42] Actionlet ensemble [43] HON4D + Ddesc [18] MMTW [44] SNV [45]

Yes Yes Yes Yes Yes Yes Yes

74.7 78.9 86.2 88.2 88.9 92.7 93.09

Nonlinear Markov models [46] Enhanced sequence matching [47] Multi-layer codebooks [10] Our method

90.9 96.7 94.0 96.9

Eigenjoints [28] Pose kinetic energy [9] Multi-layer codebooks [10] Moving pose [22] Our method

No No No No No

82.3 84 84.6 91.7 90.8

iconic and 6 metaphoric gestures performed by 30 persons, we perform experiments on the 6 iconic gestures (crouch, put goggle, shoot pistol, throw object, change weapon, kick) by employing the leave-personout cross-validation as in [10,46,47]. Because the demographics of the participants includes 7% left-handed human, we also mirror the skeleton data for the training set as in the CAD-60 dataset. Table 4 shows average accuracy of our method in comparison with some state-of-the-art methods [10,46,47] on MSRC-12 dataset. Our method outperforms all the other methods and improves multi-layer codebooks method based on the segmentation technique by 2.9% in terms of average accuracy.

three subsets, each having 8 activities. For quantitatively evaluating the recognition performance, we perform experiments separately on each subset by using the same ‘‘new-person’’ settings as in [34], where the feature sequences of 5 persons are used for training, and the remaining sequences are used for testing. The average accuracy of the three subsets obtained by our method and several other state-of-the-art methods [9,10,15,18,22,28,34,42–45] are given in Table 3. As can be seen, our method improves the average accuracy over pose kinetic energy [9] and multi-layer codebooks [10] methods based on the segmentation technique by 6.8% and 6.2%, respectively. Our method also achieves competitive performance in comparison with the moving pose method [22] utilizing pose, speed, and acceleration of joints. Even compared to some state-of-the-art methods using both depth image and joint coordinates, our method still outperforms [15,18,34,42,43] and is competitive with [44,45]. The confusion matrix of our method on the MSR Action3D dataset is shown in Fig. 5. It can be seen that our method works very well on most of activities. The misrecognitions occur when two activities involve interactions with each other. Our method does not achieve the same significant performance on MSR Action3D dataset as on the CAD-60 dataset, because MSR Action3D dataset has larger inter-person variance, more activity actors, and more corrupted and occluded poses than CAD60 dataset.

5. Conclusion In this paper, we have introduced the dynamic representation and matching of feature segments for human activity recognition. Our motivation is based on the observation that the dynamic representation of feature segments with different numbers of atomic motions can provide more temporal information than that with fixed number of atomic motions. As demonstrated in the paper, the proposed learning strategy selected discriminative frames effectively for the dynamic representing of feature sequences. Moreover, the recognition task is formulated as the dynamic matching of atomic motion series between feature sequences with shapeDTW. Experimental results on three public activity datasets have demonstrated that our method effectively improves the recognition performance and outperforms several other state-of-the-art methods (especially some methods based on the segmentation technique). Acknowledgement This work was supported by the China Postdoctoral Science Foundation (No. 2017M612145).

4.4. MSRC-12 dataset

References

The MSRC-12 activity dataset [35] is collected by Microsoft Kinect Sensor, which provides a noisy estimate of 20 human joints. There are 6,244 gesture instances of 594 videos in the dataset. It consists of 6

[1] L. Wang, Y. Qiao, X. Tang, Action recognition with trajectory-pooled deepconvolutional descriptors, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4305–4314.

7 Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.

Q. Li et al.

Signal Processing: Image Communication (

)

–

[24] L. Piyathilaka, S. Kodagoda, Gaussian mixture based HMM for human daily activity recognition using 3D skeleton features, Ind. Electron. Appl. (2013) 567–572. [25] J. Sung, C. Ponce, B. Selman, A. Saxena, Unstructured human activity detection from RGBD images, in: Proceedings of the IEEE Conference on Robotics an Automation, 2011, pp. 47–55. [26] A. Taha, H.H. Zayed, M.E. Khalifa, E.S.M. El-Horbaty, Human activity recognition for surveillance applications, in: Proceedings of the 7th International Conference on Information Technology, 2015, pp. 577–586. [27] H.S. Koppula, R. Gupta, A. Saxena, Learning human activities and object affordances from RGB-D videos, Int. J. Robot. Res. 32 (8) (2013) 951–970. [28] X. Yang, Y. Tian, Effective 3D action recognition using Eigen Joints, J. Vis. Commun. Image Represent. 25 (1) (2014) 2–11. [29] S. Sempena, N.U. Maulidevi, P.R. Aryan, Human action recognition using Dynamic Time Warping, in: Preceedings of the IEEE International Conference on Electrical Engineering and Informatics, 2011, pp. 1–5. [30] R. Vemulapalli, F. Arrate, R. Chellappa, Human action recognition by representing 3D skeletons as points in a Lie group, in: Preceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 588–595. [31] L. Liu, L. Shao, X. Zhen, X. Li, Learning discriminative key poses for action recognition, IEEE Trans. Cybern. 43 (6) (2013) 1860–1870. [32] S. Cheema, A. Eweiwi, C. Thurau, C. Bauckhage, Action recognition by learning discriminative key poses, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2011, pp. 1302–1309. [33] J. Macqueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297. [34] W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3D points, in: Proceedings of the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops, 2010, pp. 9–14. [35] S. Fothergill, H. Mentis, P. Kohli, S. Nowozin, Instructing people for training gestural interactive systems, in: Proceedings of the SIGCHI Conference on Human Factorsin Computing Systems, 2012, pp. 1737–1746. [36] C. Zhang, Y. Tian, RGB-D camera-based daily living activity recognition, J. Comput. Vis. Image Process. 2 (4) (2012) 1–12. [37] B. Ni, Y. Pei, P. Moulin, S. Yan, Multilevel depth and image fusion for human activity detection, IEEE Trans. Cybern. 43 (5) (2013) 1383–1394. [38] G. Raj Gupta, Y.S. Chia, D. Rajan, Human activities recognition using depth images, in: Proceedings of the 21st ACM International Conference on Multimedia, 2013, pp. 283–292. [39] Y. Zhu, W. Chen, G. Guo, Evaluating spatiotemporal interest point features for depthbased action recognition, Image Vis. Comput. 32 (8) (2014) 453–464. [40] D.R. Faria, C. Premebida, U. Nunes, A probabilistic approach for human everyday activities recognition using body motion from RGB-D images, in: Preceedings of the IEEE International Symposium on Robot and Human Interactive Communication, 2014, pp. 732–737. [41] G.I. Parisi, C. Weber, S. Wermter, Self-organizing neural integration of pose-motion features for human action recognition, Front. Neurorobotics (2015) 1–9. [42] J. Wang, Z. Liu, J. Chorowski, Z. Chen, Y. Wu, Robust 3d action recognition with random occupancy patterns, in: Proceedings of European Conference on Computer Vision, 2012, pp. 872–885. [43] J. Wang, Z. Liu, Y. Wu, J. Yuan, Learning actionlet ensemble for 3d human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 36 (5) (2014) 914–927. [44] J. Wang, Y. Wu, Learning maximum margin temporal warping for action recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2688–2695. [45] X. Yang, Y.L. Tian, Super normal vector for activity recognition using depth sequences, in: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2014, pp. 804–811. [46] A.M. Lehrmann, P.V. Gehler, S. Nowozin, Efficient nonlinear markov models for human motion, in: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2014, pp. 1314–1321. [47] H.J. Jung, K.S. Hong, Enhanced sequence matching for action recognition from 3d skeletal data, in: Proceedings of the 12th Asian Conference on Computer Vision, 2014, pp. 226–240.

[2] B. Zhang, L. Wang, Z. Wang, Y. Qiao, H. Wang, Real-time action recognition with enhanced motion vector CNNs, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2718–2726. [3] H.S. Koppula, A. Saxena, Anticipating human activities using object affordances for reactive robotic response, IEEE Trans. Pattern Anal. Mach. Intell. 38 (1) (2016) 14– 29. [4] L. Liu, Y. Peng, M. Liu, Z. Huang, Sensor-based human activity recognition system with a multilayered model using time series shapelets, Knowl. Based Syst. 90 (2015) 138–152. [5] A. Chaaraoui, J.R. Padilla-Lopez, F. Flórez-Revuelta, Fusion of skeletal and silhouette-based features for human action recognition with rgb-d devices, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 91–97. [6] A.A. Chaaraoui, J.R. Padilla-López, P. Climent-Pérez, F. Flórez-Revuelta, Evolutionary joint selection to improve human action recognition with RGB-D devices, Expert Syst. Appl. 41 (3) (2014) 786–794. [7] S. Baysal, M.C. Kurt, P. Duygulu, Recognizing human actions using key poses, in: Proceedings of the International Conferenceon Pattern Recognition, 2010, pp. 1727– 1730. [8] E. Cippitelli, S. Gasparrini, E. Gambi, S. Spinsante, A human activity recognition system using skeleton data from RGB-D sensors, in: Computational Intelligence and Neuroscience, 2016, pp. 1–14. [9] J. Shan, S. Akella, 3D human action segmentation and recognition using pose kinetic energy, in: Proceedings of the IEEE Workshop on Advanced Robotics and its Social Impacts, 2014, pp. 69–75. [10] G. Zhu, L. Zhang, P. Shen, J. Song, Human action recognition using multi-layer codebooks of key poses and atomic motions, Signal Process., Image Commun. 42 (2016) 19–30. [11] J. Zhao, L. Itti, shapeDTW: shape Dynamic Time Warping, 2016, arXiv:1606.01601. [12] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, R. Moore, Real-time human pose recognition in parts from single depth images, Commun. ACM 56 (1) (2013) 116–124. [13] J.R. Padilla-López, A.A. Chaaraoui, F. Flórez-Revuelta, A discussion on the validation tests employed to compare human action recognition methods using the MSR action3d dataset, 2014, arXiv:1407.7390. [14] L. Gan, F. Chen, Human action recognition using apj3d and random forests, J. Softw. 8 (9) (2013) 2238–2245. [15] L. Xia, C.C. Chen, J.K. Aggarwal, View invariant human action recognition using histograms of 3D joints, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2012, pp. 20–27. [16] G. Lu, Y. Zhou, X. Li, M. Kudo, Efficient action recognition via local position offset of 3D skeletal body joints, Multimed. Tools Appl. 75 (6) (2016) 3479–3494. [17] M.A. Gowayyed, M. Torki, M.E. Hussein, M. El-Saban, Histogram of oriented displacements(HOD): Describing trajectories of human joints for action recognition, in: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, 2013, pp. 1351–1357. [18] O. Oreifej, Z. Liu, HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 716–723. [19] J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining actionlet ensemble for action recognition with depth cameras, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1290–1297. [20] N.P. Cuntoor, R. Chellappa, Key frame-based activity representation using antieigenvalues, in: Asian Conference on Computer Vision, 2006, pp. 499–508. [21] C. Ellis, S.Z. Masood, M.F. Tappen, J.J. LaViola, R. Sukthankar, Exploring the tradeoff between accuracy and observational latency in action recognition, Int. J. Comput. Vis. 101 (3) (2013) 420–436. [22] M. Zanfir, M. Leordeanu, C. Sminchisescu, The moving pose: an efficient 3D kinematics descriptor for low-latency action recognition and detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2752–2759. [23] S. Gaglio, G.I. Re, M. Morana, Human activity recognition process using 3-D posture data, IEEE Trans. Human–Mach. Syst. 45 (5) (2015) 586–597.

8 Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.

Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images

Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images

Recommend Documents