Signal Processing: Image Communication (
)
β
Contents lists available at ScienceDirect
Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image
Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images Qiming Li a , Wenxiong Lin b , Jun Li a, * a b
Quanzhou Institute of Equipment Manufacturing, Haixi Institute, Chinese Academy of Sciences, Quanzhou, Fujian, 362216, China Fujian Institute of Research on the Structure of Matter, Chinese Academy of Sciences, Fuzhou, Fujian, 350002, China
ARTICLE
INFO
Keywords: Human activity recognition Dynamic representation and matching Shape dynamic time warping
ABSTRACT Segmenting a skeleton feature sequence into pose and motion feature segments can improve the effectiveness of key pose based human activity recognition. Nevertheless, the fixed number of atomic motions in feature segments results in the lost of temporal information, it makes the segmentation technique hardly suitable for all scenarios of feature sequences. To address this issue, this paper proposes a human activity recognition method using dynamic representation and matching of skeleton feature sequences based on the segmentation technique. In our method, a skeleton feature sequence is first segmented into key pose and atomic motion segments according to the potential differences. Afterwards, a learning strategy is proposed to select some frame sets with high confidence value, and we then extract series of poses (i.e., atomic motion series) with different numbers of poses from the learnt frame sets to represent motion feature segments, while a fixed number of centroids obtained by πΎ-Means are used to represent pose feature segments. Finally, the shape dynamic time warping (shapeDTW) algorithm is utilized to measure the distance between the corresponding motion feature segments of two feature sequences. Comprehensive experiments are conducted on three public human activity datasets, and the results show that our method achieves superior performances in comparison with some state-of-the-art methods.
1. Introduction Human activity recognition is a fundamental requirement in many application of computer vision [1,2], robotics [3], and wearable sensors [4]. Especially, with the rapid development of intelligent personal assistive robotics, being able to detect and recognize daily activities reliably has attracted numerous attentions in research community. However, it is still a channelling task to recognize peopleβs daily activities only with the RGB image sequences. Thanks to the easy availability of inexpensive depth cameras such as the Microsoft Kinect sensor, it becomes possible to capture the depth information in real-time and recognize activities accurately with the skeleton data extracted from RGB-D images. Among the skeleton based human activity recognition methods, many previous works [5β8] focusing on key poses have been proposed. The key idea shared by these methods is to learn the most discriminative key poses, which can effectively grasp the distinguishing nature of an activity. Then, activity recognition based on key poses can be considered as a sequence matching task. However, almost most of traditional key pose extraction algorithms do not consider the temporal sequence information among key poses. To address this issue, recent *
methods [9,10] propose to segment feature sequences of an activity into key pose and atomic motion segments according to the potential differences. In these works, the representation of a feature sequence is composed of some poses with tiny or significant movements, and the ββKey poseβAtomic motionβKey poseββ shape representation is benefit to extract high-level features and improve the performance of activity recognition. Nevertheless, these works use a fixed number of atomic motions to represent the motion process, and do not adopt any strategy to extract discriminative poses for feature sequences representation. As a consequence, some temporal information of the complex motion process is lost more or less. In addition, not all the obtained atomic motions are representative for the motion process. In this paper, taking full advantage of segmentation technique, we propose a human activity recognition method using dynamic representation and matching of skeleton feature sequences from RGB-D images. A skeleton feature sequence is first segmented into key pose and motion feature segments according to the potential differences. Afterwards, a learning strategy is proposed to select some frame sets with high confidence value, and we then extract series of poses (i.e., atomic motion series) with different numbers of poses from the learnt
Corresponding author. E-mail address:
[email protected] (J. Li).
https://doi.org/10.1016/j.image.2018.06.013 Received 18 January 2018; Received in revised form 24 June 2018; Accepted 24 June 2018 Available online xxxx 0923-5965/Β© 2018 Published by Elsevier B.V.
Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.
Q. Li et al.
Signal Processing: Image Communication (
)
β
Fig. 1. Architecture of dynamic representation and matching of feature sequences. One atomic motion series in segment 1 (motion feature segment) of sequence 1 contains 3 atomic motions, while another atomic motion series in segment 1 of sequence 2 contains 4 atomic motions.
frame sets to represent motion feature segments, while a fixed number of centroids obtained by πΎ-Means are used to represent pose feature segments. Finally, in order to match the corresponding motion feature segments of two feature sequences, a novel alignment algorithm, named shapeDTW [11], is used to measure the distance of two atomic motion series in the motion feature sequences for the nearest neighbour classifier. Fig. 1 shows the architecture of dynamic representation and matching of feature sequences in our method. As shown in the figure, atomic motion series with different number of discriminative poses are extracted for motion feature segments according to their different movement velocities, and shapeDTW is used to measure the distance between the corresponding motion feature segments. The advantages of our method are three folds. Firstly, the temporal information among key pose segments can be well preserved with the discriminative frame sets obtained by the proposed learning strategy. Secondly, shapeDTW, which considers both of the global optimal solution and local structural information, is applied to match two feature sequences. Thirdly, dynamic representation and matching of motion feature segments makes our method adaptive to feature sequences with different movement velocities. The remainder of our paper is organized as follows. Section 2 reviews related works about human activity recognition based on skeleton data. Section 3 describes the details of our method. Section 4 presents extensive experiments to validate the effectiveness of our method on some benchmark datasets. Section 5 gives the conclusion of this paper.
vectors which connect different coordinates of human joints to represent human postures. Recently, many researchers [5,7β10,20β23] have observed that a human activity can be viewed as a sequence of key poses. In [20], an antieigenvalue-based method was presented to detect key frames by investigating properties of operators, which transform past states to observed future states of human activities. Some other methods [5,23] performed the selection process of key poses by means of clustering algorithms. Nevertheless, not all frames are representative and discriminative for a human activity, key poses obtained from the clustering centroids of these frames could negatively impact the recognition performance. Therefore, recent models [21,22] used some learning strategies to select discriminative key frames from feature sequences. As a new trend, there are significant interests [9,10] in recognition to segment feature sequences and then extract key poses in each segment. These works argued that a feature sequence is composed of poses with tiny and significant movements. The segmentation process is conducted according to the potential differences. However, most of these methods use a fixed number of key poses to represent feature segments. Therefore, this paper focuses on the selection of dynamic number of representative poses for the motion feature segments. Some classification methods, such as hidden Markov model (HMM) [24,25], support vector machine (SVM) [23,26,27], naive Bayes nearest neighbour (NBNN) [10,28], and dynamic time warping (DTW) [29], are exploited to classify the high-level feature sequence for human activity recognition. Piyathilaka and Kodagoda [24] exploited Gaussian mixture model (GMM) based HMM, which is able to cluster data into different groups as a collection of multinomial Gaussian distributions, to detect human activities. Multiclass SVM was considered to estimate a human activity in [23,26]. In [10], high-level key poses and atomic motions were extracted from feature sequences, and the NBNN algorithm was employed to classify human activities. Because that DTW has the advantage of robustness against variation in speed or style, it was utilized in [29] as a distance measure between two activities. Recently, a novel alignment algorithm, named shapeDTW [11], was proposed to enhance DTW by taking point-wise local structural information into consideration. In order to exploit the local information between poses in a feature sequence, shapeDTW is used to measure the distance of two feature sequences for nearest neighbour classifier in this paper.
2. Related work Due to the introduction of real-time skeleton capturing method [12], many literatures [13β18] based on the skeleton data for human activity recognition were proposed recently. For example, Gan and Chen [14] developed a skeleton feature based representation named APJ3D to describe a 3D human posture, and the improved Fourier temporal pyramid and random forests were exploited to recognize activities. Some other researches focused on utilizing histogram of skeleton data to generate descriptors, such as histograms of oriented displacements (HOD) [17], histograms of 3D joint locations (HOJ3D) [15], and histograms of oriented 4D normal (HON4D) [18]. Moreover, in order to ensure that the skeleton data is invariant to the position of human joints in the world coordinate system, several methods [8,10,19] used a set of distance 2
Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.
Q. Li et al.
Signal Processing: Image Communication (
)
β
3. Our method This section gives the detailed description of dynamic representation and matching of feature sequences for human activity recognition. Firstly, the feature extraction process for skeleton data and the segmentation technique for feature sequences are explained. Secondly, the learning strategy of discriminative frame sets is described. Thirdly, how to represent the key pose and atomic motion segments in a feature sequence is presented. Finally, the dynamic matching method of feature sequences is introduced. 3.1. Feature sequence segmentation In RGB-D datasets, the coordinates of joints in each skeleton are extracted to represent postures of human body. However, because of the occlusion of different human parts, estimated coordinates from RGB-D images may be corrupted sometimes. Errors and noises caused by the corrupted data will severely degrade the effectiveness of the skeleton data. Therefore, we utilize a simple moving average to smooth the skeleton data to alleviate the influence of corrupted data in our implementation. After smoothing the skeleton data, we compute the spatial features between joint coordinates to represent a frame. Some methods for spatial features extraction are introduced in [6,8,10,30]. In our method, we adopt the normalized relative orientation (NRO) feature [10], which has shown its advantage of being insensitive to human bodyβs height, limb length, and distance to the camera. A NRO feature is computed between two relative joints that rotate round each other, such as elbow and shoulder joints. Considering a human skeleton in a frame is composed of π pairs of relative joints, ππ1 and ππ2 are two relative joints respectively. The πth NRO feature ππ in a skeleton is computed with Fig. 2. Illustration of segmentation results of the ββdrink waterββ data sequence in the Cornell CAD-60 dataset according to different threshold values on the minimum potential difference. (a) Segmentation results of the ββdrink waterββ data sequence while ππππ = 0.01, 0.02, and 0.04. (b) The precision of the proposed method on the ββdrink waterββ data sequence according to different values of ππππ .
ππ1 β ππ2 , π = 1, 2, β¦ , π, (1) βππ1 β ππ2 β β β where ββ
β denotes the Euclidean distance. A posture feature vector for one skeleton frame is formulated as
ππ =
πΉ = [π1 , π2 , β¦ , ππ ].
(2)
NRO features in a skeleton frame is divided into 5 different limbs (i.e., the left upper limb, the right upper limb, the left lower limb, the right lower limb, and the torso). In the cases of 15 and 20 joints, there are 3 and 4 NRO features for a limb (i.e., 3 Γ 5 = 15 and 4 Γ 5 = 20 NRO features in a skeleton frame), respectively. More details about the NRO feature vector can be found in [10]. Intuitively, a human activity can be viewed as a sequence of poses with tiny and significant movements. Therefore, we segment the feature sequence into pose and motion feature segments according to the potential difference [9,10], which is used to define the degree of movement of a posture. The potential difference between two frames in a feature sequence is calculated by ππ (π) = ππ (π) β ππ (π β 1),
provide more discriminative information than pose feature segments. Furthermore, when the threshold ππππ is less then 0.02 in our case, some corrupted frames (marked by the blue square in Fig. 2(a) while ππππ = 0.01) still exist, the proposed method obtains low accuracy. When the threshold ππππ is more then 0.02, it cannot achieve better performance since some discriminative and representative poses are smoothed by the threshold ππππ . Therefore, we set the threshold ππππ to 0.02 in our implementation. 3.2. Learning discriminative frame sets Generally, key poses are generated by using some clustering techniques in most of traditional recognition methods [10,31,32]. However, not all the used frames for clustering are discriminative and some frames may belong to the corrupt data. In our method, we propose a learning strategy to select discriminative poses from a constant number of segments. The key ideal of our solution is to perform a k-nearest neighbours (kNN) classifier on some randomly selected frames. More specifically, we randomly select a frame set π½ times, and let πΉΜ π = {πΉΜ1π , πΉΜ2π , β¦ , πΉΜππ } be the πth time selected set that consists of π randomly selected frames from a constant number of feature segments π. In order to learn the most discriminative frame sets, we construct a training pool, which contains all frames in π and some randomly sampled frames from other activity sequences, for the kNN search of each selected frame in πΉΜ π . The confidence function πΆ(πΉΜ π ) of πΉΜ π is computed by using the votes
(3)
β where ππ (π) = β βπΉπ β πΉ1 β is the difference of the NRO feature vectors between πth and first frame. Some frames in a feature sequence are labelled as pose feature segments when their potential difference satisfy |ππ (π)| < ππππ , | |
(4)
where ππππ is a certain threshold. Otherwise, they are labelled as motion feature segments. Fig. 2 visualizes the segmentation results of the ββdrink waterββ data sequence in the Cornell CAD-60 dataset according to different threshold values on the minimum potential difference. As shown in Fig. 2, the obtained pose and motion feature segments appear alternately in a feature sequence. Atomic motion segments consist of movements between two stationary states of a human activity, and 3
Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.
Q. Li et al.
Signal Processing: Image Communication (
from π nearest neighbours of each selected frame: πΆ(πΉΜ π ) = π (π|πΉΜ π ) =
π β π=1
π (π|πΉΜππ ) =
π β π (πΉΜππ , π) π=1
π (πΉΜππ )
=
π β πππ π=1
ππ
)
β
Algorithm 1: The training process of feature sequence representation of a human activity. ,
(5)
where π (π|πΉΜ π ) denotes the probability that the searched π-nearest neighbours of πΉΜ π belong to the same feature segments π as πΉΜ π ; ππ is the number of nearest neighbours of πΉΜππ in the training pooling, and πππ is the number of the neighbours that belong to the same feature segments π as πΉΜππ . We select top πΎ frame sets with high confidence value according to their confidence functions, the selected frame sets of one activity can be viewed as the most representative characteristics distinguished from other activities. Through experiments, we observe that, if π½ (i.e., number of randomly selected frame sets) and π (i.e., number of nearest neighbours) are large enough, this learning strategy works well in practice. The confidence function captures the intuition that, if it is high, the selected frame sets include a high proportion of frames which are the most representative for the human activity; if it is low, the selected frames are usually the common, shared ones, such as standing poses from the start and the end of each sequence. Moreover, the corrupted frames, who have much less nearest neighbours than the representative frames in the same feature segments, can be discarded by this learning strategy. As can be seen in Fig. 3, two examples of frame sets with high confidence value are visualized.
1 2
3
4 5
6
7 8
9 10
11
12
13
Input: The skeleton data {π½1 , π½2 , ..., π½π‘ } of a human activity. Output: The feature sequence representation of a human activity. Pre-process the input skeleton data with a moving filter. Compute the NRO features {πΉ1 , πΉ2 , ..., πΉπ‘ } of skeletons according to Eq. (1). Segment the feature sequence into pose feature segments and motion feature segments according to Eq. (3) and Eq. (4). for π = 1, ..., π½ do Randomly select π frames from a constant number of feature segments. Combine the selected frames together to obtain a frame set πΉΜ π = {πΉΜ1π , πΉΜ2π , ..., πΉΜππ }. Construct a training pool for the kNN search of πΉΜ π . Compute the confidence value πΆ(πΉΜ π ) of πΉΜ π according to Eq. (5). end Select top πΎ frame sets {πΉΜ π‘ππ1 , πΉΜ π‘ππ2 , ..., πΉΜ π‘πππΎ } according to their confidence values. Utilize πΎ-Means method to cluster frames in pose feature segments. Extract atomic motion series directly to represent motion feature segments. Return the feature sequence representation of a human activity.
3.3. Representation of feature segments In this subsection, we use the selected frame sets with high defined confidence value to represent feature segments. As we known, a feature sequence is segmented into pose and motion feature segments according to the potential difference. We extract series of frames (i.e., atomic motion series) with different numbers of poses from the learnt top πΎ frame sets to represent the segments. However, pose feature segments are composed of still poses or poses with tiny movements. In other word, pose feature segments indicate the normal states of human body when no significant movement is being performed. Therefore, the pose feature segments give much less discriminative information than the motion feature segments, and it is no need to represent pose feature segments with too much frames. Different from extracting series of frames directly for motion feature segments, we utilize πΎ-Means method [33] to cluster frames in the same pose feature segment of top πΎ selected frame sets. The decreased number of key poses would also reduce computational cost accordingly. As shown in Fig. 4, all frames in the same pose feature segment of the top πΎ selected frame sets are extracted and viewed as the input samples in πΎ-Means method, the obtained clustering centroids are considered as the key poses of this segment. The atomic motion series in the motion feature segment of each selected frame set is extracted directly to represent this segment. The advantage of the proposed method is that motion feature segments in the same activity for different humans are represented by some atomic motion series, in which the number of frames vary with the movement velocity of human activities. Therefore, the obtained centroids in πΎ-Means for pose feature segments and the atomic motion series for motion feature segments are able to represent a human activity efficiently and effectively. Up to now, how to represent a feature sequence of human activity in the training process has been introduced. We give an overview of the training process of feature sequences representation in Algorithm 1.
sub-sequences of test sequence and the training set. The key idea of dynamic matching is to find the best-matching activity pattern of test sub-sequence from the training set. Suppose a sub-sequence in a test feature sequence is composed of πΏ adjacent segments, which include some pose and motion feature segments. Therefore, it is natural to split the matching problem between feature segments into two sub-problems (i.e., matching of pose feature segments and matching of motion feature segments). The distance between pose feature segments in test and training sequence can be defined as β β ππ β1 β β π β2 π π·ππ = β π = 1, 3, β¦ , (6) βπΉπππ π β πΎππ,πππ π‘ β , β β ππ π=1
where ππ is the number of selected frames in the pose feature segment of π π test sub-sequence, πΉπππ π is the πth selected frame in the segment, πΎππ,πππ π‘ π is the best-matching key pose of πΉπππ π in the πth segment of the πth human activity in the training set. As we known, the pose and motion feature segments appear alternately in a feature sequence, and the pose feature segment appears first because that a human activity always starts from some still poses. Therefore, the index of feature segments (i.e., π in Eq. (6)) is odd, while it is even in the case of motion feature segments. In order to measure the distance between the atomic motion series, ShapeDTW is used to measure the distance between two motion feature segments in our method. As we known, DTW is a dynamic programming based distance measure algorithm between two temporal sequences, it has the advantage of robustness against variation in speed or style. Therefore, DTW is widely applied for the matching between two feature sequence with different numbers of poses in human activity recognition. However, DTW does not consider the local structural information of feature sequences. Recently, shapeDTW [11] is proposed to enhance DTW by incorporating point-wise local structures into the matching process. The shapeDTW algorithm used for matching two atomic motion segments in test and training sub-sequences is described in Algorithm 2. It is worth to point out that the compound shape descriptors [11], which are invariant to magnitude-shift, are used to encode πΉπ π’π,π in the atomic motion series of sub-sequence. More details about the shapeDTW algorithm can be found in [11].
3.4. Dynamic matching of feature sequences During the test stage, a test feature sequence is first segmented into pose and motion feature segments as the training stage. Secondly, unlike the training stage, we only select the frame set with the highest confidence to represent the sequence based on the learning strategy described in Section 3.2. Finally, we split the test feature sequence into some sub-sequences for matching, and calculate distances between the 4
Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.
Q. Li et al.
Signal Processing: Image Communication (
)
β
Fig. 3. Examples of frame sets with high confidence value identified from training pool by using the proposed learning strategy in the Cornell CAD-60 dataset. (a) Drinking water. (b) Wearing contact lenses.
Fig. 4. Illustration of representation of feature segments with top πΎ selected frame sets. All frames in the same pose feature segment of top πΎ selected frame sets are used for clustering in πΎ-Means method. A sequence of frames in the motion feature segment for each selected frame set is extracted directly to represent the segment.
4. Experiments
As a consequence, given two atomic motion series in motion feature segments of test and training sub-sequence, the distance between the two segments can be computed by π‘ππ π‘ π π·ππ = π·π βπππ (πΉπ π’π , π΄ππ,πππ π‘ ),
π = 2, 4, β¦ ,
In this section, we first present the parameter settings of our method in Section 4.1. In Sections 4.2β4.4, we evaluate the effectiveness of our method on three public RGB-D datasets (Cornell CAD-60 dataset [25], MSR Action3D dataset [34], and MSRC-12 dataset [35]), and compare our results with some state-of-the-art methods. All experiments are conducted using MATLAB on an i7 Quad-Core machine.
(7)
π‘ππ π‘ is the atomic motion series in one motion feature segment where πΉπ π’π π of the test sub-sequence; π΄ππ,πππ π‘ is the best-matching atomic motion series in the πth segment of the πth human activity of the training set; π·π βπππ indicates the shapeDTW distance between the two atomic motion series. When the key pose distance and atomic motion distance are obtained according to Eqs. (6) and (7), the best-matching distance between the test and training sub-sequences is defined as
π·π =
πΏ β
π·ππ .
4.1. Parameter settings In order to optimize our method, there are some parameters that need to be determined, and all parameters are fixed for all experiments to demonstrate the effectiveness of our method. Given a feature sequence of human activity, it is segmented into some segments according to the potential difference in our method. When the potential difference is less than a given threshold ππππ which is set to 0.02, the parts of feature sequence are viewed as pose feature segments. The number of adjacent segments for learning discriminative poses and matching feature sequences is set to 5, and the number of randomly selected frames for each segments is 3. We randomly select frame sets π½ =
(8)
π=1
Finally, the naive Bayes nearest neighbour (NBNN) method is used to classify the test sub-sequence according the above obtained distances: πΆπππ π‘ = πππππππ·π . π
(9) 5
Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.
Q. Li et al.
Signal Processing: Image Communication (
1
2 3
shape descriptors 4 5
2
π‘ππ π‘ ) πππ π(πΉπ π’π,π‘ 1
and
6
Recall (%)
Bathroom
100 96.2 85.8 94.0
90 100 100 96.7
Bedroom
Talking on phone Drinking water Opening pill container Average
95.8 100 100 98.6
100 94.3 96.7 97.0
Kitchen
Cooking(chopping) Cooking(stirring) Drinking water Opening pill container Average
92.8 100 100 100 98.2
99.9 92.3 94.3 99.1 96.4
Living room
Talking on phone Drinking water Talking on couch Relaxing on couch Average
97.4 100 100 100 99.4
100 94.3 100 100 98.6
Office
Talking on phone Writing on white board Drinking water Working on computer Average
97.4 97.6 100 100 98.8
100 97.6 94.3 100 98.0
Global average
97.8
97.3
π‘ππππ ; ππ π’π
Construct the warping matrices in shapeDTW: and Calculate the shapeDTW distance by solving the optimization problem: β π‘ππ π‘ π‘ππ π‘ ) + π π‘ππππ β
πππ π(πΉ π‘ππππ )β . β
πππ π(πΉπ π’π π·π βπππ = ππππππβππ π’π β π π’π π π’π β1,2 β
500 times from 5 adjacent feature segments, and the training pool for kNN search consists of all the frames in the 5 segments and 10 000 randomly selected frames from the training feature sequences. Although the training process is off-line, all the frames can be used for training. But considering the time consume, 10 000 randomly selected frames are used for training. Then, we perform kNN search with π = 2000 and select top πΎ = 10 frame sets to represent the segments. When representing feature segments, the centroid number in πΎ-Means and the number of series in motion feature segments are set to 10.
ββNew personββ
Brushing teeth Rinsing mouth Wearing contact lens Average
2
π‘ππ π‘ ππ π’π
Activity
Precision (%)
π‘ππππ ). πππ π(πΉπ π’π,π‘ 2
end π‘ππ π‘ ) and πππ π(πΉ π‘ππππ ) by DTW: Align descriptor sequences πππ π(πΉπ π’π π π’π π‘ππ π‘ }, πππ π{πΉ π‘ππππ }), where a. Initialize π·(1, 1) = πππ π(πππ π{πΉπ π’π,1 π π’π,1 πππ π(β
, β
) is the distance between two shape descriptors; π‘ππ π‘ }, πππ π{πΉ π‘ππππ }); b. Initialize π·(1, 2) = π·(1, 1) + πππ π(πππ π{πΉπ π’π,1 π π’π,2 c. π·(π‘1 , π‘2 ) = πππ{π·(π‘1 β 1, π‘2 β 1), π·(π‘1 β 1, π‘2 ), π·(π‘1 , π‘2 β 1)} π‘ππ π‘ }, πππ π{πΉ π‘ππππ }). +πππ π(πππ π{πΉπ π’π,π‘ π π’π,π‘ 1
7
Location
π‘ππππ = {πΉ π‘ππππ , πΉ π‘ππππ , ..., πΉ π‘ππππ } in motion feature and πΉπ π’π π π’π,π2 π π’π,1 π π’π,2 segments of test and training sub-sequences Output: The shapeDTW distance π·π βπππ between the two series of atomic motion. π‘ππ π‘ and πΉ π‘ππππ by its shape Encode each pose descriptor πΉπ π’π π π’π descriptors: for π‘1 = 1, ..., π1 and π‘2 = 1, ..., π2 do π‘ππ π‘ and πΉ π‘ππππ in πΉ π‘ππ π‘ and πΉ π‘ππππ by the compound Encode πΉπ π’π,π‘ π π’π,π‘ π π’π π π’π 1
β
Table 1 Precision and recall of our method in different locations of Cornell CAD-60 dataset.
Algorithm 2: ShapeDTW algorithm for matching two series of atomic motion. π‘ππ π‘ = {πΉ π‘ππ π‘ , πΉ π‘ππ π‘ , ..., πΉ π‘ππ π‘ } Input: Two series of atomic motion πΉπ π’π π π’π,π π π’π,1 π π’π,2
1
)
Table 2 Performance of our method in comparison with state-of-the-art methods on the Cornell CAD-60 dataset.
4.2. Cornell CAD-60 dataset The Cornell CAD-60 dataset [25] is a daily activity dataset of depth sequences captured by a depth camera (i.e., Microsoft Kinect sensor). This dataset consists of 12 types of activities: ββbrushing teethββ, ββcooking(chopping)ββ, ββcooking(stirring)ββ, ββdrinking waterββ, ββopening pill containerββ, ββrelaxing on couchββ, ββrinsing mouth with waterββ, ββtalking on couchββ, ββtalking on the phoneββ, ββwearing contact lensesββ, ββworking on computerββ, ββwriting on white boardββ. Each activity is performed at least once by 4 persons (2 males and 2 females) in 5 different locations (ββbathroomββ, ββbedroomββ, ββkitchenββ, ββliving roomββ, ββofficeββ). Three persons are right-handed actor and the left one is a left-handed actor. A skeleton in a frame consists of 15 joints, and 15 joints are divided into five limbs. The feature vector of a skeleton is composed of 15 NRO features (i.e., 3 NRO features per limb). Two experimental setting (ββnew personββ and ββhave seenββ) are introduced in [25]. The same experimental setting (i.e., leave-one-out cross-validation to test each human objectβs data) of ββnew personββ is adopted in our method. In order to make our method adaptive to the lefthanded human, we mirror the training skeleton data across the visual plane that cuts the person in a half, and the test feature sequences are classified without any mirroring operation. The performance of our method in terms of precision and recall for each activity in 5 different locations is given in Table 1. As can be seen, our method achieves significant results in all locations, the best performance is achieved in the environment of living room. The bathroom environment including brushing teeth, rinsing mouth, wearing contact lens is the most challenging case, since the average precision and recall are 94% and 96.7%. The recognition results for most of activities who has
Method
Precision (%)
Recall (%)
MEMM [25] SSVM [27] Kinematic feature [36] Eigenjoints [28] HMM+GMM [24] Image fusion [37] Depth images segment [38] Spatiotemporal interest pt. [39] Probabilistic body motion [40] Self-organizing neural int. [41] Cippitelli et al. [8] Pose kinetic energy [9] Multi-layer codebooks [10] Our method
67.9 80.8 86 71.9 70 75.9 78.1 93.2 91.1 91.9 93.9 93.8 97.4 97.8
55.5 71.4 84 66.6 78 69.5 75.4 84.6 91.9 90.2 93.5 94.5 95.8 97.3
tiny movements are excellent. The wearing contact lens activity has the highest misrecognition rate among all activities. We also compare our method with some other state-of-the-art skeleton-based methods [8β10,24,25,27,28,36β41] in terms of precision and recall. All the recognition results of these methods in Table 2 are provided in Cornell benchmark [25]. As shown in the table, our method outperforms all the other state-of-the-art methods by achieving 97.8% and 97.3% in terms of precision and recall, especially improves pose kinetic energy [9] and multi-layer codebooks [10] methods based on the segmentation technique. The results demonstrate that our method is able to learn discriminative frames to represent feature segments, and the dynamic representation and matching of motion feature segments is robust to temporal stretching. 4.3. MSR action3D dataset MSR Action3D dataset [34] is a challenging activity dataset due to the presence of similar and complex gesture. There are 20 activity types performed 2β3 times by 10 persons in the dataset, and the total number of activity samples is 567. The 20 activities are divided into 6
Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.
Q. Li et al.
Signal Processing: Image Communication (
)
β
Fig. 5. Confusion matrix of our method on the MSR Action3D dataset.
Table 4 Average accuracy of our method in comparison with state-of-the-art methods on MSRC-12 dataset.
Table 3 Average accuracy of our method in comparison with state-of-the-art methods on MSR Action3D dataset. Method
Used depth images
Average accuracy (%)
Method
Average accuracy (%)
Bag of 3D points [34] Histogram of 3D Joints [15] Random occupancy pattern [42] Actionlet ensemble [43] HON4D + Ddesc [18] MMTW [44] SNV [45]
Yes Yes Yes Yes Yes Yes Yes
74.7 78.9 86.2 88.2 88.9 92.7 93.09
Nonlinear Markov models [46] Enhanced sequence matching [47] Multi-layer codebooks [10] Our method
90.9 96.7 94.0 96.9
Eigenjoints [28] Pose kinetic energy [9] Multi-layer codebooks [10] Moving pose [22] Our method
No No No No No
82.3 84 84.6 91.7 90.8
iconic and 6 metaphoric gestures performed by 30 persons, we perform experiments on the 6 iconic gestures (crouch, put goggle, shoot pistol, throw object, change weapon, kick) by employing the leave-personout cross-validation as in [10,46,47]. Because the demographics of the participants includes 7% left-handed human, we also mirror the skeleton data for the training set as in the CAD-60 dataset. Table 4 shows average accuracy of our method in comparison with some state-of-the-art methods [10,46,47] on MSRC-12 dataset. Our method outperforms all the other methods and improves multi-layer codebooks method based on the segmentation technique by 2.9% in terms of average accuracy.
three subsets, each having 8 activities. For quantitatively evaluating the recognition performance, we perform experiments separately on each subset by using the same ββnew-personββ settings as in [34], where the feature sequences of 5 persons are used for training, and the remaining sequences are used for testing. The average accuracy of the three subsets obtained by our method and several other state-of-the-art methods [9,10,15,18,22,28,34,42β45] are given in Table 3. As can be seen, our method improves the average accuracy over pose kinetic energy [9] and multi-layer codebooks [10] methods based on the segmentation technique by 6.8% and 6.2%, respectively. Our method also achieves competitive performance in comparison with the moving pose method [22] utilizing pose, speed, and acceleration of joints. Even compared to some state-of-the-art methods using both depth image and joint coordinates, our method still outperforms [15,18,34,42,43] and is competitive with [44,45]. The confusion matrix of our method on the MSR Action3D dataset is shown in Fig. 5. It can be seen that our method works very well on most of activities. The misrecognitions occur when two activities involve interactions with each other. Our method does not achieve the same significant performance on MSR Action3D dataset as on the CAD-60 dataset, because MSR Action3D dataset has larger inter-person variance, more activity actors, and more corrupted and occluded poses than CAD60 dataset.
5. Conclusion In this paper, we have introduced the dynamic representation and matching of feature segments for human activity recognition. Our motivation is based on the observation that the dynamic representation of feature segments with different numbers of atomic motions can provide more temporal information than that with fixed number of atomic motions. As demonstrated in the paper, the proposed learning strategy selected discriminative frames effectively for the dynamic representing of feature sequences. Moreover, the recognition task is formulated as the dynamic matching of atomic motion series between feature sequences with shapeDTW. Experimental results on three public activity datasets have demonstrated that our method effectively improves the recognition performance and outperforms several other state-of-the-art methods (especially some methods based on the segmentation technique). Acknowledgement This work was supported by the China Postdoctoral Science Foundation (No. 2017M612145).
4.4. MSRC-12 dataset
References
The MSRC-12 activity dataset [35] is collected by Microsoft Kinect Sensor, which provides a noisy estimate of 20 human joints. There are 6,244 gesture instances of 594 videos in the dataset. It consists of 6
[1] L. Wang, Y. Qiao, X. Tang, Action recognition with trajectory-pooled deepconvolutional descriptors, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4305β4314.
7 Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.
Q. Li et al.
Signal Processing: Image Communication (
)
β
[24] L. Piyathilaka, S. Kodagoda, Gaussian mixture based HMM for human daily activity recognition using 3D skeleton features, Ind. Electron. Appl. (2013) 567β572. [25] J. Sung, C. Ponce, B. Selman, A. Saxena, Unstructured human activity detection from RGBD images, in: Proceedings of the IEEE Conference on Robotics an Automation, 2011, pp. 47β55. [26] A. Taha, H.H. Zayed, M.E. Khalifa, E.S.M. El-Horbaty, Human activity recognition for surveillance applications, in: Proceedings of the 7th International Conference on Information Technology, 2015, pp. 577β586. [27] H.S. Koppula, R. Gupta, A. Saxena, Learning human activities and object affordances from RGB-D videos, Int. J. Robot. Res. 32 (8) (2013) 951β970. [28] X. Yang, Y. Tian, Effective 3D action recognition using Eigen Joints, J. Vis. Commun. Image Represent. 25 (1) (2014) 2β11. [29] S. Sempena, N.U. Maulidevi, P.R. Aryan, Human action recognition using Dynamic Time Warping, in: Preceedings of the IEEE International Conference on Electrical Engineering and Informatics, 2011, pp. 1β5. [30] R. Vemulapalli, F. Arrate, R. Chellappa, Human action recognition by representing 3D skeletons as points in a Lie group, in: Preceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 588β595. [31] L. Liu, L. Shao, X. Zhen, X. Li, Learning discriminative key poses for action recognition, IEEE Trans. Cybern. 43 (6) (2013) 1860β1870. [32] S. Cheema, A. Eweiwi, C. Thurau, C. Bauckhage, Action recognition by learning discriminative key poses, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2011, pp. 1302β1309. [33] J. Macqueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281β297. [34] W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3D points, in: Proceedings of the IEEE Computer Society Conference on Computer Visionand Pattern Recognition Workshops, 2010, pp. 9β14. [35] S. Fothergill, H. Mentis, P. Kohli, S. Nowozin, Instructing people for training gestural interactive systems, in: Proceedings of the SIGCHI Conference on Human Factorsin Computing Systems, 2012, pp. 1737β1746. [36] C. Zhang, Y. Tian, RGB-D camera-based daily living activity recognition, J. Comput. Vis. Image Process. 2 (4) (2012) 1β12. [37] B. Ni, Y. Pei, P. Moulin, S. Yan, Multilevel depth and image fusion for human activity detection, IEEE Trans. Cybern. 43 (5) (2013) 1383β1394. [38] G. Raj Gupta, Y.S. Chia, D. Rajan, Human activities recognition using depth images, in: Proceedings of the 21st ACM International Conference on Multimedia, 2013, pp. 283β292. [39] Y. Zhu, W. Chen, G. Guo, Evaluating spatiotemporal interest point features for depthbased action recognition, Image Vis. Comput. 32 (8) (2014) 453β464. [40] D.R. Faria, C. Premebida, U. Nunes, A probabilistic approach for human everyday activities recognition using body motion from RGB-D images, in: Preceedings of the IEEE International Symposium on Robot and Human Interactive Communication, 2014, pp. 732β737. [41] G.I. Parisi, C. Weber, S. Wermter, Self-organizing neural integration of pose-motion features for human action recognition, Front. Neurorobotics (2015) 1β9. [42] J. Wang, Z. Liu, J. Chorowski, Z. Chen, Y. Wu, Robust 3d action recognition with random occupancy patterns, in: Proceedings of European Conference on Computer Vision, 2012, pp. 872β885. [43] J. Wang, Z. Liu, Y. Wu, J. Yuan, Learning actionlet ensemble for 3d human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 36 (5) (2014) 914β927. [44] J. Wang, Y. Wu, Learning maximum margin temporal warping for action recognition, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2688β2695. [45] X. Yang, Y.L. Tian, Super normal vector for activity recognition using depth sequences, in: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2014, pp. 804β811. [46] A.M. Lehrmann, P.V. Gehler, S. Nowozin, Efficient nonlinear markov models for human motion, in: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2014, pp. 1314β1321. [47] H.J. Jung, K.S. Hong, Enhanced sequence matching for action recognition from 3d skeletal data, in: Proceedings of the 12th Asian Conference on Computer Vision, 2014, pp. 226β240.
[2] B. Zhang, L. Wang, Z. Wang, Y. Qiao, H. Wang, Real-time action recognition with enhanced motion vector CNNs, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2718β2726. [3] H.S. Koppula, A. Saxena, Anticipating human activities using object affordances for reactive robotic response, IEEE Trans. Pattern Anal. Mach. Intell. 38 (1) (2016) 14β 29. [4] L. Liu, Y. Peng, M. Liu, Z. Huang, Sensor-based human activity recognition system with a multilayered model using time series shapelets, Knowl. Based Syst. 90 (2015) 138β152. [5] A. Chaaraoui, J.R. Padilla-Lopez, F. FlΓ³rez-Revuelta, Fusion of skeletal and silhouette-based features for human action recognition with rgb-d devices, in: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2013, pp. 91β97. [6] A.A. Chaaraoui, J.R. Padilla-LΓ³pez, P. Climent-PΓ©rez, F. FlΓ³rez-Revuelta, Evolutionary joint selection to improve human action recognition with RGB-D devices, Expert Syst. Appl. 41 (3) (2014) 786β794. [7] S. Baysal, M.C. Kurt, P. Duygulu, Recognizing human actions using key poses, in: Proceedings of the International Conferenceon Pattern Recognition, 2010, pp. 1727β 1730. [8] E. Cippitelli, S. Gasparrini, E. Gambi, S. Spinsante, A human activity recognition system using skeleton data from RGB-D sensors, in: Computational Intelligence and Neuroscience, 2016, pp. 1β14. [9] J. Shan, S. Akella, 3D human action segmentation and recognition using pose kinetic energy, in: Proceedings of the IEEE Workshop on Advanced Robotics and its Social Impacts, 2014, pp. 69β75. [10] G. Zhu, L. Zhang, P. Shen, J. Song, Human action recognition using multi-layer codebooks of key poses and atomic motions, Signal Process., Image Commun. 42 (2016) 19β30. [11] J. Zhao, L. Itti, shapeDTW: shape Dynamic Time Warping, 2016, arXiv:1606.01601. [12] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, R. Moore, Real-time human pose recognition in parts from single depth images, Commun. ACM 56 (1) (2013) 116β124. [13] J.R. Padilla-LΓ³pez, A.A. Chaaraoui, F. FlΓ³rez-Revuelta, A discussion on the validation tests employed to compare human action recognition methods using the MSR action3d dataset, 2014, arXiv:1407.7390. [14] L. Gan, F. Chen, Human action recognition using apj3d and random forests, J. Softw. 8 (9) (2013) 2238β2245. [15] L. Xia, C.C. Chen, J.K. Aggarwal, View invariant human action recognition using histograms of 3D joints, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2012, pp. 20β27. [16] G. Lu, Y. Zhou, X. Li, M. Kudo, Efficient action recognition via local position offset of 3D skeletal body joints, Multimed. Tools Appl. 75 (6) (2016) 3479β3494. [17] M.A. Gowayyed, M. Torki, M.E. Hussein, M. El-Saban, Histogram of oriented displacements(HOD): Describing trajectories of human joints for action recognition, in: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, 2013, pp. 1351β1357. [18] O. Oreifej, Z. Liu, HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 716β723. [19] J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining actionlet ensemble for action recognition with depth cameras, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 1290β1297. [20] N.P. Cuntoor, R. Chellappa, Key frame-based activity representation using antieigenvalues, in: Asian Conference on Computer Vision, 2006, pp. 499β508. [21] C. Ellis, S.Z. Masood, M.F. Tappen, J.J. LaViola, R. Sukthankar, Exploring the tradeoff between accuracy and observational latency in action recognition, Int. J. Comput. Vis. 101 (3) (2013) 420β436. [22] M. Zanfir, M. Leordeanu, C. Sminchisescu, The moving pose: an efficient 3D kinematics descriptor for low-latency action recognition and detection, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2752β2759. [23] S. Gaglio, G.I. Re, M. Morana, Human activity recognition process using 3-D posture data, IEEE Trans. HumanβMach. Syst. 45 (5) (2015) 586β597.
8 Please cite this article in press as: Q. Li, et al., Human activity recognition using dynamic representation and matching of skeleton feature sequences from RGB-D images, Signal Processing: Image Communication (2018), https://doi.org/10.1016/j.image.2018.06.013.