Computers & Graphics 54 (2016) 104–112
Contents lists available at ScienceDirect
Computers & Graphics journal homepage: www.elsevier.com/locate/cag
Special Issue on CAD/Graphics 2015
3D human motion retrieval using graph kernels based on adaptive graph construction Meng Li a,n, Howard Leung a, Zhiguang Liu a, Liuyang Zhou b a b
Department of Computer Science, City University of Hong Kong, 306-c Hall 8, City University of Hong Kong, Tat Chee Avenue, Kowloon Tong, Hong Kong Wisers Information Limited, Hong Kong
art ic l e i nf o
a b s t r a c t
Article history: Received 12 April 2015 Received in revised form 16 June 2015 Accepted 3 July 2015 Available online 17 July 2015
Graphs are frequently used to provide a powerful representation for structured data. However, it is still a challenging task to model 3D human motions due to its large spatio-temporal variations. This paper proposes a novel graph-based method for real time 3D human motion retrieval. Firstly, we propose a novel graph construction method which connects the joints that are deemed important for a given motion. In particular, the top-N Relative Ranges of Joint Relative Distances (RRJRD) were proposed to determine which joints should be connected in the resulting graph because these measures indicate the normalized activity levels among the joint pairs. Different motions may thus result in different graph structures so the construction of the graphs is made adaptive to the characteristics of a given motion and is able to represent a meaningful spatial structure. In addition to the spatial structure, the temporal pyramid of covariance descriptors was adopted to preserve certain level of spatio-temporal local features. The graph kernel is computed by matching the walks from each of the two graphs to be matched. Furthermore, multiple kernel learning was applied to determine the optimal weights for combining the graph kernels to measure the overall similarity between two motions. The experimental results show that our method is robust under several variations, and demonstrates superior performance in comparison to three state-of-the-art methods. & 2015 Elsevier Ltd. All rights reserved.
Keywords: Motion retrieval Adaptive graph Graph kernel Motion capture Multiple kernel learning
1. Introduction 3D motion capture (MoCap) is an advanced technology which has many applications, such as game and animation production, physical training and film-making. Although smaller motion sensor like Kinect tends to be more popular in some areas such as home entertainment, professional animation still relies on commercial grade MoCap systems due to its high accuracy. However, generating new motion data is time consuming and labor intensive. Hence, there is an increasing need to reuse and process the existing motion data. Hence, a reliable retrieval mechanism is desired in real application. Nevertheless, motion data has large spatio-temporal variations, thus it is challenging to represent the logical meaning of the motions. There are numerous successful methods for motion retrieval using the relational measures of joints to get the logical meaning of motions [19,20,24]. These methods focus on modeling the geometric relationship between the joints in several key postures. In our consideration, not only do we want to model the spatial
n
Corresponding author. Tel.: þ 852 6705 6526. E-mail addresses:
[email protected] (H. Leung),
[email protected] (Z. Liu),
[email protected] (L. Zhou). http://dx.doi.org/10.1016/j.cag.2015.07.005 0097-8493/& 2015 Elsevier Ltd. All rights reserved.
relationship between joints in specific frames, but also extract the spatio-temporal local features of joints in order to capture its temporal variations at the same time. Graphs are a powerful tool for modeling the complex structured data [1,8]. However, not many researches construct the graphs by modeling the structure of joints in 3D human motion retrieval. There are two main reasons: (1) It is a challenging task to characterize the large spatial variation of joints in motions. (2) It is a challenging task to compute the similarity of the structure, the vertex attributes and edge attributes between two constructed graphs. In this paper, we focus on these two tasks, and propose a novel graph-based approach for human motion retrieval in which the query motion is expected to be matched with each motion in the database as a whole to identify similarity similar matches. The Joint Relative Distance (JRD) has been proposed [3] to provide a meaningful description of human motions in order to emulate the logical similarity rather than numerical similarity. The Variance of Joint Relative Distance (VJRD) has been proposed as an extension to model the temporal changes [21]. If we want to determine which joint pairs are more important according to their activity levels, VJRD may not give a fair ranking because it is more likely for two farther away joints to give larger VJRD values. As a result, we propose to use the Relative Range of Joint Relative Distance (RRJRD) which indicates the normalized activity levels
M. Li et al. / Computers & Graphics 54 (2016) 104–112
among the joint pairs. We rank the joint pairs according to this measure to obtain the top-N RRJRD which indicates the joint pairs with the largest normalized variations. Based on the top-N RRJRD, we construct novel attributed graphs to model 3D human motions. The vertex attributes in the graphs correspond to the temporal pyramid of covariance descriptors on 3D joint locations and the edge attributes in the graphs represent the spatial layout relationship of vertex attributes based on the value of RRJRD. Different motions may thus result in different graph structures so the construction of the graphs is made adaptive to the characteristics of a given motion. Compared with other existing relational measures of joints, our method preserves the spatio-temporal local features of individual joints as well as the complex spatial structure between joint pairs. As the adaptive graphs have complex structures, their similarity measurement cannot be accurately obtained by just computing the weighted similarity of vertex attributes between them [15], or regarding their structures as the relationship between the adjacent edges [16]. We propose the Adaptive-Graph Kernel (AGK) which is a bridge between the structured motion representation and the similarity measurement, since it can compute the similarity of the complex structures, the vertex attributes and the edge attributes between two adaptive graphs. The proposed AGK is actually a series of decomposition kernels. That is, the adaptive graph is first decomposed into a number of the walk groups with different lengths, and the AGK is then obtained by matching the walk groups of the two graphs based on the walk graph kernels. In the traditional walk graph kernels [4,13], all the matching between walk groups were summed with the same weights. In order to improve the similarity measurement of the AGK, we use the Generalized Multiple Kernel Learning (GMKL) algorithm [12] to set the optimal weights to different matched walk groups. Our AGK matching combined with the GMKL greatly improves the accuracy and efficiency for human motion. Our proposed framework is illustrated in Fig. 1. The main contributions of this paper include the following: (1) We propose a novel adaptive graph model based on the Relative Range of Joint Relative Distance (RRJRD) which preserves the spatio-temporal local features of individual joints as well as the complex spatial structure between joint pairs, thus generating a
105
more discriminative representation for 3D human motions. (2) We propose an AGK to measure the similarity of two adaptive graphs. The AGK performs this measurement in three aspects, including the structures, the vertex attributes and edge attributes of the two adaptive graphs. (3) We propose a method which allows AGK to set weights to the different matched walk groups by using GMKL for motion retrieval.
2. Related work 3D human motion retrieval has been a hot topic in recent years. A number of methods have been proposed for human motion retrieval, such as considering relational measures for the similarity of motions [19,20,24]. Müller et al. [19,24] introduced qualitative geometric features to describe the motions. Ref. [20] proposed motion keys for characterizing the motions by using geometric features of the joints based on the Kinect input interface. Although these methods allow for efficient motion retrieval in large motion databases, their features just represent the geometric relationship among some joints in only several key postures. As a result, their motion descriptors are not comprehensive enough to model the complex spatial structure of the joints since the human motions have large spatial variations. Besides analyzing the relationship between the joints, some researches focus on the relationship between the frames in the motion sequence. Ref. [14] adopted Gaussian Mixture Model (GMM) to quantize the human motion sequence by representing each frame as the center of each motion class. Ref. [26] used the sparse coding representation as the description of the temporal relationship between the frames in the motion sequence. These methods can characterize the temporal variations between frames to model different kinds of motion sequences. However, their inter-frame information cannot reflect the spatio-temporal variations of the joints and the spatial relationship between these joints in the motion sequences, which makes them not discriminative enough to describe the motion sequences. Graphs are an extremely effective tool modeling complex structured objects. Recently, some graph-based methods are applied for motion representation and analysis upon video data.
Fig. 1. Framework of our system.
106
M. Li et al. / Computers & Graphics 54 (2016) 104–112
Ref. [2] represented each frame of a human motion by a graph, and the vertices of the graph are corresponding to the spatial local features extracted from this frame. Ref. [10] described a person in a frame as a graphical model which contains six vertices encoding the motion label and positions of five body parts. Ref. [5] constructed a string of feature graphs for the spatio-temporal layout of local features, where the spatial configuration of local features in a small temporal segment is modeled by a graph. Ref. [11] used a hyper-graph for modeling the exacted spatio-temporal local features, and a hyper-graph matching algorithm is designed for motion recognition. Ref. [22] represented motions by a twograph model which captures the spatio-temporal relation among local features. These graph models have been used successfully in human motion recognition and analysis. However, the vertex attributes and edge attributes of these graph models are used for interest points with SIFT descriptors upon video data, hence they are ill-suited for 3D human motion capture data. There are also some researches constructing graph to model 3D human motions upon the MoCap data. Ref. [15] constructed the graph model as the representation of relationship between two human motions. Their human motion retrieval is performed by using a revised Kuhn–Munkres algorithm as the graph matching method. Ref. [16] represented the human motions by a hypergraph model, using Dynamic Programming as the graph matching method for motion recognition. These graph matching methods can compute the similarity between motions. However, these methods cannot reflect the complex spatial structure of joints in the motions, and lose the power of the joint descriptors. In order to measure the similarity between graphs, the walk graph kernels have received increasing attention recently. Ref. [4] computed the graph kernel on two labeled graphs by counting the number of matched labeled random walks which is extended by [1] by using more complex kernels for continuous attributes. Ref. [13] proposed the generalized random walk graph kernels and several techniques are introduced for speeding up the computation of the random walk kernels. Ref. [6] constructed a set of segmentation graph kernels on images and then a multiple kernel learning method is used to combine these kernels together to classify the image. Ref. [22] built a context-dependent graph kernel for human motion recognition upon video sequences. These graph kernels can compute the similarity of structures between two graphs. However, these methods are not comprehensively meaningful for 3D human motion sequences, since their vertex kernels and edge kernels are not fit for the measurement of similarity between joint relative movements upon motion capture data. In this paper, we construct the Adaptive-Graph Kernel (AGK) to measure the similarity between 3D human motions represented by the adaptive graphs. Our approach has superior performance in motion retrieval.
There are 24 joints in our human model as shown in Fig. 2. These 24 joints are the major joints of the human body to articulate motion. Hence, we have 24 23=2 ¼ 276 joint pairs. Some rigid pairs (i.e. on the same bone or on the torso) are filtered out because their JRDs are almost unchanged all the time. Hence, the total number of joint pairs we used is actually 276 pairs 23 (bone pairs) 24 (torso pairs)¼ 229. The formulation of RRJRD is presented in the following: Let M ¼fF 1 ; F 2 ; …; F T g be a motion M of T frames, and each frame contains N f ¼ 229 joint pairs. The 24 joints at t-th ð1 r t rTÞ frame are depicted as fJ 1 ðt Þ; J 2 ðt Þ; …; J 24 ðtÞg. The JRD of the f-th ð1 r f rN f Þ joint pair at the t-th ð1 rt r TÞ frame is calculated by the L2 -norm between two joints J i and J j at frame t: JRDðt; f Þ ¼ dL2 ðJ i ðt Þ; J j ðt ÞÞ: The RRJRD of the f-th joint pair is formulated as RRJRDM ðf Þ ¼
maxt ðJRDðt; f ÞÞ mint ðJRDðt; f ÞÞ : meant ðJRDðt; f ÞÞ
ð2Þ
Due to the property of the ratio, RRJRD is robust to different body sizes. Hence, it needs no scale normalization process, as VJRD does. 3.2. Construction of the adaptive graph model According to this motion M, an adaptive graph based on top-N RRJRD is constructed to model the spatial layout relationships of the adaptively selected subset of 24 joints. This adaptive graph is denoted as G ¼ ðV; E; AÞ, where V is the vertex set which represents the adaptively selected subset of 24 joints according to top-N RRJRD of motion M, E is the edge set which describes the joints' spatial layout relationships based on top-N RRJRD in motion M, and N is the number of edges. A A Rnn is the affinity matrix of this graph, where n is the number of the vertices of the graph. We apply the ε-graph model to construct G: 1 if RRJRDM ðf Þ o ε; Aði; jÞ ¼ ð3Þ 0 otherwise: The parameter ε 4 0 is the threshold of the RRJRD of two joints in the motion, and its value is set according to the top-N RRJRD of the motion M in descending order. When Aði; jÞ ¼ 1, it means ðvi ; vj Þ A E. Hence, it is clear to see that this graph is constructed entirely depending on the spatial variations of the motion M. Moreover, we
3. Motion representation 3.1. Relative range of joint relative distance In each frame of motion, we extract the Joint Relative Distance (JRD), which is calculated by pair-wise euclidean distances (i.e. L2norm) between any two joints [3]. The relative range of each JRD over the duration of the movement is considered to characterize the motion, and we name it as RRJRD. The RRJRD of a pair of joints is expressed by the ratio of the range to the mean according to the JRD of this pair of joints in a motion. The value of RRJRD measures the importance of variations of each JRD for description of motions. Specially, the N pairs of joints in the top-N (choose proper value of N) RRJRD according to descending order and the structure of these joints can reflect the main characteristic of the motion.
ð1Þ
Fig. 2. The representation of the human body in motion capture data.
M. Li et al. / Computers & Graphics 54 (2016) 104–112
propose the temporal pyramid of covariance on 3D joint locations as the joints descriptor to attribute the vertices (see Section 5), and use the value of the RRJRD to attribute edges. In the end, each motion is adaptively represented by the special attributed graph based on the top-N RRJRD of this motion. This representation has two properties. First, the graph is adaptively constructed for each motion to reflect its special spatial structure of joints. Second, the graph preserves the spatio-temporal local feature of the joints. The adaptive graphs for several arbitrary motions based on top-30 RRJRD are shown in Fig. 3. In Fig. 3, six human motions (two walk motions, two clap motions, two squat motions) are chosen to construct their adaptive graphs. In each subfigure, the 24 red points corresponding to the 24 joints of human body represent the vertices of graph. The blue lines between the pairs of red points represent the edges of the graph. It is clear to see that for the same kind of motions, their adaptive graphs have similar structures, and for the different kinds of motions, their adaptive graphs have different structures. Hence, our proposed adaptive graph representations of human motions are very discriminative.
107
4. Adaptive-graph kernel 4.1. Construction of AGK Given a motion, let G ¼ ðV; EÞ be its adaptive graph with n vertices, where V ¼ fvi gni¼ 1 is the vertex set and E is the edge set. A vertex represents a joint selected by top-N RRJRD. The attribute of vertex is the probability distribution of the joint location in the motion. Since this probability distribution is not known, inspired by [23], we use its sample covariance matrix (temporal pyramid of covariance on 3D joint locations in Section 5) instead. An edge represents the connection of a pair of joints selected by top-N RRJRD. The attribute of this edge is the variations of JRD between the pair of joints in the motion which is represented by the value of RRJRD. In order to measure the similarity between two adaptive graphs, we should evaluate the similarity of their structures, their vertex attributes and edge attributes. Hence, we propose an Adaptive-Graph Kernel (AGK) to perform this evaluation. We first decompose the two graphs into a number of the walk groups, and then AGK is obtained by the walk group matching.
Fig. 3. Several adaptive graphs based on top-30 RRJRD. (For interpretation of the references to color in this figure, the reader is referred to the web version of this paper.)
108
M. Li et al. / Computers & Graphics 54 (2016) 104–112
The decomposition of our adaptive graph is based on walks. A walk with length p in graph G is defined as a sequence of vertices connected by edges, w ¼ ðvw0 ; ew1 ; …; ewp ; vwp Þ, where ewi ¼ ðvwi 1 ; vwi Þ A E; 1 r ir p. Considering the problem of the tottering and halting, we just focus on the walks with different vertices in the adaptive graph. So these walks are actually paths. Let φpG be the set of total walks with length p in the adaptive graph, and φpG ði; jÞ φpG be a walk group, which is a subset of φpG which contains walks starting at vertex vi and ending at vj. It means that for a walk w A φpG ði; jÞ, we have vw0 ¼ vi and vwp ¼ vj . So we can see that a walk w with length p ¼0 is a vertex. Hence, we have φ0G ¼ V, and φ0G ði; iÞ ¼ vi . The walk group can be regarded as subgraph of the adaptive graph. Given two motions, let G ¼ ðV; EÞ and G0 ¼ ðV 0 ; E0 Þ, be their adaptive graphs, where kv ðv; v0 Þ and ke ðe; e0 Þ are defined as the vertex kernel and the edge kernel on the two adaptive graphs, respectively, where the vertices v and v0 in kv ðv; v0 Þ represent the joints of human body. kv ðv; v0 Þ and ke ðe; e0 Þ are designed for measuring the similarity of the vertex attributes and edge attributes, respectively. The specific formulations of the kv ðv; v0 Þ and ke ðe; e0 Þ are designed in Section 5. Let kw ðw; w0 Þ be defined as the walk kernel on two walks with same length. For the case p ¼0, we have kw ðw; w0 Þ ¼ kv ðw; w0 Þ. For the case p Z 1, we have p
p
kw ðw; w0 Þ ¼ ∏ kv ðvwi ; v0w0 Þ ∏ ke ðewj ; e0w0 Þ: i
i¼0
ð4Þ
In the two adaptive graphs G and G0 , the kernels on two walk groups with length p are defined as a summation of walk kernels on all matched walk pairs from the two walk groups with length p: X X p kw ðw; w0 Þ: ð5Þ kg ðφpG ði; jÞ; φpG0 ðs; r ÞÞ ¼ wAφ
Aφ
p ðs;r Þ G0
The proposed AGK on two adaptive graphs is computed as a summation of similarities between all pairs of groups in two adaptive graphs. For the two adaptive graphs G and G0 , we denote the AGK with walk length p, as the p-th order AGK which is defined as 1 p kG G; G0 ¼ p P GG0
X φp ði;jÞ φp G G φp 0 ðr;sÞ φp 0 G G
E ¼ fð vi ; v0r ; ðvj ; v0s ÞÞj ðvi ; vj Þ AE;
v0r ; v0s A E0 ; ke ððvi ; vj Þ; ðv0r ; v0s ÞÞ 4 0g;
ð9Þ respectively. Each vertex ðvi ; v0r Þ A V is assigned a weight wir ¼ kv ðvi ; v0r Þ, and each edge ð vi ; v0r ; ðv0j ; v0s ÞÞ A E is assigned a weight wir;js ¼ ke ððvi ; vj Þ; ðv0r ; v0s ÞÞ, where ir and js are the vertex indices in G . Given two matrices W V ; W E A R j V j j V j which contain the vertex and edge weights, with ½W V ir;ir ¼ wir and ½W E ir;js ¼ wir;js , respectively, the final p-th order weight matrix W p of G can be denoted as W p ¼ W V ðW E W V Þp :
ð10Þ
Q D Q 1
denote the spectral decomposition of Let W E W V ¼ W E W V , that is, the columns of Q are its eigenvectors, and D is a diagonal matrix of corresponding eigenvalues. So the p-th order weight matrix W p of G can be written as W p ¼ W V Q Dp Q 1 :
ð11Þ
This simplifies Eq. (11) by taking the p-th power of a diagonal matrix, which decouples into scalar powers of its entries. According to Eq. (6), the p-th order AGK in graphs is denoted as 1 X p p kG G; G0 ¼ p ½W ir;js : P GG0 ir;js
ð12Þ
j
j¼0
p ði;jÞw0 G
and
kg ðφpG ði; jÞ; φpG0 ðr; sÞÞ;
ð6Þ
where P pGG0 is the number of matched walk pairs in walk groups with length p in G and G0 . With a series of weight coefficient p ðμ0 ; μ1 ; μ2 ; …Þ to emphasize the importance of each order kG G; G0 , the final AGK is calculated as a weighted summation of different pth order AGKs: X μp kpG ðG; G0 Þ: ð7Þ kG G; G0 ¼ p¼0
From Eq. (7), we can see that if all the kv ðv; v0 Þ and ke ðe; e0 Þ are equal to one, AGK will be simplified to count the number of matched walks which is a graph kernel proposed by [4]. In this case, the vertex attributes and edge attributes in the two adaptive graphs are exactly the same, and AGK actually just measures the similarity of the structures for the two adaptive graphs. 4.2. Calculation of AGK In [7], it has been proved that performing a simultaneous walk in two graphs is equivalent to performing a walk in their direct product graph. For two adaptive graphs G and G0 , we use the direct product graph to make the calculation of AGK efficient. The direct product graph of G and G0 is denoted as G ¼ ðV ; E Þ, where V and E are defined as V ¼ ðvi ; v0r Þ∣vi A V; v0r A V 0 ; kv vi ; v0r 40 ; ð8Þ
Substituting the above equation into Eq. (7), the final AGK of G and G0 is obtained.
5. Motion retrieval based on AGK matching Equipped with the above information, we further apply our AGK for the matching between two adaptive graphs. Hence, we should define the vertex kernel and the edge kernel. First of all, we give the definition of the vertex kernel. As mentioned above, the vertex kernel is designed for measuring the similarity of the vertex attributes. Adopting the similar way as [23], we give the sample covariance matrix to represent the vertex attribute since the probability distribution of joint location is not known. Then, in order to add the temporal information to the sample covariance matrix, we construct the temporal pyramid of covariance on 3D joint locations as the vertex attribute partly inspired by Fourier Temporal Pyramid [18]. Suppose that a human motion is represented by the adaptive graph with K vertices which denote K joints of the human body, and the motion is performed over T frames. Let J i ðtÞ ¼ ðxi ðt Þ; yi ðt Þ; zi ðt ÞÞ ð1 r ir KÞ be the i-th joint at the t-th ð1 rt r TÞ frame. Then the covariance J i descriptor for the motion is Cov ðJ i Þ. Specially, the probability distribution of J i is not known and we take the sample covariance instead, which is denoted as C Ji ¼
0 1 J J J J ; T 1 i i i i
ð13Þ
where J i is the sample mean of J i , and the ’ is the transpose operator. The sample covariance matrix C J i is a symmetric 3 3 matrix. For this descriptor, we only use the upper triangle. So we can see that the upper triangle of the covariance matrix is 3ð3 þ 1Þ=2 ¼ 6, which is the length of the vector of the descriptor. This 3D joint descriptor captures the dependence of locations of this joint during the performance of the motion. However, it does not capture its temporal properties. Hence, we propose a temporal pyramid of covariance of the joint location as joint descriptor. In the temporal pyramid, the top level of the 3D joint descriptor is calculated over the entire 3D human motion sequence. In Fig. 4, we just show the three-level temporal pyramid. In this temporal
M. Li et al. / Computers & Graphics 54 (2016) 104–112
pyramid, each covariance matrix is identified by two indices: the first is the temporal pyramid level index and the second is the index within the level. The top level matrix covers the entire human motion and is denoted by L00 . The covariance matrix at level l is calculated over T/2l frames of the entire motion. Using this 3D joint descriptor as the vertex attribute, the vertex kernel is defined as 8 ! 0 2 > > 0 < exp J d1 d1 J 2 if ‖d1 d1 ‖2 o εd1 ; 2σ 21 ð14Þ kðv; v0 Þ ¼ > > :0 otherwise; 0
where d1 and d1 are the corresponding 3D joint descriptors for vertices v and v0 which represent the same joint of human body. σ 1 4 0 is a scale parameter for the Gaussian function, and εd1 4 0 is a threshold. At the same time, we also define the edge kernel as 8 ! 0 2 > > 0 < exp J d2 d2 J 2 if ‖d2 d2 ‖2 o εd2 ; 2σ 22 ð15Þ kðe; e0 Þ ¼ > > :0 otherwise; 0
where d2 and d2 are the values of the RRJRD which represent the attributes of the edges e and e0 , respectively. σ 2 4 0 is a scale parameter for Gaussian function, and εd2 4 0 is a threshold. Fig. 5 shows that the values of the RRJRD used for edge kernels are reasonable. In Fig. 5, six human motions (two cartwheel motions, two sit down chair (sdchair) motions, two clap motions) are chosen to compare their values of top-50 RRJRD. It is clear to see that for the same kind of motions, their value distributions of the RRJRD tend to be similar, and that for different kinds of motions, their value distributions of the RRJRD may tend to be different. So the edge kernels make sense. Let M and M 0 be two 3D human motion sequences, these motions are represented by their adaptive graphs D and D0
separately. m is set for the maximal order of AGK. So the final AGK matching on the two motions is denoted as m X k G; G0 ¼ μp kpG ðG; G0 Þ;
ð16Þ
p¼0
where Ω ¼ ðμ0 ; μ1 ; …:μm Þ; Ω 4 0, is a weight vector, and 0 i kG ðG; G0 Þ; kG ðG; G0 Þ are calculated by Eq. (12). By using the final AGK matching, the 3D human motion retrieval system has been accomplished.
6. Generalized multiple kernel learning We utilize the GMKL method [12] to learn the weight for each AGK. Given Y training motions fM q ; yq gzq ¼ 1 , in which Mq is an input motion sequence represented by an adaptive graph Gq, and yq represents the motion label according to M q . We design a set of z z kernel matrices fK 0 ; K 1 ; …:K m g for the training motion 0 i sequences, with ½K 0 q;q0 ¼ kG ðGq ; Gq0 Þ; ½K i q;q0 ¼ kG ðGq ; Gq0 Þ, where 1 r i rm. According to Eq. (16), the final kernel matrix K for the training motion sequences is defined as K ¼ μ0 K 0 þ
m X
μp K p ;
ð17Þ
p¼1
where Ω ¼ ðμ0 ; μ1 ; …:μm Þ is actually the kernel weight vector defined in Eq. (16). By using similar method as in [22], we apply a l12 -norm regularization on the AGK weights to obtain optimal linear combination of all different order AGKs: 0 !2 1 m X 1 1@ 2 2 ð18Þ τ Ω ¼ ‖j μ0 j 1 ; j μ1 ; …:μm j 1 ‖2 ¼ μ0 þ μ A; 2 2 i¼1 Y is defined as a diagonal matrix with the motion class labels yi on the diagonal, and K is the kernel matrix defined in Eq. (17). The dual problem of GMKL is expressed as min Ω
DðΩÞwhere
subject to Fig. 4. Temporal construction of the covariance descriptor. Lli is the i-th covariance matrix in the l-th level of the temporal pyramid. A covariance matrix at the l-th level covers T=2l frames of the sequence, where T is the length of the entire human motion sequence.
109
1 D Ω ¼ max 1T α αT YKY α þ C 1 τ Ω ; α 2
ð19Þ
1T Y α ¼ 0; 0 r α r C 2 ; Ω Z 0;
where α is the Lagrangian multiplier and C 1 and C 2 are two constants which control the importance of regularization on AGK weights Due to the differentiability of and hinge loss, respectively. D Ω , the derivatives of D Ω can be formulated as ∂D Ω ∂D Ω 1 1 ¼ C 1 μ0 αT YK 0 Y α ; ¼ C 1 μ0 αT YK i Y α ; ð20Þ 2 2 ∂μ0 ∂μi where 1 ri r m. Then the α and τ Ω are obtained by solving the minimax optimization problem. The iterative process can be divided into two phases. The first phase is that Ω is fixed, K and τ Ω are constants. In this phase, α can be calculated by utilizing any SVM solver. In the second phase, α is kept fixed and Ω is obtained by the projected gradient descent method. The weights are iterated by μtkþ 1 ¼ μtk st ð∂D=∂μk Þ and projected to a feasible set μtkþ 1 ¼ maxð0; μtkþ 1 Þ, where k A f0; 1; 2; …mg and st is the step size chosen followed by the Armijo rule. These two phases are iterated until convergence.
7. Runtime complexity analysis
Fig. 5. Values of top-50 RRJRD in several human motions.
In the AGK matching method, the runtime complexity of retrieval process depends on the size of database, and the number of the vertices in adaptive graph. In this method, the vertices v and v0 in kv ðv; v0 Þ represent the same joints of human body. Hence, in
110
M. Li et al. / Computers & Graphics 54 (2016) 104–112
Section 4.2, W E W V A R j V j j V j . Since computing the spectral decomposition of a dense matrix takes time cubic in its size [9], the runtime complexity of our motion retrieval method is Oðm3 nÞ, where m ¼ j V j , which is no more than 24, and n is the number of the motion files. Therefore, we can see that the our proposed motion retrieval algorithm is linear in runtime according to the number of motion files in database, which allows real time human motion retrieval in Section 8.
8. Experimental results In this section, we test the proposed motion retrieval approach on the public HDM05 database. All the experiments were executed on a desktop computer with Intel core 2 Due 3.17 GHz processor. The HDM05 database consists of 130 different motion classes, with multiple trials performed by five subjects in each class. 13 different kinds of human motions were chosen to set up our experimental database, including clapping, walking, running, kicking, jumping jack (jumpj), sitting down chair (sdchair), hopping, punching, cartwheeling, squatting, grabbing floor (gfloor), throwing, rotating right arm (rrarm). In this experimental database, we use 1132 motion clips for training, and the 273 remaining motion clips with a total of 51,087 frames for testing. In the following parts of this section, we implemented various experiments to verify the effectiveness of our proposed human motion retrieval system. We evaluate the performance of the proposed retrieval system in two aspects: accuracy and robustness. Accuracy: To evaluate the accuracy of our method, we compared our human motion retrieval methods with three other works [24–26] based on top-50 RRJRD with six-level temporal pyramid. In order to compare with the three other works in retrieval accuracy, we follow a similar setting as in [23]. The objective of motion retrieval is to retrieve similar motions that belong to the same class as the query motion. Let ST represent the single type data set which contains all other motions that are in the same class as the query motion, and let MTs represent the mixed types of motions. Given a query motion, the retrieval is performed by computing and ranking the similarities with the MTs. The true positive (TP) ratio is used as the accuracy criteria to evaluate the retrieval results, which is defined as the percentage of correctly retrieved results from MTs that also belong to ST. Here, we used the top 20 results for TP calculation. Fig. 6 demonstrates that our method outperforms the other three methods. The average true positive ratio of our method is 0.952, and 0.793, 0.835, 0.869 for the other three methods. The average runtime of retrieval for query motions is 49 ms. The result demonstrates that our method can achieve higher accuracy in real time retrieval.
Fig. 6. True positive ratios of the proposed method, and other methods Müller et al. [24], Deng et al. [25], and Zhou et al. [26]
In addition, we generate the average precision–recall curves to compare the retrieval performance of our approach with the three other works. Precision is defined as ratio of correctly retrieved motions to the total number of retrieved motions, and recall is defined as the ratio of correctly retrieved motions to the total number of relevant motions in the motion database. The average precision–recall curves are shown in Fig. 7. From these curves, it can be seen that the performance of our approach is much better than the other three methods. Robustness: Besides evaluating retrieval accuracy, we also analyze the robustness of our approach. In the following part of this subsection, we verify the robustness of our proposed retrieval system by analyzing three aspects: changing the edge attributes by varying the values of the parameter N, changing the vertex attributes by varying the number of levels, and changing the number of available frames in the query motion. In the adaptive graph construction, the top-N RRJRD is used to determine which N edges should be used to construct a graph to represent a given motion. We vary the values of N and perform the retrieval experiment to examine how sensitive is the accuracy with respect to the change of this parameter N which is related to the edge attributes. As illustrated in Fig. 8, the retrieval performance is optimal and stable when N ranges from 40 to 70, and gradually decreases when N deviates from this range.
Fig. 7. The precision–recall curves of the proposed method,and other methods Müller et al. [24], Deng et al. [25], and Zhou et al. [26]
Fig. 8. The average retrieval accuracy with different values of N upon top-N RRJRD.
M. Li et al. / Computers & Graphics 54 (2016) 104–112
111
capture the overall characteristics of the original whole query motion.
Table 1 Results by using different levels in temporal pyramid. l¼1
l ¼2
l¼3
l ¼4
l¼5
l ¼6
0.923
0.931
0.936
0.945
0.949
0 .952
9. Conclusion In this paper, we have proposed a novel graph-based method for real time 3D human motion retrieval. First we have constructed a novel graph model based on the top-N RRJRD which were proposed to model spatial structures in different motions. Then we have adopted the temporal pyramid of covariance descriptors to preserve certain level of spatio-temporal local features. Finally, we have computed the graph kernel by matching the walks from each of the two graphs to be matched. Furthermore, we have applied multiple kernel learning to determine the optimal weights for combining the graph kernels to measure the overall similarity between two motions. Our experimental results have shown that our approach is robust under several variations, and has superior performance in comparison to three state-of-the-art methods.
Appendix A. Supplementary data Supplementary data associated with this paper can be found in the online version at http://dx.doi.org/10.1016/j.cag.2015.07.005. Fig. 9. The retrieval accuracy when a fraction of original query motion is presented as input. (For interpretation of the references to color in this figure, the reader is referred to the web version of this paper.)
In the temporal pyramid, the vertex attributes are obtained as the covariance descriptor at each level. In this experiment, we have evaluated the effect of using various amounts of vertex attributes by applying different levels in the temporal pyramid. From the results presented in Table 1, it can be seen that with the increase of levels in the temporal pyramid, the retrieval accuracy becomes higher. In general, we can deduce that adding more levels can improve the retrieval accuracy, provided that there are enough frames in the lowest levels to extract meaningful vertex attributes. From the results, it can also be observed that even with one-level vertex attributes, our approach can still perform better than the three other existing approaches. Our approach assumes that the input query contains the whole motion with which we try to retrieve similar whole motions from the data set. In this experiment, we study the performance of our proposed approach when only a fraction of the original query motion is presented as input. We have applied two ways for reducing the number of frames from the original query motion as input to our retrieval system: (1) sampling of several consecutive frames; (2) sampling of arbitrary frames. The retrieval accuracy versus the percentage of sampled frames is illustrated in Fig. 9. The bottom green curve corresponds to the first case with sampling of consecutive frames while the top blue curve corresponds to the case with sampling of arbitrary frames. In our experimental settings, the average number of frames in the original query motions is 189. From the results, it can be observed that our approach still performs well when 40% of the original frames are sampled from consecutive frames and used as input query. On the other hand, when the original frames are sampled as arbitrary frames as input query, our approach still gives good performance when only 10% of the original frames are used. The reason why sampling of arbitrary frames yields better performance than sampling of consecutive frames is because the frames in the former case are more spread out such that it is more likely to
References [1] Borgwardt KM, Ong CS, Schonauer S, Vishwanathan S, Smola AJ, Kriegel HP. Protein function prediction via graph kernels. Bioinformatics 2005:47–56. [2] Borzeshi EZ, Piccardi M, Xu R. A discriminative prototype selection approach for graph embedding in human action recognition. In: ICCV; 2011. p. 1295–301. [3] Tang JK, Leung H, Komura T, Shum HP. Emulating human perception of motion similarity. In: Computer animation and virtual worlds, vol. 19; 2008. p. 211–21. [4] Gartner T, Flach P, Wrobel S. On graph kernels: hardness results and efficient alternatives. In: Learning theory and kernel machines, vol. 2777; 2003. p. 129– 43. [5] Gaur U, Zhu Y, Song B, Roy-Chowdhury A. A string of feature graphs model for recognition of complex activities in natural videos. In: ICCV; 2011. p. 2595– 602. [6] Harchaoui Z, Bach F. Image classification with segmentation graph kernels. In: CVPR; 2007. p. 1–8. [7] Imrich W, Klavzar S, Gorenec B. Product graphs: structure and recognition. New York: Wiley; 2000. [8] Gauzere B, Brun L, Villemin D, Brun M. Graph kernels based on relevant patterns and cycle information for chemoinformatics. In: ICPR; 2012. p. 1775– 8. [9] Golub Gene H, Charles F. Van loan: matrix computations. 3rd edition. . Baltimore, MD: John Hopkins University Press; 1996. [10] Raja K, Laptev I, Perez P, Oisel L. Joint pose estimation and action recognition in image graphs. In: ICIP; 2011. p. 25–8. [11] Ta AP, Wolf C, Lavoue G, Baskurt A. Recognizing and localizing individual activities through graph matching. In: AVSS; 2010. p. 196–203. [12] Varma M, Babu BR. More generality in efficient multiple kernel learning. In: ICML; 2009. p. 1065–72. [13] Vishwanathan S, Schraudolph NN, Kondor R, Borgwardt KM. Graph kernels. In: JMLR, vol. 11; 2010. p. 1201–42. [14] Qi T, Feng Y, Xiao J, Zhuang Y, Yang X, Zhang J. A semantic feature for human motion retrieval. In: Computer animation and virtual worlds, vol. 24; 2013. p. 399–407. [15] Xiao Q, Wang Y, Wang H. Motion retrieval using weighted graph matching. Soft Comput 2015;19:133–44. [16] Celikutan O, Wolf C, Sankur B, Lombardi E. Fast exact hyper-graph matching with dynamic programming for spatio-temporal data. J Math Imaging Vis 2015;51:1–21. [18] Wang J, Liu Z, Wu Y, Yuan J. Mining actionlet for action recognition with depth cameras. In: CVPR; 2012. p. 1290–7. [19] Müller M, Röder T, Clausen M. Efficient content-based retrieval of motion capture data. ACM Trans Graph 2005:677–85. [20] Kapadia M, Chiang IK, Thomas T, Badler NI, Kider Jr JT. Efficient motion retrieval in large motion databases. In: Proceedings of the ACM SIGGRAPH symposium on interactive 3D graphics and games, I3D, vol. 13; 2013. p. 19–28. [21] Tang JKT, Leung H. Retrieval of logically relevant 3D human motions by adaptive feature selection with graded relevance feedback. Pattern Recognit Lett 2012;4:420–30.
112
M. Li et al. / Computers & Graphics 54 (2016) 104–112
[22] Wu B, Yuan C, Hu H. Human action recognition based on context-dependent graph kernels. In: CVPR; 2014. p. 4321–7. [23] Hussein M, Torki M, Gowayyed M, El-Saban M. Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. In: IJCAI; 2013. p. 2466–72. [24] Müller M, Röder T. Motion templates for automatic classification and retrieval of motion capture data. In: Proceedings of 2006 ACM SIGGRAPH/Eurographics symposium on computer animation (SCA); 2006. p. 137–46.
[25] Deng Z, Gu Q, Li Q. Perceptually consistent example-based human motion retrieval. In: Proceedings of the 2009 symposium on interactive 3D graphics and games, I3D '09; 2009. p. 191–8. [26] Zhou L, Lu Z, Leung H, Shang L. Spatial temporal pyramid matching using temporal sparse representation for human motion retrieval. Vis Comput 2014;30:845–54.