ARTICLE IN PRESS
JID: INS
[m3Gsc;January 12, 2016;19:59]
Information Sciences xxx (2016) xxx–xxx
Contents lists available at ScienceDirect
Information Sciences journal homepage: www.elsevier.com/locate/ins
Human motion recovery jointly utilizing statistical and kinematic information Guiyu Xia, Huaijiang Sun∗, Guoqing Zhang, Lei Feng
Q1
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, PR China
a r t i c l e
i n f o
Article history: Received 22 April 2015 Revised 18 November 2015 Accepted 25 December 2015 Available online xxx Keywords: Motion capture data Sparse coding Motion recovery Kinematic constraint
a b s t r a c t Human motion data that are captured by the markers attached to an actor’s body have been widely used in many areas. However, occlusion caused by the actor’s body or clothing might make several markers missing for a period of time during the capture process, which highlights the need for motion recovery in the human motion capture process. In recent years, low-rank matrix completion and sparse coding have been used in many datadriven motion recovery methods. However, applying them directly to recover missing data is not effective because low rank is only a basic statistical property of human motion. In addition, the dictionary is usually learned and used in a complete feature space, while human motion must be recovered from an incomplete feature space. Moreover, low-rank matrix completion and sparse coding take advantage only of the statistical property and ignore another important property, i.e., the kinematic property of human motion. Inspired by coupled dictionary learning, we modify the traditional dictionary learning process and propose a new process for the special task of motion recovery. The new recovery process jointly utilizes statistical and kinematic information. Within the proposed method, we first learn a dictionary from a large number of complete–incomplete training frame pairs, to preserve the statistical information of motion data. Then, with the smoothness constraint and the bone-length constraint which take the kinematic information into recovery process, we recover motions using sparse representations of incomplete frames and a learned dictionary through an optimization model. Additionally, we employ two gradient-based optimization algorithms for dictionary learning and motion recovery. Extensive experiment results and comparisons with four other state-of-the-art methods demonstrate the effectiveness of the proposed method. © 2016 Published by Elsevier Inc.
1
1. Introduction
2
With the rapid development of virtual reality technology, human motion capture data are widely used in many areas, such as the movie industry, computer games and sports training. It is a new type of multi-media data that mainly describe human motions. Although traditional multi-media data such as videos can also record human motion with image sequences, human motion related studies based on videos, such as human motion analysis [25] and pose estimation [12,14,31], are more difficult because image data are very complicated and may include a considerable amount of useless or redundant
3 4 5 6
∗
Corresponding author. Tel.: +8613905172533. E-mail addresses:
[email protected] (G. Xia),
[email protected] (H. Sun),
[email protected] (G. Zhang),
[email protected] (L. Feng). http://dx.doi.org/10.1016/j.ins.2015.12.041 0020-0255/© 2016 Published by Elsevier Inc.
Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
JID: INS 2
ARTICLE IN PRESS
[m3Gsc;January 12, 2016;19:59]
G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
Fig. 1. Sketch of human motion capture system and motion occlusion.
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Q2 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
information. However, human motion capture data are very clean and only contain positions or orientations of a tree structure, which corresponds to a human skeleton. A recent notable application of motion capture data is the movie Avatar, where the exciting applications start with an accurate acquisition of high-quality human motion data. Motion capture is a prevalent technique for capturing and analyzing human articulation. Corresponding to a temporal sequence, human motion is captured as motion frames that record the positions or orientations of all of the joints at every capturing time point. As shown in the left subfigure of Fig. 1, an optical motion capture system such as Vicon [1], utilizes video cameras to track the movements of a set of reflective markers that are strategically attached to the actor’s body. However, even with the costly professional motion capture equipment, motion capture data may still be inaccurate or incomplete. When the actor is performing, some markers may not be visible to certain cameras because of the occlusion by the actor’s body or clothing, so that the captured motion is similar to the right subfigure of Fig. 1. To ensure the high-quality performance in every application, the corrupted data should be pre-processed. Therefore, an important branch of motion capture research is to handle two highly correlated and frequently co-occurred sub-problems: one is to predict missing values in motion data, and the other is to remove both the noises and outliers. The former sub-problem is called motion recovery, and the latter sub-problem is called motion denoising. Although their respective tasks are slightly different, the technology used to implement them is very similar. In this paper, we mainly focus on motion recovery; however, we do not distinguish motion recovery and motion denoising because of their similarity of the technology used. Human motion recovery is challenging because human motion involves highly coordinated movements. Standard signal denoising technologies (if we take the missing values as noise), such as the Gaussian low-pass filter and the Kalman filter [8,42,43], often process each DOF independently so that the filtered motions often appear uncoordinated or unnatural because the spatial-temporal characteristics are undermined. Interpolation methods [13,26,34,45] can preserve the spatialtemporal characteristics embedded in natural human motion and can effectively estimate the missing marker. Although motion data are recorded in a linear space, they are essentially nonlinear. When we use interpolation methods, we should assume that the motion data are linear. Thus, these methods are not theoretically reasonable. On the other hand, assuming that motion data are linear can make recovery easy and fast in a linear space. Inspired by locally linear embedding [35], we assume motion data are locally linear even though they are a manifold structure globally, so that a locally linear method can address the motion recovery problem. Sparse coding [30,36] has become a hot research topic in recent years, and has been used to solve many practical problems in signal processing, statistic recognition and pattern recognition [32]. Additionally, locality-constrained Linear Coding as a variant of sparse coding has even been successfully used to perform human pose estimation [37] and colorization for a gray-scale facial image [24]. The core idea of sparse coding is to find a sparse representation for the target signal using a overcomplete dictionary. Its essence is a locally linear interpolation so that atoms used in the dictionary are scaled neighbors of the target signal in fact. This characteristic of sparse coding matches our assumption perfectly. In this paper, we present a human motion recovery method that uses a dictionary to represent and recover incomplete frames. Moreover, we borrow the idea of compressive sensing [2] and image super resolution [50,51], which are very successful applications of sparse coding. Their dictionary learning processes aim to enable the dictionary to reconstruct complete signals from incomplete signals. We take the idea and propose a new dictionary learning process according to the special task of motion recovery in practice, which will be discussed in detail in Section 4.1. Another challenge of human motion recovery is that the recovered motion must satisfy certain kinematic constraints. Therefore, the proposed method exploits not only the statistical characteristics but also the kinematic characteristics, while other data-driven methods mostly focus on the former and seldom utilize the latter. Bone length is a rigid kinematic constraint that is often ignored in many data-driven methods [10,16,26,27,46,47]. It provides some useful information that can Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
JID: INS
ARTICLE IN PRESS G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
[m3Gsc;January 12, 2016;19:59] 3
69
help the optimization algorithm shrink the search space. For instance, if only one middle marker is missing, we can search it only on a circle because the positions of its two neighbor markers and the lengths of the two segments connected to it determine that the missing marker must be on the given circle. We take bone length as a constraint of the objective function (7) while using a learned dictionary to recover motions. This ensures that the recovered motion satisfies the constraint of bone length. Additionally, the smoothness is also considered into the objective function (7) to guarantee the temporal stability of recovered motion. Briefly speaking, our method provides the following three contributions: (1) inspired by the idea of compressive sensing [2] and image super-resolution [50,51], we propose a new dictionary learning process for the special problem of motion recovery; (2) we successfully solve the nonlinear problem of motion recovery in a linear space by using our learned dictionary; and (3) the proposed method simultaneously takes the constraints of smoothness and bone length into account, while many other works [3,11,16,39,46,47] usually ignore one or even both of them. In addition, our learned dictionary is robust to recovery problems of different missing markers because the training set contains incomplete frames with a variety of missing markers. When any marker is missing for a period of time (e.g., 1 s), which is a reasonable circumstance in practice, a missing trajectory of the marker may be longer than a sample trajectory. In this case, a trajectory-based recovery system cannot even form a sample; therefore some trajectory-based methods [16,39] may be ineffective. In our method, a sample is composed of the positions for all of the joints in one frame, and the incomplete motion data are recovered one-frame-by-one-frame. Therefore, the missing time of a marker has nearly no influence on the accuracy of recovery. Our experiments have shown the promising results. The remaining part of this paper is structured as follows. Section 2 briefly surveys the related works, and Section 3 presents the framework of our method. In Section 4, we elaborate on the detailed algorithms of the proposed method. The experimental results and the performance analysis are conducted in Section 5. In Section 6, we provide a discussion on the used optimization algorithm and the RGB-D camera. Finally, we draw a conclusion in Section 7.
70
2. Related work
71
Because the filled data must satisfy some intrinsic constraints such as smoothness and bone-length constraints, human motion recovery is a special case of missing data filling problems that have been deeply studied in statistics for a long history. Thus, some specific methods have been proposed for motion recovery. Linear interpolation [13,26,34,45] is a class of traditional methods. These approaches can preserve the spatial-temporal characteristics embedded in natural human motion and can effectively predict the missing markers for only a very short period of time(typically less than 0.5 s). Kalman filters [8,18,42] have also been used to predict the missing markers in the current frame with the available temporal information. However, these two kinds of traditional methods tend to be ineffective when the missing time for markers becomes longer, and they often require manual intervention. Li et al. [22] employ the B-spline wavelets to decompose the incomplete motion data into the multi-resolution framework and recover the missing data by smoothing the high-magnitude coefficients. This approach is usually utilized for rigid motion data processing. Unfortunately, it may be unsuitable for handling the articulated motion capture data. To tackle this problem, Lou and Chai [28] first learned a series of filter bases from the complete motion data and subsequently employed robust statistics techniques to filter the noisy motions. This approach is a data-driven method whose key advantage is the preservation of spatial-temporal patterns; however, only the top K filter bases are kept and used in the subsequent motion denoising phase. Therefore, this approach cannot recover some motion details. Recently, Xiao et al. [46] cast the predicting missing markers as finding a sparse representation of the training set for the incomplete pose, and then use it to predict the missing data. They successfully lead the l1-sparse representation method into human motion recovery, but they use the training set instead of a learned dictionary to represent incomplete frames. As we know, the accuracy of reconstruction with a learned dictionary must be higher than that with a not learned dictionary if the dictionaries have the same size. Hou et al. [16] use a dictionary learned from trajectories for all of the joints, to predict missing data. This approach makes full use of the smoothness of trajectories, but ignores the correlations among different joints. More importantly, when the missing trajectories are longer than the atoms in the dictionary, this approach will be invalid. Xiao et al. [47] divide each human pose into five partitions and learn five dictionaries for these partitions. Then, they use these dictionaries to represent the noisy poses and denoise them. This approach can reduce the dimensionality of a sample and make the dictionaries obtain more detailed information for each partition. However, it abandons the relation between these partitions. In addition to sparse coding, low-rank matrix completion [4] has also been used to recover human motion [10,27,39] . Tan et al. [39] propose to use a trajectory sequenced matrix instead of a frame sequenced matrix [10,19,27] for completion because the former has a lower rank than the latter and the lower rank can improve the reliability of low-rank matrix completion. Of course, a similar drawback to other trajectory-based methods is that they are sensitive to the missing time. Feng et al. [10] propose to exploit the temporal stability and low-rank structure to implement motion refinement because only imposing the low-rank structure property of motion data in the objective function does not guarantee that the recovered motion is smooth enough. Liu et al. [27] propose an automatic motion data denoising approach via filtered subspace clustering and low-rank matrix completion. To reduce the rank of every matrix, human motion is separated into several segments that are represented by motion matrices. However, low rank is only a basic statistical property of human motion, and many
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
JID: INS 4
ARTICLE IN PRESS
[m3Gsc;January 12, 2016;19:59]
G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
131
particular and advanced statistical properties of human motion are not used. Moreover, the low-rank matrix completion theory would fail to handle some badly corrupted motion sequences. Considering the above problems, our approach uses a frame-based dictionary, which is learned from both incomplete and complete capture data, to represent and recover incomplete data. The learned dictionary can catch a considerable amount of advanced statistical information from a large amount of complete motion data, while low-rank based methods only utilize the basic property of low rank from the to-be-recovered motion data. Thanks to sparse coding, the linear combination coefficients of selected atoms can bring a more detailed recovery result. Moreover, it is insensitive to the missing time of markers. In addition, the bone-length constraint is taken into account as a rigid constraint of the objective function, while many researchers ignore it. Another type of related work is image super-resolution [50,51]. Yang [50] uses a joint dictionary to reconstruct highresolution images from low-resolution images. By jointly training two dictionaries for the low-resolution and high-resolution image patches, they can enforce the similarity of sparse representations between the low-resolution image patch pair and high-resolution image patch pair with respect to their own dictionaries. However, soon, they find that this joint sparse coding scheme can only be considered optimal in the concatenated feature space but not in each feature space individually. They [51] propose a new coupled dictionary training method for image super-resolution. The learning process enforces that the sparse representation of a low-resolution image patch in terms of the low-resolution dictionary can satisfactorily reconstruct its underlying high-resolution image patch. Different from the problem of image super-resolution, the feature subspace of incomplete motion data is a part of the feature space for the complete motion data. Additionally, the missing dimensions just correspond to the missing markers, while the feature space of the low-resolution image does not have an understandable relation with the feature space of the high-resolution image. More importantly, the dimensionality of incomplete motion data is not fixed, while the dimensionalities of the both low-resolution image and high-resolution image are fixed. In our work, we significantly extend the idea of coupled dictionary training to human motion recovery, but we only learn one dictionary because the dimensionality of the “small” dictionary cannot be fixed when learning a couple of dictionaries. Our dictionary learning process enforces the sparse representation of incomplete motion data to satisfactorily recover its underlying complete motion data. A new dictionary learning algorithm will be introduced in Section 4.1.
132
3. Problem formulation and solution overview
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
133 134 135 136 137 138
Let M ∈ R3d × F denote a motion sequence that records the positions of all joints in all frames, where d denotes the number of all joints in a human skeleton, 3d represents the dimensionality of one frame and F denotes the total frame number. Let f ∈ R3d denote a frame, which is a column of M. Motion data are captured as a sequence of frames, and some frames are complete while others are incomplete. Our task is to use the complete frames to recover the incomplete frames via a learned dictionary. Our method contains two main steps: dictionary learning and motion recovery. Fig. 2 illustrates the framework of our method.
Fig. 2. Framework of our method.
Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
ARTICLE IN PRESS
JID: INS
G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
[m3Gsc;January 12, 2016;19:59] 5
Fig. 3. Sparse representations of complete and incomplete frames. The white part in each sparse representation denotes that values in this part are all zero. In each subfigure, the total error denotes the percentage of the difference between two representations.
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168
We abandon the traditional learning process and provide a new dictionary learning process specifically for signal recovery. Current dictionary learning methods mainly focus on training an overcomplete dictionary in a complete feature space for various recovery or recognition tasks. This dictionary can be only considered optimal in the complete feature space, but not in an incomplete space. Nevertheless, many recovery tasks (e.g. motion recovery) are trying to recover data from an incomplete feature space. We may not have sufficient evidence to guarantee the accuracy of the recovery via such a dictionary. Therefore, our dictionary learning process enforces that the sparse representation of incomplete data can satisfactorily recover its underlying complete data. Specifically, in dictionary learning, we remove several joints from a complete frame to create an incomplete frame, and use the sparse representation of this incomplete frame with a part dictionary to reconstruct the complete frame with the complete dictionary. Our target is to minimize the reconstruction error. Notably, one complete frame can create many complete-incomplete training pairs because we can remove different joints from the same frame each time. The core idea of our dictionary learning process is to use the sparse representations of part data to reconstruct the whole data. We do not have a theoretical proof for this idea, but we can use an instance to verify its rationality in the special task of motion recovery. We use our learned dictionary to represent a complete–incomplete frame pair by sparse coding, and we find that the sparse representations of the complete and incomplete frames are very similar. Fig. 3 shows two examples we use to verify that the sparse representations of part data can satisfactorily reconstruct the whole data. From Fig. 3, we can find that atoms finally used to represent the complete frame are the same as those atoms finally used to represent the incomplete frame (i.e., the same variable IDs). Additionally, even the corresponding nonzero values of each representation are nearly equal, and the total differences of 0.48% and 0.45% can almost be ignored. Human motion is a complex natural phenomenon, but it contains many regular patterns. An important property of human motion is that different joints in a human skeleton are highly coordinated, so the motion data of every joint are highly correlative while moving. In addition, one frame in a motion sequence is a training sample so that the training set can be large enough. These two properties can further guarantee the feasibility of our motion recovery scheme. Next, we will describe our method in detail. Let Ak denote a degradation operator that removes certain qk joints (set as zero), i.e., set 3qk corresponding features in f JN as zero, where k = 1, 2, 3, !, K denotes removing different joints, K = i=1 Cdi , and JN is the maximum number of joints we are permitted to miss. In fact, Ak includes all of the situations in which the number of missing joints is less than or equal to JN. Let Ak denote a complementary degradation operator of Ak , that removes the remaining joints that Ak does not remove. For example,
⎡
⎤
0
⎢ ⎢ ⎢ Ak = ⎢ ⎢ ⎣
⎥ ⎥ ⎥ ⎥ ⎥ ⎦
0 0 1 ..
.
(1)
1 Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
ARTICLE IN PRESS
JID: INS 6
169
[m3Gsc;January 12, 2016;19:59]
G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
denotes removing the first joint, and
⎡
⎤
1
⎢ ⎢ ⎢ Ak = ⎢ ⎢ ⎣
⎥ ⎥ ⎥ ⎥ ⎥ ⎦
1 1 0 ..
.
(2)
0 170 171 172 173
denotes removing all of the joints except the first one. k Let f k ∈ R3d denote the incomplete frame degraded by Ak , and f k = Ak f . Let f ∈ R3d denote the missing features in k
f, and f = Ak f . Our dictionary learning process learns a dictionary D ∈ R3d × n with which we can obtain f using the representation of f k . An ideal dictionary D should satisfy the following equation for any fi :
2
arg min Ak fi − Ak Dαi + λαi 1 = arg min fi − Dαi 2 αi
2
αi
2
∀i = 1, 2, . . . , N; ∀k = 1, 2, . . . , K 174 175
where { fi }N i=1 are the complete frames for dictionary learning. Because the goal of our dictionary learning is to minimize the reconstruction error of missing joints, we define the following loss function:
L(D, f , z, k ) = 176
(3)
1 k
2
A f − Ak Dz
2 2
(4)
Then, the objective function of our dictionary learning process can be written as:
min D
N K
L D, fi , zik , k
i=1 k=1
2
zik = arg min Ak fi − Ak Dα + λα1 , i = 1, 2, . . . , N; k = 1, 2, . . . , K
s.t.
α
(5)
2
D(:, j )2 ≤ 1, j = 1, 2, . . . , n 177 178 179 180
After obtaining the dictionary D, we can use it to reconstruct motions. However, the reconstructed motion is only the intermediate product, and we finally do not need it. Given an incomplete frame f ∈ R3d (missing features are set as zero), 3d the reconstruction of its missing features f ∈ R consists of two consecutive steps: first, find the sparse representation of f according to
2
f − A f Dα + λα1 z = arg min
α
181 182 183 184
f . Then, many methods [16,46] comwhere A f removes certain rows in D corresponding to the features that are missing in pute f as f = A f Dz, where A f is a complementary degradation operator of A f , but ours is different. We take the smoothness and bone-length constraints into consideration, and reconstruct the complete motion sequence M (assume certain markers in M are continuously missing for the entire capture time) by minimizing the following objective function:
2
2
F
F
min M − ADZ + β O M C
M
s.t. 185 186 187 188
(6)
2
(7)
M =0
where M = AM, A removes certain rows in D and M corresponding to the features which are missing in M. Z = [z1 , z2 , . . . zF ] is the sparse representations of all of the incomplete frames in AM, which denotes all of the observed features in M, and A
F F term, and M = 0 is the rigid bone-length constraint. More details will be shown in Section 4.2. This objective func-
2 2 is a complementary degradation operator of A. The term M − ADZ is the loss, β O M C is the smoothness constraint
191
tion makes the recovered motion satisfy the smoothness and bone-length constraint. Our method does not need any postprocessing for the recovered motion, while many post-processing methods should be performed when computing missing features f as f = ADz.
192
4. The proposed methodology
193
In this section, we will present our two-step motion recovery framework in tandem. In the dictionary learning step, the difference between the traditional learning process and our innovative process will be introduced in detail. And in the motion recovery step, we will emphatically describe how we use the kinematic constraint in the recovery process.
189 190
194 195
Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
ARTICLE IN PRESS
JID: INS
[m3Gsc;January 12, 2016;19:59]
G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
7
196
4.1. Dictionary learning
197
210
The dictionary directly determines the accuracy of recovery. Therefore, we propose a new dictionary learning process for these types of signal recovery tasks. The common characteristic of these tasks is that the feature subspace of incomplete signals is a part of the complete feature space of complete signals, and the missing dimensions exactly correspond to the missing features of signals. Although it is unlike the super-resolution problems because the feature space of a low-resolution image does not have an understandable relation with the feature space of high-resolution image, we can refer to the idea of coupled dictionary learning [51] but learn only one dictionary. The process of recovery uses sparse representations of incomplete frames to reconstruct their underlying complete frames. To make the dictionary tailored to for recovery, we follow the recovering process while dictionary learning. The conventional dictionary learning process minimizes the objective function by alternatively optimizing over D and zi while keeping the other fixed. Our objective function (5) is different because we only have one optimization variable D, while the objective function for the conventional dictionary learning process has two optimization variables: dictionary and sparse representations. We employ a descent method as developed in [52] to solve the problem (5). For the descent method, we need to find a descent direction along which a feasible step in this direction will decrease the objective function value. First, we use WpT to denote the pth row of matrix W, we use Wpq to denote the matrix element at the pth row and qth column of
211
W, we use α q to denote the qth element in vector α, and we use A to denote Ak for easy of presentation, then we have
198 199 200 201 202 203 204 205 206 207 208 209
d T
T
T ∂z ∂L T T = Ai Dz − Ai f zq Aip + Ai Dz − Ai f Ai D ∂ D pq i=1 ∂ D pq ∂z T = AT (ADz − A f )zT + (ADz − A f ) AD pq ∂ D pq
212 213 214 215
216
z denote the vector built with the elements {zq }q ∈ , and D where the derivative ∂ ∂Dz is well defined according to [46]. Let pq
denote the dictionary that consists of the columns in D with indices in . is defined as = {q|zq = 0}. Although there is no analytical link between z and Dpq , we use the technique developed in [52] to compute the derivative ∂ ∂Dz , which works pq l = Ak f . Then, the derivative can be written as well in practice. Also for easy of presentation, we let G = Ak D,
T −1 ∂ GT l ∂ GT G z z ∂ G pq ∂ ∂ z Akpp = = G G − pq pq ∂ G pq ∂ D ∂ G pq ∂ G pq ∂D
αi
218 219
(9)
where we assume that the solution to the following problem
2
arg min Ak fi − Ak Dαi + λαi 1 217
(8)
2
is unique and GT G
−1
(10)
which only builds on the exists. Eq. (9) only gives us the derivative function of z with respect to D,
index set . To evaluate ∂∂Dz , we can set the remaining gradient elements of ∂∂Dz to be zero according to [51]. The gradient ∂ GT l ∂ GT G ∂ G and ∂ G z is easy to compute but complicated to express. pq
pq
⎡ ⎤ g1
g2 ⎥ ∂G l ⎢ =⎢ . ⎥ ⎣ .. ⎦ ∂ G pq T
220
where gi =
(11)
gn
lp 0
i=q . Additionally, i = q
⎡ ⎤ h1
h2 ⎥ ∂G G ⎢ z=⎢ . ⎥ ⎣ .. ⎦ ∂ G pq T
(12)
hn
GTp z + G pq zq G pi zq
i=q . i = q
221
where hi =
222
Because the objective function (5) is highly nonconvex, we do not have an extremely suitable optimization algorithm to minimize it. Therefore, inspired by the coupled dictionary learning [51], we employ a projected stochastic gradient descent procedure for the optimization of D due to its fast convergence and good behavior in practice. Although the solution we obtain may only be a local minimum, it turns to be sufficient for practical use as demonstrated in the experimental part. Algorithm 1 summarizes the complete procedures for our dictionary learning algorithm based on a projected stochastic gradient descent method.
223 224 225 226 227
Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
ARTICLE IN PRESS
JID: INS 8
[m3Gsc;January 12, 2016;19:59]
G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
Algorithm 1 Dictionary learning. Input:
K { fi }Ni=1 , Ak k=1 , initialize D0 by conventional sparse coding with { fi }Ni=1 , s = 0, t = 1, α1 = 1
repeat for i = 1 to N do 3: for k = 1 to K do 1: 2:
∂ L Ds , f ,zk ,k
i i Compute gradient U = ∂D 5: Update Ds = Ds − αt U ; 6: Normalize every column of Ds ; 7: t = t + 1; 2 8: αt = t+1 ; 9: end for 10: end for 11: Update Ds+1 = Ds , s = s + 1; 12: until Convergence Output: Ds .
4:
according to (8);
228
4.2. Motion recovery
229
After obtaining the dictionary, current dictionary-based motion recovery methods [46] usually compute the complete frame as f = Dz, where z(can be computed by least angle regression [9]) is the sparse representation of the incomplete frame f . However, human motion data are different from other signals because they are regular and need to satisfy kinematic constraints. After artificial processing, some kinematic constraints may be undermined. Data-driven approaches can preserve many spatial-temporal characteristics embedded in natural human motion, so most of the kinematic constraints, such as pose coordination, can be satisfied. However, what we finally recover is the positions of joints that will easily break the bone-length constraint. And if the motion was recovered one-frame-by-one-frame, we cannot guarantee smoothness for the motion sequence. In light of this, the smoothness constraint is used in (7) as a penalty term, and bone length is a rigid constraint of (7).
230 231 232 233 234 235 236 237 238
2 As described in Section 3, M − ADZ constrains M similar to ADZ as far as possible, and ADZ is the directly recovered F
240
part with AD. For a convenient and clear expression, we use J to represent a joint that includes three coordinate values, we use Jij to represent the missing joint and element of M in the ith row and jth column, and we use former Ji j and next Ji j
241
to represent the former and next joint of Jij in a human skeleton. Then, M can be denoted by:
239
⎡J
J12 J22 .. . Jd2
11
M=
242 243 244
⎢J21 f 1, f 2, . . . , f F = ⎢ . ⎣ .. Jd1
247
⎤ ⎥ ⎥ ⎦
(13)
2 β O M C F is the smoothness constraint term, and
246
...
J1F J2F .. . JdF
f j is made up of the observed markers, and d represents where f j represents a frame made up of the missing markers, while the number of missing markers.
O M = f 0 , f 1 , f 2 , . . . , f F , f F +1 245
··· ... .. .
(14)
where f0 is the former frame of f1 , fF +1 is the next frame of fF , and they are both complete frames. Notably, f 0 and f F +1 are not variables but constants in (7). β is the tuning parameter that is usually determined by experience. C is a tridiagonal square matrix that is used to measure the distance between neighbor frames [10] in M, and
⎡
−1 ⎢1
⎢ ⎢ C=⎢ ⎢ ⎣
⎤
1 −2
1
..
..
.
.
..
. 1
−2 1
⎥ ⎥ ⎥ ⎥ ⎥ 1⎦
(15)
−1 (F +1)×(F +1)
Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
ARTICLE IN PRESS
JID: INS
[m3Gsc;January 12, 2016;19:59]
G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
248 249
M = di=1 Fj=1 Lgtcon(Ji j ) is the rigid bone-length constraint, where Lgtcon(Ji j ) is a function that is used to compute the distance between the true bone length and the recovered bone length. Lgtcon(Ji j ) is computed as
⎧ 2
⎪
Ji j − a 2 − l 2 ⎪ ⎪ i 2 ⎪ ⎪
2
2
2
2 ⎪ ⎪ 2 2
⎨ Ji j − a − li + b − Ji j − li+1 2 2 Lgtcon Ji j =
2
2 2 2 2
⎪ 2
⎪ Ji j − former Ji j − li + b − Ji j − li+1 ⎪ 2 2 ⎪ ⎪
2 ⎪
⎪ ⎩
Ji j − former Ji j 2 − l 2 2
250 251 252 253 254 255
9
i
case 1 case 2 (16)
case 3 case 4
where a is the position of the former(Ji j ) and b is the position of next(Ji j ) in a complete frame, li is the bone length between the Jij and former(Ji j ) in a complete frame. Case 1 denotes that former(Ji j ) is not missing and next(Ji j ) is missing or does not exist. Case 2 denotes that former(Ji j ) and next(Ji j ) are both not missing. Case 3 denotes that former(Ji j ) is missing and next(Ji j ) is not missing. Case 4 denotes that former(Ji j ) and next(Ji j ) are both missing. Objective function (7) is an optimization problem with an equality constraint. Naturally, we employ an augmented Lagrange multiplier to solve the problem. First, we introduce a Lagrange function:
2
2
F
F
L M, λ = M − ADZ + β O M C + λ M 256 257
(17)
where λ is the Lagrange multiplier. Then, we utilize an augmented Lagrange function to transform problem (7) into an unconstrained problem as follows:
P M, λ, σ = L M, λ +
σ 2
2 M
(18)
258
where σ is the penalty factor of the exterior penalty function σ2 2 M . The optimization problem of min P M, λ, σ is easy
259
to solve with gradient-based optimization algorithms, such as APG [40]. The gradient
M
260
∂ (M ) ∂ P(M,λ,σ ) ∂ L(M,λ ) = + σ M . Additionally ∂M ∂M ∂M
∂ P(M,λ,σ ) can be easily computed as ∂M
∂ M ∂ L M, λ T = 2 M − ADZ + 2β O M ∗ CC (:, 2 : end − 1 ) + λ ∂M ∂M
where
261
∂ Lgtcon Ji j ∂ M ∂ Ji j
= ∂ Ji j ⎪ ∂ Lgtcon Ji j ∂ Lgtcon next Ji j ⎪ ⎩ + ∂ Ji j ∂ Ji j
262
Both
⎧ ⎪ ⎪ ⎨
(19)
if next Ji j is not missing (20)
if next Ji j is missing
∂ Lgtcon(Ji j ) ∂ Lgtcon(next(Ji j )) and are easily computed but the expressions are complicated, so we do not show the ex∂ Ji j ∂ Ji j ∗
pressions for them here. According to APG, we can use the provided Algorithm 2 to obtain the optimal M in terms of
Algorithm 2 Optimization for min P M, λ, σ . M
Input: A, D, Z, C, β , λ, σ , d, F , initialize M and V0 with ADZ, and let α0 = 1, t = 0; 1: repeat 2: Ut = (1 − αt )Mt + αt Vt ; ∂ P(Ut ,λ,σ ) 3: Vt+1 = Vt − 1 ; ∂U dF αt
t
Mt+1 = (1 − αt )Mt + αt Vt+1 ; 2 5: αt+1 = t+1 ; 6: t = t + 1; 7: until Convergence Output: Mt . 4:
263 264 265
a certain λ and σ , but it may not be the solution for problem (7). Therefore, we employ the PH algorithm (as shown in ∗ Algorithm 3) to obtain the optimal λ and σ that make M the solution of problem (7). Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
ARTICLE IN PRESS
JID: INS 10
[m3Gsc;January 12, 2016;19:59]
G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
Algorithm 3 Optimization for Eq. (7). Input: A, D, Z, C, β , d, F , initialize M 0 with ADZ, λ1 > 0, σ1 > 0, c > 1, ε > 0, θ ∈ (0, 1 ) and let t = 1; 1: while M > ε do 2: 3:
Solve the problem min P M, λ, σ if
(M t ) ≤ (Mt−1 )
M
with Algorithm 2, and obtain Mt ;
θ then
Mt+1 = (1 − αt )Mt + αt Vt+1 ; σt+1 = σt ; 6: else 7: σt+1 = cσt ; 8: end if 9: λt+1 = λt + σt Mt ; 10: t = t + 1; 11: end while Output: Mt , λt , σt . 4: 5:
266
5. Experiments
267
5.1. Design
268
We have conducted several experiments on the∗ CMU human motion database and the HDM05 database to show the effectiveness of our method. All of the algorithms are implemented in Matlab and run on a personal computer with a Pentium(R) Dual-Core CPU and 2.00 GB RAM. Before the experiments, we translate all of the poses to make each pose’s root marker at the origin and take the y-axis (the up axis) of the original world coordinate system as our y-axis. We project a vector pointing from the left shoulder marker to the right shoulder marker onto the horizontal plane and use the projected vector as the x-axis. The cross product between the x-axis and the y-axis produces the z-axis. In our experiments, we select six types of motion sequences including the repetitive motions, such as walk, jump, and run, and the non-repetitive motions, such as badminton, boxing, and dance. However, considering the length of the paper, we only show part of the results in every experiment. Ten thousand frames for six types of motion sequences above are used to learn the dictionary. From every complete frame, we remove different markers every time, so that one complete frame can form many training frame pairs for dictionary learning. To evaluate the performance of our method, we compare it with the other four methods that have appeared in recent years. In the following sections, we use L1-SRMMP, Traj-based SR, TSMC and Traj-based MC to represent the four methods, where L1-SRMMP takes the complete frames as the dictionary [46] directly, without a dictionary learning process; Traj-based SR learns a trajectory-based dictionary [16] to recover the incomplete trajectories of any joint; TSMC [10] utilizes low-rank matrix completion [4,5] with a smoothness constraint to recover the motion matrix; and Traj-based MC, which is a variant of TSMC, recovers motions from an incomplete trajectory-based motion matrix [39]. More specifically, the dictionary size of L1-SRMMP is 500 because the larger sized dictionary quickly slows down the recovery speed and slightly improves the recovery accuracy. The Traj-based SR dictionary contains 1000 atoms and is learned from more than 100 000 trajectories whose length is 72. Following the work of [10,16,39], we use the root mean squared error (RMSE) measurement to quantify the recovered results.
269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290
RMSE f , fˆ =
291 292 293 294 295 296 297 298 299
1
A f − A fˆ 2 np
(21)
where f is the original complete frame, fˆ is the recovered frame, A is a degradation operator that removes all of the nonmissing joints, and np is the total number of missing markers in fˆ. In the testing part, we select several motion sequences that are not used to learn the dictionary. Our method can recover any missing joint in the human skeleton, but some joints are not active or important for a certain type of motions, such as the fingers to jump. To increase the recovery difficulty, we remove some active joints in terms of a certain types of motions, such as the knee to walk and the elbow to dance, and show the recovering results in the following figures. During dictionary learning, the dictionary size is set as 500, and one frame will be made into K (as described in Section 3) complete–incomplete frame pairs corresponding to different missing markers. During motion recovery, we set β = 1 and the initial λ = 15, σ = 5e − 4. Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
JID: INS
ARTICLE IN PRESS G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
[m3Gsc;January 12, 2016;19:59] 11
Fig. 4. Results of our method for three motion sequences. In each subfigure, the top row has the original frames, the middle row has the incomplete frames, and the bottom row has the recovered frames where the green segments correspond to the missing joints. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Q3
Fig. 5. Performance comparison results on bone length between our method and other methods without bone-length constraints. In each subfigure, the left pose is from original motion data, the middle pose is recovered by our method, the right pose is recovered by other methods without bone-length constraints, and the larger ellipses are amplifications of the smaller ellipses.
300
5.2. Results of our method
301
317
In this section, we will first show the characteristics of our method overall. As shown in Fig. 4, we remove several active joints in terms of each type of motions and the recovered frames are very close to the original frames, which demonstrate the superior performance of our method. For example, in Fig. 4(a), the missing elbow and foot are two joints that move fast with a large amplitude, but we hardly distinguish the difference between the original frames and the recovered frames. Human motion data are different from other data because they must satisfy many physical constraints. The bone-length constraint is a very important and rigid constraint, and most methods [10,16,26,46] ignore it. Whether the recovering process contains a bone-length constraint will directly affect the effectiveness and accuracy of recovered motions. As shown in Fig. 5, because of the bone-length constraints, the bone lengths of recovered frames in our method are all extremely close to the values recorded in the ASF files. In addition, we can easily make out the segments because of the unreasonable bone length recovered by other methods. We will compare our method with the other four methods on the bone-length errors in the next section. A natural drawback of frame-based methods [46,47] is their sensitivity to the number of missing joints because more missing joints mean that less information can be used to recover. However, our method jointly utilizes the statistical (relation between different joints) and kinematic (smoothness constraint and bone-length constraint) information. Therefore, as the number of missing joints increases, the recovery errors do not increase obviously. As shown in Fig. 6, when the number of missing joints is less than 10, the recovery accuracy is acceptable, which makes our method feasible for practical use because only a few joints will be missing in a motion capture system.
318
5.3. Quantitative comparisons
319
In the first experiment of this section, we remove several joints from 40% of the frames for each sequence, and the missing time is random. As shown in Fig. 7, although the most significant advantage of our method is recovering motions with continuous missing time, our method outperforms most of the other competitors except TSMC. TSMC is good at handling
302 303 304 305 306 307 308 309 310 311 312 313 314 315 316
320 321
Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
JID: INS 12
ARTICLE IN PRESS
[m3Gsc;January 12, 2016;19:59]
G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
Fig. 6. Recovery errors with different numbers of missing joints.
Fig. 7. Recovery errors using different methods with 40% random frames having missing joints.
Fig. 8. Recovery errors using different methods with 40% of continuous frames having missing joints.
Fig. 9. Recovery errors using different methods with different percentages of continuous frames having missing joints.
322 323 324 325 326 327 328 329 330 331 332 333 334 335
motions with random missing time, but its performance has a sharp decline in the latter experiment (as shown in Fig. 8) with continuous missing time. For practical use of motion capture systems, it is very common for a certain marker to disappear for a continuous time period. To simulate this situation, we remove several active joints from 40% of the continuous frames for each sequence. Then, we use our method and four competitive methods to recover these motions, and the recovering results are in Fig. 8. From Fig. 8, we find that our method outperforms other methods on the recovery errors and it has a smaller variance on the recovery results, which reflects the spatial-temporal stability of our method. The spatial-temporal stability is the essential property of human motion data, and we preserved it by using the smoothness constraint and bone-length constraint in the objective function (7). A significant advantage of our method is that it is not sensitive to missing time. However, many other methods, such as interpolation methods [13,26,34,45], trajectory-based methods [16,39] and low-rank-based methods [10,39], are very sensitive to missing time and especially to continuous missing time. When missing time becomes longer or the percentage of incomplete frames becomes larger, the performances of these methods will have a sharp decline. As shown in Fig. 9, as the percentage of incomplete frames increases, the recovery errors of Traj-based SR, TSMC and Traj-based MC increase Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
ARTICLE IN PRESS
JID: INS
G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
[m3Gsc;January 12, 2016;19:59] 13
Fig. 10. Moving trajectories of the right knee and recovery errors using different motion smoothing methods. Subfigures (a)–(c) are the moving trajectories of three coordinate axes, and subfigure (d) shows the recovery errors.
336 337 338 339 340 341 342 343 344 345 346 347
fast, but the performances of L1-SRMMP and our method are very stable. This is because L1-SRMMP and our method recover motion data one-frame-by-one-frame, and the prior information utilized is the relativity among joints. Nevertheless, the prior information utilized by Traj-based SR, TSMC, and Traj-based MC is the relativity among frames. Therefore, an increase in incomplete frames means the decrease in prior information, so the decline in their performance is predictable and reasonable. In addition, the recovered motions for our method are very stable in time sequence and space due to the smoothness constraint and the bone-length constraint. As shown in Fig. 10(a)–(c), the trajectories of the recovered motion sequence without the smoothness constraint undulate violently so the animated character will vibrate during motion. However, with the smoothness constraint, the trajectories recovered by our method are considerably smoother and the animated character will move smoothly. In Fig. 10, the result of a moving average filter [29], which is widely used for motion smoothness [27], is also contained. The filter can smooth the trajectories with the average of neighboring moving trajectories defined within a specified span:
fis =
1 fi−Ns + fi−Ns +1 + · · · + fi+Ns −1 + fi+Ns 2Ns + 1
(22)
357
where fis is the ith smoothed frame, and 2Ns + 1 is the span. In our experiments, Ns is set to 5. Although motion smoothness can also be achieved by some post-processing methods, such as the moving average filter, they are independent of the recovery process and they do not have an obvious positive effect on the recovery accuracy. From Fig. 10(d), we can find that our method can acquire higher recovered accuracy. This is because we smooth the trajectories with the bone-length constraint which is helpful for improving the recovery accuracy. Although we can easily make out the recovered segments with unreasonable bone length from Fig. 5, we give a quantitative comparison among different methods in Table 1. We can find that the bone-length errors of our method are considerably smaller than the other methods, and it is difficult for our eyes to catch the distortion. The bone-length errors of other methods may even generate malformed poses. Meanwhile, our eyes can easily find the unnatural poses in a motion sequences.
358
5.4. Qualitative comparisons
359
Another big difference between human motion and other data is that human motion data can be visualized vividly, while other data are only a pile of numbers. Charts can only show the superior performance numerically, but they are very abstract. We visualize the results of our method and every competitive method. For every type of motions, we drew six animated characters corresponding to six frames that have the same time interval. These animated characters vividly
348 349 350 351 352 353 354 355 356
360 361 362
Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
ARTICLE IN PRESS
JID: INS 14
[m3Gsc;January 12, 2016;19:59]
G. Xia et al. / Information Sciences xxx (2016) xxx–xxx Table 1 Comparisons of our method with the other four methods for bone-length errors of recovered motions. The average bone-length errors of each frame are reported. The results of our method are highlighted in each case.
Walk Run Jump Badminton Boxing Dance
L1-SRMMP [46]
Traj-based SR [16]
TSMC [10]
Traj-based MC [39]
Ours
0.337 0.218 0.120 0.354 0.239 0.769
0.599 1.853 0.513 0.699 0.603 0.885
0.490 1.813 0.576 0.171 0.914 1.314
0.450 1.141 0.607 0.136 1.114 0.746
3.91e-4 5.10e-4 2.92e-4 4.47e-4 4.54e-4 6.15e-4
Fig. 11. Results of different methods for three motion sequences. In each subfigure, green segments are recovered parts. The malformed or unnatural parts recovered by competitive methods are circled. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
367
show what the charts have shown in Section 5.3. From Fig. 11, we can find that the animated characters of our method are very similar to the original characters, but the animated characters of other methods are more or less unnatural and some bones even have unreasonable length. Due to the use of relativity among all of the joints, we ensure the coordination of the recovered poses, and by benefiting from the bone-length constraint, we ensure that every recovered bone has reasonable length.
368
5.5. Time complexity analysis and comparisons
369
Because our proposed method contains two steps, the computational cost of our proposed method mainly comes from two steps: one is to learn the dictionary via solving objective function (5) and the other is to recover the motion via solving objective function (7). For the first step, to learn the dictionary, we employ a projected stochastic gradient descent method whose time complexity depends on its convergence rate. In our experiments, the optimization method converges within only several loops, so the time complexity can be inferred as O(Lp × N × K), where Lp (usually less than 10) is the number of loops when converging, N is the number of training samples, K (constant ) is the number of the types of missing joints. Although it is only a linear complexity over N, computing the gradient according to (8) is not fast. The whole dictionary learning process needs 5.2 h in our experiments. Fortunately, the dictionary does not need update unless a large number of new types of human motions appear, so the dictionary learning algorithm only needs to be implemented for once. For the second step, we use an augmented Lagrange multiplier to optimize the objective function (7). Although the expressions of some related equations are very complex, the objective function (18) is still a standard objective function of an augmented Lagrange multiplier, which is simple to solve. However, it is not fast enough because it needs an iterative process according to Algorithms 2 and 3. We give a comparison on elapsing time for recovering each frame in Table 2,where
363 364 365 366
370 371 372 373 374 375 376 377 378 379 380 381 382
Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
ARTICLE IN PRESS
JID: INS
[m3Gsc;January 12, 2016;19:59]
G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
15
Table 2 Elapsed time comparisons of our method with the other four methods for different missing percentages. The average elapsed time for per frame (second/frame) is reported. The percentages in parentheses in the first column denote the proportions of each situation with its corresponding missing percentage. The results of our method are highlighted in each case.
5%(25%) 10%(25%) 20%(15%) 30%(10%) 40%(8%) 50%(6%) 60%(5%) 70%(3%) 80%(2%) 90%(1%) Weighted average
Ours
L1-SRMMP [46]
Traj-based SR [16]
TSMC [10]
Traj-based MC [39]
0.263 0.415 0.901 1.04 1.304 1.224 1.294 1.371 1.455 1.562 0.737
0.054 0.055 0.053 0.053 0.067 0.056 0.058 0.053 0.053 0.053 0.055
0.014 0.014 0.014 0.014 0.017 0.014 0.014 0.014 0.014 0.014 0.014
0.876 0.444 0.222 0.134 0.115 0.093 0.075 0.061 0.053 0.047 0.399
0.489 0.254 0.126 0.077 0.064 0.052 0.043 0.034 0.029 0.025 0.225
396
the percentages in parentheses in the first column denote the proportions of each situation with its corresponding missing percentage. While recovering, L1-SRMMP and Traj-based SR only need to solve a l1-norm optimization problem, so they are very fast. What the TSMC and Traj-based MC recover is the motion matrix, and the corrupted degree of the matrix seldom influences the matrix completing speed. So, the elapsing time for recovering each frame descends as the missing percentage increases. However, the increase in computing speed per frame is at the expense of a decline in recovery accuracy. Because the computation complexity of our method depends on the scale of variables, the elapsing time of our method increases as the missing percentage increases. Fortunately, the situations of high missing percentage make up only a small portion after all (as the percentages in parentheses in Table 2), so the weighted average elapsing time of our method is not very high compared with TSMC which usually performs better than other three competitive methods when the missing percentage is less than 50%. More importantly, motion recovery is not a real time task and the recovery accuracy of our method is very stable in every situation, while the recovery accuracies of Traj-based SR, TSMC, and Traj-based MC have a sharp decline as the missing percentages increase. Nevertheless, we need to do additional algorithm optimization to reduce the computation cost of our method in the future.
397
6. Discussion
398
In this section, we will provide a discussion on two problems: one is the reason why we choose two gradient-based optimization algorithms to optimize our objective functions; the other is the difference between human motion data we handle and human motion data captured by RGB-D cameras, as well as their different applications. In this paper, we have two objective functions, i.e., (5) and (7), to be optimized and we both choose a gradient-based algorithm to optimize them. Time cost is a very important standard for measuring an optimization algorithm. The gradientbased algorithm is easy and efficient for some objective functions whose gradients are convenient to compute, such as objective function (7). Objective function (5) is complex because sparse representation zik depends on the dictionary D for l1minimization. When the training set is very large, computing all zik is quite time consuming. For instance, in our experiments (10 000 training samples), it takes more than 1 h to compute all zik at one time. This problem makes some state-of-art optimization algorithms, such as the Particle Swarm Optimization [17,33,49], the Simulated Annealing Algorithm [38,48], the Genetic Algorithm [21,44] and the Animal Migration Optimization [23], not suitable for the optimization of (5) because they all need to compute the fitness according to objective function (5). The high computational expense of fitness for each iteration prevents these algorithms from obtaining the optimal solution in an acceptable timeframe. However, the projected stochastic gradient descent algorithm employed in coupled dictionary learning [51] does not compute the gradient of the dictionary D, rather than the value of objective function (5). The gradient computation is much easier than solving the large-scale l1-minimization problem. And it has been demonstrated to be effective to solve a similar problem in coupled dictionary learning [51]. The RGB-D camera has been widely used for gesture recognition [6] and the humanoid robot [20]. It combines depth information and color images to extract human skeleton information, which can be used to conduct person verification [7] and the Localization of RGB-D Camera Networks [15]. An RGB-D camera is a simple motion capture device, and the motion data it captures are rough. Compared with some other high-precision motion capture equipments, such as Vicon [1], its notable advantages are low price and simple operation for motion capture. Because Microsoft introduced its commercial RGB-D camera, Kinect, many people can quickly and conveniently use Kinect to do some entertainments and related studies. On the other hand, its disadvantages are also obvious. The motion data captured by RGB-D cameras are not precise enough to animate character animations for some applications that demand high-quality data, such as the movie industry and sports training, because imprecise data make the animation roles perform unnaturally. In this paper, the motion data we recover
383 384 385 386 387 388 389 390 391 392 393 394 395
399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423
Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
JID: INS 16
427
7. Conclusions
428
445
In this paper, we have presented a novel human motion recovery method that jointly utilizes statistical information and kinematic information for motion data. In general, the proposed method uses a dictionary to recover human motions. Meanwhile, the smoothness constraint and the bone-length constraint are used in the recovery process. In fact, the dictionary preserves the statistical information, and kinematic information is also used as the kinematic constraints. The proposed method contains two main steps: dictionary learning and motion recovery. For the first step, which is inspired by coupled dictionary learning [51], we modified the traditional dictionary learning process according to the motion recovery process. In the learning process, we use the sparse representation of the incomplete frame to reconstruct the corresponding complete frame and minimize the reconstruction error. The new learning process enforces the learned dictionary to satisfactorily reconstruct the underlying complete frames by using the sparse representations of incomplete frames. For the second step, we take the optimized frames that satisfy the smoothness constraint and bone-length constraint, rather than the directly recovered frames with the dictionary, as the final recovery results. This is because the smoothness constraint and the bonelength constraint guarantee the kinematic characteristics and help shrink the search space during optimization, while many other methods ignored one or even both of them. The experiments have shown an improved performance. However, several problems also deserve further research. First, a great deal of high-quality motion data are needed for dictionary learning. Second, when the number of different types of training motions is quite different, the recovery accuracy of different types of motions will become unbalanced. Third, if a substantial number of markers are missing, our approach will fail to provide an accurate result because there is not enough useful information. Finally, our algorithms need extra optimization to improve the recovery speed.
446
Acknowledgments
447 449
The authors would like to express their gratitude to the anonymous referees and the editor and associate editor for their valuable comments, which lead to the substantial improvements of this paper. This work was supported by the Graduate Innovation Project of Jiangsu Province (KYZZ15_0124, KYZZ15_0125).
450
References
451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487
[1] Viconsystems, http://www.vicon.com, 2009. [2] R.G. Baraniuk, Compressive sensing [lecture notes], IEEE Signal Process. Mag. 24 (4) (2007) 118–121. [3] J. Baumann, B. Krüger, A. Zinke, A. Weber, Filling long-time gaps of motion capture data, in: Proceedings of ACM SIGGRAPH/Eurographics Symposium on ComputerAnimation (SCA), vol. 33, 2011. [4] J.C. Emmanuel, B. Recht, Exact matrix completion via convex optimization, Found. Comput. Math. 9 (6) (2009) 717–772. [5] J.F. Cai, E.J. Candès, Z. Shen, A singular value thresholding algorithm for matrix completion, SIAM J. Optim. 20 (4) (2010) 1956–1982. [6] X. Chen, M. Koskela, Using appearance-based hand features for dynamic rgb-d gesture recognition, in: 2014 22nd International Conference on Pattern Recognition (ICPR), 2014, pp. 411–416. [7] W. Chi, J. Wang, Q.H. Meng, Person verification based on skeleton biometrics by rgb-d camera, in: 2014 IEEE International Conference on Robotics and Biomimetics (ROBIO), 2014, pp. 671–676. [8] K. Dorfmüller-Ulhaas, Robust optical user motion tracking using a kalman filter, in: ACM Symposium on Virtual Reality Software & Technology, 2003. [9] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, et al., Least angle regression, Ann. Stat. 32 (2) (2004) 407–499. [10] Y. Feng, J. Xiao, Y. Zhuang, X. Yang, J.J. Zhang, R. Song, Exploiting temporal stability and low-rank structure for motion capture data refinement, Inform. Sci. 277 (2) (2014) 777–793. [11] E.E. Galin, Data-driven completion of motion capture data., Vriphys 33 (2) (2011) 283–292. [12] W. Gong, Y. Huang, J. Gonzalez, L. Wang, Enhanced mixtures of part model for human pose estimation, CoRR (2015). abs/1501.05382. [13] S. Guo, J. Roberg, A high-level control mechanism for human locomotion based on parametric frame space interpolation, Eurographics (1996). [14] X. Han, G. Li, W. Lin, X. Su, H. Li, H. Yang, H. Wei, Periodic motion detection with roi-based similarity measure and extrema-based reference-frame selection., in: APSIPA, IEEE, 2012, pp. 1–4. [15] Y. Han, S.L. Chung, J.S. Yeh, Q.J. Chen, Localization of rgb-d camera networks by skeleton-based viewpoint invariance transformation, in: 2013 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2013, pp. 1525–1530. [16] J. Hou, L.P. Chau, Y. He, J. Chen, N. Magnenat-Thalmann, Human motion capture data recovery via trajectory-based sparse representation, in: 2013 20th IEEE International Conference on Image Processing (ICIP), 2013, pp. 709–713. [17] J. Kennedy, Particle swarm optimization, in: Encyclopedia of Machine Learning, Springer, 2010, pp. 760–766. [18] H.S. Ko, S. Tak, A physically-based motion retargeting filter, ACM Trans. Graph. 24 (1) (2003) 98–117. [19] R. Lai, P.C. Yuen, K. Lee, Motion capture data completion and denoising by singular value thresholding, in: Proceedings of Eurographics, Eurographics Association, 2011. [20] J. Lei, M. Song, Z.N. Li, C. Chen, Whole-body humanoid robot imitation with pose similarity evaluation, Signal Process. (2015) 136–146. [21] B. Li, J. W., A novel stochastic optimization algorithm., IEEE Trans. Syst. Man Cybern. B 30 (1) (2000) 193–198. [22] L. Li, J. Mccann, N. Pollard, C. Faloutsos, Bolero: A principle technique for including bone length constraints in motion capture occlusion filling, in: Proceedings of the ACM Siggraph/Eurographics Symposium on Computer Animation, 2010, pp. 179–188. [23] X. Li, J. Zhang, M. Yin, Animal migration optimization: An optimization algorithm inspired by animal migration behavior, Neural Comput. Appl. 24 (7-8) (2014) 1867–1877. [24] Y. Liang, M. Song, J. Bu, C. Chen, Colorization for gray scale facial image by locality-constrained linear coding, J. Signal Process. Syst. 74 (1) (2014) 59–67. [25] W. Lin, M.T. Sun, H. Li, Z. Chen, W. Li, B. Zhou, Macroblock classification method for video applications involving motions., TBC 58 (1) (2012) 34–46. [26] G. Liu, L. Mcmillan, Estimation of missing markers in human motion capture, Vis. Comput. Int. J. Comput. Graph. 22 (9-11) (2006) 721–728.
430 431 432 433 434 435 436 437 438 439 440 441 442 443 444
448
Q6
Q7
Q8
Q9 Q10
G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
426
429
Q5
[m3Gsc;January 12, 2016;19:59]
are types of structured signals based on a predefined human skeleton. An important task for some studies based on an RGBD camera is the skeleton extraction, so the motion recovery [41] based on an RGB-D camera is to recover human actions, while our proposed method is essentially a special signal recovery method.
424 425
Q4
ARTICLE IN PRESS
Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041
JID: INS
ARTICLE IN PRESS G. Xia et al. / Information Sciences xxx (2016) xxx–xxx
488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523
[m3Gsc;January 12, 2016;19:59] 17
[27] X. Liu, Y.M. Cheung, S.J. Peng, Z. Cui, B. Zhong, J.X. Du, Automatic motion capture data denoising via filtered subspace clustering and low rank matrix approximation, Signal Process. 105 (12) (2014) 350–362. [28] H. Lou, J. Chai, Example-based human motion denoising, IEEE Trans. Visual. Comput. Graph. 16 (5) (2010) 870–879. [29] V. Lyandres, S. Briskin, On an approach to moving-average filtering, Signal Process. 34 (2) (1993) 163–178. [30] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online dictionary learning for sparse coding, in: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, 2009, pp. 689–696. [31] T. Pfister, J. Charles, A. Zisserman, Flowing convnets for human pose estimation in videos., CoRR (2015). abs/1506.02897. [32] L. Qiao, S. Chen, X. Tan, Sparsity preserving projections with applications to face recognition, Pattern Recogn. 43 (1) (2010) 331–341. [33] J. Robinson, Y. Rahmat-Samii, Particle swarm optimization in electromagnetics, IEEE Trans. Antennas Propag. 52 (2) (2004) 397–407. [34] C. Rose, M.F. Cohen, B. Bodenheimer, Verbs and adverbs: Multidimensional motion interpolation, IEEE Comput. Graph. Appl. 18 (5) (1998) 32–40. [35] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5) (2000) 2323–2326. [36] B. Schölkopf, J. Platt, T. Hofmann, Efficient sparse coding algorithms, in: Proceedings of the 2006 Conference on Advances in Neural Information Processing Systems 19£º, 2007, pp. 801–808. [37] L. Sun, M. Song, D. Tao, J. Bu, C. Chen, Motionlet llc coding for discriminative human pose estimation, Multimedia Tools Appl. 73 (1) (2014) 327–344. [38] A. Suppapitnarm, K.A. Seffen, G.T. Parks, P.J. Clarkson, A simulated annealing algorithm for multiobjective optimization, Eng. Optim. 33 (1) (2000) 59–85. [39] C.H. Tan, J. Hou, L.P. Chau, Human motion capture data recovery using trajectory-based matrix completion, Electron. Lett. 49 (12) (2013) 752–754. [40] P. Tseng, P. Tseng, On accelerated proximal gradient methods for convex-concave optimization, SIAM J. Optim. (2008). [41] J. Wang, M. Garratt, P. Li, S. Anavatti, Motion recovery using the image interpolation algorithm and an rgb-d camera, in: 2014 IEEE International Conference on Robotics and Biomimetics (ROBIO), 2014. [42] G. Welch, G. Bishop, An Introduction to the Kalman Filter, vol. 785, University of North Carolina at Chapel Hill, Springer-Verlag, Berlin Heidelberg, 1995, pp. 127–132. ISBN 978-3-642-02287-6. [43] G. Welch, G. Bishop, L. Vicci, S. Brumback, K. Keller, D. Colucci, The hiball tracker: High-performance wide-area tracking for virtual and augmented environments, in: Proceedings of the ACM Symposium on Virtual Reality Software and Technology, 1999. [44] D. Whitley, A genetic algorithm tutorial, Stat. Comput. 4 (2) (1994) 65–85. [45] D.J. Wiley, J.K. Hahn, Interpolation synthesis of articulated figure motion, IEEE Comput. Graph. Appl. 17 (6) (1997) 39–45. [46] J. Xiao, Y. Feng, W. Hu, Predicting missing markers in human motion capture using l1-sparse representation, Comput. Anim. Virtual Worlds 22 (2-3) (2011) 221–228. [47] J. Xiao, Y. Feng, M. Ji, X. Yang, J.J. Zhang, Y. Zhuang, Sparse motion bases selection for human motion denoising, Signal Process. (2015) 108–122. [48] Y. Xin, A new simulated annealing algorithm, Int. J. Comput. Math. 56 (3) (1995) 161–168. [49] B. Xue, M. Zhang, W.N. Browne, Particle swarm optimization for feature selection in classification: A multi-objective approach, IEEE Trans. Cybern. 43 (6) (2013) 1656–1671. [50] J. Yang, Image super-resolution via sparse representation, IEEE Trans. Image Process. 19 (11) (2010) 2861–2873. [51] J. Yang, Z. Wang, Z. Lin, S. Cohen, T. Huang, Coupled dictionary training for image super-resolution, IEEE Trans. Image Process. 21 (8) (2012) 3467–3478. [52] J. Yang, K. Yu, T. Huang, Supervised translation-invariant sparse coding, in: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2010, pp. 3517–3524.
Please cite this article as: G. Xia et al., Human motion recovery jointly utilizing statistical and kinematic information, Information Sciences (2016), http://dx.doi.org/10.1016/j.ins.2015.12.041