Computer Vision and Image Understanding 115 (2011) 375–389
Contents lists available at ScienceDirect
Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu
Multifactor feature extraction for human movement recognition Bo Peng a,b, Gang Qian a,b,⇑, Yunqian Ma d, Baoxin Li c a
School of Arts, Media and Engineering, Arizona State University Tempe, AZ 85287, USA School of Electrical, Computer and Energy Engineering, Arizona State University Tempe, AZ 85287, USA c School of Computing, Informatics, and Decision Systems Engineering, Arizona State University Tempe, AZ 85287, USA d Honeywell Labs, 1985 Douglas Drive North, Golden Valley, MN 55422, USA b
a r t i c l e
i n f o
Article history: Received 1 March 2010 Accepted 1 November 2010 Available online 12 November 2010 Keywords: Feature extraction View-invariance Multifactor analysis Pose recognition Gesture recognition
a b s t r a c t In this paper, we systematically examine multifactor approaches to human pose feature extraction and compare their performances in movement recognition. Two multifactor approaches have been used in pose feature extraction, including a deterministic multilinear approach and a probabilistic approach based on multifactor Gaussian process. These two approaches are compared in terms of the degrees of view-invariance, reconstruction capacity, performances in human pose and gesture recognition using real movement datasets. The experimental results show that the deterministic multilinear approach outperforms the probabilistic-based approach in movement recognition. Ó 2010 Elsevier Inc. All rights reserved.
1. Introduction 3D human movement sensing and analysis research, including articulated motion tracking and movement (e.g., pose, gesture and action) recognition, enables machines to read and understand human movement, which is critical to developing embodied human–machine interaction systems and allows users to communicate with computers through actions and gestures in a much more intuitive and natural manner than traditional human– computer interfaces through mouse clicks and keystrokes. Embodied human–machine interfaces have found many important applications, including immersive virtual reality systems such as CAVE [1], industrial control [2], embodied gestural control of media [3], healthcare [4], automatic sign language analysis and interpretation [5–8], mediated and embodied learning [9], computer games [10], human–robot interaction [11] and interactive dance works [12–15]. The sensing and analysis of 3D human motion is also an enabling component of intelligent systems that need to understand human actions or activities (e.g., video-based surveillance, motion-based expertise analysis as in sports, etc.). Due to its nonintrusive nature, in the past decade video-based 3D human movement sensing and analysis has received significant attention, evidenced by numerous quality research papers in prestigious computer vision and pattern analysis journals and ⇑ Corresponding author at: School of Arts, Media and Engineering, Arizona State University Tempe, AZ 85287, USA. Fax: +1 480 965 0961. E-mail addresses:
[email protected] (B. Peng),
[email protected] (G. Qian),
[email protected] (Y. Ma),
[email protected] (B. Li). 1077-3142/$ - see front matter Ó 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2010.11.001
conferences. For example, vision-based pose recognition has been extensively studied in the literature [16]. Numerous methods exist for recognizing hand [17–19], arm [20], and full-body [21–23] poses. Existing methods can also be categorized according to the number of cameras used, i.e., single-view [18,21] versus multiview [23], and types of features extracted e.g., 2D silhouette [21,22], and 3D volumetric reconstruction [23]. Likewise, numerous video-based methods have been developed for hand [24–26] arm [27,28] and full-body [29–32] gesture recognition. Recent literature surveys on gesture recognition can be found in [33]. According to the system methodology, video-based gesture recognition systems can be classified as kinematic-based [10,25,29,34–37] and template-based approaches [28,30–32,38–40]. In practice, it is often necessary for a 3D human movement sensing and analysis system to have a high degree of robustness with respect to adversary or uncooperative acquisition conditions including the change of camera view-angles. For example, view-invariance is important for video-based gesture recognition in many HCI applications. Many existing movement recognition methods, especially the monocular approaches, are view-dependent [41,42,8,43–45], i.e., assuming that the relative torso orientation with respect to the cameras is known. While this may be a valid assumption in some scenarios, such as automatic sign language interpretation, having to know the body orientation with respect to the camera presents an undesirable constraint that hampers the flexibility, and sometimes, the usability of an HCI system in many applications such as interactive dance [15] and embodied learning [46]. In these applications, a truly practical solution needs to be able to recognize a gesture from any view-point so that the subject can freely move and
376
B. Peng et al. / Computer Vision and Image Understanding 115 (2011) 375–389
orient in the space. Therefore, the reliable extraction of invariant human pose descriptors from video has become one of the most fundamental research problems in video-based 3D human movement sensing and analysis. In this paper, we systematically examine two view-point invariant pose feature extraction methods based on multifactor analysis and compare these two feature extraction methods in terms of the quality of the extracted features, and their performances in human pose and gesture recognition. The common practice to extract view-invariant feature is to first represent the visual hull data or its transformations (e.g., FFT or 3D shape context) in a body-centered spherical or cylindrical coordinate system and then suppress the angular dimension to achieve some level of view-invariance. Angular suppression inevitably confuses different poses in some cases and introduces ambiguity in pose and gesture recognition. For example, two poses in the anatomically neutral position with the right arm fully extended to the front and to the right respectively will result in the same feature using the angular suppression approach. Therefore, robust extraction of pose descriptors that are invariant to view changes and at the same time can preserve differences caused by actual body configuration is a practical challenge the existing work cannot solve. In addition, such view-invariant features are designed for pose and gesture recognition, and it remains yet another challenge to establish invariant pose descriptor for 3D articulated body tracking. To summarize, reliable invariant pose descriptor extraction for 3D human movement sensing and analysis is a largely underdeveloped research area with a lot of open problems. In the below we briefly review existing approaches to extracting view-invariant pose features from 3D visual hull data, which is the dominant method in the literature. In such approaches, to extract view-invariant features, the volumetric data is first transformed into certain alternative representations such as the 3D shape context (e.g., in [27,47]) and the 3D motion history volume (MHV) (e.g., in [30]). Data points in these alternative 3D representations can be indexed in a body-centered cylindrical coordinate system using (h, r, h) coordinates, which respectively correspond to the height, radius and angular location of the data point. The h-axis of the body-centered coordinate system coincides with the vertical central axis of the subject. To further obtain view-invariant features, the angular dimension h is suppressed in the feature extraction process so that the final extracted feature is independent of the data point distribution in h. To achieve this goal, in [27,30,47], data points are first grouped into rings centered at and orthogonal to the h-axis so that data points on the same ring correspond to different h, while sharing the same h and r. Then a ring-based feature is extracted from each ring and these ringbased features of all the rings constitute the final h-independent feature vector of the input volumetric data. A number of methods have been used to obtain such ring-based features. In the case of 3D shape context [27,47], the sum of the bin values on a ring is taken as the corresponding ring-based feature. Similarly, in the case of 3D MHV [30], the Fourier transform of the data points along a ring is first computed and then the sum of their Fourier magnitudes is taken as the ring-based feature. Pose features extracted using these methods are view-invariant since the orientation of the subject no longer affects the extracted features. However, as we mentioned earlier, suppression of the angular dimension may cause information loss and introduce ambiguity in gesture recognition. In [23,48], another invariant human pose descriptor based on the 3D shape context has also been introduced and used for pose classification [23] and action clustering [48]. This pose descriptor is obtained as the weighted average of a number of local 3D shape context features centered at sample points from a reference visual hull. This pose descriptor has good invariance property for translation and scaling, but not for view changes.
Multifactor analysis has been a popular computational tool to decompose ensembles of static data into perceptually independent sources of variations. Previous successful applications include multifactor face image representation using tensorface [49], modeling of 3D face geometry [50], texture and reflectance [51], and image synthesis for tracking [52]. Multilinear analysis has also been used to establish a generative model in a manifold learning framework capturing the separate contributions of poses and body shapes [53] to the image observations. To handle potential non-multilinearity, various kernelized multilinear analysis methods have also been developed [54]. Moreover, probabilistic multifactor analysis in the Gaussian process framework has also been proposed [55]. Multilinear analysis has also been used in extracting viewinvariant features. We have developed the Full-Body TensorPose approach for view-invariant pose descriptor extraction [40,56–61] from video using multilinear analysis. The observation vector in the data tensor can be formed by using either raw image observations such as silhouettes or 3D volumetric reconstruction such as visual hulls. In both cases, the observation of a human pose is affected by three main factors: joint angle configuration of the subject (different poses), the subject’s body orientation with respect to the camera system, and the body shape of the subject. The Full-Body TensorPose framework is able to extract view-invariant pose descriptors using multilinear analysis. It has been successfully applied to view-invariant static and dynamic gesture recognition [40,56–61]. Encouraging gesture recognition results have been obtained. Using these pose coefficients as feature vectors, support vector machine (SVM) [62] was used for static gesture recognition and the hidden Markov Model (HMM) for dynamic gesture recognition [40,56]. It is worth mentioning that the term Tensorposes has been used in [63,64] to describe a multilinear approach to locating nose-tips and head orientation estimation from face images. In [63,64], a tensor model is used to characterize the appearance variations caused by different subjects and head pose angles. To avoid confusion, we have adopted the term Full-Body TensorPose to refer to the approach we introduce in this paper for full-body pose feature extraction. Multilinear analysis is a deterministic approach based on highorder singular value decomposition. Recently, probabilistic multifactor analysis in the Gaussian process framework has also been proposed [55] to factorize different contributing factors in human movement. In this paper, we extend our research on robust extraction of invariant human pose descriptors from video data by systematically investigating multilinear, non-multilinear (kernelbased), deterministic and probabilistic multifactor analysis techniques for view-invariant human pose feature extraction. The two different multifactor analysis approaches to pose feature extraction have been evaluated in terms of the following key performance metrics: degree of invariance to view changes, representation capacity, and performances in video-based pose and gesture recognition. The key contribution of this paper is a systematic comparative study of two different multifactor approaches to view-invariant human pose extraction and their performances in movement recognition. The experimental results have indicated that the view-invariant pose features obtained using the multilinear analysis are more discriminative and lead to better movement recognition performances than the MGP features. Another important contribution of this paper is to use MGP for view-invariant pose feature extraction, which has not been done in previous work, e.g., [55]. Our experimental results have indicated that the MGPbased method is slightly superior to the multilinear approaches in terms of the degree of view-invariant and the reconstruction quality. Such properties are valuable to movement estimation and tracking. The organization of the remaining of the paper is as follows. In Section 2, we briefly review key elements in deterministic
377
B. Peng et al. / Computer Vision and Image Understanding 115 (2011) 375–389
multilinear analysis and probabilistic kernel-based multifactor analysis in the form of Gaussian processes. We then introduce our approaches to view-invariant pose feature extraction using both multifactor analysis techniques. In Section 4, our methods for static pose and dynamic gesture recognition are introduced. Experimental results and performance analysis of both multifactor analysis techniques in pose feature extraction and movement recognition are presented in detail in Section 5. Finally in Section 6, we present our conclusions and future research directions.
Aj 2 RNj ðN1 ...Nj1 Njþ1 ...Nn Þ . An illustration of the unfolding of a 3-mode tensor is shown in Fig. 1a and b. As an analog to matrix–matrix multiplication, an n-mode tensor can be multiplied by compatible matrices in each mode. The mode-j multiplication of A with a matrix B 2 RMNj is C = A jB, and C 2 RN1 ...Nj1 MNjþ1 ...Nn . The entries of C are given by
Cði1 ; . . . ; ij1 ; i; ijþ1 ; . . . ; in Þ ¼
Nj X
Bði; kÞAði1 ; . . . ; ij1 ; k; ijþ1 ; . . . ; in Þ;
k¼1
i ¼ 1; . . . M 2. Background on multifactor analysis 2.1. Deterministic multilinear analysis Multifactor analysis can be performed in both multilinear and non-multilinear fashions. Linear multifactor analysis, only referred to as multilinear analysis, assumes a multilinear relationship between the contributing factors and observation data. That is, when all the factors are known except one, the observation becomes a linear function of the remaining unknown factor. In multilinear analysis, a tensor representation of the data is often used. A tensor, also known as n-way array or multidimensional matrix or n-mode matrix, is a higher order generalization of a vector (1-mode tensor) and a matrix (2-mode tensor). High-order tensors can represent a collection of data in a more complicated way. When a data vector is determined by a combination of m factors, the collection of the data vectors can be represented as an (m + 1)-mode tensor T 2 RNv N1 N2 ...Nm , in which Nv is the length of the data vector and Ni, (i = 1, 2,. . .,m) is the number of possible values of the ith contributing factor, e.g., the number of people, view-points, or types of illumination in the case of tensorface. A tensor can be unfolded into a matrix along each mode. The mode-j unfolding matrix of a tensor A 2 RN1 N2 ...Nn is denoted as
ð1Þ
For example, the multiplication of a 3-mode tensor and a matrix in each mode is illustrated in Fig. 1c. When a 3-mode tensor is multiplied by a compatible row vector, it degenerates into a matrix (Fig. 1d). As a generalization of singular value decomposition (SVD) on matrices, we can also perform high-order singular value decomposition (HOSVD) [49] on tensors. A tensor A 2 RN1 N2 ...Nn can be decomposed into
A ¼ S1 U1 2 U2 . . . n Un ; N j N 0j
ð2Þ
; N 0j
where Uj 2 R 6 Nj ; j ¼ 1; . . . ; n are mode matrices containing orthonormal column vectors which are analogous to the left and 0 0 0 right matrices in SVD. The tensor S 2 RN1 N2 Nn is called the core tensor which is analogous to the diagonal matrix in SVD. In order to calculate the mode matrices Uj (j = 1,. . .,n), we can first calculate the SVD of the unfolding matrix A(j). Then Uj can be obtained by taking the columns of the left matrix of the SVD of A(j) of corresponding to the N0j largest singular values. Then, the core tensor S can be calculated as follows.
S ¼ A1 UT1 2 UT2 n UTn :
ð3Þ
Let uj,k be the kth row vector of matrix Uj, the element of A at location (i1,i2,. . .,in) is essentially a multilinear function of the u vectors:
Fig. 1. Basic tensor operations. (a) and (b) Unfolding a 3-mode tensor, (c) and (d) multiplication of a 3-mode tensor with a matrix and a vector in each mode.
378
B. Peng et al. / Computer Vision and Image Understanding 115 (2011) 375–389
Aði1 ; i2 ; . . . ; in Þ ¼ S1 u1;i1 2 u2;i2 n un;in
ð4Þ
Let A(i1, , ij1, :,ij+1, , in) be the column vector containing a series of elements A(i1, , ij1, ij, ij+1, , in), ij = 1, , Nj. It can also be represented in a multilinear form:
Aði1 ; . . . ; ij1 ; :; ijþ1 ; . . . in Þ ¼ Sj Uj 1 u1;i1 2 u2;i2 . . . j1 uj1;ij1 jþ1 ujþ1;ijþ1 . . . n un;in
ð5Þ
2.2.2. Multifactor Gaussian process MGP is a special case of Gaussian process with a special design of kernel functions. MGP models the effect of multiple independent factors on the output. Therefore, the input space is divided into multiple factor spaces. An input vector x = {x(1),. . .,x(M)} is made up of components from a few factor spaces. The kernel function of MGP is then given by
kðx; x0 Þ ¼
M Y
ki ðxðiÞ ; x0ðiÞ Þ þ b1 dðx; x0 Þ;
ð8Þ
Consider the data tensor T 2 RNv N1 N2 ...Nm , where the first mode corresponds to the dimension of the data vector. Thus, a data vector T(:,i1, , ij, , im) can be represented in the above multilinear form, in which the coefficients uk;ik in each mode are independent factors contributing to this data vector. The interaction of these factors is governed by the tensor S 1U1. In the case that no decomposition is done along the data mode, U1 becomes an identity matrix and S 1U1 = S. When a new data vector is available, the corresponding coefficient vectors can be found by solving a multilinear equation using the alternating least squares method. This multilinear analysis framework has been widely used for factorization of individual contributing factors in a few applications, including face image representation using tensorface [49], modeling of 3D face geometry [50], texture and reflectance [51], and image synthesis for tracking [52].
in which d(x, x0 ) is 1 when x = x0 and zero otherwise, and b is the output noise factor. The kernel function of each factor ki(x(i), x0 (i)) can be independently defined. Therefore, the influence of each factor on the output can be defined separately, as to be discussed in Section 3.2. MGP can be applied to model the mapping from a latent space consisting of multiple factor spaces to a high-dimensional observation space. In this case, each dimension of the observation vector is an MGP, and kernel functions for all dimensions are the same or at most differed by a linear scaling factor. If given an observation point set, the latent points are unknown, the model can be viewed as a special case of Gaussian process latent variable model (GPLVM) [66].
2.2. Probabilistic multifactor analysis
3. Multifactor pose feature extraction
Recently, the multifactor Gaussian process (MGP) model has been proposed [55] as a probabilistic kernel-based multifactor analysis framework for separation of style and content of movement data and movement synthesis. MGP was developed based on the Gaussian process (GP) method [65] and the Gaussian process latent variable model [66] by inducing kernel functions based on the multilinear model. Because of the use of kernel functions, MGP is able to represent more general multifactor models beyond multilinear relationship. In addition, since MGP is rooted on GP, it is inherently a probabilistic framework.
In our proposed method, given a 3D human movement sensing and analysis task, training data are collected from typical movement related to the specific application and then a mapping function from the video observation (e.g., visual hull data) to the pose descriptor will be established through learning. Based on this mapping function, invariant pose descriptors can then be directly extracted from the input video. Such an intermediate pose representation of human poses can be used in a wide range of applications involving video-based 3D human movement sensing and analysis, such as articulated movement tracking and gesture recognition. In our research, we have examined two multifactor analysis techniques, including the multilinear approach and the multifactor Gaussian process approach, in developing a new human pose descriptor extraction scheme that achieves the desired invariance to changes in orientation of the subject. In this section, we present our proposed approaches to invariant pose descriptor extraction using the multilinear and the multifactor Gaussian process techniques.
2.2.1. Gaussian processes Let f be a zero-mean Gaussian process, defined as a function of X = {x1, x2. . .xN} and the function values f = {f(xn)|n = 1, 2,. . .,N} satisfy a multivariate Gaussian distribution
pðf; KÞ ¼ Nðf; 0; KÞ
ð6Þ
The elements of the covariance matrix K are defined based on the covariance function or kernel function Ki,j = k(xi, xj). Two commonly applied kernel functions are the linear kernel k(x, x0 ) = xx0 , and the RBF kernel k(x, x0 ) = exp(c||x x0 ||2). Gaussian processes can be applied to predict the function value of unknown point. If a set of input variables X and corresponding function values f are known, the conditional distribution of function value f* at a new input point x* is [65]: T
pðf jfÞ ¼ Nðf ; k K1 f; k k K1 k Þ;
ð7Þ
where k** = k(x*, x*) is the unconditional variance of f*; k* is the cross-covariance vector of f* and f, i.e., k*(i) = k(x*, xi); and K is the covariance matrix of f determined by X and the kernel function as defined above. This conditional distribution of f* given f is still Gaussian since they are jointly Gaussian. The conditional mean of f* is a linear combination of sample function values f. The conditional variance of f* is smaller than the unconditional variance due to the knowledge of the sample points. In addition, this conditional distribution of f* is completely determined by the kernel function and the input point x* when the sample input and output points are given.
i¼1
3.1. Invariant pose descriptor extraction using multilinear analysis We propose to extract invariant pose descriptors using multilinear analysis. The observation vector in the data tensor can be formed by 3D volumetric reconstruction such as visual hull data. Visual hull observation of a human pose is affected by three main factors: joint angle configuration of the subject (different poses), the subject’s body orientation about the camera system, and physical body variations of the subject (e.g., height and weight). The basic idea of the proposed framework for invariant pose feature extraction is that given a set of key body poses identified either manually or automatically, we construct a multilinear analysis framework to obtain invariant descriptors for these key poses. In the application of gesture recognition, the key poses can be landmarks representing the dynamic gesture. 3.1.1. Model learning and pose descriptor extraction To extract invariant pose descriptors using multilinear analysis, a tensor including visual hull data of a number of key poses in
B. Peng et al. / Computer Vision and Image Understanding 115 (2011) 375–389
different body orientations need to be constructed. For example, Fig. 2 illustrates a visual hull tensor with three modes, voxel, body orientation, and pose, which can be used to obtain view-invariant pose descriptor. For a specific application, a set of key poses needs to be selected. In our proposed approach, the key poses are selected automatically based on the motion energy of the movement data [30]. Once the set of key poses are selected, each sample visual hull of the key pose can be rotated about the axis perpendicular to the ground plane in order to generate the training visual hulls of the pose in each designated orientation. In our approach, the designated orientation angles are evenly distributed between 0° and 360°. To remove the effect of translation and scaling to the visual hull, a simple normalization procedure [58] can be applied to the visual hull data. The normalized visual hull then is vectorized and concatenated to form a complete input vector. As shown in Fig. 2, the input observation vectors are spanned along the orientation mode and the pose mode to form the training tensor D, which can be decomposed as
D ¼ C1 U Voxel 2 U Orientation 3 U Pose
ð9Þ
by using high-order singular value decomposition (HOSVD) [49] to extract the core tensor C and the basis U matrices. The rows of UPose correspond to the pose features of the training key poses used to establish the pose tensor, and they are orthonormal vectors. To keep these key pose feature vectors orthogonal to each other, no dimension reduction is performed at this step. Orthogonality implies large distance in the feature space, and such discriminative property is beneficial to movement recognition. Given any input visual hull, the corresponding pose and body orientation coefficient vectors can be computed using the Tucker-alternating least squares methods (ALS) [67–70] based on the core tensor C. Given input visual hull of the same pose from different views, the extracted pose coefficient vectors will be similar to each other. Thus the pose coefficient vector can be used as viewinvariant pose descriptors. 3.2. Invariant pose descriptor extraction using multifactor Gaussian process model In our proposed method, we have used the kernel-based nonlinear multifactor Gaussian process-based [55] approach to tackle the challenge of achieving invariance in extracting pose descriptors. Recently, kernel-based multifactor analysis has been developed
Fig. 2. Formation of pose tensor using visual hulls.
379
to separate content and style from motion capture data. Such kernel-based approach has been found to be effective to represent and model potential nonlinearity involved in the contributing factors such as pose and orientation to the visual hull observation. Although the kernel-based multifactor analysis [55] has been proposed, it has been mainly used in modeling and synthesising movement data. It has not been exploited in view/body-invariant pose feature extraction. Our experimental results have shown additional advances made in view-invariant pose descriptor extraction using kernel-based multifactor analysis complimentary to those made using the multilinear approach. 3.2.1. Model learning In our proposed approach, the observed visual hull is modeled to be affected by two factors: the body pose and the body orientation. Therefore, a latent point can be represented as x = {x(p), x(o)} in which x(p) is the pose descriptor vector and x(o) is the orientation vector. Training data can be obtained in the same way as the multilinear approach. Let the training samples be a matrix Yt in which each row is a vectorized volumetric reconstruction. Since zeromean MGPs are applied, a mean vector needs to be subtracted from each row of Yt. After centering the training data, we also normalize the training data by dividing elements in each dimension by the standard deviation of the dimension. The resulting training observations are denoted as Y. Let Np be the number of key poses and No the number of designated orientations. Denoting N = Np No to be the number of training samples and D to be the dimensionality of the observation vector, the dimensionality of Y is N D. The body orientation variable has one degree of freedom. In order to model the periodicity of rotation, we have used a 2D orientation vector x(o) to represent the body orientation, which can be intuitively considered to be the corresponding location of the body orientation angle on the unit circle. On the other hand, Dp the dimensionality of the pose feature vector x(p) can be determined according to the distribution of the latent points. Initially, Dp is set to be equal to Np and the latent points corresponding to Y are obtained according to the training procedure presented in the following paragraph. These latent points contain redundant information, and dimension reduction can be performed to improve computational efficiency in pose feature extraction. Similar to dimension reduction in the principal component analysis (PCA), Dp is found based on the eigen-analysis of the covariance matrix of the laten points. To reduce Dp, in our research we first compute M the covariance matrix of the learned pose vectors and find the eigenvalues of M, which correspond to the energy of the covariance matrix. We then find the smallest set of eigenvalues so that a certain percentage of the energy can still be preserved when the remaining eigenvalues are dropped. Dp is then selected to be the cardinality of the smallest set of eigenvalues satisfying the energy preservation constraint. This eigenvalue set usually contains the Dp largest eigenvalues. In our research, 99.5% of the energy is preserved and Dp is found to be 12. Given the dimensions of the latent space, the following training procedure can be taken to learn the latent points. In addition, during the training process, we constrain that the latent points of samples belonging to the same pose remain the same. Likewise, the orientation coefficients of samples belonging to the same orientation will maintain to be the same. In this training process, the orientation coefficients of the designated orientations are initialized to be 2D points evenly distributed on the unit circle. The latent points of the key poses are initialized to be nonnegative Dp dimensional vectors. The pair-wise distances of these initial pose latent points are also made similar to each other. With such constraints and initial latent points, given training observations Y, the corresponding set of latent points X ¼ fxn gNn¼1 and parameters of the kernel function C ¼ fcp ; co ; b g are optimized by maximizing the
380
B. Peng et al. / Computer Vision and Image Understanding 115 (2011) 375–389
following log likelihood function with respect to X ¼ fxn gNn¼1 and C = {cp, co, b}:
LðX; CÞ ¼ log pðYjX; CÞ ¼ log
D Y
!
In this section, we present two exemplar applications showing how the invariant human pose descriptors can be applied in movement recognition.
pðyi jKðX; CÞÞ
i¼1 D ND 1X ¼ logð2pjKðX; CÞjÞ yT KðX; CÞ1 yi 2 2 i¼1 i
ð10Þ
where yi is the ith column vector of Y. The covariance matrix K(X, C) is specified as
Ki;j ðX; CÞ ¼ kðxi ; xj ; CÞ:
ð11Þ
where the kernel function is defined as
c cp kðx; x0 ; CÞ ¼ exp jjxðpÞ x0ðpÞ jj2 exp o jjxðoÞ x0ðoÞ jj2 2 2 þ b1 dðx; x0 Þ ð12Þ The quasi-Newton [71] method is applied for optimization. At each optimization step, the latent points and kernel parameters are optimized together. In this case, the RBF kernel is applied for pose and orientation factors. In our future research, we will also explore other types of kernel functions and their combinations. After the latent points X* and model parameters C* are learned from the training data Y, the conditional distribution of the obserðoÞ vation y* corresponding to a new latent point x ¼ fxðpÞ ; x g can be found as a normal distribution according to the Gaussian process theory. Define k* = k(x*, X*, C*) to be the cross-covariance vector with x* and X. The ith element of k(x, X, C) is given by
ki ðx; X; CÞ ¼ kðx; xi ; CÞ;
i ¼ 1; . . . d
ð13Þ K*
where d is the dimension of the latent space. Define = C*) and k* = k(x*, x*, C*). The conditional distribution of y* is given by
pðy jx ; Y; X ; C Þ ¼
D Y
4. Movement recognition using invariant-features
T
1 NðyðiÞ yi ; k k K1 k Þ ; k K
K(X*,
ð14Þ
4.1. Pose recognition The view-invariant pose features can be directly used for view-invariant 3D pose recognition. In our research, pose recognition is achieved by using the support vector machine (SVM) classifiers in a one-versus-the-rest manner [72]. For each pose, we train a binary classifier to identify whether the input pose feature ‘‘is’’ or ‘‘is not’’ the target pose. The training data of the classifier for a specific pose consists of positive samples of the corresponding pose and negative samples obtained from other poses. When all the classifiers are trained, pose recognition can be achieved by a traversal of all the classifiers. Each classifier will return a binary label indicating whether it accepts the input as the corresponding pose as well as a value of the discriminative function. In our pose recognition experiment, the kernel types and kernel parameters are set to be the same for all binary SVM classifiers. Therefore, all the SVM classifiers work in the same feature space. It is possible that there are multiple SVM classifiers which return positive results given a testing frame. In this case, the signed distances from the testing point to the dividing hyperplanes of these SVMs are used to select the final class. Basically, the pose class corresponding to the SVM classifier yielding the maximum signed distance to the dividing hyperplane is selected to the recognized pose. For an SVM classifier, this distance from a testing frame to its dividing hyperplane can be easily found by dividing the corresponding value of the discriminative function by a normalization term corresponding to the length of the linear coefficient vector of the dividing hyperplane. Details of SVM classification can be found in [73].
i¼1
where yðiÞ is the ith element of y* and yi is the ith column vector of Y.
4.2. Gesture recognition
3.2.2. Invariant pose descriptor extraction To extract invariant pose descriptors, we need to infer the latent ðoÞ point x ¼ fxðpÞ ; x g from a new observation y*. In our proposed framework, this is done by solving the following optimization problem
A dynamic gesture can be viewed as a time series of static body poses. Once invariant pose descriptors can be extracted from the input video data of the static poses, a gesture can be represented by a sequence of invariant pose descriptors. Using these pose descriptors as the observation vectors, gestures can be recognized from a continuous movement stream by using the hidden Markov Models (HMM) as shown in Fig. 3. These movement HMMs are learned from training data to represent the specific non-gesture movement patterns. Because of the use of the invariant pose descriptor, the resulting gesture recognition system is also invariant to changes in orientation.
x ¼ arg max log pðy jx; Y; X ; C Þ x
ð15Þ
in which the distribution p(y*|x, Y, X*,C) is defined above. To solve this optimization problem, the quasi-Newton method is applied. Since the quasi-Newton method can only find the local optimal points, it is important to find a good initial point for the optimalization process. Furthermore, it is often plausible to use multiple initial points to improve the optimization result. In our research, we first select m observations in Y that are closest to y* and then take their corresponding latent points as the initial points. After optimization, m local optimal latent points can be obtained, and the latent point yielding the largest likelihood of y* is chosen to be the solution for x*. In practice, there is a tradeoff in selecting m between precision (requiring a larger m) and computational efficiency (requiring a smaller m). In our research in pose feature extraction using MGP, we have used m = 10 initial points. By inferring the latent point x*, the pose coefficient xðpÞ and the orientation coefficient xðoÞ can be obtained. Then pose component xðpÞ can be taken as the invariant pose descriptor describing the corresponding pose.
5. Experimental results and performance comparison Due to their different nature as discussed above, the two proposed approaches may perform differently for the same application. It is important to identify their applicability. To provide concrete insights to the applicability of the potentially complementary approaches for practical applications, we have systematically evaluated and compared these two invariant pose descriptors. Specifically, we have systematically evaluated the invariant pose descriptor extraction algorithms in terms of the following key performance metrics: degree of invariance to view changes, representation capacity, and performances in video-based movement recognition.
B. Peng et al. / Computer Vision and Image Understanding 115 (2011) 375–389
381
Fig. 3. Using the invariant pose descriptor in gesture recogntion.
5.1. Exemplar view-invariant features Using the multilinear and the multifactor Gaussian process approaches, We have obtained two different view-invariant pose descriptor results using the 3-mode (voxel, pose, and view angle) pose tensor with 25 key poses and 16 views. The first row of Fig. 4 shows 3D visual hull data (new testing data outside of the training set) of a key pose from 16 views (Fig. 4a), the 25 dimen-
sional pose descriptors extracted using the multilinear approach (Fig. 4b), and the 12 dimensional pose descriptors extracted using the multifactor Gaussian approaches (Fig. 4c). The first row of Fig. 5 presents visual hulls and the corresponding pose features of a nonkey pose. From these figures it can be seen that the pose descriptors of the same pose extracted from visual hull data of the same pose viewed from different view-angles are indeed close to each other.
Fig. 4. Visual hull data of a key pose in 16 body orientations (a) and the corresponding multilinear (b) and MGP (c) pose vectors. Subplots (d–f) show sample data (d) of the same pose corrupted by protrusion errors (circled) and the corresponding multilinear (e) and MGP (f) pose vectors. Subplots (g–i) show sample data (g) of the pose with occlusion errors and the corresponding multilinear (h) and MGP (i) pose vectors.
382
B. Peng et al. / Computer Vision and Image Understanding 115 (2011) 375–389
Fig. 5. Visual hull data of a non-key pose in 16 body orientations (a) and the corresponding multilinear (b) and MGP (c) pose vectors. Subplots (d–f) show sample data (d) of the same pose corrupted by protrusion errors (circled) and the corresponding multilinear (e) and MGP (f) pose vectors. Subplots (g–i) show sample data (g) of the pose with occlusion errors and the corresponding multilinear (h) and MGP (i) pose vectors.
In practice, visual hull data are often noisy due to 3D reconstruction errors. Visual hull errors often exhibit as large blocks of uncarved background voxels and large blocks of miscarved foreground voxels. In our research, the first type of errors is referred to as protrusion errors and the second type partial occlusion errors. To verify and compare the robustness of the two proposed feature extraction approaches in the presence of such errors, we have examined pose features extracted from noisy visual hull data. In our experiment, noisy visual hull data were obtained by adding either the protrusion or partial occlusion errors to the original data used in the previous analysis (upper left subfigures in Figs. 4 and 5). To add a protrusion error to a visual hull, we first select a protrusion sphere with a random center in the background voxels and a diameter of three voxels (a 10th of the side length of the volumetric reconstruction). To realistically synthesize a protrusion error, this random sphere has to overlap with the visual hull with overlapping volume less than half of the sphere. Otherwise, another random sphere will be selected and tested, until a valid
protrusion error is synthesized. Once a protrusion sphere is found, all the voxels inside the sphere are considered protrusion voxels and their values set to 1. Likewise, to add a partial occlusion error, a partial occlusion sphere is first generated with a random center in the foreground voxels and a diameter of three voxels. Then all the voxels within this sphere are considered occluded and their values set to 0. Examples of such noisy visual hull data are shown in Fig. 6. The noisy visual hull data for a key pose and the corresponding pose vectors extracted using the multilinear and the MGP approaches are presented in the middle (protrusion error) and bottom (partial occlusion) rows of Fig. 4. The noisy data and the corresponding pose vectors of the non-key pose are given in Fig. 5. In total, 12 sets of pose features have been extracted and shown in Figs. 4 and 5, corresponding to 12 different scenarios indexed according to the dataset (key pose versus non-key pose), error type (no error, protrusion error, and partial occlusion error), and the feature extraction method (multilinear versus MGP). It
B. Peng et al. / Computer Vision and Image Understanding 115 (2011) 375–389
383
Fig. 6. Examples of two types of visual hull errors: (a) original visual hull, (b) noisy data with a protrusion error (circled), (c) noisy data with a partial occlusion error (circled).
can be clearly seen from these figures that both the multilinear and the MGP approaches are robust to visual hull errors. To obtain a quantitative measure of the error-resilience property of both approaches, we have further computed the pair-wise Euclidean inter-orientation distances between the pose vectors for each scenario. For each dataset and each feature extraction method, the average norm of the corresponding pose features extracted from the error-free (no additional error added) data is used
Table 1 Normalized/relative average inter-orientation distances for the 12 scenarios. Method and noise type
Key pose
Non-key pose
Multilinear, original data Multilinear, protrusion Multilinear, partial occlusion MGP, original data MGP, protrusion MGP, partial occlusion
0.055/1 0.072/1.31 0.085/1.55 0.074/1 0.074/1 0.083/1.12
0.112/1 0.159/1.42 0.177/1.58 0.025/1 0.027/1.08 0.028/1.12
as the normalization constant for the corresponding dataset and feature extraction method. The inter-orientation distances computed for each scenario are then normalized using the corresponding normalization constant according to the dataset and the feature extraction method. The average distances for all the scenarios are shown in Table 1. According to Table 1, we can see that both the multilinear and the MGP approaches are robust to visual hull errors in the sense that the average inter-orientation distances for the noisy cases are all comparable to the average distance in the corresponding error-free case. In Table 1, we have also given as the second number in each cell the relative average inter-orientation distances for each scenario to examine the relative amount the average distance has increased due to the added noise for a particular dataset, noise type, and pose feature extraction method. It can be seen that when the MGP method is used, the corresponding incremental percentages in the average distances caused by noise are much less than those obtained using the multilinear method. Hence, the MGP method is much more robust to visual hull errors than the multilinear approach, most likely due to its probabilistic nature.
Fig. 7. Examples of noisy visual hull data in the IXMAS dataset. Typical errors (in the circles) include remaining uncarved blocks in the background (a and b) and missing (wrongly carved) blocks in the foreground (c–f).
384
B. Peng et al. / Computer Vision and Image Understanding 115 (2011) 375–389
5.2. Degree of invariance It is critical to produce a measure of invariance for the proposed pose descriptors obtained using different algorithms. In our research, we have systematically studied and evaluated the invariance of the proposed pose descriptors to view changes. To evaluate the view-invariance, we first randomly select a number of poses and their visual hull data from the IXMAS dataset. The selected data include poses close to the key poses as well as those very much different from the key poses. Many of the visual hull data in the IXMAS dataset suffer from the protrusion errors and the partial occlusion errors introduced before. Examples of such visual hull construction errors are shown in Fig. 7. Therefore, the results reported in this section also reflect the performance of the proposed features using noisy visual hull data. Using the selected visual hull data, a view-invariance evaluation test dataset is then synthesized by rotating each of the testing poses to different (e.g., 16) facing directions. Once the test dataset is constructed, for each pose, we can extract the corresponding pose descriptors from these visual hulls and obtain their pair-wise Euclidean distance. The maximum of these distances, defined as maximum inter-orientation distances (MIOD), are applied as a measure of View-invariance of pose vectors obtained at this frame. The histogram of the MIOD for all the testing poses presents a big picture of the view-invariance of the corresponding pose descriptor. We have obtained results in evaluating the view-invariance property of the pose descriptor extracted using the 3-mode tensor analysis and the pose descriptor extracted using both the multilinear analysis and MGP. In our study, we randomly selected 10 frames of visual hull data from each of 10 subjects and totally 100 frames from the 10 subjects were used for view-invariance evaluation. Among these 100 frames, 18 frames are close to one of the key poses. To put such a view-invariance difference measure
into the proper context, we also computed the pair-wise interframe distances using the 100 original visual hull frames. The histogram of the MIOD of both the key pose frames and the non-key pose frames corresponding to the pose features obtained using multilinear analysis are shown in Fig. 8a and b. The histogram of overall inter-frame distance is shown in Fig. 8c. Using the same dataset, pose features were also extracted using the MGP method and the corresponding histograms are shown in Fig. 9. Please note that the normal distance ranges of the multilinear and MGP features are different, 0–2 for the multilinear feature and 0–6 for the MGP feature (see Figs. 8 and 9). To obtain a normalized view for the distance distributions, all the histograms are set to 10 bins, and the size of each bin is a 10th of the maximum overall inter-frame distance (MOIFD) of corresponding type of pose descriptor. From Figs. 8 and 9 it can be seen that the MIOD values of nearly all the key pose frames (17 out 18 for multilinear pose descriptors and all the 18 for MGP pose descriptors) and the majority (73 out 82 for multilinear pose descriptors and 78 out of 82 for MGP pose descriptors) of non-key pose frames are less than a 10th of MOIFD. Therefore, only a small percentage (10% for multilinear pose descriptors and 4% for MGP pose descriptors) of the testing frames has high MIOD values. Hence, we experimentally verified that the proposed pose feature extraction method using both multilinear analysis and MGP can effectively extract view-invariant features from visual hull data. Moreover, between the two different view-invariant feature extraction approaches, the MGP-based approach exhibits slightly stronger view-invariance property that the multilinear analysis-based approach. 5.3. Representation capacity For a pose descriptor, it is important to assess how capable it can represent different body poses. To this end, we have examined its representation capacity by looking at the visual hull reconstruction
Fig. 8. Distance distributions of pose vectors obtained by multilinear analysis. (a) Inter-orientation distances of pose vectors obtained from key pose frames. (b) Interorientation distances of pose vectors obtained from non-key pose frames. (c) Inter-frame distances between pose vectors obtained from 100 frames.
Fig. 9. Distance distributions of pose vectors obtained using MGP. (a) Inter-orientation distances of pose vectors obtained from key pose frames. (b) Inter-orientation distances of pose vectors obtained from non-key pose frames. (c) Inter-frame distances between pose vectors obtained from 100 frames.
B. Peng et al. / Computer Vision and Image Understanding 115 (2011) 375–389
385
Fig. 10. Illustration of visual hull reconstruction: (a) the original visual hull, (b) reconstruction from the multilinear method, (c) reconstruction from the MGP method.
quality for both key poses and non-key poses. Since both the multilinear method and the MGP method are essentially generative models, given a testing visual data, y*, it is straightforward to find the reconstruction y*0 from the inferred latent point. The distance between y* and y*0 can measure how well the pose descriptor represents y*. Finding such reconstruction error for a number of testing poses provides a picture of the representation capacity of the pose descriptor. Fig. 10 shows a testing visual hull and reconstructions obtained from the pose descriptors computed using the multilinear (middle) and the MGP methods. It can be seen that both approaches can reconstruct the original input visual data to a certain extent and the MGP-based method achieves a better reconstruction results than the multilinear analysis-based method. 5.4. Results on pose recognition We have applied the view-invariant pose descriptors obtained using multilinear analysis and those obtained using MGP for pose recognition. We performed tests on two datasets. One dataset is composed of 20 dance poses. The other data set contains 20 poses selected from the IXMAS action dataset [30]. 5.4.1. Recognition of dance poses We performed pose recognition on a data set containing 20 dance poses choreographed by a professional dancer. These poses are shown in Fig. 11a. This data set also contains 20 trick poses.
This trick poses are outliers but are similar to one of the 20 standard poses. The trick poses are shown in Fig. 11b. For this dataset, we applied a pair of 2D silhouettes obtained from orthogonal views as the input data of a pose. The viewinvariant pose extraction algorithms using the multilinear analysis and the MGP method introduced in Section 3 can also be used. In this case, an input data sample is obtained by concatenating the two vectorized body silhouettes of the subject captured by the two cameras. In our research, SVM with the radial basis function (RBF) kernels have been used for pose recognition in a one-versus-the-rest manner. In order to train the SVM classifiers, we applied images synthesized using motion capture data and animation software as well as real images captured by two video cameras. For testing, we applied additional real images which are different from those in the training set. For each pose, there are 192 synthetic training samples and eight real training samples, and 16 real image pairs are used as testing samples. For trick poses, we used 320 real image pairs, 160 of them applied as training samples, and the remaining 160 applied as testing samples. In this test, a testing data sample such as a sample from a trick pose could belong to none of the poses in the pose vocabulary (i.e., an outlier). In this case, if none of the classifiers accepts a testing sample, it is then identified as an outlier. Otherwise, the testing sample will be recognized as the pose corresponding to the classifier yielding the maximum signed distance from the testing point
Fig. 11. (a) The 20 dance poses. (b) The 20 trick poses.
386
B. Peng et al. / Computer Vision and Image Understanding 115 (2011) 375–389
Table 2 Recognition results of 20 dance poses. Feature extraction method
SVM parameters
Recognition rate (%)
False detection rate (%)
Multilinear analysis Multilinear analysis Multilinear analysis MGP MGP MGP
r = 0.7, C = 5 r = 1.2, C = 6 r = 1, C = 2 r = 0.2, C = 6 r = 0.3, C = 4 r = 0.4, C = 9
87.81 88.75 89.06 62.81 68.43 74.38
5 8.75 10.63 5 8.75 10.63
to the SVM dividing hyperplane in the feature space among all the classifiers accepting the testing sample. Therefore, each testing sample is recognized as one of the 20 poses or as an outlier. We evaluate the recognition results using the recognition rate (RR) and the false detection rate (FDR). The recognition rate is the percentage of testing samples of standard poses that are correctly recognized as their corresponding poses. The false detection rate is the percentage of testing samples of trick poses that are wrongly recognized as one of the standard poses. The RR and FDR of pose recognition using both multilinear analysis based features and MGP based features are shown in Table 2. It can be seen that using features obtained from multilinear analysis has led to better pose recognition results than those obtained using features obtained from the MGP method. 5.4.2. Recognition of poses in IXMAS dataset We also performed recognition of 20 poses in IXMAS dataset [30]. The IXMAS dataset contains multi-view image and visual hull data of a number of daily gestures performed by multiple subjects. In our experiments, we have used the 3D visual hull data as movement observation for feature extraction and pose recognition. We applied the key pose selection mechanism based on the motion energy [58] to select 25 key pose form one of the movement pieces. Then, we labeled the corresponding poses in the movement pieces performed by the first 10 subjects in the dataset. We discarded five poses that are not commonly exist in movements of all subjects, and chose the remaining 20 pose as the set of poses for recognition. The 20 pose are shown in Fig. 12. In average, there are 23 samples of each pose performed by all subjects for training and testing. For this data set, training and testing are performed in a crossvalidation manner. At each cycle, poses performed by one subject is applied as testing data and the remaining are applied as training data. This process is repeated for all the 10 subjects. In this experiment, all the testing samples are from the 20 target poses. Therefore, during pose recognition, a testing sample will be classified into one of the poses in the pose vocabulary. In this
case, we simply assign the sample to the pose corresponding to the classifier yielding the maximum signed distance to the SVM dividing hyperplane in the feature space and every testing sample is assigned to one and only one pose class. To evaluate the performance of pose recognition, we have computed the recognition rate (RR) and the false alarm rate (FAR). For a particular pose p, the corresponding RR is computed as the percentage of correctly recognized in-class pose samples, and the corresponding FAR is the ratio of the number of testing samples misrecognized as pose p to the total number of out-class samples for pose p (i.e., testing samples of the other poses). In our research, we have obtained pose recognition results from the multilinear and the MGP features as well as from raw visual hull data using the SVM and the K-nearest neighbors (K-NN) classifiers. The Euclidean distances have been used in all cases. In the case of SVM, the RBF kernel has been used for the multilinear and MGP features. When the raw visual hull data is used in SVM, the linear kernel has been adopted, due to the high dimensionality of the visual hull data [74]. For each classifier-feature (or raw data) scenario, a grid search has been carried out in the corresponding parameter space to identify the best parameters. The pose recognition results obtained using the optimal classification parameters for each case are given in Table 3. It can be seen from Table 3 that the multilinear feature has led to the best pose recognition results using either SVM or K-NN. The result obtained from MGP features using SVM is better than those directly from the raw visual hull data using SVM or K-NN. 5.5. Results on gesture recognition Using the view-invariant descriptors from the multilinear and MGP approaches, we have obtained results for gesture recognition using the pre-segmented gesture data from the IXMAS dataset for gesture training and testing. Similar to the previous case in pose recognition, the 3D visual hull data has been used for extracting the pose descriptors. In our experiment, each gesture is modeled
Table 3 Recognition results of 20 poses in IXMAS dataset. Classifier
Feature extraction method
Recognition rate (%)
False alarm rate (%)
SVM
Multilinear analysis MGP Raw visual hull data
74.40 68.09 66.42
1.35 1.68 1.77
K-NN
Multilinear analysis, K = 13 MGP, K = 9 Raw visual hull data, K = 1
71.43 64.38 65.49
1.51 1.88 1.81
Fig. 12. Twenty poses selected from the IXMAS data set.
B. Peng et al. / Computer Vision and Image Understanding 115 (2011) 375–389 Table 4 Gesture recognition results using pre-segmented data.
387
6. Conclusions and future work
Method
Recognition rate (%)
False alarm rate (%)
Weinland 3D [31] Weinland 2D [30] Proposed method using multilinear descriptors Proposed method using MGP descriptors
93.33 81.3 94.59 87.90
0.67 1.97 0.54 1.17
as a 12-state left-to-right HMM. Following [31,30], we applied the data obtained from the first 10 subjects for training and testing. We applied the same cross-validation strategy as described in the previous section for training and testing, which also concurred with the strategy applied in [31,30]. The comparison of gesture recognition results is shown in Table 4. It can be seen that using pose descriptors obtained by multilinear analysis, results of gesture recognition are better than those obtained using the MGP-based method, and those obtained using the state-of-the-art algorithms using the same training and testing datasets based on the IXMAS dataset [30]. 5.6. Discussions We have presented our approaches and experimental results for extracting invariant pose descriptors using both multilinear and the multifactor Gaussian process-based methods and their performances in movement recognition. It is clear from the experimental results that the in terms of the degree of view-invariance, the reconstruction quality, the MGP-based method is slightly better than the multilinear-based method. On the other hand, multilinear-based pose feature extraction method performs much better than the MGP-based method in movement recognition, including both static pose and dynamic gesture recognition. In the following, we attempt to provide insights and explanation to such experimental results. Both the multilinear and MGP methods conduct multifactor analysis, but in different manners. The multilinear approach is a deterministic. Once trained, the core tensor is fixed and it encodes how different factors interact to generate the observed visual hull data. For this reason, pose descriptors extracted from the multilinear approach might be more sensitive to changes in these additional factors (e.g., the body shape factor). In contrast, the MGP-based approach is based on a probabilistic generative model (the Gaussian process). Essentially, it assumes that core tensor follows a zero-mean normal distribution. In model learning, instead of finding the core tensor as in the case of the multilinear analysis, the kernel parameters are trained from the training data. The benefit of such a probabilistic model is that it is likely that such model is more flexible to changes in the other non-modeled contributing factors and it is also more resilient to observation noise present in the visual hull reconstruction. For this reason, the MGP-based method presents slightly better view-invariance property and reconstruction quality than the multilinear-based method. On the other hand, the multilinear-based method is superior to the MGP-based method in movement recognition. This is largely due to the inherent discriminative capacity of the multilinear analysis framework in pose feature extraction. In this approach, the key poses are mapped to orthogonal pose descriptors during the training. This is a huge benefit for discriminative analysis, such as pose and gesture recognition. In contrast, the pose descriptors obtained using the MGP approach do not have the orthogonal pose descriptor structure for the key poses and the general assumption of zeromean normal distributed core tensor elements in the MGP approach may not effectively capture the underlying structure of the pose descriptor space, leading to inferior classification.
It is clear in our research that both multilinear analysis and multifactor Gaussian process are effective view-invariant pose extraction approaches. Experimental results show that the MGPbased method is slightly superior to the multilinear approaches in terms of the degree of view-invariant and the reconstruction quality. On the other hand, the multilinear-based approach greatly outperforms the MGP-based method in movement recognition due to the inherent discriminative power of the multilinear analysis framework. In our future work, we will further our research on a number of fronts. First of all, we will investigate the potential of integrating multilinear and MGP-based approaches. For example, the core tensors obtained using multilinear analysis can be used as the mean tensor in the MGP approach. In doing so, we can maintain an arching structure for the learning of the latent points X while persevering the representation flexibility provided by the probability framework. When nonlinear kernel functions are used in the MGP approach, the kernelized multilinear analysis [54] method will then be used to provide the mean core tensor for the MGP training. Fully exploring these issues will be one task in the future research. Another important area of our future research is to include other contributing factors in pose feature extraction. People in different gender, age, and weight groups tend to have different body shape. In addition, loose clothes of the subject can also introduce extra variations in body shape. Pose features extracted using the above multilinear analysis of the 3-mode (voxel, pose, and orientation) pose tensor are not invariant to body shapes. In other words, the features extracted from the same pose of different subjects with different body shapes (e.g., overweighed versus slim) can be different. This is a very critical problem for practical action recognition systems. To address this challenge, in our future research we will further introduce an additional mode in the pose tensor to reflect the changes in body shape, so that the resulting pose descriptor will be both view-invariant and body-shape-invariant. In this case, we need to establish a training set of visual hull data of the key poses in different body shapes. To this end, we will both collect real data and generate synthetic data using animation software. In our future research, we will also explore kernelized multilinear analysis [54] in extracting invariant pose descriptors and compare its performance against those discussed in this paper. We will also examine the application of the invariant pose features in video-based articulated movement tracking. The basic idea is that once the view-invariant descriptors are extracted, a bidirectional mapping between the invariant descriptors and body kinematics (e.g., joint angles) can be established in a shared-GPLVM framework [75]. The backward mapping from the descriptors to the kinematics can provide initialization in tracking and recover the tracker from tracking failures. During the tracking, the forward mapping from kinematics from descriptors can also be used to evaluate likelihood in a Kalman or particle filter tracking framework. In addition in a particle filtering tracking framework, given an input visual hull data, samples can be drawn surrounding the backward mapping results to further exploit the local kinematic space. Acknowledgments The authors are grateful to the anonymous referees for this paper for their insightful comments. This work was supported in part by the US National Science Foundation (NSF) Grants RI-0403428 and DGE-05-04647. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the US NSF.
388
B. Peng et al. / Computer Vision and Image Understanding 115 (2011) 375–389
References [1] C. Cruz-Neira, D.J. Sandin, T.A. DeFanti, R.V. Kenyon, J.C. Hart, The CAVE: audio visual experience automatic virtual environment, Communications of the ACM 35 (1992) 64–72. [2] T.E. Starner, Visual recognition of american sign language using hidden markov models, in: Media Arts and Sciences, vol. Master: MIT, 1995, pp. 52. [3] C. Sul, K. Lee, K. Wohn, Virtual stage: a location-based Karaoke system, IEEE Multimedia 05 (1998) 42–52. [4] C. Keskin, K. Balci, O. Aran, B. Sankur, L. Akarun, A multimodal 3D healthcare communication system, in: Proceedings of the 3DTV Conference, 2007, pp. 1–4. [5] S.C.W. Ong, S. Ranganath, Automatic sign language analysis: a survey and the future beyond lexical meaning, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005) 873–891. [6] T. Starner, J. Weaver, A. Pentland, Real-time American sign language recognition using desk and wearable computer based video, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 1371– 1375. [7] J.F. Lichtenauer, E.A. Hendriks, M.J. Reinders, Sign language recognition by combining statistical DTW and independent classification, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2008) 2040–2046. [8] H.-D. Yang, S. Sclaroff, S.-W. Lee, Sign language spotting with a threshold model based on conditional random fields, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2009) 1264–1277. [9] D. Birchfield, T. Ciufo, G. Minyard, G. Qian, W. Savenye, H. Sundaram, H. Thornburg, C. Todd, SMALLab: a mediated platform for education, in: Proceedings of ACM SIGGRAPH, 2006. [10] H.S. Park, D.J. Jung, H.J. Kim, Vision-based game interface using human gesture, in Advances in Image and Video Technology. Berlin/Heidelberg: Springer, 2006, pp. 662–671. [11] S.-W. Lee, Automatic gesture recognition for intelligent human–robot interaction, Proceedings of the FGR (2006) 645–650. [12] A. Camurri, B. Mazzarino, M. Ricchetti, R. Timmers, G. Volpe, Multimodal analysis of expressive gesture in music and dance performances, in: G.V.A. Camurri (Ed.), Gesture-based Communication in Human–Computer Interaction, Springer-Verlag, 2004. [13] A. Camurri, S. Hashimoto, M. Ricchetti, A. Ricci, K. Suzuki, R. Trocca, G. Volpe, EyesWeb: towards gesture and affect recognition in dance/music interactive systems, Computer Music Journal 24 (2000) 57–69. [14] A. Camurri, R. Trocca, Analysis of expressivity in movement and dance, in: presented at Colloquium on Musical Informatics, L’Aquila, Italy, 2000. [15] G. Qian, F. Guo, T. Ingalls, L. Olson, J. James, T. Rikakis, A Gesture-Driven Multimodal Interactive Dance System, in: Presented at IEEE International Conference on Multimedia and Expo, 2004. [16] Y. Wu, T.S. Huang, Vision-based gesture recognition: a review, in: Lecture Notes in Artificial Intelligence 1739, Gesture-Based Communication in Human–Computer Interaction, (International Gesture Workshop, GW’99), 1999. [17] Y. Cui, D. L. Swets, J. Weng, Learning-based hand sign recognition using SHOSLIF-M, in: Proceedings of the IEEE International Conference on Computer Vision, 1995, pp. 631–636. [18] Y. Wu and T. Huang, View-independent recognition of hand postures, In: Proceedings of the IEEE International Conference On Computer Vision And Pattern Recognition, 2000, pp. 88–94. [19] A. Imai, N. Shimada, Y. Shirai, 3-D hand posture recognition by training contour variation, AFGR (2004) 895–900. [20] M. Singh, M. Mandal, A. Basu, Pose recognition using the Radon transform, Circuits and Systems, 2005. 48th Midwest Symposium on, vol. 2, 2005, pp. 1091–1094. [21] I. Haritaoglu, D. Harwood, L. Davis, Ghost: A human body part labeling system using silhouettes, in: Proceedings of the International Conference on Pattern Recognition, 1998. [22] G.R. Bradski, J.W. Davis, Motion segmentation and pose recognition with motion history gradients, Machine Vision and Applications 13 (2002) 174– 184. [23] I. Cohen, H. Li, Inference of human postures by classification of 3D human body shape, in: Proceedings of the IEEE International Workshop on Analysis and Modeling of Faces and Gestures, 2003, p. 74. [24] H. Francke, J. Ruiz-del-Solar, a. R. Verschae, Real-time hand gesture detection and recognition using boosted classifiers and active learning, in: Advances in Image and Video Technology, Springer, Berlin/Heidelberg, 2007, pp. 533–547. [25] G. Ye, J.J. Corso, D. Burschka, G.D. Hager, VICs: a modular HCI framework using spatiotemporal dynamics, Machine Vision and Applications 16 (2004) 13–20. [26] G. Ye, J.J. Corso, G.D. Hager, Gesture recognition using 3D appearance and motion features, in: Proceedings of the CVPR Workshops, 2004, pp. 160–166. [27] M.B. Holte, T.B. Moeslund, View invariant gesture recognition using 3D motion primitives, in: Proceedings ICASSP, 2008, pp. 797–800. [28] T. Kirishima, K. Sato, K. Chihara, Real-time gesture recognition by learning and selective control of visual interest points, Pattern Analysis and Machine Intelligence 27 (2005) 351–364. [29] C. Lee, Y. Xu, Online, Interactive learning of gestures for human/robot interfaces, in: 1996 IEEE International Conference on Robotics and Automation, vol. 4, 1996, pp. 2982–2987.
[30] D. Weinland, R. Ronfard, E. Boyer, Free viewpoint action recognition using motion history volumes, Computer Vision and Image Understanding 104 (2006) 249–257. [31] D. Weinland, E. Boyer, R. Ronfard, Action Recognition from Arbitrary Views using 3D Exemplars, in: Proceedings of the IEEE International Conference on Computer Vision, 2007, pp. 1–7. [32] F. Lv, R. Nevatia, Single view human action recognition using key pose matching and Viterbi path searching, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. [33] S. Mitra, T. Acharya, Gesture recognition: a survey, systems, man, and cybernetics, IEEE Transactions on Part C: Applications and Reviews 37 (2007) 311–324. [34] A.F. Bobick, Y.A. Ivanov, Action recognition using probabilistic parsing, in: Presented at Computer Vision and Pattern Recognition, 1998. Proceedings. 1998 IEEE Computer Society Conference on, 1998. [35] A. Yilmaz, Recognizing human actions in videos acquired by uncalibrated moving cameras, in: Proceedings of IEEE International Conference on Computer Vision, 2005, pp. 150–157. [36] Y. Shen, H. Foroosh, View-invariant action recognition from point triplets, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2009) 1898– 1905. [37] V. Parameswaran, R. Chellappa, View invariance for human action recognition, International Journal of Computer Vision 66 (2006) 83–101. [38] L. Gorelick, M. Blank, E. Shechtman, M. Irani, R. Basri, Actions as space-time shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (2007) 2247–2253. [39] A.F. Bobick, J.W. Davis, The recognition of human movement using temporal templates, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (2001) 257–267. [40] B. Peng, G. Qian, S. Rajko, View-invariant full-body gesture recognition via multilinear analysis of voxel data, in: Proceedings of International Conference on Distributed Smart Cameras, 2009, pp. 1–8. [41] S. Eickeler, A. Kosmala, G. Rigoll, Hidden Markov model based continuous online gesture recognition, in: Proceedings of the International Conference on Pattern Recognition, 1998, pp. 1206–1208. [42] J. Alon, V. Athitsos, Q. Yuan, S. Sclaroff, A unified framework for gesture recognition and spatiotemporal gesture segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2009) 1685–1699. [43] Y.X. Zhu, G.Y. Xu, D.J. Kriegman, A real-time approach to the spotting, representation, and recognition of hand gestures for human–computer interaction, Computer Vision and Image Understanding (2002) 189–208. [44] H.-K. Lee, J.H. Kim, An HMM-based threshold model approach for gesture recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (1999) 961–973. [45] H.-D. Yang, A.-Y. Park, S.-W. Lee, Gesture spotting and recognition for human– robot interaction, IEEE Transactions on Robotics 23 (2007) 256–270. [46] D. Birchfield, T. Ciufo, G. Minyard, SMALLab: a mediated platform for education, in: Proceedings of the 33rd International Conference and Exhibition on Computer Graphics and Interactive Techniques in Conjunction with SIGGRAPH, 2006. [47] C. Chu, I. Cohen, Pose and Gesture Recognition using 3D Body Shapes Decomposition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 69–78. [48] M. Pierobon, M. Marcon, A. Sarti, S. Tubaro, Clustering of human actions using invariant body shape descriptor and dynamic time warping, in: IEEE Conference on Advanced Video and Signal Based Surveillance, 2005, pp. 22–27. [49] M.A.O. Vasilescu, D. Terzopoulos, Multilinear analysis of image ensembles: TensorFaces, ECCV (1) (2002) 447–460. [50] D. Vlasic, M. Brand, H. Pfister, J. Popovi, Face transfer with multilinear models, in: ACM SIGGRAPH 2005 Papers, ACM Press, Los Angeles, California, 2005. [51] M.A.O. Vasilescu, D. Terzopoulos, Tensortextures: multilinear image-based rendering, ACM Transactions on Graphics 23 (2004) 334–340. [52] C.-S. Lee, A. Elgammal, Modeling view and posture manifolds for tracking, in: Presented at Proceedings of International Conference on Computer Vision, Rio de Janeiro, Brazil, 2007. [53] A. Elgammal, C.-S. Lee, Separating style and content on a nonlinear manifold, Presented at Proceedings of The IEEE International Conference on Computer Vision and Pattern Recognition, 2004. [54] Y. Li, Y. Du, X. Lin, Kernel-based multifactor analysis for image synthesis and recognition, in: Presented at Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, 2005. [55] J.M. Wang, D.J. Fleet, A. Hertzmann, Multifactor Gaussian process models for style-content separation, in Proceedings of the 24th International Conference on MACHINE LEARNING, ACM Press, Corvalis, Oregon, 2007. [56] B. Peng, G. Qian, S. Rajko, A view-invariant video based full body gesture recognition system, in: Presented at International Conference on Pattern Recognition, Tampa, FL, 2008. [57] B. Peng, G. Qian, Binocular dance pose recognition and body orientation estimation via multilinear analysis, in: S. Aja-Fernández, R. de Luis García, D. Tao, X. Li (Eds.), Tensors in Image Processing and Computer Vision, Springer-Verlag, 2009. [58] B. Peng, G. Qian, S. Rajko, View-invariant full-body gesture recognition via multilinear analysis of voxel data, in: Presented at ACM/IEEE International Conference on Distributed Smart Cameras Como, Italy, 2009. [59] B. Peng, G. Qian, Y. Ma, Recognizing body poses using multilinear analysis and semi-supervised learning, Pattern Recognition Letters 30 (14) (2009) 1289– 1294.
B. Peng et al. / Computer Vision and Image Understanding 115 (2011) 375–389 [60] B. Peng, G. Qian, Binocular dance pose recognition and body orientation estimation via multilinear analysis, in: Proceedings of the Workshop on Tensors in Image Processing and Computer Vision in Conjunction with the IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8. [61] B. Peng, G. Qian, S. Rajko, View-Invariant Full-Body Gesture Recognition from Video, in: Proceedings of the International Conference on Pattern Recognition, 2008, pp. 1–5. [62] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995. [63] J. Tu, Y. Fu, T.S. Huang, Locating nose-tips and estimating head poses in images by tensorposes, IEEE Transactions on Circuits and Systems for Video Technology 19 (2009) 90–102. [64] J. Tu, T. Huang, Locating nosetips and estimating head pose in images by tensorposes, in: Proceedings of IEEE International Conference on Image Processing IV, 2007, pp. 513–516. [65] C.E. Rasmussen, Gaussian Processes for Machine Learning, MIT Press, 2006. [66] N. Lawrence, Probabilistic non-linear principal component analysis with gaussian process latent variable models, Journal of Machine Learning Research 6 (2005) 1783–1816.
389
[67] P. Kroonenberg, J. de Leeuw, Principal component analysis of three-mode data by means of alternating least squares algorithms, Psychometrika 45 (1980) 69–97. [68] J. ten Berge, J. de Leeuw, P. Kroonenberg, Some additional results on principal components analysis of three-mode data by means of alternating least squares algorithms, Psychometrika 52 (1987) 183–191. [69] R. Sands, F. Young, Component models for three-way data: an alternating least squares algorithm with optimal scaling features, Psychometrika 45 (1980) 39–67. [70] M.A.O. Vasilescu, D. Terzopoulos, Multilinear subspace analysis of image ensembles, Computer Vision and Pattern Recognition (2) (2003) 93–99. [71] J. Nocedal, S.J. Wright, Numerical Optimization, Springer-Verlag, 1999. [72] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [73] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. [74] C.-W. Hsu, C.-C. Chang, C.-J. Lin, A Practical Guide to Support Vector Classification, Technical Report, Department of Computer Science, National Taiwan University, 2003. [75] C.H. Ek, J. Rihan, P.H.S. Torr, G. Rogez, N.D. Lawrence, Ambiguity modeling in latent spaces, Machine Learning for Multimodal Interaction (2008) 62–73.