Discriminative fusion of shape and appearance features for human pose estimation

Discriminative fusion of shape and appearance features for human pose estimation

Pattern Recognition 46 (2013) 3223–3237 Contents lists available at SciVerse ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/lo...

4MB Sizes 0 Downloads 54 Views

Pattern Recognition 46 (2013) 3223–3237

Contents lists available at SciVerse ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Discriminative fusion of shape and appearance features for human pose estimation S. Sedai a,n, M. Bennamoun b, D.Q. Huynh b a b

Post IBM Research Australia, 204 Lygon Street, Carlton, 3053 VIC, Australia School of Computer Science and Software Engineering, The University of Western Australia, Crawley, WA, Australia

art ic l e i nf o

a b s t r a c t

Article history: Received 22 October 2012 Received in revised form 8 April 2013 Accepted 21 May 2013 Available online 6 June 2013

This paper presents a method for combining the shape and appearance feature types in a discriminative learning framework for human pose estimation. We first present a new appearance descriptor that is distinctive and resilient to noise for 3D human pose estimation. We then combine the proposed appearance descriptor with a shape descriptor computed from the silhouette of the human subject using discriminative learning. Our method, which we refer to as a localized decision level fusion technique, is based on clustering the output pose space into several partitions and learning a decision level fusion model for the shape and appearance descriptors in each region. The combined shape and appearance descriptor allows complementary information of the individual feature types to be exploited, leading to improved performance of the pose estimation system. We evaluate our proposed fusion method with feature level fusion and kernel level fusion methods using a synchronized video and 3D motion dataset. Our experimental results show that the proposed feature combination method gives more accurate pose estimation than the one obtained from each individual feature type. Among the three fusion methods, our localized decision level fusion method is demonstrated to perform the best for 3D pose estimation. & 2013 Elsevier Ltd. All rights reserved.

Keywords: 3D human pose estimation Discriminative fusion Mixture of regressors Appearance descriptors Shape descriptors

1. Introduction Image based human pose estimation systems have a wide range of applications, including content-based image retrieval, character animation, human computer interaction (HCI) and visual surveillance. For example, in video-based smart surveillance systems, 3D human poses can be used to infer the action of the human subject in a scene. The estimation of 3D human poses from monocular (single-view) images is challenging because it involves not only the estimation of a large number of parameters but also dealing with artifacts such as clutters, occlusions, viewpoint changes and pose ambiguities. In order to accurately estimate the 3D pose from a single image, relevant and informative features should first be extracted from the images. Shape and appearance are the commonly used features for pose estimation. Shape-based features such as silhouettes are insensitive to background variations. However, one silhouette can be associated with more than one pose, resulting in ambiguities. Although such ambiguities can be resolved using temporal constancy [36], such temporal information is not always available, e.g., in case of pose estimation from a single image. n

Corresponding author. Tel.: +61 03 9921 3848. E-mail addresses: [email protected], [email protected] (S. Sedai), [email protected] (M. Bennamoun), [email protected] (D.Q. Huynh). 0031-3203/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2013.05.019

Appearance-based features, e.g., image textures, are more discriminative than shape features and do not require background subtraction once the person is detected in an image. However, appearance-based features are affected by background clutter and variations in the clothing of the human subject. This can make appearance features unstable. While neither shape nor appearance features are self-sufficient for a robust estimation of human poses, they have the potential to complement each other because one may not be sensitive to conditions that affect the other. This paper demonstrates that such discriminative features can be combined to improve the performance of 3D pose estimation. Approaches, such as [9,21], combine the scores of the shape and appearance features to obtain a fused score for each pose hypothesis in a generative pose estimation framework. Generative approaches can elegantly combine multiple feature types but they require image likelihoods of the hypothesized poses to be estimated. In a discriminative pose estimation approach, on the other hand, the poses are estimated by learning a mapping between the image features and poses. A similar approach has been adopted in the biometric area where binary decisions from multiple feature types, such as faces and fingerprints, are combined [28]. To the best of our knowledge, no similar work has been done in the area of 3D human pose estimation. In this paper, our main contribution is the proposal and evaluation of a framework in which appearance descriptors and shape descriptors are combined using discriminative learning for

3224

S. Sedai et al. / Pattern Recognition 46 (2013) 3223–3237

human pose estimation. Our approach is based on dividing the output pose space into clusters and using a local fusion strategy that is optimal to each cluster. The local fusion strategy involves learning the discriminative model of each feature type and combining the outputs of the discriminative models in the local pose space. Our method, which we name as a localized decision level fusion technique, is trained to maximize the performance of the local fusion. This concept is intuitive because parts of the data space may prefer the shape feature over the appearance feature whereas some other parts may prefer the opposite for pose estimation. To demonstrate the effectiveness of our method, we provide a comparative evaluation of our proposed fusion method with two other fusion methods, namely feature level fusion and kernel level fusion. The second contribution of this paper is on the development of a novel appearance descriptor that is discriminative and more resilient to noise by dimensionality reduction and histogramming of the local features. Our appearance descriptor is called the histogram of local appearance context (HLAC) [29]. This appearance descriptor is combined with the histogram of shape context descriptor [2] using our proposed feature combination framework. The rest of the paper is organized as follows. Section 2 discusses about related work. Section 3 describes the feature descriptors for pose estimation. Section 4 presents how pose estimation can be accomplished from a single feature type. Section 5 presents the proposed feature combination framework for human pose estimation. Section 6 discusses the experimental results. Section 7 provides our conclusion and future directions.

2. Related work Estimation methods for predicting the human pose can be broadly classified into two categories: generative and discriminative inference methods. In generative inference, the pose that best explains the observed image features is determined from a population of hypothesized poses. A search algorithm is used to search the prior pose space for the pose that exhibits the maximum likelihood value for the observed image feature [37,9]. Therefore a generative method consists of three basic components: a state prior, an observation likelihood model and a search algorithm. Successful search algorithms include annealing [9], Markov chain Monte-Carlo [44,21], covariance scaled sampling [37], dynamic programming [26,11] and linear programming [15]. In order to compute the likelihood of a pose hypothesis, image features, such as silhouettes, edges [9,37], color [26,11,15] have been used. Recently, it is shown that optimizing the human pose in the context of the locations of surrounding objects could improve the accuracy [35]. In order to reduce the search space, the pose prior space based on a predefined set of activities such as walking and jogging has been used [33]. Methods such as [42,43,20] learn the pose prior in a lower dimensional subspace using non-linear dimensionality techniques; pose tracking is then performed in that lower dimensional subspace. The pose prior based on a physical model that constrains the human body pose to physically plausible configurations has also been used to reduce the search space [7]. In discriminative estimation, a direct mapping between the image features and the pose parameters is learned using a training dataset. Examples of discriminative learning include nearest neighbor based regression [32], sparse regression [2] and boosting regression [4]. Often the relationship between the image features and the pose space is multimodal. This multimodal relationship is established by learning the one-to-many mapping functions that are defined in terms of a mixture of regressors [1,16,13]. This setting produces a multiplicity of the pose solutions that can be

ranked using a gating function [6], verified using an observation likelihood [27], or disambiguated using temporal constancy [36]. Other approaches use metric learning [25] and dimensionality reduction [29] to suppress the irrelevant feature components so as to improve the accuracy of pose estimation. It has also been shown that pose estimation performance can be improved by taking into account the dependencies between the output dimensions [5]. Other approaches use dimensionality reduction to address the correlation between the output dimensions [24]. The combination of generative and discriminative methods has also gained more attention recently. To combine these two methods, the observation likelihood obtained from a generative model is used to verify the pose hypotheses (often multimodal) obtained from discriminative mapping functions for pose estimation [27] and for pose tracking [31]. Our work falls within the discriminative estimation category. In particularly, we focus on the fusion of shape and appearance feature descriptors for 3D human pose estimation. The combination of different features in discriminative estimation has been performed in the area of classifier fusion whereby a strategy like majority voting, average, minimum and maximum combination operators are commonly used [18]. Such fusion strategies are designed to combine the binary outputs of the classifiers trained on different feature sets and are suitable for applications such as object detection and recognition [28]. Recently, methods that combine features by assembling the kernels of individual feature types has been introduced, e.g., using multiple kernel learning [39] and kernel basis selection [12]. However, they only work for kernel regression and are commonly used to combine the kernels of the same feature type but with different kernel parameters. To the best of our knowledge, the combination of shape and appearance features by discriminative learning has not been explored in the context of 3D human pose estimation. In this paper, we fill in this gap and propose a discriminative learning based feature combination framework for human pose estimation that can work for both the linear and the kernel regression. We also propose a novel appearance descriptor that is distinctive and resilient to noise.

3. Feature representation Given an image frame containing a human subject, we first extract the silhouette of a human subject using background subtraction. For better accuracy, we use the technique which models the background as a non-parametric kernel density as described in [10]. This technique is known to be robust in suppressing the effect of the shadows and supporting dynamic update of the background model. Our shape descriptor is obtained by encoding the silhouette boundary using the histogram of shape context descriptor [2]. To retrieve the cropped image window for our proposed appearance descriptor, the bounding rectangle that encloses the silhouette is cropped from the input image and rescaled to the fixed size of 128  64 pixels. This window size is the most commonly used size for human subjects that are in an upright pose. The scaling of the image to a fixed size can change the aspect ratio of a human body, for example when the person stretches out arms or legs. However, if these images are included in the training set, then the proposed discriminative fusion model should have the ability to generalize and cater for images with different aspect ratios of the person's body. This is because the performance of the discriminative model (being a machine learning approach) is highly dependent on the data used for training. We empirically found that padding five pixels around the edges of the bounding rectangle prior to cropping is sufficient to reduce the segmentation error. Our appearance descriptor is then computed inside the cropped window.

S. Sedai et al. / Pattern Recognition 46 (2013) 3223–3237

3.1. Appearance descriptors Given a cropped image window of a human subject, our appearance descriptor is computed in two steps. First, the descriptors which we refer to as the local appearance context (LAC) descriptors are computed on the regularly sampled locations in the image window. In the subsequent step, a histogram of these local descriptors is constructed, resulting in the HLAC imagedescriptor as illustrated in Fig. 1. 3.1.1. Local appearance context descriptor Our local appearance context (LAC) descriptor centered at a given point of the local image region inside the cropped image window is computed in two steps. First, the local image region is partitioned into log-polar spatial blocks designated by nr concentric squares resulting in 4nr spatial blocks as shown in Fig. 1. Our approximation of the log-polar partitioning leads to four square blocks around the center and 4ðnr −1Þ L-shaped spatial blocks wedged between the remaining concentric squares. The concept of log polar partitioning of the local image region has been used in the literature [3,23,14] to make the descriptor more sensitive to positions that are closer to the center compared to those that are farther away. Our approximation of log-polar partitioning using concentric squares allows us to efficiently compute features using integral images [29]. As described in our previous work [29], only the pixel values of the integral images at the corners of the squares are needed to compute the descriptor. Next, we compute the histogram of orientated gradients in each of the 4nr blocks. The histogram of oriented gradients is computed by accumulating the magnitude of gradient in each orientation bin

histogram of gradient orientations

Dimensionality Reduction using PCA

3225

as done in [8]. In the approach of [8], the orientation histograms are computed in a set of spatially connected square blocks and the resulting feature is used to detect people in images. Our method, on the other hand, computes the orientation histograms in the log polar spatial blocks in order to incorporate the local contextual information for 3D human pose estimation. For our pose estimation problem, we take nr ¼ 4 where the width of each square is taken in log-scale to be 12, 20, 30 and 48 pixels. The width of the innermost square is chosen such that it covers the limbs. The width of the outermost square is chosen to cover the width of the torso. The numbers of orientation bins for the blocks from the innermost to the outermost concentric square ring are taken respectively as 16, 12, 9 and 6 where the orientations of the gradients of pixels take values in the range [013601]. The reason for having more orientation bins in the innermost concentric square is to make the descriptor more informative. These chosen numbers of orientation bins are sufficient to encode the gradient orientations while keeping the dimension of the local descriptor to a manageable size. The orientation bins at each spatial block are concatenated to form a single LAC vector of (16+12+9+6)4¼172 dimensions. The LAC vector is then normalized with the square root of the sum of squared components of the vector. The next step of the process is to apply the principal component analysis (PCA) to reduce the dimensions of the LAC vectors. First, the PCA is performed on the covariance matrix of the LAC vectors computed on 5000 image windows containing human subjects to obtain the principal feature space. We empirically found the optimal number of PCA dimensions that gives maximum pose estimation performance to be 25 (as detailed in Section 3.1.4 and shown in Fig. 2). Consequently, the feature vector is projected onto the principal feature space and the 25 most significant components corresponding to the 25 eigenvectors with the largest eigenvalues are retained as our final LAC descriptor. The LAC descriptors computed this way have been found to be more robust to noise in the image. Such a process of dimensionality reduction has also been used in other local descriptors reported in the literature, such as gradient location and orientation histogram (GLOH) [23], PCA-SIFT [17] and PCA-HOG [22]. 3.1.2. Distinction between LAC and GLOH It seems that from the description above the LAC and GLOH descriptors [23] are similar. However, they differ in two major 160 PCA−HOG LAC

150

LAC descriptor mean 3D error in mm

140

Collect the LAC descriptors computed at various image locations

130 120 110 100 90 80

Histogramming

70 60 20

Histogram of local appearance context (HLAC) image-descriptor Fig. 1. A block diagram showing how to compute a HLAC descriptor. The LAC is an intermediate local descriptor and the HLAC is the final image-descriptor that encodes appearance information within an image window of a human subject.

40

60

80

100

120

140

160

180

Dimension of local feature Fig. 2. Comparison of the performance of the LAC and PCA-HOG descriptors for different dimensions. The abscissa of the chart shows the reduced dimensions and the ordinate shows the pose estimation error for the corresponding HLAC and HPHOG descriptors (Figure best viewed in color). (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

3226

S. Sedai et al. / Pattern Recognition 46 (2013) 3223–3237

respects. First, the spatial locations in our LAC descriptor are designated by concentric squares as opposed to concentric circles adopted for GLOH feature. By doing so, LAC descriptor can be efficiently computed (See [29] for details) using the integral image. Second, as opposed to the GLOH descriptor where the number of orientation bins is taken to be the same for all spatial blocks, the LAC descriptor uses fewer orientation bins for spatial blocks away from the center, making the LAC local descriptor more contextual.

3.1.3. Histogram of local appearance context (HLAC) As the name of the descriptor suggests, the HLAC descriptor encodes the distribution of the LAC descriptors computed in various locations of the cropped window that contain the human subject. The HLAC descriptor can be constructed in two steps via vector quantization. First, a LAC feature dictionary is obtained once and for all by k-means clustering of the combined set of LAC descriptors for all the training images. The centroids of the M clusters together constitute a feature dictionary. We take the number of cluster centroids to be M ¼200 as no obvious improvements in performance was found for M 4200. To construct the HLAC descriptor of the cropped image window, the LAC descriptors are computed at sparsely sampled points both horizontally and vertically within the cropped image window. Each LAC descriptor is then allowed to soft vote on each bin of the LAC feature dictionary. Soft voting can be implemented by placing a Gaussian window around the descriptor vector and computing the probability of each LAC vector falling into each bin. The votings for each bin are accumulated and then normalized to yield the M-dimensional HLAC descriptor for the cropped image window. The process of constructing the HLAC descriptor is illustrated in Fig. 1.

3.1.4. Comparison with histogram of PCA-HOG (HPHOG) To demonstrate the effectiveness of the HLAC descriptor in the context of 3D human pose estimation, we conducted experiments to compare it with the histogram of PCA-HOG descriptor. The PCAHOG descriptor is a truncated version of the HOG descriptor. It was first used by Ref. [22] for person tracking and action recognition. The construction of this descriptor in our experiment is summarized below. First, the HOG descriptor is computed inside the 4  4 grid of a local image region. Each cell of the grid is taken to be 8  8 pixels. The number of orientation bins is taken to be 11, resulting in a 176 dimensional local HOG descriptor. The PCA-HOG descriptor is obtained by reducing the dimension of the HOG descriptor using the PCA where the optimal dimension of the PCAHOG descriptor is estimated to be 25 from our evaluation (see Fig. 2). The histogram of PCA-HOG descriptor (HPHOG) of an image is obtained by voting each of the PCA-HOG descriptors computed on the local image regions to each bin in the PCA-HOG feature dictionary. As before, we take the size of the PCA-HOG feature dictionary to be M ¼200. It is evident from Fig. 2 that, for all the different dimensions examined for the LAC and PCA-HOG descriptors, the 3D pose estimation errors from the corresponding HLAC descriptors are consistently lower that those from the HPHOG descriptors. The error plot in this figure was generated using a sparse linear regression method (see Section 4) for the walking sequences from camera C2 (our validation set) of the HumanEva-I dataset. These experiments show that the LAC descriptor is more discriminative than the PCA-HOG descriptor. An explanation for these outcomes is that the LAC descriptor is computed over log-polar spatial locations whereas the PCA-HOG descriptor is computed in square grid locations.

3.2. Shape descriptor We use the histogram of shape context (HoSC) descriptor [2] as the shape descriptor for encoding the silhouette of the human subject. The HoSC descriptor encodes the silhouette contour as a distribution of the shape context (SC) descriptors. The SC descriptor [3] describes the shape around a point by histogramming the neighboring contour points in different spatial bins defined in a relative log-polar coordinate system. Following the same parameter settings as in [2], we use the SC descriptor characterized by 12 angular bins and 5 radial bins. The dimension of SC descriptor is thus equal to 60. Then, SC descriptors are computed at sparsely sampled points along the boundary of the silhouette. The HoSC descriptor is constructed by vector quantization of these SC descriptors using the same procedure as described in Section 3.1.3. As before, the dimension of the HoSC descriptor is set to M¼200, which is the number of cluster centroids (feature dictionary).

4. Pose estimation from a single feature type Let the image feature vector be represented by x∈Rm and the pose vector be represented by y∈Rd . The feature vector x can be a shape descriptor (HoSC) or an appearance descriptor (HLAC). The goal of discriminative learning is to obtain the mapping from the feature vector space to the pose vector space, given a training N set T ¼ fxðiÞ ; yðiÞ gi ¼ 1 . The mapping is commonly referred to as a regression function. Given a general basis function ΦðxÞ : Rm -Rq , the regression function approximates the mapping via y ¼ WT ΦðxÞ þ ϵ;

ð1Þ

where W ¼ ½w1 ⋯wd ∈Rqd is a parameter weight matrix mapping ΦðxÞ to a multidimensional output y and ϵ∈Rd is an additive noise term which is assumed to follow a Gaussian distribution with zero mean and covariance Σ ¼ diagðs1 ; …; sd Þ. Each diagonal element of the covariance matrix denotes the variance of the error in that output component. For the linear regression case, the basis function Φ is an identity mapping, thus ΦðxÞ ¼ x for which the dimension of basis vector becomes q¼ m; for the kernel regression case, each component of the basis function is associated with each data point from the training set, i.e., ΦðxÞ ¼ ½Φ1 ðxÞ; …; ΦN ðxÞT , and so q ¼N. In this paper, we use a Gaussian kernel for each Φi ðxÞ, i.e., Φi ðxÞ ¼ expð−c∥x−xðiÞ ∥2 Þ

ð2Þ

with c being the kernel width which is determined empirically and i ¼ 1; …; N. We have not explicitly shown the bias term of the feature vector to make the notation clearer. Throughout this paper, we assume that the bias term is included by concatenating a 1 to each basis vector before regression. The components of the output vector y can be assumed to be independent of each other, so the weight vector wj ∈Rq that corresponds to yj, the jth component of y, can be trained individually. We use the relevance vector machine [40] to estimate N ðiÞ the weight vector wj using the training set fxðiÞ ; yðiÞ j gi ¼ 1 , where yj ðiÞ denotes the jth component of y . The RVM is a sparse Bayesian learning technique where the prior over the parameter vector is firstly defined as a zero mean Gaussian with the covariance governed by an independent set of hyper-parameters. The posterior distribution of the parameter vector wj , after observing the training data, is a Gaussian and is expressed in terms of the hyperparameters. The point estimates of the hyper-parameters are found by maximizing the log of the marginal likelihood. We use the fast marginal likelihood maximization method based on forward basis selection [40] to find the value of each hyper-parameter. The weight vector wj is then computed using the hyper-parameter estimates.

S. Sedai et al. / Pattern Recognition 46 (2013) 3223–3237

Since the RVM is based on Bayesian learning, it gives a regularized solution as the most general solution is automatically selected thus preventing over-fitting of the model. Moreover, in the RVM, the weights of the irrelevant basis components would be zero as a subset of non-zero weights is sufficient to define the model. This produces a sparse regression model which can generalize to unseen image observations. Furthermore, by selecting a small number of weights, we gain a significant computational advantage.

5. Pose estimation from multiple feature types Let x1 ∈Rm1 and x2 ∈Rm2 be the appearance and shape descriptor vectors extracted from an image. As before, let y∈Rd be the corresponding output pose. Given the training examples N ðiÞ ðiÞ T ¼ fxðiÞ 1 ; x 2 ; y gi ¼ 1 , our goal is to find a mapping from the joint shape and appearance feature space x1 and x2 to the pose space y. Our proposed fusion framework consists of a two-tier hierarchy. At the first tier, the 3D pose space y is partitioned into K clusters using the k-means algorithm. In the second tier, the optimal fusion model of the two feature types is trained for each partition. The motivation behind the first level partition is intuitive since different data spaces may have preferences for different fusion strategies. For example, in some regions, the shape feature may get more weight than its appearance feature counterpart. Our approach performs localized fusion of more than one feature type in each region. The relation between the two feature types x1 and x2 and the 3D pose y is approximated by the following fusion model: K

pðyjx1 ; x2 Þ ¼ ∑ g k ðχ Þpk ðyjx1 ; x2 Þ

ð3Þ

k¼1

where χ ¼ ½xT1 xT2 T ∈Rm1 þm2 is the joint feature vector; K is the number of fusion models and pk ðyjx1 ; x2 Þ is the kth fusion model which serves for combining two feature types to predict the realvalued multidimensional output in the local pose space. The gating function gðχ Þ ¼ ½g 1 ðχ Þ; …; g K ðχ Þ provides the weights for the K local fusion models. The gating function is sensitive to χ and is modeled using a K-class classifier given by g k ðχ Þ ¼ expðνTk χ Þ=∑Ki¼ 1 expðνTi χ Þ, where νk ∈Rm1 þm2 for k ¼ 1; …; K, are parameter vectors. Using the fusion model described in Eq. (3), we obtain locally optimal fusion of shape and appearance features. Splitting the data space into several subsets and training the regressors in each subset have been attempted, e.g., the Bayesian committee machine [41]. However, in their case, they can only handle a single feature type to predict a single dimensional output. In contrary, our approach can incorporate different input modalities and perform a localized cluster-specific fusion of those input modalities to predict a multidimensional vector of the human pose. Moreover, our localized fusion strategy includes a training phase to find the optimal way of combining the two feature types.

3227

kth partition corresponds to the mixture of local predictors and is given by 2

pk ðyjx1 ; x2 Þ ¼ ∑ βkj ðχ Þpkj ðyjxj Þ;

ð4Þ

j¼1

where βkj is the combiner function operating on the joint feature space χ for the kth partition. In our case, the output from the two regressors is combined so the combiner can be modeled by a logistic function. Since βk2 ¼ 1−βk1 , only the parameter of βk1 need to be computed. If we write βk1 as βk , the combiner function for the kth partition can be rewritten as βk ðχ Þ ¼ 1=ð1 þ expð−λTk χ ÞÞ, where λk ∈νk ∈Rm1 þm2 is a parameter vector. Individual regressors for the kth partition can be trained for the feature types x1 and x2 and can be modeled as a conditional Gaussian distribution pkj ðyjxj Þ ¼ N ðWTkj Φðxj Þ; Σkj Þ;

for j ¼ 1; 2;

ð5Þ

where Wkj ∈Rqd is a transformation matrix that maps Φðxj Þ to y and Σkj is a covariance matrix. Same as the pose estimation from a single feature type (Section 4), the basis function Φ is mj-dimensional for linear regression and N-dimensional for kernel regression. When a kernel basis is used, we refer to the fusion method kernelized decision level fusion (KDF); when a linear basis is used, we refer to it as linear decision level fusion (LDF). Given the feature vectors x1 and x2 , the fusion in the kth partition is performed via the formula pk ðyjx1 ; x2 Þ ¼ βk ðχ Þpk1 ðyjx1 Þ þ ð1−βk ðχ ÞÞpk2 ðyjx2 Þ: The outputs from each partition are further aggregated using the weights from the partition function g ¼ ½g 1 ; …; g K  to obtain the final output distribution, as described by Eq. (3). Fig. 3 shows a block diagram for decision level fusion. The output pose space y is divided into K partitions. For each partition, a decision level fusion of the local appearance regressor, pk1 , and the local shape regressor, pk2, is performed. Local fusion is performed using the weights proportional to the confidence of the regressors in that partition. The confidence of the regressor pk1 and pk2 is given by respectively βk ðχ Þ and 1−βk ðχ Þ. Finally, the fusion results from all the partitions are aggregated using the weights given by the partition function g k ðχ Þ to yield the final output pose. The final output pose given by Eq. (3) is expressed as a mixture of conditional Gaussian distributions.

5.1. Decision level fusion Decision level fusion is a late fusion strategy where individual regressors are trained separately for each type of feature and the output poses from the regressors are combined to obtain the final output pose. In our previous research work [30], we adopt a linear regression technique for the fusion. In general, both linear regression and kernel regression techniques can be used in decision level fusion. In each of the clusters obtained from the division process described above, local fusion of more than one feature type is performed. The decision level fusion model for the

Fig. 3. A block diagram showing an example of two-level decision fusion (LDF and KDF).

3228

S. Sedai et al. / Pattern Recognition 46 (2013) 3223–3237

5.2. Feature level fusion Feature level fusion is an early fusion strategy where feature vectors x1 and x2 are concatenated to form a joint feature vector χ for training. The mapping from χ to the pose space for the kth partition can be expressed as yk ¼ WTk Φðχ Þ þ ϵk , where Φðχ Þ∈Rq and Wk ∈Rqd is a parameter matrix. As before, for linear basis vectors, we have q ¼ m1 þ m2 ; for kernel basis vectors, we have q ¼N. When a kernel basis is used, the fusion strategy is referred to as kernelized feature level fusion (KFF) and when a linear basis is used, the fusion method is referred to as linear feature level fusion (LFF). For the case of LFF and KFF, the local fusion model of Eq. (3) can be written as a conditional Gaussian, i.e., pk ðyjx1 ; x2 Þ ¼ N ðWTk Φðχ Þ; Σk Þ, where Σk is the covariance matrix. Learning this fusion model involves finding the parameter matrix W k using the training set T. As before, we use the RVM method described in Section 4 to train the parameter matrix. 5.3. Kernel level fusion In kernel level fusion, the prediction is done using a combined kernel basis of individual feature types. This technique is therefore an early fusion strategy where the fusion is performed in the kernel space. The local kernel level fusion for the kth pose partition is given by yk ¼ WTk Φðx1 ; x2 Þ þ ϵk

ð6Þ

where Φðx1 ; x2 Þ is the resultant kernel basis vector obtained by combining the kernel basis vectors of individual feature type, i.e., Φðx1 Þ and Φðx2 Þ. Similarly, for kernel level fusion, the local fusion model of Eq. (6) can be written as a conditional Gaussian, i.e., pk ðyj x1 ; x2 Þ ¼ N ðWTk Φðx1 ; x2 Þ; Σk Þ where Σk is the covariance matrix. There are basically two ways that the kernels can be combined for pose estimation. The first way is to use a basis selection method [12], which we call kernel basis fusion (KBF). The second way is via multiple kernel learning [39], which we call multiple kernel learning based fusion (MKLF). 5.3.1. Kernel basis fusion (KBF) In the kernel basis fusion approach of [12], the kernel bases of each feature type are concatenated to obtain the combined kernel bases. Let Φðx1 Þ and Φðx2 Þ be the kernel basis vectors corresponding to the two feature types, each is of dimension N  1. Then, the combined kernel basis vector in Eq. (6) is given by Φðx1 ; x2 Þ ¼ ½Φðx1 ÞT ; Φðx2 ÞT T and is of dimension 2N  1. Intuitively KBF can be thought of as a feature level fusion in the kernel space. This allows the independent selection of basis of each feature type during regression (Section 4). In the approach of [12], the relevant bases are selected by the least angle regression that imposes a regularization parameter to reduce the complexity of the model. However, it still needs cross validation to find the complexity parameter. So we resort to RVM [40] learning to find the relevant bases and their corresponding weight in the model. 5.3.2. Multiple kernel learning fusion (MKLF) The multiple kernel learning in [39] combines the features by constructing the composite kernel using a weighted linear combination of the kernels obtained from individual feature types. Each component of the kernel basis vector of Eq. (6) is a composite kernel constructed as Φi ðx1 ; x2 Þ ¼ η1 Φi ðx1 Þ þ η2 Φi ðx2 Þ for i ¼ 1; …; N, where the subscript over the kernel denotes the kernel basis corresponding to the ith data instance and the combining coefficients are constrained to ∥η∥≤1 where η ¼ ½η1; η2 T . So the learning process optimizes not only the parameters for the kernel

regressor Wk but also the vector η. MKLF is different from KBF (Section 5.3.1) because the combined kernel bases are obtained by a weighted linear combination of the kernels of individual feature types. We use the fast method proposed by [39] to train the MKLF model in each partition of the pose space. The vector η can be different for each pose space partitions since we optimize the parameters for each pose partition. 5.4. Training for late fusion methods In this section, we elaborate the training steps for the late fusion strategies such as LDF and KDF described in Section 5.1. The parameters to be estimated are Wkj ; Σkj ; νk ; λk for k ¼ 1; …K; j N ðiÞ ðiÞ ¼ 1; 2. We use N training examples T ¼ fxðiÞ 1 ; x2 ; y gi ¼ 1 to estimate the value of these parameters. The training steps for these fusion models are given below. N

1. Cluster the pose space Y ¼ fyðiÞ gi ¼ 1 into K partitions using the k-means algorithm. 2. For each partition k: (a) Train the RVM regression models pkj, for j¼ 1,2 of Eq. (5) ðiÞ ðiÞ with the training set fxðiÞ j ; y g, for all i, such that y ∈ kth partition using the RVM training method described in Section 4. This gives a weight matrix Wkj . The covariance matrix is obtained from the error eðiÞ ¼ yðiÞ −Wkj xðiÞ j as Σkj ¼ k;j Nk ðiÞ ðiÞ T ð1=N k Þ∑i ¼ 1 ðek;j Þðek;j Þ , where Nk is the number of training data that belongs to the kth partition. Since the output dimensions are assumed to be independent, we retain only the diagonal elements of these covariance matrices. (b) Train the combiner function βk to estimate its parameter λk with the training samples fχ ðiÞ ; γ ðiÞ g, for all i, such that k yðiÞ ∈kth partition. The confidence measure γ ðiÞ of the regresk sor pk1 on the kth cluster is computed as γ ðiÞ ¼ k

pk1 ðyðiÞ jxðiÞ 1 Þ

∑2j ¼ 1 pkj ðyðiÞ jxðiÞ j Þ

:

ð7Þ

The combiner function is a classifier modeled using a logistic function, hence we use the RVM classifier training method proposed by Ref. [40] to estimate the weight vector λk . 3. Train the gating function g k ðχ Þ from the training data ðiÞ N ðiÞ fχ ðiÞ ; l gi ¼ 1 , where l is a K-dimensional vector and each ðiÞ component lk denotes the probability that the pose vector ðiÞ yðiÞ belongs to the kth cluster. We set lk ¼ 1 if yðiÞ belongs to the kth cluster, otherwise we set it to 0. The maximum likelihood estimation of the parameter vectors νk for k ¼ 1 . . . K is then performed using the iteratively re-weighted least squares method. In particular, we use the fast method based on the bound optimization method described in [19] to train the parameter vectors. Once the fusion model is trained, we can use Eq. (4) to estimate the pose from the feature types x1 and x2 in the testing phase. 5.5. Training for early fusion methods To train the fusion methods that adopt an early fusion strategy, such as LFF, KFF, KBF and MKLF, similar steps can be undertaken except that step 2(a) involves training of a single regressor in each partition and step 2(b) is not required. This is because, in early fusion methods, a local fusion model consists of a single regressor, i.e., the combined basis vector is obtained either by feature level fusion or kernel level fusion. On the other hand, late fusion methods consist of two regressors (one for each feature type) and a classifier, which is used to combine the output from the two regressors.

S. Sedai et al. / Pattern Recognition 46 (2013) 3223–3237

We also train the non-fusion models that use a single feature type for prediction. The non-fusion models are: linear shape-based regressor (LSR), kernelized shape-based regressor (KSR), linear appearance-based regressor (LAR) and kernelized appearancebased regressor (KAR). As the names suggest, the first two are regression methods applied to the shape feature (HoSC descriptor) only and the last two are applied to the appearance feature, which in our case is the HLAC descriptor.

3229

Table 2 Various fusion and non-fusion models. Method

Acronym

Full name

Fusion

LDF KDF KFF LFF KBF MKLF KNNF

Linear decision level fusion Kernelized decision level fusion Kernelized feature level fusion Linear feature level fusion Kernel basis fusion Multiple kernel learning fusion K-nearest neighbor-based fusion

Non-fusion

LAR KAR LSR KSR

Linear appearance-based regressor Kernelized appearance-based regressor Linear shape-based regressor Kernelized shape-based regressor

6. Experimental results We trained and evaluated our proposed 3D human pose estimation method using the HumanEva dataset [34] provided by Brown University. We used video frames and corresponding 3D poses of three subjects from the walking, jogging and boxing sequences of the dataset to train and evaluate our approach. The dataset was originally partitioned into training, validation and testing sets. However, as the ground truth of the testing set was not provided, we used the validation set as our testing set and the original training set as our training set. Table 1 shows the number of images in the training and testing sets for each action. For each image, the corresponding ground truth 3D pose is given by the ðx; y; zÞ-coordinates of the 15 human body joint locations. We regard a pelvis joint location as the root position and normalize the pose by subtracting the pelvis joint from all the other 14 joint positions. Since the pelvis joint is always at the origin, we use the other 14 joint positions to construct the d ¼42 dimensional pose vector. We only used images taken by camera C1 as our focus was on pose estimation from monocular images. For the validation set, we used the images and pose vectors from camera C2. Although the HumanEva dataset provides images from seven cameras (where each camera covers a viewpoint of the scene), we only use images from a single camera to predict the 3D human pose since our method is designed for monocular pose estimation. The appearance descriptors (HLAC) were extracted for all the images in the training and test set using the method described in Section 3.1. Similarly, the shape descriptors (HoSC) were extracted for all the images in the training and test set using the method described in Section 3.2. As discussed in Section 3, the dimension of both shape and appearance descriptors is 200. 6.1. Training of the fusion models All the fusion models that we trained are summarized in Table 2. We used the training steps described in Section 5.4 to train the LDF and KDF. The early fusion methods (KFF, LFF, KBF and MKLF) and the non-fusion methods (LAR, KAR, LSR and KSR) are trained using the method described in Section 5.5. All the kernelized models utilize a Gaussian kernel of the form given in Eq. (2) to compute the basis vector. We computed the optimal value of the kernel width parameter c by performing a grid search for c over suitable ranges using the validation set. To train the weight parameters of MKLF, we used SVM-based multiple kernel learning routine of the SHOGUN machine learning toolbox [38]. Table 1 Number of images used for training and testing. Event

Training

Testing

Walking Jogging Boxing Gestures Throwcatch All

1528 1027 1045 1314 753 3600

1479 1158 1236 1298 545 3873

In order to compare the performance of the fusion models for different numbers of pose space partitions K, we let K vary from 1 to 10. For each value of K, we trained separate models. For each value of K, we used the same pose clusters seed for all the models. We also combined the descriptors using K-nearest neighbor (KNN) regression. KNN is a non-parametric kernel regression technique where the pose of the query image is estimated as the weighted sum of the poses corresponding to the L nearest images in the whole training set. The weights of each pose vectors are set inversely proportional to the average distance for shape and appearance feature types. We determined the optimal value for the number of nearest neighbors to be L¼10 using our validation set. We refer to this K-nearest neighbor-based fusion method as KNNF for the rest of the paper. 6.2. Evaluation Once the models were trained, we predicted the poses of the images in the testing set using the trained models. The final output of the fusion and non-fusion models is a d¼ 42-dimensional pose vector, which collectively denotes a particular pose. In order to evaluate the models, we computed the 3D error between the estimated pose y and the ground truth pose y^ according to [34]: ^ ¼ ð1=JÞ∑Ji ¼ 1 ∥mi ðyÞ−mi ðyÞ∥, ^ Eðy; yÞ where mi ðyÞ∈R3 denotes the three-dimensional coordinates of the ith joint location from the pose vector y∈Rd and ∥∥ denotes the Euclidean distance and J is the number of joint locations. The formula measures the 3D error in mm between the two pose vectors. Intuitively, the error can be thought of as the average Euclidean distance in mm between the ground truth and the estimated joint locations in the 3D space. The mean 3D error of the T test images is computed as E ¼ ð1=TÞ ðiÞ ∑Ti ¼ 1 Eðy ðiÞ ; y^ Þ. Figs. 4 and 5 compare the mean 3D errors of different fusion methods and non-fusion methods for different numbers of pose clusters for the following actions: walking (Fig. 4(a)), jogging (Fig. 4(b)), boxing (Fig. 4(c)), gestures (Fig. 5 (a)), and throwcatch (Fig. 5 (b)). Comparisons for the linear models are shown in the left column and those for the kernelized models are shown in the right column of the figures. We found that for all sequences the fusion models give a lower mean 3D error than the models that use individual feature type. The mean errors of the LDF and KDF were found to be the lowest among the linear models and kernelized models respectively for all of the different numbers of pose space partitions K. While the performance of MKLF and KFF is closer to KDF, the performance of KBF was found to be the lowest amongst all the fusion approaches. Comparing the left and the right columns of each row of Figs. 4 and 5, one can see that the mean error for kernelized models is less than the mean error for linear models. This is an expected result as the kernelized model handles the nonlinear mapping more effectively. It is interesting to

3230

S. Sedai et al. / Pattern Recognition 46 (2013) 3223–3237

Fig. 4. The mean 3D errors corresponding to different numbers of pose space partitions for different fusion and non-fusion methods. The linear models are compared in the left column and the kernelized models are compared in the right column. Evaluations were done on our test set of: (a) walking sequence, (b) jogging sequence and (c) boxing sequence.

observe that performing localized fusion in each cluster of the pose space improves the performance of pose estimation. For the walking and jogging sequences, the optimal numbers of pose clusters are 7 and 6 respectively for both linear and kernelized models. For the boxing sequence, the lowest pose estimation error is attained when the number of cluster is 3. For the gestures and throwcatch sequences, the optimal numbers of pose clusters are

2 and 3 respectively for the kernelized fusion models. The optimal number of clusters is low for the boxing, gestures and throwcatch sequences because, in these activities, there is less pose variation in comparison to the walking and jogging activities. However, when the linear fusion models are applied to the gestures and throwcatch sequence, the optimal numbers of clusters are 6 and 5 respectively. This is understandable, as linear models require

S. Sedai et al. / Pattern Recognition 46 (2013) 3223–3237

3231

Fig. 5. The mean 3D errors corresponding to different numbers of pose space partitions for different fusion and non-fusion methods. The linear models are compared in the left column and the kernelized models are compared in the right column. Evaluations were done on our test set of: (a) gestures sequence and (b) throwcatch sequence.

Table 3 Comparison of the average pose estimation errors7 standard deviations in mm for different fusion and non-fusion models on walking, jogging and boxing activities. The models such as KAR, KSR, LAR, LSR are non-fusion models and the rest are fusion models. Method

Walking

Table 4 Comparison of the average pose estimation errors7 standard deviations in mm for different fusion and non-fusion models on gestures and throwcatch activities. The models such as KAR, KSR, LAR, LSR are non-fusion models and the rest are fusion models.

Jogging

Boxing

Method

Gestures

Throwcatch

Kernelized models KDF 52.3 7 28.8 KFF 53.2 7 29.4 MKLF 52.5 7 29.5 KBF 57.5 7 30.7 KNNF 56.9 7 38.6 KAR 58.4 7 28.1 KSR 74.0 7 40.1

49.2 724.9 51.4 7 23.8 50.7 7 23.6 52.8 7 21.7 57.8 730.0 53.6 7 21.7 62.5 7 33.9

58.17 23.2 62.6 7 23.0 59.5 7 23.3 60.5 7 24.1 68.6 730.8 65.17 29.2 65.3 7 24.8

Kernelized models KDF KFF MKLF KBF KNNF KAR KSR

36.38 732.95 38.117 36.55 37.217 33.72 38.60 7 33.63 37.517 35.11 38.767 32.74 41.16 738.92

61.837 26.86 63.24726.11 63.477 25.31 65.58 7 27.65 65.737 27.35 65.02 7 29.06 67.75 7 28.97

Linear models LDF LFF LAR LSR

56.67 25.5 57.3 725.8 62.3 7 24.4 72.2 7 35.0

63.7 7 22.8 65.9 7 22.4 74.4 7 34.7 68.5 724.1

Linear models LDF LFF LAR LSR

38.687 31.42 41.217 36.72 42.80 7 40.10 43.25 7 42.86

67.217 24.39 74.46 7 28.43 73.13 727.75 73.85 7 35.34

55.2 7 28.7 58.6 7 28.1 66.5 728.1 82.0 7 44.3

more clusters than the kernelized model to model the non-linear mappings. The mean 3D errors of different fusion and non-fusion models evaluated on different test action sequences are shown in Tables 3 and 4. For each model and each test sequence, the optimal number of clusters is chosen in accordance to Figs. 4 and 5. The results show that proposed decision level fusion has the lowest average 3D errors

compared to the other fusion methods and those from each individual regressor for both linear and kernelized cases. KDF performs the best among the kernelized fusion models whereas LDF performs the best among the linear models. Although the standard deviation of the errors of the proposed fusion method is slightly higher than that of KAR for the walking, jogging, gestures and throwcatch activities, the decrease in the mean error indicates

S. Sedai et al. / Pattern Recognition 46 (2013) 3223–3237

that the fusion has an advantage over the individual regressors. For the case of boxing activity, improvements in both the average 3D error and the standard deviation are achieved. Figs. 6 and 7 compare the mean 3D errors per joint location of the proposed fusion model KDF with the non-fusion models KAR and KSR for the following actions: walking (Fig. 6(a)), jogging (Fig. 6(b)), boxing (Fig. 6(c)), gestures (Fig. 7(a)), and throwcatch (Fig. 7(b)). It can be observed that the proposed fusion method improves the pose estimation performance on most of the joints where most significant improvements were found on the wrist and elbow joints of the boxing and throwcatch activities.

120 KDF

100

KAR

Mean 3D error in mm

3232

KSR 80

60

40

20

140

ad he

ld ou

el

bo

er

w

t w

ris

p hi

ee

sh

KDF KAR KSR

100

150

80 KDF KAR

60 40 20

ad he

ld

KSR 100

50

sh

ou

el

bo

er

w

t w

ris

p hi

ee kn

an

kl

e

0

Mean 3D error in mm

Mean 3D error in mm

120

kn

an

kl

e

0

120

Mean 3D error in mm

KAR KSR 80

ad he

ld ou sh

el

bo

er

w

t ris w

p hi

ee

Fig. 7. The mean 3D errors per joint produced by the proposed fusion model (KDF), the shape regressor (KSR) and the appearance regressors (KAR) on our test set of: (a) gestures sequence and (b) throwcatch sequence.

60

40

20

ad he

ld ou sh

el

bo

er

w

t w

ris

p hi

ee kn

an

kl

e

0

140 KDF KAR KSR

120

Mean 3D error in mm

kn

KDF

100

an

kl

e

0

100 80 60 40 20

ad he

ld ou sh

el

bo

er

w

t ris w

p hi

ee kn

an

kl

e

0

Fig. 6. The mean 3D errors per joint produced by the proposed fusion model (KDF) and the individual shape regressor (KSR) and appearance regressors (KAR) on our test set of: (a) walking sequence, (b) jogging sequence and (c) boxing sequence.

Fig. 9 shows the comparison of the 3D error of each test image frame from the walking sequence of subject S2 for the KDF fusion with non-fusion models KAR and KSR. This shows that the fusion produces more accurate pose estimation across most of the image frames. Fig. 8(a)–(d) respectively shows the affinity matrix of the ground truth poses, the output poses estimated from the shape features alone, the appearance features alone, and the KDF fusion method. It can be observed that the output poses of the shape regressors are different than the example from the ground truth for most of the frames. This is because the shape features output incorrect poses due to forward/backward ambiguities and self occlusions. For the appearance regressor, these ambiguous estimates are corrected to some extent but there are still inaccurate pose estimates due to clutters and noise inherent in appearance features. The accuracy of the pose estimates is significantly improved when the two features are combined using KDF as the affinity matrix of the poses estimated using the fusion is more similar to the affinity matrix of the ground truth poses. The cluster specific local fusion strategy allows the complementary information of each feature type to be utilized for accurate pose estimation. If one feature type does not perform well in a cluster, then the other feature type compensates for a better pose prediction. However, if both feature types do not perform well in a cluster, then the fusion would not improve the performance. In our experiments, we found that for 94% of the test data, the fusion improved the performance, whereas for 6% of the total test data no improvements from the fusion were found.

S. Sedai et al. / Pattern Recognition 46 (2013) 3223–3237

3233

Fig. 8. Comparison of the ground truth and estimated poses of our test walking sequences using affinity matrices. Affinity matrix of: (a) the ground truth poses, (b) the poses estimated using the shape regressor (KSR), (c) the poses estimated using the appearance regressor (KAR), and (d) the poses estimated using our fusion method (KDF). For an affinity matrix, each row and column corresponds to one pose. The darker pixels denote smaller distances between the poses and brighter ones denote larger distances. The off-diagonal similarity band is due to the circular motion of the subject.

3D error (mm)

250

KAR KDF

200 150 100 50 0

50

100

150

200

250

300

350

400

450

frame number 250

KSR

3D error (mm)

KDF

200 150 100 50 0

50

100

150

200

250

300

350

400

450

frame number Fig. 9. Comparison of 3D error per frame for kernelized decision fusion and nonfusion models evaluated on subject S2 of our test walking sequence.

Example of how our fusion method reduces the level of ambiguity during pose estimation is shown in Fig. 10 for the case when the walking subject is facing the camera. It can be seen in

the second row of Fig. 10 that KSR model predicts three different poses B2, B3 and B4 with a probability of 0.61, 0.37 and 0.006 respectively. The pose B4 can be discarded as the probability associated with it is negligible. The remaining poses B2 and B3 have higher probabilities associated with them because the silhouettes of a person moving towards or away from the camera are generally similar. When the weighted average of these solutions is taken, the result is the meaningless pose B5. This is referred to as the forward/backward ambiguity in the literature and the motion information is used to disambiguate or select the correct pose solution [1,36]. We can observe in the first row of Fig. 10 that, in contrast to the shape model (KSR), for the appearance model (KAR), the probabilities are higher for the more correct pose solution A1 and lower for the less correct solutions A3 and A4. When we combine the models using KDF, we can observe from the third row of Fig. 10 that, the probabilities are concentrated on the more correct output pose C2 rather than on the less correct output poses C3 and C4. The error of the average prediction from the KDF fusion C5 is less than the average prediction B5 from the KSR and the average prediction A5 from KAR model. This example illustrates that our fusion reduces the level of ambiguity during pose estimation by providing more correct pose estimation than that from the shape and appearance features alone. Fig. 11 displays some of the output poses predicted using the KDF fusion method alongside with their estimation error. Our experiments show that our fusion method can effectively combine more than one feature type to improve the pose estimation performance.

3234

S. Sedai et al. / Pattern Recognition 46 (2013) 3223–3237

Appearance Rank

0.89

0.0085

0.10

Avg

Errors

64.2

249.9

131.5

58.2

Shape Rank

0.61

0.37

0.006

Avg

Errors

51.4

265.1

147.2

113.8

0

Avg

Fusion Rank

1

0

Shape+ App

Errors 1

46.4

246.4

2

3

146.2 4

46.4 5

Fig. 10. Examples of how the level of ambiguities caused by shape and appearance features are reduced by our fusion methods. The first row (A) displays the output poses predicted by the KAR model (uses appearance feature only), the second row (B) displays the output poses predicted by the KSR model (uses shape feature only). The third row (C) displays the output poses predicted by the combination of two feature types using the KDF model. Columns 2–4 show the pose solutions having the top three gating probabilities and are ordered in such a way that poses in a particular column from all three models come from the same cluster index. The gating probabilities are shown above each pose. Column 5 shows the final average pose. Below each output pose, the 3D error (in mm) is shown. The 3D poses are rendered as cylinders with the red color denoting the left limbs and blue denoting the right limbs (Figures best viewed in color). (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

Fig. 12 shows a comparison of the pose estimation time per image frame for the different fusion models. We found that the linear fusion models are the fastest with an estimation time of less than 1 ms per frame, followed by the KBF, KFF and KDF with an estimation time between 5 and 10 ms per frame. However, MKLF is found to be the slowest with an estimation time in the range of

25–30 ms per frame. The reason for MKLF being slower than other methods is that it needs more training samples in order to make predictions in comparison to other kernelized models. In our experiments, we found that on average MKLF retains 65% of the training samples for prediction whereas KDF and KFF retain 24% of the training samples and KBF retains 18% of the training samples.

S. Sedai et al. / Pattern Recognition 46 (2013) 3223–3237

3235

34.5

30.0

39.87

52.0

19.79

27.6

48.0

21.2

31.9

66.0

51.3

30.5

42.3

41.3

47.5

68.0

46.1

44 . 5

Fig. 11. Some images from our test set and their corresponding 3D poses estimated by the fusion of the shape and appearance feature types. Below each output pose, the estimation error (in mm) with respect to the ground truth pose is shown.

The computation time of our proposed appearance descriptor per image frame is 150 milliseconds. The computation time of the HoSC descriptor is found to be 18 s using the implementation of shape context descriptor provided by Ref. [3].

7. Conclusion In this paper, we have presented a learning-based localized decision level fusion method for 3D human pose estimation. Our fusion method can effectively combine the shape and appearance feature types to predict the multidimensional human pose by learning the locally optimal fusion of the discriminative models corresponding to shape and appearance features. We evaluated our method against two other fusion methods, namely feature level fusion and kernel level fusion. We have experimentally shown that

all the fusion methods produced more accurate pose estimations in comparison with those from shape or appearance features alone. Among the fusion methods, our learning-based localized decision level fusion method produced more accurate pose estimates for the linear and kernelized models. Our fusion method is effective because it combines the different feature types by first learning the input specific fusion strategy in local pose spaces. Although this paper describes an application of the fusion method to 3D human pose estimation, our fusion method is equally applicable to other systems using discriminative prediction models. The advantage of our approach lies in the speedy estimation of the human pose. However, to learn the fusion model, we need a large number of training data, which is difficult to obtain. For our future work, we aim at extending our proposed technique using a semi-supervised learning in a lower dimensional pose space so that a more accurate fusion model can be trained with partially labeled data.

3236

S. Sedai et al. / Pattern Recognition 46 (2013) 3223–3237

estimation time(milliseconds)

30 25 20 15 10 5

KB F

KL F M

F KD

KF F

F LD

LF

F

0

Fig. 12. Comparison of the average estimation time per frame for different fusion methods tested on our test set of the walking sequence. The feature extraction time is not included.

Conflict of interest None. Acknowledgments This work was in part supported by the ARC Discovery Project grant DP0771294. References [1] A. Agarwal, B. Triggs, Monocular human motion capture with a mixture of regressors, in: Computer Society Conference on CVPR, vol. 3, 2005, pp. 72–72. [2] A. Agarwal, B. Triggs, Recovering 3D human pose from monocular images, IEEE Transactions on PAMI 28 (1) (2006). [3] S. Belongie, J. Malik, J. Puzicha, Shape matching and object recognition using shape contexts, IEEE Transactions on PAMI 24 (4) (2002) 509–522. [4] A. Bissacco, M.-H. Yang, S. Soatto, Fast human pose estimation using appearance and motion via multi-dimensional boosting regression, in: IEEE Conference on CVPR, 2007, pp. 1–8. [5] L. Bo, C. Sminchisescu, Twin Gaussian processes for structured prediction, International Journal of Computer Vision 87 (1–2) (2010) 28–52. [6] L. Bo, C. Sminchisescu, A. Kanaujia, D. Metaxas, Fast algorithms for large scale conditional 3D prediction, in: IEEE Conference on CVPR, 2008, pp. 1–8. [7] M.A. Brubaker, D.J. Fleet, A. Hertzmann, Physics-based person tracking using the anthropomorphic walker, International Journal of Computer Vision 87 (1– 2) (2010) 140–155. [8] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Conference on CVPR, vol. 1, 2005, pp. 886–893. [9] J. Deutscher, I.D. Reid, Articulated body motion capture by stochastic search, International Journal of Computer Vision 61 (2) (2005). [10] A. Elgammal, R. Duraiswami, D. Harwood, L. Davis, Background and foreground modeling using nonparametric kernel density estimation for visual surveillance, Proceedings of the IEEE 90 (7) (2002) 1151–1163. [11] V. Ferrari, M. Marin-Jimenez, A. Zisserman, Progressive search space reduction for human pose estimation, in: IEEE Conference on CVPR, 2008. [12] V. Guigue, A. Rakotomamonjy, S. Canu, Kernel basis pursuit, in: European Conference on Machine Learning, 2005. [13] W. Guo, I. Patras, Discriminative 3D human pose estimation from monocular images via topological preserving hierarchical affinity clustering, in: IEEE Conference on Computer Vision, 2009, pp. 9–15. [14] C.-R. Huang, C.-S. Chen, P.-C. Chung, Contrast context histogram—a discriminating local descriptor for image matching, in: International Conference on Pattern Recognition, vol. 4, 2006, pp. 53–56. [15] H. Jiang, Human pose estimation using consistent max-covering, in: International Conference on Computer Vision, 2009 pp. 1357–1364.

[16] A. Kanaujia, D. Metaxas, Learning ambiguities using Bayesian mixture of experts, in: IEEE International Conference on Tools with Artificial Intelligence, 2006, pp. 436–440. [17] Y. Ke, R. Sukthankar, PCA-SIFT: a more distinctive representation for local image descriptors, in: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 2004, pp. 506–513. [18] J. Kittler, M. Hatef, R.P. Duin, J. Matas, On combining classifiers, IEEE Transactions on PAMI 20 (3) (1998) 226–239. [19] B. Krishnapuram, L. Carin, M.A.T. Figueiredo, A.J. Hartemink, Sparse multinomial logistic regression: fast algorithms and generalization bounds, IEEE Transactions on PAMI 27 (6) (2005) 957–968. [20] C.S. Lee, A. Elgammal, Coupled visual and kinematic manifold models for tracking, International Journal of Computer Vision 87 (1–2) (2010) 118–139. [21] M.W. Lee, R. Nevatia, Human pose tracking in monocular sequence using multilevel structured models, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (1) (2009) 27–38. [22] W.-L. Lu, J.J. Little, Simultaneous tracking and action recognition using the PCA-HOG descriptor, in: Canadian Conference on Computer and Robot Vision, 2006, pp. 6–6. [23] K. Mikolajczyk, C. Schmid, A performance evaluation of local descriptors, IEEE Transactions on PAMI 27 (10) (2005) 1615–1630. [24] R. Navaratnam, A. Fitzgibbon, R. Cipolla, The joint manifold model for semisupervised multi-valued regression, in: International Conference on Computer Vision, 2007, pp. 1–8. [25] H. Ning, W. Xu, Y. Gong, T. Huang, Discriminative learning of visual words for 3D human pose estimation, in: IEEE Conference CVPR, 2008, pp. 1–8. [26] D. Ramanan, D. Forsyth, A. Zisserman, Strike a pose: tracking people by finding stylized poses, in: IEEE Conference on CVPR, vol. 1, 2005, pp. 271–278. [27] R. Rosales, S. Sclaroff, Combining generative and discriminative models in a framework for articulated pose estimation, International Journal of Computer Vision 67 (3) (2006) 251–276. [28] A. Ross, A. Jain, Information fusion in biometrics, Pattern Recognition Letters 24 (2003) 2115–2125. [29] S. Sedai, M. Bennamoun, D. Huynh, Context-based appearance descriptor for 3D human pose estimation from monocular images, in: DICTA, 2009, pp. 484–491. [30] S. Sedai, M. Bennamoun, D. Huynh, Localized fusion of shape and appearance features for 3D human pose estimation, in: Proceedings of the BMVC, 2010, pp. 51.1–10. [31] S. Sedai, D. Huynh, M. Bennamoun, Supervised particle filter for tracking 2D human pose in monocular video, in: IEEE Workshop on Applications of Computer Vision (WACV), 2011, pp. 367–373. [32] G. Shakhnarovich, P. Viola, T. Darrell, Fast pose estimation with parametersensitive hashing, in: International Conference on Computer Vision, vol. 2, 2003, pp. 750–757. [33] H. Sidenbladh, M.J. Black, D.J. Fleet, Stochastic tracking of 3D human figures using 2D image motion, in: European Conference on Computer Vision, vol. 2, 2000, pp. 702–718. [34] L. Sigal, A. Balan, M. Black, HumanEva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human[space]motion, International Journal of Computer Vision 87 (2010) 4–27. [35] V. Singh, F. Khan, R. Nevatia, Multiple pose context trees for estimating human pose in object context, in: IEEE Conference on CVPR, 2010, pp. 17–24. [36] C. Sminchisescu, A. Kanaujia, Z. Li, D. Metaxas, Discriminative density propagation for 3D human motion estimation, in: IEEE Conference on CVPR, vol. 1, 2005, pp. 390–397. [37] C. Sminchisescu, B. Triggs, Estimating articulated human motion with covariance scaled sampling, in: International Journal of Robotics Research, 2003. [38] S. Sonnenburg, G. Raetsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona, A. Binder, C. Gehl, V. Franc, The shogun machine learning toolbox, Journal of Machine Learning Research 11 (2010) 1799–1802. [39] S. Sonnenburg, G. Rätsch, C. Schäfer, B. Schölkopf, Large scale multiple kernel learning, Journal of Machine Learning Research 7 (2006) 1531–1565. [40] M.E. Tipping, A., Faul, Fast marginal likelihood maximisation for sparse Bayesian models, in: Proceeding of the AISTATS, 2003, pp. 3–6. [41] V. Tresp, A Bayesian committee machine, Neural Computation 12 (2000) 2719–2741. [42] R. Urtasun, D.J. Fleet, A. Hertzmann, P. Fua, Priors for people tracking from small training sets, in: International Conference on Computer Vision, 2005, pp. 403–410. [43] J.WangD. Fleet, A. Hertzmann, Gaussian process dynamical models for human motion, EEE Transactions on PAMI 30 (2008) 283–298. [44] X. Zhang, C. Li, X. Tong, W. Hu, S. Maybank, Y. Zhang, Efficient human pose estimation via parsing a tree structure based human model, in: International Conference on Computer Vision, 2009, pp. 1349–1356.

Suman Sedai received his MSc from Inha University, South Korea and his PhD from the University of Western Australia, Perth, Australia, in 2007 and 2012 respectively. His research interests include image processing, visual tracking, object recognition, pattern recognition and machine learning.

Mohammed Bennamoun received his MSc from Queen's University, Kingston, Canada in the area of Control Theory, and his PhD from Queen's /Q.U.T in Brisbane, Australia in the area of Computer Vision. He lectured Robotics at Queen's, and then joined QUT, in 1993 as an Associate Lecturer. He then became a Lecturer, in 1996 and a Senior Lecturer,

S. Sedai et al. / Pattern Recognition 46 (2013) 3223–3237

3237

in 1998 at QUT. He was also the Director of a research Centre from 1998 to 2002. In January 2003, he joined the School of Computer Science and Software Engineering at the University of Western Australia (UWA) as an Associate Professor and was promoted to full Professor, in 2007. He has been the Head of the School of Computer Science and Software Engineering at UWA since February 2007. He was an Erasmus Mundus Scholar and Visiting Professor, in 2006 at the University of Edinburgh. He was also Visiting Professor at CNRS (Centre National de la Recherche Scientifique) and Telecom Lille1, France in 2009, the Helsinki University of Technology in 2006, and the University of Bourgogne and Paris 13 in France in 2002–2003. He is the co-author of the book “Object Recognition: Fundamentals and Case Studies”, Springer-Verlag, 2001. He won the “Best Supervisor of the Year” Award at QUT. He also received an award for research supervision at UWA in 2008. He published over 140 journal and conference publications. He served as a guest editor for a couple of special issues in International journals, such as the International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI). He was selected to give conference tutorials at the European Conference on Computer Vision (ECCV) and the International Conference on Acoustics Speech and Signal Processing (IEEE ICASSP). He organized several special sessions for conferences; including a special session for the IEEE International Conference in Image Processing (IEEE ICIP). He also contributed in the organization of many local and international conferences. His areas of interest include control theory, robotics, obstacle avoidance, object recognition, artificial neural networks, signal/image processing and computer vision (particularly 3D).

Du Q. Huynh is an Associate Professor at the School of Computer Science and Software Engineering, The University of Western Australia. She obtained her PhD in Computer Vision, in 1994, at the same university. Since then, she has worked for the Australian Cooperative Research Centre for Sensor Signal and Information Processing (CSSIP) and Murdoch University. She has been a visiting scholar at Chinese University of Hong Kong, Malmö University, Gunma University, and the University of Melbourne. Associate Professor Huynh is currently a recipient of a Discovery Project grant and two Linkage Project grants funded by the Australian Research Council. Her research interests include shape from motion, multiple view geometry, video image processing, visual tracking, and signal processing.