A local descriptor based on Laplacian pyramid coding for action recognition

A local descriptor based on Laplacian pyramid coding for action recognition

Pattern Recognition Letters 34 (2013) 1899–1905 Contents lists available at SciVerse ScienceDirect Pattern Recognition Letters journal homepage: www...

623KB Sizes 1 Downloads 223 Views

Pattern Recognition Letters 34 (2013) 1899–1905

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

A local descriptor based on Laplacian pyramid coding for action recognition Xiantong Zhen, Ling Shao ⇑ Department of Electronic & Electrical Engineering, The University of Sheffield, Sir Frederick Mappin Building, Mappin Street, Sheffield S1 3JD, United Kingdom

a r t i c l e

i n f o

Article history: Available online 10 November 2012 Keywords: Action recognition Laplacian pyramid Localized soft-assignment coding Max pooling

a b s t r a c t We present a new descriptor for local representation of human actions. In contrast to state-of-the-art descriptors, which use spatio-temporal features to describe cuboids detected from video sequences, we propose to employ a 2D descriptor based on the Laplacian pyramid for efficiently encoding spatiotemporal regions of interest. Image templates including structural planes and motion templates, are firstly extracted from a cuboid to encode the structural and motion features. A 2D Laplacian pyramid is then performed to decompose each of those images into a series of sub-band feature maps, which is followed by a two-stage feature extraction, i.e., Gabor filtering and max pooling. Motion-related edge and orientation information is enhanced after the filtering. To capture more discriminative and invariant features, max pooling is applied to the outputs of Gabor filtering, between scales within filter banks and over spatial neighbors. The obtained local features associated with cuboids are fed to the localized softassignment coding with max pooling on the Bag-of-Words (BoWs) model to represent an action. The image templates, i.e., MHI and TOP, explicitly encode the motion and structure information in the video sequences and the proposed Laplacian pyramid coding descriptor provides an informative representation of them due to the multi-scale analysis. The employment of localized soft-assignment coding and max pooling gives a robust representation of actions. Experimental results on the benchmark KTH dataset and the newly released and challenging HMDB51 dataset demonstrate the effectiveness of the proposed method for human action recognition. Ó 2012 Elsevier B.V. All rights reserved.

1. Introduction Automatic recognition and categorization of actions in video sequences is a very active research topic in computer vision and machine learning, and their applications can be found in many areas, including content-based video indexing, detecting activities and behaviors in surveillance videos, organizing digital video library according to specified actions, human–computer interfaces and robotics. The challenge is how to obtain robust action recognition and classification under variable illumination, background changes, camera motion and zooming, viewpoint changes, and partial occlusion. Moreover, the intra-class variation is often very large and ambiguity exists between actions (Schuldt et al., 2004). Feature representation as a fundamental part of action recognition will greatly influence the performance of the recognition system. Human actions from video data inherently contain both spatial and temporal information, which requires that descriptors of actions in video sequences accurately capture and robustly encode this kind of information. Many representation methods have recently been proposed, most of which evolve from extending popular techniques in the ⇑ Corresponding author. Tel.: +44 (0) 114 222 5841. E-mail address: ling.shao@sheffield.ac.uk (L. Shao). 0167-8655/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2012.10.021

2D image domain, e.g. histograms of spatio-temporal gradients (HOG3D) (Kläser et al., 2008), histogram of oriented gradients and histogram of optic flow (HOGHOF) (Laptev et al., 2008), 3D scale-invariant feature transform (SIFT3D) (Scovanner et al., 2007) and local trinary pattern (LTP) (Yeffet and Wolf, 2009). However, the direct extension from the 2D domain to its 3D counterpart has not been validated, because the temporal axis as the third dimension holds different properties from the spatial dimensions. Dollár et al. (2005) treated spatial and temporal dimensions separately and proposed a spatio-temporal interest point (STIP) detector, which has demonstrated many advantages over the extension versions such as the 3D Harris corner detector (Laptev and Lindeberg, 2003) and the 3D Hessian matrix detector (Willems et al., 2008). Holistic representations treat the raw video sequence as a whole and directly extract spatiotemporal features from it rather than applying sparse sampling using STIPs detectors. Such representations play important roles in human action recognition (Bobick and Davis, 2002; Ji et al., 2010; Taylor et al., 2010), because they are able to encode more visual information by preserving spatial and temporal structures of the human action occurring in a video sequence. However, holistic representations are highly sensitive to background variations and clutters. Moreover, preprocessing steps, such as background subtraction, spatial and temporal alignments,

1900

X. Zhen, L. Shao / Pattern Recognition Letters 34 (2013) 1899–1905

segmentation and tracking, are often required, which unfortunately makes holistic representations computationally expensive and sometimes even intractable in real-time applications. Recently, local representations based on detected spatiotemporal interest points are drawing much attention (Laptev and Lindeberg, 2003; Dollár et al., 2005; Oikonomopoulos et al., 2005; Willems et al., 2008), among which human action recognition systems based on the bag of words (BoW) model have achieved good results in many tasks. This would be due to the fact that the BoW model has many advantages, such as being less sensitive to partial occlusions and clutter and avoiding some preliminary steps, e.g. background subtraction and target tracking in holistic methods. However, since the BoW model is actually based on mapping local features of each video sequence onto a pre-learned dictionary, quantization errors are inevitably introduced during the codebook creation. The errors would be propagated to the final representation and degrade the recognition performance, while the soft-assignment coding could, to a certain extent, alleviate those quantization errors (van Gemert et al., 2008). In this paper, we approach the problem by using 2D image templates to represent actions. A video sequence with actions occurring in it contains two kinds of principal information, namely the spatial and temporal structures and motion. Given a set of cuboids extracted from the video sequence, we propose to encode each cuboid by the structural planes and motion templates. The structural planes are three orthogonal slices, of which the point of intersection falls on the center of the cuboid. These planes can sparsely capture the structural features spatially and temporally, while the motion templates are the motion history images (MHI), which encode the motion cues of the cuboid. Therefore, a 3D cuboid is represented by four images, i.e., three orthogonal planes (TOPs) and the motion history image (MHI) without loss of information. We can now encode these four images with informative and discriminative 2D descriptors. To directly describe those images is often inefficient due to the redundancy of information. Inspired by the success of Laplacian pyramid coding in images (Burt and Adelson, 1983), we propose to encode the images based on the Laplacian pyramid model. Each image is decomposed into a series of sub-bands. Features with different scales are segregated in each sub-band and are enhanced simultaneously. Motion and structural information exist in each sub-band with forms of edges and boundaries, which are extracted by Gabor filters. Finally a max pooling technique is applied to the responses of the filtering to make the extracted features further discriminative and invariant. The advantages of the proposed method over the state of the art lie in the following aspects: (1) the combination of motion history images (MHI) and three orthogonal planes (TOP) can explicitly encode the motion and structure information in cuboids. Moreover, operations are transferred from 3D to 2D which enables the adoption of the 2D Laplacian pyramid coding descriptor. (2) The proposed descriptor based on the Laplacian pyramid coding (LPC) inherits advantages of the multi-scale analysis and provides an informative representation of the image templates, i.e., MHI and TOP. (3) The employment of the localized soft-assignment coding alleviates the quantization error induced in the hard assignment and the max pooling operation can make the representation of actions more invariant and robust. Having the local features for all the cuboids extracted from a video sequence, we now need to obtain a global representation of the human action. Based on the standard BoW model, we employ the improved version using the soft-assignment coding and max pooling, which has been proven to outperform the baseline BoW model in image classification (Liu et al., 2011). In the remainder of the paper, we first briefly review the related work in human action classification and recognition in Section 2.

The key methods used in our action recognition algorithm are described in Section 3. Then Section 4 details the experiments and results. Finally, we draw the conclusion in Section 5.

2. Related work The past decade witnesses the dramatic advancement on human action recognition. Different approaches have been proposed and developed for both global representation and local representation. A comprehensive overview can be found in a recent survey on vision-based human action recognition (Weinland et al., 2011). Based on the spatiotemporal features, a great amount of local and holistic approaches have been proposed for human action representations. In holistic representations, structural information is well preserved and features are directly extracted from raw video sequences rather than based on interest point detection. Bobick and Davis (2002) presented motion templates through projecting frames onto a single image, namely motion history images (MHI) and motion energy images (MEI). MHI indicates how motion happens and MEI records where it is. Motion templates can capture the motion patterns occurring in a video sequence; therefore this representation gives a satisfactory performance where the background is relatively static. However, certain structural information would be unavoidably lost during the projections. Jhuang et al. (2007) proposed a biologically inspired system based on a hierarchy of increasingly complex and invariant spatio-temporal feature detector. The spatio-temporal oriented features obtained from the Gabor filtering achieves best recognition rates compared with features based on the spatio-temporal gradients and optical flows. Unfortunately, their recognition system is computationally expensive due to the complex hierarchy. Schindler and Van Gool (2008) found that very short snippets (1-7 frames) are sufficient for basic action recognition. They applied log-Gabor filters and ’optical flow’ filters to raw frames to extract form and motion features, respectively. Since these models take the motion and structural information (shape) of human actions into consideration, they perform well in action recognition tasks. By extending RBMs (Laptev and Lindeberg, 2003) to the spatio-temporal domain, Taylor et al. (2010) proposed a novel convolutional gated restricted Boltzman machine (GRBM) for learning spatio-temporal features. A probabilistic max pooling technique (Lee et al., 2009) was integrated into their model. Similarly, Ji et al. (2010) developed a 3D convolutional neural network (CNN) directly extended from its two-dimensional counterpart for feature extraction. In a 3D CNN, motion information in multiple adjacent frames is captured through performing convolutions over spatial and temporal dimensions. However, the number of parameters to be learned in those deep learning models (Le et al., 2011; Taylor et al., 2010) is very large, sometimes too large relative to the available number of training samples, which unfortunately restricts its applicability. In local representations, based on the BoW model, algorithms are proposed both for spatial-temporal features extraction and representation. Schuldt et al. calculated local jets to capture local features with information about the motion and the spatial appearance of events in image sequences (Schuldt et al., 2004). A local SVM approach is employed for action classification. Dollár et al. (2005) combined normalized pixel values, the brightness gradient, and windowed optical flow to capture the motion and structure features in cuboids. With the success of SIFT (Lowe, 2004) in image description, Scovanner et al. (2007) extended it to the spatio-temporal domain and proposed a descriptor named 3DSIFT. Similarly, Klaser et al.

1901

X. Zhen, L. Shao / Pattern Recognition Letters 34 (2013) 1899–1905

proposed a novel descriptor based on histograms of oriented spatio-temporal gradients, which has demonstrated an effective performance on many action recognition tasks. Laptev et al. (2008) combined HOG and HOF to describe cuboids extracted from video sequences, and used a non-linear support vector machine with a multi-channel chi-square kernel that robustly combines feature channels for action classification. They achieved a state-of-the-art recognition rates on several datasets. Kovashka and Grauman (2010) exploited the relationships of interest points by taking into account visual words to which the neighboring features correspond. They created a hierarchy of codebooks to capture the space-time configurations. Recently, Le et al. (2011) applied independent subspace analysis (ISA), an extension of independent component analysis (ICA), to learn hierarchical invariant spatio-temporal features from local video patches. It achieved state-of-the-art performance when combined with deep learning techniques such as stacking and convolution. However, those deep learning models are usually computationally expensive because learning parameters takes a long time. 3. Methodology 3.1. Spatio-temporal interest point detection The feature detection method used for the local representation is the periodic feature detector proposed by Dollár et al. (2005). The detector is based on a set of separable linear filters which treat the spatial and temporal dimensions in different ways. The response function is given by: 2

R ¼ ðI  g  hev en Þ þ ðI  g  hodd Þ

2

ð1Þ

where  denotes the convolution operation, gðx; y; rÞ is a 2D Gaussian kernel, applied only along the spatial dimensions, and hev en and hodd are a quadrature of 1D Gabor filter applied only temporally. They are defined as:

hev en ðt; s; xÞ ¼  cosð2ptxet

2=s2

Þ

ð2Þ

and t 2=s

hodd ðt; s; xÞ ¼  sinð2ptxe

2

Þ

ð3Þ

The parameters r and s correspond roughly to the spatial and temporal scales of the detector, and they are set manually by the user. The authors suggested keeping x ¼ 4=s. We follow the original settings in (Dollár et al., 2005) for all the parameters. The response function gives the strongest responses where there are periodic motions, however, the detector also responds strongly to a range of other motions, e.g., local regions exhibiting complex motion patterns such as space-time corners. This method can detect a high number of space-time interest points, and was proven to be faster, simpler, more precise and gives better performance, even though only one scale is used (Dollár et al., 2005; Shao and Mattivi, 2010). So it is adopted for STIP detection in this work. Cuboids are extracted from around the spatio-temporal interest points. 3.2. Structural planes The spatial and temporal structural information of a cuboid is encoded in our structural planes. Three orthogonal planes, namely XY, XT and YT planes, are orthogonal slices, of which the point of intersection falls on the center of a cuboid. The XY plane captures the mid-frame pose of the cuboid, which gives the main spatial structure, while X-T and Y-T planes record the temporal structures, namely the dynamic structure of the cuboid. Therefore, these three planes contain complementary structural information with each other. Fig. 1(b) illustrates an example of the three orthogonal planes. 3.3. Motion templates Motion history images (MHI) proposed by Bobick and Davis (2002) are used to encode the motions in a cuboid. All frames in a cuboid are projected onto one image across the temporal axis and recent motion is emphasized more than that happened in the past. Assume Iðx; y; tÞ is an image sequence and let Dðx; y; tÞ be a binary image sequence indicating regions of motion, which can be obtained from image differencing. The motion history image (MHI), Hs ðx; y; tÞ, records how the motion image is moving, and is obtained with a simple replacement and decay operator:

Hs ðx; y; tÞ ¼



s

if Dðx; y; tÞ ¼ 1

maxð0; Hs ðx; y; t  1Þ  1Þ otherwise

ð4Þ

Fig. 1. (a) A cuboid extraced from video sequences. (b) Image templates i.e., the three orthogonal planes (TOP) and (c) motion history image (MHI) extracted from the cuboid with the action ‘boxing’.

1902

X. Zhen, L. Shao / Pattern Recognition Letters 34 (2013) 1899–1905

Fig. 2. Construction of the Laplacian pyramid.

where s is the duration for defining the range of the motion. Fig. 1(c) shows an example of motion history image for a cuboid extracted from a video sequence with the action ’boxing’. 3.4. Laplacian pyramid coding The Laplacian pyramid was introduced by Burt and Adelson (1983) for compact image coding. In a Laplacian pyramid coding, images are decomposed into a series of sub-bands, and features are localized in spatial frequency as well as in space. The image pyramid is a data structure designed to support efficient scaled convolution through reducing the image resolution. It consists of a sequence of copies of an original image in which both sampling density and resolution are decreased in regular steps. A pyramid is a multi-scale representation with a recursive method. Images are composed of features with many different sizes. Therefore, to encode the motion and structural information in the feature maps, a multi-scale analysis technique needs to be used. Fig. 2 shows an example of the Gaussian pyramid and the Laplacian pyramid. Gaussian Pyramid. The Gaussian pyramid is a widely used multiscale representation of images, and is the first step of construction of the Laplacian pyramid. By using the dyadic Gaussian pyramid convolved with each of the input feature maps, a series of lowpassed images are obtained. A main advantage with the pyramid operation is that the image size decreases exponentially with the scale level and hence also the amount of computations required to process the data. To be precise, the levels of the pyramid are obtained iteratively as follows:

Gl ði; jÞ ¼

XX wðm; nÞGl1 ð2i þ m; 2j þ nÞ m

ð5Þ

n

where l indexes the level of the pyramid and wðm; nÞ is the Gaussian weighted function. Laplacian Pyramid. The Laplacian pyramid is a sequence of error images L0 ; L1 ; . . . ; LN . Each is the difference between two adjacent levels of the Gaussian pyramid. Thus, for 0 < 1 < N,

Ll ¼ Gl  EXPANDðGl Þ ¼ Gl  Glþ1 :

ð6Þ

The construction of a Laplacian pyramid is illustrated in Fig. 2, from which, we can see that motion related features are intensified in each level of the Laplacian pyramid. 3.5. Feature extraction Feature extraction using an architecture with two stages, namely a filter bank and a feature pooling technique, performs

better than that with a single stage (Jarrett et al., 2009). Hence, we employ a two-stage approach for our feature extraction: (i) applying a bank of Gabor filters to intensify the edge information at multiple orientations; and (ii) performing a nonlinear max pooling within each band of Gabor filters and over spatial neighborhoods. Gabor filtering. Due to their biologically plausible properties, Gabor filters are widely used in visual recognition systems (Jhuang et al., 2007; Song and Tao, 2010; Siagian and Itti, 2007; Serre et al., 2005), and they provide a useful and reasonably accurate description of most spatial aspects of simple receptive fields. The Laplacian pyramid representation does not introduce any spatial orientation selectivity into the decomposition process, while Gabor filters are able to extract orientational information due to their properties in common with mammalian cortical cells, such as spatial localization, orientation selectivity and spatial frequency characterization. Moreover, motion-related edges and boundaries are enhanced after the Gabor filtering. The 2D Gabor mother function is defined as: 

Fðx; yÞ ¼ e

x2 þcy2 0 0 2r2

cos

2px0 k

ð7Þ

where x0 ¼ xcosh þ ysinh; y0 ¼ xsinh þ ycosh, and the range decides the scales of Gabor filters and h determines orientations. Gabor filters with eight scales in a range from 7  7 to 21  21 pixels and four orientations: degrees of 0, 45, 90, and 135 are used. Max pooling. With good feature selectivity and invariance, feature pooling is employed in many modern visual recognition algorithms, from pooling over image pixels (Mutch and Lowe, 2008; Song and Tao, 2010) to pooling across activations of local features on dictionary in sparse coding (Yang et al., 2009). It preserves taskrelated information while removing irrelevant details. Pooling is used to achieve invariance to image transformations, more compact representations, and better robustness to noise and clutter (Boureau et al., 2010). The MAX mechanism was proposed by Riesenhuber et al. (1999) in a hierarchical model of object recognition. This max-like feature selection operation provides a more robust response in the case of recognition in clutter or with multiple stimuli in the receptive field. We adopt max pooling into the second stage of feature selection. Pooling between scales of responses from each band of Gabor filters results in invariance to a range of scales; pooling between scales within each filter band results in scale invariance while

X. Zhen, L. Shao / Pattern Recognition Letters 34 (2013) 1899–1905

pooling over spatial neighbors leads to local robustness to position shifts and invariance to noise. 3.6. Feature coding To be self-contained, this section gives a revisit to current coding methods based on a visual codebook created by clustering algorithms, e.g., k-means clustering, or a set of basis vectors generated by sparse coding. Notations. Let bi denote a visual word or a basis vector, and Bdn denotes a codebook or a set of basis vectors, where d is the dimensionality of the local feature vectors. Given the ith local feature xi corresponding to a cuboid extracted from a video sequence, ui 2 Rn is the coding coefficient vector of xi based on the codebook or basis vectors. uij is the coefficient associated with the word bj . Hard-assignment Coding. Given a set of local features xi for a video sequence, their coding coefficient is determined by assigning each local feature xi to its nearest codeword in the codebook in terms of a certain distance metric. If Euclidean distance is used, then

ui;j ¼

8 < 1 if j ¼ arg min kxi  bj k22 j¼1;...;n

:

0

ð8Þ

otherwise

Since local features are assigned to their nearest codewords, quantization errors are inevitably induced, which would be one of the key deficits in the hard-assignment coding. Besides, the hardassignment coding ignores the relationship between different codewords. A soft-assignment coding is then introduced to alleviate the quantization errors (van Gemert et al., 2008). Soft-assignment Coding. The coefficient ui;j is the degree of membership of a local feature xi to the jth codeword.

expðbkxi  bj k22 Þ uij ¼ Pn 2 k¼1 expðbkxi  bk k2 Þ

ð9Þ

where b is the smoothing factor controlling the softness of the assignment. Sparse Coding. A local feature is represented by a linear combination of a sparse set of basis vectors. The coefficient is obtained by solving an l1 -norm regularized approximation problem,

ui ¼ arg minn kxi  Buk22 þ kkuk1

ð10Þ

u2R

where k controls the sparsity of the coefficient. Sparse coding achieves less construction errors by using multiple bases. However, similar patches would be reconstructed by quite different bases due to the over completeness of the codebook, which can be overcome by introducing locality constraints in the coding process (Wang et al., 2010). Locality-constrained Linear Coding (LLC). Instead of enforcing sparsity, LLC (Wang et al., 2010) confines a local feature xi to be coded by its local neighbors in the codebook. The locality constraint ensures that similar patches would have similar codes. The coding coefficient is obtained by solving the following optimization problem:

ui ¼ arg minn kxi  Buk22 þ kkdi  uk22 ; u2R

s:t: 1T ui ¼ 1

ð11Þ

where  denotes the element-wise multiplication, and di 2 RM is the locality adaptor that gives different freedom for each basis vector proportional to its similarity to the input descriptor xi . Specifically,

di ¼ exp

  distðxi ; BÞ

r

ð12Þ

where distðxi ; BÞ ¼ ½distðxi ; b1 Þ; . . . ; distðxi ; bM ÞT , and and distðxi ; bj Þ is the Euclidean distance between xi and bj . r is used for adjusting the

1903

weight decay speed for the locality adaptor. As an approximation of LLC, one can simply use the k nearest neighbors of xi as the local bases Bi , and solve a much smaller linear system. Pooling. Having the coefficients for all the local features, a final representation P 2 Rn of a video sequence is obtained by pooling P over the coefficients. If a sum pooling is used, pj ¼ li¼1 uij , where l is the total number of local features in a video sequence. Straightforwardly, average pooling is obtained by dividing pj by l. Those pooling operations are widely used for the BoW model in action recognition. With max pooling, the j component of P is obtained by pj ¼ maxi uij , where i ¼ 1; 2; . . . ; l. Max pooling combined with sparse coding and soft-assignment coding are used in image classification, but has not be applied to action recognition. In this work, we integrate the localized soft-assignment coding with max pooling for human action representation. 4. Experiments and results The proposed Laplacian pyramid coding (LPC) descriptor is evaluated on the baseline KTH dataset and the newly released HMDB51 dataset. To demonstrate its effectiveness and efficiency as a descriptor, we compare it with popular descriptors such as local binary pattern (LBP) and pyramid of histograms of oriented gradients (PHOG). We used three layers of pyramids in our LPC descriptor, and two levels of pyramids in the PHOG descriptor which can give satisfactory results (Shao et al., 2011). For the LBP descriptor, we follow the original setting in (Ojala et al., 2002). A linear SVM is employed for action classification. 4.1. Datasets and experimental settings The KTH dataset. (Schuldt et al., 2004) is a commonly used benchmark action dataset with 2391 video clips. Six human action classes, including walking, jogging, running, boxing, hand waving and hand clapping, are performed by 25 subjects in four different scenarios: outdoors (s1), outdoors with scale variation (s2), outdoors with different clothes (s3) and indoors with lighting variation (s4). We follow the original experimental setup of the authors, i.e., divide the samples into test set (9 subjects: 2, 3, 5, 6, 7, 8, 9, 10, and 22) and training set (the remaining 16 subjects). The HMDB51 dataset. (Kuehne et al., 2011) has recently been released, which contains 51 distinct categories with at least 101 clips in each for a total of 6766 video clips extracted from a wide range of sources. It is the largest and perhaps most realistic dataset up until now. The action categories in this dataset can be grouped into five types: (1) general facial actions; (2) facial actions with object manipulation; (3) general body movements; (4) body movements with object interaction and (5) body movements for human action interaction. As the focus of our work is action recognition, we test our algorithm on a subset of this dataset, i.e. the general body movements with 19 action categories including cartwheel, clap hands, climb, climb stairs, dive, fall on the floor, backhand flip, handstand, jump, pull up, push up, run, sit down, sit up, somersault, stand up, turn, walk and wave. 2963 stabilized clips with one person involved in the action and all three levels of video quality (i.e. bad, medium and good) are used in our evaluation. 4.2. Results Baseline. We first conduct the experiments on the BoW model with hard-assignment as the baseline. To make the comparison fair, we replace our Laplacian pyramid coding (LPC) descriptor with PHOG and LBP descriptor with all the rest of setting the same. Table 1 and Table 2 show the baseline comparison results on the KTH and HMDB51 datasets. The proposed LPC descriptor

1904

X. Zhen, L. Shao / Pattern Recognition Letters 34 (2013) 1899–1905

Table 1 Recognition rate (%) comparison of the Laplacian pyramid coding (LPC) with local binary pattern (LBP) and pyramid histograms of oriented gradients (PHOG) on the KTH dataset. #Word

1000

1500

2000

2500

3000

LPC LBP PHOG

88.5 80.5 87.0

89.7 84.0 86.3

89.9 84.0 88.9

91.4 85.7 86.4

90.6 86.6 87.9

Table 2 Recognition rates (%) comparison of LPC with LPB and PHOG on the HMDB51 dataset. #Word

1000

1500

2000

2500

3000

LPC LBP PHOG

20.8 19.3 21.2

21.2 17.3 20.8

25.6 19.9 19.9

22.2 19.4 20.5

22.4 18.2 21.5

outperforms the LBP and PHOG descriptors under a large range of codebook size, which demonstrates that our LPC descriptor is informative and discriminative for action representation. The superior perfromance and advantages of the proposed LPC descriptor would result from the multi-scale analysis, i.e., the Laplacian pyramid, and the max pooling operations. The Laplacian pyramid model encode more information than the LBP and PHOG descriptors with single scale. The max pooling operations select invariant features over local patches and between adjacent scales which makes the repsentation more robust. Note that even on the baseline BoW model with hard assignment, our descriptor can still achieve a state-of-the-art recognition rate and is comparable to the HOG/HOF descriptor proposed by Laptev et al. (2008), however, in their recognition system a nonlinear support vector machine with a multi-channel v2 kernel is employed. In addition, the comparison with the state-of-the-art descriptors is shown in Table 3. Our LPC descriptor is the best one among all the descriptors listed in Table 3. Localized soft-assignment coding and max pooling. To improve the performance, we employ a localized soft assignment coding based on the BoW model. In the feature coding, we use max pooling over the activities of all local features on the codewords. As the main parameter is the number of nearest neighbors k, we give the recognition rates with respect to k. Figs. 3 and 4 plot the curves of accuracies changing with the value of k on the KTH and HMDB51 datasets. The proposed LPC descriptor can achieve recognition rates 92.2% and 27.1% on KTH and HMDB51, respectively (the accuracy on HMDB51 is the average over three split settings). The result on the KTH dataset is comparable with the state-of-the-art performance, which is impressive considering the sophisticated techniques used in their recognition systems. To evaluate the performances of different coding methods, we carry out comparison experiments with our proposed LPC descriptor. The experimental results are shown in Table 4. To make the comparison fair, we fix the size of the codebook 1500 and 2000 for the KTH and HMDB51 datasets, respectively. In the feature pooling stage, we employ max pooling for all the methods except for the hard assignment. From Table 4, we can see all the improved versions of the BoW model, i.e., the soft assignment, localityconstrained linear coding (LLC) and localized soft-assignment coding (LSC), can boost the performance, especially LLC and LSC.

Table 3 Comparison of the Laplacian pyramid coding (LPC) with state-of-the-art descriptors on the KTH dataset. Descriptors

Local jets

Gradients

HOG

HOF

LPC

Accuracy (%)

71.7

86.7

81.6

89.7

91.4

Fig. 3. The impact of k on the performance of LPC, LBP and PHOG on the KTH dataset. The codebook size is 1500.

Fig. 4. The impact of k on the performance of LPC, LBP and PHOG on the HMDB51 dataset. The codebook size is 2000.

Table 4 Comparison of different coding methods with LPC descriptor on the KTH and HMDB51 datasets.

KTH HMBD51

Hard

Soft

LLC

LSC

89.7 25.6

90.5 26.1

92.1 27.5

92.2 27.1

5. Conclusion In this paper, we have introduced a new descriptor based on Laplacian pyramid coding (LPC) for local representation of human actions. Three orthogonal planes (TOP) and the motion history image (MHI) are firstly extracted from a cuboid. A Laplacian pyramid is employed to decompose each of the images into a series of subband feature maps. For each feature map, Gabor filtering is applied to enhance the motion-related edges and boundaries as well as the orientation information. Max pooling is performed on the outputs of Gabor filtering to obtain discriminative and invariant local features for the description of each cuboid. Finally, the localized soft-assignment coding with max pooling on the BoW model is used to encode the local features as the final representation of human actions. The proposed LPC descriptor is evaluated on the benchmark KTH dataset and the challenging HMDB51 dataset. Experimental results demonstrate that LPC outperforms the state-of-the-art descriptors such as HOG, HOF, LBP and PHOG. Although our pro-

X. Zhen, L. Shao / Pattern Recognition Letters 34 (2013) 1899–1905

posed LPC descriptor is slower than PHOG and LBP, they are still comparable. However, LPC significantly outperforms PHOG and LBP in terms of recognition rates. Our method can achieve comparable and even better performance over the state-of-the-art methods with spatio-temporal features. This is due to the explicit encoding of motion and structure plans. The employed image templates, i.e., motion history image (MHI) and three orthogonal planes (TOP), are able to capture sufficient action-related information while enjoying the benefit of low computational cost. The proposed Laplacian pyramid coding descriptor has also proven to be informative and discriminative and outperforms the popular 2D descriptors such as LBP and PHOG, thanks to the multi-scale analysis employed in the Laplacian pyramid. We have also introduced the feature coding methods originally used in image classification, which are demonstrated to be effective for action recognition in the spatio-temporal domain.

References Bobick, A., Davis, J., 2002. The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Machine Intell. 23, 257–267. Boureau, Y., Ponce, J., LeCun, Y., 2010. A theoretical analysis of feature pooling in visual recognition. In: Internat. Conf. on Machine Learning, pp. 111–118. Burt, P., Adelson, E., 1983. The laplacian pyramid as a compact image code. IEEE Trans. Comm. 31, 532–540. Dollár, P., Rabaud, V., Cottrell, G., Belongie, S., 2005. Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE Internat. Workshop onVisual Surveillance and Performance Evaluation of Tracking and Surveillance. IEEE, pp. 65–72. Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., LeCun, Y., 2009. What is the best multistage architecture for object recognition? In: Internat. Conf. on Computer Vision (ICCV’09), pp. 2146–2153. Jhuang, H., Serre, T., Wolf, L., Poggio, T., 2007. A biologically inspired system for action recognition. In: IEEE 11th Internat. Conf. on Computer Vision, 2007. ICCV 2007. IEEE, pp. 1–8. Ji, S., Xu, W., Yang, M., Yu, K., 2010. 3d convolutional neural networks for human action recognition. In: Proc. of the 27th Internat. Conf. on Machine Learning, Citeseer. Kläser, A., Marszałek, M., Schmid, C., 2008. A spatio-temporal descriptor based on 3d-gradients. In: British Machine Learning Conf. (BMVC’08), pp. 995–1004. Kovashka, A., Grauman, K., 2010. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: 2010 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 2046–2053. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T., 2011. Hmdb: A large video database for human motion recognition. In: 2011 IEEE Internat. Conf. on Computer Vision (ICCV). IEEE, pp. 2556–2563. Laptev, I., Lindeberg, T., 2003. Space-time interest points. In: IEEE Internat. Conf. on Computer Vision (ICCV’03). Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B., 2008. Learning realistic human actions from movies. In: IEEE Internat. Conf. on Computer Vision and Pattern Recognition (CVPR’08), pp. 1–8. Lee, H., Grosse, R., Ranganath, R., Ng, A., 2009. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proc. of the 26th Annual Internat. Conf. on Machine Learning. ACM, pp. 609–616.

1905

Le, Q., Zou, W., Yeung, S., Ng, A., 2011. Learning hierarchical invariant spatiotemporal features for action recognition with independent subspace analysis. In: 2011 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 3361–3368. Liu, L., Wang, L., Liu, X., 2011. In defense of soft-assignment coding. In: 2011 IEEE Internat. Conf. on Computer Vision (ICCV). IEEE, pp. 2486–2493. Lowe, D., 2004. Distinctive image features from scale-invariant keypoints. Internat. J. Comput. Vision 60, 91–110. Mutch, J., Lowe, D., 2008. Object class recognition and localization using sparse features with limited receptive fields. Internat. J. Comput. Vision (IJCV) 80, 45– 57. Oikonomopoulos, A., Patras, I., Pantic, M., 2005. Spatiotemporal salient points for visual recognition of human actions. IEEE Trans. Systems Man Cybernet. Part B: Cybernetics 36, 710–719. Ojala, T., Pietikainen, M., Maenpaa, T., 2002. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Machine Intell. 24, 971–987. Riesenhuber, M., Poggio, T., et al., 1999. Hierarchical models of object recognition in cortex. Nat. Neurosci. 2, 1019–1025. Schindler, K., Van Gool, L., 2008. Action snippets: How many frames does human action recognition require. In: IEEE Conf. on Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE, pp. 1–8. Schuldt, C., Laptev, I., Caputo, B., 2004. Recognizing human actions: A local svm approach. In: Proc. 17th Internat. Conf. on Pattern Recognition, 2004. ICPR 2004. IEEE, pp. 32–36. Scovanner, P., Ali, S., Shah, M., 2007. A 3-dimensional sift descriptor and its application to action recognition. In: Proc. 15th Internat. Conf. on Multimedia. ACM, pp. 357–360. Serre, T., Wolf, L., Poggio, T., 2005. Object recognition with features inspired by visual cortex. In: IEEE Internat. Conf. on Computer Vision and Pattern Recognition (CVPR’05), pp. 994–1000. Shao, L., Mattivi, R., 2010. Feature detector and descriptor evaluation in human action recognition. In: Proc. ACM Internat. Conf. on Image and Video Retrieval. ACM, pp. 477–484. Shao, L., Zhen, X., Liu, Y., Ji, L., 2011. Human action representation using pyramid correlogram of oriented gradients on motion history images. Internat. J. Comput. Math. 88, 3882–3895. Siagian, C., Itti, L., 2007. Rapid biologically-inspired scene classification using features shared with visual attention. IEEE Trans. Pattern Anal. Machine Intell. 29, 300–312. Song, D., Tao, D., 2010. Biologically inspired feature manifold for scene classification. IEEE Trans. Image Process. 19, 174–184. Taylor, G.W., Fergus, R., Lecun, Y., Bregler, C., 2010. Convolutional learning of spatiotemporal features. In: European Conf. on Computer Vsion (ECCV’10). van Gemert, J., Geusebroek, J., Veenman, C., Smeulders, A., 2008. Kernel codebooks for scene categorization. Comput. Vision – ECCV 2008, 696–709. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y., 2010. Locality-constrained linear coding for image classification. In: 2010 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). IEEE, pp. 3360–3367. Weinland, D., Ronfard, R., Boyer, E., 2011. A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vision Image Understand. 115, 224–241. Willems, G., Tuytelaars, T., Van Gool, L., 2008. An efficient dense and scale-invariant spatio-temporal interest point detector. Computer Vision – ECCV 2008, 650– 663. Yang, J., Yu, K., Gong, Y., Huang, T., 2009. Linear spatial pyramid matching using sparse coding for image classification. In: IEEE Conf. on Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE, pp. 1794–1801. Yeffet, L., Wolf, L., 2009. Local trinary patterns for human action recognition. In: IEEE Internat. Conf. on Computer Vision (ICCV’09), pp. 492–497.