A unified framework for local visual descriptors evaluation

A unified framework for local visual descriptors evaluation

Pattern Recognition 48 (2015) 1174–1184 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr ...

2MB Sizes 3 Downloads 109 Views

Pattern Recognition 48 (2015) 1174–1184

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

A unified framework for local visual descriptors evaluation Olivier Kihl a,n, David Picard a, Philippe-Henri Gosselin a,b a b

ETIS/ENSEA – Université Cergy-Pontoise, CNRS, UMR 8051, 6 avenue du Ponceau, CS 20707 CERGY, F 95014 Cergy-Pontoise Cedex, France INRIA Rennes Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes Cedex, France

art ic l e i nf o

a b s t r a c t

Article history: Received 29 November 2013 Received in revised form 7 August 2014 Accepted 20 November 2014 Available online 28 November 2014

Local descriptors are the ground layer of recognition feature based systems for still images and video. We propose a new framework for the design of local descriptors and their evaluation. This framework is based on the descriptors decomposition in three levels: primitive extraction, primitive coding and code aggregation. With this framework, we are able to explain most of the popular descriptors in the literature such as HOG, HOF or SURF. This framework provides an efficient and rigorous approach for the evaluation of local descriptors, and allows us to uncover the best parameters for each descriptor family. Moreover, we are able to extend usual descriptors by changing the code aggregation or adding new primitive coding method. The experiments are carried out on images (VOC 2007) and videos datasets (KTH, Hollywood2, UCF11 and UCF101), and achieve equal or better performances than the literature. & 2014 Elsevier Ltd. All rights reserved.

Keywords: Image processing and computer vision Vision and scene understanding Video analysis Image/video retrieval Object recognition Feature representation

1. Introduction Most multimedia retrieval systems compare multimedia documents (image or video) thanks to three main stages: extract a set of local visual descriptors from the multimedia document; learn a mapping of the set of descriptors into a single vector to obtain a signature; compute the similarity between signatures. In this paper, we focus on the computation of visual descriptors. The main goal of local visual descriptors is to extract local properties of the signal. These properties are chosen so as to capture discriminative characteristic atoms in images or videos. Since local descriptors are the ground layer of recognition systems, efficient descriptors are necessary to achieve good accuracies. Such descriptors have become essential tools in still image classification [1,2] and video action classification [3–5]. The main contribution of this paper is a unified framework for visual descriptors evaluation that includes all the usual descriptors from the literature such as SIFT (scale-invariant feature transform) [6], SURF (speeded up robust features) [7], HOG (histogram of oriented gradient) [8], HOF (histogram of oriented flow) and MBH (motion boundary histogram) [9]. This framework is based on the decomposition of the descriptor in three levels: primitive extraction, primitive coding and code aggregation. Each popular descriptor is composed by a given primitive, a given coding and a given

n

Corresponding author. Tel.: þ 33 1 30 73 66 10; fax: þ33 1 30 73 66 27. E-mail addresses: [email protected] (O. Kihl), [email protected] (D. Picard), [email protected] (P.-H. Gosselin). http://dx.doi.org/10.1016/j.patcog.2014.11.013 0031-3203/& 2014 Elsevier Ltd. All rights reserved.

aggregation. Following these principles, we are able to perform a rigorous evaluation of many common descriptors and pinpoint which of the primitive, coding or aggregation is the source of their effectiveness. The consequence is that this evaluation allows us to improve our understanding of local descriptors. For example, we are able to show that the best coding method so far for the motion and gradient of motion primitives is a rectification, contrarily to the widely used orientation coding like in the well known HOF and MBH. Moreover, our framework allows the design of novel, more efficient and complementary descriptors, which we take as our second contribution. Using the framework as a method to explore the possible combinations of primitive, coding and aggregation allows us to precisely know the gain of each changed step compared to existing descriptors. For example, we are able to propose new descriptors based on oscillating functions aggregation which achieve the best performances for single descriptors once combined with the relevant primitive and coding steps. By knowing precisely the improvement caused by each changed step, we believe subsequent research can efficiently be focused on the steps where much of the gain is expected. The paper is organized as follows. In Section 2, we present the most popular descriptors in the literature, for still images and for human action videos. Then, in Section 3, we present our framework, explain the most popular descriptors, and extend them by modifying some of these three steps. In Section 4, we propose an evaluation of the framework hyperparameters and parameters on one still image classification dataset and on two action classification datasets. Finally, in Section 5, we compare our results with the literature on one still image classification dataset and four action

O. Kihl et al. / Pattern Recognition 48 (2015) 1174–1184

classification datasets according to the bests descriptors of our evaluation.

2. Related work In this section, we present the most popular descriptors in the literature, first for still image and then for human action video. 2.1. Still image descriptors In the past 10 years, several descriptors have been proposed for key-points matching and successfully used for still image classification. The most commonly used are SIFT [6], SURF [7] and histogram of oriented gradient (HOG) [9]. These descriptors are something referred to as “edge descriptors” since they mostly consider the spatial repartition of gradient vectors around the keypoint. SIFT and SURF are both interest points detector and local image descriptor. In this paper, we only consider the descriptors. Several descriptors have been proposed with the aim to decrease the computation time without loss of performance, for example SURF [7] and GLOH [10]. Similarly, Daisy [11] is a SIFT like descriptor designed to be faster to compute in the case of dense matching extraction. 2.2. Action descriptors In the early work on action recognition, silhouette based descriptors were used. These descriptors are computed from the evolution of a silhouette obtained by background subtraction methods or by taking the difference of frames (DOF). The main silhouette based descriptors are “motion energy image” (MEI) [12], “motion history image” (MHI) [12], the “average motion energy” (AME) and the “mean motion shape” (MMS) [13]. In [14] Kellokumpu et al. use histograms of “local binary patterns” (LBP) [15] to model the MHI and MEI images. As time is an important information in video, Gorelick et al. [16,17] study the silhouettes as space–time volumes. Space– time volumes are modeled with Poisson equations. From these, they extract seven spatio-temporal characteristic components. The main drawback of all these methods is the computation of silhouettes. Indeed, this computation is not robust, making these methods only relevant in controlled environments such as the Weizmann dataset [16] and the KTH dataset [5]. As a result, they tend to fail on more realistic data-sets such as UCF11 [18] or Hollywood2 [4] datasets. Assuming that action recognition is closely linked to the notion of movement, many authors have proposed descriptors modeling of the optical flow motion field [19–24]. The descriptor proposed by Ali and Shah [24] is based on the computation of many kinematic features on the motion field. Descriptors based on a polynomial approach for

60 40 20 0 -20 -40

1 0.5 0 -0.5 -1

1175

modeling global optical flow are proposed in [25,26]. In [27] a local space–time descriptor based on polynomial approximation is proposed. It is named series of polynomial approximation of flow (SoPAF). Finally, the most successful descriptors developed in recent years are extensions to video of the HOG [8] still image descriptors. The most commonly used are the histogram of oriented flow (HOF) [9] and the motion boundary histogram (MBH) [9]. HOF is the same as HOG but is applied to optical flow instead of gradient. The MBH models the spatial derivatives of each component of the optical flow vector field with a HOG. In this context, several extensions of still image descriptors have been proposed, such as cuboid [28], 3DHOG [29], 3D-SIFT [30], and ESURF [31]. Recently, Wang et al. [3] propose to model these usual descriptors along dense trajectories. The time evolution of trajectories, HOG, HOF and MBH, is modeled using a space time grid following pixels trajectories. The use of dense trajectories for descriptor extraction tends to increase the performances of popular descriptors (HOG, HOF and MBH).

3. Primitive/coding/aggregation framework In this section, we present the main contribution of this paper. We propose a framework providing a formal description of the steps needed to design local visual descriptors. Our framework splits descriptors extraction in three levels: primitive extraction, primitive coding and code aggregation. These three steps can be seen as hyperparemeters of a descriptor. 3.1. Primitive extraction At the primitive level, we extract a specific type of low-level information from an image or a video. The objective is to extract local properties of the signal. Generally, it relies on a high frequency filtering, linear for gradient or non-linear in the case of motion (optical flow), filters banks such as Haar (SURF), easy extension of popular filters [32], or non-linear operators. The primitive extraction induces a choice in relevant information and introduces data loss. Such primitives include the gradient (SIFT [6], HOG [8] and Daisy [11]), the responses to 2D Haar-wavelets (SURF [7]), the motion flow (HOF [9], SoPAF [27]), or the gradient of motion flow (MBH [9]). In Fig. 1, we show three examples of primitive used in the literature, the gradient, the motion flow and the gradient of the motion flow. 3.2. Primitive coding The primitive coding corresponds to a non-linear mapping of the primitive to a higher dimensional space. The objective is to

2 1 0 -1 -2 -3 -4 -5 -6

60 40 20 0 -20

0.6 0.4 0.2 0 -0.2 -0.4 -0.6

0.5 0 -0.5 -1 -1.5

5 4 3 2 1 0 -1

0.8 0.6 0.4 0.2 0 -0.2 -0.4

Fig. 1. Example of primitive; (a) horizontal gradient; (b) vertical gradient; (c) horizontal motion flow; (d) vertical motion flow; (e) horizontal gradient of horizontal motion flow; (f) vertical gradient of horizontal motion flow; (g) horizontal gradient of vertical motion flow; and (h) vertical gradient of vertical motion flow.

1176

O. Kihl et al. / Pattern Recognition 48 (2015) 1174–1184

improve the representation by grouping together the primitive properties that are similar. In the literature, the most popular primitive coding is the quantization of local vector field orientations as used in SIFT [6], HOG [8], Daisy [11], HOF [9], and MBH [9]. The quantization is usually performed on 8 bins. Let Gx ðxÞ, Gy ðxÞ be the horizontal and vertical primitive components of an image at position x, the principal orientation bin is computed by   ðatan 2ðGy ðxÞ; Gx ðxÞÞmod 2π Þ  4 oðxÞ ¼ ð1Þ

π

In order to limit the effect of floor on coding, the distance to the next orientation bin is computed by   ðatan 2ðGy ðxÞ; Gx ðxÞÞmod 2π Þ  4 rðxÞ ¼ oðxÞ  ð2Þ

π

The value associated to the bin oðxÞ and oðxÞ þ 1 are Oðx; oðxÞÞ ¼ ρðxÞ  ð1  rðxÞÞ

ð3Þ

Oðx; ðoðxÞ þ1Þmod 8Þ ¼ ρðxÞ  rðxÞ

ð4Þ

with ρðxÞ being the magnitude of horizontal and vertical primitive qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi components (ρðxÞ ¼ Gx ðxÞ2 þGy ðxÞ2 ). This primitive coding does not introduce any loss of information or redundancy. Another primitive coding is proposed in SURF [7]. Here, we call it “absolute coding”. In the SURF descriptor, it is applied to the gradient primitive. This is a four-dimensional code defined as Aðx; 0Þ ¼ Gx ðxÞ

ð5Þ

Aðx; 1Þ ¼ Gy ðxÞ

ð6Þ

Aðx; 2Þ ¼ jGx ðxÞj

ð7Þ

Aðx; 3Þ ¼ jGy ðxÞj

ð8Þ

This primitive coding introduces redundancy. However, it produces lower dimensions code than orientation coding. In the context of action recognition and classification, the rectified coding proposed by Efros et al. [20] has been used by several authors [21,22]. They decompose the horizontal (U) and vertical (V) components of a vector field (usually obtained by optical flow approaches) with a technique of half-wave rectification  UðxÞ if UðxÞ 4 0 Rðx; 0Þ ¼ ð9Þ 0 else  Rðx; 1Þ ¼  Rðx; 2Þ ¼  Rðx; 3Þ ¼

jUðxÞj

if UðxÞ o 0

0

else

VðxÞ

if VðxÞ 40

0

else

jVðxÞj 0

if VðxÞ o 0 else

ð10Þ

3.3. Code aggregation Finally, the code aggregation is used to model the encoded primitives. The objective of aggregation is to improve the robustness to deformation by allowing inexact matching between deformed image or video patches. Most descriptors from the literature (SIFT, HOG, HOF, MBH, and SURF) use accumulation of each primitive coding (typically with a simple sum). In order to improve robustness, the accumulation is done inside the cell of a grid of N  N cells. In the case of video, the grid could be extended in N  N  T cells with T being the number of cell bins in time direction. In the case of SIFT descriptors, the grid is usually 4  4 cells. In the case of HOG, HOF and MBH for action recognition, in [3], the grid is 2  2 spatially along three cells temporally. The spatial window could be pondered by a Gaussian to give more importance to the cells which are close to the center, like in SIFT. We show a 4  4 cell aggregation in Fig. 3a. The regular grid can be replaced with concentric circles arranged in a polar manner, as it is proposed in DAISY [11]. The final pattern resembles a flower, and is shown in Fig. 3b. In the following, we name this code aggregation “Flower”. The flower aggregation is defined by three parameters R, Q and T. The radius R defines the distance from the center pixel to the outer most grid point. The quantization of the radius Q defines the number of convolved primitives layer associated to different sizes of Gaussian (Q¼3 in Fig. 3b). The parameter T defines the angular quantization of the pattern at each layer (T¼ 8 in Fig. 3b). The aggregation proposed in [25] is based on the projection of primitive on a two dimensional orthogonal polynomial basis. The family of polynomial functions with two real variables is defined as follows: K

ð13Þ

k¼0l¼0

where K A N þ and L A N þ are respectively the maximum degrees of the variables ðx1 ; x2 Þ and fak;l gk A f0::Kg;l A f0::Lg A RðK þ 1ÞðL þ 1Þ are the polynomial coefficients. The global degree of the polynomial is D ¼ K þ L. Let B ¼ fP k;l gk A f0::Kg;l A f0::Lg be an orthogonal basis of polynomials. A basis of degree D is composed by n polynomials with n ¼ ðD þ 1ÞðD þ 2Þ=2 as follows: B ¼ fB0;0 ; B0;1 ; …; B0;L ; B1;0 ; ⋯ …; B1;L  1 ; …; BK  1;0 ; BK  1;1 ; BK;0 g

ð14Þ

An orthogonal basis can be created using the following three terms recurrence: 8 B  1;l ðxÞ ¼ 0; > > > > > ðxÞ ¼ 0; B > < k;  1 B0;0 ðxÞ ¼ 1; ð15Þ > > > > Bk þ 1;l ðxÞ ¼ ðx1  λk þ 1;l ÞBk;l ðxÞ  μk þ 1;1 Bk  1;l ðxÞ; > > :B ðxÞ ¼ ðx  λ ÞB ðxÞ  μ B ðxÞ; k;l þ 1

ð11Þ

L

P K;L ðx1 ; x2 Þ ¼ ∑ ∑ ak;l xk1 xl2

2

k;l þ 1

k;l

k;l þ 1 k;l  1

where x ¼ ðx1 ; x2 Þ and the coefficients λk;l and μk;l are given by 〈x1 Bk;l ðxÞjBk;l ðxÞ〉 〈x2 Bk;l ðxÞjBk;l ðxÞ〉 ; λk;l þ 1 ¼ ; 2 J Bk;l ðxÞ J J Bk;l ðxÞ J 2 〈B ðxÞjBk;l ðxÞ〉 〈B ðxÞjBk;l ðxÞ〉 μk þ 1;l ¼ k;l ; μk;l þ 1 ¼ k;l 2 J Bk  1;l ðxÞ J J Bk;l  1 ðxÞ J 2

λk þ 1;l ¼ ð12Þ

Unlike the absolute coding, rectified coding does not introduce redundancy, while keeping the same small dimension. Orientation coding, absolute coding and rectified coding are the most used in the literature. Examples of these primitive coding are shown in Fig. 2. We also propose a new primitive coding called double rectified coding. This coding corresponds to the four components of the rectified coding and the four components of the absolute coding.

ð16Þ

and 〈  〉 is the usual inner product for polynomial functions Z Z B1 ðxÞB2 ðxÞwðxÞ dx ð17Þ 〈B1 jB2 〉 ¼ Ω

with w being the weighting function that determines the polynomial family and Ω is the spatial domain covered by the window Wði; j; tÞ. Legendre polynomials (wðxÞ ¼ 1; 8 x) are usually used.

O. Kihl et al. / Pattern Recognition 48 (2015) 1174–1184

60 40 20 0

1177

60

60

40

40

20

20

0

0

60 40 20 0

-20

-20

60

60

40

40

60

60

20

20

40

40

0

0

20

20

-20

-20

0

0

-40

-40

-20

-20

60 40 20 0 -20 -40 -60

120 100 80 60 40 20 0 -20

80 60 40 20 0 -20 -40 -60

60 40 20 0 -20 -40

30 20 10 0 -10 -20 -30

60 50 40 30 20 10 0 -10

80 60 40 20 0 -20

-20 -40

-40

-20

140 120 100 80 60 40 20 0 -20 -40

Fig. 2. Example of coding; on the first line: absolute coding of the gradient primitive (Að0Þ, Að1Þ, Að2Þ, Að3Þ; on the second line: rectified coding of the gradient primitive (Rð0Þ, Rð1Þ, Rð2Þ, Rð3Þ); on the third and fourth lines: orientation coding of the gradient primitive (Oð0Þ, Oð1Þ, Oð2Þ, Oð3Þ, Oð4Þ, Oð5Þ, Oð6Þ, Oð7Þ): (a) Að0Þ; (b) Að1Þ; (c) Að2Þ; (d) Að3Þ; (e) Rð0Þ; (f) Rð1Þ; (g) Rð2Þ; (h) Rð3Þ; (i) Oð0Þ; (j) Oð1Þ; (k) Oð2Þ; (l) Oð3Þ; (m) Oð4Þ; (n) Oð5Þ; (o) Oð6Þ; (p) Oð7Þ.

Fig. 3. Examples of aggregation; (a) 4  4 cells aggregation; (b) flower aggregation with Q ¼3 and T ¼8; and (c) representation of four-degree basis spatial polynomials aggregation.

Fig. 4. Images from PASCAL Visual Object Classes Challenge 2007.

Using this basis, the approximation of a decomposed primitive component P is

The polynomial coefficients u~ k;l are given by the projection of component U onto normalized B elements

D Dk Bk;l ðxÞ P~ ¼ ∑ ∑ u~ k;l J B k;l ðxÞ J k¼0l¼0

p~ k;l ¼

ð18Þ

〈PjBk;l ðxÞ〉 J Bk;l ðxÞ J

ð19Þ

1178

O. Kihl et al. / Pattern Recognition 48 (2015) 1174–1184

We show the polynomials associated to a four-degree basis in Fig. 3c (defined in a spatial domain of 32  32 pixels). In the case of video classification, space–time aggregation is considered. Kihl et al. [25] propose to model spatial polynomial coefficients with a d degree temporal basis of Legendre polynomial defined by 8 B  1 ðtÞ ¼ 0; > > > > > < B0 ðtÞ ¼ 1; ð20Þ T n ðtÞ ¼ ðt  〈tBn  1 ðtÞjBn  1 ðtÞ〉ÞBn  1 ðtÞ  Bn  2 ðtÞ; > > > T n ðtÞ > > : : Bn ðtÞ ¼ jT n j

d

n¼0

Bn ðtÞ J Bn ðtÞ J

ð21Þ

The model has d þ 1 coefficients p~ k;l ði; j; tÞ given by p~ k;l;n ði; j; tÞ ¼

〈pk;l ði; j; tÞjBn ðtÞ〉 J Bn ðtÞ J

ð22Þ

The time evolution of a given coefficient p~ k;l ði; jÞ is given by the vector ml;k ði; j; t 0 Þ as defined in the following equation: ml;k ði; j; t 0 Þ ¼ ½p~ k;l;0 ði; j; t 0 Þ; p~ k;l;1 ði; j; t 0 Þ; …; p~ k;l;d ði; j; t 0 Þ

Primitive

Coding

Aggregation

Gradient Motion Haar Motion gradient ⋮

raw rectified absolute orientation ⋮

Regular cells Flower Polynomial basis Sine basis ⋮

Table 2 Rewriting of the usual descriptors; raw means the vector field is represented by the horizontal and vertical components.

Using this basis of degree d, the approximation of Pk;l ði; j; tÞ is p~ k;l ði; j; tÞ ¼ ∑ p~ k;l;n ði; j; tÞ

Table 1 A new framework for local descriptors.

ð23Þ

Finally, the descriptor is the concatenation of all the ml;k ði; j; t 0 Þ vectors for each coded primitive. In this paper, we also propose an easy extension of this aggregation using a Sine basis, in place of the Legendre polynomials. 3.4. Extended descriptors According to our framework, a descriptor depends on the three following hyperparameters: 1. the primitive, 2. the primitive coding, 3. the aggregation method. In Table 1, we summarize the different primitives, coding and aggregations currently used for image and video classification. According to specific combinations of primitive, coding and aggregation, we can explain most of the usual descriptors. In Table 2, we explain the usual descriptors of the literature with our framework. Each new primitive, coding or aggregation defines a new family of descriptors and each new combination of primitive–coding– aggregation defines a new descriptor. For example, using the gradient primitive extraction as in HOG, the orientation primitive coding as in HOG, and the polynomial aggregation as in SoPAF, we create a new descriptor. By changing the orientation coding step of the HOF by the rectification coding, we obtain a new descriptor. We present in Table 3 several new descriptors (no exhaustive list) proposed thanks to our framework. Since different primitives correspond to different properties of the signal, we argue that adapted coding and aggregation schemes have to be used to produce efficient descriptors. Indeed, our framework allows to explore and evaluate the possible combinations so as to find the best descriptors.

4. Evaluation of framework components In this section, we evaluate the influence of hyperparameters (primitives, coding and aggregations) combinations. These evaluations are performed for still image descriptors and action descriptors.

Name

Primitive

Coding

Aggregation

HOG Daisy HOF MBH SURF Efros SoPAF

Gradient Gradient Motion Motion gradient Haar Motion Motion

orientations orientations orientations orientations abs rectified raw

Cells Flower Cells Cells Cells Cells Polynomials

Table 3 New descriptors created through our framework. Name

Primitive

Coding

Aggregation

GoP MoP MrC MrP MGoP MGrC MGrP

Gradient Motion Motion Motion Motion gradient Motion gradient Motion gradient

orientation orientation rectification rectification orientation rectification rectification

Polynomials Polynomials Cells Polynomials Polynomials Cells Polynomials

Moreover, we evaluate both hyperparameters and parameters (number of cells, polynomial degree, etc.) for each combination. As dense sampling outperforms key-point extraction [1,3] for categories recognition, we use dense sampling in all our descriptor evaluations. We carry out experiments on an image dataset (VOC2007) for still image descriptor and on two well-known human action recognition datasets (KTH dataset [5] and Hollywood2 Human Actions dataset [4]) for action descriptors. For the experiments, we aggregate all the descriptors of a given video in a single signature as in [33,34,3]. In this paper, we consider a compressed version of VLAT which achieves state-of-the-art results for image retrieval [35]. This method uses an encoding procedure based on high order statistics deviation of clusters. In our case, the dense sampling both in spatial and temporal directions leads to highly populated sets, which is consistent with the second order statistics computed in VLAT signatures. 4.1. Still image descriptor evaluation We first present results on still image descriptor evaluation in the context of image categorization. The gradient is the only primitive considered. The gradient is extracted with the simple one-order approximation difference method, at a single resolution. The PASCAL-VOC 2007 dataset [1] consists in about 10,000 images and 20 categories (Fig. 4), and is divided into three parts: “train”, “val” and “test”. We use linear SVM classifier trained on “train” þ “val” sets and tested on the “test” set. We use four primitive codings: absolute, rectified, double rectified and orientation. These codings are combinatorially associated to the following three code aggregations: regular cells, flower and polynomials basis. For the regular grid aggregation, we use 4  4 cells.

O. Kihl et al. / Pattern Recognition 48 (2015) 1174–1184

The cells are evaluated at four scales: 4  4 pixels, 6  6 pixels, 8  8 pixels and 10  10 pixels. For the flower aggregation, the parameter Q set to 3 and the parameter T set to 8. We consider the flower aggregation at four scales by setting radius R at 9 pixels, 12 pixels, 15 pixels and 18 pixels. For the polynomial aggregation, we set the basis degree to 4. The polynomial spatial domain is considered at four scales: 16  16 pixels, 24  24 pixels, 32  32 pixels and 40  40 pixels. For the VLAT signature, we use a dictionary of 256 visual words. In each cluster, we retained 70 eigenvectors, limiting the dimension of the descriptors to 70 for every descriptor used in these evaluations. Results for each descriptor are shown in Table 4. We remark orientation coding clearly outperforms the other primitive coding for all the code aggregations experimented on this dataset. The GoF, GoC and GoP (cf. Table 4) provide the best results. We remark that the three features with highest mean average precision are GoC, GoF and GoP for each category of the VOC2007 dataset. The best results of this evaluation is obtained for a descriptor corresponding to Daisy descriptor. However, we show that our framework allows easy extension of HOG (GoC), by changing the codes aggregation from cell to polynomial. The new GoP descriptor obtain the same results as the well-known HOG (GoC) descriptor.

1179

the primitives at 1 resolution. We aggregate the extracted descriptors with the compressed VLAT signature approach. We evaluate several descriptors according to the hyperparameters of our framework and for several spatial and time parameter settings. We present in Tables 5–7 the main results obtained for each primitive extraction on KTH dataset. We present in Table 5 the results associated with the gradient primitive. As for still image experiments, the orientation coding clearly outperforms the other primitive coding for the three code aggregations. When the orientation coding is associated with the cell aggregation, it produces the best results for the gradient primitive extraction. The best results are obtained for code aggregations with the lower level of modeling along time axis for all code aggregations. We present in Table 6 the results associated with the motion primitive. In the case of motion primitive, the rectified coding allows us to obtain the best results. For the motion primitive, higher order time modeling improves the results for a given

4.2. Video action descriptor evaluation In this section, we present the evaluation of the hyperparameters for video action descriptors. First, we evaluate our framework on the KTH [5] dataset and then on Hollywood2 [4] dataset.

4.2.1. Evaluation on KTH dataset The KTH dataset [5] contains six types of human actions: walking, jogging, running, boxing, hand waving and hand clapping (Fig. 5). These actions are performed by 25 different subjects in four scenarios: outdoors, outdoors with scale variation, outdoors with different clothes, inside. For all experiments, we use the same experimental setup as in [5,3], where the videos are divided into a training set (eight persons), a validation set (eight persons) and a test set (nine persons). The classification accuracy results are obtained on the test set. On this dataset, we compare three primitive extractions (gradient, motion and gradient of motion), three primitive codings (raw, rectified and orientations) and three code aggregations (cells, polynomials and sinus). We extract the gradient with the simple one-order approximation difference method. For motion estimation, we use a Horn and Schunk optical flow algorithm [36] with 25 iterations and the regularization λ parameter set to 0.1. We extract Table 4 Classification results exprimed by mean average precision for combination of primitives, coding and aggregations on VOC2007 dataset. Name SURF GaF GaP GrC GrF GrP HOG DAISY GoP GdC GdF GdP

Coding absolute absolute absolute rectified rectified rectified orientation orientation orientation double double double

Aggregation Cells Flower Polynomials Cells Flower Polynomials Cells Flower Polynomials Cells Flower Polynomials

mAP 58.2 56.6 57.6 58.1 57.2 57.4 63.2 63.7 63.2 58.2 56.9 57.8

Fig. 5. Example of videos from KTH.

Table 5 Results for combination of gradient primitives, coding and aggregation on the KTH dataset; dim means the dimension of the descriptor; coding represent the code primitives (raw, rectified or orientation); SP means the number of spatial cells or the degree D of spatial polynomials or the spatial degree of the sinus basis; TP means the number of temporal cells, or the degree d of temporal polynomials or the degree of sinus basis. Dim

Coding

Gradient Cell

32 36 40 40 64 80

raw raw raw raw raw raw

48 48 60 64 64 72 80 80 128 144

rect rect rect rect rect rect rect rect rect rect

96 96 120 128 128

ori ori ori ori ori

Poly

SP

TP

Usual name

4 2 2 1 2 3

1 2 5 4 4 3

x x x x x x

1 2 4 4 3 2 2 1 2 3

2 3 0 1 0 2 5 4 4 4

x x x x x x x x x x

2 1 4 4 3

3 2 0 1 0

HOG x x HOG x

Sinus

80.4 81.0 82.8 86.8 84.5 83.1 88.5 84.8 83.2 86.5 83.3 84.5 87.2 88.5 88.0 88.5 92.4 91.4 92.6 93.4 93.3

1180

O. Kihl et al. / Pattern Recognition 48 (2015) 1174–1184

Table 6 Results for combination of motion primitive, coding and aggregations on the KTH dataset; the legend is the same as Table 5. Dim

Coding

Motion Cell

32 32 36 40 40 64 80

raw raw raw raw raw raw raw

87.0

48 48 60 64 64 72 80 80 128 144

rect rect rect rect rect rect rect rect rect rect

90.7

96 96 120 128 128

ori ori ori ori ori

89.2

Poly

SP

TP

Usual name

4 3 2 2 1 2 3

1 0 2 5 4 4 3

x x SoPAF x x SoPAF SoPAF

2 1 4 4 3 2 2 1 2 3

3 2 0 1 0 2 5 4 4 4

x x x x x x x x x x

2 1 4 4 3

3 2 0 1 0

HOF x x HOF x

Sinus

85.1 89.8 89.6 88.0 90.4 91.1

91.3 90.7 90.4 87.7 90.5 91.4 91.0 91.7 92.0

90.0 90.6 91.8 87.8

Table 7 Results for combination of gradient of motion primitive, coding and aggregations on the KTH dataset; the legend is the same as Table 5. Dim

Coding

Gradient of motion Cell

48 48 60 64 72 80 80 128

raw raw raw raw raw raw raw raw

90.0

32 32 48 96 96 120

rect rect rect rect rect rect

92.2

64 64 96

ori ori ori

92.5

Poly

TP

Usual name

2 1 4 4 2 2 1 2

3 2 0 1 2 5 4 4

x x x x x x x x

2 1 2 2 1 4

1 0 0 3 2 0

x x x x x x

2 1 2

1 0 0

MBH x x

Sinus

90.0 90.3 90.0 90.6 89.9 89.4 91.0

91.5 93.1 94.2 93.4 93.7

91.5 93.6

SP

spatial modeling. For instance, for the rectified coding and the polynomial aggregation with a spatial polynomial basis of degree 2, if the time polynomial basis is of degree 2 the classification accuracy is 90.5% and if the time polynomial basis is of degree 4, the accuracy is 91.7. We present in Table 7 the results associated with the gradient of motion primitive. The best results, for each code aggregation, are obtained with rectified coding. It is interesting to note that we have only generated descriptors whose size does not exceed 144 dimensions. Note that the motion of gradient primitive provides four components and the orientation coding decomposes each component in eight orientation maps. Consequently, the size of descriptors associating motion of gradient primitive and orientation coding are easily of high dimension. 4.2.2. Evaluation on Hollywood2 dataset We now present an evaluation of hyperparameters and parameters on the more challenging Hollywood2 [4] action recognition dataset. This dataset consists of a collection of video clips and extracts from 69 films in 12 classes of human actions (Fig. 6). It accounts for approximately 20 h of video and contains about 150 video samples per actions. It contains a variety of spatial scales, zoom camera, deleted scenes and compression artifact which allows a more realistic assessment of human actions classification methods. We use the official train and test splits for the evaluation. On this dataset, we compare three primitive extractions (gradient, motion and gradient of motion), two primitive coding (rectified and orientations) and two code aggregations (cells, polynomials). We extract the gradient with the simple one order approximation difference method. For motion estimation, we use a Horn and Schunk optical flow algorithm [36] with 25 iterations and the regularization λ parameter set to 0.1. We extract the primitives at seven resolutions with the resolution factor set to 0.8. The resolutions are obtained by down sampling images, we do not use any up sampling in this work. We aggregate the extracted descriptors with the compressed VLAT signature approach. For the VLAT signature, we use a dictionary of 256 visual words. In each cluster, we retained 50 eigenvectors, limiting the dimension of the descriptors to 50. For the gradient primitive evaluation presented in Table 8, we remark cell coding produces the best results. The combination of gradient primitive with orientation coding and cell aggregation corresponds to the usual HOG descriptor. In this case, our frameworks allows only to evaluate the best parameters for the spatial and temporal grid. For the motion primitive evaluation presented in Table 9, we remark that the orientation coding produces clearly lower results than the rectified coding. In this case, our framework allows us to design two new descriptors better than the usual HOF. Moreover, the polynomial aggregation produces better results than cell aggregation. For the motion primitive, the best combination is obtained with rectified coding and polynomial aggregation.

Fig. 6. Example of videos from Hollywood2 dataset.

O. Kihl et al. / Pattern Recognition 48 (2015) 1174–1184

Table 8 Results for combination of gradient primitives, coding and aggregation on the Hollywood2 dataset; dim means the dimension of the descriptor; coding represents the code primitives (rectified or orientation); SP means the number of spatial cells or the degree D of spatial polynomials; TP means the number of temporal cells, or the degree d of temporal polynomials. Dim

Coding

Gradient Cell

64 96 128 72 144 216 128 256 144 192 240 240 320 120 360

ori ori ori ori ori ori ori ori ori ori ori ori ori ori ori

SP

TP

Usual name

Coding

51.1 50.6 50.7 47.8 50.0 48.4 49.6

Motion Cell

64 72 108 64 128 72 96 120 120 160 60 180 64 96 128 72 144 216 128 256 144 192 240 80 240 120

rect rect rect rect rect rect rect rect rect rect rect rect ori ori ori ori ori ori ori ori ori ori ori ori ori ori

Dim

Coding

Gradient of motion Cell

2 2 2 3 3 3 4 4 2 2 2 3 3 4 4

2 3 4 1 2 3 1 2 2 3 4 2 3 0 2

HOG HOG HOG HOG HOG HOG HOG HOG GoP GoP GoP GoP GoP GoP GoP

Table 9 Results for combination of motion primitives, coding and aggregation on the Hollywood2 dataset; dim means the dimension of the descriptor; coding represents the code primitives (rectified or orientation); SP means the number of spatial cells or the degree D of spatial polynomials; TP means the number of temporal cells, or the degree d of temporal polynomials. Dim

Table 10 Results for combination of gradient of motion primitives, coding and aggregation on the Hollywood2 dataset; dim means the dimension of the descriptor; coding represents the code primitives (rectified or orientation); SP means the number of spatial cells or the degree D of spatial polynomials; TP means the number of temporal cells, or the degree d of temporal polynomials.

Poly

50.8 51.4 52.0 44.8 48.8 50.9 44.4 48.6

SP

TP

Usual name

2 3 3 4 4 2 2 2 3 3 4 4 2 2 2 3 3 3 4 4 2 2 2 3 3 4

4 2 3 1 2 2 3 4 2 3 0 2 2 3 4 1 2 3 1 2 2 3 4 0 2 0

MrC MrC MrC MrC MrC MrP MrP MrP MrP MrP MrP MrP HOF HOF HOF HOF HOF HOF HOF HOF MoP MoP MoP MoP MoP MoP

1181

64 96 128 144 216 128 256 144 192 240 144 240 320 120 360 128 192 256 144 288 256 288 384 160 288 240

rect rect rect rect rect rect rect rect rect rect rect rect rect rect rect ori ori ori ori ori ori ori ori ori ori ori

SP

TP

Usual name

2 2 2 3 3 4 4 2 2 2 3 3 3 4 4 2 2 2 3 3 4 2 2 3 3 4

2 3 4 2 3 1 2 2 3 4 0 2 3 0 2 2 3 4 1 2 1 2 3 0 1 0

GMrC GMrC GMrC GMrC GMrC GMrC GMrC GMrP GMrP GMrP GMrP GMrP GMrP GMrP GMrP MBH MBH MBH MBH MBH MBH GMoP GMoP GMoP GMoP GMoP

Poly

55.3 56.2 56.2 55.2 56.1 51.3 54.8 56.2 57.0 56.4 50.6 55.4 55.7 54.9 55.5 52.0 53.4 53.7 48.1 53.1 47.5 53.7 54.1 48.0 53.2 47.6

Poly

50.9 53.6 54.4 50.2 53.7 53.2 53.7 52.7 54.2 53.7 49.8 54.5 48.5 49.5 49.4 44.0 49.1 49.5 45.2 49.5 49.9 49.3 49.2 44.3 50.0 45.1

For the gradient of motion primitive evaluation presented in Table 10, we remark that the rectified coding produces clearly better results than orientation coding. The polynomial aggregation allows better results than cell aggregation. Note that using the rectified coding rather than the orientation coding allows for a faster computation of the descriptors with polynomial aggregation as well as with cell aggregation. Moreover, for given spatial and temporal parameters, the descriptors with rectified coding are more compact than those with orientation coding.

In the case of motion and gradient of motion, our framework evaluation clearly shows the interest of using rectified coding rather than orientation coding.

5. Comparison with literature In this section, we compare several interesting combinations of primitives, coding and aggregations, provided by our hyperparameters and parameters evaluations, to the literature results on still image categorization and action recognition. 5.1. Still image categorization For still image categorization, we experiments on the Pascal VOC 2007 dataset [1]. For comparison, we use the GoF, GoC and GoP (cf. Table 4) which provide the best results in our evaluation. Using a simple concatenation of the signatures, we obtain 64.2% of mean average precision. This result is reported in Table 11 and compared with the results from [33]. Note that our approach provides a global image signature which does not include any kind of spatial information like spatial pyramidal matching (SPM) [37] or object detectors [38]. We compare our results to those of Sanchez et al. [33] which give results without spatial information. In [33] the SIFT descriptors are highly dense extracted at seven resolutions and then aggregated with the Fisher vector signature approach. We show that our framework allows easy extension of HOG (GoC), for example by changing the codes aggregation from cell to polynomial. According to this new descriptor, we improve the categorization results obtained with only HOG descriptors. Moreover, our framework is compatible with adding spatial information like in [37], which should further improve the results.

1182

O. Kihl et al. / Pattern Recognition 48 (2015) 1174–1184

Table 11 Image classification results on Pascal VOC 2007 dataset.

Our method SIFT þ FV [33]

Our method SIFT þ FV [33]

Our method SIFT þ FV [33]

mAP

aeroplane

bicycle

bird

boat

bottle

bus

64.2 62.7

83.3 80.2

73.0 69.1

59.9 52.8

73.5 72.9

33.2 37.6

71.2 69.5

car

cat

chair

cow

table

dog

horse

84.2 81.8

65.7 61.8

53.3 54.9

49.5 47.2

58.8 61.5

52.3 50.5

83.0 79.1

bike

person

plant

sheep

sofa

train

tv

72.0 67.1

87.5 85.8

37.2 37.6

47.4 46.6

55.4 57.0

85.5 82.3

58.0 59.0

5.2. Video actions recognition For video actions recognition, we compare our results on the KTH [5] dataset, Hollywood2 [4], UCF11 [18] and UCF101 [39] datasets. For KTH and Hollywood2, we use the best descriptors of our evaluation on the given dataset. For UCF11 and UCF101, we use the best descriptors obtained during the evaluation on Hollywood2. We present in Table 12 the classification accuracy results of several combinations of descriptors on KTH. We show the best descriptor results of our study for each primitives and codes aggregation, and compare them to recent results from the literature. Every single descriptor presented in Table 12 is comparable to those proposed by Wang et al. [3]. Moreover, simple concatenation of all our signatures (9) outperforms the classification accuracy of Wang [3] and Gilbert [40]. Let us note that our approach uses linear classifiers, and thus leads to better efficiency both for training classifiers and classifying video shots, as opposed to methods of [3,40]. Moreover, we do not use dense trajectory to follow descriptors along the time axis as in [3]. We present in Table 13 the classification accuracy results of three best combinations of hyperparameters and parameter from our evaluation on Hollywood2 dataset. We compare these descriptors with the HOG, HOF and MBH descriptor results extracted from [3]. Moreover, we compare our three best descriptors with our implantation of HOG, HOF and MBH descriptors with parameter reported in [3]. In Table 13, we refer to these descriptors in the baseline lines. The results presented here improve the state of the art for single descriptor setups when comparing to HOG (gradient primitive), to HOF (motion primitive) and to MBH (gradient of motion primitive) both for the reported results from [3] and for our own implementation of this baseline. Note that, opposed to [3], we do not use the dense trajectories to obtain these results. Our framework allows us to improve over the usual descriptors by selecting the best hyperparameters and parameters for each primitive. Note that two of the best descriptors in our proposed combination are entirely new and designed thanks to the framework (MrP and GMrP). By concatenating our three best descriptors, we obtain a mean average precision of 60.3%. In Table 14, we compare our results on UCF11 actions dataset. The UCF11 [18] dataset is an action recognition data set with 11 action categories, consisting of realistic videos taken from YouTube (Fig. 7). The data set is very challenging due to large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background and illumination conditions. The videos are grouped into 25 groups, where each group consists of more than four action clips. The video clips in the same group may share some common features, such as the same person, similar background or similar viewpoint. The experimental setup is a leave one group out cross validation. We present in Table 14 the mean average precision results obtain with three best combinations of hyperparameters and parameter from our evaluation on Hollywood2 dataset. We compare these descriptors with the HOG, HOF and MBH descriptor results extracted from [3] and

Table 12 Classification accuracy on the KTH dataset; ND means the number of descriptors used; NL stands for non-linear classifiers. Method

ND

NL

Results (%)

Wang (HOG þ traj) [3] Wang (HOF þtraj) [3] Wang (MBHþ traj) [3] Wang (All) [3] Gilbert [40]

1 1 1 4 3a

X X X X X

86.5 93.2 95:0 94.2 94.5

A¼ G þ ori þ Cell (4,1) B ¼G þ ori þ Poly (4,0) C ¼G þ ori þ Sine (3,0) D¼ M þ ori þ Cell (4,1) E ¼M þ rect þ Poly (2,4) F ¼M þ rect þ Sine (1,2) G¼ GM þ rect þ Cell (2,3) H ¼ GM þ rect þ Poly (4,0) I¼ GM þ rect þ Sine (1,2) A þ D þ G B þ E þ H C þ F þ I Aþ ⋯þ I

1 1 1 1 1 1 1 1 1 3 3 3 9

a

93.4 92.6 93.3 91.8 91.7 91.3 94.2 93.7 93.4 94.2 94.4 93.5 94:7

In [40], the same feature is iteratively combined with itself three times.

Table 13 Mean average precision on the Hollywood2 dataset; ND is the number of descriptors; NL is the non-linear classifiers. Method

ND

NL

Results (%)

Gilbert [40] Ullah [41] HOGþ HOF Ullah [41] Wang [3] traj Wang [3] HOG Wang [3] HOF Wang [3] MBH Wang [3] all

3 2 2a 1 1 1 1 4

X X X X X X X X

50.9 51.8 55.3 47.7 41.5 50.8 54.2 58:3

baseline HOG (2,3) baseline HOF (2,3) baseline MBH (2,3) A¼ G þ ori þ Cell (2,4) B ¼M þ rect þ Poly (2,3) C ¼GM þ rect þ Poly (2,3) baseline Aþ Bþ C

1 1 1 1 1 1 3 3

51.4 49.5 53.4 52:0 54:5 57:0 57,2 60:3

a In [41] HOG/HOF descriptors are accumulated on over 100 spatio-temporal regions each one leading to a different BoW signature.

we compare our three best descriptors with our implantation of HOG, HOF and MBH descriptors with parameter reported in [3]. In our experiments on UCF11, we extract the primitives at five resolutions. The results presented here improve the state of the art for single descriptor setups when comparing to HOG (gradient primitive), to HOF

O. Kihl et al. / Pattern Recognition 48 (2015) 1174–1184

(motion primitive) and to MBH (gradient of motion primitive) both for the reported results from [3] and for our own implementation of this baseline. Note that the two entirely new descriptors designed thanks to the framework (MrP and GMrP) produce a significant improvement over the baseline descriptors. When combining descriptors, we improve the results of Wang et al. [3] without using dense trajectories. Finally, we present in Table 15 our results on the UCF101 dataset [39]. UCF101 is an action recognition dataset of realistic action videos, collected from YouTube. The dataset is composed of 13,320 videos from 101 action categories. UCF101 gives the largest diversity in terms of actions in the presence of large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc. On this dataset, we compare a combination of our three best descriptors from our evaluation on Hollywood2 with our implantation of HOG, HOF and MBH descriptors (without dense trajectories). The parameters and hyperparameters are taken directly from the evaluation on Hollywood2, without any adaptation (or cross-validation) to this specific dataset. We also reported the results obtained by Soomro et al. [39] when they proposed the dataset and the results obtain by Wang and Schmid [42] who won the THUMOS challenge [43]. On this dataset, our three best descriptors produce better results than the baseline HOG, HOF and MBH descriptors both individually and taken as a combination. The improvement is of 2.6% in multiclass accuracy, which is encouraging considering the parameters are optimized for a completely different dataset. As for UCF11 dataset, note that two entirely new descriptors designed thanks to the framework (MrP and GMrP) produce a significant improvement over their baseline counterparts (HOF and BMH). We improve by 35.5% results of Soomro et al. [39] on this dataset, but our results are lower than Wang and Schmid [42], winners of the THUMOS challenge, by 6.5%. However, we want to stress out that Wang and Schmid [42] use a much heavier setup compared to ours. In particular, all their descriptors use filtered dense trajectories where the background motion has been removed, as well as a spatio-temporal pyramid of three horizontal bands and two Table 14 Mean average precision on the UCF11 dataset; ND, number of descriptors; NL, nonlinear classifiers. Method Wang Wang Wang Wang Wang

[3] [3] [3] [3] [3]

traj HOG HOF MBH all

baseline HOG (2,3) baseline HOF (2,3) baseline MBH (2,3) A ¼G þ ori þ Cell (2,4) B ¼ M þ rect þ Poly (2,3) C ¼ GM þ rect þ Poly (2,3) baseline A þB þC

ND

NL

Results (%)

1 1 1 1 4

X X X X X

67.2 74.5 72.8 83.9 84:2

1 1 1 1 1 1 3 3

80.9 79.5 82.0 81:3 82:7 85:3 83.7 86:9

1183

temporal blocks. All these improvements are clearly compatible with our approach, and could lead to even better results.

6. Conclusion In this paper, we introduced a new framework to describe local visual descriptors. This framework consists in the decomposition of descriptors in three levels: primitive extraction, primitive coding and code aggregation. Our framework allows us to easily explain popular descriptors of the literature and to propose extensions of popular descriptors, for instance by introducing a function based aggregation. Moreover, thanks to our framework we propose a rigorous exploration of the possible combinations of the primitives, coding and aggregation methods. This allows us to design more efficient and complementary descriptors. We obtain better or equivalent results for usual descriptors on popular benchmarks for still image and video categorization. This emphases the relevance of our framework for the design of new low-level visual and motion descriptors. We are confident that our framework can be used to implement descriptor families not covered in this paper, for example by using the dense trajectories at the primitive step or non-negative sparse coding approaches [44] at the coding step. Future work also involves the optimization of the primitive step by using machine learning algorithms. For example, the primitive can be an adapted filter bank trained on some training set, in a similar way of the deep learning approaches [45] or the infinite kernel learning approaches [46]. Furthermore, it is interesting to compare our framework to the coding/pooling approaches [47] used to compute signatures. Indeed, the two last steps of our framework (primitive coding and code aggregation) are related to the ones involved for the computation of signature. Future work involves adapting recent signature computation methods to the descriptors using our framework. For example, dictionary based approaches [48,47] and model deviation approaches [49–51] can be used for the coding and aggregation steps. Table 15 Average accuracy over three train-test splits on UCF101 dataset; ND, number of descriptors; NL, non-linear classifiers. Method

ND

NL

Results (%)

HOG-HOF [39] Wang and Schmid [42]

2 3(18)a

X

43.9 85.9

baseline HOG (2,3) baseline HOF (2,3) baseline MBH (2,3) A¼ G þ ori þ Cell (2,4) B ¼M þ rect þ Poly (2,3) C ¼GM þ rect þ Poly (2,3) baseline Aþ B þC

1 1 1 1 1 1 3 3

a

65.3 68.6 74.0 66:7 72:5 76:8 76.9 79:4

Taking into account the spatio-temporal pyramid additional features.

Fig. 7. Example of videos from UCF11.

1184

O. Kihl et al. / Pattern Recognition 48 (2015) 1174–1184

Conflict of interest None declared. References [1] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results, 〈http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html〉. [2] K. Chatfield, V. Lempitsky, A. Vedaldi, A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, in: BMVC, vol. 76, 2011, pp. 1–12. [3] H. Wang, A. Klaser, C. Schmid, C. Liu, Action recognition by dense trajectories, in: Conference on CVPR, IEEE, Colorado Springs, USA, 2011, pp. 3169–3176. [4] I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: Conference on CVPR, IEEE, Anchorage, Alaska, USA, 2008, pp. 1–8. [5] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local svm approach, in: ICPR, vol. 3, IEEE, Cambridge, UK, 2004, pp. 32–36. [6] D. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 60 (2) (2004) 91–110. [7] H. Bay, T. Tuytelaars, L. Van Gool, Surf: speeded up robust features, in: ECCV, 2006, pp. 404–417. [8] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Conference on CVPR, IEEE, Graz, Austria, 2005, pp. 886–893. [9] N. Dalal, B. Triggs, C. Schmid, Human detection using oriented histograms of flow and appearance, in: ECCV, 2006, pp. 428–441. [10] K. Mikolajczyk, C. Schmid, A performance evaluation of local descriptors, Trans. Pattern Anal. Mach. Intell. 27 (10) (2005) 1615–1630. [11] E. Tola, V. Lepetit, P. Fua, Daisy: an efficient dense descriptor applied to widebaseline stereo, Trans. Pattern Anal. Mach. Intell. 32 (5) (2010) 815–830. [12] J. Davis, A. Bobick, The representation and recognition of action using temporal templates, in: Conference on CVPR, IEEE, San Juan, Puerto Rico, 1997, pp. 928–934. [13] L. Wang, D. Suter, Learning and matching of dynamic shape manifolds for human action recognition, IEEE Trans. Image Process. 16 (6) (2007) 1646. [14] V. Kellokumpu, G. Zhao, M. Pietikäinen, Texture based description of movements for activity analysis, in: VISAPP, vol. 1, 2008, pp. 206–213. [15] T. Ojala, M. Pietikäinen, T. Mäenpää, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, Trans. Pattern Anal. Mach. Intell. 24 (2002) 971–987. [16] M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space–time shapes, in: ICCV, vol. 2, IEEE, Beijing, China, 2005, pp. 1395–1402. [17] L. Gorelick, M. Blank, E. Shechtman, M. Irani, R. Basri, Actions as space-time shapes, Trans. Pattern Anal. Mach. Intell. 29 (12) (2007) 2247–2253. [18] J. Liu, J. Luo, M. Shah, Recognizing realistic actions from videos in the wild, in: Conference on CVPR, IEEE, Miami, Florida, USA, 2009, pp. 1996–2003. [19] R. Polana, R. Nelson, Low level recognition of human motion, in: Proceedings of the IEEE Workshop on Nonrigid and Articulate Motion, 1994, pp. 77–82. [20] A. Efros, A. Berg, G. Mori, J. Malik, Recognizing action at a distance, in: ICCV, vol. 2, IEEE, Nice, France, 2003, pp. 726–733. [21] A. Fathi, G. Mori, Action recognition by learning mid-level motion features, in: Conference on CVPR, IEEE, 2008, pp. 1–8. [22] S. Danafar, N. Gheissari, Action recognition for surveillance applications using optic flow and svm, in: ACCV, vol. 4844, 2007, pp. 457–466. [23] D. Tran, A. Sorokin, Human activity recognition with metric learning, in: ECCV, 2008, pp. 548–561. [24] S. Ali, M. Shah, Human action recognition in videos using kinematic features and multiple instance learning, Trans. Pattern Anal. Mach. Intell. 32 (2010) 288–303. [25] O. Kihl, B. Tremblais, B. Augereau, M. Khoudeir, Human activities discrimination with motion approximation in polynomial bases, in: ICIP, IEEE, Hong Kong, China, 2010, pp. 2469–2472.

[26] V.F. Mota, E. Perez, M.B. Vieira, L. Maciel, F. Precioso, P.-H. Gosselin, A tensor based on optical flow for global description of motion in videos, in: TwentyFifth SIBGRAPI Conference on Graphics, Patterns and Images, IEEE, Ouro Preto, Brazil, 2012, pp. 298–301. [27] O. Kihl, D. Picard, P.-H. Gosselin, Local polynomial space-time descriptors for actions classification, in: IAPR MVA, Kyoto, Japan, 2013. [28] P. Dollár, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in: Second Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, IEEE, Beijing, China, 2005, pp. 65–72. [29] A. Klaser, M. Marszalek, C. Schmid, A spatio-temporal descriptor based on 3dgradients, in: BMVC, 2008. [30] P. Scovanner, S. Ali, M. Shah, A 3-dimensional sift descriptor and its application to action recognition, in: Proceedings of the 15th International Conference on Multimedia, ACM, Augsburg, Germany, 2007, pp. 357–360. [31] G. Willems, T. Tuytelaars, L. Van Gool, An efficient dense and scale invariant spatio-temporal interest point detector, in: ECCV, 2008, pp. 650–663. [32] M. Varma, A. Zisserman, A statistical approach to material classification using image patch exemplars, Trans. Pattern Anal. Mach. Intell. 31 (11) (2009) 2032–2047. [33] J. Sánchez, F. Perronnin, T.d. Campos, Modeling the spatial layout of images beyond spatial pyramids, Pattern Recognit. Lett, http://dx.doi.org/10.1016/j. patrec.2012.07.019. [34] H. Wang, M.M. Ullah, A. Klaser, I. Laptev, C. Schmid, Evaluation of local spatiotemporal features for action recognition, in: BMVC, 2009. [35] R. Negrel, D. Picard, P. Gosselin, Using spatial pyramids with compacted vlat for image categorization, in: ICPR, 2012, pp. 2460–2463. [36] B. Horn, B. Schunck, Determining optical flow, Artif. Intell. 17 (1) (1981) 185–203. [37] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in: Conference on CVPR, vol. 2, IEEE, New York, NY, USA, 2006, pp. 2169–2178. [38] L.-J. Li, H. Su, E.P. Xing, L. Fei-Fei, Object bank: a high-level image representation for scene classification and semantic feature sparsification, Adv. Neural Inf. Process. Syst. 24 (2010). [39] K. Soomro, A.R. Zamir, M. Shah, Ucf101: A Dataset of 101 Human Actions Classes from Videos in the Wild, arXiv preprint arXiv: arXiv:1212.0402. [40] A. Gilbert, J. Illingworth, R. Bowden, Action recognition using mined hierarchical compound features, Trans. Pattern Anal. Mach. Intell. 99 (2011) 883–897. [41] M. Ullah, S. Parizi, I. Laptev, Improving bag-of-features action recognition with non-local cues, in: BMVC, 2010. [42] H. Wang, C. Schmid, Lear-inria submission for the thumos workshop, in: ICCV Workshop on Action Recognition with a Large Number of Classes, 2013. [43] R.Z.L.P.S. Jiang, Liu, Sukthankar, Thumos challenge: Action Recognition with a Large Number of Classes, 〈http://crcv.ucf.edu/ICCV13-Action-Workshop/〉, 2013. [44] T. Guthier, V. Willert, A. Schnall, J. Kreuter, K. Eggert, Non-negative sparse coding for motion extraction, in: Joint Conference on Neural Networks, IEEE, Dallas, TX, USA, 2013. [45] C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for scene labeling, Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1915–1929. [46] A. Rakotomamonjy, R. Flamary, F. Yger, Learning with infinitely many features, Mach. Learn. 91 (1) (2013) 43–66. http://dx.doi.org/10.1007/s10994-012-5324-5. [47] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear coding for image classification, in: Conference on CVPR, IEEE, San Francisco, CA, USA, 2010, pp. 3360–3367. [48] J. Sivic, A. Zisserman, Video google: a text retrieval approach to object matching in videos, in: ICCV, vol. 2, IEEE, Nice, France, 2003, pp. 1470–1477. [49] H. Jégou, M. Douze, C. Schmid, P. Pérez, Aggregating local descriptors into a compact image representation, in: Conference on CVPR, IEEE, San Francisco, CA, USA, 2010, pp. 3304–3311. [50] H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, C. Schmid, Aggregating local image descriptors into compact codes, Trans. Pattern Anal. Mach. Intell. 34 (2012) 1704–1716. [51] D. Picard, P.-H. Gosselin, Improving Image Similarity with Vectors of Locally Aggregated Tensors, in: IEEE ICIP, IEEE, Brussels, Belgium, 2011, pp. 669–672.

Olivier Kihl received the M.Sc. in Fundament for Engineering on Informatics and Image from the University of Poitiers in 2007 and the Ph.D. in image and signal processing in 2012. He joined the ETIS laboratory at the ENSEA (France) in 2012 as a post-doctoral researchers. His research interests include image processing and computer vision. He focus on motion analysis and local visual descriptors for multimedia indexing.

David Picard received the M.Sc. in Electrical Engineering in 2005 and the Ph.D. in image and signal processing in 2008. He joined the ETIS laboratory at the ENSEA (France) in 2010 as an associate professor within the MIDI team. His research interests include computer vision and machine learning for visual information retrieval, with focus on kernel methods for multimedia indexing.

Philippe Henri Gosselin received the Ph.D. degree in image and signal processing in 2005 (Cergy, France). After 2 years of post-doctoral positions at the LIP6 Lab. (Paris, France) and at the ETIS Lab. (Cergy, France), he joined the MIDI Team in the ETIS Lab as an assistant professor, and then was promoted to full professor in 2012. His research focuses on machine learning for online multimedia retrieval. He developed several statistical tools for dealing with the special characteristics of content-based multimedia retrieval. This includes studies on kernel functions on histograms, bags and graphs of features, but also weakly supervised semantic learning methods. He is involved in several international research projects, with applications to image, video and 3D objects databases.