Sparse composition of body poses and atomic actions for human activity recognition in RGB-D videos

Sparse composition of body poses and atomic actions for human activity recognition in RGB-D videos

    Sparse Composition of Body Poses and Atomic Actions for Human Activity Recognition in RGB-D Videos Ivan Lillo, Juan Carlos Niebles, A...

797KB Sizes 0 Downloads 21 Views

    Sparse Composition of Body Poses and Atomic Actions for Human Activity Recognition in RGB-D Videos Ivan Lillo, Juan Carlos Niebles, Alvaro Soto PII: DOI: Reference:

S0262-8856(16)30194-9 doi:10.1016/j.imavis.2016.11.004 IMAVIS 3565

To appear in:

Image and Vision Computing

Received date: Accepted date:

26 May 2016 16 November 2016

Please cite this article as: Ivan Lillo, Juan Carlos Niebles, Alvaro Soto, Sparse Composition of Body Poses and Atomic Actions for Human Activity Recognition in RGB-D Videos, Image and Vision Computing (2016), doi:10.1016/j.imavis.2016.11.004

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

T

Sparse Composition of Body Poses and Atomic Actions for Human Activity Recognition in RGB-D Videos

RI P

Ivan Lilloa , Juan Carlos Nieblesb , Alvaro Sotoa

SC

a Pontificia Universidad Catolica de Chile, 4860 Vicuna Mackenna, Santiago, Chile. b Universidad del Norte, Colombia; Stanford University, USA

NU

Abstract

AC CE

PT

ED

MA

This paper presents an approach to recognize human activities using body poses estimated from RGB-D data. We focus on recognizing complex activities composed of sequential or simultaneous atomic actions characterized by body motions of a single actor. We tackle this problem by introducing a hierarchical compositional model that operates at three levels of abstraction. At the lowest level, geometric and motion descriptors are used to learn a dictionary of body poses. At the intermediate level, sparse compositions of these body poses are used to obtain meaningful representations for atomic human actions. Finally, at the highest level, spatial and temporal compositions of these atomic actions are used to represent complex human activities. Our results show the benefits of using a hierarchical model that exploits the sharing and composition of body poses into atomic actions, and atomic actions into activities. A quantitative evaluation using two benchmark datasets illustrates the advantages of our model to perform action and activity recognition. Keywords:

Activity recognition, Hierarchical recognition model, RGB-D videos.



Corresponding author

Email addresses: [email protected] (Ivan Lillo), [email protected] (Juan Carlos Niebles), [email protected] (Alvaro Soto)

Preprint submitted to Image Vision and Computing

November 25, 2016

ACCEPTED MANUSCRIPT

1. Introduction

AC CE

PT

ED

MA

NU

SC

RI P

T

Human activity recognition is a key technology for the development of relevant applications such as visual surveillance, human-computer interaction, and video annotation, among others. Consequently, it has received wide attention [1, 2] with a strong focus on recognition of atomic actions using RGB videos. Recently, the emergence of capable and affordable RGB-D cameras has opened a new attractive scenario to recognize complex human activities. As a major advantage, RGB-D data facilitates the segmentation of the human body and the identification of body joint positions [3], which are much more difficult to identify directly from color images only. In this paper, we build upon this recent success and present an approach for human activity recognition that operates on body poses estimated from color and depth images. In general, a human activity is composed of a set of atomic actions that can be executed sequentially or simultaneously by different body regions. These compositions can occur spatially and/or temporally, and they can involve interactions with the environment, other people, or specific objects. For instance, people can text while walking, or wave one hand while holding a phone to their ear with the other (Fig. b). Note that different compositional arrangements of poses and atomic actions can yield different semantics at a higher level. In this work, we leverage on this observation and present a hierarchical compositional model for recognizing human activities using RGB-D data. In particular, we focus on activities that can be characterized by the body motions of a single actor. As a key aspect, our model operates at three levels of abstraction. At the bottom level, it learns a dictionary of representative body pose primitives such as a particular motion of the torso or an arm. At the mid-level, these body poses are combined to compose atomic actions such as walking or reading. Finally, at the top-level, atomic actions are combined to compose complex human activities such as walking while talking on the phone and waving a hand. The use of intermediate abstraction levels that have a direct semantic interpretation, such as body poses or atomic actions, provides several advantages. At training time, it can take advantage of labeled data, such as relevant atomic actions, that can be available at an intermediate abstraction level. At test time, it enables the model to 2

ACCEPTED MANUSCRIPT

AC CE

PT

ED

MA

NU

SC

RI P

T

automatically identify useful information such as the temporal span of atomic actions or the regions of the body executing an action. Explicitly modeling and inferring semantic information at the intermediate levels makes a notable difference with respect to blind compositional models based on deep learning techniques [4]. Additionally, the use of a compositional model can naturally handle scenarios of partial occlusions and pose estimation failures by inferring an appropriate spatial composition of the visible and relevant body regions. We formulate learning as an energy minimization problem, where structural hierarchical relations are modeled by sub-energy terms that constraint compositions among poses and actions, as well as, their spatial and temporal relations. Additionally, the energy function is complemented by regularization terms that foster the inference of a dictionary of body poses that shares discriminative poses among action classes. This enables the model to use small dictionary sizes, reduce over-fitting problems, and improve computational efficiency. A preliminary version of our model appeared in [5]. Here, we introduce three extensions to the preliminary model as well as new extensive experimental validation. First, the work in [5] is based only on geometric features extracted from body joint positions, here we explore the addition of feature descriptors based on appearance and motion cues extracted from RGB images. Second, the model in [5] associates all body poses from all video frames to one of the available atomic actions; in this paper, we introduce a new mechanism that identifies and discards non-informative frames. Third, the model in [5] uses a quadratic regularizer to constraint model parameters, we now explore the use of a new regularizer that fosters group sparsity among the coefficients of the atomic action classifiers. These extensions result in improved recognition performance, easier semantic interpretation of the intermediate representations, and increased computational efficiency with respect to the work in [5]. We validate these considerations conducting extensive experimental validations. The rest of the paper is organized as follows. Sec. 2 reviews relevant prior work. Sec. 3 introduces the details of our model and discusses learning and inference algorithms. Sec. 4 presents empirical evaluations. Finally, Sec. 5 concludes the paper. 3

ACCEPTED MANUSCRIPT

2. Related Work

AC CE

PT

ED

MA

NU

SC

RI P

T

Visual recognition of human activities is a very active topic in the computer vision literature. Recent surveys show the breath of prior work [1, 2, 6, 7]. Here, we briefly survey some of the most relevant previous work that relates to the methods proposed in this paper. In terms of recognition of composable activities, a number of researchers have tackled this problem using composition of actions and low-level representations based on local interest points [8] by modeling their temporal arrangement. Some researchers have extended single image representations, such as correlatons [9] and spatial pyramids [10] to videos [11], and have applied them to the problem of simple human action categorization. Others have proposed models for decomposing actions into short temporal motion segments [12, 13], but cannot capture spatial composition of actions. Recently, the interest in recognition of human actions and activities from RGB-D videos has increased rapidly [15], mainly due to the availability of capable and inexpensive new sensors. Some methods for low-level feature extraction have been proposed to leverage the 3D information available in RGB-D data [16, 17]. Furthermore, the availability of fast algorithms for human pose estimation [3, 18] from depth images helps to overcome the difficulty and high computational expense of human pose estimation from color images only. This has motivated a significant amount of methods that build representations on top of human poses [19, 20, 21]. In addition to the use of body pose features, other researchers have also proposed fusing them with low-level features from color [22] or depth [23]. These methods usually focus on categorizing simple and non-composed activities. From a learning perspective, our work is related to dictionary learning methods. Early frameworks for dictionary learning focus on vector quantization using k-means to cluster low-level keypoint descriptors [24]. These approaches have spawned algorithmic variations that use alternative quantization methods, discriminative dictionaries, or different pooling schemes [10]. Sparse coding methods have also been used to obtain meaningful dictionaries that achieve low reconstruction error, high recognition rate, and attractive computational properties [25]. In contrast to our approach, these 4

ACCEPTED MANUSCRIPT

AC CE

PT

ED

MA

NU

SC

RI P

T

methods have mostly focused on non-hierarchical cases, where mid-level dictionaries and top-level classifiers are trained independently. Our model also builds on ideas related to learning classifiers using a discriminative approach and latent variables. In particular, [26] uses a latent SVM scheme to develop an object recognition approach based on mixtures of multiscale deformable part models. [13] extends this model to the case of action recognition. In contrast to our approach, the model in [13] is limited to binary classification problems. Recently, [27] proposes a hierarchical latent variable approach to action recognition that directly considers the multiclass classification case. Their layered model incorporates information about patches, hidden-parts, and action class, where the meaning of the hidden layers is not clear. In contrast, our hierarchical model integrates semantically meaningful information at all layers: poses, actions, and activities. Unlike [27], our model can account for compositions of actions into activities and, as a byproduct, outputs per-body-region and per-frame action classification, so it has the appealing property that mid-level semantics are produced in addition to the final activity classification decision. [23] proposes a model for action recognition in static images but it is not clear if an extension to spatio-temporal compositions is possible. Similarly to our approach, [27] and [23] rely on a latent SVM scheme for model learning and inference, however, details of the model formulation and regularization schemes are substantially different from our proposed approach. In this sense, the novelty of our work is the proposed model, not the actual learning/inference schemes. In terms of hierarchical compositional models, our work is related to recent recognition approaches based on deep learning (DL) [28, 29]. Similarly to DL schemes, our approach also incorporates joint hierarchical estimation of connected layers, spatial pooling schemes, and intermediate representations based on linear filters. DL is usually applied over the raw image representation using several layers of generic structures. As a consequence, DL architectures have a large number of parameters and they are usually difficult to train. In contrast, we embed semantic knowledge to our model by explicitly exploiting compositional relations among poses, actions, and activities. This leads to simpler architectures and, during training, enables us to incorporate labeled data at intermediate layers, such as atomic action labels. 5

ACCEPTED MANUSCRIPT

SC

RI P

T

In summary, our method tackles some limitations of previous work with a new framework that models spatio-temporal compositions of activities using a hierarchy of three semantic levels. The compositional properties of our model enable it to provide meaningful annotations and to handle occlusions naturally. We discuss the details of our model in the following section.

3. Model Description

3.1. Video Representation

MA

NU

In this section, we describe the input data and video representation used in our framework. We then present the main details of our hierarchical model as well as its learning and inference schemes.

PT

ED

We represent an input RGB-D video as a sequence of human body poses estimated at each frame. Unlike previous work [19, 30], we do not compute a global pose descriptor for the entire body. Instead, we divide the body pose into R fixed spatial regions and independently compute a pose feature vector for each region. Fig. 3 illustrates the case when R = 4 that we use in all our experiments. Our body pose feature vector consists of the concatenation of the following two descriptors.

AC CE

Geometric descriptor. At frame t and region r, a descriptor xgt,r encodes geometric information about the spatial configuration of body joints, and it is inspired by [31]. As illustrated in Fig. 3, we use a body pose model given by the positions of sixteen joints, which are grouped manually into R = 4 fixed regions. For each region, we compute relative angles among six segments, where each segment is a line that connects a pair of joints; see Fig. 3 for details. We also compute relative angles between each segment and a plane spanned by sets of three joints. This provides a total of 18 dimensions (angles) for the geometric descriptor associated to each region.

Motion descriptor. While the geometric descriptor captures body pose configurations, it misses to encode information about the dynamic of each body pose. We argue that motion cues are relevant to disambiguate poses that are similar in configuration but 6

ACCEPTED MANUSCRIPT

ED

MA

NU

SC

RI P

T

move differently. Our motion descriptor xm t,r encodes local motion information around each body joint, and is based on the trajectory descriptor from [32] using the default settings. While the descriptor in [32] can also encode appearance information, our experiments indicate that encoding motion alone, using a histogram of optical flow (HOF), is sufficient to obtain improvements in the recognition accuracy of our model. Furthermore, instead of calculating a dense descriptor as in [32], we only encode motion information around each detected body joint location. Specifically, following the setup in [32], at each joint location we compute a HOF using RGB patches centered at the joint location for a temporal window of 15 frames. This process produces a 108-dimensional descriptor for each joint location. We concatenate these per-joint descriptors to obtain a motion descriptor in each body region. Finally, to reduce dimensionality, we apply Principal Components Analysis (PCA) to transform the concatenated descriptor into a 20-dimensional vector, keeping the dimensionality of our final descriptor relatively low. 3.2. Hierarchical Model

AC CE

PT

Our hierarchical compositional model spans three semantic levels. At the top level, our model assumes that each activity is composed of a temporal and spatial arrangement of atomic actions. Similarly, at the intermediate level, our model assumes that each atomic action is composed of a temporal and spatial arrangement of body poses. Finally, at the bottom level, our model identifies local body poses using a bank of linear classifiers that are applied to the incoming frame descriptors. Given a query video, our goal is to correctly recognize the activity being performed, while also inferring at each frame the corresponding spatial and temporal arrangement of atomic actions and body poses. We achieve such detailed inference by optimizing an energy function that captures the hierarchical and compositional structure of each activity. In the following, we detail our proposed energy function and present the corresponding learning and inference schemes. To simplify the notation, we first consider the case of representing the human body using only one spatial region (R = 1). Later, we extend the model to the case of R > 1 regions.

7

ACCEPTED MANUSCRIPT

Energy function. Given a video D with T frames, we define its energy function as:

T

E(D) = Eposes (Z)+Eactions (Z, V )+Eactivity (V )+Epose transition (Z)+Eaction transition (V ),

PT

ED

MA

NU

SC

RI P

(1) where Z and V denote pose and atomic action labels, respectively. Here, the energy E(D) is expressed in terms of energy potentials associated to body poses (Eposes ), atomic actions (Eactions ), as well as, the activity in video D (Eactivity ). Eq. (1) also includes energy potentials to encode temporal transitions between pairs of body poses (Epose transition ) and pairs of atomic actions (Eaction transition ). In the following, we provide details about each term in Eq.(1). Eposes : At the lower level of the hierarchy, we encode each frame descriptor xt = {xgt , xm t }, t ∈ [1, . . . , T ], using one out of K body pose primitives. To achieve this, we introduce a latent vector Z = (z1 . . . zT ), where component zt ∈ {1, . . . , K} indicates the entry assigned to frame t from a dictionary of K body poses. As we detail in Sec. 3.3, we obtain the dictionary of body poses by learning a set of K linear classifiers wk that define the entries of the dictionary. In this way, the score of a candidate body pose i in frame t is given by the dot product between the frame descriptor xt and the corresponding linear classifier wz =i . In contrast to the preliminary version of our model [5], we augment the dictionary of body poses by adding a new mechanism that allows the model to adaptively select informative poses, while discarding irrelevant and noisy ones. We implement this by introducing an additional (K + 1)-th entry into the dictionary of body poses, which is associated to a penalty score θ that we learn during the training stage of the model. Consequently, for an input video, we define the energy potential Epose associated to pose assignments Z , as the sum of the pose entry scores for all its frames: !

AC CE

t

Eposes (Z) =

K   t

wk ⊤ xt δ(zt = k) + θδ(zt = K + 1)

(2)

k=1

where δ(ℓ) = 1 if ℓ is true and δ(ℓ) = 0 if ℓ is false. Note that every frame t is associated with a single dictionary entry given by the corresponding pose entry zt . Intuitively, a high energy score Eposes indicates that pose descriptors xt are highly consistent with the assignment Z to the dictionary of body poses. 8

ACCEPTED MANUSCRIPT

Eactions : At the second level of the hierarchy, we measure the compatibility between the

ED

MA

NU

SC

RI P

T

inferred pose assignments Z and the set of A mid-level atomic actions. To do this, we introduce an assignment vector V = (v1 . . . vT ), where component vt ∈ {1, . . . , A} indicates the atomic action label assigned to frame t. To aggregate evidence for an atomic action a, we build a histogram ha (Z, V ) calculated over the body pose assignments Z that are associated by V to atomic action a. Specifically, for an atomic action a in an in" put video, and the k-th entry of its histogram is hak (Z, V ) = Tt=1 δ(zt = k)δ(vt = a). To quantify the compatibility between each atomic action a and the observed evidence ha (Z, V ), we use a set of training videos to train a linear classifier to identify each atomic action. Specifically, let βa = (βa,1 . . . βa,K+1 ) be the coefficients of the resulting linear classifier to identify atomic action a ∈ {1, . . . , A}, we access a compatibility score between atomic action a and its corresponding histogram ha (Z, V ) by computing dot product βa⊤ ha (Z, V ). By aggregating this score over all candidate actions in a video, we obtain energy potential Eactions associated to action assignment V: T  A K+1   Eactions (Z, V ) = βa,k δ(zt = k)δ(vt = a). (3) t=1 a=1 k=1

AC CE

PT

Intuitively, a high energy score in Eq. (3) indicates a high degree of consistency between the selected pose and atomic action labels in the input video. In this work, we assume that the set of atomic action labels for each training video is available at training time. Eactivity : At the third level of the hierarchy, we use the histogram of atomic actions h(V ) accumulated over all T frames to build a representation for the activity being " performed. Each entry a in h(V ) is given by ha (V ) = Tt=1 δ(vt = a), and the energy term for the activity level is given by: Eactivity (V ) = αy⊤ h(V ) =

A  T 

αy,a δ(vt = a),

(4)

a=1 t=1

where αy ∈ RA is the linear classifier for activity class y. This energy captures the compatibility between atomic actions in the action histogram h(V ) and activity class y . A high score for Eactivity means that the atomic actions present in the video and their time span are consistent with the global activity y. 9

ACCEPTED MANUSCRIPT

and Eaction transition : Departing from an orderless representation, we incorporate energy potentials that model local temporal co-occurrences between assignments of poses and atomic actions in neighboring frames. Specifically, let coefficients ηk,k ∈ R and γa,a ∈ R quantify the co-occurrence strength between neighboring pair of poses (k, k′ ) and neighboring pair of actions (a, a′ ), respectively. The energy potentials for pose and atomic action transitions in Eq. (1) are given by: ′

Epose transition (Z) =

K+1  K+1 

ηk,k′

a=1 a′ =1

(5)

δ(zt = k)δ(zt+1 = k ′ )

T −1 

NU

A  A 

T −1  t=1

k=1 k′ =1

Eaction transition (V ) =

SC



RI P

T

Epose transition

γa,a′

(6)

δ(vt = a)δ(vt+1 = a′ )

t=1

MA

So far, we have defined our model when R = 1. In practice, our model uses R = 4 semantic regions, as shown in Fig. 3. Therefore, we extend Eq. (2), (3), (4), (5), and (6) to the case of multiple regions R as follows:

ED

Eposes (Z) =

K   r,t

AC CE

Eactivity (V ) = Epose transition (Z) =

Eaction transition (V ) =

= k) + θr δ(zt,r = K + 1)

k=1

R  A 

PT

Eactions (Z, V ) =

wkr ⊤ xt,r δ(zt,r

βar ⊤ ha (Zr , Vr )

!

(7) (8)

r=1 a=1

R 

(9)

αyr ⊤ h(Vr )

r=1



r ηk,k ′



r γa,a ′

δ(zt−1,r = k)δ(zt,r = k ′ )

(10)

δ(vt,r = a)δ(vt+1,r = a′ )

(11)

t=1

r,k,k′

r,a,a′

T −1 

T −1  t=1

3.3. Learning

The goal of learning is to obtain the optimal parameters for our energy function in Eq. (1), so that it can be used to correctly classify new activity videos. In particular, given a set of training videos D with corresponding atomic action labels V and activity labels Y , we look for pose assignments Z(D) and energy parameters [α, β, w, γ, η, θ] that maximize the energy function corresponding to the true assignment of atomic action and activity labels in the training videos. 10

ACCEPTED MANUSCRIPT

where

MA

NU

SC

RI P

T

We cast our learning formulation as an error minimization problem, resembling a hinge loss. Our goal is to learn all relevant parameters simultaneously. The input to our training algorithm is a set of videos D, where each video Di contains labels for the activity yi and atomic actions Vi , and its set of T frames is described by the set of feature vectors Xi = {x1 , . . . , xT }. Note that the atomic actions labels Vi are region dependent; for instance, the right arm could be executing the atomic action “drinking”, while at the same time both legs could be executing the atomic action “walking”, and the left arm could be in a resting position or “idle”. This setup enables not only a temporal, but also a spatial composition of atomic actions. The optimization model is stated as: |D|

1 µ C  min ||W−β ||22 + ||β||22 + ρ||β||1 + ξi , W 2 2 |D| i=1

ED

ξi = max {E(Xi , Z, V, y)+∆((yi , Vi ), (y, V ))−max E(Xi , Zi , Vi , yi )}, Z,V,y

Zi

(12)

i ∈ [1, ...|D|], βj ≥ 0

(13)

AC CE

PT

Eq. (12) corresponds to an Elastic Net regularizer [33] on the atomic action classifiers (β) that favors classifiers with a sparse set of values on their coefficients. Eq. (13) expresses the slack variables ξi , where each slag ξi quantifies the error of the inferred labeling for the corresponding video Di . It is important to note that the sparsity constraint on the coefficients of the atomic action classifiers, implies at the next level of the hierarchy that each activity is also influenced by a reduced number of poses, encouraging the learning of poses specialized to the identification of certain activities. The previous optimization minimizes errors in the training set by identifying parameters values such that the total energy provided by the assignment of right labels must be higher that the one provided by the assignment of any alternative configuration. The formulation also enforces a margin between such labeling by introducing a loss function ∆, which penalizes incorrect labeling in the activity and atomic action levels. In our implementation, we favor predicting the correct labels as follows: ∆((yi , Vi ), (y, V )) = λ1 δ(y != yi ) +

11

R T λ2   δ(vt,r != v(t,r)i ) RT r=1 t=1

(14)

ACCEPTED MANUSCRIPT

RI P

T

Note that by selecting a large value of λ1 , we assign a large penalty to incorrect activity labelings. With λ2 , the loss penalizes incorrect atomic action labelings V != Vi proportionally to the total number of frames that are incorrectly labeled. 3.4. Learning Algorithm

PT

ED

MA

NU

SC

By using the Latent Structured SVM (LSSVM) framework [34], Eq. (12) can be solved with a Concave-Convex Procedure (CCCP), [35] which guarantees convergence to a local minimum. The CCCP algorithm alternates between maximizing Eq. (12) with respect to the latent variables, and solving a structural SVM (SSVM) optimization problem [36] that treats latent variables as completely observed1 . We rewrite the energy terms in Eq. (13) in terms of a linear classifier over a structured vector, i.e., E(X, Z, V, y) = W ⊤ ψ(X, Z, V, y). This allows us to formulate Eq. (12) as an LSSVM problem, which is solved using cutting-plane algorithm [37]. The solution to this LSSVM problem implies the sequential iteration of two main steps. The first step consists of inferring for each input video the corresponding latent variables Z . To achieve this, we solve: # $ (15) Zi∗ |W = max W ⊤ ψ(Xi , Z, Vi , yi ) , Z

AC CE

where the subscript i indicates the true activity and set of atomic action labels. Then, given the latent variable assignment Zi = Zi∗ , we solve the following SSVM optimization problem: |D|

1 µ C  min ||W−β ||22 + ||β||22 + ρ||β||1 + ξi W,ξ 2 2 |D| i=1

subject to:

(16)

W ⊤ (ψi (Xi , Zi , Vi , yi ) − ψi (Xi , Z, V, y)) ≥ ∆i (y, V ) − ξi ∀y ∈ Y, Z ∈ Z, V ∈ V, ξi ≥ 0,

where ∆i (y, v) = ∆((yi , Vi ), (y, V )). It is worth to highlight that given Zi , the optimization problem in Eq. (16) generates linear classifiers that maximize the energy 1 A more detailed formulation can be found at http://web.ing.puc.cl/∼ialillo/activities

12

IVC 2016 supp.pdf

ACCEPTED MANUSCRIPT

RI P

T

function corresponding to the known activity and actions labels annotated in the training set. In a test video, we do not know in advance the pose labels Z nor the atomic action labels V , so both must be inferred. In both cases, each pose label zt depends on the pose dictionary entry wz , the feature descriptor of the frame xt , the atomic action label vt , and the pose and atomic action labels zt−1 and vt−1 associated to the previous frame. Note that the labeling of poses is done jointly with the activity and atomic actions labeling of the complete video; we can interpret this behavior as a contextual priming for the poses. Algorithm 1 summarizes the learning process.

SC

t

repeat

Zt ← Z W ← W0 W=∅ M ← number of videos j←0

MA

Z ← Z0 t ← 0,

ED

repeat Wj ← W for i = 1 : M do

Find (ˆ yi , Vˆi , Zˆi ) of the most violated constraint of video i given Wj

PT

end for

¯ ← average margin of ψ(Xi , yi , Vi , Z t ) ∆ψ i and ψ(Xi , yˆi , Vˆi , Zˆi ) given Wj δ¯ ← average loss function ¯ ≥ δ¯ − ξ to Add the new constraint Wj⊤ ∆ψ working set W W = argminw {Ω(w) + Cξ s. to w ∈ W} j ←j+1 until [violation in working set] < ǫ for i = 1 : M do Find Zi = argmaxz {ψ(Xi , yi , Vi , z)}

AC CE

2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29:

NU

Algorithm 1 Learning algorithm using one-slack formulation. 1: procedure L EARN W

end for t←t+1 until assignments Z is (almost) the same as Z t end procedure

3.5. Inference

The input to the inference algorithm is a new video sequence with features X . The task is to infer the best activity label yˆ, the best atomic action labels Vˆ , and the best 13

ACCEPTED MANUSCRIPT

yˆ, Vˆ , Zˆ = argmax E(X, Z, V, y)

(17)

RI P

y,V,Z

T

pose assignments Zˆ , i.e., we solve:

V,Z

T  % t=1

αy,vt + βvt ,zt + wzt ⊤ xt δ(zt ≤ K)

NU

ˆ = argmax Vˆ , Z|y

SC

As in the training procedure, we can solve this by using a dynamic programming approach that exhaustively enumerates all values of activities y, and solves for atomic action assignments Vˆ and pose assignments Zˆ using: (18)

&

+ θδ(zt = K + 1) + γvt−1 ,vt + ηzt−1 ,zt .

MA

Using Dynamic Programming, Eq. (18) can be solved efficiently.

4. Experiments

AC CE

PT

ED

We validate the effectiveness of our proposed model by considering two experimental scenarios. First, we test the ability of our approach to discriminate simple actions on a benchmark dataset. Second, we test the performance of our model in recognizing complex and composable human activities. We also study the contribution of key components of our model and their impact in recognition performance. In particular, we highlight the contribution of the extensions of our framework with respect to its preliminary version in [5]. 4.1. Recognition of simple actions

While our main goal is the recognition of human activities that can be composed of simpler atomic actions, we experimentally verify that the model can also handle the recognition of atomic actions. Towards this goal, we evaluate our algorithm on the MSR-Action3D dataset [38], which consists of 10 subjects performing 20 simple atomic actions related to gaming in front of a TV. This dataset provides pose estimation data (joint locations) and low resolution depth maps. Our hierarchical model need two levels of annotations, for activities and atomic actions. For simple action videos like in MSR-Action3D dataset, only one level of 14

ACCEPTED MANUSCRIPT

94.3% 92.9%

97.2% 95.5% 93.3% 94.9% 89.8% – – – 95.3% – –

94.6% 87.0% 93.6% – – – 83.9% – –

AS3

Setup 2

99.1% 99.1% 95.5% 100.0% 97.0% – – – 98.2% – –

Average Accuracy 95.4% 97.2% 94.5% 94.0% 93.5% – – – 92.5% – –

93.0% 96.7% – – 93.6% 95.6% 93.1% 90.2% 89.5% 89.5% 88.2%

T

Setup 1

RI P

AS2

SC

Ours J. Luo et al. [39] Du et al. [29] Y. Zhu et al. [22] Tao and Vidal [21] Lu & Tang [40]* Yang & Tian [41]* C. Wang et al.[42] Vemulapalli et al. [20] Lillo et al. [5] J. Wang et al. [23]

AS1

NU

Algorithm

MA

Table 1: Recognition accuracy rates in the MSR-Action3D dataset for our approach and previous methods. * indicates the use of depth features instead of 3D pose estimation data.

AC CE

PT

ED

annotation is provided. Then, we augment the annotations creating an intermediate level of A + 1 atomic actions, with A the number of activities, adding an idle action to this level. The atomic actions are related to the time span when the activity is being performed, detecting the idle poses in all frames via a simple heuristic and assigning the remaining frames with the corresponding activity label. In this way, the complete video is assigned a single activity action (the original annotation), while only a subset of frames and regions are assigned with the atomic action in the intermediate level. In our experiments, we use a total of 557 sequences as in [23]. To facilitate comparison with previous work, we test our approach using two different setups. Setup 1: as in [38], we split the dataset into three subsets, each containing 8 actions, independently calculate the recognition accuracy in each subset, and report average accuracy over the three subsets. Setup 2: as in [23], we test our model using all 20 action categories at once. As in previous work, we use cross-subject validation in all our experiments, with subjects 1-3-5-7-9 assigned to the training set and the rest to the testing set. In our evaluation, we use K = 100 poses for each body region, for a total of 400 pose dictionary entries. Tables 1 report recognition accuracy of our model for the MSR-Action3D dataset using Setups 1 and 2, respectively. Accuracy is measured by the average of the diagonal of the normalized confusion matrix. We can observe that, although designed 15

ACCEPTED MANUSCRIPT

T

Algorithm GEO GEO/MOV GEO/MOV + NI Ours 89.5% 90.6% 93.0%

RI P

Table 2: Recognition accuracy of our approach in MSR-Action3D : i) Using a geometric descriptor only (GEO), ii) Combining geometric and motion descriptors (GEO/MOV), and iii) Adding a mechanism to identify and discard non-informative poses (NI).

AC CE

PT

ED

MA

NU

SC

for complex activity recognition, our model can also achieve competitive results in the task of single atomic action recognition. An important aspect of these results is that our model achieves this performance using a total of 400 pose dictionary entries (100 entries per body region). This compares favorably with the results reported in [23] and [42], which use dictionaries with thousands of entries. Our reduced dictionary translates to compact representations that provide more meaningful interpretarions and require less computation at the inference stage. Table 2 reports the performance of our model under several settings. First, we report performance when the model uses a quadratic regularizer combined with the geometric descriptor (GEO) described in Section 3.1. This corresponds to the setup used in [5]. We also consider the case when we combine the geometric descriptor with the motion descriptor described in Section 3.1 (GEO/MOV). Additionally, we consider the case when the model includes the method to identify and discard non-informative frames (NI). 4.2. Recognition of composable activities

In order to test the suitability of our model for recognizing complex and composable activities, we use the Composable Activities benchmark dataset from [5]. This dataset consists of 694 RGB-D videos that contain activities in 16 classes performed by 14 actors. Each RGB-D sequence is captured using a Microsoft Kinect sensor, and is provided with the position of relevant body joints estimated using [18]. Each activity in this dataset is spatio-temporally composed by a variable number of mid-level (atomic) actions. Every actor performs each activity 3 times in average. The total number of actions in the videos is 25 (plus an idle action), while the number of actions that compose each particular activity fluctuates between 2 to 10 actions. Fig. 4 summarizes the composition of atomic actions for each activity, note that activities such as Com16

ACCEPTED MANUSCRIPT

84.9% 85.7% 67.2% 76.5% 74.2% 79.6% 74.7%

GEO/MOV 92.2% 91.8% 90.6% 90.9% 74.1% 78.9% 82.4% 83.8%

T

− −

RI P

GEO



SC

Algorithm

Ours, SR+NI Ours, QR+NI Ours, SR Ours, QR BoW HMM H-BoW 2-lev-HIER LG [20]

NU

Table 3: Recognition accuracy of our method compared to several baselines (see Section 4.1). It is noteworthy that our 3-level model outperforms all 2-levels models. Also, including motion cues in the descriptor (GEO/MOV) and using non-informative poses handling (NI) improve the accuracy over our previous model. The best performance is obtained when using all the contributions described in this work.

posed activity 4 can be composed of up to 10 atomic actions. The RGB-D data and

AC CE

PT

ED

MA

annotations for this dataset are publicly available2 . Recognition rates for the Composable Activities Dataset are summarized in Table 3. We report performance averaged over multiple runs using a leave-one-subject-out cross-validation strategy. We use a validation set to experimentally adjust the value of all the main parameters. In practice, we set λ1 = 100 and λ2 = 20. We set K = 50 pose dictionary entries per body region when using the quadratic regularizer (QR), and K = 150 per body region when using the proposed sparse regularizer (SR). Also, we use fixed parameter values µ = 0.1 and ρ = 5 for SR, and µ = 1 and ρ = 0 for QR. We also reduce the temporal resolution for faster processing by a factor of 6, so the effective frame rate for all videos in training and recognition is 5 fps. Table 3 reports our recognition results in context by comparing them to the performance of two baselines methods, two simplified versions of our model, and a stateof-the-art algorithm. The first baseline is a bag of words model (BoW), which only captures very coarse per-region pose orderings and uses an independently pre-trained pose dictionary. Specifically, this baseline uses k-means to quantize pose descriptors for each body region independently, which are aggregated into a temporal pyramid histogram representation. This is then fed into a multi-class linear SVM for directly mapping from video descriptors to activities. The accuracy of this baseline is 74.1% 2 http://web.ing.puc.cl/∼ialillo/ActionsCVPR2014/

17

ACCEPTED MANUSCRIPT

AC CE

PT

ED

MA

NU

SC

RI P

T

when using the combined descriptor based on geometric and motion information. As a second baseline, we implement a Hidden Markov Model (HMM). The HMM model can directly encode pose and action transitions built upon an independently pretrained pose dictionary. In our implementation, states are trained with supervision by assigning one state to each atomic action. Quantized poses are the observed variables. We train models independently for each class, and at testing time, we classify new sequences by assigning the label that corresponds to the highest scoring model. The accuracy of this baseline is 78.9% when using the combined descriptor based on geometric and motion information. While the ordering encoded by the HMM model helps to improve accuracy over BoW, it still lacks the discriminative power provided by the joint learning of mid and top level representations, as performed by our proposed model. We also compare performance against two simplified versions of our hierarchical model. The first simplified version (H-BoW) does not jointly learn the pose dictionary, but uses a fixed pose quantization obtained with k-means, and omits the transition terms. Unlike our full model, this simplified version does not take advantage of jointly learning the pose dictionary, which leads to sub-optimal pose encoding at the lower level of the hierarchy. Also, by omitting the transition terms, the model cannot capture patterns in the evolution of actions and poses. These simplifications lead to a 10% drop in performance in comparison to our full model. As a second simplification of our full model, we construct a hierarchical model with only two coupled layers (2-lev-HIER) that are jointly trained to encode poses and atomic actions. In this simplified model, activity recognition is performed by an independently trained linear classifier that operates on top of the inferred atomic actions. In this case, the performance of this model simplification is 8.4% lower than our full model. This indicates the clear benefit of jointly learning the mid-level representations and the top level activity classifier. We also compare against an existing state-of-the-art algorithm from the literature. In this case, we compare to the method recently described in [20] (LG). We select this algorithm because it achieves state-of-art performance on several pose-based action datasets. We train this model to directly predict the activities from poses, omitting the 18

ACCEPTED MANUSCRIPT

AC CE

PT

ED

MA

NU

SC

RI P

T

mid-level annotations, as in the BoW baseline. While the accuracy of LG is above BoW, it is still 11% lower than our model that only uses geometric information (GEO). The confusion matrix obtained with our full model is reported in Fig. 5. Note that for some activities the prediction is perfect, while for others there is high confusion between some activities. Large confusion may be caused by highly similar poses, for instance between calling with hands and waving hand, where many actors perform the calling with hands action with only one arm. As a main advantage, in addition to the high recognition performance, our model also generates a rich video interpretation in the form of detailed per-frame and perbody-region action annotations. Fig. 6 shows the action labels associated to each body part in a test video from the Composable Activities dataset. This example illustrates the capability of our model to correctly identify, spatially and temporally, the main body parts that are involved in the execution of a given action. To illustrate the semantic interpretation of the poses learned by our model, Fig. 7 shows top-scoring frames for three activities executed by different subjects. In general, our model produces highly interpretable poses that are associated to characteristic body configurations of the underlying atomic actions. To further illustrate this observation, Fig. 8 shows the highest activations for eight pose dictionary entries associated to the body region corresponding to the left arm. In each case, Fig. 8 also indicates the atomic action that assigns a greatest relevance (weight) to the corresponding pose. 4.3. Impact of motion descriptor

Tables 2 and 3 compare the performance of the geometric (GEO) and combined (GEO/MOV) descriptors under different model configurations using the MSR-Action3D and Composable Activities datasets. Since the MSR-Action3D dataset does not include RGB information, we modify the motion descriptor presented in Section 3.1. Specifically, we encode differences of displacements for every joint, in a similar setup to [32]. By incorporating this motion descriptor, we achieve a recognition accuracy of 90.6% in the MSR-Action3D dataset, reducing error rate by 1.1% with respect to the model presented in [5]. In case of the Composable Activities dataset, we use the motion descriptor presented in Section 19

ACCEPTED MANUSCRIPT

RI P

T

3.1. By incorporating this descriptor, we achieve a recognition accuracy of 90.9% in the Composable Activities dataset when using a QR, reducing error rate by 5.2% with respect to the model presented in [5]. 4.4. Impact of handling non-informative poses

AC CE

PT

ED

MA

NU

SC

During learning, we need to initialize the set of candidate frame poses that can be considered as NI. In practice, we initialize the NI poses by using the initial pose dictionary obtained with k-means, and selecting as NI the poses that are most distant to their assigned cluster centers. For each video, we initially assign a total of 20% of the frames to the NI bucket. As learning progresses, on each iteration poses can be reassigned to a pose dictionary entry or to NI. In general, we observe that after convergence approximately 17% of all training frames are assigned as non-informative. When we initialize the NI assignment with 40% of the frames, our final model reduces this to 19%. This high degree of robustness with respect to the initialization, indicates that the model is effectively learning to detect non-informative frames. Moreover, near 40% of the initial pose assignments are updated throughout the learning iterations. We can compare this with the version of the model in [5], that does not include noninformative pose handling, and where only 25% of the initial pose assignments are updated throughout the learning iterations. This indicates that our full model is capable of updating the pose representations more effectively. In terms of accuracy, the use of NI reduces error rate by 1.8% and raises recognition accuracy from 90.6% to 92.4% in the MSR-Action3D dataset. In the Composable Activities dataset, the introduction of NI reduces error rate by 1.4% and raises recognition accuracy to 92.2%. An important ability of the NI mechanism is that occluded regions are often assigned as NI poses. Fig. 9 shows a sequence of the activity Walking while reading. In this figure, the bottom graph shows with black boxes frames where a body region is identified by our method as corresponding to a non-informative pose. Observing the body region corresponding to the arms, the long sequences of non-informative poses nearly coincides with the occlusion periods of the arms (thick gray lines). Other frames considered as non-informative tend to be sparser in time, and they can be explained by rare poses or noisy body joints estimation. This behavior is 20

ACCEPTED MANUSCRIPT

RI P

T

advantageous in two ways: during learning, it allows the model to automatically disregard many occluded regions when learning the pose classifiers; and during testing, it allows the model to identify possible occluded regions. 4.5. Impact of sparse regularizer

ED

MA

NU

SC

We test the effects of sparse regularization in β using the Composable Activities dataset, since it provides richer hierarchical annotations. In the experiments, we use the same parameter values described in Section 4.2. In our experiments, we observe that when using SR only 8 to 11% of the coefficients of β remain non-zero. In average, this means a reduction of 85% in the number of nonzero terms with respect to the use of QR. In terms of accuracy, as we see in Table 3, recognition performance using SR is similar to the case of QR, decreasing the error rate by 0.4% when using the fused descriptor and NI handling, and slightly increasing the error when these modifications are not applied. Despite the similar recognition rates provided by both regularizers, we identify at least three relevant advantages of using the SR setup:

PT

i) Increased robustness to overfitting problems when using a sparse regularizer. Figure

AC CE

10 shows recognition accuracy in function of the pose dictionary size K when using QR and SR in the Composable Activities dataset. We observe that when using QR, the model clearly overfits as the dictionary grows larger than K = 50. In contrast, using the SR setup, the accuracy tends to increase or at least hold when larger dictionaries are used. This behavior shows that the model using SR is best suited to be used in larger recognition problems, since the learned poses are more specialized, and the model shows an increased robustness to overfitting problems. ii) Fewer poses influence each activity. One of the goals of using a sparse regularizer

is to decrease the number of entries from the pose dictionary that are used by each action classifier. Since few pose dictionary entries are required to explain each action (and therefore each activity), this encourages every action to rely on a smaller group of poses, which usually helps to improve the generalization power of the model, the 21

ACCEPTED MANUSCRIPT

AC CE

PT

ED

MA

NU

SC

RI P

T

semantic interpretability of the poses, and the efficiency of the pose dictionary. To illustrate this, Fig. 11 shows the influence of the pose dictionary entries related to the right arm body region, across all activities. The left diagram reports the influence in the QR setup, while the bottom diagram corresponds to the SR setup. We measure this influence relating the coefficients of β, corresponding to the relevant poses for each action classifier, and the coefficients of α, which measures the influence of the atomic actions over the activities. To facilitate the visualization, we binarize the influence using a threshold corresponding to the 5% of the top influence values. For a fair comparison, K = 50 is used in both setups, and we use the same initial pose labels Z to make the poses comparable. We can observe that for the QR setup each activity is influenced by many poses (and therefore each pose entry has activations in many activities), without a clear pattern of pose activations. In this setup, in average 17.8 pose entries influence each activity. In contrast, for the SR setup, in average only 8.6 poses influence each activity, showing a sparse pattern. Since the pose dictionary and atomic action classifiers are shared by all activities, the grouping effect is straightforward, since activities that share a common action will also share the action poses. As an example, if we observe the activations for the three activities that include the atomic action Talking on cellphone, many common poses are shared by these three activities, and few of them are not, indicating a grouping effect. Moreover, for those activities that do not share atomic actions, the poses tend to be also non-shared. We omit from the analysis the idle case to enforce the influence of actual annotated actions. iii) Improved semantical interpretation of poses. The SR setup helps to improve the

semantic meaning of the learned poses. As illustrated by Fig. 11, the learned poses are related to the actions involved in the activities. As an example, consider poses 1 to 5, we can observe that under the SR setup these poses mainly influence the activities Composed Activity 4, Waving hand and drinking, and Walking while waving hand. All these activities are strongly related to configurations or motions of the right arm body region. Actually, they are related to the waiving hand action, since it is the only atomic action that appears in all these activities. As it can be also noticed in Fig. 11, a similar semantic interpretation is not possible under the QR setup. Note that, for clarity, we 22

ACCEPTED MANUSCRIPT

T

omit the idle action when computing the influence scores (all activities share the idle action), that is why some rows in Fig. 11 are all white (not influential to any activity).

RI P

4.6. Inference of per-frame annotations.

AC CE

PT

ED

MA

NU

SC

The hierarchical structure and compositional properties of our model enable it to output a predicted global activity, as well as per-frame annotations of predicted atomic actions and poses for each body region. We highlight that, in the generation of the perframe annotations, no prior temporal segmentation of atomic actions is needed. Also, no post-processing of the output is performed. The proficiency of our model to produce per-frame annotated data, enabling atomic action detection temporally and spatially, is a key advantage of our model. Fig. 6 illustrates the capability of our model to provide per-frame annotation of the atomic actions that compose each activity. The accuracy of the mid-level action prediction can be evaluated as in [43]. Specifically, we first obtain segments of the same predicted action in each sequence, and then compare these segments with ground truth action labels. The estimated label of the segment is assumed correct if the detected segment is completely contained in a ground truth segment with the same label, or if the Jaccard Index considering the segment and the ground truth label is greater than 0.6. Using these criteria, the accuracy of the mid-level actions is 79.4%. In many cases, the wrong action prediction is only highly local in time or space, and the model is still able to correctly predict the activity label of the sequence. Considering only the correctly predicted videos in terms of global activity prediction, the accuracy of action labeling reaches 83.3%. When consider this number, it is important to note that not every ground truth action label is accurate: the videos were hand-labeled by volunteers, so there is a chance for mistakes in terms of the exact temporal boundaries of the action. In this sense, in our experiments we observe cases where the predicted labels showed more accuracte temporal boundaries than the ground truth. 4.7. Robustness to occlusion and noisy joints.

Our method is also capable of inferring action and activity labels even if some joints are not observed. This is a common situation in practice, as body motions induce 23

ACCEPTED MANUSCRIPT

AC CE

PT

ED

MA

NU

SC

RI P

T

temporal self-occlusions of body regions. Nevertheless, due to the joint estimation of poses, actions, and activities, our model is able to reduce the effect of this problem. To illustrate this, we simulate a totally occluded region by fixing its geometry to the position observed in the first frame. We select which region to be completely occluded in every sequence using uniform sampling. In this scenario, the accuracy of our preliminary model in [5] drops by 7.2%. Using our new SR setup including NI handling, the accuracy only drops by 4.3%, showing that the detection of non-informative poses helps to reduce the effect of occluded regions. In fact, as we show in Section 4.4, many of truly occluded regions in the videos are identified using NI handling. In contrast, the drop in performance of BoW is 12.5% and HMM 10.3%. This is expected, as simpler models are less capable of robustly dealing with occluded regions, since their pose assignments rely only on the descriptor itself, while in our model the assigned pose depends on the descriptor, sequences of poses and actions, and the activity evaluated, making inference more robust. Fig. 12 shows some qualitative results for cases displaying occluded regions. In terms of noisy joints, we manually add random Gaussian noise to change the joints 3D location of testing videos, using the SR setup and the GEO descriptor to isolate the effect of the joints and not mixing the motion descriptor. Fig. 13 shows the accuracy of testing videos in terms of noise dispersion σnoise measured in inches. For little noise, there is no much effect in our model accuracy, as expected due to the robustness of the geometric descriptor. However, for more drastic noise levels, the accuracy drops dramatically. This behavior is expected, since for highly noisy joints the model can no longer predict well the sequence of actions and poses.

5. Conclusions and Future Work We present a novel hierarchical compositional model to recognize human activities using RGB-D data. The proposed method is able to jointly learn suitable representations at different abstraction levels leading to compact and robust models, as shown by the experimental results. In particular, our model achieves multi-class discrimination while providing useful annotations at the intermediate semantic level. The composi24

ACCEPTED MANUSCRIPT

AC CE

PT

ED

MA

NU

SC

RI P

T

tional capabilities of our model also bring robustness to partial body occlusions. Through our experiments, we show how a fused descriptor, composed of geometric features and motion descriptors, improves the accuracy of the activity prediction by over 5% compared to the geometric descriptor alone in the Composable Activities dataset. Moreover, the model makes an efficient use of learned body poses by imposing sparsity over coefficients of the mid-level (atomic action) representation. We observe that this capability produces specialization of poses at the higher levels of the hierarchy. Recognition accuracy using sparsity over mid-level classifiers is similar to the case of using a quadratic regularizer, however, the semantic interpretation of the output of the model is improved, since interactions of pose classifiers with atomic action and activity classifiers are more efficient. There are several research avenues for future work. In particular, during training our current model requires annotated data at the level of action, which can be problematic for a large scale application. An improvement could be treating action labels as latent variables, and using only a list of possible action labels for every activity. Also, for real-time video recognition, we also need to include inference with respect to the temporal position and span of each activity, which can be also considered as latent variables. Finally, as we mentioned before, our model can be extended to the case of identifying the composition of novel activities that are not present in the training set. [1] J. K. Aggarwal, M. S. Ryoo, Human activity analysis, ACM Computing Surveys 43 (3) (2011) 16:1–16:43. [2] D. Weinland, R. Ronfard, E. Boyer, A survey of vision-based methods for action representation, segmentation and recognition, Journal of Computer Vision and Image Understanding 115 (2) (2011) 224–241. [3] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, A. Blake, Real-time human pose recognition in parts from a single depth image, in: Communications of the ACM, 2011, pp. 116–124. [4] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (2015) 436–444. [5] I. Lillo, A. Soto, J. C. Niebles, Discriminative hierarchical modeling of spatiotemporally composable human activities, in: CVPR, 2014, pp. 812–819. [6] L. Lo Presti, M. La Cascia, 3D skeleton-based human action classification: A survey, Pattern Recognition 53 (2015) 130–147. 25

ACCEPTED MANUSCRIPT

T

[7] M. Ziaeefard, R. Bergevin, Semantic human activity recognition: A literature review, Pattern Recognition 48 (8) (2015) 2329–2345.

RI P

[8] I. Laptev, On Space-Time Interest Points, International Journal of Computer Vision 64 (2-3) (2005) 107–123. [9] S. Savarese, J. Winn, A. Criminisi, Discriminative object class models of appearance and shape by correlatons, in: CVPR, 2006, pp. 2033–2040.

SC

[10] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, in: CVPR, 2006, pp. 2168– 2178.

NU

[11] I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: CVPR, 2008, pp. 1–8.

MA

[12] A. Gaidon, Z. Harchaoui, C. Schmid, Temporal Localization of Actions with Actoms, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (11) (2013) 2782–2795. [13] J. C. Niebles, C.-W. Chen, L. Fei-Fei, Modeling temporal structure of decomposable motion segments for activity classification, in: ECCV, 2010, pp. 392–405.

ED

[14] H. Chen, G. Wang, J.-H. Xue, L. He, A Novel Hierarchical Framework for Human Action Recognition, Pattern Recognitiondoi:10.1016/j.patcog.2016.01.020.

PT

[15] J. K. Aggarwal, L. Xia, Human activity recognition from 3D data: A review, Pattern Recognition Letters 48 (2014) 70 – 80.

AC CE

[16] O. Oreifej, Z. Liu, HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences, in: CVPR, 2013, pp. 716–723. [17] J. Luo, W. Wang, H. Qi, Spatio-temporal feature extraction and representation for RGB-D human action recognition, Pattern Recognition Letters 50 (1) (2014) 139–148. [18] Microsoft, Kinect for Windows SDK (2012). [19] V. Escorcia, M. A. Davila, M. Golparvar-Fard, J. C. Niebles, Automated visionbased recognition of construction worker actions for building interior construction operations using RGBD cameras, in: Construction Research Congress, 2012, pp. 879–888. [20] R. Vemulapalli, F. Arrate, R. Chellappa, Human action recognition by representing 3D skeletons as points in a Lsmie group, in: CVPR, 2014, pp. 588–595. [21] L. Tao, R. Vidal, Moving Poselets : A Discriminative and Interpretable Skeletal Motion Representation for Action Recognition, in: IEEE International Conference on Computer Vision Workshops, 2015, pp. 61–69. 26

ACCEPTED MANUSCRIPT

T

[22] Y. Zhu, W. Chen, G. Guo, Fusing spatiotemporal features and joints for 3D action recognition, in: IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013, pp. 486–491.

RI P

[23] J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining actionlet ensemble for action recognition with depth cameras, in: CVPR, 2012, pp. 1290–1297.

SC

[24] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, C. Bray, Visual categorization with bags of keypoints, in: Workshop on Statistical Learning in Computer Vision, ECCV, 2004, pp. 1–2.

NU

[25] Y. Boureau, F. Bach, Y. LeCun, J. Ponce, Learning mid-level features for recognition, in: CVPR, 2010, pp. 2559–2566. [26] P. Felzenszwalb, D. Mcallester, D. Ramanan, A discriminatively trained, multiscale, deformable part model, in: CVPR, 2008, pp. 1–8.

MA

[27] Y. Wang, G. Mori, Hidden part models for human action recognition: Probabilistic versus max margin, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (7) (2011) 1310–1323.

ED

[28] A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, in: NIPS, 2012, pp. 1097–1105. [29] Y. Du, W. Wang, L. Wang, Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition, in: CVPR, 2015, pp. 1110–1118.

PT

[30] J. Sung, C. Ponce, B. Selman, A. Saxena, Unstructured human activity detection from RGBD images, in: ICRA, 2012, pp. 842–849.

AC CE

[31] C. Chen, Y. Zhuang, F. Nie, Y. Yang, F. Wu, J. Xiao, Learning a 3D Human Pose Distance Metric from Geometric Pose Descriptor., IEEE Transactions on Visualization and Computer Graphics 17 (11) (2010) 1676–1689. [32] H. Wang, A. Klaser, C. Schmid, C.-L. Liu, Action recognition by dense trajectories, in: CVPR, 2011, pp. 3169–3176. [33] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2) (2005) 301–320. [34] C. Yu, T. Joachims, Learning structural SVMs with latent variables, in: ICML, 2009, pp. 1169–1176. [35] A. Yuille, A. Rangarajan, The concave-convex procedure, Neural Computation 15 (4) (2003) 915–936. [36] I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun, Support vector machine learning for interdependent and structured output spaces, in: ICML, 2004, pp. 104–111. 27

ACCEPTED MANUSCRIPT

T

[37] T. Joachims, T. Finley, C. Yu, Cutting-plane training of structural SVMs, Machine Learning 77 (1) (2009) 27–59.

RI P

[38] W. Li, Z. Zhang, Z. Liu, Action recognition based on a bag of 3D points, in: CVPR, 2010, pp. 9–14.

SC

[39] J. Luo, W. Wang, H. Qi, Group sparsity and geometry constrained dictionary learning for action recognition from depth maps, in: ICCV, IEEE, 2013, pp. 1809–1816. [40] C. Lu, C.-k. Tang, Range-Sample Depth Feature for Action Recognition, in: CVPR, 2014, pp. 772–779.

NU

[41] X. Yang, Y. Tian, Super Normal Vector for Activity Recognition Using Depth Sequences, in: CVPR, 2014, pp. 804–811.

MA

[42] C. Wang, Y. Wang, A. L. Yuille, An approach to pose-based action recognition, in: CVPR, 2013, pp. 915–922.

AC CE

PT

ED

[43] P. Wei, N. Zheng, Y. Zhao, S.-C. Zhu, Concurrent action detection with structural prediction, in: ICCV, 2013, pp. 3136–3143.

28

AC CE

PT

ED

MA

NU

SC

RI P

T

ACCEPTED MANUSCRIPT

AC CE

PT

ED

MA

NU

SC

RI P

T

ACCEPTED MANUSCRIPT

AC CE

PT

ED

MA

NU

SC

RI P

T

ACCEPTED MANUSCRIPT

AC CE

PT

ED

MA

NU

SC

RI P

T

ACCEPTED MANUSCRIPT

AC CE

PT

ED

MA

NU

SC

RI P

T

ACCEPTED MANUSCRIPT

AC CE

PT

ED

MA

NU

SC

RI P

T

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

Highlights A novel hierarchical model to recognize human activities using RGB-D data is proposed.



The method jointly learns suitable representations at different abstraction levels.



The model achieves multi-class discrimination providing useful mid-level annotations.



The compositional capabilities of our model also bring robustness to body occlusions.

AC CE

PT

ED

MA

NU

SC

RI P

T



41