Journal Pre-proof Double-layer conditional random fields model for human action recognition Tianliang Liu, Xiaodong Dong, Yanzhang Wang, Xiubin Dai, Quanzeng You, Jiebo Luo
PII: DOI: Reference:
S0923-5965(19)30064-5 https://doi.org/10.1016/j.image.2019.115672 IMAGE 115672
To appear in:
Signal Processing: Image Communication
Received date : 29 January 2019 Revised date : 19 October 2019 Accepted date : 19 October 2019 Please cite this article as: T. Liu, X. Dong, Y. Wang et al., Double-layer conditional random fields model for human action recognition, Signal Processing: Image Communication (2019), doi: https://doi.org/10.1016/j.image.2019.115672. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.
Journal Pre-proof Manuscript Number: IMAGE_2019_54 Manuscript Title: Double-Layer Conditional Random Fields Model for Human Action Recognition
Jo
urn al P
re-
pro
of
Research highlights can be listed as follows: ► We formulate a double-layer conditional random field model with higher order dependencies and intermediate variable representations for action recognition task. The top augmented layer as the global variables is increased to capture higher order dependencies of the target states, while the intermediate representation variables being enhanced to explicitly model deeper intermediate representations within the target states for the underlying structure construction of the target states in themselves.. ► We derive an efficient and exact model inference technique in our designed graph, which can be decomposed into the top linear-chain CRF model and the bottom linear-chain CRF one to be resolved efficiently using dynamic programming approach. ► We suggest that the block-coordinate primal-dual Frank-Wolfe (BCFW) optimization with gap sampling approach be employed to learn effectively and efficiently the assumed DLCRFs model parameters in a structured support vector machine framework.
Journal Pre-proof Signal Processing: Image Communication manuscript No. (will be inserted by the editor)
Double-Layer Conditional Random Fields Model for Human Action Recognition
Received: 29/Jan/2019 / Revised: 19/Oct/2019 / Accepted: date
to decompose the DL-CRFs model in two parts, that are the top linear-chain CRFs model and the bottom one, in order to ease inference both during the parameter learning phase and test time. Lastly, the assumed DL-CRFs model parameters can be learned with blockcoordinate primal dual Frank-Wolfe algorithm with gap sampling scheme in a structured support vector machine framework. Experimental results and discussions on two public benchmark datasets demonstrate that the proposed approach performs better than other state-ofthe-art methods in several evaluation criteria.
Tianliang Liu
[email protected]
urn al P
re-
Abstract The conditional random fields (CRFs) model, as one of the most successful discriminative approaches, has received renewed attention recently for human action recognition. However, the existing CRFs model formulations have typically limited capabilities to capture higher order dependencies among the given states and deeper intermediate representations within the target states, which are potentially useful and significant to model the complex action recognition scenarios. In this paper, we present a novel double-layer CRFs (DLCRFs) model for human action recognition in the graphical model framework. In problem formulation, an augmented top layer as the high-level and global variable is designed in the DL-CRFs model, with the global perception perspective to acquire higher-order dependencies between the target states. Meanwhile, we exploit an additional intermediate variables to explicitly perceive the intermediate representations between the target states and observation features. We then propose
pro
of
Tianliang Liu1 · Xiaodong Dong1 · Yanzhang Wang1 · Xiubin Dai2 · Quanzeng You3 · Jiebo Luo4
Xiaodong Dong · Yanzhang Wang · Xiubin Dai {dongxd, wangyz, daixb}@njupt.edu.cn Quanzeng You
[email protected]
1
Jo
Jiebo Luo SPIE/IAPR/IEEE/ACM/AAAI Fellow
[email protected]
College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, 210003 Nanjing, Jiangsu, China. 2 School of Geography and Biological Information, Nanjing University of Posts and Telecommunications, 210046 Nanjing, Jiangsu, China. 3 Computer Vision Team, Microsoft Cloud + AI, Redmond, WA, USA. 4 Department of Computer Science, University of Rochester Rochester, NY 14627, USA.
Keywords Action recognition · Double-layer CRFs · Graphical model · RGB-D video · Structured SVM
1 Introduction Human action recognition is a general and fundamental problem in the computer vision. It can make the robots as companions to help people in their daily life. And there also exist other applications such as video surveillance, patient monitoring systems, etc. Among many challenges in providing an overall understanding of the captured RGB-D video scenes from RGB-D sensors, the core task is to predict simple or atomic actions of person in most of these applications, which constitute the high-level activities. The human actions refer to the atomic movements of certain person who relate to at most one object in the environment such as opening the microwave, reaching food, moving food, placing food, and closing microwave, etc. [1]. In contrast, the high-level activities refer to a complete sequence composed by different actions in a certain order [2]. Microwaving food shown in Figure 1, can be thought of as high-level activity composed of the mentioned actions.
Journal Pre-proof Tianliang Liu1 et al.
2 Microwaving Food
Opening Microwave
Reaching
Moving
Food
Food
Placing
Closing
Food
Microwave
RGB-D Video
Fig. 1 An illustration of the relationship of human actions and high-level activity
urn al P
re-
The visual recognition of transitive actions comprising human-object interactions is a key component for artificial systems operating in natural environments. For real tasks in human-robot interaction (HRI), we are more concerned with recognizing actions, which can make the robots as companions executing the next action that the person is going to do. Most existing techniques consider human action recognition as a sequential prediction problem. Early work [39] exploits hypergraph matching with dynamic programming for the task of action recognition. However, probabilistic graphical models could be generally applied to model the human actions to be classified for certain practical applications both smart home and robotics scenarios. The broad family of the probabilistic graphical models can be divided into two categories based on the nature of the learning method, i.e. generative models [3, 4], and discriminative models [2, 5, 6]. The generative models require to make certain assumptions of prior distribution p(x) including complex dependencies. This generative strategy may be risky or not suitable since the given joint distribution may not reflect the real and complex dependencies among inputs data. In contrast, the discriminative models can directly model the posterior probability p(y|x), regardless of how the inputs data are distributed in details. This intuition can provide a natural way of multi-model or structural data fusion as the inputs data for action recognition, and also can derive an efficient and effective model inference.
In this work, we aim to address the sequential prediction problem of action recognition from a RGB-D video that captures complex behaviors. The key challenge is to simultaneously capture and efficiently integrate higher order dependencies of the target states and intermediate representations within the given states in a uniform and hierarchical framework. Figure 2 shows the proposed double-layer conditional random fields (DLCRFs) model framework for action recognition task. This given DL-CRFs model is inspired by Figure 1 and previous work [32], where we utilize the basic idea of the CRFs model to recognize the sequence of actions and augment intermediate representation variables. For simplicity, we introduce intermediate variables to explicitly model the intermediate representations within target states, such as human-object interaction [8,9,33]. According to real human habit, there exist little dependencies between high-level activities, e.g. microwaving food and taking medicine. We can consider this phenomenon as being independent between both these actions. However, the activity of microwaving food is composed of the aforementioned actions in a certain order shown in Figure 1. It hints the sequence of human actions w.r.t certain high-level activity should have intrinsic and higher order dependencies. Then, we increase a top augmented layer as the high-level and global variables to capture intrinsic and higher order dependencies of the actions in certain high-level activity.
of
Action
they can only capture one or few timestep interactions, which are potentially useful in real-life applications.
pro
High-level Activity
Jo
Linear chain conditional random fields (CRFs) model, as we know, is one of the most successful discriminative approaches for many applications, which is a loglinear model representing the conditional distribution of the observation labels. To some extent, the linear chain CRFs models are generally effective and efficient, since they may capture certain interaction between the target states; and the exact inference can be tractable to be implemented efficiently. However, these models are deficient in explicitly capturing intermediate structures within the target states. Meanwhile, these models neglect higher order dependencies of target states since
In order to advise an effective and efficient model inference method to apply an exact inference in the designed graph, we suggest that the DL-CRFs model could be divided into two components, such as the bottom linear-chain CRFs and the top linear-chain CRFs, which can be considered as two linear-chain CRFs model. Then, the potential functions can encode the additional constraints on intermediate representations within the target states and higher order dependencies of the target states based on the given model parameters. To label effectively and efficiently certain sequence of human actions for a given video with high-level activities, we present a structural graphical model with double-layer structure to achieve both spatial and temporal consistency for human action recognition task. And then, we extensively and comprehensively evaluate the proposed DL-CRFs approach on two publicly available 3D action benchmark dataset, such as Cornell Activity Dataset-60 (CAD-60) [3] and Cornell Activity Dataset-120 (CAD120) [2]. Compared with several traditional and highly related approaches, our experimental results demonstrate that the proposed DL-CRFs model outperforms the state-of-the-art approaches [2, 3, 10, 32].
Journal Pre-proof Double-Layer Conditional Random Fields Model for Human Action Recognition
3
xa
Top
layer yt
Bottom layer
h1
o1
ot
ht
x1
yT -1
yT
oT -1
hT -1
oT
hT
of
y1
xT -1
xt
xT
Our contribution can be summarized as follows.
such way, the assumed DL-CRFs model parameters are learned and relaxedly optimized for the task of human action recognition in a structured SVM framework [29]. Some parts of this work have appeared in a preliminary conference paper [32]. The present manuscript introduces a new double-layer conditional random fields model for human action recognition. The remainder can be structured as follows. Section II give a categorized overview of related works, which differentiate our research work basically in the factors such as methodologies and datasets. Section III formulates this recognition task with the designed DL-CRFs model and the given potential function. Section IV and V respectively introduces model inference and learning. Section VI shows the experimental results and discussion analysis.
urn al P
re-
– We formulate a double-layer conditional random field model with higher order dependencies and intermediate variable representations for action recognition task. The top augmented layer as the global variables is increased to capture higher order dependencies of the target states, while the intermediate representation variables being enhanced to explicitly model deeper intermediate representations within the target states for the underlying structure construction of the target states in themselves. – We derive an efficient and exact model inference technique in our designed graph, which can be decomposed into the top linear-chain CRF model and the bottom linear-chain CRF one to be resolved efficiently using dynamic programming approach. – We suggest that the block-coordinate primal-dual Frank-Wolfe (BCFW) optimization with gap sampling approach be employed to learn effectively and efficiently the assumed DL-CRFs model parameters in a structured support vector machine framework.
pro
Fig. 2 An illustration of our proposed double-layer CRFs model for human action recognition task. yt denotes a predicted action label, ht , ot are augmented intermediate variables, respectively represent atomic human pose and the interacting object with human. xa is the global variable in the augmented top layer.
Jo
Our work is quite different from previous approaches. The target states in our DL-CRFs model are human actions belonging to certain high-level activity, while we are more concerned with their intermediate representations and higher order dependencies. Firstly, without the latent variable definition, we explicitly model the intermediate state variable representations, such as human-object interaction in [8, 9], and learn them from the training dataset. Secondly, we increase an augmented top layer as the global variables to capture the intrinsic and higher order dependencies of the target states. Thirdly, we can divide our designed graph into two linear-chain CRFs model such as bottom CRFs and top one, so that the exact inference can be tractable and efficiently implemented using dynamic programming. In
2 Related Work 2.1 Generative model and discriminative model The first methodology can divide some existing techniques into discriminative model and generative model based on the nature of the learning method. Human activity recognition is an important integral or holistic component for HRI [15], which has attracted much attention recently. Early works mainly prefer to generative graphical models, e.g., hidden Markov models (HMMs) [3,16],dynamic bayesian networks (DBNs) [17, 18] and semi-Markov models [9, 19]. Since the features of the generative model and discriminative model have been discussed in the introduction, more recent approaches attempt to employ discriminative probabilistic models [5] and the moderate order CRFs model, and also consider increasing the latent variables to exploit the underlying structure of the target states. Here, we
Journal Pre-proof Tianliang Liu1 et al.
4
Jo
urn al P
re-
pro
of
both simple and short, such as walking, waiting, falling, focus on reviewing most of the related works that adopt jumping and waving. However, in real world, the activdiscriminative models for action recognition. ities are not always as simple as these basic actions. For The linear-chain CRFs model is one of the most example, the preparing breakfast activity may consist popular discriminative models among the probabilistic of multiple actions, such as opening fridge, getting salad graphical models. Ninghang Hu et al. augmented an and making coffee. Generally speaking, multiple related additional layer of latent variables into the linear-chain actions have high-level intrinsic dependence discussed CRFs model [1, 23]. They learn the latent variables diin the introduction. In general, multiple-layer recognirectly from the training data, which makes the given tion model should be considered to represent and model model with more flexibility and can be applied to model the sub-level actions and high-level activity [1, 10, 28]. more complex data. They resolve the model inference problem by reforming the defined structured graph into a set of cliques, so that the exact inference can be exSung et al. proposed a hierarchical maximum enecuted and implemented using dynamic programming tropy Markov model for action recognition task, while technique. However, these linear-chain CRFs model inthe related actions being considered as hidden nodes troduced latent variables, are still linear models and can that are learned implicitly to detect the activities from be not enough to model the high-level dependencies. RGB-D videos [3]. Due to the limitations of the Markov Ju Yong Chang proposed a non-parametric feature property, it is not possible to model the high-order cormatching based CRF model for gesture recognition and relation of simple behaviors under the high-level beits learning strategy [20], which utilize the skeletal joint havior framework. Then, Koppula et al. presented an position feature, skeletal joint distance feature and apinteresting loopy CRFs-based approach to model the pearance features that correspond to the left and right temporal and spatial interactions between humans and hands. Inspired by this work, we not only utilize the objects, which models both human actions and object skeletal joint feature, but also the objects’ features to affordance as random variables [2]. However, this apresolve the human action recognition task with the frame- proach cannot capture the global information, which work of the improved CRFs model in our work. Then, is important for modeling high-level activities for subNatraj Raman brought together HCRF and non-parametric actions. And then, Hu et al. proposed a hierarchical models in a fully Bayesian context to address the acCRFs approach that jointly estimates the actions and tion classification problem [21]. However, the related activities from the RGB-D videos [10]. This overall linexisting models are not well represented for different ear model has limited capabilities to capture deeper semantic intervals of human actions occurring or assointermediate representations within the target states ciating with the high-level activity. Shih-Yao Lin preand higher order dependence between the given states, sented a temporal pyramid representation to express an which are potentially useful and signicant in the modelaction with various semantic granularities, which can be ing of complex action recognition scenarios. Additionthought of as hidden states to enrich CRFs model capally, Wang et al. presented a latent hierarchical model turing the implicit structure of the input features [22]. (LHM) to describe the decomposition of the assumed However, this temporal pyramid model for human achigh-level activity into sub-actions in a hierarchical way tion recognition task could not explicitly model or em[14]. In terms of efficiency, the latent hierarchical model bed the intermediate representation within the target to be trained needs to be improved further [14]. states for the underlying structure construction of the target states in themselves in the given CRFs model. In terms of the type of driven pattern, human action recognition approaches with multiple-layer model can be classified into model-centric recognition tech2.2 Single-layer model and multiple-layer model niques and data-centric recognition methods. Generally, those data-centric multiple-layer recognition methThe second methodology can divide related works into ods using deep learning solutions and long-term short single-layer model and multiple-layer model in terms of memory (LSTM) are difficult to make the related black the hierarchical structure in themselves. Depending on box models to be explainable or interpretable, and rethe complexity and the duration of high-level activiquire huge amount of training sample data to be trained ties, the hierarchical framework of the designed model the related model. However, our presented double-layer for activity recognition task can be separated into two recognition method, belonging to model-centric pattern, categories such as single-layer model and multiple-layer has explicit and obvious physical meaning and mechamodel. Single-layer model [24–27] directly recognizes nism of the proposed model, and do not need too large human actions from the given data without any acscale of training dataset to learn the model parameters. tion hierarchy definition. Generally, these actions are
Journal Pre-proof Double-Layer Conditional Random Fields Model for Human Action Recognition
To verify the effectiveness of human action recognition methods, let us introduce related experiment datasets in terms of computational modality, modularity and compositionality. Tej Singh et al. outlined different types of video datasets and highlights their merits and demerits under practical considerations in [40]. Computational modality in 3D Action Based on the available information inside the dataset, these datasets including 3D clues for the effectiveness of human action understanding can be categorized into RGB-D with depth and 3D points. Traditional 3D action recognition approaches depend on hand-crafted features to capture the spatial-temporal information. And, one dominant alternative is the skeleton based approaches. For example, the CAD-60) dataset [3] was provided to detect unstructured human activity from RGBD images captured with Kinect cameras, in which the human skeleton sequence with joint positions and specific actions was performed in 5 different environments. And, the CAD-120 dataset with 10 high-level activities and 10 sub-activity labels with skeleton 15 joints data, contains over 60,000 RGB-D video frames of activity performed by 4 subjects [2]. Then, according to its intra-class variations and choices of action classes, MSR-Daily Activity dataset [42] is one of the most challenging benchmarks for human action recognition. The MSR-Daily Activity dataset contains 16 types of activities with the skeleton of 20 joint positions. Though most of the activities involve human-objective interactions, there do not exist the ground truth label of the interactions to explicitly augment the global variables for top augmented layer in our pipeline. Additionally, MSR-Action 3D dataset in [43] was captured using a depth sensor similar to Kinect for action recognition based on a bag of 3d points, which contains 20 actions with joint position skeleton with 3D data only in single modality excluding human-object interactions. To the best of knowledge, two public 3D action datasets such as CAD-60 and CAD-120 is more suitable to parse the complete sequence composed by different actions guided by high-level activities and explicit intermediate representations, facing the challenging problem for complex action recognition scenarios. Modularity and compositionality Modular approaches are important for data efficiency in visual understanding. Since the subjects interacting with the objects are
Jo
urn al P
re-
For structured prediction tasks, recent BCFW approach [11] can hold highly superior performance, which is currently one of the state-of-the-art algorithms for SSVM. The BCFW approach belongs to randomized block coordinate method and works on block-separable convex compact domains, comparing with the traditional (batch) Frank-Wolfe algorithm. The BCFW optimization method to solve the SSVMs objective function is more effective than earlier approaches, such as cutting plane [37] and stochastic sub-gradient method [38]. Distinctive features of the BCFW technique for SSVM, include optimal step size selection leading to the absence of the step-size parameter, convergence guarantees for the primal objective, and ability to calculate the duality gap as a stopping criterion. Notably, the duality gap obtained by BCFW can be written as a sum of block gaps, where each block of dual variables corresponds to one training example. The improved BCFW with gap sampling approach was presented with this property in multiple ways [13]. First, the standard uniform sampling of the parameterized object points was substituted at each iteration with an adaptive non-uniform sampling, to accelerate the convergence of optimization process of solving the structured SVMs objective. Then, motivated by the fact that batch algorithms based on these steps have linear convergence rates [12] whereas the convergence of standard Frank-Wolfe is sublinear, pairwise and away steps of Frank-Wolfe to the block-coordinate setting were exploited [13]. The original BCFW optimization method can guarantee the convergence of the original problem and reduce the computation during the iteration, but it does not lead to significant progress when solving the maximal oracle function. Third, they cached
2.4 Human action/activity datasets
of
The structured support vector machine (SSVM), which generalizes the classical binary SVM to problems with structured outputs, can be applied to train conditional random fields model, as one of the most popular learning objectives for structured prediction tasks [33, 37]. There exist a novel class of margin losses for training conditional random fields to being presented in [36], with the purpose of learning CRF model. The given training losses are categorized into the groups of marginalbased losses and likelihood-based training losses. In our opinions, we can consider the L2-regularized L1-slack structured SVM to learn the CRF model. The minimization of the regularized structured hinge-loss on the labeled training set is one of most important components in SSVM method. Multiple optimization techniques have been applied to resolve this problem, including cutting-plane methods ( [30,33,37]) and stochastic subgradient methods [11, 38], among others [13].
the oracle calls and presented a gap-based criterion for calling the oracle (cache miss vs. cache hit). When the oracle operation is computationally expensive, e.g., in the case of cutting plane methods [37], caching the oracle calls was shown to deliver significant speed-ups.
pro
2.3 Model learning and structured prediction
5
Journal Pre-proof Tianliang Liu1 et al.
6
3 Model Formulation
pro
of
The proposed double-layer CRFs model for human action recognition can be illustrated in a graphical model framework in Figure 2. From a local perception perspective, we let x = {x1 , ..., xt , ..., xT } ∈ X be the input sequence of the visual observation features in the bottom layer, where T denotes the total number of temporal segments or the actions in the recognized video. From a global perception perspective, we let xa be the augmented input variable in the top layer, which can be assumed as the global feature extracted from input features x or global structural attribute obtained from the whole video to enhance overall representation ability according to practical requirements. Our goal is to estimate or predict the most likely underlying action label sequence y = {y1 , ..., yt , ..., yT } ∈ Y from the input visual evidences x and the augmented global variable xa . And we also define h = {h1 , ..., ht , ..., hT } ∈ H and o = {o1 , ..., ot , ..., oT } ∈ O to be intermediate representation variables in the assumed DL-CRFs model, where the variable ht denotes human pose and the variable ot represents the object interacting with the human related to the t-th temporal segment of video sequence.
re-
generally perform the related actions only given highlevel activity instruction, there can be real-world execution varieties in the videos for different person in different environments in several human action or activity datasets. Recent neural graph matching (NGM) network framework was presented in [45] to exploit the compositionality to learn to recognize a previous unseen 3D action class with only a few examples on the CAD-120 dataset [2], where a set of entities and there interactions are used to describe the action. Guo et al. [45] focused on evaluating the action or sub-activity labels (e.g., reaching, moving, placing) and their combination with objects in the scene (e.g., bowl, milk, and microwave). The inherent structure in 3D data can be naturally leveraged to provide modularity for the 3D representation, and thus lead to more effective fewshot learning in [45]. The fine-grained interactions with the objects make the classification challenging in fewshot setting in [45]. However, those human-objects interactions are generally beneficial to explicitly model deeper intermediate representations within the target states for underlying structure construction of the target states in themselves in our CRFs model formulation.
3.1 Objective Function
Jo
urn al P
The larger MPII Cooking activity database 2.0 was presented to evaluate these tasks of fine-grained and composite activity recognition, which use hand-centric visual features and textual script data with available annotations (activities, objects, human pose, text descriptions) [41]. Natural language textual information was utilized to enhance the recognition of fine-grained and composite activities using hand-centric visual features in [41]. However, the geometric spatial information in the 3D data does not be considered sufficiently to boost the recognition performance in [41]. Although the larger MPII-2 dataset includes composed actions with sub-activities in a temporal ordering in [41], the RGB videos with various complexity and available natural language annotations do not contain spatial information from the depth channel of the captured scene being inherent in the 3D data, to encode geometrically the additional intermediate variables respectively representing atomic human pose and the interacting objects simultaneously. However, the given method in [41] adopted the RGB video and textual modality to recognize the fine-grained and composite activity, which differentiates our work using purely RGB-D visual features in terms of computational modality from input. The unavailability of depth data for the MPII cooking2 dataset is not valid anymore, since now high quality 3D poses can be retrieved using RGB data (like LCRNet).
When input sequence of visual observation evidences x and the augmented global variable xa ∈ Xa are given, we can model the conditional probability of the output sequence y of the predefined action labels by : p(y|x, xa ; ω) = p(y, h, o|x, xa ; ω) P Ψ (y,h,o,x,xa ;ω) h∈H,o∈O e P = Ψ (y,h,o,x,xa ;ω) y∈Y,h∈H,o∈O e
(1)
The given potential function Ψ (y, h, o, x, xa ; ω) ∈ R, which can be parameterized by the vector of the assumed model parameters ω, measures the compatibility between the sequence of the target action labels y, the augmented top input variable xa and the bottom input observation features x, and the assumed intermediate latent variables of the mid-level state representation such as h, o using spatial information.
3.2 Potential Function To be more discriminative in problem formulation, the overall potential function Ψ (y, h, o, x, xa ; ω) with doublelayer structures potentials summed over the whole sequence, can be built in our DL-CRFs model with four
Journal Pre-proof Double-Layer Conditional Random Fields Model for Human Action Recognition
7
types of the energies from the visual observations, the augmented variable, human poses and the objects.
on whether ot denotes a microwave or a bottle or not to some degree. The third potential term can be written
Ψ (y, h, o, x, xa ; ω) = Ebot + Etop
Ψ3 (ht , ot ; ω3 ) = ω3 (ht , ot )
T X {Ψ1 (yt , ht , ot , xt ; ω1 )
3.2.2 Top layer potential Etop
+
T X
Ψ4 (yt , yt−1 , xa ; ω4 )
t=2
(2)
where Ebot and Etop represent bottom and top layer potential from the double-layer structures, respectively. 3.2.1 Bottom layer potential Ebot
re-
Our goal is to predict the target label yt that is most likely close to the fact from the given observations xt at the t-th moment. From the perception perspective of from bottom to top, the ability of computational modeling of human-object interactions can establish an effective cooperation or bridge to promote reliably action recognition involving the interactions between people and objects in real-world scenarios. The first term measures the score of making a special visual observation xt with a joint-state assignment (yt , ht , ot ) about the given target variable yt and the intermediate state variables ht , ot . We define Φ(x) to be the function mapping the input data into the discriminative feature space. This first potential term can model the full connectivities among yt , ht , ot and xt w.r.t. the moment t, which can avoid to make any conditional independence assumptions and be formulated and written by
According to the human’s ability such as intuition and common sense, people perform some interesting actions purposefully in a certain order. From our proposed model, we know that current action yt may be closely linked to the previous moment action yt−1 , while being also associated with or restricted by the augmented input variable xa . The assumed global input variable xa could contain not only the whole history actions with the comprehensive dependencies, but also the information of forwarding towards or completing the certain actions with the given special purpose in the future. For example, for the video of microwaving food, when the activity of opening microwave or reaching food is performed, the current action is most likely to be moving food, the next action is most likely to be placing food, and it always contains closing the microwave as the final action. So the fourth potential term can be written as follows
of
t=1
+ Ψ2 (yt , ht , ot ; ω2 ) + Ψ3 (ht , ot ; ω3 )}
pro
=
Ψ4 (yt , yt−1 , xa ; ω4 ) = ω4 (yt , yt−1 , xa )
(3)
Jo
The second potential term can measure the compatibility score between yt and ht , ot in terms of cooccurrence frequency statistics. The given score can be thought of as the bias entry of (2) or the prior of seeing joint-state assignment (yt , ht , ot ). The second potential term can be formulated as Ψ2 (yt , ht , ot ; ω2 ) = ω2 (yt , ht , ot )
(4)
The third potential can measure the score of the interior dependence between the intermediate state variables ht and ot in itself. The score can be considered as the prior information between ht and ot occurring at the same time. Let us consider a familiar example. If ht denotes the human pose with respect to the action of opening, whether the intermediate state refers to opening microwave or opening bottle or not depends
(6)
According to the definition of the four potential terms, we can rewrite total potential function in Equation (2) for the proposed double-layer CRFs model as:
urn al P
Ψ1 (yt , ht , ot , xt ; ω1 ) = ω1 (yt , ht , ot ) · Φ(xt )
(5)
Ψ (y, h, o, x, xa ; ω) =
T X
{ω1 (yt , ht , ot ) · Φ(xt )
T X
ω4 (yt , yt−1 , xa )
t=1
+ ω2 (yt , ht , ot ) + ω3 (ht , ot )} +
(7)
t=2
The assumed whole potential function can measure the matching score among the bottom input visual features x, the top input augmented variable xa , the intermediate representation state variables h, o and the label sequence of the output target states y. Since the defined matching score can be seen the un-normalized joint probability in the given log space, we only require to maximize the un-normalized joint probability to infer the output target states y with the whole potential. 4 Model Inference Given the constructed graphical model and the assumed parameters ω, the model inference task with input sequence of observed visual evidences x and the augmented global variable xa ∈ Xa , is to predict the most
Journal Pre-proof Tianliang Liu1 et al.
8
likely target states y by maximizing the objective loss.
ρt (u) = argmax{ζt−1 (v) + Ft (yt = u, ht , ot , xt ) u∈Y
+ ω4 (yt−1 = v, yt = u, xa )}
y ˆ = arg max log p(y, h, o|x, xa ; ω) y∈Y
max Ψ (y, h, o, x, xa ; ω) = max ζT (u) y∈Y
u∈Y
(14)
The T -th time’s optimal assignment can be predicted. yˆT = argmax ζT (u) u∈Y
(15)
And then, we can track back the most likely assignment in previous segment in terms of the obtained recursive path ρt (u) sequence t = T − 1, ..., 1
yˆt = ρt+1 (ˆ yt+1 )
(16)
5 Parameter Learning
We apply the max-margin approach to learn the assumed graphical model parameters. The labeled training data set D = {(xi , y i ), xia )}N i=1 with the N samples consists of each pair of the observation sequence x i and the related ground-truth action label sequence of the high-level activities y i with the given chain length Ti and an augmented global variable xia . The intermediate variables h, o are unknown in the dataset, which will be automatically inferred in the given training process. The goal of model learning is to find the optimal weight parameters ω that minimize the objective function:
urn al P
re-
The main challenge comes from the fact that solving Equation (8) is an NP-hard problem, which requires the evaluation of the whole potential function over an exponential number of state sequences. An exact inference can be desirable since there exist a global optimum found in inference procedure as certain guarantee. However, the exact inference procedure usually can only be applied to resolve the acyclic graph. Although our designed graph contains certain loops, we can transform it into two linear chain structures in order to allow us to devise an efficient exact inference approach. If we divide our constructed graph into two parts such as top layer and bottom one, and collapse intermediate representation state variables ht , ot with the action state yt into a single node in the bottom level. This procedure results in two classical linear-chain CRFs structure so that the given DL-CRFs model can be decomposed into top linear-chain CRFs model and the bottom one in order to be efficiently calculated in model inference. Inspired by the recursive scheme in the first-order linear-chain, Equation (8) can be performed and resolved efficiently using dynamic programming technique. Firstly, we calculate the assumed matching score for the given objective function at start time (t = 1) w.r.t the given label space Y = {1, ..., V }.
When the last segment (t = T ) is evaluated, we can obtain the maximum matching score at the T -th time:
of
(8)
∝ arg max Ψ (y, h, o, x, xa ; ω)
pro
y∈Y
(13)
v∈Y
ζ1 (v) =ω1 (y1 = v, ht , o1 ) · Φ(x1 )+
v∈Y
ω2 (y1 = v, h1 , o1 ) + ω3 (h1 , o1 )
(9)
To be convenient to derive model inference, we define: Ft (yt , ht , ot , xt ) =ω1 (yt , ht , ot ) · Φ(xt )+
ω2 (yt , ht , ot ) + ω3 (ht , ot )
(10)
Then, the maximum score at next time (t = 2) for the label space u ∈ Y could be written as follows: u∈Y
Jo
ζ2 (u) = max{ζ1 (v) + F2 (y2 = u, h2 , o2 , xt , ) + ω4 (y1 = v, y2 = u, xa )}
(11)
v∈Y
In general, the recursive formula could calculate the maximum score at the t-th time for the label space u ∈ Y and record the recursive path ρt (u) for the given maximum score. ζt (u) = max{ζt−1 (v) + Ft (yt = u, ht , ot , xt ) u∈Y
+ ω4 (yt−1 = v, yt = u, xa )}
v∈Y
(12)
min{ ω
N
X λ Li (y, y ˆ)} k ω k2 + 2 i=1
(17)
where the factor λ is a trade-off constant, which can be exploited to provide a regularization balance term between the model complexity and fitting date used to avoid or reduce over-fitting in model learning. The function Li (y, y ˆ) represents the measure cost of making incorrect predictions for the i-th training sample. The symbol y ˆi denotes the most likely action sequence predicted from Equation (8). Here we adopt a normalized Hamming distance between the predicted action sequence y ˆi and the ground-truth sequence yi with the chain length Ti for the given cost written by Ti 1 X (1 − δ(yti , y ˆti )) Ti t=1 ( 1, yti = y ˆti i i here, δ(yt , y ˆt ) = 0, otherwise
Li (y, y ˆ) =
(18)
where t is the t-th index value in the predicted sequences and the ground-truth sequences.
Journal Pre-proof Double-Layer Conditional Random Fields Model for Human Action Recognition
Generally speaking, optimizing directly Equation (17) is not possible since the defined loss function involves the costly computation of the ”arg max” function in Equation (8). Following the similar relaxation strategies in [10] and [23], we substitute original loss function in Equation (17) by the margin-rescaling surrogate [33,34], which can be an upper-bound for the given loss function. Then, we can rewrite Equation (17) in the formula of minimizing a relaxed function subject with a set of constraints by adding the slack variables as:
9
Jo
urn al P
re-
pro
of
Inspired by the works in [11,13], we suggest that the improved BCFW algorithm with gap sampling strategy shown in Algorithm 1 would be exploited to accelerate learning convergence optimization for the optimal structural model parameters to resolve the task of the given structural SVM problem. At each iteration, the BCFW algorithm need select randomly a training sample and perform the block-coordinate step with respect to the corresponding dual variables as a significant algorithm progress. Unfortunately, if these given dual variables are nearly close to be optimal, then N λ 1 X 2 the original BCFW technique [11] does not make sigmin{ k ω k + ξi }, ω,ξ 2 N i=1 nificant progress at this iteration. For the coordinate (19) blocks with larger sub-optimality are sampled more fres.t. hω, ϕi (y)i ≥ Li (y, y ˆ) − ξi quently, we obtain the appropriate coordinate block ∀i ∈ {1, 2, ..., N }, y ∈ Yi := Y(xi ) gap (Line 7 of Alg.1) of each iteration by quantifying the sub-optimality on the given blocks similarly where ξi is the slack variable that measures the surin [13]. Then, we utilize the sampling probability of a rogate loss for the i-th sample data, and the margin . hω, ϕi (y)i = Ψ (y i , hi , oi , xi , xia ; ω)−Ψ (ˆ y i , hi , oi , xi , xia ; ω) given block coordinate being proportional to the value of current gap estimate, to randomly choose a suitable can be solved in the same way as the inference probblock coordinate (a training sample in the given data lem using (12). Equation (19) formulates a so-called set D = {(xi , y i ), xia )}N N -slack structural SVM problem with margin-rescaling i=1 ) at each iteration in the optimization procedure at the parameter learning phase. convex function. However, this equation has an exponential number of the constraints about the combinatorial nature of Yi . Inspired by previous works in [11], 6 Experiments and Discussions we exploit the block-separability of the domain of the given function in Equation (19) that replace the Y lin6.1 Experimental Datasets ear constraints with N piecewise-linear ones, i.e. M := 4|Yi | × ... × 4|YN | , and adopt the loss-augmented deWe evaluated our proposed DL-CRFs model on two coding sub-problem to define the structured hinge loss: public benchmark 3D action datasets, such as CAD˜ 60 [3] and CAD-120 [2]. Both two public 3D action 0 Hi (ω) := max Li (y, yb) − hω, ϕi (b y )i max y∈Yi | {z } (20) datasets contain the sequences of color and depth imoracle0 =: Hi (y; ω) ages and the related skeleton joints of the person with large varieties of human-object interactions. However, ˜ i (ω) to replace the we utilize the nonlinear ones ξi ≥ H these two 3D action datasets are quite different in themconstraints in Equation (17). It is important to note selves so that they are suitable to be utilized to evaluate that this loss-augmented decoding sub-problem can be our proposed DL-CRFs model with the generalization. tackled with only assumption that having access to an The CAD-60 dataset was provided by Sung et al. [3] efficient so-called maximization oracle function. To esto resolve unstructured human activity detection from timate the structural model parameters w , the equivaRGBD images. This dataset contains 68 RGB-D video lent non-smooth unconstrained formulation of Equation clips of four different subjects captured with Kinect (19) can be rewritten as follows: cameras, which perform 12 action classes in five differN ent environments: office, kitchen, bedroom, bathroom, X λ ˜ i (ω)} (21) min{ k ω k2 + H and living room. This video can be represented as a ω 2 i=1 sequence of the human skeleton with 15 joint positions and a total of 13 specific actions operating in 5 difThe maximization oracle allows us to exploit the subgradient methods to resolve the constrained convex prob- ferent environments. The action labels in this dataset ˜ i (ω) with respect to ω include rinsing mouth, brushing teeth, wearing contact lem with a sub-gradient of H lens, talking on the phone, drinking water, opening pill (here is −ϕi (b y )), where yb is any maximizer of the losscontainer, cooking (chopping), cooking (stirring), talkaugmented decoding sub-problem. And this optimizaing on couch, relaxing on couch, writing on white board, tion problem can then be tackled effectively and effiand working on computer. Though those actions do not ciently by a block-coordinate primal-dual Frank-Wolfe consist of the related high-level activity, we also can (BCFW) approach invoked by previous works in [11].
Journal Pre-proof Tianliang Liu1 et al.
10
(0)
(0)
(ki )
9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:
Let γ =
gi
and clip to [0, 1]; (k) λkωi − ωs k2 (k+1) (k) Update ωi = (1 − γ) ωi + γωs (k+1) (k) and li = (1 − γ) li + γls (k+1) (k) Update ω (k+1) = ω (k) + ωi − ωi (k+1) (k) (k+1) (k) and l = l + li − li if update the global duality gap then for k = 0...N do ki = k + 1 Solver ybi = arg maxy∈Yi Hi yi ; ω (k) cf. Equ.(20)
1 1 b Calculate ωs = λN ϕi ybi and ls = N Li (y, y) (ki ) . (k) (k) (k) gi = λ ωi − ωs ω − li + ls end for end if end for
re-
8:
of
(0)
1: Let ω (0) = ωi = 0, l(0) = li = 0, gi = +∞, ki = 0 2: for k = 0...K do (ki ) 3: Pick i at random with the estimated probability gi ; (k) 4: Solver ybi = arg maxy∈Yi Hi yi ; ω cf. Equ.(20) 5: Let ki = k 1 1 b ϕi ybi and ls = N Li (y, y) 6: Calculate ωs = λN (k) (k) (ki ) . (k) = λ ωi − ωs ω − li + ls 7: gi
action features, object-subject relation features, temporal object and subject features. Then, these features are concatenated into a single vector, being considered as the observation of one action segment for the input feature in the bottom layer. Since the augmented global variable xa represents a kind of global or highlevel properties with the quite flexible form with respect to the input features in the top layer structure, we respectively adopt two kinds of the input features in the top layer as the augmented top global variable xa . For example, the high-label features directly encode the assumed high-level activity label for the given full video segments, while occlusion features are obtained by dividing the video into N uniform length segments and finding the fraction of the objects that are occluded fully or partially in the temporal video segments. To initialize the intermediate variables for the DLCRFs model, we apply a classical clustering technique on the bottom input features to obtain them. The number of the clusters is assumed to be the same as the number of the intermediate states for each action. We execute K-means clustering on the bottom input data for 10 times. Then, we choose the best results with minimal within-cluster distances. The resulting labels of the given clusters are assigned to initialize intermediate state representation in the given preparation phase. On the CAD-120 dataset, we can extract the complete set of the observation features embedded with the object locations, which are provided by the given training dataset. And we can initialize the intermediate variables with the certain features including human skeleton and object information. For the CAD-60 dataset, only skeletal features are extracted since there does not exist object information, and it is tested by the DL-CRFs (Single) model, which refers to our baseline approach only adding an augmented top layer as the global variable, wherein no intermediate representation state (number of variable=1), i.e. only one intermediate representation variable denotes the feature space directly mapped from the input data.
pro
Algorithm 1 Block-coordinate Frank-Wolfe (BCFW) method with gap sampling strategy for structured SVM
Jo
urn al P
consider the actions in our experiment since the given actions are performed in the same environment with higher order dependencies. The whole action sequence in each environment in this CAD-60 dataset can be considered or identified as an explicit high-level activity. In contrast, the CAD-120 dataset provided by Koppula et al. [2], contains 120 activity sequences over 60,000 RGB-D video frames of ten different high-level activities performed by four different subjects. This CAD120 dataset contains 10 high-level activities and 10 subactivity labels with skeleton 15 joints data. Each activity sequence contains a high-level activity, which consists of a long sequence of sub-level actions. And our goal is to recognize the category of sub-level actions. The actions include reaching, moving, pouring, eating, drinking, opening, placing, closing, scrubbing and null, etc. The objects in the given CAD-120 dataset are automatically detected as in [2], and the locations of the objects are also provided in the CAD-120 dataset.
6.2 Experimental Setup
Several approaches have been tested and can be directly compared on these two public datasets with the quantitative experimental results. For fair comparison, the same input features are extracted from the previous works [2]. The observation visual features have human skeleton features, object features, object-object inter-
6.3 Experimental Results and Extensive Analysis Our system can be divided into three parts, such as graphical model formulation, model inference and parameter learning. First, we construct the given graphical model and build the learning algorithm. For model inference, we apply the inference engine from the library LibDAI [31]. For parameter learning, we exploit the structural SVM framework provided by [33] to learn the given model parameter. We extensively compare the performance of the proposed approach with other stateof-the-art methods [2,3,10,23,32,45]. In this section, the
Journal Pre-proof Double-Layer Conditional Random Fields Model for Human Action Recognition
Koppula et al. [2] LC-CRFs Hu et al. [10] Guo et al. [45] Liu et al. [32] DL-CRFs (Single) DL-CRFs (Full)
86.0±0.9 85.7±2.9 89.7±0.6 91.1 89.8±2.8 87.6±0.7 90.2±2.6
84.2±1.3 86.4±6.1 90.2±0.7 N/A 92.7±3.8 90.4±2.1 90.7±5.7
Recall
F-Score
76.9±2.6 82.4±4.0 88.2±0.6 N/A 86.8±2.2 85.9±1.4 86.6±2.5
80.4±1.5 82.6±6.2 89.2±0.6 N/A 87.1±4.4 86.7±2.7 89.9±0.7
Note that the symbol ”N/A” means to be not available.
urn al P
re-
experimental results are reported in details. To compare fairly the performance with different approaches, the quantitative evaluation criteria, such as accuracy, precision, recall and F-score, are given and analyzed. In the CAD-120 dataset, the actions of reaching and moving occupy more than half of all the instances. Therefore, the precision and recall will be relatively better than accuracy among the evaluation criteria since they remain meaningful and significant, even if the given classes are imbalance. Then, we give and show the confusion matrix which can visualize the objective and comprehensive performance of the proposed DL-CRFs-based human action recognition.Next, we deeply analyze the exhibited results to illustrate the corresponding performance of the aforementioned DL-CRFs mode approach to improve the original linear CRFs model.
of
Accuracy Precision
method [2]. The reason may be that our DL-CRFs (Single) model has increased an augmented top layer with the global variables, which can capture higher order dependencies of the target states. This proves that an augmented top layer with the global variables should be considered as an important part in the DL-CRFs model to recognize a set of sequence human actions. The higher order dependencies of the target states in [32], can be also captured by a recursive mean-filed-like approximation. This approximation may reduce its effectiveness and have complex calculation process for too long input sequence. Compared with the previous work [32], the proposed DL-CRFs model has slightly better performance on the whole. These results also demonstrate that the augmented top layer can capture higher order dependencies of the target states. Since few-shot 3D action recognition using neural graph matching(NGM) networks proposed by Guo et. al. in [45] belongs to model-driven technique among the state-of-the-art methods, we can make a fair comparison with our DL-CRFs model method. As we can see in Table 1, the performance of our presented DL-CRFs method in terms of accuracy is very close to that of few shot 3D action recognition using neural graph matching networks with 5 shots in [45]. The given NGM may explicitly leverage the interaction graph as the graphical representation and exploit the graph tensor in the graph matching stage to compare not only the vector representation of nodes, but also the structure of the graph through edge matching feature. Furthermore, the graph generator and graph matching metric can be jointly trained in an end-to-end fashion to directly optimize for the few-shot learning objective. Figure 3 shows the comparison of the proposed DLCRFs model with the top layer augmented with different feature variables xa on the CAD-120 dataset. Although behaving slightly inferiorly on recall and Fscore, the DL-CRFs model adopted by high-label features can outperform better than that of occlusion features on accuracy and precision. These results demonstrate that different input features in the top layer can have certain different influence in our given DL-CRFs model. However, no matter high-label features or occlusion features, we can see that the proposed DL-CRFs model can also perform on four evaluation metrics better than the Ninghang Hu’s ”No Top” method in [10], which does not consider the higher order dependencies of the target states in the given method.
pro
Table 1 The quantitative performance of the DL-CRFs model on the CAD-120 dataset comparing with others are reported in terms of accuracy, precision, recall and F-score
11
6.3.1 The importance of the augmented top layer
Jo
Table 1 shows the quantitative performance of our DLCRFs models during test phase on the CAD-120 dataset, comparing with the other state-of-the-art approaches [2,10,32,45]. The assumed DL-CRFs (Single) model denotes the proposed DL-CRFs (Full) models without intermediate representation states which are degenerated and shrunk as one node connected only with input observation features and the target label at the given moment, while other components in the DL-CRFs model keeping unchanged. From the perspective of the overall performance, the DL-CRFs (Single) model on the whole can perform better than the first-order linearchain CRFs (LC-CRFs) model without intermediate states in terms of especially the precision and recall on four evaluation criteria reported in [2], such as accuracy, precision, recall and F-score. As we can see from Table 1, the assumed LC-CRFs model also performs slightly better than [2] in terms of precision and recall. In contrast, our DL-CRFs (Single) model exhibits significant improvements in terms of precision and recall, comparing with LC-CRFs model and the Koppula’s
6.3.2 The importance of intermediate representation To demonstrate the importance of intermediate representation, we make a fair comparison between the pro-
Journal Pre-proof Tianliang Liu1 et al.
12
Table 2 The quantitative performance of the presented DL-CRFs model on the CAD-60 dataset comparing with other methods are reported in terms of precision and recall Method Sung et al [3] Koppula et al [2] Hu et al [23] Ours method
Bathroom P R 72.7 88.9 77.6 83.3
65.0 61.1 81.5 75.0
Bedroom P R 76.1 73.0 81.8 92.9
59.2 66.7 76.9 75.0
Kitchen P R 64.4 96.4 88.2 93.3
Living P R
47.9 85.4 92.0 83.3
52.6 69.2 80.6 83.3
Office
45.7 68.7 75.9 83.3
P
R
73.8 76.7 81.7 78.3
59.8 75.0 75.1 83.3
Average P R 67.9 80.8 80.8 86.2
55.5 71.4 80.1 80.1
of
Note that the human actions are grouped in terms of the human’s locations, and a separate model is trained and tested based the environment label. Therefore, the results are reported in terms of precision (P. %), recall (R. %) and their average with respect to our proposed DL-CRFs model, being different from Table 1.
95
90
85
80
2
re-
1
pro
100
4
5
Numbers of intermediate states
Fig. 4 The performance of the proposed DL-CRFs model with different numbers of intermediate representation states on the CAD-120 dataset.
120 dataset. When there are only two intermediate representation variables, the DL-CRFs model can achieve the best performance in the given four evaluation metrics. When it is more than two for the number of the intermediate representation variables in the DL-CRFs model, the corresponding performance can be seen to be worse, since the presented DL-CRFs model with more complicated parameters should be tuned and be easily over fitted. Therefore, to select the appropriate number of the intermediate representation variables should be dependent on the experimental dataset with different applications. If certain specific recognition task with more complex activities should be resolved, the more intermediate representation states should be assumed and exploited in our proposed DL-CRFs model.
urn al P
Fig. 3 The comprehensive performance of the proposed DLCRFs model with different augmented variables xa while keeping other environment configuration and mode parameters constant on the CAD-120 dataset. For different augmented variables xa in the DL-CRFs model, the symbol ”HF” represents the high-label features and the symbol ”OF” denotes the occlusion features. The symbol ”No Top” shows the results of the technique presented in [23] without capturing higher order dependencies of the target states.
3
Accuracy Precision Recall F−score
Jo
posed whole DL-CRFs (Full) model and the assumed DL-CRFs (Single) model without intermediate representation state. The bottom two lines in Table 1 demonstrates that the proposed DL-CRFs (Full) model outperform the assumed DL-CRFs (Single) model on four evaluation metrics. It is noted that the accuracy and precision of the proposed whole DL-CRFs (Full) model are respectively increased by over 3 and 4 percentage points after adding intermediate representation state. The improvements with respect to both the evaluation criteria are obvious (p < 0.05). And, the recall and Fscore of the proposed whole DL-CRFs (Full) model also have slight improvement than that of the assumed DLCRFs (Single) model. These results demonstrate there exist the significant contribution of explicitly modeling deeper and intermediate representations within the target states for the whole DL-CRFs model. Figure 4 shows the influence of the proposed DLCRFs model with respect to the assumed number of the intermediate representation variables on the CAD-
6.3.3 Discussions on two public 3D action datasets On the CAD-60 dataset, comparing with the other stateof-the-art approaches, the DL-CRFs model can be accomplished with the similar strategy employed in the previous methods [2, 3], where we group the human actions according to their environment labels and the DL-CRFs model can be learned and tested for each of the groups, respectively. The overall performance of the DL-CRFs model algorithm can be comprehensively considered by the average of precision and recall among
Journal Pre-proof
92.3
10.0
eating
bi ng ll
Uniform Gap1 Gap5 Gap10
nu
g in
in g
g in
ru b sc
os cl
pl ac
en
in g
90.0
−1
10
100.0 85.7 4.1
14.3 95.9
25.0
−2
10
77.8 8.3
16.7
20
50.0 100.0
null
0
Fig. 5 Confusion matrix over different classes on CAD-120 dateset. Rows are ground-truth and columns are detection.
80
100
Fig. 6 We report the learning convergence of the BCFW optimization with uniform sampling [11] and gap sampling by the different frequencies of batch passes, while updating the gap estimates w.r.t the duality gap against the number of effective passes over the training dataset. Note that Gap 1, Gap 5 and Gap 10 denote every one pass, every 5 passes and every 10 passes over the given training dataset, respectively.
the gap sampling (where Gap 1, Gap 5 and Gap 10 denote certain gap computation pass after 1, 5, 10 blockcoordinate passes, respectively). As shown in Figure 6, the given BCFW algorithm with the gap sampling schemes (w.r.t Gap 5 and Gap 10) perform faster convergence than that of using uniform sampling. Meanwhile, the BCFW approach with the ”Gap 1” schemes wastes too many extraordinary calculations and converges slowly, since the assumed BCFW approach performs the gap computation after each pass of BCFW optimization process. The proposed DL-CRFs method can be thought of being not very sensitive to the given gap parameters (we have tried to set to be 5, 10 and 20), which allow us to always adopt the value of 10.
Jo
urn al P
re-
all the defined environments. Table 2 reflects that our presented DL-CRFs model outperforms on most of the given environments. On the whole, the recall of our proposed DL-CRFs model is equal to that of [23], and the precision of the presented DL-CRFs model outperforms that of [23] by over 6 percentage points in Table 2. On the CAD-120 dataset, Table 1 also shows that our presented approach with the DL-CRFs model performs better than the technique in [10]. Although the proposed DL-CRFs model perform slightly inferiorly than [10] in terms of the recall and F-score, we can see significant improvements with respect to accuracy and precision, which respectively increase around 1 and 6 percentage points. Figure 5 shows the confusion matrix of human action classification by the whole DL-CRFs model with two intermediate states representations. In the confusion matrix, the higher are the values on the diagonal line, the more accurate are the classification of human action. On the whole, the performance of our proposed DL-CRFs model can be achieved more excellent, except for the action of scrubbing which can be generally confused with reaching and placing. There exist certain challenging difficulties to recognize the action class of scrubbing, which scrubbing object may be more similar to reaching and placing object in real world. In the improved BCFW algorithm with gap sampling in model optimization learning process in Section V, we replace the exact block gaps with the estimates of the block gaps computed from the past oracle calls on each of the traversed coordinate blocks. However, such estimates might be quite far from current values w.r.t the duality gaps from the assumed selected coordinate blocks. To compensate staleness, we refresh the block gaps by traversing a full gap computation (one pass over the given dataset) after several block-coordinate passes. Figure 6 shows the optimization learning convergence of the assumed BCFW algorithm with gap sampling technique. Compared uniform sampling technique with
40 60 Effective passes over data
of
22.2
closing
pro
drinking opening
scrubbing
op
g
in k dr
in g
tin ea
ur
in g ov
po
1.2 100.0
pouring
placing
0
7.7 98.8
moving
13
10
Duality gap
reaching
m
re a
ch
in
g
Double-Layer Conditional Random Fields Model for Human Action Recognition
6.4 Discussions on computational analysis 6.4.1 The efficiency of model inference
In Section IV, owing to the human pose ht being welldetermined from input features, traversing Equation (9) once requires O(Ny No ) computations, where No is the number of appearing objects. The proposed DLCRFs model should be evaluated for all possible assignments of (yt , ht , ot ), so that Equation (9) computation is proportional to Ny No operations. Since the given DL-CRFs model with double layers can be decomposed into top linear-chain CRFs model and the bottom one using dynamic programming technique, the total computational cost in model inference is O(Ny2 No2 T ) for a video sequence, so that this calculation can be manageable and efficient, when O(Ny No ) is not very large for the action recognition task especially with the subject interacting with the limited objects.
Journal Pre-proof Tianliang Liu1 et al.
14 8
6.4.2 The optimization of parameter learning
Gap Uniform Cutting−plane
7 6 Train error
5 4 3 2 1 0
10
20
30
40 50 Effective passes over data
60
70
80
90
of
0
pro
Fig. 7 Train error of the stochastic parameter solvers for the proposed DL-CRFs model learning by BCFW with gap or uniform sampling [11], cutting-plane [37].
7 Conclusion and Future Work This paper proposes a double-layer CRFs model for human action recognition from the RGB-D sensor data. We suggest that the sequence of the given human actions belonging to certain high-level activity have higher order dependencies, and an augmented top layer as the global or high-level variable can be applied to capture the assumed higher order dependencies. Then, we explicitly model the deeper intermediate representations between the target states and input observation features for human action recognition, which can capture the underlying perception structure in the designed DLCRFs graph to predict the target states. And then, to efficiently execute model inference using dynamic programming, the proposed DL-CRFs model can be devised and decomposed into the top linear-chain CRFs model and the bottom CRFs one, while the structured support vector machine being exploited to learn the given DL-CRFs model parameters. Finally, we show that the proposed DL-CRFs model outperforms the other state-of-the-art methods on two public benchmark datasets. In future, we would like to investigate how a set of the given human actions belong to different high-level activities and automatically decide which temporal time the high-level activities are finished.
Jo
urn al P
re-
Inspired from the recently block-coordinate Frank-Wolfe (BCFW) method [11], which is used to optimize the structured support vector machine (SSVM) objective in the context of structured prediction, the modification of the BCFW approach is expected to be applied to obtain the assumed model parameters of human action recognition. In Section V, we suggest that the improved block coordinate primal-dual Frank-Wolfe (BCFW) approach with gap sampling and sub-gradient method tackle effectively and efficiently the constrained convex problem of the maximization oracle in Equation (19) in the max-margin framework. This exploited technique in Algorithm 1 can accelerate learning convergence optimization for the optimal structural model parameters to resolve the task of the structural SVM problem. Compared with original BCFW method [11], the improved BCFW algorithm has three obvious advantages such as adaptive sampling, pairwise and away steps and caching. Firstly, the estimates of block gaps maintained by the BCFW method reveal the block sub-optimality that can be used as an adaptive criterion. Comparing traditional BCFW with uniform sampling, the adopted and improved BCFW with adaptive non-uniform sampling can obtain a faster convergence [13]. The improved BCFW algorithm can obtain at each iteration the block gap quantifying the sub-optimality on the block. The block gaps are applied to randomly choose a block (an object of the training set) at each iteration, in such a way that the blocks with larger sub-optimality are sampled more often. Meanwhile, the max-oracle by the margin-rescaling surrogate can be efficiently implemented with a cache-hit criterion by minding the block gaps. At each step, the max oracle in the BCFW and BCPFW algorithms can be called to find the FrankWolfe corner. If the max oracle operations were computationally expensive, this step should be prone to become a computational bottleneck. Caching the results of the max oracle [13], can complete the natural idea to overcome this problem by reusing the previous calls of the max oracle to store potentially promising corners. Figure 7 shows train error of the stochastic parameter solvers for the proposed DL-CRFs model learning on the CAD-120 dataset. At the early stage of the iteration process, the training error of the BCFW with gap sampling strategy decreases faster than that of the UniformBCFW algorithm [11] and cutting-plane method [37]. The phenomenon can be explained by the important characteristic of the given BCFW algorithm with gap sampling that utilizes the sampling probability to choose a suitable training sample, which ensure that each iteration could maximize the optimization in the training.
Acknowledgements This work was supported in part by the National Natural Science Foundation of China (61001152, 31671006, 61772286 and 61071091), Natural Science Foundation of Jiangsu Provice of China (BK2012437) and China Scholarship Council. We thank the reviewers for their valuable and insightful comments.
References 1. N. Hu, G. Englebienne, and B. Krose.:A two-layered approach to recognize high-level human activities. in: Proc. Ro-Man, 2014, pp. 243-248. 2. H. S. Koppula, R. Gupta, and A. Saxena.:Learning human activities and object affordances from RGB-D
Journal Pre-proof Double-Layer Conditional Random Fields Model for Human Action Recognition
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
of
7.
pro
6.
re-
5.
22. S. Y. Lin, Y. Y. Lin, C. S. Chen, and Y. P. Hung.:Learning and inferring human actions with temporal pyramid features based on conditional random fields. in: Proc. ICASSP, 2017, pp. 2617-2621. 23. N. Hu, G. Englebienne, Z. Lou and B. Krose.:Learning latent structure for activity recognition. in: Proc. ICRA, 2014, pp. 1048-1053. 24. P. Matikainen, R. Sukthankar, and M. Hebert.:Model recommendation for action recognition. in: Proc. CVPR, 2012, pp. 2256-2263. 25. I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.:Learning realistic human actions from movies. in: Proc. CVPR, 2008, pp. 1-8. 26. J. Liu,and M. Shah.:Learning human actions via information maximization. in: Proc. CVPR, 2008, pp. 1-8. 27. Q. Shi, L. Cheng, L. Wang, and A. Smola.:Human action segmentation and recognition using discriminative semimarkov models. Int. J. comput. vision, vol. 93, no. 1, pp. 22-32, 2011. 28. H. S. Koppula, and A. Saxena.:Anticipating human activities for reactive robotic response.in: Proc. ICIRS, 2013, pp. 2071-2071. 29. A. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun.:Efficient structured prediction with latent variables for general Graphical Models. in: Proc. ICML, 2012. 30. J. E. Jr. Kelley.:The cutting-plane method for solving convex programs. J. Society Industr. Appl. Math, vol. 8, no. 4, pp. 703-712, 1960. 31. J. M. Mooij.:LibDAI: a free and open source C++ library for discrete approximate inference in graphical models. J. Mach. Learn. Res, vol. 11, no. 8, pp. 2169-2173, 2010 32. T. Liu, X. Wang, X Dai, and J. Luo.:Deep recursive and hierarchical conditional random fields for human action recognition.in: Proc. WACV, 2016, pp. 1-9. 33. I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun.: Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res, vol. 6, no. 2, pp. 1453-1484, 2005. 34. C. N. J. Yu, and T. Joachims.:Learning structural SVMs with latent variables. in: Proc. ICML, 2009, pp. 11691176. 35. R. L. S. Torres, Q. Shi, A. van den Hengel and D. C. Ranasinghe.: A hierarchical model for recognizing alarming states in a batteryless sensor alarm intervention for preventing falls in older people. Perva. Mobile Comput, vol. 40, pp. 1-16, 2017. 36. E. Ahmadi, Z. Azimifar: Margin losses for training conditional random fields. J Math Imaging Vis 56(3), 499-510 (2016). doi:10.1007/s10851-016-0651-y 37. T. Joachims, T. Finley, and C. N. J. Yu.: Cutting-plane training of structural SVMs. Mach. Learn. vol. 77, no. 1, pp. 27-59, 2009. 38. Ratliff, N., Bagnell, J. A., and Zinkevich, M. (Online) subgradient methods for structured prediction. In: Proc. AISTATS, 2007. 39. O. C ¸ eliktutan, C. Wolf, B. Sankur, E. Lombardi: Fast exact hyper-graph matching with dynamic programming for spatio-temporal data. J. Math Imaging Vis. 51(1), 121 (2014). doi:10.1007/s10851-014-0503-6 40. T. Singh, D. K. Vishwakarma: Video benchmarks of human action datasets: a review. Artificial Intelligence Review, vol. 150, pp. 1-48, 2018 41. M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal and B. Schiele: Recognizing finegrained and composite activities using hand-centric features and script data, Int. J. Comput. Vision, vol. 119, no. 3, pp. 346-373, 2016.
urn al P
4.
Jo
3.
videos. Int. J. Robot. Res. vol. 32, no. 8, pp. 951-970, 2013. J. Sung, C. Ponce, B. Selman, and A. Saxena.:Unstructured human activity detection from RGBD images. in: Proc. ICRA, 2011, pp. 842-849. C. Zhu, and W. Sheng.:Multi-sensor fusion for human daily activity recognition in robot-assisted living. in: Proc. Int. Conf. Hum. Robot. Interact. 2009,pp. 303-304. T. L. M. Van Kasteren, G. Englebienne, and B. J. A. Krose.:Activity recognition using semi-Markov models on real world smart home datasets. J. Amb. Intell. Smart Environ, vol. 2, no. 3, pp. 311-325, 2010. N. Hu, G. Englebienne, and B. Krose.:Posture recognition with a top-view camera. in: Proc. Int. Conf. Intell. Robot. Sys. (IROS), 2013, pp. 2152-2157. Y. Wang, and G. Mori.: Max-margin hidden conditional random fields for human action recognition. in: Proc. CVPR, 2009, pp. 872-879. P. Wei, Y. Zheng, and S. C. Zhu.:Modeling 4D humanobject interactions for event and object recognition. in: Proc. ICCV, 2013, pp. 3272-3279. B. Yao, and L. Fei-Fei.:Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Trans. Pattern Anal. Mach Intell, vol. 34, no. 9, pp. 1691-1703, 2012. N. Hu, G. Englebienne, Z. Lou and B. Krose.:Latent hierarchical model for activity recognition. IEEE Trans. Robot, vol. 31, no. 6, pp. 1472-1482, 2015. S. LacosteJulien, M. Jaggi, M. Schmidt, and P. Pletscher.:Block-coordinate Frank-Wolfe optimization for structural SVMs.in: Proc. ICML, 2013, pp. 53-61. Lacoste-Julien, S. and Jaggi, M. On the global linear convergence of Frank-Wolfe optimization variants. In: Proc. NIPS, 2015, pp. 496-504. A. Osokin, J. B. Alayrac, I. Lukasewitz, P. Dokania, and S. Lacoste-Julien.:Minding the gaps for block frank-wolfe optimization of structured SVMs.in: Proc. ICML, 2016, pp. 593-602. L. Wang, Y. Qiao, X. Tang.:Latent hierarchical model of temporal structure for complex activity classification. IEEE Trans. Image Processing, vol. 23, no. 2, pp. 810822, 2014. F. Amirabdollahian, S. Bedaf, R. Bormann, H. Draper, V. Evers,J. G. Perez, G. J. Gelderblom, C. G. Ruiz, D. Hewson, N. Hu, and etc.:Assistive technology design and development for acceptable robotics companions for ageing years. J. Behavior. Robot, vol. 4, no. 2, pp. 94-112, 2013. C. Zhu, and W. Sheng.: Human daily activity recognition in robot-assisted living using multi-sensor fusion. in: Proc. ICRA, 2009, pp. 2154-2159. N. N. Vo, and A. F. Bobick.:From stochastic grammar to Bayes network: probabilistic parsing of complex activity. in: Proc. CVPR, 2014, pp. 2641-2648. Q. Xiao, and R. Song.:Action recognition based on hierarchical dynamic Bayesian network. Multimed. Tool. Appl, pp. 1-14, 2017. H. Koppula, and A. Saxena.:Learning spatio-temporal structure from RGB-D videos for human activity detection and anticipation. in: Proc. ICML, 2013, pp. 792-800. J. Y. Chang.:Nonparametric feature matching based conditional random fields for gesture recognition from multimodal video. IEEE Trans. Pattern Anal. Machine Intell, vol. 38, no. 8, pp. 1612-1625, 2016. N. Raman, and S. J. Maybank.:Non-parametric hidden conditional random fields for action classification. in: Proc. IJCNN, 2016, pp. 3256-3263.
15
Journal Pre-proof Tianliang Liu1 et al.
16 42. J. Wang, Z. C Liu, and Y. Wu: Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach Int, vol. 36, no. 5, pp. 914-927, 2014.
45. M. Guo, E. Chou, D. A. Huang, S. Song, S. Yeung and F. F. Li: Neural graph matching neetworks for fewshot 3D action recognition, in: Proc. ECCV, 2018, pp. 673-689.
Quanzeng You received the B.E. and M.E. degrees from the Dalian University of Technology, Dalian, China, in 2009 and 2012, respectively, and obtained the Ph.D. degree from the Department of Computer Science at University of Rochester, Rochester, NY, USA, in 2017. He is currently a researcher at Microsoft AI Perception and Mixed-Reality. His research focuses on social multimedia, social networks and data mining. He is interested in developing effective machine learning algorithms to help us understand the data.
urn al P
re-
Tianliang Liu obtained his Ph.D. Degree in Biology and Medical Engineering from the School of Biology and Medical Engineering, Southeast University, China in 2010. From 2013 to 2014, he has been a visiting scholar in the Department of Computer Science at the University of Rochester. He is currently an associate professor of College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, China. His current research interests include image processing, computer vision, pattern recognition, and machine learning.
Xiubin Dai received the Ph.D. Degree in Biology and Medical Engineering from Department of Biology and Medical Engineering, Southeast University, China, in 2009. He is currently an associate professor of School of Geography and Biological Information, Nanjing University of Posts and Telecommunications, China. His recent work concentrates on the pattern recognition and image processing, etc.
of
44. L. Mici, G. I. Parisi , S. Wermter: A self-organizing neural network architecture for learning human-object interactions. Neurocomputing, vol. 307, pp. 14-24, 2018
pro
43. W. Q. Li, Z. Y. Zhang, and Z. C. Liu.: Action recognition based on a bag of 3D points. in: Proc. CVPRW, 2010, pp. 9-14.
Jo
Xiaodong Dong earned his B.E degrees in Mathematics and Applied Mathematics from Xi‘an University of Science and Technology, China, in 2014. He was a graduate student in Signal and Information Processing at Nanjing University of Posts and Telecommunications. His research interests include image processing, computer vision and machine learning.
Yanzhang Wang earned his B.E degrees in Telecommunications Engineering from Yangtze University , China, in 2017. He is now a graduate student in Signal and Information Processing at Nanjing University of Posts and Telecommunications. His research interests include computer vision, action recognition and machine learning.
Jiebo Luo joined the University of Rochester in Fall 2011 after over fifteen prolific years at Kodak Research Laboratories, where he was a Senior Principal Scientist leading research and advanced development. He has been involved in numerous technical conferences, including serving as the program co-chair of ACM Multimedia 2010, IEEE CVPR 2012 and IEEE ICIP 2017. He has served on the editorial boards of the IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE Transactions on Multimedia, IEEE Transactions on Circuits and Systems for Video Technology, ACM Transactions on Intelligent Systems and Technology, Pattern Recognition, Machine Vision and Applications, and Journal of Electronic Imaging. He is a Fellow of the SPIE, IAPR, IEEE, ACM, and AAAI. His research spans image processing, computer vision, machine learning, data mining, social media, biomedical informatics, and ubiquitous computing. He is a pioneer for contextual inference in semantic understanding of visual data, and social multimedia data mining.
Journal Pre-proof CONFLICT OF INTEREST DECLARATION
We wish to draw the attention of the Editor to the following facts which may be considered as potential conflicts of interest and to significant financial contributions to this work.
of
We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us.
pro
We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property.
urn al P
Signed by all authors as follows:
re-
We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He/she is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from
[email protected].
Jo
Tianliang Liu Xiaodong Dong Yanzhang Wang Xiubin Dai Quanzeng You Jiebo Luo