Three-stream CNNs for action recognition

Three-stream CNNs for action recognition

Accepted Manuscript Three-stream CNNs for action recognition Liangliang Wang, Lianzheng Ge, Ruifeng Li, Yajun Fang PII: DOI: Reference: S0167-8655(1...

1MB Sizes 1 Downloads 95 Views

Accepted Manuscript

Three-stream CNNs for action recognition Liangliang Wang, Lianzheng Ge, Ruifeng Li, Yajun Fang PII: DOI: Reference:

S0167-8655(17)30107-1 10.1016/j.patrec.2017.04.004 PATREC 6788

To appear in:

Pattern Recognition Letters

Received date: Revised date: Accepted date:

27 June 2016 1 February 2017 3 April 2017

Please cite this article as: Liangliang Wang, Lianzheng Ge, Ruifeng Li, Yajun Fang, Three-stream CNNs for action recognition, Pattern Recognition Letters (2017), doi: 10.1016/j.patrec.2017.04.004

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

CR IP T

ACCEPTED MANUSCRIPT

AN US

Research Highlights (Required) To create your highlights, please type the highlights against each \item command.

It should be short collection of bullet points that convey the core findings of the article. It should include 3 to 5 bullet points (maximum 85 characters, including spaces, per bullet point.) • A novel three-stream CNNs architecture for action feature extraction is proposed.

M

• An effective encoding scheme for action representation is presented.

AC

CE

PT

ED

• Promising action recognition results on challenging datasets.

ACCEPTED MANUSCRIPT 1

Pattern Recognition Letters

journal homepage: www.elsevier.com

Three-stream CNNs for action recognition

a State

CR IP T

Liangliang Wanga,∗∗, Lianzheng Gea , Ruifeng Lia , Yajun Fangb Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin, Heilongjiang, 150001, China Massachusetts Institute of Technology, Cambridge, MA, 02139, USA

b CSAIL,

ABSTRACT

ED

M

AN US

Existing Convolutional Neural Networks (CNNs) based methods for action recognition are either spatial or temporally local while actions are 3D signals. In this paper, we propose a global spatial-temporal three-stream CNNs architecture, which is able to be used for action feature extraction. Specifically, the three-stream CNNs comprises of spatial, local temporal and global temporal streams generated respectively from deep learning single frame, optical flow and global accumulated motion features in the form of a new formulation named Motion Stacked Difference Image (MSDI). Moreover, a novel soft Vector of Locally Aggregated Descriptors (soft-VLAD) is developed to further represent the extracted features, combining the advantage of Gaussian Mixture Models (GMMs) and VLAD by encoding data according to their overall probability distribution and the corresponding difference with respect to clustered centers. To deal with the inadequacy of training samples during learning, we introduce a data augmentation scheme which is very efficient due to its origin at cropping across videos. We conduct our experiments on UCF101 and HMDB51 datasets, and the results demonstrate the effectiveness of our approach. c 2017 Elsevier Ltd. All rights reserved.

PT

Key words: Action recognition, three-stream Convolutional Neural Networks, soft Vector of Locally Aggregated Descriptors, Support Vector Machines.

CE

1. Introduction

AC

Human action recognition in videos attracts increasing attentions in computer vision domain recent decades since its considerable demand in a variety of fields such as human-computer interaction, intelligent transportation system, video surveillance and index. Known broadly, in order to enable computers to recognize actions from different scenarios, at the heart is to characterize actions using discriminant features, based on which action types can be attained through a classification scheme. Different with static human detection, in addition to spatial appearance features, temporal motion features are available and even play a more important part (Simonyan and Zisserman, 2014) in action recognition. Accordingly, how to extract effective spatial

∗∗ Corresponding

author: Tel.: +86-138-3610-3115; e-mail: [email protected] (Liangliang Wang)

detectors and how to further formulate temporal descriptors efficiently are two primary issues for the topic subject to cluttered background, illumination variation, viewpoint change, large degree of freedom of actions and so forth. Therefore, in spite of a great number of efforts has been made, it remains a difficult challenge to accomplish action recognition task due to both the intra and inter-class diversity. Conventional action recognition methods (Bobick and Davis, 2001; Yuan et al., 2016; Sekma et al., 2015; Wang and Schmid, 2013; Laptev et al., 2008; Scovanner et al., 2007) focus on extracting powerful spatial-temporal features manually, and then categorize the handcrafted features using a relatively simple classifier. More recently, along with the big success of convolutional neural networks (CNNs) in the area of image classification (Krizhevsky et al., 2012), many attempts (Xu et al., 2015; Sun et al., 2015; Yuan et al., 2016; Karpathy et al., 2014) have been inspired to obtain action features from raw images automatically by learning the hierarchy of multi-layer boosted convolution and pooling channels, on the assumption that every layer shares a same threshold. Compared to images, the design of CNNs for actions is much more complicated as the temporal extension. To deal with the 3D complexity, a majority of action recognition approaches (Wang et al., 2015, 2016) based

ACCEPTED MANUSCRIPT 2

CR IP T

the techniques for action feature extraction can be divided into two categories: appearance features based and attention models based methods. Among various appearance extractors, Histogram of Oriented Gradient (HOG) (Dalal and Triggs, 2005) is widely studied as its high efficiency and robustness for human spatial characterization. Inspired by HOG, Laptev et al. formulated HOG/Histogram of Optical Flow (HOF) in (Laptev et al., 2008) by combining HOG with the temporal optical flow information. Furthermore, HOG was extended into HOG3D for spatial-temporal feature extraction in (Kl¨aser et al., 2008). Wang and Schmid introduced Improved Trajectories (Wang and Schmid, 2013) fusing HOG, HOF and Motion Boundary Histograms (MBH), which got big success for action recognition. On this basis, in (Murthy and Goecke, 2015), authors developed an Ordered Trajectories to lengthen the temporal duration of trajectories searching. Similarly, Harris3D (Laptev, 2005), Hessian3D (Willems et al., 2008) and 3D-SIFT (Scovanner et al., 2007) are all popular local detectors transformed from discriminant 2D ones for action description. Differently, attention models based approaches often tackle actions using a saliency map by selecting features of interest according to a prior mathematical model, for example, in (Zhang et al., 2008), Zhang et al. proposed a Bayesian framework called SUN to analyze the visual features. Other popular attention models cover cognitive models, decision theoretic models, graphical models and etc. as detailed in (Borji and Itti, 2013). In all, appearance features are more accessible compared to attention models as their complexity.

AC

CE

PT

ED

M

AN US

on CNNs follow a two-step implementation: first to build spatial CNNs of frames and then to fuse them temporally, which leads to loss of the temporal relationships between motions. Instead, Simonyan and Zisserman (Simonyan and Zisserman, 2014) investigate a two-stream CNNs architecture consisting of both learnt image CNNs features and temporal CNNs features generated by learning optical flow between frames, more importantly, they show that the temporal CNNs features are more discriminant. Directly constructing a 3D CNN model (Ji et al., 2013) is another choice, unfortunately, the related work often gets trapped in low accuracy. Generally, by virtue of the lack of adequate labeled samples for training (Sun et al., 2015), action recognition methods based on CNNs-learnt features result in worse outcomes than best reported algorithms using handcrafted features currently despite having a promising prospect for the real-world case. Motivated by above observations, in this paper, we build up a three-stream CNNs including spatial, local temporal and global temporal stream for action feature extraction. Benefiting from the advance of two-stream CNNs (Simonyan and Zisserman, 2014), action image CNNs features and optical flow CNNs features are preserved as our spatial and local temporal stream, and extended by incorporating a global temporal stream produced via learning Motion Stacked Difference Image (MSDI) CNNs features, aiming at representing action global temporal features in deep. Furthermore, in order to classify three-stream CNNs descriptors, unlike straightly fusing the scores gained by softmax regression, we first implement a PCA-Whitening operation on learnt features, and further encode the preprocessed data adopting a novel soft Vector of Locally Aggregated Descriptors (soft-VLAD) scheme, based on which the final classification is accomplished under Support Vector Machines (SVMs) framework on two relatively large datasets: UCF101 and HMDB51, whose volume is enlarged by an efficient augmentation scheme. Our experimental results outperform previous CNNs based and most state-of-the-art methods, which demonstrates the effectiveness of algorithm, also indicates that the three stream CNNs are complementary. Figure 1 illustrates the architecture of our work. This paper makes two significant contributions: First, we propose a novel three-stream video feature deep learning architecture, incorporating spatial, local temporal and global temporal convolutional networks for action feature extraction. Second, an effective soft-VLAD encoding scheme is presented to further represent learnt CNNs descriptors. 2. Related Work

As far as we know, action recognition is actually a two-stage process involving representing actions from videos and classifying the acquired descriptors. We review action recognition techniques mostly focused on exploiting different action representation methods, based on which we discuss the issue from two aspects: hand-crafted features based and deep learnt features based approaches. More comprehensive surveys are available in (Weinland et al., 2011; Borges et al., 2013). The first step of standard hand-crafted features based action recognition methods is to extract local features. Totally,

Usually, it is needed for hand-crafted features based approaches to further represent the local features to obtain more useful messages for action classification, relying on encoding schemes. Bag of words (BoWs) is the most widely applied model (Soomro et al., 2012; Kuehne and L.Grimson, 2011; Wang and Schmid, 2013; Murthy and Goecke, 2015) rewriting each feature according to histograms of amounts of matched units with respect to the centers clustered by K-Means algorithm across the whole feature set. However, BoWs disregard features’ relationship information. In (Perronnin et al., 2010), Fisher Vector (FV) was proposed to overcome the shortcoming by going beyond count statistics. Another attracting feature encoding framework is Vector of Locally Aggregated Descriptors (VLAD) (Jegou et al., 2012), describing features based on their difference relative to the nearest centers clustered also by KMeans. Work of (Jegou et al., 2012; Xu et al., 2015; Wu et al., 2014) evaluated different encoding methods for video representation, where VLAD displayed more merits in many facets. After action representation, action recognition can be finished by classifying the hand-crafted descriptors. Definitely, SVMs framework is the most successful in visual classification task, such as (Xu et al., 2015; Soomro et al., 2012; Kuehne and L.Grimson, 2011; Wang et al., 2015; Sekma et al., 2015; Wu et al., 2014; Murthy and Goecke, 2015). Other prevalent classifiers consist of Bayesian, K-Nearest Neighbor (KNN), Decision Tree (DT), Neutral Networks (NNs) and etc. Recently, multitask learning based methods (Liu et al., 2017; Yang et al., 2016) also attract a lot of interest. A great amount of deep learnt features based work has been

ACCEPTED MANUSCRIPT

CR IP T

3

Fig. 1: Overflow of our action recognition method. We first employ three fundamental features: frame, optical flow and Motion Stacked Difference Image (MSDI) produced from action sequence as inputs for Convolutional Neural Networks (CNNs) learning. The CNNs architecture includes 5 convolutional layers (Conv) with or without normalization (Norm) and pooling, and 2 fully connected layers (Full). The learnt features on three streams: spatial, local temporal and global temporal are then preprocessed by PCA-Whitening, further represented by soft-VLAD and classified using SVMs.

3. Three-stream CNNs

In this section, we present the three-stream CNNs for action deep feature extraction. We first briefly review the twostream CNNs, then describe the configuration of our threestream CNNs, besides, a data augmentation method is proposed.

AC

CE

PT

ED

M

AN US

conducted since CNNs was successfully applied in the image classification domain (Krizhevsky et al., 2012). As actions are 3D spatial-temporal signals, a way to deep learn actions under CNNs is to use 3D convolution kernels, as a good example, Ji et al. (Ji et al., 2013) developed a 3D CNNs model for action by convolving a 3D kernel to the stacked frame cubes. However, the computation complexity of 3D convolution operation is very large. To handle this problem, in (Sun et al., 2015), Sun et al. introduced a Factorized Spatio-Temporal CNNs factorizing 3D kernel convolution into a 2D spatial convolution followed by a 1D temporal convolution. Also, Wang et al. (Wang et al., 2016) built up a temporal pyramid pooling based CNNs for action representation additionally comprising of an encoding, a pyramid pooling and a concatenation layer, which was capable of transforming both motion and appearance features. Moreover, in (Wang et al., 2015), a trajectory-pooled CNNs was designed by fusing Improved Trajectories into CNNs architecture, while an adaptive recurrent convolutional hybrid networks was proposed consisting a data module and a learning module. Another solution to represent actions based on CNNs is to handle the learnt descriptors as local features, thereby, encoding schemes such as BoWs, FV or VLAD can be employed, as explored in (Xu et al., 2015; Yuan et al., 2016; Karpathy et al., 2014; C. Feichtenhofer and Zisserman, 2016). In addition to CNNs, Recurrent Neural Networks (RNNs) is another prevalent deep learning structure for action recognition by learning actions utilizing either a hidden layer or a memory cell, like (Xin et al., 2016; Veeriah et al., 2015; Du et al., 2015). However, classical RNNs fails to deal with short-term videos owing to the exponential decay (Veeriah et al., 2015). To address the problem, RNNs with long short-term memory (LSTM) was designed very recently and gained big success, for instance, Donahue et al. (Donahue et al., 2015) proposed a unified model combining LSTM with CNNs to describe actions, based on which encouraging results were obtained. Other RNN-CNNs based deep learning methods cover (S. Sharma and Salakhutdinov, 2016; N. Srivastava and Salakhutdinov, 2015).

3.1. Two-stream CNNs revisited

We start our approach with revisiting two-stream CNNs (Simonyan and Zisserman, 2014). The purpose of two-stream CNNs is to adapt CNNs designed for image representation to action recognition by action decomposition and temporal information fusion. Owing to the 3D property of action, the authors initially make a hypothesis that actions can be decomposed into spatial object and temporal motion. Following this pipeline, the two components of two-stream CNNs: spatial stream CNNs and temporal stream CNNs are introduced, while respectively learnt from action single frames and optical flow stacks across an action sequence. The two streams follow the same CNNs learning architecture containing 5 convolutional and max-pooling layers followed by 2 fully connected layers, with rectification (ReLU) activation functions. Table 1 shows the details of two-stream CNNs in terms of kernel, stride and channel size. In order to train the spatial CNNs, ImageNet ILSVRC-2012 dataset (Krizhevsky et al., 2012) is employed for pre-training. Images with labels are first resized to 224 × 224 as inputs of the network by cropping randomly on 3 (RGB) channels. After conducting mini-batch stochastic gradient descent algorithm to maximize the average difference between all labels and distributions predicted by softmax regression on outputs (the 2048volumed descriptors on the Full7 layer), the weights can be learnt. Based on this, fine-tuning is then achieved according to the number of action class. Similarly, temporal CNNs are trained from scratch without pre-training due to optical flow stacks (Brox et al., 2004) are used as inputs, and correspondingly, the channel size is changed to 2L considering the the

ACCEPTED MANUSCRIPT 4 Table 1: Architecture of two-stream CNNs. Layer

Conv1

Pool1

Conv2

Pool2

Conv3

Conv4

Conv5

Pool5

Full6

Kernel

7×7

3×3

5×5

3×3

3×3

3×3

3×3

3×3

-

-

1

-

-

96

96

256

256

512

512

512

512

4096

2048

Channel

2

2

2

2

1

stack length L and the 2D property of optical flow. Also, each channel is constrained to 224 × 224 for iteratively training optimum weights under the same architecture. The action classification is a accomplished through fusing the scores of the testing sequence by averaging or SVMs, which reports promising results. 3.2. Three-stream CNNs configurations

1

Next, we display M ¸ τ (x, y, t) using color map, then, we employ the color map as the input of the third stream CNNs with the same configurations of Two-stream CNNs illustrated in Table 1. After learning, for a video clip V, we are able to get one global CNNs descriptor C¸g , by combining its Two-stream CNNs descriptors respectively from spatial and local temporal streams, the Three-stream CNNs descriptors can be denoted as: ¸ tj ; C¸g | i ∈ [1, n], j ∈ [1, n − 1]} C¸(V) = {C¸is ; C

(4)

3.3. Data augmentation

So far, an important reason why deep learnt based action recognition methods have not completely beaten hand-crafted features based methods is the lack of dataset. Due to the big complexity of CNNs architecture, a large volume dataset is necessary for alleviating overfitting during training. In (Krizhevsky et al., 2012), randomly extracting patches from initial images and adding Gaussian distributed variables to Principle Component Analysis (PCA) processed quantities are formulated to enlarge the dataset, which is commonly applied (Simonyan and Zisserman, 2014; Brox et al., 2004). However, the operation is spatial. Here, we present a quick temporal data augmentation method for our temporal CNNs training.

AN US

Based on the assumption that actions can be decomposed spatially and temporally, we follow the configurations of twostream CNNs except that we replace the original optical flow algorithm using a more efficient one introduced in (Brox et al., 2004) on RGB channels, so that the input of our optical flow stack stream is 6L. Thus, given a video clip V, a set of twostream CNNs descriptors can be obtained:

1

CR IP T

Stride

Full7

C¸(V) = {C¸1s , C ¸ 2s , · · · , C¸ns ; C¸t1 , C ¸ t2 , · · · , C¸tn−1 }

(1)

AC

CE

PT

ED

M

where C¸is (i ∈ [1, n]) denotes the spatial CNNs descriptor while C¸tj ( j ∈ [1, n − 1]) denotes the temporal one. n is selected as 10 according to (Simonyan and Zisserman, 2014) by cropping and flipping at each action video clip. Furthermore, we regard the temporal CNNs originated from optical flow as local temporal CNNs, and extend two-stream CNNs to three-stream CNNs by adding a global temporal CNNs via deep learning Motion Stacked Difference Image (MSDI) features, which indicates that we consider motions both locally and globally. MEI (Bobick and Davis, 2001) is a very efficient temporal template representing actions by accumulating time-unrelated local features derived from image difference. Due to the accumulation, some significant features would be overlapped when MEI is used for action recognition directly. However, CNNs architecture is very suited to learn deep hidden features of complicated images. Based on this observation, we seek to construct a global template for action deep learning from the perspective of stacked difference. Specifically, different with MEI attaining binary images through thresholding image differences, our MSDI template is established by fusing every local action features in the form of absolute difference between connecting images across the whole action sequence, which can be defined as follows: M ¸ τ (x, y) =

τ−1 X

D ¸ (x, y, τ)

(2)

i=0

where τ denotes the length of the video, and D ¸ (x, y, t) is defined as: D ¸ (x, y, t) = |I¸(x, y, t) − I¸(x, y, t − 1)| (3)

where t ∈ [2, τ], and I¸(x, y, t) denotes the brightness lattice of the pixel point (x,y) at time t.

Fig. 2: Flowchart of our data augmentation algorithm.

For a video sequence V ¸ (x, y, l) with l frames, we sample temporally via cropping with a stride s across the whole sequence iteratively. In order to avoid repeated sampling on one image, we restrict the starting frame within the domain from V ¸ (x, y, 1) to V ¸ (x, y, s), which also limits the iteration number as s implicitly. After each iteration, we can get a sparser video sequence with (l/s) frames. To align every sparser sequence, we remove the last one or several frames to guarantee (l/s) equals to an integer, so the final length of augmented videos is modified as INT (l/s). Therefore, for a stride s, we can obtain s video clips: {V ¸ (x, y, 1), V ¸ (x, y, 1 + s), · · · , V ¸ (x, y, 1 + [INT (l/s) − 1]s)}, {V ¸ (x, y, 2), V ¸ (x, y, 2 + s), · · · , V ¸ (x, y, 2 + [INT (l/s) − 1]s)}, · · · ,

ACCEPTED MANUSCRIPT 5

4. Classification scheme We introduce our classification scheme for action recognition in this section. First, we implement a soft Vector of Locally Aggregated Descriptors (soft-VLAD) operation on the learnt features on each stream, then the attained descriptors are categorized using SVMs framework.

ei =

K¯ X j=1

wi j (di − c j )

(6)

It is worth noting that in our work, all videos’ learnt descrip¸ (V2 ), · · · , C¸(V~ )} are encoded on three separate tors {C¸(V1 ), C streams, where ~ indicates the size of action dataset volume. 4.2. SVMs classification Having encoded learnt features, first, we categorize them for action classification under SVMs framework with extended kernels using χ2 distance following (Zhang et al., 2007). For arbitrary two q dimensional encoded descriptors: E1 = (α1 , · · · , αq ) and E2 (β1 , · · · , βq ), we define the χ2 distance as follows: q

Υ(Ei , E j ) =

5. Experiments

1 X (αi − βi )2 2 i=1 αi + βi

(7)

We evaluate our action recognition approach on four popular and comparatively large datasets: UCF101 (Soomro et al., 2012), HMDB51 (Kuehne and L.Grimson, 2011), Hollywood2 (C. Feichtenhofer and Zisserman, 2009) and YouTube (J. Liu and Shah, 2009). UCF101 comprises of 13,320 videos annotated into 101 classes, the action scenario is also complex. HMDB51 includes 6,766 videos divided into 51 categories. Hollywood2 contains 12 actions, and we select the clean dataset with 1,707 samples. YouTube includes 11 actions with 1,168 videos. To keep the results consistent with others for comparison, we follow the three splits protocol for training and testing on the two datasets, and adopt the mean accuracy as our recognition rate. We employ Caffe toolbox for CNNs learning (Jia et al., 2014), and choose VLFeat toolbox (Vedaldi and Fulkerson, 2010) for soft-VLAD and SVMs implementation.

AC

CE

PT

ED

M

AN US

4.1. soft-VLAD encoding VLAD (Jegou et al., 2012) is an effective global representation scheme first proposed for image classification. By clustering K centers using K-means algorithm on scattered data, VLAD achieves encoding through concatenating the difference between data and their nearest centers. Unfortunately, VLAD is a hard assignment (Peng et al., 2016) process since the final descriptors are straightly quantized related to only one codeword (the nearest center) instead of taking multiple codewords into account (soft assignment). It is easy to understand that soft is more appropriate than hard because some descriptors share multiple candidates. Also, it is obvious that the assignment of VLAD’s core component K-means is also hard. In our work, to improve the performance of VLAD, we introduce a soft-VLAD algorithm to encode a group of data. Above all, we preprocess the input data using PCAWhitening technique to reduce their correlation which is demonstrated to be very important for encoding (Peng et al., 2016). So far, a data set can be attained: {d1 , · · · , dm | m ≥ 2}. ¯ K¯ ≤ m) Gaussian distributions to fit the Then, we select K( total distribution of preprocessed data. Using g(x) denote the Gaussian probability density function, and choosing the logP likelihood function m i=1 logP(xi ) of Gaussian Mixture Model (GMM) (Stauffer and L.Grimson, 1999) whose probability denPK¯ sity function is in the form of: P(x) = k=1 wk g(x|k) as the ¯ objective function, where weight wk denotes the probability of variant x matches the kth Gaussian distribution, we can obtain the optimum parameters of GMM by Expectation Maximization (EM) algorithm: First initialize GMM empirically, then for each xi , estimate the posterior probability wi j according to the following equation:

di , it is able to be encoded to ei by summing the weighted distance with respect to its related centers:

CR IP T

{V ¸ (x, y, s), V ¸ (x, y, s + s), · · · , V ¸ (x, y, s + [INT (l/s) − 1]s)}, as illustrated in Figure 2. Considering the duration of different videos, in our settings, for short videos (< 1s), s is selected as 2, which derives 2 clips, for long videos (> 5s), s is set as 2, 3, 5 respectively, which derives 10(2 + 3 + 5) clips, and for moderate videos (1-5s), we set s as 2 and 3, deriving 5(2 + 3) clips.

w j g(xi |k) wi j = P ¯ K r=1 wr g(xr |k)

(5)

By maximizing the objective function with gradient descent algorithm using the estimated probabilities from eq.(5), we can get the updated parameters based on initialized ones, moreover, we can continue the update iteratively and get the final parameters until the update is small enough. After that, based on the estimated GMMs model, we can derive K¯ centers: {c1 , c2 , · · · , cK¯ } through distributing data according to the weight. Finally, for

5.1. Implementation details When training three-stream CNNs, the mini-batch size of gradient descent algorithm is set as 256, and the dropout ratio is chosen as 0.9, according to (Simonyan and Zisserman, 2014). As the presence of pre-training in spatial and global temporal stream while training from scratch for local temporal stream, we set the initial learning rate as 10−2 , which is proved to be very applicable (Wang et al., 2015), and change it to 10−3 after 50K and 15K iterations respectively. Besides, the learning rate is adjusted to 10−4 for the fine-tuning 70K iterations later, and the the training is stopped after 90K and 30K iterations, separately. Moreover, the default optical flow stack number L is set as 10. When performing PCA, the 2048 volumed deep learnt descriptors are transformed to 1000 dimensional vectors while guaranteeing the contribution rate is higher than 95%, furthermore, while executing GMM for clustering, we use random means and covariances to initialize 512 vocabularies by default. Note that the parameters are adjustable, and the impact of subjective ones on action recognition rate is also analyzed in section 5.3.

ACCEPTED MANUSCRIPT 6 Table 2: Average recognition rate of three-stream CNNs. Method

Average Accuracy HMDB51 Hollywood2

UCF101

YouTube

Spatial CNNs+softmax regression+averaging (Simonyan and Zisserman, 2014) Temporal CNNs+softmax regression+averaging (Simonyan and Zisserman, 2014) Two-stream CNNs+SVMs fusion (Simonyan and Zisserman, 2014)

72.8% 81.2% 88.0%

55.4% 59.4%

-

-

Spatial CNNs+softmax regression+averaging Local temporal CNNs+softmax regression+averaging Global temporal CNNs+softmax regression+averaging Three-stream CNNs+SVMs fusion

72.4% 81.6% 78.8% 89.7%

49.7% 56.0% 52.9% 61.3%

50.9% 62.6% 63.3% 70.6%

52.7% 68.5% 65.8% 78.2%

Table 3: Recognition results under different classification frameworks based on three-stream CNNs features. Classification framework

UCF101

Average Accuracy HMDB51 Hollywood2

YouTube

91.7% 88.4% 90.9% 89.7%

65.0% 62.2% 66.6% 61.3%

73.3% 70.1% 72.9% 70.6%

81.0% 78.4% 80.8% 78.2%

Soft-VLAD+SVMs

92.1%

67.2%

73.4%

81.3%

CR IP T

FV+SVMs BoWs+SVMs VLAD+SVMs Softmax regression+SVMs fusion

Table 4: Recognition results using CNNs features with or without data augmentation. Action features

UCF101

Three-stream CNNs with data augmentation

88.0% 89.3% 92.1%

59.4% 61.9% 67.2%

62.1% 65.5% 73.4%

AN US

Two-stream CNNs without data augmentation Two-stream CNNs with data augmentation Three-stream CNNs without data augmentation

Average Accuracy HMDB51 Hollywood2

93.4%

68.3%

74.6%

YouTube 69.4% 73.7% 81.3%

83.8%

Table 5: Recognition results using different methods. Method

Average Accuracy UCF101 HMDB51

ED

M

Multiresolution CNNs (Karpathy et al., 2014) LSTM with 30 Frame Unroll (Ng et al., 2015) Factorized spatio-temporal CNNs (Sun et al., 2015) Two-stream CNNs (Simonyan and Zisserman, 2014) Temporal pyramid CNNs (Wang et al., 2016) Adaptive RNN-CNNs (Xin et al., 2016) Trajectory-pooled CNNs (Wang et al., 2015) VideoDarwin (Fernando et al., 2015) Multiple dynamic images (Bilen et al., 2016) RLSTM-g3 (Mahasseni and Todorovic, 2016) Hierarchical clustering multi-task learning (Liu et al., 2017) Discriminative representation (Yuan et al., 2016) Ordered trajectories (Murthy and Goecke, 2015) Improved trajectories (Wang and Schmid, 2013) Super-category exploration (Yang et al., 2016) Multi-layer Fisher Vector (Sekma et al., 2015)

5.2. Recognition results

PT

Proposed algorithm

AC

CE

We also evaluate the impact of our classification scheme by classifying three-stream CNNs features utilizing softmax regression followed by SVMs fusion, bag of words (BoWs), Fisher Vector (FV) and Vector of Locally Aggregated Descriptors (VLAD) followed by SVMs. Table 3 illustrates the recognition results on different action benchmarks under different categorization frameworks, from which we can see that the proposed soft-VLAD is more suited for classification of threestream CNNs features. To verify the effectiveness of our data augmentation strategy, we accomplish action recognition under our classification scheme using augmented and non-augmented two-stream and three-stream learnt data, respectively. Table 4 shows the corresponding accuracy. Definitely, the recognition results are improved to a big extent profiting from data augmentation. Finally, Table 5 comprehensively compares our approach with other state-of-the-art techniques, the actin recognition rate

Hollywood2

YouTube

65.4% 88.6% 88.1% 88.0% 89.1% 91.5% 89.1% 86.9% 76.3% 79.7% 72.8% -

59.1% 59.4% 63.1% 61.1% 65.9% 63.7% 65.2% 55.3% 51.4% 28.2% 47.3% 57.2% 60.8% 68.5%

71.9% 63.1% 73.7% 78.5% 64.3% 70.3%

89.7% 79.7% -

93.4%

68.3%

74.6%

83.8%

of our algorithm on UCF101 reaches 93.4%, which is very encouraging. Also, the 68.3% accuracy rate on HMDB51, 74.6% on Hollywood2 and 83.8% on YouTube is also rather competitive. We also find out that the failure cases mainly happen when scenarios change strongly, especially when the two actions are similar, such as ”eat”, ”talk” and ”drink” within the most challenging HMDB51 dataset. Considering the scenario variation and action similarity of UCF101, Hollywood2 and YouTube, better action recognition results are available. Meanwhile, it is easy to discover that some hand-crafted features based work can get good experimental results benefiting from the featureencoding techniques such as BoW, FV or VLAD, while some deep learning methods can recognize actions well owing to the fit learning architectures and training settings. 5.3. Exploration of parameter sensitivity We notice that three parameters: dropout ratio, GMM vocabulary size K¯ and optical flow stack number L are influential for

ACCEPTED MANUSCRIPT 7 features, built upon the overall GMMs probability distributions of data with clustering centers. A special data augmentation scheme for action dataset was also introduced, based on cropping across action videos. We evaluated our methodology on two large datasets, from which the effectiveness is validated. More and more powerful deep learning architectures emerge recent years (Krizhevsky et al., 2012; Xin et al., 2016; Szegedy et al., 2015), we will explore the applicability of our framework with a more promising learning architecture in the future work. Meanwhile, it is very interesting to build up a more extensive action benchmark, which is extraordinarily required for the coming action recognition task in real-world environments.

Fig. 3: Impact of dropout ratio on action recognition accuracy.

CR IP T

our action recognition scheme. Above work directly gives the optimum value, however, the settings are based on experimental analysis. Now, we show the exploration process.

Acknowledgment

This research is supported by the National Natural Science Foundation of China (Grant No: 661273339). The authors also would like to thank Berthold K. P. Horn for his good ideas during author’s visit study at MIT CSAIL.

AN US

References

ED

M

Fig. 4: Impact of GMM vocabulary size K¯ on action recognition accuracy.

PT

Fig. 5: Impact of optical flow stack length L on action recognition accuracy.

AC

CE

According to most CNNs learning procedures (Wang et al., 2015; Sun et al., 2015; Szegedy et al., 2015), dropout ratio is often set as 0.3−0.9 empirically. As well, the GMM vocabulary size is usually chosen as 2κ , where κ is sampled within 6 − 11. The optical flow stack number is selected as 1 (single-frame optical flow), 5 or 10 according to (Simonyan and Zisserman, 2014). Figure 3, 4, and 5 illustrate their effect on our action recognition based on three-stream CNNs under the proposed classification scheme with augmented data, which suggests that our default ones are the most appropriate. 6. Conclusion In this paper, we proposed a novel three-stream CNNs system incorporating deep learnt single frame, optical flow and MSDI features. The developed architecture is complementary in spatial and temporal domain. Furthermore, we presented a softVLAD encoding scheme to represent the three-stream CNNs

Bilen, H., Fernando, B., Gawes, E., Vedaldi, A., Gould, S., 2016. Dynamic image networks for action recognition, in: CVPR, pp. 3034–3042. Bobick, A., Davis, J., 2001. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Aanlysis and Machine Inteelligence 23, 1106–1114. Borges, P.V.K., Conci, N., Cavallaro, A., 2013. Video-based human behavior understanding: A survey. IEEE Transactions on Circuits and Systems for Video Technology 11, 1993–2008. Borji, A., Itti, L., 2013. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 185–207. Brox, T., Bruhn, A., Papenberg, N., Weickert, J., 2004. High accuracy optical flow estimation based on a theory for warping, in: ECCV, pp. 25–36. C. Feichtenhofer, A., Zisserman, A., 2009. Actions in context, in: CVPR, pp. 2929–2936. C. Feichtenhofer, A.P., Zisserman, A., 2016. Convolutional two-stream network fusion for video action recognition, in: CVPR, pp. 1933–1941. Dalal, N., Triggs, B., 2005. Histograms of oriented gradients for human detection, in: CVPR, pp. 886–893. Donahue, J., Hendricks, L., Guadarrama, S., Rohrbach, M., 2015. Long-term recurrent convolutional networks for visual recognition and description, in: CVPR, pp. 2625–2634. Du, Y., Wang, W., Wang, L., 2015. Hierarchical recurrent neural network for skeleton based action recognition, in: ICCV, pp. 1110–1118. Fernando, B., Gawes, E., Oramas, M., Ghodrati, A., Tuytelaars, T., 2015. Modeling video evolution for action recognition, in: CVPR, pp. 5378–5387. J. Liu, J.L., Shah, M., 2009. Recognizing realistic actions from videos in the wild,, in: CVPR, pp. 1996–2003. Jegou, H., Perronnin, F., Douze, M., Sanchez, J., 2012. Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Aanlysis and Machine Inteelligence 34, 1704–1716. Ji, S., Xu, W., Yang, M., Yu, K., 2013. 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 221–231. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T., 2014. Caffe: Convolutional architecture for fast feature embedding, in: arXiv, p. PP. Karpathy, A., Toderici, G., Shetty, S., Leung, T., 2014. Large-scale video classification with convolutional neural networks, in: CVPR, pp. 1725–1732. Kl¨aser, A., Marszałek, M., Schmid, C., 2008. A spatio-temporal descriptor based on 3d-gradients, in: British Machine Vision Conference, pp. 995– 1004. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25, 1106–1114.

ACCEPTED MANUSCRIPT 8

CR IP T

Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C., 2007. Local features and kernels for classification of texture and object categories: A comprehensive study. International Jorunal of Computer Vision 73, 213–238. Zhang, L., Tong, M., Marks, T., Shan, H., Cottrell, G., 2008. Sun: A bayesian framework for saliency using natural statistics. Journal of Vision 8, 1–20.

AC

CE

PT

ED

M

AN US

Kuehne, H., L.Grimson, W.E., 2011. Hmdb: A large video database for human motion recognition, in: ICCV, pp. 2556–2563. Laptev, I., 2005. On space-time interest points. International Journal of Computer Vision 64, 107–123. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B., 2008. Learning realistic human actions from movies, in: CVPR, pp. 1–8. Liu, A., Su, Y., Nie, W., Kankanhalli, M., 2017. Hierarchical clustering multitask learning for joint human action grouping and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 102–114. Mahasseni, B., Todorovic, S., 2016. Regularizing long short term memory with 3d human-skeleton sequences for action recognition, in: CVPR, pp. 3054– 3062. Murthy, O.V.R., Goecke, R., 2015. Ordered trajectories for human action recognition with large number of classes. Image and Vision Computing 42, 22–34. N. Srivastava, E.M., Salakhutdinov, R., 2015. Unsupervised learning of video representations using lstms, in: ICLR, pp. 843–852. Ng, J.Y., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., R.Monga, Toderici, G., 2015. Beyond short snippets: Deep networks for video classification, in: CVPR, pp. 4694–4702. Peng, X., Wang, L., Wang, X., Qiao, Y., 2016. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Computer Vision and Image Understanding 150, 109–125. Perronnin, F., S´anchez, J., , Mensink, T., 2010. Improving the fisher kernel for large-scale image classification, in: ECCV, pp. 119–133. S. Sharma, R.K., Salakhutdinov, R., 2016. Action recognition using visual attention, in: ICLR, p. PP. Scovanner, P., Ali, S., Shah, M., 2007. A 3-dimensional sift descriptor and its application to action recognition, in: International Conference on Multimedia, pp. 357–360. Sekma, M., Mejdoub, M., Amar, C.B., 2015. Human action recognition based on multi-layer fisher vector encoding method. Pattern Recognition Letters 65, 37–43. Simonyan, K., Zisserman, A., 2014. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems 1, 568–576. Soomro, K., Zamir, A.R., Shah, M., 2012. Ucf101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01 . Stauffer, C., L.Grimson, W.E., 1999. Adaptive background mixture models for real-time tracking, in: CVPR, pp. 246–252. Sun, L., Jia, K., Yeung, D., Shie, B., 2015. Human action recognition using factorized spatio-temporal convolutional networks, in: ICCV, pp. 4597–4605. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions, in: CVPR, pp. 1–9. Vedaldi, A., Fulkerson, B., 2010. Vlfeat: An open and portable library of computer vision algorithms, in: International Conference on Multimedea, pp. 1469–1472. Veeriah, V., Zhuang, N., Qi, G., 2015. Differential recurrent neural networks for action recognition, in: ICCV, pp. 4041–4049. Wang, H., Schmid, C., 2013. Action recognition with improved trajectories, in: ICCV, pp. 3551–3558. Wang, L., Qiao, Y., Tang, X., 2015. Action recognition with trajectory-pooled deep-convolutional descriptors, in: CVPR, pp. 4305–4314. Wang, P., Cao, Y., Shen, C., Liu, L., 2016. Temporal pyramid pooling based convolutional neural network for action recognition. IEEE Transactions on Circuits and Systems for Video Technology PP. Weinland, D., Ronfard, R., Boyer, E., 2011. A survey of vision-based methods for action representation, segmentation and recognition. Computer Vision and Image Understanding 115, 224–241. Willems, G., Tuytelaars, T., Gool, L.V., 2008. An efficient dense and scaleinvariant spatio-temporal interest point detector, in: ECCV, pp. 650–663. Wu, J., Zhang, Y., Lin, W., 2014. Towards good practices for action video encoding, in: CVPR, pp. 2577–2584. Xin, M., Zhang, H., Wang, H., Sun, M., Yuan, D., 2016. Arch: Adaptive recurrent-convolutional hybrid networks for long-term action recognition. Neurocomputing 178, 87–102. Xu, Z., Yang, Y., Hauptmann, A.G., 2015. A discriminative cnn video representation for event detection, in: CVPR, pp. 1798–1807. Yang, Y., Liu, R., Deng, C., Gao, X., 2016. Multi-task human action recognition via exploring super-category. Signal processing 124, 36–44. Yuan, Y., Zheng, X., Lu, X., 2016. A discriminative representation for human action recognition. Pattern Recognition 59, 88–97.