Correlational Convolutional LSTM for human action recognition

Correlational Convolutional LSTM for human action recognition

ARTICLE IN PRESS JID: NEUCOM [m5G;May 3, 2019;15:54] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputing journ...

950KB Sizes 0 Downloads 160 Views

ARTICLE IN PRESS

JID: NEUCOM

[m5G;May 3, 2019;15:54]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Correlational Convolutional LSTM for human action recognition Mahshid Majd, Reza Safabakhsh∗ Amirkabir University of Technology, Tehran, Iran

a r t i c l e

i n f o

Article history: Received 6 March 2018 Revised 4 September 2018 Accepted 12 October 2018 Available online xxx Keywords: Human action recognition Deep learning Convolutional networks LSTM ConvLSTM

a b s t r a c t In light of recent exponential growth of video data, the need for automated video processing has increased substantially. To learn the intrinsic structure of video data, many representation approaches have been proposed, focusing on learning the spatial features and time dependencies, while, motion features are hand-crafted and left out of the learning process. In this work, we present an extended version of the L STM units named C2 L STM in which the motion data are perceived as well as the spatial features and temporal dependencies. We leverage convolution and correlation operators to credit both the spatial and motion structure of the video data. Furthermore, a deep network is designed for human action recognition using the proposed units. The network is evaluated on the two well-known benchmarks, UCF101 and HMDB51. The results confirm the potency of C2 LSTM to capture motion as well as spatial features and time dependencies. © 2019 Elsevier B.V. All rights reserved.

1. Introduction A large amount of video data is produced everyday by the users of modern commonplace technology. Video is a rich source of information with numerous applications to enhance the quality of life. As the volume of video data, its quality and its accessibility increase, the implementation of an effective video processing method appears as a more crucial goal. A widely applicable way of processing video is recognizing human actions detected in it, which is called human action recognition. From assisted living and health care monitoring to surveillance systems, there are many applications for human action recognition methods. Over recent decades, action recognition techniques have been promoted from traditional methods limited to controlled environments to advanced algorithms for a varied number of actions. Currently, deep neural networks are the preferred advanced algorithms applied to many different areas [1] including action recognition. These networks focus on learning the best representation of information (i.e., features). Like other knowledge representation models, learned features outperform hand-crafted ones [2]. A deep learning approach for action recognition in videos aims to understand the video in terms of both spatial and temporal aspects. Several different deep architectures have been proposed which have one or more of the following components: some convolutional layers to extract spatial features, a mechanism to acquire



Corresponding author. E-mail addresses: [email protected] (M. Majd), [email protected] (R. Safabakhsh).

and then process the motion features, and LSTM layers to extract the temporal dependencies of features over time [3–5]. While spatial features and temporal dependencies are automatically learned through network layers, motion features are usually extracted with a traditional hand-crafted method such as optical flow. Automatic estimation of optical flow using deep networks has been recently proposed [6,7], where a cross-correlation operator is introduced to extract the relation of two consecutive video frames. Generally, cross-correlation measures the similarity of two random variables and is usually used to search a short signal in another longer one. The same concept is used in the aforementioned researches to calculate the motion in one video frame compared to its consecutive frame. This idea was also utilized to recognize actions in a two-stream network [8]. Motivated by the potency of cross-correlation operator, we attempt to extend the LSTM unit so that it considers both spatial and motion features alongside constructing temporal dependencies. The proposed unit is named Correlational Convolutional LSTM, C2 LSTM. We incorporate the two basic operators of convolution and cross-correlation inside the LSTM unit to cover spatial and motion aspects of a video while extracting the temporal dependencies. Convolution regards the spatial features and cross-correlation extracts the motion. To reduce overfitting, spatial and temporal augmentations are applied. We report the results on two well-known benchmark datasets, UCF101 and HMDB51, which confirm the effectiveness of the proposed unit. In summary, the main contribution of this paper is twofold. First, a new LSTM-like unit for representing video is presented which can effectively extract spatial and motion features in

https://doi.org/10.1016/j.neucom.2018.10.095 0925-2312/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: M. Majd and R. Safabakhsh, Correlational Convolutional LSTM for human action recognition, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.10.095

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;May 3, 2019;15:54]

M. Majd and R. Safabakhsh / Neurocomputing xxx (xxxx) xxx

addition to temporal dependencies. Second, a network architecture based on the proposed unit is introduced for action recognition, where video representation is quite influential. Extensive examination of the network performance demonstrate the success of our unit in extracting useful features. The remainder of this paper is organized as follows: First, the literature of deep learning methods used for human action recognition is briefly reviewed in Section 2. The LSTM unit proposed in this paper is described in Section 3. Next, the experimental results are presented in Section 4. Finally, the paper is concluded in Section 5. 2. Related work Human action recognition in video has been a popular subject of research in recent decades. This interest has significantly increased since the emergence of deep learning. There are some invaluable surveys on action recognition methods using shallow and deep methods [9,10]. Here, we summarize some of the most important action recognition methods based on deep networks. Since video data has both spatial and temporal features in addition to long temporal dependencies, complex feature extraction networks are proposed which are mainly different in how they handle the time dimension. A set of methods expand the well-known 2D convolutional methods to have 3D filters [11–13]. These methods expectedly outperform 2D methods in representing videos; but they have two main drawbacks: (1) A staggering number of learning parameters and (2) Restrictive assumptions on the number of video frames. Another set of methods suggest two separate streams of network layers to process video frames as the spatial data and dense optical flow as the motion data [14,15]. Although calculating optical flow could be time consuming, it is shown that this motion data is much more informative than the spatial part of the video. The third set of methods attempt to model the temporal dependency of the features using the LSTM networks [3–5], since recurrent connections provide the means to process the temporal data. The basic idea is to serialize some convolutional layers and feed the output to a LSTM layer. These two aforementioned streams were not actually merged to one network until the work of Xingjian et al. [16]. Although before that, Eigen et al. introduced the concept of recursion for convolutional networks [17], where the weights of the higher convolutional layers were tied between layers and this relation was considered equivalent to recurrence. This architecture was quite novel but its performance and characteristics were not examined thoroughly. ConvLSTM was first introduced in [16]. The LSTM network is remodeled so that it can handle the spatial information as well as temporal dependencies. To do so, a convolution operator is used in the state-to-state and input-to-state transitions of the LSTM equations. In addition, the input, outputand hidden cells plus gates are all considered as 3D tensors. The proposed network was applied to a synthetic video prediction dataset and showed promising results. The results surpassed the ones from FC-LSTM. In another attempt [18], the product operator was replaced with convolution in Attention based LSTM (ALSTM) to cover the spatial information of videos and the network was called videoLSTM. Moreover, they introduced a motion-based attention mechanism, where a shallow convolutional network was applied. Recently, Jung et al. [19] instated that the recurrent CNN has painfully slow computations and proposed a temporal normalization method called adaptive detrending. The detrending method aims to remove a slowly changing trend from a time series and make it stationary. They applied this method to a convolutional gated recurrent unit which led to more efficient results. To evaluate their method, the authors utilized two contextual video datasets

which required both spatial and temporal information to be categorized. A hierarchical architecture combining ConvLSTM and soft attention was proposed in [11] to recognize actions. Their attention map was built on a Residual-152 Network trained on ImageNet and the ConvLSTM was the same as the work of Xingjian et al. [16]. Sun et al. [20] introduced Lattice LSTM which learns different sets of patch-level filters for different spatial locations of videos. They introduced a local superposition operator to apply on the hidden transition of cell memory within the recurrence, while other linear combinations of their LSTM module were convolutional. Moreover, they used a sampling technique where temporal augmentation was applied to the videos to have samples with different movement ratios. The video clips were fed through the pretrained spatial and temporal networks at each time step and then they were processed through a two-stream framework. The input and forget gates of the two streams shared weights and the other gates were trained independently. What we can see so far is that motion acts as a key source of information in human action recognition and optical flow is usually the favored method for representation of motion. Apart from traditional methods for estimation of optical flow, there are some ongoing research for estimating optical flow using deep networks [6,7]. To do so, two consecutive images of the input video pass through two streams of convolutional layers and then a new layer is introduced, correlation layer, where the correlation of these features is calculated in a patch-wise manner. In a simpler setting, the two images are stacked and fed to a generic convolutional network. Inspired from the success of this network, a stream of convolutional layers is proposed in [21] to do both action recognition and optical flow estimation at the same time, as a multi objective network. The network is named ActionFlowNet, where a 3D ResNet is used to extract features. The extracted features pass through two streams, one stream of deconvolutional layers to estimate the optical flow and one stream of fully connected layers to classify the actions. In another attempt[8], the two stream action recognition network were altered so that it estimates the optical flow through an encoder-decoder without supervision, and then applies the usual temporal stream CNN. 3. Correlational Convolutional LSTM: C2 LSTM While learned features are widely preferred over handcrafted ones, there is still a long way towards learning all aspects of video data. Three major types of information shape a typical video data: spatial information, motion information and the temporal dependencies of these information. The structure of the existing networks for modeling video data is typically based on these three aspects. These networks serialize some components to extract each aspect individually. The convolutional layers are responsible for learning the spatial features and the LSTM layers discover the temporal dependencies. Unlike the current networks where the motion data is captured through optical flow either with traditional methods or with a deep network, the proposed architecture extracts the motion information through a spatiotemporal component. The proposed LSTM-like unit is used to extract both spatial and motion characteristics of the video, simultaneously. The overall view of the proposed network architecture is presented in Fig. 1. First the input video data is fed to two parallel identical convolutional networks with shared weights, so that each frame goes through a series of convolutional and pooling layers to extract basic spatial features. Next these features are passed to the spatiotemporal component consisting of a layer of the proposed C2 LSTM units. Finally, a classifier, usually Softmax, labels the resulting features. Note that each input video clip contains several

Please cite this article as: M. Majd and R. Safabakhsh, Correlational Convolutional LSTM for human action recognition, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.10.095

ARTICLE IN PRESS

JID: NEUCOM

[m5G;May 3, 2019;15:54]

M. Majd and R. Safabakhsh / Neurocomputing xxx (xxxx) xxx

3

Fig. 1. Proposed architecture of the network. Each frame is fed to a convolutional tower, which is a series of convolutional and pooling layers. Next the output of all towers are concatenated and passed to the C2 LSTM layer. At the end a classifier, usually softmax categorizes the extracted features into a label.

is formulated as follows:

C =AB C (x, y, z ) = Az ◦ B(x, y ) N N   1 1

=

j=−N i=−N

σA z σB

(Az (i, j ) − A¯z )(B(x + i, y + j ) − B¯ )

where



Az =

A(i, j )|(i, j ) ∈ {0, . . . , w} × {0, . . . h},

  z= Fig. 2. Proposed C2 LSTM unit. ρ represents the correlation operator defined in Eq. (2).

frames which are separately fed to the convolutional towers. The outputs are concatenated and passed as one single output. Compared to the usual networks for processing video data, our network concurrently extract the spatial, and temporal information and the time dependencies through a layer of C2 LSTM. In other words, the proposed C2 LSTM unit is the heart of this architecture. In the following, this unit and its operators are described in more details. We attempt to upgrade the LSTM units so that in addition to the temporal dependency, the spatial and motion characteristics of the video would also be taken into account. C2 LSTM is presented in Fig. 2. This unit has two fundamental differences with the general LSTM unit: first, it is based on ConvLSTM units where input gates and weights are 2D arrays and the multiplication operators are replaced with convolution. Second, the input from previous time t−1 is also fed to the unit to compute the correlation of the two consequent input data. The unit is,therefore, formulated as follows:

CRt = Xt−1  Xt It = σ (Wxi ∗ Xt + Whi ∗ Ht−1 + Wci ∗ CRt + bi ) Ft = σ (Wx f ∗ Xt + Wh f ∗ Ht−1 + Wc f ∗ CRt + b f ) Ot = σ (Wxo ∗ Xt + Who ∗ Ht−1 + Wco ∗ CRt + bo ) C˜t = σ (Wxc ∗ Xt + Whc ∗ Ht−1 + Wcc ∗ CRt + bc ) Ct = σ (Ft  Ct−1 + It  C˜t ) Ht = Ot  tanh(Ct )

(1) ∗

where ◦ledast represents the batch-wise correlation operator, is the convolution operator,  is the dot operator, and CRt denotes the correlation matrix at time t. To calculate the correlation of two input matrices, the matrices are divided into patches and the normalized cross correlation of each pair are computed. The operator

i + p

   j w p

.

p



, p = 2N + 1

(2)

where ◦ represents the cross correlation operation. The first operand, A, with a size of w × h, is partitioned to smaller squared patches named Az of size 2N + 1. For example, let w = 12, h = 18 and N = 1, then p = 3 and A1 contains all A(i, j) where i and j equal {3, 4, 5} and j {0, 1, 2} respectively. For a higher robustness, zero normalized cross correlation is applied and the input is normalized before computation. σAz and A¯z are the standard deviation and mean of Az respectively, and likewise, σ B and B¯ are the standard deviation and mean of B. This operation has no weights and therefore does not need training. We utilize the two basic operators of convolution and correlation to capture the spatial and motion features while looking for time dependencies. Both operators are shift-invariant and linear and hence easy to use. The application of convolution in LSTM has been studied before and has proven useful in capturing the spatial relation of data. Moreover, correlation is used to locate the patches of the current input in the previous one. That is to say a patch is sliding around the other input looking for a location where their overlap is high. Note that as the correlation of two inputs increases, the Euclidean distance between them decreases. Fig. 3 shows the output of zero normalized cross correlation on a sample image. The output shows a high intensity in the center since the two images are the same. Applying this operation to a single patch of one frame and its consequent frame represents a trace of motion between the two. The output of the defined correlation operator is a 3D matrix which is considered another source of information, CRt , and used alongside the input of the current time, Xt , and the output of the previous time step, Ht−1 , to compute all gates. Similar to the LSTM gates, here It is the input gate, Ft is the forget gate, Ot is the output gate and Ct is the memory unit. The output is represented by Ht . 4. Experimental results In this section, the used benchmarks and evaluation metrics are firstly described, and then the implementation details of our

Please cite this article as: M. Majd and R. Safabakhsh, Correlational Convolutional LSTM for human action recognition, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.10.095

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;May 3, 2019;15:54]

M. Majd and R. Safabakhsh / Neurocomputing xxx (xxxx) xxx

Fig. 3. Result of applying zero normalized cross correlation on a sample image where the correlation of B to A is computed.

network is presented. Next, the results, the discussion of the proposed method, and comparison with the state of the art are reported. 4.1. Datasets and metrics We conduct experiments on two widely-used benchmarks for human action recognition: UCF101 [22] and HMDB51 [23]. The first one consists of 13,320 videos of 101 human action categories. It has 2.4M frames in total with a resolution of 320 × 240 and 25 fps. The second one has 6766 videos of 30 fps in 51 action classes. This is considered as a challenging dataset since it is both small and complex. The videos of both datasets are gathered from different sources. Moreover, both datasets contain three splits to select for training and testing. We report the mean accuracy on all three splits. The proposed method is assessed based on video accuracy as the standard evaluation protocol. First, the label of each video is obtained through averaging the per-clip softmax scores and the maximum value of these scores is considered as the class label. Then, the video accuracy is measured by the number of correctly assigned labels over all videos. 4.2. Implementation details We choose three blocks of VGG16 [24] as the convolutional tower in use. Note that we just need a basic spatial feature extractor whose output has enough spatial and motion data to pass to the C2 LSTM layer. The towers are identical with shared weights, pre-trained on ImageNet [25] and fine tuned on the corresponding dataset. In the C2 LSTM layer, the convolutional filters are of size 3 × 3. The image patches used to compute the cross correlation are also of size 3 × 3. The categorical cross entropy is employed for the training loss. Adam is chosen as the optimization method based on its fast convergence. We extract video clips of 30 frames from videos. To reduce overfitting, common spatial data augmentations, random cropping and horizontal flipping are performed on each frame. Moreover, augmentation is applied in temporal dimension to cover multiple clips of the same video with different motion speeds and different cycles of action. The video clips are sampled with random starting points and strides. Formally speaking, given a video V ∈ Rt×w×h×ch , where t is the number of frames, w is the width, h is the height and ch is the number of channels, a series of clips    Vclip ∈ Rt ×w ×h ×ch are extracted where t < t is the number of selected frames, w < w and h < h are the cropped size of the clip. Considering the sequence of the input video V = ( fi )ti=1 , the video 

clip Vclip = ( fi j )tj=1 is selected as follows:

Vclip = { fi j |i1 ∈ {1, 2, . . . , t − t  × s}, i j = i j−1 + s}

(3)

where i1 is the starting index of the clip frame and s is the stride between them. Both of these parameters are selected randomly.

Table 1 The influence of temporal augmentation and pre-training on the proposed method. Conditions

UCF101

HMDB51

Baseline Temporally augmented Temporally augmented and pretrained

89.3 92.0 92.8

45.7 59.0 61.3

Table 2 The impact of convolution and correlation operator on LSTM units for human action recognition. Methods

UCF101

HMDB51

LSTM ConvLSTM C2 LSTM

80.5 85.2 92.8

49.8 54.0 61.3

We found data augmentation critical to avoid overfitting in our network. 4.3. Results and discussion Table 1 demonstrates the contribution of temporal data augmentation and pre-training of the network. As mentioned before, temporal augmentation enlarges the dataset and prevents overfitting. Moreover, it provides samples with different temporal speeds and increases the method generalization. Since HMDB51 is relatively small, the augmentation has higher impact on it. It is expected that pre-training helps faster convergence and our experiments confirm it. Note that the baseline mentioned in the Table 1 uses common spatial data augmentations which is random cropping and horizontal flipping. The rest of the experiments are reported on the proposed method including both temporal augmentation and pre-training. The results of proposed method on both benchmarks are shown in Table 2. To show the benefits of considering motion information in LSTM units through the proposed network, its results are compared to the same network with simple LSTM units and ConvLSTM ones. Remember that while LSTM units extract the time dependencies of their inputs, ConvLSTM units inspect the spatial relations as well, and the C2 LSTM units considers the motion through correlation of the data in addition to time dependencies and spatial relations. In all cases, temporal augmentation and pre-training are utilized to enhance the results. The C2 LSTM results show a notable improvement. In Table 3, we compare the state-of-the-art on HMDB51 and UCF101 with our results. This comparison is made based on two basic aspects: the type of model and the type of input data. The models are categorized into the traditional models and the deep models and the type of input data can be RGB images, optical flow or both. Therefore, the table is divided into four sections. The first section contains non-deep state-of-the-art models. The deep models with only RGB images as input is listed in the second section. The proposed network which also takes only RGB images as the

Please cite this article as: M. Majd and R. Safabakhsh, Correlational Convolutional LSTM for human action recognition, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.10.095

ARTICLE IN PRESS

JID: NEUCOM

[m5G;May 3, 2019;15:54]

M. Majd and R. Safabakhsh / Neurocomputing xxx (xxxx) xxx

5

Table 3 Comparison with state of the art. Models

Deep models

Non-deep Models RGB as input

Optical flow as input

RGB + Optical flow

iDT + FV [26] iDT + HSV [27] Spatial stream network [14] Spatial net conv4 and conv5 [28] C3D [12] scLSTM [29] C2 LSTM Temporal network [14] Temporal net conv3 and conv4 [28] Deeper temporal net [30] Two stream [14] VideoLSTM [18] L2 STM [20]

UCF101

HMDB51

85.9 87.9 73.0 82.8 85.2 84.0 92.8 83.7 82.2 84.9 88.0 89.2 93.6

57.2 61.1 40.5 – – 55.1 61.3 – – – 59.4 56.4 66.2

5. Conclusions

Fig. 4. The visualization of motion-aware convLSTM on detection of motion. The video sequences are followed by the back propagation of the output neuron to the video sequence. The proposed method can detect the moving points accurately.

input belongs to this section. The third section provides the deep methods which use only optical flow as input. The last section shows the deep networks utilizing both RGB images and optical flow as input. As expected, our method shows a meaningfully better performance compared to non-deep models since the proposed deep structure provides a powerful feature extraction method. Furthermore, our superior results compared to the deep models with RGB inputs verifies the effect of the proposed motion detectors inside the unit. Although motion information is substantial in understanding video, it is not adequate. Therefore the proposed method which extracts both spatial and temporal information surpasses the deep models which use only optical flow as input. While our results is not the best compared to the methods with RGB and optical flow inputs, it still outperforms most of them. L2 STM which achieved the highest accuracy, applies a local spatially varying superposition operation on the memory cell. The network uses a multi-modal learning procedure for the input and forget gates to leverage both spatial and temporal information. Although this approach has some similarities to ours, where we leverage spatial and temporal information simultaneously, it uses the precomputed optical flow on the video frames. Compared to the traditional optical flow estimations, the proposed correlation based motion estimator has a low complexity and does not need any precomputation. In summary, the presented results supports the effectiveness of C2 LSTM in modeling motion and yet necessitate the search for a better motion detector. As suggested in [20], Back Propagation Through Time is used to visualize the effect of using kernelized cross-correlation in modeling the motion information. To do so, the neuron of the action category regarding the input video class is back propagated to the input image and the saliency region is visualized. Fig. 4 shows two sample video sequence and the back propagation of our method to the input. The proposed method can successfully detect the moving points in the video.

To automate feature extraction through learning, we proposed to change the LSTM units to cover all three types of information needed to categorize action classes in a video. Convolution and correlation operators are used to form a new unit which is capable of extracting spatial and motion features and their temporal dependencies. We tested the proposed unit, called C2 LSTM in a deep architecture with convolutional layers. The experiments show the success of the proposed unit in enhancing LSTM units with motion information. In our future work, we intend to examine different unit architectures while using correlation information. Also, the network architecture could be altered to get better results. In addition, we believe that we should test our method on longer and more complex datasets to better measure its full capabilities. Declarations of interest None. References [1] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M.S. Lew, Deep learning for visual understanding: a review, Neurocomputing 187 (2016) 27–48. [2] M. Budnik, E.-L. Gutierrez-Gomez, B. Safadi, G. Quénot, Learned features versus engineered features for semantic video indexing, in: Proceedings of the 13th International Workshop on Content-Based Multimedia Indexing (CBMI), IEEE, 2015, pp. 1–6. [3] N. Srivastava, E. Mansimov, R. Salakhudinov, Unsupervised learning of video representations using lstms, in: Proceedings of the International Conference on Machine Learning, 2015, pp. 843–852. [4] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625–2634. [5] F.J. Ordóñez, D. Roggen, Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition, Sensors 16 (1) (2016) 115. [6] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, T. Brox, Flownet: learning optical flow with convolutional networks (2015) 2758–2766. [7] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, T. Brox, Flownet 2.0: evolution of optical flow estimation with deep networks, Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2462–2470. [8] Y. Zhu, Z. Lan, S. Newsam, A.G. Hauptmann, Hidden two-stream convolutional networks for action recognition, arXiv:1704.00389 (2017). [9] S. Herath, M. Harandi, F. Porikli, Going deeper into action recognition: a survey, Image Vis. Comput. 60 (2017) 4–21. [10] R. Poppe, A survey on vision-based human action recognition, Image Vis. Comput. 28 (6) (2010) 976–990. [11] S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 221–231. [12] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE, 2015, pp. 4489–4497. [13] G. Varol, I. Laptev, C. Schmid, Long-term temporal convolutions for action recognition, IEEE Trans. Pattern Anal. Mach. Intell. 40 (6) (2017) 1510–1517.

Please cite this article as: M. Majd and R. Safabakhsh, Correlational Convolutional LSTM for human action recognition, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.10.095

JID: NEUCOM 6

ARTICLE IN PRESS

[m5G;May 3, 2019;15:54]

M. Majd and R. Safabakhsh / Neurocomputing xxx (xxxx) xxx

[14] K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in: Proceedings of the Advances in Neural Information Processing Systems, 2014, pp. 568–576. [15] C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition (2016). [16] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, W.-c. Woo, Convolutional LSTM network: a machine learning approach for precipitation nowcasting, in: Proceedings of the Advances in Neural Information Processing Systems, 2015, pp. 802–810. [17] D. Eigen, J. Rolfe, R. Fergus, Y. LeCun, Understanding deep architectures using a recursive convolutional network, arXiv:1312.1847 (2013). [18] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, C.G. Snoek, Videolstm convolves, attends and flows for action recognition, Comput. Vis. Image Underst. 166 (2018) 41–50. [19] M. Jung, H. Lee, J. Tani, Adaptive detrending to accelerate convolutional gated recurrent unit training for contextual video recognition, Neural Networks 105 (2018) 356–370. [20] L. Sun, K. Jia, K. Chen, D.Y. Yeung, B.E. Shi, S. Savarese, Lattice long short-term memory for human action recognition, Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2147–2156. [21] J.Y.-H. Ng, J. Choi, J. Neumann, L.S. Davis, Actionflownet: learning motion representation for action recognition, IEEE Winter Conference on Applications of Computer Vision (WACV) (2018) 1616–1624. [22] K. Soomro, A.R. Zamir, M. Shah, Ucf101: a dataset of 101 human actions classes from videos in the wild, arXiv:1212.0402 (2012). [23] H. Kuehne, H. Jhuang, R. Stiefelhagen, T. Serre, Hmdb51: a large video database for human motion recognition, in: Proceedings of the High Performance Computing in Science and Engineering ‘12, Springer, 2013, pp. 571–582. [24] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv:1409.1556 (2014). [25] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Proceedings of the Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. [26] H. Wang, C. Schmid, Action recognition with improved trajectories, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), IEEE, 2013, pp. 3551–3558. [27] X. Peng, L. Wang, X. Wang, Y. Qiao, Bag of visual words and fusion methods for action recognition: comprehensive study and good practice, Comput. Vis. Image Underst. 150 (2016) 109–125.

[28] L. Wang, Y. Qiao, X. Tang, Action recognition with trajectory-pooled deep-convolutional descriptors, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4305–4314. [29] X. Wang, L. Gao, J. Song, H. Shen, Beyond frame-level CNN: saliency-aware 3-d CNN with LSTM for video action recognition, IEEE Signal Process. Lett. 24 (4) (2017) 510–514. [30] Y. Han, P. Zhang, T. Zhuo, W. Huang, Y. Zhang, Going deeper with two-stream convnets for action recognition in video surveillance, Pattern Recognit. Lett. 107 (2018) 83–90. Mahshid Majd received her B.Sc. and M.Sc. degrees in computer engineering from Shiraz University in 2007 and 2010 respectively. She is currently a Ph.D. student in Artificial Intelligence and Robotics at Amirkabir University of Technology. Her research interests include deep learning, computer vision, video understanding and bio-inspired learning algorithms.

Reza Safabakhsh received the B.S. degree in electrical engineering from Sharif University of Technology, Tehran, Iran, in 1976 and the M.S. and Ph.D. degrees in electrical and computer engineering from the University of Tennessee, Knoxville, in 1980 and 1986, respectively. He worked at the Center of Excellence in Information Systems, Nashville, TN, USA, from 1986 to 1988. Since 1988, he has been with the Computer Engineering Department, Amirkabir University of Technology, Tehran, Iran, where he is currently a Professor and the director of the Computer Vision Laboratory. His current research interests include neural networks, computer vision, and deep learning. Dr. Safabakhsh is a member of the IEEE and several honor societies, including Phi Kappa Phi and Eta Kapa Nu. He was the founder and a member of the Board of Executives of the Computer Society of Iran, and was the President of this society for the first 4 years.

Please cite this article as: M. Majd and R. Safabakhsh, Correlational Convolutional LSTM for human action recognition, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.10.095