Two-stage transfer network for weakly supervised action localization

Two-stage transfer network for weakly supervised action localization

Neurocomputing 339 (2019) 202–209 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Two-sta...

1MB Sizes 0 Downloads 43 Views

Neurocomputing 339 (2019) 202–209

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Two-stage transfer network for weakly supervised action localization Qiubin Su School of Computer Science & Engineering, South China University of Technology, Guangzhou, Guangdong, China

a r t i c l e

i n f o

Article history: Received 21 September 2018 Revised 27 December 2018 Accepted 11 February 2019 Available online 18 February 2019 Communicated by Dr. Yongmin Li Keywords: Weakly supervised learning Action localization Untrimmed videos

a b s t r a c t Action localization is a central yet challenging task for video analysis. Most existing methods rely heavily on the supervised learning where the action label for each frame should be given beforehand. Unfortunately, for many real applications, it is often costly and source-consuming to obtain frame-level action labels for untrimmed videos. In this paper, a novel two-stage paradigm where only the video-level action labels are required, is proposed for weakly supervised action localization. To this end, an Image-to-Video (I2V) network is firstly developed to transfer the knowledge learned from the image domain (e.g. ImageNet) to the specific video domain. Relying on the model learned from I2V network, a Video-to-Proposal (V2P) network is further designed to identify action proposals without the need of temporal annotations. Lastly, a proposal selection layer is devised on the top of the V2P network to choose the maximal proposal response along each class subject and thus obtain a video-level prediction score. By minimizing the difference between the prediction score and video-level label, we fine-tune our V2P network to learn enhanced discriminative ability on classifying proposal inputs. Extensive experimental results show that our method outperforms the state-of-the-art approaches on ActivityNet1.2 and the mAP is improved from 13.7% to 16.2% on THUMOS14. More importantly, even with weak supervision, our networks attain comparable accuracy to those employing strong supervision, thus demonstrating the effectiveness of our method. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Action localization is a fundamental task for video understanding. Given a video, action localization should simultaneously answer “what kind of action is in this video?” and “when it starts and ends?” This problem is important because long and untrimmed videos are dominant in real world applications such as surveillance videos. In recent years, temporal action localization in videos is an active area of research and great progress has been facilitated by abundant methods on learning representation of videos [1–7] and datasets [8–10]. Benefited from the development of parallel computing power, deep learning based methods recently have achieved great improvements for video analysis. Many of these work exploit convolution neural networks as feature extractors and train classifiers to categorize sliding windows or segment proposals [11–15]. These methods heavily rely on the temporal annotations of the untrimmed videos in a supervised setting and proposals should be labeled with action categories, the start time and end time. With these dense temporal annotations, the proposal-level loss is able

E-mail addresses: [email protected], [email protected] https://doi.org/10.1016/j.neucom.2019.02.026 0925-2312/© 2019 Elsevier B.V. All rights reserved.

to be calculated and the backward propagation can be applied to train the networks. The problems, however, are how to collect action annotations and how to guarantee the quality of them. In particular, manually annotating actions frame by frame is not only time-consuming but also subjective to the annotators, making the annotations to be severely biased. The annotation issue above also exists in the research area of still images. For example, in object detection, it is costly to manually collect objective bounding boxes. As a result, weakly supervised object detection was extensively studied [16,17]. However, for the video domain, it is more challenging to solve the action localization problem given the video labels only. The reason is that action localization needs not only to learn spatial features but also to extract temporal patterns. Hence, very few attempts have been proposed for weakly supervised action localization. This paper tackles the weakly supervised action location as the proposal classification problem without frame-level annotations. To this end, we propose a two-stage paradigm to localize actions in untrimmed videos. Particularly, in the first stage, we propose an Image-to-Video (I2V) net for untrimmed video classification. Such network can capture coarse action patterns from the untrimmed videos but neglect the precise discriminative information between action categories. Hence, we further propose a Video-to-Proposal

Q. Su / Neurocomputing 339 (2019) 202–209

203

(V2P) network in the second stage. It is natural to feed proposals into the network and obtain prediction outputs. While the loss cannot be calculated directly since the proposal labels are unavailable, a specific proposal selection layer is designed in the V2P network. Through this layer, proposals contributing most to each class will be selected and gradients will be propagated via these proposals. In other words, the video-level loss is transferred to the proposal-level one, and the network parameters can be updated accordingly. The main contributions of our paper are summarized as follows:

relying on the attention weights. Our method is different from UntrimmedNet since our localization results are attained by selecting the generated proposals according to the prediction scores. Recently, Shou et al. [31] proposed a method, namely AutoLoc, to tackle this problem. An Outer-Inner-Loss was devised to predict the boundary of action instances. However, this method heavily relied on the classification score of UntrimmedNet. Our method differs from AutoLoc for it does not require the help other methods and performs action localization individually.

(1) We devise a principled technique to tackle the problem of weakly supervised action localization. We successfully train a two-stage network to localize action in untrimmed videos without the need of temporal annotations of action instances. (2) We provides an efficient proposal selection layer to bridge the proposal inputs and video labels. With the help of this layer, the video-level loss is transferred to the proposal-level one, and thus the network parameters can be fine-tuned efficiently. (3) Our proposed network outperforms the state-of-the-art methods on ActivityNet1.2 [9] and shows results that are comparable to the state-of-the-art on THUMOS14 [8]. We significantly improve the current best results from 16.0% to 18.5% on ActivityNet1.2.

Given an untrimmed video set {Vi , yi }N , where N is number i=1 of videos and yi ∈ {0, 1}c is the label vector for the ith video with c being the total number of action classes in the video set. Here, each video may belong to one class or multiple classes depending on how many types of actions are included in the video. And each video may contain multiple actions, but the position of each action, denoted by (tstart , tend ), is unknown, where tstart and tend denote the start time point and end time point, respectively. Given a coming video, the task of Action Localization is to recognize the actions and identify their positions in the video simultaneously. The task of action localization, however, is extremely difficult in practice since manually annotating actions with precise locations is extremely time-consuming and expensive. More critically, the annotation of action position can be very subjective, leading to severe bias issue. In other words, we often do not have the exact position of each action, but the video-level labels only. In this paper, we instead propose a weakly supervised learning paradigm for action localization, in which only video-level labels are required. The overall scheme of the proposed method is shown in Fig. 1, with details being given below. First of all, we reformulate the action localization into an action proposal recognition problem. Specifically, for a coming untrimmed video V, we apply TAG (Temporal Actionness Grouping) method [15] to generate Q candidate action proposals.1 Each action proposal contains multiple continuous frames. Without loss of generality, we describe each proposal by (tstart , tend ). As a q q result, we obtain an action proposal set {(tstart , tend )}Q for the q=1 video V. Since each proposal contains position information, the action localization task is reduced to learn a prediction function q q to classify the action proposal set {(tstart , tend )}Q . Note that, q=1 we still do not have the label for each action proposal (i.e., no proposal-level labels), but the video-level labels only. Now, it becomes the key issue of the proposed method on how to effectively represent and classify action proposals relying on video-level labels. Unfortunately, in practical video analysis, we have two additional difficulties. First, the training data are often insufficient to train a deep network. Taking THUMOS14 dataset [8] for example, there are only 200 untrimmed videos for training. It is prone to be overfitting if we train a deep network directly on the limited data. Second, for video analysis, the long-term temporal dependency within long videos is very important but hard to capture. For example, in THUMOS14, the average lasting time of the untrimmed videos is more than 209 s. It is difficult to discover the temporal dependency from such long videos. To address the key issue and two challenges mentioned above, we propose a two-stage weakly supervised learning paradigm to learn effective and robust representations for action proposals. Each stage of the proposed method is associated with the learning of neural network. In Stage 1, an Image-to-Video (I2V) network is trained to learn coarse action patterns by directly classifying untrimmed videos. This network, as will be shown, is able to learn

2. Related work Action localization. Action localization in videos needs not only to recognize action categories, but also to localize the start time and end time of each action instance. Previous work focus on classifying proposals generated by sliding windows with the handcrafted features [18,19]. Over the past few years, deep learning based methods have been extensively studied [11,12,14,15,20–23]. Shou et al. [12] proposed a multi-stage approach involving three segment-based 3D ConvNets. Dai et al. [23] constructed features using the context around proposals and further introduced a ranking mechanism to select action proposals. Xu et al. [13] exploited the 3D ConvNet as the feature extractor and proposed a framework which shares the similar idea as Faster R-CNN. Zhao et al. [15] introduced the completeness of proposals and designed a pyramid structure to classify and refine the boundary of proposal simultaneously. To further enhance the performance, there are also studies on generating action proposals of good quality [21,24]. Apart from the two-stage framework, Lin et al. [25,26] developed end-to-end architectures to perform proposal generation and proposal classification. Another class of popular approaches applied recurrent neural network to learn temporal features for localizing actions [11]. These methods have one thing in common: they employ temporal annotations to train classifiers for proposals. Our method is distinct from these approaches, for we only need video labels rather than temporal annotations for training. Weakly supervised action localization. Weakly supervised learning has been successfully applied in object detection. However, only a few approaches are proposed to tackle the action localization problem with weak supervision. Duchenne et al. [27] proposed to localize the action instances with the help of movie scripts. Bojanowski et al. [28] employed a weakly supervised scheme to label actions with clustering method. Recently, Hide-and-seek [29] applied attention mechanism to weakly supervised object detection and action localization. This method works well in spatial localization rather than in the temporal domain. UntrimmedNet [30] learnt to recognize actions with the attention module and perform dense prediction in an untrimmed video to localize actions

3. Proposed method

1

The number Q can be different for different videos.

204

Q. Su / Neurocomputing 339 (2019) 202–209

Fig. 1. Overview of the two-stage paradigm. Stage 1: The Image-to-Video (I2V) net is developed for untrimmed video classification to capture coarse action patterns. Stage 2: The Video-to-Proposal (V2P) net is built upon I2V nets and designed for action proposal classification.

general action patterns among different videos but will neglect the discriminate information within videos. Hence, in Stage 2, a Video-to-Proposal (V2P) network is learned to classify the generated action proposals. The schematic depiction of the proposed paradigm is shown in Fig. 1. In the following sections, we will give more details about I2V net and V2P net, including the main motivations. 3.1. I2V net for learning video-level features The Image-to-Video (I2V) network, as shown in Fig. 1, is a convolutional neural network (CNN) for extracting video-level features based on image-level features. Specifically, due to the lack of training data, we first pre-train a CNN model on ImageNet. Note that the pre-trained model is originally for 10 0 0-way ImageNet classification, we have to replace the 10 0 0-way fully-connected layer with c-way fully-connected layer, where c as defined before is the number of classes in videos. The c-way fully-connected layer is initialized randomly. Once obtaining the pre-trained CNN model, we fine-tune I2V network with the c-way fully-connected layer. However, directly fine tuning the network with untrimmed videos is infeasible due to the limited training data and complex temporal changes in long videos. To address this issue, we extend learning strategy in TSN (Temporal Segment Network) [5] to learn I2V network. TSN is originally designed for trimmed videos, where the background instances are filtered out beforehand and each trimmed video is divided into three segments uniformly. Then, TSN represents the whole video by combining the features extracted from three segments. Unfortunately, for untrimmed videos, sampling of three segments is not sufficient to model the input video precisely, since most untrimmed videos consist of various action and background instances.

To address the above issue, we instead propose a Dense Segments Sampling method. In practice, we observe that sampling more segments in a long untrimmed video helps to capture richer action information and thus leads to better prediction performance. Specifically, we divide the untrimmed video V into K segments instead, where K > 3. In this paper, K is set to 21 by default. An ablation study is done in experiments on the effect of K. For the kth segment, we randomly sample one snippet2 within its duration. By feeding this snippet to CNN, we will obtain the probabilities of the kth segment belonging to c classes, denoted by zˆk ∈ [0, 1]c . The vector zˆk can be considered as the segment-level score vector zˆk for the kth segment. Then, we propose to use an average voting mechanism over these segment-level scores to obtain the video-level feature vector z by

zj =

K 1 j zˆk . K

(1)

k=1

Apparently, we have z ∈ [0, 1]c , which represent the probabilities of the video belonging to c classes. In this sense, given the groundtruth label y of an untrimmed video V, the cross-entropy loss can be computed by

LI2V = −

N  c 

y j log(z j ).

(2)

i=1 j=1

Based on the softmax loss function, we fine-tune the I2V network using mini-batch SGD method. Relying on the above training on segments with video labels, the I2V network is able to capture the global information of the video. Thus, it can be directly applied to extract features for representing action proposals. However, the

2

In this paper, snippet defines a stack of continues frames.

Q. Su / Neurocomputing 339 (2019) 202–209

Proposals

205

Proposal Scores

Video Score High Jump

Baseball

High Jump

Softmax

Baseball

I2V Network

Mean Squared Loss Baseball

Video Label Fig. 2. Details of the proposal selection layer. We compare the proposal scores over each class and select the max one as the representative score for each class. For example, the pink and green bars represent the selected scores for the “Baseball” and “High_Jump” classes, respectively. The green dotted arrows indicate the forward process, while the red ones represent the backward pass.

representation ability is still quite limited since the local information in proposal is not considered yet. 3.2. V2P net for proposal-level classification The V2P network is built upon I2V network and takes action proposal set as input and outputs a score vector for each video, as shown in Fig. 1. Since the proposal-level labels are absent, we develop a proposal selection layer on top of I2V network to connect action proposals and the video labels. Without loss of generality, we assume that each video contains at least one positive action proposal which contributes significantly to the target video class. Relying on this assumption, we propose to choose the maximum response of all Q action proposals as the prediction for the video. For each class in the video, we compute the responses of all action proposals and perform comparison among proposals. The proposal with the highest response to the class will be selected, as shown in Fig. 2. Accordingly, for each video, c class scores will be chosen for the final stage. We then combine these c scores together to obtain a video-level score vector (see Fig. 2). Here, we select proposals according to the absolute scores rather than soft-max scores. Compared to the soft-max scores, the absolute ones are able to choose sharper response, which helps to alleviate the noise. j Let rq be the outputs of the qth proposal and rq denote the response of the jth class in rq . The proposal selection can be done by

v = max( j

r1j , r2j , . . . , rQj

),

(3)

where v denotes the video-level absolute score vector and vj specifies the jth element of v. We further perform the soft-max normalization on v by

pˆ = so f tmax(v ).

(4)

The resulting pˆ represents the video-level prediction. Via the selection in Eq. (3), the proposal with the highest score among all will be retained for each class. The loss will only be propagated through these c samples that contribute most to the output (as illustrated in Fig. 2). Consequently, our specially designed selection layer finally bridges the proposal inputs and video labels. Since untrimmed videos are often composed of more than one category of action, our V2P network formulates the proposal

classification problem as a multi-label video classification problem. Following [32], we use a mean squared loss to measure the error between the prediction and ground-truth label. Note that yi ∈ {0, 1}c , where the c is the number of action classes. To ensure the j j sum of probabilities to be 1, we normalize yi by pi = yi /||yi ||1 . Then, the loss function of the network is defined as

LV2P =

N 1  pi − pˆ i 22 , N

(5)

i=1

for a mini-batch of N videos. With the loss defined in Eq. (5), we can fine-tune the parameters of V2P network by back-propagation. Even with the weak supervision settings, the proposal selection layer implicitly guides the network to learn the most discriminative pattern for each action category. For the sake of convenience, we summary the overall structure of V2P net below. Overall structure. For each coming untrimmed video, we first generate Q action proposals by the TAG method [15]. We then feed Q action proposals to the V2P network and obtain Q prediction scores. Lastly, with the proposal selection layer, we obtain the video prediction score by comparing the responses over the G proposals. We thus formulate the loss between prediction scores and video labels to enable back-propagation for network training. 4. Experiments We evaluate the performance of the proposed method on two benchmark datasets and compare our method with several stateof-the-art methods. 4.1. Datasets and evaluation metrics THUMOS14 [8] is composed of four parts: training, validation, testing and background sets. Following the common setting in [8], we apply the validation set containing 200 videos of 20 classes for the training. Here, we only adopt the video-level labels to ensure the weakly supervision contraction. The performance is evaluated on the testing set that contains 213 videos in total. The ActivityNet1.2 [9] dataset divided into 100 activity classes consists of 4819 videos for training, 2383 videos for validation, and 2480 videos for testing. We train our network on the training set and evaluate it on the validation set.

206

Q. Su / Neurocomputing 339 (2019) 202–209 Table 2 Comparison to state-of-the-art methods on the ActivityNet 1.2 validation set. Our method with ∗ is pre-trained on the Kinetics dataset.

Table 1 Comparison to state-of-the-art methods on the THUMOS14 testing set. Method

AP@IoU 0.1

Hide and seek AutoLoc UntrimmedNet Two-stage (RGB) Two-stage (Flow) Two-stage (RGB+Flow)

36.4 – 44.4 24.5 35.6 47.0

0.2 27.8 – 37.7 22.3 32.2 42.7

0.3 19.5 35.8 28.2 17.1 25.2 32.9

0.4 12.7 29.0 21.1 12.0 18.5 23.9

0.5 6.8 21.2 13.7 8.3 13.0 16.2

We use the mean Average Precision (mAP) as the comparison metric. Following the conventional evaluation set-ups, we report the mAP at different IoU thresholds. A prediction proposal is correct if it gets the same category as ground-truth and its temporal IoU with this ground truth instance is larger than IoU threshold. On THUMOS14, the IOU thresholds are chosen from [0.1, 0.2, 0.3, 0.4, 0.5]; on ActivityNet1.2, the IOU thresholds are determined over [0.5, 0.75, 0.95]. 4.2. Implementation details Training: We choose TSN [5] with two-stream input modalities (i.e. optical flow and RGB images) as the feature extractor and exploit the Inception architecture [33] with Batch Normalization, namely BN-Inception, to be the basic model. The input for the spatial and temporal stream are 1-frame RGB images and 5-frame stacks of TVL1 optical flows, respectively. For the spatial stream, the initial learning rate is 0.0 0 01 and will be decreased to its 1 10 every 50 epochs. For temporal stream, we set the initial 1 learning rate to 0.001 and decrease it to its 10 every 20 epochs. The dropout ratio is set as 0.8 for spatial stream and 0.7 for the temporal stream. Common data augmentation techniques including cropping augmentation and scale jittering are used in our experiments. We use the proposals generated by the TAG algorithm [15]. Our networks are optimized by the mini-batch stochastic gradient algorithm. For I2V net, we set the number of segments to 21. The parameters are initialized by the pre-trained models from ImageNet. The mini-batch size is set to 80. For V2P net, we sample 3 segments for each proposal. As presented before, the parameters of V2P net are initialized by I2V net. We choose only 1 video for the training at each iteration. Prediction: At the testing stage, we compute the scores of all proposals for each video via our V2P network. We employ standard non-maximum suppression (NMS) to remove redundant proposals and attain the final results.

Method

ActivityNet 1.2 AutoLoc UntrimmedNet Two-stage (RGB) Two-stage (Flow) Two-stage (RGB+Flow) Two-stage (RGB+Flow)∗

AP@IoU

Average

0.5

0.75

0.95

9.7 27.3 12.2 21.6 9.1 24.9 32.0

– 15.1 6.0 13.4 5.3 15.8 18.8

– 3.3 1.3 3.2 0.9 3.6 4.6

– 16.0 6.5 12.8 7.0 14.9 18.5

UntrimmedNet [30]. Unlike AutoLoc, our method is self-contained and able to perform action localization individually, without requiring any help from the external models. Besides, the late fusion strategy for the two-stream models in our model is more efficient than the early fusion in AutoLoc, which is empirically proved by Chao et al. [34]. 4.3.2. ActivityNet dataset We compare our two-stage network with UntrimmedNet [30] and its reported baseline [9]. We implement UntrimmedNet using the trained model provided by the author and follow the same settings in the paper to produce the localization results. As shown in Table 2, our method achieves the best performance. In terms of average mAP, our method remarkably outperforms the UntrimmedNet by 8.4%. In contrast to UntrimmedNet, our method is able to scoring proposals with the enhanced discriminative ability without the help of attention module. AutoLoc [31] shows good performance on average mAP. However, our method outperforms AutoLoc when the IoU threshold is large, meaning our method is able to locate actions more precisely when the predictions are required to highly match the ground-truths. We conduct additional experiments by pre-training our models by the large-scale video dataset, i.e., the Kinetics [35] dataset. The performance is further improved; large scale video dataset provides better initialization for the networks. 4.4. Ablation studies Importance of sampling dense segments. In Section 3.1, we raised the number of segments when extracting features from the 200 untrimmed videos of THUMOS14. We verify this operation by varying the number of segments from 3 to 21. We use precision as

4.3. Comparison with state-of-the-arts 4.3.1. THUMOS14 dataset We compare our proposed method with state-of-the-art methods: (1) UntrimmedNet [30]; (2) Hide-and-seek [29]; (3) AutoLoc [31]. All these compared methods are weakly supervised. The results are listed in Table 1. Following the same settings in [30], we employ a late fusion strategy to combine results from RGB and optical flow streams with the a ratio of 1: 2. We can see that our method outperforms the UntrimmedNet by over 2.5% for all IoU values. When IoU is set to 0.5, our method improves the result from 13.7% to 16.2%. Interestingly, even with weak supervision, our result (16.2%) is comparable to the result of [11] i.e. 17.1% when IoU = 0.5. It is worth noting that [11] applies dense temporal annotations for the training. The AutoLoc by Shou et al. [31] shows good performance. However, it should be noticed that this method heavily relies on the Class Activation Sequence (CAS) produced by

Fig. 3. Performance on varying number of segments. We compare the precisions by varying the number of the sampling segments in I2V net on THUMOS14 testing set. The results are reported for both RGB and optical flow modalities.

Q. Su / Neurocomputing 339 (2019) 202–209

20 Soft-max score

mAP (%)

16

13

12 8

8.2

8.3

16.2

Absolute score 12.7

9.5

4 0 RGB

Flow

RGB+Flow

Fig. 4. Softmax score v.s. Absolute score. The performance is measured by mAP for IoU = 0.5.

Ground Truth

#4365

Ground Truth

the evaluation metric. Results are shown in Fig. 3. We can see that the performance is getting better as the number of sampling segments grows. Such observation is unsurprising because more sampled segments enable extracted action features. However, the denser sampling strategy will bring more computation cost. Hence, we choose 21 to be the number of segments in our I2V network since the improvement is relatively small around 21. The effectiveness of I2V and V2P networks. We conduct additional experiments on the THUMOS14 and ActivityNet 1.2 datasets to contrast the effectiveness of I2V and V2P nets. For this purpose, we try three different protocols: 1) using V2P only without I2V pre-training; 2) employing I2V only without V2P fine-tuning; 3) applying I2V and V2P both, namely, our proposed coarse-to-fine framework. The mAP results are reported in Table 3. It is observed that directly training the V2P net on the raw videos achieves

Long Jump

Prediction of I2V Prediction of I2V+V2P

207

#4402 Long Jump (a) THUMOS14

#4382

#4429 #4483

Long Jump #4433

#367 Baseball Pitch #456

Prediction of I2V #507 #287 Baseball Pitch Prediction of I2V+V2P #335 Baseball Pitch #490 (b) THUMOS14

Ground Truth

#1971

Prediction of I2V #1940 Prediction of I2V+V2P #1968

Cricket Cricket Cricket

#2106 #2146 #2109

(c) ActivityNet 1.2

Ground Truth Prediction of I2V Prediction of I2V+V2P

#960

Playing Violin #1250

#996

#3090

#3318 Playing Violin #3195 Playing Violin

(d) ActivityNet 1.2 Fig. 5. Visualization of action localization results on (a) THUMOS14 and (b) ActivityNet 1.2. The blue and green bars stand for the ground-truth action instances and model prediction results, respectively. The numbers with # represent frame IDs.

208

Q. Su / Neurocomputing 339 (2019) 202–209

Table 3 Comparison between different variants of our methods on the THUMOS14 and ActivityNet datasets, where mAP is measured for IoU = 0.5. Dataset

Modality

Component

mAP

Gain

THUMOS14

RGB

V2P I2V I2V+V2P V2P I2V I2V+V2P V2P I2V I2V+V2P V2P I2V I2V+V2P V2P I2V I2V+V2P V2P I2V I2V+V2P

0.1 7.6 8.2 0.4 8.6 13.0 0.5 12.6 16.2 3.2 21.2 21.6 1.5 7.6 9.1 3.9 24.2 24.9

– 7.5 8.1 – 8.2 12.6 – 12.1 15.7 – 18.0 18.4 – 6.1 7.6 – 20.3 21.0

Flow

RGB+Flow

ActivityNet

RGB

Flow

RGB+Flow

severely worse performance. This could due to the reason that the V2P net fails to select the corresponded proposal response to the target class subject without the initialization by the I2V net. In contrast, applying the I2V net only can still attain promising results although it is able to capture the general rather than the precise action patterns. As expected, our integration framework outperforms each single model, thus verifying the necessity of each of the I2V and V2P nets. Soft-max score v.s. Absolute score. In the proposal selection layer, we employ the absolute score for the proposal selection. Here, we perform additional experiments on V2P network by using the scores after the soft-max computation, and compare their results in Fig. 4 with those applying the absolute scores. It demonstrates that, the absolute method outperforms the soft-max one for both modalities (RGB and optical flow). We conjecture that applying the absolute score is able to alleviate the noise influence and thus can improve the final performance. 4.5. Qualitative results We visualize action localization results on THUMOS14 and ActivityNet 1.2 in Fig. 5. As presented in Section 3.2, the V2P net can learn action patterns in a more precise way compared to the I2V net. Such conclusion is verified by the results in Fig. 5. Clearly, the I2V network is able to learn general patterns but is somehow misled by the background samples. In contrast, the V2P network is capable of distinguishing the action instances from the background and localizing them more accurately. 5. Conclusions We address the weakly supervised action localization problem by developing a novel coarse-to-fine framework. Only given the video labels, our method is able to transfer knowledge learned from the video domain to the proposal domain by automatically mining positive samples for the training. We achieved the stateof-the-art performance on two benchmark datasets THUMOS14 and ActivityNet 1.2. One future direction to enhance our network could be considering more advanced feature extraction methods and post-processing techniques. References [1] K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in: Advances in Neural Information Processing Systems, 2014, pp. 568–576.

[2] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, in: Proceedings of the ICCV, 2015, pp. 4489–4497. [3] C. Cao, Y. Zhang, C. Zhang, H. Lu, Action recognition with joints-pooled 3d deep convolutional descriptors., in: Proceedings of the IJCAI, 2016, pp. 3324–3330. [4] P.T. Bilinski, F. Bremond, Video covariance matrix logarithm for human action recognition in videos., in: Proceedings of the IJCAI, 2015, pp. 2140–2147. [5] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L. Van Gool, Temporal segment networks: Towards good practices for deep action recognition, in: Proceedings of the ECCV, Springer, 2016, pp. 20–36. [6] J. Liu, A. Shahroudy, G. Wang, L.-Y. Duan, A.C. Kot, Ssnet: scale selection network for online 3d action prediction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [7] X. Shu, J. Tang, G.-J. Qi, W. Liu, J. Yang, in: Hierarchical long short-term concurrent memory for human interaction recognition, 2018. arXiv: 1811.00270. [8] Y. Jiang, J. Liu, A.R. Zamir, G. Toderici, I. Laptev, M. Shah, R. Sukthankar, in: Thumos challenge: action recognition with a large number of classes, 2014. [9] F. Caba Heilbron, V. Escorcia, B. Ghanem, J. Carlos Niebles, Activitynet: A large-scale video benchmark for human activity understanding, in: Proceedings of the CVPR, 2015, pp. 961–970. [10] C. Gu, C. Sun, S. Vijayanarasimhan, C. Pantofaru, D.A. Ross, G. Toderici, Y. Li, S. Ricco, R. Sukthankar, C. Schmid, et al., in: AVA: a video dataset of spatiotemporally localized atomic visual actions, 2017. arXiv: 1705.08421, 3(4) 6. [11] S. Yeung, O. Russakovsky, G. Mori, L. Fei-Fei, End-to-end learning of action detection from frame glimpses in videos, in: Proceedings of the CVPR, 2016, pp. 2678–2687. [12] Z. Shou, D. Wang, S.-F. Chang, Temporal action localization in untrimmed videos via multi-stage CNNS, in: Proceedings of the CVPR, 2016, pp. 1049–1058. [13] H. Xu, A. Das, K. Saenko, R-c3d: region convolutional 3d network for temporal activity detection, in: Proceedings of the ICCV, 2017. [14] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, S.-F. Chang, Cdc: Convolutional-de– convolutional networks for precise temporal action localization in untrimmed videos, in: Proceedings of the CVPR, 2017. [15] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, D. Lin, Temporal action detection with structured segment networks, in: Proceedings of the ICCV, 2017. [16] H. Bilen, A. Vedaldi, Weakly supervised deep detection networks, in: Proceedings of the CVPR, 2016. [17] B. Lai, X. Gong, Saliency guided end-to-end learning forweakly supervised object detection, in: Proceedings of the IJCAI, AAAI Press, 2017, pp. 2053–2059. [18] D. Oneata, J. Verbeek, C. Schmid, Action and event recognition with fisher vectors on a compact feature set, in: Proceedings of the ICCV, 2013, pp. 1817–1824. [19] J. Yuan, B. Ni, X. Yang, A.A. Kassim, Temporal action localization with pyramid of score distribution features, in: Proceedings of the CVPR, 2016, pp. 3093–3102. [20] T. Bagautdinov, A. Alahi, F. Fleuret, P. Fua, S. Savarese, Social scene understanding: End-to-end multi-person action localization and collective activity recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [21] J. Gao, Z. Yang, C. Sun, K. Chen, R. Nevatia, in: Turn tap: temporal unit regression network for temporal action proposals, 2017. arXiv: 1703.06189. [22] H. Zhu, R. Vial, S. Lu, X. Peng, H. Fu, Y. Tian, X. Cao, Yotube: Searching action proposal via recurrent and static regression networks, IEEE Trans. Image Process. 27 (6) (2018) 2609–2622. [23] X. Dai, B. Singh, G. Zhang, L.S. Davis, Y. Qiu Chen, Temporal context network for activity localization in videos, in: Proceedings of the ICCV, 2017. [24] F. Caba Heilbron, J. Carlos Niebles, B. Ghanem, Fast temporal activity proposals for efficient detection of human actions in untrimmed videos, in: Proceedings of the CVPR, 2016, pp. 1914–1923. [25] T. Lin, X. Zhao, Z. Shou, Single shot temporal action detection, in: Proceedings of the ACM on Multimedia Conference, MM, Mountain View, CA, USA, 2017, pp. 988–996. [26] S. Buch, V. Escorcia, B. Ghanem, L. Fei-Fei, J.C. Niebles, End-to-end, single-stream temporal action detection in untrimmed videos, in: Proceedings of the British Machine Vision Conference, 2017. [27] O. Duchenne, I. Laptev, J. Sivic, F. Bach, J. Ponce, Automatic annotation of human actions in video, in: Proceedings of the ICCV, IEEE, 2009, pp. 1491–1498. [28] P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce, C. Schmid, J. Sivic, Weakly supervised action labeling in videos under ordering constraints, in: Proceedings of the ECCV, Springer, 2014, pp. 628–643. [29] K. Kumar Singh, Y. Jae Lee, Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization, in: Proceedings of the CVPR, 2017, pp. 3524–3533. [30] L. Wang, Y. Xiong, D. Lin, L. Van Gool, Untrimmednets for weakly supervised action recognition and detection, in: Proceedings of the CVPR, 2017. [31] Z. Shou, H. Gao, L. Zhang, K. Miyazawa, S.-F. Chang, Autoloc: Weaklysupervised temporal action localization in untrimmed videos, in: Proceedings of the ECCV, 2018, pp. 162–179. [32] L. Jing, L. Yang, J. Yu, M.K. Ng, Semi-supervised low-rank mapping learning for multi-label classification, in: Proceedings of the CVPR, 2015, pp. 1483–1491.

Q. Su / Neurocomputing 339 (2019) 202–209 [33] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, in: Proceedings of the ICML, 2015, pp. 448–456. [34] Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D.A. Ross, J. Deng, R. Sukthankar, Rethinking the faster R-CNN architecture for temporal action localization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [35] A. Zisserman, J. Carreira, K. Simonyan, W. Kay, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, et al., in: The kinetics human action video dataset, 2017.

209 Qiubin Su received the Master of Science in Mathematics and Applied Mathematics from Sun Yat-sen University, Guangzhou, China, in 2005. He is working in South China University of Technology from 2005. He has studied in School of Computer Science & Engineering, South China University of Technology since 2013.