SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval

Information Processing and Management 56 (2019) 102104 Contents lists available at ScienceDirect Information Processing and Management journal homep...

Download PDF

3MB Sizes 0 Downloads 14 Views

Report

Full Text

Information Processing and Management 56 (2019) 102104

Contents lists available at ScienceDirect

Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman

SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval

T

Bin Jiang ,a, Xin Huanga, Chao Yanga, Junsong Yuanb ⁎

a b

College of Computer Science and Electronic Engineering, Hunan University, Lushan Road (S), Yuelu District, Changsha, China Computer Science and Engineering department, State University of New York at Buffalo, NY 14260-2500, USA

ARTICLE INFO

ABSTRACT

Keywords: Cross-modal retrieval Moment localization Spatial attention network Language-temporal attention network Tensor fusion network

This paper focuses on temporal retrieval of activities in videos via sentence queries. Given a sentence query describing an activity, temporal moment retrieval aims at localizing the temporal segment within the video that best describes the textual query. This is a general yet challenging task as it requires the comprehending of both video and language. Existing research predominantly employ coarse frame-level features as the visual representation, obfuscating the specific details (e.g., the desired objects “girl”, “cup” and action “pour”) within the video which may provide critical cues for localizing the desired moment. In this paper, we propose a novel Spatial and Language-Temporal Tensor Fusion (SLTF) approach to resolve those issues. Specifically, the SLTF method first takes advantage of object-level local features and attends to the most relevant local features (e.g., the local features “girl”, “cup”) by spatial attention. Then we encode the sequence of the local features on consecutive frames by employing LSTM network, which can capture the motion information and interactions among these objects (e.g., the interaction “pour” involving these two objects). Meanwhile, language-temporal attention is utilized to emphasize the keywords based on moment context information. Thereafter, a tensor fusion network learns both the intra-modality and inter-modality dynamics, which can enhance the learning of moment-query representation. Therefore, our proposed two attention sub-networks can adaptively recognize the most relevant objects and interactions in the video, and simultaneously highlight the keywords in the query for retrieving the desired moment. Experimental results on three public benchmark datasets (obtained from TACOS, Charades-STA, and DiDeMo) show that the SLTF model significantly outperforms current state-of-the-art approaches, and demonstrate the benefits produced by new technologies incorporated into SLTF.

2010 MSC: 00-01 99-00

1. Introduction Video becomes a new way of communication between Internet users with the proliferation of sensor-rich mobile devices. This has encouraged the development of advanced techniques for a broad range of video understanding applications. Searching videos of interests from large collections has long been an open problem in the field of multimedia information retrieval (Vallet, Hopfgartner, Jose, & Castells, 2011). Language-based video retrieval only needs to judge whether the query occurs in a video and returns an entire video. However, in many real-world scenarios (e.g., autonomous driving, robotic navigation, and surveillance), the untrimmed videos usually contain complex scenes and involve a large number of objects, actions, and interactions, whereby only some parts of the Corresponding author. E-mail addresses: [email protected] (B. Jiang), [email protected] (X. Huang), [email protected] (C. Yang), [email protected] (J. Yuan). ⁎

https://doi.org/10.1016/j.ipm.2019.102104 Received 17 July 2019; Received in revised form 9 August 2019; Accepted 20 August 2019 Available online 28 August 2019 0306-4573/ © 2019 Elsevier Ltd. All rights reserved.

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

complex scene convey the desired cues or match the description. Therefore, there has been increased interest in finding specific moments within a video responding to a textual query rather than simply retrieving an entire video. This task is known as temporal moment retrieval (i.e., localization). Solving this task requires recognizing and localizing dynamic human activities or interactions among multiple objects in a long video. Fig. 1 shows an example in which the query is “a girl pours water into a cup”, which describes an activity “ pour” involving two objects: girl and cup. In this paper, we focus on the task of temporal moment retrieval, which aims at identifying the specific start and end time points within a video in response to the given description query. In our work, a desired moment refers to a query-aware temporal segment whose content is in accordance with the given query. The key challenges are as follows: (1) Recognization of relevant objects and interactions. The untrimmed videos usually contain complex scenes and involve a large number of objects, human activities, and interactions among objects. However, only a minority of these objects and interactions are mentioned in the query language. For example, the sentence “a girl pours water into a cup” is a typical query, and what should be identified in videos are objects “girl”, “cup” and the action “pour”. Most existing methods Gao, Sun, Yang, and Nevatia (2017); Liu, Wang, Nie, He, et al. (2018) feed the whole video moments into a pre-trained C3D network to establish one feature vector. Despite their success, simply treating video moments holistically as one feature vector may obfuscate the meaningful objects on each frame and interactions among them. Thereby, how to distinguish the video moment containing the relevant objects and interactions from other scenes is highly challenging. (2) Comprehension of crucial query information. Some keywords in the query language convey the crucial semantic cues for retrieving the desired moment. Taking the query of “A person puts dishes away in a cabinet” as an example, the words “dishes”, “cabinet” and the temporal action “put” should contribute most to the moment retrieval. A direct solution as employed in Gao, Sun, et al. (2017) and Hendricks et al. (2017) is to represent the entire query with a global feature vector through a general LSTM network (Liu, Nie, Wang, & Chen, 2017) or one offline language processor (e.g., Skip-thoughts). Nevertheless, these methods obfuscate the keywords that have rich semantic cues, that is, the crucial details in the query are not fully explored and leveraged. Therefore, how to comprehend the complex query language and emphasize the keywords in queries is important to localize the desired moment. Based on the above considerations, we propose a novel Spatial and Language-Temporal Tensor Fusion (SLTF) model for temporal moment retrieval, which comprises of two attention sub-networks: spatial attention network and language-temporal attention network. In the first branch, in order to recognize the relevant objects and interactions in videos, we extract local features on each frame by Faster R-CNN (Anderson et al., 2017), and introduce the spatial attention to selectively attend to the most relevant local features mentioned in the query. Then we utilize LSTM network to encode the sequence of the local features on consecutive frames, capturing the motion information and interactions among relevant objects. Through the above process, we obtain the local interaction features (i.e., interaction information among the relevant objects) for the video moment. Meanwhile, we extract global motion features on each video moment by a C3D network (Tran, Bourdev, Fergus, Torresani, & Paluri, 2015). Then we integrate local interaction features with global motion features as the visual representations. In the second branch, to comprehend the crucial query information, we utilize the language-temporal attention proposed by Liu, Wang, Nie, He, Tian, et al. (2018) to adaptively assign weights on different words in queries to obtain effective query representations. Thereafter, we introduce a tensor fusion network to jointly model the textual and visual features. It is built on the inter- and intra-modal embedding interactions. Finally, we feed the moment-query representation into a multi-layer perception (MLP) network to predict the relevance scores and the location of the desired moment. Extensive experiments on three public datasets have well justified that our model outperforms the state-of-the-art baselines significantly. The main contributions of this work are summarized as follows:

• We address the problem of temporal moment localization in videos by proposing a Spatial and Language-Temporal Tensor Fusion • •

Network, which jointly characterizes the local interaction features, the global motion features, and the attentive textual features through a tensor fusion network. Experiments show that integrating them within a unified model can significantly improve the performance of the model. For the purpose of accurately localizing moments in a video with natural language, we are the first to introduce the spatial attention network in the temporal moment retrieval task. By adaptively assigning different weights to the relevant objects on each frame and then encoding the interaction information, the SLTF method is able to capture the crucial details, thus it can leverage the missing details to improve the localization accuracy. We evaluate our proposed model on three benchmark datasets, TACOS, Charades-STA and DiDeMo, to demonstrate the performance improvement. Meanwhile, we have released the dataset and our implementation to facilitate the research community for further exploration.1

The rest of the paper is organized as follows. After introducing related works in Section 2, we elaborate our proposed methods in Section 3. We then perform experimental evaluation in Section 4. Finally, we conclude the whole paper and give an outlook of future work in Section 5.

1

https://github.com/SLTA-VideoRetrieval/SLTFNet. 2

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

Fig. 1. Temporal moment localization in an untrimmed video.

2. Related works 2.1. Sentence-based image/video retrieval Given a set of video/image candidates and a language query, this task aims at retrieving the videos/images that match the query. Technically, the retrieval problem is generally seen as a ranking task (Feng, He, Liu, Nie, & Chua, 2018; Song et al., 2017), returning results based on their matching scores. As for image retrieval, Karpathy and Fei-Fei (2015) proposed Deep Visual-Semantic Alignment (DVSA) model to tackle the problem of visual-text alignment. DVSA employed Bi-directional LSTMs to encode textual embeddings, and R-CNN object detector (Girshick, Donahue, Darrell, & Malik, 2014) to extract features from object proposals, which achieved top performance in image retrieval task. Sun, Gan, and Nevatia (2015) designed a model to discover visual concepts from image-query pairs and apply the concept detectors for image retrieval. And Hu et al. (2016) formulated the problem of natural language object retrieval. They proposed a novel Spatial Context Recurrent ConvNet (SCRC) model as scoring function on candidate boxes for object retrieval, integrating spatial configurations and global scene-level contextual information into the network. As for video retrieval, similar to image-language embedding models, current methods are designed to incorporate deep videolanguage embeddings. Lin, Fidler, Kong, and Urtasun (2014) parsed the sentence descriptions into a semantic graph, and then matched the visual concepts in the videos with the semantic graphs (Shin, Jin, Jung, & Lee, 2019). And Alayrac et al. (2016) introduced a strategy to tackle the problem of video-text alignment by assigning a temporal interval to the given video and the set of sentences with the temporal ordering. Different from the aforementioned works, the input of our model is only one sentence query and the temporal ordering is not used. There are also some efforts dedicated to retrieving temporal segments within a video with constrained settings. Tellex and Roy (2009) considered retrieving video clips from a home surveillance camera by text queries with a fixed set of spatial prepositions (e.g., “across” and “through”). Later, Lin et al. (2014) developed a model to retrieve temporal segments in 21 videos from a dashboard car camera. Hendricks et al. (2017) presented a moment context network for matching candidate video clips and sentence query, which incorporated the contextual information by integrating both local and global video features over time. However, these models can only verify the segments containing the corresponding moment. Namely, there are many background noises in the returned results. And they retrieve the corresponding video moments by densely sampling video moments at different scales, which is not only computationally expensive but also increases the search space. As we know that temporal boundaries of proposals by learning regression parameters have been adopted in object localization (Ren, He, Girshick, & Sun, 2015) successfully. Inspired by this, we adopted a temporal regression localizer to identify the specific start and end time points of the desired video moment in our model. 2.2. Temporal action localization Temporal action localization is a task that given a long untrimmed video, predicting the start and end times of the activities (Escorcia, Heilbron, Niebles, & Ghanem, 2016; Lin, Zhao, & Shou, 2017; Ma, Sigal, & Sclaroff, 2016). Gaidon, Harchaoui, and Schmid (2011) introduced the problem of temporally localizing actions in the untrimmed videos, focusing on limited actions (e.g., “drinking and smoking” and “open the door and sit down”). Later, researchers worked on constructing large-scale datasets consisting of complex action categories, and proposed different models for localizing activities in videos. Sun, Shetty, Sukthankar, and Nevatia (2015) tackled the problem of fine-grained action localization from temporally untrimmed web videos by transferring image labels into their model. Shou, Wang, and Chang (2016) proposed an end-to-end segment-based 3D Convolutional Neural Network (CNN) framework for temporal action localization in untrimmed videos, which outperforms other Recurrent Neural Network (RNN)3

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

based methods by capturing spatio-temporal information simultaneously. And Ma et al. (2016) introduced novel ranking losses within the RNN learning objective, which can capture the progression of activities better. Meanwhile, Singh, Marks, Jones, Tuzel, and Shao (2016) extended a two-stream bidirectional RNN network (Simonyan & Zisserman, 2014a) to predict activity labels or activity segments at each time step. Later, Gao, Yang, Sun, Chen, and Nevatia (2017) proposed a novel temporal coordinate regression network, which can jointly predict the action proposals and refines the temporal boundaries by temporal coordinate regression. However, these action localization methods are restricted to the pre-defined list of actions. Gao, Sun, et al. (2017) proposed to use natural language queries to localize activities. They designed a cross-modal temporal regression localizer to jointly model the text query and video moments. And Hendricks et al. (2017) designed a moment context network to localize language queries in videos by integrating the local and global video features. Although these two models perform well in their tasks, they only consider the framelevel global features as the visual representation, which may overlook the local detail information in videos. 2.3. Object-level local features in videos As we know, compared with images, videos contain more types of information including appearance features and motion features. However, on the processing of video features, existing works mainly adopt frame-level appearance features and it may result in detail missing. To tackle this issue, Shetty and Laaksonen (2015) employed pre-trained SVM classifier to deal with local features and integrated these features with global features in the video caption task. However, these local features are processed by simple averaging strategy, overlooking the spatial structure of each video frame. That is, it is incapable of expressing the significance between several objects from the collapsed features. And Yu, Wang, Huang, Yang, and Xu (2016) employed optical flow to roughly extract patch features on each video frame and then these features are pooled together. Nevertheless, this method may result in misjudgment of objects. In addition, they only used patch features which will overlook context information. Considering the limitations of these works, we utilize a pre-trained Faster R-CNN model (Anderson et al., 2017) that can recognize objects more precisely to exact local features on each video frame. At the same time, we introduce the spatial attention to selectively attend to the most relevant local features. Thus, our method can capture more significant details. 3. Methodology The framework of SLTF is illustrated in Fig. 2. Generally speaking, our proposed SLTF model is comprised of four components: (1) the spatial attention network which selectively attends to the most relevant objects on each frame to enhance the local features; (2) the language-temporal attention network which adaptively assigns the weight to the keywords in the given query based on temporal moment contexts; (3) the tensor fusion network which fuses query representations, local interaction features, and global motion features; and (4) the MLP which estimates the relevance scores and predicts the location of the desired moment. To be specifically, we first present the notations and formulate the temporal moment retrieval problem to be solved (Section 3.1). We then introduce the input features of our model (Section 3.2). After that, we elaborate the four key ingredients of our proposed framework (Sections 3.3–3.6). 3.1. Task definition and notation Let denote a video and denote a query. The language query is affiliated with a temporal annotation (τs, τe), where τs and τe is the start time and end time of the desired moment. And the video is segmented into a set of moment candidates = {c1, c2, …cM } via multi-scale temporal sliding windows and each candidate moment ci is affiliated with a temporal bounding box (ts, te). Then we

Fig. 2. An illustration of our proposed SLTF model. 4

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

extract local features vli from the video frames while extracting the global motion features vci from the video moments. Thus, a video is presented as a sequence of frames = {v1, v2, …vN } , where each vi = {vli , vci } . Moreover, since the positive moment candidates have overlaps with the ground truth on different sliding scales, we pair each positive moment-query pair (c, q) with a time location offset (i.e., (ts s, t e e )). The detailed data construction is shown in Section 3.6. As such, the temporal moment retrieval task is formally defined as: Input: A set of moment candidates and a query q. Output: A ranking model that maps each moment-query pair (c, q) to a relevance score and estimates the time location offset of the desired moment. 3.2. Input features Global motion features. For each untrimmed video , we first evenly pre-segment it into a set of moment candidates = {c1, c2, …cM } . Then we apply the pre-trained C3D (Tran et al., 2015) network to encode these video moments, (1)

x ci = C 3D (ci).

Here x ci is the fc7 layer C3D features of the video moment ci. Thus, we obtain a set of global motion features vc = {x c1, x c 2, …x cM}, 4096 on each moment. where x ci Local features. Each video moment consists of a sequence of video frames. We use the pre-trained Faster R-CNN (Anderson et al., 2017) for object proposal and local feature extraction. Specifically, we select the top- candidate objects from each frame and regional features are stacked together to form local features. Finally, we obtain a set of local features vli = {vli1, vli1, …vliK } on the i-th 2048 . The number of detected objects frame of the video moment, where vlij is fixed to be 36. Query features. Suppose a sequence of T words {wt }Tt = 1 represents the given query. We use GloVe (Pennington, Socher, & Manning, 2014) to embed each word and obtain a vector sequence {et }Tt = 1. In order to explore the details in sentences, we employ a Bidirectional LSTM to represent sentences. Unlike the general LSTM which encodes the sentence as a whole, the Bi-directional LSTM takes the vector sequence {et }Tt = 1 as input and stacks the hidden states of the Bi-LSTM from both directions at each time step. It can be formulated with the following equations,

et = embedding (wt ) ht(f ) = LSTM (f ) (et , ht(f )1) ht(b) = LSTM (b) (et , ht(b)1)

.

ht = [ht(f ) , ht(b) ]

(2)

Through the query feature encoding process, we obtain the query representation in this paper.

=

{ht }Tt = 1.

The hidden state size h is set to 2000

3.3. Spatial attention and local interaction feature extraction In order to focus on the most relevant local features, we introduce spatial attention mechanism to learn the weights of the toplocal features vli = {vli1, vli1, …vliK } . Here, we select top 36 (k = 36) object regions and each region is represented as 2048 dimensional features. The spatial attention network is shown in Fig. 3. We extract nouns from the word sequence {wt }Tt = 1 by Stanford CoreNLP (Manning et al., 2014) and then project each noun into an embedding vector, which represents a set of L vector sequence {nl}lL= 1. Then an averaging strategy is employed on the word embeddings to obtain one single vector.

Fig. 3. Left (blue dotted wireframe): object detection using Faster R-CNN on a video frame. Right (red dotted wireframe) : spatial attention corresponding to nouns in the query. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) 5

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

^ = n

1 L

L n. l =1 l

(3)

^ as inputs to generate a fusion feature (Al-Smadi, AlThereafter, we employ a single-layer neural network to take both vli and n Ayyoub, Jararweh, & Qawasmeh, 2019), and then a softmax function is used to compute a normalized attention weight (Yang, Zhang, Jiang, & Li, 2019), ^ + bn ) a = tanh (Wvl vlij + Wn n = softmax (a)

,

(4)

where Wvl, Wn, and bn are the common space embedding matrix and bias vector, respectively. With the attention weight η, the attended local feature is computed as follows, K

VLi =

j vlij ,

(5)

j =1

where VLi is the attended local feature on the i-th frame of the video moment. Thereafter, in order to capture the motion information and interactions among relevant objects, we employ a LSTM network to encode the sequence of local features, 2048

ot = LSTM (VLt , ot 1) . VI = oT

(6)

We obtain the local interaction features VI of the current moment, which equals to the last output of LSTM (oT). And the length of the video moments is set as 64 frames. As such, our spatial attention network can recognize the most relevant objects and interaction on each moment based on the given query. 3.4. Language-temporal attention We introduce language-temporal attention network proposed by Liu, Wang, Nie, He, Tian, et al. (2018) to capture the varying importance of each word by assigning an attentive weight to the embedding word. The language-temporal attention network is = {ht }Tt = 1 and current moment global motion feature x ci . Suppose the shown in Fig. 4. Given the previous calculated query features temporal context moments are cj (j {i n, …, i 1, i + 1, …, i + n}), and n denotes the neighbor size of moment contexts. we present the language-temporal attention network as follows,

b = ReLU Wqht +

i+n j = i n Wc x c j

+ bq

, (7)

= softmax (b)

where Wq, Wc, and bq are the common space embedding matrix and bias vector, respectively. With the attention weight β, the attended query feature q is computed as follows, T

q=

t et .

(8)

t=1

As such, our language-temporal attention network can emphasize the keywords in the query based on the temporal moment contexts.

Fig. 4. Illustration of the Language-Temporal Attention Network. 6

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

Fig. 5. Illustration of the Tensor Fusion Network.

3.5. Tensor fusion network Thus far, we have obtained the attended query features q, the local interaction features VI, and the global motion features x ci of current input moment along with its surrounding moments x cj . Then we integrate local interaction features with global motion features as the visual representations,

mc = x ci

n

… x ci

1

x ci

x ci +1

(9)

… x ci +n,

where ⊕ represents vector concatenation. mc denotes the fused moment representation. We hence can derive a cross-modal joint representation for the current query-moment pair by feature fusion. As we know, existing multimodal research predominantly employ vector concatenation as an approach for multimodal (Lee & Jung, 2019) feature fusion. However, this method is unable to efficiently model the intermodality dynamics. In this paper, we introduce a tensor fusion network to capture the intra-modal and inter-modal embedding interactions. The former is implemented by the tensor fusion to explicitly model the interactions between the visual and textual embeddings. And the latter is implemented by the concatenation operation, which retains the information within each individual modality. Finally, we concatenate these intra-modal and inter-modal embeddings to obtain a fused moment-query representation. The tensor fusion network is shown in Fig. 5, which consists of two parts: the mean pooling and the tensor fusion. We employ a mean pooling operation before conducting the tensor fusion, because high dimensional vectors may lead to expensive time complexity when computing tensor fusion. Specifically, we employ the mean pooling layer on mc and q to obtain the dimension reduction ^ for the moment and the query, respectively. Then, we input these two embeddings into the tensor fusion model, ^ c and q features m

^ x l, c , q = m c 1

^ q ^ c, m ^c = [m 1

^, q ^ , 1], q

(10)

where ⊗ indicates the outer product between vectors. As a result, xl,c,q is the cross-modal joint representation for the textual features, local interaction features, and global motion features, which is capable of encoding the information across the textual modality and visual modality. 3.6. Learning Above the feature fusion layer is a multi-layer perceptron (MLP) (Jiang, Huang, Yang, & Yuan, 2019; Li, Liu, Li, & Zomaya, 2015; Liu, Nie, Wang, Tian, & Chen, 2018; Wang, He, Nie, & Chua, 2017), which enables the model to capture complicated interactions among the cross-modal joint representation. We feed xl,c,q into the MLP network. Then we can leverage the prediction layer to get the relevance score sc,q of the moment-query pair (c, q), as well as the time location offset [ts s, t e e ] between the moment candidate and the desired moment,

g1 = ReLU (W1x l, c, q + b1) g 2 = ReLU (W2g1 + b2) , …… gh = ReLU (Wh gh 1 + bh)

(11)

where Wh, bh and gh denote the weight matrix, bias vector, and the output vector of the h-th hidden layer, respectively. ReLU function is used as the non-linear activation function, which has empirically proved to work well. Finally, the output vector 3 comprises of the relevance score s gh = [sc, q, s, e] s and e = te e. c,q and the time location offsets s = ts Inspired by Gao, Sun, et al. (2017), Luo, Huang, and Cao (2018), Yang, Kurahashi, Ono, and Terano (2012), the loss function of our model is designed with two parts: alignment loss for visual-semantic alignment and localization regression loss for the 7

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

localization offsets. 3.6.1. Alignment loss. Similar to the spirit in Gao, Sun, et al. (2017), we cast the visual-semantic alignment as a binary classification task. Given a set of video moment candidates and a query , the moment-query pairs are divided into two groups: the aligned pairs and the misaligned pairs . We adopt the alignment loss to encourage the aligned moment-query pairs to have positive scores and the misaligned to have negative scores. Formally, we restate it as,

Lalign = +

(c , q )

(c , q )

1log(1

2 log(1

+ exp( sc, q))

+ exp(sc, q)),

(12)

where λ1 and λ2 are two hyper-parameters controlling the balance of weights between the positive and negative moment-query pairs. 3.6.2. Localization regression loss. As the positive moment candidates with the bounding box [ts, te] may not exactly match the desired moment [τs, τe], there is the location offset between the positive candidates and ground truth. Hence, we adopt the moment boundary adjustment strategy proposed by Gao, Yang, et al. (2017). The ground truth localization offsets between the positive moment candidates and ground truth are denoted as [ s*, e*]. After that, the location offset regression is formulated as follows,

Lloc =

(c , q)

| s*

s|

+ | e*

e |,

(13)

where is the set of positive aligned pairs. Here, we adopted L1 norm function. Hence, based on the location offset regression, our model can adaptively adjust the alignment points of the current moments to match the exact temporal duration. We devise the optimization framework consisting of the alignment loss and the localization regression loss processes as, (14)

L = Lalign + Lloc , where λ is a hyper-parameter which balances the two loss terms. 4. Experiments

We first evaluate the effectiveness of our proposed model on three temporal moment localization datasets: TACoS, Distinct Describable Moments (DiDeMo) dataset and Charades-STA. We then investigate how the well-designed attention network affects the localization. 4.1. Data description 4.1.1. TACoS The dataset was first constructed by Regneri et al. (2013) on the top of MPII-Compositive dataset (Rohrbach et al., 2012). It contains 127 videos. Each video is affiliated with two types of annotations. One is the fine-grained activity labels with temporal location annotation (i.e., the start and end time). The other is natural language descriptions for the temporal annotations. The dataset was used in Gao, Sun, et al. (2017) for temporal activity localization, named as TACoS. We briefly describe the dataset construction process. We sample frames at 3 fps for each video. In our paper, each training video is sampled by multi-scale temporal sliding windows with size of [16, 32, 48, 64] frames and 80% overlap. As for the testing videos, we coarsely sampled using sliding windows with size of [16, 32] frames. For a sliding window moment c with temporal annotation (τs, τe), and a query description q with temporal annotation (ts, te), they are aligned as a pair of training sample if they satisfy the following conditions: (1) the Intersection over Union (IoU) is larger than 0.5; (2) the non Intersection over Length (nIoL) is smaller than 0.15; and (3) one sliding window moment can be aligned with only one query description. In the TACoS dataset, there are 75 training videos, 25 testing videos, and 26,963 training moment-query pairs satisfying the above conditions. We extracted C3D feature and Faster R-CNN feature for each video moment. As for query features, we set the maximum query length as 10 and adopted 300 dimensional dense word embeddings Glove (Pennington et al., 2014). Thus, the dimension of the visual local features, global motion features, and the word embeddings are 36 × 2048, 4096 and 300, respectively. 4.1.2. Charades-STA The dataset was constructed by Gao, Sun, et al. (2017) on the top of the Charades dataset (Sigurdsson et al., 2016) for evaluating temporal activity localization in videos. It contains 6672 videos. To generate the clip-level sentence annotations used in the retrieval task, Gao, Sun, et al. (2017) introduced a semi-automatic way. As the released Charades-STA dataset only contains the videodescription file, we downloaded original videos from the website and sampled frames at 3 fps on each video. We segmented each training video into multi-scale sliding windows with size of [16, 32, 48, 64] frames with 80% overlap. The testing videos are sampled using sliding windows with size of [16, 32] frames. Similarly, we employ the same settings from the previous experiments on the TACoS dataset to exact features. The temporal feature of each moment candidate is the mean pooling of the features of corresponding units. 8

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

Table 1 The summary of the TACoS, Charades-STA and DiDeMo datasets. Dataset

#Videos

#Queries

Domain

Video source

TACoS Charades-STA DiDeMo

100 6672 10,464

14,229 16,128 40,543

Cooking Homes Open

Lab Kitchen Daily Activities Flickr

4.1.3. DiDeMo The dataset was constructed by Hendricks et al. (2017) for language-based moment retrieval, named as the Distinct Describable Moments (DiDeMo) dataset. It contains 10,464 personal videos lasting 25–30 seconds and 40,543 localized description annotations. Descriptions in DiDeMo refer to expressions, describing the specific moments in a video. Besides, the construction of the DiDeMo dataset contains a verification step to ensure that the descriptions align with a single moment within a video. In the dataset, each video is broken into six five-second moments, and each moment is represented by a 4096-dimensional VGG (Simonyan & Zisserman, 2014b) vector. In addition, we extracted 36 × 2048 dimensional Faster R-CNN (Anderson et al., 2017) feature for each frame. For textual features, each word is represented by a 300-dimensional word embeddings Glove (Pennington et al., 2014). The statistics of the datasets are summarized in Table 1. The reported experimental results in this paper are based on datasets stated above. In the following experiments, we set the context moment number n as 1. And the length of context window is set as 128 frames on the TACoS dataset, Charades-STA dataset and 5 seconds on the DiDeMo dataset. 4.2. Experimental settings 4.2.1. Evaluation protocols To thoroughly measure the performance of our method and the baselines, we adopted “R@n, IoU=m” setup in Gao, Sun, et al. (2017) as the evaluation metric. To be more specific, for each sentence query, we calculate the temporal Intersection over Union (IoU) between the predicted moment candidates and ground truth. Then for each IoU larger than m, we compute the percentage of top-n results. In the following paragraphs, we use R(n, m) to denote “R@n, IoU=m”. This metric itself is on query level and the overall performance is the average among all the sentence queries,

R (n , m ) =

1 Nq

Nq

r (n , m , qi )

(15)

i=1

where r(n, m, qi) is the recall for a given query qi, Nq represents the total number of queries and R(n, m) denotes the averaged overall performance. 4.2.2. Comparative approaches We compared our proposed SLTF model with the following several state-of-the-art baselines to justify the effectiveness of our method:

• MCN (Hendricks et al., 2017): This is a moment context network which localizes natural language queries in videos by integrating • • • • •

local and global video features. The query feature is extracted by the LSTM model and the video appearance feature is extracted by the pre-trained VGG model. As it simply assumes that the given queries and video features from the corresponding moments should be close in a common space, the loss function only enforces their features to be similar in the shared embedding space. CTRL (Gao, Sun, et al., 2017): This is a cross-modal temporal regression localizer that jointly captures the interaction between the query description and video moments, as well as outputs alignment scores and action boundary regression results. The query feature is extracted via the Skip-Thoughts. As for visual features, it is a concatenation of the current moment and its contextual moments. ACRN (Liu, Wang, Nie, He, et al., 2018): This is an attentive cross-modal retrieval network that introduces a memory attention mechanism Song, Park, and Shin (2019) to emphasize the visual features based on the query information and simultaneously incorporates its context moments. Meanwhile, a cross-modal fusion sub-network is adopted to learn both the intra-modality and inter-modality dynamics, which can enhance the learning of moment-query representation. ROLE (Liu, Wang, Nie, He, Tian, et al., 2018): This is a language-temporal attention network that learns the word attention based on the temporal context information in the video. It can automatically select “what words to listen to” for localizing the desired moment. The query feature is extracted by Bi-LSTM, and visual features is the concatenation of the current moment and its contextual moments. L-Net (Chen, Ma, Chen, Jie, & Luo, 2019): This is an end-to-end localization network for the task of natural language localization in videos. With the proposed cross modal interactor and the self interactor, it can take advantages of the fine-grained interactions between two modalities and the evidences from the context to semantically localize the video segment corresponding to the natural sentence. ACL (Ge, Gao, Chen, & Nevatia, 2019): This is a novel actionness score enhanced activity concepts based localizer, which localizes the activities of natural language queries. It mines the activity concepts from both the videos and the sentence queries to facilitate the localization. 9

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

Table 2 Performance comparison between our proposed model and the state-of-the-art baselines on TACoS. (p-value*: p-value over R(1, 0.5)). Method

R@1 IoU=0.5

R@1 IoU=0.3

R@1 IoU=0.1

R@5 IoU=0.5

R@5 IoU=0.3

R@5 IoU=0.1

p-value*

MCN CTRL ACRN ROLE L−Net ACL SLTF

0.86% 8.67% 10.17% 9.94% 9.28% 12.31% 12.36%

1.25% 12.89% 14.93% 15.38% 14.32% 16.88% 18.07%

2.62% 18.26% 20.28% 20.37% 23.99% 23.65% 24.67%

1.01% 19.89% 20.24% 20.13% 20.65% 22.04% 22.86%

1.82% 30.09% 31.52% 31.17% 30.92% 32.58% 33.20%

2.88% 43.08% 44.01% 45.45% 45.67% 47.65% 48.78%

3.68E−10 5.61E−06 2.72E−05 3.48E−05 3.17E−04 4.34E−04 –

4.2.3. Implementation and hyper-parameter setting We trained our model on a server equipped with four high-performance NVIDIA GPUs, each having 12GB video memory. Our model was implemented in python with the TensorFlow deep learning library. For the gradient descent parameters, we select function AdamOptimizer as our optimizer and start training with a learning rate of 0.01 and batch size of 30. The hyper-parameters λ1, λ2 and λ are 1.03, 0.03 and 0.01, respectively. For the comparisons with other models, we used the source code provided by the authors. The optimal hyperparameters for those models were obtained by a grid search algorithm using our datasets. 4.3. Performance comparison We conducted an empirical study to investigate whether our proposed model can achieve better localization performance. The experiment results of above methods on three datasets are displayed in Table 2, Table 3 and Table 4, respectively. Several observations are as follows:

• MCN achieves poor performance compared with the other baselines, since simply treating the entire set of moment candidates as • •

• • •

the context features may introduce noisy features and lead to negative transfer. Moreover, as it models the relations between the given query and moment features by only enforcing their distance to be close in the common space, the cross-modal relations have not been fully explored. In addition, it simply employs coarse frame-level appearance features as video features, which fails to identify the relevant objects and interactions in videos and overlooks the local detail information. When performing our moment retrieval task, CTRL outperforms MCN. The difference is that it not only considers the neighbor moments as contextual information but also contains a cross-modal processing part, which can exploit the interactions across the visual and textual modalities. In addition, it utilizes a temporal coordinate regression localizer to refine the temporal boundaries. ACRN and ROLE achieve better moment localization results than MCN and CTRL. Both them introduce attention mechanism in temporal moment localization task, one considers the different attention weight of each context moment, and the other refines word attention of the queries based on the temporal context information. However, we argued that their performances are limited owing to simply treating videos moments holistically to one global feature vector. Because only a minority of the objects, activities, and interactions in the video are mentioned in the query language. Such coarse-grained feature will result in crucial detail missing and localization inaccuracy. L-Net performs better on three datasets compared with CTRL. The reason is that L-Net is capable of exploiting the fine-grained interactions between the natural sentence and video through a cross-gated attended recurrent network. ACL achieves better performance than the other baselines. Because it not only mines the activity concepts from both the videos and the sentence queries but also introduces an actionness score enhanced pipeline, which enhances the localization. SLTF achieves the best performance, substantially surpassing all the baselines. Particularly, SLTF shows consistent improvements over the aforementioned baselines. This verifies the importance of selecting relevant local features by spatial attention, and integrating local interaction features with global motion features in tensor fusion module. We also evaluated our proposed SLTF model and the baseline methods on DiDeMo, and reported the results regarding IoU ∈ {0.5,

Table 3 Performance comparison between our proposed model and the state-of-the-art baselines on Charades-STA. (p-value*: p-value over R(1, 0.5)). Method

R@1 IoU=0.3

R@1 IoU=0.5

R@1 IoU=0.7

R@5 IoU=0.3

R@5 IoU=0.5

R@5 IoU=0.7

p-value*

MCN CTRL ACRN ROLE L-Net ACL SLTF

32.59% 35.32% 38.06% 37.68% 40.82% 40.24% 41.56%

11.67% 18.46% 20.26% 21.74% 21.62% 22.81% 23.73%

2.63% 6.04% 7.64% 7.82% 6.94% 9.18% 9.75%

89.52% 90.11% 92.47% 92.79% 94.25% 94.37% 94.81%

54.21% 65.81% 71.99% 70.37% 66.39% 68.48% 73.39%

14.56% 26.74% 27.79% 30.06% 28.43% 32.03% 32.26%

5.20E−09 1.81E−02 2.17E−03 3.79E−03 2.46E−03 3.48E−04 –

10

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

Table 4 Performance comparison between our proposed model and the state-of-the-art baselines on DiDeMo. (p-value*: p-value over R(1, 0.5)). Method

R@1 IoU=0.5

R@1 IoU=0.7

R@1 IoU=0.9

R@5 IoU=0.5

R@5 IoU=0.7

R@5 IoU=0.9

p-value*

MCN CTRL ACRN ROLE L-Net ACL SLTF

23.32% 26.45% 27.35% 28.94% 31.32% 31.77% 32.28%

15.36% 15.39% 16.23% 15.47% 18.69% 19.21% 19.37%

15.31% 15.35% 16.47% 15.58% 17.65% 18.36% 19.47%

41.03% 68.23% 69.25% 69.11% 72.12% 72.65% 73.86%

20.37% 28.57% 28.73% 32.93% 28.43% 32.48% 34.17%

19.77% 26.23% 27.21% 28.26% 26.37% 30.67% 30.94%

6.24E−09 5.62E−04 3.98E−03 2.32E−02 4.27E−03 5.74E−04 –

0.7, 0.9} and R@{1, 5}. Note that since the positive moment-query pairs in DiDeMo dataset are well aligned, there are no location offsets between them. We only used the alignment loss to train the CTRL, ACRN, ROLE, ACL, and SLTF for localizing the corresponding moment. In addition, we also conducted the significance test between our model and each of the baselines. We can see that all the p-values are substantially smaller than 0.05, indicating that the advantage of our model is statistically significant. 4.4. Study of SLTF To verify the effectiveness of our SLTF model, we studied variants of our model to further investigate the effectiveness of the spatial attention network, tensor fusion network, along with visual integration of local interaction features and global motion features:

• SLTF-a: Instead of using the spatial attention, we utilized the average pooling on the top- region features to generate the local features for each video frame. • SLTF-l: We eliminated the local interaction features in our tensor fusion network. That is, we only fuse the attended query features with global motion features as the cross-modal joint representation. • SLTF-g: We eliminated the global motion features in our tensor fusion network. Namely, we only fuse the attended query features with local interaction features as the cross-modal joint representation. • SLTF-f: Instead of using tensor fusion network in the Eq. (10), we simply adopted the vector concatenation approach to fuse multimodal feature.

We explore these model variants on TACoS, Charades-STA, and DiDeMo dataset, respectively. The experiment results of the component-wise comparison are displayed in Fig. 6.

• Jointly comparing the performance of SLTF-a and SLTF in Fig. 6 (a) and (b), SLTF achieves measurable improvement than SLTF-a

• •

on the TACoS dataset. So do the same on the Charades-STA and DiDeMo dataset. It illustrates that simply adopting the averaging strategy for local features is insufficient to identify the most relevant objects and underlying interactions among these objects. As average pooling assumes that the multiple local features equally contribute to the moment retrieval and thus obfuscates the crucial detail information in videos. Therefore, the improvement achieved by SLTF verifies the effectiveness of the spatial attention. The performances of SLTF-g and SLTF-l are worse than SLTF-a, indicating that removing global motion features or local interaction features hurts the visual representation and further degrades the retrieval results, especially in term of R@1. This admits that only considering the global motion features is insufficient to identify the objects and interactions based on the query information, and only using the local interaction features without global motion features will overlook the context information. SLTF beats SLTF-f by a great margin on three different datasets. This indicates that tensor fusion network is beneficial to enhance the moment-query representations. One possible explanation is that the concatenation of the moment and query representations models the intra-modal interactions solely and limits the expressiveness of the moment query pairs representations. While the tensor fusion network can capture the intra-modal and inter-modal embedding interactions.

4.5. Attention visualization Apart from achieving the superior performance, the key advantage of SLTF over other methods is that its spatial attention is able to attend to the most relevant objects and capture the interaction among these objects. Meanwhile, language-temporal attention can emphasize keywords in the sentence. In order to understand our model more intuitively, we showed some examples, and then visualized the attention values and moment retrieval results as demonstrated in Fig. 7. Fig. 7 (a) describes a scene that a black dog chows down on his meal in a silver-colored bowl on the floor. Given a query that “a dog starts eating from a bowl”, we expect to return the moment containing that a dog is eating the meal in a bowl. Intuitively, the vital information within the video should be the objects “dog”, “bowl” and the action “eat”. Video moments and the given query are fed into our SLTF model, then we obtain the attention weights for relevant objects on each frame, together with attention score for 11

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

Fig. 6. Performance comparison among the variants of our proposed model w.r.t. R@1 vs IoU ∈ {0.1, 0.3, 0.5, 0.7, 0.9} and R@5 vs IoU ∈ {0.1, 0.3, 0.5, 0.7, 0.9} on the TACoS, Charades-STA and DiDeMo datasets.

Fig. 7. Visualization of the Spatial and Language-Temporal Attention on DiDeMo and Charades-STA. The Ground truth moments are outlined in the orange box and the green dash box is our retrieval result. The white aperture area represents the different importance of each local object on each frame in the spatial attention stage. The word attention is represented by different colors, the depth of color states the level of importance. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

each word in the query. We can observe that (1) the objects “dog”, “bowl” are covered with the white aperture, indicates that these objects are the most related to the given query. And, the local features on different frames are constantly changing. We capture the motion information and interactions by encoding the sequence of local features. (2) the words “dog”, “bowl” and “eating” are marked in the darkest orange, which obtains the most attention. Therefore, our proposed SLTF is able to adaptively identify the most relevant objects and interactions, and the most important words. 12

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

Fig. 8. Moment retrieval results on the TACoS dataset.

Fig. 7 (b) shows another example, where the video describes a scene: a man is sitting on the couch with a book in his hand, picking up a sandwich from a plate and then eating it. When retrieving the corresponding moment by the query “A person is sitting on the couch eating a sandwich”, we expected our model to identify the relevant objects “person”, “couch”, “sandwich” and the action “eat”. As shown in Fig. 7(b), it can be observed that the objects “person”, “couch” and “sandwich” are covered with the white aperture. Meanwhile, the words “eating” and “sandwich” obtain more attention. This agrees with our previous analysis. 4.6. Qualitative results To gain the deep insights into our proposed SLTF model, we illustrated several moment localization results via different language queries. In particular, the examples from TACoS and Charades-STA are shown in Figs. 8 and 9, respectively. In addition, we also displayed the localization results by the baselines. Fig. 8 describes a complex cooking scene, where a woman firstly opened the drawer to find a knife and placed it on the counter, and then she took out a handful of vegetables and washed them in the sink. Later, she used the knife to slice the vegetables and finally put the fresh vegetables back into the refrigerator. We select the sentence “She used the knife and sliced the vegetables into small parts” as the given query, and compare the moment retrieve results on the previous methods. We have the following observations:

• As Fig. 8(b) illustrates, MCN returns a moment that “The person rinsed vegetables in the sink” from the moment candidates,

•

instead of the desired moment that “She used the knife and sliced the vegetables into small parts”. It is probably because (1) MCN treats the entire set of moment candidates as the context feature and introduce noisy visual features. When most moments within the video are related to the scene “rinsed vegetables in the sink”, it fails to represent the desired scene; and (2) it adopts the pretrained VGG model to extract appearance features, which is insufficient to identify the interaction “slice” between the key objects “knife” and “vegetables”. As Fig. 8(c) illustrates, CTRL achieves the unsatisfactory alignment result by returning a moment containing an irrelevant scene “rinse vegetables in the sink”. Compared with MCN, CTRL model adopts the neighbor contextual moments as the context instead of the whole background. However, it only returns parts of the desired moments since the whole videos moments is simply represented as a C3D vector, overlooking the crucial detail information “knife” and “slice”. 13

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

Fig. 9. Moment retrieval results on the Charades-STA dataset. All of the above figures are the R@1 results.

• Although ACRN, ROLE, L-Net and ACL generate more accurate results than CTRL, they still return some irrelevant scenes except •

the desired moment. Since they only consider global motion feature, they are unable to accurately detect the objects and interactions according to the query textual information. The suboptimal performance admits the importance of the fusion of local interaction and global motion features. Our proposed SLTF outperforms other baselines as shown in Fig. 8(h). The localized moment with the spatial attention indicates that our model can capture not only the relevant objects “vegetables” and “knife” but also the interaction “slice”. This again indicates the effectiveness of our proposed temporal moment retrieval network.

Similarly, as for the example in Fig. 9, our model generates more accurate results than those of other models do. In this video, a woman walked into the bathroom, turned on the light and then looked into the mirror. Therefore, the object “woman”, “mirror” and the action “look into the mirror” is the crucial clue to distinguish the desired moment from the others as we expected. Since MCN, CTRL, ACRN, ROLE, L-Net and ACL simply consider the global motion features for video moments, this may result in local detail missing. As a result, MCN returns a moment that the woman is leaving the mirror, while the other models return some irrelevant scenes. Compared with these six methods, our SLTF returns the most optimal moment that having the largest IoU with the ground truth. However, it is not equal to the ground truth. The reason may be that we adopted the coarse grain sliding window to generate moment candidates for Charades-STA. 5. Conclusion and future work In this paper, we develop a novel cross-modal retrieval method to localize the desired moment via a given query. To well align the given textual query and the video moment candidates, we devise a spatial and language-temporal tensor fusion model to adaptively identify the relevant objects and interactions based on the query information. Meanwhile, we also integrate local interaction features with global motion features as the visual representation. Moreover, we adopt a tensor fusion network to incorporate cross-modal information into the moment-query alignment. To verify the effectiveness of our model, extensive experiments are performed on three public datasets. The results demonstrate that our proposed model can achieve better performance compared to the state-of-theart baselines. As a byproduct, we have released the data, codes, and parameter settings to facilitate research in the community. 14

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

In the future, we plan to deepen or widen our work from the following aspects: (1) We will model the visual relation among relevant objects on consecutive frames, because their position relationship can be important to capture the interaction information among objects such as a query like “put the vegetables on the cutting board” ; (2) We will incorporate reinforcement learning into our model to adaptively decide both where to look at next and when to predict, instead of the traditional costly “scan and localize” framework; And (3) we will consider incorporating hashing module into our model to speed up the retrieval process. Acknowledgments An earlier version of this paper was presented at the 2019 ACM International Conference on Multimedia Retrieval. This work was supported in part by National Natural Science Foundation of China under Grants No. 61702176 and Hunan Provincial Natural Science Foundation of China under Grant No. 2017JJ3038. Supplementary material Supplementary material associated with this article can be found, in the online version, at 10.1016/j.ipm.2019.102104. References Al-Smadi, M., Al-Ayyoub, M., Jararweh, Y., & Qawasmeh, O. (2019). Enhancing aspect-based sentiment analysis of arabic hotels reviews using morphological, syntactic and semantic features. Information Processing & Management, 56(2), 308–319. Alayrac, J.-B., Bojanowski, P., Agrawal, N., Sivic, J., Laptev, I., & Lacoste-Julien, S. (2016). Unsupervised learning from narrated instruction videos. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE4575–4583. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2017). Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE6077–6086. Chen, J., Ma, L., Chen, X., Jie, Z., & Luo, J. (2019). Localizing natural language in videos. Proceedings of the American association for artificial intelligence. AAAI. Escorcia, V., Heilbron, F. C., Niebles, J. C., & Ghanem, B. (2016). Daps: Deep action proposals for action understanding. Proceedings of the European conference on computer vision. Springer768–784. Feng, F., He, X., Liu, Y., Nie, L., & Chua, T.-S. (2018). Learning on partial-order hypergraphs. Proceedings of the international conference on world wide web. International World Wide Web Conferences Steering Committee1523–1532. Gaidon, A., Harchaoui, Z., & Schmid, C. (2011). Actom sequence models for efficient action detection. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE3201–3208. Gao, J., Sun, C., Yang, Z., & Nevatia, R. (2017). Tall: Temporal activity localization via language query. Proceedings of the IEEE international conference on computer vision. IEEE5267–5275. Gao, J., Yang, Z., Sun, C., Chen, K., & Nevatia, R. (2017). Turn tap: Temporal unit regression network for temporal action proposals. Proceedings of the IEEE international conference on computer vision. IEEE3628–3636. Ge, R., Gao, J., Chen, K., & Nevatia, R. (2019). Mac: Mining activity concepts for language-based temporal localization. Proceedings of the IEEE winter conference on applications of computer vision. IEEE245–253. Girshick, R. B., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE580–587. Hendricks, L. A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., & Russell, B. C. (2017). Localizing moments in video with natural language. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE5804–5813. Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., & Darrell, T. (2016). Natural language object retrieval. Proceedings of the ieee conference on computer vision and pattern recognition. IEEE4555–4564. Jiang, B., Huang, X., Yang, C., & Yuan, J. (2019). Cross-modal video moment retrieval with spatial and language-temporal attention. Proceedings of the ACM international conference on multimedia retrieval. ACM217–225. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE3128–3137. Lee, O.-J., & Jung, J. J. (2019). Integrating character networks for extracting narratives from multimodal data. Information Processing & Management, 56(5), 1894–1923. Li, K., Liu, C., Li, K., & Zomaya, A. Y. (2015). A framework of price bidding configurations for resource usage in cloud computing. IEEE Transactions on Parallel and Distributed Systems, 27(8), 2168–2181. Lin, D., Fidler, S., Kong, C., & Urtasun, R. (2014). Visual semantic search: Retrieving videos via complex textual queries. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE2657–2664. Lin, T., Zhao, X., & Shou, Z. (2017). Single shot temporal action detection. Proceedings of the ACM international conference on multimedia. ACM988–996. Liu, M., Nie, L., Wang, M., & Chen, B. (2017). Towards micro-video understanding by joint sequential-sparse modeling. Proceedings of the ACM international conference on multimedia. ACM970–978. Liu, M., Nie, L., Wang, X., Tian, Q., & Chen, B. (2018). Online data organizer: micro-video categorization by structure-guided multimodal dictionary learning. IEEE Transactions on Image Processing, 28(3), 1235–1247. Liu, M., Wang, X., Nie, L., He, X., Chen, B., & Chua, T.-S. (2018). Attentive moment retrieval in videos. Proceedings of the international ACM SIGIR conference on research and development in information retrieval. ACM15–24. Liu, M., Wang, X., Nie, L., Tian, Q., Chen, B., & Chua, T.-S. (2018). Cross-modal moment localization in videos. Proceedings of the ACM international conference on multimedia. ACM843–851. Luo, J., Huang, W., & Cao, B. (2018). A novel approach to identify the mirna-mrna causal regulatory modules in cancer. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 15(1), 309–315. Ma, S., Sigal, L., & Sclaroff, S. (2016). Learning activity progression in lstms for activity detection and early detection. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE1942–1950. Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., & McClosky, D. (2014). The stanford corenlp natural language processing toolkit. Proceedings of 52nd annual meeting of the association for computational linguistics: System demonstrations55–60. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. Proceedings of the conference on empirical methods in natural language processing. NIPS1532–1543. Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., & Pinkal, M. (2013). Grounding action descriptions in videos. Transactions of the Association of Computational Linguistics, 1, 25–36. Ren, S., He, K., Girshick, R. B., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Proceedings of the advances in neural information processing systems. NIPS91–99.

15

Information Processing and Management 56 (2019) 102104

B. Jiang, et al.

Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., & Schiele, B. (2012). Script data for attribute-based recognition of composite activities. Proceedings of the European conference on computer vision. Springer144–157. Shetty, R., & Laaksonen, J. (2015). Video captioning with recurrent networks based on frame-and video-level features and visual content classification. arXiv:1512. 02949. Shin, S., Jin, X., Jung, J., & Lee, K.-H. (2019). Predicate constraints based question answering over knowledge graph. Information Processing & Management, 56(3), 445–462. Shou, Z., Wang, D., & Chang, S.-F. (2016). Temporal action localization in untrimmed videos via multi-stage cnns. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE1049–1058. Sigurdsson, G. A., Varol, G., Wang, X., Farhadi, A., Laptev, I., & Gupta, A. (2016). Hollywood in homes: Crowdsourcing data collection for activity understanding. Proceedings of the European conference on computer vision. Springer510–526. Simonyan, K., & Zisserman, A. (Zisserman, 2014a). Two-stream convolutional networks for action recognition in videos. Proceedings of the advances in neural information processing systems. NIPS568–576. Simonyan, K., & Zisserman, A. (2014b). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Singh, B., Marks, T. K., Jones, M. J., Tuzel, O., & Shao, M. (2016). A multi-stream bi-directional recurrent neural network for fine-grained action detection. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE1961–1970. Song, M., Park, H., & Shin, K.-s. (2019). Attention-based long short-term memory network using sentiment lexicon embedding for aspect-level sentiment analysis in korean. Information Processing & Management, 56(3), 637–653. Song, X., Feng, F., Liu, J., Li, Z., Nie, L., & Ma, J. (2017). Neurostylist: Neural compatibility modeling for clothing matching. Proceedings of the ACM international conference on multimedia. ACM753–761. Sun, C., Gan, C., & Nevatia, R. (2015). Automatic concept discovery from parallel text and visual corpora. Proceedings of the ieee international conference on computer vision. IEEE2596–2604. Sun, C., Shetty, S., Sukthankar, R., & Nevatia, R. (2015). Temporal localization of fine-grained actions in videos by domain transfer from web images. Proceedings of the ACM international conference on multimedia. Tellex, S., & Roy, D. (2009). Towards surveillance video search by natural language query. Proceedings of the acm international conference on image and video retrieval. ACM38. Tran, D., Bourdev, L. D., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the ieee conference on computer vision and pattern recognition. IEEE4489–4497. Vallet, D., Hopfgartner, F., Jose, J. M., & Castells, P. (2011). Effects of usage-based feedback on video retrieval: a simulation-based study. ACM Transactions on Information Systems, 29(2), 11. Wang, X., He, X., Nie, L., & Chua, T.-S. (2017). Item silk road: Recommending items from information domains to social users. Proceedings of the international ACM SIGIR conference on research and development in information retrieval. ACM185–194. Yang, C., Kurahashi, S., Ono, I., & Terano, T. (2012). Pattern-oriented inverse simulation for analyzing social problems: family strategies in civil service examination in imperial china. Advances in Complex Systems, 15(07), 1250038. Yang, C., Zhang, H., Jiang, B., & Li, K. (2019). Aspect-based sentiment analysis with alternating coattention networks. Information Processing & Management, 56(3), 463–478. Yu, H., Wang, J., Huang, Z., Yang, Y., & Xu, W. (2016). Video paragraph captioning using hierarchical recurrent neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE4584–4593.

16

SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval

SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval

Recommend Documents