DeepImSeq: Deep image sequencing for unsynchronized cameras

DeepImSeq: Deep image sequencing for unsynchronized cameras

Pattern Recognition Letters 117 (2019) 9–15 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com...

3MB Sizes 0 Downloads 42 Views

Pattern Recognition Letters 117 (2019) 9–15

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

DeepImSeq: Deep image sequencing for unsynchronized cameras ✩ Gagan Kanojia∗, Shanmuganathan Raman Indian Institute of Technology Gandhinagar, Gandhinagar 382355, India

a r t i c l e

i n f o

Article history: Received 19 May 2018 Available online 19 November 2018 Keywords: Image sequencing Deep learning Neural network Pattern recognition

a b s t r a c t Consider a set of n images of a dynamic scene captured using multiple hand-held devices. The order in which these images are captured is unknown. For n images, there can be n! possible arrangements, which makes this problem extremely challenging. In this work, we address the problem of sequencing such a set of unordered images in its temporal order. We propose an LSTM-based deep neural network which addresses this problem in an end-to-end manner. The network takes the set of images as input and outputs their order of capture. We formulate the problem as a sequence-to-sequence mapping task, in which each image is mapped to its position in the ordered sequence. We do not provide any other information to the network apart from the input images. We show that the proposed approach obtains the state-of-the-art results on the standard dataset. Further, we show through experimental results that the network learns better when the target sequence is reversed. © 2018 Elsevier B.V. All rights reserved.

1. Introduction Analysis of a dynamic scene using still images has been an active area of research in computer vision for a long time. When a dynamic scene is captured using a hand-held device, it makes the analysis even more challenging due to the inclusion of camera motion along with the motion of dynamic objects. Consider a set of still images of a dynamic scene captured with multiple hand-held devices. Such a set of images can be obtained from the internet or directly from a group of people. In the case of images obtained from the internet, they may not have a time-stamp (the information about the time at which they were captured), and are available in an unsorted manner. Even when the images are directly obtained from a group of people, the time-stamp of the images can be unsynchronized due to reasons like imperfect device clock. In this work, we address the problem of sequencing such a set of images in its temporal order. We assume that there is no information available regarding the order of capture for any image in the given set. If there are n images in the set, then there can be n! possibilities, which makes this problem extremely challenging. Fig. 1 demonstrates the task of sequencing given an unordered image sequence. Recurrent neural networks (RNNs) are very powerful models for the task of exploiting temporal dependencies. However, in practice, they suffer from the problem of vanishing/exploding gradi-

✩ ∗

Conflict of interest. None. Corresponding author. E-mail address: [email protected] (G. Kanojia).

https://doi.org/10.1016/j.patrec.2018.11.014 0167-8655/© 2018 Elsevier B.V. All rights reserved.

ents [5]. Long short-term memory (LSTM) is a variant of recurrent neural networks, which is designed to handle these drawbacks and has been successfully used in several natural language processing (NLP) and computer vision tasks [7,10,11,16,24,32,34]. In this work, we utilize the ability of LSTMs to capture the temporal dependencies present in the set of input images. The major contributions of this work are as follows.









We propose an LSTM-based neural network which addresses the problem of sequencing an unordered set of images in its temporal order in an end-to-end fashion. We formulate this problem as a sequence-to-sequence mapping task, in which each image is mapped to its position in the ordered sequence. We show that even without any preprocessing and any other assistance, the proposed network achieves state-of-the-art accuracy. We show through experimental results that the network learns better when the target sequence is reversed.

The rest of the paper is organized as follows. Section 2 provides an overview of the relevant works. Section 3 discusses the neural network architecture proposed to address the problem of sequencing. Section 4 discusses the dataset used for the experiments along with the training and implementation details. It also discusses several experiments and comparison with the state-ofthe-art approach. Section 5 provides the conclusion and discusses the future work that can be performed.

10

G. Kanojia and S. Raman / Pattern Recognition Letters 117 (2019) 9–15

There are works which perform sequencing in a different context [20,23,25,28,35]. In [25], the authors present an approach of temporally ordering the events in news. In [28], the authors learn the feature representations by recognizing the 2-D rotation applied to the input images. They show that it helps in better semantic feature learning. In [30], the authors propose an automated approach for photo album creation from an unordered collection. They order those photos such that they make a story. A similar problem is addressed in [1], in which they take a jumbled set of image-caption pairs that belong to an event, and then they sort them such that it makes a consistent story. In [31], the authors sort the images based on the relative attributes like age, smile, etc. In [26], the authors learn the visual representation from videos. In their work, they learn to determine whether the given sequence of frames is in correct temporal order or not. The recent work by Lee et al. [22] is the most relevant to our work. They address this problem as a multi-class classification problem in which the network has n!/2 outputs. They do not feed these frames directly. Instead, they extract spatial patches with large motions and then feed them to their network. They learn visual representations by training their network to order the shuffled set of input images of a dynamic scene. Unlike Agrawal et al. [1] and Santa Cruz et al. [31], we only take a set of images as the input to predict its temporal order. We do not perform any preprocessing on the input set of images and directly feed them to the network. The proposed network has only n output nodes. 3. Proposed approach

Fig. 1. The figure demonstrates the task of sequencing, given an unordered image sequence. The first column shows the set of unordered input images. The second column shows the input images in the sequence in which they are captured.

We propose an LSTM-based neural network which takes the unordered sequence of n images as input and outputs their actual positions in the ordered sequence. Let Iin = {I1 , I2 , I3 , I4 , I5 } be the unordered input sequence and the order in which these images has been captured be Io = {I3 , I4 , I1 , I5 , I2 }. The network takes the input Iin in the given order and outputs the mapping for each image to its position in the ordered sequence, i.e., {I1 , I2 , I3 , I4 , I5 } → {3, 5, 1, 2, 4}. Fig. 2 shows the proposed network architecture.

2. Related works 3.1. LSTM Sequencing a set of temporally unordered images can be very helpful for tasks which exploit the temporal coherence [3,8,18,27,29,40]. In [35], the authors argue the sequence learning as the most powerful form of human learning. In [27], the authors propose a learning method for deep neural networks which exploits the coherence present in the sequential data which is further used to improve various tasks such as face recognition. In [3], the authors examine whether the exposure to a temporally sequenced data from a scene help the observer to predict the future events. Their results show that by providing the sequential data, we can improve the prediction of the future event. The early works by Basha et al. address the problem of sequencing the images of a dynamic scene captured by multiple hand-held cameras [4,9]. In their work, they assume that they know which are the images captured from the same camera and the order of capture of the images captured from the same camera. They also assume that two of the images are captured from the same location [4]. In [19], the authors overcome these assumptions. Given a set of images of a dynamic scene captured using multiple cameras, their proposed method splits the given images into different sets such that the images captured from the same camera falls in the same set. Afterward, they arrange the images belonging to each set in their order of capture. In [2], the authors focus on finding the spatial order of the images taken surrounding an event. These approaches address the problem of photo-sequencing in an unsupervised manner.

In many cases, it is hard to infer a situation just by looking at each of its instance separately. For example, to get the context of a statement, it is required to go through its each word while keeping the memory of the previous words. LSTMs have been proven to be very successful for such tasks in which there is a requirement to keep a memory of previous inputs to infer the current input [6,15,36,38,39]. LSTM networks are a variant of RNNs [12,13]. They were introduced in the work by Hochreiter and Schmidhuber [17]. They are capable of gathering information for long durations and then using them for inference. However, over time, some of the accumulated information become not so essential. Hence, there is a need to forget such information. LSTMs are very well equipped in learning which information to remember and which information to forget based on the inputs at each time step. In this work, we need to sequence an unordered set of images. In this case, even human beings can not decide the location of an image in the ordered set just by looking at a single image. First, a person has to look at all the images and then assign each image to its corresponding index in the ordered sequence. As we will see in the next section, the proposed network does the same thing. The encoder takes an overview of the set of images and passes the context to the decoder. Then, the decoder assigns each image its corresponding index. We use LSTMs in our network to exploit their ability to accumulate the information over time and take an appropriate decision based on the previous inputs.

G. Kanojia and S. Raman / Pattern Recognition Letters 117 (2019) 9–15

11

FC2 +Softmax

FC2 +Softmax

FC2 +Softmax

FC2 +Softmax

FC1 + ReLU

FC1 + ReLU

FC1 + ReLU

FC1 + ReLU

FC3 + ReLU

FC3 + ReLU

FC3 + ReLU

LSTM

LSTM

Feature Extractor Extracted Features

FC3 + ReLU

FC1 + ReLU FC3 + ReLU FC2 +Softmax

Fully connected layer followed by ReLU activation Fully connected layer followed by Softmax activation

Fig. 2. The figure shows the proposed LSTM-based neural network for sequencing the set of unordered input images. It demonstrates the unroll of LSTM by the length of the input image sequence. Here, we have demonstrated the methodology for the unordered image sequence of length 5.

3.2. Model Our proposed model consists of three parts: feature extractor, encoder, and decoder. Feature extractor is applied on each of the n images of the unordered sequence to provide a fixed length feature vector for each image. We adapt the ResNet (18 layers) architecture for the feature extractor [14]. We replace its classification layer by the ReLU activation layer and use its output as the input for the encoder. These feature vectors are then fed to the encoder in the given order to obtain a fixed length context vector for the complete sequence. For example: if Iin = {I1 , I2 , I3 , I4 , I5 } is the input sequence, first I1 is fed to the encoder, then I2 , and so on. This context vector contains the overview of the input sequence. We employ a multi-layer LSTM as the encoder. Let the number of layers be m. The encoder reads the extracted feature vector qt corresponding to the tth image in the given input sequence, one at each time step t ∈ {1, . . . , n}, and provides a fixed length vector representation for the complete sequence by executing the following equations at each time step. l itl = σ (Wxil xtl + Whil ht−1 + bli )

(1)

l ftl = σ (Wxl f xtl + Whl f ht−1 + blf )

(2)

l l glt = tanh(Wxcl xtl + Whc ht−1 + blc )

(3)

l l l l olt = σ (Wxo xt + Who ht−1 + blo )

(4)

l ctl = ftl  ct−1 + itl  glt

(5)

htl = olt  tanh(ctl )

(6)

Here, σ is the logistic sigmoid function,  is the element-wise multiplication, xtl is the input to the lth layer of the encoder at l time t, htl ∈ R is the hidden state, ht−1 ∈ R is the hidden state at the previous time step, ctl ∈ R is the cell state, itl ∈ R is the input gate, ftl ∈ R is the forget gate, glt ∈ R is the input modulation gate for the lth layer of the encoder, bli , blf , blc , and blo are the bias vecl , W l , W l , and W l are the learntors, and Wxil , Whil , Wxl f , Whl f , Wxc xo hc ho able weights. The initial values h0 and c0 are set to zero. In multilayer LSTM, LSTMs are stacked over each other. For the first layer of the encoder, l = 1, xt1 is equal to qt , i.e., the output of the feature extractor and for the remaining layers xtl is the hidden state of the LSTM in the previous layer in the same time step. The context vector comprises of the hidden state and the cell state at the nth time step of the encoder, where n is the number of input images. The decoder takes the context vector and the unordered image sequence as the input and maps each image to its corresponding position in the ordered sequence. For the decoder, we employ a multi-layer LSTM-based neural network with n outputs, where n is the length of the sequence. In [37], the authors show that by reversing the order of source sentences, the network is able to learn better. Similar to this work, we reverse the target sequence and map the images to their corresponding index in the reverse order. For example, we provide Iin = {I1 , I2 , I3 , I4 , I5 } to the encoder in the given order, i.e., first I1 , then I2 , and so on, one at each time step and map the images to their corresponding index using the decoder in the following order: {I5 , I4 , I3 , I2 , I1 }, i.e., first I5 , then I4 , and so on, one at each time step. The reason behind doing this is that the decoder will be able to take better decision

12

G. Kanojia and S. Raman / Pattern Recognition Letters 117 (2019) 9–15

Fig. 3. The figure shows some sets of unordered 6-tuple of frames extracted from UCF-101 dataset.

for the last image seen by the encoder since the context vector will remember the last image the most. The decoder executes the Eqs. (1 ) − (12 ) iteratively for t = 1, . . . , n in order to get to the solution {s1 , s2 , . . . , sn } which are the corresponding positions of the images {In , In−1 , . . . , I1 } in the ordered sequence.

sequence. We compute the loss L for an input sequence using the outputs of the decoder as shown in Eq. 13.

xt1 = φ (Wxq [qn−t+1 | p] + bq )

Here, (ot )z j is the zj th entry of vector ot .

(7)

L(o1 , . . . , on ) = −

n  t=1



log

exp((ot )zt ) n  j=1 exp ((ot )z j )



(13)

4. Evaluation

xtl

=

∀l ∈ {2, . . . , m} (8)

htl−1 ,

ot = φ (Who htm + bo )

(9)

ot = Wo n ot + bo

(10)

 o t = (ot )

(11)

pst−1 = 1,

if t > 1 (12)





Here, p = p1 p2 ··· pn , n is the length of the sequence, φ is the ReLU activation function,  is the softmax activation function, xt1 is the input to the first layer of the decoder which is the concatenation of the extracted feature vector qn−t+1 from the (n − t + 1 )th image at the tth time step and vector p. bq , bo , and bo are the bias vectors, and Wxq , Who , and Wo n are the learnable weights. We initialize all the entries of p to zero. At each time step t = 2, . . . , n, the st−1 th entry of the vector p is set to 1. Here, st is the position of the (n − t + 1 )th image of the given sequence in the ordered sequence. st is estimated as the index of the maximum value of o t . The intuition behind p is that, at each step, p tells the decoder about the locations which are already occupied in the sequence and which are not. That is, if an entry of p is 0, then it is not occupied in the ordered sequence. We initialize the hidden state h0 and cell state c0 of the decoder with the hidden and cell state at the nth time step of the encoder. 3.3. Loss function Let the target vector be {z1 , z2 , . . . , zn }, where zt is the position of the (n − t + 1 )th image of the given sequence in the ordered

4.1. Dataset In [22], the authors extract 3-tuple and 4-tuple of frames from the videos of UCF-101 dataset [33]. They estimate the optical flow and use its magnitude to select the frames for the training data. Then, they apply some pre-processing on the frames which involves selection of spatial patches from the frames, spatial jittering, and channel splitting before feeding it to their network. In our experiments, we use 4 datasets comprised of 3-tuple, 4-tuple, 5tuple, and 6-tuple of frames extracted from the UCF-101 dataset, respectively. We use 3-tuple and 4-tuple of frames extracted from UCF-101 dataset used in Lee et al. [22]. We obtain 5-tuples by adding a frame to the 4-tuples which is extracted from the left of the 4-tuple in the video. We obtain 6-tuples by adding 2 frames to the 4-tuples which are extracted each from the left and right side of the 4-tuple in the video. The frames for 5-tuple and 6-tuple are extracted in such a way that the distances between the frames is maintained. We do not apply any pre-processing on these tuples and directly feed them to the proposed network. Fig. 3 shows some examples of 6-tuple inputs. We randomly split each n-tuple dataset into training, validation, and testing as 70%, 10%, and 20% of the data, respectively. 4.2. Training and implementation details Similar to [22], we consider the forward and the backward permutations as a single class. Hence, an n-tuple will have n!/2 classes. We train separate networks for 3-tuple, 4-tuple, 5-tuple, and 6-tuple using 85K, 85K, 82K, and 80K training samples, respectively. For weight update, we use Adam with learning rate of 10−4 , momentum = 0.9,  = 10−8 , β1 = 0.9, and β2 = 0.999 [21]. We apply a dropout of 0.3 on all fully connected layers. We train the networks by feeding them with all the permutations of the ntuples and comparing the outputs with the target sequences. For example: for 6-tuple, we fed 80K × (6!/2 ) ≈ 30.9M sequences to

G. Kanojia and S. Raman / Pattern Recognition Letters 117 (2019) 9–15

13

Fig. 4. The figure shows results obtained on some test sets of unordered images. The first row of (a), (b), (c) and (d) shows the unordered sequence provided as an input to the proposed network. The second row of (a), (b), (c) and (d) shows the sequence obtained using the proposed network.

the network for training. The feature extractor takes the image of size 224 × 224 × 3 as the input. We adapt ResNet (18 layers) architecture for the feature extractor. We replace its classification layer by the ReLU activation layer and use its output as the input for the encoder. The length of the feature vector is 512. We use 2560

hidden units in the networks for 3-tuple and 4-tuple, and 3072 hidden units in the networks for 5-tuple and 6-tuple in both the encoder and the decoder LSTMs with two layers. The input size of the multi-layer LSTM of the encoder is 512 and of the decoder is 128. The output size of FC1 is 64 and FC2 is equal to n. Initially,

14

G. Kanojia and S. Raman / Pattern Recognition Letters 117 (2019) 9–15

Table 1 The table shows the comparison of the order prediction accuracy (in percentage) between Lee et al. [22] and the proposed approach. Sequence

Lee et al. [22]

Ours (w/o fine-tuning)

Ours (with fine-tuning)

3-tuple 4-tuple 5-tuple 6-tuple

63 41 NA NA

59.15 53.04 48.64 43.25

67.18 60.33 54.78 51.30

Table 2 The table shows the comparison between accuracies (in percentage) obtained (without fine-tuning) on the test set using the reversed and the unreversed target sequence. Sequence

Reversed

Unreversed

3-tuple 4-tuple

59.15 53.04

49.63 44.7

Table 3 The table shows the prediction accuracy (in percentage) on the test set individually for each position of the given sequences. Sequence

1st

2nd

3rd

4th

5th

6th

3-tuple 4-tuple 5-tuple 6-tuple

79.99 78.95 75.94 74.72

67.7 71.03 70.19 69.65

79.9 71.01 71.83 69.72

− 78.58 70.24 69.71

− − 76.63 69.61

− − − 74.06

we keep the weights of the feature extractor fixed and only train the encoder and the decoder. After reaching saturation, we finetune the complete network. While fine-tuning, we use 10−5 as the learning rate for the feature extractor and keep the learning rate of the decoder and the encoder as 10−4 . For data augmentation, we randomly crop a region of 224 × 224 from the training images whose smaller side is resized to 256 pixels. We implemented our network and performed all the experiments using PyTorch on a system with Intel i7-7820X processor, 64 GB RAM, and an Nvidia Titan Xp GPU. 4.3. Experiments We compare our results with the state-of-the-art method proposed by Lee et al. [22]. The proposed network outperforms the state-of-the-art results. Table 1 shows the comparison of the order prediction accuracy of Lee et al. with our approach. We obtain even better order prediction for 6-tuple which has 6!/2 = 360 possibilities on the test set in comparison to the order prediction accuracy obtained by Lee et al. [22] for 4-tuple which has 4!/2 = 12 possibilities. Table 1 shows the comparison between the accuracies obtained on the test set with pre-trained weights of ResNet (18 layers) and the fine-tuned network. Due to fine-tuning, ResNet learns better task-specific features which provide better information to the encoder and the decoder for the sequencing task. This can be seen from the results shown in Table 1. After fine-tuning, we obtain better order prediction accuracy. Fig. 4 shows the results obtained on few test sets using the proposed approach. Table 2 shows the comparison between accuracies obtained (without fine-tuning) on the test set using reversed and unreversed target sequence while having similar training accuracy. We obtain better accuracy when the target sequence is reversed. While training with unreversed target sequences, we found that it is easier to train the network with the reversed target sequences as compared to unreversed target sequences. To train with unreversed target sequences, first we had to train the network with the reversed target sequences for few epochs and then use the learned weights to train for the unreversed target sequences. Table 3 shows the average accuracies computed for each position of the sequence. As

Table 4 The table shows the comparison between the CPU run-time (in seconds) for the feature extractor, encoder, and decoder of the proposed networks trained on the datasets of 3-tuple, 4-tuple, 5-tuple, and 6-tuple of frames. Sequence

Feature extractor

Encoder

Decoder

3-tuple 4-tuple 5-tuple 6-tuple

0.127 0.161 0.192 0.234

0.019 0.024 0.042 0.051

0.019 0.025 0.130 0.168

expected, the individual accuracies are superior to that obtained for the complete sequence. Table 4 shows the comparison between the CPU run-times (in seconds) for the feature extractor, encoder, and decoder for the sequences of different lengths. 5. Conclusion and future work In this paper, we present a novel technique for sequencing the given unordered set of images temporally. We formulate the problem as a sequence-to-sequence mapping task in which each image of a given sequence is mapped to its position in the ordered sequence. We propose an LSTM-based neural network architecture for the task. We experimentally show that our technique provides state-of-the-art results for the problem of sequencing. We also experimentally show that the network learns better when the target sequence is reversed. There is still a scope of improvement in terms of order prediction accuracy. In this work, we have shown results till 6-tuple. However, in practice, there could be more than 10 images in the given set. Even a 10-tuple has 10!/2 = 1814400 possibilities if we consider forward and backward permutations as a single class. If the training data has 80K samples, then we have to train the network by feeding 80 K × 1814400 = 145152M combinations, which will take a huge amount of time. As a future work, we would like to improve the prediction accuracy by making modifications to the proposed network. We would also like to propose a technique for learning the task of sequencing using small image sequences and use the learned knowledge for recursively ordering larger image sequences. Acknowledgment Gagan kanojia was supported by TCS Research Scholarship. References [1] H. Agrawal, A. Chandrasekaran, D. Batra, D. Parikh, M. Bansal, Sort Story: Sorting Jumbled Images and Captions into Stories, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 925–931. [2] H. Averbuch-Elor, D. Cohen-Or, RingIt: ring-ordering casual photos of a temporal event, ACM Trans. Gr. (TOG) 34 (3) (2015) 33. [3] R. Baker, M. Dexter, T.E. Hardwicke, A. Goldstone, Z. Kourtzi, Learning to predict: exposure to temporal sequences facilitates prediction of future events, Vis. Res. 99 (2014) 124–133. [4] T. Basha, Y. Moses, S. Avidan, Photo sequencing, in: Proceedings of the European Conference on Computer Vision, Springer, 2012, pp. 654–667. [5] Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Netw. 5 (2) (1994) 157–166. [6] L. Chen, Y. He, L. Fan, Let the robot tell: describe car image with natural language via LSTM, Pattern Recognit. Lett. 98 (2017) 75–82. [7] Y. Chherawala, P.P. Roy, M. Cheriet, Combination of context-dependent bidirectional long short-term memory classifiers for robust offline handwriting recognition, Pattern Recognit. Lett. 90 (2017) 58–64. [8] A. Cleeremans, J.L. McClelland, Learning the structure of event sequences., J.Exp. Psychol. Gen. 120 (3) (1991) 235. [9] T. Dekel, Y. Moses, S. Avidan, Space-time tradeoffs in photo sequencing, in: Proceedings of the IEEE International Conference on Computer Vision, IEEE, 2013, pp. 977–984.

G. Kanojia and S. Raman / Pattern Recognition Letters 117 (2019) 9–15 [10] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2625–2634. [11] F.A. Gers, E. Schmidhuber, LSTM recurrent networks learn simple context-free and context-sensitive languages, IEEE Trans. Neural Netw. 12 (6) (2001) 1333–1340. [12] I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep learning, 1, MIT press, Cambridge, 2016. [13] K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, J. Schmidhuber, LSTM: a search space odyssey, IEEE Trans. Neural Netw. Learn. Syst. 28 (10) (2017) 2222–2232. [14] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2016, pp. 770–778. [15] R.G. Hefron, B.J. Borghetti, J.C. Christensen, C.M.S. Kabban, Deep long short-term memory structures model temporal dependencies improving cognitive workload estimation, Pattern Recognit. Lett. 94 (2017) 96–104. [16] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780. [17] S. Hochreiter, J. Schmidhuber, LSTM can solve hard long time lag problems, in: Proceedings of the Advances in Neural Information Processing Systems, 1997, pp. 473–479. [18] T.-H. K. Huang, F. Ferraro, N. Mostafazadeh, I. Misra, A. Agrawal, J. Devlin, R. Girshick, X. He, P. Kohli, D. Batra, et al., Visual storytelling, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1233–1239. [19] G. Kanojia, S.R. Malireddi, S.C. Gullapally, S. Raman, Who shot the picture and when? in: Proceedings of the International Symposium on Visual Computing, Springer, 2014, pp. 438–447. [20] G. Kim, E.P. Xing, Reconstructing storyline graphs for image recommendation from web community photos, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2014, pp. 3882–3889. [21] D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, arXiv:https: //arxiv.org/abs/1412.6980 (2014). [22] H.-Y. Lee, J.-B. Huang, M. Singh, M.-H. Yang, Unsupervised representation learning by sorting sequences, in: Proceedings of the IEEE International Conference on Computer Vision, IEEE, 2017, pp. 667–676. [23] H. Li, A short introduction to learning to rank, IEICE Trans. Inf. Syst. 94 (10) (2011) 1854–1862. [24] Y. Liu, J. Wang, X. Wang, Learning to recognize opinion targets using recurrent neural networks, Pattern Recognit. Lett. 106 (2018) 41–46. [25] I. Mani, B. Schiffman, Temporally anchoring and ordering events in news, Time and Event Recognition in Natural Language, John Benjamins, 2005.

15

[26] I. Misra, C.L. Zitnick, M. Hebert, Shuffle and learn: unsupervised learning using temporal order verification, in: Proceedings of the European Conference on Computer Vision, Springer, 2016, pp. 527–544. [27] H. Mobahi, R. Collobert, J. Weston, Deep learning from temporal coherence in video, in: Proceedings of the Twenty-Sixth Annual International Conference on Machine Learning, ACM, 2009, pp. 737–744. [28] M. Noroozi, P. Favaro, Unsupervised learning of visual representations by solving jigsaw puzzles, in: Proceedings of the European Conference on Computer Vision, Springer, 2016, pp. 69–84. [29] L.C. Pickup, Z. Pan, D. Wei, Y. Shih, C. Zhang, A. Zisserman, B. Scholkopf, W.T. Freeman, Seeing the arrow of time, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2035–2042. [30] F. Sadeghi, J.R. Tena, A. Farhadi, L. Sigal, Learning to select and order vacation photographs, in: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2015, IEEE, 2015, pp. 510–517. [31] R. Santa Cruz, B. Fernando, A. Cherian, S. Gould, Deeppermnet: visual permutation learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3949–3957. [32] N. Si, H. Wang, Y. Shan, Exploring global sentence representation for graph-based dependency parsing using BLSTM-SCNN, Pattern Recognit. Lett. 105 (2017) 96–104. [33] K. Soomro, A.R. Zamir, M. Shah, UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild, CRCV-TR-12-01, November, 2012. [34] N. Srivastava, E. Mansimov, R. Salakhudinov, Unsupervised learning of video representations using LSTMS, in: Proceedings of the Thirty-Second Annual International Conference on Machine Learning, 2015, pp. 843–852. [35] R. Sun, C.L. Giles, Sequence learning: from recognition and prediction to sequential decision making, IEEE Intell. Syst. 16 (4) (2001) 67–70. [36] M. Sundermeyer, R. Schlüter, H. Ney, LSTM neural networks for language modeling, in: Proceedings of the Thirteenth Annual Conference of the International Speech Communication Association, 2012. [37] I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in: Proceedings of the Advances in Neural Information processing Systems, 2014, pp. 3104–3112. [38] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko, Sequence to sequence-video to text, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4534–4542. [39] C.-Y. Wu, A. Ahmed, A. Beutel, A.J. Smola, H. Jing, Recurrent recommender networks, in: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, ACM, 2017, pp. 495–503. [40] Y. Zhou, T.L. Berg, Temporal perception and prediction in ego-centric video, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4498–4506.