Deep sequential fusion LSTM network for image description

Deep sequential fusion LSTM network for image description

Neurocomputing 312 (2018) 154–164 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Deep se...

NAN Sizes 1 Downloads 93 Views

Neurocomputing 312 (2018) 154–164

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Deep sequential fusion LSTM network for image description Pengjie Tang a,b,d, Hanli Wang a,b,c,∗, Sam Kwong e a

Department of Computer Science and Technology, Tongji University, Shanghai 201804, PR China Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai 200092, PR China c Shanghai Engineering Research Center of Industrial Vision Perception and Intelligent Computing, Shanghai 200092, PR China d College of Mathematics and Physics, Jinggangshan University, Ji’an 343009, PR China e Department of Computer Science, City University of Hong Kong, Hong Kong, PR China b

a r t i c l e

i n f o

Article history: Received 15 November 2017 Revised 24 March 2018 Accepted 25 May 2018 Available online 29 May 2018 Communicated by Dr Xinmei Tian Keywords: Image description Long short term memory network Layer-wise optimization Deep supervision Deep sequential fusion

a b s t r a c t It is a challenging task to perform automatic image description, which aims to translate an image with visual information into natural language conforming to certain proper grammars and sentence structures. In this work, an optimal learning framework called deep sequential fusion based long short term memory network is designed. In the proposed framework, a layer-wise strategy is introduced into the generation process of recurrent neural network to increase the depth of language model for producing more abstract and discriminative features. Then, a deep supervision method is developed to enrich the model capacity with extra regularization. Moreover, the prediction scores from all of the auxiliary branches in the language model are employed to fuse the final decision output with product rule, which further makes use of the optimized model parameters and hence boosts the performance. The experimental results on two public benchmark datasets verify the effectiveness of the proposed approaches, with the consensusbased image description evaluation metric (CIDEr) being 103.4 on the MSCOCO dataset and the metric for evaluation of translation with explicit ordering (METEOR) reaching to 20.6 on the Flickr30K dataset. © 2018 Elsevier B.V. All rights reserved.

1. Introduction It is challenging to generate descriptions or captions for images, as both the techniques of natural language processing and computer vision play important roles. In order to describe an image by natural language, it requires to understand the image contents correctly, e.g., scenes, objects, attributes, actions and relations of objects, etc, and then to generate the corresponding sentences with correct words, grammars and proper structures. Image description has wide application prospects such as early education, auxiliary for the people with obstacle of visual sense, computer interaction, etc. A number of researches have been proposed for image description, including the template based models [1–4], semantic transfer based strategies [5–8], neural language based methods [9–15] and hybrid approaches [16–19]. Nowadays, the neural language based methods and hybrid approaches have become the most popular solutions to this task, along with the successes of deep learning, especially the convolutional neural network (CNN) model in speech recognition [20,21] and a series of visual tasks [22–27]. However, ∗ Corresponding author at: Department of Computer Science and Technology, Tongji University, Shanghai 201804, PR China. E-mail address: [email protected] (H. Wang).

https://doi.org/10.1016/j.neucom.2018.05.086 0925-2312/© 2018 Elsevier B.V. All rights reserved.

the language processing part in neural language based models and hybrid approaches is usually insufficiently optimized even though CNN models offer abstract and discriminative visual features. For instance, the depths of long short term memory (LSTM) networks, which are usually employed to construct language model, are always so shallow that there are no enough non-linear transform layers to yield multi-modal information. As a result, the resultant sentences are often poor in semantics and the model performances are difficult to be further promoted. To address these issues, we explore to build a deeper LSTM network to act as the language processing model, leading to the proposed layer-wise and deep supervision based LSTM network for image description. With the proposed methods, the model parameters for both the vision and language processing parts are optimized jointly. Inspired by the concept of layer-wise optimization [28–30], the proposed framework generates a shallow LSTM network first, and then adds more new LSTM layers which are retrained based on the previously optimized model. Meanwhile, in order to prevent model from over-fitting and local optimum, the technique of deep supervision such as [31,32] is introduced into the proposed model, so that the previous auxiliary branches and objective functions will be reserved and retrained with newly added layers and objective functions. In addition, a sequential voting strategy is employed to further boost performances. Unlike

P. Tang et al. / Neurocomputing 312 (2018) 154–164

155

previous researches which fuse prediction scores from different models in an additional manner, the sequential scores from each auxiliary branch from bottom to top are fused in this work with the product rule [33,34]. State-of-the-art results are achieved by the proposed approaches on the benchmark MSCOCO [35] and Flickr30K [36] datasets. The main contributions of this work are given below. • A method to establish deeper LSTM networks with layer-wise optimization is proposed to improve the semantic ability of sentences generated, hence the multi-modal information including vision and language are transformed more sufficiently with more LSTM layers. • During layer-wise optimization, deep supervision is employed to offer extra regularization for preventing deeper LSTM models from over-fitting and local optimum, and the LSTM parameters could be optimized effectively. • A voting strategy for word prediction is designed to fuse deep sequential scores based on the product rule, and the fused score from each auxiliary branches is utilized in deeper LSTM networks. The rest of this paper is organized as follows. The related works are discussed in Section 2, including the fundamental of CNN and LSTM, as well as the current popular methods for image description. In Section 3, the motivation and the details of the proposed approaches are introduced. Next, experiments and results are shown in Section 4 to present the performances under different situations and the comparison to state-of-the-art methods. Finally, Section 5 concludes this work. 2. Related works 2.1. CNN model Nowadays, CNN has become one of the most popular technologies in computer vision because of a series of its breakthroughs on various visual tasks. It simulates the process of visual sense on images with the receptive field in human eyes by the convolutional operation, and it employs the strategy of parameter sharing and pooling operation to reduce the number of involving neurons and connections. As the original information is transformed a number of times via linear or non-linear transformations, the features extracted with CNN gradually become abstract and discriminative. Moreover, the techniques of Dropout [37] and DropConnect [38] are developed to suppress over-fitting to some extent, and several effective methods such as local response normalization (LRN) [22] and batch normalization (BN) [39] are proposed for feature normalization and training convergence speedup. There are a number of benchmark CNN models in the literature. The Alex-Net [22] is firstly designed for large-scale image classification and achieves an astonishing success on the Imagenet competition. Hereafter, several elaborately designed CNN models are developed such as GoogLeNet [23], VGG-Net [24], ZF-NET [40] and the recent ResNet [25]. The progress on CNN models reveals that the deeper a CNN model is, the stronger the generalization ability of the extracted features become and the better the performances are. Inspired by this trend as a consequence, an effective mechanism is designed in this work to incorporate more layers into the LSTM network in order to pursue more wonderful performances on image description. 2.2. LSTM model LSTM is a special recurrent neural network (RNN), which is to achieve the goal of long term dependency and overcome the problem of limited memory ability caused by gradient vanishing or ex-

Fig. 1. The classical LSTM unit employed in this work.

plosion [41] of traditional RNN models. In general, there are three types of gate employed by LSTM, including forget gate, input gate and output gate. The forget gate is designed to reserve or delete the information from the previous status, the input gate is to control the information of the current input status and the previous output status, and the goal of the output gate is to control outputs according to the current status. An illustration of a typical LSTM structure is shown in Fig. 1, where xt and ht−1 represent the original input at the time step of t and the output from the previous time step, respectively; (w f , b f ), (wi , bi ), (wo, bo ) and (wc , bc ) denote the weights and bias for the forget gate, the input gate, the output gate and the state unit of “Cell”, respectively; σ stands for the sigmoid activation function; and meanwhile, for updating the state unit, the intermediate status c˜t is also required. In the process of back propagation, the memory unit cell plays an important role in transferring information which has been remembered to the previous status. Currently, there are also a few varieties of LSTM such as the gated recurrent unit (GRU) [42], guiding LSTM (gLSTM) [18] and so on. In this work, the classical LSTM unit is employed. 2.3. Image description The aim of image description is to represent the content of an image by natural language in line with the habits of human expression. Different from the previous image understanding works such as image retrieval [43,44], clustering [45,46], annotation and tagging [43,45,47,48], the task of image description detects a full range of information of the candidate image and generates natural sentence with correct grammar and decent pattern structure. In the previous works, researchers employ the template based methods [1–4] and semantic transfer based methods [5–8] to achieve this goal. The template based methods usually require a fixed structured sentence template as the basis, and then employ multiple technologies like recognition of scenes, objects and actions to grasp the meaningful details of an image. After that, the corresponding words are filled into the fixed template to generate sentences. However, the resultant sentences by this type of method are usually rigid and there are great gaps from the reference sentences annotated by humans. On the contrary, the sentences produced by the approach of semantic transfer are more flexible; however, this kind of method depends too much on the query database, and consequently there are large deviations between the generated sentences and the reference sentences when there are no similar images to the target image in the reference database.

156

P. Tang et al. / Neurocomputing 312 (2018) 154–164

Inspired by machine translation, the encoding-decoding based techniques are proposed recently [9–15], by which the target image is regarded as the source language and is encoded by extracting its visual features. Then, the conditional random field (CRF) or RNN models are usually employed to decode the visual features and generate candidate sentences. In the encoding part of the pipeline, CNN plays an irreplaceable role and the visual features extracted by CNN are fed to RNN for training with the embedded features of the reference sentences. One of the famous works is m-RNN [12], in which the CNN features of the image and the embedding features of the sentences are combined as a multi-modal input and fed to RNN for learning. Furthermore, the probability distribution of the words in an image is employed by multi-modal representation with the log-bilinear neural language model [13]. In a similar manner, the CNN features extracted from Alex-Net are fed into LSTM directly [14] for image description, and impressive performances are achieved due to the stronger memory ability of LSTM as compared with the traditional RNN. Analogously in [15], LSTM is applied as the language model by the neural image caption (NIC) method; however, the CNN features extracted from the GoogLeNet model are only fed to the first time step rather than all the time steps in LSTM, which can be further improved. Besides the neural language based encoding-decoding methods, there are a number of hybrid approaches [16–19] which employ more advanced vision technologies such as object recognition and saliency detection. In [16], the method of region-based CNN (RCNN) is introduced to make use of the semantic information of the target image, by which the objects are firstly detected and then the relation between the vision model and the language model is established by the inner product of the RCNN and RNN outputs. Since salient regions in an image are easier to catch human’s attention and the description sentences generally focus on these pieces of information, Xu et al. propose to apply the CNN features of salient regions to LSTM for image caption [17]. Accordingly, the resultant sentences are decent and contain more semantic information, particularly for the images with complex backgrounds. Nevertheless, in [18], the researchers believe that hybrid approaches such as [16,17] focus too much on local information of images and ignore the global information, resulting in the position relation of different objects in an image being less accurate. To overcome this problem, a model named gLSTM is proposed in [18], which constructs the relation of an image and its reference sentences as the global semantic information to guide image caption. Moreover, the semantic attributes and the multi-label training strategy are utilized in [19] for image description. However, even in the up-to-date neural language based methods and the hybrid approaches, the language models are usually shallow with one or two layers designed for LSTM, which is desired to be further improved to produce more abstract and representative multi-modal features. In this work, we follow the encoding-decoding pipeline to build the basic image description framework, but aim to address the difficulty in construction of deeper language models, which are achieved by the proposed layer-wise optimization, deep supervision and deep sequential fusion. 3. Proposed methods 3.1. Motivation It is well known that depth is a key factor to affect the performance of deep learning models such as CNN. As mentioned before, few research efforts are devoted to strengthen the capacity of multi-modal features by exploring deep LSTM models for image description. Currently, only one or two layers are designed for LSTM to build the language model and it is easy to fall into lo-

cal optimum when more layers are incorporated into the existing model architectures due to gradient vanishing. In order to illustrate the effect of LSTM depth on image description, the following three model configurations are evaluated for comparison. In the first configuration, a two-layer LSTM network is deployed as the basic module, in which the embedding features of words are fed to the first layer, and the output of the first layer and the CNN features extracted by GoogLeNet are fed to the second layer for multi-modal learning, then a fully connected layer and a softmax layer are applied to generate the probability of each word in the vocabulary. In the second and third configuration, two and three of the above basic modules are used to generate deeper LSTM networks, i.e., four-layer and six-layer LSTMs. These three model configurations are evaluated on the benchmark MSCOCO dataset [35] with the performance comparison shown in Fig. 2. From Fig. 2, it can be seen that when more layers are used for LSTM, it becomes more difficult to reduce the loss value (Fig. 2(a)), and the performances of B-4 [49] and CIDEr [50] consequently degrade (Fig. 2(b)). As one of widely used evaluation metrics to assess image description methods, B-4 counts the 4-gram matching degree between the reference sentence and the generated candidate sentence, counts all the 4-grams in the candidate sentence and then calculates the ratio of the above two numbers. Regarding CIDEr, it is another popular evaluation criterion, which employs the concept of consensus and computes the matching degree between the reference sentence set and the candidate sentence. Generally speaking, the higher of the two metrics, the better coherence and richer semantics of the generated sentence. The results in Fig. 2 indicate that it is harder to optimize deeper LSTM models since the errors from high layers cannot be effectively transferred to low layers with back propagation and thus the parameters in low layers are not optimized sufficiently. A similar conclusion is also made in [14] that a two-layer LSTM network outperforms its four-layer counterpart. In order to solve the problem due to the increase in LSTM depth as mentioned above, three novel techniques are designed in this work. First, a layer-wise optimization approach is introduced to learn the multi-modal parameters of deep LSTM networks effectively. Second, a deep supervision method is developed to provide more regularization functionalities for the entire deep model to further avoid the problem of over-fitting and local optimum. Third, a product-rule based deep sequential fusion strategy is conceived to determine the final output word at each time step to further improve the quality of the description sentence. Before introducing these three proposed techniques, the underlying framework for image description is described in the following Section 3.2, which is employed to train both the CNN and LSTM networks integrally.

3.2. Joint modeling of CNN and LSTM The general framework for image description usually consists of CNN and LSTM networks for visual and linguistic modeling, respectively, and the following two steps are used to optimize the model parameters. At the first step, a CNN model is trained on a largescale image dataset to generate a pre-trained model, and then this pre-trained model is employed for feature extraction to encode an image with visual features. Afterwards, visual features are fed to the LSTM network for parameter optimization with the following cost function as



L (Xd , S ); (θcnn , θlm )



= Lcnn (Xs , Ys ); θcnn + Llm





 



G (Xd , θcnn ) → RDcnn ×|Xd | , S ; θlm ,

(1)

P. Tang et al. / Neurocomputing 312 (2018) 154–164

157

Fig. 2. Illustration of the effect of LSTM depth on the performance of the MSCOCO dataset with different layer configurations. (a) Training loss comparison. (b) B-4 and CIDEr comparison.

where Lcnn and Llm stand for the cost functions to optimize CNN and LSTM, respectively; G is the CNN model, Xs and Xd denote the input images in the source reference dataset and in the target description dataset, respectively, Ys is the corresponding image label of Xs ; θ cnn and θ lm represent the parameter sets of the CNN model and the LSTM network; Dcnn is the feature dimension and |Xd | represents the number of source images in the target description dataset. In addition, S is the set of the reference sentences of Xd . During training, the cross entropy function is employed and Lcnn can be written as



Lcnn (Xs , Ys ); θcnn



Ns 1  =− Ysk log( pks |Xsk ; θcnn ), Ns

(2)

where Ns is the number of training samples in the source reference dataset, pks is the output probability vector of the kth image with the CNN model G under the condition of (Xsk ; θcnn ). Regarding Llm , it can be considered as the sum of all the errors of predicted words. For notation simplicity, let F  G (Xd , θcnn ) → RDcnn ×|Xd | , and Llm can be formulated as



=



|Xd |  Lk   1  1  T (Stk ) log ptk |(Fk , St−1 ) ; θ , lm k |Xd | Lk k=1

(3)

t=1

where Lk is the length of the reference sentence corresponding to the kth image. The function T (· ) is used to map each word to the corresponding dummy label of the target word in the vocabulary V. For the kth image, the probability vector of the output word at the time step of t under the condition of ((Fk , St−1 ); θlm ) is denoted as k ptk , and it can be calculated by



⎜ ⎜ ⎜ T E (St−1 ) ⎝ exp ( θ ) l l=1 k

ptk = |V |

1



exp (θ2 )T E (St−1 ) ⎟ ⎟ k ⎟, .. ⎠ .

t−1 T exp (θ|V | ) E (Sk )



exp (θ1 )T E (St−1 ) k

(4)

where [θ1 , θ2 , . . . , θ|V | ]T is the parameter vector in θ lm , E (· ) is used to encode the previous output word St−1 into one-hot vector, and k St−1 can be obtained by k





St−1 = D max{ pt−1 } , k k 1:|V |





L (Xdk ,Sk ); θ = −

k   1 C (Stk )log ptk |(Xdk ,St−1 ); θ , k Lk

L

(6)

t=1

k=1

Llm (F, S ); θlm

[51,52]. In general, a pre-trained model optimized on the largescale source reference dataset is employed to initialize the CNN model. Then, the LSTM model is trained on the target description dataset with both images and sentences, and meanwhile the CNN model is further refined. During the whole process, let θ = (θcnn , θlm ), the parameters of both the CNN model and the LSTM model are optimized jointly by back propagation with the following cost function to be minimized for the kth image in one iteration.

(5)

where D (· ) is the function to pick the word with the maximum probability from the vocabulary V. The objective of the entire system formulated in Eq. (1) is to minimize both Lcnn and Llm through the optimization of θ cnn and θ lm , which can be achieved by end-to-end strategies such as

where the function C (· ) is to map the word Stk to the dummy label of the kth target image Xdk .

3.3. Layer-wise optimized LSTM Although the joint modeling of CNN and LSTM is able to generate semantically meaningful and well structured sentences, it is difficult to further improve the description performance when deeper LSTM networks are used because of gradient vanishing and insufficient optimization of low layer parameters. In order to settle this problem, a layer-wise optimization technique is proposed herein to deepen LSTM networks. In order to illustrate the proposed layer-wise optimization technique, a four-layer LSTM model is built for example as shown in Fig. 3, and then its general form with multiple layers is derived. As shown in Fig. 3(a), we firstly employ a factored model as the basic module (denoted as “M1” for subsequent discussions), which includes two LSTM layers, during the first stage of training. In this basic module, the first LSTM layer is applied as a single language model with the input embedding feature of each word, while the purpose of the second LSTM layer is to build a multimodal model by combining the CNN feature and the output of the first LSTM layer. The training of this basic module follows the strategy of joint modeling of CNN and LSTM networks as presented in Section 3.2. Then, the proposed layer-wise optimization technique is implemented to build a two-module LSTM network with four layers based upon the aforementioned basic module as shown in Fig. 3(b), in which another factored module (which is denoted as “M2”) is added on the basic module and its first layer receives the output of the second layer in the basic module. This refined model is optimized in a jointly fine-tuned manner including the CNN network, the previously trained basic module and the two newly added LSTM layers during the second stage of training. In a similar way, deeper LSTM networks with more than four layers can be designed with more stages of training. In general, the cost function at the nth stage can be formulated as

158

P. Tang et al. / Neurocomputing 312 (2018) 154–164

Fig. 3. Illustration of a four-layer LSTM structure with the proposed layer-wise optimization technique, where Std and Stc stand for the reference word and the output candidate word at the t th time step, “BoS” and “EoS” represent the word token of beginning and ending, and the function of E (Std , Stc ) calculates the error between the reference word Std and the output word Stc .



Ln (Xdk , Sk ); (θ pre , θnew )n = −



Lk     1  n T (Stk ) log ptk | Xdk , St−1 ; ( θ , θ ) , pre new k Lk

(7)

t=1

where θ pre and θnew represent the parameter sets of the previ1 = ous stage and the newly added LSTM layers. After defining θ pre 1 1 2 1 1 1 θcnn and θnew = θlm , we can have θ pre = (θcnn , θlm ) = (θ pre , θlm ) and n = (θ n−1 , θ n ). And θ θlm pre at the nth stage can be obtained as new lm

 n−1 n−1  n θ pre = θ pre , θnew = (θ pre , θnew )n−1 .

(8)

When looking into the details about parameter update at the jth stage when the total number of stages is n, we have

(θ pre , θnew ) j = (θ pre , θnew ) j    αn · ∂ n k n −ηj L X , S ; ( θ , θ ) , (9) pre new k d ∂ (θ pre , θnew ) j where ηj is a decay factor to regulate the learning rate at the jth stage, which is in the range of [0,1] when j < n and is equal to 1 when j = n. α n is the learning rate at the nth stage, and the process of gradient computation at the top layer (i.e., the fully connected layer for word classification) is

   ∂ Ln Xdk , Sk ; (θ pre , θnew )n n ∂ (θ pre , θnew ) L k     1  = − ptk | Xdk , Sk ); (θ pre , θnew )n − C (Stk ) . Lk

(10)

t=1

Based on Eqs. (9) and (10), it can be observed that the optimized parameters of low layers provide initialization for those of high layers, and then all the model parameters are further refined when the depth of the language model is gradually increased. 3.4. Layer-wise optimized LSTM with deep supervision The layer-wise optimization technique is able to deepen the LSTM based language model and thus boost the semantic capacity of resultant sentences. However, as the increase of the depth of LSTM network, the amount of model parameters will be dramatically augmented, which may make the model easy to fall into over-fitting and local optimum. To figure out this emerging issue, a deep supervision approach is further designed based upon our layer-wise LSTM model. For discussion convenience, we also take the four-layer LSTM model

Fig. 4. Illustration of the proposed deep supervision approach based on a twomodule (i.e., four-layer LSTM) structure, where Sd0:T −1 is the input word sequence of a reference sentence.

(i.e., two modules) as an example to introduce the proposed deep supervision approach as illustrated in Fig. 4. As shown, an auxiliary branch is virtually built to represent the objective error function from each module, e.g., Branch-1 corresponds to the output error function of M1 and Branch-2 relates to M2. During the first stage of training, only the error function of Branch-1 is used to update the model parameters of M1, while during the second stage of training, both of the error functions of Branch-1 and Branch-2 are employed to update the model parameters of the entire model including M1 and M2. As an extension, if there are n modules, then a total of n error functions (i.e., Branch-1, Branch-2, . . . , Branch-n) will be utilized to optimize the entire model, with the cost function of the entire model written as





Ln (Xdk , Sk ); (θ pre , θnew )n =

n 



L j (Xdk , Sk ); (θ pre , θnew ) j



j=1

= −

1 Lk

Lk n  









T (Stk ) log ( ptk ) j | Xdk , St−1 ; (θ pre , θnew ) j . k

(11)

j=1 t=1

Since there are n − j objective error functions after the jth stage, the number of parameter update operations for the jth module will be n − j + 1, and this process can be formulated as

(θ pre , θnew ) j = (θ pre , θnew ) j   n   αl · ∂ l k l −ηj L X , S ; ( θ , θ ) , pre new k d ∂ (θ pre , θnew ) j l= j

(12)

P. Tang et al. / Neurocomputing 312 (2018) 154–164

159

Fig. 5. Illustration of the proposed deep sequential probability score fusion.

where the gradients at the jth stage can be calculated by



∂ (θ pre , θnew ) = −

1 Lk

j

Lj



Lk n  





Xdk , Sk ; (θ pre , θnew ) j

where n is the number of stages. According to the Bayes rule, we can derive



n t p(Sti |Im ) t p(Im |St1 , St2 ,. . ., Stn ) = |V | i=1 , n t t p(Si |Im ) m=1 i=1



  ( ptk )l | Xdk , Sk ;(θ pre , θnew )l −C (Stk ) .

(13)

l= j t=1

According to Eq. (12), the number of parameter update operations in low layers will be increased with the increment of model depth. The supervision from low layers to high layers make the parameters of low layers optimized sufficiently, and it also provides extra regularization for the entire model because of the disturbance by low layer branches as well as their corresponding objective functions.

3.5. Deep sequential fusion Deep LSTM networks can improve the sensibility of generation sentences, and it is found that there are little gaps among the results from the auxiliary branches. Inspired by the concept of classifier ensemble, the output from each auxiliary branch can be fused for performance improvement. In the following, we also take the two-module model to present the proposed score fusion strategy. As shown in Fig. 5, M1−t and M2−t stand for the module status at the t th time step of the M1 and M2 modules, pt1 and pt2 are the output probability scores from M1 and M2 at the t th time step, Stc is the output from the top module with layer-wise and deep supervision, and (Stc ) f denotes the final output at the t th time step after score fusion of M1 and M2. In order to achieve deep sequential score fusion, the product rule [33,34] is utilized which has been proved to be superior to the traditional addition rule for CNN output score fusion [53]. For the t th time step at the jth stage, let the feature space be χ tj , and it can be derived as

χ tj = Q(χ tj−1 ),

(14)

where the function Q(· ) represents a set of non-linear transform operations. The feature spaces can be viewed as different and independent because each LSTM layer contains a series of non-linear transformations such as sigmoid and tanh. Suppose St is the output word at the t th time step, its probability of belonging to the mth word in the image I can be expressed as

















t t t t p St1 , St2 ,. . ., Stn |Im = p St1 |Im · p St2 |Im . . . p Stn |Im ,

(15)

(16)

where |V| is the number of words in the vocabulary V. Therefore, the final probability score at the t th time step can be computed by fusing all the scores of the language processing modules (i.e., auxiliary branches) as formulated in Eq. (16). 4. Experiments 4.1. Datasets and metrics Two popular public datasets are employed to evaluate the proposed techniques, including MSCOCO [35] and Flickr30K [36]. For the MSCOCO dataset [35], there are 123,287 images with each image owning at least 5 reference sentences annotated by humans. Two evaluation protocols are used on MSCOCO including the standard mode [14,18] and the extended mode [16]. Under the standard mode, 82,783 images are utilized for training, 50 0 0 images are used for validation, and another 50 0 0 images are applied for test. On the other hand, under the extended mode, 113,287 images are used for training, and both the validation and test sets contain 50 0 0 images. Regarding the Flick30K dataset [36], 31,783 images are included and each image has 5 reference sentences, and we follow the split mode [16] so that 29,0 0 0 images are for training, 10 0 0 images are for test, and the rest are for validation. As far as the evaluation metrics for image description are concerned, the following widely used evaluation metrics are employed for performance presentation including the bi-lingual evaluation understudy (BLEU) [49], the consensus-based image description evaluation metric (CIDEr) [50], the metric for evaluation of translation with explicit ordering (METEOR) [54], and the recall oriented understudy of gisting evaluation (ROUGE_L) [55]. Specifically, BLEU can be represented in the form of B−n, which counts the ngram matching of the reference sentence and a generation candidate sentence. For each of these four criteria, the larger its value becomes, the better the corresponding image description method performs. 4.2. Implementation The widely utilized open source deep learning framework of Caffe [56] is employed to implement CNN for visual modeling and the long-term recurrent convolutional network (LRCN) [14] is applied as the basis to deploy linguistic modeling. Moreover, the

160

P. Tang et al. / Neurocomputing 312 (2018) 154–164

Fig. 6. Comparison of performances in B-4 and CIDEr on the MSCOCO dataset with different number of LSTM layers, where “Baseline” indicates the baseline model which contains two LSTM layers without employing any of the proposed techniques, “LW” means that only the layer-wise optimization technique is applied, “LW+DS” stands for the model with both layer-wise optimization and deep supervision being used, and “LW+DS+Fusion” represents that all of the three proposed techniques including LW, DS and score fusion are utilized. In addition, “DS” denotes the model with deep supervision alone, while “DS+Fusion” is the model with deep supervision and deep sequential fusion strategy.

visual representation is sent to LSTM at each time step to avoid visual information loss along with the increasing of time step, which may easily happen if the visual representation is only fed to LSTM at the first time step. In addition, the pre-trained model optimized on Imagenet is used to initialize the CNN parameters. The configuration of time step is set as 20 because the number of words in most of the reference sentences is less than 20 based on the observation of the MSCOCO and Flickr30K datasets. The stochastic gradient descent solver is adopted with the batch size equal to 16, while the learning rate is initially set to 0.01 and is gradually decreased by multiplying 0.1 in a step-size manner after each of 20K iterations. Moreover, the decay factor η in Eqs. (9) and (12) is set as 0.1 to update the parameters of CNN and LSTM models. For the MSCOCO dataset, the vocabulary sizes are 8800 and 10,019 for the standard mode and the extended mode, respectively, and the number of hidden units is set as 10 0 0 in each LSTM layer because of the large-scale data. For training, the maximum numbers of iterations are set to 110K and 200K for the standard mode and the extended mode, respectively. Regarding the Flickr30K dataset, the vocabulary size is 7405, the number of hidden units of each LSTM layer is 512, and the maximum number of iterations is 110K. 4.3. Evaluation of proposed techniques First of all, a number of experiments are carried out to evaluate the individual performances achieved by the proposed techniques including layer-wise (denoted as LW for short) optimization, deep supervision (DS for short) and deep sequential score fusion. The GoogLeNet [23] model is employed to extract CNN visual features. As shown in Fig. 6, the performances in terms of B-4 and CIDEr on the MSCOCO dataset are presented for the baseline model (which does not contain the proposed LW, DS and fusion techniques) and the proposed methods with different number of LSTM layers (e.g., Deep-n indicates there are n LSTM layers employed). As seen in Fig. 6, the proposed techniques can greatly improve both the B-4 and CIDEr performances as compared with the baseline model. When only the layer-wise optimization technique is employed (i.e., LW), the performances can be gradually enhanced with the increase of the number of LSTM layers, except the case of Deep-6 in B-4. After the technique of deep supervision is applied with layer-wise optimization (i.e., LW+DS), the performances can be further boosted, especially in CIDEr. And if all of the three proposed techniques are employed (i.e., LW+DS+Fusion), the best performances can be achieved under all of the test conditions. Therefore, this set of experiments verify the effectiveness of

Table 1 Performance of the proposed LW+DS+Fusion model with different number of LSTM layers on the MSCOCO dataset. Method

B-1

B-2

B-3

B-4

M

R_L

C

Baseline (Std.) Deep-4 (Std.) Deep-6 (Std.) Deep-8 (Std.) Baseline (Ext.) Deep-4 (Ext.) Deep-6 (Ext.) Deep-8 (Ext.)

69.4 70.5 70.4 70.4 70.0 71.0 71.0 71.3

51.9 53.1 53.1 53.3 52.5 53.7 53.9 54.1

37.6 38.9 38.8 39.1 38.2 39.5 39.7 39.8

27.1 28.3 28.4 28.5 27.7 28.9 29.1 29.2

23.6 24.0 24.3 24.5 24.0 24.7 24.9 25.0

51.3 51.9 52.1 52.2 51.6 52.4 52.7 52.7

85.1 89.1 90.4 91.5 88.0 91.6 93.4 94.6

Table 2 Performance of the proposed LW+DS+Fusion model on the Flickr30K dataset. Method

B-1

B-2

B-3

B-4

M

R_L

C

Baseline Deep-4 Deep-6

61.8 62.4 64.5

43.7 44.4 46.2

30.2 30.9 32.2

20.9 21.3 22.4

19.5 19.8 19.6

45.6 45.8 45.7

43.0 44.6 45.2

the proposed layer-wise optimization, deep supervision and score fusion techniques. Moreover, it is observed that deep supervision alone provides minor improvements on both of the B-4 and CIDEr performances when Deep-4 is employed. And when the model becomes deeper (e.g., Deep-6 and Deep-8), the performances deteriorate. The possible reason is that additional parameters in a deeper model bring about over-fitting when only the proposed deep supervision method is applied. Although the deep sequential fusion strategy slows down the decline (as seen by DS+Fusion in Fig. 6), the performances achieved are similar to or worse than that of baseline. Moreover, the performances achieved by the proposed overall model (i.e., LW+DS+Fusion) with different number of LSTM layers on the MSCOCO dataset under both the standard mode and the extended mode are detailed in Table 1, where “M”, “R_L” and ‘‘C” stand for the metrics of METEOR, ROUGE_L and CIDEr, respectively; “Std.” and “Ext.” mean the standard mode and the extended mode, respectively. From Table 1, it can be observed that the proposed LW+DS+Fusion model is superior to the baseline model in all the evaluation criteria, and the performances are generally improved with the increase of LSTM layers under both the standard mode and the extended mode. In addition to the MSCOCO dataset, the proposed LW+DS+Fusion model is also tested on the Flickr30K dataset with the results shown in Table 2. Since the training samples in Flickr30K are much less than that of MSCOCO, it is no need to

P. Tang et al. / Neurocomputing 312 (2018) 154–164 Table 3 Performance of the proposed LW+DS+Fusion model with VGG-16 and ResNet152 on the MSCOCO dataset. Method

B-1

B-2

B-3

B-4

M

R_L

C

VGG-16 Baseline (Std.) Deep-4 (Std.) Deep-6 (Std.) Baseline (Ext.) Deep-4 (Ext.) Deep-6 (Ext.)

70.2 70.2 70.6 71.1 71.4 71.3

52.7 52.8 53.2 53.8 54.2 54.0

38.3 38.4 38.8 39.1 39.7 39.5

27.6 27.8 28.2 28.1 28.7 28.5

23.7 24.0 24.3 24.3 24.7 24.7

51.6 51.8 52.1 52.2 52.6 52.5

86.2 88.5 90.0 89.0 91.2 91.9

ResNet152 Baseline (Std.) Deep-4 (Std.) Deep-6 (Std.) Baseline (Ext.) Deep-4 (Ext.) Deep-6 (Ext.)

71.5 72.1 72.2 71.6 72.6 73.1

54.6 55.2 55.4 54.6 55.8 56.5

40.2 41.0 41.1 40.3 41.5 42.2

29.3 30.2 30.2 29.4 30.6 31.3

24.7 25.2 25.3 24.8 25.4 25.7

52.9 53.4 53.5 52.8 53.6 54.0

92.8 96.7 97.7 94.2 97.8 99.9

Table 4 Performance of the proposed LW+DS+Fusion model with VGG-16 and ResNet152 on the Flickr30K dataset. Method

B-1

B-2

B-3

B-4

M

R_L

C

VGG-16 Baseline Deep-4

62.2 62.8

43.5 44.6

29.7 30.5

20.0 20.8

19.2 19.6

44.8 45.3

41.4 43.8

ResNet152 Baseline 64.0 Deep-4 65.1

46.0 47.2

32.4 33.4

22.8 23.8

20.1 20.5

46.4 47.0

49.1 50.8

employ very deep language models, otherwise the model may easily stick into over-fitting caused by large-scale model parameters with small-scale training samples. As a consequence, the maximum number of LSTM layers is set to 6 for the Flickr30K dataset. From the results, it can be seen that the performances are continuously improved along with the increase of LSTM model depth except the METEOR performance achieved by Deep-6 which is degraded a little bit as compared to that of Deep-4. Besides the GoogLeNet [23] model, another two state-of-the-art CNN models including VGG-16 [24] and ResNet152 [25] are applied for CNN feature extraction to evaluate the proposed overall LW+DS+Fusion model. The corresponding results on the MSCOCO dataset and the Flickr30K dataset are shown in Table 3 and Table 4, respectively. From the results, it can be concluded that the proposed LW+DS+Fusion model with either VGG-16 or ResNet152 is better than the baseline model for image description. And it exhibits the general trend that the Deep-6 model performs better than the Deep-4 model except the test case with VGG-16 under the extended mode on the MSCOCO dataset, where the Deep-4 model is a little bit better than the Deep-6 model in the criteria of BLEU and ROUGE_L. This is because the VGG-16 model contains more parameters than the models of GoogLeNet and ResNet152. Largescale parameters easily result in over-fitting and hence degrade the performance about the correctness of generated words, leading to worse BLEU and ROUGE_L performances. However, it does not affect the continuous improvement of sentence semantics which is reflected on CIDEr. 4.4. Evaluation with beam search The results in Section 4.3 are obtained by only selecting the word with the maximum probability at each time step during the sentence generation. In the research arena, a heuristic search strategy named beam search is recently designed for image description and widely employed by a number of state-of-the-art works such as [14,15,18,19,57]. In this work, we also incorporate the beam search method into our proposed model to further evaluate the

161

Table 5 Performance comparison with beam search on the MSCOCO dataset. Method

B-1

B-2

B-3

B-4

M

R_L

C

GoogLeNet Baseline (Std.) Deep-4 (Std.) Baseline (Ext.) Deep-4 Ext.)

70.2 71.7 70.9 72.6

53.3 54.6 54.0 55.7

39.9 40.7 40.7 41.8

29.9 30.3 31.0 31.5

24.1 24.4 24.7 25.2

52.0 52.8 52.6 53.4

89.9 92.6 93.5 96.9

VGG-16 Baseline (Std.) Deep-4 (Std.) Baseline (Ext.) Deep-4 (Ext.)

71.3 72.6 71.9 73.2

54.4 55.8 55.1 56.3

40.8 41.8 41.4 42.0

30.5 31.2 31.1 31.2

24.3 24.7 24.7 25.1

52.6 53.3 53.0 53.7

92.0 94.8 93.8 95.9

ResNet152 Baseline (Std.) Deep-4 (Std.) Baseline (Ext.) Deep-4 (Ext.)

72.6 74.1 73.4 74.8

56.2 57.8 57.1 58.4

42.7 43.8 43.5 44.3

32.3 33.0 33.0 33.5

25.4 25.8 25.6 25.9

53.9 54.6 54.1 54.8

99.3 101.8 100.9 103.4

Table 6 Performance comparison with beam search on the Flickr30K dataset. Method

B-1

B-2

B-3

B-4

M

R_L

C

Baseline+GoogLeNet Deep-4+GoogLeNet Baseline+VGG-16 Deep-4+VGG-16 Baseline+ResNet152 Deep-4+ResNet152

65.2 65.7 65.1 66.3 66.9 68.1

46.7 47.4 46.0 47.5 48.2 50.1

32.9 33.4 31.7 33.0 34.1 36.0

22.9 23.5 21.6 22.7 23.9 25.8

19.1 19.2 18.9 19.2 20.0 20.6

45.4 45.9 45.1 45.8 46.7 47.7

45.3 46.9 42.0 45.5 50.5 54.6

Table 7 Summary of the comparative results on the MSCOCO dataset. Methods

B-1

B-2

B-3

B-4

M

C

Multi-modal RNN [16] Google NIC [15] LRCN-CaffeNet [14] m-RNN [12] Soft-Attention [17] Hard-Attention [17] emb-gLSTM, Gaussian [18] RA-SF [57] Att+CNN+LSTM [19] Deep-8+GoogLeNet (Std.) Deep-6+VGG-16 (Std.) Deep-6+ResNet152 (Std.) Deep-8+GoogLeNet (Ext.) Deep-6+VGG-16 (Ext.) Deep-6+ResNet152 (Ext.) Deep-4+GoogLeNet+BS (Std.) Deep-4+VGG-16+BS (Std.) Deep-4+ResNet152+BS (Std.) Deep-4+GoogLeNet+BS (Ext.) Deep-4+VGG-16+BS (Ext.) Deep-4+ResNet152+BS (Ext.)

62.5 66.6 62.8 67.0 70.7 71.8 67.0 69.7 74.0 70.4 70.6 72.2 71.3 71.3 73.1 71.7 72.6 74.1 72.6 73.2 74.8

45.0 46.1 44.2 49.0 49.2 50.4 49.1 51.9 56.0 53.3 53.2 55.4 54.1 54.0 56.5 54.6 55.8 57.8 55.7 56.3 58.4

32.1 32.9 30.4 35.0 34.4 35.7 35.8 38.1 42.0 39.1 38.8 41.1 39.8 39.5 42.2 40.7 41.8 43.8 41.8 42.0 44.3

23.0 24.6 21.0 25.0 24.3 25.0 26.4 28.2 31.0 28.5 28.2 30.2 29.2 28.5 31.3 30.3 31.2 33.0 31.5 31.2 33.5

19.5 – – – 23.9 23.0 22.7 23.5 26.0 24.5 24.3 25.3 25.0 24.7 25.7 24.4 24.7 25.8 25.2 25.1 25.9

66.0 – – – – – 81.3 83.8 94.0 91.5 90.0 97.7 94.6 91.9 99.9 92.6 94.8 101.8 96.9 95.9 103.4

proposed model comprehensively. The corresponding results with four LSTM layers on the MSCOCO dataset and the Flickr30K dataset are presented in Table 5 and Table 6, respectively. From the results, it can be observed that when the beam search method is applied for both the baseline model and the proposed LW+DS+Fusion model, the proposed model outperforms the baseline in all the performance criteria on both datasets. 4.5. Comparison with state-of-the-arts The proposed LW+DS+Fusion model is also compared with other state-of-the-art models with the performance comparison shown in Table 7 and Table 8 on the MSCOCO dataset and the Flickr30K dataset, respectively, where “BS” indicates that the beam search method is employed, and the best performances are

162

P. Tang et al. / Neurocomputing 312 (2018) 154–164

highlighted in bold for each of the criteria. For the proposed LW+DS+Fusion model, only the best general performance with each of the underlying CNN models (i.e., GoogLeNet, VGG-16, ResNet152) when beam search is enabled or disabled is presented. On MSCOCO, the best performance is achieved with the proposed Deep-4+ResNet152+BS model under the extended mode, which is generally better than the other state-of-the-art models, except the METEOR performance obtained by Att+CNN+LSTM [19]. In this case, the METEOR performance achieved by Att+CNN+LSTM is 26.0 which is slightly higher (i.e., +0.1) than that of our model. However, our performances in other criteria are much higher than that of Att+CNN+LSTM, e.g., the B-4 and CIDEr performances achieved by Deep-4+ResNet152+BS (Ext.) are 33.5 and 103.4, which are +2.5 and +9.3 higher than the corresponding performances of Att+CNN+LSTM. Moreover, it can be observed that beam search is

very useful to be incorporated into the proposed LW+DS+Fusion model as the performances are generally improved when beam search is employed. And the deployment of beam search can decrease the depth requirement of the LSTM network so as to reduce the complexity of the language model, e.g., when GoogLeNet is employed for CNN feature extraction and the standard mode is applied, without beam search the best performance is achieved with an eight-layer LSTM network (i.e., Deep-8), after enabling beam search only a four-layer LSTM network (i.e., Deep-4) is able to achieve the best performance. As far as the Flickr30K dataset is concerned, we can draw the similar conclusion that the proposed LW+DS+Fusion model (specifically speaking, the Deep-4+ResNet152+BS model) is able to achieve better performances than all the other competing models as shown in Table 8.

Fig. 7. Examples from MSCOCO test set (trained on the extended set) with ResNet152 feature. {R1, R2, R3, R4, R5} is the set of references, while C1 and C2 are the sentences generated by the baseline model and the proposed LW+DS+Fusion model (Deep-4+ResNet152+BS).

P. Tang et al. / Neurocomputing 312 (2018) 154–164 Table 8 Summary of the comparative results on the Flickr30K dataset. Method

B-1

B-2

B-3

B-4

M

LogBilinear [13] Multi-modal RNN [16] Google NIC [15] LRCN-CaffeNet [14] m-RNN [12] Soft-Attention [17] Hard-Attention [17] emb-gLSTM, Gaussian [18] RA-SF [57] Deep-4+GoogLeNet Deep-4+VGG-16 Deep-4+ResNet152 Deep-4+GoogLeNet+BS Deep-4+VGG-16+BS Deep-4+ResNet152+BS

60.0 57.3 66.3 58.7 60.0 66.7 66.9 64.6 67.0 64.5 62.8 65.1 65.7 66.3 68.1

38.0 36.9 42.3 39.1 41.0 43.4 43.9 44.6 47.5 46.2 44.6 47.2 47.5 47.5 50.1

25.4 24.0 27.7 25.1 28.0 28.8 29.6 30.5 33.0 32.2 30.5 33.4 33.5 33.0 36.0

17.1 15.7 18.3 16.5 19.0 19.1 19.9 20.6 24.3 22.4 20.8 23.8 23.5 22.7 25.8

16.9 15.3 – – – 18.5 18.5 17.9 19.4 19.6 19.6 20.5 19.2 19.2 20.6

163

description capacity. The experimental results on the benchmark datasets of MSCOCO and Flickr30K have verified the effectiveness of the proposed layer-wise, deep supervision and score fusion approaches, and the overall model is able to perform the best as compared with other state-of-the-art approaches. Acknowledgments This work was supported in part by National Natural Science Foundation of China under Grants 61622115 and 61472281, Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005), Shanghai Engineering Research Center of Industrial Vision Perception & Intelligent Computing (17DZ2251600), IBM Shared University Research Awards Program, and Scientific Research Foundation of Education Bureau of Jiangxi Province (No. GJJ170643). References

4.6. Examples and discussion Some typical examples from the MSCOCO test set with the ResNet152 CNN feature are shown in Fig. 7. Compared to the baseline model, the sentences generated with the proposed LW+DS+Fusion model contain more accurate words and have richer semantics. For the first image, the baseline model generates “stop sign” to describe the object next to “a man”, as the model may notice the red light on traffic sign. However, the “traffic light” may be more suitable for human according to references. Regarding the second example, the sentence generated by the baseline model includes the phrase of “wearing a hat and tie” which is not the real content in the candidate image, while “laying on the bed” in C2 by the proposed model is closer to the fact in semantics. Also, the sentences of C2 in other examples possess richer semantics and more elegant words. However, it is obvious that the sentence patterns in references are more flexible and the words in references are of spirituality and personalization as compared to the generated description sentences by the proposed model. The model complexity of the propose methods are analyzed in the following. When the LW and DS methods are employed, the amount of model parameters will increase and the computations required by the proposed model is higher than the baseline. Let Op and Ot denote the amount of model parameters and the computational complexity of a visual CNN model, and Wp and Wt represent the amount of model parameters and the computational complexity of a two-layer LSTM module. When where are n stages, the amount of model parameters required by the proposed LW approach becomes O p + n × Wp for both training and testing, while the computational complexity of LW is n × Ot + n(n2+1 ) × Wt for training and Ot + n × Wt for testing. Regarding the proposed DS approach, its model parameter scale is the same as that of LW, and its computational complexity is Ot + n × Wt for both training and testing. When both LW and DS are applied, the model parameter scale is also O p + n × Wp , the computational complexity for training is n(n2+1 ) × Ot + Wt and Ot + n × Wt for testing. 5. Conclusion In this paper, a deep sequential fusion LSTM network is proposed for image description. First, the layer-wise optimization technique is designed to deepen the LSTM based language model to enhance the representation ability of description sentences. Second, in order to prevent model from falling into over-fitting and local optimum, the deep supervision method is proposed to optimize the model parameters in an effective manner. Third, the product rule based fusion strategy is developed to fuse the output scores from each of the linguistic modules to further improve the image

[1] A. Farhadi, M. Hejrati, M.A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, D. Forsyth, Every picture tells a story: generating sentences from images, in: Proceedings of the 2010 European Conference on Computer Vision, Springer, 2010, pp. 15–29. [2] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, et al., Babytalk: understanding and generating simple image descriptions, IEEE Trans. Pattern Anal. Mach. Intell. 35 (12) (2013) 2891–2903. [3] M. Mitchell, J. Dodge, A. Goyal, K. Yamaguchi, K. Stratos, X. Han, et al., Midge: generating image descriptions from computer vision detections, in: Proceedings of the 2012 European Association of Computational Linguistics, ACL, 2012, pp. 747–756. [4] Y. Yang, C.L. Teo, H.D. III, Y. Aloimonos, Corpus-guided sentence generation of natural images, in: Proceedings of the 2011Conference on Empirical Methods in Natural Language Processing, ACL, 2011, pp. 27–31. [5] P. Kuznetsova, V. Ordonez, A.C. Berg, Collective generation of natural image descriptions, in: Proceedings of the 2012 Annual Meeting of the Association for Computational Linguistics, ACL, 2012, pp. 359–368. [6] P. Kuznetsova, V. Ordonez, A.C. Berg, T. Berg, Y. Choi, Generalizing image captions for image-text parallel corpus, in: Proceedings of the 2013 Annual Meeting of the Association for Computational Linguistics, ACL, 2013, pp. 790–796. [7] P. Kuznetsova, V. Ordonez, T. Berg, Y. Choi, TREETALK: composition and compression of trees for image descriptions, Trans. ACL 2 (10) (2014) 351–362. [8] R. Mason, E. Charniak, Nonparametric method for data driven image captioning, in: Proceedings of the 2014 Annual Meeting of the Association for Computational Linguistics, ACL, 2014, pp. 592–598. [9] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in: Proceedings of the 2015 International Conference on Learning Representations, 2015. [10] K. Cho, B.V. Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, ACL, 2014, pp. 1724–1734. [11] I. Sutskever, O. Vinyals, Q.V. Le, Sequence to sequence learning with neural networks, in: Proceedings of the 2014 Annual Conference on Neural Information Processing Systems, MIT Press, 2014, pp. 3104–3112. [12] J. Mao, W. Xu, Y. Yang, J. Wang, A. Yuille, Deep captioning with multimodal recurrent neural networks (m-RNN), in: Proceedings of the 2014 International Conference on Learning Representations, 2014. [13] R. Kiros, R. Salakhutdinov, R. Zemel, Multimodal neural language models, in: Proceedings of the 2014 International Conference on Machine Learning, ACM, 2014, pp. 595–603. [14] J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description, in: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2015, pp. 2625–2634. [15] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: a neural image caption generator, in: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2015, pp. 3156–3164. [16] A. Karpathy, F.-F. Li, Deep visual-semantic alignments for generating image descriptions, in: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2015, pp. 3128–3137. [17] K. Xu, J.L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R.S. Zemel, Y. Bengio, Show, attend and tell: neural image caption generation with visual attention, in: Proceedings of the 2015 International Conference on Machine Learning, ACM, 2015, pp. 2048–2057. [18] X. Jia, E. Gavves, B. Fernando, T. Tuytelaars, Guiding the long-short term memory model for image caption generation, in: Proceedings of the 2015 International Conference on Computer Vision, IEEE, 2015, pp. 2407–2415. [19] Q. Wu, C. Shen, L. Liu, A. Dick, A. Hengel, What value do explicit high level concepts have in vision to language problems? in: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2016, pp. 203–212.

164

P. Tang et al. / Neurocomputing 312 (2018) 154–164

[20] O. Abdel-Hamid, A. Mohamed, H. Jiang, G. Penn, Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition, in: Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and SP, IEEE, 2012, pp. 4277–4280. [21] T.N. Sainath, A. Mohamed, B. Kingsbury, B. Ramabhadran, Deep convolutional neural networks for LVCSR, in: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and SP, IEEE, 2013, pp. 8614–8618. [22] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in: Proceedings of the 2012 Annual Conference on Neural Information Processing Systems, MIT Press, 2012, pp. 1097–1105. [23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2015, pp. 1–9. [24] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: Proceedings of the 2014 International Conference on Learning Representations, 2014. [25] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2016, pp. 770–778. [26] Y. Wei, Y. Zhao, C. Lu, S. Wei, L. Liu, Z. Zhu, S. Yan, Cross-modal retrieval with CNN visual features: a new baseline, IEEE Trans. Cybern. 47 (2) (2016) 449–460. [27] R. Girshick, J. Donahue, T. Darrell, J. Malik, Region-based convolutional networks for accurate object detection and segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 38 (1) (2016) 142–158. [28] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (5786) (2006) 504–507. [29] P.D. Djurdjevic, M. Huber, Deep belief network for modeling hierarchical reinforcement learning policies, in: Proceedings of the 2013 International Conference on System, Man, and Cybernetics, IEEE, 2013, pp. 2485–2491. [30] M. Hermans, S. Benjamin, Training and analysing deep recurrent neural networks, in: Proceedings of the 2013 Annual Conference on Neural Information Processing Systems, MIT Press, 2013, pp. 190–198. [31] C.-Y. Lee, S. Xie, P.W. Gallagher, Z. Zhang, Z. Tu, Deeply-supervised nets, in: Proceedings of the 2015 International Conference on Artificial Intelligence and Statistics, JMLR, 2015, pp. 562–570. [32] L. Wang, C.-Y. Lee, Z. Tu, S. Lazebnik, Training deeper convolutional networks with deep supervision, arXiv preprint:1505.02496(2015). [33] J. Kittler, M. Hatef, R. Duin, J. Matas, On combining classifiers, IEEE Trans. Pattern Anal. Mach. Intell. 20 (3) (1998) 226–239. [34] D.M. Tax, M.V. Breukelen, R.P. Duin, J. Kittler, Combining multiple classifiers by averaging or by multiplying? Pattern Recognit. 33 (9) (20 0 0) 1475–1485. [35] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, C.L. Zitnick, Microsoft coco: common objects in context, in: Proceedings of the 2014 European Conference on Computer Vision, Springer, 2014, pp. 740–755. [36] Y. Peter, L. Alice, H. Micah, H. Julia, From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions, Trans. ACL 2 (2) (2014) 67–78. [37] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R.R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, arXiv preprint:1207.0580(2012). [38] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, R. Fergus, Regularization of neural networks using DropConnect, in: Proceedings of the 2013 International Conference on Machine Learning, ACM, 2013, pp. 1058–1066. [39] S. Ioffe, C. Szegedy, Batch Normalization: Accelerating deep network training by reducing interval covariate shift, arXiv preprint:1502.03167(2015). [40] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: Proceedings of the 2014 European Conference on Computer Vision, Springer, 2014, pp. 818–833. [41] Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Trans. Neural Netw. 5 (2) (1994) 157–166. [42] K. Yao, T. Cohn, K. Vylomova, K. Duh, C. Dyer, Depth-gated LSTM, arXiv preprint:1508.03790(2015). [43] Z. Li, J. Tang, Weakly supervised deep matrix factorization for social image understanding, IEEE Trans. Image Process. 26 (1) (2017) 276–288. [44] Z. Li, J. Tang, Weakly supervised deep metric learning for community-contributed image retrieval, IEEE Trans. Multimed. 17 (11) (2015) 1989–1999. [45] Z. Li, J. Liu, J. Tang, H. Lu, Robust structured subspace learning for data representation, IEEE Trans. Pattern Anal. Mach. Intell. 37 (10) (2015) 2085–2098. [46] Z. Li, J. Tang, X. He, Robust structured nonnegative matrix factorization for image representation, IEEE Trans. Neural Netw. Learn. Syst. PP (99) (2017) 1–14. [47] J. Tang, H. Li, G.-J. Qi, T.S. Chua, Integrated graph-based semi-supervised multiplesingle instance learning framework for image annotation, in: Proceedings of the 2008 ACM International Conference on Multimedia, ACM, 2008, pp. 631–634. [48] F. Sun, H. Li, Y. Zhao, X. Wang, D. Wang, Towards tags ranking for social images, Neurocomputing 120 (2013) 434–440.

[49] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a method for automatic evaluation of machine translation, in: Proceedings of the 2002 Annual Meeting of the Association for Computational Linguistics, ACL, 2002, pp. 311–318. [50] R. Vedantam, C.L. Zitnick, D. Parikh, Cider: consensus-based image description evaluation, in: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2015, pp. 4566–4575. [51] R. Xu, C. Xiong, W. Chen, J.J. Corso, Jointly modeling deep video and compositional text to bridge vision and language in a unified framework, in: Proceedings of the 2015 AAAI Conference on Artificial Intelligence, AAAI, 2015, pp. 2346–2352. [52] Y. Pan, T. Mei, T. Yao, H. Li, Y. Rui, Jointly modeling embedding and translation to bridge video and language, in: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2016, pp. 4594–4602. [53] P. Tang, H. Wang, S. Kwong, G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition, Neurocomputing 225 (2017) 188–197. [54] S. Banerjee, A. Lavie, METEOR: an automatic metric for MT evaluation with improved correlation with human judgments, in: Proceedings of the 2005 Annual Meeting of the Association for Computational Linguistics Workshop, ACL, 2005, pp. 65–72. [55] C.-Y. Lin, F.J. Och, Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics, in: Proceedings of the 2004 Annual Meeting of the Association for Computational Linguistics, ACL, 2004, pp. 21–26. [56] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: convolutional architecture for fast feature embedding, in: Proceedings of the 2014 ACM International Conference on Multimedia, ACM, 2014, pp. 675–678. [57] J. Jin, K. Fu, R. Cui, F. Sha, C. Zhang, Aligning where to see and what to tell: image caption with region-based attention and scene factorization, arXiv preprint:1506.06272 (2015). Pengjie Tang received the M.S. degree in computer software and theory from Nanchang University, China in 2009. He is currently a Ph.D. candidate at the Department of Computer Science and Technology, Tongji University, China. His research interest covers computer vision and deep learning.

Hanli Wang received the B.E. and M.E. degrees in electrical engineering from Zhejiang University, Hangzhou, China, in 2001 and 2004, respectively, and the Ph.D. degree in computer science from City University of Hong Kong, Kowloon, Hong Kong, in 2007. From 2007 to 2008, he was a Research Fellow with the Department of Computer Science, City University of Hong Kong. From 2007 to 2008, he was also a Visiting Scholar at Stanford University, Palo Alto, CA, USA. From 2008 to 2009, he was a Research Engineer with Precoad, Inc., Menlo Park, CA, USA. From 2009 to 2010, he was an Alexander von Humboldt Research Fellow at University of Hagen, Hagen, Germany. Since 2010, he has been a full Professor with the Department of Computer Science and Technology, Tongji University, Shanghai, China. His current research interests include digital video coding, computer vision, and machine learning. Sam Kwong received the B.Sc. degree and M.A.Sc. degree in electrical engineering from the State University of New York at Buffalo, USA and University of Waterloo, Canada, in 1983 and 1985, respectively. In 1996, he later obtained his Ph.D. from the University of Hagen, Germany. From 1985 to 1987, he was a diagnostic engineer with the Control Data Canada where he designed the diagnostic software to detect the manufacture faults of the VLSI chips in the Cyber 430 machine. He later joined the Bell Northern Research Canada as a Member of Scientific staff. In 1990, he joined the City University of Hong Kong as a lecturer in the Department of Electronic Engineering. He is currently a Professor in the Department of Computer Science. His research interests are in pattern recognition, evolutionary algorithms, and video coding.