ARTICLE IN PRESS
JID: NEUCOM
[m5G;July 22, 2019;16:16]
Neurocomputing xxx (xxxx) xxx
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Discriminative multimodal embedding for event classification Fan Qi a,d, Xiaoshan Yang b,c,d,∗, Tianzhu Zhang b,c, Changsheng Xu a,b,c,d a
School of Computer and Information, Hefei University of Technoloy, PR China Institute of Automation, Chinese Academy of Sciences, Beijing 100190, PR China c University of Chinese Academy of Sciences, Beijing 100190, PR China d Peng Cheng Laboratory, ShenZhen, China b
a r t i c l e
i n f o
Article history: Received 30 July 2017 Revised 3 October 2017 Accepted 2 November 2017 Available online xxx Keywords: Social media Event classification Multimodal embedding
a b s t r a c t Most of existing multimodal event classification methods fuse the traditional hand-crafted features with some manually defined weights, which may be not suitable to the event classification task with large amounts of photos. Besides, the feature extraction and event classification model are always performed separately, which cannot capture the most useful features to describe the semantic concepts of complex events. To deal with these issues, we propose a novel discriminative multimodal embedding (DME) model for event classification in user generated photos by jointly learning the representation together with the classifier in a unified framework. In the proposed DME model, we can effectively resolve the multimodal, intra-class variation and inter-class confusion challenges by using the contrastive constraints on the multimodal event data. Extensive experimental results on two collected datasets demonstrate the effectiveness of the proposed DME model for event classification. © 2019 Elsevier B.V. All rights reserved.
1. Introduction With the impressive progress of mobile Internet, more and more smart phones with digital cameras have been connected to the Internet, which successfully facilitates the photo sharing and propagation. As a result, a huge number of photos have been uploaded by users on the Internet. Take the Instagram site as an example, about 80 million photos are uploaded every day.1 Besides, there are many other well known web sites, such as Facebook, Flickr, and Tumblr, also including millions of photos, which have recorded various kinds of events happening around us, such as concerts or weddings. The huge amount of photos provide user facilities to share experiences, while more scalable, effective and robust technologies are also required to manage and index them. Most of the existing social web sites have a large number of users with different requirements, and only provide the keyword-based or time-based search and browse scheme, which are not effective to meet the needs of all users. Therefore, automatic event analysis from massive user generated photos is important and helpful to better browse, search and monitor events by users, companies and governments.
∗ Corresponding author at: Institute of Automation, Chinese Academy of Sciences, Beijing 100190, PR China. E-mail addresses:
[email protected] (F. Qi),
[email protected] (X. Yang),
[email protected] (T. Zhang),
[email protected] (C. Xu). 1 https://www.instagram.com/press/.
Many mono-modality methods have been proposed for event analysis in photos. A number of them rely on annotation information, such as titles, tags, descriptions and GPS and have achieved good performance [1–10]. However, these annotation data are always too noisy to fully represent the rich concepts related to the events expressed by the photos. Furthermore, a large number of photos do not have any annotation information. To deal with this issue, some other methods have been proposed to make use of the rich visual information for event analysis [11–15]. It is difficult to do the vision based event analysis in photos, thus most of these methods focus on event recognition in photo collections, which are organized well in the annotated albums. The photo collection is similar to a set of sparsely sampled key frames from a specific video. Thus the conventional event recognition model on videos can be applied for photo collections. It is far from enough to recognize event in photos by using only one single modality. Thus, some methods utilize multimodal features for event analysis because different modalities of data typically carry complement information. Most of them rely on combining multimodal features of titles, tags, descriptions and GPS information. For example, in [7], event annotation is carried out by two tasks: grouping photo into event clusters, and classifying each cluster into different event categories. The clustering algorithm is based on time and GPS features. A small time interval may also suggest that the photos are taken in the same place. Thus the time tag is a useful feature for event clustering. However, this method is limited to photo analysis on private camera of single
https://doi.org/10.1016/j.neucom.2017.11.078 0925-2312/© 2019 Elsevier B.V. All rights reserved.
Please cite this article as: F. Qi, X. Yang and T. Zhang et al., Discriminative multimodal embedding for event classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2017.11.078
JID: NEUCOM 2
ARTICLE IN PRESS
[m5G;July 22, 2019;16:16]
F. Qi, X. Yang and T. Zhang et al. / Neurocomputing xxx (xxxx) xxx
Fig. 1. Overview of the proposed discriminative multimodal embedding model for event analysis.
user rather than large scale photos of a vast customer base. Some other methods employ the traditional feature fusion algorithm. For example, in [8], the image features and GPS features are combined through a confidence-based fusion which is formulated with corresponding output of classifiers and confidence weights. More recently, some methods choose to combine the low-level feature of different modalities for event classification in social media. For example, in [16], the TF-IDF text feature and GIST image feature are combined together by early fusion and late fusion which is a simple composition of classifiers’ output. Although the above multimodal event recognition methods have achieved acceptable performances on specific datasets, there still exists significant room for improvement, especially in the following two aspects: (1) Existing methods mainly fuse the traditional hand-crafted features with some manually defined weights, which may be ineffective for the event classification task with large amounts of photos. (2) Existing methods perform feature extraction and event classification separately, which cannot capture the most useful features to describe concepts of events. To deal with the above issues, this paper aims to map multimodal data (e.g. photos and texts) into a common space to make them enhance and complement each other and learn compact event semantics. As a result, the data with different modalities can be represented comprehensively, which will be easily associated with a specific event class. To embed multimodal data in a common feature space for event classification, we should deal with the following three challenges including: (1) Multimodal property. A single modality is far from enough to describe social event. For example, an image of a big hall without description could be ambiguous in event classification. It may be recognized as a concert, an exhibition or a wedding. In order to increase the accuracy of the social event analysis, we must take full use of the multiple modalities information updated by users. Furthermore, it’s also vital to choose the appropriate method to enhance the discriminability of features. (2) Intra-class variation. Photos uploaded by users are sparsely sampled visual appearances of event over time. Thus, each photo may only capture a single object or scene of a specific complex event. For example, people gathered in a big hall or outside are both attending a concert. (3) Inter-class confusion. Photos related
to different events may contain similar objects or scenes. Sometimes an image of a man dressed in the green giant costume for the Halloween may be recognized as the St.Patrick’s Day. To overcome the above three challenges, we propose a novel discriminative multimodal embedding (DME) model for event classification in user generated photos. As shown in Fig. 1, in the training stage, given images and their corresponding textual descriptions as input, we map the features of images and their descriptions to a compact feature space where the event classes can be assigned easily. The proposed DME model has the following three advantages. (1) To solve the multimodal issue, we minimize the distances between the image features and their corresponding text features in the common feature space by learning a joint image-text embedding. (2) To reduce the intra-class variation, for each single modality (image or text), the distance between two samples with the same class label is minimized. (3) To alleviate inter-class confusion, the distance between two samples with the different class labels is maximized. Thus, we are able to not only capture the complementary patterns in images and texts, but also obtain the discriminative feature representations for effective event classification. The main contributions of this paper are highlighted as following. 1. We propose a unified visual-semantic embedding framework to jointly learn discriminative and multimodal feature representation for event classification in user generated photos. 2. In the proposed discriminative multimodal embedding (DME) model, we manage to solve the multimodal, intra-class variation and inter-class confusion issues by using the contrastive constraints on the multimodal and single modal data. 3. Experiments on two collected large-scaled datasets containing about 450 K images demonstrate that the proposed DME model performs better than the other state-of-the-art algorithms. The rest of this paper is organized as follows. In Section 2, we summarize the related work. The proposed discriminative multimodal embedding model is described with details in Section 3. Experimental results are reported and analyzed in Section 4. Finally, we conclude the paper with future work in Section 5.
Please cite this article as: F. Qi, X. Yang and T. Zhang et al., Discriminative multimodal embedding for event classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2017.11.078
JID: NEUCOM
ARTICLE IN PRESS
[m5G;July 22, 2019;16:16]
F. Qi, X. Yang and T. Zhang et al. / Neurocomputing xxx (xxxx) xxx
2. Related work In this section, we briefly review the existing work most related to our work, including social event analysis and multimodal feature learning. Social event analysis in user generated photos: In the early work [12], a purely image-based event classification approach is proposed. The Bag-of-Words (BoW) representation is extracted from dense SIFT and color features. Page rank is used for selecting the most important features and Support Vector Machines (SVM) finally predict the event type. In [13,14], event recognition is experimented on an album dataset including about 40 k photos with 10 holiday events. They extract object semantic from image and classify by the compositional object pattern frequency. In [15], the adapted discriminative hidden Markov model is applied to a new dataset contained about 60 k images annotated with 14 event from Flickr. As the event detection becomes prevalent, the rich contextual metadata available in the net provide new opportunities for event classification. Thus, some text-based [17,18] methods have been proposed for event classification. The text-based methods need a lot of work in processing textual data: removing stop-word, stemming, tagging PoS and tokenization. Most of these early methods are designed for the mono-modality and small dataset. They are not effective in practical applications where numerous photos related to much more complex event need to be recognized. Recently, deep learning has also been applied in event classification in photos. In [19], an effective method is developed to recognize events from static images by semantic fusion of holistic deep representation and spatial detection map. The architecture in [20] is decomposed into object net and scene net, which extract useful information for event understanding from the perspective of objects and scene context, respectively. But the semantic information they used such as time of day, the number of faces, the type of scenes and so on is all detected from the image. All these deep learning based event classification methods mainly focus on the visual content contained in the photos. Compared with these methods, we propose a multimodal feature learning method by combining CNNs for capturing visual content and RNNs for the textual content. Besides, we also provide a new large multimodal event dataset consists of about 200 K photos and their corresponding textual descriptions. Multimodal feature learning: People’s perception of life is diverse, including visual, auditory, tactile, taste, smell and so on. Any loss of perception is likely to cause an abnormality of intelligence or ability. Based on that, multimodal deep learning provides multimodal data processing capabilities for machines. In the early research work, there are mainly three different kinds of multimodal feature learning methods. (1) Feature Subspace: The most widely used feature subspace method for multiple modalities in the past is the canonical correlation analysis (CCA) [21]. The CCA can be seen as the problem of finding basis vectors for variables with different modalities. Thus, the correlation between the projected vectors of the variables along the basis vectors is mutually maximized. The basis vectors are decided by a set of linear transformations, one for each modality of the variables; (2) Semantic Integration: In [22], the query-by-example paradigm is extended to the semantic domain. A semantic feature space where each image is represented by the vector of posterior concept probabilities is defined. In [23], the semantic representation for each image is constructed based on the correlation space where the original features are mapped using CCA. The correlation between two modalities based on CCA and the semantic representation based on multiclass logistic regression are combined in this method; (3) Kernel Method: In [24], a semi-supervised learning approach is proposed to leverage the information contained in the tags associated with unlabeled images. The multiple kernel learning (MKL) framework
3
is used to combine a kernel based on the image content with a second kernel which encodes the tags associated with each image. With the success of deep learning in computer vision and natural language processing, it has also been applied in multimodal feature learning. Nitish et al. propose to learn multimodal data representations based on Deep Boltzmann Machine [25]. The evaluation on classification and retrieval tasks using the fused image-tag representation of the MIR Flickr Data shows that this method achieves more noticeable performance than multimodal Deep Belief Network (DBN) and a deep autoencoders [26]. This method mainly focuses on the image dataset of general objects while the proposed method focuses on the event dataset which is much more difficult due to the intra-class variation and inter-class confusion. More recently, a number of approaches have been proposed to learn joint image-text embedding by mapping images and sentences into a common space [27–30]. In [27], Weston et al. propose a strongly performing method that scales to large datasets by learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations, by learning a mapping into a feature space where images and annotations are both represented. The loss function they employed with their model is weighted approximate-rank pairwise loss. In [29], the authors show the DT-RNN model which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences. These methods mainly focus on image to text alignment and visual description tasks. Thus they do not need to consider the discriminability of the learned feature representation as in this paper. The method in [31] is the most related work to the proposed DME model. The image and text description are embedded into a common feature space through contrastive loss. Different from this method where the embedded image and text features are mainly used for cross-modal retrieval and image captioning, the proposed DME model jointly learns the image-text embedding and the discriminative feature representation for event classification. 3. The proposed method In this section, we introduce the multimodal deep learning model in details. Firstly, we show the overall flowchart of the embedding model. Secondly, we show the text encoder. Finally, we illustrate the proposed loss functions for the multimodal constraint and discriminative constraint. 3.1. Problem formulation Given images and their descriptions as input, our goal is to learn a joint image-sentence embedding for mapping images and sentences into discriminative feature representations which will be further used for event classification. Fig. 2 shows the overall flowchart of the proposed DME model. Sentences are encoded using recurrent neural networks (RNN) while image features are extracted through pretrained CNNs. The multimodal embedding model is optimized by simultaneously considering three kinds of constraints including the multimodal contrastive constraint, the discriminative constraint and the classification constraint. Let D, K, H and V be the dimension of an image feature vector, the dimension of the multimodal embedding space, the dimension of the sentence feature vector, and the dimension of the word embedding vector respectively. The event class label of an image-sentence pair is denoted as y and the total number of event class is C. The image and sentence embedding maK×D and W ∈ RK×H , respectively. We trices T are denoted as WI ∈ R use w1 , . . . , wN |w ∈ RV to denote word embedding vectors for a sentence S. These word embedding vectors will be learned in the
Please cite this article as: F. Qi, X. Yang and T. Zhang et al., Discriminative multimodal embedding for event classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2017.11.078
JID: NEUCOM 4
ARTICLE IN PRESS
[m5G;July 22, 2019;16:16]
F. Qi, X. Yang and T. Zhang et al. / Neurocomputing xxx (xxxx) xxx
Fig. 2. Framework of the proposed discriminative multimodal embedding (DME) method.
optimization of the proposed DME model. The feature representation p ∈ RH of the sentence is computed as the output of the last hidden layer of the sentence encoder which will be illustrated in Section 3.2. The embedded sentence feature in the common feature space can be computed as v = WT p ∈ RK . We use q ∈ RD to denote the image feature vector (for the image corresponding description sentence S). The embedded image feature can be computed as f = WI q ∈ RK . The fused feature x ∈ RK of f and v is computed by the element-wise addition. The objective of the proposed multimodal embedding model is formulated as following:
L = γ1 L1 + γ2 L2 + γ3 L3 ,
(1)
where L1 is to constrain different modalities, L2 is for discriminative feature learning, and L3 is a softmax loss for explicit event classification. γ 1 , γ 2 and γ 3 are the factors which control the balance of the three terms. The formulation of L3 is given as follows,
L3 = −
i
eWyi xi +byi log . W j xi +b j j=1 e
(2)
Here, W j ∈ RK denotes the jth row of the weight matrix W ∈ RC×K in the last fully connected layer after the feature embedding and b ∈ RC is the bias term. In Section 3.2, we show how to encode the representations of all words in a sentence S. More details about the modality constraint L1 the discriminative feature constraint L2 will be introduced in Sections 3.3 and 3.4.
Fig. 3. Illustration of (a) LSTM and (b) gated recurrent units. (a) i, f and o are the ˜ input, forget and output gates, respectively. c and cdenote the memory cell and new memory cell content. (b) r and z are the reset and update gates, and h and ha˜ re the activation and the candidate activation .
of natural language processing. Meanwhile the gated recurrent unit (GRU) shown in Fig. 3(b) is a modification version of the LSTM. In [32], these two models are introduced. Similarly to the LSTM unit, the GRU has gating units that modulate the flow of information inside the unit without a separate memory cell. j The activation ht of the GRU at time t is a linear interpolation j
between the previous activation ht−1 and the candidate activation ˜ j. h t
j ˜ j, htj = 1 − ztj ht−1 + ztj h t
where an update gate decides how much the unit updates its activation, or content. The update gate is computed by
ztj = σ (Wz wt + Uz ht−1 ) . j
3.2. Sentence encoder In this section, we show how to obtain the text feature v from the sentence S. In [16], they employ frequency-inverse document frequency (TF-IDF) features to represent their text information. This kind of features cannot capture the sequential semantic patterns contained in the sentence. Thus, in this paper, we decide to employ another recurrent neural networks based text encoder. Next we will introduce how to get the text features in details. As we all know, the LSTM in Fig. 3(a) has gained considerable success in Deep Learning and has been widely used in many cases
(3)
j zt
(4)
This procedure of taking a linear sum between the existing state and the newly computed state is similar to the LSTM unit. The GRU, however, does not have any mechanism to control the degree to which its state is exposed, but exposes the whole state each time. The candidate activation is computed similarly to that of the traditional recurrent unit:
˜ j = tanh (W wt + U (rt ht−1 ) ) j , h t
(5)
where rt is a set of reset gates and is an element-wise multij plication. When off (rt close to 0), the reset gate effectively makes
Please cite this article as: F. Qi, X. Yang and T. Zhang et al., Discriminative multimodal embedding for event classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2017.11.078
ARTICLE IN PRESS
JID: NEUCOM
[m5G;July 22, 2019;16:16]
F. Qi, X. Yang and T. Zhang et al. / Neurocomputing xxx (xxxx) xxx
5
the unit act as if it is reading the first symbol of an input sequence, allowing it to forget the previously computed state. j The reset gate rt is computed similarly to the update gate:
the implementation details. Finally, we give evaluation results and analyses.
rtj = σ (Wr wt + Ur ht−1 )
4.1. Dataset
j
(6)
Many papers have compared LSTM and GRU, turning out to be that their performances are comparable. But the construction of GRU is simpler and then uses fewer matrix multiplication, which saves time with large training data. So our text representation is extracted by using the GRU. 3.3. Modality constraint The L1 in Eq. (1) is designed to constrain the different modalities, so that their complement information can be captured in the embedded feature space. Specifically, we adopt the following constrictive loss between the images and sentences.
L1 =
i
+
max {0, ε1 − s( fi , vi ) + s( fi , vk )}
k
i
max {0, ε1 − s(vi , fi ) + s(vi , fk )},
(7)
k
where vk is a feature of contrastive (non-descriptive) sentence for image embedding fi , and vice-versa with fk . The contrastive terms are chosen randomly from the training set and resampled every epoch. The s = ( f, v ) = f · v is defined as the scoring function, where the s is equal to cosine similarity. 3.4. Discriminative constraint The L2 in Eq. (1) is designed to constrain that the image or text has more compact feature representation in the embedded feature space. Thus the embedded features are more discriminative for the final event classification. Specifically, our discriminative constraint is based on the L2 norm contrastive loss, which is originally proposed by Hadsell et al. [33] for dimensionally reduction. The contrastive loss is applied on both the text and image features:
L2 =
lv Eu
vi , v j + (1 − lv )Ed (vi , vk )
i, j,k
+
l f Eu fi , f j + 1 − l f Ed ( fi , fk ).
(8)
i, j,k
Here l is equal to 0/1. “l = 1” means that these two vectors are from the same class. “l = 0” means that these two vectors are from different classes. The Eu (fi , fj ) and Eu vi , v j are designed to encourage that the image representations and sentence representations with the same class should have small distances. The Ed (fi , fk ) and Ed (vi , vk ) are used to constrain the mono-modality feature representation of the different classes. The formulations can be described as follows:
1 max(0, ε2 − D2 ( fi , fk ) )2 , 2 1 Ed (vi , vk ) = max(0, ε2 − D2 (vi , vk ) )2 , 2 1 2 Eu f i , f j = D f i , f j , 2 1 Eu vi , v j = D2 vi , v j , 2 Ed ( f i , f k ) =
(9) (10) (11) (12)
where fi , fj are from the same class while fi , fk are from different classes. The same rules to text features vi , v j , vk . D(a, b) =
a − b2 , which means euclidean distance of vector a and b.
4. Experiment In this section, we firstly introduce the two datasets we collected for multimodal event classification (MES). Then we illustrate
YFCC-MES: The first dataset is based on the publicly available Yahoo Flickr Creative Commons 100 Million (YFCC100 m) dataset. The YFCC100m dataset contains 99,206,564 images from Flickr along with their corresponding metadata including title, description, camera type, tags, and geotags when available. This dataset cannot be used for event classification directly because it does not have any category annotations. Thus, from this large dataset, we select nearly 200 K images of 13 social events (Birthday, Christmas, Concert, Cruise, Easter, Exhibition, Graduation, Halloween, Hiking, Road trip, St Patrick Day, Skiing, Wedding [15]) with their corresponding titles and descriptions. Specifically, we use event names mentioned above as the key words to search the YFCC dataset. Though the event dataset in [15] is also comprised of these 13 social events, it only contains photos without any text descriptions. Here we simply introduce how to collect the new multimodal event dataset. The first step is information type (title, description, camera type, tags, and geotags) selection. Some other datasets, such as [16], select the title as the text information of the photo. In [34], they concatenated the tags, title, and description of one photo as the text information to do landmark analysis. But in our dataset, each photo has much tags which will decrease the precision of event classification. And the description is more precise than title and contains more abundant concepts. For example, for an image describing two children sitting down in front of the Christmas tree, the title is “Christmas Morning” while the description is “Nothing beats the smiles of kids on Christmas morning”. Therefore there is no need to concatenate these three types of text information. Descriptions are more appropriate to be the text information. After fixing the text type, the next step is data denoising. Since the metadata are directly downloaded from Flickr, descriptions inevitably contain much unknown numbers, marks, emotional symbols which will have an influence on the effectiveness of the event dataset. So we filtrate all these unusual symbols and abnormal letters. And we delete those sentences which have less than three or more than fifteen words. We replace all event names in description sentences with unknown symbol “UNK” to prevent the prospective event classifier using them to perfectly predict the event category. Only photos meet all requirements in the above two processing steps are selected in our multimodal event dataset. The number of photos finally obtained for each event class are shown in Table 1, where we also show the train/dev/test split. Fig. 4 shows several examples of the photos and their descriptions in our collected multimodal event dataset.
Table 1 Overview of the collected YFCC-MES dataset. Event-Name
#Training
#Development
#Test
Total
Birthday Christmas Concert Cruise Easter Exhibition Graduation Halloween Hiking Road trip St Patrick Day Skiing Wedding Total
15,677 22,411 9683 7808 9410 4439 4605 7054 2428 2514 1690 866 26,495 115,080
5359 7319 3310 2573 3064 1422 1563 2378 838 813 559 287 8876 38,244
5174 7501 3274 2621 3027 1434 1633 2390 797 850 614 233 8813 38,920
26,210 37,231 16,267 13,002 15,501 7259 7801 11,822 4063 4177 2863 1386 44,184 192,244
Please cite this article as: F. Qi, X. Yang and T. Zhang et al., Discriminative multimodal embedding for event classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2017.11.078
JID: NEUCOM 6
ARTICLE IN PRESS
[m5G;July 22, 2019;16:16]
F. Qi, X. Yang and T. Zhang et al. / Neurocomputing xxx (xxxx) xxx
Fig. 4. Image and text examples in our collected YFCC-MES dataset.
SED2014-MES: The second dataset we collected is based on a publicly available benchmark dataset of the SED task from 2014 (SED 2014). The SED 2014 dataset contains 473,119 images. Each image in the dataset is accompanied by metadata typically found on the social web (including time-stamps, tags, geotags for a small subset of them). This dataset is weakly annotated and exclusively used in event detection. We select nearly 260 K images for 9 of the most common social events (Music, Tour, Theater, Festival, Conference, Parade, Protest, Show, Game). For this dataset, we choose tags as the text information of the photo. Here we adopt the same data processing scheme as the YFCC-MES dataset. The number of photos finally obtained for each of the event classes is shown in Table 2.
Table 2 Overview of the collected SED2014-MES dataset. Event-Name
#Training
#Development
#Test
Total
Tour Music Theater Conference Parade Protest Show Game Festival Total
12,083 85,315 6479 2321 1928 997 18,855 1086 55,409 184,473
8129 17,384 470 233 425 57 2545 145 8402 37,790
8129 17,384 470 233 425 57 2544 144 8402 37,788
28,341 120,084 7419 2787 2778 1111 23,944 1375 72,213 260,051
4.2. Implementation details The textual and visual features are extracted as follows. For text features, two preprocessing tasks are conducted. The first task segments the document to get the separate feature items in words. The second task is to remove the stop words. After the preprocessing, we extract the remaining words as the feature items of the
event texts. This paper takes the GRU mentioned in Section 3.2 to represent text features. Each sentence is represented by the output of the last hidden layer in the RNNs. For visual features, the representations are extracted from the fc7 layer of the popular AlexNet model [35]. The concrete introduction of AlexNet is following. The first convolutional layer filters the 224 × 224 × 3 input
Please cite this article as: F. Qi, X. Yang and T. Zhang et al., Discriminative multimodal embedding for event classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2017.11.078
ARTICLE IN PRESS
JID: NEUCOM
[m5G;July 22, 2019;16:16]
F. Qi, X. Yang and T. Zhang et al. / Neurocomputing xxx (xxxx) xxx
7
Table 3 Precision of multimodal event classification on the YFCCMES dataset. Method
Precision on Training Set
Image&SVM Text&SVM DC_T&DC_I&SVM MC&SVM [31] MC&DC_T&Softmax MC & DC_I&Softmax Our DME
45.83 46.67 34.8 47.05 48.25 55.9 72.85
Precision on Testing Set 44.49 44.50 33.18 46.92 47.35 55.12 60.79
image with 96 kernels of size 11 × 11 × 3 with a stride of 4 pixels (this is the distance between the receptive field centers of neighboring neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 × 5 × 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 × 3 × 256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 3 × 3 × 192, and the fifth convolutional layer has 256 kernels of size 3 × 3 × 192. The fully-connected layers have 4096 neurons each. So the dimension of the visual vector we get finally is 4096. 4.3. Evaluation We evaluate the proposed discriminative multimodal embedding (DME) method on the collected YFCC-MES and SED2014-MES datasets. Next we will show the performance and we will also give some analysis. 4.3.1. Baselines We compare our proposed DME model with several popularly used baselines. Two variants of our model are also compared to explore the impact of each part in Eq. (1). As shown in Table 3, there are six baselines for our proposed DME method. Image&SVM and Text&SVM are two baselines including event classification through SVM classifier on image features and text features, respectively. These two simple baselines are useful since there are a lot of images on the social networks without descriptions, titles or tags. The third one is DC_T&DC_I&SVM which refers to the method which learns features with only the contrastive losses for images and texts. The forth one is MC&SVM [31] which refers to the method which learns features with the contrastive loss for multimodal data of image and text. We employ the SVM as the event classifier for the above four baselines. The fifth MC&DC_T&Softmax refers to the proposed DME method without the discriminative constraint for images. The sixth MC&DC_I&Softmax refers to the proposed DME method without the discriminative constraint for texts. 4.3.2. Classification results We evaluate all baselines with the precision metric. The results on the YFCC-MES dataset are shown in Table 3. We can see that the Image&SVM method has really low accuracy. This is due to that only visual features are used to decide the class labels of event samples. Typically, vision based classification on social event is much difficult because the same object or scene may appear in different events. The precision of the Text&SVM method is similar with the result based on image features. Results of these two baselines show that mono-modality feature based SVM classier performs not well on event classification.
Fig. 5. The precision result of each class. 1: Birthday 2: Christmas 3: Concert 4: Cruise 5: Easter 6: Exhibition 7: Graduation 8: Halloween 9: Hiking 10: Road Trip 11: St Patrick Day 12: Skiing 13: Wedding.
Table 4 Precision of multimodal event classification on the SED2014MES dataset. Method
Precision on Training Set
Image&SVM Text&SVM DC_T&DC_I&SVM MC&SVM [31] MC&DC_T&Softmax MC & DC_I&Softmax Our DME
27.9 32.85 25.46 43.71 47.68 46.78 41.7
Precision on Testing Set 21.47 22.07 22.18 21.6 25.45 23.08 26.07
The DC_T&DC_I&SVM method has result 33.18%, which is lower than the results of the first two baselines. For the MC&SVM method [31], the accuracy is increased by 13.74%, which tell us the key to build discriminative feature vectors is how to embed the different modalities. The MC&DC_T&Softmax and MC&DC_I&Softmax methods have much better performances and the latter improve the accuracy by 7.77% compared with the former. These results further demonstrate that the discriminative constraint reduce the intra-class difference and expand the inter-class variation for image features. Compared with the mono-modality methods, Image&SVM and Text&SVM, our proposed DME method with multimodal features improve the accuracy by 16.29%. Compared with the MC&SVM [31] which only uses modality constraint to constrain the embedding features, our DME method which combines the multimodal constraint, the discriminative constraint and the explicit softmax loss, improves the performance by 13.87%. Anyway, our proposed method shows the best performance in the experiment. It is easy to conclude from the Table 3 that the proposed discriminative multimodal embedding model has a strong potential in social event analysis. In Fig. 5, we show the precision of the proposed DME for each event class. The Wedding event has the highest precision. This is due to that it has much more images than the other classes. Similarly, performances on Birthday, Christmas, Concert, Cruise are also high (large than 0.5). The event Road Trip only has precision 0.17. These results demonstrate that too few images and noisy metadata lead the inferior performances. Finally, we conclude two key factors which affect the performance of social event classification: the number of images and the noise level of metadata. The results on the SED2014-MES dataset are shown in Table 4. We can see that results shown in Table 4 are relatively low
Please cite this article as: F. Qi, X. Yang and T. Zhang et al., Discriminative multimodal embedding for event classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2017.11.078
JID: NEUCOM 8
ARTICLE IN PRESS
[m5G;July 22, 2019;16:16]
F. Qi, X. Yang and T. Zhang et al. / Neurocomputing xxx (xxxx) xxx
Fig. 6. The feature visualization of three baselines and our proposed DME method.
compared with the results in Table 3. This is due to that the SED2014-MES dataset suffers from the serious class imbalance problem where some classes have far less instances than the other classes. For example, as shown in Table 2, the Music class has about 120 K instances while the Protest class has only about 1 K instances. The SED2014 dataset is originally designed for event detection. Thus, it probably has much more intra-class difference. For example, a photo about “Outside concert at summer in Beijing” probably has really different visual appearance with another photo about “Inside concert in Argentina”. In spite of the difficulty of the SED2014-MES dataset, our DME method still performs better than the other baselines. DC_T&DC_I&SVM has much better performance than the mono-modality. Compared to MC&SVM, our DME improve 5% on the classification precision. The MC&DC_T&Softmax and MC&DC_I&Softmax methods perform better than the mono-modality and the MC&SVM methods. These results also demonstrate that the proposed DME method can decrease the intra-class difference and enlarge the inter-class variation for multimodal event classification. 4.3.3. Feature visualization To qualitatively analyze the learned feature representations by the proposed DME method, we employ TSNE [36] to visualize the learned features by reducing the dimension to 2. Here we show features in the embedded feature space of all training data on the YFCC-MES dataset. Distribution of the features obtained by the Image&SVM method is shown in Fig. 6(a). We can see that the feature points are really confused. The DC_T&DC_I&SVM method, as shown in Fig. 6(b), has better features compared with the Image&SVM method. This demonstrate that the visual-semantic feature is more discriminative than mono-modality to a specific event. Fig. 6(c) shows the feature distribution of the MC&DC_I&Softmax method.
We can see that features of the same class are gradually clustered together. Fig. 6(d) shows the feature distribution of our proposed DME. Clearly, the points of the same class are gathered together and are able to be distinguished. These results qualitatively demonstrate that the embedded feature vectors by the proposed DME method are discriminative for event classification. To verify the effectiveness of the adopted modality constraint, we also visualize the embedded features used by the MC&SVM method. Fig. 7 shows the features with different value of the parameter ε 1 . Clearly, when ε1 = 0.002, the feature representations are totally confused. When ε1 = 0.02, the distribution does not get better. When ε1 = 0.2, most of the samples of different classes can be easily distinguished. As illustrated in Eq. (7), ε 1 practically controls the distance margin between the paired image-text and the unpaired image-text. The modality constraint with a large ε 1 will constrain that the unpaired image-text has a much larger distance than the paired image-text. Thus ε 1 is always set to a large value in the experiment. 4.3.4. Parameter analysis Several parameters play significant roles in the proposed model. In this section, we show their impacts on the performance. For the sake of time, we randomly select 20% training and test instances in the YFCC-MES dataset to do the experiment. The parameter γ 1 , γ 2 and γ 3 in 1 are used to determine the weights of the modality constraint, the discriminative constraint and the softmax loss. In Fig. 8, we show the accuracies with different values of γ 2 by setting γ3 = 1 because it is indispensable for classification. Meanwhile we limit the sum of γ 1 and γ 2 to 1. We can see that γ 1 and γ 2 have a small impact on the performance of our DME model. Overall, the results are quite stable. Based on these results, we can see that it is good to set γ 1 to 0.8 or 0.9 and γ 2 to 0.2 or 0.1.
Please cite this article as: F. Qi, X. Yang and T. Zhang et al., Discriminative multimodal embedding for event classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2017.11.078
JID: NEUCOM
ARTICLE IN PRESS
[m5G;July 22, 2019;16:16]
F. Qi, X. Yang and T. Zhang et al. / Neurocomputing xxx (xxxx) xxx
9
Fig. 7. The feature visualization of the MC&SVM method with different value of ε 1 .
proposed DME method fuses the data with multiple modalities by using the multimodal and discriminative constraints in a unified framework to learn the representation together with the classifier jointly. In future work, we will extend our method to fuse social media data with more than two modalities. We will also try to resolve the class imbalance problem in the proposed multimodal embedding framework. Declaration of interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgments
Fig. 8. Accuracy versus different γ 2 while γ 1 is equal to 1-γ 2 . γ 3 is fixed to 1.
This work was supported in part by National Key Research and Development Program of China (No. 2017YFB1002804), National Natural Science Foundation of China (No. 61702511, 61720106006, 61728210, 61751211, 61620106003, 61632007, 61532009, 61572498, 61572296, 61432019, U1836220, U1705262) and Key Research Program of Frontier Sciences, CAS, Grant NO. QYZDJSSWJSC039. This work was also supported by Research Program of National Laboratory of Pattern Recognition (No. Z-2018007) and CCF-Tencent Open Fund. References
Fig. 9. Accuracy versus different ε 1 and ε 2 .
The ε 1 plays an important role in the modality constraint while ε2 is also important for the discriminative constraint. For the experiment shown in the Fig. 9, we give the accuracies of the proposed DME with different values of ε 1 and ε 2 . We can see that the better performance can be achieved when ε 2 is set to 1 or 10. With an appropriate value of ε 2 , the ε 1 can be set to any values from 0.02 to 200. Besides, some others combinations of different values of ε 1 and ε 2 can also obtain good performance, such as ε1 = 20 and ε2 = 100. 5. Conclusion In this paper, we propose a discriminative multimodal embedding model for event classification in user generated photos. The
[1] A. Scherp, R. Jain, M.S. Kankanhalli, V. Mezaris, Modeling, detecting, and processing events in multimedia, in: Proceedings of the ACM Multimedia, ACM, 2010, pp. 1739–1740. [2] G. Petkos, S. Papadopoulos, V. Mezaris, Y. Kompatsiaris, Social event detection at mediaeval 2014: Challenges, datasets, and evaluation, in: Proceedings of the MediaEval, in: CEUR Workshop Proceedings, 1263, CEUR-WS.org, 2014. [3] T. Reuter, S. Papadopoulos, G. Petkos, V. Mezaris, Y. Kompatsiaris, P. Cimiano, C.M.D. Vries, S. Geva, Social event detection at mediaeval 2013: Challenges, datasets, and evaluation, in: Proceedings of the MediaEval, in: CEUR Workshop Proceedings, 1043, 2013. [4] C.S. Firan, M. Georgescu, W. Nejdl, R. Paiu, Bringing order to your photos: event-driven classification of Flickr images based on social knowledge, in: Proceedings of the CIKM, ACM, 2010, pp. 189–198. [5] T. Reuter, P. Cimiano, Event-based classification of social media streams, in: Proceedings of the ICMR, ACM, 2012, p. 22. [6] X. Liu, B. Huet, Heterogeneous features and model selection for event-based media classification, in: Proceedings of the ICMR, ACM, 2013, pp. 151–158. [7] L. Cao, J. Luo, H.A. Kautz, T.S. Huang, Image annotation within the context of personal photo collections using hierarchical event and scene models, TMM 11 (2) (2009) 208–219. [8] J. Yuan, J. Luo, Y. Wu, Mining compositional features from GPS and visual cues for event recognition in photo collections, IEEE Trans. Multimed. 12 (7) (2010) 705–716. [9] X. Yang, T. Zhang, C. Xu, Cross-domain feature learning in multimedia, IEEE Trans. Multimed. 17 (1) (2015) 64–78. [10] A. Rosani, G. Boato, F.G.B.D. Natale, Eventmask: a game-based framework for event-saliency identification in images, IEEE Trans. Multimed. 17 (8) (2015) 1359–1371. [11] R. Mattivi, J. Uijlings, F. de Natale, N. Sebe, Exploitation of time constraints for (sub-) event recognition, in: Proceedings of the ACM Workshop on Modeling and Representing Events, 2011, doi:10.1145/2072508.2072511. [12] N. Imran, J. Liu, J. Luo, M. Shah, Event recognition from photo collections via pagerank, in: Proceedings of the ACM Multimedia, ACM, 2009, pp. 621–624.
Please cite this article as: F. Qi, X. Yang and T. Zhang et al., Discriminative multimodal embedding for event classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2017.11.078
JID: NEUCOM 10
ARTICLE IN PRESS
[m5G;July 22, 2019;16:16]
F. Qi, X. Yang and T. Zhang et al. / Neurocomputing xxx (xxxx) xxx
[13] S. Tsai, T.S. Huang, F. Tang, Album-based object-centric event recognition, in: Proceedings of the ICME, IEEE Computer Society, 2011, pp. 1–6. [14] S. Tsai, L. Cao, F. Tang, T.S. Huang, Compositional object pattern: a new model for album event recognition, in: Proceedings of the ACM Multimedia, ACM, 2011, pp. 1361–1364. [15] L. Bossard, M. Guillaumin, L.J.V. Gool, Event recognition in photo collections with a stopwatch HMM, in: Proceedings of the ICCV, 2013, pp. 1193–1200. [16] M. Zeppelzauer, D. Schopfhauser, Multimodal classification of events in social media, Image Vis. Comput. 53 (2016) 45–56. [17] T. Sutanto, R. Nayak, ADMRG @ mediaeval 2013 social event detection, in: Proceedings of the MediaEval, in: CEUR Workshop Proceedings, 1043, CEUR-WS.org, 2013. [18] I. Gupta, K. Gautam, K. Chandramouli, Vit@mediaeval 2013 social event detection task: Semantic structuring of complementary information for clustering events, in: Proceedings of the MediaEval, in: CEUR Workshop Proceedings, 1043, CEUR-WS.org, 2013. [19] Y. Xiong, K. Zhu, D. Lin, X. Tang, Recognize complex events from static images by fusing deep channels, in: Proceedings of the CVPR, IEEE Computer Society, 2015, pp. 1600–1609. [20] L. Wang, Z. Wang, W. Du, Y. Qiao, Object-scene convolutional neural networks for event recognition in images, in: Proceedings of the CVPR Workshops, IEEE Computer Society, 2015, pp. 30–35. [21] D.R. Hardoon, S. Szedmák, J. Shawe-Taylor, Canonical correlation analysis: an overview with application to learning methods, Neural Comput. 16 (12) (2004) 2639–2664. [22] N. Rasiwasia, P.J. Moreno, N. Vasconcelos, Bridging the gap: query by semantic example, IEEE Trans. Multimed. 9 (5) (2007) 923–938. [23] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G.R. Lanckriet, R. Levy, N. Vasconcelos, A new approach to cross-modal multimedia retrieval, in: Proceedings of the MM, ACM, 2010, pp. 251–260. [24] M. Guillaumin, J.J. Verbeek, C. Schmid, Multimodal semi-supervised learning for image classification, in: Proceedings of the CVPR, IEEE Computer Society, 2010, pp. 902–909. [25] N. Srivastava, R. Salakhutdinov, Multimodal learning with deep Boltzmann machines, J. Mach. Learn. Res. 15 (1) (2014) 2949–2980. [26] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A.Y. Ng, Multimodal deep learning, in: Proceedings of the ICML, 2011, pp. 689–696. [27] J. Weston, S. Bengio, N. Usunier, Large scale image annotation: learning to rank with joint word-image embeddings, Mach. Learn. 81 (1) (2010) 21–35. [28] A. Frome, G.S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov, Devise: a deep visual-semantic embedding model, in: Proceedings of the NIPS, 2013, pp. 2121–2129. [29] R. Socher, A. Karpathy, Q.V. Le, C.D. Manning, A.Y. Ng, Grounded compositional semantics for finding and describing images with sentences, TACL 2 (2014) 207–218. [30] A. Karpathy, A. Joulin, F. Li, Deep fragment embeddings for bidirectional image sentence mapping, in: Proceedings of the NIPS, 2014, pp. 1889–1897. [31] R. Kiros, R. Salakhutdinov, R.S. Zemel, Unifying visual-semantic embeddings with multimodal neural language models, CoRR abs/1411.2539 (2014). [32] J. Chung, Ç. Gülçehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, CoRR abs/1412.3555 (2014). [33] R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learning an invariant mapping, in: Proceedings of the CVPR (2), IEEE Computer Society, 2006, pp. 1735–1742. [34] W. Min, B. Bao, C. Xu, Multimodal spatio-temporal theme modeling for landmark analysis, IEEE MultiMed. 21 (3) (2014) 20–29. [35] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Proceedings of the NIPS, 2012, pp. 1106–1114. [36] L. van der Maaten, G.E. Hinton, Visualizing high-dimensional data using T-SNE, J. Mach. Learn. Res. 9 (2008) 2579–2605.
Fan Qi received the bachelor’s degree in computer science from Beijing Jiaotong University. She is currently pursuing the Ph.D. degree at Hefei University of Technology.
Xiaoshan Yang received the Ph.D. degree at the Multimedia Computing Group, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. He is currently an assistant researcher at Institute of Automation, Chinese Academy of Sciences (CASIA). His research interests include multimedia analysis and computer vision.
Tianzhu Zhang(M’11) received the bachelor’s degree in communications and information technology from Beijing Institute of Technology, Beijing, China, in 2006, and the Ph.D. degree in pattern recognition and intelligent systems from the Institute of Automation, Chinese Academy of Sciences, Beijing, China, in 2011. Currently, he is an Associate Professor at the Institute of Automation, Chinese Academy of Sciences. His current research interests include computer vision and multimedia, especially action recognition, object classification and object tracking.
Changsheng Xu(M’97–SM’99–F’14) is a Professor in National Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences and Executive Director of China-Singapore Institute of Digital Media. His research interests include multimedia content analysis/indexing/retrieval, pattern recognition and computer vision. He has hold 30 granted/pending patents and published over 200 refereed research papers in these areas. Dr. Xu is an Associate Editor of IEEE Trans. on Multimedia, ACM Trans. on Multimedia Computing, Communications and Applications and ACM/Springer Multimedia Systems Journal. He received the Best Associate Editor Award of ACM Trans. on Multimedia Computing, Communications and Applications in 2012 and the Best Editorial Member Award of ACM/Springer Multimedia Systems Journal in 2008. He served as Program Chair of ACM Multimedia 2009. He has served as associate editor, guest editor, general chair, program chair, area/track chair, special session organizer, session chair and TPC member for over 20 IEEE and ACM prestigious multimedia journals, conferences and workshops. He is IEEE Fellow, IAPR Fellow and ACM Distinguished Scientist.
Please cite this article as: F. Qi, X. Yang and T. Zhang et al., Discriminative multimodal embedding for event classification, Neurocomputing, https://doi.org/10.1016/j.neucom.2017.11.078