Multi-label learning with label relevance in advertising video

Multi-label learning with label relevance in advertising video

Neurocomputing 171 (2016) 932–948 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Multi-l...

12MB Sizes 1 Downloads 24 Views

Neurocomputing 171 (2016) 932–948

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Multi-label learning with label relevance in advertising video Sujuan Hou a,b,c, Shangbo Zhou a,b,n, Ling Chen c, Yong Feng a,b, Karim Awudu b a b c

Key Laboratory of Dependable Service Computing in Cyber Physical Society, Ministry of Education, Chongqing University, Chongqing 400030, China College of Computer Science, Chongqing University, 174# Shazheng Street, Chongqing 400030, China Centre for Quantum Computation & Intelligent Systems, FEIT, University of Technology, Sydney, Australia

art ic l e i nf o

a b s t r a c t

Article history: Received 10 November 2014 Received in revised form 26 June 2015 Accepted 12 July 2015 Communicated by Yongdong Zhang Available online 21 July 2015

The recent proliferation of videos has brought out the need for applications such as automatic annotation and organization. These applications could greatly benefit from the respective thematic content depending on the type of video. Unlike the other kinds of video, an advertising video usually conveys a specific theme in a certain time period (e.g. drawing the audience's attention to a product or emphasizing the brand). Traditional multi-label algorithms may not work effectively with advertising videos due mainly to their heterogeneous nature. In this paper, we propose a new learning paradigm to resolve the problems arising out of traditional multi-label learning in advertising videos through label relevance. Aiming to address the issue of label relevance, we firstly assign each label with label degree (LD) to classify all the labels into three groups such as first label (FL), important label (IL) and common label (CL), and then propose a Directed Probability Label Graph (DPLG) model to mine the most related labels from the multi-label data with label relevance, in which the interdependency between labels is considered. In the implementation of DPLG, the labels that appear occasionally and possess inconspicuous cooccurrences are consequently eliminated effectively, employing λ-filtering and τ-pruning processes, respectively. And then the graph theory is utilized in DPLG to acquire Correlative Label-Sets (CLSs). Lastly, the searched Correlative Label-Sets (CLSs) are utilized to enhance multi-label annotation. Experimental results on advertising videos and several publicly available datasets demonstrate the effectiveness of the proposed method for multi-label annotation with label relevance. & 2015 Elsevier B.V. All rights reserved.

Keywords: Multi-label learning Advertising video Label relevance

1. Introduction In the traditional label learning, each object is represented by a single instance and associated with a single label. Actually, many real-world objects might own multiple semantic meanings, in this situation, this object should be assigned to a set of corresponding labels to express its semantics. For example, a news document could cover several topics such as economy and politics. In image annotation, one image may contain various information, such as building, sky, street and pedestrian. A video may express its content with various objects, concepts or situations, all of these belong to the field of multi-label learning. At present, the study of multilabel learning has attracted a considerable attentions from the research community [1–7] and it has had extensive application in various areas such as text classification [8,9], scene classification [10,11], automatic web page categorization [12], image classification [13–15], video annotation [16] and analgesia effects prediction [17]. Existing multi-label algorithms fall into the two groups

n Corresponding author at: College of Computer Science, Chongqing University, 174# Shazheng Street, Chongqing 400030, China. Tel.: þ 86 13132399578. E-mail address: [email protected] (S. Zhou).

http://dx.doi.org/10.1016/j.neucom.2015.07.022 0925-2312/& 2015 Elsevier B.V. All rights reserved.

proposed in [1,2]: (i) problem transformation, and (ii) algorithm adaptation. The first group tackles multi-label learning problem by transforming it into other well-established learning scenarios, such as the task of binary classification (e.g. binary relevance (BR) algorithm [18], its extension classifier chain (CC) algorithm [19] and probabilistic classifier chain (PCC) algorithm [20]), or label ranking (e.g. Calibrated Label Ranking [21], Mr.KNN [22]), or multiclass classification (e.g. Random k-label sets [23] and LP [24]). The second group solves multi-label problems by extending specific learning techniques to deal with multi-label data. For instance, BoosTexter [25] and Adtboost.MH [26] adopt boosting-style algorithm, Rank-SVM [27] utilizes the kernel method, BCC [28] and MLKNN [12] employ both Bayesian reasoning and lazy learning, BCC [28] implements both Bayesian Networks and chain classifiers technique, Sorted Label Classifier Chains [29] uses topologically theory-based algorithm, CDN [30] develops a conditional dependency network work by adapting graph theory and sampling technique, PLST [31] deals with multilabel classification from the hypercube view, and Local Pairwise Label correlation (LPLC) [32] constructs a Bayesian model for online social media content categorization. Due to the recent proliferation of multimedia content, managing a large number of images and videos is challenging. Besides

S. Hou et al. / Neurocomputing 171 (2016) 932–948

the multi-label learning techniques, other similar label learning techniques, called automatic tagging/annotation, are also known to solve this problem and facilitate searching, organizing and accessing. They may include image tagging/annotation [33–38], and video tagging/annotation [39,40], etc. Although there have been many studies in the video field, most of them focus on the various kinds of video (e.g. movies, sports videos and news reports). To the best of our knowledge, there is little previous works on multi-label learning for advertising videos. Existing works on multimedia advertising mainly focus on contextual advertising [41–47], aiming to associate advertisements with appropriate texts, images, videos or users. Two aspects of research are involved: contextual relevance that determines the selection of relevant advertisements and contextual intrusiveness which is the key to detect appropriate ad insertion positions within multimedia content. Unlike the other types of video, advertising videos have their unique features: the strict time restriction (necessitating the condensation of special effects such as the usage of color, camerawork, viewpoint, sound and rhythms), drawing the audience's attention to products (by utilizing many artifacts e.g. cartoons and ads spokesmen), attracting as many potential purchasers as possible and emphasizing the brand (logo or trademark). The research into multi-label learning in advertisement videos is motivated by the difficulty of concept ambiguity and diversity. However, traditional multi-label learning methods are not applicable to advertising videos based on the following observations. On one hand, in annotating an advertising video, a logo usually plays an essential role and it is usually associated with a particular service or product and should not be treated as equally as other labels. In this case, a “car” label cannot be assigned to a trousers advertising video just because there appears a car in one or several frames in the corresponding video, therefore, traditional multilabel learning methods may not work effectively in this situation. Some specific examples are shown in Fig. 1, in which there are three images from three different advertising videos of “CocaCola”. In the traditional multi-label learning, labels may be assigned as “shoes” and “ground” in image (a), “water” and “hair” in image (b) and image (c) will have the labels “food” and “bowl”. It is obvious that the labels above obtained by traditional multi-label learning approaches make little sense to the advertising theme, and cannot effectively annotate the advertising videos. On the other hand, all labels are usually not independent (e.g. the appearance of one label may influence the occurrence of other labels), for instance, a logo often appears in an advertising video superimposed on objects such as geometrical renderings, shirts of persons, jersey of players, boards of shops, billboards, or posters in

933

sports playfields. The labels for these above-mentioned superimposed objects also possess extra significance. Taking Fig. 2 as another example, which is extracted from “Coca-Cola” advertising video, there are several objects in it, and several labels such as “bottle”, “shoes”, and “ground” which could be learned to describe it, however, not all the labels in the video are related to the goal of advertisement except “bottle”, which acts as the superimposed object containing the label “Coca-Cola”. Owing to the differences between advertising videos and other kinds of videos mentioned above, and based on the above observations, traditional multilabel learning or tagging approaches may not work very well for these kinds of video, a new learning paradigm is required to solve the problem of multi-label learning in advertising video effectively. In addition, there are several other motivations for the research on advertising video. Considering the high cost of promotion of commercial products and trademark, both the sponsors and media agencies are all keen on gaining the best advertising impact with reasonable economic cost. Other motivations may include the marketing strategy analysis, e.g., to evaluate the type of advertisement videos that the consumers are interested in based on user behavior or the advertisement model of competitors. Motivations may also be related to company brands, e.g. intelligent image analysis may be applied in a wide range of economical relevant applications, such as searching similar existing logos, discovering malicious use of logos, unveiling slightly varied logos to deceive customers, as well as obtaining logo displaying statistics. In summary, traditional multi-label learning or tagging approaches cannot work well due to the differences between advertising videos and other kinds of video. In this paper, we

Fig. 2. Only parts of labels (e.g. Coca-Cola, bottle) are related with the concepts of advertisement videos.

Fig. 1. Images (a)–(c) are three different key frames from three different advertising videos of “Coco-Cola”, traditional multi-label learning works not well for advertising theme.

934

S. Hou et al. / Neurocomputing 171 (2016) 932–948

devise a solution to the problem of traditional multi-label learning in advertising videos through label relevance as a new learning paradigm. Specifically, we present a directed probability label graph (DPLG) model, aiming to resolve the problem of annotating the advertising videos effectively. In brief, firstly we assign each label with the corresponding label degree (LD) to classify the labels into three groups such as First Label (FL), Important Label (IL) and Common Label (CL) according to their importance degree. Then, the interdependency between labels is also considered in order to build the DPLG model. The graph theory is lastly employed in DPLG for searching Correlative Label-Sets (CLS) to enhance multi-label learning with label relevance. The rest of this paper is organized as follows: Section 2 gives a brief overview of the DPLG model and the implementation of DPLG model is illustrated in detail in Section 3. Section 4 presents the experimental evaluation criteria, experimental setup and performance. Conclusion comes up in Section 5.

2. Directed probability label graph model In this section, a novel method for multi-label learning with label relevance named Directed Probability Label Graph (DPLG) is presented. To begin, several notations are introduced to simplify the derivation of DPLG. Definition 1. First Label (FL): The most important label in the learning paradigm is called “First Label”, which is highly relevant to an advertising video's theme processing. In this model, we define the “logo” as the FL. In most cases, the FL is a single label since there is usually only single kind of logo existing in one advertising video. In the case of an advertising video data containing multiple logos, the logo which appears most frequently is chosen as the “FL”. Definition 2. Important Label (IL): The label which is highly correlated with the First Label and often appears with FL simultaneously is defined as “Important Label”. In Fig. 3 as follows, the label “Coca-Cola” is the FL while the label “bottle” is the IL. Definition 3. Common Label (CL): We call all other labels the “Common Label (CL)” except FL and IL. Definition 4. Label Degree (LD): The Label Degree (LD) reveals the importance degree of a label to the multi-label problem itself. The range of LD is {0,1}. For any label l, the LD of FL, IL and CL are defined as follows: ( 1 if l ¼ FL LDðlÞ ¼ ð1Þ pðl=FLÞ others

pðl=FLÞ ¼ pðl; FLÞ=pðFLÞ

ð2Þ

We can see that the p(l/FL) in Eq. (2) reveals the posterior probability relationship between l and FL, both p(l,FL) and p(FL) can be obtained from sample data. Based on the above definition of Label Degree (LD), given a label l, it can be determined which label category it belongs to, i.e. FL, IL or CL, according to the following definition as 8 > < FL if LDðlÞ ¼ 1 ð3Þ δðLDðlÞÞ ¼ IL if LDðlÞ A ½θ; 1Þ > : CL if LDðlÞ A ½λ; θÞ where

δ : LDðlÞ-fFL; IL; CLg it is a function mapping from the value of LD to label category. Both the parameter λ and θ are the threshold value, which will be discussed in detail in later section.   Definition 5. Correlative label-sets (CLS): Let I ¼ l1 ; l2 ; …; ll be a subset of labels with l labels, I DV ; jI j ¼ l. We call I an l-correlation label set if every label li A I is learnable from every other label lj A Iðli alj Þ at a certain probability, and vice versa. Now we propose the DPLG model as illustrated in Fig. 4. Given a directed probability label graph D, which is an ordered pair {V, U}, we could depict it as follows: D ¼ fV; Ug

ð4Þ

where V≔DðV Þ ¼ fl~1 ; l~2 ; …; l~k g

ð5Þ

Each element consists of two parts: a set of labels fv1 ; v2 ; …; vk g, and their corresponding label degree (LD), that is l~k ¼ fvk : LDðvk Þg

ð6Þ

Edge set U is a relationship matrix revealing the dependencies between labels, whose members are associated with an incidence function φD ðvi ; vj Þ: 2 3 φD ðv 1 ; v 1 Þ ⋯ φD ðv 1 ; v k Þ 6 7 ⋮ ⋱ ⋮ ð7Þ U≔DðUÞ ¼ 4 5 φD ðvk ; v1 Þ ⋯ φD ðvk ; vk Þ where

φD ðvi ; vj Þ ¼

(

pðvj =vi Þ;

vi a vj

0;

vi ¼ vj

ð8Þ

and pðvj =vi Þ representing the posterior probability of vj towards vi and pðvj =vi Þ A ½0; 1. It can be rewritten as Eq. (9) using Bayesian rule: pðvj =vi Þ ¼ pðvj ; vi Þ=pðvi Þ

ð9Þ

where pðvj ; vi Þ indicates the joint probability of the label vj and the label vi while pðvi Þ reveals the probability of the label vi. In DPLG, both the relationship between labels and the label relevance corresponding to relationship matrix U and label sets V respectively are taken into account.

3. Employing DPLG to multi-label learning with label relevance

Fig. 3. Example of FL (Coca-Cola) and IL (bottle) in one frame.

This section discusses how to apply DPLG model to generate the multi-label annotation with label relevance.

S. Hou et al. / Neurocomputing 171 (2016) 932–948

Fig. 4. The illustration of proposed DPLG model.

Fig. 5. Implementation of DPLG.

935

936

S. Hou et al. / Neurocomputing 171 (2016) 932–948

Fig. 6. Example of searching the Correlative Label-Sets (CLS).

Fig. 7. One example of logo datasets.

3.1. DPLG implementation For the start, some notations are introduced before giving the algorithm flowchart of DPLG. Let χn¼ fRi j 1 r i r oMg be instance domain with M instances and let Υ ¼ y1 ; y2 ; …; yϱ be label domain containing ϱ labels. Given training data T ¼ fðx1 ; Y x1 Þ; …; ðxi ; Y xi Þ; …j xi A χ ; Y xi D Υ g, for any

instance xi in T, there is an associated label set Y x D Υ . For all the labels in T, the dependency relationship matrix could be constructed from the training data as in Eq. (9), which reveals conditional probability between label-pairs. For each test instance t, there are two steps to obtain the annotated label sets Yt: the first step is to identify the FL, IL and CL to build the DPLG model, during which the logo detection is

S. Hou et al. / Neurocomputing 171 (2016) 932–948

primarily executed utilizing SIFT feature [48,49] to determine the FL. Then object detectors [50] are applied to obtain the latent labels IL and CL. After identifying the FL, IL and CL, λ-filtering and τ-pruning processes are implemented to build DPLG model for t. The second step is to get the CLSs from the DPLG model, and the CLS that contains FL will be finally returned as the learned multilabels for t. The algorithmic implementation of DPLG is given as in Fig. 5. Overall, there are two stages in the proposed approach: training stage and test stage. To elaborate on the implementation, firstly, in the training stage, we find the pairwise label correlations in Υ (label domain). We define a ϱ  ϱ (ϱ is the number of labels) matrix U (for storing the label correlations). For further details, refer to the description of Eqs. (7)–(9). If two labels are strongly correlated, they must often co-occur with each other. For instance, when U ij ¼ 1, yj has the most pairwise label correlation with yi, that means yj is always accompanied by yi. When U ij ¼ 0, there is no dependency relation between yi and yj, or yj and yi are the same label. Otherwise, the value of U ij is between 0 and 1. It should be noted that U ij may not be equal to U ji . Meanwhile, in the test stage, for any test instance t, the first phase is to construct the DPLG model for t, involving several steps as follows. The first step is to construct the candidate labels C by utilizing the existing techniques of logo detection and object

937

detectors. We use a conservative threshold in order to minimize false positive detections. There may exist some bad labels in C mainly due to partial reliance on the techniques of robust sift-base matching technique and high precision objection detectors. The second step is to compute LD of ℓ according to Eqs. (1) and (2) and dependency matrix U, where ℓ is the label (ℓ A C). And after obtaining the LD of ℓ, we further determine the FL, IL and CL according to the result of logo detection and Eq. (3). The third step is to implement λ-filtering and τ- pruning processes to obtain the refined label set Cr. This step makes sure that some bad labels are effectively eliminated from C. From the perspective of graph theory, λ-filtering and τ-pruning processes eliminate bad vertexes (corresponding to the unfrequented labels in C) and low-weight edges (corresponding to the weak label correlation in U), respectively. The fourth step is to construct the DPLG model for t, that is Dt ¼ fV t ; U t g, where V t ¼ C r and U t ¼ fU ixj xi jx A C r ; 1 ri r ϱg. At this point, a directed probability label graph Dt is finally built. And the second phase is to search the CLS that includes FL based on DPLG(Dt), this phase comprises two steps: the first step is to obtain the CLS set (CLSs) by carrying out Tarjan's algorithm towards Dt, and the second step is to choose the final Yt. We can conclude that the labels in Yt are the ones which have direct or indirect dependency with FL.

Fig. 8. Shows the matching of “Audi” between logo image and key frames from the advertising video, in each image, the left part is the logo image while the right one is the key frame from advertising video.

938

3.2.

S. Hou et al. / Neurocomputing 171 (2016) 932–948

λ-Filtering and τ-pruning

where (

The several preprocessing processes are indispensable before performing the proposed DPLG for the multi-label learning with label relevance.

3.2.1. λ-Filtering Given that D is a directed probability label graph as in Eq. (4) described in Section 2, and λ is a threshold value. We can get a fDλ g from D by carrying out a λ-filtering process as follows:   Dλ ¼ V λ ; U λ ð10Þ where      V λ ≔V Dλ ¼ ðvi: LDðvi ÞÞ j 1 r i r k; LDðvi Þ Z λ

From the above definitions, we can see that the labels which appear only occasionally and the inconspicuous co-occurrence between labels can be effectively eliminated after the λ-filtering and the τ-pruning processes, respectively. 3.3. Searching correlative label-sets In the proposed approach, searching correlative label-sets (CLSs) is to find the Strongly Connected Component (SCC) that includes FL from DPLG model, and could be formulated as   y~ ¼ arg ðSCC D CLSsÞ \ ðFL A SCC Þ ð12Þ SCC

3.2.2. τ-Pruning Given that D is a directed probability label graph as in Eq. (4) described in Section 2, and τ is a threshold value. We can get a fDτ g from D by carrying out a τ-pruning process as follows: Dτ ¼ fV τ ; U τ g

Uτ ¼

)     fφDτ vi ; vj j φDτ vi ; vj ¼ fpðvj =vi Þg Z τ; 1 ri; j r k; i a j

ð11Þ

We know that in graph theory, there are several ways to get the SCC for a directed graph, e.g. Kosaraju's algorithm [51], Tarjan's algorithm [52], and Gabow's algorithm [53]. Among these algorithms, Tarjan's algorithm [52] allows us to find correlative labelsets very quickly based on the depth-first search strategy. We will give some theoretical basis for it in the following section. According to the graph theory [54], Tarjan's algorithm is established on the following theorem, that is

Fig. 9. Shows the matching of “Chrysler” between logo image and key frames from the advertising video, in each image, the left part is the logo image while the right one is the key frame from advertising video.

S. Hou et al. / Neurocomputing 171 (2016) 932–948

Theorem 1. In any depth-first search, all the vertices within strongly connected components are in the same branch in a depth-first tree. In other words, the strongly connected components of a graph must be a sub-tree of deep search tree. Fig. 6 gives a DPLG model after processes:   Dn ¼ V n ; U n

λ-filtering and τ-pruning

where V n is the candidate labels while U n represents certain relationship between the labels in V n : V n ¼ fl~k ¼ fvk : LDðvk Þgjk A fa; b; c; d; e; f ; g; hgg; U n ¼ fφDn ðp; qÞjp; q A V n g

939

Two steps are implemented to complete the process as in Eq. (12). The first step is to get all the CLSs in Dn , adopting Tarjan's algorithm based on depth-first search strategy. However, unlike Tarjan's algorithm which begins from an arbitrary start node, the start node here is not arbitrary but specific node that representing FL. In Fig. 6 the label a is assumed to be FL. Starting from a, Tarjan's algorithm performs a single pass of depth-first search. It maintains a stack of vertices that have been explored by the search but not yet assigned to a component, and calculates “low numbers” of each vertex (an index number of the highest ancestor reachable in one step from a descendant of the vertex) which is used to determine when a set of vertices should be popped off the stack into a new component. Finally, three CLSs are obtained using Tarjan's algorithm, CLSs¼ {{a,b,c}, {d, f}, {e,g,h}}. The next task is to refine the CLS (say {a, b, c}) that includes FL as the final multi-label sets.

Fig. 10. Parts of key frames with logo detection for “Audi”.

940

S. Hou et al. / Neurocomputing 171 (2016) 932–948

4. Empirical results and analysis In this section, in order to measure the performance of the proposed DPLG approach, the effectiveness of multi-label learning with label relevance is evaluated via an empirical study on advertising video and several publicly available datasets separately. We implement and compare the proposed DPLG with BR [18], its extensions CC [19], ML-KNN [12] (since it outperforms BoosTexter, Adtboost.MH and Rank-SVM), BCC [28], CDN [30] and PLST [31] to test the performance of proposed approach.

4.1. Evaluation criteria for experiments To analyze the effectiveness of the proposed approach, six criteria, namely Hamming Loss, Rank Loss, Avg Precision,

ExampleF, MacroF and MicroF are used to measure the relative performance of DPLG. Related definitions can be formulated as follows. We firstly identify a set of test instances S ¼ fðxi ; Υ xi Þj 1 r i r N; Υ xi D Υ g. For each test instance x, Υ x is the ground truth label set while Ox is the learned label set. Hamming Loss ½2; 12 ¼

N   1X Υ x ΔOx  Ni¼1

where Δ represents the symmetric difference between Υ x and Ox. Hamming Loss here is a measure of how close Ox is to Υ x . The smaller the value of Hamming Loss, the better the performance. Rank Loss ½2; 12 ¼

N   1X 1   j y1 ; y2 N i ¼ 1 Υ x  j Υ 0x j

Fig. 11. Parts of key frames with logo detection for “CocaCola Zero”.

S. Hou et al. / Neurocomputing 171 (2016) 932–948 0

j ff ðx; y1 Þ rf ðx; y2 Þ; ðy1 ; y2 Þ A Υ x  Υ x g Rank loss measures the average fraction of label pairs that are reversely ordered for the instance. The smaller the value of Rank Loss, the better the performance. Avg Precision ½2 ¼

N 1X 1 X j fy0 j ranf ðx; y0 Þ r ranf ðx; yÞ; y0 A Υ gj   N i ¼ 1 Υ y A Υ rankðx; yÞ

The Avg Precision measures the average fraction of relevant labels which is ranked higher than a particular label yA Υ . The bigger the value of Avg Precision, the better the performance of proposed approach. Let TP, TN, FP, TP stand for the number of true positive, true negative, false positive, false negative instances/labels respectively, F

941

is defined as follows: F¼

2nprecisionnrecall precision þ recall

where precision and recall are given as follows: precision ¼

TP  100%; TP þ FP

recall ¼

TP  100% TP þFN

The ExampleF measures the evaluation of the system's performance according to its performance on each test instance, while MacroF criterion estimates an approach's performance from the perspective of each label. Furthermore, the MicroF is computed

Fig. 12. Parts of key frames with logo detection for “ThinkPad, Lenovo, Intel”.

942

S. Hou et al. / Neurocomputing 171 (2016) 932–948

Fig. 13. Example of response objects for one video.

S. Hou et al. / Neurocomputing 171 (2016) 932–948

based on all the labels. The detailed definition is given as follows: ExampleF ¼ MacroF ¼

N X

  1 F ðxi Þjxi S Ni¼1 N     1X F Υ j jΥ j A Υ Ni¼1

  here, Υ j is the jth label from the label domain Υ ¼ 1; 2; …; ϱ . MicroF ¼

ϱ  1X   F Υ j jΥ j A Υ

ϱi¼1

4.2. Experiments 4.2.1. Experiments on advertising video (1) Datasets: To carry out the proposed approach, two groups of datasets are collected from the web using a crawler: logo dataset and advertising video dataset. In our large logo dataset, there are nearly 20k classes of different logos. And 401 groups are chosen, in which each class is composed of several tens to a hundred of logo images. One example of the Logo “Audi” is shown in Fig. 7. The advertising video dataset also includes 11 426 advertising videos for different products such as drinks, cosmetics, cars, footwear and digital devices (cellphone, computers etc.). (2) Preprocessing and searching FL: The primary task of preprocessing advertising videos is segmenting them to characterize the entire shot by analyzing the representative key frames. We use the existing adaptive thresholding approach [55] to segment the shot, which used color features in each frame to reflect changes among frames. To facilitate the task of multi-label annotation in advertising videos with label relevance, the indispensable step is to search FL. In our approach of the learning task, the FL is the logo detected from advertising video. After the video segmentation, the representative frames are extracted from each shot and then the logo detection is implemented towards extracted key frames. After that, we classify all the key frames into two categories: logo and non-logo. There are several local descriptors used by many authors to support detection of logos in real world images. Among these local descriptors, SIFT [48,49], as a typical local visual descriptor, has been proved to be able to capture sufficiently discriminative local elements with some invariant properties to geometric or photometric transformations and is robust to occlusions. In this paper, we use a set of SIFT feature descriptors to represent the logo and key frames. Suppose there are N feature points detected in a logo or video key frame, F can be represented as

build DPLG model with candidate labels and how to obtain the formal labels employing DPLG model. We know that in the field of image-related processing, image representation became high-level-oriented [50,57–59], that is because low-level image features carry little semantic meanings while high-level representation conveys much more semantic information benefiting from powerful local descriptors. In this paper, Object Bank (OB) Representation [50] is adopted to facilitate the task of multi-label learning with label relevance. OB is an image representation constructed from the response of a generalized object convolution. Here we state that we use the response of a detected “general object” using SIFT feature to constitute the label list for key frames, and the accumulative response of objects from all key frames to compose the candidate labels for one advertising video. We know that the distribution of objects follows Zipf's Law: a small proportion of object classification account for the majority of objects instances. Here we employ 177 general object detectors [50] to detect the objects from an advertising videos, and each of them was trained at 6 scales, 2 components and 3 spatial pyramid levels. Considering the relevance between labels, we add logo as the FL in the advertising video which has been elaborated on in the above section (we set the FL to be 0.5 in coordinates in our experiment). Based on the observation of the training data, a label set containing 178 object classes is built which can be found in the Appendix. As illustrated in Fig. 13, we provide one example of an object response of one video, in which part (a) is the frame list of the example video, part (b) gives parts of key frames, part (c) describes the corresponding response of object detectors, and part (d) collects the accumulative object response of the example video from all key frames. Based on the accumulative object response of the video, we can obtain the candidate labels and their occurrence number. We take the advertising videos of “Audi” for example, Fig. 14 demonstrates the object response of the Audi's advertising videos with the first bar as the FL. It is worth notable that our approach partially relies on the techniques of robust sift-based matching and high precision object detectors in the first step. Both of these techniques have been extensively studied to have demonstrated good performance in various domains and applications. For instance, in [60], the detection rate for logo matching is up to 94%; and in [50] the classification accuracy achieved 80% or more. When the matching and detector tasks do not work well, as illustrated in Fig. 15, where there are bad labels still appearing in some frames, our algorithm can still obtain relatively true CLSs, thanks to the noble strategies we used to minimize the false positive

F ¼ fM k ; xk ; yk ; sk ; dk j k A f1; 2; …Ngg; where Mk is a 128-dimensional local edge orientation vector of the SIFT point and xk, yk, sk, dk are x- and y-positions, the scale and the dominant direction of the kth feature point respectively. Logo matching is accomplished by comparing the set of local features representing the logo with local features detected from the key frames. In order to minimize false positive detections [56], we utilize a conservative threshold, which gives good results in terms of robustness. Some examples of logo matches in advertising videos are shown as in Figs. 8 and 9. In addition, several illustrative examples for logo detection in advertising videos are shown in Figs. 10–12. (3) Labelling: from candidate to formal: In this section, we methodically discuss how to get the candidate labels, how to

943

Fig. 14. The object response of Audi's advertising video.

944

S. Hou et al. / Neurocomputing 171 (2016) 932–948

Fig. 15. An illustrative example of bad labels.

Table 1 The experimental results for different parameter combinations. (θ,λ,τ)

Hamm-Loss

MacroF

MicroF

ExampleF

Avg-Prec

Rank loss

(0.8,0.1,0.4) (0,8,0.2,0.4) (0.8.0.3,0.4) (0.8,0.1,0.6) (0.8,0.1,0.5) (0.8,0.4,0.6) (0.7,0.1,0.5) (0.9,0.1,0.4) (0,9,0.2,0.4) (0,7,0.2,0.5) (0.9,0.1,0.5)

0.0203 0.0206 0.0206 0.0203 0.0203 0.0206 0.0203 0.0203 0.0206 0.0206 0.0203

0.0260 0.0211 0.0211 0.0256 0.0263 0.0211 0.0263 0.0260 0.0211 0.0211 0.0263

0.3184 0.2925 0.2925 0.3157 0.3216 0.2925 0.3216 0.3184 0.2925 0.2925 0.3216

0.3811 0.3703 0.3703 0.3798 0.3827 0.3703 0.3827 0.3811 0.3703 0.3703 0.3827

0.2989 0.2906 0.2906 0.2985 0.2997 0.2906 0.2997 0.2989 0.2906 0.2906 0.2997

0.7207 0.7319 0.7319 0.7221 0.7195 0.7319 0.7195 0.7207 0.7319 0.7319 0.7195

detections and filter out unpromising labels. Our strategies are twofold: (1) for the logo matching, we use a conservative threshold in order to minimize false positive detections, (2) for the object detector, the unfrequented objects existing in object response are removed by λ -filtering process and τpruning process, furthermore, the independent labels are excluded from CLSs by Tarjan's algorithm. Fig. 15 gives an illustrative example, in which the part (a) lists the parts of key frames for an advertising video, while the parts (b) and (c) provide two key frames of them. Each key frame in (a) is expressed as a 178-dimmensional response vector obtained by object detectors. After collecting the accumulative object response of all key frames, we realign the labels based on their corresponding appearing frequency in video. And we select the labels which being in the top 30% as the final raw label set, that is C ¼{“Audi”, “truck”,“car”, “boat”,“baggage”, “basket”,“bird”}. Calculating with Eqs. (1) and (2), the resultant dependency matrix is as follows: LDðcarÞ ¼ 0:7067 4 θ, LDðtruckÞ ¼ 0:1163 4 λ, LDðboatÞ ¼ 0:0017 o λ, LDðbaggageÞ ¼ 0:0017 o λ, LDðbasketÞ ¼ 0:0677 o λ (here, θ ¼0.7, λ ¼0.1). By performing the identifiers on FL, IL and CL with Eq. (3) and according to the principle of λ-filtering and τpruning, we get that FL ¼“Audi” IL¼{“car”} and CL ¼{“truck”}. Thus, we build the DPLG model with the labels “Audi”,

“truck”,“car”. The final CLS also includes “Audi”, “truck”, and “car” after running Tarjan's algorithm. Moreover, when building the candidate labels, we could get the “lion”, “turtle”, “car” from (b), and “kitchen” from (c). However, we find that the unfrequented labels such as “lion”, “turtle” or “kitchen” do not appear in the final CLS. In other words, these unfrequented labels are effectively eliminated at the preprocessing stage. As for the labels such as “boat”, “baggage”, “basket” and “bird”, DPLG treats them as the inconspicuous labels which have weak dependency relationship with the first label, they are eliminated after λ-filtering and τpruning processes. Similarly, when processing other types of advertising videos, like “Coco-Cola” as illustrated in the introduction section, the labels like “shoes”, “ground”, “bowl” can also be excluded from the final CLS employing DPLG despite the fact that they appear in several frames in the video. For evaluating the proposed approach, the appropriate parameters for θ, λ and τ need to be worked out carefully. Hence in order to determine the parameters, we conduct 11 sets of experiments on part data to verify the choice of three parameters θ, λ and τ. We report the experimental results with respect to different parameter combinations in Table 1, the best result for each criterion is in bold font.

S. Hou et al. / Neurocomputing 171 (2016) 932–948

945

Table 2 The comparison results between BR,CC, ML-KNN,CDN BCC,PLST and DPLG. Eval-Crit (mean)

Hamm-Loss

MacroF

MicroF

ExampleF

Avg-Prec

Rank loss

BR CC ML-KNN CDN BCC PLST DPLG

0.011 0.011 0.0174 0.040 0.010 0.0146 0.0766

0.062 0.062 0.0715 0.164 0.062 0.0801 0.1847

0.509 0.501 0.5347 0.805 0.509 0.6019 0.8127

0.290 0.276 0.4931 0.796 0.288 0.5160 0.8033

0.394 0.465 0.4124 0.132 0.394 0.4398 0.6780

0.161 0.198 0.5582 0.148 0.159 0.5515 0.0976

Based on the experimental results, we can conclude that the proposed model is more sensitive in terms of λ and τ than θ, this may be because the final CLS largely depends on λ and τ rather than θ. These three parameter combinations, i.e. (0.7, 0.1, 0.5), (0.8, 0.1, 0.5) and (0.9, 0.1, 0.5), have the best performance among the rest. For the sake of simplicity, we select the parameter settings as (θ ¼0.7, λ ¼0.1 and τ ¼0.5) in our subsequent experiments. We compare the proposed model with BR, CC, ML-KNN, CDN, BCC and PLST. When setting up these experiments, we couple BR and CC algorithms with linear support vector machines. As for MLKNN, just as suggested in paper [12] that “experimental results show that the number of nearest neighbors used by ML-KNN does not significantly affect the performance of the algorithm when k varies from 8 to 12”, we set the parameter k to a moderate value of 10. We use the SVM as the base classifier for CDN. BCC adopts the default parameter settings as suggested in [28]. The sparsity parameter for the reconstruction algorithm is set as a recommendation from PLST [31]. The experimental results are shown in Table 2, and the best result of each criterion is in bold font. According to the results shown in Table 2, DPLG achieves better performance compared with BR, CC, ML-KNN, CDN, PLST in all evaluation criteria (namely Hamming Loss, MacroF, MicroF, ExampleF, Avg Precision and Rank Loss). Compared to BCC, DPLG is capable of achieving reasonable performance in nearly all criteria except Hamming Loss. 4.2.2. Experiments on other datasets In this section, we will perform experiments on three publicly available benchmarks: scene, core5k and delicious. Both scene and core5k belong to the domain of images, while delicious is the collection of texts from the web. Table 3 presents the details of the three benchmarks, in which the “#instances” stands for the number of instances, the “#features” means the number of features while the “#labels” describes the number of labels. These datasets and related reference papers are available in Mulandatasets,1 and they consist of both training and testing parts. (1) Experiment setup: Before the implementation of proposed DPLG on publicly available benchmarks, several problems need to be resolved to ensure that there is no obvious relevance among labels. Q1: The first one is how to determine FL, IL and CL? Unlike advertising videos, here all the labels are treated equally for publicly available datasets. In this case, the k-Nearest Neighbor (kNN, k ¼10 in our experiments) algorithm is used. Specifically, the label that has the largest number among the k-nearest neighbors is considered as the FL, while the label probability of simultaneous occurrence with FL greater than 70% is determined as IL. We call the label that has the probability of simultaneous occurrence with FL between 30% and 70% as the CL. Q2. Experiment setup for DPLG and other approaches. 1

http://mulan.sourceforge.net/datasets.html

Table 3 Descriptions of three publicly available benchmark datasets. Name

#instances

#features

#labels

nature scene core5k delicious

2407 5000 16 105

294 499 500

6 374 983

In this section, 10-fold cross-validation is performed on the above benchmark datasets. In each run of the experiment, we randomly sample 90% of the dataset for training and reserve the rest for testing. We also compare DPLG with BR, CC, ML-KNN, CDN, BCC and PLST algorithms. (2) Experimental results: We conduct the experiments with three publicly available datasets (scene, core5k and delicious) in Mulan.2 In the experiments, we use six different algorithms (BR, CC, ML-KNN,CDN, BCC and PLST) for comparison on the three datasets. The experimental results are reported in Tables 4–6. As indicated in the results above, DPLG performs relatively superior to the other algorithms used in the evaluation. Specifically, on the scene, DPLG outperforms other approaches in terms of MacroF and ExampleF, it also achieves reasonable performance with respect to other criteria, e.g. MicroF and Avg Precision. Towards core5k, DPLG is outstanding in terms of Hamming Loss and Avg Precision. And on delicious, DPLG is superior to the other approaches in terms of MaroF, MicroF, ExampleF and Avg Precision. In Table 6,“–” means CDN and BCC are not suitable for large scale datasets. From the above analysis, it can be concluded that DPLG performs relatively well against the other six algorithms. We state that DPLG is more suitable for large scale datasets, this may be because the more the data, the more prior the statistic information.

5. Conclusion In this paper, a Directed Probability Label Graph (DPLG) model is proposed to resolve the issue of label relevance in multi-label learning applied to advertising videos. Specifically, we firstly use logo detection technique and trained object detectors with SIFT feature to get the candidate label set for a target, and then classify the labels into FL, IL and CL according to their LD statistical probability information after removing unfrequented object-label from the object response of videos. We associate the value of the graph node with the LD of the corresponding label. At the same time, we utilize conditional probability between label pairs to determine the edge weights among nodes to build the DPLG model 2

http://mulan.sourceforge.net/

S. Hou et al. / Neurocomputing 171 (2016) 932–948

946

Table 4 The comparison results between BR, CC, ML-KNN, CDN, BCC, PLST and DPLG on scene. Eval-Crit

Hamm-Loss

MacroF

MicroF

ExampleF

Avg-Prec

Rank Loss

BR CC ML-KNN CDN BCC PLST DPLG

0.106 7 0.009 0.1027 0.011 0.08517 0.0092 0.22 7 0.007 0.1087 0.011 0.11137 0.0070 0.09247 0.0076

0.687 70.024 0.7177 0.023 0.73697 0.0208 0.337 0.021 0.69 7 0.03 0.5843 7 0.0282 0.73837 0.0174

0.6837 0.029 0.7127 0.031 0.73877 0.0278 0.3667 0.022 0.6857 0.037 0.59247 0.0326 0.7328 7 0.0210

0.62 70.032 0.716 70.031 0.68747 0.0311 0.3617 0.02 0.6467 0.041 0.4588 70.0346 0.73817 0.0207

0.385 7 0.015 0.35570.015 0.8254 7 0.0138 0.4277 0.018 0.3737 0.013 0.71087 0.0307 0.8249 7 0.0130

0.211 70.023 0.175 70.018 0.31697 0.0332 0.4187 0.018 0.189 70.024 0.5452 70.0347 0.2728 7 0.0216

Table 5 The comparison results between BR, CC, ML-KNN, CDN, BCC, PLST and DPLG on core5k. Eval-Crit

Hamm-Loss

MacroF

MicroF

ExampleF

Avg-Prec

Rank Loss

BR CC ML-KNN CDN BCC PLST DPLG

0.016 7 0 0.0137 0 0.0094 7 0.011 0.0147 0 0.0117 0 0.0094 7 0.0001 0.0094 7 0.0001

0.003 7 0 0.0477 0.005 0.0086 7 0.0036 0.02 7 0.002 0.047 0.005 0.01117 0.0017 0.00137 0.0003

0.2027 0.012 0.189 70.012 0.0287 70.0117 0.1657 0.009 0.141 7 0.009 0.08017 0.0081 0.09017 0.0091

0.27 0.012 0.1777 0.012 0.01787 0.0069 0.167 0.009 0.1117 0.006 0.0587 7 0.0054 0.09197 0.0091

0.0147 0.001 0.0137 0.001 0.0251 70.0055 0.0137 0.001 0.0137 0.001 0.0525 7 0.0042 0.0665 7 0.0030

0.144 70.004 0.322 70.008 0.9862 70057 0.337 0.006 0.345 70.006 0.75757 0.0043 0.94467 0.0030

Table 6 The comparison results between BR, CC, ML-KNN, CDN, BCC, PLST and DPLG on delicious. Eval-Crit

Hamm-Loss

MacroF

MicroF

ExampleF

Avg-Prec

Rank Loss

BR CC ML-KNN CDN BCC PLST DPLG

0.03 7 0 0.02 7 0 0.0182 70.0002 – – 0.01847 0.0002 0.0288 7 0.0022

0.0087 0 0.113 70.005 0.04817 0.0018 – – 0.0283 70.0008 0.1593 7 0.0070

0.244 70.002 0.2157 0.008 0.17387 0.0047 – – 0.1423 70.0058 0.4235 70.0412

0.234 7 0003 0.1817 0.007 0.15177 0.0044 – – 0.11317 0.0039 0.4233 70.0437

0.0277 0004 0.026 70.001 0.12137 0.0033 – – 0.0983 70.0030 0.47797 0.0384

0.1737 0.002 0.502 7 0.002 0.8945 7 0.0040 – – 0.91657 0.0034 0.60197 0.0434

after λ-filtering and τ-pruning. And finally Tarjan's algorithm is employed to obtain Correlative Label-Sets (CLSs). Experimental results on real-world advertising videos reveal the promising performance of DPLG. Furthermore, the results of the implementation of DPLG on several publicly available datasets (i.e. scene, core5k and delicious) also demonstrate the effectiveness of the proposed approach. Actually, this approach is not only suitable for processing advertising videos, but also the videos containing significant symbols but not limited to logos.

The authors acknowledge the financial support of the National Nature Science Foundation of China (No. 61103114), Frontier and Application Foundation Research Program of CQ CSTC (No. cstc2014jcyjA40037), Fundamental Research Funds for the Central Universities (Nos. CDJZR13180089, CDJZR185502 and CDJZR188801) and China Scholarship Council (CSC). Appendix A. Object classes

3 tree

18 desktop computer 21 cabinet 24 soil, dirt 27 kitchen 30 balloon

Acknowledgment

0.5 Logo

7 television, television system 9 plate 10 French horn, horn 12 pup tent, shelter tent 13 pot 15 newspaper, paper 16 building, edifice 6 keyboard

1 rug, carpet, carpeting 4 truck, motortruck

2 shield 5 printer

33 table 36 basket, basketball hoop 39 electric refrigerator, fridge 42 table-tennis table, pingpong table 45 writing desk 48 Ferris wheel 51 tower

19 jersey, T-shirt, tee shirt 22 bus, autobus, coach, charabanc 25 box 28 door 31 ball 34 television, telecasting 37 laptop

8 bookshelf 11 helmet 14 aquarium 17 lamp 20 blind, screen 23 mug 26 shoe 29 vase 32 sofa, couch 35 lion 38 eyeglasses

40 coral

41 airplane

43 rack

44 buckle

46 pool table, snooker 47 aircraft table 49 roller coaster 50 desk 52 snail 53 computer screen

S. Hou et al. / Neurocomputing 171 (2016) 932–948

54 microwave, oven 57 elephant 60 cow 63 66 69 72

monkey clock toilet seat cross

75 bag 78 dishwasher 81 guitar 84 window 87 umbrella 90 wall 93 duck 96 suit of clothes 99 bridge, span 102 camera 105 goggles 108 motorcycle, bike 111 towel 114 chair 117 floor 120 streetlight 123 snake, serpent 126 cesspool, cesspit 129 cupboard, closet 132 computer, computing device 135 bird 138 bride 141 animal, beast 144 fence, fencing 147 mountain, mount 150 ocean 153 pen 156 hook 159 garage 162 knife 165 basketball 168 microphone, mike 171 flipper, fin 174 sail

55 beach 58 bottle 61 dressing table, dresser 64 bed 67 male horse 70 curtain, drape 73 cat, true cat

56 switch 59 fork 62 clam

65 button 68 soccer ball 71 car, auto 74 loudspeaker 76 cell, electric cell 77 wheel 79 computer monitor 80 turtle 82 stove 83 car, elevator car 85 bench 86 shelf 88 train, railroad train 89 sky 91 baggage, luggage 92 wedding gown 94 railing, rail 95 baseball 97 drum 98 radio, wireless 100 bear 101 horse 103 homo, human 104 fruit 106 blanket, cover 107 computer mouse 110 face veil 109 gravel, crushed rock 112 shower curtain 113 rabbit 115 seashore, coast 116 flower 118 ship 119 bathtub 121 hat, chapeau 122 filter 124 key 125 mirror 127 bus stop 128 grass 130 basketball court 131 people 133 pool ball 134 bouquet 136 gallery, art gallery 137 mouse 139 skyscraper 140 candle, taper 142 attire, dress 143 propeller 145 drawer 146 public toilet 148 faucet, spigot 149 aqualung 151 dog, Canis 152 familiaris bridegroom 154 swing 155 boot 157 glove, mitt 158 telephone 160 sailboat, sailing 161 cloud boat 163 wing 164 backboard 166 glove 167 boat 169 room light 170 stick 172 oxygen mask 173 rock, stone 175 light, light source 176 squash racket

177 saddle

References [1] G. Tsoumakas, I. Katakis, Multi-label classification: an overview, Int. J. Data Warehous. Min. 3 (2007) 1–13. [2] M. Zhang, Z. Zhou, A review on multi-label learning algorithms, IEEE Trans. Knowl. Data Eng. 26 (2013) 1819–1837.

947

[3] T. Zhou, D. Tao, X. Wu, Compressed labeling on distilled labelsets for multilabel learning, Mach. Learn. 88 (2012) 69–126. [4] T. Zhou, D. Tao, Labelset anchored subspace ensemble (LASE) for multi-label annotation, in: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval, ACM, 2012,p. 42, Hong Kong, Hong Kong. [5] G. Doquire, M. Verleysen, Mutual information-based feature selection for multilabel classification, Neurocomputing 122 (2013) 148–155. [6] M.-L. Zhang, Z.-J. Wang, MIMLRBF: RBF neural networks for multi-instance multi-label learning, Neurocomputing 72 (2009) 3951–3956. [7] H. Ma, E. Chen, L. Xu, H. Xiong, Capturing correlations of multiple labels: a generative probabilistic model for multi-label learning, Neurocomputing 92 (2012) 116–123. [8] T.-Y. Wang, H.-M. Chiang, Solving multi-label text categorization problem using support vector machine approach with membership function, Neurocomputing 74 (2011) 3682–3689. [9] A.F. De Souza, F. Pedroni, E. Oliveira, P.M. Ciarelli, W.F. Henrique, L. Veronese, C. Badue, Automated multi-label text categorization with VG-RAM weightless neural networks, Neurocomputing 72 (2009) 2209–2217. [10] M.R. Boutell, J. Luo, X. Shen, C.M. Brown, Learning multi-label scene classification, Pattern Recognit. 37 (2004) 1757–1771. [11] Z.-H. Zhou, M.-L. Zhang, Multi-instance multi-label learning with application to scene classification, in: Advances in Neural Information Processing Systems, 2006, pp. 1609–1616. [12] M.-L. Zhang, Z.-H. Zhou, ML-KNN: a lazy learning approach to multi-label learning, Pattern Recognit. 40 (2007) 2038–2048. [13] Z.-J. Zha, X.-S. Hua, T. Mei, J. Wang, G.-J. Qi, Z. Wang, Joint multi-label multiinstance learning for image classification, in: IEEE Conference on Computer Vision and Pattern Recognition, 2008, CVPR, 2008, IEEE, 2008, pp. 1-8, Anchorage, Alaska, USA. [14] Z. Chen, Z. Chi, H. Fu, D. Feng, Multi-instance multi-label image classification: a neural approach, Neurocomputing 99 (2013) 298–306. [15] Z. Wei, H. Wang, R. Zhao, Semi-supervised multi-label image classification based on nearest neighbor editing, Neurocomputing 119 (2013) 462–468. [16] A. Dimou, G. Tsoumakas, V. Mezaris, I. Kompatsiaris, L. Vlahavas, An empirical study of multi-label learning methods for video annotation, in: Seventh International Workshop on Content-Based Multimedia Indexing, 2009, CBMI0 09, IEEE, 2009, pp. 19-24, Chania, Crete. [17] G. Qu, H. Wu, C.T. Hartrick, J. Niu, Local analgesia adverse effects prediction using multi-label classification, Neurocomputing 92 (2012) 18–27. [18] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi-label data, in: Data Mining and Knowledge Discovery Handbook, Springer, 2010, pp. 667-685, Springer, US. [19] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multi-label classification, Mach. Learn. 85 (2011) 333–359. [20] W. Cheng, E. Hllermeier, K.J. Dembczynski, Bayes optimal multilabel classification via probabilistic classifier chains, in: Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 279–286. [21] J. Frnkranz, E. Hllermeier, E.L. Menca, K. Brinker, Multilabel classification via calibrated label ranking, Mach. Learn. 73 (2008) 133–153. [22] X. Lin, X.-W. Chen, Mr. KNN: soft relevance for multi-label classification, in: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, ACM 2010, pp. 349-358, Toronto, Canada. [23] G. Tsoumakas, I. Vlahavas, Random k-labelsets: an ensemble method for multilabel classification, in: Machine Learning: ECML 2007, Springer, 2007, pp. 406-417, Warsaw, Poland. [24] E.A. Cherman, M.C. Monard, J. Metz, Multi-label problem transformation methods: a case study, CLEI Electr. J. 14 (2011). [25] R.-E. Schapire, Y. Singer, BoosTexter: a boosting-based system for text categorization, Mach. Learn. 39 (2000) 135–168. [26] F. De Comit, R. Gilleron, M. Tommasi, Learning multi-label alternating decision trees from texts and data, in: Machine Learning and Data Mining in Pattern Recognition, Springer, 2003, pp. 35-49, Leipzig, Germany. [27] A. Elisseeff, J. Weston, A kernel method for multi-labelled classification, in: Advances in Neural Information Processing Systems, 2001, pp. 681–687. [28] J.H. Zaragoza, L.E. Sucar, Bayesian chain classifiers for multidimensional classification, in: IJCAI, vol. 11, 2011, pp. 2192–2197. [29] X. Liu, Z. Shi, Z. Li, X. Wang, Z. Shi, Sorted label classifier chains for learning images with multi-label, in: Proceedings of the International Conference on Multimedia, ACM, 2010, pp. 951-954, Firenze, Italy. [30] Y. Guo, S. Gu, Multi-label classification using conditional dependency network, in: IJCAI Proceedings of the International Joint Conference on Artificial Intelligence, vol. 22, 2001, pp. 1300. [31] F. Tai, H.-T. Lin, Multilabel classification with principal label space transformation, Neural Comput. 24 (2012) 2508–2542. [32] J. Huang, G. Li, S. Wang, Q. Huang, Categorizing social multimedia by neighborhood decision using local pairwise label correlation, in: 2014 IEEE International Conference on Data Mining Workshop (ICDMW), December 2014, pp. 913–920. [33] D. Liu, X.-S. Hua, L. Yang, M. Wang, H.-J. Zhang, Tag ranking, in: Proceedings of the 18th International Conference on World Wide Web, ACM, 2009, pp. 351360, Madrid, Spain. [34] D. Liu, S. Yan, X.-S. Hua, H.-J. Zhang, Image retagging using collaborative tag propagation, IEEE Trans. Multimed. 13 (2011) 702–712. [35] J. Wang, J. Zhou, H. Xu, T. Mei, X.-S. Hua, S. Li, Image tag refinement by regularized latent Dirichlet allocation, Comput. Vis. Image Underst. 124 (2014) 61–70.

948

S. Hou et al. / Neurocomputing 171 (2016) 932–948

[36] D. Wang, S. Hoi, Y. He, J. Zhu, T. Mei, J. Luo, Retrieval-based face annotation by weak label regularized local coordinate coding, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2013) 353–362. [37] X. Li, C.G. Snoek, M. Worring, Learning social tag relevance by neighbor voting, IEEE Trans. Multimed. 11 (2009) 1310–1322. [38] L. Wu, L. Yang, N. Yu, X.-S. Hua, Learning to tag, in: Proceedings of the 18th International Conference on World Wide Web,ACM, 2009, pp. 361-370, Madrid, Spain. [39] Z. Chen, J. Cao, T. Xia, Y. Song, Y. Zhang, J. Li, Web video retagging, Multimed. Tools Appl. 55 (2011) 53–82. [40] T. Yao, T. Mei, C.-W. Ngo, S. Li, Annotation for free: video tagging by mining user search behavior, in: Proceedings of the 21st ACM International Conference on Multimedia, ACM, 2013, pp. 977-986, Barcelona, Spain. [41] G. Armano, A. Giuliani, E. Vargiu, Studying the impact of text summarization on contextual advertising, in: 2011 22nd International Workshop on Database and Expert Systems Applications (DEXA), IEEE, 2011, pp. 172-176, Toulouse, France. [42] X.-S. Hua, T. Mei, S. Li, When multimedia advertising meets the new internet era, in: 2008 IEEE 10th Workshop on Multimedia Signal Processing, IEEE, 2008, pp. 1-5, Sydney, Australia. [43] J. Liu, C. Wang, W. Yao, Keyword extraction for contextual advertising, China Commun. 7 (2010) 51–57. [44] T. Mei, X.-S. Hua, Contextual internet multimedia advertising, Proc. IEEE 98 (2010) 1416–1433. [45] T. Mei, X.-S. Hua, S. Li, VideoSense: a contextual in-video advertising system, IEEE Trans. Circuits Syst. Video Technol. 19 (2009) 1866–1879. [46] D. Horowitz, B. Rudolph, Method and System for Multimedia Advertising, Google Patents 2004. [47] T. Mei, R. Zhang, X.-S. Hua, Internet multimedia advertising: techniques and technologies, in: Proceedings of the 19th ACM International Conference on Multimedia, ACM, 2011, pp. 627-628, Scottsdale, USA. [48] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 60 (2004) 91–110. [49] H. Sahbi, L. Ballan, G. Serra, A. Del Bimbo, Context-dependent logo matching and recognition, IEEE Trans. Image Process. 22 (2013) 1018–1031. [50] L.-J. Li, H. Su, L. Fei-Fei, E.P. Xing, Object bank: a high-level image representation for scene classification & semantic feature sparsification, in: Advances in Neural Information Processing Systems, 2010, pp. 1378–1386. [51] S. Alshomrani, G. Iqbal, An extended experimental evaluation of SCC (Gabows vs Kosarajus) based on adjacency list, Glob. J. Comput. Sci. Technol.-E: Netw. Web Secur., 13 (2013). [52] R. Tarjan, Depth-first search and linear graph algorithms, SIAM J. Comput. 1 (1972) 146–160. [53] P. Lammich, Verified Efficient Implementation of Gabow's Strongly Connected Component Algorithm, Interactive Theorem Proving, Springer, 2014, pp. 325340,Vienna, Austria. [54] J.A. Bondy, U.S.R. Murty, Graph Theory with Applications, Macmillan, London, 1976. [55] Y. Yusoff, W.J. Christmas, J. Kittler, Video shot cut detection using adaptive thresholding, in: BMVC, 2000, pp. 1-10. [56] A.D. Bagdanov, L. Ballan, M. Bertini, A. Del Bimbo, Trademark matching and retrieval in sports video databases, in: Proceedings of the International Workshop on Multimedia Information Retrieval, ACM, 2007, pp. 79-86, Augsburg, Bavaria, Germany. [57] F. Perronnin, J. Snchez, T. Mensink, Improving the fisher kernel for large-scale image classification, in: Computer Vision-ECCV 2010, Springer, 2010, pp. 143156, Heraklion, Crete, Greece. [58] E. Nowak, F. Jurie, B. Triggs, Sampling strategies for bag-of-features image classification, in: Computer Vision-ECCV 2006, Springer, 2006, pp. 490-503, Graz, Austria. [59] H. Jgou, M. Douze, C. Schmid, P. Prez, Aggregating local descriptors into a compact image representation, in: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE 2010, pp. 3304-3311, San Francisco, CA, USA. [60] A.P. Psyllos, C.-N. Anagnostopoulos, E. Kayafas, Vehicle logo recognition using a SIFT-based enhanced matching scheme, IEEE Trans. Intell. Transp. Syst. 11 (2010) 322–328.

Sujuan Hou received the M.S. degree from Chongqing University, Chongqing, China, in 2010. She is pursuing the Ph.D degree at Chongqing University, Chongqing, China and University of Technology, Sydney as a joint Ph.D. student. Her research interests include video data mining, multimedia retrieval and pattern recognition.

Shangbo Zhou received the B.Sc. degree from Gangxi National College in 1985, the M.Sc. degree from Sichuan University in 1991, both in Mathematics, and Ph.D. degree in Circuit and System from Electronic Science and Technology University. From 1991 to 2000, he was with the Chongqing Aerospace Electronic and Mechanical Technology Design Research Institute. Since 2003, he has been with the Department of Computer Science and Engineering of Chongqing University, where he is now a Professor. His current research interests include artificial neural networks, physical engineering simulation, visual object tracking and nonlinear dynamical system.

Ling Chen received her Ph.D. in 2008, in Computer Engineering, from Nanyang Technological University, Singapore. She is currently a Senior Lecturer with Faculty of Engineering and Information Technology (FEIT), University of Technology, Sydney. Prior to that, she was a Postdoc Research Fellow with L3S Research Center, Leibniz University Hannover, Germany. Ling's main research interests include data mining, machine learning, social media etc.

Yong Feng is a Professor at the College of Computer Science of Chongqing University in China. In 2000, he received the Bachelor's degree at Chongqing University in Computer Applied Technology. In 2003, he received the Master's degree at Chongqing University in Computer Systems Organization. In 2006, he received the Ph.D. at Chongqing University in Computer Software and Theory. From July 2007 to July 2010, he did the postdoctoral research at Chongqing University in Control Science and Engineering Center. Dr. Feng has published more than 50 academic papers and 2 monographs.

Karim Awudu was born in 1976, at Dunkwa in the central region of Ghana. He graduated from a 3-year college of education program in 1999 with a teaching certificate ‘A’, and became a Professional Junior High School Science Teacher with Ghana education service afterwards. He received his Bachelor's degree in Information Technology from the University of Education, Winneba, in 2008. He worked briefly as an ICT Coordinator/Instructor with the Ministry of Education. He received his M.Eng. degree in Computer Science from Hunan University, P.R.C., in 2012. He was a member of IOT team of Hunan University headed by Dr. Zhang Xiaoming from 2011 to 2012. He is currently a Ph.D. student in Computer Science, Chongqing University, P.R.C.