Advertising object in web videos

Advertising object in web videos

Neurocomputing 119 (2013) 118–124 Contents lists available at SciVerse ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom...

587KB Sizes 2 Downloads 97 Views

Neurocomputing 119 (2013) 118–124

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Advertising object in web videos Richang Hong a,n, Linxie Tang b, Jun Hu c, Guangda Li d, Jian-Guo Jiang a a

School of Computer and Information, Hefei University of Technology, Hefei 230009, China Department of EEIS, University of Science and Technology of China, Hefei 230026, China College of Computer Science, Zhejiang University, Hangzhou 310027, China d School of Computing, National University of Singapore, 117417 Singapore, Singapore b c

a r t i c l e i n f o

a b s t r a c t

Available online 4 January 2013

We have witnessed the booming of contextual video advertising in recent years. However, those advertisement systems solely take the metadata into account, such as titles, descriptions and tags. This kind of text-based contextual advertising reveals a number of shortcomings in ads insertion and ads association. In this paper, we present a novel video advertising system called VideoAder. The system leverages the well organized media information from the video corpus for embedding visual content relevant ads into a set of precisely located insertion position. Given a product, we utilize content-based object retrieval technique to identify the relevant ads and their potential embedding positions in the video stream. Then we formulate the ads association as an optimization problem to maximize the total revenue for the system. Specifically, the ‘‘Single-Merge’’ and ‘‘Merge’’ methods are proposed to tackle the complex query in visual representation. Typical Feature Intensity (TFI) is used to train a classifier to automatically decide which method is more representive. Experimental results demonstrated the accuracy and feasibility of the system. & 2013 Elsevier B.V. All rights reserved.

Keywords: Video advertising Visual relevance Product

1. Introduction The explosively growing online multimedia data has brought new challenges to online video advertising. For traditional video advertising, the association between video and ads are decided by keywords matching, such as Google AdWord and AdSense [1]. The relevance is calculated on the basis of video metadata such as title, description and tags. However, the traditional text-based contextual advertising reveals a number of disadvantages. First, conventional video advertising systems usually determine the ads insertion points by metadata analysis [2] or video structure [15], without considering the visual coherence between ads and the insertion point of video. This will make the ads highly intrusive to the audience. Second, user-tagged text is generally incomplete and inaccurate, or to say, the text quality will be variable in diversity and accuracy by the subjectivity of text creator, which might decrease the revenue of the advertising system. There exist a wide variety of advertising schemes for contextual based video advertisement. Typical scheme for contextual relevance are the content relevance system based on the video’s webpage (e.g., Google AdSense [16]). YouTube [19] and Hulu [21] select relevant ads by mining contextual metadata and insert ads

n

Corresponding author. E-mail address: [email protected] (R. Hong).

0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2012.04.040

in the beginning or the last frame of video or several chips. The vADeo system [19] leverage on scene detection and face recognition in ads insertion. These kinds of ads association disregard the coherence and relevance between ads and insertion point, bringing intrusive feelings. On the other hand, Mei et al. [2] proposed a VideoSense system, which insert ads at the positions with the highest discontinuity and lowest attractiveness while maximizing the overall global and local relevance. Yi et al. [3] provided a contextual advertising system which is able to select ads for individual scenes in video contents taking advantage of video scripts. However, all the above works have limitation in deeply mining the visual-content to deciding which product is better to be advertized. In this paper, we propose a novel advertising system called VideoAder [25]. Instead of scanning given textual information or determining by video structure to find potential points to insert ads, we directly detect product in all videos of a video corpus by content based object retrieval technique [6,17,18]. In order to represent the complicated visual characteristics of products for query, we utilize two approaches to represent product and train a classifier for approach selection to make the product query representation more comprehensive, representative and precise [26,28]. We also consider how to associate the ads with relevant video content. The former video advertising systems generally process videos one by one. It is a video-ads mapping, which means we have to utilize all ads to search in a video to find the

R. Hong et al. / Neurocomputing 119 (2013) 118–124

119

Fig. 1. The Framework of the VideoAder.

Fig. 2. Examples of product suitable for (a) Single-Merge (Nano 6, Amazon Kindle, IPhone 3GS) approaches and (b) merge approaches (camera).

most relevant video stream and advertising points in the stream. Therefore we need synchronously process the video corpus after the ads set is updated [7]. Here, we utilize retrieval method to find advertising points. We extracted the keyframes from the videos to form a large image set. Therefore, given a targeted product, we can directly search the objects in the image set. We argue that VideoAder represents one of the first attempts towards video advertising by leveraging on visual-based techniques.

2. System framework The overall system framework is depicted in Fig. 1. 2.1. Preprocessing All videos of a video community website are decomposed into series of keyframes at every 5 s. Intuitively, one keyframe per 5 s

120

R. Hong et al. / Neurocomputing 119 (2013) 118–124

can both basically ensure the video contextual continuity and satisfy the need for product searching. After that, a large corpus of keyframes forms. For visual content processing, the Laplacian of Gaussian [4,20] method is utilized to detect the feature points and scale-invariant feature transform [5,29] (SIFT) to describe these points. Each feature point is represented by a 140-dimensionvector, 2 for coordinates, 128 for description of the feature point and 10 for the nearest 10 neighbors. Hierarchy clustering is then utilized for clustering with a total of about 100,000 cluster centers.

thus, we select images taken from different viewpoints to form a merged feature representation. We firstly collect top 200 images from Google image by text search. In order to filter noisy images in the collected image set, we then use Amazon product image to re-rank the 200 images by content based object retrieval method. Finally we collect top 100 images and merge all the visual words to select the most typical visual words to represent the product. 2.2.3. Classifier for approach selection Given a product name, we cannot definitely design which approach would be better to describe the visual characteristics of product. We may easily design a preferable approach by manual, but it would be subjective and laborsome. In order to automatically select a preferable approach to represent product query, we train a classifier. Foremost, we introduce a concept called typical feature intensity (TFI). TFI indicates the intensity of typical features which can describe one object representatively. The TFI of product p is calculated by:

2.2. Visual representation of query A precise, comprehensive and representative visual representation is of crucial importance for product search. For content based product retrieval, query representation could be considerable complex. So as to attain a compelling result, the query representation method is adaptive to the characteristic of product. 2.2.1. Single-merge In this paper, we represent product query by two approaches. The first approach is called Single-Merge. We observe that for some products, such as iPod Nano 6th Generation, Amazon Kindle, iPhone, shown in Fig. 2(a), these products are mainly represented by only one view of the product images, hence we can detect the feature points from that view only. The Single-Merge query images are selected from Amazon products for that Amazon product images are always representative.

TFIðpÞ ¼

9SðpÞ9 X  X 9MðpÞ9 1 E Si ðpÞ, 29SðpÞ9 i ¼ 1 j ¼ 1

Mj ðpÞ



ð1Þ

where Si(p) is a visual word of product p by Single-Merge approach, while M j ðpÞ is a visual word of product p by Merge approach. E(w1,w2) returns 1 if visual word w1 equals to w2, else return 0. Table 1 gives all the 10 kinds of product and there TFI values. Eq. (1) indicates that Single-Merge visual words are more intensively appeared within the filtering merged words for a certain query with a high TFI value. In other words, the object is more easily represented by the Single-Merge approach. Therefore, we deliberate that for a product, if the TFI value is larger than a threshold (0.3), Single-Merge approach is selected to represent product query, otherwise, Merge approach selected. The threshold is learned from a training data constructed of 100 product queries.

2.2.2. Merge We observe that for some products, such as computer and camera, seen in Fig. 2(b), their feature points are homogeneously covered on all sides of the 3D object surfaces and all sides have similar frequency of occurrences in videos. So the first query representation approach can hardly integrally represent the query,

2.3. Searching Table 1 The TFI value for 10 query products. Product

TFI

1. Amazon Kindle 2. iPhone 3GS 3. iPod Nano6 4. BlackBerry 9700 5. Cisco 7600 Phone 6. MacBook Pro 7. Nikon P700 8. NintendoWii 9. ThinkPad 10. Xbox36

0.593 0.669 0.516 0.368 0.284 0.157 0.223 0.122 0.161 0.101

2.3.1. Indexing and ranking By treating each video keyframe as a document and the visual words extracted from keyframe as textual words, we utilize textual inverted index method to form the indexing structure. TF-IDF is used to evaluate the importance of certain visual word. TF represents the visual word frequency in one keyframe, while IDF value Table 2 Approach selection to query (S represents Single-Merge, M represents Merge). Product Number Approach

1 S

2 S

3 S

4 S

5 M

60 50

50 49

48

45

Single Merge

43

Merge 35

40 30

28

30 25

23

19 20

14

17

20 15 9

10

15 10 2

0

Fig. 3. Search results by two approaches.

6

6 M

7 M

8 M

9 M

10 M

R. Hong et al. / Neurocomputing 119 (2013) 118–124

represents the frequency of the visual word in all image corpus [27,30]. IDF value is learned from a large corpus of about 10 million images. We can calculate the ranking value as follows: RðdÞ ¼

9WðdÞ9 X

TFIDF ðW i ðdÞÞ

ð2Þ

i¼1

2.3.2. Filtering We deliberate about filters to make the search result more satisfactory. We filter the following keyframes after ranking: (a) too many or too few feature points detected in this keyframe; (b) ranking ahead but with few matching points; (c) in the case of one-to-many matching, filter the redundant matching points. 2.3.3. Re-ranking We apply the computational expensive, geometric verification to the small retrieved image set after searching and filtering the top 100 keyframes. The spatial consistency from the k (k¼10) spatially nearest neighbors is used to filter the visual words. 2.4. Advertisement association optimization In AdWord [1], each advertiser places bids on a number of keywords and specifies a maximum daily budget. The objective is to maximize the total revenues while respecting ad relevance. In our framework, let A denote the keywords (products) bid by a advertisers containing Na keywords, represented by A ¼ fai gN i ¼ 1. b denote the all possible insertion points in the video Let B ¼ fbj gN j¼1 c denote the corpus, containing Nb insertion points. Let C ¼ fck gN k¼1 candidate ad images for insertion. Online ads insertion can be described as selection of M keywords, N insertion points and N ad images from A, B and C. M and N can be given by the publisher of VideoAder. The objective of ads association is to maximize the total revenues while respecting the ad relevance. The ad relevance can be measured by two factor, one is the contextual relevance between keyword and insertion point, the other is local similarity of ad image and the keyframe in insetion point. Thus, we introduce the following three items for optimization. Let Rb(ai) denote the daily budgets of keywords ai, and Rr(ai,bj) denote the contextual relevance between keyword ai and insertion point bj. Let Rs(bj,ck) denote the local similarity of ad image ck and the keyframe in insertion point bj. Local similarity requires that we should be prior to select the products whose product images are most relevant to the video scenes (e.g., product images which have the same viewpoint with the product in the video will have higher priority than that with different viewpoints). The ads association can now be formulated as an optimization problem, which simultaneously maximized the three items [22–24]. Here we introduce the following design variables X ¼ ½x1 ,x2 ,. . .,xNa , xiA{0,1}, and Y ¼ ½y1 ,y2 ,. . .,yNb , yjA{0,1}, and Z ¼ ½z1 ,z2 ,. . .,zNc , zk A f0,1g, where xi,yj,zk indicate whether keyword ai, insertion point bj and ads image ck are selected ðxi ¼ 1, yj ¼ 1, zk ¼ 1Þ or not ðxi ¼ 0, yj ¼ 0, zk ¼ 0Þ. The optimization problem can be expressed as following nonlinear 0-1 programming problem, [20] maxðx,y,zÞ f ðx,y,zÞ ¼ wb

þ wr

Na X i¼1

xi Rb ðai Þ

Nb Na X X

Nb X Nc X     xi yj Rr ai ,bj þ ws yj zk Rs bj ,ck

i¼1j¼1

s:t:

Na X

xi ¼ M,

i¼1

where Wi(d) represents the ith visual word in document (or keyframe) d, TFIDF(Wi(d)) returns the weight of that visual word for ranking. The ranking value is accumulated by all the visual words in a certain frame.

121

Nb X

yj ¼ N,

j¼1

Nc X

j¼1k¼1

zk ¼ N, xi , yj , zk A f0,1g

ð3Þ

k¼1

The parameters (wb, wr, ws) control the emphasis on daily budget and ad relevance, and satisfy the constrains: 0rwb, wr, ws r1 and wb þ wr þws ¼1. The parameters can be set according to the importance of the optimization item. By examining Eq. (3), we can observe that there are N N CM Na C Nb C Nc MNN!solutions in total. The searching space will be tremendous when dataset is large. For practical usage, we introduce a heuristic searching algorithm to solve the optimization problem [21], As described in Algorithm 1, the number of solutions can be significantly decreased to O(Na þNb þNc þ M  N0  N0 ). Algorithm 1. The heuristic searching algorithm for Eq. (3) 1. Initialize: set the labels of all the elements in X, Y and Z as ‘‘0’’. 2. Rank all the elements in X according to wbRb(ai) in a descendent order, select the top M elements, and set the labels as ‘‘1’’. 3. Rank all the elements in Y according to wbRb(ai)þwrRr(ai,bj) in a descendent order, select the top N0 (N o N0 {N b ) elements. 4. For each yj in the top N0 elements in Y, select the zk with the max{wrRr(ai,bj)þRs(bj,ck)}. 5. For each xi in the top M elements in X, select the unselected yj and zk with the max{wbRb(ai)þ wrRr(ai,bj) þRs(bj,ck)}, set the label of yj and zk as ‘‘1’’. 6. Output all the triples with ðxi ¼ 1, yj ¼ 1, zk ¼ 1Þ.

3. Experiment To ensure our algorithm and methods works in practice, we conduct experiments on a web video set consisting of 307 videos collected from YouTube. In order to ensure that the results would make a significant impact in practice, we concentrate on the 10 popular and distinctive products queries for searching and annotation. Typical queries include ‘‘iPod’’, ‘‘iPhone’’, ‘‘Camera’’, etc. All the ads are collected from Amazon. For each product query, we extract most typical or representative product image from Amazon for Single-Merge query representation. To obtain multiple views images of 3D apparent products, we search Google Images to form the merged product query representation.

3.1. Evaluation of classifier To evaluate the performance, we collect the top 50 images by our searching and ranking method, and we then count the number of keyframes which contain the query product, as is shown in Fig. 3. Table 2 depicts the query representation approach selection by our classifier. In this scenario, we observe that for certain kind of product, the selected query representation approach in Table 2 always receive the preferable result of two approaches shown in Fig. 3. That means our classifier works perfectly for the approach selection. Given a targeted product name, we can definitely automatically design the adaptive approach to form a query representation.

122

R. Hong et al. / Neurocomputing 119 (2013) 118–124

3.2. Evaluation of product searching It is of great challenge to quantify the quality of searching result. Foremost, product occurrences in web videos are heavily influenced by the complex video scenes. For instance, we may just see one corner of camera in the keyframe and it is hard to design if this camera has ever appeared in the frame, but we must make the choice. Thus, we may attain an empirical and intuitive result. In Section 3.1, we use the number of satisfied keyframe within the top 50 images to evaluate the performance. In our scenario, we pay more attention to the result precision, in other words, we care more about finding the accurate advertising points, even though there doesn’t exist a compelling recall. Fig. 3 shows the searching result by our method. From the result, we clearly see that the 3D apparent products are more negative to reach a satisfactory result. First, query representation about the products which are suitable for Merge approach seems to be harder to satisfy the video object searching scenario. Moreover, it is more difficult to reach a compelling viewpoint of product for searching or object matching. Finally, we must consider that TF-IDF weight is more suitable for text ranking, and content-based object

retrieval could be fairly complex when considering the geometric information which is vital to get a wonderful result. Fig. 4 depicts some search result.

4. Conclusions and future work The explosive amount of videos from Internet sharing community had entailed the needs for effective video advertising system. The system ought to be both user-friendly and has high revenue. In this paper, we have proposed an intelligent video advertising system-VideoAder as Fig. 5. The evaluation indicates that this system could associate visual relevant advertisements with the web videos. It is a commendable way to lessen the intrusiveness of web ads, and give the video viewers a fresh feeling during scanning the videos. In order to identify points for embedding ads, we utilize content-based object retrieval technique to construct the system framework [8,9,14]. We use adaptive visual representation approaches to represent different query product. However, given a product name, how can we decide which approach is more suitable to describe that product? We then introduce the TFI

Fig. 4. Search result for three products: (a) search result of Amazon Kindle, (b) search result of iPod Nano 6th and (c) search result of Nikon P7000.

R. Hong et al. / Neurocomputing 119 (2013) 118–124

123

Fig. 5. The demo system of VideoAder: (a) main interface and (b) video displaying and advertising interface.

concept and train a classifier to achieve preferable approach selection. Hence, we get a relatively precise, comprehensive query representation about a certain product.

Furthermore, VideoAder does not rely on text information of video such as tags, title and description. This releases the system from the video text information [10–13]. Besides that, a more

124

R. Hong et al. / Neurocomputing 119 (2013) 118–124

integrated video advertising system can be produced to give more comprehensive services. We will do more exploration into this direction.

Acknowledgment This work was supported by a grant from the National Natural Science Foundation of China, No. 61172164. References [1] A. Mehta, A. Saberi, U. Vazirani, V. Vazirani. AdWards and generalized on-line matching, in: Proceedings of the 46th Annual IEEE Sysposium on Foundations of Computer Science, Pittsburgh, USA, October 2005. [2] T. Mei, X.-S. Hua, L. Yang, S. Li, VideoSence-Towards Effective Online Video Advertising, ACM Multimedia, Augsburg, Germany, 2007. [3] B.-J. Yi, J.-T. Lee, H.-W. Woo, H.-C. Rim. Contextual Video Advertising system using scene information inferred from video scripts, in: Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Geneva, Switzerland. July 2010. [4] T. Linderberg, Feature detection with automatic scale election, Int. J. Comput. Vision 30 (2) (1998) 79–116. [5] D.G. Lowe. Object recognition from local scale-invariant features, in: Proceedings of the International Conference on Computer Vision, Vancouver, Canada, 1999. [6] M. Wang, X.-S. Hua, J. Tang, R. Hong, Beyond distance measurement: constructing neighborhood similarity for video annotation, IEEE Trans. Multimedia 11 (3) (2009) 465–476. [7] M. Wang, X.-S. Hua, Active learning in multimedia annotation and retrieval: a survey, ACM Trans. Intell. Syst. 2 (2) (2011) 10. [8] M. Wang, X.-S. Hua, R. Hong, J. Tang, G.–J. Qi, Y. Song, Unified video annotation via multi graph learning, IEEE Trans. Circuit Syst. Video Technol. 19 (5) (2009) 733–746. [9] R. Hong, M. Wang, M. Xu, S. Yan, T.–S. Chua, Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment, ACM Multimedia, Beijing, China, 2010. [10] Z.-J. Zha, L. Yang, T. Mei, M. Wang, Z. Wang, T. Chua, X.-S. Hua, Visual Query Suggestion: Towards Capturing User Intent in Internet Image Search, ACM TOMCCAP 6 (3) (2010). [11] R. Hong, J. Tang, Z.-J. Zha, Z. Luo, T.-S. Chua, Mediapedia: mining web knowledge to construct multimedia encyclopedia, Lect. Notes Comput. Sci. 5916 (2010) 556–566. [12] M. Wang, K. Yang, X.-S. Hua, H.-J. Zhang, Towards a diverse relevant search of social images, IEEE Trans. Multimedia 12 (8) (2010) 829–842. [13] M. Wang, Y. Sheng, B. Liu, X.-S. Hua, In-image accessibility indication, IEEE Trans. Multimedia 12 (4) (2010) 330–336. [14] M. Wang, X.-S. Hua, T. Mei, R. Hong, G.-J. Qi, Y. Song, L.–R. Dai, Semisupervised kernel density estimation for video annotation, Comput. Vis. Image Understand. 113 (3) (2009) 384–396. [15] T. Mei, J. Guo, X.–S. Hua, F. Liu, Adon: toward Contextual Overlay In-Video Advertising, Multimedia Syst. 16 (4) (2010) 335–344. [16] AdSense. Avaliable at: /http://adsense.google.comS. [17] R. Hong, J. Tang, H.–K. Tan, S. Yan, C.–W. Ngo, T.–S. Chua, Beyond search: event driven summarization for web videos, ACM Trans. Multimedia Comput. Commun. Appl. 7 (4) (2011) 35. [18] R. Hong, M. Wang, G. Li, Z.–J. Zha, T.–S. Chua, Multimedia Question Answering. IEEE Multimedia. 19(4) (2012) 72–78. [19] YouTube. Avaliable at: /http://www.youtube.comS. [20] Y. Gao, J. Tang, R. Hong, S. Yan, Q. Dai, N. Zhang, T.–S. Chua, Camera constraint-free view-based 3-D object retrieval, IEEE Trans. Image Process. 21 (4) (2012) 2269–2281. [21] Hulu. Avaliable at: /http://www.hulu.comS. [22] S.H. Srinivasan, N. Sawant, S. Wadhwa, vADeo: Video Advertising System, ACM Multimedia, Augsburg, Germany, 2007. [23] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, UK, 2004.

[24] C.R. Reeves, Modern Heuristic Techniques for Combinatorial Problems, Blackwell Scientific Publications, Oxford, 1993. [25] J. Hu, G. Li, Z. Lu, J. Xiao, R. Hong, Videoader: A Video Advertising System Based on Intelligent Analysis of Visual Content, ICIMCS, Chengdu, China, 2011. [26] Y. Gao, M. Wang, Z.–J. Zha, Q. Tian, Q. Dai, N. Zhang, Less is more: efficient 3-D object retrieval with query view selection, IEEE Trans. Multimedia 13 (5) (2011) 1007–1018. [27] J. Shen, D. Tao, X. Li, QUC-tree: integrating query context information for efficient music retrieval, IEEE Trans. Multimedia 11 (2) (2009) 313–323. [28] J. Shen, J. Shepherd, B. Cui, K.–L. Tan, A novel framework for efficient automated singer identification in large music databases, ACM Trans. Inf. Syst. 27 (3) (2009). [29] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, Y. Pan, A multimedia retrieval framework based on semi-supervised ranking and relevance feedback, IEEE Trans. Pattern Anal. Mach. Intell. 34 (4) (2012) 723–742. [30] Y. Yang, F. Wu, F. Nie, H.T. Shen, Y. Zhuang, A.G. Hauptmann, Web and personal image annotation by mining label correlation with relaxed visual graph embedding, IEEE Trans. Image Process. 21 (3) (2012) 1339–1351.

Richang Hong is a Professor in Department of EEIS, Hefei University of Technology. Before that, he was a Postdoctoral Research Fellow in School of Computing, National University of Singapore. He received his Ph.D. degrees in July 2008 from University of Science and Technology of China (USTC). His current research interests include multimedia question answering, social media mining and content based image retrieval. From February 2006 to June 2006, he worked as a research intern in Web Search and Data Mining group at Microsoft Research Asia. He has authored over 50 journal and conference papers in these areas. He was a recipient of the Best Paper Award in ACM Multimedia 2010 in Florence, Italy. He served as a technical program committee member of more than 20 worldwide top conferences and a reviewer of over 20 prestigious international journals. Dr. Hong is a member of the Association for Computing Machinery (ACM).

Linxie Tang is a PhD candidate from University of Science and Technology of China. His research interests are multimedia and computer vision.

Jun Hu is a undergraduate from Zhejiang University. His research interest is computer vision.

Guangda Li is a research Fellow in Lab for Media Search in School of Computing, National University of Singapore. He got his PhD degree from National University of Singapore in 2012. His research interests are social media analysis and multimedia question answering.

Jian-Guo Jiang is a professor from Hefei University of Technology. His research interests are digital signal processing and comuter vision.