IM(S)2: Interactive movie summarization system

J. Vis. Commun. Image R. 21 (2010) 283–294 Contents lists available at ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/loc...

Download PDF

654KB Sizes 0 Downloads 46 Views

Report

PDF Reader
Full Text

J. Vis. Commun. Image R. 21 (2010) 283–294

Contents lists available at ScienceDirect

J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

IM(S)2: Interactive movie summarization system Mehdi Ellouze a,*, Nozha Boujemaa b, Adel M. Alimi a,1 a b

REGIM: Research Group on Intelligent Machines, University of Sfax, ENIS, BP 1173, Sfax 3038, Tunisia INRIA: IMEDIA Team, BP 105 Rocquencourt, 78153 Le Chesnay Cedex, France

a r t i c l e

i n f o

Article history: Received 15 December 2008 Accepted 18 January 2010 Available online 25 January 2010 Keywords: Video analysis Video summarization Users’ preferences Interactive multimedia system Content analysis Pattern recognition Genetic algorithm One-class SVM

a b s t r a c t The need of summarization methods and systems has become more and more crucial as the audio-visual material continues its critical growth. This paper presents a novel vision and a novel system for movies summarization. A video summary is an audio-visual document displaying the essential parts of an original document. However, the deﬁnition of the term ‘‘essential” is user-dependent. The advantage of this work, unlike the others, is the involvement of users in the summarization process. By means of IM(S)2, people generate on the ﬂy customized video summaries responding to their preferences. IM(S)2 is made up of an ofﬂine part and an online part. In the ofﬂine, we segment the movies into shots and we compute features describing them. In the online part users inform about their preferences by selecting interesting shots. After that, the system will analyze the selected shots to bring out the user’s preferences. Finally the system will generate a summary from the whole movie which will provide more focus on the user’s preferences. To show the efﬁciency of IM(S)2, it was tested on the database of the European project MUSCLE made up of ﬁve movies. We invited 10 users to evaluate the usability of our system by generating for every movie of the database a semi-supervised summary and to judge at the end its quality. Obtained results are encouraging and show the merits of our approach. Ó 2010 Elsevier Inc. All rights reserved.

1. Introduction Nowadays, the movie industry is ﬂourishing as video content consumption is growing and hence presenting challenging problems for the scientiﬁc community. In fact, in the every-day life search for entertainment material, people become more and more interested in the products and services related to video content industry and especially movies [43]. Besides, the increasing use of new technologies as the Internet makes the information sources close to people. Web sites as YouTube, Google video or Daily Motion make thousands of video sequences available everyday. Consulting video sequences becomes a part of a daily routine for the majority of the internet users. All this has created a problem namely how I can ﬁnd what I need in a minimum of time without being distorted? [41]. Video summarization systems have tried to solve this problem. Summarizing a video document consists in producing a reduced video sequence showing the important events of the original sequence and hence after consulting the summaries, viewers can have an idea about the context and the semantics of the original sequences [18,19,21].

* Corresponding author. Fax: +216 74 275 595. E-mail addresses: [email protected] (M. Ellouze), [email protected] (N. Boujemaa), [email protected] (A.M. Alimi). 1 Fax: +33 13963 5674. 1047-3203/$ - see front matter Ó 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.jvcir.2010.01.007

Many video summarization systems have been proposed in the literature. They differ in two things: the form of the generated summary and the rules (assumptions) on which authors based their system to generate a summary shorter than the original sequence but that gives an appropriate overview of the whole content. However, what is appropriate for one viewer may be inappropriate for others. In fact, selecting the parts of the original video that should be included in the summaries is still a challenging problem. No clear criteria were established to decide what should and what should not be included in the ﬁnal summary. This has posed also another challenging problem of evaluation that has been solved in part in the TRECVID evaluation campaigns [9,20]. Aware of the importance of this problem, we introduce in this paper a new vision for video summarization: ‘‘A tailor made summary for each user”. Every user has to specify what should be included in the ﬁnal summary. Contrary to the majority of existing video summarization systems which either neglect or do not accord a lot of attention to the users’ preferences, the contribution of this paper may be summarized by proposing a complete framework for a user oriented video summarization. We propose a novel summarization system that involves users in the process of movies summarization. We have been motivated by the lessons learned from the study on ‘‘requirement speciﬁcations for personalized multimedia summarization” done by Agnihoti et al. [1]. The authors organized a panel

284

M. Ellouze et al. / J. Vis. Commun. Image R. 21 (2010) 283–294

which gathered ﬁve experienced researchers in the ﬁeld of video indexing, video summarization and user studies in order to analyze user needs. Over four separate brainstorming sessions, the panel examined four questions: (i) Who needs summarization and why? (ii) What should summaries contain? (iii) How should summaries be personalized? And (iv) How can results be validated? According to this work, the brainstorming sessions concluded unanimously that ‘‘summary is user context dependent, and so it is important for summarization to be personalized to cover user aspects: preferences, context, time, location, and task. They concluded also that both content requirements and user requirements should be validated”. Moreover, we concentrate our efforts on movies. The choice of movies as an application ﬁeld can be justiﬁed by the growing need of people for watching movies and the growing number of produced movies. Indeed, according to IMDB [17], we have until now a stock of 328,530 movies representing a total of 740,803 h. The rest of the paper will be organized as follows; in Section 2, we discuss works related to video summarization. In Section 3, we state the problem and present how some works include users in the summarization process. In Section 4, we present our approach. Results of our approach are shown in Section 5. We conclude with directions for future work.

2. Related work Recently two prominent papers [32,49] have been published and presented an exhaustive analysis of the majority of works proposed in the ﬁeld of video summarization. A video summarization system aims at generating a reduced version of the original sequence in which we ﬁnd the essential information. It may have two forms: Storyboard: the storyboard is a set of keyframes extracted from the original sequence showing an overview of the whole video sequence [49]. Video skim: this type of summary consists of a set of video excerpts extracted from the original sequence because they have been judged important. They are joined by either a cut or a gradual effect [49]. We can classify existing works according to the form of generated video summary.

2.1. Storyboards Early works as [31] and [46] produced storyboards by random or uniform sampling. At each important visual change, a single keyframe will be selected to appear in the storyboard. These kinds of approaches are not used any more because they suffer from a lot of redundancy. The following works are based on shot unit. So, a storyboard is made up of keyframes representing the whole or a part of the video shots. In [54], the authors select from every shot the ﬁrst image and if there is an important change between two successive frames, the shot will be represented by more than one image. Authors in [52] proposed to summarize video by a set of key-frames with different sizes. The selection of key-frames is based on eliminating uninteresting and redundant shots. Selected key-frames are sized according to the importance of the shots from which they have been extracted. In [45], the authors represent every shot in the storyboard by a set of frames disposed according to type of camera work. More sophisticated works such as [53] and [55] use the classiﬁcation to eliminate redundancy and some others as [47] use the motion information to measure the activity of a

shot and to estimate the number of frames that must be selected for every shot. 2.2. Video skim Although the storyboard is interesting to have an overview about the whole video sequence, it suffers from a lack of semantics due to the absence of the temporal and auditory information. That’s why more effort has been accorded to video skims. In the literature, we ﬁnd two kinds of video skims: summary sequence which is used to provide users with an impression about the entire video content; highlights sequence which provides users with exiting scenes and events. In the category of systems generating highlights oriented summaries, we can evoke the VAbstract system [37] which selects important segments in action movies. The selected segments present either high contrast, or contain frames having the largest differences. The VAbstract system is targeting action events. The MOCA project [22] which is an improved version of the VAbstract system, tries to locate high level events as gunﬁre, explosions and ﬁghts to improve the quality of generated trailer. In the sport ﬁeld, there are also many systems that try to generate highlights sequences [2,23,36,39]. In the context of sport, highlights depend on the sport competition itself. They correspond in general to goals, tries, points, red cards, penalties, substitutions, etc. In the category of works generating summary oriented skims, we can evoke the Informedia system [42] from Carnegie Mellon University which transcribes the auditory bound of the video sequence and uses NLP techniques to locate important segments. It also uses the visual bound to locate important segments through analysing the motion information and the detection of faces. The system fuses the results of the two works to have an audio-visual summary. IBM, Microsoft and Mitsubishi labs proposed systems [33,35,38] that summarize the original video sequence by changing the speed playback. Important segments will have a slow playback and non-important segments will have fast playback. The difference between the three systems consists in criteria used to delimit important segments. More sophisticated works used some psychological and perceptional features to generate video skims. In [28], the authors try to simulate the user attention by a function. This function evaluates a given video excerpt using some visual and auditory features as contrast, motion, audio volume, caption text, etc. To ﬁnd the best summary, authors try to maximise this function. In [26], the authors proceed nearly in the same way, they affect to shots scores according to some high level features which generally attract the attention of the viewer as the face occurrences, text occurrences, voices, cameras zooming, explode noises, etc. The summarisation process is considered as a constraint-satisfaction problem to satisfy the attention of the viewer by means of evoked features. In our work [9] we also consider also the problem of summarization as an optimization problem. During our last participation in TRECVID 2008, we proposed a system that summarizes BBC rushes by some shots selected according to ﬁxed criteria [9]. A genetic algorithm was used to compute the optimal summary. In [51] Tsoneva et al. propose a novel approach for video summarization. Their approach uses the subtitles and the scripts to segment movies into meaningful parts called subscenes. After that the subscenes will be ranked through an importance function. Finally important subscenes will be selected to formulate the ﬁnal summary. The evaluation of the generated summaries shows the merits of using the textual information in the summarization process.

285

M. Ellouze et al. / J. Vis. Commun. Image R. 21 (2010) 283–294

3. Problem statement A video summary should describe brieﬂy and concisely the original content. It must be shorter than the original sequence and gives an appropriate overview of the whole content. However, what is appropriate for one viewer may be inappropriate for others. That’s why a new tendency of works tries to involve users in the process of video summarization. The ﬁrst work speaking about including users in the summarization process is [12] in which authors suggest a summarization process in two stages. The ﬁrst one consists in computing some descriptors as color and motion and the second consists in generating the summary according to one and simple user preference namely the number of key frames of the summary. It is a preference speciﬁed in the MPEG7 User Preference Description scheme. In [13], the authors propose a system that produces on-the-ﬂy storyboards and allows users to select the number of key frames and the processing time. In [34], the authors try to collect the user preferences through weights over a set of high level features. The major drawback of this approach is the fact that users operate directly on technical parameters that may not be easily understood. More sophisticated works as [44] try to collect the preferences of a given person using his own pictures stored in his personal computer. They based their work on the assumption that nowadays it is common and especially among young people to store thousands of photos inside their own PCs. These photos reﬂect generally the user’s taste personality and lifestyle. The authors judge that these photos may be used to estimate users’ preferences. They do a categorization for the personal photo library to extract pertinent classes. After that they classify the key frames of the video which will be summarized according to detected classes to bring out clips meaningful to users. Even if some efforts have been made to integrate users in the summarization process, they remain insufﬁcient. In fact using users’ personal photos to generate summaries, or asking users about the duration (video skim) or the number of frames (storyboards) of the targeted summary does not reﬂect the real users’ preferences. Even if Parshin and Chen [34] proposed an adjustable system that tries to gather a set of information related to the users’ preferences, it remains difﬁcult for users to understand technical parameters and to adapt their preferences according to technical features. Our contribution will be the design of a system that collects the user’s preferences appropriately and tries to generate a summary that reﬂects these preferences and describes brieﬂy and concisely

the original content. It combines the most successful methods for computer vision, video processing and pattern recognition [6,9,11,15,40] into an integrated architecture.

4. Proposed approach 4.1. System framework A video summary is an audio-visual document displaying the essential parts of the original document. The term essential depends on the user and on the audio-visual document itself. In fact, It is only after knowing the context of the movie that a user may express his/her preferences. Indeed, we may have a preference for a kind of scenes in action movies and for another kind in dramatic movies, etc. Our idea consists in displaying to every user an overview of the movie that can help him to illustrate his preferences easily and efﬁciently. We have been inspired by image interactive indexing systems in which users operate directly on the database and try through many iterations to reach the targeted class. In our previous work [11] for instance, the system displays at each round two images that are randomly selected from the database which contains thousands of images. The user is required to choose among the two images the one that is closest to the targeted class. Little by little and through many rounds the system delimits the targeted class. The framework that we propose may be summarized as follows: the user will take a look at an overview, interact with it to express his preferences and the system replies by generating a summary which at the same time displays the essential parts of the original movie and includes the user’s preferences. Fig. 1 illustrates our framework in a synthetical way. This kind of framework is original in the context of video summarization and that’s why many questions may be asked [3,4]: What kind of overview we have to display to the user? How the user will express one’s preferences? How many rounds we will have between the user and the system? How the system will bring out and understand the user’s preferences? How the system will generate a summary that takes into consideration the user’s preferences and covers the essential parts of the movie at the same time?

Movie

Scene Shot

Compute Features

Offline part Overview

Online part

Video Summarization Preferences

User Video Summary Fig. 1. Overview of the IM(S)2 system.

Features Database

286

M. Ellouze et al. / J. Vis. Commun. Image R. 21 (2010) 283–294

Fig. 2. GUI of IM(S)2.

We will try in the following sections to reply to all these questions. 4.2. Movie overview An overview should be at the same time simple to understand and quickly understandable. It must give an idea about the semantics of the original movie by informing about the actors involved in the movie, the places in which the story of movie takes place, the important events, etc. Clustering the shots of the movie and showing a representative image for each cluster was one of the solutions that we considered. However, this kind of overview may hide many details that are important to understand the context of the movie. As it’s mentioned by Hanjalic et al. in [16] ‘‘Humans tend to remember different events after watching a movie and think in terms of events during the video retrieval process. Such an event can be a dialog, action scene or, generally, any series of shots uniﬁed by location or dramatic incident. Therefore, an event as a whole should be treated as an elementary retrieval unit in advanced movie retrieval systems” For this reason, we think that segmenting the movie into scenes and making an overview showing all the scenes of the movie is a suitable solution. Generally, in one movie we ﬁnd between 1500 and 2500 shots and between 50 and 80 scenes. A scene gathers a set of correlated shots that share the same actors and take place at the same moment and in the same place. So, presenting a mosaic image composed of small images representing the scenes of the movie can reduce a lot of redundancy and give a general overview of the whole movie. This overview has also the advantage of giving the user the opportunity to decide before starting the process of summarization if the movie is interesting or not, and so avoiding the waste of time to summarize and browse it needlessly. 4.3. User interaction After watching the overview, the user is required to express his/her preferences. These preferences are generally centered

on some types of scenes (action scenes, dialog scenes, landscape, romance), or around some contents as locations (forest, mountain, city, buildings, indoor, outdoor, etc.), actors, time (day and night) or simply special kinds of objects (plane, tank, car, etc.). As it is impossible to predict and to model all the users’ preferences and to implement appropriate detectors, the user has to interact directly with the system by selecting directly from the overview shots that represent what he likes to watch in the summary. Besides, we ask him to specify the upper duration that the generated summary must not exceed (see Fig. 2). The system will analyze the selected shots, locate the center of interest of the user inside the movie and ﬁnally take into consideration his preferences to generate a summary not exceeding the ﬁxed duration (see Fig. 3). In the following sections, we will present the two main parts of IM(S) 2: the ofﬂine part and the online part. In the ofﬂine part, the movies of our database are segmented into shots and features are computed to describe these shots. In the online part, users interact with the system to generate summaries taking into consideration their preferences.

4.4. Ofﬂine part 4.4.1. Scenes segmentation and overview making Semantically speaking, a movie is considered as a set of scenes and not a set of shots because scenes, and contrary to shots, are semantic units. The database of the European project MUSCLE on which we tested our system is provided with the ground truth of the shot boundaries. To segment movies into scenes, we used our system called ‘‘Scene Pathﬁnder” which is tested on ﬁlms of different cinematographic genres and gives encouraging results [8]. To formulate the overview of a given movie, we achieve inside every detected scene a clustering to extract the shots clusters. To accomplish this we relied on the work of Casillas et al. [7]. The representative shot of every cluster is the longest one (duration). An

287

M. Ellouze et al. / J. Vis. Commun. Image R. 21 (2010) 283–294

User’s Preferences

Time Persons

Content

Locations

No Redundancy

Summary constraints

Online Summarization Process

Romance

Summary

Type

Dialogue

Temporal Coverage

Action

Duration Fig. 3. Constraints taken into consideration by IM(S)2 system.

image composed of the keyframes of representative shots is computed to have a representative image of the scene. Four shots at maximum representing the four greatest clusters are selected in order not to disturb the vision and the attention of the user to let him focus on the core of the scene. If the user is interested in a given scene he can zoom more and all the clusters of the scenes are shown (see Fig. 2). If a cluster is selected by a user, all the shots of cluster are considered as positive examples, representing the preferences of this user.

4.4.2. Shot representation The major problem in multimedia remains the semantic gap. Until now we have not the features that describe well the content of images. In our case the features that will be used must describe as well the maximum of the concepts present in the selected shots. The preferences of a user may be divided according to two axes: type and content. In fact by selecting some speciﬁc shots, the user may inform about: His preferences to some types of shots in action shots, dialog shots, landscape, romance, etc. Action shots are characterized by high tempo, a lot of motion and a lot of sound effects. However the other types are related to special content. Dialog shots are related to the presence of the actors of the movie, landscape shots are related to special texture or color (greenery, sun setting, etc.), and romance shots are characterized by a high percentage of skin color in the image. His preferences to special contents as locations (forest, mountain, city, buildings, indoor, outdoor, etc.), actors, time (day and night), objects (planes, cars, tanks, etc.). For this reason we used in our system two types of descriptors: tempo descriptors and content descriptors.

4.4.3. Tempo descriptors and shots clustering In order to know if the user is interested by action scenes, we have to delimit them in the movie. To do that we achieve an unsupervised clustering to classify shots of the movie into two classes: action shots and non-action shots. This classiﬁcation is done using Fuzzy CMeans [5] classiﬁer and tempo features. Indeed, compared to non-action-scenes (dialogue, romance, etc.) which are characterized by a calm tempo, action scenes (ﬁght,

war, car chase, etc.) display three important phenomena. First, they contain a lot of motion: motion of objects (motion of actors, motion of cars, etc.) and camera motion (pan, tilt, zoom, etc.). The second phenomenon is the use of special sound effects to excite and stimulate the viewer’s attention and this by amplifying the actors’ voices, introducing explosion and gunﬁre sounds from times to times, etc. The third important phenomenon is the duration and the number of shots per minute. Action scenes are ﬁlmed by many cameras. For this reason, the ﬁlmmaker is switching permanently between all cameras to ﬁlm the scene from many views. The features that we use to quantify these three phenomena are: The shot activity: we used the Lukas Kanade optical ﬂow [27] to estimate the direction and speed of object motion from one frame to another in the same shot. The shot audio energy: We compute Short-Time Average Energy of the auditory bound of every shot. The Short-Time Average Energy of a discrete signal is deﬁned as follows:

EðshotÞ ¼

1 X sðiÞ2 N i

ð1Þ

The shot frequency: We place every shot in the center of a one minute interval. After that, we count the number of shots that belong to this interval. This number is the third feature of every shot (see Fig. 4). As result of a classiﬁcation done by the Fuzzy CMeans classiﬁer and with a number of classes equal to two, we obtained compact zones composed of consecutive action shots representing action scenes [8]. 4.4.4. Content descriptors The quality of the descriptors is very important in any work of video processing. In fact, the quality of these descriptors is crucial to bridge the semantic gap.

The reference shot

An interval of 60 s Fig. 4. The shot frequency feature.

288

M. Ellouze et al. / J. Vis. Commun. Image R. 21 (2010) 283–294

In our system the goal of the content descriptors is to provide important visual cues to describe the content of non-action shots (dialog, romance and landscape) and the content of shots in general as location, time, objects, concepts, themes, etc. The content descriptors that we used in IM(S)2 have been already implemented and integrated in our content based image retrieval engine called IKONA [6]. In general, every concept or object is related to special colors, special texture and special edges. That is why the descriptors integrated into IKONA are varied and touch the colors, the textures and the edges of images. To describe the color content we used a standard HSV histogram computed on 120 bins. The HSV histograms inform about the color distribution of images; however, they suffer from a lack of spatial information because all pixels are equally important. For this reason, we used also two other descriptors. First, the LapRGB which is a histogram computed on 216 bins and which uses the edge information to weight every color pixel. The idea of LapRGB is that pixels situated on the edges and corners have an important contribution to the histogram. Second, the ProbRGB which is a histogram computed also on 216 bins and which uses the texture information to weight every color pixel, favoring the contribution of pixel where there is an important color activity. We use also the Fourier histogram descriptor which exploits the texture information to have an exact measure of the energy information and its distribution in a given image. This histogram is computed on 64 bins. Besides, in order to have a general idea about the different shapes that a given image may contain, we used two descriptors: the Hough histogram and the Local Edge Oriented Histogram. The Hough histogram is computed on 49 bins and gathers information about how pixels in the image behave with respect to the tangent line to the local edge passing through each pixel and with respect to the distance from the origin point. In addition, we compute the Local Edge Oriented Histogram (LEOH) on 32 bins to have an idea about pixels belonging to horizontal and vertical lines. The vector gathering all the evoked features has 697 dimensions which is a very high number. To reduce the dimensionality we used linear principal component analysis. A previous study [10] which is done on several image databases in a context of query by example shows that we may reduce the resulting vector dimension about 5 times and we lose less than 3%. 4.5. Online part In the online part the user is required to specify the duration of the targeted summary and to select shots that he judges they correspond or are close to his preferences. Relying on these selected shots, the system will generate a summary including shots that may be interesting. However a summary has to give also an idea about the whole content of the original sequence. For this reason, other constraints will be taken into consideration namely, insure a good temporal coverage, avoid redundancy and insure a smooth summary. We consider the summarization process as an optimization problem in which we will try to ﬁnd a compromise among all evoked constraints. 4.5.1. Summary duration The user has to specify the duration of the summary (see Fig. 1). This duration is not necessarily the duration of the generated summary but an upper duration that the summary must not exceed. 4.5.2. Learning users interests The shots selected by the user will be used to understand his preferences: the types and contents of shots that he is targeting.

To know if the user is interested in action scenes, we achieve a clustering of shots into action and non-action shots. If one of the selected shots belongs to the cluster of action shots, we suppose that the user is interested in such kind of shots. Consequently, two shots from every action scene (the two shots that have the highest motion) will be included in the ﬁnal summary. For the other types of shots, as they are related to content, a content classiﬁcation is achieved to know if they belong to the center of interest of the user. The aim of this classiﬁcation is to discriminate between interesting and non-interesting shots. However we are confronted to a particular classiﬁcation problem because we have only positive examples. Starting from the shots selected by the user (positive examples) and the content features computed in the ofﬂine part, we have to classify the shots into two categories: the shots that may be of great interest for the user and the shots that are not particularly interesting. This situation is not particularly new and has been treated in other ﬁelds as text categorization [29] and gene prediction [30]. In [29] and [30] the authors solved this problem using the Oneclass SVM. It is a type of SVMs which uses few samples of one class (either positive or negative class) to enclose all samples that belong to this class. The encouraging results obtained in [29] and [30] incited us to use this type of SVM to classify movie shots into interesting and non-interesting. In the literature, there are many implementations of the one class SVM. The implementation which is the most widespread and which gives the best results is the one of Schölkopf et al. [40] in which the authors consider the origin as the only member of the second class, after transforming the features through a kernel function. After that, they try to separate the positive samples from the origin by a hyperplane. In our context we can have preferences non-linearly separable and having irregular shapes. One of the advantages of using SVMs is their ﬂexibility toward the class shape and this is due to the use of kernels. The study done in [29] in the context of document classiﬁcation, shows that the radial kernel produces in general the best results relatively to other kernels. This conclusion is not surprising because the radial kernel is known for its ability to capture different shapes of classes. The form of the radial kernel that we used is the following:

KðX; YÞ ¼ ecjjXYjj

2

ð2Þ

We choose the toolbox LIBSVM which is available in [24] and in which we ﬁnd the version one-class SVMs implemented. The used parameters are the standard ones. Every shot will be represented by its middle frame. The positive examples are the middle frames of selected shots. The one-class SVM classiﬁes the shots using content features evoked in Section 4.4.4 and computed on the middle frames to extract all interesting shots. Extracted shots represent the basic elements of the ﬁnal summary. 4.5.3. Generating the summary This step aims at ﬁnding a compromise between what a user likes to watch and how a summary must be. If the user is interested in action scenes, we will take from every action scene two shots that have the highest motion to integrate them in the summary. We add the durations of these shots and we substract their sum from the upper duration. The remaining time duration will be used to integrate other shots that represent the content users’ preferences. From the shots classiﬁed as shots having interesting content by the one-class SVM, we will select some of them to be included in the video skim. The integration of these shots is not random.

289

M. Ellouze et al. / J. Vis. Commun. Image R. 21 (2010) 283–294

Fig. 5. Results of the genetic algorithm compared to other approaches tested on TRECVID benchmark. (a) Distribution of ‘‘Presence of repeated material” scores and (b) Distribution of ‘‘Pleasant Tempo material” scores.

Indeed, we have to remind that a summary has to inform about the whole content of the original sequence (temporal and spatial coverage) in pleasant tempo and without redundancy. Genetic algorithms [15] are able to support heterogeneous criteria in the evaluation and they are naturally suited to carry out incremental selection, which may be applied to streaming media as video. We consider the summarization process as a constrained optimization problem and we propose to solve it with genetic algorithms. The reason for the choice of genetic algorithms is the encouraging results we obtained in TRECVID 2008 evaluation campaign [48], in the summarization task. 4.5.3.1. TRECVID benchmark and genetic algorithms. During the two last years, TRECVID has added the task of summarization to its evaluation campaigns. In our participation in TRECVID 2008 [9], we computed summaries for BBC Rushes (TRECVID summarization database) using genetic algorithms by selecting from every one of them shots considered important to obtain an informative skim. We suggest generating randomly a set of summaries (initial population). Then, we run the genetic algorithm (crossing, mutation, selection, crossing, etc.) many times on this population with the hope to enhance the quality of summaries. In TRECVID campaign the evaluation of generated summaries is done by assessors who judge the summaries and attribute for every one of them, and for a set of criteria, scores ranging from 1 (worst) to 5 (best). We ranked ﬁrst and second among 43 participants respectively in the criteria ‘‘Pleasant Tempo ‘‘and ‘‘Presence of repeated material” (see Fig. 5). This shows the merits of using genetic algorithms in the summarization task and justiﬁes why we reuse them once more. However, some modiﬁcations will be done to adapt it in the context of real-time movie summarization. In our current context, we have to generate a summary responding to the following criteria:

not exceeding the remaining duration (after deduction of the total duration of action shots from the upper duration ﬁxed by the user); including the maximum of shots (from those judged interesting); presenting the minimum of redundancy; pleasant tempo; spread over the whole movie. 4.5.3.2. Features of the genetic algorithm. To take into consideration the constraints evoked in the previous section, we suggest the following features for our genetic algorithm. Binary encoding: We have chosen to encode our chromosomes (summaries) with binary encoding because of its popularity and its relative simplicity. In binary encoding, every chromosome is a string of bits (0, 1). In our genetic solution, the bits of a given chromosome are the shots representing the user’s preferences. We use 1’s to denote selected shots (see Fig. 6). Fitness Function: The ﬁtness function aims at evaluating a given summary. This function has to favor smooth summaries that cover the whole movie, present a minimum of redundancy and do not exceed the duration speciﬁed by the user.

Candidate shots selected by the user

1: shot included in the summary 0: shot not included in the summary

Chromosome representing a summary

0

0

0

1

0

0

0

0

1

0

0

1

0

0

0

Fig. 6. Encoding of the genetic algorithm. Shots which are present in the summary are encoded by 1. The remaining are encoded by 0.

5 min

Movie Shots Reference Shot Shot included in the summary Shot not included in the summary Fig. 7. Position of summary shots in the movie relatively to reference shots.

290

M. Ellouze et al. / J. Vis. Commun. Image R. 21 (2010) 283–294

Let S be a video summary composed of m shots. S = {shoti, 1 < i < m}.We evaluate the chromosome representing the summary S by maximizing the following ﬁtness function:

FitðSÞ ¼ minðDistðSÞÞ maxðDistðSÞÞ N shots ðSÞ SDðSÞ SSðS; UÞ

ð3Þ

Let RS be the set of shots of the movie called reference shots. The ﬁrst shot of RS is the ﬁrst shot of the movie. Every two successive shots of RS are separated by 5 min (see Fig. 7). RS ¼ fshoti ; shot0 ¼ First shot of the movie;shoti þ 1 shoti ¼ 5 ming U ¼ RS [ S; the shots of U are temporally ordered

A summary has to cover the major parts of the original movie and should be as smooth as possible. In fact the video jerkiness may disturb the users. SS(S, U) is a score computed to evaluate the continuity (the smoothness) of the summary and the coverage of the movie by the summary (through reference shots). It is based on a measure called Temporal Cost Function TCF. SS and TCF are introduced in [25]. SS has the following form:

SS ¼

X

TCFðjshi1 shi jÞ

ð4Þ

i1;i2U

where TCFðxÞ ¼ 1:25xkz

kz is a constant

ð5Þ

Nshots (S) is the number of shots selected in the chromosome. We have to include a maximum of shots to include a maximum of ground truth. Dist(S) is the vector containing the distances computed between the shots of the summary two by two. Maximizing the term ‘‘min(Dist) max(Dist)” penalizes redundancy that may occur in the ﬁnal summary. The distance used to compute the difference between two shots is called Hist and is deﬁned as the complementary of the histograms intersection. The distance between two shots A and B is computed as follows:

HistðA; BÞ ¼ 1

256 X

, minðHA ðkÞ; HB ðkÞÞ

k¼1

256 X

ð6Þ

k¼1

SD is a weight computed on the summary duration according to a Gaussian function represented in Fig. 8 and has the following form:

" SDðxÞ ¼ exp

lnð2Þ ð2 ðx Max DurationÞÞ2 Max Duration

2

0

0

1

1

1

0

0

1

0

0

1

0

0

0

0

1

0

0

1

0

1

1

0

0

0

0

1

1

0

0

0

0

1

1

1

0

0

1

0

0

0

1

1

0

0

1

0

0

1

0

1

1

0

0

0

1

0

0

0

Children

Fig. 9. The crossover operation between two chromosomes representing two summaries.

0

0

0

1

0

0

0

0

1

0

0

1

0

0

0

0

0

0

1

0

0

1

0

1

0

0

1

0

0

0

Fig. 10. The mutation operation.

The Crossover operation: The crossover consists in exchanging a part of the genetic material of the two parents to construct two new chromosomes (see Fig. 9). This technique is used to explore the space of solutions with the proposition of new chromosomes to ameliorate the ﬁtness function. In our genetic algorithm the crossover operation is classic. The crossover probability is ﬁxed at 0.75. The Mutation operation: After a crossover is performed, mutation takes place. Mutation is intended to prevent the falling of all solutions in the population into a local optimum (see Fig. 10). The mutation probability is ﬁxed at 0.02 5. Experimental results 5.1. Dataset and test bed

! HA ðkÞ

0 Parents

#2 ð7Þ

The more the duration of the summary is nearer to maximum duration (remaining duration) the more the score increases and will be near to ‘‘1”. Since the summary must not exceed the speciﬁed duration this coefﬁcient penalizes the summaries exceeding this duration.

To show the effectiveness of IM(S)2 we have been inspired by the evaluation done in TRECVID workshop [48] (the video summarization task) and by many works that evaluate the summaries generated by their systems [13,14,25,50] through the MOS (Mean Opinion Score). We invited 10 users (assessors) to test IM(S) 2 on a database composed of 5 movies of the European project MUSCLE (Multimedia Understanding through Semantics, Computation and Learning). The 10 users are not familiar with the system, but they are regular movie consumers. They belong to different age categories and to different backgrounds (students, PhD student, professors and ofﬁce workers). The list of movies is presented in Table 1. IM(S)2 is totally implemented in Matlab. We use the genetic algorithm toolbox of Matlab called ‘‘gatool”. The hardware platform is a PC with 2.66 GHZ processor and 1GB RAM. 5.2. Results The system displays to every user an overview of the movie composed of its scenes. Then, the user will select the shots that correspond to his preferences. The system will study these prefer-

Table 1 Details of MUSCLE database.

Fig. 8. Duration weighting.

Movie

Genre

Duration

Cold Mountain Jackie Brown Platoon Secret Window The Lord of Rings

Drama Drama Action Horror Action

2H300 2H300 1H550 1H300 2H450

291

M. Ellouze et al. / J. Vis. Commun. Image R. 21 (2010) 283–294

ences and generate a summary. After watching the summary, every user will be asked to evaluate the quality of the summary. It is a sort of assessment similar to that of TRECVID workshop to evaluate the generated summaries. The measures that we mentioned in our evaluation are:

the summary gives a clear idea about the story of the movie; the summary includes the preferences of the user; the summary has a pleasant tempo and it is easy to understand; the lack of redundancy.

As in TRECVID, every user is asked to give a score ranging from 1 (worst) to 5 (best) to indicate how the summaries respect the criteria. The overall results of individual measures are presented as Tukey-style boxplots. Tukey Boxplots (also known as Box and Whisker plots) are used to create a graphic image of the several key measures of distribution including the minimum, maximum, median and 25th and 75th percentile (the middle 50%). 5.2.1. Giving clear idea about the story of the movie The ﬁrst goal of a video summary is to give clear idea about the story of the original video sequence. For this reason we ask the 10 users if the generated summaries give clear idea about the stories of original sequences. To every user we give the textual summary of the original movie located in the ofﬁcial movie producer’s website. Then we ask them how the textual summary matches the generated video skim. After that, we plot for every movie the distribution of its scores (see Fig. 11). All the scores are around four. The ﬁtness function played a key role here, it favors summaries that contain a maximum of shots due to the term Nshots(S) (maximum of events), with a minimum of redundancy due to the term ‘‘min(Dist(S)) max(Dist(S))” (all included events are original) and covering the major parts of the movie due to the terms SS(S, U) (this means that all included events are spread over the entire movie). However we notice that there is a small difference between action movies and non-action movies. Action movies have the best scores. Shots of non- action movies are long compared to shots of action movies. For this reason summaries of non-action movies contain fewer shots than action movies. This reduces the coverage of the movie by the summary and so affects the understandability of the non-action movies. 5.2.2. Respecting the user’s preferences We plot for every movie the distribution of the scores of its summaries taking into account the criterion of ‘‘respecting the user’s preferences” (see Fig. 12). Generally users are satisﬁed by the output of the system. Users retrieve the content that they targeted. This conﬁrms our technical choices. It conﬁrms the efﬁciency of the one class-SVMs and the content features that we use in bridging the semantic gap and in understanding the users’

Cold Mountain Jackie brown Platoon Lord of Rings Secret Window 0

Jackie brown

Platoon

Platoon

Lord of Rings

Lord of Rings

Secret Window

Secret Window

3

4

5

6

Fig. 11. Distribution of ‘‘Giving clear idea about the story of the movie” scores according to movies.

4

5

5.2.3. Pleasant tempo and easy to understand Taking into the consideration the problem of the smoothness of video summary was useful. Introducing the term ‘‘SS(S, U)” in the ﬁtness function, makes the generated summaries easy to understand (not jumpy) and with pleasant tempo. Besides we used the fades as transitions between shots of the generated summaries to overcome the change blindness issue and to have summaries more comfortable to watch. This has an important impact on the scores

Jackie brown

2

3

preferences in terms of content information. The high score obtained for action movies (Platoon and the Lord of Rings) conﬁrms also the efﬁciency of tempo features and of Fuzzy CMeans to locate action scenes. Besides, what we noticed also when watching the test operations is that the users do not focus on only one preference. For example, a user may select from the overview shots showing soldiers ﬁghting in the forest and a close-up shot showing another actor. Regarding the obtained results we can deduce also that the one-class SVM is able to understand different concepts (heterogeneous preferences). We think that this is due to the use of the radial kernel. Moreover, specifying of the preferences directly in the movies by means of the overview makes easier the understanding of the context of the movies. It avoids the problem of ambiguity that may occur when specifying the preferences using keywords for example. The preferences in form of images are more concrete and more linked to the context of the original movies. In fact, some users suggest specifying their preferences directly by typing the targeted concepts. For example, if we treat the case of the movie ‘‘Platoon”, a user may ask for shots showing a soldier in the forest. Instead of selecting a shot showing a soldier in the forest to express this preference he may write it as follows: Concept {Soldier} + Concept {Forest}. However this will ask for a heavy work of concept video indexing to extract the concepts present in every movie according to ontology. After that users may navigate in this ontology and specify their preferences.

Cold Mountain

1

2

Fig. 12. Distribution of ‘‘Respecting the user’s preferences” scores according to movies.

Cold Mountain

0

1

0

1

2

3

4

5

6

Fig. 13. Distribution of ‘‘Pleasant tempo and easy to understand” scores according to movies.

292

M. Ellouze et al. / J. Vis. Commun. Image R. 21 (2010) 283–294

Cold Mountain Jackie brown Platoon Lord of Rings Secret Window 0

1

2

3

4

5

6

Fig. 14. Distribution of ‘‘Lack of redundancy” scores according to movies.

attributed to generated summaries which are generally around ‘‘4” (see Fig. 13). 5.2.4. Lack of redundancy Our strategy to integrate the users’ preferences in the summaries may produce a sort of redundancy in the summaries. In fact, we may have the preferences of a given user centered on a given context (content) and so this context may be present during the entire summary which will make an impression of redundancy in the summary. Here the genetic algorithm played a key role to avoid this problem. Our participation in TRECVID 2008 shows that the genetic algorithms are efﬁcient in doing such kind of tasks. These results are once more conﬁrmed. Indeed, we plot for every movie the distribution of the scores of its summaries about the criterion of ‘‘Lack of redundancy” and the results are all high for all movies (see Fig. 14). 5.3. Comparison results We compared our system with Parshin’s system [34]. Parshin’s system [34] is the only summarization system in the literature that proposes an effective interaction with the users and tries to collect effective preferences, related to content of the movie. In fact in [12] the only preference is the number of key frames of the summary (summary size). In [13] the preferences are the number of keyframes and the processing time. In [44] users are required to provide their own photos to deduce their preferences. Parshin’s system is based on quantifying the preferences of the user through some high level features, namely the place of action (indoor or outdoor), the time for outdoors shots (day or night), percentage of the human skin, the average quantity of motion and the duration of the semantic segments in the shots (speech, music, silence and noise).

Two of the ten users that judge our summarization system were invited to test the Parashin’s system using the MUSCLE database. The results of the comparison are shown in Table 2. Although we explained all the features and their impact on generated summaries, the users do not appreciate a lot specifying the preferences by quantifying some high level features. They consider that the used features are general and are not related to the exact context of every movie. Besides they consider that their preferences are more complicated than ‘‘indoor/outdoor” or ‘‘day/night”. For this reason we perform better in the criterion of respecting the users’ preferences. In our system we pay a lot of attention to the quality of generated summaries. Indeed, we try to avoid having jumpy and nonsmooth summaries and to insure that these summaries cover the whole original sequences and this through the SS function. Contrary to our system, Parashin’s system does not take into consideration these two constraints. It is simply based on tracking some high level features in the movie. All shots showing these high level features are automatically integrated. No effort was done to see if the selected shots are well spread over the entire movie. That is why we perform better in the criterion ‘‘giving clear idea about the story of the movie” and the criterion ‘‘Pleasant Tempo and easy to understand”. Parashin’s system gives good results in the criterion ‘‘Lack of redundancy” because it achieves a clustering of shots to gather similar ones. The score attributed to every shot of the cluster takes into consideration the ‘‘Originality” of this shot relatively to the other shots of the cluster. The results of our system and those of Parashin’s system in this criterion are nearly the same. 6. Conclusions and perspectives We propose the IM(S)2 system for generating user oriented video summaries. Contrary to existing systems which either neglect or ignore the users’ preferences, our system involves users directly in the summarization process. In our work, we tried to encompass all the preferences that user may have as type of scenes (action, romance, dialog), their contents (characters, locations, time, etc.) and the duration of the ﬁnal summary. To demonstrate the effectiveness of IM(S)2 we invited 10 users to judge the quality of generated summaries according to four criteria. Results are generally encouraging and show that the one class SVMs are successful in bringing out the users’ preferences and that the used genetic algorithms are efﬁcient in generating optimal summaries and hence conﬁrm results obtained in TRECVID workshop. However, IM(S)2 system needs some improvements. In the near future we will try to enhance the interaction between the user and the system. We plan to reﬁne the way by which users specify their

Table 2 Comparison results of our system with the system of Parashin et al. Systems

Our system

Users

First user

Parashin’s system Second user

First user

Second user

Movie/criterion

CS

RP

PT

LR

CS

RP

PT

LR

CS

RP

PT

LR

CS

RP

PT

LR

Cold Mountain Jackie Brown Platoon Secret Window The Lord of Rings

4 3 4 5 4

4 4 3 4 4

4 5 3 4 4

5 5 5 4 5

5 4 4 3 4

5 3 4 4 4

4 4 4 3 3

4 5 5 4 4

3 2 3 4 2

3 3 4 3 3

3 2 1 2 3

4 5 4 4 4

2 2 3 3 2

3 2 3 3 3

4 2 2 3 2

5 3 5 4 3

Average

4

3.8

4

4.8

4

4

3.6

4.4

2.8

3.2

2.2

4.2

2.4

2.8

2.6

4

*CS: Giving clear idea about the story of the movie. *RP: Respecting the user’s preferences. *PT: Pleasant tempo and easy to understand. *LR: Lack of redundancy.

M. Ellouze et al. / J. Vis. Commun. Image R. 21 (2010) 283–294

preferences. In a given shot we may be interested in particular objects or concepts. For this reason we propose that users specify inside the shots the regions or the objects that may be interesting instead of selecting the whole shot. The encouraging results obtained for movies corpus, motivate us to think to extend the use of our system to summarize other corpus. In fact, we have begun to investigate adapting the architecture of our system to news and documentary videos. Acknowledgments The authors would like to thank several individuals and groups for making the implementation of this system possible. The authors would like to acknowledge the ﬁnancial support of this work by grants from the General Direction of Scientiﬁc Research and Technological Renovation (DGRSRT), Tunisia, under the ARUB program 01/UR/11/02. We are also grateful, to EGIDE and INRIA, France, for sponsoring this work and the three-month research placement of Mehdi Ellouze from 1/11/2007 to 31/1/2008 in INRIA IMEDIA Team in which parts of this work were done. We are also grateful to the European project MUSCLE and to Prof. Constantine Kotropoulos from Aristotle University of Thessaloniki for providing the data. References [1] L. Agnihot, N. Dimitrova, J. Kender, J. Zimmerman, Study on requirement speciﬁcations for personalized multimedia systems summarization, in: Proceedings of the IEEE International Conference on Multimedia and Expo, Baltimore, USA, 2003, pp. 757–760. [2] J. Assfalg, M. Bertini, A. Del Bimbo, W. Nunziati, P. Pala, Soccer highlights detection and recognition using HMM’s, in: Proceedings of the IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, 2002, pp. 825–828. [3] A. Benammar, Enhancing query reformulation by combining content and hypertext analyses, in: Proceedings of the European Conference on Information Systems, 2004. [4] A. Benammar, G. Hubert, J. Mothe, Automatic proﬁle reformulation using a local document analysis, in: Proceedings of the European Conference on IR Research, 2002, pp. 124–134. [5] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum, New York, 1981. [6] N. Boujemaa, J. Fauqueur, M. Ferecatu, F. Fleuret, V. Gouet, B.L. Saux, H. Sahbi, IKONA: interactive generic and speciﬁc image retrieval, in: Proceedings of the International Workshop on Multimedia, Rocquencourt, France, 2001. [7] A. Casillas, M.T. Gonzalez, R. Martinez, Document clustering into an unknown number of clusters using a genetic algorithm, in: Proceedings of the International Conference on Text, Speech and Dialogue, Ceske Budejovice, Czech Republic, 2003, pp. 43–49. [8] M. Ellouze, N. Boujemaa, A.M. Alimi, Scene pathﬁnder: unsupervised clustering techniques for movie scenes extraction, Multimedia Tools And Application, in press, doi:10.1007/s11042-009-0325-5. [9] M. Ellouze, H. Karray, A.M. Alimi, REGIM, Research Group on Intelligent Machines, Tunisia, at TRECVID 2008, BBC Rushes Summarization, in: Proceedings of the International Conference ACM Multimedia, TRECVID BBC Rushes Summarization Workshop, Vancouver, British Columbia, Canada, 2008, pp. 105–108. [10] M. Ferecatu, Image retrieval with active relevance feedback using both visual and keyword-based descriptors. Ph.D. thesis, University of Vesrsailles SaintQuentin-En-Yveline, 2005. [11] M. Ferecatu, N. Boujemaa, M. Crucianu, Semantic interactive image retrieval combining visual and conceptual content description, Multimedia Systems 13 (2008) 309–322. [12] A.M. Ferman, A.M. Tekalp, Two-stage hierarchical video summary extraction to match low-level user browsing preferences, IEEE Transactions on Multimedia 5 (2003) 244–256. [13] M. Furini, F. Geraci, M. Montangero, VISTO: Visual STOryboard for Web Video Browsing, in: Proceedings of ACM International Conference on Image and Video Retrieval, Amsterdam, The Netherlands, 2007, pp. 635–642. [14] M. Furini, G. Vittorio, An audio-video summarization scheme based on audio and video analysis, in: Proceedings of Consumer Communications and Networking Conference, Las Vegas, USA, 2006, pp. 1209–1213. [15] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley Longman Publishing Co., Inc. Boston, MA, 1989. [16] A. Hanjalic, L.L. Reginald, J. Biemond, Automatically segmenting movies into logical story units, in: Proceedings of the International Conference on Visual Information Systems, Amsterdam, The Netherlands, 1999, pp. 229–236. [17] IMDB, http://www.imdb.com/, Last visited June 2008.

293

[18] H. Karray, M. Ellouze, A.M. Alimi, KKQ: K-frames and K-words extraction for quick news story browsing, International Journal of Information and Communication Technology 1 (2008) 69–76. [19] H. Karray, M. Ellouze, A.M. Alimi, Indexing video summaries for quick video browsing, Springer, London, 2009, pp. 77–95. [20] H. Karray, A. Wali, N. Elleuch, A. BenAmmar, M. Ellouze, I. Feki, A.M. Alimi, REGIM at TRECVID2008: High-level features extraction and video search, TRECVID 2008. [21] M. Kherallah, H. Karray, M. Ellouze, A.M Alimi, Toward an interactive device for quick news story browsing, in: Proceedings of the International Conference on Pattern Recognition, Florida, USA, 2008, pp. 1–4. [22] R. Leinhart, S. Pfeiffer, W. Effelsberg, Video abstracting, Communication of the ACM (1997) 55–62. [23] B.X. Li, M.I. Sezan, Event detection and summarization in American Football broadcast video, in: Proceedings of the Symposium of Electronic Imaging: Science and Technology: Storage and Retrieval for Media Databases, 2002, pp. 202–213. [24] LIBSVM 2.0, http://www.csie.ntu.edu.tw, Last visited May 2008. [25] W.-N. Lie, K.-C. Hsu, Video summarization based on semantic feature analysis and user preference, in: Proceedings of the IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing, Taichung, Taiwan, 2008, pp. 486–491. [26] S. Lu, I. King, M.R. Lyu, Video summarization using greedy method in a constraint satisfaction framework, in: Proceedings of the International Conference on Distributed Multimedia Systems, Florida, USA, 2003, pp. 456– 461. [27] B. Lukas, T. Kanade, An iterative image registration technique with an application to stereo vision, in: Proceedings of the International Joint Conference on Artiﬁcial Intelligence, 1981, pp. 674–679. [28] Y.F. Ma, L. Lu, H.J. Zhang, M.J. Li, A user attention model for video summarization, in: Proceedings of ACM Multimedia, Juan-les-Pins, France, 2002, pp. 533–542. [29] L.M. Manevitz, M. Yousef, One-class SVMS for document classiﬁcation, Machine Learning Research Archive 2 (2002) 139–154. [30] D. Marcu, The automatic construction of large-scale corpora for summarization research, in: Proceedings ACM SIGIR Conference on Research and Development in Information Retrieval, California, USA, 1999, pp.137–144. [31] M. Mills, A magniﬁer tool for video data, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, California, USA, 1992, pp. 93–98. [32] A.G. Money, H. Agius, Video summarization: a conceptual framework and survey of the state of the art, Journal of Visual Communication and Image Representation (2007) 121–143. [33] N. Omoigui, L. He, A. Gupta, J. Grudin, E. Sanoki, Time-compression: system concerns, usage, and beneﬁts, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Pennsylvania, USA, 1999, pp. 136–143. [34] V. Parshin, L. Chen, Video summarization based on user-deﬁned constraints and preferences, in: Proceedings of the Conference Recherche d’Information Assistée par Ordinateur, 2004. [35] K. Peker, A. Divakaran, Adaptive fast playback-based video skimming using a compressed-domain visual complexity measure, in: Proceedings of the IEEE International Conference on Multimedia and Expo, Taipei, Taiwan, 2004, pp. 2055–2058. [36] M. Petkovic, V. Mihajlovic, M. Jonker, S. Djordjevic-Kajan, Multi-modal extraction of highlights from TV Formula 1 programs, in: Proceedings of the IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland, 2002, pp. 817–820. [37] S. Pfeiffer, R. Leinhart, S. Fisher, W. Effelsberg, Abstracting digital movies automatically, Journal of Visual Communication and Image Representation 7 (1996) 345–353. [38] D. Ponceleon, A. Amir, Cuevideo: automated multimedia indexing and retrieval, in: Proceeding of ACM Multimedia Conference, Florida, USA, 1999, p. 199. [39] D. Sadlier, N. O’Connor, Event detection in ﬁeld sports video using audio-visual features and a support vector machine, Journal of IEEE Transactions on Circuits and Systems for Video Technology (2005) 1225–1233. [40] B. Schölkopf, J.C. Platt, J.S. Taylor, A.J. Smola, Estimating the support of a highdimensional distribution, Neural Computation 7 (2001) 1443–1471. [41] A.F. Smeaton, B. Lehane, N.E. O’Connor, C. Brady, G. Craig, Automatically selecting shots for action movie trailers, in: Proceedings of the ACM International Workshop on Multimedia Information, New York, USA, 2006, pp. 231–238. [42] M.A. Smith, T. Kanade, Video skimming and characterization through the combination of image and language understanding techniques, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Puerto Rico, USA, 1997, pp. 775–781. [43] STUDIO4NETWORKS, http://www.studio4networks.com/, Last visited June 2008. [44] Y. Takeuchi, M. Sugimito, Video summarization using personal photo libraries, in: Proceedings of ACM International Workshop on Multimedia Information Retrieval, Santa Barbara, USA, 2006, pp. 213–222. [45] Y. Taniguchi, A. Akutsu, Y. Tonomura, Panorama Excerpts: extracting and packing panoramas for video browsing, in: Proceedings of ACM Multimedia, Seattle, USA, 1997, pp. 427–436. [46] Y. Taniguchi, A. Akutsu, Y. Tonomura, H. Hamada, An intuitive and efﬁcient access interface to real-time incoming video based on automatic indexing, in:

294

[47]

[48] [49]

[50] [51]

M. Ellouze et al. / J. Vis. Commun. Image R. 21 (2010) 283–294 Proceedings of the ACM International Conference on Multimedia, San Francisco, USA, 1995, pp. 25–33. C. Toklu, S.P. Liou, M. Das, Video abstract: a hybrid approach to generate semantically meaningful video summaries, in: Proceedings of the IEEE International Conference on Multimedia and Expo, New York, USA, 2000, pp. 268–271. TRECVID, 2003-2008, TREC Video Retrieval Evaluation: , Last visited May 2008. B.T. Truong, S. Venkatesh, Video abstraction: a systematic review and classiﬁcation, ACM Transactions on Multimedia Computing, Communications and Applications 3 (2007). T. Tsvetomira, Ph.D. thesis, Automated summarization of movies and TV series on a semantic level University of Eindhoven, 2007. Tsvetomira Tsoneva, Mauro Barbieri, Hans Weda, Automated summarization of narrative video on a semantic level, in: Proceedings of the International Conference on Semantic Computing, California, USA, 2007, p. 169–176.

[52] S. Uchihashi, J. Foot, A. Girgensohn, J. Boreczky, Video Manga: Generating semantically meaningful video summaries, in: Proceedings of ACM Multimedia, Florida, USA, 1999, pp. 383–392. [53] X.D. Yu, L. Wang, Q. Tlan, P. Xue, Multi-level video representation with application to keyframe extraction, in: Proceedings of the International Conference on Multimedia Modelling, Brisbane, Australia, 2004, pp. 117–121. [54] H.J. Zhang, D. Zhong, S.W. Smoliar, An integrated system for content-based video retrieval and browsing, Pattern Recognition 30 (1997) 643–658. [55] Y. Zhuang, Y. Rui, T.S. Huang, S. Mehrotra, Adaptive key frame extraction using unsupervised clustering, in: Proceedings of the IEEE International Conference on Image Processing, Chicago, USA, 1998, pp. 73–82.

IM(S)2: Interactive movie summarization system

IM(S)2: Interactive movie summarization system

Recommend Documents