A usage study of retrieval modalities for video shot retrieval

A usage study of retrieval modalities for video shot retrieval

Information Processing and Management 42 (2006) 1330–1344 www.elsevier.com/locate/infoproman A usage study of retrieval modalities for video shot ret...

443KB Sizes 0 Downloads 96 Views

Information Processing and Management 42 (2006) 1330–1344 www.elsevier.com/locate/infoproman

A usage study of retrieval modalities for video shot retrieval Alan F. Smeaton *, Paul Browne Adaptive Information Cluster and Centre for Digital Video Processing, Dublin City University, Glasnevin, Dublin 9, Ireland Received 11 July 2005; received in revised form 24 November 2005; accepted 30 November 2005 Available online 27 January 2006

Abstract As an information medium, video offers many possible retrieval and browsing modalities, far more than text, image or audio. Some of these, like searching the text of the spoken dialogue, are well developed, others like keyframe browsing tools are in their infancy, and others not yet technically achievable. For those modalities for browsing and retrieval which we cannot yet achieve we can only speculate as to how useful they will actually be, but we do not know for sure. In our work we have created a system to support multiple modalities for video browsing and retrieval including text search through the spoken dialogue, image matching against shot keyframes and object matching against segmented video objects. For the last of these, automatic segmentation and tracking of video objects is a computationally demanding problem which is not yet solved for generic natural video material, and when it is then it is expected to open up possibilities for user interaction with objects in video, including searching and browsing. In this paper we achieve object segmentation by working in a closed domain of animated cartoons. We describe an interactive user experiment on a medium-sized corpus of video where we were able to measure users’ use of video objects versus other modes of retrieval during multiple-iteration searching. Results of this experiment show that although object searching is used far less than text searching in the first iteration of a user’s search it is a popular and useful search type once an initial set of relevant shots have been found. Ó 2005 Elsevier Ltd. All rights reserved. Keywords: Video indexing and retrieval; Video object detection; User study

1. Introduction Video is a rich medium for carrying information, far richer than text, image or audio. Video in digital format started to receive much attention when personal computers became powerful enough to handle video, namely when they were able to capture, compress and store video. Because technology now allows us to easily capture, compress, store, edit, transmit, and render video, and this results in large collections of video, it follows that managing collections of video information is also now important. There are two broad approaches for managing digital video information, or indeed managing any kind of digital information: using manually-assigned tags and metadata and automatically processing video content in *

Corresponding author. E-mail address: [email protected] (A.F. Smeaton).

0306-4573/$ - see front matter Ó 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2005.11.003

A.F. Smeaton, P. Browne / Information Processing and Management 42 (2006) 1330–1344

1331

order to derive content descriptors (Smeaton, 2004). In the case of video, archivists have, for years, been using manual annotations and metadata (date, time, location, etc.) as the basis for systems which manage video content in broadcasters’ libraries, news archives, video stock shot vendors, etc. The second approach of automatically processing video data in order to derive content descriptors for use as a basis for retrieval is a relatively recent development. Even in the few years since such video analysis techniques have reached the stage of being efficient, scalable and accurate as well as also being computationally tractable, a variety of approaches or modalities for automatic content description have emerged, which we shall see later. We are now at an interesting point in the development of content-based retrieval from digital video archives in that the easy things like searching the spoken dialogue have been researched, prototypes have been developed and are under evaluation, and we are now facing having to tackle the harder things in terms of video processing such as matching tracked video in shots if we want to make further progress in this area. This requires us to now propose and develop new modalities for video retrieval as these new video analysis and matching technologies turn into tools which are usable in video retrieval. As we move headlong into tackling the ‘‘hard’’ video retrieval techniques, a question we should ask is whether these will be worth the effort—do we know what kinds of modalities for retrieval that users want, or will even find useful? Because video is so different to say text, simply replicating techniques based on statistical word occurrences developed for textbased IR on video is not acceptable because it would not necessarily leverage the richness of information associated with video when compared to text or other media. To address the question of whether, and perhaps in what order, different modalities for video retrieval are actually useful in retrieval, in this paper we set out to examine how a range of such video retrieval techniques are used in a certain type of video retrieval scenario. By limiting ourselves in the domain and type of video we use, and also in the type of search a user is asked to perform, we are able to extend the range of search/browse modalities available into a richer set of retrieval options which we would expect to see developed generally. Thus, restricting the video genre and search type gives us a glimpse of what we can expect on general purpose video sometime in the future. Operating in a controlled search environment, we are able to monitor how a set of users each runs a set of given search tasks and on this we carry out a usage study analysing what features of search our users use, and when. This gives us some guidelines as to what kinds of video retrieval people will actually use. The rest of paper is organised as follows. The next section contains a brief review of contemporary approaches to video IR. This mostly covers shot-level retrieval, i.e. retrieval of video shots rather than single frames, video scenes or whole programmes. Our review includes text-based retrieval on the closed captions or speech as appearing in the video dialogue or in video OCR, using a sample image/keyframe as a query for video shot retrieval, indexing video by semantic concepts and the growing interest in ontologies for video retrieval and the different approaches to video browsing. A recent development in available video retrieval techniques is object-based retrieval where a user uses a video object or a component of an object as part of a query. In Section 3, we examine the problem of object detection and what is currently feasible in that area, including ongoing work on objectbased video retrieval. Section 4 gives a brief overview of a system we have built which supports search and browsing of a 21-hour corpus of video using a range of retrieval options. We describe the video corpus, user queries, user characteristics and the search task we set our users. In Section 5, we present results of our usage survey of this system and we conclude by highlighting the contribution of this paper. 2. Approaches to video retrieval In this paper we are interested in video retrieval where metadata such as date and time, location, title, etc., or other manually assigned descriptors, is not sufficient to give the richness of retrieval and interaction we are interested in. Such a situation occurs when a user has already narrowed down a video archive using such metadata and there is a considerable amount remaining to be searched or the user’s information need cannot take advantage of existing metadata to narrow the search scope. In these cases we need to use search and/or browsing techniques based directly on the video content and we refer to this as content-based video retrieval. Contemporary approaches to content-based video retrieval usually involve a combination of techniques including text matching against the spoken dialogue or against video text OCR, image matching against shot keyframes, or using semantic features to index video shots. Text matching is the traditional approach of comparing a user’s input query against text obtained from the video such as from closed captions (if available),

1332

A.F. Smeaton, P. Browne / Information Processing and Management 42 (2006) 1330–1344

recognized speech, or against video OCR, which is text recognized from a video frame where it appears as a caption or as part of the actual image itself like a road sign or name on the side of a building (Sato, Kanade, Hughes, & Smith, 1998). The usual text-based IR approaches can be used and term weighting approaches from text IR such as BM25 have been applied, as well as attempts to develop language models for video text retrieval. Text-based video retrieval works adequately for certain types of user search where the focus of the search is the actual item under discussion in the dialogue from the video, but when the focus of the search is to be found in the visual content then image matching and indexing techniques based on semantic features such as indoor, outdoor, or the number of faces appearing, are more suitable. One of the first steps in video analysis is automatic structuring of video into units referred to as shots. These correspond to the video footage taken by a single camera in a contiguous time period and may involve camera motion (zoom, pan, tilt, etc.) and/or the movement of objects in the video. Boundaries between shots can be ‘‘hard’’, where the last frame of the finishing shot is followed directly by the first frame of the incoming shot, or the shot transition can be ‘‘gradual’’ as in fades, cross-fades, etc. Automatic techniques for shot boundary detection are now very reliable with greater than 95% precision and recall being reported for hard cuts and greater than 80% precision and recall for gradual transitions using some techniques (Smeaton, Over, & Kraaij, 2004). Shot boundary detection is usually accompanied by the automatic extraction of a keyframe or single image as a representative for each shot. In video retrieval, this representative image is selected using somewhat injudicious mechanisms, the most popular being to simply choose the keyframe in the middle of the shot! Keyframes, however they are chosen, are used in video retrieval for two purposes: to present a visual summary of a shot to a user during video browsing, and to act as a representative of a shot’s contents for image-based shot retrieval. In the case of using keyframes to browse shots this can include browsing sets of keyframe images representing sets of shots. Other approaches to browsing sets of shots such as fast-forward and advanced displays of keyframe images have been tried, especially in the interactive search task in the annual TRECVid benchmarking activity (Smeaton et al., 2004). In the case of image-based shot retrieval then using the simplest kind of image matching will match images on the basis of low-level image features such as global or local colour, texture, edges or perhaps shapes (Smeulders, Worring, Santini, Gupta, & Jain, 2000). However using the poor approaches to keyframe selection that we use, it is not surprising that matching a user’s query image(s) against keyframes using low-level features can be successful only if the user’s information need can be captured in a still image, and the keyframe is genuinely representative of a shot. Queries such as a ‘‘rocket launch’’ with a query image consisting of a blaze of flames coming from a rocket against a blue sky background will be visually similar to keyframes taken from rocket launch footage, so there are cases when this type of retrieval works well. Apart from text-based and image matching approaches, the automatic detection of semantic concepts and the relationship among these concepts and how they can be structured into an ontology has received much attention from the video retrieval research community. The common approach is to manually annotate a large corpus of shots as having feature X or not having feature X and then use a machine learning algorithm to learn an automatic classifier. Examples of features which have been used for this include ‘‘indoor/outdoor’’, ‘‘buildings’’, ‘‘countryside/cityscape’’, ‘‘people present’’, various types of camera motion, ‘‘bicycle’’, ‘‘automobile’’, ‘‘waterside/beach’’, and so on (Rautiainen, Ojala, & Sepp, 2004). Clearly the range of possible features which we could hope to automatically detect is enormous but building such concept feature detectors is a difficult task for a number of reasons. Firstly, the definition of what constitutes a semantic concept is debatable. Is a shot taken indoors but including outdoor scenery through a window indoor or outdoor? A shot of a park with trees and grass and a lake, but with a New York city skyline in the background, is it countryside or cityscape? How much of a bicycle needs to be visible for a shot to be deemed as featuring a bicycle? A second reason why building feature detectors for video retrieval is difficult is that the selection of a set of concepts which is broad enough to cover a large range of information needs yet discriminative enough to be useful in user searching, is an open question. Finally the task of collecting enough positive and negative examples of a feature concept and then learning a classifier is something that requires much effort, all of it manual, and thus is expensive. Nevertheless, if a concept feature detector is available and has been applied to a corpus of video with a high level of accuracy then it can certainly be useful in helping a users search. The way in which such semantic features can be most usefully used is to arrange the semantic features into an ontology, usually

A.F. Smeaton, P. Browne / Information Processing and Management 42 (2006) 1330–1344

1333

hierarchical in nature and to allow the user to navigate this ontology to help focus their information need (Snoek et al., 2005). The problem with using semantic features and ontologies is that there are only a small number of classifiers built for these features and many have a level of accuracy which is less than that required for reliable retrieval. This is shown each year in TRECVid where the average precision of some feature detections can be as low as 20% for the best-performing groups (Smeaton et al., 2004). One other type of video shot retrieval which has not received much attention to date is retrieval based on the presence, or absence, of given video objects. Sometimes we want to retrieve video shots based on an identifiable object in the video such as a car, a dog, the Eiffel Tower or a motorbike. In such cases we can hope that the dialogue will mention that what is on-screen is a dog in a car beside a bicycle in front of the Eiffel Tower, but we cannot rely on this as it usually does not happen at all. We could also hope that reliably detected and useful features will help narrow down the corpus so that we can do keyframe browsing based on filtering shots as being outdoor and containing cars but we cannot rely on this either. Another approach is to find an example image of a dog or a car or the Eiffel Tower and match this example image containing our desired object(s) against all keyframes with the match being based on colour, texture or edges. This is likely to be useful only for very few shots where the colour, texture and edges are very predictable such as a rocket launch against a blue sky but the range of colours, textures and edges possible in a shot of the Eiffel Tower is huge. What would be really useful for this kind of retrieval is if we could specify one or more actual objects, segmented from their backgrounds, and use those as part of the query (Sivic & Zisserman, 2003), which is exactly what we do in this paper as we shall see later. With a range of search modalities available for video retrieval, and each working best only on certain search types, clearly the best overall approach is to make as many modalities as possible available to a user and let the user use these, possibly in some weighted combination. A recent approach in (Yan et al., 2004) has been to automatically assign a user’s search to one of four pre-defined search types, and for each search type the system will have previously been trained to determine the optimal combination of different retrieval tools such as those outlined above. In the case of (Yan et al., 2004) this was done by training search types using TRECVid data (Smeaton et al., 2004) which has known relevance assessments. This appears to be accepted as the best overall approach to video shot retrieval but it does not allow the user to interactively adjust the dynamics of the search depending on whether he/she prefers to use one modality more than another. In the work reported here we are interested in how a user chooses for himself/herself from among the different search modalities on offer in video retrieval. We are particularly interested in how a user would use, or choose not to use, video objects as part of retrieval. With so much effort going into object segmentation and tracking, and the promise that object-based retrieval offers, we are keen to try to understand whether objectbased retrieval will actually be useful in practice. In the next section we summarise prior work in video object segmentation and object-based retrieval. 3. Video objects: segmentation and retrieval Segmentation, tracking and compression of objects in video underpins the MPEG-4 compression standard (Ebrahimi & Horne, 2000) and when it is achieved reliably, accurately and on a large scale, then object detection is supposed to open up many interaction possibilities between users and objects appearing in vide. It will also offer large gains in video compression. When the MPEG-4 standard was specified in the late 1990s that was the goal and this remains the objective behind much current research in video and image processing, an objective which is proving hard to achieve because of the complex nature of object segmentation in the general case. We will now examine work in video object segmentation and then look at how this can be used in objectbased retrieval. 3.1. Video object detection and segmentation Despite much focus and attention, fully automatic object detection, segmentation and tracking across video is not yet achievable for natural video though detecting objects in images and video is typically used by humans to understand visual media and is thus very important for content access to such media (Duygulu, Barnard, de Freitas, & Forsyth, 2002). Available object segmentation systems usually require user

1334

A.F. Smeaton, P. Browne / Information Processing and Management 42 (2006) 1330–1344

involvement and are thus semi-automatic. What this means in practice is that such systems are assisted by a human who draws either a rough sketch or an exact boundary contour around an object of interest and the system will detect and segment this object, separate it from its background, and track it temporally through the remainder of the video shot. Fully automatic detection of objects has had some success (Adamek & O’Connor, 2003), but only in limited shot types such as in-studio news anchorpersons and field sports players against grass backgrounds. It is limited by the availability of training data and of models of objects which work across different shot types. Object based feature extraction, which is an important part of the work in this paper is an enabling technology for the type of video retrieval we are interested in. It is primarily based on edge information computed from the image or from a video frame. An edge is defined as a sharp variation in intensity or colour at some local part of an image which corresponds to the boundaries between objects appearing in the image such as the boundaries between trees and sky, people and grass, cars and buildings, etc. One of the main reasons why edge is more important than other image features for segmentation is that the colours of an object can vary greatly under different lighting conditions making object detection a haphazard affair, and while colour information is useful for object detection, it is extracted separately. The following requirements for object detection from natural video explain why this is the case:  Object detection must be invariant to scale as an object could fill the entire frame or be in a small area of the frame;  Object detection must be invariant to rotation as a matching object in the image might be rotated along the X-, Y- or Z-axes and we need to take account of this during detection;  Part of an object may be occluded either by another object or the object itself and so a detection algorithm needs to be able to fill in the missing data and in many cases there could be more than one possibility;  Object detection must be able to cater for variations of the same object type which generate very different comparisons, such as a car object for example. If we picture all the different types and colours of cars we can think of it is easy to see how different object matches would be produced and indeed one approach to object segmentation is based on using templates for shapes of objects as viewed from different angles;  Detection will have to deal with noise such as edges from other objects in the image and/or errors in the edge detection process itself;  Images and frames are 2D representations of a 3D world without depth information. For a symmetrical object like a vase it does not matter what angle we look at it from as the shape will always be the same, but most objects are not like this;  Each object for detection requires a number of positive and negative training examples which illustrate true and false occurrences of the object. In light of the difficulties outlined above it is not surprising to see that detection of all possible object types is generally considered a computationally infeasible problem. Broadly speaking, there are three classes of approach to object segmentation. The first approach decomposes an input image into a tree or graph of constituent sub-regions and then performs graph and sub-graph matching to detect objects such as demonstrated in (Tu, Saxena, & Hartley, 1999). In this approach we detect a bicycle, for example, by detecting wheels and a frame as sub-components which are matched against a graph for a typical bicycle. The second approach is based around computing some kind of correspondence measure between regions of an image and models of the objects being detected. A number of templates of the object to be detected are constructed in advance and possible matches against image regions are extracted from the input image. Shape matching is often used during the comparison/detection to compare each image against the available templates (Adamek & O’Connor, 2003; Doufour & Miller, 2002). The third approach is the most crude and is to try identify objects by their active movement. The idea behind this approach is that foreground objects will move from frame to frame to a greater degree than their backgrounds which will exhibit less movement. In this paper we take the second approach to object segmentation based on template matching because we carry out our experiments on animated video content, and this approach performs better on such animated content compared to natural video for the following reasons:

A.F. Smeaton, P. Browne / Information Processing and Management 42 (2006) 1330–1344

1335

(1) In animated content, the shape and colour of the characters, objects and backgrounds do not tend to differ greatly whereas in natural video animate objects such as people can change constantly and colours can fluctuate under the variable lighting conditions. (2) In animated content, the characters usually have a limited number of detectable and repeatable facial expressions, whereas in natural video an individual’s expression is more varied. (3) In animated content the boundaries for each of the objects are very clear and detectable because a reduced set of colours compared to natural video is normally used whereas the edges in natural video are more broken and less easily separated. Lighting is also not an issue for animated content making object boundary detection easier. To illustrate these differences, Figs. 1 and 2 show a set of edges automatically detected from an animated image and from a natural scene which have more or less the same composition and arrangement of people, yet the edges from the animated image are much more pronounced and clear. If we were to take the first approach to segmenting objects then we would have a pre-defined template for ‘‘head and shoulder’’ recognition and we would search for occurrences of eyes, mouths, head and shoulder silhouettes, etc. which we would map onto the pre-defined template. in the second approach we would have a template for an entire head-and-shoulders view and we would search for occurrences of that template and all other templates, in the image or video. In the third approach we would detect any movement of the characters in the sequence of frames and use this as a basis for object segmentation. One of the disadvantages to the approach of using animated video content and template matching for object detection is that the set of objects which are to be segmented, and then used in user querying, must be pre-defined. This reduces user querying using objects to just those set of pre-defined objects and in a more realistic object-based retrieval scenario a user would need to identify, or perhaps segment, objects at query time. 3.2. Video object retrieval: related work Despite the technical difficulties with object segmentation there are some examples of work done which support object-based retrieval from video. Some of these are briefly summarised here. The QIMERA system is a collaborative video segmentation platform that incorporates natural object detection using a number of colour and texture features to identify object boundaries. QIMERA also

Fig. 1. Edges from an animated cartoon image.

Fig. 2. Edges from natural video image.

1336

A.F. Smeaton, P. Browne / Information Processing and Management 42 (2006) 1330–1344

Fig. 3. MPEG table tennis sequence (taken from O’Connor et al. (2003)).

demonstrates user-assisted object segmentation on a number of standard MPEG test sequences such as that in Fig. 3 (Adamek & O’Connor, 2003). In (Hohl, Souvannavong, Merialdo, & Huet, 2004) the work presented segments images into regions, visually, and refers to a set of homogeneous regions as an object, though these are not objects in the semantic sense. An ‘‘object’’ retrieval system based on this approach is evaluated on a short cartoon rather than on natural content, with a ground truth of object occurrences and in (Hohl et al., 2004) they demonstrate the accuracy of an approach to locating objects in video based on a user query specified as a set of regions. Similarly in (Erol & Kossentini, 2005) there is another proposal for locating arbitrary-shaped objects, this time based on shapes and shape deformations over time, with another set of evaluation figures on measuring the accuracy of locating query objects in video sequences as measured against humans locating the same objects. Sivic and his colleagues at the University of Oxford take an approach whereby the user also does the object segmentation in the query and this is then matched and highlighted against similar objects appearing in shots. The approach uses contiguous frames within a shot to improve the estimation of objects and addresses changes in viewpoint, illumination and object occlusion. The approach is illustrated working on the movie Groundhog Day in (Sivic, Shaffalitzky, & Zisserman, 2004) and a more detailed presentation of the image processing can be found in (Sivic & Zisserman, 2003) on the movie Run Lola Run where again, the task is to locate a user-specified visual object. Work reported in (Liu & Ahuja, 2004) addresses object segmentation and retrieval based on a complex approach to motion representation and concentrates on the object tracking without actually segmenting the semantic object (whether a person, a bird or a car). The paper reports some preliminary experiments where similar objects which have a similar trajectory to the query clip and appear in similar video compositions, can be located, though a thorough evaluation is needed. Similar work is reported in (Smith & Khotanzad, 2004) where they automatically segment video frames into regions based on colour and texture, and then track the largest of these ‘‘objects’’ through a video sequence, though they are more like blobs than objects. A user query is not a still object but an object appearing in a query video clip. The work reported in (Smith & Khotanzad, 2004) will search using a query video clip to find video sequences similar in terms of object motion, as well as edges, texture, and colours and this has been tested on a corpus of natural video. In this paper we are more interested in information retrieval from video where the user is not quite sure what their query object might look like, though video object detection as described here, will be part of such a system. Finally, while most of the work mentioned above is quite recent and suggests that object-based video retrieval is a new development, this is not quite true, with work on video-object retrieval being reported more than 10 years ago (e.g. Oomoto & Tanaka, 1993). Clearly the notion of using video objects for retrieval has been desirable for some time, but only very recently has technology started to allow even very basic object-location functions on video. Our interest in video object retrieval is partly in the challenge of the technique itself, but more so in the way in which it can be used by users in searching for video information, and in trying to measure the actual advantages it offers in searching. In the next section we shall introduce the system for video retrieval we have built, including details of how it segments objects to support object-based retrieval. 4. The Fı´schla´r–Simpsons system The television show ‘‘The Simpsons’’ has been a phenomenon since it started broadcasting in 1990. With nearly 360 episodes completed in 16 seasons, the show is hugely popular. The 360 episodes of the Simpsons

A.F. Smeaton, P. Browne / Information Processing and Management 42 (2006) 1330–1344

1337

add up to 144 hours or 6 full days of content. The basic premise of the show revolves around the exploits of a dysfunctional American family, the Simpsons, and their hometown of Springfield. The Simpsons family members are Homer, Marge, Bart, Lisa, Maggie, Grandpa Abe Simpson, Patty and Selma. Over the years, the show has evolved both in terms of the visual quality of the animation and the storylines. The first two years of the show focussed on Bart as the main character but in the second year this focus shifted to Homer who became (and remains) the most popular character. After the first season, Homer became more stupid and continues to get worse as seasons progress. After the first season Lisa’s character changed too, becoming more intelligent. We have built a complete video indexing, searching and browsing system for a corpus of video of the Simpsons. For our corpus we used 52 episodes (20.8 hours) taken from seasons 2 and 3, and from themed DVDs covering seasons 4–12. These were taken from DVDs and transcoded into MPEG-1 for image analysis, storage and playback. Extraction of the closed captions presented a challenge as the DVD subtitle text was stored as a series of images and therefore, optical character recognition (OCR) was required to convert the image text into machine-readable text. The video system had several general requirements, including the ability to support searching based on a range of search modalities and the ability to support iterative relevance feedback. In preparing our video content for indexing and search, we first applied a shot boundary detection algorithm to this corpus (Browne & Smeaton, 2004) to determine the 20,529 shots regarded as units of retrieval. For each shot we identified a single keyframe as the frame in the middle of the shot and we extracted this as a still image. In terms of user searching, we support text search through the closed captions manually extracted from the DVD where they are available as tagged text files (Browne & Smeaton, 2004). Each word in the extracted text was associated with a shot based on the timing information. Stemming and stopword removal were then applied to the extracted text, and the remaining terms were indexed based on a standard TF*IDF method. Text comparison between a given text query and the text used to index each shot was based on our implementation of the vector space retrieval model (Ferguson, Gurrin, Wilkins, & Smeaton, 2005) and the resulting comparison scores were ranked in descending order in order to generate a shot ranking from a text query. In addition to supporting text search, we also support keyframe-based image matching whereby a keyframe can be used as a query to find similar keyframes based on visual appearance. In our system we used the following four low-level visual features to index each shot keyframe and facilitate visual searching similar to how we perform keyframe matching in our participation in TRECVid (Cooke et al., 2004):    

4 9 9 4

region * 18-bin hue histogram, region * average colour, region * median colour, region * 16-bin edge histogram.

As the key interest in the work reported here is to explore how useful object-based retrieval can be as part of video shot retrieval we decided to automatically detect the presence of the major characters on-screen as video objects. Our approach was to have a number of shape templates for each character manually selected from representative images, to extract all yellow coloured objects from each keyframe and then to match these yellow objects against the character templates. This procedure for object detection is straightforward because all Simpsons characters are always yellow and have a small number of facial poses each. Each of our character templates is compared against all yellow objects in each keyframe using a shape matching algorithm which generates a number of comparison scores and scores above a pre-defined threshold are regarded as a positive match. Colour plays an important part in the shape matching process, as positive matches require yellow areas to match the templates. All other colours are ignored and this reduces false detection considerably (Adamek & O’Connor, 2003; Browne, Smeaton, O’Connor, Marlow, & Berrut, 2000). Fig. 4 is a screenshot taken from the template-based shape matching application we used (Adamek & O’Connor, 2003). At the bottom of the screen, we can see 5 members of the Simpsons family and their representative templates. At the top left of the screen we can see the keyframe image that is being compared while the top right shows positive template matches for Homer, Lisa and Marge. The choice of which Simpsons characters to detect automatically was based primarily on their popularity and on the availability

1338

A.F. Smeaton, P. Browne / Information Processing and Management 42 (2006) 1330–1344

Fig. 4. Screenshot of template matching application.

of representative examples for template matching. The template matching technique needs multiple training examples to detect each character, and the overall accuracy of the detection needs to be high in order for subsequent searching to be reliable. To measure the accuracy of the object detection we used a development corpus of 12 episodes (4.8 hours, 6525 shots) and manually tagged each shot which had any of our target characters (objects). Table 1 gives a list of the 10 characters detected automatically using this approach, their occurrence in the development corpus, and the accuracy of detection measured against a manually tagged ground truth from the development corpus using precision and recall. The results of our object detection accuracy test on the development corpus shows that precision is acceptably high for almost all characters with an average overall value of 0.947 (weighted by the number of occurrences per character/object), but recall is low at 0.421, indicating we are missing many character detections. These results (high precision, low recall) are similar to those observed in other work on video object detection (e.g. Erol & Kossentini, 2005; Hohl et al., 2004), and are mostly due to the character not being present or being occluded in the keyframe but present in some other part of the shot.

Table 1 Objects detected automatically in the Simpsons Character

Occurrences

Recall

Precision

Homer Marge Bart Lisa Maggie Mr. Burns Mr. Smithers Abe Simpson Principal Skinner Moe Weighted accuracy

2066 (32%) 1343 (21%) 1361 (21%) 647 (10%) 321 (5%) 269 (4%) 175 (2.6%) 83 (1.3%) 246 (4%) 43 (0.7%)

0.36 0.32 0.49 0.57 0.31 0.47 0.46 0.27 0.8 0.61 0.421

0.98 0.87 0.98 0.93 0.88 1.0 0.98 0.91 1.0 0.78 0.947

A.F. Smeaton, P. Browne / Information Processing and Management 42 (2006) 1330–1344

1339

Fig. 5. Browse shot context screen from Fı´schla´r–Simpsons.

The Fı´schla´r–Simpsons system we built allows a user to search for video shots using three different modalities namely by text matching, by similarity between a query image and a shot keyframe, and by the presence or absence of combinations of video objects, the 10 major characters in the Simpsons in our case. The three different modalities, or any combination of them can also be combined into one search as most often happens in practice once a search has commenced. At each iteration of a search, whatever input the user has provided for the query whether text, positive or negative example frames or positive or negative example objects, separate shot rankings are generated for each component, text, image and object, and these three separate ranking are combined using late fusion (Mc Donald & Smeaton, 2005). Fig. 5 shows a screen from the system showing the shot context screen in which the 5 shots preceding and following a given ranked shot are presented with the ranked shot in the centre of the screen. For this given ranked shot any video objects detected are presented with a white border and can be added to the query as positive or negative feedback by clicking with either mouse button. The Fı´schla´r–Simpsons system is described in more detail in (Browne & Smeaton, 2005). 5. Results of usage analysis In order to assess the importance and usefulness of video objects as part of a user’s query strategy, we conducted a series of user experiments with our Fı´schla´r–Simpsons video retrieval system. The user interacts with the system by starting a search, which can use either text, a keyframe, or an object from a keyframe as the query and this generates a ranked list of shots. The user can then browse a shot’s context (surrounding shots) and add other shot keyframes as positive or negative feedback in which case an image similarity measure is used to re-rank unseen shots, or a user can add a segmented object (a character from the Simpsons) as positive or negative feedback in which case a re-ranking is also generated. In this way a user’s query can be composed of a combination of text, keyframe(s) and/or object(s). Our interest here is in examining the usage of different types of search features, namely text, image and object searching. To evaluate our system, 15 users, not familiar with the system but knowledgeable about the Simpsons, each ran the same 12 topic searches. The users were a mixture of undergraduate and graduate students of computing and were thus experienced computer users and all were either regular viewers of the Simpsons or had

1340

A.F. Smeaton, P. Browne / Information Processing and Management 42 (2006) 1330–1344

watched the programme on several occasions. We formulated the 12 topics shown in Appendix A which could be narrow (N), general (G), or broad (B) which we define as depending on the number of relevant shots to be found in the evaluation corpus of 20,529 shots. Appendix A also indicates whether text searching (T), lowlevel image keyframe matching (L), object-object matching (O) or simple keyframe context browsing (X) are likely to be useful for each of the topic types. The nature of these topics broadly reflects the nature of topics created in each year of the TRECVid experiments in terms of the mix of topic types (broad, general, etc.). For each topic we manually determined relevant shots from within the corpus in order to establish the ground truth for retrieval. While this was time-consuming, given our familiarity with the content we believe we have ensured a high degree of completeness in these relevance judgments. Our 15 users were each shown a demonstration of the system and they then they went through a training period by searching for 3 topics under the supervision of the system developer. Each user then searched for each of the 12 topics in turn with 7 minutes allocated per topic. Users were not grouped together for either the demonstration or search sessions and performed these solo. Users were allowed to use any search features (i.e. text-, image- or object-based querying) individually or in combination, that they considered suitable at any point of their iterative search process. Topic ordering was rotated among users in a Latin squares, following the guidelines for interactive search in TREC and in TRECVid. The rationale for a time-constrained search was to simulate the case where an enduser is required to find some shots, but not necessarily all shots, which satisfy some search constraint and the answer set is required as soon as possible. In the TRECVid interactive search task the time limit is set to 15 minutes, but we felt that with a corpus of just under half the size of a TRECVid search corpus we should also reduce the allocated search time. All searches by all users for all topics used up the full 7 minutes. With a maximum of 7 minutes per topic it is clear that for some topics, such as topic 11 with 631 relevant shots, not all these could possibly be found whereas for others such as topic 6 with only 22 relevant shots, locating all shots is possible. In the case of the broad type of topic the aim is to find as many of the 631 shots as a user can (since all relevant shots are treated equally) and so high recall is not possible, while for topics like topic 6 a combination of high precision and high recall is a realistic target. That means an overall averaged performance of both high precision and high recall is not attainable. In Fig. 6, we see the precision for each topic i.e. the proportion of relevant shots in the set retrieved for each topic. Average recall across the topics is approximately 28% reflecting the fact that there are very many relevant shots and only a limited time to complete searches. However, in this paper we are primarily interested in what search features were used, and especially how object-based retrieval was used in the 180 searches (15 users, 12 searches each) which were logged. Table 2 summarises those results. As we can see, text search was the most common search type with keyframe matching following close by. Object-based searches were used in 54% of searches and there was good use of objects as both positive and negative relevance feedback judgments whereas keyframe image similarity was used mostly for positive rather than negative feedback. The distribution of object searching across the 12 topics indicates it was naturally highly concentrated among topics in which named characters were also among

Fig. 6. System precision for each topic.

A.F. Smeaton, P. Browne / Information Processing and Management 42 (2006) 1330–1344

1341

Table 2 Distribution of search types Number of topic searches

180

Number of topic searches using text search Number of individual text searches

175 (95%) 640

Number of topic searches using keyframe similarity # Positive relevance feedback using a keyframe (image) # Negative relevance feedback using a keyframe (image)

161 (89%) 602 82

Number of topic searches using object search # Positive relevance feedback using an object # Negative relevance feedback using an object

97 (54%) 211 155

Table 3 Search types by iteration Iteration

Object search

Text search

Keyframe matching

Total

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

10 58 61 47 39 36 28 18 18 11 14 7 6 3 2 4 2 1 0 1 0 0

243 75 64 56 46 35 25 18 22 13 13 6 7 6 6 0 2 2 0 0 1 0

1 83 75 70 73 65 61 56 41 40 26 30 14 12 9 8 6 6 5 2 0 1

274 235 202 177 159 141 118 97 81 67 55 43 28 21 17 12 10 9 5 3 1 1

Fig. 7. Distribution of search types per search iteration.

1342

A.F. Smeaton, P. Browne / Information Processing and Management 42 (2006) 1330–1344

the detected objects (topics 3, 4, 5, 6 and 12). Table 3 shows the total number of search types at each iteration over the 180 searches, this is shown graphically as Fig. 7. Each time a user marks an object as a relevant or non-relevant object, or adds positive or negative feedback for a shot keyframe, or changes the content of their text search by adding or deleting terms, their query expands and the iteration level increments. When the user restarts with a new search then the iteration level is reset to zero. Looking at the search types we can see that users try a number of different query searches during the 7 minutes as the total number of searches (1690) is far greater than the total number of topics completed by users (180). The maximum number of iterations completed by a user is 22, and the most popular search strategy employed by users is to start with a text search followed by keyframe match in the subsequent iteration, while the second most popular method is to start with a text and then an object search. This can be explained by the fact that users like to start doing visual search as soon as they can rather than have an initial text search followed by another text search, for example. Eighty percent of feedback judgments are in the first 8 iterations while 40% of feedback is in the first 3 judgments. 6. Conclusions In this paper we have combined text-, image-, and object-based searching into an iterative video shot retrieval system, and evaluated its interactive use with users in a controlled experiment. This was done in an attempt to gauge how useful object-based search is in comparison with keyframe matching and text searching. Fifteen users each completed 12 different searches, each in a controlled and measured environment with a maximum 7 minutes time limit to complete each search. The study does have limitations in terms of the relatively small size of the sample and the artificiality of the search tasks but does reveal some interesting patterns on how objectbased video searching can be used. The findings from the log data analysis shows that text-based search is the most popular method for initial querying and was used in 97% of query topics. This confirms our expectation and we have found that the retrieval performance of text quickly gets people to relevant areas of the content at which point keyframe (image) retrieval can be used. Object-based searching was performed in over 50% of queries, and this shows that when image objects are available and appropriate for the query topic, users will use them. This result goes some way towards validating the approach of allowing users to select objects as a basis for searching video archives when the search dictates it as appropriate, though the technology to do this on natural as opposed to animated video, is still under development for scalable applications. Our approach of using animated cartoons to allow us to do object interaction, allowed users in our work to manipulate video objects as part of their searching and browsing interaction. Acknowledgement The authors wish to acknowledge the support of the Informatics Directorate of Enterprise Ireland. This work was supported by Science Foundation Ireland under Grant No. 03/IN.3/I361. Appendix A. Evaluation query topics Table to illustrate query topics and their nature where topics are narrow (N), general (G), or broad (B) and the potentially useful search modes are text searching (T), low-level image keyframe matching (L), objectobject matching (O) or simple keyframe context browsing (X). Topic ID Relevant shots Topic type Topic description 1

149

G(TL)

2 3

153 43

G(TX) N(OLT)

Find shots of any Simpsons character smoking. Shots where the character is holding a cigarette are also relevant Find shots of people hurt and in pain. The sound must be audible Find shots of Bart, Lisa and Maggie together in the sitting room WITHOUT Homer. Shots can contain other characters but must contain Bart, Lisa and Maggie and NOT Homer

A.F. Smeaton, P. Browne / Information Processing and Management 42 (2006) 1330–1344

1343

Appendix A (continued) Topic ID

Relevant shots

Topic type Topic description

4

221

G(OLT)

5

225

G(OX)

6

22

7 8 9

45 349 62

N(TX) B(TLX) N(TLX)

10

155

G(TLX)

11

631

B(TL)

12

99

N(OL)

N(O)

Find shots of Homer working at the nuclear power plant. Valid results can show him at his desk or ’working’ around the plant. Shots in the plant canteen are also valid Find shots of Bart sad, upset, or worried—emotion should be evident on his face and/or in his voice Find shots that show Bart and a teacher. Valid results will show JUST Bart and a teacher, no other character should be visible Find shots showing an explosion; sound must be audible Find shots that show Homer’s neighbour Flanders Find shots of the ‘new’ owners of the Springfield nuclear power plant when Mr. Burns decided to sell. Valid results are any of the new owners Find shots of the Simpsons character that proposes to Marge’s sister Selma Find shots of video that feature any sporting event. A valid result can show a sporting event taking place and/or being watched by any of the Simpsons characters Find shots of Homer and Marge in the Kitchen without Bart, Lisa and Maggie

References Adamek, T., & O’Connor, N. (2003). Efficient contour-based shape representation and matching. In Proceedings of the 5th ACM SIGMM international workshop on multimedia information retrieval. Berkeley, CA, 7 November 2003. Browne, P., & Smeaton, A. F. (2004). Video information retrieval using objects and Ostensive relevance feedback. In Proceedings of SAC 2004-ACM symposium on applied computing. Nicosia, Cyprus, 14–17 March 2004. Browne, P., & Smeaton, A. F. (2005). Video retrieval using dialogue, keyframe similarity and video objects. In Proceedings of the international conference on image processing, ICIP2005. Genoa, Italy, September 2005. Browne, P., Smeaton, A. F., Murphy N., O’Connor, N., Marlow, S., & Berrut, C. (2000). Evaluating and combining digital video shot boundary detection algorithms. In Proceedings of IMVIP 2000—Irish machine vision and image processing conference. Belfast, Northern Ireland, 31 August–2 September 2000. Cooke, E. et al. (2004). TRECVID 2004 experiments in Dublin City University. In Proceedings of TRECVID 2004—text retrieval conference TRECVID. Gaithersburg, Maryland, 15–16 November 2004. Doufour, R. M., Miller, E. L., & Galatsanos, N. P. (2002). Template matching based object recognition with unknown geometric parameters. IEEE Transactions on Image Processing, 11(12), 1385–1396. Duygulu, P., Barnard, K., de Freitas, N., & Forsyth, D. (2002). Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In Proceedings of the European conference on computer vision (ECCV). Copenhagen, 2002. Ebrahimi, T., & Horne, C. (2000). MPEG-4 natural video coding—an overview. Signal Processing—Image Communication, 15(4–5), 365–385. Erol, B., & Kossentini, F. (2005). Shape-based retrieval of video objects. IEEE Transactions on Multimedia, 7(1), 179–182. Ferguson, P., Gurrin, C., Wilkins, P., & Smeaton, A. F. (2005). Fı´sre´al: a low cost terabyte search engine. In Proceedings of 27th European conference on information retrieval (ECIR2005). Santiago de Compostela, Spain, 21–23 March 2005. Hohl, L., Souvannavong, F., Merialdo, B., & Huet, B. (2004). Enhancing latent semantic analysis video object retrieval with structural information. In Proceedings of the International Conference on Image Processing, ICIP2004 (Vol. 3, pp. 1609–1612). Liu, C.-B., & Ahuja, N. (2004). Motion based retrieval of dynamic objects in videos. In Proceedings of the 12th annual ACM international conference on multimedia (pp. 288–291). New York, NY, USA. Mc Donald, K., & Smeaton, A. F. (2005). A comparison of score, rank and probability-based fusion methods for video shot retrieval. In Proceedings of the international conference on image and video retrieval (CIVR2005), 20–22 July 2005. LNCS (3568). Singapore: Springer. O’Connor, N., Sav, S., Adamek, T., Mezaris, V., Kompatsiaris, I., Lui, T. Y., et al. (2003). Region and object segmentation algorithms in the QIMERA segmentation platform. In Proceedings of CBMI 2003—international workshop on content-based multimedia indexing. Rennes, France, 22–24 September 2003.

1344

A.F. Smeaton, P. Browne / Information Processing and Management 42 (2006) 1330–1344

Oomoto, E., & Tanaka, K. (1993). OVID: design and implementation of a video-object database system. IEEE Transactions on Knowledge and Data Engineering, 5(4), 629–643. Rautiainen, M., Ojala, T., & Sepp, T. (2004). Analysing the performance of visual, concept and text features in content-based video retrieval. In MIR ’04: Proceedings of the 6th ACM SIGMM international workshop on multimedia information retrieval (pp. 197–204). New York, NY, USA, October 2004. Sato, T., Kanade, T., Hughes, E., & Smith, M. (1998). Video OCR for digital news archive. In Proceedings of the workshop on contentbased access of image and video databases (pp. 52–60). Los Alamitos, CA, January 1998. Sivic, J., Shaffalitzky, F., & Zisserman, A. (2004). Efficient object retrieval from videos. In Proceedings of the 12th European signal processing conference (EUSIPCO’04). Vienna, Austria, September 2004. Sivic, J., & Zisserman, A. (2003). Video google: a text retrieval approach to object matching in videos. In Proceedings of the 9th IEEE international conference on computer vision (ICCV 2003). Nice, France. Smeaton, A. F. (2004). Indexing, browsing and searching of digital video. ARIST—Annual Review of Information Science and Technology, 38, 371–407 (chapter 8). Smeaton, A. F., Over, P., & Kraaij, W. (2004). TRECVid: evaluating the effectiveness of information retrieval tasks on digital video. In: MULTIMEDIA ’04: Proceedings of the 12th annual ACM international conference on multimedia (pp. 652–655). New York, NY, USA, October 2004. Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image retrieval at the end of the early years. IEEE Transaction on Pattern Analysis and Machine Intelligence, 22(12), 1349–1380. Smith, M., & Khotanzad, A. (2004). An object-based approach for digital video retrieval. In Proceedings of the international conference on information technology: coding and computing (ITCC 2004), (Vol. 1, pp. 456–459). Las Vegas, NV, USA, April 2004. Snoek, C., Worring, M., van Gemert, J., Geusebroek, J. -M., Koelma, D., Nguyen, G. P., et al. (2005). MediaMill: exploring news video archives based on learned semantics. In MULTIMEDIA ’05: Proceedings of the 13th annual ACM international conference on multimedia (pp. 225–226) Singapore, November 2005. Tu, P., Saxena, T., & Hartley, R. (1999). Recognising objects using colour-annotated adjacency graphs. I: Shape, contour and grouping in computer vision. Lecture notes in computer science (Vol. 1681). Springer-Verlag, pp. 249–263. Yan, R., Yang, J., & Hauptmann A. G. (2004). Learning query-class dependent weights in automatic video retrieval. In Proceedings of the 12th annual ACM international conference on multimedia (pp. 548–555). New York, NY, USA.

Further reading Campbell, I. (2000). Interactive evaluation of the Ostensive Model using a new test collection of images with multiple relevance assessments. Journal of Information Retrieval, 1, 85–112. Lyman, P., & Varian, H. (2003). How much information? Berkeley: University of California. Available fromhttp://www.sims.berkeley.edu/ research/projects/how-much-info-2003/.