An integrated system for content-based video retrieval and browsing

An integrated system for content-based video retrieval and browsing

Pattern Recognition, Vol. 30, No. 4, pp. 643~658, 1997 qL~ 1997 Pattern Recognition Society. Published by Elsevier Science Ltd Printed in Great Britai...

1MB Sizes 8 Downloads 126 Views

Pattern Recognition, Vol. 30, No. 4, pp. 643~658, 1997 qL~ 1997 Pattern Recognition Society. Published by Elsevier Science Ltd Printed in Great Britain. All rights reserved 0031-3203/97 $17.00+.00

Pergamon

PII: S0031-3203(96)00109-4

AN INTEGRATED SYSTEM FOR CONTENT-BASED VIDEO RETRIEVAL AND BROWSING HONG JIANG ZHANG, ~'* JIANHUA WU, DI ZHONG and STEPHEN W. SMOLIAR Institute of Systems Science, National University of Singapore, Heng Mui Keng Terrace, Kent Ridge, Singapore 0511, Singapore

(Received 12 June 1996; receivedfor publication 30 July 1996) Abstract--This paper presents an integrated system solution for computer assisted video parsing and contentbased video retrieval and browsing. The effectiveness of this solution lies in its use of video content information derived from a parsing process, being driven by visual feature analysis. That is, parsing will temporally segment and abstract a video source, based on low-level image analyses; then retrieval and browsing of video will be based on key-frame, temporal and motion features of shots. These processes and a set of tools to facilitate content-based video retrieval and browsing using the feature data set are presented in detail as functions of an integrated system. ~;) 1997 Pattern Recognition Society. Published by Elsevier Science Ltd. Video parsing

Video retrieval

Video browsing

I. INTRODUCTION With rapid advances in communication and multimedia computing technologies, accessing mass amounts of multimedia data is becoming a reality. However, interaction to multimedia data, video in particular, is in general still not easy. For example, selection of a video clip in conventional video on demand (VOD) systems rarely involves anything better than key words search or category browsing; and any manipulation of the video itself is limited to the lowest level of VCR control. The problem is that, from the point of view of content, the resources managed by such systems are unstructured, and therefore can be neither indexed nor accessed on the basis of structural properties. Fundamentally, apart from other requirements, what we need are video parsing tools to extract structure and content information of video. Only after such information becomes available can content-based retrieval and manipulation of video data be facilitated. We see parsing of video data as encompassing two tasks: temporal segmentation of a video program into elemental units, and content extraction from those units, based on both video and audio semantic primitives. (~) The temporal segmentation process is analogous to sentence segmentation and paragraphing in parsing textual documents, and many effective algorithms are now available for this task. (2-7) However, fully automated content extraction is a much more difficult task, requiring both signal analysis and knowledge representation techniques; so human assistance is still needed. We thus feel the most fruitful research approach is to concentrate on facilitating tools, using low-level visual features. Such * Author to whom correspondence should be sent. tCurrent address: HP Labs, Page Mill Road, Palo Alto, CA 94304, U.S.A. 643

Multimedia

Database

tools are clearly feasible and research in this direction should ultimately lead to an intelligent video parsing system.(8-10) Retrieval and browsing require that the source material first be effectively indexed. While most previous research in indexing has been text-based, (11'12) content-based indexing of video with visual features is still an open research problem. Visual features can be divided into two levels: low-level image features, and semantic features based on objects and events. The semantic level includes name, appearance, motion, and temporal variation of characteristics of constituent objects, as well as relationships among different objects at different times and the contributions of all these attributes and relationships to the story being presented in a video sequence. (11,t2) To date automation of low-level feature indexing has been far more successful than that of semantic indexing in image databases, (13-~5) especially in some special applications, such as database of human faces. (tS~ Then, perhaps the biggest problem with indexing video using the low-level image features of every frame is its enormous volume, while uniform subsampling °6~ may reduce some data, but risky for losing important frames. We believe that a viable solution is to index representative key-frames, (~7~ extracted from the video sources. Then, the problem becomes how to obtain the key-frames automatically and content-based from video sources, which is one of the key contribution of our work presented in this paper. While we tend to think of indexing supporting retrieval, browsing is equally significant for video source material. By browsing we mean an informal perusal of content which may lack any specific goal or focus. The task of browsing is actually very intimately related to retrieval. On the one hand, if a query is too general, browsing is the best way to examine the results. This

644

H.J. ZHANG et al.

should provide some indication of why the query was poorly expressed; so browsing also serves as an aid to formulating queries, making it easier for the user to just ask around in the process of figuring out the most appropriate query to pose. Unfortunately, the only major technological precedent for video browsing is the VCR (even when available in the soft form for computer viewing), with its support for sequential fast forward and reverse play. Browsing a video this way is a matter of skipping frames: the faster the play speed, the larger the skip factor. The contentbased browser of reference (16) takes this approach, but a uniform skip factor really does not account for video content. Furthermore, there is always the danger that some skipped frames may contain the content of greatest interest. Finally, there is the problem that high-speed viewing requires increased attention, because of the higher data rate, while we prefer to associate browsing with a certain relaxation of attention. A truly contentbased approach to video browsing should be based on some level of analysis of the actual image content, rather than simply providing a more sophisticated view of temporal context. However, only few research efforts published ~9'1°'18)that have discussed how parsing results may be applied to support more powerful browsing tools. This paper presents our contributions that have resuited in an integrated system solution to parsing, retrieval, and browsing based on three levels of automated content analysis. What makes our solution effective is its novel ideas of using key-frames and motion features to represent shot content and similarity, and the algorithms to automatic extracted key-frames in the video parsing processes. Based on such representation, we have developed a set of enable tools to content-based video retrieval and browsing. As shown in Fig. 1, there are three processes in our solution that capture different levels of content information: The first is temporal segmentation to identify shot

boundaries. At the second level each segment is abstracted into key-frames, based on a novel idea of applying simple analysis of content variation in video shots, which yields results far more useful than subsampling at a fixed rate. Finally, visual features, such as color and texture, are used to represent the content of key-frames and in measuring shot similarity. In addition, variations among frames from the same shot are calculated and integrated with information about camera operation and object motion to provide event-based cues. Indexing is then supported by a clustering process that classifies key-frames into different visual categories; this categorization may also support manual user annotation. These results composite the meta-data set of video, which will facilitate retrieval and browsing in a variety of ways. Retrieval may be based on not only the annotated index but also these meta-data; and key-frames enable browsing with a fast forward/backward player, a hierarchical time-space viewer, and cluster-based clip windows. This paper is organized as follows: Section 2 presents our approaches to video parsing, including a brief summary of temporal segmentation and a detailed discussion of our contribution to automatic and content-based keyframe extraction. Section 3 reviews the visual features of key-frames and motion related features used in shot content representation and presents the novel ideas of shot-based similarity measures. The resulting approach to retrieval and browsing is then presented in detail in Section 4. Finally, Section 5 offers concluding remarks and a brief view of our current and future work. 2. VIDEO PARSING: SEGMENTATION AND ABSTRACTION

2.1. Temporal segmentation and key-frame extraction algorithms

The first task in video parsing is temporal segmentation which partitions a video program into individual

Retrieval and Browsing~

Sequential Video Streams [

(Segmentation)

(Indexing/Clustering)

~EFeature

Video Shots

] -

~'N I II " " ,n ~ l e m p o r a l variations k,~ xtracuon/ I " - [ M o t i o n features

! (

Abstraction t Key Frames

)

I l~f,~Feature ~ [--I"~ExtractionJ

Meta-data

Key-frame features [ i - [ Color, texture, shape I I,.

,.I

Fig. l. System processing overview and data flow for an integrated and content-based solution for video parsing, retrieval and browsing.

An integrated system for content-based video retrieval and browsing camera shots, consisting of one or more frames generated and recorded contiguously and representing a continuous action in time and space. These shots are basic units of video to be represented and indexed. In our system we have developed and integrated a twin comparison algorithm that detects not only simple camera breaks but also gradual transitions implemented by special effects such as fades, wipes, and dissolvesF ) The algorithm is also able to filter out false positives in detecting gradual transitions, resulting from frame sequences involving camera operations which produce similar temporal variation patterns as gradual transitions. This is achieved by using a simple and yet effective camera operation detection algorithm to identify sequences involving particular global motion based on the characteristic pattern of motion vectors resulting from camera operations. ~7) To handle compressed video data streams without decompressing, we have also developed and integrated into the system a hybrid segmentation algorithm that makes use of both DCT coefficients and motion vectors of MPEG data. These segmentation algorithms have been proved to have both high processing speed and accuracy. ~19) Given the large data volume of video, viewing every frame, or a subset of frames uniformly subsampled in a shot, like in conventional VCR systems, is not what users want ultimately. Even when videos are compressed, it is rarely desirable to index and/or store all frames for retrieval. What we need is a representation to present information landscape or structure of video in a more abstracted or summarized manner. In our system, we treat video abstraction as a problem of mapping an entire video segment to a small set of representative images, usually called key-frames. Such key-frames can also be displayed directly as a visual abstraction for browsing of shot content. More importantly, shots can be indexed and similarities between shots can be defined by the similarities of their key-frames; thus, visual similarity based retrieval of shots can be facilitated. Key-frames are still images which best represent the content of the video sequence in an abstracted manner, and may be either extracted or reconstructed from original video data. Key-frames have been frequently used to supplement the text of a video log, (17) and often only key-frames are maintained on-line if digital storage space is limited. However, there has been little progress in identifying them automatically. In some prototype systems ~5) and commercial products, the first frame of each shot has been used as the only key-frame to represent the shot content. However, while such a representation does reduce the data volume, its representation power is very limited since it often does not give sufficient clue as what actions presented by a shot, except for shots with no change or motion. Also, for the same reason, key-frames selected in this way may fail if they are used in comparing shot similarity. The challenge in extracting of keyframes is that it needs to be automatic and content-based so that the resulting key-frames maintain the dynamic nature video content while it removes all redundancy. In theory, semantic primitives of video, such as interesting objects, actions and events should be used. How-

645

ever, because such general semantic analysis is not currently feasible, especially when information from soundtracks and/or close caption is not available, we have to rely on low-level image features and other readily available information instead. Based on these constraints, we have developed a novel and robust extraction technique which utilizes information computed by the parsing processes. 2 Instead of using only the first frame of each shot, our novel technique determines a set of keyframes (one or more) according to the following criteria. Shot based criteria: Key frames will be extracted at shot level based on features and content information of the shot. Given a shot, the first frame will always be selected as the first key-frame; but, whether more than one key-frame need to be chosen will depends on the following analyses. Color feature based criteria: After the first key-frame is selected, following frames in the shot will be compared against the last key-frame sequentially as they are processed, based on their similarities defined by color histogram or moments, as to be presented in Section 3. If significant content change occurs between the current frame and the last key-frame, the current frame will be selected as a new key-frame. Such a process will be iterated until the ending frame of the shot is reached. In this way, any significant action in a shot will be captured by a key-frame, while static shots will result in only a single key-frame. When compressed data are used, frame similarity may be calculated from DCT coefficients associated with all macro-blocks in frames. (6'19) Motion based criteria: Dominant or global motion resulting from camera operations and large moving objects are the most important source of content changes and thus an important input for key-frame selection. For instance, camera operations are used explicitly to express a camera man's intention in tracking, focusing/defocusing on an event or object. On the other hand, color histogram and moment representation often do not capture such motions quantitatively, due to their insensitivity to motion. Therefore, we need to detect sequences involving two types of global motions: panning and zooming. In reality, there are more than just these two types of motion in video sequences, such as tilting, tracking, and dollying. But, due to their similar visual effects, at least in key-frame selection, we treat camera panning, tilting and tracking, as well as horizontal and vertical motion of large objects as one type of motion: panning-like; while camera zooming and dollying and perpendicular (to the imaging plan) motion of large objects as another type: zooming-like. To select key-frames representing these two types of motion, we have introduced two criteria: For a zoominglike sequence, at least two frames will be selected: the first and last frame since one will represent a global, while the other will represent a more focused view. For a panning-like sequence, the number of frames to be selected will depends on the scale of panning: ideally, the spatial context covered by each frame should have 2US Patent; Japan and Europe Patent pending.

H.J. ZHANG et al.

646

little overlap, or each frame should capture a different, but sequential part of object activities. The spatial overlap we have chosen is 30%. In other words, if a scene covered by a frame is shifted by 30% in any direction from the last key-frame, a new key frame should be chosen. To detect the two major types of global motion for keyframe extraction, we have used the same algorithm as in filtering false positives in detection gradual transitions, ~7~ as mentioned above. The only difference is that in the filtering of false transition detection, we only need to identify if a global motion occurs in a sequence, while here we need also to identify the type of global motions. Compared with other camera operation detection algorithms for similar purpose, (2'2°) this algorithm is a simple, yet very effective first pass. A more quantitative algorithm, developed by Tse and Baker ~21) is then applied to the detected panning or zooming sequences to calculate actual panning vectors and zooming factors between frames for selecting key-frames. When MPEG video is used, motion vectors from B and/or P frames can be directly extracted for motion analysis. ~19~ In our implementation, the key-frame extraction process is integrated with the processes of temporal segmentation. That is, each time a new shot is identified, the key-frame extraction process is invoked, using parameters already computed during segmentation. In addition, users can adjust several parameters to control the density of key-frames in each shot. These parameters include: thresholds in comparing color feature changes, and the spatial overlap used in selecting key-frame in panning sequences. The default is that at least two keyframes will be selected for each shot; and, in the simplest case, they could be the first and last frames of the shot.

2.2. Video segmentation and key-frame extraction: experimental evaluation The performance of various temporal segmentation systems has been discussed elsewhere (z-v) and will not be repeated here. Table 1 summarizes key-frame extraction results obtained from four test sets, which represent the effectiveness of the techniques described above in abstracting different types of video material. The first test was based on two stock footage videos consisting of unedited raw shots of various lengths, covering a wide range of scenes and objects. Singapore, the second test set, is travelogue material from an elaborately produced documentary which draws upon a variety of sophisticated

editing effects. Finally, the Dance video is the entirety of Changing Steps, a "dance for television" containing shots with fast moving objects (dancers), complex and fast changing lighting and camera operations, and highly imaginative editing effects, many of which are far less conventional than those in Singapore. We feel that these four source videos are representative of both the content material and style of presentation that one will generally find in professionally produced television and film. Because key-frame extraction takes place at the segment level, segmentation results are also listed in Table 1. Furthermore, because it is hard to quantitatively measure the accuracy of key-frame extraction (even human selection can be subjective), only the number of key-frames extracted for each test is given. There are two sets of key-frame numbers, one (Nk2) extracted with the default value of at least 2 key-frames per shot, and one (Nk0 extracted without this limitation, i.e the minimum number of key-frames is 1 per shot. Table 1 shows that the segmentation algorithm performs with an accuracy of over 95% (counting both missing and false detection as errors) in the first three tests. The Dance video yields the lowest accuracy, due to the elaborate production technique, and can be considered the lowest achievable accuracy of the segmentation algorithm. For all segments which were correctly detected, key-frames were correctly extracted, yielding an average of between two and three key-frames per shot, which tends to be a suitable abstraction ratio. In addition the test results demonstrate that camera operations (both panning and zooming) have been properly abstracted by the keyframes detected. It is also worth observing that, particularly in the Dance example, key-frame extraction actually compensated visually for segmentation errors. That is, in many of the cases where a shot boundary was missed, but one or more key-frames were detected to represent the material in the missed shot. This means that for the purpose of video browsing, as will be discussed later in this paper, the scheme of using a set of key-frames are more reliable and informative than using a single key-frames. This is because, apart from the reasons discussed in the last subsection, the selection of a single key-frame, whether the first, last or middle frame in a shot, depends entirely on the correct detection of the shot boundaries, and there will be no key-frame detected if a shot is missed by the partitioning process. To systematically evaluate the effectiveness and accuracy of our key-frame extraction system, we conducted

Table 1. Video segmentation and key-frame extraction results Video Stock footage 1 Stock footage 2 Singapore Dance

Length (s)

Nd

am

Nf

Nk2

Nkl

N1

451.8 1210.7 173.8 2109.1

35 78 31 90

1 1 1 17

1 5 0 4

116 271 71 205

81 193 56 178

76 186 45 **

No: number of shots correctly detected; Nm: number of shots missed by the detection algorithm; Nf: false detection of shot boundaries; Nk2: number of key-frames extracted with default value 2; Nkl: number of key-frame extracted with default value 1; NI is the number of key-frames selected by the librarians; and **: not accessed.

An integrated system for content-based video retrieval and browsing

647

Figure 2(a) shows an example in which our system chosen three key-frames from a flit-up shot, while the librarians only chosen the last frame as the key-frame. It is obvious that from the three key-frames one can see clearly that it is a tilt-up sequence, which is impossible to see from any single key-frame. In this respect, extracting three key-frames is a more adequate abstraction than only one key-frame. On the other hand, the key-frames chosen by the librarians does capture the visual content of the shot as good as the three key-frames: business district buildings, except that they ignored the camera movement effect which is important for users (especially producers and editors) who want to choose shots from stock footage. Figure 2(b) shows another similar example where our system chosen three key-frames while the librarians chosen only one. Therefore, while we can conclude that our key-frame extraction algorithm can achieve adequate abstraction and produce good key-frames, we need a more systematic user study of key-frame extraction performance, involving analysing opinion scores, which will be a task in the future developments of the system.

a user trial comparing automatic extraction with human identification of key-frames from the two stock footage videos listed in Table 1, which were provided by the Singapore Broadcasting Corporation (SBC) and the Singapore video. Members of the SBC Film/Videotape Library staff were instructed to identify key-frames from these source tapes as they usually did in their daily video library indexing process, on the basis of which they then evaluated the results of our system. The evaluation showed that our system, with either default values, did not miss any of the key-frames selected by the librarians. The only difficulty was that our results were more "generous", extracting more keyframes than were selected manually; but this discrepancy has been remedied by providing user control of the density of frames selected. On the other hand, such a condensed selection of key-frames performed by librarians could only be obtained after the librarians watching through a number neighboring shots to understand the content/context of any particular shot. Also, when they were choosing key-frames with the higher level of abstraction, the librarians tended to assume that the users of key-frames would have the same level of knowledge about the content of the video as they have. In reality, this is not the case; and the level of abstraction which the librarians have chosen is only adequate for those users who have apriori knowledge about the video content and key-frames only serve as a reminder of their memory. Based on this, the librarians agreed that our automatic extraction was more objective (our system always tried to identify at least one key-frame per shot) and adequate, while they sometimes tended to ignore certain shots which they felt were not important.

After partitioning and abstraction, the next step in video analysis is to identify and compute representation primitives, based on which the content of shots can be indexed, compared, and classified. Ideally these should be semantic primitives that a user can employ to define interesting or significant events. However, such an ideal solution is not feasible; so that our representation primitives are based on information accessible through the

Key-frame 1

Key-frame 3

3. FEATURE-BASED REPRESENTATION AND SIMILARITY MEASURES

Key-frame2

(a)

Key-frame I

Key-frame2

Key-frame 3

(b) Fig. 2. Key-frames selected by the system and the librarians: (a) Three key-frames selected by the system from a tilting-up shot and the last one was selected by the librarians; Co) Three key-frames selected by the system from a zooming-out shot and the last one was selected by the librarians.

H.J. ZHANGet al.

648

techniques described in Section 2. The resulting representation is divided into two parts: primitives based on key-frames and those based on temporal variation and motion information. With these two representations, we can then index video shot and define shot similarity used in video retrieval, which is another major contribution of the work presented in this paper.

3.1. Representation of shot content based on key-frame features Following the techniques used in image database systems, (13-]5), key-frames are represented by properties of color, texture, shape, and edge features. Color Features: Color has excellent discrimination power in image retrieval systems. It is very rare that two images of totally different objects will have similar colors. (22) Our representation primitives for color features include color histogram, dominant colors, and statistical moments. Color histogram is the most important features in image representation. We have chosen the Munsell space to define color histograms because it is close to human perception of colors. (23) As in QBIC, we have quantized the color space into 64 super-cells using a standard minimum sum of squares clustering algorithm. (14) A 64-bin color histogram is then calculated for each keyframe where each bin is assigned the normalized count of the number of pixels that fall in its corresponding supercell. The distance between two color histograms, I and Q, each consisting of N bins, is quantified by the following metric: (]4) N

N

~ - ~ a i j ( l i - Qi)(lj - Qj),

D2is(I, Q) = ~ i

(1)

j

where the matrix a# represents the similarity between the colors corresponding to bins i and j, respectively. This matrix needs to be determined from human visual perception studies, and we have derived it using the method of QBIC. ( 1 4 ) Notice that if aij is the identity matrix, then this measure becomes Euclidean distance. This distance or dissimilarity measure has also been used in our temporal segmentation, where consecutive frames are compared, and in key-frame selection processes, where the current frame is compared with the last selected keyframe, as presented in the last section. As pointed out there, by adjusting the threshold for this metric, a user can determine the density of key-frames. Since, in most images, a small number of color ranges capture the majority of pixels, these dominant colors can be used to construct an approximate representation of color distribution. Experiments have shown that using only a few dominant colors will not degrade the performance of color image matching.(]3'z2) Our current implementation is based on twenty dominant colors, corresponding to the histogram bins contains the maximum numbers of pixels. Since a probability distribution is uniquely characterized by its moments, following reference (24), we also

represent a color distribution by its frst three moments:

1 .----,u

~i = ~ 2_.,PiJ,

(2)

j"= 1

O'i =

~[

(Pij - - #i)2

,

(3)

Si =

~

(Pij -- #i) 3

,

(4)

where pq is the value of the ith color component of thejth image pixel. The first order moment, #i, defines the average intensity of each color component; the second and third moments, ai and si, respectively, define the variance and skewness. For comparing image similarity using this feature set, weighted Euclidean distance is used. Experimental evidences has shown that this measure is more robust in matching color images than color histograms, (24) and thus is used as one of the color similarity measures in key-frame based retrieval. Texture features: Texture has long been recognized as being as important a property of images as is color, if not more so, since textural information can be conveyed as readily with gray-level images as it can in color. Among all these alternatives we have chosen two models which are both popular and effective in image retrieval: Tamura features (contrast, directionality, and coarseness) and the Simultaneous Auto-regressive (SAR) model. For details on the definition and derivation of these features, and the associate similarity measures, see references (14,25,26). Shape and edge features: Dominant objects in keyframes represent important semantic content and are best represented by their shapes, if they can be identified by either automatic or semi-automatic spatial segmentation algorithms. In our implementation, dominant objects in key-frames are obtained by a color-based segmentation algorithm(]3) and an interactive outlining algorithm. (27) In comparing similarity between shapes, cumulative turning angles are used, because they provide a measure closer to human perception of shapes than algebraic moments or parameteric curves. (28)

3.2. Temporal features for representation of shot content Key-frame based features presented in the last subsection capture most the spatial information, but little of the temporal nature of video. With such a representation only it will be insufficient to support event-based classification and retrieval, since motion is an essential and unique feature of video sequence. We shall now review some of the temporal features we are currently using to represent temporal characteristics at shot level. Temporal variation and camera operations: The means and variances of average brightness and a few dominant colors calculated over all frames in a shot are used to present quantitative temporal variations of brightness and colors. An example is that such temporal variations have been used to classify news video clips

An integrated system for content-based video retrieval and browsing into anchorperson shots and news shots. (29) The algorithms for detection of camera operations and global motions, as presented in Section 2.1, classify video sequences into no global motion sequences, panninglike, or zooming-like sequences. Such classification provides useful data for shot indexing and queries such as "find all shots with a camera panning at 10 ° s -1'' can be easily satisfied based on such meta-data. Statistic motion features: Since motion features have to roughly match human perception and it is still not clear how humans describe motions, we base our motion representation on statistical motion features, rather than object trajectory based as in some other works. More specifically, these features include directional distributions of motion vectors and average speed in different directions and areas. These features are derived from optical flow calculated between consecutive frames within shots. The directional distribution of motion vectors can be def'med as follows:

Ni di=~tmt,

i= 1,...,m,

(5)

where i represents one of the M evenly divided directions,

Ni the number of moving points in direction i and Nmt the number of total moving points at all directions. Nmt Can be replaced by the number of total points in the optical flow so as to consider how large the moving area is in the scene. Similarly, we can estimate the average speed and standard deviation at a given direction as follows:

~,

_

Ej=~ s,j Ni

N, ,

ai= V

sij Ni-1

- ~i)

i=1,

. . . .

649

comparison or clustering for video retrieval and browsing, as will be presented in Section 4. When key-frames are used as the representation of each video shot, we can define video shot similarity based on the similarities between the two key-frame sets. ff two shots are denoted as Si and Sj, their key-frame sets as Ki = {fi,,,, m = 1,..., M} and Kj = {J),n, n = 1, ...,N}, then the similarity between the two shots can be defined as

Sk(Si, Sj) = max[sk (f. 1,3~,1), sk~,~,2),..., sk~,l ,j~v),..., sk(fi,m,fj, 1), Sk (fi,m,fj,2 ), . . . , sk O~,M,fj,N ) ], (7) where sk is a similarity metric between two images defined by any one or a combination of the image features presented above; and there are totally M x N similarity values, from which the maximum is selected. This definition assumes that the similarity between two shots can be determined by the pair of key-frames which are most similar, and it will guarantee that if there is a pair of similar key-frames in two shots, they are considered similar. Another definition of key-frame based shot similarity is 1

M

Sk(Si, Sj) = ~ ~

max[sk ~,m,J~,l), Sk~,m,fj,2),...,

m=l

Sk(fi,m,fj,N)].

(g)

This definition state that the similarity between two shots is the sum of the most similar pairs of key-frames. When there is only one pair of frames matches, this definition is equal to the first one.

M,

' (6)

where sij is the speed o f j t h moving point in direction i. These two feature sets provide a general description of the speed distribution and will be used for motion-based shot retrieval and clustering. To obtain localized motion information, we can also calculate the average speed and its variance in uniformly divided blocks in frames. That is, instead of calculating average speeds in M directions of entire frames, we calculate a set of motion statistics in M blocks of frames according to equation (6). Then, the motion-based comparison of shot contents will be based on the motion statistics comparison of corresponding blocks in consecutive frames. When shot-based temporal and motion features are used in representing the content of video shots, Euclidean or L1 distance between the feature parameters from two shots can be used as the difference/similarity measure in comparing shots.

3.3. Shot similarity The visual features presented above provide content representations of shots, but the goal is to define shot similarity based on these representations, to enable shot

3.4. Feature-based clustering To support class-based video browsing and indexing, as will be discussed later in Section 4.2.2, it is necessary to classify shots into groups of similar visual content and to get representative icons for each class. For this purpose, we have developed and integrated into our system a set of clustering algorithms using the content features and similarity measures discussed above. There are mainly two types of clustering: partition clustering arranges data in separate clusters; and hierarchical clustering leads to a hierarchical classification tree. ~3°) For our purpose of clustering large data set of shots to allow class-based video browsing with different levels of abstractions, partition methods are more suitable since they can find an optimal clustering at each level and obtain good abstraction of data items. Based on this, we have adapted a flexible hierarchical partition clustering approach, which performs iterative partition clustering of data at each level of a hierarchy. This clustering approach is very flexible such that different feature sets, similarity metrics and iterative clustering algorithms can be applied at different levels. Considering our requirements, we have chosen and implemented two iterative clustering methods: the Kmeans method and the Self-Organization Map (SOM) method. (31"32) The advantage of SOM is its learning

H.J. ZHANG et al.

650

ability without prior knowledge and good classification performance, which have been extensively proved by many researchers. Another benefit of using SOM is that similarities among the extracted classes can be seen directly from the two-dimensional map. This will allow horizontal exploring as well as vertical browsing of the video data, which is very useful when we have a large number of classes. Our K-means cluster method has been enhanced by incorporating fuzzy classification ideas, which computes the membership function of a data item to all the classes and thus can assign data items at the boundary of two classes to both of them. This is useful especially at higher levels of hierarchical browsing, where users expect all similar data items are under a smaller number of nodes. At the same time, outliers, i.e. data items whose membership functions to all classes are very small, can be detected and put into a miscellaneous class. The calculation of membership functions and the clustering process in the fuzzy K-means method are as follows: (1) Get N classes using the K-means algorithm. (2) For every data items vi, i = 1 , . . . , M : Calculate its similarity k (k = 1. . . . ,N) as

Sik

with

each

class

d~k2/(4~ 1) ~ N d/~2/($-1) '

where dij is the distance between data item vi and the reference vector of class j. ~h is the fuzzy exponent

(~ > 1.0). If sik >_ p, where p is the threshold set by users (0 < p < 1); add item vi into class k. If vi is not assigned to any class in above step, assign it to the miscellaneous class. We have used mainly the color features for clustering shots into classes of similar content as being discussed in Section 4.2.2. Experimental results are also given in that section. A more comprehensive evaluation of MOS method in clustering images based on color and texture feature has been performed and can be found in references (31,32).

3.5. Summary This section has discussed a set of visual features and similarity metrics for video content representation and two feature-based clustering methods for shot classification. It should be noted that each of the visual features presented above represents a particular property of images or image sequences and is effective in matching the images in which the feature is the salient one. Therefore, it is important to identify salient features for a given image or shot and to apply the appropriate similarity metrics. Developing such algorithms remains a long-term research topic. In addition, how to integrate the different features optimally into a feature set that has the combined

representation power of each feature is another challenging problem. 4. VIDEO RETRIEVAL AND BROWSING TOOLS

The video parsing processes presented in Section 2 determine shot boundaries, extract key-frames and their visual features, as well as the temporal and motion features of shots. These features provide the meta-data set, along with textual data if available, based on which content-based video indexing, retrieval and browsing can be facilitated. This section presents the retrieval and browsing tools implemented in our system, which utilize the meta-data, and some experimental results of video retrieval and browsing.

4.1. Content-based retrieval of video shots: tools and examples With the representation and similarity measures described in Section 3, querying a video database to retrieve shots can be performed based on key-frame features, temporal shot features, or a combination of the two. The retrieval system we are building supports retrieval as a more interactive process than is provided by simply formulating and processing conventional queries, providing fewer constraints on the possibilities the user may with to explore. The query process is still iterative, with the system accepting feedback to narrow or reformulate searches or change its link-following behavior, and to refine any queries that are given.

4.1.1. Key-frame-based retrieval. Once a video has been abstracted and is represented by its key-frame features, then index schemes of image databases, such as QBIC, (14) can be applied to shot indexing. That is, each key-frame links to a shot and a shot is linked to all its key-frames; and search of video shots becomes a matter of identifying those key-frames from the database which are similar to the query, according to the similarities defined in Section 3. Similar to QBIC image database, to accommodate different user requests, three basic query approaches are supported in our system: query by template manipulation, query by object feature specification, and query by visual examples. Also, the user can specify a particular feature set to be used for a retrieval. Query by visual templates is based on the assumption that a user often wants to retrieve images which consist of some known color patterns, such as a sunny sky, sea, beach, lawn, or forest. These pre-defined templates are stored and displayed as color texture maps and can be selected by the user to form a query, as shown in Fig. 2(a). The color distribution of a selected template can also be manipulated: The color manipulator consists of three scalars, corresponding to red, green, and blue intensities, and a display of up to 20 significant colors in the selected template, in descending order of contribution to the template image. The user can modify the red, green, and blue content of any selection to approximate

An integrated system for content-based video retrieval and browsing

(a)

(b)

651

(d)

(c) Fig. 3. Key-frame-based query of video database. (a) Template panel with color template selection and color manipulation tools; (b) Composed template image which forms a query; (c) Retrieved key-frames based on color similarity to the query; (d) A player to browse a shot represented by a key-frame retrieved.

more closely what he/she has in mind. The manipulation is performed in real time, meaning that the user is able to see changes in the template color while making adjustments. As shown in Fig. 3(b), it is not necessary that the entire area of the target image be filled by templates: images can be retrieved based on matching only those regions for which templates have been specified. For example, one can form a query by selecting a sky and a green area template and place them in the regions as shown in Fig. 3(b). This partially specified query resulted in the five best histogram-based matches displayed in Fig. 3(c), from a database of about 700 images. However, it should also be noted that in this example, the retrieval was primarily based on the color texture or the background of images. For a more comprehensive evaluation of color

feature-based image retrieval implemented in our system, see reference (33). It has been observed from the evaluation study that the best retrieval rate (87%) has been achieved by matching the similarity between top 20 dominant bins of nine local histograms and a global histogram of each image. Once a key-frame has been identified, the user may view its associated video clip by clicking the "Video" button in the "Retrieved Images" window. This initiates a video player, as shown in Fig. 2(d), which is cued by the location of the key-frame. That frame may also be used to derive a hierarchical display, as will be seen in Fig. 6, in which the key-frames are elements of the lowest level. Retrieval may thus be followed by browsing as a means to examine the broader context of the retrieved key-frame. On the other hand, a query can also

652

H.J. ZHANG et al.

be initiated from browsing. That is, a user may select a key-frame while browsing and offer it as a query: a request to find all key-frames which resemble that image. The query composition window need not necessarily have 3 x 3 uniformed sub-regions, as shown in Fig. 3(b). Our system also provide a drawing window that allows the user to draw arbitrary shapes with any selected patterns or colors, though the matching will be still based on color and/or texture features of nine or more subregions in the query image and the images in the database if no object segmentation have been performed in database images. Though, the visual interfaces as such for query by template composition may still be some cumbersome to novice users, it is a useful first level visual query def'mition tool, and it provides a jump stone to the use of query-by-visual-example tools. It should be pointed out that, though there have been many research efforts, including our own, the development of color and texture feature-based image retrieval algorithms is still in its early stage. How to link the low level features to high level semantics remains and will remain for a long time to be an open problem in computer vision. On the other hand, it is believed that low level

features in image retrieval algorithms and tools will provide us a bridge to a more intelligent solution and will improve our productivity in image and video retrieval, though the current tool set is far less than satisfactory. 4.1.2. S h o t - b a s e d retrieval. Shot-based temporal features provide another set of primitives that capture the motion nature of video and are used in our system as another retrieval tool to improve the retrieval performance. As discussed in Section 3.3, the two sets of features can be used separately or combined, based on user's selection and the associate similarity measures of each feature set. In this subsection, we present two examples to illustrate the tools in our system that support example based shot retrieval. The data set consists of totally 91 shots, 32 from the Singapore video used in Table 1 and the rest from a 10min news program. Figure 4 shows an example of combining key-frame based features and temporal features in a single query, where an anchorperson shot shown in Fig. 4(a) is used as a query example, and retrieval is based on color histograms of the key-frame (one only for each shot since they

i

(a)

(b) Fig. 4. Shot-based video retrieval. (a) Query formed by this example shot of anchorperson. (b) Retrieved shots represented by their key-frames from the video database based on their similarity to the query example in terms of shot features.

653

An integrated system for content-based video retrieval and browsing Table 2. Numbers of images in the four classes of video shots Classes of shots

Crowds

TalkingHeads

Sports

Classroom

Others

Number of shots

5

15

8

7

57

are talking head shots), and the temporal mean and variance of the histogram within the shots. That is, the similarity between two shots is defined as the sum of color histogram similarity, defined by equation (1) where only 20 dominant colors were used, between key-frames of the two shots, and the average difference and its variance of the frame-to-frame histogram difference within each shot. This is, in fact, the technique used in distinguishing anchorperson shots from general news shots in news parsing process. (29) Here all other five anchorperson shots in the test set of 91 video shots are found. Another more comprehensive example of temporalbased retrieval is based on motion features discussed in Section 3.2. In this example, 35 shots were pre-classified from 91 shots into four classes according to their motion patterns, crowds of walking people, TalkingHeads, Sports and Classroom audience. The TalkingHeads class also include six anchorperson shots. The number of shots in each class is given in Table 2 and Fig. 5 shows an example frame of each class. In the experiments, each shot (its features) in a class was used as a query, and the retrieval rate was calculated based on the number of shots from the same class which were retrieved as one of the top candidates. The mathematical definition of the retrieval rate for a class is thus:

1 ~-~Nk

× 100%,

(9)

where k is a query shot from a class, Nk is the number of candidates from the same class as the query shot, and Mk is the total number of top candidates, which equals to the number of shots of each class, minus one. Two sets of features were tested in this experiment: First, we calculated the average speeds at four directions

in four evenly divided blocks of each scene. The average retrieval rates of each class and overall average of all classes are listed in Table 3, where the top candidate range was 15. In the second test, directional distribution of motion vectors defined by equation (5), and average speeds and standard deviation defined by equation (6) were combined in retrieving video shots. That is, 30 candidate shots were retrieved firstly by each feature set, equations (5) and (6), then the two candidate sets were combined by an AND operation and ranked according to the second feature. Finally, 15 shots are picked up according to the similarity ranking from the combined set. Average recall rates are shown in Table 4. We have chosen to examine 16 directions, e.g. N was set to 16 in equations (5) and (6); and the sum of absolute difference between each components of the two feature vectors was used as the distance measure. Also, higher weight was given to the standard deviation. Comparing the two tests, the second test obtained a much higher retrieval rate for the class Crowds, but much lower for Sports. This is because Crowds shots have a dominant motion direction (left) while the Sports shots do not. As a result, when more direction information was concerned in test 2, it benefited the retrieval of crowds while lessened Sports. The talking heads and Classroom classes are all activities with small motion speed and random motion directions and the shots of these two classes were often mixed together and thus reduce the recall rates. It concludes that our statistic motion feature set obtains the best retrieval result when they are applied to the shots which contain consistent motion patterns over the entire shot duration, though the motion is not necessarily in only one direction. Figure 6 shows the retrieval result using the example shot in the Crowds class as the query and applying the combined features in retrieval. Five video shots with walking people towards

// Fig. 5. Example shots in four perceptual classes.

Table 3. Results of temporal feature-based shot retrieval. Classes:

Crowds

TalkingHead

Sports

Classroom

Average

64.0 80.0

43.2 52.8

62.5 53.6

51.0 56.4

54.0 56.6

Method Regional motion Motion directions and speeds /

H.J. ZHANG et al.

654

~;~.~

ill!i Fig. 6. Query result by the combinated motion features.

left are found due to the query, among which four belong to the Crowds class, while the third candidate was classified as false positive. The experiments presented in this section utilize only low-level motion features in representation of activities in video shots, and the robustness of the retrieval algorithms is rather limited. Another key issue is again how to link these low level features to high level, which calls for more semantic-based motion features and similarity measures. Therefore, we are studying more event-based features, such as human appearance and motion patterns, especially in videos of some specific domains.

4.2. Content-based browsing of video Interactive browsing of full video contents is probably the most essential feature of new forms of interactive access to digital video. As we pointed out earlier in the Introduction, the novelty and uniqueness of our video browsing tools lie in their use of the content information or meta-data obtained from video parsing including segment boundaries, key-frames and temporal features. This is another significant contribution of our system development. This section presents a set of content-based video browsing tools, which support two different approaches to accessing video source data: sequential access and random access. In addition, these tools accommodate two levels of granularity overview and detail along with an effective bridge between the two levels.

4.2.1. Key-frame-based browsing tools. Our sequential access browsing takes place through an improved VCR-like interface illustrated in Fig. 2(d). Overview granularity is achieved by playing only the extracted key-frames at a selected rate. Detailed granularity is provided by normal viewing, with frame-by-frame single stepping; and this approach is further enhanced to freeze the display each time a keyframe is reached. Also, functions such as "go to next shot" can also be supported based on the meta data of shot boundaries. Finally, viewing at both levels of granularity is supported in the reverse, as well as forward, direction. The hierarchical browser is designed to provide random access to any point in a given video: a video

sequence is spread in space and represented by frame icons which function rather like a light table of slides. In other words the display space is traded for time to provide a rapid overview of the content of a long video. The first attempt of such hierarchical browser was called Video Magnifier; (34)but it only uses successive horizontal lines, each of which offers greater time detail and narrow time scope by selecting frames from the video. We have improved the content accessibility of such browsers by utilizing the structural content of video obtained in video parsing. As shown in Fig. 7, at the top of the hierarchy, a whole video program is represented by five key-frames, each corresponding to a sequence consisting of an equal number of consecutive camera shots. Any one of these segments may then be subdivided to create the next level of the hierarchy. As we descend through the hierarchy, our attention focuses on smaller groups of shots, single shots, the representative frames of a specific shot, and finally a sequence of frames represented by a key-frame. We can also move to more detailed granularity by opening the first type of video player to view sequentially any particular segment of video selected from this browser at any level of the hierarchy. As we pointed out earlier, another advantage to use key-frames in browsing is that we are able to browse the video content down to the key-frame level without necessarily storing the entire video. This is particularly advantageous if our storage space is limited. Such a feature is very useful not only in video databases and information systems but also to support previewing in VOD systems. What is particularly important is that the network load for transmitting small numbers of static images is far less than that required for transmitting video; and, because the images are static, quality of service is no longer such a critical constraint. Through the hierarchical browser, one may also identify a specific sequence of the video which is all that one may wish to "demand." Thus, the browser not only reduces network load during browsing but may also reduce the need for network services when the time comes to request the actual video.

4.2.2. Class-based browsing tools. The browsing tools presented above are basically programs or clips based. That is, video programs or clips are loaded into

An integrated system for content-based video retrieval and browsing

00:00:00.03

00:01:45.1:]3

00:02:19.15

00:04:16.12

Shots : ( 2 1 , 3 1 )

Shots : ( 5 2 , 4 2 )

LEVEL I Interval

1o

Shots : ( 1 , 10)

Shots : ( 1 1 , 2 0 )

655

00:07:13.09

I

Shots : ( 4 5 , 5 3 )

O0:01:45.03

O0:01:50.12

O0:01:55.18

O0:02:05.06

O0:02:11.2 !

Shots: (I I, 12)

Shots: (13, 14)

Shots: (15, 16)

Shots: (17, 18)

Shots: (19, 20)

O0:01:45.03

O0:01:47.24

LEVEL

2 Int~r~,al

LEVEL3 I Interval I

Shot : I 1

O0:01:47.24

i

Shot : 12

O0:01:48.21

O0:01:49.18

LEVEL 4 Interval I

Frame: 3 0

Frame: 31

Frame: 32

Fig. 7. Key-frame-based hierarchical video browser.

the browser either from a list of names specified by retrieval results, a database index or a user. As shown in Fig. 7, the only c r i t e r i o n used to select the representative icons displayed at the top level of the hierarchy is the time order: they are the first key-frame of the first shot among all shots in a group. In other words, the shots at the high levels are grouped only according to their sequential relations not their content. As a result, though random access is provided, a user has to browse down to the second or third level to get a sense of content of all shots in a group. This is less of a problem if the browsing is launched from retrieval result such as shown in Figs 3 and 4 but it will not be very convenient when we use it to browse a large collection of video clips. To overcome this problem, we need class-based browsing tools that present content similarity between shots To achieve this, we have developed and integrated into our system a set of clustering algorithms as discussed in Section 3.4 and a class-based hierarchical browser. When a list of video programs or clips is provided, the system clusters shots into classes, consisting of shots of similar visual content, using either key-frame and/or

shot features. After such clustering, each class of shots is represented by an icon frame determined by the centroid of the class, which is then displayed at the higher levels of the hierarchical browser. Figure 8 shows an example of the class-based hierarchical browsing, in which 71 shots have been classified using the 64-bin color histograms of key-frames calculated in Munsell space and the K-mean clustering method as discussed in Section 3.1. As one can see from this example, with the cluster-based hierarchical browsing, the viewer can get a sense of the content of the shots in a class roughly even without moving down to lower level of the hierarchy. Such clustering is also useful in index building and computer-assisted video content annotation. A more comprehensive evaluation of the color-based key-frame clustering algorithms has been performed on the same set of 700 test images as used in evaluating color histogram-based image retrievalJ 33) Two types of color features were used: 64 bin color histogram, and totally six first- and second-order color moments (#i and ~7i),one for each component of Lu*v*, as defined by equations (2) and (3). L1 distance was used for color histogram comparison. Weighted Ll-distance as below was

656

H.J. ZHANG et al,

?~.,

LEVEL

. . . .

;

! Classes

Classes: (1,2)

Classes: (3, 4)

Class.: 3

~a~:

Classes: (5, 6)

Classes: (7, 8)

Classes: (9,10)

O0;02:t 1 . 2 1

00:04:16.12

00 '.07:13,09

Shot: 15

Shot: 24

LEVEL 2 CI as.~

;

00:0t

:45.05

4

00:02:0~&16

LEVEL 3 Shot

Shot: 4

Shot: 8

Shot: 9

Fig. 8. Cluster-based hierarchical video browsing.

been used for color moment matching:

(1.04.04.0)

d=

1.0

4.0

4.0

[#v*i - [Lv*j[

I(Tv*i -- CYv*jl

where #i and cri are the first and second color moment for each image (key-frame). The SOM method was used as the partition clustering method and the performance was measured by recall (R) and precision (P) defined as number of items hited and relevant total relevant in class number of items hited and relevant P_ total hited

R--

The images were clustered into 20 classes, among which, seven predefined classes were identified and used to evaluate the clustering performance, The result is listed in Table 4. (32) It can be seen from Table 4 that good performances have been achieved by the MOS clustering algorithm Table 4. Clustering results using color features and MOS method Measures:

R(%)

P(%)

76,6 74.9

79.0 84.1

Features Histogram Moments

using either color histograms or color moments. On the other hands, since the clustering of shots has been based on low level features rather than semantic contents of shots, it suffers from the same limitation as low level feature-based image retrieval. It is observed that the clustering result depends heavily on the dominant colors in the images, which could be background, foreground or large objects. However, since browsing is a more interactive and informal process of content searching, we can tolerate a larger degree of misclustering than in retrieval, though high performance is desirable. Nevertheless, such a clustering scheme based on low level shot features provide an enabling tool for content-based browsing of video, which will improve greatly the productivity of human operators in searching for particular video content. Developing clustering algorithms to facilitate contentbased video browsing is one of our on-going research and the results presented here are still very limited. We are currently evaluating the effectiveness of using both keyframe based and temporal feature based representations and different clustering schemes.

5. CONCLUDING REMARKS AND FUTURE WORK

In this paper we have presented an integrated and content-based solution for computer-assisted video parsing, abstraction, retrieval and browsing. The core of our solution is its use of low-level visual features as a representation of structure content of video and its automatic abstraction process. The extracted data video

An integrated system for content-based video retrieval and browsing structure, key-frame based features, temporal and motion features, and some semantic primitives derived at shot level, can then be organized and used to facilitate indexing, retrieval and browsing. Some of these system components have been transferred to industries for development of video indexing and archival systems. Experimental results and usability studies have shown that such a solution is effective and feasible, and it provides many enabling tools for a variety of video applications, though traditional text-based indexing and retrieval tools will still play an important role in video database systems. Many algorithms presented in this paper still need more comprehensive evaluation and user studies, before they can be fully integrated into video systems for content-based video retrieval and browsing. From these studies, we wish to identify a set of most effective visual features and the associated similarity measures for content-based video retrieval and, more importantly, for interactive video browsing. We are also working on more robust algorithms to extract event-based features at shot level, especially in some specific application domain such as news video parsing and indexing. Case studies on parsing and indexing news broadcasts have already allowed us to support queries such as "find me all interview shots in • a news program ,, . (29) An important extension of our work will be to incorporate audio analysis and text parsing in both video parsing and content representation as proposed by the Imformedia Project/is) With such multiple information sources, video indexing will be not longer limited to visual data alone, and more semantic level index can be derived by text or natural language processing techniques.

REFERENCES

1. H. J. Zhang and S. W, Smoliar, Developing power tools for video indexing and retrieval, Proc. IS and T/SPIE Conf. Storage and Retrieval for Image and Video Databases 11, San Jose, California, 140-149 (1994). 2. A. Akutsu, Y. Tonomura, H Hoshimoto and Y. Ohba, Video indexing using motion vectors, Proc. SPIE Conf. Visual Communications and Image Processing, Vol. 1818, Amsterdam, 1522-1530 (1992). 3. P. Aigrain and P. Joly, The automatic real-time analysis of film editing and transition effects and its applications, Computer Graphics 18(1), 93-103 (1994). 4. A. Nagasaka and Y. Tanaka, Automatic video indexing and full-video search for object appearances, Visual Database Systems 11, E. Knuth and L. M. Wegner, eds, pp. 119-133. North-Holland, Amsterdam (1991). 5. B. Shahrary and D. C. Gibbon, Automatic generation of pictorial transcripts of video programs, Proc. IS and T/ SPIE Conf. Digital Video Compression: Algorithms and Technologies, San Jose, pp. 512-519 (1995). 6. B.-L. Yet and B. Liu, Rapid scene analysis on compressed video, 1EEE Trans. Circuits Systems Video Tech. (to appear). 7. H. J. Zhang, A. Kankanhalli and S. W. Smoliar, Automatic partitioning of full-motion video, Multimedia Systems 1(1 ), 10-28 (1993). 8. D. Swanberg, C.-F. Shu and R. Jain, Knowledge guided parsing in video databases, Proc. IS and T/SPIE Conf. on

9.

10.

11. 12.

13.

14. 15.

16. 17.

18. 19. 20. 21. 22. 23. 24. 25. 26.

27.

28.

657

Storage and Retrieval for Image and Video Databases, San Jose, California (1993). H. J. Zhang, S. W. Smoliar and J. H. Wu, Content-based video browsing tools, Proc. IS and T/SPIE Conf. on Multimedia Computing and Networking, San Jose, California (1995). M. M. Yueng, B.-L. Yet, W. Wolf and B. Liu, Video browsing using clustering and scene transitions on compressed sequences, Proc. IS and T/SPIE Conf. Multimedia Computing and Networking, San Jose, 399-413 (1995). M. Davis, Media streams: An iconic visual language for video annotation, Proc. Symp. Visual Languages, Bergen, Norway (1993). L. A. Rowe, J. S. Boreczky and C. A. Eads, Indexes for user access to large video databases, Proc. IS and T/SPIE Conj. Storage and Retrieval for Image and Video Databases I1, San Jose, California, 150--161 (1994). Y. Gong, H. J. Zhong, H. C. Chuan and M. Sakauchi, An image database system with content capturing and fast image indexing abilities, Proc. Int. Conf. Multimedia Computing and Systems, Boston, Massachusetts, 121-130 (1994). C. Faloutsos, R. Barber, M. Flicker, J. Hafner, W. Niblack, D. Petkovic and W. Equitz, Efficient and effective querying by image content, J. lntelL Inf. Systems 3, 231-262 (1994). A. Pentland, R. W. Picard and S. Scarloff, Photobook: Tools for content-based manipulation of image databases, Proc. IS and T/SPIE. Conf Storage and Retrieval for Image and Video Databases 11, San Jose, California, 34-47 (1994). E Arman, A. Hsu and M. Y. Chiu, Content-based browsing of video sequences, Proc. ACM Multimedia'94, San Francisco, California, 194 (1993). 13. C. O'Connor, Selecting key frames of moving image documents: A digital environment for analysis and navigation, Microcomputers Inf. Management 8(2), 119133 (1991). A. G. Hauptmann and M. Smith, Text, speech and vision for video segmentation: The informedia projects, CMU Technical Report (1995). H. J. Zhang, C. Y. Low and S. Smoliar, Video parsing using compressed data, Proc. IS and T/SPIE Conf. linage and Video Processing 11, San Jose, California, 142-149 (1994). L. Teodosio and W. Bender, Salient video stills: Content and context preserved, Proc. ACM Multimedia'93, Anaheim, California, 39-46 (1993). Y. T. Tse and R. L. Baker, Global zoom/pan estimationn and compensation for video compression, Proe. ICASSP'91 4 (1991). M. J. Swain and D, H. Ballard, Color indexing, Int. J. Comput. Vision 7, 11-32 (1991). M. Miyahara and Y. Yoshida, Mathematical transform of (R,G,B) color data to munsell (H,V,C) color data, Proc. SPIE Visual Comm. Image Process., 1001 650-657 (1998), M. Stricker and M. Orengo, Similarity of color images, Proc. IS and T/SPIE. Conf. Storage and Retrieval for lmage and Video Databases 111, San Jose, California (1995). H. Tamura, S. Mori and T. Yamawaki, Texture features corresponding to visual perception, 1EEE Trans. Systems Man Cybernet. 6(4), 460-473 (1979). J. Mat and A. K. Jain, Texture classification and segmentation using multiresolution simultaneous autoregressive models, Pattern Recognition 25(2), 173-188 (1992). D. Daneels, D. Van Compenhouk, W. Niblack, W. Equitz, R. Barber, E. Bellon and F. Fierons, Interactive outlining: An improved approach using active geometry features, Proc. IS and T/SPIE. Conf. Storage and Retrieval for Image and Video Databases II, San Jose, California (1993 ). E. M. Arkirl et al., An efficiently computable metric for comparing polygonal shapes, IEEE Trans. Pattern Analysis Mach. lntell. 13(3), 209-226 (1991).

658

H.J. ZHANG et al.

29. H. J. Zhang, S. Y. Tan, S. Smoliar and Y. Gong, Automatic parsing and indexing of news video, Multimedia Systems 2(6), 256-266 (19951. 3(I. A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Prentice-Hall, Englewood Cliff, New Jersey (19881. 31. H. J. Zhang and D. Zhong, A scheme for visual feature based image indexing, Proc. IS and T/SPIE Conj'. Image and Video Processing 111, San Jose, California, 36-46 (19951.

32. D. Zhong, Visual feature based image and video indexing and retrieval, MS Thesis, Insititute of System Science, National University of Singapore (19951. 33. H. J. Zhang, Y. Gong, C. Y. Low and S. Smoliar, lmagc retrieval based on color features: An evaluation study, Proc SPIE Phonics East Conference on Digital Storage and Arehiving, Philadelphia (1995). 34. M. Mills, J. Cohen and Y. Y. Wong, A magnifier tool tor video data, Proc. CHI'92, Monterey, California, 93-98 (19921.

About the A u t h o r - - HONG JIANG ZHANG obtained his Ph.D from the Technical University of Denmark in 1991, B.Sc. in 1982 from Zhengzhou University, Zhengzhou, China, both in Electrical Engineering. From December 1991, he was with the Institute of Systems Science, National University of Singapore, led the work on video/image content analysis, representation, indexing and retrieval. He joined the Broadband Inlbrmation System Lab of Hewlett-Packard Labs, Pain Alto, in October 1995. His current research interests are in video/ image content analysis and retrieval, interactive video, and image processing. He has published over 40 papers and book chapters in these areas and is a co-author of "Image and Video Processing in Multimedia Systems", a book published by Kluwer Academic Publishers. He serves on program committees of several international conferences on multimedia. He is also a member of the Editorial Board of Kluwer's international journal "Multimedia Tools and Applications".

About the A u t h o r - - J I N H U A WU obtained his B.E. and his first M.Sc. in computer science from Shanghai JiaoTong University, Shanghai, China, in 1984 and 1987, respectively. He obtained his second M.Sc. in Computer Science from the National University of Singapore and joined the Institute of Systems Science as a software engineer in 1993, where he have been working on the Video Classification Project. His research interests include databases, multimedia systems, natural language processing and client-server computing.

About the A u t h o r - - D I ZHONG is currently a Ph.D. student at Columbia University. He obtained his B.E. and his first M.Sc. in computer science from Zhejiang University, Hanzhou, China, in 1990 and 1993, respectively. He studied and obtained his second M.Sc. from the Institute of Systems Science, National University of Singapore, in 1996. His current research interests are in content-based image and video analysis and retrieval.

About the A u t h o r - - S T E P H E N WILLIAM SMOL1AR obtained his Ph.D. in Applied Mathematics and his B.Sc. in Mathematics from MIT. He has taught Computer Science at both the Technion, in Israel, and the University of Pennsylvania. He has worked on problems involving specification of distributed systems and expert systems development. He is currently interested in the role of hypermedia in communication. From May 1991 until August 1994, he led a project on video classification at the Institute of Systems Science at the National University of Singapore. He is now managing Extended Media research at the Fuji-Xerox Pain Alto Laboratory.