Automatic Video Partitioning and Indexing

Automatic Video Partitioning and Indexing

Copyright © IFAC 12th Triennial World Congress, Sydney, Australia, 1993 AUTOMATIC VIDEO PARTITIONING AND INDEXING HJ. Zhang, A. Kankanhalll and S.W. ...

1MB Sizes 0 Downloads 118 Views

Copyright © IFAC 12th Triennial World Congress, Sydney, Australia, 1993

AUTOMATIC VIDEO PARTITIONING AND INDEXING HJ. Zhang, A. Kankanhalll and S.W. Smollar /nstituJe of Systems Science. National University ofSingapore. Heng Mui Keng Terrace. KenJ Ridge. Singapore 05/1. Republic ofSingapore

Abstract. Partitioning a video source into meaningful segments is an important first step towards achieving automatic indexing. In this paper we discuss video partitioning algorithms based on image analysis techniques. Experiment results are presented and used to evaluate the algorithms. Future work in utilizing computer vision techniques for video indexing is outlined. Key words. video indexing; image processing; multimedia; information retrieval.

develop tools which identify the basic elements to be indexed. By way of analogy to sentences in a body of text, these elements will take the form of continuous segments from the time-line of the source video. Appropriate index terms may then be assigned to each such segment, based on a variety of approaches to content analysis.

1. INTRODUCTION Recently, video has become an important element of multimedia computing environments. The timedependent nature of video, however, makes it a very difficult medium to represent, model, edit, retrieve and manipulate. The impediment to easy and effective organization and retrieval of information in video sources is the lack of good indexing facilities. Unfortunately, when it comes to trying to solve this problem, the intuitions we have acquired through the study of information retrieval do not translate very well into non-text media.

2.1 Difference metrics A natural way to define a video segment is as a single, uninterrupted camera shot. This reduces the partitioning task to detecting the boundaries between consecutive camera shots. The simplest transition is a camera break, where there is a significant qualitative difference across that boundary. If that difference can be expressed by a suitable metric, then a segment boundary can be declared whenever that metric exceeds a given threshold. Hence, establishing useful metrics and techniques for applying them are the fundamental tasks for automatic partitioning of video packages.

Clearly, research is required to improve this situation; but relatively little has been achieved towards the development of video handling tools [Tonomura 1991]. The Video Classitication Project, at the Institute of Systems Science of the National University of Singapore, is an effort to change this situation. It aims to develop an intelligent system that can automatically classify the content of a given video package. The tangible result of this classification will consist of an index structure and a table of contents. In this way a video package will become more like a book to anyone interested in accessing specific information.

The most suitable metrics used in video partitioning are based on the comparison of pixel intensity histograms of two frames. The principle behind these metrics is that two frames having an unchanging background and objects will show little difference in their overall intensity distributions. Let Hi(j) denote the histogram for the ith frame, where j is one of the G possible pixel values. Then the difference between the ith frame and its successor will be given by the following formula:

In this paper we discuss our work on developing computer tools for automatic video partitioning and retrieval based on image processing technologies. We first present difference metrics for video partitioning and a twin-comparison technique for detecting gradual transitions. Automatic determination of the threshold used in applying tlle difference metrics is addressed in Section 2.3. Experimental partitioning results are presented in Section 3. Finally, we shall review currently outstanding research issues.

G

SDi = LIHj(j) - Hj+ l (j)1 j=1

(1)

If SDi is larger than a given threshold, a segment boundary is declared. Fig. 1 illustrates the application of histogram comparison to a documentary video. The graph displays the sequence of SDi values defined by (1)

2. VIDEO PARTITIONING TECHNIQUES The first step in automating video indexing is to

697

80 15 70 &l

u

. c

10

60

5

50

0

&l

...~

N

Q

..""

,.,.,

40

S

~

Cl

-

v

30

., 0

i: 20 10 0 N

V

\D

CD

V

N

0

\D

CD

0 N

N N

V

\D

N

N

Fnme Number

Fig. 1. A sequence of frame-to-frame histogram differences obtained from a documentary video, where differences cOITesponding to both camera breaks and transitions implemented by special effects can be observed.

An example of a dissolve is represented by tl1e inset of tl1e graph in Fig. 1: a sequence of difference values which are higher tl1an tl10se of tl1eir neighbors (outside tl1e inset) but are significantly lower tl1an tl1e cutoff tl1reshold. It is difficult to detect such gradual transitions using any of tl1e difference metrics. If we lower the tl1reshold, false detection will be increased dramatically because the difference values between some transition frames may be smaller tl1an tl10se which occur between the frames witl1in a camera shot. The problem is that a single threshold value is being made to account for all segment boundaries, regardless of context, which appears to be asking too much of a single number.

between every two consecutive frames over an excerpt from this source. The graph exhibits two high pulses which con·espond to two camera breaks. If an appropriate tl1reshold is set, tl1e breaks can be easily detected. Nagasaka and Tanaka (1991) have also used tl1e following x2-test fonnula, SDi =

~

IHi(j)- Hi.+1(j)1

j=l

Hi+1(J)

(2)

to make the histogram comparison reflect tl1e difference between two frames more strongly. However, our experiments show that tl1e overall perfonnance is not necessarily better tl1an using (1), while (2) also requires more computation time.

To solve tl1is problem, a so called twin-comparison approach has been developed [Zhang et at. 1992]. This approach introduces a reduced tl1reshold to detect the potential frames where a gradual transition may occur; tl1en, tl1e difference metric is used to compare tl1e first potential transition frame with each following frame until the cumulative difference exceeds a second tl1reshold. This interval is then interpreted as the extent of tl1e transition. The key idea of twin-comparison is thus tl1at two distinct tl1reshold conditions be satisfied at tl1e same time. FurtI1ennore, tI1e algoritI1m is designed in such a way U1at gradual transitions are detected in addition to ordinary camera breaks.

One approach to increasing segmentation accuracy is to consider tl1e intensity distributions of tl1e individual color channels. A simple but effective approach is to compare histograms based on a coLor code [Nagasaka and Tanaka 1991]: instead of gray levels, j in (1) denotes a code value derived from tl1e tl1ree color intensities of a pixel. To reduce tl1e computation task, we can choose only the two or tl1ree most significant bits of each color component to compose a color code. We shall see in Section 3 tl1at a 6-bit code, providing 64 bins, gives sufficient accuracy. 2.2

Detecting gradual transitions and camera movements

Unfortunately, since camera movements tend to induce successive difference values of the same order as those of gradual transitions, tl1ey may be falsely detected as such transitions. It is necessary to distinguish changes associated witl1 those transitions from changes introduced by camera panning or zooming. The specific feature which serves to detect camera movements is opticaL flOW. This feature may be computed by the block-matching algorithm developed for motion compensation in video compression. The distribution of motion vectors resulting from camera panning should exhibit a single strong modal value which will correspond to

Techniques more sophisticated tl1an simple camera breaks include dissolve, wipe, fade-in, and fade-out. Such effects involve much more gradual changes between consecutive frames tl1an does a camera break, which downgrades tl1e power of a simple difference metric and a single tl1reshold for camera break detection. In addition, tl1e changes introduced by camera movements, such as pan and zoom, may be of tl1e same order as those introduced by such gradual transitions, which further complicates detecting the boundaries of camera shots.

698

Singapore. The entire documentary consists of about 200 camera shots, and the transitions are implemented by bom dissolves and camera breaks. There are also several sequences of camera panning and zooming, as well as object motion. This video thus provides suitable material for test and evaluation of algoriilims for both camera break and gradual transition detection, as well as our technique for camera movement detection.

the movement of tile camera. On the oUler hand me field of motion vectors resulting from zooming has its minimum value at ule focus center. Our algorithm can detect mese particular types of motion vector fields and ulereby distinguish changes introduced by camera movements from mose due to gradual transitions [Zhang et at. 1992]. In me future we anticipate foregoing me computation of motion vectors by using chips developed for video compression using the H.261 or MPEG standards.

The twin-comparison approach was applied to detect gradual transitions, as well as camera breaks; and the segmentation results are summarized in Table 1. The transitions and breaks detected algorimmically, as well as those missed and misdetected are listed.

2.3 Threshold selection Selection of appropriate mreshold values is a key issue in applying the segmentation metrics. Thresholds must be assigned which tolerate variations in individual frames while still ensuring a desired level of perfonnance. Automatic selection is based on analyzing all frame-to-frame differences over an entire video source. If mere is no camera shot change or camera movement in a video sequence, the difference frame-to-frame value can only be due to noise, which may be assumed to be Gaussian. This means ule distribution of frame-toframe differences can be decomposed into a sum of two parts: ule Gaussian noise and ule differences introduced by camera breaks, gradual transitions, and camera movements. Thus, t1lreshold selection requires distinguishing mese two types of differences.

The first two lines of Table 1 are results based on gray level histograms, using difference metrics (1) and (2), respectively. In the third line difference metric (1) was applied to histograms of the 6-bit color code. Table 1 demonstrates that histogram comparisons, based on either gray level or color code, give very high accuracy in detecting both camera breaks and gradual transitions. In fact, no effort was made to tune the thresholds to obtain these data. About 90% of ilie breaks and transitions are correctly detected. Among me mree algorimms, color gives the most promising result: besides the high accuracy, it is also the fastest of the three algorithms. Note that the x2-test histogram comparison algorithm does not yield a better reSUlt, even after tuning the threshold. This is contrary to the conclusions of [Nagasaka and Tanaka 1991], where no experimental data were presented.

Let (J be ilie standard deviation and J1 the mean of the frame-to-frame differences. If ilie only departure from J1 is due to Gaussian noise, then a nonnal distribution will account for most of the frames within a few standard deviations from t1le mean value. In other words t1le frame-to-frame differences from ule non-transition frames will fall in me range of 0 to Jl+ua, for a small constant value u. Therefore, ilie threshold Tb can be selected as

Tb

=J1 + aa

The false camera break detections are mainly due to sharp changes in lighting arising from flashing lights and flickering of some objects in the image. The missed camera breaks result from computed differences which are lower than me given threshold. Turning from camera breaks to gradual transitions, we discovered iliat ilie missing gradual transitions are again due to threshold selection problems. Movements of individual objects cannot currently be detected by our optical flow analysis, and iliey are ilie main source of false detection,. An alternative approach is to identify a moving object as a cluster in the motion field and then use iliat cluster of vectors to track me object. One can assume iliat a transition will not take place in ilie middle of an object's trajectory and mus eliminate false detection.

(3)

3 EXPERIMENT AND EVALUATION The output of me segmentation process applied to a given video package is a sequence of segment boundaries. These boundaries can be compared with those detected manually as a basis for evaluation. Our experiment and evaluation are based on a twenty-minute documentary about the Faculty of Engineering at the National University of Type of transitions

Transitions+camera movements

Camera breaks

Algorithms used

Nd

Nm

Graylevelcomnarison

65

y2 gray level comnarison

Color code comnarison

Nm

Nr

101

8

9+3

16

93

16

9+5

3

95

14

13+2

Nr

Nd

13

2

60

18

73

5

Table 1. Detection results of applying twin-comparison wim iliree difference metrics. Source: a twenty-minute documentary produced at t1le National University of Singapore. Nd: number of successful detection. Nm : number of detection missed by the algoriilim. Nf. number of false detection identified by me algorithm. In me transition results Nf also includes me false detection arising from camera movements (second number).

699

Thus, motion analysis is an important issue in future improvement of the partitioning techniques.

our project will involve an extensive study of object tracking algorithms.

Our experiments show tllat those false detection of transitions in Table 1 which are due to camera movements are correctly detected by our optical flow analysis. This demonstrates that motion detection and analysis are effective in distinguishing camera movements from gradual transitions with a high degree of accuracy. While identifying such distinctions was our primary purpose, we have also tested tlle applicability of tlle camera movement detection algorithm on ilie entire video source to classify camera movements. The results were very satisfactory [Zhang et al. 1992].

As more and more video sources become available in a compressed digital form, it would be advantageous to perform video partitioning directly on tllat digital source, saving on ilie computational cost of decompressing every frame. A difference metric based on tlle DCT (Discrete Cosine Transform) coefficients has been developed and used for partitioning video compressed in JPEG standard [Arman et al. 1993). We are currently carrying out a study of using tlle same metric on video compressed by MPEG and otller motion video compression standards. In addition, tlle vectors used for motion compensation in video compression may benefit object tracking.

4. VIDEO INDEXING: CHALLANGES AND RJTURE WORK

Anotller important source of information in most video packages is tlle audio track. As any filmmaker knows, tlle audio signal provides a very rich source of information to supplement the understanding of any video source; and tllis information may also be engaged for tasks of segmentation and indexing. For instance, significant changes in spectral content may serve as segment boundary cues. Tracking audio objects will also provide useful information for segmentation and indexing. Therefore, an effective analysis of tlle audio track and its integration witll information obtained from image analysis will be an important part of our future work on video segmentation and indexing.

The task of video indexing may be viewed as twostep process: segmentation and index construction. The techniques presented in Section 2 can be used to break down a given body of source material into clips (camera shots) or episodes (sequences of camera shots). Then, content analysis needs to be performed on individual segments to identify appropriate index terms. Automatic video partitioning is thus only the first step in our effort towards computer-assisted video indexing. The second step of content analysis is obviously a more challenging task, and more sophisticated image processing techniques will be required to provide useful tools. Before such an index can be developed, it is first necessary to work out an architectural specification. If one is working with a dynamic medium like video, any index which ultimately takes the form of a printed document is likely to be of limited value. Thus, it is preferable to view tlle index, itself, as a piece of computer software tllrough which the user may interact witll tlle video source material. This is a vision we share witll tlle InfoScope project [Swanberg et al. 1992], along witll their specification of an architecture which integrates databases, knowledge bases, and vision systems. Thus, it is necessary to analyze the contributions which are most desirable from each of tllese classes of software. The contributions of database and knowledge base technologies have been discussed elsewhere [Smoliar 1992]; so iliis section concentrates on furtller contributions anticipated from vision systems.

5. ACHNOWLEDGEMENTS The autllors also wish to tllank Professor Louis Pau for his many fruitful discussions about ilie Video Classification Project.

6. REFERENCES Arman, F., A. Hsu, and M.-Y. Chiu (1993) Feature Management for Large Video Databases, Proc. SPIE Conf. on Storage and Retrieval for Image and Video Databases, San Diego. Nagasaka A. and Y. Tanaka (1991) Automatic Video Indexing and Full-Video Search for Object Appearances, Proc. 2nd Working Conference on Visual Database Systems, pp. 119-133. Smoliar, S. W. (1992) Applying Knowledge Representation Technology to tlle Management of Video Information, submitted to IJCAI 93. Swanberg, D., C. F. Shu, and R. Jain (1992) Architecture of a Multimedia Information System for Content-Based Retrieval, Proc. Third International Workshop on Network and Operating Systems Support for Digital Audio and Video, pp. 345-350. Tonomura Y. (1991) Video Handling Based on Structured Information for Hypermedia Systems, Proc. International Conference on Multimedia Information Systems, Singapore, pp. 333-344. Zhang, H. J., S. W. Smoliar, and A. Kankanhalli (1992) Automatic Partitioning of Video, submitted to Multimedia Systems.

Most important is that we plan to develop our study of tlle use of object tracking techniques. For instance, if we can extract a moving object from its background and track its motion, we should tllen be able to construct a description of tllat motion which may be employed for subsequent retrieval of ilie camera shot. There is also tlle value of identifying tlle object, once it has been extracted; but even witllout sophisticated identification techniques, merely constructing an icon from the extracted image may serve as a valuable visual index term. Also, as was observed in Section 3, object motion is a major obstacle to segmentation and detection of camera movement. Therefore, tlle next phase of

700