PII:
Comput. & Graphics, Vol. 22, No. 6, pp. 675±685, 1998 # 1999 Elsevier Science Ltd. All rights reserved Printed in Great Britain S0097-8493(98)00088-0 0097-8493/98/$ - see front matter
Graphics in/for Digital Libraries
INTEGRATED INFORMATION MINING FOR TEXTS, IMAGES, AND VIDEOS O. HERZOG{, A. MIENE, TH. HERMES and P. ALSHUTH Image Processing Department, TZI ÐCenter for Computing Technologies, University of Bremen, Bremen Germany AbstractÐThe large amount and the ubiquitous availability of multimedia information (e.g., video, audio, image, and also text documents) require ecient, eective, and automatic annotation and retrieval methods. As videos start to play an even more important role in multimedia, content-based retrieval of videos becomes an issue, especially as there should be an integrated methodology for all types of multimedia documents. Our approach for the integrated retrieval of videos, images, and text comprises three necessary steps: First, the detection and extraction of shots from a video, second, the construction of a still image from the frames in a shot. This is achieved by an extraction of key frames or a mosaicing technique. The result is a single image visualization of a shot, which in turn can be analyzed by the ImageMiner2 { system. The ImageMiner system was developed in cooperation with IBM at the University of Bremen in the Image Processing Department of the Center for Computing Technologies. It realizes the content-based retrieval of single images through a novel combination of techniques and methods from computer vision and arti®cial intelligence. Its output is a textual description of an image, and thus in our case, of the static elements of a video shot. In this way, the annotations of a video can be indexed with standard text retrieval systems, along with text documents or annotations of other multimedia documents, thus ensuring an integrated interface for all kinds of multimedia documents. # 1999 Elsevier Science Ltd. All rights reserved
1. INTRODUCTION
Digital media originating from images, audio, video, and text are a comparably new data part in nowadays information systems. Although it is wellknown how to index and retrieve text documents, the same task is very dicult for, e.g., images or single sequences out of long videos. It is the aim of this paper to contribute to the research in the automatic analysis of graphical data, such as videos, and to extend it to a semantical level for static properties. There are several well-known systems for the analysis of multimedia data and their retrieval which mainly concentrate on non-textual graphical data, such as color and texture vector information, or video cut detection: The ART MUSEUM [13] is used to ®nd images from a database which can contain only images of artistic paintings and photographs. The algorithm for a sketch retrieval and/or a similarity retrieval is based on graphical features. A user can formulate a query by using sketches, which are taken from templates or which can be drawn. The PHOTOBOOK [24] system is a set of interactive tools for browsing and searching single { Corresponding author. E-mail:
[email protected]. { ImageMiner is a trademark of IBM Corp. 675
images and video sequences. A query is based on image content rather than on text annotations. The VIRAGE VIDEO ENGINE [10] uses some videospeci®c data, like motion, audio, closed caption, etc., to build up an information structure representing the content information about a video. A user can formulate queries to retrieve, e.g., commercials, scenes with special camera motions, e.g., a talking head, or just a scene denoted by a short text. One of the ®rst image retrieval projects is QBIC. Using an interactive graphical query interface, a user can draw a sketch to ®nd images with similar shapes, to ®nd images with colors or textures positioned at speci®c places, or to denote an object motion for the video domain [22]. In this paper we concentrate on MPEG videos and describe a special algorithm for a fast automatic shot detection based on the dierence of the chrominance and luminance values. This shot detection is a ®rst step towards a logical segmentation of a video, which ®nally leads to the selection of key frames or the generation of a single image using a fast mosaicing technique. A video is then indexed by textual information describing camera parameters of the shots and also the content of the key frames or the mosaic images. The ImageMiner system [17] can be used to process the representative frames for color, texture, and contour features.
676
O. Herzog et al. 2. SHOT DETECTION IN MPEG VIDEOS
To support videos in a multimedia retrieval system, the high number of frames in a video must be reduced to remove the enormous amount of redundant information. As the video data is not structured by tags, the frames are grouped by semantical units in a ®rst step through an automatic shot analysis which detects cuts in the video stream. Basically there are two dierent methods to perform a shot detection: using DCT-coecients or determining the dierences in the color distribution of successive frames in the video stream. Using the DCT-coecients (discrete cosine transformation-coecients) is a very fast way to do the shot detection, because the compressed image data is used directly [29]. The basic idea of the MPEG coding mechanism is to predict motion from frame to frame and to use DCTs to organize the redundancy in the temporal direction. The DCTs are done on 8 8 blocks, and the motion prediction is done in the luminance (Y) channel on 16 16 blocks. For the shot detection the DC images are used to perform the comparison. The extraction of a DC image from an I-frame a MPEG stream is very fast, because only a very small fraction of the entire video data is used. MPEG streams use DCT for block coding providing enough information that can be used for ®nding segments or groups of shots. The ®rst component of a transformed block is known as the DC value which is the average of all the pixel values in the raw block. This can be seen by analyzing the transformation expression at location (0,0) which is C
0,0
7 X 7 1X f
x,y 8 x0 y0
1
where C(0,0) is the DC term and f(x,y) the related pixel values. This information provides the average intensity of the block. Due to the small size of the DC image the computation is very fast, which determines the histogram dierences between two frames. While the image area is smaller than the original image, cut detection algorithms are less sensitive to camera or object motion found in a typical shot. For a detailed description see [28] and [2]. Patel and Sethi use the DC coecient of I frames to perform the problem of cut detection as a statistical hypothesis using luminance histograms [23]. The exact location of abrupt changes can not be located with this method because P and B frames are not analyzed. Liu and Zick [18] make use of only this information in P and B frames for the detection. Meng et al. [20] use the variance of DC coecients in I- and Pframes, and motion vector information to ®nd the cut points. In contrast to these methods, our approach to shot detection is based on an analysis of the dier-
ences in chrominance and luminance of every two succeeding frames [5]. The color values U and V of the chrominance are treated separately. In a ®rst step we sum up the luminance- and chrominance values. To achieve a solution independence, the total number of macro blocks serves as a normalization factor. In the standard MPEG-1 resolution of 352 288 pixels each frame consists of 396 macro blocks. N 1 X YSumnorm Yi
2 396 i0 USumnorm
N=4 1 X Ui 396 i0
3
VSumnorm
N=4 1 X Vi 396 i0
4
Let frame A and B be two direct succeeding frames. Then their dierences in the Y, U and V values YDiff YSumnorm
B ÿ YSumnorm
A
5
UDiff USumnorm
B ÿ USumnorm
A
6
VDiff VSumnorm
B ÿ VSumnorm
A
7
are compared with the thresholds ThY, ThU und ThV for the Y, U and V dierences. A shot boundary between frame A and B is detected, if the following condition holds:
YDiff > ThY ^
UDiff > ThU ^
VDiff > ThV _
YDiff >
2*ThY ^
UDiff >
2*ThU _
YDiff >
2*ThY ^
VDiff >
2*ThV _
UDiff >
2*ThU ^
VDiff >
2*ThV :
8 The use of all frames of the stream guarantees precise shot boundaries and the calculation takes only a small amount of extra time compared to the decoding of the MPEG stream. In the same step the camera motion within a shot can be analyzed. For this task we use the motion estimation part of the MPEG format in order to automatically detect camera pans or tilts in a shot. This information can be used later on to create a mosaic image from a shot exhibiting camera movement.
3. GENERATING STILL IMAGES FROM VIDEO SHOTS
As the result of the shot detection a video isdivided in several cuts. Each cut is now treated as one unit. From these units only some frames should be analyzed by an image analysis system to derive information about color, texture, and contours. For
Integrated retrieval of multimedia information
browsing purposes, the computation of indices or other analysis functions, two dierent kinds of images can be used: signi®cant key frames or generated mosaic images. The key frame methode is in principle suitable for arbitrary shots, but to choose the most signi®cant frames is a dicult task. The mosaicing technique condenses the amount of video data without loosing information, but it requires the right kind of camera movement within the shot. 3.1. Key frame extraction Key frames are used to represent the frames of a given shot and can be used for retrieval purposes. As key frames are simple frames in a video, there is no need to store them separately. There are many well-known approaches, among others, e.g., the selection of the ®rst and the last frame of a shot, or each n-th frame of a shot, e.g., each second frame. The advantage of this key frame technique is obviousÐits simplicity. But is has two important disadvantages: The ®rst one is, that it cannot be guaranteed that for example the ®rst and the last frame of a shot represent its key information. The second one is, that by choosing each second frame of a shot, the enormous amount of video data is reduced only by a factor of two. To cope with these problems we use again the dierence of the chrominance and luminance values and the information about the motion which is available from the shot analysis phase. On the assumption that the camera concentrates on objects or on parts of a scene which contain the important message, we use the following heuristics to determine signi®cant key frames: for each shot, take the frame where the motion in the scene as well as the dierence in chrominance and luminance to the neighbor frames is minimal. In some cases it is not necessary to extract one frame for each shot. In an interview, e.g., a ®rst shot might show the interviewer and the next one the interview partner, and so on. In such a case we use the results of the shot clustering described in the following and extract one key frame for each cluster instead of one for each shot. The next step after the shot detection is to group successive shots into clusters based on visual similarity. A time-constrained clustering uses a time
677
measure as a distance to reduce the comparison eort. Only those shots will be compared which are inside some time boundaries [30]. 3.2. Mosaicing The basic idea of the mosaicing technique is the creation of a single image for each shot in the entire video sequence. This image would then constitute a graphical index of the complete information of the scene. However, as explained later on in this section, this approach does not work in all cases, but fortunately, the missing cases can be covered by the key frame approach. In order to create a mosaic image, all the images in a shot have to be aligned with respect to the coordinate transformation (motion) from image to image. However, if the directions of the object movements are disparate, a mosaicing technique must fail, as it is based on the projective ¯ow method. The following sections describe the basic ideas of the proposed mosaicing procedure, the algorithm, and the ®rst results obtained with this approach. 3.2.1. Coordinate transformation. We are considering two frames taken at time and t = t + 1. The coordinate transformation maps the image coordinates from the ®rst frame X~ x,yT to a new set of coordinates X~ 0 x 0 ,yT at the time t', corresponding to the second frame. The approach to ®nd the related coordinates relies on the assumption of the transformation models. The most common models and their transformations are shown in Table 1 taken from [19]. The implemented algorithm which was introduced by [19] is based on the projective ¯ow model. It determines the eight parameters which are necessary to take into account all possible camera motions (zoom, rotate, pan, and tilt). 3.2.2. Projective ¯ow method. The brightness constancy constraint equation contains the optical ¯ow velocities uf and vf. They contain the information how two successive images are related to each other. uf Ex vf Ey Et 10
Ex and Ey are the spatial derivatives and Et is the temporal derivative of intensity (brightness) for
Table 1. Image coordinate transformation models Model Translation Ane Bilinear Projective Pseudoperspective
9
Coordinate transformations from X to X'
Parameters
X' = x + b X' = AX + b x' = qx'xyxy + qx'xx + qx'yy + qx' y' = qy'xyxy + qy'xx + qy'yy + qy' X'=CAXb T X1 x' = qx'xx + qx'yy + qx'+qax2+qbxy y' = qy'xx + qy'yy + qaxy+q2b
bER2x1 AER2x2, bER2x1 qiER AER2x2, b, CER2 qiER
678
O. Herzog et al.
each point in the image. uf and vf is the optical ¯ow in the horizontal (vertical) direction. To solve the underestimation problem (one equation for two unknown parameters), it is common practice to compute the ¯ow over some neighbourhood. This means that it is computed for at least two pixels, but it is also possible to use the whole image, as it is done in this approach. Using the projective ¯ow model for the transformation, we can compute the new coordinates X 0 x 0 ,y 0 T , by X0
Ax,yT b Ax b T T T C x1 C x,y 1
10
where A $ R2x2 and b,C $ R2x1 are the parameters to describe the transformation. The optical ¯ow, which can be derived from the above equation is the model velocity with its components um and vm. Minimizing the sum of the squared dierences between the ¯ow velocity and the model velocity, and expanding the result into a Taylor series using only the ®rst three terms, leads to a formula corresponding to the bilinear model: um x qx 0 xy xy qx 0 x x qx 0 y y qx 0 vm y qy 0 xy xy qy 0 x x qy 0 y y qy 0
11
When these two terms for the model velocity are included into the brightness constancy Equation (9), it results in a set of eight linear equations with eight unknown parameters. Finally, we get the eight approximate parameters qk(k = 1 . . . 8), which have to be related to the eight exact parameters for the projective ¯ow model. 3.2.3. The `Four-Point-Method'. We use four points in the ®rst frame. These could be the four edges of the image ~s
s1 ,s2 ,s3 ,s4 . In order to determine their position in the second frame, we apply the approximate parameters for these points. rkx
um skx
rky
vm sky
12
The result is a new vector r~ r1 ,r2 ,r3 ,r4 . The components are the coordinates of the four selected points calculated with the model ¯ow um, vm. The correspondence between r~ and s~ gives four linear equations: 0 xk x k ,yk ,1,0,0,0, ÿ x k x 0k , ÿ yk x 0 k 0,0,0,x k ,yk ,1, ÿ x k y 0 k, ÿ yk y 0k y 0k ax 0 x ,ax 0 y ,bx ,ay 0 y ,by 0 ,cx ,cy T
4. THE IMAGEMINER SYSTEM: OVERVIEW
The ImageMiner system is a system for the automatic annotation of still images. Key frames and mosaic images obtained by the techniques described in Section 3.1 and Section 3.2 can be analyzed by the ImageMiner system. The ImageMiner system consists of two main modules: the image analysis module, which extracts the content information, and the image retrieval module. The functionality of the retrieval module is discussed in Section 4.3. The next paragraphs give an overview of the image analysis module. This module consists of four submodules: Three modules extract one of the low level features: color, texture, and contour. These feature extraction modules are independent of each other. Therefore, the user is able to con®gure the image analysis by choosing the relevant features depending on the application (see Section 4.1). The fourth module (Section 4.2) performs an automatic knowledge-based object recognition. This modul dierentiates ImageMiner from other image retrieval systems like those mentioned in Section 1. Each of the low-level submodules extracts segments for one of the three features, and the content description of these segments consists of plain ASCII text. This description comprises the low-level annotations of the analyzed images, which are stored as three dierent aspects. The object recognition process is based on the generated annotations of the three low-level modules. First of all, the neighborhood relations of the extracted low-level segments are computed. Graphs are an adequate representation for neighborhood relations of segments or (simple) objects. Each node represents an object and each edge symbolizes the neighborhood relation between segments resp. objects. Secondly, the object recognition is realized by graph operations on the graph representing the spatial relations of the image. It is triggered by a graph grammar, a compiled taxonomy, which de®nes the objects related to the application domain knowledge. This object recognition module provides the information for the fourth aspect of the annotation belonging to the analyzed image. In this way, a textual description of an image is automatically generated. The textual descriptions of images (and videos for analyzed key frames or mosaic images) can be subsequently indexed using standard text retrieval techniques, which provide also query functions. It is this textual description which constitutes an integrated level of description for multimedia documents.
13
where 1 R k R 4 de®nes the number of the point. Taking into account all four points, we have 8 linear equations for the eight unknown parameters. The solution of the equations is P~
ax 0 x ,ax 0 y ,bx ,ay 0 y ,by 0 ,cx ,cy , whose components are the parameters for the projective ¯ow model.
4.1. The low-level image analysis 4.1.1. Color-based segmentation. After the transformation from RGB to HLS color space [6], an arbitrary homogeneous-sized grid divides the image into grid elements. A color histogram is computed for every grid element. The color appearing most
Integrated retrieval of multimedia information
frequently de®nes the color of the subwindow. In the next step, subwindows with the same color are grouped together, and the circumscribing rectangles are determined. Segmented rectangles can overlap. The result of the color-based segmentation are color rectangles with attributes rule as size and position in relation to the underlying grid size, and of course the resulting color. Another attribute is the color density which gives the ratio of the size of the color rectangle in relation to the amount of grid elements containing the resulting color. 4.1.2. Texture-based segmentation. The local distribution and variation of the gray values within a region determines the texture of the region. A possible method to classify natural textures is to use an arti®cial neural network to classify the textures of an application domain. The ImageMiner System ®rst divides the whole image into arbitrary homogeneous grids, similar to the color-based segmentation described before. For every grid element the trained neural network maps the main texture features onto a texture such as forest, water, ice, stone, etc. The result of the texture-based segmentation done by the ImageMiner System are texture rectangles with attributes like size in respect to the underlying grid size, position in respect to the grid size, and the classi®ed texture. The use of an arti®cial neural network implies its training in respect to an application domain. This process precludes a ¯exible and easy change of a domain. Therefore, during the last months we have developed a new texture segmentation methodology which will be described in the following. Instead of dividing the image into a ®xed grid, an edge- and region-based texture segmentation is performed to ®nd homogeneously textured regions in the image [16]. After segmenting texture regions each region exceeding a certain minimum size is selected, and a rectangular texture sample is taken. These samples are analyzed and described by the texture analysis method [21] described below. To ®nd this mapping we implemented 42 dierent statistical features described in several statistical texture analysis approaches [1, 8, 12, 25±27]. Then we performed a signi®cance analysis to ®nd one statistical feature for each visual property to compute its value of characteristics. The parameters of the statistical features were also varied within the signi®cance analysis, to ®nd the most useful parameter settings. For an overview of all seven properties and the corresponding statistical features see Table 2. The value of the statistical feature corresponds to that of the visual property, except for softness, where a high value for complexity (fcom) implies a non-soft texture while a low value corresponds to a soft one. The estimation of the statistical feature complexity is based on a matrix called neighborhood gray-tone dierence matrix (NGTDM) [1].
679
The parameter d speci®es the width of the local neighborhood the NGTDM is calculated for. To analyze whether the shape of primitives of a given texture is homogeneous, multi-areas or bloblike, the statistical feature roughness Frgh [27] ®ts best. The roughness of a texture is estimated by use of a statistical feature matrix, which calculates the dierence of the gray levels for each two pixels of a certain distance. The parameters Lc and Lr specify the maximum distance. The statistical features Flin, Fcrs, Freg, and Fdir [26] are used to calculate the line-likeness, coarseness, regularity, and the directionality of a texture. The algorithm for the calculation of the line-likeness of a texture is based on an edge detection and a count of edges which appear as lines, where t is a threshold for the gray level dierence used for edge detection. The directionality is derived using a histogram over the direction of edges in the image. A single peak in the direction histogram shows that the texture has a main direction. The height of the peak corresponds to the degree of directionality. Regularity is calculated by measuring the variation of some measuring features over the whole texture region. Therefore the texture sample is splited into s2 sub-images. The measuring feature is calculated for each sub-image. Then the dierence of the results gives us a measure for regularity. Best results were achieved with the feature gray level variance as measuring feature. It is also used to measure the contrast of a texture. The algorithm for the calculation of coarseness is described in detail in [21, 26]. The statistical features are calculated over the original gray scale texture samples, with the exception of directionality, which needs a linear histogram scaling as pre-processing. The result of our texture analysis is an automatically generated texture description based on a set of visual texture properties. The advantage is its usability for several texture domains [3] without the need of training the neural net again. The visual properties allow an user to specify the textures for domain independent searches. The de®nition and classi®cation of textures appearing in landscape scenes like water, forest or clouds is just one possible ®eld of application.
Table 2. Signi®cant statistical features Visual property
Statistical feature
Parameter
Shape of primitives Homogeneity Bloblikeness t multiareas Linelikeness Coarseness Regularity Directionality Contrast Softness
Frgh [27] Frgh [27] Flin [26] Fcrs [26] Freg [26] Fdir [26] graylevel variance fcom [1]
Lc=Lr=2 Lc=Lr=8 t = 32 d=2 s=4 Ð Ð d = 12
680
O. Herzog et al.
4.1.3. Contour-based segmentation. The contourbased segmentation of an image consists of the following three steps: gradient-based edge detection, determination of object contours, and shape analysis. The shape analysis results in a list of region parameters. They will be passed on to the module for object recognition. We present here only the basic ideas. To detect image edges, we ®rst convolve a gray value image with two convolution kernels that approximate the ®rst derivation of the Gaussian function. The direction and magnitude of the image intensity gradient can then be computed for each pixel. After we know the image gradient, the next step is to locate the pixels with the steepest slope along the local gradient direction. According to edge detection theory, these points give the real position of image edges. The method of edge detection in our system was ®rst proposed by Korn [15]. A similar one can be found in Canny [4]. The successful detection of edge points depends on the use of convolution kernels that are suited to the local image gray value changes. The selection of optimal convolution kernels has a direct in¯uence on the extraction of image structures. This is the so-called scale-space problem in computer vision. Instead of ®nding optimal kernels for each point, we try to determine optimal convolution kernels with a dierent deviation for the whole image in our approach. To realize this, we implemented the edge detection algorithm [15] in a pyramid structure. For image queries, the features with a larger scale are indeed more useful than those with a smaller one. This can be considered to be one of the dierences between image processing techniques for image retrieval and those for customary computer vision applications. Unfortunately, edge detection also provides edge points that are caused by noise. At the same time, it may result in incomplete edge points of objects. Therefore, edge points cannot be directly applied for image queries. They have to be connected to form object or region contours. Therefore, we use a
contour point-connecting algorithm which is fully described in detail by Zhang [31]. 4.2. Knowledge-based object recognition To solve the problem of knowledge-based object recognition by syntactical pattern recognition, two essential steps are necessary: 1. Bridge the gap between low level (quantitative) information, i.e., the information generated from the methods described in Section 4.1, and the atomic entities of the high level (qualitative) information, i.e., the concepts described by a taxonomy. The result of this ®rst step are hypotheses concerning the primitive objects. 2. Combine the primitive objects according to the compositional semantics of more complex objects described by a taxonomyÐour knowledge base. A hypothesis used in one description of the analyzed image becomes a thesis. Inherent in the information about color, texture and contour of the image low-level analysis phase is the information about the topological relations in the image data between these dierent segments as illustrated in Fig. 1. These neighborhood relations are distinguished by three cases: overlaps, meets, contains and their inverse relations. One fundamental assumption of the ImageMiner system is that these neighborhood relations restrict the recognition complexity of objects, i.e., a (primitive) object is built out of segments which are in this neighborhood relation. Based on this assumption, the process of object recognition can be understood as a process of graph transformations, i.e., the process of graph rewriting. In the ImageMiner system the graph parser GraPaKL [14] is used for the object recognition phase. The underlying graph grammar formalism and the parser algorithm are described in [14]. In the ImageMiner system two graph grammars are used [17]: 1. A grammar to bridge the gap between low-level information and primitive objects. 2. A grammar to combine the primitive objects.
Fig. 1. Come together: Color (CL), Texture (T) and Contour (CT)
Integrated retrieval of multimedia information
681
Fig. 2. One complex object like mountainlake consists of some simple objects like sky, clouds and so on. One simple object like clouds consists of a color, texture, and contour segment. Additionally, the rules for clouds are also speci®ed
An example grammar is given in Fig. 2. It shows that a complex object (mountain_lake) consists of simple objects (sky, clouds, lake, forest). Additionally, the rules for the simple object clouds are speci®ed. The grammars are compiled out of our knowledge base. In this sense the model of our recognizable world is represented by this taxonomy. 4.2.1. Knowledge representation. The complete high-level knowledge needed for object recognition is stored in a knowledge base. This tool is a combination of a logic-based thesaurus [9], a KL-ONElike system [11] and a front-end visualization tool [7]. In Fig. 2 the visualization of a typical complex object is shown. The representation component stores the entire knowledge. Several functions are provided for the access and modi®cation of the knowledge base. 4.2.2. Strategies for modeling the domain knowledge. Using an approach of syntactical pattern recognition, a graph grammar is a powerful method to handle object recognition by substitution of topological graphs. A prerequisite is to ®nd an adequate and consistent model of the domain. This paragraph concentrates on the underlying modeling strategies of a grammar, which describes our landscape domain. The graph grammar consists of three dierent object types: goal, terminal and nonterminal (see Fig. 3). Terminal nodes are represented by the input of the color, texture, and contour module. The nonterminal nodes are composed of color, texture, and contour segments. Hence it follows that the nonterminal nodes are divided into dierent object classes: the primitive objects, which are just
supported by the color, texture, and contour segments (specifying the grammar about the primitive objects) and the complex objects which are composed out of primitive objects. In the following the strategies for the modeling of primitive objects for the landscape domain are presented: . Texture segments of the size xsmall are neglected in favour of bigger connected texture segments. . Primitive objects consist always of a color, texture and contour segment (see Fig. 2). The size of the color and the texture segments are only allowed to dier by a factor of two, and both segments must be contained by the contour segment. Thereby, the color and the texture annotation corresponds to the same region. In the landscape grammar we modeled eight primitive objects, where each one is de®ned by one grammar rule: sky, clouds, forest, grass, sand, snow, stone, and water. Modeling complex objects incorporates the following rules: . Complex objects are composed of primitive objects. In order to reduce the number of rules for the de®nition of a concept, supersort relations are introduced, e.g., instead of three rules for a de®nition of the complex object called landscape_scene containing clouds, snow or water, the number of rules is reduced to one rule by introducing the knowledge that water_form can be assumed to be a supersort of these simple objects.
682
O. Herzog et al.
Fig. 3. Object de®nitions visualized in a KL-ONE like graph
. Primitive objects are speci®ed in general by their size, e.g., a complex object should be dominated by the segment forest. Therefore, the de®nition requires a forest segment of size large or xlarge. . Primitive objects are related by the topological relations meets, contains and overlaps to guarantee for their necessary neighborhood relations. In this way a landscape grammar is composed out of three layers: segments as terminals (®rst level), primitive objects as non-terminals (second
level), and complex object as goals (third level) (see Fig. 2). It is remarkable that our approach of combining subsymbolic and symbolic knowledge leads to a very small knowledge base which makes it very easy to adapt the system to new domains. 4.3. Image retrieval The generated textual description of an image is a standard text document. A simple text query interface can then be used to search for multimedia documents in such a database comprising images, video, and texts. However, it is also desirable to
Fig. 4. The query graphical user interface with an example query result for the item mountain_lake
Integrated retrieval of multimedia information
683
Table 3. Some performance testings of our shot detection approach on several clips of dierent genres Genre
Accuracy
Sport News Movies Cartoons Advertising
93% 97% 99% 89% 96%
extend this user interface by visual properties of possible queries (query by example). The ImageMiner sytem provides two types of queries. Firstly, it supports a query on a syntactical level, which mean a user can compose a query for special features like color, texture or contour. The second possibility is a query at a semantical level, e.g., for cloud, forest, mountain, sky_scene, etc. A combination of these queries is possible, and it can also include search terms for text documents. The complete interaction between a user and the system is performed by a graphical user interface. Figure 4 shows an example query result for the item mountain_lake.
5. EXAMPLES
This section shows an example of the whole process, which was described in the previous sections. The ®rst step of the annotation process (see Section 2) is the shot detection. Table 3 gives an overview of the performance of our shot detection approach. The accuracy of the shot detection is given in percent. For this special example we have tested the complete analysisÐvideo analysis and still image analysis with ImageMinerÐwith a short MPEG-1 videostream containing several scenes from a feature movie presenting the forest around the Amazonas river. The analyzed clip contained 1100 frames with a total length of 44 sec. Our algorithm detected all 5 shot boundaries. The precision of shot detection can vary by the genre of a movie. Because of the intrinsic features it is almost obvious that it seems to be more dicult to detect shots in a cartoon movie or in a commercial than in a feature clip or in an action movie.
Fig. 6. Mosaiced image from a shot with 200 frames. This image contains the complete information of the video shot
However, with our cut detection approach we obtain very good results for all kinds of videos with the same set of parameters, without adaption of the algorithm for a special genre (see Table 3). To demonstrate the idea of annotating videos with the ImageMiner system, we took one shot from the Amazonas movie for further processing. Figure 5 shows three frames of the shot, which consists of 200 individual frames. The left frame shows the ®rst frame in the shot, the second is taken from the middle of the shot and the third one is the last frame in the shot. It can be seen that the content of the individual frame changes strongly from the ®rst to the last frame of the shot. Using the key frame method to represent the content of the shot, at least these three frames shown would have to be stored and annotated. However, as described in Section 3.2.3, we can use the mosaicing technique to create a single image containing the complete information of the shot. The result of the mosaicing procedure over the 200 single frames is shown in Fig. 6. This image was analyzed with the ImageMiner system. Based on the color, texture, and contour information of the image, the object recognition process was invoked and identi®ed a forest_scene.
Fig. 5. Three individual frames taken from a shot with 200 frames to represent the dominant camera motion. The three frames are the ®rst, the middle, and the last frame of the shot
684
O. Herzog et al. 6. SUMMARY AND CONCLUSIONS
We have shown a successful approach to analyze MPEG videos by dividing them into shots (Section 2). Then a representative image is constructed for each shot: if it contains camera motion the mosaicing technique (Section 3.2) is used, otherwise a signi®cant key frame is extracted (Section 3.1). The still imageÐkey frame or mosaic imageÐrepresenting a shot is then analyzed with image processing methods as they are implemented in the ImageMiner system (Section 4). The novel feature of an automatic knowledge-based object recognition using graph grammars is based on the results of the image analysis and on the domain knowledge, which is de®ned once by a domain expert as a set of graph grammar rules. This approach has been tested on video sequences from the landscape domain which delivered the correct interpretation of, e.g., a scene represented by a mosaic image. These results extend to the identi®cation of static objects in a video. Our current eorts are aimed towards an integrated spatio-temporal semantics in order to be able to model also the dynamic properties of videos. In addition, we are looking for ways to incorporate the information of sound tracks and textual inserts into the video analysis.
REFERENCES
1. Amadasun, M. and King, R., Textural features corresponding to textural properties. IEEE Transactions on Systems, Man and Cybernetics, 1989, SMC-19(5), 1264±1274. 2. Arman, F., Hsu, A. and Chiu, M. Y., Image processing on encoded video sequences. Multimedia Systems, 1994, 1(5), 211±219. 3. Asendorf, G. and Hermes, Th., On Textures: an Approach for a New Abstract Description Language. In Proceedings of IS & T/SPIE's Symposium on Electronic Imaging, Science & Technology, pp. 98-106. San Jose, CA, USA, 29 January±1 February, 1996. 4. Canny, J., A computational approach to edge detection. IEEE Trans Pattern Analysis and Machine Intelligence, 1986, PAMI-8(6), 679±698. 5. Dammeyer, A., JuÈrgensen, W., KruÈwel, C., Poliak, E., Ruttkowski, S., SchaÈfer, T., Sirava, M. and Hermes T., Videoanalyse mit DiVA. In KI-98 Workshop Inhaltsbezogene Suche von Bildern und Videosequenzen in digitalen multimedialen Archiven (accepted). Bremen, Germany, 15±17 September 1998. 6. Foley, J. D., van Damm, A., Feiner, S. K. and Hughes J. F., Computer Graphics, Principles and Practice. Addison-Wesley, 2nd edition, 1990. Revised 5th Printing, 1993. 7. FroÈhlich, M. and Werner, M., Demonstration of the interactive Graph Visualization System daVinci. In Proceedings of DIMACS Workshop on Graph Drawing `94. pp. 266-269. Springer-Verlag, LNCS 894, 1994. 8. Galloway, M. M., Texture analysis using gray level run lengths. Computer Graphics and Image Processing, 1975, 4, 172±179. 9. Goeser, S., A Logic-based Approach to Thesaurus Modelling. In Proceedings of the International Conference on Intelligent Multimedia Information
10.
11.
12.
13.
14.
15. 16. 17.
18.
19. 20.
21. 22.
23. 24.
25.
Retrieval Systems and Management (RIAO) `94, pp. 185-196. C.I.D.±C.A.S.I.S., 1994. Hampapur, A., Virage Video Engine. In IS & T/SPIE Symposium on Electronical Imaging Science & Technology, pp. 188-198. San Jose, CA, February 1997. Hanschke, P., Abecker, A. and Drollinger, D., TAXON: A Concept Language with Concrete Domains. In Proceedings of the International Conference on Processing Declarative Knowledge (PDK) `91, pp. 411-413. Springer-Verlag, LNAI 567, 1991. Haralick, R. M., Shanmugam, K. and Dinstein, I., Textural features for image classi®cation. IEEE Transactions on Systems, Man and Cybernetics, 1973, SMC-3(6), 610±621. Hirata, K. and Kato, T., Query By Visual Example. In Proceedings of Third Intl. conf. on Extending Database Technology, pp. 56-71. Viennna, Austria, March 1992. Klauck, Ch., Eine Graphgrammatik zur RepraÈsentation und Erkennung von Features in CAD/CAM. DISKI No. 66. in®x-Verlag, St. Augustin, 1994. Dissertation (Ph.D. Thesis), University of Kaiserslautern. Korn, A., Toward a symbolic representation of intensity changes in images. IEEE Trans. Pattern Analysis and Machine Intelligence, 1988, PAMI-10, 610±625. Kreyenhop, P., Textursegmentierung durch Kombination von bereichs- und kantenorientierten Verfahren. Master thesis, University of Bremen, 1998. Kreyss, J., RoÈper, M., Alshuth, P., Hermes, Th. and Herzog, O., Video Retrieval by Still Image Analysis with ImageMiner2. In Proc. of SPIEÐThe Inter. Soc. for Optical Engineering, Storage and Retrieval for Image and Video Databases V, pp. 36-44. February 1997. Liu, H. C. and Zick, G. L., Scene decomposition of MPEG compressed video. In IS & T/SPIE Symposium on Electronical Imaging Science & Technology (Digital Video Compression: Algorithms and Technologies), Vol. 2419. San Jose, CA, February 1995. Mann, S. and Picard, R. W., Video orbits of the projective group: a new perspective on image mosaicing. Technical Report 338, MIT Technical Report, 1995. Meng, J., Juan, Y. and Chang, S. F., Scene change detection in a MPEG compressed video sequence. In IS & T/SPIE Symposium on Electronical Imaging Science & Technology (Digital Video Compression: Algorithms and Technologies), Vol. 2419. San Jose, CA, February 1995. Miene, A. and Moehrke, O., Analyse und Beschreibung von Texturen. Master thesis, University of Bremen, 1997. Niblack, W., Barber, R., Equitz, W., Flickner, M., Glasman, E., Petkovic, D., Yanker, P., Faloutsos, C. and Taubin, G., The QBIC Project: Querying Images By Content Using Color, Texture, and Shape. In IS & T/SPIE Symposium on Electronical Imaging Science & Technology, Vol. 1908, pp. 13-25. San Jose, CA, February 1993. Patel, N. V. and Sethi, I. K., Compressed video processing for cut detection. IEE Proceedings of Visual and Image Signal Processing, 1996, 143(5), 315±323. Pentland, A., Picard, R. W. and Sclaro, S., Photobook: Content-Based Manipulation of Image Databases. In IS & T/SPIE Symposium on Electronical Imaging Science & Technology (Storage and Retrieval Image and Video Databases II), pp. 3447. San Jose, CA, February 1994. Sun, C. and Wee, W. G., Neighboring gray level dependence matrix for texture classi®cation. Computer
Integrated retrieval of multimedia information Vision, Graphics and Image Processing, 1982, 23, 341± 352. 26. Tamura, H., Mori, S. and Yamawaki, T., Textural features corresponding to visual perception. IEEE Transactions on Systems, Man and Cybernetics, 1978, SMC-8, 460±473. 27. Wu, C.-M. and Chen, Y.-C., Statistical feature matrix for texture analysis. CVGIP: Graphical Models and Image Processing, 1992, 54(5), 407±419. 28. Yeo, B. L. and Liu, B., A Uni®ed Approach to Temporal Segmentation of Motion JPEG and MPEG Compressed Video. In Second International Conference on Multimedia Computing and Systems. May 1995.
685
29. Yeo, B. L. and Liu, B., Rapid scene analysis on compressed video. IEEE Transactions on Circuits and Systems for Video Technology, 1995, 5(6), 533± 544. 30. Yeung, M., Yeo, B. L. and Liu, B., Extracting Story Units from Long Programs for Video Browsing and Navigation. In International Conference on Multimedia Computing and Systems. July 1996. 31. Zhang, J., Region-based road recognition for guiding autonomous vehicles. PhD thesis. Department of Computer Science, University of Karlsruhe, Germany, Feb. 1994, VDI Berichte 298, VDI Verlag, DuÈsseldorf, 1994 (in German).