J. Vis. Commun. Image R. 15 (2004) 261–264 www.elsevier.com/locate/jvci
Editorial
Multimedia database management systems
Over the last decennium we have witnessed a significant growth in the digital media market. Digital video cameras have become ubiquitous with the proliferation of Internet cameras, security and monitoring cameras, and personal hand-held cameras. Meanwhile, advances in digital storage technology have made the digitization, compression, archiving, and streaming of multimedia data popular and inexpensive. Finally, the expansion of the Internet and the development of video streaming technology are providing convenient ways for the widespread distribution and usage of these data. Apparently, all these trends indicate a promising future for digital media in a variety of applications, including entertainment, education, medicine, and online information services. This transition is both a blessing and a challenge. On one hand, it allows an extremely flexible way of producing, delivering, and consuming audiovisual content. On the other hand, the huge amount of multimedia data brings us a crucial challenge on how to efficiently store, access, index, represent, browse, and search the data. Traditional techniques that are effective in processing alphanumeric data will no longer work well with multimedia data. In this context, innovative research areas are emerging and new technologies are being developed to address these issues in the fields of multimedia database management, multimedia content analysis, video summarization, video indexing, browsing, and video retrieval. The objective of this special issue is to review the latest development in multimedia data management technologies, bringing various multimedia research efforts together. It contains the following 10 articles covering five major research topics— multimedia feature extraction, video event detection, semantic video context modeling and concept detection, video summarization and adaptation, and multimedia content indexing, browsing, and retrieval: ‘‘Framework for Measurement of the Intensity of Motion Activity of Video Segments’’ by Peker and Divakaran ‘‘Evaluation of Shape Similarity Measurement Methods for Spine X-Ray Images’’ by Antani, Lee, Long, and Thoma 1047-3203/$ - see front matter Ó 2004 Elsevier Inc. All rights reserved. doi:10.1016/j.jvcir.2004.08.004
262
Editorial / J. Vis. Commun. Image R. 15 (2004) 261–264
‘‘Integrated Use of Different Content Derivation Techniques Within A Multimedia Database Management System’’ by Petkovic and Jonker ‘‘Real-Time View Recognition and Event Detection for Sports Video’’ by Zhong and Chang ‘‘On Supervision and Statistical Learning for Semantic Multimedia Analysis’’ by Naphade ‘‘Video Personalization and Summarization System for Usage Environment’’ by Tseng, Lin, and Smith ‘‘Bridging the Semantic Gap in Sports Video Retrieval and Summarization’’ by Li, Errico, Pan, and Sezan ‘‘Organizing a Personal Image Collection with Statistical Model-Based ICL Clustering on Spatio-temporal Camera Phone Meta-data’’ by Pigeau and Gelgon ‘‘Content-Based Retrieval for Human Motion Data’’ by Chiu, Chao, Wu, Yang, and Lin ‘‘Automatic Generation of Conference Video Proceedings’’ by Amir, Ashour, and Srinivasan The first article, by Peker et al. explores a framework for measuring the intensity of motion activity for characterizing video segments. Motion activity, which captures object motion and camera motion, provides an effective measure for discriminating video content for applications such as sports video retrieval and summarization. The second article, by Antani et al. reports on the evaluation of two popular shape similarity measures—polygon approximation and Fourier descriptors—applied to a collection of digitized medical X-ray images of vertebra. The paper reports on experimental results that found polygon approximation performed better than Fourier descriptors. The authors discuss other factors, such as efficiency, partial matching, and similarity of closely related shapes, that need further consideration and investigation. In the third article, Petkovic and Jonker present their work on how to extend a traditional database management system with content-based video retrieval functionality. Specifically, important issues regarding video data models, dynamic feature extraction, and extensions of different layers of database architecture are elaborately addressed. Moreover, content analysis techniques that automatically detect and recognize diverse sports events (e.g., net playing, forehand and backhand in tennis, and passing and flying-out in Formula 1 car race) are described. The integration of these techniques with the proposed database management system has allowed efficient, scalable, and domain-independent content-based video retrieval. A real-time structure parsing and event detection system is described in the fourth article, by Zhong and Chang, which aims at recognizing important recurrent scenes in sports videos (e.g., pitching in baseball and serving in tennis) and detecting highlevel sports events such as strokes, net plays, and baseline plays. A three-stage framework is proposed to achieve this goal. Specifically, in the first training phase, feature models and object rules are automatically or semi-automatically learned. Then in the second operation phase, optimal models are selected to adapt to new videos and
Editorial / J. Vis. Commun. Image R. 15 (2004) 261–264
263
subsequently used to detect target scenes. Finally, high-level events within detected scenes are recognized based on constraint models on spatio-temporal properties of segmented video objects. To facilitate the access of the content structure as well as detected events, a summarization and browsing application is also investigated in the article. In the fifth article, ‘‘On Supervision and Statistical Learning for Semantic Multimedia Analysis,’’ the author presents a review of the state of the art for the hot field of semantic multimedia analysis, together with his own extensive accomplishments in tackling various challenging problems on this topic. Issues discussed in the article, such as context modeling, active learning, and unsupervised structure discovery, are all active research fields. In general, this is a very informative article, with interesting results. The sixth article, ‘‘Video Personalization and Summarization System for Usage Environment,’’ describes a rather thorough system for personalizing and summarizing video content according to the usage environment and addresses a number of interesting technical issues within the system. In particular, the proposed approach leverages MPEG-7 and MPEG-21 for various aspects of the system. Interesting descriptions on the overall system structure, the set of tools built for the system, and various application scenarios can be found in the article. A general framework for indexing and summarizing sports broadcast programs is presented in the seventh article, ‘‘Bridging the Semantic Gap in Sports Video Retrieval and Summarization,’’ by Li et al. This work has attempted to bridge the semantic gap between the rich meanings that a user desires and the shallowness of the content descriptions for sports video in the following two efforts: first, it applies an event detection scheme to automatically detect all segments that contain interesting events of a particular game such as a pitch in baseball and a field goal in football, by exploiting specific domain knowledge and well-established production patterns. Second, it analyzes and interprets independently generated rich textual metadata that describe key events of the game and synchronizes them with the detected event segments. The event segment data and the synchronized textual metadata are then merged into a rich media content description that facilitates the retrieval and summarization of various sports content. The eighth article, ‘‘Organizing a Personal Image Collection with Statistical Model-Based ICL Clustering on Spatio-temporal Camera Phone Meta-data,’’ discusses clustering image collection according to time and geo-location metadata information, and particularly toward the application on camera phones, which are becoming more and more popular these days. A model-based unsupervised classification method is used. The ICL criteria were examined and optimized using the EM technique. Some example photograph-taking scenarios are discussed in this paper. Interesting experimental results are also presented. The ninth article, ‘‘Content-Based Retrieval for Human Motion Data,’’ presents a video indexing and retrieval system based on extracted human motion information. Specifically, assuming that each frame contains a posture, it first represents each posture with a hierarchical skeletal structure using an affine invariant feature vector. Then, for each skeletal segment (e.g., arm, leg, or a torso), it constructs an
264
Editorial / J. Vis. Commun. Image R. 15 (2004) 261–264
index map according to the segment-posture distribution through self-organizing map (SOM) clustering. During the retrieval stage, the start and end frames (postures) of the query example are used to find candidate clips from a video collection, and then the similarity between the query and each candidate is computed using a dynamic time warping algorithm. A video collection containing Tai Chi Chuan data (a traditional Chinese marital art) is used to demonstrate the system performance. The last article, by Amir et al. describes an application that allows a nearly automatic, real-time creation of video proceedings. The proposed video proceeding contains videos of all conference talks, as opposed to regular paper proceeding, and provides the end-user with the following functionalities: full conference coverage including presentations and panels, automated video and speech processing and indexing, efficient content search using free text queries and keywords, random and nonlinear content access, and efficient video browsing with generated video summary and multiple synchronized views. To achieve this goal, various video analysis and indexing approaches are described including the shot boundary detection, automatic speech recognition (ASR), speech analysis, audio processing, and slide show streaming. We thank all the authors in this special issue who have worked very hard to write their papers in a very strict time frame. Also, we hope that you, the reader, find this special issue an enjoyable mix and a spotlight on new themes emerging in the Multimedia Database Management field. Finally, we thank the Journal of Visual Communication and Image Representation staff for helping us produce this issue.
E-mail address:
Ying Li and Dr. John Smith IBM T.J. Watson Research Center 19 Skyline Drive, Hawthorne, NY 10532, USA E-mail addresses:
[email protected],
[email protected] Dr. Tong Zhang Hewlett-Packard Labs, 1501 Page Mill Road, MS1203 Palo Alto, CA 94304, USA E-mail address:
[email protected] Prof. Shih-Fu Chang Department of Electrical Engineering, Columbia University New York, NY 10027 E-mail address:
[email protected]