Real-Time Imaging 2, 153–162 (1996)
A Software Multimedia Platform with Real-Time Video Manipulation Capability new type of software multimedia platform GOLS (Graphical Objective Language System), which can provide efficient multimedia database systems including manipulation of video data streams, is proposed. GOLS integrates three capabilities: data acquisition, data management and data presentation, which are necessary to develop advanced multimedia database applications. Using GOLS, object-oriented database applications can be easily constructed, because objects of GOLS are persistent and distributed, and the objects can be easily attached to graphical user interfaces. In addition, data acquisition capability supported by video recognition libraries enables the system to link real-time raw video data dynamically. A video recognition model based on a state-transition type model and Video Scene Description Language is proposed to describe flexible and reusable models for recognition of video data. Experience of developing applications, including Live Hypermedia system, revealed that multimedia applications can be easily developed using GOLS.
A
© 1996 Academic Press Limited
Takashi Satou and Masao Sakauchi* Institute of Industrial Science, University of Tokyo, 7-22-1, Roppongi, Minato-ku Tokyo 106, Japan, *E-mail:
[email protected]
Introduction Developing a variety of multimedia database applications requires a powerful and flexible general-purpose hypermedia development environment. A hypermedia platform, GOLS: Graphical Objective Language System, developed for this purpose by our group, is discussed here. GOLS supports an adaptive integration framework for data model construction and utilization [1, 12]. This framework adopts image or real-time video understanding techniques to the database acquisition phase (input phase) and flexible (or intermediate-level) data models to the database presentation phase (output phase), which are shown in Figure 1 as the ‘database vision’ component of the framework and the ‘database presentation’ component, respectively.
1077-2014/96/030153 + 10 $18.00
This framework allows multiple levels of abstraction and recognition. The top-level data model is fully abstracted (or completely recognized), corresponding to the abstract data model required by ordinary databases. Though this toplevel model is ideal for ordinary applications, it is very difficult to realize this model for image and video data in an automated way. Lower level models are introduced to overcome this problem. They are not fully abstracted, in other words, incomplete, which facilitate the difficulty of image recognition. To make this lower level model useful, we need to develop intelligent presentation methods for each situation. Three major functions of GOLS shown in Figure 2 correspond to this framework as follows.
© 1996 Academic Press Limited
154
T. SATOU AND M. SAKAUCHI
Figure 1. Adaptive integration framework for image data model construction and utilization. Figure 2. Three major functions of GOLS.
Database Vision: GOLS inherits reasoning capability from Prolog, so users can easily describe recognition models and processes for images and real-time video data. For example, using image processing libraries, users can divide images into segments as tokens for recognition, then match the token structures with recognition models. VSDL and the state-transition type recognition engine, which will be explained later, are also included. With these functions, the system converts raw images and videos into material objects with higher recognition levels for hypermedia applications. Object-oriented Database Management: GOLS also works as an object-oriented database language, which makes the database facilities already inherent in Prolog more powerful. GOLS can have nested dictionaries of horn clauses (units of procedures and data in Prolog), and it can treat each dictionary as an object. GOLS objects have many useful features. For example, they can be stored in files and re-used on the next execution or in other programs. These features enable a much more flexible database expansion. Database Presentation: GOLS also includes graphical user interface class libraries for presenting objects (such as real-time video objects, image objects, database objects etc). These libraries include interfaces such as windows, icons, menus, editors, images, and movies. High-level classes such as hypermedia kits, color selector, image viewer, and file chooser are also available. Users can design interfaces either declaratively through Prolog scripts or interactively through interface builder. Prolog’s programming style of enumerating horn clauses defines event handers very well.
Language Specification GOLS is a software platform based on an object-oriented Prolog programming language with class libraries for recognition, database and graphical user interface. The following are detailed descriptions of the language specification.
Object-oriented Prolog GOLS is an object-oriented language similar to SELF [2] except that GOLS also has a logic programming language feature. From the viewpoint of object-oriented languages, GOLS is classified as a prototype-based language, in which an object is created by cloning (copying) a prototype instead of instantiation of a class. There is no difference between class and instance. The main features of a prototype-based language are simple object relationships, creation by copying, examples of pre-existing modules, support for special customized objects, and elimination of meta-regress [2]. GOLS inherits all of these merits. From a viewpoint of logic programming languages, GOLS is a Prolog upper-compatible language. Ordinary Prolog language has only one dictionary of horn clauses (or data and rules). On the other hand, GOLS can have nested dictionaries of clauses. A large program can be divided into small components. In other words, GOLS provides a module to Prolog in order to support the maintenance and development of large programs.
REAL-TIME VIDEO MANIPULATION
155
Figure 3. Class libraries of GOLS.
One reason why we employed the Prolog language is that Prolog has a good affinity with rule-based systems and relational database systems. In our laboratory, many advanced recognition systems for drawing, image and video have been developed using Prolog [3, 4, 14]. GOLS has taken over these properties.
Image processing: image data class has many image processing functions such as sizing, M 3 N mask filtering, segmentation, etc.
In addition, another reason is that the declarative programming style of Prolog language closely matches the description of slots in objects, This feature is suitable for knowledge representation and event handling.
Others: some customized versions of GOLS have 3-D drawing and interface library which enables users to manipulate objects in a 3-D space directly, and vector data library which deals with 100,000 vectors in drawing images effectively.
Class Libraries
Object
GOLS provides class libraries with which application programs can easily be built. The class libraries consist of the following libraries as shown in Figure 3.
An object of GOLS is a dictionary of horn clauses. The following is a declarative definition of an object.
Recognition: recognition engines and basic models based on the state-transition type model. Details will be discussed later. Database management: database management library which defines data and container class. Graphical User Interfaces: graphical components which enable users to access objects directly. Windows, dialog, boxes, buttons, icons, movies and an interface builder are defined. Hypertext: card-type Hypertext engine and keyword database.
Communication: a communication library which enables applications to call procedures in other applications.
@ a { foo(1). bar(2). zot(3). do(M) :– write(M), nl. }.
“@” means the declaration of an object, and a pair of curly braces mean the beginning and the end of contents of the object, respectively. The object a has four slots (or horn clauses): foo, bar and zot represent data, and do represents a procedure to print a variable M (in GOLS, as well as Prolog, a word starting with an upper-case character is a variable).
156
T. SATOU AND M. SAKAUCHI
The following describes an example of inheritance. @ b : a { bar(4). zot(_) :– !, fail. }.
The object b is defined as a sub-object of a. The slots foo and do are inherited from the object a. The slot bar overrides the definition in a, which changes the behavior of the base object. Therefore, calling b::bar(X) results X = 4 instead of X = 2. The new definition of zot also overrides the previous definition, however the new behavior is just failing. The definition means elimination of the slot zot, because failure means negation in Prolog. Therefore, calling b::zot(X) will fail as if there is no definition of zot in the object. In GOLS, elimination of slots is easier than ordinary object-oriented languages. The contents of the dictionary can be stored in disks, and restored in other contexts or execution units. In other words, GOLS object can be persistent.
Communication In GOLS, an object can contain child objects in it. Objects are located in a tree structure like directories in UNIX file systems. Each object has a path to address. The top object is called global. A path starting with global is an absolute path. For example, the object a defined above can be addressed as global$a, and the slot foo can be accessed as: :– global$a::foo(X). % X will be 1.
More than one object can be sent messages. :– [video, voice]::play.
In this case, the method play of two objects video and voice are called simultaneously. In this manner, the behavior of objects is synchronized. Furthermore, an application can communicate with other GOLS applications through inter-application communication library.
called, and the variable X will be bound. This function performs like RPC and supports to develop client-server systems. Using GOLS, we have developed several specific servers such as an image processing server, a video recognition server, a video data providing server, a meta-data providing server and so on.
Examples The following example shows how easily applications can be developed. First, a simple database is defined. @ images : c_image_collection { % Define an image database @ girl : c_image_data { % item ’girl’ file_(’girl.gif’). }. @ boy : c_image_data { % item ’boy’ file_(’boy.jpg’). keyword_(’Tom’). }. }.
This code is fairly clear. images contain two image items girl and boy. Each item is related with an image file. An image item can have keywords and can be indexed querying to the container. Then a simple window is defined. @ win : c_window { % Define a window shape(0, 0, 200, 200). % Position and Size @ button : c_button { % Define a button shape(55, 160, 80, 30) label(’Push me’). action :– % action when clicked parent$image::change_image ( global$images$boy). }. @ image : c_image { % Define a image icon shape(36, 25, 128, 128). image(global$images$girl). }. }. :– win::(draw, show). % Activate the window ’win’. :– x_run. % Start event loop.
:– app##a::foo(X).
the slot foo in the object a in the application app is
The window win has a button and an image icon. At first the image icon displays the image girl. If the button is
REAL-TIME VIDEO MANIPULATION
157
Table 1. VSDL predicates @ images { @ girl { }. @ boy { }. }.
@ win { clicked
Predicate
Meaning
color area position motion relation consist_of appear disappear color_change camera
Color of region Area of region Position and size of region Motion of region Spatial relation between regions Set of regions Appearance of region Disappearance of region Color change rate Camera works
@ image { }. @ button { }.
change
}.
Figure 4. Example of screen of GOLS.
shot. The state-transition type model infers contents of shots using the primitive description. Object-oriented approach is adopted in the model construction process, in order to improve the modularity and re-usability of the models.
Video Scene Description Language
clicked, the image icon changes its content to the image boy as shown in Figure 4. The point is that the structure of objects corresponds well to the structure of GUIs. The readability of the list is fairly good, because the program in GOLS is written declaratively. GOLS seems to be the simplest and more powerful than Smalltalk, SELF, Tcl/Tk, C++ with InterViews, and VisualBasic, because it has the following features:
The VSDL was originally proposed by our research group for automatic detection and classification of TV shots [5]. Features shown in Table 1 are selected for a description of typical video shots which users may imagine. They provide primitive parameters and spatiotemporal relation of color regions in a video scene. It can be said that VSDL is responsible for low-level video processing.
(i) good capability to design database; (ii) good affinity for graphical user interfaces; (iii) and simple syntax to define and address objects.
State-Transition Type Model
Video Recognition Model In order to deal with real-time video data effectively, a video recognition model is required in GOLS. The model processes video shots and analyses their contents as follows. The recognition model accepts a set of regions in a video shot, detects the scene categories, and extracts objects in the shot. These functions are essential to real-time video clipping, authoring, video editing, and other advanced interactions for video. The recognition model is implemented by Video Scene Description Language (VSDL) [5] and state-transition type model [4]. The VSDL realizes a primitive description of a
The state-transition type model was originally proposed by our research group in order to realize flexible and reusable models for recognition of drawing image [4, 6]. This model has been improved and extended to adopt video data recognition. Primitive data (e.g. a line for drawing, a region for video scene generated by color image processing) is called a token. Each token is given a state. The state is changed from one to another according to arrows between the states. For example, each region has an initial state ‘reg_color’ as shown in Figure 5. Then the state can be changed to ‘reg_skin’, ‘reg_face’,..., ‘caster’, etc., in accordance with bottom-up and/or top-down recognition rules given for each state-transition link. In this manner, the structure of regions can be recognized by this type of model. In other words, the state-transition type model is responsible for high-level processing.
158
T. SATOU AND M. SAKAUCHI
Figure 6. Structure of models for recognition. Figure 5. Example of state-transition type model.
Object-Oriented Approach for Construction of Models The model consists of a set of the transition rules. Each arrow in Figure 5 represents an individual rule. Each rule describes a condition to change from one state to another. The rule is written in the VSDL and Prolog. The inference engine interprets the rules and updates the state of each token. Interpretation strategy is a combination of bottom-up and top-down analysis. At first, bottom-up rules are invoked and token states are updated until no more state-transition can occur. Then top-down rules are invoked for further state updating. Several tokens are often grouped into one. This strategy reduces alternative choices of state updating, and contributes to high-speed analysis. The merit of this expression lies in its modularity: rule sets can be easily replaced according to a recognition target. This feature is further improved by objectoriented construction of models as described in the next section. Another merit is that the model allows imperfect recognition results. Even if the perfect goal for recognition cannot be reached (and this is often the case for image and video), the intermediate results, or sub-goals, remain as the states on the tokens. These imperfect results can be considered as a kind of recognition, and utilized effectively for some applications using devised utilization procedures. Another research project of our group revealed that this model was successfully applicable to complicated still image retrieval systems with fully automated keyword extraction capability [3]. The user can choose matching level from fine to coarse and from narrow to wide.
The structure of the model for video scene recognition in GOLS consists of three parts: kernel, basic model and application models as shown in Figure 6. This structure can improve the modularity of models. Packages of recognition facility can be distributed and plug-and-played in user systems. In the kernel, a recognition engine is provided. Predicates of the VSDL and the recognition engine which executes the rules are implemented here. The basic model includes common rules and is used commonly in application models. In our system, color is considered to be one of the most important features of regions. Therefore, the rules in the basic model deal with color categories. When other features, such as shape of region, can also be important in other applications, the basic model may include rules about them. Each application model contains application-dependent rules. For example, the NEWS model deals with contents of typical news clips such as detection of an anchorperson. Some heuristics can be contained in the application models. The following code is a part of the NEWS model description. The shot news_shot, an application model, is defined as an inheritance of the basic shot color_shot. % definition of NEWS shot @ news_shot : color_shot { % bottom-up rules bu_rule(reg_face, reg_skin, Obj) :– not #back_ground(Obj), #size_ratio(Obj, Size), 0.005 < Size, Size < 0.2,
REAL-TIME VIDEO MANIPULATION #Obj::position(_,_,OW,OH), R is OW / OH, R < 1.6, R > 0.4. bu_rule(reg_hair, reg_black, Obj) :– #size_ratio(Obj, Size), 0.005 < Size, Size < 0.2. % import bottom-up rules of base scene class. bu_rule(To, From, Obj) :– base ^: bu_rule(To, From, Obj). % top-down rules td_rule(human_face, Objs, human_face ( [Face|NearFace]), Rest) :– select1(Objs, reg_face, Face, Rest1), !, collect_near(reg_face, Rest1, NearFace, Face, Rest). td_rule(human_hair, Objs, human_hair ( [Hair|NearHair]), Rest) :– select(Objs, reg_hair, Hair, Rest1), collect_near(reg_black, Rest1, NearHair, Hair, Rest). td_rule(caster, Objs, caster([Face, Hair]), Rest) :– td_rule(human_face, Objs, Face, Rest1), td_rule(human_hair, Rest1, Hair, Rest), Face = human_face(FaceReg), Hair = human_hair(HairReg), regs_relation(on, HairReg, FaceReg). % ...
159
Figure 7. Concept of the Live Hypermedia.
Applications Many applications and prototype systems for research have been developed using GOLS in our research group.
}.
The following are examples of applications using GOLS, and subjective evaluations of GOLS by developers of the applications.
Model and Result
Live Hypermedia
After the recognition process, several results are generated. The model matched with the scene can be referred to through “is-a” links between models and results as shown in Figure 6. Although it may be strange that a result is a model, this relationship increases the reusability of the results. Results possess not only parameters obtained by analysis of data, but also the same recognition capability as the matched models. Results can be customized and reused for acquisition of other scenes, where more detailed or more specialized matching will be done. Thus, results are bound with the models by “is-a” links.
Live information (such as news, weather forecast, sports programmes on TV, etc.) seems to be important for people’s daily life. They are, however, scattered independently throughout various TV programmes. There is no relationship between TV channels, or between TV programmes.
For example, suppose that a generic face model detects two different persons’ faces from video data, and generates two results face-A and face-B. Each result has the parameters generated by the model matching process, such as face color, shape, area and so on. The user will be able to utilize face-A for the same person detection in other scenes.
On the other hand, hypermedia technology has been developed to store and represent various kinds of information and relationships between information nodes. Hypermedia has been used on such application fields as online manual (e.,g. Windows help system) and education (e.g. Intermedia [7]). Recently, the World Wide Web (WWW) [8] has become popular on the Internet, and shows another feature of hypermedia as a general information server. However, available information in traditional hypermedia systems is not fresh, because they manage only
160
T. SATOU AND M. SAKAUCHI
Figure 8. Example of screen from the video scrapbook.
stored information. Although old information like bibliography could be valuable, freshness is one of the most important features of the kind of information that comes in through broadcasting channels. If hypermedia can manage live information, real-time information may be associated with other stored or even live information, and consequently the value of the information can be increased, Live Hypermedia is a new type of hypermedia system which can link live information, especially live video coming into the system in real-time as shown in Figure 7. The live hypermedia can recognize real-time video data using the video recognition technique discussed in the previous chapter. Live hypermedia can generate nodes from video data automatically, manage them in hypermedia networks and attach them to GUIs. Figure 8 shows a prototype of the live hypermedia developed using GOLS, which works like a “Video scrap-
book”. There is a card-type hypermedia viewer in the bottom right of the figure, where each card represents a “logical channel” and is linked with a recognition model. In this example, a model linked with the NEWS card detects scenes from the 6-channel TV broadcasting window in the top right of the figure. Detected scenes and iconic objects are spooled in the spool window on the left, and pasted onto the card if needed. In this way, the user can manage news clippings concerning a specific topic, as if to make a scrapbook clipping from newspapers. This corresponds to the freezing of real-time data. Evaluation: First, three major functions of GOLS— acquisition, management and presentation of data—play an important part in the live hypermedia. Satou and Sakauchi [9, 13] provide evaluation details such as recognition performance. Second, the GOLS communication facility makes it easy to develop the distributed system. The prototype system consists of three distributed components: TV broadcaster,
REAL-TIME VIDEO MANIPULATION
161
Figure 9. Example of screen from the drawing recognition workbench.
recognition server and hypermedia management system. They are integrated into one application using the communication facility.
Workbench for Drawing Recognition This workbench supports research into the design of new recognition methods for drawing images such as a vectorized map [10]. New recognition models can be made and tested easily on this workbench. GOLS is used to realize the recognition engine, describe production rules, implement the model learning system and display data as shown in Figure 9. Evaluation: First, it is very convenient to watch the recognition process through GUIs. For example, during the recognition of boundary lines in a map, the processed part of lines are highlighted. This helps the early discovery of problems in recognition methods.
Figure 10. Example of screen from the neural network system.
Interface for Neural Network System Second, GUIs also help to give examples of the system for inductive learning of models. For example, to learn the size parameter of a model of buildings, positive and negative examples of buildings have to be pointed in a map. It is difficult to specify these spatial parameters without GUI.
In this application, a neural network (NN) system classifies image data calculating correlation between colors in the images and emotional keywords [11]. After learning of the images and the keywords, the NN can answer the category which a given image belongs to.
Third, object-oriented programming capability helps to construct knowledge-based and data models.
GOLS works as an interface language for the NN system as shown in Figure 10.
162
T. SATOU AND M. SAKAUCHI
Evaluation: The difficulty lies in the linkage between GOLS and the NN system. However, only a few additional codes were necessary to link both. After the integration, interactivity with the NN system is increased. GUIs help to input parameters to the NN, watch the internal parameters of the NN, and display the output of the system. Before the integration, all inputs had to be given textually, and all output was also displayed textually. Though the developer was not familiar with window programming, he was able to attach GUIs to the system easily. The description seems to be easier to understand and use than Tcl/Tk as an interface language.
Conclusions GOLS offers an integrated platform for multimedia database applications. It supports not only management and presentation but also the acquisition of multimedia data. Video data acquisition capability makes it possible to link real-time video data with stored information in hypermedia document networks. The value of live information linked with other information is expected to increase. We will continue to improve GOLS as well as the live hypermedia system. The future work is to integrate temporal model effectively into the state-transition type model.
References 1. Sakauchi, M. (1994) Database vision and image retrieval, IEEE Multimedia, 1(1): 79–81.
2. Unger, D. & Smith, R.B. (1987) SELF: the power of simplicity, OOPSLA ’87, SIG-PLAN Notices, 22(12): 227–241. 3. Yamane, J. & Sakauchi, M. (1995) A construction of a new image database system which realizes fully automated image keyword extraction, IEEE Trans. Inf. Sys., E74-D, 1216–1233. 4. Satoh, S., Ohsawa, Y. & Sakauchi, M. (1990) Drawing image understanding framework using state transition models, Procedure of the 10th ICPR, pp. 491–495. 5. Gong, Y. & Sakauchi, M. (1992) A method for color moving image classification using the color and motion features of moving images, ICARCV ’92. 6. Satoh, S., Moh, H. & Sakauchi, M. (1994) A drawing image understanding system cooperating with rule generation supporting using man-machine interaction, IEICE Trans. Inf. Sys., E77-D. 7. Garrett, N.L., Smith, K.E. & Meyrowitz, N. (1986) Intermedia: issues, strategies, and tactics in the design of a hypermedia document system, Proc, Conf. on CSCW. 8. Berners-Lee, T., Cailliau, R., Groff, J.F. & Pollermann, B. (1992) World-Wide Web: The Information Universe, Electronic Networking: Research, Applications and Policy, Vol. 1, No. 2, Meckler, Westport CT. 9. Satou, T. & Sakauchi, M. (1995) Video acquisition on live hypermedia, Proceedings of the Int’l Conference on Multimedia Computing and Systems, pp. 175–181. 10. Lu, W. & Sakauchi, M. (1994) An interactive map drawing recognition system with learning ability, Proceedings of the IAPR Workshop on Machine Vision Applications, pp. 235–238. 11. Hiwatashi, M. & Sakauchi, M. (1995) Keyword extraction for image retrieval using color sensitive neurons, Proceedings of the 1995 IEICE General Conference, Information Systems 2, p. 118 (in Japanese). 12. Sakauchi, M., Satou, T. & Yaginuma, Y. (1996) Multimedia database system for contents mediator, IEICE Trans. In press. 13. Satou, T. & Sakauchi, M. (1996) Video information acquisition on live hypermedia, IEICE Trans. Inf. Sys. In press. 14. Ono, A., Amano, M., Hakaridani, M., Satou, T. & Sakauchi, M. (1996) Synthesis and evaluation of the image database with fully automated keyword extraction by state transition model and scene description language, IEICE Trans. Inf. Sys. In press.