ARTICLE IN PRESS
Computers & Graphics 28 (2004) 485–495
Adaptation of virtual human animation and representation for MPEG Thomas Di Giacomo*, Chris Joslin, Ste´phane Garchery, HyungSeok Kim, Nadia Magnenat-Thalmann MIRALab, University of Geneva, C.U.I., 24 rue General Dufour, Geneva-41211, Switzerland
Abstract While level of detail (LoD) methods for the representation of 3D models are efficient and established tools to manage the trade-off between speed and quality of the rendering, LoD for animation has not yet been intensively studied by the community, and especially virtual humans animation has not been focused in the past. Animation, a major step for immersive and credible virtual environments, involves heavy computations and as such, it needs a control on its complexity to be embedded into real-time systems. Today, it becomes even more critical and necessary to provide such a control with the emergence of powerful new mobile devices and their increasing use for cyberworlds. With the help of suitable middleware solutions, executables are becoming more and more multi-platform. However, the adaptation of content, for various network and terminal capabilities—as well as for different user preferences, is still a key feature that needs to be investigated. It would ensure the adoption of ‘‘Multiple Target Devices Single Content’’ concept for virtual environments, and it would in theory provide the possibility of such virtual worlds in any possible condition without the need for multiple content. It is on this issue that we focus, with a particular emphasis on 3D objects and animation. This paper presents some theoretical and practical methods for adapting a virtual human’s representation and animation stream, both for their skeleton-based body animation and their deformation-based facial animation, we also discuss practical details to the integration of our methods into MPEG-21 and MPEG-4 architectures. r 2004 Elsevier Ltd. All rights reserved. Keywords: Multi-resolution animation and representation; Adaptation; Graphics standardization
1. Introduction Since the invention of the computer, content has been tailored towards a specific device, mainly by hand. Computer games have been developed with specific computer’s capabilities in mind, various sized video has been produced so that the user can select their choice based on the one most likely to run best, and even different formats have been provided in order that it can run on different types of machines. In recent years, this *Corresponding author. Tel.: +41-22-379-7618; fax: +4122-379-7780. E-mail address:
[email protected] (T. Di Giacomo).
trend of multiple devices, multiple content (MDMC), is slowly shifting towards multiple devices, single content (MDSC), with great advantage both to the content provider and the end user. Firstly, only a single (usually high quality) content need be provided for an entire suite of devices, and secondly the user is provided with content that is optimised and fits not only the requirements of the device, but the network, and the user’s own, often specific, preferences. This highly motivating goal can only be achieved within the framework of standardisation; and in this case the Motion Picture Experts Group (MPEG); this is mainly due to the extensibility and range of applications to which this can be taken. Though early work in MPEG-7, e.g. proposed by Heuer et al. [1], adapts certain types of content, this kind of
0097-8493/$ - see front matter r 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.cag.2004.04.004
ARTICLE IN PRESS 486
T. Di Giacomo et al. / Computers & Graphics 28 (2004) 485–495
MDSC concept is being constructed under the framework of Digital Item Adaptation (DIA)—a Digital Item being the MPEG name given to any media type or content, adaptation because it is being adapted, rather than changed, from its original form into something that is tailored towards the context of the user and the user’s equipment. DIA is part of the MPEG-21 framework [2], for content control, and is in its final stages of standardisation. In this paper, we discuss DIA relating to virtual humans, and in this particular case animation as it is the main influence for any session involving avatars. Representation is discussed, but it is merely a transition from the general conditions that it currently exists in, towards the aforementioned context of standardisation. Here we are mainly concerned with adapting animation, under a variety of different conditions, to a specific context. We split this paper into 4 major Sections; Section 2 describes related work, focusing especially on other types of context-based adaptation both for the representation and the animation. Section 3 introduces the adaptation of representation and both Face/Body Animation (FBA), see Preda et al. [3], and Bone Based Animation (BBA), see Preda et al. [4], the high-level adaptation schema and the methodology. Section 4 provides some preliminary results, and Section 5 provides a conclusion and an overview of the future work that we have planned over the next development period.
2. Related work This section presents relevant methods for the adaptation of body and facial animation of virtual human. We first discuss work to adapt the representation of these 3D objects, then methods for adaptable animation and to conclude approaches related to scalability within MPEG and to various attempts of 3D applications with single adaptation. In Computer Graphics, there are many methods, such as the one proposed by Hoppe [5] for instance, for creating a LoD representation, a.k.a. multi-resolution models, for virtual objects. Basically, it consists in refining or simplifying the polygonal mesh, according to certain criteria such as the distance of the object to the camera, to save computation during the rendering and/ or to meet some timing requirements. Garland et al. [6] also propose a method based on error quadrics to simplify surfaces of 3D objects. Some of these geometrical methods are even specifically geared for virtual humans, for instance, Fei et al. [7] propose LoD for the virtual human body meshes and Seo et al. [8] for virtual human face meshes. The main consideration for virtual humans, over other objects, is to retain a specific number of polygons at the joints (or near control points
for the facial animation). Another approach to render numerous virtual humans in real-time on different devices is to convert 3D objects to 2D objects. This procedure is also referred to as transmoding within MPEG, and such methods have been proposed by Aubel et al. [9] with impostors and Tecchia et al. [10]. Concerning the methods to adapt animation itself, much work has been made mostly in the field of physically based animation. Adaptive techniques, such as those proposed by Wu et al. [11] using a progressive mesh, and by Debunne et al. [12] combining it with an adaptive time-step, reduce the processing time required for animation. Capell et al. [13] also use a multiresolution hierarchical volumetric subdivision to simulate dynamic deformations with finite elements. Hutchinson et al. [14] refine mass-spring systems if certain constraints on the system are violated. James et al. [15] propose to animate deformations with ‘‘dynamic responsive textures’’ in a system called DyRT, with the help of precomputations. Adaptation for nonphysically based animation field has received little attention until now. However, some relevant methods have been proposed, Granieri et al. [16] adapt the sampling frequency of virtual human motions to be played back and support basic reduction of degrees of freedom for virtual humans. For natural scenes, Di Giacomo et al. [17] present a framework to animate trees with the use of a level of detail (LoD) for the animation, Guerraz et al. [18] animates prairies with three different animation models as three different possible levels. A specific adaptation of an MPEG file is also possible, depending on the media type. While scalable audio and video are intensively studied topics, such as the work of Kim et al. [19], and on audio, such as the method proposed by Aggarwal et al. [20] a progression towards graphics is slowly being initiated; however only by a few, such as from Van Raemdonck et al. [21] and BoierMartin et al. [22]. It is probably even more the case with animation, which, to our knowledge, is mainly investigated in the work of Joslin et al. [23]. Though it is slightly outside the scope of this paper, there is an important point to be mentioned for DIA, i.e. the impact of the adaptation. Quality of Service (QoS) is a tool to estimate such an influence. For instance, Pham Ngoc et al. [24] propose a QoS for 3D graphics. Yet there is a need for a dedicated Adaptation QoS for 3D animation, to evaluate the impact on aesthetic quality and the usability of adaptation for animation and specific 3D objects, e.g. virtual characters. Finally, some research is being done to specify adaptable architectures for 3D applications, though most of them are not geared towards multiple scenarios. For instance Schneider et al. [25] integrate, in their Network Graphics Framework, various transmission methods for downloading 3D models in a client–server environment. A particular transmission method is
ARTICLE IN PRESS T. Di Giacomo et al. / Computers & Graphics 28 (2004) 485–495
selected by comparing the quality of the models and the performances of the network. Lamberti et al. [26] propose a platform with accelerated remote rendering on a cluster which transmits the images to a PDA. Usernavigation is then calculated on the PDA-side and transmitted back to the server. Optimization for particular devices are being investigated, not to mention all the methods for standard desktop PC graphicsboards, some dedicated optimizations, e.g. such as done by Kolli et al. [27] for the ARM processor, are explored for different mobile and new devices. Another way to adapt 3D applications on different devices is to adapt directly the underlying model of standard and known computer graphics techniques. One of the most relevant examples is the use of Image-Based Rendering techniques by Chang et al. [28]. Another example is the work by Stam [29], demonstrating stable fluids on PDA with consideration for the Fixed-Point arithmetic, because of the lack of FPU on these devices. Most of the previously described methods perform their respective task well, however we believe there is actually no global and generic architecture to adapt content, representation and animation of virtual humans, towards network and target client capabilities, as well as user preferences. This is the focus of our work.
engine (or more commonly BintoBSD), as shown in Fig. 1. Once the bitstream is in BSD format, it can be adapted using an XML style sheet, which basically contains information on how to adapt the XML document according to a set of rules that are passed by the adaptation engine. This adaptation basically removes elements from the XML document according to the adaptation schema (described in the following sections). During this stage the header might be changed in order to take account of the elements in the bitstream that were removed; for example, the initial mask might indicate the presence of all elements, but this would be adapted to indicate which elements remain after adaptation. The XML document is then parsed via a BSD to Bin converter which takes the XML document and converts it back into a bitstream; as shown in Fig. 2. In general the header is converted back from its human readable form, directly into its binary representation and the remaining elements of the payload are assembled back into the binary stream following the header (either using the URI data, or the hexadecimal string embedded in the BSD). In order to overview this at the bit stream level, Fig. 3 provides a syntactic perspective on the entire process.
3. Adaptation of content 3.1. Introduction The adaptation of content, based on the MPEG-21 Digital Item Adaptation [30] is quite complex as an overview, but is actually quite simple and allows for extremely practical applications. The principles of adaptation are based on XML Schemas called Bit Stream Description Language (BDSL) and its generic form generic Bit Stream Description Language (gBDSL) introduced by Amielh et al. [31,32]. The idea is that the codec is described using these schemas; BSDL uses a codec specific language, meaning that the adaptation engine needs to understand the language and use a specific XML Style Sheet (explained later) in order to adapt the bitstream. gBSDL uses a generic language, which means that any adaptation engine can transform the bitstream without a specific style sheet (i.e. that multiple style sheets are not required for each adaptation). The adaptation works on multiple levels and is very flexible. The Bit Stream Description (BDS) is basically an XML document that contains a description of the bitstream at a high level (i.e. not on a bit-by-bit basis). This can contain either the bitstream itself, represented as hexadecimal strings, or URI links to the bitstream in another file (usually the original file). This BDS is generated using a Binary to Bitstream Description
487
Fig. 1. Binary to bitstream description conversion.
Fig. 2. Bitstream description to binary.
Fig. 3. Adaptation process at bitstream level.
ARTICLE IN PRESS 488
T. Di Giacomo et al. / Computers & Graphics 28 (2004) 485–495
Whilst the header might also be adapted, in general most of it will be retained as it contains the information outlining the format of the rest of the packet. In the following sections we present adaptation schema for both body and face animation. 3.2. Virtual human adaptation 3.2.1. Introduction As the schema is used to define the bitstream layout on a high level it must basically represent a decoder in XML format. This does not mean that it will decode the bitstream, but the structure of the bitstream is important. This means that the adaptation methods must be inline with the lowest level defined in the schema. For example, if the adaptation needs to skip frames, and as MPEG codecs are byte aligned, it is practical in this case to search for the ‘‘start code’’ for each frame; this means that a frame can be dropped without needing to understand most of the payload (this is possibly with the exception of updating the frames—skipped section of the header—as this could be avoided). However, as will be seen in the following sections the schema is based on a much lower level in order to be more flexible. 3.2.2. Schema In terms of the bitstream level, the schema is defined on a low-level in order to take into account of the adaptation described in Sections 3.3 and 3.4. This is necessary because the FBA codec is defined using groups (which basically define limbs and expressive regions), but these are not sufficiently marked in order identify the payload. This does not pose a serious problem as the schema generally remains at the server-side (i.e. size is not a problem), however it does take longer to process and obviously it is more prone to error. 3.2.3. Scalability of the FBA codec As a final discussion, we describe the scalability of the FBA codec. After a closer examination of the codec it can be found that although the codec performs well for its general operation, coding face and body movements and even expressions, there are several short comings mainly in the area of scalability. This rests mainly in the grouping of the parameters; this grouping is mainly suitable for Body/Face Interest Zones, detailed in Sections 3.4.2 and 3.4.3, but not for Level of Articulation (LoA), which would be just as useful. This means that schema must be defined on quite a low-level in order for it to adapt properly. In addition the codec defines each frame based on quite a high level of independence, which means that the coding scheme can be swapped on practically a frame-by-frame basis, in fact each region of the face can be coded differently (some as expressions, and some as visemes). This means that codec is quite flexible in its approach, but again
quite impractical for adaptation as the schema must account for this diversity. 3.3. Shape representation The adaptation operation should be simple enough to be implemented in light-weight environment. There have been many researches regarding on multi-resolution shape representation [33]. So far, they are either consumes too much spaces or require relatively complex decoding mechanisms. In this research we devise a simple representation conforms to the current standards by clustering all the data so that a specific complexity can be obtained by simply choosing a set of clusters. From the complex mesh Mn ; it is sequentially simplified to Mn1 ; y; M1 ; M0 : By following the sequence, we can identify a set of vertices and faces that are removed from a mesh of the level i to make a mesh of the level i 1; denoted by C(i) where M0 ¼ Cð0Þ: Also there is a set of vertices and faces that are newly generated by simplification, denoted by N(i). By the properties of the simplification, it ensures N(i) to be a subset of unions of C(j) for all i > j: Using this property, the cluster C(i) is sub-clustered into a set of Cði; jÞ; which belongs to N(j) where j>i and Cði; iÞ which does not belongs to any N(j). Thus, the level i mesh is represented as the following equation, which requires simple set selections for the clusters of Cði; jÞ: Mi ¼
i n X X ðCðk; kÞ þ Cðk; jÞÞ: k¼0
ð1Þ
j¼iþ1
The clusters are ordered to have small number of selections for each level. Fig. 4 shows the ordered example of clusters. The representation does not have any explicit redundancies and an adaptation process would be a selection of part of the stream. Streaming and decoding process can be stopped at the almost any clusters, except clusters of Cð0Þ; and it is guaranteed that at least one LoD can be composed from the current subset of clusters. If the stream is processed more, higher level is provided. In the mesh, there are other properties that have to be taken into account, such as normal, color, and texture
Fig. 4. Illustration of clusters.
ARTICLE IN PRESS T. Di Giacomo et al. / Computers & Graphics 28 (2004) 485–495
coordinates. Based on the mesh structure, we make a unique mapping from a pair of vertex and face to a value of property. By assigning the property to the lowest level vertex and its corresponding clusters, property values can be represented into the cluster. In real applications, it is often not necessary to generate all levels of details. Differences by a few polygons do not usually give significant differences either in performances or in qualities. By using the proposed representation, the modeling system is able to generate a set of levels with any combination depending on a model and application.
489
Table 3 Definitions of the new LoA profiles Name
Definition
LoA High LoA Medium
All joints are animatable. Fingers, Toes, and Spine Joints are composed into a single joint. The only animatable joints are from the Very Low Profile, plus the neck joint, the elbow joints and the knee joints. The only animatable joints are the shoulders, the root, and the hip joints.
LoA Low
LoA VeryLow
3.4. Body animation 3.4.1. Introduction Body animation is commonly performed by skeletonbased animation; where the hierarchical structure of the bones and joints allows a rather straightforward level of complexity, which basically corresponds to the level of depth in this hierarchy. The H-Anim group specifies some basic LoA [34], see Table 1, which we extend with modifications and new methods in the following sections. 3.4.2. Modified LoA Following the H-Anim specification, we redefine LoA, as shown in Table 2, linked to original values but being more suitable to our global adaptation architecture. For instance, the lower level includes shoulders and hips to enable minimalist motions even at this level. The medium level is also a higher simplification of the spine compared to the primary H-Anim definition. The detailed definition of our new LoA is given in Table 3, detailing the possible joints at each level. Such levels of complexity are used to select an adapted LoD for the animation of the virtual human. For Table 1 H-Anim LoA profiles Level
Brief description
No. of joints
0 1 2 3
Minimum Low-end/real-time Simplified spine Full hierarchy
1 18 71 89
Table 2 Newly defined LoA profiles Level
brief description
No. of joints
Very low Low Medium High
Minimum Low-end/real-time Simplified spine Full hierarchy
5 10 35 89
Table 4 Elementary LoA zones Zone
Brief description
No. of joints
1 2 3 4 5 6
Face only Right and left arms Torso Pelvis Legs Feet
2 49 25 4 4 4
instance, some virtual humans, which are part of an audience in a theatre, do not require a high LOA, thus using LoA VeryLow is enough, while the main character on stage, which is moving, talking and the focus of the scene, requires a high LOA (thus using LoA Medium or High). Though it is not such a significant gain for a single virtual human, this feature provides efficient control for the animation of a whole crowd and/or adapting animations according to terminal and network capabilities. 3.4.3. Body regions The main purpose of defining interest zones is to allow the user to select the part they are most interested in, as shown in Table 4. For instance, if the virtual human is presenting a complete script on how to do cooking, but the user is only interested in listening to the presentation (and not at the motions), then the animation can be adapted to either the upper body, or even just the face and the shoulders. Another example would be the tourist guide, pointing at several locations such as theatres or parks, the user may only wish to focus on the arms and hands of the presenter. The lower level interest zones, consisting of six zones shown in Fig. 4, can be composed into higher level ones by two different methods: *
Predefined zones—This is a set of higher level zones predefined as a combination of elementary zones, as described in Table 5.
ARTICLE IN PRESS T. Di Giacomo et al. / Computers & Graphics 28 (2004) 485–495
490
Table 5 Predefined zones and their correspondence with elementary zones Name LoAZone LoAZone LoAZone LoAZone LoAZone
Definition All Face FaceShoulders UpperBody LowerBody
1+2+3+4+5+6 1 1+2+3 1+2+3+4 4+5+6
Table 6 Mask to select elementary zones Predefined
Mask
No. of joints
All Face Shoulders UpperBody LowerBody
0 1 7 15 45
94 1 77 82 17
*
Mask—This is a value representing which elementary zones are active. This mask is a binary accumulator which can take different value, a few of which, the ones related to predefined zones, are presented in Table 6.
3.5. Face animation 3.5.1. Introduction The body LoA is based on a skeleton. For the face part we do not have this kind of base information, but we define different areas on the face, and different levels of detail for each of these areas. The different parts of the face, as shown in Fig. 5, are initially specified, and then we will explain in detail the different LoA values for the face before introducing the applications possible of these definitions according to the different platform or systems used. 3.5.2. Face regions The face is segmented into different interest zones. This segmentation makes it possible to group the Facial Animation Parameters (FAP) according to their influence zone with regards to deformation. For each zone we also define different levels of complexity, as shown in Fig. 5. The zone segmentation is based on MPEG-4 FAP grouping; although we have grouped the tongue, inner and outer lips into only one group because these displacements are very much linked. The zones are defined as shown in Table 7. Note that the LOA does not influence the head rotation value. Depending on the user interest, we can further increase or reduce complexity in specific zones.
Fig. 5. Body interest zones.
Table 7 Defined facial regions/zones Group/Zone
Contains
0 1 2 3 4 5 6
Jaw (jaw, chin, lips and tongue) Eyeballs (eyeballs and eyelids) Eyebrow Cheeks Head rotation Nose Ears
For example, an application based on speech requires precise animation around area of the lips and less in other parts of the face. In this case, we can reduce the complexity to a very low level for the zones containing eyeballs, eyebrows, nose and ears zones; a medium level for the cheek zones and a very high level in the jaw zone. 3.5.3. Levels of complexity For the face, we define 4 levels—as shown in Table 8, with the direct hierarchy as defined for the body but based on the FAP influence according to the desired complexity or user preferences. Two different techniques were used to reduce the level of complexity. The first consists of grouping FAP values together (i.e. all upper lip values can be grouped into one value). To select which FAP should be grouped, we defined certain constraints, as follows:
ARTICLE IN PRESS T. Di Giacomo et al. / Computers & Graphics 28 (2004) 485–495
491
Table 8 Level of articulation for face Name
Definition
LoA High LoA Medium
All FAP values Group Equivalent FAP values together 2 by 2 Starting by suppressing less important FAP values and grouping FAP or LoA Medium FAP results together Suppressing unimportant facial animation parameters and adding full symmetry of the face parameters
LoA Low
LoA VeryLow
Fig. 6. Face zones of interest.
1. All grouped FAP values must be in the same area. 2. All grouped FAP values must be under the influence of the same FAP units. 3. When grouped by symmetry, two FAP values or groups, we define the controlling FAP values included in the right part of the face. The second set of constraints is more destructive to the overall set of values, meaning that quality is reduced more rapidly. After a certain distance from the viewing camera some FAP values become insignificant (e.g. at low level we remove FAP values pertaining to the dilation of pupils, this deformation becomes invisible after a short distance). *
*
*
*
LoA High—This represents the use of all FAP values, i.e. 2 high level FAP values (never reduced because they already adapt to level of complexity), and 66 low level FAP values LoA Medium—Here we group together certain FAP values, in order to maintain a high LoD. At this level, after regrouping we obtain 44 FAP values rather than maximum 68, a reduction of over 35%. LoA Low—At this level, we remove many of the FAP values that become unimportant for face animation and continue to group FAP values together. After regrouping/deleting FAP values, 26 FAP values remain against the maximum 68, a reduction of 62%. LoA VeryLow—Here, we remove most FAP values and concentrate on the base values necessary for minimal animation. This level represents the minimum set of parameters required for animating a face: we mostly link symmetric LoA Low parameters together and remove all unimportant values. At this level, the number of FAP values is reduced to 14, resulting in an overall reduction of 79%
3.5.4. Levels of articulation-overview Fig. 6 shows a global view of all FAP values and the links between all of them for each LoA profiles. Table 9 shows, for each LoA, how many FAP values drive each
Table 9 No. of FAP values shown against LoA profile
High Level Jaw Eyeballs Eyebrows Cheeks Head rotation Nose Ears Total Ratio
High
Medium
Low
V. Low
2 31 12 8 4 3 4 4 68
2 19 8 3 4 3 3 2 44 65%
2 12 4 2 1 3 0 2 26 38%
2 6 1 1 1 1 0 0 14 21%
part of the face and Fig. 7 indicating the overall reduction of FAP values (Fig. 8). 3.5.5. Frame rate reduction Depending on capabilities of the target platform to reproduce the exact and complex face animation, we can also reduce the frame rate in order to reduce network and animation load; it is done for each LoA, as shown in Table 10. In the case of different LoA for each zone of interest, we assume that the maximum frame rate. For each frame we assume a bit stream less than 1 KBits. 3.5.6. Advantages of adaptation According to the simplification explained above, we can transmit the minimum facial information according to the context. After simplification, the FAP stream remains compatible with MPEG-4 specification, but with a reduced size; on the client, there are two possibilities: The first consists of decoding the stream and reconstructing all FAP values according to the LOA rules. In this case, we have just reduced the bit stream. The second technique also consists of simplifying the deformation. In most MPEG-4 facial animation engines, they compute deformations for each FAP value and group them together and in mobile platforms the lower
ARTICLE IN PRESS T. Di Giacomo et al. / Computers & Graphics 28 (2004) 485–495
492
number of computations that are made during animation the better. We could simply apply the FAP stream to the deformation engine; in this case we do not compute 66 deformations areas and group them together, but design directly the deformation area for each level of complexity. This work of regrouping could be done automatically during the pre-processing step or be transmitted with the model. In the case of very low LoD, we should have only 11 FAP values included in linear deformation to be combined on the face rather than 58.
4. Results
Fig. 7. FAP grouping.
Fig. 8. Graph showing reduction of FAP values.
Table 10 Frame rate reduction Frame Rate (frames/second) (fps) LoA LoA LoA LoA
High Medium Low VeryLow
25 20 18 15
Table 11 shows results of multiresolution model representation. The progressive mesh (PM) approach is known as a near-optimal method for its compactness in the size. The discrete mesh is a set of discrete levels, which is still quite common in real-world applications. The numbers are the size of the VRML and BIFS files, respectively. Since the PM cannot be encoded to BIFS format, only the approximated size for the text file is noted. The highest details have a number of polygons of 71 K and 7 K for each model whilst the lowest details have 1 K and 552 polygons each. The models are constructed to have 5 different levels. The proposed method is located in-between these approaches, and is flexible and simple enough to allow adaptation with relatively small file size. More importantly, it is able to be transmittable via standard MPEG streams. It also utilizes a simple adaptation mechanism, which is very similar to the simplest discrete level selection. To verify the effectiveness of the adaptation in terms of the network performance, we performed some specific comparisons using several standard animation sequences (Table 12). We made server experiments with different compressed FAP files (using the same frame rate, and compression methods) to estimate the influence of LoA on complete animation, comparing the overall size (Fig. 9). Overall, the profiles provide a mean value of 77% for the medium level profile, 63% for the low profile, and 55% for the very low profile. All these files, except the last one, represent animation from all parts of the face. For each of them, we obtain the same reduction factor. For the last file, which only animates the lips, we observe
Table 11 Size of data (bytes)
Body model Face model
# of polygons (K)
Original model (M)
Proposed method (M)
Progressive mesh (M)
Discrete mesh (M)
71 7
12.7/4.5 0.8/0.3
17.5/6.5 1.0/0.5
B13 B0.9
51.6/12.0 3.9/1.6
ARTICLE IN PRESS T. Di Giacomo et al. / Computers & Graphics 28 (2004) 485–495 Table 12 Overall data size comparison
493
Table 13 Overall time computation comparison
Wow
Baf
Face23
Macro
Lips
Wow
Size High Medium Low Very low
21255 16493 13416 12086
32942 24654 20377 17033
137355 104273 78388 70498
18792 14445 11681 8788
44983 36532 30948 29322
Time High Medium Low Very low
Ratio (%) High Medium Low Very Low
100% 78% 63% 57%
100% 75% 62% 52%
100% 76% 57% 51%
100% 77% 62% 47%
100% 81% 69% 65%
Ratio (%) High Medium Low Very low
4.17 3.56 3.09 2.87
100% 85% 74% 69%
Baf
Face23 6.10 5.23 4.56 4.11
100% 86% 75% 67%
25.04 21.23 17.58 15.98
100% 85% 70% 64%
Macro 4.08 3.54 3.10 2.60
100% 87% 76% 64%
Lips 10.90 9.81 8.81 8.40
100% 90% 81% 77%
even greater multitude of adaptation and variety of context-based situations. We are currently concentrating in applying this work into an entire system including video, audio, 2D and 3D graphics in a European Project called ISIS. In addition, we are also concentrating on developing a new codec that is much more scalable and therefore better suited for adaptation; this will also include better integration of body and face animation in the context of virtual humans (which is currently quite separated). As discussed in Section 3.2.3, it can be seen that the codec does not provide very much flexibility in terms of adaptation; hence we are currently working on a similar codec that will provide the same kind of framework as it currently contains, but with a better structure for adaptation.
Acknowledgements
Fig. 9. Facial results with no symmetry ((a) LoA High, (b) LoA medium, (c) LoA Low, (d) LoA Very Low).
This research has been funded through the European Project ISIS (IST-2001-34545) by the Swiss Federal Office for Education and Science (OFES). The authors would like to thank T. Molet for his consultation.
References a smaller reduction factor due absence of FAP suppression (Table 13).
5. Conclusion and future work In conclusion we have provided a completely new area of adaptation within the MPEG framework. We have shown that whilst in most media types the adaptation process is quite extensive, virtual humans provide an
[1] Heuer J, Casas J, Kaup A. Adaptive Multimedia Messaging based on MPEG-7—The M3-Box, Proceedings of the second International Symposium on Mobile Multimedia Systems & Applications, 2000. p. 6–13. [2] MPEG-21, ISO/IEC 21000-7 Committee Draft ISO/IEC/ JTC1/SC29/WG11/N5534, March 2003. [3] Preda M, Preteux F. Critic review on MPEG-4 face and body animation. Proceedings IEEE International Conference Image Processing ICIP, 2002. [4] Preda M, Preteux F. Advanced animation framework for virtual character within the MPEG-4 Standard.
ARTICLE IN PRESS 494
[5] [6] [7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
T. Di Giacomo et al. / Computers & Graphics 28 (2004) 485–495 Proceedings IEEE International Conference Image Processing, September 2002. Hoppe H. Progressive meshes, SIGGRAPH, 1996, p. 99–108. Garland M, Heckbert P. Surface Simplification Using Quadric Error Metrics, SIGGRAPH, 1997. Fei G, Wu E. A real-time generation algorithm of progressive mesh with multiple properties. Proceedings Symposium on Virtual Reality Software and Technology, 1999. Seo H, Magnenat-Thalmann N. LoD management on animating face models. Proceedings IEEE Virtual Reality, 2000. Aubel A, Boulic R, Thalmann D. Real-time display of virtual humans: level of details and impostors. /http://vrlab.epfl.ch/publications/pdf/Aubel Boulic Thalmann IEEE circuits oo.pdfS, IEEE Transaction Circuits and Systems for Video Technology, Special Issue on 3D Video Technology. 2000;10(2):207–17. Tecchia F, Loscos C, Chrysanthou Y. Image-based crowd rendering, IEEE Computer Graphics and Applications 2002:36–43. Wu X, Downes MS, Goktekin T, Tendick F. Adaptive nonlinear finite elements for deformable body simulation using dynamic progressive meshes. Proceedings of Eurographics EG’01, p. 349–58, 2001. Debunne G, Desbrun M, Cani MP, Barr A. Dynamic realtime deformations using space & time adaptive sampling. SIGGRAPH, 2001. p. 31–6. Capell S, Green S, Curless B, Duchamp T, Popovic Z. A Multiresolution framework for dynamic deformations. Proceedings of ACM SIGGRAPH Symposium on Computer Animation, 2002. Hutchinson D, Preston M, Hewitt T. Adaptive refinement for mass/spring simulations. Proceedings Eurographics Workshop on Computer Animation and Simulation, 1996. James DL, Pai DK. DyRT, Dynamic response textures for real time deformation simulation with graphics hardware. Proceedings of SIGGRAPH, 2002. Granieri JP, Crabtree J, Badler N. Production and playback of human figure motion for visual simulation. ACM Transaction on Modeling and Computer Simulation, 1995. p. 222–41. Di Giacomo T, Capo S, Faure F. An Interactive forest. Proceedings Eurographics Workshop on Computer Animation and Simulation, 2001. Guerraz S, Perbet F, Raulo D, Faure F, Cani MP. A procedural approach to animate interactive natural sceneries. Proceedings Of Computer Animation and Social Agents, 2003. Kim J, Wang Y, Chang S. Content-adaptive utility based video adaptation. Proceedings of the IEEE International Conference on Multimedia & Expo, 2003. Aggarwal A, Rose K, Regunathan S. Commander Domain Approach to Scalable AAC. Proceedings of the 110th Audio Engineering Society Convention, 2001. Van Raemdonck W, Lafruit G, Steffens E, Otero-Pe´rez C, Bril R. Scalable 3D graphics processing in consumer terminals. IEEE International Conference on Multimedia and Expo, 2002.
[22] Boier-Martin I. Adaptive Graphics. IEEE Computer Graphics and Applications, 2003;6–10. [23] Joslin C, Magnenat-Thalmann N. MPEG-4 animation clustering for networked virtual environments. IEEE International Conference on Multimedia and Expo, 2002. [24] Pham Ngoc N, Van Raemdonck W, Lafruit G, Deconinck G, Lauwereins R. A QoS Framework for Interactive 3D Applications. Proceedings Winter School of Computer Graphics, 2002. [25] Schneider B, Martin I. An adaptive framework for 3D graphics in networked and mobile environments. Proceedings Interactive Applications on Mobile Computing, 1998. [26] Lamberti F, Zunino C, Sanna A, Fiume A, Maniezzo M. An Accelerated remote graphics architecture for PDAs. Proceedings of the Web3D 2003 Symposium, 2003. [27] Kolli G, Junkins S, Barad H. 3D Graphics optimizations for ARM architecture. Proceedings Game Developers Conference, 2002. [28] Chang C, Ger S. Enhancing 3D Graphics on Mobile Devices by Image-Based Rendering, Proceedings of the third IEEE Pacific-Rim Conference on Multimedia, 2002. [29] Stam J. Stable Fluids. SIGGRAPH, 1999, p. 121–8. [30] MPEG-21 DIA, ISO/IEC/JTC1/SC29/WG11/N5612, March 2003. [31] Amielh M, Devillers S. Multimedia content adaptation with XML. Proceedings International Conference on MultiMedia Modeling, 2001. [32] Amielh M, Devillers S. Bitstream syntax description language: application of XML-schema to multimedia content adaptation. Proceedings International WWW Conference, 2002. [33] Heckbert P. et al. Multiresolution Surface Modeling, ACM SIGGRAPH Course Note No. 25, 1997 [34] H-Anim version LoA 1.1 Specification, http://h-anim.org/ Specifications/H-Anim1.1/appendices.html, 2000. Thomas Di Giacomo completed a Master’s degree on multiresolution methods for animation with iMAGIS lab and Atari ex-Infogrames R&D department. He is now a research assistant and a Ph.D. candidate at MIRALab, University of Geneva. His work focuses on level of detail for animation and physically based animation Chris Joslin obtained his Master’s degree from the University of Bath, UK and his Ph.D. in Computer Science from the University of Geneva, Switzerland. His research has been focused on Networked Virtual Environment systems and realtime 3D spatial audio, and he is currently a Senior Research Assistant developing scalable 3D graphics codecs for animation and representation for specifically for MPEG-4 and MPEG-21 adaptation. Stephane Garchery is a computer scientist who studied in University of Grenoble and Lyon, in France. He is working at the University of Geneva as a senior research assistant in MIRALab, participating in research on Facial Animation for real time applications. One of his main tasks is focused on developing MPEG-4 facial animation engine, applications and tools to automatic facial data construction. He has developed different kind of facial animation engines based on MPEG-4 Facial Animation Parameters for different platform (stand
ARTICLE IN PRESS T. Di Giacomo et al. / Computers & Graphics 28 (2004) 485–495
495
alone, web applet and mobile device), and different tools to design quickly and in interactive way.
Geometry and Texture. He is also interested in 3D Interaction techniques and Virtual Reality Systems.
HyungSeok Kim is a post-doctoral assistant at MIRALab, University of Geneva. He received his Ph.D. in Computer Science in February 2003 at VRLab, KAIST : ‘‘Multiresolution model generation of texture-geometry for the real-time rendering’’. His main research field is Real-time Rendering for Virtual Environments, more specifically Multiresolution Modeling for
Nadia Magnenat-Thalmann has pioneered research into virtual humans over the last 20 years. She obtained several Bachelor’s and Master’s degrees in various disciplines and a Ph.D. in Quantum Physics from the University of Geneva. From 1977 to 1989, she was a Professor at the University of Montreal in Canada. In 1989, she founded MIRALab at the University of Geneva.