Synthetic and SNHC audio in MPEG-4

Synthetic and SNHC audio in MPEG-4

Signal Processing: Image Communication 15 (2000) 445}461 Synthetic and SNHC audio in MPEG-4q Eric D. Scheirer!,*, Youngjik Lee", Jae-Woo Yang" !Machi...

455KB Sizes 0 Downloads 25 Views

Signal Processing: Image Communication 15 (2000) 445}461

Synthetic and SNHC audio in MPEG-4q Eric D. Scheirer!,*, Youngjik Lee", Jae-Woo Yang" !Machine Listening Group, MIT Media Laboratory, E15-401D, Cambridge, MA 02143-4307, USA "Switching and Transmission Technology Laboratories, ETRI, Japan

Abstract In addition to its sophisticated audio-compression capabilities, MPEG-4 contains extensive functions supporting synthetic sound and the synthetic/natural hybrid coding of sound. We present an overview of the Structured Audio format, which allows e$cient transmission and client-side synthesis of music and sound e!ects. We also provide an overview of the Text-to-Speech Interface, which standardizes a single format for communication with speech synthesizers. Finally, we present an overview of the AudioBIFS portion of the Binary Format for Scene Description, which allows the description of hybrid soundtracks, 3-D audio environments, and interactive audio programming. The tools provided for advanced audio functionality in MPEG-4 are a new and important addition to the world of audio standards. ( 2000 Elsevier Science B.V. All rights reserved. Keywords: MPEG-4; Structured audio; Text-to-Speech; AudioBIFS; SNHC audio; Audio standards; Audio synthesis

1. Introduction This article describes the parts of MPEG-4 that govern the compression, representation, and transmission of synthetic sound and the combination of synthetic and natural sound into hybrid soundtracks. Through these tools, MPEG-4 provides advanced capabilities for ultra-low-bit-rate sound transmission, interactive sound scenes, and #exible, repurposable delivery of sound content. We will discuss three MPEG-4 audio tools. The "rst, MPEG-4 Structured Audio, standardizes preq Parts of this paper will also appear in A. Puri and T. Chen (Eds.), Advances in Multimedia: Systems, Standards, and Networks, Marcel Dekker, New York, 2000. * Corresponding author. Tel.: #1-617-253-0112; fax: #1617-258-6264. E-mail address: [email protected] (E.D. Scheirer)

cise, e$cient delivery of synthetic music and sound e!ects. The second, MPEG-4 Text-to-Speech Interface, standardizes a representation protocol for synthesized speech, an interface to text-to-speech synthesizers, and the automatic synchronization of synthetic speech and `talking heada animated face graphics [24]. The third, MPEG-4 AudioBIFS } part of the main BIFS framework [22] } standardizes terminal-side mixing and post-production of audio soundtracks. AudioBIFS enables interactive soundtracks and 3-D sound presentation for virtual-reality applications. In MPEG-4, the capability to mix and synchronize real sound with synthetic is termed Synthetic/Natural Hybrid Coding of Audio, or SNHC Audio. The organization of the present paper is as follows. First, we provide a general overview of the objectives for synthetic and SNHC audio in MPEG-4. This section also introduces concepts

0923-5965/00/$ - see front matter ( 2000 Elsevier Science B.V. All rights reserved. PII: S 0 9 2 3 - 5 9 6 5 ( 9 9 ) 0 0 0 5 7 - 0

446

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

from speech and music synthesis to readers whose primary expertise may not be in the "eld of audio. Next, a detailed description of the synthetic-audio codecs in MPEG-4 is provided. Finally, we describe AudioBIFS and its use in the creation of SNHC audio soundtracks.

2. Synthetic audio in MPEG-4: concepts and requirements In this section, we introduce speech synthesis and music synthesis. Then we discuss the inclusion of these technologies in MPEG-4, focusing on the capabilities provided by synthetic audio and the types of applications that are better addressed with synthetic audio coding than with natural audio coding. 2.1. Relationship between natural and synthetic coding Modern standards for natural audio coding [1,2] use perceptual models to compress natural sound. In coding synthetic sound, perceptual models are not used; rather, very speci"c parametric models are used to transmit sound descriptions. The descriptions are received at the decoding terminal and converted into sound through real-time sound synthesis. The parametric model for the Text-toSpeech Interface is "xed in the standard; in the Structured Audio toolset, the model itself is transmitted as part of the bitstream and interpreted by a recon"gurable decoder. Natural and synthetic audio are not unrelated methods for transmitting sound. Especially as sound models in perceptual coding grow more sophisticated, the boundary between `decompressiona and `synthesisa becomes somewhat blurred. Vercoe, Gardner and Scheirer [28] have discussed the relationships among parametric models of sound, digital sound creation and transmission, perceptual coding, parametric compression, and various techniques for algorithmic synthesis.

enables the translation of text information into speech so that the text can be transferred through speech channels such as telephone lines. Today, TTS systems are used for many applications, including automatic voice-response systems (the `telephone menua systems that have become popular recently), e-mail reading, and information services for the visually handicapped [9,10]. TTS systems typically consist of multiple processing modules as shown in Fig. 1. Such a system accepts text as input and generates a corresponding phoneme sequence. Phonemes are the smallest units of human language; each phoneme corresponds to one sound used in speech. A surprisingly small set of phonemes, about 120, is su$cient to describe all human languages. The phoneme sequence is used in turn to generate a basic speech sequence without prosody, that is, without pitch, duration, and amplitude variations. In parallel, a text-understanding module analyzes the input for phrase structure and in#ections. Using the result of this processing, a prosody generation module creates the proper prosody for the text. Finally, a prosody control module changes the prosody parameters of the basic speech sequence according to the results of the text-understanding module, yielding synthesized speech. One of the "rst successful TTS systems was the DecTalk English speech synthesizer developed in 1983 [11]. This system produces very intelligible speech and supports eight di!erent speaking voices. However, developing speech synthesizers of this sort is a di$cult process, since it is necessary to

2.2. Concepts in speech synthesis Text-to-speech (TTS) systems generate speech sound according to given text. This technology

Fig. 1. Block diagram of a text-to-speech system, showing the interaction between text-to-phoneme conversion, text understanding, and prosody generation and application.

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

extract all the acoustic parameters for synthesis. It is a painstaking process to analyze enough data to accumulate the parameters that are used for all kinds of speech. In 1992, CNET in France developed the pitchsynchronous overlap-and-add (PSOLA) method to control the pitch and phoneme duration of synthesized speech [25]. Using this technique, it is easy to control the prosody of synthesized speech. Thus synthesized speech using PSOLA sounds more natural; it can also use human speech as a guide to control the prosody of the synthesis, in an analysis}synthesis process that can also modify the tone and duration. However, if the tone is changed too much, the resulting speech is easily recognized as arti"cial. In 1996, ATR in Japan developed the CHATR speech synthesizer [10]. This method relies on short samples of human speech without modifying any characteristics; it locates and sequences phonemes, words, or phrases from a database. A large database of human speech is necessary to develop a TTS system using this method. Automatic tools may be used to label each phoneme of the human speech to reduce the development time; typically, hidden Markov models (HMMs) are used to align the best phoneme candidates to the target speech. The synthesized speech is very intelligible and natural; however, this method of TTS requires large amounts of memory and processing power. The applications of TTS are expanding in telecommunications, personal computing, and the Internet. Current research in TTS includes voice conversion (synthesizing the sound of a particular speaker's voice), multi-language TTS, and enhancing the naturalness of speech through more sophisticated voice models and prosody generators.

447

In the story telling on demand (STOD) application, the user can select a story from a huge database stored on "xed media. The STOD system reads the story aloud, using the MPEG-4 Text-toSpeech Interface (henceforth, TTSI) with the MPEG-4 facial animation tool or with appropriately selected images. The user can stop and resume speaking at any moment he wants through the user interface of the local machine (for example, mouse or keyboard). The user can also select the gender, age, and the speech rate of the electronic story teller. In a motion-picture-dubbing application, synchronization between the MPEG-4 TTSI decoder and the encoded moving picture is the essential feature. The architecture of the MPEG-4 TTS decoder provides several levels of synchronization granularity. By aligning the composition time of each sentence, coarse granularity of synchronization can be easily achieved. To get more "nely tuned synchronization, information about the speaker lip shape can be used. The "nest granularity of synchronization can be achieved by using detailed prosody transmission and video-related information such as sentence duration and o!set time in the sentence. With this synchronization capability, the MPEG-4 TTSI can be used for motion picture dubbing by following the lip shape and the corresponding time in the sentence. To enable synthetic video teleconferencing, the TTSI decoder can be used to drive the facial-animation decoder in synchronization. Bookmarks in the TTSI bitstream control an animated face by using facial animation parameters (FAP); in addition, the animation of the mouth can be derived directly from the speech phonemes. Other applications of the MPEG-4 TTSI include speech synthesis for avatars in virtual reality (VR) applications, voice newspapers, and low-bit-rate Internet voice tools.

2.3. Applications for speech synthesis in MPEG-4 2.4. Concepts in music synthesis The synthetic speech system in MPEG-4 was designed to support interactive applications using text as the basic content type. Some of these applications include on-demand story telling, motion picture dubbing, and `talking heada synthetic videoconferencing.

The "eld of music synthesis is too large and varied to give a complete overview here. An artistic history by Chadabe [4] and a technical overview by Roads [16] are sources that provide more background on the concepts developed over the last 35 years.

448

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

The techniques used in MPEG-4 for synthetic music transmission were originally developed by Mathews [13,14], who demonstrated the "rst digital synthesis programs. The so-called unit-generator model of synthesis he developed has proven to be a robust and practical tool for musicians interested in the precise control of sound. This paradigm has been re"ned by many others, particularly Vercoe [27], whose language `Csounda is very popular today with composers. In the unit-generator model (also called the Music-N model after Mathews' languages MusicIII, Music-IV, and Music-V and Vercoe's languages Music-11 and Music-360), algorithms for sound synthesis are described as the interaction of a number of basic primitives, such as oscillators and envelope functions. Modern languages, such as Csound and the MPEG-4 language SAOL (described below) provide a rich set of built-in functions that musicians can use to create synthesizers. The notion of transmitting sound by sending algorithms in a synthesis-description language was suggested as early as 1991 by J.O. Smith [23], but apparently not attempted in practice until the `NetSounda experiment by Casey and Smaragdis [3], which used Csound to code and transmit sound. Contemporaneously to the development of advanced software-synthesis technology, which occurred primarily in the academic world, the MIDI (Musical Instrument Digital Interface) protocol [15] was standardized by the musicsynthesizer industry and became popular. MIDI is a protocol for communication between controllers (such as keyboards) and synthesizer modules; the protocol allows the speci"cation of which note to play, but not which algorithms should be used for synthesis. The algorithms for synthesis are implementation dependent in a MIDI synthesizer. In recent years, inexpensive soundcards for personal computers have become available. These devices typically provide limited sound quality and synthesis features (typically, each sound card provides only a single algorithm for synthesis) as well as direct audio output. It is somewhat ironic that although general-purpose software synthesis was the "rst method developed for

sound creation on digital computers, by the time digital computers became popular, much simpler and less satisfying technology emerged as a de facto standard. 2.5. Requirements and applications for audio synthesis in MEPEG-4 The goal in the development of MPEG-4 Structured Audio } the toolset providing audio synthesis capability in MPEG-4 } was to reclaim a generalpurpose software synthesis model for use by a broad spectrum of musicians and sound designers. By incorporating this technology in an international standard, the development of compatible tools and implementations is encouraged, and such capabilities will become available as a part of the everyday multimedia sound hardware. Including high-quality audio synthesis in MPEG-4 also serves a number of important goals within the standard itself. It allows the standard to provide capabilities that would not be possible through natural sound, or through simpler MIDIdriven parametric synthesis. We list some of these capabilities below. The Structured Audio speci"cation allows sound to be transmitted at very low bit-rates. Many useful soundtracks and compositions can be coded in Structured Audio at bit-rates from 0.1 to 1 kbps; as content developers become more practiced in low-bit-rate coding with such tools the bit-rates can undoubtedly be pushed even lower. In contrast to perceptual coding, there is no necessary trade-o! in algorithmic coding between audio quality and bit-rate. Low-bit-rate compressed streams can still decode into full-bandwidth, full-quality stereo output. Using synthetic coding, the trade-o! is more accurately described as one of yexibility (quality and of sound models) versus bit-rate [19]. Interactive accompaniment, dynamic scoring, synthetic performance [26], and other new-media music applications can be made more functional and sophisticated by using synthetic music rather than natural music. In any application requiring dynamic control over the music content itself, a structured representation of music is more appropriate than a perceptually coded one.

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

Unlike existing music-synthesis standards such as the MIDI protocol1, structured coding with downloaded synthesis algorithms allows accurate sound description and tight control over the sound produced. Allowing any method of synthesis to be used, not only those included in a low-cost MIDI device, provides composers with a broader range of options for sound creation. There is an attractive uni"cation in the MPEG-4 standard between the capabilities for synthesis and those used for e!ects processing. By carefully specifying the capabilities of the Structured Audio synthesis tool, the AudioBIFS tool for audio scene description (see Section 5) is much simpli"ed and the standard as a whole is cleaner. Finally, Structured Audio is an example of a new concept in coding technology } that of the yexible or downloadable decoder. This idea, considered but abandoned for MPEG-4 video coding, is a powerful one whose implications have yet to be fully explored. The Structured Audio toolset is computationally complete in that it is capable of simulating a Turing machine [7], and thus of executing any computable sound algorithm. It is possible to download new audio decoders } even new perceptual coders } into the MPEG-4 terminal as Structured Audio bitstreams; the requirements and applications for such a capability remain as a question for future research [29]. The Structured Audio tools can be used to describe algorithms of arbitrary complexity. To enable guaranteed decodability, a complexitymeasurement tool for Structured Audio bitstreams is included in the Conformance part of the MPEG-4 standard. The capability required of a conforming decoder at a certain Level is described in reference to this tool. A content author can measure his bitstreams by simulating them with this complexity-analysis tool in order to guarantee that they are decodable at the desired Level of decoder performance.

1 The MIDI protocol is only speci"ed for communicating between a control device and a synthesizer. However, the need for e$cient structured-audio coding schemes has led MIDI "les (computer "les made up of MIDI commands) to "ll a niche for Internet representation of music.

449

3. The MPEG-4 text-to-speech interface Text } that is, a sequence of words written in some human language } is a widely used representation for speech data in stand-alone applications. However, it is di$cult with existing technology to use text as a speech representation in multimedia bitstreams for transmission. The MPEG-4 Text-to-Speech Interface (TTSI) is de"ned so that speech can be transmitted as a bitstream containing text. It also provides interoperability among text-to-speech (TTS) synthesizers by standardizing a single bitstream format for this purpose. Synthetic speech is becoming a rather common media type; it plays an important role in various multimedia application areas. For instance, by using TTS functionality, multimedia content with narration can be easily created without recording natural speech. Before MPEG-4, however, there was no way for a multimedia content provider to easily give instructions to an unknown TTS system. In MPEG-4, a single common interface for TTS systems is standardized; this interface allows speech information to be transmitted in the International Phonetic Alphabet (IPA), or in a textual (written) form of any language. The MPEG-4 TTSI tool is a hybrid/multi-level scalable TTS interface that can be considered a superset of the conventional TTS framework. This extended TTSI can utilize prosodic information taken from natural speech in addition to input text and can thus generate much higher-quality synthetic speech. The interface and its bitstream format are strongly scalable in terms of this added information; for example, if some parameters of prosodic information are not available, a decoder can generate the missing parameters by rule. Algorithms for speech synthesis and text-to-phoneme translation are not normative in MPEG-4, but to meet the goal that underlies the MPEG-4 TTSI, a decoder should have the capability to utilize all the information provided in the TTSI bitstream. As well as an interface to text-to-speech synthesis systems, MPEG-4 speci"es a joint coding method for phonemic information and facial animation (FA) parameters. Using this technique, a single bitstream may be used to control both the TTS

450

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

interface and the facial animation visual object decoder. The functionality of this extended TTSI thus ranges from conventional TTS to natural speech coding and its application areas, from simple TTS to audiovisual presentation with TTS and moving picture dubbing with TTS. The next section describes the functionality of the MPEG-4 TTSI and its decoding process. 3.1. MPEG-4 TTSI functionality The MPEG-4 TTSI has important functionalities both as an individual codec and in synchronization with the facial animation techniques described by Tekalp et al. [24]. As a standalone codec, the bitstream format provides hooks to control the language being transmitted, the gender and age of the speaker, the speaking rate, and the prosody (pitch contour) of the speech. It can pause with no cost in bandwidth, by transmission of a silence sentence that only has silence duration. A `trick modea allows operations such as start, stop, rewind, and fast forward to be applied to the synthesized speech. The basic TTSI format is extremely low bit-rate. In the most compact method, one can send a bitstream that contains only the text to be spoken and its length. In this case, the bit-rate is 200 bits per second. The synthesizer will add prede"ned or rule-generated prosody to the synthesized speech (in a nonnormative fashion). The synthesized speech with prede"ned prosody will deliver emotional content to the listener. On the other hand, one can send a bitstream that contains text as well as the detailed prosody of the original speech, that is, phoneme sequence, duration of each phoneme, base frequency (pitch) of each phoneme, and energy of each phoneme. The synthesized speech in this case will be very similar to the original speech since it employs the original prosody. Thus, one can send speech with subtle nuances without any loss of intonation using MPEG-4 TTSI. One of the important features of the MPEG-4 TTSI is the ability to synchronize synthetic speech with the lip movements of a computer-generated avatar or `talking heada. In this technique, the TTS synthesizer generates phoneme sequences and their

durations, and communicates them to the facial animation visual object decoder so that it can control the lip movement. With this feature, one can not only hear the synthetic speech but also see the synchronized lip movement of the avatar. The MPEG-4 TTSI has the additional capability to send facial expression bookmarks through the text. The bookmark is identi"ed by `SFAPa, and lasts until the closing bracket `'a. In this case, the TTS synthesizer transfers the bookmark directly to the face decoder so that it can control the facial animation visual object accordingly. The facial animation parameter (FAP) of the bookmark is applied to the face until another bookmark resets the FAP. Content capable of playing sentences correctly, even in trick-mode manipulations, requires that bookmarks of the text to be spoken are repeated at the beginning of each sentence. These bookmarks initialize the face to the state that is de"ned by the previous sentence. In such a case, some mismatch of synchronization can occur at the beginning of a sentence; however, the system recovers when the new bookmark is processed. Through the MPEG-4 elementary stream synchronization capabilities [6], the MPEG-4 TTSI can perform synthetic motion picture dubbing. The MPEG-4 TTSI decoder can use the system clock to select an adequate speech location in a sentence, and communicates this to the TTS synthesizer, which assigns appropriate duration for each phoneme. Using this method, synthetic speech can be synchronized with the lip shape of the moving image. 3.2. MPEG-4 TTSI decoding process Fig. 2 shows a schematic of the MPEG-4 TTSI decoder. The architecture of the decoder can be described as a collection of interfaces. The normative behavior of the MPEG-4 TTSI is described in terms of these interfaces, not the sound and/or animated faces that are produced. In particular, the TTSI standard speci"es: 1. The interface between the demux and the syntactic decoder. Upon receiving a multiplexed MPEG-4 bitstream, the demux passes coded MPEG-4 TTSI elementary streams to the

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

451

Fig. 2. Overview of the MPEG-4 TTSI decoding process, showing the interaction between the syntax parser, the TTS synthesizer, and the face animation decoder. The shaded blocks are not normatively described, and operate in a terminal-dependent manner.

syntactic decoder. Other elementary streams are passed to other decoders. 2. The interface between the syntactic decoder and the TTS synthesizer. Receiving a coded MPEG4 TTSI bitstream, the syntactic decoder passes a number of di!erent pieces of data to the TTS synthesizer. The input type speci"es whether TTS is being used as a standalone function, or in synchronization with facial animation or motion-picture dubbing. The control commands sequence speci"es the language, gender, age, and speech rate of the speaking voice. The input text speci"es the character string for the text to be synthesized. Auxiliary information such as IPA phoneme symbols (which allow text in a language foreign to the decoder to be synthesized), lip shape patterns, and trick-mode commands are also passed along this interface. 3. The interface from the TTS synthesizer to the compositor. Using the parameters described in the previous paragraph, the synthesizer constructs a speech sound and delivers it to the audio composition system (described in Section 4). 4. The interface from the compositor to the TTS synthesizer. This interface allows the local control of the synthesized speech by users. Using this interface and an appropriate interactive scene, user can start, stop, rewind, and fast forward the TTS system. Controls can also allow changes to the speech rate, pitch range, gender, and age of the synthesized speech by the user.

5. The interface between the TTS synthesizer and the phoneme/bookmark-to-FAP converter. In the MPEG-4 framework, the TTS synthesizer and the face animation can be driven synchronously, by the same input control stream, which is the text input to the MPEG-4 TTSI. From this input stream, the TTS synthesizer generates synthetic speech, and at the same time, phoneme symbols, phoneme durations, word boundaries, stress parameters, and bookmarks. The phonemic information is passed to the Phoneme/Bookmark-to-FAP converter, which generates relevant facial animation accordingly. Through this mechanism, the synthesized speech and facial animation are synchronized when they enter the scene composition framework.

4. MPEG-4 structured audio The tool that provides audio synthesis capability in MPEG-4 is termed the Structured Audio coder. This name originates in the Vercoe et al. [28] comparison of di!erent methods of parameterized sound generation } it refers to the fact that this tool provides general access to any method of structuring sound. While the music synthesis technology is typically conceived mainly as a tool for composers and musicians, MPEG-4 Structured Audio is, "nally, a codec like the other audio tools in

452

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

MPEG-4. That is, the standard speci"es a bitstream format and a method of decoding it into sound. While the techniques used in decoding the bitstream are those taken from the practice of general-purpose digital synthesis, and the bitstream format is somewhat unusual, the overall paradigm is identical to that of the natural audio codecs in MPEG-4. This section will describe the organization of the Structured Audio standard, focusing "rst on the bitstream format and then on the decoding process. There is a second, simpler, tool for using parameterized wavetable synthesis with downloaded sounds; we will discuss this tool at the end of the section, and then conclude with a short discussion on encoding Structured Audio bitstreams. 4.1. Structured audio bitstream format The Structured Audio bitstream format makes use of the new coding paradigm known as algorithmic structured audio, described by Vercoe et al. [28] and Scheirer [18]. In this framework, a sound transmission is decomposed into two pieces: a set of synthesis algorithms that describe how to create sound, and a sequence of synthesis controls that specify which sounds to create. The synthesis model is not "xed in the MPEG-4 terminal; rather, the standard speci"es a framework for recon"gurable software synthesis. Any current or future method of digital sound synthesis can be used in this framework. Like the other MPEG-4 media types, a Structured Audio bitstream consists of a decoder conxguration header that tells the decoder how to begin the decoding process, and then a stream of bitstream access units that contain the compressed data. In Structured Audio, the decoder con"guration header contains the synthesis algorithms and auxiliary data, and the bitstream access units contain the synthesis control instructions. 4.1.1. Decoder conxguration header and SAOL The decoder con"guration header speci"es the synthesis algorithms using a new unit-generator language called SAOL (pronounced `saila), which stands for Structured Audio Orchestra Language. The syntax and semantics of SAOL are speci"ed

precisely in the standard } MPEG-4 contains the formal speci"cation of SAOL as a language. The similarities and di!erences between SAOL and other popular music languages have been discussed elsewhere [21]. While space does not provide for a full tutorial on SAOL in the present article, we give a short example so that the reader may understand the #avor of the language. Fig. 3 shows the textual representation of a complete SAOL synthesizer or orchestra. This synthesizer de"nes one instrument (called `beepa) for use in a Structured Audio session. Each bitstream begins with a SAOL orchestra that provides the instruments needed in that session. The synthesizer description as shown in Fig. 3 begins with a global header that speci"es the sampling rate (in this case, 32 KHz) and control rate (in this case, 1 kHz) for this orchestra. SAOL is a tworate signal language } every variable represents either an audio signal that varies at the sampling rate or a control signal that varies at the control rate. The sampling rate of the orchestra limits the maximum audio frequencies that may be present in the sound, and the control rate limits the speed with which parameters may vary. Higher values for these parameters lead to better sound quality but

Fig. 3. A SAOL orchestra, containing one instrument that makes a ramped complex tone. See text for in-depth discussion of the orchestra code.

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

require more computation. This trade-o! between quality and complexity is left to the decision of the content author (who should respect the Level definitions discussed in Section 2.5) and can di!er from bitstream to bitstream. After the global header comes the speci"cation for the instrument beep. This instrument depends on two parameter xelds (p-"elds) named pitch and amp. The number, names, and semantics of p-"elds for each instrument are not "xed in the standard but decided by the content author. The values for the p-"elds are set in the score, which is described in Section 4.1.2. The instrument de"nes two signal variables: out, which is an audio signal, and env, which is a control signal. It also de"nes a storedfunction table called sound. Stored-function tables, also called wavetables, are crucial to general-purpose software synthesis. Many synthesis algorithms can be realized as the interaction of a number of oscillators creating appropriate signals; wavetables are used to store the periodic functions needed for this purpose. A stored-function table in SAOL is created by using one of several wavetable generators (in this case, harm) that allocate space and "ll the table with data values. The harm wavetable generator creates one cycle of a periodic function by summing a set of zero-phase harmonically related sinusoids; the function placed in the table called sound consists of the sum of four sine waves at frequency 1, 2, and 4 with amplitudes 1, 0.5, and 0.2 respectively. This function is sampled at 2048 points per cycle to create the wavetable. To create sound, the beep instrument uses an interaction of two unit generators, kline and oscil. There is a set of about 100 unit generators speci"ed in the standard, and content authors can also design and deliver their own. The kline unit generator generates a control-rate envelope signal; in the example instrument it is assigned to the controlrate signal variable env. The kline unit generator interpolates a straight-line function between several (time, val) control points; in this example, a linesegment function is speci"ed which goes from 0 to the value of the amp parameter in 0.1 seconds, and then back down to 0 in dur-0.1 seconds, dur is a standard name in SAOL that always contains the duration of the note as speci"ed in the score.

453

In the next line of the instrument, the oscil unit generator converts the wavetable sound into a periodic audio signal, by oscillating over this table at a rate of cps cycles per second. Not every point in the table is used (unless the frequency is very low); rather, the oscil unit generator knows how to select and interpolate samples from the table in order to create one full cycle every 1/cps seconds. The sound that results is multiplied by the control-rate signal env and the overall sound amplitude amp. The result is assigned to the audio signal variable out. The last line of the instrument contains the output statement, which speci"es that the sound output of the instrument is contained in the signal variable out. When the SAOL orchestra is transmitted in the bitstream header, the plain-text format is not used. Rather, an e$cient tokenized format is standardized for this purpose. The Structured Audio speci"cation contains a description of this tokenization procedure. The decoder con"guration header may also contain auxiliary data to be used in synthesis. For example, a type of synthesis popular today is `wavetablea or `samplinga synthesis, in which short clips of sound are pitch-shifted and added together to create sound. The sound samples for use in this process are not included directly in the orchestra (although this is allowed if the samples are short), but placed in a di!erent segment of the bitstream header. Score data, which normally resides in the bitstream access units as described below, may also be included in the header. By including in the header score instructions that are known when the session starts, the synthesis process may be able to allocate resources more e$ciently. Also, real-time tempo control over the music is only possible when the notes to be played are known beforehand. For applications in which it is useful to reconstruct a human-readable orchestra from the bitstream, a symbol table may also be included in the bitstream header. This element is not required and has no e!ect on the decoding process, but allows the compressed bitstream representation to be converted back into a human-readable form. 4.1.2. Bitstream access units and SASL The streaming access units of the Structured Audio bitstream contain instructions that specify

454

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

how the instruments that were described in the header should be used to create sound. These instructions are speci"ed in another new language called SASL, for `Structured Audio Score Languagea. An example set of such instructions, or score, is given in Fig. 4. Each line in this score corresponds to one note of synthesis. That is, for each line in the score, a di!erent note is played using one of the synthesizers de"ned in the orchestra header. Each line contains, in order: a timestamp indicating the time at which the note should be triggered, the name of the instrument that should perform the synthesis, the duration of the note, and the parameters required for synthesis. The semantics of the parameters are not "xed in the standard, but depend on the de"nition of the instrument. In this case, the "rst parameter corresponds to the cps "eld in the instrument definition in Fig. 3, and the second parameter in each line to the amp "eld. Thus, the score in Fig. 4 includes four notes that correspond to the musical notation shown in Fig. 5.

Fig. 4. A SASL score, which uses the orchestra in Fig. 1 to play four notes. In an MPEG-4 Structured Audio bitstream, each score line is compressed and transmitted as an Access Unit.

Fig. 5. The musical notation corresponding to the SASL score in Fig. 2.

In the streaming bitstream, each line of the score is packaged as an access unit. The multiplexing of the access units with those in other streams, and the actual insertion of the access units into a bitstream for transport, is performed according to the MPEG-4 multiplex speci"cation [6]. There are many other sophisticated instructions in the orchestra and score formats; space does not permit a full review, but more details can be found in the standard and in other references on this topic [21]. In the SAOL orchestra language, there are built-in functions corresponding to many useful types of synthesis; in the SASL score language, tables of data can be included for the use in synthesis, and the synthesis process can be continuously manipulated with customizable parametric controllers. In addition, timestamps can be removed from the score lines, allowing a real-time mode of operation such as the transmission of live performances. 4.2. Decoding process The decoding process for Structured Audio bitstreams is somewhat di!erent than the decoding process for natural audio bitstreams. The streaming data does not typically consist of `framesa of data that are decompressed to give bu!ers of audio samples; rather, it consists of parameters that are fed into a synthesizer. The synthesizer creates the audio bu!ers according to the speci"cation given in the header. A schematic of the Structured Audio decoding process is given in Fig. 6. The "rst step in decoding the bitstream is processing and understanding the SAOL instructions in the header. This stage of the bitstream processing is similar to compiling or interpreting a high-level language. The MPEG-4 standard speci"es the semantics of SAOL } the sound that a given instrument declaration is supposed to produce } exactly, but does not specify the exact manner of implementation. Software, hardware, or dual software/hardware solutions are all possible for Structured Audio implementation; however, programmability is required, and thus "xed-hardware (ASIC) implementations are di$cult to realize. The SAOL pre-processing stage results in a set of instrument de"nitions that are used to con"gure the

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

455

Fig. 6. Overview of the MPEG-4 Structured Audio decoding process. See text for details.

reconxgurable synthesis engine. The capabilities and proper functioning of this engine are described fully in the standard. After the header is received and processed, synthesis from the streaming access units begins. Each access unit contains a score line that directs some aspect of the synthesis process. As each score line is received by the terminal, it is parsed and registered with the Structured Audio scheduler as an event. A time-sequenced list of events is maintained, and the scheduler triggers each at the appropriate time. When an event is triggered to turn on a note, a note object or instrument instantiation is created. A pool of active notes is always maintained; this pool contains all of the notes that are currently active (or `ona). As the decoder executes, it examines each instrument instantiation in the pool in turn, performing the next small amount of synthesis that the SAOL code describing that instrument speci"es. This processing generates one frame of data (the length of the frame depends on the control

rate speci"ed by the content author) for each active note event. The frames are summed together for all notes to produce the overall decoder output. Since SAOL is a very powerful format for the description of synthesis, it is not possible to generally characterize the speci"c algorithms which are executed in each note event. The content author has complete control over the methods used for creating sound and the resulting sound quality. Although the speci"cation is #exible, it is still strictly normative (speci"ed in the standard); this guarantees that a bitstream produces the same sound when played on any conforming decoder. 4.3. Wavetable synthesis in MPEG-4 A simpler format for music synthesis is also provided in MPEG-4 Structured Audio for applications that require low-complexity operation and do not require sophisticated or interactive music content } karaoke systems are the primary example.

456

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

A format for representing banks of wavetables, the Structured Audio Sample Bank Format or SASBF, was created in collaboration with the MIDI Manufacturer's Association for this purpose. Using SASBF, wavetable synthesizers can be downloaded to the terminal and controlled with MIDI sequences. This type of synthesis processing is readily available today; thus, a terminal using this format may be manufactured very cheaply. Such a terminal still allows synthetic music to be synchronized and mixed with recorded vocals or other natural sounds. Scheirer and Ray [19] have presented a comparison of algorithmic synthesis and wavetable synthesis capabilities in MPEG-4, describing the relative advantages of each as well as situations in which they are pro"tably used together. 4.4. Encoding structured Audio bitstreams As with all MPEG standards, only the bitstream format and decoding process are standardized for the Structured Audio tools. The method of encoding a legal bitstream is outside the scope of the standard. However, the natural audio coders in MPEG-4 [2], like those in previous MPEG Audio standards, at least have well-known starting points for automatic encoding. Many tools have been constructed that allow an existing recording (or live performance) to be automatically turned into legal bitstreams for a given perceptual coder. This is not yet possible for Structured Audio bitstreams; the techniques required to do this fully automatically are still in a basic research stage, where they are known as polyphonic transcription [12] or computational auditory scene analysis [17]. Thus, for the forseeable future, human intervention is required to produce Structured Audio bitstreams. Since the tools required for this are very similar to other tools used in a professional music studio today } such as sequencers and multitrack recording equipment } this is not an impediment to the utility of the standard.

5. MPEG-4 audio/systems interface and audioBIFS This section describes the relation between the MPEG-4 audio decoders and the MPEG-4 Sys-

tems functions of elementary stream management and composition. By including sophisticated capabilities for mixing and post-producing multiple audio sources, MPEG-4 enables a great number of advanced applications such as virtual-reality sound, interactive music experiences, and adaptive soundtracks. Companion papers in this special issue provide detailed introductions to elementary stream management in MPEG-4 [6] and to the MPEG-4 Binary Format for Scenes (BIFS) [22]. The part of BIFS controlling the composition of a sound scene is called AudioBIFS. AudioBIFS provides a uni"ed framework for sound scenes that use streaming audio, interactive presentation, 3-D spatialization, and dynamic download of custom signal-processing e!ects. Scheirer et al. [20] have presented a more in-depth discussion of the AudioBIFS tools. 5.1. AudioBIFS requirements Many of the main BIFS concepts originate from the Virtual Reality Modeling Language (VRML) standard [8], but the audio toolset is built from a di!erent philosophy. AudioBIFS contains signi"cant advances in quality and #exibility over VRML audio. There are two main modes of operation that AudioBIFS is intended to support. We term them virtual-reality and abstract-ewects compositing. In virtual-reality compositing, the goal is to recreate a particular acoustic environment as accurately as possible. Sound should be presented spatially according to its location relative to the listener in a realistic manner; moving sounds should have a Doppler shift; distant sounds should be attenuated and low-pass "ltered to simulate the absorptive properties of air; and sound sources should radiate sound unevenly, with a speci"c frequency-dependent directivity pattern. This type of scene composition is most suitable for `virtual worlda applications and video games, where the application goal is to immerse the user in a synthetic environment. The VRML sound model embraces this philosophy, with fairly lenient requirements on how various sound properties must be realized in an implementation. In abstract-e!ects compositing, the goal is to provide content authors with a rich suite of tools

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

from which artistic considerations can be used to choose the right e!ect for a given situation. As Scheirer [18] discusses in depth, the goal of sound designers for traditional media such as "lms, radio, and television is not to recreate a virtual acoustic environment (although this would be well within the capability of today's "lm studios), but to apply a body of knowledge regarding `what a "lm should sound likea. Spatial e!ects are sometimes used, but often in a non-physically realistic way; the same is true for the "lters, reverberations, and other sound-processing techniques used to create various artistic e!ects that are more compelling than strict realism would be. This model of content production demands stricter normativity in playback than does the virtual-reality model. MPEG realized in the early development of the MPEG-4 sound compositing toolset that if the tools were to be useful to the traditional content community } always the primary audience of MPEG technology } then the abstract-e!ects composition model would need to be embraced in the "nal MPEG-4 standard. However, game developers and virtual-world designers demand high-quality soni"cation tools as well, so the VRML model should also be available. MPEG-4 AudioBIFS therefore integrates these two components into a single standard. Sound in MPEG-4 may be post-processed with arbitrary downloaded "lters, reverberators, and other digital-audio e!ects; it may also be spatially positioned according to the simulated parameters of a virtual world. These two types of post-production may be freely combined in MPEG-4 audio scenes. 5.2. The MPEG-4 audio system A schematic diagram of the overall audio system in MPEG-4 is shown in Fig. 7 as a reference for the discussion to follow. Sound is conveyed in the MPEG-4 bitstream as several elementary streams which contain coded audio in the formats described earlier. There are four elementary streams in the sound scene in Fig. 7. Each of these elementary streams contains a primitive media object, which in the case of audio is a single-channel or multichannel sound that will be composited into the overall scene. In Fig. 7,

457

the stream coded with the GA coder (MPEG-4 General Audio [2], used for wideband music) decodes into a stereo sound, and the other streams into monophonic sounds. The di!erent primitive audio objects may each make use of a di!erent audio decoder, and decoders may be used multiple times in the same scene. The multiple elementary streams are conveyed together in a multiplexed representation. Multiple multiplexed streams may be transmitted from multiple servers to a single MPEG-4 receiver, or terminal. There are two multiplexed MPEG-4 bitstreams shown in Fig. 7; each originates from a di!erent server. Encoded video content can also be multiplexed into the same MPEG-4 bitstreams. As they are received in the MPEG-4 terminal, the MPEG-4 bitstreams are demultiplexed, and each primitive media object is decoded. The resulting sounds are not played directly, but rather made available for scene compositing using AudioBIFS. 5.3. AudioBIFS nodes Also transmitted in the multiplexed MPEG-4 bitstream is the BIFS scene graph itself. BIFS } and AudioBIFS } are simply parts of the content like the media objects themselves; there is nothing `hardwireda about the scene graph in MPEG-4. Content developers have wide #exibility to use BIFS in a variety of ways. In Fig. 7, the BIFS part and the AudioBIFS part of the scene graph are separated because it is convenient to imagine them this way, but there is no technical di!erence between them (AudioBIFS is just a subset of BIFS). Like the rest of the BIFS capabilities [22], AudioBIFS consists of a number of nodes which are interlinked in a scene graph. However, the concept of the AudioBIFS scene graph is somewhat di!erent; it is termed an audio subgraph. Whereas the main (visual) scene graph represents the spatiotemporal position of visual objects in presentation space, and their properties such as color, texture, and layering, an audio subgraph represents a signal-#ow graph describing digitalsignal-processing manipulations. Sounds #ow in from MPEG-4 audio decoders at the bottom of the scene graph; each `childa node presents its results from processing to one or more `parenta nodes.

458

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

Fig. 7. The MPEG-4 Audio system, showing the demux, decode, AudioBIFS, and BIFS layers. This schematic shows the interaction between the frames of audio data in the bitstream, the decoders, and the scene composition process. See text for details.

Through this chain of processing, sound streams eventually arrive at the top of the audio subgraph. The `intermediate resultsa in the middle of the manipulation process are not sounds to be played to the user; only the result of the processing at the top of each audio subgraph is presented. The audioBIFS nodes are summarized in Table 1 and discussed in more detail in the following paragraphs. The AudioSource node is the point of connection between real-time streaming audio and the AudioBIFS scene. The AudioSource node attaches

an audio decoder, of one of the types speci"ed in the MPEG-4 audio standard, to the scene graph; audio #ows out of this node. The Sound node is used to attach sound into audiovisual scenes, either as 3-D directional sound, or as non-spatial ambient sound. All of the spatial and non-spatial sounds produced by Sound nodes in the scene are summed and presented to the listener. The semantics of the Sound node in MPEG-4 are similar to those of the VRML standard, i.e., the sound attenuation region and spatial characteristics are de"ned in the same way

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461 Table 1 The AudioBIFS nodes Node name

Function

AudioSource Sound AudioMix AudioSwitch AudioDelay AudioFX AudioBu!er ListeningPoint TermCap

Connect decoder to scene graph Connect audio subgraph to visual scene Mix multiple channels of sound together Select a subset of a set of channels of sound Delay a set of audio channels Perform audio e!ects-processing Bu!er sound for interactive playback Control position of virtual listener Query resources of terminal

as in the VRML standard to create a simple model of attenuation. In contrast to VRML, where the Sound node accepts raw sound samples directly and no intermediate processing is done, in MPEG-4 any of the AudioBIFS nodes may be attached. Thus, if an AudioSource node is the child node of the Sound node, the sound as transmitted in the bitstream is added to the sound scene; however, if a more complex audio scene graph is beneath the Sound node, the mixed or e!ects-processed sound is presented. The AudioMix node allows M channels of input sound to be mixed into N channels of output sound through the use of a mixing matrix. The AudioSwitch node allows N channels of output to be taken as a subset of M channels of input, where M)N. It is equivalent to, but easier to compute than, an AudioMix node where M)N and all matrix values are 0 or 1. This node allows e$cient selection of certain channels, perhaps on a language-dependent basis. The AudioDelay node allows several channels of audio to be delayed by a speci"ed amount of time, enabling small shifts in timing for media synchronization. The AudioFX node allows the dynamic download of custom signal-processing e!ects to apply to several channels of input sound. Arbitrary e!ectsprocessing algorithms may be written in SAOL and transmitted as part of the scene graph. The use of SAOL to transmit audio e!ects means that MPEG does not have to standardize the `besta arti"cial reverberation algorithm (for example), but also that

459

content developers do not have to rely on terminal implementors and trust in the quality of the algorithms present in an unknown terminal. Since the execution method of SAOL algorithms is precisely speci"ed, the sound designer has control over exactly which reverberation algorithm (for example) is used in a scene. If a reverb with particular properties [5] is desired, the content author transmits it in the bitstream; its use is then guaranteed. The complexity of processing in the AudioFX node is de"ned and restricted through levels, in a manner similar to the complexity of the Structured Audio decoder. The position of the Sound node in the overall scene, as well as the position of the listener, are also made available to the AudioFX node, so that e!ects-processing may depend on the spatial locations (relative or absolute) of the listener and sources. The AudioBu4er node allows a segment of audio to be excerpted from a stream, and then triggered and played back interactively. Unlike the VRML node AudioClip, the AudioBu4er node does not itself contain any sound data. Instead, it records the "rst n seconds of sound produced by its children. It captures this sound into an internal bu!er. Then, it may later be triggered interactively (see the section on interaction below) to play that sound back. This function is most useful for `auditory iconsa such as feedback to button-presses. It is impossible to make streaming audio provide this sort of audio feedback, since the stream is (at least from moment to moment) independent of user interaction. The backchannel capabilities of MPEG-4 are not intended to allow the rapid response required for audio feedback. There is a special function of AudioBu4er which allows it to cache samples for sampling synthesis in the Structured Audio decoder. This technique allows perceptual compression to be applied to sound samples, which can greatly reduce the size of bitstreams using sampling synthesis. The ListeningPoint node allows the user to set the listening point in a scene. The listening point is the position that the spatial positions of sources are calculated relative to. By default (if no ListeningPoint node is used), the listening point is the same as the visual viewpoint.

460

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

The TermCap node is not an AudioBIFS node speci"cally, but provides capabilities which are useful in creating terminal-adaptive scenes. The TermCap node allows the scene graph to query the terminal on which it is running, to discover hardware and performance properties of that terminal. For example, in the audio case, TermCap may be used to determine the ambient signal-to-noise ratio of the environment. The result can be used to control `switchinga between di!erent parts of the scene graph, so that (for example) a compressor is applied in a noisy environment such as an automobile, but not in a quiet environment such as a listening room. Other audio-pertinent resources that may be queried with the TermCap node include: the number and con"guration of loudspeakers, the maximum output sampling rate of the terminal, and the level of sophistication of 3-D audio functionality available. The MPEG-4 Systems standard contains speci"cations for the resampling, bu!ering, and synchronization of sound in AudioBIFS. Although we will not discuss these aspects in detail, for each of the AudioBIFS nodes there are precise instructions in the standard for the associated resampling and buffering requirements. These aspects of MPEG-4 are normative. This makes the behavior of an MPEG-4 terminal highly predictable to content developers.

[2]

[3]

[4] [5]

[6]

[7]

[8]

[9]

[10]

[11] [12]

6. Summary We have described the tools for synthetic and SNHC audio in MPEG-4. By using these tools, content developers can create high-quality, interactive content and transmit it at extremely low bitrates over digital broadcast channels or the Internet. The Structured Audio toolset provides a single standard to unify the world of algorithmic music synthesis and to drive forward the capabilities of the PC audio platform; the Text-to-Speech Interface provides a greatly needed measure of interoperability between content and text-to-speech systems.

[13] [14] [15]

[16] [17] [18]

[19]

References [1] K. Brandenberg, Perceptual coding of high quality digital audio, in: M. Kahrs, K. Brandenburg (Eds.), Applications

[20]

of Digital Signal Processing to Audio and Acoustics, Kluwer Academic, New York, 1998, pp. 39}83. K. Brandenburg, O. Kunz, A. Sugiyama, MPEG-4 natural audio coding, Signal Processing: Image Communication 15 (4}5) (2000) 423}444. M.A. Casey, P.J. Smaragis, Netsound: real-time audio from semantic descriptions, in: Proceedings of International Computer Music Conference, Hong Kong, 1996, pp. 143. J. Chadabe, Electric Sound: The Past and Promise of Electronic Music, Prentice-Hall, Upper Saddle River, NJ, 1997. W.G. Gardner, Reverberation algorithms, in: M. Kahrs, K. Brandenburg (Eds.), Applications of Digital Signal Processing to Audio and Acoustics, Kluwer Academic, New York, 1998, pp. 85}132. C. Herpel, A. Eleftheriadis, MPEG-4 systems: Elementary stream management, Signal Processing: Image Communication 15 (4}5) (2000) 299}320. J.E. Hopcroft, J.D. Ullman, Introduction to Automata Theory, Languages, and Computation, Addison-Wesley, Reading, MA, 1979. International Organisation for Standardisation, Virtual reality modeling language (VRML), International Standard ISO 14472}1:1997, ISO, 1997. D. Johnston, C. Sorin, C. Gagnoulet et al., Current and experimental applications of speech technology for telecom services in Europe, Speech Communication 23 (1}2) (1997) 5}16. M. Kitai, K. Hakoda, S. Sagayama et al., ASR and TTS telecommunications applications in Japan, Speech Communication 23 (1}2) (1997) 17}30. D.H. Klatt, Review of text-to-speech conversion for English, J. Acoust. Soc. Amer. 82 (3) (1987) 737}793. K.D. Martin, Automatic transcription of simple polyphonic music: Robust front-end processing, Technical Report d399, MIT Media Laboratory Perceptual Computing, 1996. M.V. Mathews, An acoustic compiler for music and psychological stimuli, Bell Systems Tech. J. 40 (1961) 677}694. M.V. Mathews, The Technology of Computer Music, MIT Press, Cambridge, MA, 1969. MIDI Manufacturers Association, The Complete MIDI 1.0 Detailed Speci"cation, Protocol speci"cation, MIDI Manufacturers Association, 1996. C. Roads, The Computer Music Tutorial, MIT Press, Cambridge, MA, 1996. D.F. Rosenthal, H.G. Okuno, Computational Auditory Scene Analysis, Lawrence Erlbaum, Mahweh, NJ, 1998. E.D. Scheirer, Structured audio and e!ects processing in the MPEG-4 multimedia standard, Multimedia Systems 7 (1) (1999) 11}22. E.D. Scheirer, L. Ray, Algorithmic and wavetable synthesis in the MPEG-4 multimedia standard, in: Proceedings of 105th Convention of the Audio Engineering Society (reprint d4811), San Francisco, 1998. E.D. Scheirer, R. VaK aK naK nen, J. Huopaniemi, AudioBIFS: Describing audio scenes with the MPEG-4 multimedia

E.D. Scheirer et al. / Signal Processing: Image Communication 15 (2000) 445}461

461

also an accomplished and award-winning jazz trombonist.

standard, IEEE Trans. on Multimedia 1 (3) (1999) 237}250. E.D. Scheirer, B.L. Vercoe, SAOL: the MPEG-4 structured audio orchestra language, Comput. Music J. 23 (2) (1999) 31}51. J. Signes, Y. Fisher, A. Eleftheriadis, MPEG-4's binary format for scene description, Signal Processing: Image Communication 15 (4}5) (2000) 321}345. J.O. Smith, Viewpoints on the history of digital synthesis, Proceedings of International Computer Music Conference, Montreal, 1991, pp. 1}10. A.M. Tekalp, J. Ostermann, Face and 2-D mesh animation in MPEG-4, Signal Processing: Image Communication 15 (4}5) (2000) 387}421. H. Valbret, E. Moulines, J.P. Tubach, Voice transformation using PSOLA technique, Speech Communication 11 (2}3) (1992) 175}187. B.L. Vercoe, The synthetic performer in the context of live performance, in: Proceedings of International Computer Music Conference, Paris, 1984, pp. 199}200. B.L. Vercoe, Csound: A Manual for the Audio-Processing System, Program Reference Manual, MIT Media Lab, 1996. B.L. Vercoe, W.G. Gardner, E.D. Scheirer, Structured audio: the creation, transmission, and rendering of parametric sound representations, Proc. IEEE 85 (5) (1998) 922}940. E.D. Scheirer, Y.E. Kim, Generalized audio coding with MPEG-4 Structured Audio, in: Proc. of the Audio Engineering Society 17th Internat. Conf. on High Quality Audio Coding, Florence, 1999, pp. 189}204.

Youngjik Lee received the B.S. degree in electronics engineering from Seoul National University, Seoul, Korea in 1979, the M.S. degree in electrical engineering from Korea Advanced Institute of Science and Technology, Seoul, Korea in 1981 and the Ph.D. degree in electrical engineering from the Polytechnic University, Brooklyn, New York, USA. From 1981 to 1985 he was with Samsung Electronics Company, where he developed video display terminals. From 1985 to 1988 he worked on sensor array signal processing. Since 1989, he has been with Electronics and Telecommunications Research Institute pursuing research in multimodal interfaces, speech recognition, speech synthesis and speech translation, neural networks, pattern recognition, and digital signal processing.

Eric D. Scheirer is a Ph.D. candidate in the Machine Listening Group at the MIT Media Laboratory, where his research focuses on the construction of music-understanding computer systems. He received B.S. degrees in computer science and linguistics from Cornell University, Ithaca, NY, in 1993, and the M.S. degree from the Media Lab in 1995. He was an Editor of the MPEG-4 audio standard and was the primary developer of the Structured Audio and AudioBIFS components of MPEG-4. Eric has published articles on a range of audio topics, including music analysis, structured coding, advanced psychoacoustic models, audio pattern recognition, and sound synthesis. He is an active speaker and writer for non-technical audiences, with a particular interest in the application of audio technology to multimedia systems design. He is

Jae-Woo Yang received the B.S. degree in electrical engineering and the M.S. degree in control engineering from Seoul National University in 1978 and 1982, respectively, and received the Ph.D. degree in Computer Science from Korea Advanced Institute of Science and Technology in 1997. He worked at Samsung Electronics Company during 1978}1979. He joined ETRI in 1980. Since then he has developed operation support systems, arti"cial intelligence systems, spoken language processing systems and multimedia systems. He is now director of the Telecommunication Terminal Technology department. He also works as a board member of the Acoustics Society of Korea. His current research areas are human interfaces, multimedia communication, Internet terminal equipment and spoken-language processing.

[21]

[22]

[23]

[24]

[25]

[26]

[27] [28]

[29]