Signal Processing: Image Communication 4 (1992) 153 159 Elsevier
153
The MPEG systems coding specification Alexander G. MacInnis International Business Machines, Zip 9261, 11400 Burnet Road, Austin, TX 78758-3493, USA
Abstract. The MPEG Systems Committee has produced a specification for the syntax and semantics of system layer coding of combined MPEG compressed digital video and audio. The systems layer provides a framework and information required for these functions: a multiplex of various numbers of audio, video and private streams; synchronization of audio and video; management of buffers for coded information; random access and start-up conditions; and absolute time identification. In addition the specification allows for the later inclusion by ISO of other data streams.
Keywords. Synchronization, multiplex, start-up, multi-media.
1. Introduction The ISO-IEC/JTC1/SC29/WG11 MPEG (Motion Picture Experts Group) working group has produced a Committee Draft of a specification for coding of combined video and audio information. The specification is composed of three parts : Part 1, Systems; Part 2, Video; and Part 3, Audio. The video and audio parts address specifically coding of video and audio, respectively. The system part specifies a system coding layer for combining coded video and audio, and provides the capability of also combining private data streams and streams that may be defined by ISO at a later date. The M P E G system layer specifies a multiplex of multiple elementary streams such as audio and video, with a syntax which includes data fields which directly support synchronization of the elementary streams and the data source mechanism. The system data fields also assist in parsing the multiplexed stream after a random access; managing coded data buffers in the decoders; and identifying the absolute time of the coded program. The M P E G system specification specifies the syntax and semantic rules of the coded data stream. The semantics imply some requirements on the part of decoders. The encoding process is not speciElsevier Science Publishers B.V.
fled and can be implemented in a variety of ways, as long as the resulting data stream meets the requirements of the specification.
2. System concept 2.1. Representative encoder While the encoder portion of the overall system is not part of the specification, it is instructive to illustrate the system function with a complete system including an encoder. Figure 1 illustrates the encoder function. In this representative encoder model the video encoder receives digitized, uncoded pictures (video Presentation Units or PUs) at discrete times and the audio digitizer receives digitized, uncoded blocks of audio samples called Audio Presentation Units at discrete times. The times of arrival of the pictures are not necessarily aligned with the times of arrival of the audio PUs. The System Time Clock (STC) is a reference time base operating at 90 kHz, and is not necessarily phase locked to either the audio or video sample clocks. The STC produces 33 bit time values incremented at 90 kHz. At the arrival at the encoders of some, but not
154
A.G. Maclnnis / The M P E G systems coding specification
Video Encoder
Video Source
>
System Multiplexer and
MPEG Data
Encoder Audio Source
Audio Encoder
Stream
>
>
System Time Clock
Fig. 1. Representative encoder system.
necessarily all, of the video and audio PUs the current value of the STC is sampled and stored with the PUs through the coding, transmission and decoding processes. These values are called Presentation Time Stamps (PTS), and the PTS values are specified in the M P E G Systems specification. The time stamps, using a reference frequency of 90 kHz and unsigned binary values from 0 to 2**33 - 1, allow unique identification of operating time within the data stream over an interval exceeding 24 hours. The video and audio encoders, respectively, encode digital video and audio in accordance with the M P E G specification Parts 2 and 3. They produce as output coded pictures or Video Access Units (VAU) and Audio Access Units (AAU). These outputs are referred to as elementary streams. The system encoder and multiplexer produce a multiplex containing the elementary streams as well as system layer coding, which is defined below. The output of the system encoder is a single data stream identified as M(i). The multiplex operation, rather than being specified by ISO/ MPEG, has specific constraints placed upon it which are described in a later section. In addition to the PTS, there are time stamps associated with the coded data stream itself. These time stamps, called System Clock References Signal Processing:Image Communication
(SCR) are created, in the preceding model, as samples of the STC such that the value of the SCR equals the value of the STC at the time the last byte of the SCR exits the system encoder. Decoding Time Stamps (DTS) are similar to PTS, except that the permutation ordering of pictures in the video coding process is reflected in the DTS values. DTS are included where appropriate in the data stream along with PTS; DTS is described later. The information contained in PTS, DTS and SCR fields as described above is useful for ensuring synchronization between the various decoders and the data stream source in a decoding system. This information is also directly usable for managing decoder buffers, as will be illustrated below.
2.2. Reference decoder model Since the M P E G specification specifies syntax and semantics of the coded data stream, the encoder model above is not part of the specification; however the reference decoder model is specified in the M P E G system specification as part of the semantic definition of the data stream at the system layer. Formally the reference decoder model, which includes a System Target Decoder (STD) is a bit-stream verifier; in practice an
155
A.G. MacInnis / The MPEG systems coding specification
encoder may include a similar model in order to ensure that its data stream output is proper. While an encoder model may be inferred as being similar to an inverse of the STD model, there is no requirement that the encoder follows any particular model as long as the resulting data stream meets the specification The reference decoder model represents the actions and processes of a real decoder system, although it makes simplifications in the interests of generality and understanding. Real decoders that are assured to perform properly must be designed with proper consideration taken of the differences between the STD and the real design. Such differences may require different amounts of buffering and timing offsets. The STD is shown in Fig. 2. The M(i) is the data stream, a stream of bytes, input to the STD. It may come from any Digital Storage Medium (DSM) or channel. The term Tm(i) is the time of the ith byte of stream M(i). The {Bi} are data buffers for coded data, one per elementary stream; their sizes are specified in the data stream. The term T d n ( j ) is the time of decoding the jth access unit of elementary stream
n. The {Dn} are elementary stream decoders. O1 is a re-order buffer which stores I and P pictures in a video decoder system due to the permutation ordering of coded pictures. The term Pn(k) represents the kth presentation unit of stream n, and Tpn(k) is the time of presentation of Pn(k). The system decoder operates instantaneously; as soon as a byte in M(i) that is part of elementary stream n arrives, it is instantly put in buffer Bn. The elementary stream decoders {Dn} operate instantaneously. That is, all of the bits of access unit j are removed from Bn instantaneously and the access unit decoded instantaneously at the time Tdn(j). Since the decoders operate instantaneously, the Tpn(k) are equal to the T d n ( j ) except in cases where decoded video access units are reordered by O1. The indexj is in order of the occurrence of access units in the coded data stream, which may be different from the order of video presentation units due to the reordering. The time values above, Tm(i), T d n ( j ) and Tpn(k), are continuously-valued times. The SCR, PTS and DTS fields in the data stream are discrete
I
Coded
Reorder I ~ Buffer 01 ~ . o T P
A,cj!
0
Data Buffer B1 M(I) Tm(i)
i-th byte of multiplex atreom and arrival time at STD
Pl(k)
l(k)
/
L ooo,-uo, k-th presentation unit
'l
and decoding time
0
and presentation time
Demultiplex
Coded
Data
Pn(k) Tpn(k)
An(j) Tdn(j)
Buffer
Bn
Dn System Control Data
Fig. 2. Diagram of system target decoder. Vol. 4, No. 2, April 1992
156
A.G. Maclnnis / The MPEG systems coding specification
values from a common time base that has a nonzero accuracy tolerance. The SCR fields indicate the time, as a value of a common system time clock, that the last byte of the SCR field itself arrives at the system decoder. SCR fields occur intermittently in the data stream, and may be spaced no further than 700 ms, as measured by the values of the SCRs. The PTS fields are quantized values of Tpn(k), as values of the common system time clock. PTS fields need not be present for every access unit of each stream, and they may be spaced no further than every 700 ms in terms of the PTS values. There are Decoding Time Stamp (DTS) values included in the data stream; these are quantized valus of T d n ( j ) , and are values of the c o m m o n STC. DTS values are not coded where their values are equal to PTS values, and they are only coded for access units that have coded PTS fields; i.e. DTS fields only exist where PTS fields are coded for I and P pictures. The common STC used to produce the SCR, PTS and DTS fields may have a tolerance of no more than 50 parts per million. A formal requirement imposed on the coded data stream is that any valid M P E G bit stream, when decoded by the STD, must never cause the decoder buffers {Bn} to o v e r f o w or underflow. This behavior is a characteristic of the data stream itself given the definition of the STD. The STD reference model applies to both fixed and variable rate data streams. In the case of variable rate operation the date rate of M(i) is piecewise constant between instances of SCRs. The piece-wise constant data rate is specified via fields in the data stream. The semantics of the rate specification support bursty transmission via intervals of zero-rate transmission following each peak.
2.3. STD buffer behavior Given the STD model above, the time fields indicate when data enters and leaves the {Bn}, therefore the amount of time that coded data spend in the {Bn} is specified. Since the buffer occupancy is Signal Processing: Image Communication
specified in terms of time, and using a time base that is c o m m o n to the decoding process, there is no need to compute the relationship between bits in the buffer, bits-per-second and buffer delay. If data enter and leave the buffer at the intended times then the {Bn} buffer sizes are known, the buffer will not overflow or underflow, regardless of the data rate. The number of bits in the buffer need not be specified nor computed.
3. Synchronization As a general principle, for systems that decode multiple audio and video streams from a storage or transmission medium to operate in synchronism, there must be exactly one autonomous time base, referred to as a time master, in the decoding system. The time master can be any of the decoders, the data stream source, or an external time base. All other entities (decoders and data source) must slave their timing to the time master; otherwise problems such as buffer underflow, or loss of presentation synchronization, will occur. M P E G does not specify which entity is the time master; m a n y configurations are possible, and the system data stream definition provides necessary and sufficient information to implement practical systems. Decoders can implement phase-locked loops or other timing means to assure proper slaving of their operation to the time master. If a decoder is the time master, the time when it presents a presentation unit is considered to be the correct and instantaneous time for use by the other entities. Since the SCRs are samples of the c o m m o n time base, they can be used to ensure that the data source provides the correct amount of data at the correct time. If the data stream source is the time master, the SCR values indicate the correct and instantaneous time at the moment that these values are received. The decoders then use this information of what the correct time is to pace their decoding and presentation timing.
A.G. Maclnnis / The MPEG systems coding specification
If the time base is an external entity, all of the decoders and the data source are expected to slave the timing of their operations to the external timing source.
4. System stream coding The M P E G system specification includes a syntax with three coding layers above the layers encompassed by the elementary streams. These are the ISO11172 Stream layer, the Pack layer and the Packet layer. The ISO11172 Stream layer includes a sequence of Packs followed by an end code. The Pack layer includes the SCR field, the mux_rate field, an optional System Header packet, and the packet layer. The ISOl1172 and Pack layers are indicated in Fig. 3. The mux_rate field bounds the rate of bytes per second as measured by the current and succeeding SCR values and the number of coded bytes intervening. The system header packet is detailed below. The following figures which indicate the syntax are simplified for clarity. The formal syntax is specified in the ISOl1172 Committee Draft, Part 1. The Packet layer contains packets containing data from individual elementary streams. There is data from exactly one elementary stream in each packet. Packet contents, and all system layer coding, are byte aligned; note that the individual coding elements within elementary streams may not be byte aligned. ISOI 1172 Stream Layer: [0 or more Packs] end code
32 bits
Pack Layer: pack start code
systemclock reference (including marker bits) mux rate (includingmarker bits) optional system_header_packet [0 or more Packets] Fig. 3.
32 bits 40 bits 24 bits
157
Packet Layer: packet start code packet length if (packetstart_code! = private 2) { optional stuffing bytes optional STD_buffer_size if no presentation time stamp following, constant value optional presentation_time stamp (including marker bits) optional decoding time stamp
32 bits 16 bits
16 bits 8 bits 40 bits 40 bits
} packet_data Fig. 4.
Each packet consists of a packet start code followed by the packet length; three optional fields: STD buffer size, PTS and DTS; and packet data from the elementary stream. The packet length is in bytes, and ranges up to 2 " ' 1 6 - 1. The amount of packet data is limited only by the total available packet length, minus the data in the packet header itself, and by constraints imposed by the STD. The Packet layer is indicated in Fig. 4. There are 69 different values of packet start code. Of these, 16 are for video, 32 are for audio, 2 are private, 1 is for padding, and the remainder are reserved for future use by ISO. It is anticipated that the Multimedia and Hypermedia Experts Group (MHEG) may define the uses for up to 16 of these reserved packet start codes in the future. The multiple audio and video packet start codes provide for similar numbers of multiple audio and video elementary streams. Note that a single audio stream may be coded stereo audio. While the system codes (Pack, Packet, and system header start codes and the end code) cannot be emulated by data within video packets, they can be emulated within audio and private data packets. Private data is unrestricted, other than by the syntax and STD model which applies to the entire stream. There are two types of private streams defined; private 1 includes the Packet syntax including optional STD buffer size, PTS and DTS fields, and pivate 2 does not. It is possible for users Vol. 4, No. 2, April 1992
158
A.G. Maclnnis / The M P E G systems coding specification
system_header packet: system header_start code packet length rate bound (with marker bits) audiobound, fixed flag videobound reserved_byte while (next bit == '1'){ stream_id std_buffer_size_bound }
32 16 24 8 8 8
bits bits bits bits bits bits
8 bits 16 bits
Fig. 5.
of private data streams to define local syntaxes which support multiple data types with the single M P E G private data framework. A system header packet is defined which specifies certain global upper limits on parameters within the entire data stream. These include the value of the mux_rate field, the number of simultaneous audio streams in the data stream, the number of simultaneous video streams, and the STD buffer size fields for each stream. The fixed_flag indicates fixed bit-rate operation. The system header packet is indicated in Fig. 5.
5. Multiplexing The pattern and method of multiplexing are not directly specified in MPEG. There are specific constraints which must be followed by an encoder and multiplexer in order to produce a valid M P E G data stream. The multiplex itself is not required to follow any of the commonly used constraints such as fixed packet sizes or consistent ordering of elementary streams within the multiplex. The STD model requires that the individual stream buffers in the STD must never overflow or underftow. The definition of the STD is such that the timing of all data entering and leaving these buffers is specified by the data stream itself, ensuring that such verification is possible. The sizes of the STD individual stream buffers impose limits on the behavior of the multiplexer. These limits are sufficient to guarantee proper operation in any decoder system which can Signal Processing:Image Communication
itself decode data streams that are acceptable under the STD model. It should be noted that real decoding systems may not behave in the same manner as is specified in the STD. For example they may not remove from their individual buffers all of the bits associated with a coded picture or audio frame instantaneously and immediately before completion of decoding the picture or audio frame. The designs of actual decoders must take into account the differences between their behavior and the STD model in order to ensure that sufficient buffering exists; typically a real decoder will require greater buffer capacity than is specified in the STD.
6. Startup operation Decoding of the M P E G data stream starting at the beginning is straightforward; there are no ambiguous bit patterns. This is true in part due to the Length field which is part of every packet, and which points to the succeeding Pack or Packet start code or to the end code. Startup at points other than the beginning are similarly straightforward if known starting points, e.g. Pack start codes, are located directly. Starting decoding operation at random points without exact knowledge of specific codes requires locating Pack or Packet start codes within the data stream. Typically a decoder will search for a Pack or Packet start code within the data stream. Since such codes can be emulated within non-video data packets, the confidence of error-free operation is increased if the existence of a succeeding start code is verified at the location pointed to by the packet length field. The confidence level is also increased by taking into account the fact that a Pack start code is always followed (after the SCR) by another system start code (or system header stream). A data stream parser may choose to buffer or discard the data stream until a high-confidence entry point is found.
A.G. MacInnis / The MPEG systems coding specification
7. Conclusion
Described here is a syntax with semantic rules for multiplexing multiple dissimilar coded data streams, in a flexible way, along with the necessary information to synchronize the decoding of the data and presentation of the decoded data, using coding layers that are specified at a higher layer than the coded data. The M P E G system syntax is structured in such a way that decoding in a wide variety of environments and conditions is possible.
159
It can be expanded, using reserved or private packet start codes, to accommodate data types that are as yet undefined by ISO.
References [ 1] ISO/IEC JTC 1/SC29/WG 11 CD1-11172. Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbits/s, Part 1 (Systems), November 1991.
Vol. 4, No. 2, April 1992