SIGNAL PROCESSING:
IMAGE COMMUNICATION
Signal Processing: Image Communicarion 9 (1997) 327-342
ELSEVIER
Tests on MPEG-4 audio codec proposals L. Contina**, B. Edlerb, D. Meares”, P. Schreinerd ’ Centro Studi e Laboratori Telecomunicazioni, V. Reiss Romoli, 274, 10148 Torino, Italy b Universitiit Hannover, Schneiderberg 32, 30167 Hannover, Germany ‘BBC, Kingswood Warren, Tadworth, Surrey, KTZO 6NP, UK d Scientific Atlanta, P.O. Box 6850. Norcross. Ga. 30091, USA
Abstract During December 1995, subjective tests were carried out by members of the Moving Picture Experts Group (MPEG, ISO/JTCl/SC29/WGll) to select the proposed technology for inclusion in the audio part of the new MPEG-4 standard. The new standard addresses coding for more than just the functionality of data rate compression. Material coded at very low bit-rates is also included. Thus, different testing methodologies were applied, according to ITU-R Rec. BS 1116 for a bit-rate of 64 kbit/s per channel and according to ITU-T Rec. P.80 for lower bit-rates or functionalities other than data rate compression. Proposals were subjectively tested for coding efficiency, error resilience, scalability and speed change: a subset of the MPEG4 ‘functionalities’. This paper describes how two different evaluation methods were used and
adjusted to fit the different testing requirements. This first major effort to test coding schemes at low bit-rates proved successful. Based on the test results, decisions for MPEG-4 technology were made. This was the first opportunity for MPEG members to carry out tests on the submitted functionalities. In the process, much was learnt. As a result, some suggestions are made to improve the way new functionalities can be subjectively evaluated. 0 1997 Elsevier Science B.V. Keywords:
Audio quality; Subjective assessment methods; MPEG4
1. Introduction The goal of the current MPEG4 audio activities is the definition of a standard for the representation of audio signals at data rates of 64 kbit/s per channel and below. In addition to the requirement of high coding efficiency, additional functionalities are envisaged in order to cover a wider range of applications. The standardisation process is based on a socalled Verification Model (VM). The VM is defined
*Corresponding
author. E-mail:
[email protected].
0923-5965/97/$17.00 Copyright PI1 SO923-5965(97)00005-2
0
tests
by a signal processing structure composed of basic modules and appropriate interfaces between these modules. The use of the VM approach allows the collaboration of as many different companies as possible. Various alternatives for the different modules can be evaluated by core experiments in which new proposals are compared against the current VM. In order to define an initial VM, an MPEG4 Call for Proposals was issued in July 1995 [6], accompanied by a document describing the test procedures for subjective evaluation [12]. In response to this Call for Proposals, a total of 17 organisations submitted proposals, several of them
1997 Elsevier Science B.V. All rights reserved.
328
L. Contin et al. / Signal Processing:
covering more than one functionality and bit-rate. During December 1995, tests were carried out in order to establish a rank order, based on the perceived quality, for the various algorithms proposed. This was the first step towards the definition of an audio VM. The functionalities, that were considered the most representative for MPEG4 audio coding and which were covered in this first round of subjective tests were: coding efficiency, error resilience, scalability, and speed change. For submissions addressing other functionalities mentioned in the Call for Proposals, an additional evaluation by experts was arranged [2]. This paper describes in Section 2 the assessment methods used and the reasons behind their choice. Section 3 illustrates the laboratory set up adopted, while Sections 4 and 5 provide details about test material and experimental design, respectively. Then, Section 6 examines the statistical analysis which was applied to the results. Details about the test results can be found in [3,4]. Finally, in the conclusions, suggestions are given on possible improvements to the way new functionalities can be subjectively evaluated.
2. Test methods The main differences between the first round of tests on MPEG-4 audio proposals and the previous MPEG audio tests conducted for MPEG-1 and MPEG-2 are related to the range of bit-rates considered and the new functionalities implemented. In order to allow the evaluation of high quality audio codecs as well as the evaluation of codecs operating at very low bit-rates, two different test methods were applied. They are outlined in the two subsections below. Concerning the testing of new functionalities, since the perceived audio quality was considered an evaluation criterion suited to all the functionalities taken into account, no specific test methods were designed for any of the new functionalities. The data obtained from the testing of the new functionalities were processed separately, as will be illustrated in Section 6.2.
Image Communication
9 (‘1997) 327-342
2.1. The ITU-R method For broadcasting applications requiring high quality audio codecs, MPEG has carried out the evaluations according to the triple-stimulus, hidden-reference, double blind test method. This method, which is fully described in ITU-R Rec. BS 1116 [S], is designed to evaluate systems generating only small impairments to the original signal by a direct comparison of the coded signal to the original signal. It is important that this technique be used to attempt to identify small differences in basic audio quality only when the coded signal sounds quite similar to the original source signal. Though more time consuming than some other test methods, this approach, if carried out rigorously enough, has been proven, within MPEG, to give very consistent results [S]. In this method, three audio stimuli, reference ‘Ref’, signal ‘A’, and signal ‘B’, are assessed by the listeners. ‘Ref’ and one of the audio signals ‘A’ or ‘B’ are the reference or uncoded source material, whilst the remaining stimulus ‘B’ or ‘A’is the coded material. The allocation of ‘A’ and ‘B’ to the hidden reference or the coded version is decided at random and the identity is not known by either the listener or the person running the test. In the case of these MPEG tests, the stimuli were grouped into a number of test sequences or trials pre-recorded onto tape in the sequence shown in Fig. 1. The progress of the tests was conveyed to the listener by means of synthetic voiced announcements recorded in a sequence; i.e. ‘Item N’, ‘R’, ‘A and ‘B’. Between consecutive sessions the listeners were required to take a break of at least 20min duration. Rec. BS 1116 itself allows for assessment by either loudspeaker or headphone reproduction. However, as these tests were of monophonic codecs and the testing needed to be performed in a simple stable manner that would allow many companies to share in the work, it was decided to use specific high quality headphones. Because of the complexity of the test method and because it was expected that the coded signals would be of a high quality, the method includes a requirement that all test sites conduct training of the listeners prior to running the formal section of
L. Contin et al. / Signal Processing: Image Communication 9 (1997) 327-342
329
Test N stml. 2.
stimulus I
slim3 u
u
“.’ u
stimk
stim k+l u
u
3orZsessions I
session
I
session 2
1
Fig. 3. Protocol of absolute category rating (ITU-T method).
Fig. 1. Protocol of triple-stimulus, blind test method (ITU-R method).
hidden-reference,
double
5.0 - Imperceptible 4.0 -- Perceptible but not annoying 3.0 -- Slightly annoying 2.0 -- Annoying 1 .O - Very annoying
Fig. 2. The ITLJ-R impairment scale used in the MPEG-4 test.
the tests. The aim of this is to eliminate variations in the test results due to learning effects by familiarising the listeners with both the test methods which they will be using and the types of artefacts which they will be judging in the tests. The subjects were instructed that the grades for the tests should be given according to the ITU-R 5-point impairment scale as shown in Fig. 2. This was described to them as a continuous scale with anchor points. In awarding their grades to stimuli A and B, the subjects were required to grade at least one of A or B as 5.0 and to give their results to one decimal place.
2.2. The ITU-T
method
One of the methods recommended by ITU-T for assessing speech quality in telecommunications is
the so-called Absolute Category Rating (ACR). It is intended to give a mean grade for the subjective impression of signals which may contain clearly audible degradation. Through the use of anchor signals representing various levels of degradation, this technique can be used to provide rank ordering in terms of mean opinion scores (MOS) for a wide variety of processed audio signals. A complete description of the ACR method can be found in ITU-T Rec. P. 80 [9]. The ITU-T method was also run from pre-recorded tapes, but the procedure was much simpler in other respects than that used in the ITU-R method. The recorded stimuli followed the sequence shown in Fig. 3. The sequence of Beeps was to signal the start of each test and to help align the subject with the results sheet. To aid in this context every tenth stimulus was preceded by a double Beep, all the rest were preceded by a single Beep, as shown in Fig. 3. The subjects were required to grade the presentation quality by using a fivegrade scale (Excellent/Good/Fair/Poor/Bad). These were described as integer grades to be marked on the score sheet. The mean of the grades for a particular codec was the MOS, or Mean Opinion Score. As in the previous method, a training phase and instruction was included in the test procedure to acclimatise the listeners to the test method and range of quality of the stimuli. Reproduction was by means of high quality headphones.
L. Contin et al. J Signal Processing: Image Communicaiion 9 (1997) 327-342
330 Table 1 Test site details Site
Room size (mxm)
Noise level
Headphone type
Model name of audio tape deck
CCETT
1x4
Approx. 22 dB(A)
STAX Lambda pro
Sony PCM-2500
FhG
5x3
< 30 dB(A)
STAX Lambda pro, STAX Sigma pro
Panasonic SV-3700
Motorola
8x8
<<50 dB(A)
STAX Lambda pro, STAX Lambda nova signature
Sony PCM-7010
Mitsubishi
4x8
Approx. 28 dB(A)
STAX Lambda pro
Sony DTC-75ES
NTT
4x3
Sound shielded
STAX Lambda pro
Sony DTC-77ES
Sony
8x8
Approx. 30 dB(A)
STAX Lambda pro, STAX Lambda nova classic
Sony DTC-55ES
In this section, we will describe the test signals used. These are comprised of the audio sources, the reference codec signals, the anchor signals, and the coded excerpts.
excerpts. These were selected by the MPEG Audio Subgroup members based on previous experience of what constituted a critical signal, as well as requiring the excerpts to cover speech and music applications. Source material was also selected in order to cover three types of content complexity; i.e. single sources, single sources with background, and complex sources. In order to aid intelligibility within the tests, programme items were available for both Japanese language listeners and English and German language listeners. The duration of the test items varied between 6 and 10 s. The test signals were available with an original sampling rate of 48 kHz. Where required by the conditions for a test, these were converted to 16 and 8 kHz by a specified sampling rate conversion program. The items used in the test are listed in Table 2. For each item, an item type was assigned. Items of type ‘E’ contain European (English or German) speech, items of type ‘J’ contain Japanese speech, and items of type ‘M’ contain music (single instrument or orchestra).
4.1. Source material
4.2. Reference conditions
For the MPEG4 audio tests, an input library was established containing all the source audio
In order to compare the quality of the proposals to that of proven technology, excerpts coded by
3. Laboratory arrangements
Many organisations were involved in the subjective evaluation of MPEG4 audio coding proposals. Some of the details relating to the test sites and the facilities used in these tests are shown in Table 1. As these tests were of monophonic codecs, a consistent form of high quality headphones was used. This, to a great extent, excludes the effects of different listening room acoustics at the different test sites. However, care was still required to ensure that a relatively low noise environment was used to avoid listener distraction. The use of headphones also allowed several listeners to make their assessments simultaneously.
4. Test material
L. Contin et al. / Signal Processing: Image Communication 9 (1997) 327-342
Table 2 MPEG4
331
audio library
Item type
Content complexity
Audio test material
E
Single source
German German English German
E
Complex sources
2 English talkers
E
Single source with background
English female speech with car noise
J
Single source
Japanese Japanese Japanese Japanese Japanese Japanese Japanese Japanese
J
Complex sources
2 Japanese talkers
J
Single source with background
Japanese male speech with background noise Japanese female speech with car background noise
M
Single source
Harpsichord Castanets Female vocal
M
Complex sources
Trumpet Solo with Orchestra Haydn
male speech 1 male speech 2 female speech female speech
male speech 1 female speech 1 female speech (sentence pair) male speech (sentence pair) female speech 2 female speech 3 male speech 2 male speech 3
“This item was only used at 2 kbit/s.
standard schemes were added to the test material. For each bit-rate, the standard coding scheme used as a reference was chosen from those that were supposed to be very close, in terms of performance, to the coding schemes under test. The reference codecs are specified for each bitrate in Table 4 of Section 5.2. The listeners were not informed when a reference codec was being played, so they evaluated them exactly as any other test condition.
4.3. Anchor conditions for ITU-T
method
When the ITU-T method is used, anchor conditions are usually added to the set of test conditions to allow comparison of the results coming from different laboratories. Usually, these anchor conditions are produced by a Modulated Noise
Reference Unit (MNRU), complying with ITU-T Rec. P.81 [lo]. The MNRU produces random noise with amplitude proportional to the instantaneous speech amplitude (speech-correlated or multiplicative noise). This noise is perceptually very similar to the quantisation noise produced by many speech coders, especially those employing non-linear companding laws and operating generally at 16 kbit/s or more. Ref. [lo] specifies MNRUs for a conventional telephone bandwidth (MNRUN) and a 7 kHz wide band (MNRUw). The ratio of speech level and multiplicative noise level expressed in dB is known as Q. Both MNRUn and MNRUw are characterised by the Q value. For a given coder, an equivalent Q value can be determined by means of the listening test methods. When the degradation produced by the MNRU is perceptually very close to that produced by all the
332
L. Contin et al. / Signal Processing:
coders under test, the conversion of the Mean Opinion Score (MOS) into the equivalent Q value is used to calibrate the test results of each laboratory and then compare the results from different laboratories. When the degradation produced by the MNRU is perceptually different from that of at least one of the codecs under test, the MNRU is used just to expand or complete the range of quality. This last condition was the case for the MPEG4 audio tests. In the MPEG4 tests, different sets of MNRU anchors were chosen for each bit-rate, in order to cover the full range of quality listened to in each test, without unnecessarily expanding the range for any individual test. In the test for the bit-rates above 16 kbit/s, the sampling rate was 48 kHz for which there is no standard MNRU generation available. In order to obtain anchors with error spectra not too annoying, a special MNRU was designed, which produced a modulated noise band limited to 8 kHz. This was available with speech to multiplicative noise ratios of 17, 24, 31, 38 and 45 dB. The anchor conditions are specified for each bitrate in Table 4 of Section 5.2. During the tests, the listeners were not informed when an anchor condition was being played, so they evaluated them exactly as any other test condition.
4.4. Coded excerpts The submissions to the tests had to contain a detailed technical description, the encoded data in the form of bit stream files, and executable decoder programs for a defined hardware platform. One proposal could address more than one bit-rate and/or functionality. For the submissions operating at sampling rates lower than 48 kHz, a unique software sampling rate converter was used to produce the input material and to convert the output sampling rate to 48 kHz prior to the test tape preparation. The candidate codecs evaluated in the subjective tests based on ITU-T method are listed in Table 3, which also gives some information about the key technologies and the addressed functionalities.
Image Communication
9 (1997) 327-342
5. Tests performed 5.1. Test conditions and assessment
methods
In the MPEG-4 audio tests, a wide range of bitrates (from 2 to 64 kbit/s) were considered. For each bit-rate investigated, suitable test conditions were defined. They include: type of test items, input/ output sampling frequency, reference codecs, and anchor conditions. Table 4 summarises the test conditions used for each bit-rate. Considering that lower bit-rates (from 2 to 16 kbit/s) are likely to be used for speech communication and higher bit-rates (above 16 kbit/s) are likely to be used for ‘broadcasting’ applications, items of type ‘E’ and ‘J’, that are primarily speech, were mainly used at lower bit-rates, while items of type ‘M’ were mainly used at higher bit-rates. However, in order to have a more complete evaluation of the proposals’ performance, a few items of type ‘M’ were included in the tests at 6 and 16 kbit/s, and one speech item was included at bit-rates above 16 kbit/s. In order to adapt the source material to the bit-rates, different sampling frequencies were specified. In the case of scalability, the input sampling rate had to be in accordance with that specified for the highest bit-rate. When decoding the scalability subsets, a decoder had to deliver its output at sampling rates as specified for the particular bitrates in each group. After a quality check by an expert panel, it was decided only to evaluate codecs operating in error free conditions at the highest bit-rate (i.e. 64 kbit/s) with the ITU-R test method. In the case of error resilience testing, even for the codecs operating at 64 kbit/s, preference was given to using the ITU-T ACR method for the two conditions of random errors and burst errors.
5.2. Functionalities A subset of functionalities defined in the MPEG-4 audio requirements was evaluated during the tests. This subset was recognised to be the most representative for the standard. The functionalities investigated were:
L. Contin et al. / Signal Processing: Image Communication 9 (1997) 327-342 Table 3 Candidate
codecs and addressed
333
functionalities
Company/Institution
Basic technologies
Compression (kbit/s)
Alcatel/Philips/RAI
Subband adaptive
2440
1, 2
AT&T
Transform
coding
24,40
1
AT&T
Waveform
interpolation
2, 6
BoschJCSELTi MATRA (MAVT)
Subband coding/LPC + enhancement for fine step scalability
INESC
Subband/transform coding, Huffman coding, harmonic component extraction
16, 24, 40
Matsushita
CELP + postprocessing for speed control + enhancement for error robustness
6
Motorola
Transform coding, Huffman coding + enhancement for error robustness
24.40
NEC
Transform coding + enhancement for scalability
24,40
NEC
CELP
6
NTT
Transform coding, VQ + enhancement for error robustness
NTT
CELP, pitch synchronous innovation + enhancement error robustness
coding, ADPCM, Huffman coding
6
64
2
1, 2 1, 2 2
2, 6
6
3
for
Philips
CELP with efficient search strategies
Samsung
Subband
Sony
Subband/transform coding scalability with LPC
Sony IPC
Subband/transform coding of LPC residual + enhancement for scalability
Sony IPC
LPC with harmonic vector excitation + enhancement for error robustness
2,6
Transform coding + scalability with low bit-rate codecs
24,40
University of Hannoveri Deutsche Telekom
Analysis/synthesis coding for individual spectra1 lines + enhancement for scalability
6, 16
JVC
Transform
coding,
2
16, 64
CELP, pitch synchronous innovation + enhancement for error robustness
of Erlangen
6
Scalability modes
16, 24, 40
NTT DoCoMo
University
Error resilience (kbit/s)
6
6
with multi mode coding
coding,
Speed change (kbit/s)
6
16
VQ
VQ
+ 3
2,6
6
3
1, 2 6
334
L. Contin et al. / Signal Processing:
Image Communication
9 (1997) 327-342
Table 4 Test conditions and test methods adopted Functionalities
In/out sampling frequency
Anchors (MNRU)
Reference codecs
Number and type of items”
Assessment method
2
Coding efficiency Speed change ‘normal speed’, Speed change ‘speed up’ Scalability 3
8 kHz
10, 20, 30, 40 dB
FS 1016 (4.8 kbit/s)
6E 6E 6E 4E
1lJ 1lJ 11J 85 + 3M
ITU-T
6
Coding efficiency Error res. ‘no errors’ Error res. ‘burst errors’ Error res. ‘random errors’, Speed change ‘normal speed Speed change ‘speed up’ Scalability 1 Scalability 2 Scalability 3
8 kHz
10, 20, 30, 40 dB
G.729 (8 kbit/s)
5E + 5E + 5E + 5E + 5E + 5E + 5M 5M 4E +
85 + 85 + 85 + 85 + 8J + 85 +
ITU-T
16
Coding efficiency Error res. ‘no errors’ Error res. ‘burst errors’ Error res. ‘random errors’ Scalability 3
16 kHz
10, 17, 24, 31, 38 and 45 dB
G.722 (64 kbit/s)
4E + 85 + 3M 4E+85+3M 4E+85+3M 4E + 85 + 3M 4E + 85 + 3M
ITU-T
24
Coding efficiency Scalability 2
48 kHz
17, 24, 31, 38 and 45 dB
MPEG-2 Layer III (low sampl. freq. mode)
5M 5M
ITU-T
40
Coding efficiency
48 kHz
17, 24, 31, 38 and 45 dB
MPEG-1 Layer III
5M 5M
ITU-T
64
Error res. ‘no errors’ (2 proposals)
48 kHz
MPEG-1 Layer III
5M
ITU-R
Bit-rate (kbit/s)
+ + + +
3M 3M 3M 3M 3M 3M
85 + 3M
Error res. ‘burst errors’ (2 proposals)
17, 24, 31, 38 and 45 dB
ITU-T
Error res. ‘random errors’ (2 proposals)
17, 24, 31, 38 and 45 dB
ITU-T
Scalability 1 (4 proposals)
MPEG-1 Layer III
ITU-R
Scalability 2 (6 proposals)
MPEG-1 Layer III
ITU-R
“In the case of scalability, the excerpts used are those specified for the highest bit-rate.
- Coding eficiency - This is the ability to provide
subjectively better audio quality at comparable bit-rates, compared to existing or emerging standards. It is the functionality that is traditionally evaluated in MPEG tests.
- Error resilience - The error resilience tests assessed a code& operation in error-prone environments by subjecting the compressed bitstreams to channel bit error conditions representative of those present in a variety of networks and storage media.
L. Contin et al. J Signal Processing:
The following error conditions were taken into account: ‘no errors’, ‘random errors’ and ‘burst errors’. _ Scalability - This is the possibility to decode a subset of the coded data specified by a lower bit-rate than the total. Three scalability operating modes had been defined according to the bitrates: scalability 1 (64/6 kbit/s), scalability 2 (64/24/6 kbit/s) and scalability 3 (16/6/2 kbit/s). In the case of scalability 2 and 3, the subset specified by the third bit-rate had to be a subset of that specified by the second bit-rate. - Speed change - This is the possibility of changing the playback speed of the decoder without affecting the pitch. At 2 and 6 kbit/s, the two conditions ‘normal speed’ and ‘speed up’ were evaluated. The functionalities listed above were not evaluated at every bit-rate. The distribution of functionalities across the bit-rates is shown in Table 4.
5.3. Experimental
design
5.3.1. Distribution
of the tests across the test sites
Tests were conducted at several test sites. Four laboratories carried out the tests using the method recommended by ITU-R in Rec. BS 1116, to evaluate proposals at 64 kbit/s for scalability-l and -2 and proposals for error robustness (without channel errors). Five laboratories carried out the tests using the ACR method in ITU-T Rec. P.80 to evaluate all the other conditions. In a few cases, the same laboratory carried out all the tests (by using both ITU-R and ITU-T methods). In the tests according to the ITU-R method, a total of 33 listeners (5 + 5 + 7 + 16) carried out the evaluation using headphones. The experience of the subjects in the audio evaluation varied among the different test sites, from ‘inexperienced’ to ‘experienced’. At each test site, the test material included the whole set of excerpts (4 music + 1 speech) coded through both the reference codec (MPEG-1 Layer III) and all the test codecs at 64 kbit/s. For tests on bit-rates up to 16 kbit/s, the test material was divided in two subsets: one including English/German speech items and three music
Image Communicafion
9 (1997) 327-342
335
items, the other including Japanese speech items and the same three music items. The first subset of items was used in the USA (Mitsubishi and Motorola) and Germany (Fraunhofer Gesselschaft, Institut fiir Integrierte Schaltungen), and the second subset was used in Japan (Sony). In total, 53 listeners (3 + 5 + 10 + 35) took part in the tests. Some of the listeners in US/Germany category were high fidelity audio experts but had no experience on low rate speech evaluation. The rest of the listeners had no experience in audio/speech evaluation. For tests on bit-rates above 16 kbit/s but below 64 kbit/s, the same set of items was used in the four laboratories that carried out the tests. Two of these laboratories are in the USA (Mitsubishi and Motorola), one in Germany (Fraunhofer Gesselschaft, Institut fi,ir Integrierte Schaltungen), and one in Japan (NTT). The common set of items included four music items and an English speech item. In total, 40 listeners (3 + 5 + 8 + 24) carried out the test. 5.3.2. Presentation order In the tests using the ITU-T ACR method, where no direct quality reference is used, the presentation order could have an impact on the votes. In fact, a listener’s grade could be higher or lower than usual if the quality of the signal previously played is respectively very low or very high. In order to reduce this ordering effect, different randomisation orders were used with different groups of listeners, and a repetition of the test with a different randomisation was conducted for each group. Each test session included stimuli at only one bit-rate, regardless of functionality. The signals decoded after introduction of channel errors or decoded with modified playback speed were included on the test tapes corresponding to their standard bit-rate, whilst the analysis of the results was carried out separately. The same applies to the scaleable codecs. Test sessions lasted no more than 20 min. To achieve this, when necessary, the whole set of test conditions corresponding to a particular bit-rate was split between two or more sessions. To give an idea of the number of test conditions evaluated, Table 5 shows for each bit-rate the total number of
L. Contin et al. / Signal Processing:
336
Table 5 Number of items on the test tapes Bit-rate (kbit/s)
Test sites in Europe/US
Test sites in Japan
2 6 16 24140164
89 296 112 160
147 391 176 160
conditions (excerpts processed through either a proposal codec, a reference codec, or an MNRU) taken into account. In the tests using the ITU-R method, the same pseudo-random order was used in all the laboratories, and repetitions were not applied. Repetitions and different presentation orders were not considered necessary because of the presence of the reference. (Subsequent experience has shown that this is not necessarily a safe assumption, particularly where one set of trials contains a wide range of achieved quality.)
6. Results This section gives information about how the data were statistically processed and how the results presented. Since the goal of this paper is to report the test methodology, rather than to provide information about the performance of the proposals, the detailed results are not reported. As previously mentioned, the full test results can be found in [4,3]. The following subsections illustrate examples of the tables and graphs obtained for each test method.
6.1. Statistical analysis of ITU-R method test data
Strict analysis of the data from these tests would have required separate analyses of the results from each test centre to determine whether or not they could be combined into a common pool of results (i.e. to determine whether a common population existed). However, for that to be realised, each test
Image Communication
9 (1997) 327-342
site would have been required to use 20 or more listeners (as recommended in BS 1116). On the other hand, these were supposed to be preliminary assessments of the proposals and so it was decided to break from strict adherence to statistical protocol and to pool the results such that an overview could be taken. Detailed examination of the test results threw doubt on the results from one test site. The results showed a number of differences from those obtained from the other test sites. On further examination, a number of significant inconsistencies were discovered in the way the tests had been conducted at that particular site, despite the careful pre-test instructions. As a result of this careful examination, it was decided to exclude the results from the site from the final analysis. An example of the test results averaged over all the remaining test sites and all programme items is shown in Fig. 4. It should be remembered that, for this test method, both ‘A’ and ‘B’ are given a grade by each listener. From these grades, the difference between the grade for the coded stimulus and the grade for the hidden reference is calculated. This is known as the ‘diff-grade’, and it is the diff-grades that are subjected to the statistical analysis. The presentation, of Fig. 4, shows the mean score (mean diff-grade) and the upper and lower 95% confidence interval bounds. The full results included an analysis according to different programme excerpts and according to test sites to show what effect these parameters might have had on the results. Very roughly, one can draw conclusions about rank order of the codecs if the lower bound for one codec exceeds the upper bound for another. However, the procedure used in these tests was insufficient to draw any conclusions other than an estimate for the Rank Order of the codecs under test. Thus, the project team was able to conclude that some of the codecs indicated performance better than the reference version of MPEG-1 Layer III at 64 kbit/s [7]. The best of these, it was noted, were using techniques already in use in the NBC reference model work [ 11. Additionally, there was some evidence that, as one goes from Scalability 1 to Scalability 2, there is a loss of performance of the best codecs, possibly due to the increased overhead associated with increased flexibility.
L. Contin et al. / Signal Processing:Image Communication9 (1997) 327-342 Excerpt:(All)
MPEG4 64kbps
Site@ll)
-4.50
ErrorRes
Scalability1
337
Codec Score: -2.03 il 0. of subjecfs: 28
Av
Scalability 2
Ref
Fig. 4. Example of the test results for 64 kbitis codecs.
6.2. Statistica/
analysis of
IRJ-Ttest data
For the statistical analysis at a particular bit-rate, all functionalities which were tested with the same test conditions (i.e. In/Out sampling frequency, type of channel error, speed) were combined into a common ranking group, as shown in Table 6. The statistical analysis was carried out within each ranking group separately. Thus, one graph or table may contain the results of different functionalities. Within each ranking group, the Mean Opinion Score (MOS) and the 95% confidence interval (CI) were calculated for each codec. The ranking calculation was based on a complete statistical Tukey analysis, which provides a probability of equivalence for each pair of proposals [ 11). The criterion used to decide that two proposals were statistically significantly different (SSD) was a probability of equivalence lower than 0.05. The statistical analysis was made at different levels of details. A first analysis was carried out on the whole set of data, including the votes coming from all the test sites. An Analysis Of Variance (ANOVA) was carried out to investigate if the test
site was a significant factor of variation. The results of the ANOVA showed that within each ranking group this was the case. The influence of the factor ‘test site’ was expected, because in spite of the efforts to make the test conditions as similar as possible across the test sites, important differences, like the differences in the test material (due to the different languages) and the number of listeners could not be avoided. For this reason, the statistical analysis was carried out on the data coming from each laboratory separately. For each ranking group and for each test site, an ANOVA was performed to check for dependency on item type and unwanted dependency on the stimuli presentation order and repetition. As expected, the item type was a significant factor, while the factors ‘presentation order’ and ‘repetition’ were in a few cases significant, but at a considerably lower level than ‘codec’ and ‘item type’. In order to further investigate the dependency of the results on the item type, in the ranking groups corresponding to bit-rates up to 16 kbit/s, MOS and CI were calculated separately for each item type (i.e. European speech, Japanese speech, and music). In the
L. Contin et al. / Signal Processing: Image Communication 9 (1997) 327-342
338
Table 6 Ranking groups used for statistical analysis Functionality
In/out sampling frequency (kHz)
Ranking group
2
Coding efficiency Scalability 3 Speed change ‘normal speed’ Speed change ‘speed up’
8 16 8 8
A H A 0
6
Coding efficiency Error res. ‘no errors’ Error res. ‘burst errors’ Error res. ‘random errors’ Scalability 1 Scalability 2 Scalability 3 Speed change ‘normal speed’ Speed change ‘speed up’
8 8 8 8 48 48 16 8 8
B B I J F F G B P
Coding efficiency Error res. ‘no errors’ Error res. ‘burst errors’ Error res. ‘random errors’ Scalability 3
16
C C K L C
Coding efficiency Scalability 2
48
D D
Bit-rate (kbit/s)
16
24 40
Coding efficiency
48
E
64
Error res. ‘burst errors’ Error res. ‘random errors’
48
M N
other ranking groups, considering the differences among the single items, the MOS and CI were also calculated for each of them. As an example, several graphs for the 16 kbit/s test results are shown in Figs. 5 -8. In order to enable a condensed presentation, the format is slightly changed with respect to that of the test report [4], All results for the functionalities compression, scalability, and error resilience are now combined in one graph which also contains the MNRU and the reference codec (G722 at 64 kbit/s). For the error resilience test results, the three evaluated error conditions are denoted in the following way: ‘0’ = error-free, ‘b’ = burst errors, ‘r’ = random errors. The first graph shows the averages over all items of the three item types, while the other three graphs present the individual averages for the item types ‘E’, ‘J’ and ‘M’. Comparison of the results
for different item types shows that there are clear differences in the behaviour of different codecs for different input material. The results of the Tukey test were presented for each bit-rate in tables like those shown in Tables 7 and 8, which are related to the 16 kbit/s tests. Each row corresponds to a test condition (either to a codec, a reference or an anchor), as indicated in the first column. The next three columns report the MOS for each item type, while the fifth column reports the MOS calculated over the item types. The rightmost columns present the results of the Tukey test. Codecs (or MNRUs) that are SSD are designated by different Roman numerals. For example, in the table corresponding to the Japanese test site, codec G is not SSD with respect to codec C, but it is SSD with respect to MNRU with the speech to multiplicative noise ratio of 24 dB (MN24).
7. Conclusions This paper presents an overview of the first tests that were carried out with the aim of characterising the proposals for MPEG4 audio codecs. This task was complex primarily because of the number of functionalities that were being evaluated. Of these functionalities, only basic data rate compression had been previously evaluated by MPEG. Thus new test procedures were being introduced at the same time as the tests were being planned and conducted. The scope and complexity of the evaluations required a considerable level of co-operation between many codec developers and test sites. The fact that valuable insights into the usefulness and relative qualities of the different functionalities was achieved is a testament to the mutual support available within the MPEG team. The final result was sufficient knowledge of the candidates for the MPEG-4 audio functionalities to enable the first audio VM to be designed. A further benefit of the process has been a better understanding of the test methods and ways in which the test planning and conduct can be improved. Whilst it is commendable to share the evaluation task amongst many test sites, concentration
L. Contin et al. 1 Signal Processing: Image Communication 9 (1997) 327-342
A
B
C
339
:scalability : error resilience 1
compression
D
:
E
F
: GO
Gb
Gr
:
10
17
31
38
45
16722
: REF
MNR [da]
CODEC
Fig. 5. Results of the 16 kbit/s tests (average evaluation. The subscripts 0, b, r correspond
24
over all items). For error resilience at 16 kbit/s just one proposal (6) was submitted to error-free, burst error and random error conditions, respectively.
for
@alability : error resilience :
compression
:.~~~~~~~~~~-~~~~
A
B
C
D
:
E
F
1 GO
CODEC
Fig. 6. Results of the 16 kbit/s tests (average evaluation. The subscripts 0, b, r correspond
Gb
Gr
: 10
17
24
31
MNR [dB)
38
45
: 0722
; REF
over all ‘E’ items). For error resilience at 16 kbit/s just one proposal (6) was submitted to error-free, burst error and random error conditions, respectively.
of the tests at one or, possibly, two sites would have reduced some of the elements of variability in the results. Regardless of the care in running the tests, it is inevitable that features at one site will not be truly replicated at other sites. This would be particularly true if or when tests need to
for
be conducted using loudspeaker listening rather than headphone listening. It would also be beneficial to build into the test method some way of checking on listener reliability, such that those listeners who were guessing, rather than voting reliably, could be eliminated. If these features
340
L. Contin et al. / Signal Processing:
Image Communication
9 (1997) 327-342
:scalabllily : error resilience :
compression
51 t
4,5 s
4.
1i
3.5 .’
-I
A
B
C
D1.E
F
;GO
Gb
Or110
II
24
31
38
45 ;Glz2 1 REF
MNR [dB]
COOEC
Fig. 7. Results of the 16 kbit/s tests (average over all ‘J’ items).
:scalability 1error resilience 1
compression 4.5
F :
4-
I
385 -
1c
*
: *
1
I_._p___~
0.5
; __+p+-_+_-~m+
07 A
__~~
B
C
D
:
E
F
: GO
CODEC
Gb
Gr
: 10
17
24
~_~ _
31
MNR [dB]
_~
I_..-
; 38
45
: 6722 : REF
Fig. 8. Results of the 16 kbit/s tests (average over all ‘M’ items).
cannot be built into future tests, then it is important to use larger numbers of listeners at each site. This factor alone will reduce uncertainty in the results. Finally, it is worth noting that the development of MPEG4 audio VM is now very much underway. Intermediate tests are being planned and conducted as the optimisation process continues. This
is leading to the formal evaluations of the MPEG4 functionalities in tests to be conducted during the final phase of the standardization process. In the meantime, the process of the optimisation of the test methodologies, such that those evaluations will not be affected in any way by the test procedures, will be continued.
L. Contin et al. J Signal Processing:
Table I Tukey results for European test stimuli CODEC
E
MN45 Gl22 MN38 MN31 F C MN24 B G E D MN17 A MN10
4.63 4.26 3.44 2.19 2.81 3.09 2.18 2.05 1.I9 2.15 1.3 1.49 1.48 1.05
J
M
TOT
4.43 3.25 4.15 3.48 2.02 1.45 2.55 2.32 2.32 1.3 2.18 1.83 1.53 1.12
4.54 3.83 3.14 3.09 2.41 2.39 2.34 2.16 2.01 1.79 1.68 1.64 1.5 1.08
Tukey test results I II II III IV IV IV v IV v VI v VI VII VI VII VII VII VIII
Image Communication
9 (1997) 327-342
Source material preparation Codec development
Bit stream decoding Test tape preparation Subjective tests Results analysis
341
BBC, AT&T, Philips, Motorola, Sony IPC Alcatel/Philips/RAI, Bosch/ CSELT/MATRA/MAVT, Motorola, NEC, NTT, Samsung, Sony, University of Erlangen/ FhG, AT&T, INESC, JVC, Matsushita, NTT DoCoMo, Philips, University of Hannover/ Deutsche Telekom, Sony IPC University of Hannover AT&T, NTT, Sony IPC Sony, NTT, Mitsubishi, CCETT, FhG, Motorola, Sony IPC Tektronix, BBC, CSELT
References
Table 8 Tukey results for Japanese test stimuli
Cl1 M. Bosi, MPEG-2 Audio NBC (13818-7) Reference Model
CODEC
E
MN45 MN38 G722 MN31 B F G C MN24 D E A MN17 MN10
J
M
TOT
4.51 4.1 3.87 3.09 2.67 2.95 2.45 2.8 2.25 1.85 2.08 1.73 1.55 1.09
4.52 4.08 3.45 3.41 3.14 2.28 2.87 1.66 2.53 2.96 1.51 1.18 1.6 1.09
4.55 4.1 3.16 3.18 2.8 2.16 2.56 2.49 2.33 2.15 1.92 1.74 1.57 1.09
Tukey test results I II III IV V V VI VI VII VII VIII IX X XI XII
Acknowledgements
This paper records the process and results of tests involving contributions made by a large number of organisations and individuals to an international collaborative set of tests. The authors are pleased to acknowledge the contributions from and the support of the following organisations:
3 (RM3), International Organisation for Standardisation, Coding of Moving Pictures and Associated Audio, ISO/IEC JTCl/SC29/WGl l/N1 132. c21 K. Brandenburg, Report of the Ad Hoc Group on the Evaluation of Tools for Non-tested Functionalities of Audio Submissions to MPEG4, International Organisation for Standardisation, Coding of Moving Pictures and Associated Audio, ISO/IEC JTCl/SC29/WGl l/N1063, November 1995. c31 S. Diamond, MPEG4 Audio 64 kbit/s subjective tests: overall results, International Organisation for Standardisation, Coding of Moving Pictures and Associated Audio, ISO/IEC JTCl/SC29/WGl l/N1 136, January 1996. M B. Edler and L. Contin, MPEG4 Audio Test Results (MOS Tests), International Organisation for Standardisation, Coding of Moving Pictures and Associated Audio, ISO/IEC JTCl/SC29/WGl l/N1 144, January 1996. [51 F. Feige and D. Kirby, Report on formal subjective listening tests of MPEG-2 multichannel audio, International Organisation for Standardisation, Coding of Moving Pictures and Associated Audio, ISO/IEC JTCl/SC29/ WGl l/N685, March 1994. C61 ISO, 1995, Call for proposals, International Organisation for Standardisation, Coding of Moving Pictures and Associated Audio, ISO/IEC JTCl/SC29/WGll/N0991, July 1995. c71 ISO, 1995, Information technology ~ Generic coding of moving pictures and associated audio information - Part 3: Audio, International Organisation for Standardisation, Coding of Moving Pictures and Associated Audio, International Standard 13818-3, May 1995.
342
L. Contin et al. / Signal Processing: Image Communication 9 (1997) 327-342
[8] ITU-R, 1993, Methods for the subjective assessment of
[9] [lo] [l l] [12]
small impairments in audio systems including multichannel sound systems, ITU-R Recommendation BS-1116, November 1993. ITU-T, 1993, Methods for subjective determination of transmission quality, ITU-T Recommendation P80, 1993. ITU-T, 1993, Modulated Noise Reference Unit (MNRU), ITU-T Recommendation P. 81, 1993. J. Neter et al., Applied linear statistical models, 3rd Edition, IRWIN, 1990. F. Pereira, MPEG4 testing and evaluation procedures document, International Organisation for Standardisation, Coding of Moving Pictures and Associated Audio, ISO/IEC JTCl/SC29/WGl l/N999, July 1995.
Laura Contin obtained her degree in mathematics in 1985 from the University of Turin. A few months later she joined CSELT, where she worked for a few years at the development of new video compression algorithms. Since 1990 she has been involved in the definition and validation of subjective test methods for multimedia services. Her work has been mainly focused on subjective assessment methods for very low-bit-rate coded sequences and joint evaluation of audio and video quality. She coordinated the development of three ITU-T recommendations related to subjective quality evaluation in multimedia services. She has been actively engaged in the MPEG4 test activities and she currently chairs the MPEG Test Subgroup. Bernd Edler received the Diploma (M.S. degree) in Electrical Engineering from the University of Erlangen, Germany in 1985. From 1986 to 1993 he was research assistant at the ‘Institut fiir Theoretische Nachrichtentechnik und Informationsverar- beitung’ of the Hannover, University of Germany. His main research areas were filter banks and transforms for source coding and he contributed to the development of the filter bank used in the MPEG-1 Layer 3 audio coder. Since 1993 he is staff member of the Systems Technology department at the ‘Laboratorium fi.ir Informationstechnologie’ which is a research institution of the University of Hannover. He received the Ph.D.
degree in July 1994. His current work focuses on very low bit-rate audio coding based on parametric signal representations for MPEG4. David Meares graduated from Salford University, England in 1968 with an honours degree in Electrical Engineering and joined the BBC’s Research Department. Over the years, David has worked on such subjects as the original research into video analogue-to-digital video digital conversion, processing, studio acoustics, acoustic modelling and digital audio developments. More recently, he has worked on the standardisation of surround sound systems, particularly the evaluation of psycho-acoustic codecs. David is currently R & D Manager (Audio and Acoustics) at BBC Research and Development Department. He is a Chartered Engineer, a Fellow of the Institute of Acoustics and a member of the Institution of Electrical Engineers. He has presented many papers on sound and acoustic topics and has represented the BBC on both ITU-R and EBU study groups. David has a heavy involvement in the audio work of IS0 MPEG, being the Audio Subgroup secretary and UK Head of Delegation. Peter G. Schreiner III received a B.S.E.E., M.S.E. in Electrical Science, and Ph.D. in Bioengineering at the University of Michigan. He conducted clinical research in speech processing for the hearing impaired before joining Scientific-Atlanta in Atlanta, Georgia, USA in 1978. Dr. Schreiner has provided technical support for R & D work for satellite communications systems for SCPC, Message, Video, and Digital Audio transmission. In 1980, Dr. Schreiner developed the audio source coding for the US commercial radio satellite network digital audio distribution system and is presently involved with the evaluation of reduced-bit-rate digital audio technology for use with digital video compression systems for satellite distribution. He is the current chairman of the ISO/MPEG Audio Subgroup.