Applied Acoustics 134 (2018) 125–130
Contents lists available at ScienceDirect
Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust
Quality aspects of music used as a background noise in speech communication over mobile network
T
⁎
Peter Počtaa, , Scott Isabelleb a b
Dept. of Multimedia and Information-Communication Technology, FEE, University of Žilina, SK-01026 Žilina, Slovakia Knowles Electronics, LLC, USA
A R T I C L E I N F O
A B S T R A C T
Keywords: Overall quality Background noise Speech coding Music Noise intrusiveness Speech distortion Mobile speech communication
This paper compares the effect of send-side music and environmental noise as background noise in a telephone communication. The study focuses on the quality experienced by the end user in the context of NB, WB and SWB mobile speech communication. The subjective test procedure defined in ITU-T Rec. P.835 is followed in this study. The results show that music as background noise in telephone conversation deteriorates the overall quality experienced by the end user. Moreover, the impact of music background noise on the quality is similar to that of the environmental noise. Furthermore it is shown that the music background noise seems to be slightly less intrusive than the environmental noise, especially when it comes to the lower SNR.
1. Introduction A speech communication over telecommunication network currently mostly involves mobile terminals. So, a speech capture process is exposed to unpredictable background noise in the speaker’s environment. The background noise, which is transmitted together with the speech signal over telecommunication network to other end/ends of communication chain, is perceived as a quality degradation by remote interlocutors and reduces the overall quality perceived by the end user. This issue has recently received a renewed interest with an introduction of wideband (WB) telephony offering a wider frequency band, ranging from 50 to 7000 Hz, than the traditional telephony by many mobile operators and speech service providers around the world. Moreover, a foreseen advent of super-wideband (SWB) telephony using a frequency band ranging from 20 to 14,000 Hz even increases the importance of this problem for mobile operators and speech service providers. To assess a quality perceived by the end user, a subjective test methodology is defined in ITU-T Rec. P.835 [1]. Test subjects are asked to focus on and rate a speech distortion (S-MOS), background noise intrusiveness (N-MOS) and overall quality (G-MOS) separately on a five-point scale. The average score per test condition across listeners is called Mean Opinion Score (MOS). At least 32 naïve listeners are required for this subjective test. The listeners should be native speakers of the language used for the test. In the objective domain, an objective model predicting speech quality in the presence of background noise for wideband telephony is described in [2,3]. A modified version of this model and its super⁎
wideband extension, are published in [4] and [5] respectively. Moreover, it is worth noting here that an objective model predicting the quality of noise-reduced speech is currently being developed by Question 9 of ITU-T Study Group 12 under the work item P.ONRA (series P Recommendations Objective Noise Reduction Assessment), which is planned to build upon the foundation of ETSI TS 103 281 model [5]. Some work has been carried out to study the impact of informational content of background noise on speech quality experienced by the end user and the dependence of the dimensions of the subjective test procedure defined in ITU-T Rec. P.835, on different factors. In [6], Leman et al. analyzed the effect of meaning of background noise on quality experienced by the end user of telephone communication. They ran two subjective tests according to ITU-T Rec. P.800 [7] to compare the effect of stationary background noises, i.e. a pink noise, cocktailparty noise and electric noise, and non-stationary environmental noises, i.e. city noise, restaurant noise, background speech, on the quality experienced by end user in the context of narrowband (NB) telephone communication, i.e. the traditional telephony. The first test focused on an interaction between the background noises and their levels. Three loudness levels determined in a preliminary experiment and G.711 codec [8] were used. No packet loss was present in the samples used in this test. The results show that test subjects are more indulgent for the non-stationary noises than the stationary noises when an increase of loudness is considered. For the high loudness level, a difference of 0.5 MOS was reported between the noises. The second test dealt with the interaction between the background noise characteristics and network degradations, i.e. codec and packet loss. Only the middle loudness level
Corresponding author. E-mail address:
[email protected] (P. Počta).
https://doi.org/10.1016/j.apacoust.2018.01.002 Received 21 September 2017; Received in revised form 21 November 2017; Accepted 4 January 2018 0003-682X/ © 2018 Elsevier Ltd. All rights reserved.
Applied Acoustics 134 (2018) 125–130
P. Počta, S. Isabelle
was used in this experiment. For network degradations, the G.711 and G.729 codecs [9] with native packet loss concealment algorithms were used, at levels of 0% and 3% packet loss. The results of this test confirm the results reported for the previous test. So, they concluded that the test subjects were more tolerant to the non-stationary, i.e. environmental, noises than to the stationary noises as they have found them native to the environment of the talker. In [10], Ullmann et al. investigated the dependence of three quality dimensions of the subjective test procedure defined in ITU-T Rec. P.835, i.e. speech distortion, background noise intrusiveness and overall quality, on the following three factors: bandwidth context, presence of Lombard speech and application of noise reduction processing. Two subjective tests conducted in a super-wideband context and one in a narrowband context were run according to the ITU-T Rec. P.835 in this study. Ten different environmental noises from the ETSI EG 202 396-1 database [11] were used with one (realistic) presentation level for each noise. Different transmission conditions involving AMR-NB [12], AMR-WB [13] and EVRC-B [14] codecs and live transmission over mobile networks with or without noise reduction were covered by the test. When it comes to the bandwidth context, the results show that noise intrusiveness is scored almost identically in NB and SWB contexts. Consequently, bandwidth limitations in a SWB context influence speech degradation and overall quality, but not noise intrusiveness scores. On the other hand, the presence of Lombard speech had no effect on noise intrusiveness, and only improved the speech degradation scores of conditions with low signal-to-noise ratios. Background noise conditions were perceived as being significantly more intrusive when played back at a higher level despite an unchanged signal-to-noise ratio. Finally, while noise reduction processing always significantly reduced perceived noise intrusiveness, the accompanying degradation of foreground speech cancelled the benefits on overall quality for all conditions with lower levels of background noise. To the best of our knowledge, there is no work dealing specifically with the effect of music used as a background noise in the context of telephone communication. Therefore, we have decided to focus on this issue in this paper, in particular on a comparison of effect of music and environmental noise on a quality experienced by the end user in the context of NB, WB and SWB mobile speech communication. Two different noises per noise type, i.e. music and environmental, and two SNR values are employed in this study. Typical codecs deployed in current mobile networks and bit rates as per [15] are used. The subjective test procedure defined in ITU-T Rec. P.835 is followed in this study. The aim of this study is twofold: first, we would like to know how music as background noise in telephone conversation influences the overall quality experienced by the end user. Secondly, we would like to know whether the impact of the music as background noise on the quality experienced by the end user is the same as that of typical environmental noises used currently for a development of the objective models, see [11] for more details. The remaining of the paper is organized as follows. Section 2 describes the subjective test carried out within this study. In Section 3, the experimental results obtained from the subjective test are presented and discussed. Section 4 provides the final conclusions.
Table 1 Rating scales defined in ITU-T Rec. P.835 and used in the subjective test. Score
Speech distortion (SMOS)
Background noise intrusiveness (N-MOS)
Overall quality (GMOS)
5 4 3 2 1
Not distorted Slightly distorted Somewhat distorted Fairly distorted Very distorted
Not noticeable Slightly noticeable Noticeable but not intrusive Somewhat intrusive Very intrusive
Excellent Good Fair Poor Bad
distortion (S-MOS), background noise intrusiveness (N-MOS) and overall quality (G-MOS), separately on a five-point scale. The corresponding rating scales for all the assessed aspects used in the test are presented in Table 1. 2 males and 2 females 8-s long speech samples sampled at 48 kHz containing 2 sentences in the Slovak language were used as a speech material in this test. The individual sentences were used in more than one trial during the test. Moreover, 2 environmental noise recordings coming from the ETSI TS 103 224 background noise database [16], namely “Cafeteria” and “TrainStation”, and 2 music songs in CD quality, i.e. “Pride” by U2 and “More Than Words” by Jon Schmidt & Steven Sharp Nelson, representing music noises in this study were used as background noise material. Both music songs used in the study are moderately popular in Slovakia. So, their similar and moderate popularity should not affect judgement of test subjects. In both cases, 8 s long excerpts of the particular noises were used as background noise samples in the test. In other words, the same 8 s of noise were used for all test material generated for a given noise type. When it comes to the environmental noise samples, it is worth noting here that the selected noise recordings represent according to the Linear Discriminant Analysis published in [17] the noises with a very diverse impact on S-MOS and N-MOS scores covering mostly all the impact range reported in the analysis for the NB and WB handset scenarios. Regarding the selected music noise samples, the selected songs and in particular the excerpts also represent very different signals in terms of their complexness and intrusiveness. The speech and noise samples were processed as defined in the Appendix I of the ITU-T Rec. P.835. Firstly, they were filtered by the corresponding filter or a combination of the filters in order to simulate the response of a handset. As the focus of this study is on the mobile speech communication, responses of mobile handsets in the context of NB, WB and SWB speech communication were simulated. The following filters were applied the LP35 and MSIN (NB scenario), the P341 (WB scenario) and the 14KBP (SWB scenario), see ITU-T Rec. G.191 [18] for more details. Secondly, the level of the filtered speech and noise signals was adjusted to –26 dBov as defined in the ITU-T Rec. P.835. The particular software tools included in the ITU-T Rec. G.191 were used to filter the speech and noise samples and adjust the level to the required value. In the final step, the filtered and level-adjusted speech and noise samples were mixed together using 2 SNR values, i.e. 0 dB and 12 dB, to form 32 output files to be processed by the speech codecs involved in this study. Three speech codecs typically deployed in current mobile networks, i.e. AMR-NB, AMR-WB and EVS [19], were used. For each codec, the typical bit rates used for each bandwidth were employed, resulting in 5 codec conditions. See Table 2 for more details. The 5 codec conditions combined with the 2 SNR values formed 10 test conditions listed in Table 2 investigated by the subjective test. It is worth noting here that the in-built noise reduction feature of the codec, if available, was switched off. As requested by the test procedure defined in the ITU-T Rec. P.835, reference conditions should be included in the test. The use of reference conditions is intended to provide perceptual anchors for the three rating scales defined in ITU-T Rec. P.835. It is worth noting here that the use of reference conditions also allows comparison of experiments made in different laboratories or at a different time in the same laboratory.
2. Subjective test The subjective listening test was performed in accordance with the ITU-T Rec. P.835. In all test sessions, up to 2 listeners were seated in a small listening room (acoustically treated) with a background noise below 20 dB SPL (A). All subjects were Slovak Nationals whose first language was Slovak. All together, 33 listeners (16 male, 17 female, 20–58 years, mean 36.33 years) participated in the test. The subjects were remunerated for their efforts. The samples were played out in a random order using high quality studio equipment and presented diotically at 73 dB SPL (A) to the test subjects. The subjects listened to each stimulus three times and assessed three different aspects, i.e. speech 126
Applied Acoustics 134 (2018) 125–130
P. Počta, S. Isabelle 5
Table 2 Description of test conditions used in the subjective test.
1 2 3 4 5 6 7 8 9 10
SNR (dB) 12 0 12 0 12 0 12 0 12 0
Codec
Bandwidth
EVS [19] EVS [19] AMR-WB [13] AMR-WB [13] EVS [19] EVS [19] AMR-NB [12] AMR-NB [12] EVS [19] EVS [19]
SWB SWB WB WB WB WB NB NB NB NB
4
Bit rate (kbps)
S−MOS
No.
Music Environmental
4.5
24.4 as per [15] 24.4 as per [15] 12.65 as per [15] 12.65 as per [15] 5.9 5.9 12.2 as per [15] 12.2 as per [15] 5.9 5.9
3.5 3 2.5 2 1.5 1 No.1
No.2
No.3
No.4
No.5
No.6
No.7
No.8
No.9
No.10
Test Conditions Fig. 1. Effect of test conditions (see Table 2 for more information about the investigated test conditions) on average S-MOS. The vertical bars show 95% CI computed over 264 SMOS values (33 subjective scores per sample for 8 samples).
Table 3 Description of the reference conditions used in the subjective test. No.
Processing of signal
SNR (dB)
Type of noise
1 2 3 4 5 6 7 8 9 10 11 12
Source (filtered) Source (filtered) Source (filtered) Source (filtered) Source (filtered) NS Level 1 NS Level 2 NS Level 3 NS Level 4 NS Level 3 NS Level 2 NS Level 1
– 0 12 24 36 – – – – 24 12 0
No noise Full-size car Full-size car Full-size car Full-size car No noise No noise No noise No noise Full-size car Full-size car Full-size car
130 130 130 130
km/h km/h km/h km/h
the codecs and different bandwidths. The aim of this part was to familiarize the subjects with the test procedure and rating scales. This was carried out by simulating a short part of the main test.
[16] [16] [16] [16]
3. Experimental results Fig. 1 presents the results of the subjective test for the test conditions averaged over 264 values (33 scores per sample for 8 samples) for S-MOS. As it can be clearly seen from this figure, the average S-MOS values reported for the music noise decrease mostly monotonically, besides the test conditions 3 and 4 representing the AMR-WB codec test conditions, with decreasing bandwidth and bit rate or different codec for both values of the SNR, i.e. 0 dB and 12 dB. The AMR-WB exclusion from the trend seems to be caused by its quite poor ability to handle speech mixed with music signals, which even caused a degradation of the speech signal in this case, especially when it comes to the “Pride” excerpt. On the other hand, a pure monotonic decrease was achieved for the environmental noise type in this case. Moreover, when comparing the both types of the noise in terms of the obtained average SMOS values, slightly higher values were obtained for the music noise than for the environmental noise, again besides the AMR-WB test conditions. The difference becomes naturally bigger for the lower SNR and is always statistically significant. One three-way analysis of variance (ANOVA) test was conducted on the S-MOS values using codec conditions, noise type and SNR as fixed factors (Appendix A, Table 4). The effect of SNR was found to be highly statistically significant (the highest F-ratio of 538.76, p < .001). Furthermore, the effect of codec conditions was also highly statistically significant with F = 527.82, p < .001. The last factor investigated in the ANOVA test was the noise type factor and the analysis revealed a weaker effect on the S-MOS values than the previous factors on their own (F = 141.72, p < .001) but was also highly statistically significant. Regarding interactions of all the involved factors, the results show that all of them were highly statistically significant. Moreover, the highest F-ratio, naturally in the context of interactions, was reported for an interaction of the noise type and SNR (F = 43.74, p < .001),
130 km/h [16] 130 km/h [16] 130 km/h [16]
Reference conditions defined in [20] and listed in Table 3 were used in this test. These reference conditions were successfully used for the P.835 training and validation databases developed for [4] and [5]. The NS Level provides controlled speech distortion similar to noise reduction algorithms. It is worth noting here that all 4 source speech samples sampled at 48 kHz, representing a full band (FB) scenario, were deployed in this case. The 20KBP filter, see the ITU-T Rec. G.191 for more detail, was applied. The 32 output files (4 source speech signals mixed with each of the 4 background noises at 2 different SNR values) were encoded by the corresponding codecs and bit rates listed in Table 2 to yield to 160 test items. Moreover, 4 source speech signals were processed by the 12 reference conditions described in Table 3 to yield to additional 48 test items. So, the test comprised of 208 test items. The 208 test items were divided to four test sessions balanced from a size and degradation perspective in order to avoid subject fatigue as advised in the ITU-T Rec. P.835. In the first and third sessions, the order of the ratings was speech distortion, background noise intrusiveness, and then overall quality. In the other sessions, the order of the first two ratings, i.e. speech distortion and background noise intrusiveness, was swapped. In order to ensure that each participant of the test properly understood his/her task in the test and the test procedure is fully clear to him/her, a familiarization session consisting of two parts preceded each test. In the first part of the familiarization session, the test subjects were supposed to listen to 16 items representing 4 source speech samples containing 2 sentences in the Slovak language degraded by 4 different conditions corresponding to those to be used in the main test, namely the clean reference condition, the most distorted distortion-only condition, the lowest SNR additive noise condition and the most distorted and lowest SNR additive noise condition. In comparison to the speech samples used in the main test, different sentences uttered by the same speakers and different noises were included in the samples used in the familiarization session. The aim of the first part was to put the subjects into the context of this experiment. The second part consisted of 16 items covering all the range of the degradations included in the main test in terms of noise types, SNR values and degradations induced by
Table 4 Summary of ANOVA test conducted on the S-MOS values.
127
Effect
SS
df
MS
F
p
Codec Conditions Noise Type SNR Codec Conds. * Noise Type Codec Conds. * SNR Noise Type * SNR Error Total
1612.55 108.25 411.49 71.46 16.98 33.41 4020.53 6274.67
4 1 1 4 4 1 5264 5279
403.137 108.245 411.492 17.865 4.245 33.409 0.764
527.82 141.72 538.76 23.39 5.56 43.74
.0000 .0000 .0000 .0000 .0002 .0000
Applied Acoustics 134 (2018) 125–130
P. Počta, S. Isabelle 5
4
4
3.5
3.5
3 2.5
3 2.5
2
2
1.5
1.5
1
No.1
No.2
No.3
No.4
No.5
No.6
No.7
No.8
No.9
Music Environmental
4.5
G−MOS
N−MOS
5
Music Environmental
4.5
1
No.10
No.1
No.2
No.3
No.4
Test Conditions
No.5
No.6
No.7
No.8
No.9
No.10
Test Conditions
Fig. 2. Effect of test conditions (see Table 2 for more information about the investigated test conditions) on average N-MOS. The vertical bars show 95% CI computed over 264 NMOS values (33 subjective scores per sample for 8 samples).
Fig. 3. Effect of test conditions (see Table 2 for more information about the investigated test conditions) on average G-MOS. The vertical bars show 95% CI computed over 264 GMOS values (33 subjective scores per sample for 8 samples).
followed by an interaction of the codec conditions and noise type (F = 23.39, p < .001) and the codec conditions and SNR (F = 5.56, p < .001). To summarize, the results of the ANOVA test revealed that subjects were more sensitive to the SNR and codec conditions than to the noise type, and highly statistically significant interactions between all the investigated factors were found. The average N-MOS values obtained from the subjective test are presented in Fig. 2. It is interesting to see that the values are mostly consistent across the bandwidths and types of noises for both values of the SNR. Moreover, a bit rate impact, though present, has a mostly negligible effect on the average N-MOS values. This implies that neither increasing the noise bandwidth nor the reduction of bit rate increase the intrusiveness; the SNR appears to be the only significant factor here. One three-way analysis of variance (ANOVA) test was conducted on the N-MOS values using codec conditions, noise type and SNR as fixed factors (Appendix A, Table 5). The highest F-ratio (F = 4033.93, p < .001) was determined for the SNR factor. The effect of SNR was found to be highly statistically significant. Moreover, the noise type factor appeared to have a weaker effect on the N-MOS values than the SNR factor with F = 19.53, p < .001. Furthermore, the effect of noise type was again highly statistically significant. The last factor investigated in the ANOVA test was the codec conditions factor and it turned out to have a weaker effect on the N-MOS values than the SNR and noise type factors on their own and to be highly statistically significant, (F = 17.21, p < .001). Regarding interactions of all the involved factors, the results show that all factors are statistically significant. It is worth noting here that an interaction of the noise type and SNR is less statistically significant than the other interactions. To summarize, the results of the ANOVA test revealed that subjects were much more sensitive to the SNR than to the noise type and codec conditions, and statistically significant interactions between all the investigated factors were found. Fig. 3 reports the average G-MOS values obtained from the subjective test. When it comes to the music noise, the trend is the same as that reported for the S-MOS values above. On the contrary, the pure monotonic decrease reported above for the S-MOS values and the
environmental noise is not true for the G-MOS values and the higher value of the SNR. As it can be clearly seen from the figure, the AMR-WB codec test condition is again responsible for this change. Moreover, the slightly higher values are again reported for the music noise than for the environmental noise, besides the AMR-WB test conditions. The difference becomes bigger for the lower SNR, similarly as in the S-MOS case, and is mostly statistically significant. It is interesting to note here that the test condition 1, typifying the EVS-SWB codec operating at 24.4 kbps and the higher SNR (the most transparent test condition), represents the only condition showing a very clear and statistically significant preference of the music noise over the environmental noise for the higher SNR. Unfortunately, the test setup does not allow us to reveal the reason behind this interesting behavior. To be more precise, whether the preference is driven more by the bandwidth or by the bit rate. One three-way analysis of variance (ANOVA) test was conducted on the G-MOS values using codec conditions, noise type and SNR as fixed factors (Appendix A, Table 6). The effect of SNR was again found to be highly statistically significant (the highest F-ratio of 1695.07, p < .001). Furthermore, the effect of codec conditions was also highly statistically significant with F = 381.68, p < .001. The last factor investigated in the ANOVA test was the noise type factor and it showed a weaker effect on the G-MOS values than the previous factors on their own (F = 199.1, p < .001) but was also highly statistically significant. Regarding interactions of all the involved factors, the results show that mostly all of them were highly statistically significant. Moreover, the highest F-ratio, in the context of interactions, was reported for an interaction of the noise type and SNR (F = 64.66, p < .001), followed by an interaction of the codec conditions and noise type (F = 31.33, p < .001) and the codec conditions and SNR (F = 2.93, p < .02). To summarize, the results of the ANOVA test revealed that subjects were much more sensitive to the SNR than to the codec conditions and noise type, and mostly highly statistically significant interactions between the investigated factors were found. Fig. 4 presents the results of the subjective test for the reference conditions averaged over 132 values (33 scores per sample for 4 samples) for S-MOS. In principle, the reference conditions can be split into 3
Table 5 Summary of ANOVA test conducted on the N-MOS values.
Table 6 Summary of ANOVA test conducted on the G-MOS values.
Effect
SS
df
MS
F
p
Effect
SS
df
MS
F
p
Codec Conditions Noise Type SNR Codec Conds. * Noise Type Codec Conds. * SNR Noise Type * SNR Error Total
35.92 10.19 2105.22 19.23 18.33 1.6 2747.17 4937.67
4 1 1 4 4 1 5264 5279
8.98 10.19 2105.22 4.81 4.58 1.6 0.52
17.21 19.53 4033.93 9.21 8.78 3.07
.0000 .0000 .0000 .0000 .0000 .0797
Codec Conditions Noise Type SNR Codec Conds. * Noise Type Codec Conds. * SNR Noise Type * SNR Error Total
773.93 100.93 859.26 63.53 5.95 32.78 2668.42 4504.8
4 1 1 4 4 1 5264 5279
193.482 100.928 859.261 15.884 1.488 32.776 0.507
381.68 199.1 1695.07 31.33 2.93 64.66
.0000 .0000 .0000 .0000 .0195 .0000
128
Applied Acoustics 134 (2018) 125–130
5
5
4.5
4.5
4
4
3.5
3.5
G−MOS
S−MOS
P. Počta, S. Isabelle
3 2.5
3 2.5
2
2
1.5
1.5
1
No.1
No.2
No.3
No.4
No.5
No.6
No.7
No.8
1
No.9 No.10 No.11 No.12
No.1
No.2
No.3
No.4
No.5
No.6
No.7
No.8
No.9 No.10 No.11 No.12
Reference Conditions
Reference Conditions
Fig. 6. Effect of reference conditions (see Table 3 for more information about the investigated reference conditions) on average G-MOS. The vertical bars show 95% CI computed over 132 G-MOS values (33 subjective scores per sample for 4 samples).
Fig. 4. Effect of reference conditions (see Table 3 for more information about the investigated reference conditions) on average S-MOS. The vertical bars show 95% CI computed over 132 S-MOS values (33 subjective scores per sample for 4 samples).
5
groups, namely SNR (No. 2–5), NS Level (No noise) (No. 6–9) and NS Level (Noise) (No. 10–12). Reference condition 1 has no added noise or speech distortion, implying an achievement of maximum S-MOS value. For the first group, there is essentially no dependence of S-MOS on SNR, with the possible exception of condition 2, which has the lowest SNR. It is worth noting here that the conditions 3 and 4 received the same score and there is no statistically significant difference between all of the scores reported for this group, besides the condition 2. Furthermore, the S-MOS values monotonically increase, as expected, with increasing the processing level for the conditions covered by the second group, resulting in decreasing speech distortion. On the other hand, the S-MOS values reported for the third group decrease monotonically with decreasing the processing level and SNR values, also as expected. Fig. 5 reports the average N-MOS values obtained for the reference conditions in the subjective test. The N-MOS values reported for the first group increase monotonically with increasing the SNR values. The ceiling is again naturally achieved for the reference condition 1. The values reported for the conditions covered by the second group oscillate around one value, actually representing the ceiling. So, there is obviously no impact of the NS Level processing on the N-MOS values, as expected, as NS Level imparts speech distortion. Regarding the third group, the values decrease again monotonically with decreasing the processing level and SNR values. The average G-MOS values obtained for the reference conditions in the test are presented in Fig. 6. The G-MOS values monotonically increase when increasing the SNR values for the conditions covered by the first group. Similarly as in the previous case and as also reported for the S-MOS values, the G-MOS values reported for the conditions covered by the second group monotonically increase with increasing the processing level. On the other hand, the values obtained for the conditions belonging to the last group monotonically decrease with decreasing processing level and SNR values. It is worth noting here that
Reference Test − Environmental Test − Music
4.5
N−MOS
4 3.5 SWB − 12 dB
NB − 12 dB
3
WB − 12 dB
2.5 2 NB − 0 dB
WB − 0 dB
SWB − 0 dB
1.5 1
1
1.5
2
2.5
3
3.5
4
4.5
5
S−MOS Fig. 7. Correlation between the S-MOS and N-MOS obtained for the reference (informative) and test conditions.
the results reported for the reference conditions and all three investigated quality aspects, i.e. S-MOS, N-MOS and G-MOS, fully follow the expected trends. Fig. 7 compares the S-MOS and N-MOS values obtained from the subjective test for the reference (informative) and test conditions. As indicated in this figure, 6 different regions representing a combination of the bandwidth and SNR can be clearly identified for the test conditions. So, the figure shows the main effects of bandwidth and SNR on the S-MOS and N-MOS values. When it comes to the overlapping areas of the regions, they both are clearly dominated by the music noise test conditions. Moreover, the 12 dB overlapping area does not contain the EVS-NB codec and is slightly dominated by the AMR-NB codec. This is due to the bit rates associated with these two codecs in this study. When it comes to the 0 dB overlapping area, all codecs involved in this study are present with a very slight dominance of the AMR-NB and AMR-WB codecs. The reason for the dominance of the AMR-NB is the same as in the previous case. When it comes to the AMR-WB, its dominance seems to be connected to its poor ability to handle speech mixed with music signals.
5 4.5
N−MOS
4
4. Conclusions
3.5
In this paper, we compared the effect of the music and environmental noise used as a background noise in telephone communication on the quality experienced by the end user in the context of NB, WB and SWB mobile speech communication. It is worth noting here that modern communication endpoints, such as mobile phones, are expected to embed advanced signal processing techniques, such as a speech enhancement or a noise suppression, to alleviate an impact of the background noise on the quality experienced by the user in the context of mobile speech communication. Such advanced signal processing techniques were not considered in this study. The first conclusion is that, as expected, music used as a background
3 2.5 2 1.5 1
No.1
No.2
No.3
No.4
No.5
No.6
No.7
No.8
No.9 No.10 No.11 No.12
Reference Conditions Fig. 5. Effect of reference conditions (see Table 3 for more information about the investigated reference conditions) on average N-MOS. The vertical bars show 95% CI computed over 132 N-MOS values (33 subjective scores per sample for 4 samples).
129
Applied Acoustics 134 (2018) 125–130
P. Počta, S. Isabelle
noise in telephone conversation deteriorates the overall quality experienced by the end user. Comparing the impact of the music used as a background noise on the quality experienced by the end user with that of typical environmental noises used in this study, we can conclude that the impact is more or less the same, besides the AMR-WB codec. Moreover, the slightly higher values of the overall quality (G-MOS) are reported for the music noise than for the environmental noise, besides again the AMR-WB test conditions. Statistically significant difference between the G-MOS values obtained for the music and environmental noise is observed for most of the test conditions involving the lower SNR. The only test condition revealing the very clear and statistically significant preference of the music noise over the environmental noise for the higher SNR is the test condition 1 representing the EVS-SWB codec operating at 24.4 kbps and naturally the higher SNR. With regard to noise intrusiveness, the music noise seems to be slightly less intrusive than the environmental noise, especially for the lower SNR. Statistically significant differences were achieved for the EVS-SWB and AMR-NB codecs, for both values of the SNR used in this study.
[5] ETSI. Speech and multimedia Transmission Quality (STQ); Speech quality performance in the presence of background noise: Objective test methods for superwideband and fullband terminals. European Telecommunications Standards Institute, Sophia-Antipolis, France, ETSI TS 103 281; 2017. [6] Leman A, Faure J, Parizet E. Influence of informational content of background noise on speech quality evaluation for VoIP application. J Acoust Soc Am May 2009;123(5). [7] ITU. Methods for subjective determination of transmission quality. International Telecommunication Union, Geneva, Switzerland, ITU-T Rec. P.800; 1996. [8] ITU. Pulse code modulation (PCM) of voice frequencies. International Telecommunication Union, Geneva, Switzerland, ITU-T Rec. G.711; 1988. [9] ITU. Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction (CS-ACELP). International Telecommunication Union, Geneva, Switzerland, ITU-T Rec. G.729; 2007. [10] Raphael Ullmann, Hervé Bourlard, Jens Berger, Anna Llagostera Casanovas. Noise intrusiveness factors in speech telecommunications. In: Proceedings of AIA-DAGA joint conference on acoustics, Merano, Italy; 2013. p. 436–9. [11] ETSI. Speech processing, Transmission and Quality Aspects (STQ); Speech quality performance in the presence of background noise; Part 1: Background noise simulation technique and background noise database” European Telecommunications Standards Institute, Sophia-Antipolis, France, ETSI EG EG 202 396-1; 2008. [12] 3GPP. Mandatory speech CODEC speech processing functions; AMR speech Codec; General description. Third Generation Partnership Project, 3GPP TS 26.071; 2012. [13] 3GPP. Speech codec speech processing functions; Adaptive Multi-Rate - Wideband (AMR-WB) speech codec; General description. Third Generation Partnership Project, 3GPP TS 26.171; 2012. [14] 3GPP2. Enhanced variable rate codec, speech service options 3, 68, 70, 73 and 77 for wideband spread spectrum digital systems. Third Generation Partnership Project 2, 3GPP2 C.S0014-E; 2011. [15] ETSI. Universal Mobile Telecommunications System (UMTS); LTE; Speech and Video telephony terminal acoustic test specification” European Telecommunications Standards Institute, Sophia-Antipolis, France, ETSI TS 126 132; 2016. [16] ETSI. Speech and multimedia Transmission Quality (STQ); A sound field reproduction method for terminal testing including a background noise database” European Telecommunications Standards Institute. Sophia-Antipolis, France, ETSI TS 103 224; 2014. [17] ETSI. On speech quality in noise for NB and WB wireless telephones” European Telecommunications Standards Institute, Sophia-Antipolis, France, ETSI STQ(16) 53_025; 2016. [18] ITU. ITU-T Software Tool Library 2009 User's manual” International Telecommunication Union, Geneva, Switzerland, ITU-T Rec. G.191; 2009. [19] ETSI. Universal Mobile Telecommunications System (UMTS); LTE; EVS Codec Detailed Algorithmic Description (3GPP TS 26.445 version 12.0.0 Release 12)” European Telecommunications Standards Institute, Sophia-Antipolis, France, ETSI TS 126 445; 2014. [20] 3GPP. Revision of DESUDAPS-1: Common subjective testing framework for training and validation of SWB and FB P.835 test predictors. Third generation partnership project, 3GPP S4(16)0548; 2016.
Appendix I Tables 4–6 shows the results of the ANOVA test carried out on the subjective data (Dependent variable: S-MOS, N-MOS, G-MOS respectively) described in more detail in Section 3. References [1] ITU. Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm. International Telecommunication Union, Geneva, Switzerland, ITU-T Rec. P.835; 2003. [2] ETSI. Speech processing, Transmission and Quality Aspects (STQ); Speech Quality performance in the presence of background noise Part 3: Background noise transmission - Objective test methods. European Telecommunications Standards Institute, Sophia-Antipolis, France, ETSI EG EG 202 396-3; 2008. [3] Reimes J, Gierlich HW, Kettler F, Poschen S, Lepage M. The relative approach algorithm and its applications in new perceptual models for noisy speech and echo performance”, Acta Acustica united with Acustica, vol. 97, No. 2; March/April 2011. p. 325–41. [ISSN 1610-1928]. [4] ETSI. Speech and multimedia Transmission Quality (STQ); Speech quality performance in the presence of background noise: Background noise transmission for mobile terminals-objective test methods. European Telecommunications Standards Institute, Sophia-Antipolis, France, ETSI TS 103 106; 2013.
130