Intelligibility of speech in noise under diotic and dichotic binaural listening

Intelligibility of speech in noise under diotic and dichotic binaural listening

Applied Acoustics 125 (2017) 173–175 Contents lists available at ScienceDirect Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust ...

427KB Sizes 3 Downloads 130 Views

Applied Acoustics 125 (2017) 173–175

Contents lists available at ScienceDirect

Applied Acoustics journal homepage: www.elsevier.com/locate/apacoust

Intelligibility of speech in noise under diotic and dichotic binaural listening Noam R. Shabtai a,⇑, Itai Nehoran b, Matan Ben-Asher b, Boaz Rafaely a a b

Department of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer Sheva 8410501, Israel Waves Audio Ltd., Azrieli Center 3, The Triangle Tower, 32nd Floor, Tel-Aviv 6701101, Israel

a r t i c l e

i n f o

Article history: Received 26 January 2017 Received in revised form 7 April 2017 Accepted 2 May 2017

Keywords: Speech intelligibility Binaural sound reproduction Hearing in noise test

a b s t r a c t Binaural sound reproduction (BSR) can improve speech intelligibility due to spatial release from masking (SRM), in the case where the signals are reproduced in such a manner that they are perceived as arriving to the listener from different directions. However, the effect of BSR on speech intelligibility has not yet been thoroughly studied for different noise conditions that involve diotic or binaural playback via headphones. In this work, a hearing in noise test (HINT) is performed, in which both the target and the noise are presented over headphones, in order to compare the speech intelligibility in the following conditions: (1) target and noise signals are displayed as diotic signals, (2) target is binaurally reproduced and noise is diotic, (3) target is diotic and noise is binaurally reproduced, and (4) target and noise signals are binaurally reproduced at the same direction. The results show that the speech intelligibility is significantly better in condition 2 relative to the other conditions. Ó 2017 Published by Elsevier Ltd.

1. Introduction The intelligibility of a speech signal is degraded in the presence of noise that is generated by other acoustic sources in natural environments. The reason for this degradation is the masking characteristic of acoustic signals. If, however, the target speech signal can be spatially separated from the noise signals, the cocktail party effect is triggered in the human auditory system, resulting in spatial release from masking (SRM), which may be evident by a decrease in the speech reception threshold (SRT) in speech intelligibility listening tests [1]. The SRM phenomenon has already been studied using loudspeakers that generate signals from different locations. Listening tests have shown that spatially separating the noise from the target signal leads to a maximum SRM level of about 10 dB [2]. The level of SRM is dependent on the angular separation between the target and the noise and, based on further results [3], a model has been suggested to predict the SRM [4]. However, recent tests show that a separation of 15° results in an 8 dB SRM, and a wider separation may result in a 12 dB SRM [5]. Based on these recent results, the original SRM model has been revised [6]. Other studies used binaural sound reproduction (BSR) via headphones by directly filtering the signals with the head related ⇑ Corresponding author. E-mail address: [email protected] (N.R. Shabtai). http://dx.doi.org/10.1016/j.apacoust.2017.05.002 0003-682X/Ó 2017 Published by Elsevier Ltd.

transfer functions (HRTFs), reaching lower SRM levels of 5 dB [7] or 6 dB [8]. The BSR effect on SRM has also been studied in a more general manner in the spherical harmonics domain [9,10]. The spherical harmonics order up to which the plane-wave amplitude density function and the HRTFs were decomposed was investigated for its effect on the SRM [11]. The improvement of speech intelligibility due to SRM has motivated the advanced design of microphone array systems with an output via headphones, where the HRTFs are employed in the beamforming weights [12–14]. In natural environments, however, the noise signals can arrive from many directions, including directions that are near the target signal. Furthermore, a recent study has shown that BSR does not result in SRM under ambient noise [15]. This particular result shows that under certain noise conditions BSR does not improve the intelligibility over a regular system that displays a diotic mono signal over headphones. In light of these recent results, this work aims to investigate the SRM of a diotic signal in the case where the other signal is dichotic and generated using BSR. Therefore, a hearing in noise test (HINT) that includes four conditions of target-noise signals in dioticdichotic sound reproduction modes are studied in this work. The results show that BSR of the target signal can result in significant SRM when the noise is displayed as a diotic signal, but to a much lesser extent the other way around, where the noise is dichotic and binaurally reproduced and the target is diotic.

174

N.R. Shabtai et al. / Applied Acoustics 125 (2017) 173–175

2. Experimental setup A HINT has been performed in the control room of the Acoustics Laboratory at Ben-Gurion University of the Negev. Both the target and the noise signals were delivered to the listener via headphones of type AKG K701. The Nx – Virtual Mix Room over Headphones of Waves Audio Ltd., version No. 9.6, was used to generate all the dichotic signals, where the sources were configured to be at the frontal direction when the listener looks directly at the screen. This direction is represented using an azimuth angle of h ¼ 0 and an elevation of / ¼ 0 in an interaural-polar coordinate system, similar to the one used in the CIPIC database [16]. Room ambience is used in order to introduce early room reflections that improve externalization [11]. Head tracking was obtained using a face detection algorithm and a web camera recording at 30 frames per second, where the latency was too small to be perceived by the subjects. The sound was rendered according to the position of the listener head. The energy level at the left and right channels, calculated using the RMS of the Nx response to a white noise signal, is shown in Fig. 1a as a function of the azimuth angle h between 90 and 90 . At a zero elevation, these azimuth angles correspond to a 180 rotation of the head in the transverse plane. A few representative cases of elevation angles, / ¼ 30 ; 0 , and 30 , are presented. It can be seen that the energy levels are symmetrical for azimuthal variations around the frontal direction. The frontal direction frequency response is presented in Fig. 1b, showing almost identical left and right channel responses, where differences may be explained by the simulated room effect. Four different spatial target-noise conditions were investigated, as also summarized in Table 1:

Table 1 HINT target-noise diotic-dichotic condition. Condition 1 2 3 4

Target

Noise

Diotic Dichotic Diotic Dichotic

Diotic Diotic Dichotic Dichotic

A preliminary calibration process was performed in order to ensure that the sound levels of the target and the noise signals are the same at the listeners’ ears, mainly in the cases of conditions 2 and 3. The target and noise signals at the output of the Nx system in the frontal direction were measured using software, and the signal to noise ratio (SNR) was calculated for each condition. At a 0 dBA input SNR, around ±1.5 dBA output SNR levels were measured, which were then used to calibrate the SNR under the four conditions of the experiment. The target signal contained triplets of digits from 0 to 9, where every triplet was repeated three times, and the noise was a babble speech signal. The initial SNR level was 24 dBA and, in an adaptive manner, increased by 2 dB for each incorrect response of the listener, and decreased by 2 dB for each correct response. The amplitude of the target signal was adjusted to compensate for the SNR gain measured in the calibration process. Six experienced normal hearing subjects, aged between 20 and 30, participated in the test, where each was presented with 50 triplets of digits. At each condition, any digit triplet was repeated only once. In a preceding training session, the subjects were allowed to tune the volume level according to their convenience. 3. Experimental results

1. Both target and noise signals are displayed as diotic signals, 2. the target signal is binaurally reproduced as a source at the frontal direction and displayed as a dichotic signal, while the noise signal is displayed as a diotic signal, 3. opposite to condition 2, the target signal is diotic and the noise signal is dichotic, and 4. both the target and the noise signals are dichotic and binaurally reproduced as sources at the frontal direction. Specifically, condition 2 may represent a real-life scenario of noise from a far-end speaker in a multi-speaker conference scene, where the channel displays a diotic mono signal, but the signal from the desired speaker arrives at a separate stream from the noise, and is made binaural. Furthermore, condition 2 may be relevant to conference applications that apply noise reduction, where the background noise is added back to the original signal as a mono signal in order to listen to other speakers in the room.

For each condition, SRT was calculated by averaging over the SNR levels for the last 40 trials, further averaged over all subjects. Fig. 2 shows the SRT result for each condition along with the 95% confidence interval displayed on vertical bars. First, it may be seen that the lowest SRT value was obtained in condition 2, where the target is dichotic and the noise is diotic. Second, it may be seen that in condition 2, the obtained SRM is only 1 dB higher than in conditions 1 and 4, in which there is no spatial difference between the target and the noise signals. Hence, it is shown that the SRM obtained by a dichotic target signal using BSR under a diotic noise signal is significantly higher than the SRM obtained using the opposite condition, in which the target signal is diotic and the noise signal is dichotic. A hypothetical reason for the large difference in the SRM between the two opposite conditions might be that dichotic target signals sound more natural to the subjects and, therefore, are more

Fig. 1. (a) RMS of Nx response to white noise at azimuth h and a few elevation angles /. (b) Frequency response at the frontal direction ðh; /Þ ¼ ð0 ; 0 Þ.

N.R. Shabtai et al. / Applied Acoustics 125 (2017) 173–175

175

signal improved the intelligibility by approximately 4.5 dB in the presence of a babble speech noise that is displayed as a diotic signal. An improvement was observed for the opposite condition, in which the target signal was diotic and the noise signal was dichotic, but to a lesser extent. Acknowledgement This work was supported by the Israel Science Foundation (ISF) under Grant 146/13. References

Fig. 2. SRT calculated by averaging over the SNR levels for the last 40 trials and over all subjects. 95% confidence interval is shown on vertical bars.

intelligible when displayed as dichotic signals than as diotic signals. Particularly, it is known that the SRM is worse when the masker is noise, compared to another speaker. This fact was validated, both in experiments that included loudspeakers [17] and in experiments that included headphones [18]. The main reason is that noise is a continuous signal, which is not sparse in the timefrequency domain. As a result, the human hearing system cannot employ the precedence effect in a room, as it does it for speech. In condition 2, the speech signal is dichotic and the noise is diotic. Hence, the human hearing system can employ the precedence effect for the speech signal, and gain some spatial information that separates the speech from the noise. In condition 3, however, the speech signal is displayed as a diotic signal and the noise is dichotic. The human hearing system cannot employ the precedence effect, and the spatial information that can be obtained is more limited. Spatial separation of the two signals is therefore reduced in condition 3 compared to condition 2, which may explain the differences in SRT. 4. Conclusion A HINT was performed via headphones using four target-noise diotic-dichotic conditions, where the dichotic signals were obtained using the Nx – Virtual Mix Room over Headphones system of Waves Audio Ltd. It was observed that the BSR of the target

[1] Litovsky RY. Spatial release from masking. Acoust Today 2012;8(2):18–25. [2] Plomp R, Mimpen AM. Effect of the orientation of the speakers head and the azimuth of a noise source on the speech reception threshold for sentences. Acustica 1981;48(5):325–32. [3] Peissig J, Kollmeier B. Directivity of binaural noise reduction in spatial multiple noise-source arrangements for normal and impaired listeners. J Acoust Soc Am 1997;101(3):1660–70. [4] Bronkhorst AW. The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions. Acustica 2000;86 (1):117–28. [5] Marrone N, Mason CR, Kidd G. Tuning in the spatial dimension: evidence from a masked speech identification task. J Acoust Soc Am 2008;124(2):1146–58. [6] Jones GL, Litovsky RY. A cocktail party model of spatial release from masking by both noise and speech interferers. J Acoust Soc Am 2011;130(3):1463–74. [7] Bronkhorst AW, Plomp R. Effect of multiple speechlike maskers on binaural speech recognition in normal and impaired hearing. J Acoust Soc Am 1992;92 (6):3132–9. [8] Ihlefeld A, Shinn-Cunningham B. Spatial release from energetic and informational masking in a selective speech identification task. J Acoust Soc Am 2008;123(6):4380–92. [9] Duraiswami R, Zotkin D, Li Z, Grassi E, Gumerov N, Davis L. High order spatial audio capture and its binaural head-tracked playback over headphones with HRTF cues. In: 119th Convention of the AES, New York, NY, USA. p. 1–16. [10] Rafaely B, Avni A. Interaural cross correlation in a sound field represented by spherical harmonics. J Acoust Soc Am 2010;172(2):823–8. [11] Avni A, Wierstorf H, Geier M, Ahrens J, Spors S, Rafaely B. Spatial perception of sound fields recorded by spherical microphone arrays with varying spatial resolution. J Acoust Soc Am 2013;133(5):2711–21. [12] Shabtai NR, Rafaely B. Generalized spherical array beamforming for binaural speech reproduction. IEEE Trans Audio Speech Lang Process 2014;22 (1):238–47. [13] Jeffet M, Shabtai NR, Rafaely B. Theory and perceptual evaluation of the binaural reproduction and beamforming trade-off in the generalized spherical array beamformer. IEEE Trans Audio Speech Lang Process 2016;24(4):708–18. [14] Shabtai NR. Optimization of the directivity in binaural-sound-reproduction beamforming. J Acoust Soc Am 2015;138(5):3118–28. [15] Shabtai NR, Ben-Hur Z, Nehoran I, Ben-Asher M, Rafaely B. Intelligibility of spatially reproduced speech over headphones under ambient noise. In: 42nd Jahrestagung für Akustik (DAGA 2016), Aachen, Germany. p. 345–7. [16] Algazi VR, Duda RO, Thompson DM, Avendano C. The CIPIC HRTF database. In: Proc. 2001 IEEE Workshop Applicat. Signal Process. Audio Acoust. (WASPAA 2001), New Paltz, NY, USA. p. 99–102. [17] Freyman R, Helfer K, McCall D, Clifton R. The role of perceived spatial separation in the unmasking of speech. J Acoust Soc Am 1999;106(6):3578–87. [18] Brungart DS, Simpson BD, Freyman RL. Precedence-based speech segregation in a virtual auditory environment. J Acoust Soc Am 2005;118(5):3241–51.