Speech intelligibility for different spatial configurations of target speech and competing noise source in a horizontal and median plane

Available online at www.sciencedirect.com Speech Communication 55 (2013) 1021–1032 www.elsevier.com/locate/specom Speech intelligibility for diﬀeren...

Download PDF

1MB Sizes 9 Downloads 76 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com

Speech Communication 55 (2013) 1021–1032 www.elsevier.com/locate/specom

Speech intelligibility for diﬀerent spatial conﬁgurations of target speech and competing noise source in a horizontal and median plane Edward Ozimek ⇑, Je˛drzej Kocin´ski, Dariusz Kutzner, Aleksander Se˛k, Andrzej Wicher Institute of Acoustics, Faculty of Physics, Adam Mickiewicz University in Poznan, Umultowska 85, 61-614 Poznan, Poland Received 5 February 2013; received in revised form 5 June 2013; accepted 13 June 2013 Available online 24 June 2013

Abstract The speech intelligibility for diﬀerent conﬁgurations of a target signal (speech) and masker (babble noise) in a horizontal and a median plane was investigated. The sources were placed at the front, in the back or in the right hand side (at diﬀerent angular conﬁgurations) of a dummy head. The speech signals were presented to listeners via headphones at diﬀerent signal-to-noise ratios (SNR). Three diﬀerent types of listening mode (binaural and monaural for the right or left ear) were tested. It was found that the binaural mode gave the lowest, i.e. ‘the best’, speech reception threshold (SRT) values compared to the other modes, except for the cases when both the target and masker were at the same position. With regard to the monaural modes, SRTs were generally worse than those for the binaural mode. The new data gathered for the median plane revealed that a change in elevation of the speech source had a small, but statistically signiﬁcant, inﬂuence on speech intelligibility. It was found that when speech elevation was increased, speech intelligibility decreased. Ó 2013 Elsevier B.V. All rights reserved. Keywords: Speech intelligibility; Speech-in-noise test; Spatial perception; Monaural and binaural perception

1. Introduction In natural acoustic environments speech often coexists with signals generated by other sound sources. Therefore, communication is often made diﬃcult, because speech is masked by other sounds (speech, traﬃc noise, music etc.). However, the auditory system is capable of separating out signals coming from diﬀerent directions and extracting information of interest. Many experiments on speech intelligibility measurements for diﬀerent conﬁgurations of sources have been carried out so far. For example, Bronkhurst and Plomp (1988) investigated the eﬀect of interaural time delay (ITD) and acoustic headshadow on binaural speech intelligibility in noise. Recordings were made of speech reproduced in front of a manikin; and of noise emanating

⇑ Corresponding author. Address: Institute of Acoustics, Adam Mickiewicz University, Ul. Umultowska 85, 61-614 Poznan, Poland. Tel.: +48 618295133; fax: +48 618295123. E-mail address: [email protected] (E. Ozimek).

0167-6393/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.specom.2013.06.009

from seven angles in the horizontal plane ranging from 0° (frontal) to 180° in steps of 30°. Freyman et al. (2001) determined the extent to which the perceived separation of speech and interference improves speech recognition in the free ﬁeld. The target talker was always presented from a loudspeaker directly in front (0°). The interference was either presented from the front or from both a right loudspeaker (60°) and a front loudspeaker, with the right leading the front by 4 ms. In the experiment carried out by Hawley et al. (2004), speech reception thresholds (SRTs) were measured for Harvard IEEE sentences presented from the front in the presence of one, two, or three interference sources. Moreover, four types of interferer were used: other sentences spoken by the same speaker, time-reversed sentences of the same speaker, speech-spectrum shaped noise, and speech-spectrum shaped noise, modulated by the temporal envelope of the sentences. In the research of Kocin´ski and Se˛k (2005), the speech intelligibility in the presence of one or two statistically independent speech-shaped noise sources varying in conﬁguration was investigated. Litovsky (2005) tested children between the ages of 4 and 7, and adults, in free ﬁeld

1022

E. Ozimek et al. / Speech Communication 55 (2013) 1021–1032

for speech intelligibility. The target speech was presented from the front (0°); speech or modulated speech-shapednoise competitors were either in front or on the right (90°). Brungart and Iyer (2012) showed that listeners with normal hearing are able to eﬃciently extract information from better-ear glimpses that ﬂuctuate rapidly across frequency and across the two ears. A general aim of these experiments was to investigate a spatial release from masking (also called spatial unmasking or spatial suppression). It turned out that the speech intelligibility depended among others on the mutual spatial conﬁguration of the speech and masker sources. Moreover, it was found that for a given signal-to-noise ratio (SNR) the speech intelligibility was higher when the target and masker sources were spatially separated than in the case when they were collocated. The spatial unmasking is related to the inﬂuence of the listener’s head on the propagation of the signal that creates an acoustical shadow (called also the head shadow eﬀect) leading to an interaural level diﬀerence. When the speech and noise sources are spatially separated, the listener is able to take advantage of this diﬀerence that leads to changes in SNRs in the respective ears. Moreover, the interaural time (phase) diﬀerences play an important role in the spatial unmasking, which can be partially interpreted in terms of the binaural masking level diﬀerence (Kocin´ski and Se˛k, 2005). This eﬀect (that originally was related to the detection of tones presented in noise) was generalized to speech perception and was called the binaural intelligibility level diﬀerence (Peissig and Kollmeier, 1997), which incorporates both binaural and monaural components of auditory processing (Garadat and Litovsky, 2006; Hawley et al., 2004; Lin and Feng, 2003). Many investigators have analysed conﬁgurations in which the target speech source was placed directly in front of the listener and the azimuth of the disturbing source was varied, e.g. 0° (frontal) to 180° in steps of 30° (Bronkhurst and Plomp, 1988), 0° and 60° (Freyman et al., 1999) or 0°, 45°, 90° (Drullman and Bronkhorst, 2000). A scenario in which speech and masker sources are spatially distributed was also used in a recent study by Allen et al. (2008) that concerned auditory streaming and spatial speech unmasking. In the quoted study considerable release from masking was demonstrated when two noise sources were located symmetrically at azimuths of 30 deg and +30 deg, respectively. The purpose of the present study was to determine the SRT (i.e. SNR yielding 50% speech intelligibility) for diﬀerent conﬁgurations of the speech and masker sources. The experiments were carried out in monaural and binaural listening modes, in the horizontal plane (outlined by the surface that cuts through the head at ear level) (experiment 1) and in the median plane (outlined by surface that splits the head into the left and the right halves, i.e. sagittal plane in anatomical coordinate system) (experiment 2). In Fig. 1, the above mentioned planes, the spatial conﬁgurations of speech and noise considered in the experiments as well as

angles describing direction of an incoming sound (i.e. azimuth and elevation), are schematically presented. Since a vast majority of previously carried out experiments focused mainly on the inﬂuence of the masker azimuth, while the speech azimuth was kept at 0°, within a framework of this study speech azimuth was considered (the experiment 1) for the following listening modes: the monaural-left ear, the monaural-right ear and binaural (for details see Section 2). The experiment 2 analysis the novel aspect of the speech intelligibility referring to the eﬀects in median plane. In this case the speech azimuth was kept constant, while the speech source elevation and the masker azimuth were modiﬁed and the signals were presented monaurally and binaurally. There are several situations in which the sources of speech signals are situated higher than listeners’ heads. For example, in most public places, like stations, airports or churches, speech sources are located above our heads. Therefore it seems to be fairly important to check speech intelligibility experimentally when speech sources are located above the head. The Polish Sentence Test (PST) (Ozimek et al., 2009) was used as the target speech material for the ﬁrst time in such a study. It is worth adding that the rationale for doing this study is both scientiﬁc and practical. For example, comparison of the intelligibility data obtained for the monaural-left and right ear modes in acoustically adverse conditions can provide relevant information on the location of speech and noise sources to get optimal hearing conditions, especially for patients with a unilateral hearing loss. Furthermore, in diﬀerent human environments, speech and noise sources are often distributed each others in a complex way. The results of the present study could be helpful in deﬁning the optimal spatial distribution of those sources from the speech intelligibility improvement point of view. 2. Materials and methods 2.1. Stimuli: speech and masking noise The PST, which reﬂects the basic features of the Polish language, consists of 25 diﬀerent lists, each containing 20 sentences. The test is characterized by a large number of fricatives and limited number of vowels. Due to this, a relatively high level of energy in the frequency range above 5 kHz can be noticed, and a high variability in amplitude envelope occurs in comparison with other languages. When presented in a background of the so-called babble noise masker, the lists of sentences produce relatively steep intelligibility functions, i.e. functions that link the probability of correct response to SNR. A large slope of intelligibility function at the SRT point implies low inter- and intra-variability across the lists and, consequently, a high precision in SRT determination can be obtained. Using such a test it is possible to detect subtle diﬀerences in speech intelligibility for diﬀerent measurement conditions, which might be diﬃcult to obtain for test materials producing less steep

E. Ozimek et al. / Speech Communication 55 (2013) 1021–1032

1023

Fig. 1. Schematic diagram of spatial conﬁguration of speech (S) and noise (N) sources examined in the experiment 1 (a) and experiment 2 (b).

intelligibility functions, for example logatome or a single consonant–vowel–consonant test (Bosman and Smoorenburg, 1995). All the PST lists contained grammatically correct and semantically neutral utterances and were statistically and phonemically balanced. During the measurements, the speech was presented in the background of the babble noise. The power spectrum of the babble noise optimally matched the power spectra of the sentences presented (for more details on PST and masking noise see Ozimek et al., 2009). It should be stressed that precise spectral matching of masked speech and masker signal has been shown to be very important in getting a large slope for the intelligibility function, i.e. for the accurate SRT measurement. The reference (normative) SRT and slope of intelligibility function for PST, obtained for a group of otologically normal subjects, are 6.1 dB and 25.6%/dB, respectively. Thus, any statistically signiﬁcant decrease in the SRT may be regarded as a reduction of masking eﬀectiveness and release from masking. Conversely, any increase in SRT with respect to the normative value could be a consequence of an increase in masking magnitude for a given measurement scenario. The adaptive method was used in the present study to measure SRT (Cox et al., 1991; Kollmeier and Wesselkamp, 1997). Signals for diﬀerent localizations of speech and noise sources were recorded by means of a dummy head. During the recordings, the speech source was placed at the right hand side of the dummy head or in the front/back of it, while the noise source could move clockwise. In this way, diﬀerent combinations of recordings were obtained and on this basis diﬀerent listening modes could be arranged. The signals recorded at diﬀerent locations of speech and masker sources were mixed and presented to the subjects via headphones (Sennheiser HD 580). During the listening sessions the following three modes were examined by the subjects: monaural-right ear (i.e. speech and masker signals recorded by the right channel of the dummy head were presented monaurally to the subject), monaural-left ear (i.e. speech and masker signals recorded by the left channel of the dummy head were presented monaurally to the subject) and binaural mode (i.e. speech and masker signals recorded

by the left and right channel of the dummy head were presented binaurally to the subject). 2.2. Target-masker spatial conﬁgurations The sound sources were placed in an anechoic chamber on a net hung at a distance of 2 m from the dummy head. The distance was chosen because of the large size of the loudspeaker (Altus 300) which was about 1 m high. The dummy head was always placed in the centre of the loudspeaker height. Diﬀerent conﬁgurations of the target speech source were chosen; ﬁve in the horizontal plane, i.e. for the azimuth h: 0 (in front of the head), 30, 60, 90 (on the right side of the head) and 180° clockwise, and the elevation angle u equals to 0° only; and three in the median plane, i.e. for the elevation angle u: 0, 45 and 90°, and the azimuth h equals to 0° only (Table 1). The elevation angle u equals to 0° means that the sound source was placed at the same height as the dummy head, whereas the angle of 45° means that it was placed at the elevation angle of 45° above the head and 90° means that the source was placed directly above the head. For each of the target speech placements eight diﬀerent azimuth angles of the masker were used, i.e. 0°, 15°, 30°, 45°, 60°, 75°, 90°, 180°, clockwise (for the elevation angle of 0° only). Thus, the notation S uh used in the paper stands for the target speech signal placed in the azimuth h, clockwise, and elevation angle u; N h stands for a noise source placed in the azimuth h, clockwise. For example the notation S 030 N 75 means that the speech source is placed at 0° in the median plane, 30° in the horizontal plane and the noise source is placed at 75° in the horizontal plane. 2.3. Recordings The aim of the recording sessions was to collect signals that reﬂect physical features of sounds perceived by both ears for diﬀerent locations of sound sources. In order to do this, for each spatial conﬁguration of the target signal source and masking signal for the entire PST list was recorded. The recording procedure was controlled by means of a custom software implemented in Matlab 6.5.

1024

E. Ozimek et al. / Speech Communication 55 (2013) 1021–1032

Table 1 Set of spatial conﬁgurations used in the experiment. S stands for target signal whereas N stands for noise signal; the lower index indicates the azimuth angle h, of the target source while the upper index indicates the elevation angle u of the target source. In experiment 1, speech elevation angle was 0, whereas speech azimuth and noise azimuth angle were modiﬁed. In experiment 2, speech azimuth angle was kept at 0, while speech elevation angle and masker azimuth were changed. Target speech elevation in median plane (°)

Target speech azimuth in horizontal plane (°)

Experiment 1

Experiment 2

0

45

90

0

30

60

90

180

0

0

S 00 N 0 S 00 N 15 S 00 N 30 S 00 N 45 S 00 N 60 S 00 N 75 S 00 N 90 S 00 N 180

S 030 N 0 S 030 N 15 S 030 N 30 S 030 N 45 S 030 N 60 S 030 N 75 S 030 N 90 S 030 N 180

S 060 N 0 S 060 N 15 S 060 N 30 S 060 N 45 S 060 N 60 S 060 N 75 S 060 N 90 S 060 N 180

S 090 N 0 S 090 N 15 S 090 N 30 S 090 N 45 S 090 N 60 S 090 N 75 S 090 N 90 S 090 N 180

S 0180 N 0 S 0180 N 15 S 0180 N 30 S 0180 N 45 S 0180 N 60 S 0180 N 75 S 0180 N 90 S 0180 N 180

S 45 0 N0 S 45 0 N 15 S 45 0 N 30 S 45 0 N 45 S 45 0 N 60 S 45 0 N 75 S 45 0 N 90 S 45 0 N 180

S 90 0 N0 S 90 0 N 15 S 90 0 N 30 S 90 0 N 45 S 90 0 N 60 S 90 0 N 75 S 90 0 N 90 S 90 0 N 180

Noise azimuth in horizontal plane (°) 0 15 30 45 60 75 90 180

A standard 44.1 kHz sampling rate and a 16-bit resolution was used. Speech and noise were recorded separately, i.e. ﬁrstly the PST was recorded for the desired angles only. Next, analogous recordings were carried out for all chosen angles of the masking source. This procedure meant that diﬀerent SNRs and angular target-masker conﬁgurations could be obtained just by means of mixing the speech and noise on a PC. Since the distance between the dummy head and the loudspeaker was kept constant, a change in the source’s location can be regarded as a change in the angular separations between the masked speech and masking signal. The schematic setup of the recording apparatus used in this procedure is depicted in Fig. 2.

2.4. Experimental sessions During the listening sessions the recorded signals were played back by means of the Tucker-Davis Technologies (TDT) System 3 (the real-time 24-bit signal processor RP2, the headphone buﬀer HB7). The measurements were controlled by a PC using software implemented in Matlab 6.5. In order to generate diﬀerent SNRs, the following procedure was used (Fig. 3). The root-mean-square (rms) of the signals recorded via the reference microphone was calculated. According to the determined rms, the signals from both the left and the right ear were adjusted to obtain the

Fig. 2. Scheme of the recording equipment. All the signals were stored in .wma format (each sentence separately) on a PC hard drive. They were sent via an ADAT interface to the Yamaha 01 V digital console used as a D/A converter and were delivered to the Pioneer A-505R ampliﬁer to adjust the ampliﬁcation and then fed to the Tonsil Altus 300 loudspeaker placed in an anechoic chamber. The signal from the loudspeaker was recorded using a Neumann KU100 dummy head and an additional small (1/2 inch) reference omni-directional microphone (GRAS 40AE with Svantek SV01A pre-amp) placed just above the dummy head. The reference microphone was used to be able to adjust the levels of speech and noise signals (and obtain diﬀerent signal-to-noise ratios) in the adaptive procedure used in the experiment. All three signals (i.e. from the left and the right ear and from the additional microphone) were fed to the Yamaha 01 V console, A/D converted and delivered via an ADAT to a PC where they were labelled unambiguously and ﬁnally stored on a hard drive.

E. Ozimek et al. / Speech Communication 55 (2013) 1021–1032

1025

Fig. 3. Scheme of SNR adjustment and signal presentation used in measurements (see text for details).

desired SNR. For the speech signal, the rms was calculated for the whole utterance (including natural pauses between words). The speech and masker signals were then mixed (separately for both channels) and delivered to both channels of the TDT RP2 (D/A conversion). Next, the level of the signal was adjusted (the total level was adjusted to 70 dB SPL – reference signal). Finally, the signals were presented to the subject, via Sennheiser HD 580 headphones, according to three diﬀerent scenarios (monaural-right ear, monaural-left ear and binaural mode). During the listening sessions the subjects were seated in the double-walled, acoustically – insulated booth. The number of subjects was chosen in such a way that for each angular conﬁguration and each listening mode (left or right ear or binaural presentation) SRTs for 6 listeners were determined. In order to avoid learning eﬀects that might inﬂuence the speech intelligibility data, each subject was presented with 24 diﬀerent speech-noise spatial conﬁgurations (8 noise angles in 3 listening modes; the PST consists of 25 lists). Hence, the total number of subjects participating in the measurements was 42. All the subjects were Polish native speakers aged between 21 and 28. The subjects were paid for participation in the measurements. All the subjects had normal hearing threshold levels (<15 dB HL) at all the audiometric frequencies and no history of hearing disorders.

The SNR was increased or decreased by some value (step) when the most recent response was incorrect (1-up) or correct (1-down), respectively. This method converges the SNR value to the 50%-equilibrium point on the intelligibility function, i.e. the SRT. A 2 dB-step was used until the ﬁrst incorrect answer was recorded then it was changed to 1 dB to improve the resolution of the adaptive procedure. SNR was calculated according to the rms of the signal recorded by the reference microphone and it did not depend on the target-masker spatial distribution, thus it could be regarded as a sort of nominal SNR. Apart from the nominal value, a change in source spatial conﬁguration had a strong inﬂuence on SNR recorded by the respective microphones of the dummy head because of the head shadow eﬀect. The SNR that depended both on the adaptive procedure and the spatial conﬁguration of sound sources was called the eﬀective SNR. SRT was calculated as the mean of the last 12 nominal SNRs, including the 21st SNR determined according to the response related to the 20th sentence. Thus SRT was also expressed in terms of the nominal SNR in the dB-scale. For a given subject one PST list of 20 sentences was used to estimate one SRT value.

2.5. Measurement method

3.1. Experiment 1. SRTs for diﬀerent speech-noise spatial separations and diﬀerent presentation modes in a horizontal plane

An adaptive staircase procedure with a 1-up/1down decision rule was used to determine SRT values (Brand and Kollmeier, 2002; Versfeld et al., 2000). Unlike the constant stimuli paradigm, in this method SNR was varied adaptively with respect to the most recent subject’s response.

3. Results

The results of experiment 1, i.e. SRTs vs. masker azimuth for diﬀerent speech azimuths are depicted in Fig. 4. The successive panels present the data obtained for

1026

E. Ozimek et al. / Speech Communication 55 (2013) 1021–1032

Fig. 4. SRT as a function of noise source azimuth (clockwise). Zero degree means that the noise source is placed in the front of the head. Subsequent panels present the results for diﬀerent azimuths of the target speech, i.e. (a) 0°, (b) 30°, (c) 60°, (d) 90°, (e) 180°. Circles present data (i.e. mean SRT averaged across six subjects) obtained for monaural-right ear mode, crosses depict data for monaural-left ear mode and asterisks present data for binaural mode. Error bars present standard deviation across subjects.

diﬀerent spatial conﬁguration of the speech source (S0 Fig. 4a, S30 Fig. 4b, S60 Fig. 4c, S90 Fig. 4d and S180 Fig. 4e). The data gathered for three diﬀerent listening modes are also depicted, namely: binaural (asterisks with solid line), monaural-left ear (crosses with dotted line) and monaural-right ear (circles with dashed line). The error bars depict standard deviations across subjects. 3.1.1. Monaural-right ear mode In general, for this listening mode the highest SRTs were obtained. In most cases SRT ranged from 6.0 to 4.0 dB, the minimal and maximal SRT for this presentation mode were 9.8 and 2.3 dB and were observed for S90N0 and S180N90, respectively. The relationship between SRT and the masker azimuth is a non-monotonic function. The shapes of the result patterns are comparatively similar for diﬀerent speech azimuths. For all the conﬁgurations considered the highest SRTs were obtained for the noises where azimuth ranged from 45° to 90°. For S0 and S180, but except in situations when the target azimuth and masker azimuth were the same, SRTs obtained for the monaural-right ear mode were always higher than those obtained for the monaural-left ear and binaural modes. When the azimuths of the speech source and masker’s source are

the same, SRTs obtained for all the listening modes are similar and tend to approximately 6.0 dB. 3.1.2. Monaural-left ear mode In most cases, SRTs measured for the monaural-left ear mode (Fig. 4, crosses with dotted line) were lower than those determined for the monaural-right ear mode. The minimal and maximal SRTs were observed for S0N75 and S60N0 and equal to 15.4 and 0.2 dB, respectively. The resulting pattern is ‘U-shaped’ (i.e. SRT is a non-monotonic function of masker azimuth) and shifts along the Y-axis for diﬀerent speech azimuths. In general, the lowest SRTs were observed for S0, while the highest SRTs were obtained for S60. The SRTs obtained for monaural-left ear mode were higher than those for monaural-right ear and binaural modes for S30N0, S30N15, S60N0, S60N15, S60N30, S30N180 and for S90N0, S90N15, S90N180. 3.1.3. Binaural mode For the binaural mode (Fig. 4, asterisks, solid line) the lowest SRTs were observed. The exceptions were when both the target and the masker were placed at the same position or when the speech source and the masker source were placed exactly in the front and in the back of the

E. Ozimek et al. / Speech Communication 55 (2013) 1021–1032

dummy head, respectively. In these cases the highest SRTs were observed for the binaural mode and there were no differences across all the listening modes. The minimal and maximal SRTs for the binaural mode were 17.4 and 5.5 dB and they were obtained for S0N75 and S30N30, respectively. For each combination of speech source and masker angle in the horizontal plane and each listening mode, the SRTs collected were pooled across listeners and subjected to a three-way analysis of variance (ANOVA) with respect to the following factors: speech azimuth (‘speech’), masker azimuth (‘noise’) and listening mode (‘mode’). It turned out that all the main eﬀects were statistically significant (p 0.001) for: ‘speech’ {F(4, 719) = 176.82}, ‘noise’ {F(7, 719) = 9.94} and ‘mode’ {F(2, 719) = 2014.23}. Also all the interactions among the main factors were shown to be statistically signiﬁcant (p 0.001) for: ‘speech’ ‘noise’{F(28, 719) = 55.02}, ‘speech’ ‘mode’ {F(8, 719) = 323.01}, ‘noise’ ‘mode’ {F(14, 719) = 170.76} and ‘speech’ ‘noise’ ‘mode’ {F(56, 719) = 28.08}. To examine the statistical signiﬁcance of ‘local’ diﬀerences between chosen data samples (for example: only SRTs obtained for S0N0 and S0N180 etc.), the ANOVA was followed by the post hoc analysis by means of the Tukey test. This analysis showed that SRTs measured for the masker placed behind the head (S0N180) are better than that determined in the case when both target and noise were collocated (p < 0.05). For the S0 condition, some differences between mean SRTs can be noticed, however, they are not statically signiﬁcant (p = 0.12). 3.2. Experiment 2. SRTs for diﬀerent speech elevation angle, masker azimuth and various presentation modes Fig. 5 depicts results of experiment 2, i.e. mean SRT (averaged across subjects) and the corresponding standard deviations for diﬀerent speech elevation angles: 0° Fig. 5a (replotted from Fig. 4a), 45° Fig. 5b and 90° Fig. 5c, and the masker azimuth (speech azimuth was always kept at 0°). As can be seen from Fig. 5, considerable qualitative differences across the listening modes are obtained. The highest SRTs are observed for monaural-right ear mode, while the lowest SRTs are obtained for the binaural mode. For each listening mode a relationship between SRT and the masker azimuth is a non-monotonic function. The patterns observed for the monaural-left ear and binaural modes are ‘U-shaped’. The minimal and maximal SRT obtained in this experiment is 16.3 and 1.0 dB, respectively, and these values were obtained for the same target-masker conﬁguration, S045N75, yet for diﬀerent listening modes, i.e. binaural mode and monaural-right ear mode, respectively. Except for the monaural-right ear mode, for each speech elevation angle, SRT decreases as the noise azimuth approaches 75°. For the monaural-right ear mode, the SRT increases with increasing noise azimuth. The local maximum in the result pattern for monaural-right ear

1027

depends on the speech elevation angle and for 0°, 45° and 90° the local maxima correspond to the noise azimuths 60°, 75° and 90°, respectively. The individual SRTs were pooled across the listeners for each combination of the masker horizontal angle (‘horizontal’), the speech elevation angle (‘elevation’), the presentation mode (‘mode’) and subjected to a three-way ANOVA. All the main eﬀects were statistically signiﬁcant: ‘elevation’ {F(2, 287) = 31.89, p 0.001}, ‘horizontal’ {F(7, 287) = 142.93, p 0.001} and ‘mode’ {F(2, 287) = 2230.29, p 0.001}. Two interactions were shown to be statistically signiﬁcant: ‘mode’ ’horizontal’ {F(14, 287) = 138.32, p 0.001} and ‘mode’ ’elevation’ {F(2, 287) = 11.34, p 0.001}, whereas the two other interactions turned out to be insigniﬁcant: ‘horizontal’ ‘elevation’ {F(7, 287) = 1.37, p = 0.21} and ‘mode’ ‘horizontal’ elevation’ {F(14, 287) = 11.37, p = 0.16}.

4. Discussion 4.1. Experiment 1 4.1.1. Monaural presentation In the case of the monaural-right ear mode, the range of SRT change, determined for diﬀerent noise azimuths, is relatively small. This means that a change in the horizontal angle of speech and masker sources has a relatively small inﬂuence on speech intelligibility when speech and noise sources are placed at the right hand side of the head and perceived by the right ear. As can be seen from Fig. 4a and e, if the speech azimuth is kept at 0° or 180° and the noise azimuth varies from 60° to 90°, some growth in the SRT is noticeable, i.e. an increase in masking eﬀectiveness is observed and no spatial unmasking occurs. This is due to the fact, that the spatial separation is regarded as a binaural phenomenon, supported additionally by the head shadow eﬀect. This mechanism, however, does not occur for the monaural-right ear mode, for which both speech and masker are presented without the head shadow eﬀect. The increase in the masking magnitude observed for the noise azimuth approaching 60°–90° is related to the diﬀerences in the head-related-transfer-functions (HRTFs) for diﬀerent azimuths, i.e. the dummy head ‘introduces’ slightly more attenuation when the noise source is placed in the front/back of it than in the case when the noise is presented directly to the ear (i.e. N90). Consequently, the masking eﬀectiveness is lower for S0N0 than for S0N90 and better listeners’ performance is observed when the speech source is placed in the front or in the back of the head. This explanation is supported by the data reported by Drullman and Bronkhorst (2000) who determined individual HRTFs for 12 human subjects and showed the difference between HRTFs obtained for azimuths of 0° and 90°. The poorest speech intelligibility obtained for the monaural-right ear mode is in agreement with the data reported by Edmonds and Culling (2006) who showed that

1028

E. Ozimek et al. / Speech Communication 55 (2013) 1021–1032

Fig. 5. SRTs data vs. noise source azimuth for the speech source placed in the median plane: (a) 0°, (b) 45°, (c) 90°. See description of Fig. 1 for details.

SRTs were higher for monaural listening than those observed for binaural presentation. When the speech and the masker azimuths are the same, the local maximum in the result pattern (Fig. 4c and d) corresponds exactly to the source azimuth. In these cases, the speech intelligibility is the worst and the masking eﬀectiveness the highest. Furthermore, both sources are located exactly on the same side, thus perceptual fusion can additionally deteriorate speech intelligibility. In the monaural-left ear mode, the pattern of results is qualitatively diﬀerent from that observed for the monaural-right ear mode. In most cases, the speech intelligibility in the monaural-left ear mode is considerably better than for the monaural-right ear mode. In the monaural-left ear mode, the subjects were presented with speech and noise recorded at the left ear of the dummy head, while the azimuths of speech and noise sources were changed from 0° to 180° clockwise (i.e. at the right side of the dummy head). Thus, the result pattern is mainly determined by the acoustic shadow eﬀect that improves the eﬀective SNR in the left ear of the dummy head. For each speech azimuth, the highest SRTs, i.e. the poorest intelligibility, was measured for the masker source placed in the front/back of the head (S0 and S180). Moreover, except for S0, SRTs measured for the masker placed behind the head are better than that determined in the case when both target and noise were collocated.

When the noise azimuth is increased, a considerable improvement in the listeners’ performance is observed. This unmasking can be interpreted in terms of the acoustic shadow. Theoretically, the eﬀect of the acoustic shadow on speech intelligibility should be strongest, i.e. attenuation ‘introduced’ by the head should reach a maximum, when the noise is presented directly (N90) to one ear, while the signals are perceived by the opposite ear. Nevertheless, as can be seen from Fig. 4, no local minima are observed for N90, while the lowest SRTs (the best intelligibility) are demonstrated for N75, i.e. the azimuth close – yet not equal to 90°. The higher eﬀectiveness of masking observed for N90 than those for N75 are a consequence of the so-called Babinet’s eﬀect that had been observed in earlier studies on spatial unmasking (Muller, 1992; Peissig and Kollmeier, 1997). Although imprecisely, a dummy head can be regarded as a spherical obstacle, thus according to the Babinet’s theorem, a maximum of a diﬀraction pattern is observed on the symmetric axis ‘linking’ the noise source and the target source. Therefore, in the monaural-left ear mode, the eﬀective SNR is smaller for S0N90 (since the left microphone is placed in the mentioned maximum of the diﬀraction pattern) than for S0N75 and, consequently, less spatial unmasking is observed for S0N90 than for S0N75. Statistical analysis was used to determine the signiﬁcance

E. Ozimek et al. / Speech Communication 55 (2013) 1021–1032

of the Babinet’s eﬀect, i.e. for each speech azimuth the values of SRTs obtained for N90 and N75 were subjected to the post hoc tests that revealed that for S0, S60 and S90 the inﬂuence of the Babinet’s eﬀect was statistically signiﬁcant (p < 0.001), however, for S30 it was shown to be insigniﬁcant (p = 0.12). In general, for the monaural-left ear mode, the best intelligibility data are obtained when the speech source is placed in the front/back of the head. For S30, S60 and S90 (i.e. when speech azimuth increases) the subjects’ performance becomes worse. Again, the reduction in speech intelligibility can be explained taking into account the acoustic shadow, but keeping in mind that in these cases (i.e. S30, S60 and S90) the head attenuates both the masking signal as well as the speech since both sources are placed on the right side of the dummy head and, but are recorded by the left microphone. Therefore, when a speech azimuth increases, the strength of speech attenuation grows, the eﬀective SNR in the left ear decreases and the listener’s performance becomes worse: the patterns of results measured for S30, S60 and S90 are shifted towards higher (‘worse’) values relative to SRTs obtained for S0 and S90. For S30, S60, and S90 conditions the poorest speech intelligibility should be expected for S90, since in this case the acoustic shadow caused by the head should introduce a maximum attenuation of the speech presented to the right ear, but recorded by the left ear. In contrast, if one compares SRTs measured for S60N90 and S90N90, it turns out that the speech intelligibility determined for S90N90 is better than that for S60N90 (the diﬀerence conﬁrmed by the post hoc analysis p = 0.007). This is also a consequence of the Babinet’s eﬀect since for S90N90 the left microphone and the speech source are placed on a symmetrical axis linking the left ear and the speech source. Therefore, a maximum of the diﬀraction pattern of the speech waveform occurs at the left ear, the eﬀective SNR in this ear is higher than that for S60N90 (for which the maximum diﬀraction does not occur) and better intelligibility is observed for S90N90. When the speech and the masker azimuths are the same or the speech and the masker are presented in the front or in the back of the head, respectively, SRTs obtained for the monaural-left ear mode tend to SRTs obtained for the monaural-right ear and binaural modes, i.e. approximately 6.0 dB. In these cases the acoustic shadow attenuates both speech and masker in the same way and the energetic relations between them remain unaltered, therefore the speech unmasking does not occur. It is worth noting that comparison of the intelligibility data obtained for the monaural-right ear and monaural-left ear modes is a new ﬁnding and gives relevant information about the ability of listeners with unilateral deafness (or with a unilateral profound hearing loss) to understand speech in acoustically adverse conditions. 4.1.2. Binaural presentation The data show that the beneﬁt from the spatial separation of speech and masker is largest when the listeners are

1029

presented binaurally with stimuli incorporating the physical features of signals presented to the respective ears. For S0 and S180 the SRT results are ‘U-shaped’ and similar to the corresponding pattern observed for the monauralleft ear mode, but unmasking is more apparent for the binaural presentation. When the speech is presented in the front/back of the head (S0 and S180), the poorest speech intelligibility is obtained when the masker is placed also in the front/back of the head. In these cases (S0N0, S0N180, S180N0 and S180N180), although stimuli are presented binaurally there are no interaural diﬀerences between signals delivered to both ears and the listeners do not beneﬁt from the binaural mechanisms. Conversely, when the speech azimuth is kept at 0° (or 180°), while masker azimuth is increased, not only the eﬀective SNR in the left ear increases considerably, but an increase in a phase shift between noises delivered to both ears occurs. The maximum unmasking magnitude (i.e. the diﬀerence between SRTs measured for collocated S0N0 and a given conﬁguration) is observed for N75 and is equal to 11.4 dB. This noise azimuth is in line with results of the study on binaural speech unmasking carried out by Peissig and Kollmeier (1997), although they observed less unmasking, i.e. approximately 8.0 dB. In the case of the binaural presentation, for each speech azimuth, the mean SRTs obtained for N90 are slightly higher than that observed for N75, however, except S30 (post hoc test p = 0.02), the diﬀerences were not statistically signiﬁcant (post hoc test p > 0.05). No signiﬁcance of the Babinet’s eﬀect in these cases can be explained by the mechanism of binaural perception. In the study by Peissig and Kollmeier (1997), SRTs for N90 are approximately 2 dB higher than the SRT for N75 (and both means lie outside the ranges of the standard deviations). However, if a single talker acted as a masking signal (Peissig and Kollmeier, 1997) SRTs for N75 and N90 were shown to be almost identical so the Babinet’s eﬀect was not observed. For S0, some statistically insigniﬁcant (post hoc p = 0.23) diﬀerence between the mean SRTs measured for S0N90 and S0N60 can be noted in Fig. 4a. A similar relation was observed by Hawley et al. (2004) who showed that for four types of interfering signals, diﬀerence between the mean values for S0N90 and S0N60 was in the standard deviation range. The binaural unmasking obtained for S0N30 equal to 7.7 dB is smaller than the release from masking for the same conﬁguration reported by Allen et al. (2008), which was approximately 12 dB. This diﬀerence might be due to the masker used in the quoted study (speech of a single talker) since for speech-like maskers more eﬀective spatial speech unmasking has been demonstrated than for noiselike maskers (Brungart and Simpson, 2002; Hawley et al., 2004). For S30, S60 and S90 the result patterns are also nonmonotonic functions of the masker azimuth, but with one local maximum. As can be seen, when the speech and the masker azimuths are the same (S30N30, S60N60 and

1030

E. Ozimek et al. / Speech Communication 55 (2013) 1021–1032

S90N90), the SRT obtained for the monaural-left ear and binaural modes tends to approximately 6.0 dB, i.e. unmasking does not occur. This is due to the fact that, although stimuli are delivered to both ears, in these cases the subjects do not beneﬁt from the binaural presentation since for S30N30, S60N60 and S90N90, the speech and noise are collocated at the same azimuth. Hence, the acoustic shadow inﬂuences both the masked and masking signal, i.e. for the left ear the speech and the masker are attenuated by the same amount, while for the right ear both signals are slightly ‘ampliﬁed’ by the same amount. In accordance, the eﬀective SNR remains unaltered and speech unmasking does not occur. For the other conﬁgurations, SRTs for the binaural presentation turned out to be considerably lower than those obtained for the both monaural modes since a change in the speech and the masker azimuths results in an increase in the eﬀective SNR. Apart from the acoustic shadow eﬀect, the principal ﬁnding in this experiment is that despite the decorrelation of speech waveforms delivered to both ears and the decorrelation of noise signals presented to both ears, not much speech intelligibility improvement is observed for S30N30, S60N60 and S90N90. This means that the central binaural interaction (promoting a binaural unmasking) inﬂuences the speech perception only if cross-correlation coeﬃcients between both ears are diﬀerent. This observation is consistent with the data reported by Shinn-Cuningham et al. (2001) who showed that ‘binaural advantage’ decreased when target and masker were collocated at an azimuth of 90°. However, an alternative explanation of no binaural unmasking for S30N30, S60N60 and S90N90 can be based on the perceptual fusion of signals emanating from collocated sources. 4.2. Experiment 2 The most important ﬁnding from experiment 2 is that the intelligibility of speech presented in a noisy environment depends on the speech source elevation angle. First let us consider the data for the monaural-right ear mode. 90 As can be seen, SRTs obtained for S45 0 and S 0 are ‘worse’ 0 than SRTs observed for S 0 . This is due to the diﬀerences in HRTFs for diﬀerent elevation angles; the speech is more ‘attenuated’ when it comes from the source above the head. Moreover, for each speech source elevation the listeners’ performance was shown to be worst when the masker azimuth was 60°–90°. This conﬁguration reﬂects a sort of non-optimal condition for speech perception: the speech is presented from the source above the head (i.e. is ‘attenuated’ by the HRFT), whereas the masker is presented directly to the ear (i.e. it is ‘ampliﬁed’ by the HRFT). This result has, therefore, some useful applications since in many public buildings (for example: railway stations, airports etc.) loudspeakers are often placed above listeners’ heads, while noises come from sound sources located approximately at an elevation close to 0° (for example: arriving train, other speakers etc.). Considering the data

obtained in this experiment, it is suggested that target sound sources should be placed at lower elevation with respect to the listeners’ heads in order to increase the eﬀective SNR and, consequently, to improve speech understanding in these conditions. Although SRTs obtained for the monaural-left ear mode are markedly lower than those measured for the monaural-right ear mode, because of the acoustic shadow, the speech coming from above the head is also more attenuated than that presented for the elevation angle 0°, thus the eﬀective SNR is lower and, consequently, the speech 90 0 intelligibility is worse for S45 0 and S0 than for S0 . Again, SRTs observed for N90 are worse than for N75 which is a consequence of the Babinet’s eﬀect. For S90 0 the diﬀerence between SRTs determined for N90 and N75 was proved by means of the post hoc test (p = 0.003), whereas for S45 0 some diﬀerence between the mean values of SRTs for N90 and N75 can be noted (but because of large standard deviations this diﬀerence was shown to be insigniﬁcant (p = 0.22). The eﬀect of speech source elevation on the intelligibility is comparatively small especially when compared with the ranges of SRT related to the changes in the source azimuths. This small, although statistically signiﬁcant, dependence of SRT on the target source elevation can be explained by taking into account the values of interaural time diﬀerence (ITD) and interaural intensity diﬀerence (IID) that hardly depend on the source elevation angle, when its azimuth is kept at 0. These diﬀerences in the median plane are much smaller than the diﬀerences in the azimuth plane, hence larger changes in SRTs are observed when the azimuth is modiﬁed than in the case when the elevation is changed. However, the data reported by Yost et al. (1993) show that when the signal azimuth is 60°, for example, a change in the elevation considerably inﬂuences ITD and IID. Therefore, one may expect that if different speech azimuths were considered in experiment 2, the range of SRTs for diﬀerent target elevation angles would be considerably larger. The results of the present study conﬁrm additionally the importance of spectral matching of the speech and the interfering signal for accurate and reliable speech-in-noise assessment. This matching has been shown to be very important for getting a steep intelligibility function. Steep intelligibility functions are obtained when the speech and the masker are ‘spectrally matched’, i.e. SNR across respective frequency channels (auditory ﬁlters) is kept constant. This fact is supported by the error bars presented in Fig. 4, which give information on the accuracy of SRT measurement for each spatial conﬁguration and presentation mode considered. It was found that for many spatial conﬁgurations both speech and noise were aﬀected by HRTFs that depended on azimuth and elevations. Thus, the greater the diﬀerence between HRTFs measured for a given speech and masker placement (azimuths), the less the spectral matching and the greater the standard deviation of SRT should be observed. This eﬀect can be clearly

E. Ozimek et al. / Speech Communication 55 (2013) 1021–1032

visible for the monaural-left ear mode since in this case a considerable spectral mismatching (related to the head shadow eﬀect) occurs: the greater the spatial separation between the speech and masker sources, the larger the spectral mismatching and the larger the standard deviations of SRT. On the contrary, when the speech and masker sources are collocated, regardless of the azimuth, their power spectra are ‘multiplied’ by the same HRTF and the spectral matching is maintained. The results of the present study extend knowledge of the subject’s ability to discriminate between sound source locations in two spatial dimensions: azimuth and elevation. Rather little is known about the parameters and mechanisms that determine elevation perception. It is worth adding that the obtained data may have also practical relevance in areas such as: three-dimensional audio-display technology (based on manipulation in azimuth, elevation, and distance) used to improve the perception and sensation of the natural auditory spatial information (in civilian and military environments, for example); combined azimuth and elevation psychoacoustic experiments; noise control; acoustic-environment design; and in speciﬁc measurements of telephone systems, headphones, personal hearing protectors and hearing aids. 5. Conclusions The following conclusions can be drawn from the present study. The best performance in speech intelligibility was found for the binaural mode. In this case, the subjects beneﬁt from the head shadow eﬀect and the interaural diﬀerences between stimuli delivered to both ears. The worst performance was obtained in the monaural-right ear mode, i.e. when both the speech and masker sources were placed at the right side of the head and perceived by the right ear. In this case no spatial unmasking occurs since listeners do not beneﬁt from the head shadow eﬀect or interaural diﬀerences. In the case of monaural-left ear mode, i.e. when the target and masker target sources are placed at the right side of the head, but perceived by the left ear, the subject beneﬁts from the head shadow eﬀect that attenuates the masking signal. However, if the speech azimuth increases, the shadow eﬀect also inﬂuences the speech and the speech intelligibility decreases. Regardless of the presentation mode, spatial unmasking does not occur when the speech and masker sources are collocated. In this case, the highest accuracy of SRT measurement is obtained. If a separation between speech and masker grows, the standard deviation of SRT increases. When speech elevation is increased, speech intelligibility decreases. However, the eﬀect of speech elevation is smaller that the eﬀect of modiﬁcation of the source azimuth. Until now the PST has not been used for measuring different aspects of spatial speech intelligibility together with an adaptive procedure. Therefore the presented study includes a new set of data which is rarely available for other languages.

1031

Acknowledgments This work was supported by Grants from: the European Union FP6: Project 004171 HEARCOM, the Polish-Norwegian Research Fund, State Ministry of Science and Higher Education: Project Number N N518 502139 and National Science Centre: Project Number UMO-2011/03/ B/HS6/03709. References Allen, K., Carlile, S., Alais, D., 2008. Contributions of talker characteristics and spatial location to auditory streaming. Journal of the Acoustical Society of America 123 (3), 1562–1570. Bosman, A.J., Smoorenburg, G.F., 1995. Intelligibility of Dutch CVC syllables and sentences for listeners with normal hearing and with three types of hearing impairment. Audiology 34 (5), 260–284. Brand, T., Kollmeier, B., 2002. Eﬃcient adaptive procedures for threshold and concurrent slope estimates for psychophysics and speech intelligibility tests. Journal of the Acoustical Society of America 111 (6), 2801–2810. Bronkhurst, A.W., Plomp, R., 1988. The eﬀect of head-induced interaural time and level diﬀerences on speech intelligibility in noise. Journal of the Acoustical Society of America 83 (4), 1508–1516. Brungart, D.S., Iyer, N., 2012. Better-ear glimpsing eﬃciency with symmetrically-placed interfering talkers. Journal of the Acoustical Society of America 132 (4), 2545–2556. Brungart, D.S., Simpson, B.D., 2002. The eﬀects of spatial separation in distance on the informational and energetic masking of a nearby speech signal. Journal of the Acoustical Society of America 112 (2), 664–676. Cox, R.M., Alexander, G.C., Rivera, I.M., 1991. Comparison of objective and subjective measures of speech intelligibility in elderly hearingimpaired listeners. Journal of Speech and Hearing Disorders 34 (904– 915). Drullman, R., Bronkhorst, A.W., 2000. Multichannel speech intelligibility and talker recognition using monaural, binaural, and three-dimensional auditory presentation. Journal of the Acoustical Society of America 107, 2224–2235. Edmonds, B.A., Culling, J.F., 2006. The spatial unmasking of speech: evidence for better-ear listening. Journal of the Acoustical Society of America 120 (3), 1539–1545. Freyman, R.L., Balakrishnan, U., Helfer, K.S., 2001. Spatial release from informational masking in speech recognition. Journal of the Acoustical Society of America 109 (5 Pt 1), 2112–2122. Freyman, R.L., Helfer, K.S., McCall, D.D., Clifton, R.K., 1999. The role of perceived spatial separation in the unmasking of speech. Journal of the Acoustical Society of America 106, 3578–3588. Garadat, S.N., Litovsky, R., 2006. Speech intelligibility in free ﬁeld: spatial unmasking in preschool children. Journal of the Acoustical Society of America 121 (2), 1047–1055. Hawley, M.L., Litovsky, R.Y., Culling, J.F., 2004. The beneﬁt of binaural hearing in a cocktail party: eﬀect of location and type of interferer. Journal of Acoustical Society of America 115 (2), 833–843. Kocin´ski, J., Se˛k, A.P., 2005. Speech intelligibility in various spatial conﬁgurations of background noise. Archives of Acoustics 30 (2), 173– 191. Kollmeier, B., Wesselkamp, M., 1997. Development and evaluation of a sentence test for objective and subjective speech intelligibility assessment. Journal of the Acoustical Society of America 102 (4), 1085–1099. Lin, W.Y., Feng, A.S., 2003. GABA is involved in spatial unmasking in the frog auditory midbrain. Journal of Neuroscience 23, 8143–8151. Litovsky, R., 2005. Speech intelligibility and spatial release from masking in young children. Journal of the Acoustical Society of America 117 (5), 3091–3099. Muller, C., 1992. Perzeptive Analyse und Weiterentwicklung eines Reimtestverfahrens fur Sprachaudiometrie. Gottingen.

1032

E. Ozimek et al. / Speech Communication 55 (2013) 1021–1032

Ozimek, E., Kutzner, D., Se˛k, A., Wicher, A., 2009. Polish sentence test for measuring the intelligibility of speach in interfering noise. International Journal of Audiology 48, 440–450. Peissig, J., Kollmeier, B., 1997. Directivity of binaural noise reduction in spatial multiple noise-source arrangements for normal and impaired listeners. Journal of the Acoustical Society of America 101, 1660–1670. Shinn-Cuningham, B.B., Schickler, J., Kopocko, N., Litovsky, R., 2001. Spatial unmasking of nearby speech sources in a simulated anechoic

environment. Journal of the Acoustical Society of America 110, 1118– 1129. Versfeld, N.J., Daalder, L., Festen, J.M., Houtgast, T., 2000. Method for the selection of sentence material for eﬃcient measurement of the speech reception threshold. Journal of the Acoustical Society of America 107, 1671–1684. Yost, W.A., Popper, A.N., Fay, R.R., 1993. Human Psychophysics. Springer-Verlag, New York.

Speech intelligibility for different spatial configurations of target speech and competing noise source in a horizontal and median plane

Speech intelligibility for different spatial configurations of target speech and competing noise source in a horizontal and median plane

Recommend Documents