SPEECH COMMUNICATION ELSEVIER
Speech Communication 14 (1994) 103-118
The masking of narrowband noise by broadband harmonic complex sounds and implications for the processing: of speech sounds Changxue Ma, Douglas O'Shaughnessy INRS-Telecommunications, 16 Place du Commerce, lie de Soeurs, Quebec, H3E 1H6, Canada
(Received 19 May 1993; revised 26 September 1993)
Abstract The evaluation of processed and synthesized speech is closely related to the auditory perception of complex sounds. An understanding of the perception of complex sounds is therefore helpful to improve the quality of processed sounds. The perceptual study of speech sounds in this paper is mainly concerned with auditory masking. Unlike most such studies, the targets in our experiment are narrowband noise signals and the maskers arc wideband harmonic complex sounds. We show that the detection of targets at low frequencies is mainly determined by the spectral properties of the maskers. At high frequencies, the detection of targets is predominantly determined by the temporal behaviour of maskers. The relative contributions of spectral and temporal analysis strongly depend on the fundamental frequency of the masker. Better temporal resolution is associated with a higher masker level.
Zusammenfassung Die Bewertung von EDV-verarbeiteter und synthetischer Sprache steht in enger Beziehung zu der auditivcn Wahrnehmung von komplexen T6nen. Daher ist ein Vers6indnis der Wahrnehmung von komplexen T6nen niitzlich, um die Qualit~it der EDV-verarbeiteten T6ne zu verbessern. Die Wahrnehmungsuntersuchung der Spracht6ne wird in diesem Artikel hauptsiichlich unter dem Gesichtspunkt auditiver Maskierung behandelt. Im Gegensatz zu den meisten anderen analogen Arbeiten haben wir in unserem Experiment Schmalband-Ger[iuschsignale und harmonische Breitbandsignale als Maskierung verwendet. Dadurch kann belegt werden, dab die Erkennung von Testt6nen bei niedrigen Frequenzen haupts/ichlich von den spektralen Eigenschaften der Maskierung abh~ingt. Bei hohen Frequenzen hiingt die Erkennung von Testt6nen haupts~ichlich vom zeitlichen Verhalten der Maskierung ab. Die relativen Beitrage der spektralen und zeitlichen Analyse h~ingen stark v o n d e r Grundfrequenz der Maskierung ab. Eine h6here zeitliche Aufl6sung entspricht einem h6heren Maskierungsniveau.
Rdsum4 L'6valuation de la qualit6 de la parole cod6e et synth6tis6e est 6troitement li6e ?a la perception auditive des sons complexes. Une compr6hension de la perception des sons complexes est donc n6cessaire pour am61iorer la qualit6 0167-6393/94/$07.00 © 1994 Elsevier Science B.V. All rights reserved SSD1 01 67-6393(93)E0077-9
104
C. Ma, D. O'Shaughnessy / Speech Communication 14 (1994) 103-118
des sons apr~s traitement. L'6tude perceptuelle des sons de parole est abordEe dans ce papier sous l'aspect du masquage auditif. Contrairement ?~ la plupart des autres travaux analogues, nous avons pris comme cibles des signaux de bruit ?t bande Etroite et comme masqueurs des sons harmoniques complexes a large bande. Nous montrons que la d6tection des cibles ~ basses frEquences est surtout fonction des propriEtEs spectrales des masqueurs. Pour les hautes frEquences, la d&ection des cibles est pr~fErentiellement dEterminEe par le comportement temporel des masqueurs. Les contributions relatives des analyses spectrale et temporelle dependent fortement de la frEquence fondamentale du masqueur. Une meilleure resolution temporelle correspond ~ un plus haut niveau de masquage. Key words: Auditory masking; Speech processing
1. Introduction
The evaluation of processed and synthesized speech is closely related to the auditory perception of complex sounds. An understanding of the perception of complex sounds is therefore helpful to improve the quality of processed sounds. Due to the limited frequency and time resolution of the auditory system, speech signals can be manipulated or transformed for different purposes without damaging their subjective sound quality. Phase equalization and dispersion techniques can be used to manipulate the short-term phase spectra of speech (Moriya and Honda, 1986; Quatieri et al., 1990; Griffin and Lim, 1984). Such manipulations take advantage of the fact that the human auditory system seems rather insensitive to phase, although under certain conditions the phase spectrum does play a role in judgment of sound quality (Plomp and Steeneken, 1969). For the same reason, LPC synthesis with pulse excitation produces speech with high intelligibility (Flanagan, 1972). Particularly, speech or audio signals can be encoded by using a very low bit-rate without audible distortions after reproduction (Schroeder et al., 1979). Spectral weighting has been used to shape the spectrum of the noise such that its power spectrum is similar to that of the speech and the noise can be masked effectively (Schroeder et al., 1979). This technique has been remarkably successful in the coding of wideband audio signals (Johnston, 1988). One of the main factors for this achievement is due to the masking phenomena of the auditory system (Zwicker and Fastl, 1990).
Auditory masking phenomena have been widely studied by using simple stimuli (Zwicker and Fastl, 1990). These studies provide masking thresholds of sounds and a model for the working mechanism of the auditory system. The use of simple signals with a very compact distribution in the frequency domain (sinusoids) as maskers or targets can provide an appropriate measure of frequency resolution. On the other hand, sounds with compact time distribution are utilized to measure temporal resolution (Zwicker and Fastl, 1990). The spectral weighting in speech coding is based on the masking patterns of pure tones by noise bands or masking of noise bands by pure tones (Schroeder et al., 1979). Masking studies of pure-tone targets masked by harmonic complex sounds (Duifhuis, 1970; Schroeder and Mehrgardt, 1982; Kohlrausch, 1988) have revealed the ability of the auditory system to perform spectral and temporal analysis. The response of the auditory system to complex signals like speech plus noise signals, however, cannot easily be predicted from the response to simple sinusoids. In this paper we are mainly concerned with auditory masking phenomena using sounds with complex spectra. Therefore, we have used harmonic complex sounds and synthetic vowels as maskers and noise bands as the targets in our experiments. The maskers are still simple compared to natural speech, but they make it possible to systematically study some important aspects of masking by complex sounds. We will first study the masking of narrowband noise of a critical bandwidth by equal-amplitude and zero-phase broadband harmonic complexes.
c. Ma, D. O'Shaughnessy/ Speech Communication 14 (1994) 103-118 This e x p e r i m e n t focusses o n how the d e t e c t i o n thresholds of the n a r r o w b a n d noise targets vary as a f u n c t i o n of (1) their c e n t e r f r e q u e n c y a n d (2) the f u n d a m e n t a l f r e q u e n c y of the maskers. I n Section 4 e x p e r i m e n t s are p e r f o r m e d to study how the spectral tilt a n d level of the maskers i n f l u e n c e target thresholds. T h e p h a s e effects of the maskers o n target thresholds are s t u d i e d in Section 5. As a n a p p l i c a t i o n to speech percep-
105
tion, we use synthetic vowels as maskers in the final e x p e r i m e n t in Section 6.
2. Experimental method 2.1. Procedure A two-interval, two-alternative, forced-choice (212AFC), adaptive p r o c e d u r e was used to deter-
.
.
.
.
.
t(ms)
.
.
.
.
.
t(ms)
(b)
[ 0
3
;
9
~2
~s
,8
2,
2,
27
~o
o
r ;
;
;
~
(c)
0
3
6
,;
,;,
i,
r i,
17
3'o
(d)
9
12
15
18
21
24
27
30
18
21
24
27
30
tCms)
(el
0
3
6
9
12
15
t(~s)
Fig. 1. (a) Waveform of the masker with a flat spectrum and zero phase. (b) Masker with a spectral slope - 3 dB/oct and zero phase. (c) Masker with a spectral slope - 6 dB/oct and zero phase. (d) Masker with a flat spectrum and alternating phase. (e) The m masker. (f) The m+ masker. Waveforms in (a)-(d) are normalized to the same RMS value, while the RMS value for the Schroeder-phase maskers is a factor of 5 larger. For the latter two, the time scale is also increased by a factor 2.
C. Ma, D. O'Shaughnessy/ Speech Communication 14 (1994) 103-118
106
mine thresholds (Levitt, 1971). Each stimulus interval contained either 200 ms of masker alone or 200 ms of masker plus target, both intervals including 25-ms onset and offset half-cycle cosine ramps. The pause between the two stimulus intervals was 500 ms and their order of presentation was random. The target level was initially well above the expected threshold, and then decreased after two consecutive correct responses and increased after each incorrect response. A step size of 8 dB was used for the first three reversals (a reversal being defined as a transition from down to up, or vice versa) and 2 dB for the 11 reversals that followed. The average of the midpoints between consecutive reversals, excluding the first three points, was taken as the threshold level. This procedure was repeated three times for each parameter and subject. The response time was controlled by the subjects. Stimuli were generated by using two equal D / A converters of 16 bits operated at a sampling frequency of 20 kHz, and were filtered by two low-pass filters with a cut-off frequency of 7.8 kHz and with an attenuation rate of 90 d B / o c tave. The masker and target levels were controlled by programmable analog attenuators. The stimuli were presented diotically through ETY M O T I C R E S E A R C H ER-2 insert earphones. Colleagues from the laboratory as well as paid subjects participated in the experiments. They had normal hearing sensitivity for pure tones and had experience with this experiment.
the working mechanism of the auditory system. The global spectral slopes of the maskers were chosen to be 0 d B / o c t (flat spectrum, A i = 1), - 3 d B / o c t ( A i = 1/¢r{), and --6 d B / o c t ( A i = 1/i). For zero-phase maskers, qli was equal to zero for all i. For cosine-sine alternating-phase stimuli, ~/i was equal to r r / 2 for odd harmonic numbers and zero for even harmonic numbers. For maskers with two Schroeder-phase conditions, ~0i = - i ( i + 1)rr/N and ~0i = +i(i + 1)~r/N (Schroeder, 1970), which will be respectively called masker m _ and masker m+. Schroederphase maskers have low peak factors in the temporal waveform; this contrasts with the zero-phase maskers which have high peak factors. Waveforms of the maskers with alternating phase are quasi-periodic, with the quasi-period being half the period of the zero-phase masker. Fig. 1 shows that the zero-phase masker has a much larger peak factor than the Schroeder-phase maskers. The peak factor also becomes smaller for complexes with a tilted spectral slope. The targets were narrowband noise signals whose bandwidth was equal to or smaller than the critical bandwidth. The use of these narrowband noise signals restricted us to study the detectability of the targets due to the temporal and spectral variations of the maskers. These target signals were calculated according the following formula:
n( t ) = C E cos( 2"rrif, t + ~i),
(2)
i
2.2. Stimuli Maskers were synthesized by adding harmonics, according to the formula: N
s( t) = Y'~ A i cos(2arifot + I]ti) , i=1
where f0 is the fundamental frequency and N was chosen so that the spectrum of s(t) covered the frequency range up to 10 kHz. The choices for amplitude and phase parameters are enormous for harmonic complexes. We chose them so that the harmonic complex sounds were close to speech sounds and they reflected
where ~)i is the phase angle randomly distributed over the range ( - 7 , "rr), f , was equal to 4 Hz, and the range for i was chosen such that the above formula could produce a particular narrowband noise. The noise signals varied randomly from trial to trial in our experiments. The threshold of the noise band in a specified frequency region was calculated as the ratio of the average energy of the noise to that of the masker in a 1-Hz band. In other words, the target threshold, TD, was defined as
T D = lO log
A f. ],
(3)
C. Ma, D. O'Shaughnessy / Speech Communication 14 (1994) 103-118
where A2/2fo and C2/2fn are the spectral power densities of the masker and the target at the center frequency of the noise band.
3. Experiment 1. Detection of a narrowband noise of critical-band width in broadband spectrally-flat and zero-phase harmonic complexes
This experiment is concerned with the threshold of narrowband noise as a function of the fundamental frequency of the masker and of the center frequency of the target. The broadband maskers (0-10 kHz) were spectrally-flat harmonic complex sounds with the initial phases of the harmonics set to zero. The fundamental frequencies of the complexes were 100, 150, 200, 250 and 400 Hz. Thus a typical masker was c~N=o cos(2~rifot), where Nfo = 10 kHz. Noise bands with critical-band widths served as targets. The values of the critical-band width were computed according to the formula proposed by Zwicker and Terhardt (1980). When the spectral spacing of the masker components was larger than the bandwidth of the target signal, the spectrum of the target was either centered on a specific harmonic or placed between two successive harmonics. When the bandwidths of the noise targets were greater than the spacing of two successive harmonics, the noise targets were added without consideration of the harmonic structures of the maskers. The maskers were presented at a sound pressure level of 80 dB.
3.1. Results Since the results from four subjects were similar, only the averages of the measurements are presented. The standard deviations are indicated by vertical bars in our data. Figs. 2(a-e) show the results for masker fundamental frequencies of 100, 150, 200, 250 and 400 Hz, respectively. One sees from panels (a)-(c) that, in the highfrequency region, the threshold of the target decreases with an increase of the center frequency of the noise band. On the other hand, in the low-frequency region in panels (b) and (c), and to a certain extent in panel (a), the threshold of the
107
noise target increases globally with an increase of the center frequency of the noise band until the threshold reaches a maximum whose frequency position is dependent on the fundamental frequency of the masker. Besides the global increase of the thresholds of the noise targets towards high frequencies, the thresholds of the noise bands in the low-frequency region also show peaks and dips (most clearly seen in panels (c) and (d)). Peaks occur for noise targets which have a center frequency equal to the frequency of a harmonic of the masker, and dips occur for noise targets which are situated between two successive masker harmonics.
3.2. Discussion The masking patterns in this experiment are strongly dependent on the relationships between the fundamental frequency of the masker and the bandwidth of the auditory filter. In the lowfrequency region, if the bandwidth of the filter is smaller than or close to the fundamental frequency of the masker, the threshold for noise targets is predominantly determined by the sharpness of spectral resolution. In the highfrequency region, auditory filters with a wide bandwidth pass more than three harmonics and the interaction of these harmonics produces a temporally modulated waveform. In this case, the detection of noise targets is predominantly determined by the temporal resolution of the auditory system. Thresholds which are mainly determined by spectral resolution reflect the harmonic structure of the maskers. In the low-frequency region, the masking patterns show clear peaks and dips for the maskers with high fundamental frequencies because the critical-band widths of the auditory system at low frequencies are smaller than the spacing between harmonics. For maskers with high fundamental frequencies, detection is dominated by spectral resolution up to frequencies as high as 3 - 4 kHz (see Figs. 2(d) and 2(e)). For maskers with a low fundamental frequency, the thresholds of high frequency noise targets decrease as a consequence of the increasingly better temporal resolution of the auditory
C. Ma, D. O'Shaughnessy / Speech Communication 14 (1994) 103-118
108
channels at high frequency and as a result of the energy increase of the target as critical bands expand towards higher frequencies. In these situations, the response of the auditory filter to the pulsed maskers decays quickly and the filtered waveforms are therefore more deeply modulated. For a given channel, this modulation becomes
shallower for maskers with higher fundamental frequencies. Consequently, the thresholds of the noise bands in maskers with higher fundamental frequencies are higher. As the critical-band width increases monotonically with frequency, the spectral resolution degrades and the temporal resolution improves.
o
o
(a) 100Hz
~ ,
(d) 250Hz
=,? . . . . . . . TO r g e t
~
. . . . . . .
~o"
lo"
freclu ency[Hz]
....... To rget
~&
.......
~
.......
~o*
frequency[Hz]
o
o
,?_
ii
"~ gl. ~.,
~,_
(b) 150Hz
=,=,
....... TO r Q e t
i~ frequ
........ ency[Hz]
1
(e) 400Hz ....... TO r g e t
~
frequency[Hz]
o
,?. =,_
~,=,1
(c) 200Hz ....... To rget
qb"
........
frequency[Hz]
Fig. 2. Thresholds of noise bands of critical-band width in flat-spectrum, zero-phase maskers are plotted as a function of the center frequency of the noise band. The parameter in the panels is the fundamental frequency of the masker. (a) the 100-Hz masker, (b) the 150-Hz masker, (c) the 200-Hz masker, (d) the 250-Hz masker, (e) the 400-Hz masker. The target threshold represents the ratio of the spectral densities of target and masker expressed in decibels. The masker level is 80 dB SPL. Vertical bars represent standard deviations.
109
c. Ma, D. O'Shaughnessy / Speech Communication 14 (1994) 103-118
ized by listening during the valleys of the cochlear-filter responses to the maskers. The masking patterns in this experiment have been assumed to be given by the envelope of the output of the auditory filters. No attempt was made to account for nonlinear characteristics and phase dispersion of the auditory system, although they are very important factors to quantitatively describe the masking pattern of temporal masking (Jesteadt et al., 1982; Kohlrausch, 1988). These two factors will be studied in the experiments that follow.
Masking patterns therefore show global maxima in the middle frequency region in Figs. 2(b-c) but the global maxima are less pronounced in Figs. 2(a) and 2(d), corresponding to maskers with fundamental frequencies of 100 and 250 Hz. The results suggest that the detection of targets in harmonic complex tones is optimally realized by listening either in the spectral valleys of the masker or in the valleys in the temporal envelopes of the masker. Since the envelopes of the auditory filter responses represent a distribution of the energy of the masker in the t i m e frequency plane, the masking patterns could be qualitatively explained by examining the valleys in the t i m e - f r e q u e n c y distribution of the masker energy. Figs. 3(a) and 3(b) show two examples of temporal envelopes of simulated auditory filter responses (center frequency 4 kHz) to the 100 and 400-Hz maskers. The solid line represents the envelope of the response to the masker alone and the dotted line the response envelope for the masker plus noise target at threshold. We used a gamma-tone filter here, whose impulse response was h ( t ) = t -:~ e -~'°°t sin m o t , where a = 0.125 and w 0 was the center frequency (Patterson, 1987). The envelopes were obtained by Hilbert transform and are represented on a decibel scale. These figures suggest that subjects may be able to listen in the temporal valleys of the 100-Hz masker and that therefore less target energy is necessary to reach the threshold. The masking period pattern of the harmonic complexes obtained with pulsed tones as targets (Duifhuis, 1971) clearly showed that the detection of the target was real-
4. Experiment 2. Masking of narrowband noise by broadband harmonic complex sounds as a function of spectral tilt and level Natural sounds generally have some spectral tilt. It is therefore more realistic to investigate the detection of noise targets in such maskers. Tilting the spectrum of the masker while keeping its overall level constant redistributes the energy of the masker in the frequency domain. The local spectral level of the masker is therefore a function of component frequency, and the: waveform of the maskers is more dispersed in time than for the flat-spectrum signal (see Fig. 1). Therefore, for the purpose of understanding the influences of masker level and spectral tilt on the threshold, the maskers with spectral tilts were presented at different sound pressure levels. Narrowband noise targets were centered at 500, 1000, 2000 and 4000 Hz, with bandwidths of
o
o
]
0
3
e
9
12
15
t(~s)
18
21
24
27
3~
~
0
i
r
,
J
,3
6
9
12
i
1
15
18
t(~s)
21
2~.
2'7
3'0
(b) Fig. 3. Envelopes of gamma-tone filter responses at 4 kHz for (a) 100-Hz and (b) 400-Hz maskers with a flat spectrum and zero phase, respectively. Dotted lines show the envelopes for masker plus noise signal at threshold.
110
C. Ma, D. O'Shaughnessy / Speech Communication 14 (1994) 103-118
100, 100, 200 and 400 Hz, respectively. The sound pressure level of the masker was changed from 44 dB to 64 dB in steps of 5 dB for maskers with a spectral slope of 0 dB/oct, and in steps of 10 dB for maskers with spectral slopes of - 3 and - 6 dB/oct. For spectrally tilted maskers, the threshold of the narrowband noise targets was defined as the spectral density relation between the target and the masker at the center frequency of the target noise, expressed in decibels.
!/[ ~. ~¢~" ~ ~, ~,-]
. . . . . . . TO r g e t
"16" . . . freq u ency[Hz]
. . . .
'1o"
4.1. Results The average thresholds from three subjects are plotted in Figs. 4(a-c) against the center frequency of the narrowband noise target, with the sound pressure level as a parameter. The threshold increases when the masker sound pressure level decreases from 64 to 44 dB. The threshold increase is the largest for the spectrally-flat masker, and at high frequencies. The threshold increase is reduced for maskers with spectral slopes. Towards low frequencies, the level effect becomes small for all three maskers. In addition, for the - 6 dB/oct masker, the target threshold at 500 Hz increases slightly with the increase of the masker level.
!-I| -~ ~r~, ~ -7-~, ~.~,-] (b)
'~o"
.
.
.
.
.
To rg et
.
;& frequ
.
.
.
.
.
.
.
~c¢
ency[Hz]
o°
4.2. Discussion J ~-o o
The spectral slope of the masker has a significant influence on the thresholds of narrowband noise targets. The differences between the thresholds in the three spectrally-tilted maskers increase with an increase of the center frequency of the noise signal. There are two factors contributing to these threshold differences. On the one hand, the modulation depth in the temporal waveform at the output of the auditory filters is reduced as a result of spectral tilt. This can be seen in Fig. 5, where the envelopes of the auditory filter responses to the three maskers at 4 kHz are plotted on a decibel scale. The auditory filter is here simulated by a gamma-tone filter and the spectral levels of the three maskers at 4 kHz are identical. On the other hand, due to the spectral tilt, the local spectral levels are different when the three maskers are presented at the
.
1
(c) .
.
.
.
.
.
.
.
. . . . . . ~llbS To r g e t f r e q u e n c y [ H z ]
.
i
10 4
Fig. 4. Thresholds of noise bands in the 100-Hz zero-phase maskers with spectral slopes of (a) 0 dB/oct, (b) - 3 dB/oct and (c) - 6 dB/oct. The masker levels are 64 dB (zx), 59 dB ( n ) , 54 dB (O), 49 dB (o) and 44 dB ( v ) .
same overall level. Because the detection of narrowband noise targets at high frequencies is predominantly determined by the temporal resolution of the auditory system, similar level effects as seen in temporal masking influence the noise thresholds. For forward masking, target thresholds do not correspond to a constant target-tomasker ratio; they decrease with an increase in masker level (Jesteadt et al., 1982). Since a steeper
C. Ma, D. O'Shaughnessy / Speech Communication 14 (1994) 103-118
slope of the masker level is associated with a lower level for high frequency masker components, these level effects are a second contribution to the high thresholds for sloping versus flat-spectrum maskers. In contrast to thresholds at high frequencies, the threshold of a noise target at 500 Hz in a masker with a spectral slope of - 6 d B / o c t increases somewhat with an increase in masker level (see Fig. 4(c)). This behaviour is expected if spectral resolution and, especially, upward spread of masking play a dominant role (Wegel and Lane, 1924). For a masker with a slope of - 6 d B / o c t , the spectral level at low frequencies is quite high, e.g., the first harmonic of the masker is 18 dB higher than in the spectrally-flat masker. Due to the upward spread of masking, the threshold should therefore increase with an increase of masker level. Two possible factors contribute to the threshold differences at high frequencies in Fig. 4 for the spectrally-tilted maskers. We now examine the results obtained by varying the overall masker level. Figs. 4(a-c) indicate that the threshold of the noise target (expressed relative to the spectrum level of the masker) changes systematically as the sound pressure level changes. The rate of the threshold change is strongly dependent on the center frequency of the noise target; see Fig. 6 for the spectrally-flat masker. The thresholds are replotted as a function of the sound pressure level of the maskers, with the center frequency of the noise target as a parameter. The data in Fig. 6 are well fitted by straight lines, whose slope
i
10dB
X
o
3
6
9
12
15
18
21
24
27
30
t(ms) Fig. 5. Envelopes of gamma-tone-filter responses at 4 kHz for the 100-Hz maskers with three different spectral slopes: 0 d B / o c t (solid line), - 3 d B / o c t (dashed line) and - 6 d B / o c t (dotted line). The maskers have the same harmonic amplitude at 4 kHz.
111
m
3, E,,, o "7~4
~7 3 ~ . 0
40.0
4~0
50.O ~Aosker
S~.O eO.O level[d B]
e~
0
70.0
7~.0
Fig. 6. Thresholds of noise bands in spectrally-flat,zero-phase maskers as a function of masker level. The center frequency of the noise band is the parameter: 4000 Hz ( z~ ), 2000 Hz (+), 1000 Hz ( v ) and 500 Hz (©).
becomes steeper with an increase in the center frequency of the noise target. This implies that the influence of level on the target thresholds decreases as the auditory responses to the masker become less modulated.
5. Experiment 3. Masking of narrowband noise by broadband harmonic complexes with different phase relations This experiment investigates the influence of phase on the masking of narrowband noise. Firstly, two Schroeder-phase maskers ( m and m + ) were used with a fundamental frequency of 100 Hz. The maskers were presented at sound pressure levels of 44 and 64 dB. The noise targets were centered at frequencies of 500, 1000, 2000 and 4000 Hz, with bandwidths of 100, 100, 200 and 400 Hz, respectively. Secondly, we used a masker with a fundamental frequency of 100 Hz, an alternating-phase relationship for all harmonic components and a flat spectrum. The targets were narrowband noise signals of critical bandwidth, as used in Experiment 1. For comparison with the results from Experiment 1, the maskers were presented at a sound pressure level of 80 dB. 5.1. Results
The thresholds of the noise targets in the two Schroeder-phase maskers (m and m + ) were
C. Ma, D. O'Shaughnessy / Speech Communication 14 (1994) 103-118
112
averaged among three subjects and plotted in Fig. 7. Thresholds in zero-phase maskers with the same levels are also plotted for comparison. There are significant threshold differences for the three types of maskers, but only at high center frequencies for the noise targets. For the maskers at 64 dB SPL, the threshold at high frequencies is lowest in the zero-phase masker and is highest in the m_ masker; the difference is 16 dB. For the maskers at 44 dB SPL, the differences between the thresholds are reduced and the thresholds for the m+ and m_ maskers are nearly identical. Results for the alternating-phase masker are shown in Fig. 8. For comparison, the thresholds of critical-band noise targets in zero-phase maskers with fundamental frequencies of 100 and 200 Hz are replotted from Fig. 2. Comparing all thresholds for low target frequencies, the thresholds for alternating-phase and zero-phase maskers are close when the maskers have the same funda-
o
~o,~_ T
4k~ i
1,
.
.
.
......
.
TO r g e t
.
.
ib"
.
.
.
freq u ency[Hz]
Fig. 8. Thresholds of critical-band noise targets in a 100-Hz masker with an alternating phase for all harmonies ( v ) . Thresholds of the noise bands in a 100-Hz zero-phase masker (ix) and in a 200-Hz zero-phase masker (o) are plotted for comparison. The masker level is 80 dB SPL.
mental. For high target frequencies, on the other hand, the thresholds in the 100-Hz alternatingphase masker are within 5 dB of those for the 200-Hz zero-phase masker. 5.2. Discussion
.......
~
.......
To rget
freclU ency[Hz]
To rg et
freclU ency[Hz]
~
~ ( b" )i = t
Fig. 7. Thresholds of noise bands for three maskers with a f u n d a m e n t a l of 100 Hz: the zero-phase masker (O), the m _ masker (zx) and the m + masker ( v ) Panel (a) for masker level 64 dB and panel (b) for masker level 44 dB.
As the detection of noise targets at high frequencies is predominantly determined by the temporal waveform of the masker, it is expected that the target threshold for the two types of Schroeder-phase maskers is higher than that for the zero-phase masker. Fig. 7 indeed shows that the noise targets masked by the zero-phase masker have the lowest thresholds. The difference between thresholds for the two Schroeder-phase maskers is not manifested by the envelopes of their waveforms. The envelopes of the cochlear-filter responses to the two maskers will be used to explain the differences. Since the cochlear model implemented by Strube (1985) has been used to explain the threshold differences between the two Schroeder-phase maskers (Strube, 1985; Smith et al., 1986; Kohlrausch, 1988). We used this model to calculate the cochlear response to the three maskers, with the model parameters set according to Strube (1985). The response waveforms at a 4 kHz resonance frequency and their envelopes (plotted on a decibel scale) for the three maskers are shown in Fig. 9. We see that the valleys in the envelope for the
C. Ma, D. O'Shaughnessy / Speech Communication 14 (1994) 103-118
113
nating-phase manipulation introduces a secondary peak in the middle of a period. Therefore the 100-Hz masker with an alternating phase has a quasi-period of 5 ms and the responses of the cochlear filters at high frequencies are very similar to the responses for the 200-Hz zero-phase masker. Its masking pattern in the high-frequency region is therefore close to that obtained with the 200-Hz masker. This is clearly illustrated in Fig. 10, where the responses of the basilar-membrane filter at 4 kHz to these three maskers are plotted. The valleys in the log-envelope plot for the 100-Hz zero-phase masker are much deeper than those for the 100-Hz masker with an alternating-phase
zero-phase masker are the deepest and the widest, whereas the valleys in the envelope for the m _ masker are the shallowest. The valleys in the envelope of the m + masker are shallower and narrower than those for the m + masker. This is in line with the experimental finding at 4 kHz that the lowest target threshold is obtained with the zero-phase masker and the highest for the m masker, and that the threshold for the m + masker is lower than that for the m _ masker. The masking of the targets by the masker with an alternating-phase relationship differs at high frequencies from the masking by the 100-Hz zero-phase masker (see Fig. 8), because the alter-
< laJ
0
3
6
9
12
15
t(m~)
18
21
24
27
30
E
0
3
6
9
12
o
~
"
~
1~
15
t(~,)
18
21
24
27
30
:~,
2',
17
3'o
> J 0
3
6
9
12
15
18
21
24
27
30
t(ms)
1~ ,i~ t(m,)
(b)
or <:
o
,
~
]
6
i
9
r
,
12
,
,5
t(~)
i
,a
r r
2,
-° ~
,
2,
i
~7
~
3o
t(ms)
(c) Fig. 9. Left panels: Waveforms of basilar-membrane filter responses to (a) the m + masker, (b) the m masker, (c) the zero-phase masker. Right panels: Corresponding envelopes, plotted on a decibel scale. The resonance frequency of the filter is 4 kHz. The maskers are normalized to have the same RMS value.
C. Ma, D. O'Shaughnessy / Speech Communication 14 (1994) 103-118
114
o
:>
<
t.~
o
,
e
,
,2
o
~
s
~
,~
,s
,8
2~
2,
2,
~o
o
A
;
~
,~ ,~ ,: t(m~)
~,
2',
b
;o
,s
~8
2,
2,
2~
30
o
3
6
9
,~
0
3
6
9
12
t(m~)
t(m.~)
,5
,8
2,
~,
2~
30
15
18
21
24
27
30
t(~)
(b)
rr i 3
i S
r rr i 9
i 12
i 15
i 18
i 21
i 24
i 27
3'0
tCms)
tCms)
(c) Fig. 10. Left panels: Waveforms of basilar-membrane filter responses to (a) the 100-Hz zero-phase masker, (b) the 100-Hz masker with all harmonics in an alternating phase, (c) the 200-Hz zero-phase masker. Right panels: Corresponding envelopes, plotted on a decibel scale. The resonance frequency of the filter is 4 kHz. The maskers are normalized to have the same R M S value.
relationship. On the other hand, the log-envelope plots for the 200-Hz zero-phase masker and the alternating-phase masker are similar. In the lowfrequency region, frequency resolution plays a dominant role in determining the masking thresholds. Therefore, the phase choice does not influence the masked threshold.
6. E x p e r i m e n t by synthetic
4. Masking vowel
of narrowband
noise
sounds
As an application to speech perception, it is desirable to determine whether the masking pat-
terns of speech sounds such as vowels can be understood in terms of concepts presented previously in this paper. Since the spectra of vowel sounds consist of several formant regions represented by peaks and valleys, the spectral level of a vowel masker is a function of frequency. It is therefore investigated in this experiment whether these differences in the spectral levels of vowel maskers influence the thresholds of noise targets. Vowel sounds were synthesized by using spectrally-flat zero-phase harmonic complexes with fundamental frequencies of 100 and 200 Hz as inputs to a linear-predictive-coding (LPC) filter. The formant frequencies of the vowel were 680,
C. Ma, D. O'Shaughnessy / Speech Communication 14 (1994) 103-118
1110, 2347, 3202 and 4500 Hz. The targets were noise bands of critical-band width as used in Experiment 1. The targets passed through the same LPC filter as used for the vowel sound and therefore had locally the same spectral envelope as the vowel sound. The threshold of the target was then defined as the spectral level difference between the masker and the target. All vowel maskers were presented at a sound pressure level of 80 dB.
115
o-
~7 rio
40"
10` TO r g e t
lo"
freQuency[Hz]
6.1. Results o
The average thresholds for three subjects are plotted in Figs. 11(a,b) for vowel fundamentals of 100 and 200 Hz, respectively. The spectral envelope of the vowel masker is plotted in Fig. 11(c). There is no global decrease of threshold towards high frequencies as we have observed in Experiments 1 and 2. At low target frequencies, Fig. l l ( b ) shows threshold dips at 300 and 500 Hz and peaks at 400 and 600 Hz for the 200-Hz vowel masker, since masker harmonics are spectrally well resolved. By comparing Figs. 11(a)and 11(c), it can be seen that threshold peaks correspond to spectral valleys and threshold valleys correspond to spectral peaks of the vowel masker.
~Q
(b) ?-
lo"
ency[Hz]
2OdB
6.2. Discussion
In principle, the (relative) threshold in vowel maskers should be globally less frequency-dependent than in a flat-spectrum zero-phase masker (see Experiment 1), because the spectrum of a vowel sound has an overall slope of - 6 d B / o c t . As shown in Experiment 3, the threshold of the noise target increases as a result of the spectral slope of the masker. In Fig. l l ( a ) the valleys in the masking pattern were located approximately at the formant frequencies, i.e., at the spectral peaks of the masker, and the peaks of the masking pattern were located at the spectral valleys of the masker. In particular, the threshold dips around frequencies 1110 and 2350 Hz correspond to the second and the third formant frequencies of the vowel masker. The threshold peak at about 1900 Hz corresponds to the spectral valley at 1900 Hz. These low thresholds at the formant
lo"
lo" T ~ rcjle t f r e q u
(c) o.1
I Frequency
lo (kHz)
Fig. 11. Thresholds of critical-band noise targets for vowel maskers as a function of center frequency of the noise band. (a) 100-Hz vowel masker. (b) 200-Hz vowel masker. (c) Spectral envelope of the vowel masker.
frequencies are quite similar to the result of Experiment 3, where the thresholds of noise targets in the maskers with high levels have a lower spectral density ratio. Due to the limited frequency resolution of the auditory system, the masking pattern of a vowel sound is a blurred version of its physical spectrum (Moore and Glasberg, 1983; Tyler and Lindblom,
C. Ma, D. O'Shaughnessy / Speech Communication 14 (1994) 103-118
116
~T=,
1&
....... TO rg et
;b"
.......
,O"
fr-eq u e ncy[Hz]
Fig. 12. Thresholds of critical-band noise targets in the 100-Hz vowel masker, expressed in dB SPL.
1982; Houtgast, 1974). The threshold difference between peaks and valleys in the masking pattern is less than found in the physical spectrum of the masker. Our result suggests that, among other reasons, the level effects due to the spectral peaks and valleys in the vowel masker can also be attributed to the reduction of peak-valley difference in the masking pattern. The thresholds expressed in spectral density ratios in this experiment can be easily transformed to absolute target level by incorporating masker power density and target bandwidth. The threshold values transformed in this way are plotted in Fig. 12. They show indeed that the masked threshold curve is a blurred version of the physical spectrum of the masker. The masking pattern of the 200-Hz vowel masker of Fig. 11(b) clearly reflects the spectral composition of the masker at low frequencies, and the formant regions are not well delineated. This is a consequence of the spectral resolution of the auditory system associated with a vowel with a high fundamental frequency. Only at high frequencies, threshold peaks occur at spectral valleys of the envelope of the vowel masker, at about 1900 and 3000 Hz.
7. General discussion
The auditory system can be modelled as a bank of bandpass filters with varying frequency resolution. Frequency resolution decreases and,
accordingly, temporal resolution increases with an increase of the center frequency of the filter. Therefore, the detection of targets at low frequencies is mainly determined in the frequency domain by a spectral analysis of the masker. At high frequencies, on the other hand, the detection of targets is predominantly determined by a temporal analysis of the masker. The relative contributions of spectral and temporal analysis strongly depend on the fundamental frequency of the masker. The auditory system can therefore easily detect targets at high frequencies when the envelopes of auditory-filter responses show deep valleys. Similarly, targets at low frequencies can be easily detected in maskers with high fundamental frequencies. The temporal resolution changes nonlinearly with masker level. A better resolution is associated with a higher masker level. Therefore, in deeply modulated maskers with high levels the targets are detected more easily (if the threshold is expressed as a spectral density ratio of target and masker) than in low level maskers. For maskers with spectral slopes, such as speech sounds, the thresholds of targets at high frequencies increase due to the low spectral level associated with a strong spectral slope. The audibility of quantization noise is one of the main concerns in speech coding. The spectral weighting technique is based on the masking of pure tones; the spectrum of the noise is shaped such that it is similar to the spectral envelope of the coded speech signal (Schroeder et al., 1979). If the spectral level of the noise is significantly lower than the spectral level of the speech sound, the quantization noise can be made inaudible. Our experiments showed that the masking of noise targets by speech-like harmonic complex sounds is mainly determined by local details of the masker in the frequency domain or in the time domain, and not primarily by global features of the maskers' spectra. This suggests that the weighting in the low-frequency region and for high-pitched sounds should be associated with the harmonic structure of the speech signal. At high frequencies, the perceptual weighting should be associated with the temporal waveform of the speech signal. Although the masked thresholds of the noise target at high frequencies were gener-
C, Ma, D. O'Shaughnessy / Speech Communication 14 (1994) 103-ll8
ally higher in vowel sounds than in zero-phase harmonic complex sounds, the threshold could still be lower in the spectral peak region than in the valley region. In view of the continuously changing time-frequency structure of the sounds, a dynamic adaptation of perceptual weights could improve the quality of the low bit-rate coded speech, especially for coding of transient sounds. The effects of phase on the detection of a narrowband noise target in our experiments imply reduced sensitivity to quantization noise in the phase dispersion system (Quatieri et al., 1990), but also reduced detectability of any weak but wanted sounds that are present together with a speech signal. The phase effects are mainly determined by the temporal resolution of the auditory system and depend on the fundamental frequency and the spectral slope of the stimuli. Manipulating the phase spectra of speech signals certainly influences their subjective sound quality, especially for transient sounds. Although the longterm spectrum of speech is usually preserved, the temporal waveform is changed (Moriya and Honda, 1986; Quatieri et al., 1990). The auditory system can detect this kind of waveform change, especially at high frequencies, where good temporal resolution is retained. For example, the phase dispersion system (Quatieri et al., 1990), which replaced the phase spectra of vowel sounds by the phase spectra of upward frequency sweeps while amplitude spectra were left unchanged, slightly changed subjective voice quality. Due to the global spectral tilt of vowel sounds, the phase changes at high frequencies become less noticeable.
8. Acknowledgments The experiments reported in this paper were performed at the Institute for Perception Research (IPO), Eindhoven, The Netherlands. The comments of Prof. A. Houtsma and Dr. A. Kohlrausch of IPO are gratefully acknowledged. The authors thank Prof. B.C.J. Moore and one anonymous reviewer for their very helpful comments and suggestions.
117
9. References H. Duifhuis (1970), "'Audibility of high harmonics in a periodic pulse", J. Acoust. Soc. Amer., Vol. 48, pp. 888-893. H. Duifhuis (1971), "Audibility of high harmonics in a periodic pulse. II. Time effect", J. Acoust. Soc. Amer., Vol. 49, pp. 1155-1162. J.L. Flanagan (1972), Speech Analysis, Synthesis and Perception (Springer, New York), 2nd Edition. D.W. Griffin and J.S. Lim (1984), "Signal eslimation from modified short-time Fourier transform", 1EEE Trans. Acoust. Speech Signal Process., Vol. ASSP-32, No. 2, pp. 236-243. T. Houtgast (1974), "Auditory analysis of w)wel-like sound", Acustica, Vol. 31, pp. 320-324. W. Jesteadt, S.P. Bacon and J.R. Lehman (1982), "'Forward masking as a function of frequency, masker level, and signal delay", Z Acoust. Soc. Amer., Vol. 71, pp. 951-963. J.D. Johnston (1988), "Transform coding of audio signals using perceptual noise criteria". 1EEE J. Selected Areas in Communication, Vol. 6, pp. 314-323. A. Kohlrausch (1988), "Masking patterns of harmonic complex tone maskers and the rote of the inner ear transfer function", in Basic Issues in Hearing, ed. by H. Duifhuis. J.W. Horst and H.P. Wit (Academic Press, London), pp. 339-350. H. Levitt (1971), "Transformed up-down method in psychoacoustics", J. Acoust. Soc. Amer., Vol. 49, pp. 467-477. B.C.J. Moore and B.R. Glasberg (1983), "Masking patterns for synthetic vowels in simultaneous and forward masking". J. Acoust. Soc. Amer., Vol. 73, pp. 906-917. T. Moriya and M. Honda (1986), "Speech coder phase equalization and vector quantization". Proc. h;,ternat. Conf. Acoust. Speech Signal Process. "86, pp. 1701-1704. R.D. Patterson (1987), "A pulse ribbon model of monaural phase perception", J. Acoust. Soc. Amer., Vol. 82, pp. 1560-1586. R. Plomp and H.J.M. Steeneken (1969), "Effect of phase on the timbre of complex tones", J. Aeoust. Soc. Amer., Vol. 46, pp. 409-42l. T.F. Quatieri, J.T. Lynch, M.L. Malpass, R.J. McAulay and C.J. Weinstein (t990), The VISTA speech enhancement system for AM radio broadcasting, Final Technical Report, Lincoln Lab., MIT, 29. M.R. Schroeder (1970), "'Synthesis of low-peak-factor signals and binary sequences with low autocorrelation", IEEE Trans. Inform. Theory, Vol. 16, pp. 85-89. M.R. Schroeder and S. Mehrgardt (1982), "Auditory masking phenomena in the perception of speech", in The Representation of Speech in the Peripheral Auditoo' System, ed. by R. Carlson and B. Granstr6m (Elsevier, Amsterdam), pp. 79-87. M.R. Schroeder, B.S. Atal and J.H. Hall (1979). "Optimizing digital speech coder by exploiting masking properties of the human ear", J. Acoust. Soc. Amer., Vol. 66, pp. 16471652.
118
C. Ma, D. O'Shaughnessy /Speech Communication 14 (1994) 103-118
B.K. Smith, U.K. Sieben, A. Kohlraush and M.R. Schroeder (1986), "Phase effects in masking related to dispersion in the inner ear", J. Acoust. Soc. Amer., Vol. 80, pp. 16311637. H.W. Strube (1985), "A computationally efficient basilarmembrane model", Acustica, Vol. 58, pp. 207-214. R.S. Tyler and B. Lindblom (1982), "Preliminary study of simultaneous-masking and pulsation-threshold patterns of vowels", J. Acoust. Soc. Amer., Vol. 71, pp. 220-224. R.L. Wegel and C.E. Lane (1924), "The auditory masking of
one pure tone by another and its probable relation to the dynamics of the inner ear", Phys. Rev., Vol. 23, pp. 266285. E. Zwicker and H. Fastl (1990), Psychoacoustics: Facts and Models (Springer, Berlin). E. Zwicker and E. Terhardt (1980), "Analytical expressions for critical-band rate and critical bandwidth as a function of frequency", J. Acoust. Soc. Amer., Vol. 68, pp. 15231525.