Hearing Research 309 (2014) 75e83
Contents lists available at ScienceDirect
Hearing Research journal homepage: www.elsevier.com/locate/heares
Research paper
The effects of noise vocoding on speech quality perception Melinda C. Anderson*, Kathryn H. Arehart, James M. Kates University of Colorado, Speech Language, Hearing Sciences, 2501 Kittredge Loop Road, 409 UCB, Boulder, CO 80309, USA
a r t i c l e i n f o
a b s t r a c t
Article history: Received 23 July 2012 Received in revised form 22 November 2013 Accepted 25 November 2013 Available online 11 December 2013
Speech perception depends on access to spectral and temporal acoustic cues. Temporal cues include slowly varying amplitude changes (i.e. temporal envelope, TE) and quickly varying amplitude changes associated with the center frequency of the auditory filter (i.e. temporal fine structure, TFS). This study quantifies the effects of TFS randomization through noise vocoding on the perception of speech quality by parametrically varying the amount of original TFS available above 1500 Hz. The two research aims were: 1) to establish the role of TFS in quality perception, and 2) to determine if the role of TFS in quality perception differs between subjects with normal hearing and subjects with sensorineural hearing loss. Ratings were obtained from 20 subjects (10 with normal hearing and 10 with hearing loss) using an 11point quality scale. Stimuli were processed in three different ways: 1) A 32-channel noise-excited vocoder with random envelope fluctuations in the noise carrier, 2) a 32-channel noise-excited vocoder with the noise-carrier envelope smoothed, and 3) removal of high-frequency bands. Stimuli were presented in quiet and in babble noise at 18 dB and 12 dB signal-to-noise ratios. TFS randomization had a measurable detrimental effect on quality ratings for speech in quiet and a smaller effect for speech in background babble. Subjects with normal hearing and subjects with sensorineural hearing loss provided similar quality ratings for noise-vocoded speech. Ó 2013 Elsevier B.V. All rights reserved.
1. Introduction Many of the approximately 35 million Americans with hearing loss are candidates for hearing aids (Kochkin, 2010). While recent clinical trials document the benefit of hearing aids (e.g., Larson et al., 2000), only 20e40% of individuals who are candidates actually own them (Dubno et al., 2008; Kochkin, 2010). Of those who own hearing aids, approximately 65e80% are satisfied with their instruments (Dubno et al., 2008; Kochkin, 2010). Sound quality, along with speech intelligibility, is correlated with overall user satisfaction with hearing aids (Kochkin, 2010). Modifications to the signal caused by environmental noise and/or by nonlinear and linear hearing aid signal processing can affect both speech intelligibility and speech quality (e.g., Moore and Tan, 2003; Arehart et al., 2007; Davies-Venn et al., 2007; Tan and Moore, 2008; Anderson et al., 2009; Arehart et al., 2010). These modifications affect speech in both the spectral and temporal domains.
Abbreviations: TFS, temporal fine structure; TE, temporal envelope; TSNR, signal-to-noise ratio; BC, band cutoff * Corresponding author. Permanent address: University of Colorado Hospital, 1635 Aurora Court Suite 6200, Mail Stop F736, Aurora, CO 80045, USA. Tel.: þ1 720 848 7218; fax: þ1 720 848 2857. E-mail addresses:
[email protected] (M.C. Anderson), kathryn.
[email protected] (K.H. Arehart),
[email protected] (J.M. Kates). 0378-5955/$ e see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.heares.2013.11.011
A complex signal such as speech can be separated into multiple frequency bands. The temporal information in each band can be divided into two components. The temporal envelope (TE) is the slowly varying amplitude modulation. Temporal fine structure (TFS) is the more rapidly varying carrier signal (Shannon et al., 1995). In recent years, researchers have developed several speech quality indices to predict perceptual effects caused by changes in one or more signal characteristics. However, these indices may not accurately reflect the impact on speech quality of modifications to the TFS of the signal. The Perceptual Evaluation of Speech Quality (PESQ) index (Beerends et al., 2002) focuses on the change in excitation patterns introduced by the signal modifications, and will be affected by signal modifications only to the degree that the modifications change the average power in each band. The PEMO-Q quality index (Huber and Kollmeier, 2006) measures the change in the signal envelope modulation. The Hearing Aid Speech Quality Index (HASQI) (Kates and Arehart, 2010) measures the change in envelope time-frequency modulation and the change in the signal long-term spectrum. Neither PEMO-Q nor HASQI directly measures the change in TFS, although both indices will be indirectly affected by how changes in TFS are reflected in changes to the envelope modulation. For example, additive noise will randomize the TFS and also reduce the depth of the envelope modulation. Another quality model (Moore et al., 2004; Moore and Tan, 2004; Tan et al.,
76
M.C. Anderson et al. / Hearing Research 309 (2014) 75e83
2004) uses the normalized cross-correlation between the output and input signals after the signal has been filtered into bands to estimate the effect of noise and nonlinear distortion on the signal. The signals are divided into 30-ms segments and the crosscorrelation between the input and output is computed. The crosscorrelation value is normalized by the signal energy in the segments, and a level weighting function is applied to reduce the importance of low-intensity segments. The normalized and weighted cross-correlations are then averaged within each frequency band. The cross-correlation directly measures changes in the TFS, but assumes the same sensitivity to TFS modification at all frequencies despite the reduction in neural phase locking at frequencies above 1500 Hz (Johnson, 1980). In summary, current models of speech quality perception focus primarily on TE modifications without accurately quantifying the effects of TFS modifications. The data presented here provide information regarding the effects of TFS modifications on speech quality perception. This information is needed, in part, for improvements in models of speech quality perception to more accurately predict the effects of hearing aid signal processing. Traditional hearing aid processing, such as dynamic-range compression or noise suppression, directly modifies the signal envelope (Anderson et al., 2009). However, recently developed hearing-aid signal processing algorithms directly modify the TFS of the signal. Hearing aid processing algorithms are being developed that replace the high frequencies in the input speech signal with noise modulated by the speech envelope (Kates, 2011a; Ma et al., 2011). The envelope at high frequencies is preserved, while the speech TFS is replaced by that of the random noise. Because the original TFS has been replaced, the processed high-frequency output signal is uncorrelated with the input. The accuracy of the feedback path estimation in feedback cancellation increases as the cross-correlation of the input and output decreases, so the TFS replacement improves the performance of the adaptive feedback cancellation implemented in the device (Kates, 2011a; Ma et al., 2011). While these techniques improve stability, the impact of this type of signal processing on speech quality has not been determined. Other types of hearing aid processing modify the TFS even though the processing objective is to change the signal envelope or spectrum. An example of this involves shifting the high frequency content of a signal. This shift may be implemented in multiple ways: 1) by moving a block of frequencies to a lower frequency region (Korhonen and Kuk, 2008), 2) by proportionally reducing the frequencies of the signal components above a cutoff frequency (Aguilera-Muñoz et al., 1999; Simpson et al., 2005; Souza et al., 2013), or 3) by shifting the frequencies towards the center of each frequency band in a multi-band system (Kulkarni et al., 2012). These frequency-shifting strategies reduce the correlation between the TFS of the processed signal and that of the original unprocessed version, and it is important to understand the impact of these TFS changes on speech quality. The frequency modification algorithms described above are designed to maximize speech understanding and usable gain for hearing aid users. While maintaining high levels of speech intelligibility is important for user satisfaction with hearing aids, it is possible to have high levels of intelligibility combined with poor sound quality (e.g., Preminger and Van Tasell, 1995; Souza et al., 2013). To date, no literature explicitly explores the effects of TFS manipulation on speech quality perception. The focus of the present study was to determine speech quality with parametric variation of TFS randomization in specific frequency regions for situations in which speech intelligibility remains at high levels. This study used noise vocoding to explore the effects of TFS randomization on speech quality perception.
Vocoding has been used to study the separate effects of TE and TFS on speech perception (e.g., Dudley, 1939; Shannon et al., 1995). To vocode a signal, it is filtered into a number of bands, and the envelope of each band is used to modulate a carrier signal (either noise or sine waves). For the noise vocoder, all frequencies within a band receive the same modulation. The uniform modulation causes a nearly constant amplitude across the band, resulting in a stairstep spectrum shape (Stone and Moore, 2003). As the number of bands increases, the spectrum becomes smoother and more of the envelope time-frequency modulation remains intact (Kates, 2011b). Each band is re-filtered (using the same filter bank) to remove any out-of-band components and the bands are combined. The resulting signal includes the modified TE and limited portions of the original TFS (dependent on specific envelope filter cutoff frequencies and whether noise or tone carriers are used). In the vocoding process, the TFS is modified, not removed. The TFS of the vocoded output comprises two components. The first is the residual speech TFS, and the second is the TFS associated with the vocoder carrier. The amount of each type of TFS is dependent on the signal processing configuration of the vocoder. The Gaussian noise traditionally used in noise vocoders has intrinsic random amplitude fluctuations over time, meaning that at any given point in time, the noise has its own random envelope. This intrinsic noise envelope may have a detrimental impact on speech understanding when combined with the temporal envelope of the speech (Whitmal et al., 2007; Stone et al., 2008; Souza and Rosen, 2009). It is possible to remove a substantial portion of the envelope from Gaussian noise. Both noise-envelope-intact vocoding noise and noise-envelope-removed vocoding noise have been described in the literature, with improved speech intelligibility for noise-envelope removed vocoding (Whitmal et al., 2007; Kates, 2011b). Regardless of the type of carrier used in the vocoding process, a signal processing confound exists (Kates, 2011b). Although vocoding is designed to remove original TFS cues, it also affects the TE. The results of Kates (2011b) show that vocoding may not accurately reproduce envelope behavior across frequency bands. Each TFS modification technique considered in Kates (2011b) resulted in a loss in the accuracy of the envelope time-frequency modulation reproduction. In addition, while vocoding removes original TFS, TFS is still present in the vocoded signal and may show resemblance to the original TFS. Even with this limitation, vocoding is still a valuable signal processing tool because it provides a consistent method of TFS modification and allows for the study of TFS cues in speech perception. While the role of TFS in speech quality perception is unclear, recent studies have examined how TFS influences speech intelligibility (e.g., Shannon et al., 1995; Qin and Oxenham, 2003; Lorenzi et al., 2006; Bas¸kent, 2006; Hopkins et al., 2008; Hopkins and Moore, 2009, 2010). For a single talker in quiet, speech with limited original TFS is highly intelligible for subjects with normal hearing and for subjects with mild to moderate hearing loss due to cochlear damage (Shannon et al., 1995; Bas¸kent, 2006). However, when listening to speech in the presence of competition, original TFS plays a more important role (Qin & Oxenham, 2003; Lorenzi et al., 2006; Bas¸kent, 2006; Hopkins et al., 2008; Hopkins and Moore, 2009, 2010). When presented with a competing sound, speech with primarily TE cues is insufficient for high speech intelligibility for both subjects with normal hearing and subjects with hearing loss. The inclusion of original TFS for speech in the presence of noise improves speech understanding to differing degrees for subjects with normal hearing and subjects with sensorineural hearing loss. Subjects with normal hearing achieve better understanding of speech in noise from inclusion of original TFS up to about 5000 Hz, while subjects with sensorineural hearing loss
M.C. Anderson et al. / Hearing Research 309 (2014) 75e83
benefit from inclusion of TFS only up to about 1500 Hz (Hopkins et al., 2008). This decreased ability to use original TFS in subjects with hearing loss has been attributed, in part, to broader auditory filters associated with cochlear hearing loss (Hopkins et al., 2008). Given that subjects with sensorineural hearing loss show limited abilities to utilize original TFS in understanding speech, there may be a similar variation in quality perception for speech with TFS modification between subjects with normal hearing and subjects with sensorineural hearing loss. The purpose of this study was to establish the relationship between TFS manipulation and speech quality perception. Several factors were manipulated to increase our understanding of the role TFS plays in quality perception: Effects of hearing loss: Both normal-hearing and subjects with sensorineural hearing loss were included in the study. Effects of signal processing: Three types of signal processing were used to manipulate the speech signal. The first two involved noise vocoding and included noise-envelope-intact and noise-envelope-removed vocoding noise. The third was removal of high frequency bands of the signal (low-pass filtering of the signal) as a control condition. This control condition allows for exploration of the overall benefit of including any acoustic information in the frequency regions of interest. Effects of frequency region: The effects of frequency region were explored through changes to the band cutoff (BC), such that signal content above the BC was vocoded or removed, and signal content below and including the BC was left intact. Effects of background noise: Multi-talker babble was used as background noise at different signal-to-noise ratios (SNR). 2. Methods and materials 2.1. Stimuli The test materials for speech quality were two sentences spoken by a female talker from the IEEE corpus (Rosenthal, 1969). All stimuli were digitized at 44.1 kHz and were down-sampled to 22.05 kHz to reduce computation time. The two concatenated sentences were “A saw is a tool used for making boards” and “Take the winding path to reach the lake.” These two sentences cover a broad range of sounds typical of American English. The average fundamental frequency was 237 Hz (range: 143e383 Hz), with the third formant region extending to 3600 Hz and the fourth formant region extending to 4700 Hz. The multi-talker babble background noise was taken from the Connected Speech Test (CST) (Cox et al., 1988). The duration of the babble was matched to the duration of each sentence, and pauses were inserted in the babble to duplicate the pauses between the sentences. The sentences and babble were gated on and off using 5 ms raised-cosine windows. The overall root-mean-square (RMS) level of the speech samples was kept constant, so in conditions where background noise was added there was a slight decrease in the target speech level. For the conditions using background babble, the speech sample was first combined with the background babble at the selected SNR (18 or 12 dB). The signal was then passed through a bank of 32 band-pass, linear-phase finite impulse response (FIR) filters. The band edges and center frequencies of the 32 bands were based on a standard equivalent rectangular bandwidth (ERBN) filter design (Glasberg and Moore, 1990; Slaney, 1993). The signal envelope for each vocoded band was generated via the Hilbert transform. The signal in each band was then divided by the envelope to give the TFS. In instances where the envelope amplitude was zero (which potentially would return an output with no meaning), MATLAB
77
provided a division output of zero. It is noted the TFS extracted from the Hilbert transform may contain discontinuities and abrupt changes in phase. A linear-phase envelope-modulation low-pass filter (filter cutoff ¼ 300 Hz) was used to filter the envelope in each vocoded band. The filters were 512-tap FIR filters designed using the MATLAB fir1 function with a Hamming window shape. Sidelobe suppression was greater than 50 dB. The speech was then reconstructed with the original fine structure and filtered envelopes and passed through the bandpass filter to remove modulation sidebands. The resultant speech had essentially the original TFS with envelope modulation restricted to frequencies below 300 Hz. The Gaussian noise used for the noise vocoding was passed through the same linear-phase FIR filter bank as the speech. One of two things was then done to the noise: 1) the noise was multiplied by the speech envelope (fluctuating; FL; Fig. 1a), or 2) the noise envelope was removed by dividing the noise signal in the frequency band by its own envelope after which it was multiplied by the speech envelope (smooth; SM; Fig. 1b). The second filtering stage did reintroduce some modulations in the noise carrier, and in doing so introduced some envelope distortion to the vocoded signal. The RMS level of the noise and speech in each frequency band was set equal to the RMS level of the original input speech in the corresponding band. Low-pass filtered conditions (LPF) were created using the same 32 band-pass linear phase FIR filters, and the amplitudes of the signals in the bands above the BC were set to zero. A total of 73 test conditions were included: 24 FL vocoding conditions, 24 SM vocoding conditions, 24 LPF conditions, and 1 full bandwidth quiet condition. Specific experimental conditions were chosen based on preliminary tests which established the intelligibility of each condition. All noise-vocoded conditions led to intelligibility above 90% for both subjects with normal hearing and subjects with sensorineural hearing loss. LPF conditions gave more variable intelligibility scores, ranging from 44% to 100% for subjects with sensorineural hearing loss and from 54% to 100% for subjects with normal hearing. 2.2. Subjects There were 10 subjects with normal hearing (mean age of 44; range: 20e64; standard deviation ¼ 16) and 10 with sensorineural hearing loss (mean age of 67; range: 47e81; standard deviation ¼ 11). Subjects underwent an audiometric evaluation at their initial visit. Subjects in the normal-hearing group (NH) had air conduction thresholds of 20 dB HL or better at octave frequencies from 250 to 8000 Hz (ANSI, 2004). Subjects in the group with hearing impairment (HI) were required to have at least a mild sensorineural hearing loss that would be compatible with a hearing aid fitting. The air-bone gap was less than or equal to 10 dB at octave frequencies between 500 and 4000 Hz and acoustic reflexes were consistent with the degree of hearing loss. The better hearing ear was used for testing, with the default of the right ear when hearing was symmetrical. All subjects had a passing Mini Mental Status Exam (MMSE) score of 27 or higher, which indicated intact general cognitive function (Folstein et al., 1975). Table 1 shows the age, sex, and air conduction thresholds for the test ear for all subjects. All subjects were native speakers of American English. Although there was a positive relationship between age and amount of hearing loss, this relationship was not statistically significant (Pearson correlation coefficient: r ¼ 0.10, p ¼ 0.782). An excitation pattern model (Hopkins et al., 2008) was employed in order to quantify the audibility of the target signal after NAL-R gain (Byrne and Dillon, 1986) had been applied for each subject with hearing loss. Mean excitation levels between 100 and 10,000 Hz were calculated for each subject. This model
78
M.C. Anderson et al. / Hearing Research 309 (2014) 75e83
Fig. 1. a. Stimulus generation block diagram: The additive noise is modulated by the filtered speech envelope (fluctuating noise; FL). The background babble is added to the speech at the appropriate SNR before processing begins. The speech line refers to the speech þ background babble. The noise line refers to the vocoding noise. b. The noise envelope is replaced by the filtered speech envelope (smooth noise; SM). Otherwise as in Fig. 1a.
incorporated a middle ear transfer function (Glasberg and Moore, 2006). Default values for the proportion of inner and outer hair cell damage were assumed based on audiometric thresholds in dB HL (Moore and Glasberg, 2004). The model gave estimates of the excitation level at threshold as a function of frequency for each subject. The excitation level evoked by the stimulus was compared with the excitation level at threshold to estimate the effective sensation level (SL). Subjects in the NH group had SLs of at least 20 dB for the full frequency range of the stimuli. Each of the 10 subjects in the HI group had an SL of at least 15 dB through 3545 Hz with an average of 26 dB SL. Five of the 10 subjects had a more modest 10 dB SL between 5575 and 10,000 Hz. For three of the 10 HI subjects, the SL was limited to an average of 5 dB from 3545 to 7100 Hz, with less than 5 dB SL above 7100 Hz. For the final two of the 10 HI subjects, the frequency range of at least 15 dB SL was limited to below 3545 Hz, with an average less than 5 dB SL above 3545 Hz.
2.3. Experimental procedures Subjects judged speech quality for the 73 conditions using an 11-point quality rating scale ranging from 0 (minimum speech quality) to 10 (maximum speech quality) in 0.1 increments. The scale was modeled after the overall impression scale described by Gabrielsson et al. (1988). The ratings were made on a computer monitor using a computer mouse to click on the point that the subject felt described the quality. A practice set was rated on the scale and included a sample of 27 of the 73 experimental conditions, including the end points (most and least processing) for the three signal processing conditions and each of the SNRs. This practice set was intended to familiarize the listener with the quality rating task and range of experimental conditions. No feedback was provided. The 72 LPF and vocoded conditions were each presented four times. The unmodified speech sample was presented twelve times. The order of conditions was randomized, with one full set of
M.C. Anderson et al. / Hearing Research 309 (2014) 75e83
79
Table 1 Individual subject information. HA user ¼ hearing aid user, PTA ¼ pure tone average (average of 0.5, 1, and 2 kHz), HFPTA ¼ high frequency pure tone average (average of 1, 2, and 4 kHz). PTA, HFPTA and thresholds are in dB HL. Sub ID
Sex
Age (yrs)
HA user
Test ear
PTA
HF PTA
250 Hz
500Hz
1000 Hz
2000 Hz
3000 Hz
4000 Hz
6000 Hz
8000 Hz
HI1 HI2 HI3 HI4 HI5 HI6 HI7 HI8 HI9 HI10 NH1 NH2 NH3 NH4 NH5 NH6 NH7 NH8 NH9 N10
F M M M M M F M F M F F F F F M F F F F
81 62 68 51 73 61 47 75 74 55 51 56 57 60 64 39 21 46 20 29
Y Y N N Y N Y N N Y N N N N N N N N N N
R L L R R R L R L R L R R R R R R R R R
52 25 35 28 30 12 55 37 27 23 10 5 12 8 18 12 13 5 13 5
55 38 47 33 43 22 68 48 30 35 7 5 10 10 18 12 13 5 10 7
40 15 35 20 35 5 20 25 20 25 5 5 20 5 20 15 5 5 20 0
45 15 35 25 25 10 30 30 25 15 15 5 20 5 20 10 10 5 15 5
55 20 35 25 15 15 60 30 20 25 5 5 5 10 15 15 10 5 15 10
55 40 35 35 50 10 75 50 35 30 10 5 10 10 20 10 20 5 10 0
45 55 60 40 55 20 65 60 35 30 5 0 10 10 20 10 15 5 10 0
55 55 70 40 65 40 70 65 35 50 5 5 15 10 20 10 10 5 5 10
50 40 60 45 55 30 65 65 35 45 5 0 10 10 20 5 0 15 10 10
70 15 65 40 65 35 55 65 40 50 5 0 10 10 10 0 0 20 0 15
conditions presented before a second presentation of any condition. Subjects were responsible for the pace of presentation and were given breaks after 50 trials, or more frequently as needed. Test instructions for the speech quality task are included in the Appendix. Two full sets were completed in each of the two sessions. Subjects were seated in a double-wall sound booth facing a computer monitor. The speech stimuli were processed through a digital-to-analog converter (Tucker Davis Technologies [TDT] RX8), an attenuator (TDT PA5) and a headphone buffer amplifier (TDT HB7), and presented monaurally to the test ear through a Sennheiser HD 580 earphone. Speech was played out at an average level of 65 dBA for subjects in the NH group. 3. Results The quality ratings were first analyzed to determine the reliability and consistency of the ratings within and across visits. The quality ratings were then considered statistically in terms of four main factors: 1) type of signal processing (FL, SM, and LPF), 2) effect of SNR (quiet, 18 dB and 12 dB SNR), 3) BC (vocoding or removal of the highest 16 bands in 8 consecutive 2-band steps moving from highest to lowest), and 4) effects of hearing status. All analyses were conducted using SPSS version 18 using mixedmodel repeated-measures analysis of variance (RM ANOVA). When the assumption of sphericity was violated, the GreenhouseeGeisser correction was used. 3.1. Within visit and across visit reliability Bivariate correlations were used to quantify how consistent subjects were within and across visits for the same condition. The Pearson correlation coefficients for within-session consistency were 0.89 (first session) and 0.92 (second session) (p < 0.001) for the NH group and 0.93 (first session) and 0.96 (second session) (p < 0.001) for the HI group. When the two quality ratings from visit 1 were averaged and two quality ratings from visit 2 were averaged, the Pearson correlation coefficient across sessions one and two was 0.86 (p < 0.001) for the NH group and 0.92 (p < 0.001) for the HI group. The high correlations for both groups suggest that the rating scale was a reliable instrument for quantifying the effects of stimulus
processing on quality perception. The mean scores across all four trials for each condition were used for the remaining analyses.
3.2. Statistical analysis of quality ratings The average quality ratings given by the NH group and the HI group are shown in Figs. 2 and 3, respectively. Each figure panel contains the average scores with standard error bars for all three conditions (FL, SM, and LPF) as a function of BC. Each panel shows results for one SNR (quiet, 18 dB SNR, or 12 dB SNR). A RM ANOVA was conducted with within-subject factors of signal processing (FL, SM, and LPF), SNR (quiet, 18 dB and 12 dB SNR), and BC (16 BC to 30 BC) and a between-subject factor of group. Selected results are included in Table 2. All three withinsubject main effects were significant. Ratings were dependent on the type of signal processing. Post hoc Bonferroni pairwise comparisons showed a significant difference between the three signal processing types. LPF conditions were rated most poorly and SM conditions most highly. Ratings decreased as the amount of background noise increased, with post hoc pairwise comparisons showing significant differences between all pairs of noise levels. Stimuli in quiet were given the highest quality ratings and stimuli at 12 dB SNR were given the lowest quality ratings. Ratings significantly decreased as more information was removed from the signal by varying BC. The between-subject factor of group was not significant. The interaction of SNR with signal processing indicates that as SNR decreased the difference in ratings between signal processing types decreased. The other significant interactions were based on BC: SNR with BC, signal processing with BC, and BC with group. The effect of BC was reduced as SNR decreased, the effect of BC was more pronounced for the LPF condition, and NH subjects were more sensitive to BC than HI subjects. There was a single significant three-way interaction. The effect of signal processing type on BC varied with SNR. As SNR worsened the effects of signal processing type with BC were reduced. Post-hoc comparisons were performed on the BC variable. An increase in the number of vocoded bands caused a significant decrease in quality ratings. Adding original TFS information was beneficial for frequencies up to 4594 Hz (p < 0.05). There was no significant difference in quality when the speech in bands above this frequency was replaced by the vocoder output. In
80
M.C. Anderson et al. / Hearing Research 309 (2014) 75e83
Fig. 2. The average quality ratings for listeners in the NH group. Each panel shows results for three signal processing types and one SNR (quiet, 18, or 12 dB). The FL vocoding noise is represented by closed triangles, the SM vocoding noise by closed circles, and the LPF conditions by closed squares. Speech in quiet is in the left panels, 18 dB SNR in the middle panels, and 12 dB SNR in the right panels. The error bars represent 1 standard error.
contrast, ratings continued to improve for LPF conditions until a plateau was reached at 8023 Hz, BC ¼ 30 (p < 0.05).
replaced to a signal where the same frequency bands were removed. 4.2. Effects of frequency region
4. Discussion The main finding was that replacement of original TFS has a small, but significant, negative effect on ratings of speech quality for subjects with both normal and impaired hearing. In this section, we consider four factors related to the observed effects of TFS modification on speech quality perception. 4.1. Effects of signal processing type The type of vocoding noise influenced the nature of the speech quality degradation. Speech vocoded with a noise-envelope-intact vocoder (FL) was rated significantly lower than speech vocoded with a noise-envelope-removed vocoder (SM). This finding supports the idea that subjects are sensitive to the presence of the noise envelope in combination with the speech envelope. An intact noise envelope has been shown to be detrimental to speech intelligibility (Whitmal et al., 2007; Stone et al., 2008) by introducing noise-carrier envelope fluctuations to the signal. The results from this study indicate that the noise-carrier envelope is also detrimental to speech quality. Ratings were also negatively affected when the speech was lowpass filtered, as in the LPF condition. Subjects gave lower ratings to speech in this condition compared to the vocoded conditions. This indicates that subjects preferred a signal where the TFS was
The results are consistent with the expectation that as more original TFS is included in the speech, there is a corresponding increase in quality ratings, until quality ratings plateau at 4594 Hz. These results are also consistent with studies that show that subjects are relatively insensitive to the presence of original TFS above approximately 5000 Hz (Heinz and Swaminathan, 2009; Moore and Sek, 2009). When lower-frequency original TFS is available, subjects do not use higher-frequency original TFS for speech understanding (Hopkins and Moore, 2010). Similarly, these results indicate that removal of higher-frequency original TFS does not impair speech quality perception when lower-frequency original TFS is available. Ratings for subjects with normal and impaired hearing improved when the low-pass filter cut-off frequency was increased from 1500 to 8023 Hz. Previous studies have shown that quality ratings for subjects with normal hearing consistently improve as the bandwidth of a signal increases, while subjects with hearing loss may not show the same quality improvement for bandwidth increases (e.g., Gabrielsson et al., 1990; Moore and Tan, 2003; Ricketts et al., 2008; Arehart et al., 2010, 2011; Füllgrabe et al., 2010). Arehart et al. (2010) reported that increasing the low-pass filter cutoff frequency from 2 to 7 kHz improved speech quality for subjects with normal hearing, and Ricketts et al. (2008) found that increasing the low-pass filter cutoff from 5.5 to 9 kHz
Fig. 3. As in Fig. 2 but for the HI group.
M.C. Anderson et al. / Hearing Research 309 (2014) 75e83 Table 2 Statistical results for the RM ANOVA with within-subject variables of signal processing, SNR, and band cutoff (BC). The between-group variable is hearing status (group). The dependent variable is quality rating. Significant effects are highlighted in gray. Variable
df
F
P
Partial h2 Observed power
Signal process SNR BC Group Signal process group SNR group BC group SNR BC SNR signal process Signal process BC SNR signal process group SNR BC group Signal process BC group SNR signal process BC SNR signal process BC group
2, 36 2, 36 7, 126 1, 18 2, 36 2, 36 7, 126 14, 252 4, 72 14, 252 4, 72 14, 252 14, 252 28, 504 28, 504
31.505 40.523 74.444 0.023 1.042 0.114 3.453 17.576 8.69 28.474 0.496 1.138 0.979 3.951 1.122
<0.001* <0.001* <0.001* 0.882 0.329 0.752 0.039* <0.001* 0.001* <0.001* 0.624 0.336 0.409 0.001* 0.353
0.636 0.692 0.805 0.001 0.055 0.006 0.161 0.494 0.326 0.613 0.027 0.059 0.052 0.18 0.059
1 1 1 0.052 0.17 0.062 0.634 1 0.964 1 0.127 0.253 0.25 0.979 0.466
81
The limited effect of TFS modification on sound quality for speech in background noise may be due to the temporal envelope modifications introduced by the background noise (Stone et al., 2012). The addition of background noise, even low-level background noise, may mask the more subtle effects of TFS modification due to the vocoding process. In quiet, a listener may be better able to perceive the modification in TFS (and its associated TE effects) from the vocoding process. The addition of background babble has a greater effect on the TE of the signal than the modifications introduced from vocoding, as well as affecting TFS in the lower frequency regions not affected by the vocoding process. The addition of the background babble has the effect of altering the overall TE by introducing energy into low-level valleys in the target speech, thereby reducing the overall peak-to-valley ratio of the envelope. The background babble may also act as a partial masker of the TFS of the target speech in frequency regions where the original TFS is intact. These larger effects from the addition of background babble may overwhelm the comparatively small effects of TFS modification, causing controlled TFS modification to be less important in quality perception for speech in noise. 4.4. Effects of hearing status
improved speech and music quality for subjects with normal hearing. In contrast, quality ratings did not improve for extended bandwidth between 2 and 7 kHz for subjects with sensorineural hearing loss in Arehart et al. (2010), while only some subjects with sensorineural hearing loss judged sound quality to be better for increased bandwidth from 5.5 to 9 kHz in Ricketts et al. (2008). In the current study both groups of subjects showed improved quality ratings when bandwidth was extended up to 8023 Hz. The present results show that the addition of the original TFS information above 4594 Hz did not lead to higher speech quality ratings. In contrast, quality ratings continued to improve as the low-pass filter cutoff frequency was increased up to 8023 Hz in the LPF condition. Both NH and HI subjects gave lower quality ratings to speech where a band was removed than they gave to speech where that band was replaced by modulated noise that reproduced the TE. This finding indicates that quality may not depend on TFS for frequencies above 5000 Hz and that reproducing the TE is sufficient in this frequency region. This result is consistent with the gradual loss of phase locking to the signal TFS as the frequency is increased above approximately 1500 Hz, with an almost complete loss of synchronization by 5000 Hz (Johnson, 1980). Additional explanations may also exist. Information at high frequencies in speech may be perceived as noise-like (e.g. frication), so it may be that replacing original TFS with the modified TFS used in the noise-vocoding process does not change the signal enough to affect the perception of speech quality. The finding that there was increasing speech quality benefit for the low-pass-filtered signal up to a frequency of 8023 Hz indicates that audibility of the signal was not the reason for the limited benefit of original TFS above 5000 Hz. 4.3. Effects of background babble Adding background babble to the speech significantly decreased ratings. However, the addition of background babble also reduced the negative impact of TFS manipulation on quality ratings. If the signal was already degraded by additive noise, then the effects of modifying the TFS were less apparent. It was expected that as the amount of background noise increased, the importance of original TFS to speech quality would also increase. Instead, the addition of noise appeared to dominate speech quality perception. Even limited amounts of background noise, where intelligibility remained above 90%, reduced the impact of TFS modification on speech quality ratings.
The presence of a sensorineural hearing loss did not significantly affect the perceived effects of TFS modification. This finding was unexpected given the difference in speech quality perception found for other types of processing, and because of the difference in the abilities of normal hearing and impaired hearing to utilize TFS for speech intelligibility (Qin and Oxenham, 2003; Bas¸kent, 2006; Lorenzi et al., 2006; Hopkins et al., 2008; Hopkins and Moore, 2010). However, the significant interaction (i.e. band cutoff with group) showed that there were some effects of hearing loss on sensitivity to TFS modification. There was a greater effect of BC for the group with normal hearing. One possible explanation for the lack of an overall group difference is that modifications to the TE may be the primary determining factor in the quality ratings. Research has shown that modification to the TE is a strong predictor of quality ratings for subjects with normal and impaired hearing (van Buuren et al., 1999; Anderson et al., 2009; Kates and Arehart, 2010). It is possible that while TFS modification plays a role in quality perception, the effects of TFS modifications are overshadowed by the effects of TE modification on quality perception. 4.5. Implications for hearing-aid signal processing Some hearing-aid signal processing algorithms directly affect the TFS characteristics of the signal (e.g., Aguilera-Muñoz et al., 1999; Simpson et al., 2005; Korhonen and Kuk, 2008; Kates, 2011a; Ma et al., 2011; Kulkarni et al., 2012). The results of this study show a negative impact of TFS modification on speech quality. However, a greater reduction in quality was observed for additive noise at SNRs of 18 and 12 dB. Low-pass filtering also degraded the perception of sound quality to a greater extent than TFS modification in the same spectral region. Both low-pass filtering and the addition of background noise affect the TE of the signal. TE modification from signal processing decreases sound quality (van Buuren et al., 1999; Anderson et al., 2009; Kates and Arehart, 2010). Signal processing algorithms that primarily affect TE (e.g., wide dynamic range compression and spectral subtraction) may have a greater impact on speech quality perception than signal processing algorithms that primarily affect TFS. This may lead to less influence of signal processing algorithms that affect TFS, such as feedback management or frequency compression, in terms of quality perception.
82
M.C. Anderson et al. / Hearing Research 309 (2014) 75e83
Models of quality perception have shown that quantification of the change in the TE from a clean signal to a modified signal can be used to accurately predict the quality ratings given by subjects with normal and impaired hearing (e.g., Huber and Kollmeier, 2006; Kates and Arehart, 2010). While these models provide accurate estimations of sound quality perception, they do not account for the entire picture. The addition of a variable for TFS modification in a model of speech quality perception may increase the accuracy in predicting a listener’s perception of speech quality for signal modifications that more directly affect TFS. 5. Conclusions This study established the role of original TFS in ratings of speech quality and quantified the difference between subjects with normal hearing and subjects with sensorineural hearing loss. The results showed that removal of original TFS information between 1500 and 4594 Hz had a small, but measurable, negative effect on quality ratings. Modification of TFS above 4594 Hz did not adversely affect speech quality. This was true for speech in quiet and speech in babble at SNRs of 18 and 12 dB. The results indicate that the addition of babble has a greater influence on quality ratings than TFS modification. However, it is important to acknowledge that when the original TFS is removed from the signal, there is a coexisting degradation in the TE. Given this confound, it is difficult to isolate the effects of TFS modification alone on quality ratings. However, for the 12 dB SNR, when envelope changes were most pronounced, there was still a small drop in quality ratings as the amount of noise vocoding increased. Therefore, high-frequency TFS modification is a factor e albeit a more minor factor e to consider in models of sound quality perception for subjects with both normal and impaired hearing. Future work should explore the contribution of TFS below 1500 Hz to speech quality perception and the effects of TFS modification on music quality perception for subjects with normal and impaired hearing. The results from this study may be useful in the refinement of models of speech quality perception and may influence the development of novel signal processing algorithms for use in hearing aids by providing evidence for the limited role of TFS modification above 5000 Hz on speech quality. Acknowledgements This article is based upon a dissertation submitted to the Graduate School of the University of Colorado at Boulder in partial fulfillment of the requirements of the doctoral degree. This research was supported by a research grant to the University of Colorado at Boulder from GN ReSound. Appendix. Subject instructions Instructions provided to subjects are reproduced below. In addition to verbal instruction, instructions were printed and left with the subject for reference during the experimental session. The instructions were adapted from Gabrielsson et al. (1988) and Davies-Venn et al. (2007). Quality instructions Your task today is to judge the sound quality of the programs you listened to in the previous sessions. You shall now try to describe how they sound by means of an overall impression scale. It is graded from 10 (maximum) to 0 (minimum). You decide yourself on the accuracy that you consider necessary. As you can see it is also possible to use decimals. The integers 9, 7, 5, 3, and 1 are defined on the scale. 10 means maximum (highest
possible) sound quality, 9 means very good, 7 rather good, 5 midway, 3 rather bad,1 very bad, and 0 minimum (lowest possible) sound quality. To begin each trial, click on the button marked PLAY. After the sample has ended, mark your rating on the slider bar using the mouse. Click CONFIRM to indicate that you have made a final decision. Click PLAY again to begin the next trial. If you would like a break before the end of the block, do not click PLAY. References Aguilera-Muñoz, C., Nelson, P., Rutledge, C., Gago, A., 1999. Frequency lowering processing for listeners with significant hearing loss. In: Proceedings of the Sixth IEEE International Conference on Electronics, Circuits and Systems (Cat. No.99EX357), Part 2(2), pp. 741e744. American National Standards Institute (ANSI), 2004. Specifications for audiometers ANSI S3.6. New York. Anderson, M.C., Arehart, K.H., Kates, J.M., 2009. The acoustic and perceptual effects of series and parallel processing. EURASIP J. Adv. Signal Process., 619805. http:// dx.doi.org/10.1155/2009/619805. Arehart, K.H., Kates, J.M., Anderson, M.C., 2011. Effects of noise, nonlinear processing and linear filtering on perceived music quality. Ear Hear 50, 177e190. Arehart, K.H.., Kates, J.M., Anderson, M.C., 2010. The effects of noise, nonlinear and linear processing on speech quality. Ear Hear 31, 420e436. Arehart, K.H., Kates, J.M., Anderson, M.C., Harvey, L.O., 2007. Effects of noise and distortion on speech quality judgments in normal-hearing and hearingimpaired listeners. J. Acoust. Soc. Am. 122, 1150e1164. Bas¸kent, D., 2006. Speech recognition in normal hearing and sensorineural hearing loss as a function of the number of spectral channels. J. Acoust. Soc. Am. 120, 2908e2925. Beerends, J., Hekstra, A., Rix, A., Hollier, M., 2002. Perceptual Evaluation of Speech Quality (PESQ): the new ITU standard for end-to-end speech quality assessment. Part II e psychoacoustic model. J. Audio Eng. Soc. 50, 765e778. Byrne, D., Dillon, H., 1986. The National Acoustical Laboratories’ (NAL) new procedure for selecting the gain and frequency response of a hearing aid. Ear Hear 7, 257e265. Cox, R., Alexander, G.C., Gilmore, C., Pusakulich, K.M., 1988. Use of the connected speech test (CST) with hearing-impaired listeners. Ear Hear 9, 198e207. Davies-Venn, E., Souza, P., Fabry, D., 2007. Speech and music quality ratings for linear and nonlinear hearing aid circuitry. J. Am. Acad. Audiol. 18, 688e699. Dubno, J., Matthews, L., Fu-Shing, L., Ahlstrom, J., Horwitz, A., 2008. Predictors of hearing-aid ownership and success by older adults. In: Paper presented at the International Hearing Aid Research Conference, Lake Tahoe, CA. Dudley, H., 1939. Remaking speech. J. Acoust. Soc. Am. 11, 169e177. Folstein, M., Folstein, S., McHugh, P., 1975. “Mini-mental state”. A practical method for grading the cognitive state of patients for the clinician. J. Psyciatr. Res. 12, 189e198. Füllgrabe, C., Baer, T., Stone, M.A., Moore, B.C.J., 2010. Preliminary evaluation of a method for fitting hearing aids with extended bandwidth. Int. J. Audiol. 49, 741e753. Gabrielsson, A., Hagerman, B., Bechkristensen, T., Lundberg, G., 1990. Perceived sound quality of reproductions with different frequency responses and sound levels. J. Acoust. Soc. Am. 88, 1359e1366. Gabrielsson, A., Schenkman, B.N., Hagerman, B., 1988. The effects of different frequency responses on sound quality judgments and speech intelligibility. J. Speech Hear Res. 31, 166e177. Glasberg, B.R., Moore, B.C.J., 1990. Derivation of auditory filter shapes from notchednoise data. Hear Res. 47, 103e138. Glasberg, B.R., Moore, B.C.J., 2006. Prediction of absolute thresholds and equalloudness contours using a modified loudness model. J. Acoust. Soc. Am 120, 585e588. Heinz, M., Swaminathan, J., 2009. Quantifying envelope and fine-structure coding in auditory nerves responses to chimeric speech. J. Assoc. Res. Otolaryngol.10, 407e423. Hopkins, K., Moore, B.C.J., 2009. The contribution of temporal fine structure to the intelligibility of speech in steady and modulated noise. J. Acoust. Soc. Am. 125, 442e446. Hopkins, K., Moore, B.C.J., 2010. The importance of temporal fine structure information in speech at different spectral regions for normal-hearing and hearingimpaired subjects. J. Acoust. Soc. Am. 127, 1595e1608. Hopkins, K., Moore, B.C.J., Stone, M.A., 2008. Effects of moderate cochlear hearing loss on the ability to benefit from temporal fine structure information in speech. J. Acoust. Soc. Am. 123, 1140e1153. Huber, R., Kollmeier, B., 2006. PEMO-Q e a new method for objective audio quality assessment using a model of auditory perception. IEEE 14, 1902e1911. Johnson, D.H., 1980. The relationship between spike rate and synchrony in responses of auditory-nerve fibers to single tones. J. Acoust. Soc. Am. 68, 1115e 1122. Kates, J.M., 2011a. Stability improvements in hearing aids. U.S. Patent Application 20110249845, Oct 13, 2011. Kates, J.M., 2011b. Spectro-temporal envelope changes caused by temporal fine structure modification. J. Acoust. Soc. Am. 129, 3981e3990. Kates, J.M., Arehart, K.H., 2010. The Hearing Aid Speech Quality Index (HASQI). J. Audio Eng. Soc. 58, 363e381.
M.C. Anderson et al. / Hearing Research 309 (2014) 75e83 Kochkin, S., 2010. MarkeTrak VIII: customer satisfaction with hearing aids is slowly increasing. Hear. J. 63, 11e19. Korhonen, P., Kuk, F., 2008. Use of linear frequency transposition in simulated hearing loss. J. Am. Acad. Audiol. 19, 639e650. Kulkarni, P., Pandey, P., Jangamashetti, D., 2012. Binaural dichotic presentation to reduce the effects of spectral masking in moderate bilateral sensorineural hearing loss. Ear. Hear. 51, 334e344. Larson, V.D., Williams, D.W., Henderson, W.G., Luethke, L.E., Beck, L.B., Noffsinger, D., et al., 2000. Efficacy of 3 commonly used hearing aid circuits e a crossover trial. JAMA e J. Am. Med. Assoc. 284, 1806e1813. Lorenzi, C., Gilbert, G., Carn, H., Garnier, S., Moore, B.C.J., 2006. Speech perception problems of the hearing impaired reflect inability to use temporal fine structure. Proc. Natl. Acad. Sci. U.S.A. 103, 18866e18869. Ma, G., Gran, F., Jacobsen, F., Agerkvist, F.T., 2011. Adaptive feedback cancellation with band-limited LPC vocoder in digital hearing aids. IEEE Trans. Audio Speech Lang. Proc. 19, 677e687. Moore, B.C.J., Glasberg, B.R., 2004. A revised model of loudness perception applied to cochlear hearing loss. Hear. Res. 188, 70e88. Moore, B.C.J., Sek, A.P., 2009. Sensitivity of the human auditory system to temporal fine structure at high frequencies. J. Acoust. Soc. Am. 125, 3186e3193. Moore, B.C.J., Tan, C., 2003. Perceived naturalness of spectrally distorted speech and music. J. Acoust. Soc. Am. 114, 408e419. Moore, B.C.J., Tan, C., 2004. Development and validation of a method for predicting the perceived naturalness of sounds subjected to spectral distortion. J. Audio Eng. Soc. 52, 900e914. Moore, B.C.J., Tan, C., Zacharov, N., Matilla, V., 2004. Measuring and predicting the perceived quality of music and speech subjected to combined linear and nonlinear distortion. J. Audio Eng. Soc. 52, 1228e1244. Preminger, J.E., Van Tasell, D.J., 1995. Quantifying the relation between speech quality and speech-intelligibility. J. Speech Hear. Res. 38, 714e725. Qin, M.K., Oxenham, A.J., 2003. Effects of simulated cochlear-implant processing on speech reception in fluctuating maskers. J. Acoust. Soc. Am. 114, 446e454. Ricketts, T.A., Dittberner, A.B., Johnson, E.E., 2008. High-frequency amplification and sound quality in listeners with normal through moderate hearing loss. J. Speech Lang. Hear. Re. 51, 160e172.
83
Rosenthal, S., 1969. IEEE: recommended practices for speech quality measurements. IEEE Trans. Audio Electroacoust. 17, 227e246. Shannon, R.V., Zeng, F.G., Kamath, V., Wygonski, J., Ekelid, M., 1995. Speech recognition with primarily temporal cues. Science 270, 303e304. Simpson, A., Hersbach, A., McDermott, H.J., 2005. Improvements in speech perception with an experimental nonlinear frequency compression hearing device. Ear Hear 44, 281e292. Slaney, M., 1993. An Efficient Implementation of the PattersoneHoldsworth Auditory Filter Bank. Apple Computer Technical Report #35. Apple Computer Library, Cupertino, CA. Souza, P., Rosen, S., 2009. Effects of envelope bandwidth on the intelligibility of sine- and noise-vocoded speech. J. Acoust. Soc. Am. 126, 792e805. Souza, P., Arehart, K.H., Kates, J.M., Croghan, N., Gehani, N., 2013. Exploring the limits of frequency lowering. J. Speech Lang. Hear. Res. 56, 1349e1363. Stone, M.A., Füllgrabe, C., Moore, B.C.J., 2012. Notionally steady background noise acts primarily as a modulation masker of speech. J. Acoust. Soc. Am. 132, 317e326. Stone, M.A., Moore, B.C.J., 2003. Effect of the speed of a single channel dynamic range compressor on intelligibility in a competing speech task. J. Acoust. Soc. Am. 114, 1023e1034. Stone, M.A., Moore, B.C.J., Greenish, H., 2008. Discrimination of envelope statistics reveals evidence of sub-clinical hearing damage in a noise-exposed population with ‘normal’ hearing thresholds. Int. J. Audiol. 47, 737e750. Tan, C., Moore, B.C.J., 2008. Perception of nonlinear distortion by hearing-impaired people. Ear Hear 47, 246e256. Tan, C., Moore, B.C.J., Zacharov, N., Matilla, V., 2004. Predicting the perceived quality of nonlinearly distorted music and speech signals. J. Audio Eng. Soc. 52, 699e711. van Buuren, R.A., Festen, J.M., Houtgast, T., 1999. Compression and expansion of the temporal envelope: Evaluation of speech intelligibility and sound quality. J. Acoust. Soc. Am. 105, 2903e2913. Whitmal, N., Poissant, S., Freyman, R., Helfer, K., 2007. Speech intelligibility in cochlear implant simulations: Effects of carrier type, interfering noise, and subject experience. J. Acoust. Soc. Am. 122, 2376e2388.