Speech Communication 48 (2006) 549–558 www.elsevier.com/locate/specom
Applied principles of clear and Lombard speech for automated intelligibility enhancement in noisy environments
q
Mark D. Skowronski *, John G. Harris Computational Neuro-Engineering Lab, Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA Received 22 December 2004; received in revised form 27 July 2005; accepted 21 September 2005
Abstract Previous studies have documented phenomena involving the modification of human speech in special communication circumstances. Whether speaking to a hearing-impaired person (clear speech) or in a noisy environment (Lombard speech), speakers tend to make similar modifications to their normal, conversational speaking style in order to increase the understanding of their message by the listener. One strategy characteristic of the above speech types is to increase consonant power relative to the signal power of adjacent vowels and is referred to as consonant–vowel (CV) ratio boosting. An automated method of speech enhancement using CV ratio boosting is called energy redistribution voiced/unvoiced (ERVU). To characterize the performance of ERVU, 25 listeners responded to 500 words in a two-word, forced-choice experiment in the presence of energetic masking noise. The test material was a vocabulary of confusable monosyllabic words spoken by 8 male and 8 female speakers, and the conditions tested were a control (unmodified speech), ERVU, and a high-pass filter (HPF). Both ERVU and the HPF significantly increased recognition accuracy compared to the control. Nine of the 16 speakers were significantly more intelligible when ERVU or the HPF was used, compared to the control, while no speaker was less intelligible. The results show that ERVU successfully increased intelligibility of speech using a simple automated segmentation algorithm, applicable to a wide variety of communication systems such as cell phones and public address systems. 2005 Elsevier B.V. All rights reserved. PACS: 43.72.Ew; 43.60.Dh; 43.71.Es Keywords: Clear speech; Speech enhancement; Energy redistribution
1. Introduction Speech-enhancement algorithms improve the quality of speech (naturalness and intelligibility) in q
Supported by the iDEN Technology Group of Motorola. Corresponding author. Tel.: +1 352 392 2626; fax: +1 352 392 0044. E-mail address:
[email protected]fl.edu (M.D. Skowronski). *
communication channels, such as phone networks or public address systems. The speech signal is degraded during communication by noise from three sources: (1) acoustic noise in the talkerÕs environment (recording environment), (2) channel noise (e.g., electronic echo in phone networks, compression artifacts), and (3) acoustic noise in the listenerÕs environment (broadcast environment). Noise reduction techniques have been used to reduce the noise
0167-6393/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2005.09.003
550
M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558
power from the first two sources relative to the speech signal power (Boll, 1979; Ephraim and Van Trees, 1995). However, noise reduction algorithms generally do not target noise added to the speech signal after the electro-acoustic transducer, such as the speaker of a telephone handset or public address system. A simple solution is to increase the power of the speech signal before broadcasting in the noisy listenerÕs environment, thus increasing the signal-tonoise ratio (SNR). But what can be done when ‘‘turning up the volume’’ is no longer an option? Assuming the broadcast transducer is operating at its peak power output without clipping (i.e., the maximum volume on a cell phone or a public address system), a speech-enhancement algorithm has been proposed by the authors that redistributes the speech signal energy in order to increase intelligibility (Harris and Skowronski, 2002). The energy redistribution algorithm emphasizes temporal regions critical to intelligibility: energy redistribution voiced/unvoiced (ERVU) which boosts unvoiced regions, and energy redistribution spectral transitions (ERST) which boosts regions of large spectral change. The current work shows the similarities between ERVU and speech-production characteristics of clear and Lombard speech and also extends the previous analysis of ERVU by considering the effects of listener nativeness and also talker differences on speech intelligibility.
speech is characterized by a slower speaking rate, more and longer pauses, elevated speech intensity, increased word duration, ‘‘targeted’’ vowel formants, increased consonant intensity compared to adjacent vowels, and phonological changes such as fewer reduced vowels and more released stop bursts (Picheny et al., 1986). Using nonsense English sentences, clear speech was shown to increase the intelligibility of speech for hearing impaired as well as normal hearing listeners (Payton et al., 1994). Furthermore, using simple English sentences, clear speech was shown to increase intelligibility for normal hearing native listeners more so than for non-native listeners (Bradlow and Bent, 2002). In a study of the importance of acoustic cues for vowel recognition for young and elderly listeners, monosyllabic /bVd/ words were excised from sentences of clear and conversational speech (Ferguson and Kewley-Port, 2002). The young normal hearing listeners averaged a higher vowel recognition score on the clear speech compared to conversational speech, while elderly hearing-impaired listeners had the same average score for both clear and conversational speech. One explanation of these results is that clear speech enhances many cues for speech intelligibility, not all of which are utilized by hearing-impaired listeners. 2.2. Lombard speech
Research from the past two decades has investigated various types of speech in the presence of degraded communication conditions. A significant portion of that research was designed to study the changes made during speech communication in noisy acoustic environments in order to better understand the strategies speakers employ to overcome the degraded communication conditions and also to apply the acquired knowledge to algorithms that improve speech communications in man-made devices and systems (e.g., hearing aids, public address systems, and cell phones).
A style of communication related to clear speech is the Lombard effect. In the Lombard effect, a speaker talking in a noisy environment makes several vocal changes in order to improve intelligibility of the speech signal against the competing acoustic noise. The changes in vocal characteristics include an increase in speech signal amplitude, slower delivery, higher pitch, higher formants, and an increase in the spectral energy center of gravity as well as an increase in the consonant-to-noise ratio and decrease in spectral tilt (Junqua, 1993; Summers et al., 1988). Junqua showed that Lombard speech may be more or less intelligible than conversational speech, depending on the vocabulary under test, noise type, and speaker gender.
2.1. Clear speech
2.3. Applications
A seminal work in this area is the study of clear speech by Picheny et al. (1985). Clear speech is the style of speaking a speaker adopts when talking with someone who has a hearing impairment. Clear
Some success has been reported in applying the observed phenomena of clear and Lombard speech to speech enhancement. Gordon-Salant (1986) boosted the consonant–vowel (CV) ratio of
2. Clear and Lombard speech
M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558
hand-labeled nonsense CV words by 10 dB and also linearly doubled the duration of words in an intelligibility experiment for normal hearing listeners in the presence of masking babble noise. While the crude duration increase yielded negligible benefits, boosting the CV ratio by 10 dB (the difference between clear and conversational speech observed by Picheny et al.) significantly improved CV word intelligibility for both young and elderly normal hearing listeners. A similar experiment was performed by Kennedy et al. (1998) who investigated the effects of increasing CV intensity ratio on intelligibility for hearing-impaired listeners. With handlabeled nonsense VC words, consonants were enhanced in separate trials with gains ranging from 0 to 24 dB. The consonant enhancement at maximum recognition, CEmax, was 8 dB for voiced consonants and 11 dB for unvoiced consonants. The experiments described below used the ERVU algorithm to extend the existing experiments on CV ratio boosting. The goals of the experiments were the following: (1) quantify the performance of CV ratio boosting using the ERVU algorithm, and (2) observe the interactions of CV ratio boosting for the following factors: vocabulary set, SNR, and listener nativeness (i.e., if the listener was a selfdescribed native or non-native speaker of English). 3. Method In the following experiments, speech enhancement using CV ratio boosting was characterized with closed-vocabulary tests. Boosting was achieved using the ERVU algorithm, which redistributes energy across time, and was compared with a high-pass filter (HPF), which redistributes energy across frequency. 3.1. Algorithms Two methods of automatic speech enhancement using energy redistribution were developed previously (Harris and Skowronski, 2002). The first method, ERVU, takes energy from voiced regions and boosts unvoiced regions while conserving global energy (over the duration of words). Vowels, liquids, nasals, diphthongs, and voiced plosives and fricatives are attenuated in their voiced regions while all other phonemes are amplified. The decision boundary between voiced and unvoiced regions is determined using a voice activity detector. The second method, ERST, takes energy from spectrally
551
stationary regions and boosts spectral transitions (Reinke, 2001). Spectral transitions are identified using cepstral-based features. Furui showed that spectral transitions contain important intelligibility cues in an experiment using truncated vowels in CV utterances (Furui, 1986). When vowels were truncated beyond a certain point, intelligibility greatly decreased. These points were found to occur in regions of the greatest spectral change. Boosting spectral transitions in hand-labeled logatomes (nonsense consonant–vowel combinations) and sentential material was also demonstrated to improve intelligibility in noisy environments (Hazan and Simpson, 1998). The current work details the performance of energy redistribution using voiced/unvoiced information (ERVU). Several voice activity detectors are available from speech codec standards to segment voiced and unvoiced regions. One compact and effective algorithm is the spectral flatness measure (SFM) (Gray and Markel, 1974). Eq. (1) is the expression for the SFM used for voice activity detection in ERVU. QN SFMðjÞ ¼
1=N k¼1 X j ðkÞ PN 1 k¼1 X j ðkÞ N
ð1Þ
where Xj is the magnitude of the N-point DFT of the jth stemporal window of the uttered word. The SFM is the ratio of the geometric mean to the arithmetic mean of windowed speech, bounded between zero and one. An SFM value near unity indicates that the spectrum is flat, while an SFM value near zero indicates a peaky spectrum. Voiced regions, with peaks at harmonics of the fundamental frequency, have low SFM values while unvoiced regions, which lack fundamental frequency harmonics, have high SFM values. A two-level Schmidt trigger decision boundary is used to provide robustness to variations in the SFM. The levels for the Schmidt trigger were determined using probability distribution functions of phoneme-labeled speech. Fig. 1 shows an example of the SFM for the word ‘‘clarification’’, and Fig. 2 shows the time domain signal of ‘‘clarification’’ with voiced/unvoiced boundaries delineated using the SFM and Schmidt trigger thresholds. After the voiced/unvoiced decision is made, the unvoiced regions are amplified and the entire word is scaled by a normalizing gain factor such that the modified word energy is the same as the original word energy. The transition between voiced and
552
M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558
Butterworth HPF with cutoff frequency fc = 1.5 kHz was used to improve the intelligibility of clean speech in the presence of noise (Niederjohn and Grotelueschen, 1976). Furthermore, the HPF increases the spectral energy center of gravity for speech, which is a characteristic of the Lombard effect. The HPF is included in the experiments to gauge the performance of CV ratio boosting using ERVU.
0.9 0.8 Unvoiced 0.7 0.6
SFM
0.5 0.4 0.3 0.2
3.2. Subjects Voiced
0.1 0 0
125
250
375
500
625
750
875
Time (ms)
Amplitude
Fig. 1. Spectral flatness measure (SFM) for the word ‘‘clarification’’. Window length is 20 ms with 50% overlap. The two horizontal lines represent the Schmidt trigger decision boundaries between voiced and unvoiced regions (0.36 and 0.47). See Fig. 2 for time domain plot showing transition labels.
3.3. Stimuli
V
V 0
The listening test was performed by 25 individuals who reported to have no known hearing impairments. Listener age was 27 ± 5 years, and all listeners were engineering graduate students who had no prior experience with speech research studies involving listening tests. Four listeners were female and 21 were male, and 9 listeners reported being native speakers of English while 16 listeners reported being non-native speakers of English. All listeners were sufficiently proficient with English to enter an English-speaking university in the United States. The native countries of the listeners and the number of listeners from each country were the following: USA (8), China (6), India (7), Turkey (1), Portugal (1), Pakistan (1), and Spain (1).
125
250
375
V 500
V 625
750
875
Time (ms)
Fig. 2. Time domain signal for the word ‘‘clarification’’ with SFM depicted in Fig. 1. The vertical lines represent the voiced/ unvoiced decisions boundaries determined from the SFM and Schmidt trigger thresholds in Fig. 1. Regions labeled V denote voiced regions, and all other regions are considered unvoiced.
unvoiced scale factors is smoothed by a 10 ms linear interpolation. Any spurious signals larger than the dynamic range of the original signal are clipped. Informal listening tests indicated that artifacts from such infrequent clipping were not perceptible for the scale factors used in the following experiments. The ERVU algorithm redistributes energy across time, analogous to the redistribution of energy across frequency through filtering. A third-order
The vocabulary used was from the confusable sets listed in Table 1. The confusable sets were used by Junqua (1993) in experiments of the Lombard effect and represent a challenging vocabulary for intelligibility. The words were drawn from the TI46 speech corpus (Doddington and Schalk, 1981). The corpus consists of isolated utterances of each
Table 1 Vocabulary sets of confusable words used for human listening test S-set A-set E-set M-set
F, A, B, M,
S, X, YES H, K, EIGHT C, D, E, G, P, T, V, Z, THREE N
Words were taken from the TI-46 speech corpus, spoken in isolation by 8 male and 8 female speakers as part of a corpus of 46 words. Speakers were not instructed to pronounce the words differentially as if listed in the following confusable sets. The letters below were pronounced as the names of the letters in the alphabet (e.g., /ef/ for ‘‘F,’’ /et/ for ‘‘EIGHT,’’ /bi/ for ‘‘B,’’ /em/ for ‘‘M’’).
M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558
word in the vocabulary spoken 26 times by 16 speakers (8 male and 8 female) in a quiet environment (SNR > 30 dB). The speakers were not instructed to pronounce the words in any way that may aid in their recognition with respect to the confusable sets chosen for this experiment. That is, no emphasis was placed on the discriminating phonemes of each word (e.g., the phoneme /f/ in the word ‘‘F’’ in the S-set). 3.4. Procedure Each test participant listened to utterances of single words in a two-choice, forced-decision experiment. For each trial of the experiment, pairs of words were displayed on a personal computer using a graphical user interface. An utterance of one of the displayed words was randomly drawn from the TI-46 speech corpus, then one of the three conditions (unmodified control, ERVU, or the HPF) was randomly selected. Utterances from trials that were selected for control were not modified, while the other test conditions (ERVU or the HPF) were applied accordingly. After the selected condition was appropriately applied, the utterance was corrupted with additive white Gaussian noise before being played through Sony MDR-V200 padded stereo headphones for the listener to hear and identify. During each trial, the test participant could listen to the noisy utterance as many times as necessary before choosing one of the two displayed words. Test conditions and responses for each trial were recorded automatically as part of the graphical user interface software, and the listener could not advance to the next trial without responding to the current trial. To select the pair of words displayed during each trial, one of the four confusable sets was randomly selected from Table 1, then two words from the set were randomly selected. All random selections were made with equal a priori probabilities; thus, each of the four confusable sets was used approximately the same number of times during each experiment, and the same was true for each of the 16 speakers and each of the conditions under test. For all trials using ERVU, consonant regions were boosted by 7.4 dB before normalization. The boost in CV ratio was similar to that determined to be optimum for consonants as reported by Kennedy et al. (1998). The Schmidt trigger lower and upper thresholds used by the spectral flatness measure were 0.36 and 0.47, respectively.
553
Each experiment consisted of 500 trials. The SNR of the first 250 trials was 0 dB, and the SNR of the second 250 trials was 10 dB. The volume was set to a comfortable level before the test of the first participant, and the gain was fixed for all tests for all participants. Each test was completed in 30–45 min, and no effects from fatigue were observed. The first 25 responses at 0 dB and 10 dB SNR of each test participant were considered practice and were discarded for the following analysis. 4. Results 4.1. Condition effects A 2 · 4 · 3 · 2 analysis of variance (ANOVA) was performed on the dependent variable (score) for the four factors SNR, vocabulary set, condition, and listener nativeness. All factors were treated as fixed effects. The dependent variable was the output of each trial for each listener: 0 for an incorrect response and 1 for a correct response. Therefore, mean results for each factor and interaction from the ANOVA were interpreted as percent correct. The residual error variance of the modeling in the ANOVA was not significantly affected by transformation of percentage scores using the arc sine root transform (Hopkins, 2000); therefore, the analysis was performed, and results are reported, as percent correct. Perfect recognition was 100% correct, while the chance level was 50%. All four main effects were significant at a confidence level of 95% [SNR: F = 168, degrees of freedom (d.f.) = 1, p < 0.0001; vocabulary set: F = 38, d.f. = 3, p < 0.0001; condition: F = 4.7, d.f. = 2, p < 0.01; listener nativeness: F = 32, d.f. = 1, p < 0.001]. Using the Tukey–Kramer post-hoc test, the mean percentage scores at 0 dB and 10 dB SNR were 89 ± 0.6% and 77 ± 0.6%, respectively. The two SNR levels were chosen such that the listening experiments were not too easy nor too hard, and the post-hoc results validate the choice of SNR levels although the scores at 0 dB SNR showed signs of ceiling effects. Test scores varied significantly due to vocabulary set as well. Using the Tukey–Kramer post-hoc test, the mean percentage scores for the S-set, A-set, E-set, and M-set were 85 ± 0.9%, 89 ± 1.0%, 86 ± 0.6%, and 73 ± 1.1%, respectively. The percentage scores for the M-set were much lower than scores for the other sets, similar to the results reported by Junqua (1993).
554
M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558
Whether or not a listener was a native speaker of English was also a significant factor. The mean percentage scores from the post-hoc tests for native and non-native English speakers were 86 ± 0.8% and 80 ± 0.6%, respectively. However, nativeness was not a significant factor in any interactions, although the interaction of nativeness and vocabulary was marginally significant (F = 2.3, d.f. = 3, p = 0.074). All other interactions involving nativeness had p > 0.66. The remaining main effect, condition, also produced significant differences among percentages scores for the control, ERVU, and the HPF. Using the Tukey–Kramer post-hoc test, the mean scores
Fig. 3. Tukey–Kramer post-hoc test means and standard errors for the interaction SNR, vocabulary set, and condition. For each vertical bar, the upper error bars are for results at 0 dB SNR, while the lower error bars are for results at 10 dB SNR.
for the control, ERVU, and the HPF were 85 ± 0.8%, 88 ± 0.8, and 87 ± 0.8%, respectively. All remaining two-way interactions among SNR, condition, and vocabulary set were significant except for the interaction of SNR and vocabulary set (F = 1.1, d.f. = 3, p = 0.36). The three-way interaction of SNR, vocabulary set, and condition was also significant (F = 5.1, d.f. = 6, p < 0.001), and the results from the Tukey–Kramer post-hoc test are plotted in Fig. 3. 4.2. Speaker effects The current study used speech material from 8 male and 8 female speakers from the TI-46 corpus. The effects of speaker on percentage score were analyzed with a 2 · 3 · 16 ANOVA on the dependent variable (score) for the three factors SNR, condition, and speaker. The large number of degrees of freedom prohibited the inclusion of the factors vocabulary and listener nativeness that were used in the previous ANOVA. To decrease the variation in score due to vocabulary, the M-set was removed from the analysis. Fig. 3 shows that percentage scores for the M-set were significantly lower than the scores for the other three vocabulary sets. All main effects were significant at a confidence level of 95% [SNR: F = 181, d.f. = 1, p < 0.001; condition: F = 13.3, d.f. = 2, p < 0.001; speaker: F = 4.4, d.f. = 15, p < 0.001]. In addition, all two-way interactions were significant [SNR and speaker: F = 1.7, d.f. = 15, p < 0.05; SNR and condition: F = 5.0, d.f. = 2, p < 0.01; speaker and condition:
Fig. 4. Tukey–Kramer post-hoc means and standard errors for three-way interaction of SNR, condition, and speaker. For each vertical bar, the upper error bars are for results at 0 dB SNR, while the lower error bars are for results at 10 dB SNR. In (a), the results are for the 8 female speakers F1–F8. In (b), the results are for the 8 male speakers M1–M8. (a) Female speakers and (b) male speakers.
M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558
F = 2.2, d.f. = 30, p < 0.001]. The interaction of all terms was significant as well [SNR, condition, and speaker: F = 2.0, d.f. = 30, p < 0.001]. Means and standard errors from Tukey–Kramer post-hoc tests for the three-way interaction are shown in Fig. 4. The speaker labels are the same as those used in the TI-46 corpus. When the factor speaker was replaced with speaker gender in the ANOVA, speaker gender was not a significant main effect (F = 0.9, d.f. = 1, p > 0.30). Even after the factors listener nativeness and vocabulary were added to the ANOVA, speaker gender still was not a significant main effect (F = 1.3, d.f. = 1, p > 0.25). Junqua (1993) reported in a similar experiment that intelligibility of Lombard speech was higher or lower than normal speech, depending on the speaker gender. In contrast, our results show that intelligibility due to one aspect of the Lombard effect, namely CV ratio boosting, is not dependent on speaker gender. 5. Discussion 5.1. Condition effects The results in Fig. 3 considering the three-way interaction among condition, SNR, and vocabulary set show that the S-set produced significantly different recognition scores between the control and ERVU and the HPF. The HPF produced a significantly lower score for the S-set at 0 dB in contrast to a significantly higher score at 10 dB, compared to the control. ERVU produced similar results for the S-set compared to the control, although the results at 0 dB were only marginally significant. The control results for the S-set had the largest change in score between 0 and 10 dB SNR for any vocabulary group, indicating an acute sensitivity by the control of the S-set to the noise source (white Gaussian). Although they operate on different parts of the signal, ERVU and the HPF maintain cues for intelligibility that aid in distinguishing words in the S-set at 10 dB while recognition using the control dropped sharply. Experiments with different noise sources and more trials would better explain the results for the S-set. Both ERVU and the HPF significantly improved the intelligibility of the E-set compared to the control according to the post-hoc tests in Fig. 3. The HPF was marginally significant at 0 dB while ERVU was marginally significant at 10 dB. The E-set, with 10 words, was the largest set tested and
555
exhibited the most phonetic differences among word pairs of any set. These results are encouraging when considering the performance of the HPF and ERVU in open-vocabulary experiments where phonetic differences increase. 5.2. SNR effects Post-hoc tests of the differences between the control and the algorithms ERVU and the HPF produced significant differences only at 10 dB SNR. Observations of the scores revealed that the test at 0 dB SNR was significantly affected by ceiling effects. At 10 dB SNR, control recognition error was 24.9%, compared to 19.2% for ERVU and 16.7% for the HPF. That is, ERVU reduced recognition error by 23% and the HPF reduced recognition error by 33% at 10 dB compared to the control. For the two-choice experiment employed, random chance was 50% error for all vocabulary sets, so the results at 10 dB can be considered the least affected by floor and ceiling effects since the control error was midway between that for perfect recognition and chance. 5.3. Nativeness effects Non-native English speakers scored significantly lower than native speakers because of a lack of experience with the acoustic cues distinguishing the words in each confusable set. However, nativeness had no significant interactions with any of the other factors. These results indicate that the acoustic cues not used by non-native speakers were evenly distributed among the vocabulary sets tested and also that the acoustic cues enhanced by either ERVU or the HPF were utilized to the same extent by both groups of listeners. These results contrast with the findings of Bradlow and Bent (2002) who concluded that listeners for which English was a non-native language benefited from English clear speech but not nearly to the extent at which native listeners benefited. This suggests that CV ratio boosting benefits both native and non-native listeners of English to the same degree while other cues present in clear speech (e.g., rate changes, hyperarticulation) provide more benefit to listeners thoroughly familiar with English. However, the test material in the experiments of Bradlow and Bent used open-vocabulary meaningful sentences as opposed to the closed-vocabulary isolated words used in the current work, and the difference in test
556
M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558
material may partly explain the contrasting conclusions regarding native listener performance between the previous study and the current study. 5.4. Speaker effects The results in Fig. 4 show that, at 0 dB SNR, results from ERVU were significantly higher than the control for 6 speakers, while the control was significantly higher than ERVU for only 2 speakers. The maximum difference between results from ERVU and the control was 33 percentage points (M7) while the minimum difference was 12 percentage points (F1). At 10 dB SNR, results from ERVU were significantly higher than the control for 7 speakers, while the control was significantly higher than ERVU for only 1 speaker. Differences between ERVU and the control for all other speakers were not significant. The maximum difference between results from ERVU and the control was 19 percentage points (F6) while the minimum difference was 7 percentage points (F4). Results from the HPF, at 0 dB SNR, were significantly greater than the control for only 1 speaker (M7), while the results from the control were significantly greater than results from the HPF for 2 speakers. At 10 dB SNR, the results from the HPF were significantly greater than the control for 6 speakers, while the results from the control were not significantly greater than results from the HPF for any speakers. Differences between the HPF and the control for all other speakers were not significant. Averaged over both SNRs, results from 9 of the 16 speakers were significantly positively affected by either ERVU or the HPF compared to the control. ERVU did not significantly negatively affect results for any of the speakers, and only one speaker (F1) had significantly lower results for the HPF compared to the control. These results show that intelligibility significantly increased for about half of all speakers when either ERVU or the HPF was used, and intelligibility did not significantly decrease when either ERVU or the HPF was used for the vast majority of speakers. 5.5. Alternative explanation of algorithm effectiveness The increase in intelligibility due to CV ratio boosting can be explained outside of the context of clear or Lombard speech. Early research on speech intelligibility used low-pass/high-pass filters
to characterize intelligibility as a function of filter cutoff frequency (Fletcher, 1953). Fletcher suggested that speech contains redundant phonetic information in the frequency domain after finding that vowels and nasals were recognized more accurately than fricatives and stops after filtering. In other words, phonetic information in vowels and nasals is distributed throughout the spectrum while the information in fricatives and stops is concentrated in spectral sub-bands. Similar results were found for filtered phonemes corrupted with additive noise (Miller and Nicely, 1955). In that experiment, voicing and nasality were little affected by the additive noise, while fricative, place-of-articulation and duration features were more affected. Results from subsequent listening tests demonstrated that the dynamic information between consonants and vowels in a CVC word was sufficient for vowel recognition even when the vowel was removed from the word (Strange et al., 1983). These studies show that speech contains regions in time and frequency that (1) contain phonetic cues important for intelligibility, and (2) vary in robustness to additive noise. Both ERVU and the HPF are effective because they directly target the above-described regions of speech. The HPF boosts the amplitudes of the second and higher formants at the expense of the first formant, thus improving the local SNR for the higher formants. This strategy is effective when the SNR around the first formant is sufficiently high, as is true for the white noise used in the experiments of this work. ERVU employs CV ratio boosting which primarily improves the SNR for low-energy frames of speech. 5.6. Spectral flatness measure Of practical interest is the effectiveness of the automated segmentation routine used by ERVU. Fig. 5 shows a receiver operating characteristic (ROC) curve for the spectral flatness measure with a single threshold. The ROC curve, generated from 1000 phonetically labeled TIMIT sentences, indicates the trade-off between false positives and false negatives for detecting voiced segments using the SFM. The ‘‘X’’ in Fig. 5 indicates the operating point from the Schmidt trigger thresholds used in the current work and lies just inside the singlethreshold ROC. For the current work using the Schmidt trigger thresholds, voiced frames were classified as unvoiced at a rate of 2.9% while unvoiced frames were classified as voiced at a rate of 18.8%.
M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558 2
False negatives, %
10
1
10
Schmidt trigger threshold: False negative: 2.9% False positive: 18.8%
0
10 0 10
1
10
2
10
False positives, %
Fig. 5. ROC curve for spectral flatness measure using single threshold. The ÔXÕ indicates the operating point from the Schmidt trigger thresholds used in the current work.
The spectral flatness measure relies on sharp contrast between spectral peaks and adjacent valleys, and additive noise can raise the noise floor which significantly affects spectral contrast. However, the current work demonstrates that a simple segmentation scheme used with CV ratio boosting is sufficient to significantly improve speech intelligibility. Future work remains to develop a more robust voicing detector and also to find the optimum operating point on the detectorÕs ROC curve. 6. Conclusions We have extended the previous experiments of CV ratio boosting in the following four ways. First, we used speech from 8 male and 8 female speakers in our recognition experiments. Previous studies that applied phenomena from clear and Lombard speech considered very few speakers (typically 1– 4), and we feel the experimental results would be more representative coming from a larger population of speakers. Results from the current study show that ERVU or the HPF improve intelligibility for about half the speakers (9 of 16) without degrading the intelligibility of the other speakers. Second, we used listeners who were native speakers of American English as well as listeners who were non-native speakers of American English. Bradlow and Bent showed that listeners who were native speakers of English benefited more from clear speech than did listeners who were non-native English speakers, but our results show that the increase in percentage
557
scores due to CV ratio boosting was the same for both groups of listeners. Our results suggest that some aspects of clear speech, specifically an increased CV ratio, are equally beneficial to all listeners, regardless of their extent of familiarity with English. Third, we used a simple, real-time segmentation algorithm, instead of hand labeling, to distinguish consonants from vowels. Hand labeling is not appropriate for real-time implementation. Fourth, we used monosyllabic words, instead of logatomes, from the TI-46 speech corpus, arranged in the same confusable sets as used by Junqua (1993). Confusable sets offer a challenging recognition task and are easily administered in a closed-vocabulary test environment. Furthermore, the corpus of confusable words emphasizes word recognition over phoneme recognition and is a stepping stone towards recognition of sentential material. The application of CV ratio boosting on conversational speech in real time is the ultimate goal of this work. Several interesting aspects of ERVU and CV ratio boosting remain unanswered. Since ERVU and the HPF target different speech cues, combining the algorithms should improve intelligibility beyond that for each algorithm separately, provided that one algorithm does not negatively impact the cues targeted by the other. Also, algorithm effectiveness depends on the acoustic noise characteristics. Energetic white noise was employed in the current work, yet other noise sources may expose strengths and weaknesses of the algorithms not considered currently. Acknowledgement This work was funded in part by the Motorola iDEN Technology Group. The authors would like to thank Rahul Shrivastav for fruitful discussions on the manuscript subject. References Boll, S.F., 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27 (2), 113–120. Bradlow, A.R., Bent, T., 2002. The clear speech effect for nonnative listeners. J. Acoust. Soc. Amer. 112 (1), 272–284. Doddington, G.R., Schalk, T.B., 1981. Speech recognition: turning theory to practice. IEEE Spectrum, 26–32. Ephraim, Y., Van Trees, H.L., 1995. A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process. 3 (4), 251–266. Ferguson, S.H., Kewley-Port, D., 2002. Vowel intelligibility in clear and conversational speech for normal-hearing and
558
M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558
hearing-impaired listeners. J. Acoust. Soc. Amer. 112 (1), 259–271. Fletcher, H., 1953. Speech and Hearing in Communication. D. Van Nostrand Company, Inc., New York. Furui, S., 1986. On the role of spectral transition for speech perception. J. Acoust. Soc. Amer., 1016–1025. Gordon-Salant, S., 1986. Recognition of natural and time/ intensity altered CVs by young and elderly subjects with normal hearing. J. Acoust. Soc. Amer. 80 (6), 1599–1607. Gray Jr., A.H., Markel, J.D., 1974. A spectral-flatness measure for studying the autocorrelation method of linear prediction of speech analysis. IEEE Trans. Acoust. Speech Signal Process. 22 (3), 207–217. Harris, J.G., Skowronski, M.D., 2002. Energy redistribution speech intelligibility enhancement, vocalic and transitional cues. J. Acoust. Soc. Amer. 112 (5), 2305. Hazan, V., Simpson, A., 1998. The effect of cue-enhancement on the intelligibility of nonsense word and sentence materials presented in noise. Speech Comm. 24 (3), 211–226. Hopkins, W.G., 2000. A new view of statistics. Internet Society for Sport Science. Available from:
(April 30 2005). Junqua, J.C., 1993. The Lombard reflex and its role on human listeners and automatic speech recognizers. J. Acoust. Soc. Amer. 93 (1), 510–524. Kennedy, E., Levitt, H., Neuman, A.C., Weiss, M., 1998. Consonant–vowel intensity ratios for maximizing consonant recognition by hearing-impaired listeners. J. Acoust. Soc. Amer. 103 (2), 1098–1114.
Miller, G.A., Nicely, P.E., 1955. An analysis of perceptual confusions among some English consonants. J. Acoust. Soc. Amer. 27 (2), 338–352. Niederjohn, R.J., Grotelueschen, J.H., 1976. The enhancement of speech intelligibility in high noise levels by high-pass filtering followed by rapid amplitude compression. IEEE Trans. Acoust. Speech Signal Process. 24 (4), 277–282. Payton, K.L., Uchanski, R.M., Braida, L.D., 1994. Intelligibility of conversational and clear speech in noise and reverberation for listeners with normal and impaired hearing. J. Acoust. Soc. Amer. 95 (3), 1581–1592. Picheny, M.A., Durlach, N.I., Braida, L.D., 1985. Speaking clearly for the hard of hearing I: intelligibility differences between clear and conversational speech. J. Speech Hear. Res. 28, 96–103. Picheny, M.A., Durlach, N.I., Braida, L.D., 1986. Speaking clearly for the hard of hearing II: acoustic characteristics of clear and conversational speech. J. Speech Hear. Res. 29, 434– 445. Reinke, T.L., August 2001. Automatic speech intelligibility enhancement. MasterÕs thesis, University of Florida, Gainesville, FL, USA. Strange, W., Jenkins, J.J., Johnson, T.L., 1983. Dynamic specification of coarticulated vowels. J. Acoust. Soc. Amer. 74 (3), 695–705. Summers, W.V., Pisoni, D.B., Bernacki, R.H., Pedlow, R.I., Stokes, M.A., 1988. Effects of noise on speech production: acoustic and perceptual analyses. J. Acoust. Soc. Amer. 84 (3), 917–928.