Applied principles of clear and Lombard speech for automated intelligibility enhancement in noisy environments

Speech Communication 48 (2006) 549–558 www.elsevier.com/locate/specom Applied principles of clear and Lombard speech for automated intelligibility en...

Download PDF

323KB Sizes 0 Downloads 49 Views

Report

PDF Reader
Full Text

Speech Communication 48 (2006) 549–558 www.elsevier.com/locate/specom

Applied principles of clear and Lombard speech for automated intelligibility enhancement in noisy environments

q

Mark D. Skowronski *, John G. Harris Computational Neuro-Engineering Lab, Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, USA Received 22 December 2004; received in revised form 27 July 2005; accepted 21 September 2005

Abstract Previous studies have documented phenomena involving the modiﬁcation of human speech in special communication circumstances. Whether speaking to a hearing-impaired person (clear speech) or in a noisy environment (Lombard speech), speakers tend to make similar modiﬁcations to their normal, conversational speaking style in order to increase the understanding of their message by the listener. One strategy characteristic of the above speech types is to increase consonant power relative to the signal power of adjacent vowels and is referred to as consonant–vowel (CV) ratio boosting. An automated method of speech enhancement using CV ratio boosting is called energy redistribution voiced/unvoiced (ERVU). To characterize the performance of ERVU, 25 listeners responded to 500 words in a two-word, forced-choice experiment in the presence of energetic masking noise. The test material was a vocabulary of confusable monosyllabic words spoken by 8 male and 8 female speakers, and the conditions tested were a control (unmodiﬁed speech), ERVU, and a high-pass ﬁlter (HPF). Both ERVU and the HPF signiﬁcantly increased recognition accuracy compared to the control. Nine of the 16 speakers were signiﬁcantly more intelligible when ERVU or the HPF was used, compared to the control, while no speaker was less intelligible. The results show that ERVU successfully increased intelligibility of speech using a simple automated segmentation algorithm, applicable to a wide variety of communication systems such as cell phones and public address systems. 2005 Elsevier B.V. All rights reserved. PACS: 43.72.Ew; 43.60.Dh; 43.71.Es Keywords: Clear speech; Speech enhancement; Energy redistribution

1. Introduction Speech-enhancement algorithms improve the quality of speech (naturalness and intelligibility) in q

Supported by the iDEN Technology Group of Motorola. Corresponding author. Tel.: +1 352 392 2626; fax: +1 352 392 0044. E-mail address: [email protected]ﬂ.edu (M.D. Skowronski). *

communication channels, such as phone networks or public address systems. The speech signal is degraded during communication by noise from three sources: (1) acoustic noise in the talkerÕs environment (recording environment), (2) channel noise (e.g., electronic echo in phone networks, compression artifacts), and (3) acoustic noise in the listenerÕs environment (broadcast environment). Noise reduction techniques have been used to reduce the noise

0167-6393/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2005.09.003

550

M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558

power from the ﬁrst two sources relative to the speech signal power (Boll, 1979; Ephraim and Van Trees, 1995). However, noise reduction algorithms generally do not target noise added to the speech signal after the electro-acoustic transducer, such as the speaker of a telephone handset or public address system. A simple solution is to increase the power of the speech signal before broadcasting in the noisy listenerÕs environment, thus increasing the signal-tonoise ratio (SNR). But what can be done when ‘‘turning up the volume’’ is no longer an option? Assuming the broadcast transducer is operating at its peak power output without clipping (i.e., the maximum volume on a cell phone or a public address system), a speech-enhancement algorithm has been proposed by the authors that redistributes the speech signal energy in order to increase intelligibility (Harris and Skowronski, 2002). The energy redistribution algorithm emphasizes temporal regions critical to intelligibility: energy redistribution voiced/unvoiced (ERVU) which boosts unvoiced regions, and energy redistribution spectral transitions (ERST) which boosts regions of large spectral change. The current work shows the similarities between ERVU and speech-production characteristics of clear and Lombard speech and also extends the previous analysis of ERVU by considering the eﬀects of listener nativeness and also talker diﬀerences on speech intelligibility.

speech is characterized by a slower speaking rate, more and longer pauses, elevated speech intensity, increased word duration, ‘‘targeted’’ vowel formants, increased consonant intensity compared to adjacent vowels, and phonological changes such as fewer reduced vowels and more released stop bursts (Picheny et al., 1986). Using nonsense English sentences, clear speech was shown to increase the intelligibility of speech for hearing impaired as well as normal hearing listeners (Payton et al., 1994). Furthermore, using simple English sentences, clear speech was shown to increase intelligibility for normal hearing native listeners more so than for non-native listeners (Bradlow and Bent, 2002). In a study of the importance of acoustic cues for vowel recognition for young and elderly listeners, monosyllabic /bVd/ words were excised from sentences of clear and conversational speech (Ferguson and Kewley-Port, 2002). The young normal hearing listeners averaged a higher vowel recognition score on the clear speech compared to conversational speech, while elderly hearing-impaired listeners had the same average score for both clear and conversational speech. One explanation of these results is that clear speech enhances many cues for speech intelligibility, not all of which are utilized by hearing-impaired listeners. 2.2. Lombard speech

Research from the past two decades has investigated various types of speech in the presence of degraded communication conditions. A signiﬁcant portion of that research was designed to study the changes made during speech communication in noisy acoustic environments in order to better understand the strategies speakers employ to overcome the degraded communication conditions and also to apply the acquired knowledge to algorithms that improve speech communications in man-made devices and systems (e.g., hearing aids, public address systems, and cell phones).

A style of communication related to clear speech is the Lombard eﬀect. In the Lombard eﬀect, a speaker talking in a noisy environment makes several vocal changes in order to improve intelligibility of the speech signal against the competing acoustic noise. The changes in vocal characteristics include an increase in speech signal amplitude, slower delivery, higher pitch, higher formants, and an increase in the spectral energy center of gravity as well as an increase in the consonant-to-noise ratio and decrease in spectral tilt (Junqua, 1993; Summers et al., 1988). Junqua showed that Lombard speech may be more or less intelligible than conversational speech, depending on the vocabulary under test, noise type, and speaker gender.

2.1. Clear speech

2.3. Applications

A seminal work in this area is the study of clear speech by Picheny et al. (1985). Clear speech is the style of speaking a speaker adopts when talking with someone who has a hearing impairment. Clear

Some success has been reported in applying the observed phenomena of clear and Lombard speech to speech enhancement. Gordon-Salant (1986) boosted the consonant–vowel (CV) ratio of

2. Clear and Lombard speech

M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558

hand-labeled nonsense CV words by 10 dB and also linearly doubled the duration of words in an intelligibility experiment for normal hearing listeners in the presence of masking babble noise. While the crude duration increase yielded negligible beneﬁts, boosting the CV ratio by 10 dB (the diﬀerence between clear and conversational speech observed by Picheny et al.) signiﬁcantly improved CV word intelligibility for both young and elderly normal hearing listeners. A similar experiment was performed by Kennedy et al. (1998) who investigated the eﬀects of increasing CV intensity ratio on intelligibility for hearing-impaired listeners. With handlabeled nonsense VC words, consonants were enhanced in separate trials with gains ranging from 0 to 24 dB. The consonant enhancement at maximum recognition, CEmax, was 8 dB for voiced consonants and 11 dB for unvoiced consonants. The experiments described below used the ERVU algorithm to extend the existing experiments on CV ratio boosting. The goals of the experiments were the following: (1) quantify the performance of CV ratio boosting using the ERVU algorithm, and (2) observe the interactions of CV ratio boosting for the following factors: vocabulary set, SNR, and listener nativeness (i.e., if the listener was a selfdescribed native or non-native speaker of English). 3. Method In the following experiments, speech enhancement using CV ratio boosting was characterized with closed-vocabulary tests. Boosting was achieved using the ERVU algorithm, which redistributes energy across time, and was compared with a high-pass ﬁlter (HPF), which redistributes energy across frequency. 3.1. Algorithms Two methods of automatic speech enhancement using energy redistribution were developed previously (Harris and Skowronski, 2002). The ﬁrst method, ERVU, takes energy from voiced regions and boosts unvoiced regions while conserving global energy (over the duration of words). Vowels, liquids, nasals, diphthongs, and voiced plosives and fricatives are attenuated in their voiced regions while all other phonemes are ampliﬁed. The decision boundary between voiced and unvoiced regions is determined using a voice activity detector. The second method, ERST, takes energy from spectrally

551

stationary regions and boosts spectral transitions (Reinke, 2001). Spectral transitions are identiﬁed using cepstral-based features. Furui showed that spectral transitions contain important intelligibility cues in an experiment using truncated vowels in CV utterances (Furui, 1986). When vowels were truncated beyond a certain point, intelligibility greatly decreased. These points were found to occur in regions of the greatest spectral change. Boosting spectral transitions in hand-labeled logatomes (nonsense consonant–vowel combinations) and sentential material was also demonstrated to improve intelligibility in noisy environments (Hazan and Simpson, 1998). The current work details the performance of energy redistribution using voiced/unvoiced information (ERVU). Several voice activity detectors are available from speech codec standards to segment voiced and unvoiced regions. One compact and eﬀective algorithm is the spectral ﬂatness measure (SFM) (Gray and Markel, 1974). Eq. (1) is the expression for the SFM used for voice activity detection in ERVU. QN SFMðjÞ ¼

1=N k¼1 X j ðkÞ PN 1 k¼1 X j ðkÞ N

ð1Þ

where Xj is the magnitude of the N-point DFT of the jth stemporal window of the uttered word. The SFM is the ratio of the geometric mean to the arithmetic mean of windowed speech, bounded between zero and one. An SFM value near unity indicates that the spectrum is ﬂat, while an SFM value near zero indicates a peaky spectrum. Voiced regions, with peaks at harmonics of the fundamental frequency, have low SFM values while unvoiced regions, which lack fundamental frequency harmonics, have high SFM values. A two-level Schmidt trigger decision boundary is used to provide robustness to variations in the SFM. The levels for the Schmidt trigger were determined using probability distribution functions of phoneme-labeled speech. Fig. 1 shows an example of the SFM for the word ‘‘clariﬁcation’’, and Fig. 2 shows the time domain signal of ‘‘clariﬁcation’’ with voiced/unvoiced boundaries delineated using the SFM and Schmidt trigger thresholds. After the voiced/unvoiced decision is made, the unvoiced regions are ampliﬁed and the entire word is scaled by a normalizing gain factor such that the modiﬁed word energy is the same as the original word energy. The transition between voiced and

552

M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558

Butterworth HPF with cutoﬀ frequency fc = 1.5 kHz was used to improve the intelligibility of clean speech in the presence of noise (Niederjohn and Grotelueschen, 1976). Furthermore, the HPF increases the spectral energy center of gravity for speech, which is a characteristic of the Lombard eﬀect. The HPF is included in the experiments to gauge the performance of CV ratio boosting using ERVU.

0.9 0.8 Unvoiced 0.7 0.6

SFM

0.5 0.4 0.3 0.2

3.2. Subjects Voiced

0.1 0 0

125

250

375

500

625

750

875

Time (ms)

Amplitude

Fig. 1. Spectral ﬂatness measure (SFM) for the word ‘‘clariﬁcation’’. Window length is 20 ms with 50% overlap. The two horizontal lines represent the Schmidt trigger decision boundaries between voiced and unvoiced regions (0.36 and 0.47). See Fig. 2 for time domain plot showing transition labels.

3.3. Stimuli

V

V 0

The listening test was performed by 25 individuals who reported to have no known hearing impairments. Listener age was 27 ± 5 years, and all listeners were engineering graduate students who had no prior experience with speech research studies involving listening tests. Four listeners were female and 21 were male, and 9 listeners reported being native speakers of English while 16 listeners reported being non-native speakers of English. All listeners were suﬃciently proﬁcient with English to enter an English-speaking university in the United States. The native countries of the listeners and the number of listeners from each country were the following: USA (8), China (6), India (7), Turkey (1), Portugal (1), Pakistan (1), and Spain (1).

125

250

375

V 500

V 625

750

875

Time (ms)

Fig. 2. Time domain signal for the word ‘‘clariﬁcation’’ with SFM depicted in Fig. 1. The vertical lines represent the voiced/ unvoiced decisions boundaries determined from the SFM and Schmidt trigger thresholds in Fig. 1. Regions labeled V denote voiced regions, and all other regions are considered unvoiced.

unvoiced scale factors is smoothed by a 10 ms linear interpolation. Any spurious signals larger than the dynamic range of the original signal are clipped. Informal listening tests indicated that artifacts from such infrequent clipping were not perceptible for the scale factors used in the following experiments. The ERVU algorithm redistributes energy across time, analogous to the redistribution of energy across frequency through ﬁltering. A third-order

The vocabulary used was from the confusable sets listed in Table 1. The confusable sets were used by Junqua (1993) in experiments of the Lombard eﬀect and represent a challenging vocabulary for intelligibility. The words were drawn from the TI46 speech corpus (Doddington and Schalk, 1981). The corpus consists of isolated utterances of each

Table 1 Vocabulary sets of confusable words used for human listening test S-set A-set E-set M-set

F, A, B, M,

S, X, YES H, K, EIGHT C, D, E, G, P, T, V, Z, THREE N

Words were taken from the TI-46 speech corpus, spoken in isolation by 8 male and 8 female speakers as part of a corpus of 46 words. Speakers were not instructed to pronounce the words diﬀerentially as if listed in the following confusable sets. The letters below were pronounced as the names of the letters in the alphabet (e.g., /ef/ for ‘‘F,’’ /et/ for ‘‘EIGHT,’’ /bi/ for ‘‘B,’’ /em/ for ‘‘M’’).

M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558

word in the vocabulary spoken 26 times by 16 speakers (8 male and 8 female) in a quiet environment (SNR > 30 dB). The speakers were not instructed to pronounce the words in any way that may aid in their recognition with respect to the confusable sets chosen for this experiment. That is, no emphasis was placed on the discriminating phonemes of each word (e.g., the phoneme /f/ in the word ‘‘F’’ in the S-set). 3.4. Procedure Each test participant listened to utterances of single words in a two-choice, forced-decision experiment. For each trial of the experiment, pairs of words were displayed on a personal computer using a graphical user interface. An utterance of one of the displayed words was randomly drawn from the TI-46 speech corpus, then one of the three conditions (unmodiﬁed control, ERVU, or the HPF) was randomly selected. Utterances from trials that were selected for control were not modiﬁed, while the other test conditions (ERVU or the HPF) were applied accordingly. After the selected condition was appropriately applied, the utterance was corrupted with additive white Gaussian noise before being played through Sony MDR-V200 padded stereo headphones for the listener to hear and identify. During each trial, the test participant could listen to the noisy utterance as many times as necessary before choosing one of the two displayed words. Test conditions and responses for each trial were recorded automatically as part of the graphical user interface software, and the listener could not advance to the next trial without responding to the current trial. To select the pair of words displayed during each trial, one of the four confusable sets was randomly selected from Table 1, then two words from the set were randomly selected. All random selections were made with equal a priori probabilities; thus, each of the four confusable sets was used approximately the same number of times during each experiment, and the same was true for each of the 16 speakers and each of the conditions under test. For all trials using ERVU, consonant regions were boosted by 7.4 dB before normalization. The boost in CV ratio was similar to that determined to be optimum for consonants as reported by Kennedy et al. (1998). The Schmidt trigger lower and upper thresholds used by the spectral ﬂatness measure were 0.36 and 0.47, respectively.

553

Each experiment consisted of 500 trials. The SNR of the ﬁrst 250 trials was 0 dB, and the SNR of the second 250 trials was 10 dB. The volume was set to a comfortable level before the test of the ﬁrst participant, and the gain was ﬁxed for all tests for all participants. Each test was completed in 30–45 min, and no eﬀects from fatigue were observed. The ﬁrst 25 responses at 0 dB and 10 dB SNR of each test participant were considered practice and were discarded for the following analysis. 4. Results 4.1. Condition eﬀects A 2 · 4 · 3 · 2 analysis of variance (ANOVA) was performed on the dependent variable (score) for the four factors SNR, vocabulary set, condition, and listener nativeness. All factors were treated as ﬁxed eﬀects. The dependent variable was the output of each trial for each listener: 0 for an incorrect response and 1 for a correct response. Therefore, mean results for each factor and interaction from the ANOVA were interpreted as percent correct. The residual error variance of the modeling in the ANOVA was not signiﬁcantly aﬀected by transformation of percentage scores using the arc sine root transform (Hopkins, 2000); therefore, the analysis was performed, and results are reported, as percent correct. Perfect recognition was 100% correct, while the chance level was 50%. All four main eﬀects were signiﬁcant at a conﬁdence level of 95% [SNR: F = 168, degrees of freedom (d.f.) = 1, p < 0.0001; vocabulary set: F = 38, d.f. = 3, p < 0.0001; condition: F = 4.7, d.f. = 2, p < 0.01; listener nativeness: F = 32, d.f. = 1, p < 0.001]. Using the Tukey–Kramer post-hoc test, the mean percentage scores at 0 dB and 10 dB SNR were 89 ± 0.6% and 77 ± 0.6%, respectively. The two SNR levels were chosen such that the listening experiments were not too easy nor too hard, and the post-hoc results validate the choice of SNR levels although the scores at 0 dB SNR showed signs of ceiling eﬀects. Test scores varied signiﬁcantly due to vocabulary set as well. Using the Tukey–Kramer post-hoc test, the mean percentage scores for the S-set, A-set, E-set, and M-set were 85 ± 0.9%, 89 ± 1.0%, 86 ± 0.6%, and 73 ± 1.1%, respectively. The percentage scores for the M-set were much lower than scores for the other sets, similar to the results reported by Junqua (1993).

554

M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558

Whether or not a listener was a native speaker of English was also a signiﬁcant factor. The mean percentage scores from the post-hoc tests for native and non-native English speakers were 86 ± 0.8% and 80 ± 0.6%, respectively. However, nativeness was not a signiﬁcant factor in any interactions, although the interaction of nativeness and vocabulary was marginally signiﬁcant (F = 2.3, d.f. = 3, p = 0.074). All other interactions involving nativeness had p > 0.66. The remaining main eﬀect, condition, also produced signiﬁcant diﬀerences among percentages scores for the control, ERVU, and the HPF. Using the Tukey–Kramer post-hoc test, the mean scores

Fig. 3. Tukey–Kramer post-hoc test means and standard errors for the interaction SNR, vocabulary set, and condition. For each vertical bar, the upper error bars are for results at 0 dB SNR, while the lower error bars are for results at 10 dB SNR.

for the control, ERVU, and the HPF were 85 ± 0.8%, 88 ± 0.8, and 87 ± 0.8%, respectively. All remaining two-way interactions among SNR, condition, and vocabulary set were signiﬁcant except for the interaction of SNR and vocabulary set (F = 1.1, d.f. = 3, p = 0.36). The three-way interaction of SNR, vocabulary set, and condition was also signiﬁcant (F = 5.1, d.f. = 6, p < 0.001), and the results from the Tukey–Kramer post-hoc test are plotted in Fig. 3. 4.2. Speaker eﬀects The current study used speech material from 8 male and 8 female speakers from the TI-46 corpus. The eﬀects of speaker on percentage score were analyzed with a 2 · 3 · 16 ANOVA on the dependent variable (score) for the three factors SNR, condition, and speaker. The large number of degrees of freedom prohibited the inclusion of the factors vocabulary and listener nativeness that were used in the previous ANOVA. To decrease the variation in score due to vocabulary, the M-set was removed from the analysis. Fig. 3 shows that percentage scores for the M-set were signiﬁcantly lower than the scores for the other three vocabulary sets. All main eﬀects were signiﬁcant at a conﬁdence level of 95% [SNR: F = 181, d.f. = 1, p < 0.001; condition: F = 13.3, d.f. = 2, p < 0.001; speaker: F = 4.4, d.f. = 15, p < 0.001]. In addition, all two-way interactions were signiﬁcant [SNR and speaker: F = 1.7, d.f. = 15, p < 0.05; SNR and condition: F = 5.0, d.f. = 2, p < 0.01; speaker and condition:

Fig. 4. Tukey–Kramer post-hoc means and standard errors for three-way interaction of SNR, condition, and speaker. For each vertical bar, the upper error bars are for results at 0 dB SNR, while the lower error bars are for results at 10 dB SNR. In (a), the results are for the 8 female speakers F1–F8. In (b), the results are for the 8 male speakers M1–M8. (a) Female speakers and (b) male speakers.

M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558

F = 2.2, d.f. = 30, p < 0.001]. The interaction of all terms was signiﬁcant as well [SNR, condition, and speaker: F = 2.0, d.f. = 30, p < 0.001]. Means and standard errors from Tukey–Kramer post-hoc tests for the three-way interaction are shown in Fig. 4. The speaker labels are the same as those used in the TI-46 corpus. When the factor speaker was replaced with speaker gender in the ANOVA, speaker gender was not a signiﬁcant main eﬀect (F = 0.9, d.f. = 1, p > 0.30). Even after the factors listener nativeness and vocabulary were added to the ANOVA, speaker gender still was not a signiﬁcant main eﬀect (F = 1.3, d.f. = 1, p > 0.25). Junqua (1993) reported in a similar experiment that intelligibility of Lombard speech was higher or lower than normal speech, depending on the speaker gender. In contrast, our results show that intelligibility due to one aspect of the Lombard eﬀect, namely CV ratio boosting, is not dependent on speaker gender. 5. Discussion 5.1. Condition eﬀects The results in Fig. 3 considering the three-way interaction among condition, SNR, and vocabulary set show that the S-set produced signiﬁcantly diﬀerent recognition scores between the control and ERVU and the HPF. The HPF produced a signiﬁcantly lower score for the S-set at 0 dB in contrast to a signiﬁcantly higher score at 10 dB, compared to the control. ERVU produced similar results for the S-set compared to the control, although the results at 0 dB were only marginally signiﬁcant. The control results for the S-set had the largest change in score between 0 and 10 dB SNR for any vocabulary group, indicating an acute sensitivity by the control of the S-set to the noise source (white Gaussian). Although they operate on diﬀerent parts of the signal, ERVU and the HPF maintain cues for intelligibility that aid in distinguishing words in the S-set at 10 dB while recognition using the control dropped sharply. Experiments with diﬀerent noise sources and more trials would better explain the results for the S-set. Both ERVU and the HPF signiﬁcantly improved the intelligibility of the E-set compared to the control according to the post-hoc tests in Fig. 3. The HPF was marginally signiﬁcant at 0 dB while ERVU was marginally signiﬁcant at 10 dB. The E-set, with 10 words, was the largest set tested and

555

exhibited the most phonetic diﬀerences among word pairs of any set. These results are encouraging when considering the performance of the HPF and ERVU in open-vocabulary experiments where phonetic differences increase. 5.2. SNR eﬀects Post-hoc tests of the diﬀerences between the control and the algorithms ERVU and the HPF produced signiﬁcant diﬀerences only at 10 dB SNR. Observations of the scores revealed that the test at 0 dB SNR was signiﬁcantly aﬀected by ceiling eﬀects. At 10 dB SNR, control recognition error was 24.9%, compared to 19.2% for ERVU and 16.7% for the HPF. That is, ERVU reduced recognition error by 23% and the HPF reduced recognition error by 33% at 10 dB compared to the control. For the two-choice experiment employed, random chance was 50% error for all vocabulary sets, so the results at 10 dB can be considered the least aﬀected by ﬂoor and ceiling eﬀects since the control error was midway between that for perfect recognition and chance. 5.3. Nativeness eﬀects Non-native English speakers scored signiﬁcantly lower than native speakers because of a lack of experience with the acoustic cues distinguishing the words in each confusable set. However, nativeness had no signiﬁcant interactions with any of the other factors. These results indicate that the acoustic cues not used by non-native speakers were evenly distributed among the vocabulary sets tested and also that the acoustic cues enhanced by either ERVU or the HPF were utilized to the same extent by both groups of listeners. These results contrast with the ﬁndings of Bradlow and Bent (2002) who concluded that listeners for which English was a non-native language beneﬁted from English clear speech but not nearly to the extent at which native listeners beneﬁted. This suggests that CV ratio boosting beneﬁts both native and non-native listeners of English to the same degree while other cues present in clear speech (e.g., rate changes, hyperarticulation) provide more beneﬁt to listeners thoroughly familiar with English. However, the test material in the experiments of Bradlow and Bent used open-vocabulary meaningful sentences as opposed to the closed-vocabulary isolated words used in the current work, and the diﬀerence in test

556

M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558

material may partly explain the contrasting conclusions regarding native listener performance between the previous study and the current study. 5.4. Speaker eﬀects The results in Fig. 4 show that, at 0 dB SNR, results from ERVU were signiﬁcantly higher than the control for 6 speakers, while the control was signiﬁcantly higher than ERVU for only 2 speakers. The maximum diﬀerence between results from ERVU and the control was 33 percentage points (M7) while the minimum diﬀerence was 12 percentage points (F1). At 10 dB SNR, results from ERVU were signiﬁcantly higher than the control for 7 speakers, while the control was signiﬁcantly higher than ERVU for only 1 speaker. Diﬀerences between ERVU and the control for all other speakers were not signiﬁcant. The maximum diﬀerence between results from ERVU and the control was 19 percentage points (F6) while the minimum diﬀerence was 7 percentage points (F4). Results from the HPF, at 0 dB SNR, were significantly greater than the control for only 1 speaker (M7), while the results from the control were signiﬁcantly greater than results from the HPF for 2 speakers. At 10 dB SNR, the results from the HPF were signiﬁcantly greater than the control for 6 speakers, while the results from the control were not signiﬁcantly greater than results from the HPF for any speakers. Diﬀerences between the HPF and the control for all other speakers were not signiﬁcant. Averaged over both SNRs, results from 9 of the 16 speakers were signiﬁcantly positively aﬀected by either ERVU or the HPF compared to the control. ERVU did not signiﬁcantly negatively aﬀect results for any of the speakers, and only one speaker (F1) had signiﬁcantly lower results for the HPF compared to the control. These results show that intelligibility signiﬁcantly increased for about half of all speakers when either ERVU or the HPF was used, and intelligibility did not signiﬁcantly decrease when either ERVU or the HPF was used for the vast majority of speakers. 5.5. Alternative explanation of algorithm eﬀectiveness The increase in intelligibility due to CV ratio boosting can be explained outside of the context of clear or Lombard speech. Early research on speech intelligibility used low-pass/high-pass ﬁlters

to characterize intelligibility as a function of ﬁlter cutoﬀ frequency (Fletcher, 1953). Fletcher suggested that speech contains redundant phonetic information in the frequency domain after ﬁnding that vowels and nasals were recognized more accurately than fricatives and stops after ﬁltering. In other words, phonetic information in vowels and nasals is distributed throughout the spectrum while the information in fricatives and stops is concentrated in spectral sub-bands. Similar results were found for ﬁltered phonemes corrupted with additive noise (Miller and Nicely, 1955). In that experiment, voicing and nasality were little aﬀected by the additive noise, while fricative, place-of-articulation and duration features were more aﬀected. Results from subsequent listening tests demonstrated that the dynamic information between consonants and vowels in a CVC word was suﬃcient for vowel recognition even when the vowel was removed from the word (Strange et al., 1983). These studies show that speech contains regions in time and frequency that (1) contain phonetic cues important for intelligibility, and (2) vary in robustness to additive noise. Both ERVU and the HPF are eﬀective because they directly target the above-described regions of speech. The HPF boosts the amplitudes of the second and higher formants at the expense of the ﬁrst formant, thus improving the local SNR for the higher formants. This strategy is eﬀective when the SNR around the ﬁrst formant is suﬃciently high, as is true for the white noise used in the experiments of this work. ERVU employs CV ratio boosting which primarily improves the SNR for low-energy frames of speech. 5.6. Spectral ﬂatness measure Of practical interest is the eﬀectiveness of the automated segmentation routine used by ERVU. Fig. 5 shows a receiver operating characteristic (ROC) curve for the spectral ﬂatness measure with a single threshold. The ROC curve, generated from 1000 phonetically labeled TIMIT sentences, indicates the trade-oﬀ between false positives and false negatives for detecting voiced segments using the SFM. The ‘‘X’’ in Fig. 5 indicates the operating point from the Schmidt trigger thresholds used in the current work and lies just inside the singlethreshold ROC. For the current work using the Schmidt trigger thresholds, voiced frames were classiﬁed as unvoiced at a rate of 2.9% while unvoiced frames were classiﬁed as voiced at a rate of 18.8%.

M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558 2

False negatives, %

10

1

10

Schmidt trigger threshold: False negative: 2.9% False positive: 18.8%

0

10 0 10

1

10

2

10

False positives, %

Fig. 5. ROC curve for spectral ﬂatness measure using single threshold. The ÔXÕ indicates the operating point from the Schmidt trigger thresholds used in the current work.

The spectral ﬂatness measure relies on sharp contrast between spectral peaks and adjacent valleys, and additive noise can raise the noise ﬂoor which signiﬁcantly aﬀects spectral contrast. However, the current work demonstrates that a simple segmentation scheme used with CV ratio boosting is suﬃcient to signiﬁcantly improve speech intelligibility. Future work remains to develop a more robust voicing detector and also to ﬁnd the optimum operating point on the detectorÕs ROC curve. 6. Conclusions We have extended the previous experiments of CV ratio boosting in the following four ways. First, we used speech from 8 male and 8 female speakers in our recognition experiments. Previous studies that applied phenomena from clear and Lombard speech considered very few speakers (typically 1– 4), and we feel the experimental results would be more representative coming from a larger population of speakers. Results from the current study show that ERVU or the HPF improve intelligibility for about half the speakers (9 of 16) without degrading the intelligibility of the other speakers. Second, we used listeners who were native speakers of American English as well as listeners who were non-native speakers of American English. Bradlow and Bent showed that listeners who were native speakers of English beneﬁted more from clear speech than did listeners who were non-native English speakers, but our results show that the increase in percentage

557

scores due to CV ratio boosting was the same for both groups of listeners. Our results suggest that some aspects of clear speech, speciﬁcally an increased CV ratio, are equally beneﬁcial to all listeners, regardless of their extent of familiarity with English. Third, we used a simple, real-time segmentation algorithm, instead of hand labeling, to distinguish consonants from vowels. Hand labeling is not appropriate for real-time implementation. Fourth, we used monosyllabic words, instead of logatomes, from the TI-46 speech corpus, arranged in the same confusable sets as used by Junqua (1993). Confusable sets oﬀer a challenging recognition task and are easily administered in a closed-vocabulary test environment. Furthermore, the corpus of confusable words emphasizes word recognition over phoneme recognition and is a stepping stone towards recognition of sentential material. The application of CV ratio boosting on conversational speech in real time is the ultimate goal of this work. Several interesting aspects of ERVU and CV ratio boosting remain unanswered. Since ERVU and the HPF target diﬀerent speech cues, combining the algorithms should improve intelligibility beyond that for each algorithm separately, provided that one algorithm does not negatively impact the cues targeted by the other. Also, algorithm eﬀectiveness depends on the acoustic noise characteristics. Energetic white noise was employed in the current work, yet other noise sources may expose strengths and weaknesses of the algorithms not considered currently. Acknowledgement This work was funded in part by the Motorola iDEN Technology Group. The authors would like to thank Rahul Shrivastav for fruitful discussions on the manuscript subject. References Boll, S.F., 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 27 (2), 113–120. Bradlow, A.R., Bent, T., 2002. The clear speech eﬀect for nonnative listeners. J. Acoust. Soc. Amer. 112 (1), 272–284. Doddington, G.R., Schalk, T.B., 1981. Speech recognition: turning theory to practice. IEEE Spectrum, 26–32. Ephraim, Y., Van Trees, H.L., 1995. A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process. 3 (4), 251–266. Ferguson, S.H., Kewley-Port, D., 2002. Vowel intelligibility in clear and conversational speech for normal-hearing and

558

M.D. Skowronski, J.G. Harris / Speech Communication 48 (2006) 549–558

hearing-impaired listeners. J. Acoust. Soc. Amer. 112 (1), 259–271. Fletcher, H., 1953. Speech and Hearing in Communication. D. Van Nostrand Company, Inc., New York. Furui, S., 1986. On the role of spectral transition for speech perception. J. Acoust. Soc. Amer., 1016–1025. Gordon-Salant, S., 1986. Recognition of natural and time/ intensity altered CVs by young and elderly subjects with normal hearing. J. Acoust. Soc. Amer. 80 (6), 1599–1607. Gray Jr., A.H., Markel, J.D., 1974. A spectral-ﬂatness measure for studying the autocorrelation method of linear prediction of speech analysis. IEEE Trans. Acoust. Speech Signal Process. 22 (3), 207–217. Harris, J.G., Skowronski, M.D., 2002. Energy redistribution speech intelligibility enhancement, vocalic and transitional cues. J. Acoust. Soc. Amer. 112 (5), 2305. Hazan, V., Simpson, A., 1998. The eﬀect of cue-enhancement on the intelligibility of nonsense word and sentence materials presented in noise. Speech Comm. 24 (3), 211–226. Hopkins, W.G., 2000. A new view of statistics. Internet Society for Sport Science. Available from: (April 30 2005). Junqua, J.C., 1993. The Lombard reﬂex and its role on human listeners and automatic speech recognizers. J. Acoust. Soc. Amer. 93 (1), 510–524. Kennedy, E., Levitt, H., Neuman, A.C., Weiss, M., 1998. Consonant–vowel intensity ratios for maximizing consonant recognition by hearing-impaired listeners. J. Acoust. Soc. Amer. 103 (2), 1098–1114.

Miller, G.A., Nicely, P.E., 1955. An analysis of perceptual confusions among some English consonants. J. Acoust. Soc. Amer. 27 (2), 338–352. Niederjohn, R.J., Grotelueschen, J.H., 1976. The enhancement of speech intelligibility in high noise levels by high-pass ﬁltering followed by rapid amplitude compression. IEEE Trans. Acoust. Speech Signal Process. 24 (4), 277–282. Payton, K.L., Uchanski, R.M., Braida, L.D., 1994. Intelligibility of conversational and clear speech in noise and reverberation for listeners with normal and impaired hearing. J. Acoust. Soc. Amer. 95 (3), 1581–1592. Picheny, M.A., Durlach, N.I., Braida, L.D., 1985. Speaking clearly for the hard of hearing I: intelligibility diﬀerences between clear and conversational speech. J. Speech Hear. Res. 28, 96–103. Picheny, M.A., Durlach, N.I., Braida, L.D., 1986. Speaking clearly for the hard of hearing II: acoustic characteristics of clear and conversational speech. J. Speech Hear. Res. 29, 434– 445. Reinke, T.L., August 2001. Automatic speech intelligibility enhancement. MasterÕs thesis, University of Florida, Gainesville, FL, USA. Strange, W., Jenkins, J.J., Johnson, T.L., 1983. Dynamic speciﬁcation of coarticulated vowels. J. Acoust. Soc. Amer. 74 (3), 695–705. Summers, W.V., Pisoni, D.B., Bernacki, R.H., Pedlow, R.I., Stokes, M.A., 1988. Eﬀects of noise on speech production: acoustic and perceptual analyses. J. Acoust. Soc. Amer. 84 (3), 917–928.

Applied principles of clear and Lombard speech for automated intelligibility enhancement in noisy environments

Applied principles of clear and Lombard speech for automated intelligibility enhancement in noisy environments

Recommend Documents