Analysis and recognition of whispered speech

Analysis and recognition of whispered speech

Speech Communication 45 (2005) 139–152 www.elsevier.com/locate/specom Analysis and recognition of whispered speech Taisuke Ito, Kazuya Takeda *, Fumi...

658KB Sizes 0 Downloads 64 Views

Speech Communication 45 (2005) 139–152 www.elsevier.com/locate/specom

Analysis and recognition of whispered speech Taisuke Ito, Kazuya Takeda *, Fumitada Itakura Graduate School of Engineering, Nagoya University, Information Electronics, Nagoya 464-8603, Japan Accepted 13 October 2003

Abstract In this study, we have examined the acoustic characteristics of whispered speech and addressed some of the issues involved in recognition of whispered speech used for communication over a mobile phone in a noisy environment. The acoustic analysis shows that there is an upward shift of formant frequencies of vowels as observed in the whispered speech data compared to the normal speech data. Voiced consonants in the whispered speech have lower energy at low frequencies up to 1.5 kHz and their spectral flatness is greater compared to the normal speech. In experiments on whispered speech recognition, results of our studies on adaptation of the whispered speech models have shown that adaptation using a small amount of whispered speech data from a target speaker can be effectively used for recognition of the whispered speech. In a noisy environment, the recognition accuracy decreases significantly for the whispered speech compared to the normal speaking of the same speech. A method to increase the SNR by covering the mouth with a hand has been shown to give a higher recognition accuracy for the whispered speech frequently encountered for private communication in a noisy environment.  2004 Elsevier B.V. All rights reserved. Keywords: Speech recognition; Whispered speech; Telephone handset; Noise robustness

1. Introduction Advances in wireless communication technology have led to the widespread use of mobile phones for private communication as well as infor*

Corresponding author. Tel.: +81 52 789 3629; fax: +81 52 789 3172. E-mail addresses: [email protected] (T. Ito), [email protected] (K. Takeda), itakuraf@ccmfs. meijo-u.ac.jp (F. Itakura).

mation access using speech. Speaking loudly to a mobile phone in public places is considered a nuisance to others and conversations are often overheard. Speech in the form of whispering can be effectively used for quiet and private communication over mobile phones. Recognition of whispered speech is important for speech communication over mobile phones. In normal speech, voiced sounds are produced by modulation of the air flow from the lungs by vibrations of vocal cords. However, there is no

0167-6393/$ - see front matter  2004 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2003.10.005

140

T. Ito et al. / Speech Communication 45 (2005) 139–152

vibration of vocal cords in the production of whispered speech. Since exhalation is the source of excitation in whispered speech, the acoustic characteristics differ from those of normal speech (Fujimura and Lindqvist, 1971). In particular, the magnitude (power) spectrum in the lowfrequency region is weaker for whispered speech compared to normal speech. Furthermore, in real-world environments where background noise is present, the signal-to-noise ratio (SNR) of whispered speech is low. Therefore, processing and recognition of whispered speech signals is expected to be more difficult than that of normal speech (Morris and Clements, 2002; Wenndt et al., 2002). The generation mechanism for whispered speech has the following differences from that for normal speech (Leggetter and Woodland, 1995; MeyerEppler, 1957; Thomas, 1969; Holmes and Stephens, 1983; Sugito et al., 1991): exhalation of air is used as the sound source, and the shape of the pharynx is adjusted such that the vocal cords do not vibrate. Due to these differences in the generation mechanism, the acoustic characteristics of whispered speech are different from those of normal speech. A study on the acoustic analysis of vowel sounds (Konno et al., 1996) has shown that the formant frequencies for a vowel in whispered speech shift to higher frequencies compared to normal speech. The shift amount is larger for a vowel with low formant frequencies. The boundaries of vowel regions in the F1  F2 plane are also found to be different for normal and whispered speech (Kallail and Emanuel, 1984; Eklund and Traunmuller, 1996). Recognition systems trained with normal speech perform poorly for whispered speech due to the differences in their acoustic characteristics. In this paper, we investigate how accurately can the whispered speech be recognized based on a statistical ASR framework. Knowing that the acoustic modeling of sounds in whispered speech using hidden Markov models (HMMs) requires large training data sets, we have built a large whispered speech corpus for this task. In the next section, we describe our corpus for whispered speech recorded in various environments. In Section 3, we examine the acoustic characteristics of different sounds in whispered speech in terms of Cepstral distance.

In Section 4, we present results of our studies on recognition of whispered speech under several conditions.

2. Corpus In this section, we describe the corpus built for whispered speech. The corpus consists of two parts; close-talking-microphone (CTM) recording and telephone handset (TH) recording. CTM recording of whispered speech was carried out in a soundproof room, whereas TH recording was carried out both in a soundproof room and in a computer laboratory. The background noise level was 16.1 dB(A) in the soundproof room and 62.3 dB(A) in the computer room, respectively. For CTM recording sessions, a digital video camera (Sony DCR-TRV900 model) was used for recording. Image data was also collected for future studies on multi-sensor voice recognition. Speech data was digitized using a sampling frequency of 48 kHz for CTM speech and 8 kHz for TH speech. An example recorded image is shown in Fig. 1. A personal handyphone system (PHS) was used for TH recording. PHS is a Japanese cellular phone standard, which uses a micro-cell (100–500 m radius) system, and a 32 kbps ADPCM (G.726) codec. The TH recording setup includes a transmitter unit (Panasonic KX-HS110) and a receiver (Kyocera PS-C1) unit. A digital audio tape recorder (Sony TCD-D10) was used for recording the data from the receiver.

Fig. 1. An example recorded image of a face.

T. Ito et al. / Speech Communication 45 (2005) 139–152

In the CTM recording session, the normal speech and the whispered speech are collected from 68 male speakers and 55 female speakers. Each speaker reads 60 sentences of ATR phonetically balanced sentences (Kurematsu et al., 1990) and 50 sentences of the ASJ database of Japanese newspaper article sentences (JNAS) (Itou et al., 1998). Ten sentences of 60 ATR PB sentences are read by all speakers. The data for sentences in the ATR database is used for training and the data for sentences in the JNAS database is used for evaluation. Various speaking styles are considered for the TH recording session. We consider styles typically used for quiet and private communication such that a conversation does not disturb others in the room and the speech is not overheard. Such communication is normally done using one or a combination of the following speaking manners: (1)z speaking in low voice, (2) covering the mouth with oneÕs hand, and/or (3) covering the mouth as well as the input section of cellular phone with oneÕs

141

hand. The position of the speakerÕs hand in the latter two styles is shown in Fig. 2. The speech data is collected from 10 speakers for each of the following six speaking styles: (1) normal speech, (2) normal speech in low voice (hereafter called low-voice speech), (3) low-voice speech with a hand covering the mouth, (4) whispered speech, (5) whispered speech with a hand covering the mouth, and (6) whispered speech with a hand covering the mouth as well as the input section of the cellular phone. Each speaker reads 10 sentences from the set A of the ATR PB sentences and 20 sentences from the JNAS database.

3. Acoustic analysis of whispered speech In the acoustic analysis steps of our studies, we have compared waveforms of different categories of sounds, the short-term spectra of vowels and the average spectra of different sounds for the normal and whispered speech. We have also studied

Fig. 2. Usage of mobile phone. Covering mouth (bottom left). Covering mouth and input section of mobile phone (bottom right).

142

T. Ito et al. / Speech Communication 45 (2005) 139–152

Fig. 3. Waveforms of the signal in the normal and whispered speech.

are shown in Figs. 4 and 5, respectively. It is observed that the spectra for vowels in the normal speech show a periodic structure that is not present in the spectra for the whispered speech. However, the formant structure of vowels is present in both

120

Magnitude [dB]

the suitability of the cepstral distance as a quantitative measure of difference for a sound in the normal and whispered speech. In Fig. 3, waveforms of the speech signal for the phrase /chi:sanaunagiyani/ for both normal and whispered speech are displayed. Unlike normal speech, the amplitude of vowels is lower than that of consonants in the case of whispered speech. There is a significant reduction in the amplitude of vowels and voiced consonants in the whispered speech compared to the normal speech because there is no vibration of vocal cords during the production of voiced sounds in the whispered speech. The amplitude for unvoiced consonants is observed to be similar for normal and whispered speech. Next we compare the short-term spectra of different vowels for the normal and whispered speech. For each of the five vowels, the short-term magnitude spectrum is obtained by computing the FFT over a frame of 32 ms and using a Hamming window. Examples of spectra for vowels in the normal and whispered speech in the same context

100

80

60

40 0

1

2

3

4

5

6

7

8

Frequency [kHz] Fig. 4. An example of the spectrum of vowel /a/ in normal speech.

T. Ito et al. / Speech Communication 45 (2005) 139–152

Magnitude [dB]

120

100

80

60

40 0

1

2

3 4 5 Frequency [kHz]

6

7

8

Fig. 5. An example of the spectrum of vowel /a/ in whispered speech.

types of speech. There is a shift in the formant frequencies towards higher frequencies for the vowels in the whispered speech compared to the normal speech. The first and second formant frequencies, F1 and F 2, estimated from a typical spectrum of each of five vowels are given in Table 1 for the whispered speech and the normal speech of the same speaker with the same context. The formant frequencies are estimated through visual inspection of LPC spectrum. The subscripts n and w denote normal and whispered speech, respectively. It is observed that the value of F1 for the whispered speech vowels is about 1.3–1.6 times the value for the corresponding vowels in normal speech. The value of F2 shows an increase by a factor in the range of 1.0–1.2. However, there is a decrease in the spectral magnitude by about 20–25 dB for vowels in the whispered speech compared to the case of normal speech. For a comparison of the spectra of a phoneme, we first perform phoneme alignment using the HMM acoustic models trained separately with the

143

normal speech and whispered speech data. Details of the training are presented in Section 4. For each phoneme segment, an averaged spectrum is obtained from the spectra of the aligned segments of the phoneme in continuous speech and then it is averaged for male and female speakers separately. In this analysis, the averaging is done in linear spectrum domain. The averaged spectra for each of the 24 phonemes obtained from the normal speech data and the whispered speech data are shown in Fig. 6 for male and Fig. 7 for female speakers. The spectra for the whispered speech data are normalized so that the range of magnitude is the same as that of the normal speech. It is observed that the spectral magnitude for the frequencies under 1.5 kHz is lower for voiced consonants in the whispered speech compared to the normal speech. It is also observed that the spectral flatness is greater for voiced consonants in the whispered speech. An increase in the frequencies of lower formants for voiced consonants, as in the case of vowels, was prevailing. The whispered speech spectra for the voiced and unvoiced consonants with the same place of articulation are found to be similar. Next we present a quantitative analysis of dissimilarity between the spectra of phonemes in the normal speech and those in the whispered speech. The cepstral distance is used as a measure of dissimilarity between the average spectra of a phoneme in the two types of speech. The cepstral distance is computed from 29 cepstral coefficients as follows: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u 29 X ðnÞ 10 u ðwÞ2 t2 Dcep ¼ ð1Þ ðci  ci Þ ½dB ln 10 i¼1 where c(n) and c(w) are the average cepstral coefficients for the normal speech and the whispered speech, respectively.

Table 1 The formant frequencies and the ratio of the formant frequency of whispered speech to that of normal speech Vowels

F1n [Hz]

F2n [Hz]

F1w [Hz]

F2w [Hz]

F1w/Fln

F 2w/F 2n

/a/ /i/ /u/ /e/ /o/

680 280 320 440 450

1130 2540 1370 2200 820

890 440 460 600 760

1400 2590 1400 2250 1040

1.3 1.5 1.4 1.3 1.6

1.2 1.0 1.0 1.0 1.2

T. Ito et al. / Speech Communication 45 (2005) 139–152

Magnitude [dB]

144

100

100

Magnitude [dB]

100

u

e

o

80

80

80

80

60

60

60

60

60

2

4

6

100

0

2

4

6

100

0

2

4

6

100

w

0

2

4

6

100

r

0

m

80

80

80

60

60

60

60

60

4

6

100

0

2

4

6

100

0

2

4

6

100

b

0

2

4

6

100

d

0

80

80

60

60

60

60

60

6

100

0

2

4

6

100

0

2

4

6

100

p

0

2

4

6

100

t

0

80

80

60

60

60

60

60

6

0

2

4

6

0

2

4

6

4

6

4

6

sh

80

4

2

s

80

2

6

100

k

80

0

4

j

80

4

2

z

80

2

6

100

g

80

0

4

n

80

2

2

100

y

80

0

Magnitude [dB]

100

i

80

0

Magnitude [dB]

100

a

0

2

4

6

0

2

Magnitude [dB]

Frequency [kHz] 100

100

100

ts

100

ch

h

f

80

80

80

80

Normal

60

60

60

60

Whispered

0

2

4

6

Frequency [kHz]

0

2

4

Frequency [kHz]

6

0

2

4

Frequency [kHz]

6

0

2

4

6

Frequency [kHz]

Fig. 6. Average spectra of the normal/whispered speech of male speakers.

Dissimilarities are measured using the following procedure: • Phone level segmentation is performed on both normal and whispered speech of the same sentence using the normal speech model and whispered speech model, respectively;

• Averaged cepstrum is calculated for each phone segment; • Cepstrum distances are calculated between the pair of average cepstra of phone segments at the same position of the normal and whispered speech; and

Magnitude [dB]

T. Ito et al. / Speech Communication 45 (2005) 139–152

100

100

Magnitude [dB]

100

u

e

o

80

80

80

80

60

60

60

60

60

2

4

6

100

0

2

4

6

100

0

2

4

6

100

w

0

2

4

6

100

r

0

m

80

80

80

60

60

60

60

60

4

6

100

0

2

4

6

100

0

2

4

6

100

b

0

2

4

6

100

d

0

80

80

60

60

60

60

60

6

100

0

2

4

6

100

0

2

4

6

100

p

0

2

4

6

100

t

0

80

80

60

60

60

60

60

6

0

2

4

6

0

2

4

6

4

6

4

6

sh

80

4

2

s

80

2

6

100

k

80

0

4

j

80

4

2

z

80

2

6

100

g

80

0

4

n

80

2

2

100

y

80

0

Magnitude [dB]

100

i

80

0

Magnitude [dB]

100

a

145

0

2

4

6

0

2

Magnitude [dB]

Frequency [kHz] 100

100

100

ts

100

h

ch

f

80

80

80

80

Normal

60

60

60

60

Whispered

0

2

4

6

Frequency [kHz]

0

2

4

Frequency [kHz]

6

0

2

4

Frequency [kHz]

6

0

2

4

6

Frequency [kHz]

Fig. 7. Average spectra of the normal/whispered speech of female speakers.

• The calculated cepstrum distances are averaged over each of 24 phones. The average value of the cepstral distances for each phoneme is plotted in Fig. 8. Different shades are used for the distance plots of vowels, voiced consonants and unvoiced consonants. It is seen that the cepstral distance

is about 3.5 dB for vowels and voiced consonants, and about 2 dB for unvoiced consonants. The results of the acoustic analysis of the normal and whispered speech indicate that characteristics of vowels and voiced consonants change more significantly than those of unvoiced consonants.

146

T. Ito et al. / Speech Communication 45 (2005) 139–152 5

correct

substitution substitution i

z

b

correct e

Recognition result

4

CD [dB]

s

i

b

e

Manual label

deletion

3 2

Fig. 9. Error counting method for phoneme recognition experiments.

1 0

a i u e o b d g j z w r y m n p t k ts ch s sh h f

Phoneme

Fig. 8. Cepstral distances between normal and whispered speech for each phoneme.

4. Recognition of whispered speech 4.1. Phonemes recognition and confusion matrix We now present phoneme recognition results and error tendency on whispered speech using the HMM acoustic models trained separately with the normal speech data and the whispered speech data collected in the soundproof room. For training each HMM model, 4000 ATR PB sentences uttered by 40 male and 40 female speakers were used. Each set of triphone HMMs shares 500 states, each of which has 16 mixture Gaussian distributions. The experimental conditions for training the HMMs are given in Table 2. Whispered speech data of 200 phonetically balanced sentences for two male speakers and two female speakers are used for evaluation. The evaluation data is manually segmented into phoneme segments and labeled in order to calculate a phoneme confusion matrix. The recognition system (Kawahara et al., 1999) uses the Japanese phonatactic constraint as a language model, i.e., no consonant can directly precede consonant etc. The recognition results are evaluated based on Table 2 Conditions for training the HMMs Sampling frequency Analysis window Frame width Frame shift Feature parameters

16 kHz Hamming 25 ms 10 ms MFCC(12) + DMFCC(12) + Dpower

the center frame positions as shown in Fig. 9. The overall phoneme accuracy given by the normal speech model is 52.2%, whereas the models trained with whispered speech data yield an accuracy of 75.9%. The confusion matrices for recognition of the whispered speech using two different models are shown as grayscale plots in Figs. 10 and 11 for the normal speech models and the whispered speech models, respectively. In these figures, the darker shade denotes a higher accuracy. The regions of phonemes in different groups such as vowels, unvoiced consonants, voiced plosives and fricatives, and other voiced consonants are demarcated by the solid lines. Deletion errors corresponding to a phoneme being recognized as silence are denoted ‘‘del’’ in the figures. Substitution errors for voiced and unvoiced consonants with the same place of articulation are denoted by circles. Phonemes with a frequency of occurrence of less than 30 are enclosed in parentheses on the y-axis. From the plot of the confusion matrix for the normal speech models in Fig. 10, it is seen that almost all the phonemes have substitution errors with phoneme /h/. This is mainly because the exhalation sounds generated in the whispered speech are similar to /h/. For vowels, the confusion besides the one associated with /h/ is mainly with the other vowels. The highest accuracy is obtained for /a/, and the lowest accuracy is for /o/. For plosives and fricatives, many substitution errors occur with phonemes having the same place of articulation. This is mainly because the voiced and unvoiced consonants have a similar manner of articulation in whispered speech. For the whispered speech models, it is seen from Fig. 11 that the number of substitution errors with /h/ is insignificant. However, the other tendencies are similar to those in normal speech models. Substitu-

T. Ito et al. / Speech Communication 45 (2005) 139–152

147

Fig. 10. Confusion matrix of the recognition results of whispered speech using the normal speech models. The darker shade denotes a higher accuracy. The regions of phonemes in different groups such as vowels, unvoiced consonants, voiced plosives and fricatives, and other voiced consonants are demarcated by the solid lines. Deletion errors corresponding to a phoneme being recognized as silence are denoted ‘‘del’’ in the figures. Substitution errors for voiced and unvoiced consonants with the same place of articulation are denoted by circles. Phonemes with a frequency of occurrence of less than 30 are enclosed in parentheses on the y-axis.

tion errors in phoneme recognition may be handled by taking advantage of linguistic information in continuous speech recognition. 4.2. Continuous speech recognition results For the continuous speech recognition, in addition to the previously mentioned two sets of triphone HMM models, i.e., the normal speech model and the whispered speech model, a speaking-style-independent model was built using both the whispered speech and the normal speech as training data. Instead of using the speaking-style-independent model, we can also consider a model selection mechanism that uses either set of the normal

speech models or the whispered speech models. In this approach, the model set giving a higher likelihood for each sentence is selected. The approach is aimed at recognizing speech without specifying whether it is normal speech or whispered speech. The same HMM models and evaluation data as in the previous subsection are used for the continuous speech recognition experiment. A trigram model trained with articles in the Mainichi newspaper over a 75-month period is used for language modeling. The vocabulary size of the dictionary is 20 000 words. The word recognition performance is given in the terms of the percent correct rate (% correct) and the accuracy:

148

T. Ito et al. / Speech Communication 45 (2005) 139–152

Fig. 11. Confusion matrix of the recognition results of whispered speech using the whispered speech models. The darker shade denotes a higher accuracy. The regions of phonemes in different groups such as vowels, unvoiced consonants, voiced plosives and fricatives, and other voiced consonants are demarcated by the solid lines. Deletion errors corresponding to a phoneme being recognized as silence are denoted ‘‘del’’ in the figures. Substitution errors for voiced and unvoiced consonants with the same place of articulation are denoted by circles. Phonemes with a frequency of occurrence of less than 30 are enclosed in parentheses on the y-axis.

% correct ¼

N SD  100; N

and % accuracy ¼

N SDI  100; N

where S, D and I are the numbers of substitution, deletion and insertion errors and N is the total number of words. The performance of the four approaches in recognition of the normal speech data are displayed in Fig. 12. It is seen that the normal speech models yield about 82% accuracy and the whispered speech models give an accuracy of about 53%. The speaking-style-independent models and the model-selection-based recognition approach give an accuracy of about 80%.

Fig. 13 shows the performance of the four approaches for the whispered speech data. It should be noted that the normal speech models give a poor recognition accuracy of about 27%, whereas the whispered speech models give about 68% accuracy. The speaking-style-independent models and the model-selection-based recognition approach give an accuracy of about 66%. These results indicate that the speaking-style-independent models and the model selection approach can be used for recognition of normal as well as the whispered speech data with an accuracy close to that of the independent normal speech models and the whispered speech models, respectively. These two approaches can also be used for recognition of speech consisting of both styles.

T. Ito et al. / Speech Communication 45 (2005) 139–152 %Correct

90

Accuracy

Recognition rate[%]

80 70 60 50 40 30 20 10 0 Normal speech Whispered speech Speaking-style Model selection model model independent model

Approaches

Fig. 12. Performance of four approaches in recognizing the normal speech. Performance using the normal speech model, whispered speech model, speaking-style-independent model, and model selection-based recognition are presented on the abscissa.

100

%Correct

Recognition rate[%]

90

Accuracy

80 70 60 50 40 30 20 10 0 Normal speech model

Whispered speech Speaking-style Model selection model independent model

Approaches

Fig. 13. Performance of four approaches in recognizing whispered speech. Performance using the normal speech model, whispered speech model, speaking-style-independent model, and model selection-based recognition are presented on the abscissa.

In the above experiments, the misclassification rate of normal speech as whispered speech is about 5% and the misclassification rate of whispered speech as normal speech is almost zero. 4.3. Speaker and speaking style adaptation Next we tested the improvement of the recognition accuracy obtained by speaker and speaking style adaptation. Maximum likelihood linear

regression (MLLR) is applied to baseline models, i.e., normal speech and whispered speech used for the previous subsections. In the experiment, MLLR adaptation updates mean and variance vectors by applying a linear transformation defined for each group of acoustically similar HMM states. The number of the groups is automatically controlled by the size of adaptation data. In our studies, the maximum number of the groups was set at 16. The evaluation sentences and recognition task are the same as in the previous subsection, i.e., newspaper sentences. First we consider speaker adaptation for the whispered speech models, where the models are adapted using the whispered speech data of the target speaker. Fig. 14 shows the recognition performance without adaptation, adaptation using the whispered speech of 10 phonetically balanced sentences, and 50 phonetically balanced sentences. The accuracy improves by about 10% after the adaptation. There is no significant increase in the accuracy when the number of adaptation sentences is increased from 10 to 50. Next we consider adaptation of the normal speech models with the whispered speech data of the target speaker (the speaker and speaking style adaptation) and then with the nontarget speakers (the speaking style adaptation). In Fig. 15, we present recognition performance without adaptation, adaptation using the whispered speech of 10/50 phonetically balanced sentences of the target speaker, and adaptation with

100 90

Recognition rate [%]

100

149

80

%Correct Accuracy

70 60 50 40 30 20 10 0

(None)

10

50

Number of adaptation sentences Fig. 14. Performance after applying MLLR adaptation to the whispered speech models.

150

T. Ito et al. / Speech Communication 45 (2005) 139–152

100 %Correct Accuracy

Recognition rate[%]

90 80

Table 3 SNR for different speaking styles Speaking style

SNR [dB] computer room

SNR [dB] soundproof room

Normal speech Low-voice speech Low-voice speech with covering mouth Whispered speech Whispered speech with covering mouth Whispered speech with covering mouth and handset

24.6 15.7 19.1

35.7 24.7 33.7

11.5 15.5

18.1 26.6

24.3

36.9

70 60 50 40 30 20 10 0 (None)

10

50

80

Number of adaptation sentences Fig. 15. Performance after applying MLLR adaptation to the normal speech models.

the whispered speech data of 80 phonetically balanced sentences of the nontarget 80 speakers. It is seen that adaptation using the whispered speech data from the target speaker yields a significantly higher accuracy of about 60% compared to about 20% when no adaptation is used. The accuracy increases to about 63% when the number of adaptation sentences is increased from 10 to 50. Adaptation using a greater number of sentences from the population of nontarget speakers gives an accuracy of about 51% only. The results show that even a small amount of adaptation data from the target speaker is effective in using the normal speech models for recognition of the whispered speech. 4.4. Recognition in noisy environment For the private communication using a mobile phone in noisy conditions, one may consider methods such as speaking in a low voice or whispering with covering the mouth and/or the input section of the mobile phone with oneÕs hand to improve the SNR of the recorded data. In this subsection, we discuss the effects of environmental noise on the recognition of whispered speech with the mobile phone as the input device. As discussed earlier, the TH corpus is used throughout this section. First, we compare the SNR values for speech recorded in a computer room and in a soundproof room. In Table 3, the SNRs of a female speech data for a sentence in six different speaking styles

are listed. It is seen that the SNR decreases by about 9–11 dB for the low voice speaking style compared to normal speech. However, in the case of covering the mouth with a hand while speaking in a low voice, the SNR is improved by about 3.4 dB for the data recorded in the computer room, and by about 9.0 dB for the data recorded in the soundproof room. For the whispered speech, the SNR decreases by about 13 dB and 17.5 dB for the recordings in the computer room and the soundproof room, respectively. Covering the mouth with a hand is helpful in improving the SNR by 4.0 dB and 8.5 dB, respectively. Covering both the mouth and the input section of the mobile phone with a hand significantly improves the SNR by about 12.8 dB for the data recorded in the computer room, and by about 18.8 dB for the speech recorded in the soundproof room. These results clearly show that the method of covering the mouth and the mobile phone with a hand is helpful in improving the SNR. Next we evaluate the recognition performance for the speech recorded using different speaking styles. For each of the recording conditions, i.e., the soundproof room and the computer room. The test sentences are the same 200 newspaper sentences of the above experiment, but uttered by different 10 speakers. The same language models are used. Acoustic models are derived by applying MLLR adaptation to the standard HMMs, which is trained by band limited version of 4000 PB sentences in CTM part of the corpus. Since the MLLR adaptation is performed for each speaker by using 10 PB sentences uttered at the same

T. Ito et al. / Speech Communication 45 (2005) 139–152

speaking style with the test utterance, the model is adapted to both speaker and speaking style. In Fig. 16, we present the word recognition accuracy for different speaking styles recorded in the soundproof room and the computer room. For the data recorded in the soundproof room, it is seen that the accuracy decreased by about 5% for the low-voice speech compared to the performance for the normal speech. The accuracy has decreased by about 10% in the case of covering the mouth with a hand while speaking in a low voice. For the whispered speech, the accuracy decreased by about 20% compared to that for the normal speech. Covering the mouth with a hand resulted in a decrease by about 5% whereas covering the mouth and the mobile phone resulted in a decrease by about 12% compared to the accuracy for the whispered speech. These results show that the accuracy decreases significantly when the mouth and the mobile phone are covered with a hand for sentences spoken in a soundproof room. It should be noted that the recognition accuracy has decreased although the SNR has improved. For the data recorded in the computer room, the accuracy is about 3% lower than that for the data recorded in the soundproof room with the normal speech data. The accuracy has decreased by about 30% for the low-voice speech compared to the performance for the normal speech in the computer room. Covering the mouth with a hand is helpful in improving the accuracy for the lowvoice speech by about 25%. For whispered speech,

100 soundproof room

90

computer room

Accuracy [%]

80 70 60 50 40 30 20 10 0 Speaking Normal Style No Covering

Low-voice

No

Low-voice

Whispered

Whispered

Whispered

Mouth

No

Mouth

Mouth & handset

Fig. 16. Performance for different speaking styles.

151

the accuracy decreased by about 48% compared to that for normal speech. Covering the mouth with a hand resulted in an improvement of the accuracy by about 20% for the whispered speech in the computer room. Covering the mouth and the mobile phone with a hand resulted in an improvement of only about 10%. These results show that although covering mouth and/or the mobile phone improves SNR of the whispered speech in real environment, in order to improve the recognition accuracy, the whispered speech model should be retrained for the particular covering condition.

5. Summary and conclusions In this study, we have examined the acoustic characteristics of whispered speech and addressed some of the issues involved in the recognition of whispered speech used for communication over a mobile phone in noisy environment. The acoustic analysis shows that there is an upward shift of the formant frequencies of vowels as observed in whispered speech data compared to the normal speech data. Voiced consonants in whispered speech have lower energy at low frequencies up to 1.5 kHz and their spectral flatness is greater compared to normal speech. An analysis based on the cepstral distance between the spectra of a sound in normal and whispered speech shows that the characteristics of the voiced sounds (vowels and voiced consonants) change more significantly than those of the unvoiced sounds. A corpus of whispered speech data was built to develop a continuous speech recognition system for the whispered speech. Data for a large number of sentences was recorded in a soundproof room and in noisy conditions. Data was collected for different speaking styles such as normal speech, whispered speech, speaking in a low-voice, and covering the mouth with a hand while speaking. In experiments on whispered speech recognition, we have used acoustic models trained with normal speech, acoustic models trained with whispered speech, acoustic models trained with both the normal and the whispered speech, and a model selection strategy to select the models with a higher likelihood. Results have revealed that

152

T. Ito et al. / Speech Communication 45 (2005) 139–152

the speaking-style-independent models and the model selection strategy show good recognition performance. Results of our studies on adaptation of whispered speech models have shown that adaptation using a small amount of whispered speech data from a target speaker can be effectively used for recognition of whispered speech. In noisy surroundings, the recognition accuracy decreases significantly for the low-voice speaking style and the whispered speech style compared to the normal speech style. The method of improving the SNR by covering the mouth with a hand has been shown to yield an improved recognition accuracy for the low-voice speaking style and the whispered speech style used for private communication in a noisy environment.

References Eklund, I., Traunmuller, H., 1996. Comparative study of male and female whispered and phonated versions of the long vowels of Swedish. Phonetica 54, 1–21. Fujimura, O., Lindqvist, J., 1971. Sweep-tone measurements of vocal-tract characteristics. J. Acoust. Soc. Am. 49, 541–558. Holmes, J.N., Stephens, A.P., 1983. Acoustic correlates of intonation in whispered speech. J. Acoust. Soc. Am. 73, S87. Itou, K., Takeda, K., Takezawa, T., Matsuoka, T., Shikano, K., Kobayashi, T., Itahashi, S., 1998. Design and development of Japanese speech corpus for large vocabulary continuous speech recognition. In: Proceedings of Oriental COCOSDA, May 1998, Tsukuba. Kallail, K.J., Emanuel, F.W., 1984. Formant-frequency differences between isolated whispered and phonated vowel

samples produced by adult female subjects. J. Speech Hearing Res. 27, 245–251. Kawahara, T., Kobayashi, T., Takeda, K., Minematsu, N., Itou, K., Yamamoto, M., Yamada, A., Utsuro, T., Shikano, K., 1999. Japanese dictation toolkit: plug-and-play framework for speech recognition R&D. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRUÕ99), pp. 393–396. Konno, H., Toyama, J., Shimbo, M., Murata, K., 1996. The effect of formant frequency and spectral tilt of unvoiced vowels on their perceived pitch and phonemic quality. IEICE Technical Report, SP95-140, March, pp. 39– 45. Kurematsu, A., Takeda, K., Kuwabara, H., Shikano, K., Sagisaka, Y., Katagiri, S., 1990. ATR Japanese speech database as a tool of speech recognition and synthesis. Speech Communication 9 (4), 357–363. Leggetter, C.J., Woodland, P.C., 1995. Flexible speaker adaptation using maximum likelihood linear regression. In: Proceedings of the ARPA Spoken Language Technology Workshop, 1995, Barton Creek. Meyer-Eppler, W., 1957. Realisation of prosodic features in whispered speech. J. Acoust. Soc. Am. 29, 104–106. Morris, R., Clements, M., 2002. Estimation of speech spectra from whispers. In: Proceedings of International Conference on Acoustic Speech and Signal Processing (ICASSP), 7–11 May 2002, Olando, p. IV-4159. Sugito, M., Higasikawa, M., Sakakura, A., Takahashi, H., 1991. Perceptual, acoustical, and physiological study of Japanese word accent in whispered speech. IEICE Technical Report, SP91-1, May 1991, pp. 1–8. Thomas, I.B., 1969. Perceived pitch of whispered vowels. J. Acoust. Soc. Am. 46, 468–470. Wenndt, S.J., Cupples, E.J., Floyd, R.M., 2002. A study on the classification of whispered and normally phonated speech. In: Proceedings of International Conference on Spoken Language Processing (ICSLP), 16–20 September 2002, Denver, pp. 649–652.