Glimpsing speech

Glimpsing speech

ARTICLE IN PRESS Journal of Phonetics 31 (2003) 579–584 www.elsevier.com/locate/phonetics Glimpsing speech Martin Cooke* Department of Computer Scie...

93KB Sizes 2 Downloads 134 Views

ARTICLE IN PRESS

Journal of Phonetics 31 (2003) 579–584 www.elsevier.com/locate/phonetics

Glimpsing speech Martin Cooke* Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP, UK Received 16 September 2002; received in revised form 12 February 2003; accepted 14 February 2003

1. Introduction The purpose of this paper is to provide further support for the notion of multiple looks or ‘glimpses’ in everyday speech perception, and to highlight some of the obstacles that confront any system, computational or biological, wishing to exploit them. Moore (2003) provides arguments for an intelligent temporal integration of speech, motivated by the ‘multiple looks’ model of Viemeister and Wakefield (1991). Their model involved the detection of brief tones in noise, but Moore argues that such a process might also apply to speech perception. The idea that listeners are able to integrate glimpses of speech to form linguistic percepts was suggested by Miller and Licklider (1950), who demonstrated the high intelligibility of interrupted speech. Informally, a glimpse may be defined as an arbitrary time–frequency region which contains a reasonably undistorted view of the target signal. Section 2 describes some studies which contribute to the appeal of glimpses, while the problem of defining a glimpse is tackled in Section 3. The glimpses notion has also influenced computational approaches to robust automatic speech recognition (ASR), leading to the development of missing data theory (Cooke, Green, & Crawford, 1994; Cooke, Green, Josifovski, & Vizinho, 2001). A recent implementation of these ideas (Barker, Cooke, & Green, 2001) was one of the top performers in the 2001 AURORA global evaluation of robust ASR. Computational aspects of glimpsing are discussed further in section 4.

2. Arguments for glimpsing in speech perception A number of factors contribute to the appeal of glimpses. There is ample evidence that listeners can handle partial information brought about by experimental manipulations which result in significant holes in the spectrum (Lippmann, 1996; Kasturi, Loizou, Dorman, & Spahr, 2002) or gaps in the temporal waveform (Strange, Jenkins, & Johnson, 1983). A substantial body of experiments utilizing distorted speech lends further support to the glimpses notion (see the *Tel.: +44-114-222-1822; fax: +44-144-222-1810. E-mail address: [email protected] (M. Cooke). 0095-4470/03/$ - see front matter r 2003 Elsevier Ltd. All rights reserved. doi:10.1016/S0095-4470(03)00013-5

ARTICLE IN PRESS 580

M. Cooke / Journal of Phonetics 31 (2003) 579–584

comprehensive review in Assmann & Summerfield, in press). Additionally, there are good reasons to believe that missing information is the natural condition for speech in noise or reverberant environments. Masking due to additional sound sources, or self-masking in the case of reverberant environments, causes distortions to the speech signal which manifest themselves as regions where the energy surface of speech is locally swamped by noise. An experiment by Drullman (1995a) demonstrated that relatively few glimpses are necessary to achieve high intelligibility of speech in noise. He filtered noisy speech into 24 quarter-octave bands, extracted the temporal envelope in each band, and replaced those parts of the envelope below a target amplitude level with a value equal to the target level. By manipulating the target level used to define which regions to retain, Drullman was able to control how many glimpses were available to listeners. In one condition, he found that even with 98% of the information in the signal lost, listeners were able to achieve a mean sentence intelligibility score of 60%. Similar results have been demonstrated for computational models which use partial information (Cooke et al., 1994). Although glimpsing can be seen as a strategy for handling the nonuniform distribution of relevant information in a noisy speech signal, there is growing interest in phonetics in nonuniform processes for the analysis of clean speech too. For instance, Stevens (2002) argues that listeners base identification on ‘acoustic landmarks’, while van Son and Pols (1999) and Hawkins and Smith (2001) consider factors underlying the integration of widely dispersed cues, an issue which has been tackled quantitatively by Yang, Van Vuuren, Sharma, and Hermansky (2000) using the mutual information concept. The computational appeal of glimpsing stems from the possibility of a simpler approach to ASR in noise than alternative strategies employed at present. Current approaches to robust ASR attempt to clean up or ‘enhance’ noisy signals before matching them using models for clean speech. In spite of significant efforts, no general-purpose noise-removal algorithm has been built to date. ‘Blind’ approaches, which utilize multiple microphones and powerful algorithms based on statistical independence, are now generally accepted to be incapable of separating audio signals under realistic conditions (Hyv.arinen, Karhunen, & Oja, 2001). Alternative approaches employing constraints based on primitive auditory grouping principles have likewise achieved only limited success (Slaney, 1998; Cooke & Ellis, 2001). Enhancement schemes have to solve the difficult problem of splitting energy at each time–frequency location into contributions from the target speech and the background noise. However, the dynamic nature of speech ensures that, for any given region, the target speech will frequently either be dominant or hopelessly dominated by the background. A strategy which attempts to glimpse the dominant regions and treat the nondominant regions as masked avoids the hard problem of energy splitting.

3. What constitutes a glimpse? Although the notion that everyday speech perception is mediated by glimpsing the incoming spectro-temporal excitation pattern (STEP) is attractive, there are several practical questions that any computational implementation (biological or otherwise) must address. The discussion up to this point has considered a glimpse to be a region of favorable local signal-to-noise ratio (SNR) in some time–frequency representation such as Moore’s STEP. However, this working definition

ARTICLE IN PRESS M. Cooke / Journal of Phonetics 31 (2003) 579–584

581

leaves several factors unspecified. One such is the size of the spectro-temporal window used for glimpsing. Lower bounds are provided by the frequency and time resolution of the auditory system, but experimental data on glimpse windows are sparse and contradictory. Assmann and Summerfield (in press) note that a glimpsing model of concurrent vowel identification (Culling & Darwin, 1994) required only a sufficiently small window—some tens of milliseconds—to exclude the nondominant vowel, while sentence pairs will display longer regions of dominance for one or other source. Similar concerns apply to estimates of the frequency extent of a glimpse. A recent glimpse-based model of vowel identification (de Cheveigne! & Kawahara, 1999) operates at the frequency resolution of the auditory periphery. However, their model exploits simultaneous glimpses. Howard-Jones and Rosen (1993) examined listeners’ ability to utilize nonsimultaneous glimpses occurring in distinct frequency regions using a noise masker whose spectrogram resembles a checkerboard. They demonstrated that listeners can exploit such ‘uncomodulated’ glimpsing, but by varying the bandwidth of the individual squares of the checkerboard they showed that the fluctuations must extend over a wide frequency region. It is not clear whether similar results would be found for structured maskers such as speech. A further issue concerns the degree to which one source must dominate in a given spectrotemporal region for it to be considered as a useful glimpse. Presumably, the detection threshold of –4 dB SNR for complex signals (Moore, 1997) provides a lower bound, while at small positive SNRs, speech energy undergoes minimal distortion. Drullman (1995b) demonstrated that weak speech elements at up to 2 dB below the noise level contribute to intelligibility.

4. Computational aspects of glimpsing Any system wishing to exploit the computational benefits of glimpsing has to solve the twin problems of detection and integration. This section examines possible solutions to each problem, and describes the computations involved in bringing glimpses into contact with stored templates for speech. How might glimpses of speech be detected? One approach is to examine each spectro-temporal region for distinctive evidence of speech. For example, Berthommier and Glotin (1999) used a measure of harmonicity in four sub-bands, while Tchorz and Kollmeier (2002) estimated local speech-to-noise ratios from amplitude modulation spectrograms using a neural network learning procedure. One drawback with these techniques is that many nonspeech sources share properties such as harmonicity with speech. Further, much useful information in speech is not conveyed via harmonics. Consequently, it is more likely that glimpse detection occurs alongside glimpse integration. In principle, primitive auditory scene analysis cues such as harmonicity and common onset can be used for across-frequency integration, while continuity constraints are available for sequential integration (Bregman, 1990). Common amplitude fluctuations in the speech or masking sources may indicate beneficial times at which to glimpse the entire spectrum. It is worth noting that, due to their common origin, glimpses of speech in noise will not be ‘independent looks’ in a statistical sense. Truly independent glimpses would lack primitive grouping cues and would have to rely on purely linguistic factors for temporal integration.

ARTICLE IN PRESS 582

M. Cooke / Journal of Phonetics 31 (2003) 579–584

A further issue is the longer-term integration of glimpses into linguistic percepts. Apart from temporal warping and spectral-shifting of templates as suggested by Moore (which are in widespread use in ASR), several obstacles prevent glimpses from being brought directly into contact with templates. First, since glimpses represent partial information, standard distance metrics cannot be applied. Instead, missing data techniques (Cooke et al., 1994, 2001) have been developed which, at their core, allow the likelihood of a spectral frame containing missing portions to be evaluated with respect to models for clean speech. These techniques go beyond simply replacing missing values by zero or some constant value, instead computing the likelihood of an observation by integrating over all possible values for the missing portions. Furthermore, the missing parts themselves contain useful information, which is exploited by missing data techniques. The observed energy in missing regions acts as an upper bound on the energy of the masked speech signal. A more important problem is that it is unreasonable to expect glimpses to be sorted into speech and nonspeech categories prior to recognition, due to the difficulty in finding nonlinguistic cues unique to speech. A related problem occurs when the interfering ‘noise’ is another speaker, in which case all glimpses arrive as speech, but without any speaker assignation. Thus, temporal integration of glimpses is akin to solving a jigsaw puzzle containing a subset of pieces from each of a number of jigsaws. In principle, every combination of glimpses could be evaluated with respect to stored templates for speech, but the number of combinations is prohibitive. However, Barker, Cooke, and Ellis (submitted) have recently demonstrated a speech decoder capable of efficient glimpse integration. In contrast to traditional ASR decoders, which solve the problem of determining the most likely word sequence given the acoustic observations, the decoder of Barker et al. jointly determines the most likely word sequence and set of glimpses. Consequently, temporal integration of speech glimpses is possible without the need for the glimpses to be labeled as speech in advance. This finding eases the pressure on the glimpse detection stage to the point at which it may be possible to use primitive auditory grouping cues (Bregman, 1990) to determine which time–frequency regions belong together to form a glimpse.

5. Conclusions Substantial perceptual evidence exists to suggest that listeners can make decisions about speech targets in noise backgrounds based on partial information glimpsed in the signal. However, important aspects of the underlying processes remain obscure. The key challenges of the glimpsing approach to speech perception lie in finding robust mechanisms to selectively detect and integrate regions of the signal. Computational approaches based on missing data and novel decoders promise to solve the integration problem, but the detection of glimpses is a significant challenge. A compelling argument in favor of glimpsing models comes from the fact that conventional techniques for ASR break down in the face of modest departures from noise stationarity (Lippmann, 1997), yet it is precisely that kind of masker which presents the best opportunities for glimpsing. By understanding the mechanisms employed by listeners to exploit such opportunities, progress can be made in ASR in adverse conditions.

ARTICLE IN PRESS M. Cooke / Journal of Phonetics 31 (2003) 579–584

583

Acknowledgements I would like to thank the editors, Alain de Cheveigne! and one anonymous reviewer for their insightful comments on the manuscript. References Assmann, P., & Summerfield, Q. (in press). The perception of speech under adverse acoustic conditions. In: S. Greenberg, & W. Ainsworth (Eds.), Speech processing in the auditory system, Springer Handbook of Auditory Research, Vol. 14. Berlin: Springer. Barker, J., Cooke, M. P., & Ellis, D. P. W. Decoding speech in the presence of other sources, Speech Communication, submitted for publication. Barker, J., Cooke, M. P., & Green, P. D. (2001). Robust ASR based on clean speech models: An evaluation of missing data techniques for connected digit recognition in noise. In Proceedings of Eurospeech, Aalborg 2001 (pp. 213–216). Berthommier, F., & Glotin, H. (1999). A new SNR feature mapping for robust multistream speech recognition. In Proceedings of the XIVth international congress of phonetic sciences, San Francisco (pp. 711–715). Bregman, A. S. (1990). Auditory scene analysis. Cambridge, MA: MIT Press. Cooke, M. P., Green, P. D., & Crawford, M. D. (1994). Handling missing data in speech recognition. In Proceedings of the third international conference on spoken language processing, Yokohama (pp. 1555–1558). Cooke, M. P., Green, P. D., Josifovski, L., & Vizinho, A. (2001). Robust automatic speech recognition with missing and uncertain acoustic data. Speech Communication, 34, 267–285. Cooke, M. P., & Ellis, D. P. W. (2001). The auditory organization of speech and other sources in listeners and computational models. Speech Communication, 35, 141–177. Culling, J. F., & Darwin, C. J. (1994). Perceptual and computational separation of simultaneous vowels: Cues arising from low frequency beating. Journal of the Acoustical Society of America, 95, 1559–1569. de Cheveign!e, A., & Kawahara, H. (1999). Missing-data model of vowel identification. Journal of the Acoustical Society of America, 105, 3497–3508. Drullman, R. (1995a). Temporal envelope and fine structure cues for speech intelligibility. Journal of the Acoustical Society of America, 97, 585–592. Drullman, R. (1995b). Speech intelligibility in noise: Relative contributions of speech elements above and below the noise level. Journal of the Acoustical Society of America, 98, 1796–1798. Hawkins, S., & Smith, R. (2001). Polysp: A polysystemic, phonetically rich approach to speech understanding. Italian Journal of Linguistics-Rivista di Linguistica, 13, 99–188. Howard-Jones, P. A., & Rosen, S. (1993). Uncomodulated glimpsing in checkerboard noise. Journal of the Acoustical Society of America, 93, 2915–2922. Hyv.arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Kasturi, K., Loizou, P. C., Dorman, M., & Spahr, T. (2002). The intelligibility of speech with holes in the spectrum. Journal of the Acoustical Society of America, 112, 1102–1111. Lippmann, R. P. (1996). Accurate consonant perception without mid-frequency speech energy. IEEE Transactions on Speech and Audio Processing, 4, 66–69. Lippmann, R. P. (1997). Speech recognition by machines and humans. Speech Communication, 22, 1–16. Miller, G. A., & Licklider, J. C. R. (1950). The intelligibility of interrupted speech. Journal of the Acoustical Society of America, 22, 167–173. Moore, B. C. J. (1997). An introduction to the psychology of hearing (4th ed.). San Diego: Academic Press. Moore, B.C.J. (2003). Temporal integration and context effects in hearing. Journal of Phonetics [doi:10.1016/S00954470(03)00011-1], this issue. Slaney, M. (1998). A critique of pure audition. In D. F. Rosenthal, & H. G. Okuno (Eds.), Readings in computational auditory scene analysis (pp. 27–42). London: Lawrence Erlbaum. Stevens, K. N. (2002). Towards a model for lexical access based on acoustic landmarks and distinctive features. Journal of the Acoustical Society of America, 111, 1872–1891.

ARTICLE IN PRESS 584

M. Cooke / Journal of Phonetics 31 (2003) 579–584

Strange, W., Jenkins, J. J., & Johnson, T. L. (1983). Dynamic specification of coarticulated vowels. Journal of the Acoustical Society of America, 74, 695–705. Tchorz, J., & Kollmeier, B. (2002). Estimation of the signal-to-noise ratio with amplitude modulation spectrograms. Speech Communication, 38, 1–13. van Son, R. J. J. H., & Pols, L. C. W. (1999). Perisegmental speech improves consonant and vowel identification. Speech Communication, 29, 1–22. Viemeister, N. F., & Wakefield, G. H. (1991). Temporal integration and multiple looks. Journal of the Acoustical Society of America, 90, 858–865. Yang, H. H., Van Vuuren, S., Sharma, S., & Hermansky, H. (2000). Relevance of time–frequency features for phonetic and speaker-channel classification. Speech Communication, 31, 35–50.