Basic directions in automatic speech recognition

bzt. J. Man-Machine Studies (1972) 4, 105-118 Basic Directions in Automatic Speech Recognition DAVID J. BROAD Speech Communications Research Laborat...

Download PDF

787KB Sizes 0 Downloads 152 Views

Report

PDF Reader
Full Text

bzt. J. Man-Machine Studies (1972) 4, 105-118

Basic Directions in Automatic Speech Recognition DAVID J. BROAD

Speech Communications Research Laboratory Inc., 35 West Micheltorena Street, Santa Barbara, California, U.S.A. (Received 10 February 1972) This paper represents a view of basic problems in automatic speech recognition drawn from the vantage point of applied linguistics. The basic areas of speech production, articulatory phonetics, acoustic analysis, acoustic phonetics, and phonetic sequences of natural speech are crucial to flexible automatic speech recognition, especially for the long-range goal of recognizing continuous-flow large-vocabulary natural speech of different speakers. Articulatory phonetics provides a link between the physical events of speech and the elements of the phonological code. The processes of speech production are basic to articulatory phonetics while the acoustic theory of speech production provides a basis for the phonetic interpretation of the acoustic speech waveform. Formant frequencies are particularly important acoustic parameters and recent improvements in formant tracking are encouraging for speech recognition. Even if complete phonetic recognition is realized, however, there remains the conversion of phonetic strings to lexical strings. This problem requires extensive phonetic analysis of actual spoken language.

Introduction Automatic speech recognition (ASR) has been defined and approached in many waysJ A speech recognizer is, generally speaking, a device which will decode the acoustic signals of speech into discrete messages. The messages may be meaning related actions of a machine, or may be some other discrete representation of the utterance, such as a typescript of the speech. Ideally, the goal is to design an ASR system to handle the continuous utterances of any number of speakers in moderate or even poor noise environments. With this as a long-range goal it seems evident that a basic understanding of human speech communication is needed. This requires knowledge in many areas, including the mechanisms of speech production, articulatory phonetics, the acoustic structure of the speech signal, the relations between the continuous physical events of speech and phonetic units, the organization of phonetic 105

106

D . J . BROAD

units into higher-ordered sets as specified by the phonological structures of given dialects, and the rules by which strings of phonological units encode lexical strings. Significant problems for research exist in each of these areas. Several reviews of the ASR field are available in the literature (Lindgren, 1965a, b; Hyde, 1968; Lea, 1970; Hill, 1971). The purpose of this paper is to outline one general approach to automatic speech recognition (ASR). No actual recognizer is described; instead the relevance of basic research in the above areas to a general resolution of ASR problems is discussed, and, somewhat more strongly, it is asserted that these basic research problems are critical to the realization of flexible ASR. Research in ASR has often been pursued as a short-range effort to develop some special-purpose device, with the usual result that although the device may sometimes function according to specification, its extension to a more powerful system does not appear possible and little of general value to the field has been learned. Clearly, it is desirable to realize limited-purpose ASR systems that can find immediate useful appliction, especially from the point of view that success with such systems may justify interest in more ambitious systems (Hill, 1969). At the same time, the pursuit of ASR via basic research on several major sub-problems has the advantage that its results are also applicable both to the more limited short-range goals in ASR and to the solution of general-purpose ASR. In addition, we might justifiably expect to learn something significant about speech by taking a basic approach to what is usually regarded as an applied problem. There are enough unresolved basic problems that the need for the co-operative efforts of many researchers and laboratories should be evident.

Speech Production and Articulatory Phonetics The mechanisms of speech production determine the phonetic structure of speech and it is through a basic understanding of these mechanisms that properties of speech which are crucial to ASR can be found. Such a basic understanding is necessary for the proper direction of ASR research. For example, some acoustic speech analyzer may have an apparently intractable output. In such a case it is good to have some idea about whether the analyzer's complex behaviour is due to an inappropriate analyzer design or whether, on the other hand, the complexity arises from the speech process itself. An understanding of the speech mechanisms indicates that the latter is indeed often the case. Also, a knowledge of the intricate behavior of the speech mechanisms provides an idea of the kinds of problems that an acoustic speech analyzer and its associated decision programs will have to handle. Hence speech

BASIC DIRECTIONS IN AUTOMATIC SPEECH RECOGNITION

107

production imposes some of the design criteria on an ASR system. More than this, it also provides some of the design concepts, because every regular effect that must be taken into account is actually a form of redundancy that can help reduce the unknown part of the variability of the speech signal. For example, we know that vowel shapes are influenced by adjacent consonants. By taking this into account, vowel recognition can be improved and, in addition, some information about consonant identity can be obtained. Speech production has also been the physical basis for the phonetic description of speech via articulatory phonetics. A speaker produces an utterance by a sequence of complicated movements of the structures of his vocal mechanism. These movements control the flow of the breath stream in the generation of sound in the vocal tract; they also determine how the sound is modified before it is radiated as an acoustic wave to be heard and understood by a listener. Articulatory phonetics is founded on the observation that utterances can be described as sequences of individual speech " sounds ", or "phones ", which can be defined according to the physiological gestures that produce them. The notational system of phonetics traditionally refers to physiological formations in the vocal mechanism. Physiological systems for describing speech sounds have existed for a very long time. Especially noteworthy is the level of phonetic sophistication achieved by the ancient Sanskrit scholars in the Atharva-Veda Practicakhya (Whitney, 1862). A universal notational system adopted in 1886 by the International Phonetic Association (IPA, 1949) has long been used by phoneticians and linguists alike. Contemporary phoneticians, such as Pike (1943) and others, have used the IPA as a reference for developing their more elaborate systems. The significance of such systems goes far beyond notational conventions; the systems specify a basic relation between observable physical events and linguistic signs that underlies all spoken language. Such a relation is crucial to automatic speech recognition, of course, since at some stage it is necessary to interpret a physical event of speech as an instance of an element of some language code. Traditional articulatory phonetics has been the empirical basis for the phonological study of languages for some time and has served this function remarkably well. For the algorithrn~c specification of speech, however, it has been found that some concepts are needed that have not been previously formalized. For example, it is not at all easy to state the formal conditions which define the phone, the fundamental unit of phonetics. A substantial effort has therefore been directed toward the rigorous specification of a physiological phonetic theory (Peterson & Shoup, 1966a). This theory

108

D . J . BROAD

defines the phone through a hierarchy of more primitive concepts, including those of articulatory minimum, artieulatory steady state, controlled articulatory movement, and a set of phonetic parameters and parameter values. The constructs are based on a set of axioms, which are assumptions about the behavior of the vocal mechanisms, (for example, that turbulent airflow can be generated at a constriction, or that there exist minima in the absolute average rate of change of the shape and position of an articulator). Although the theory is fairly complex, it appears that any much simpler theory would overlook phenomena that are crucial to defining the phonetic segment. One of the problems handled by the theory, for example, is that articulators seldom occupy steady states at the same time and so it is necessary to specify which of the many simultaneous actions of the vocal mechanism are most relevant to defining the current segment. Thus, to take one of the simplest examples, it is usually not necessary to take account of tongue movements during the bilabial closure to specify the plosive [b]. In other sounds, such as the rounded vowel [u], however, both tongue position and lip shape need to be considered. Many similar details must be considered when a definition for all the phonetic types is sought together with the conditions that permit a discrete segment to be identified with some interval of the continuous behavior of the vocal mechanism. For ASR a rigorous definition of the phonetic type is needed, especially when the phonetic type itself is the initial object of recognition. Although the theory mentioned above goes far in realizing this goal, in. some respects it is still not complete. It is of some theoretical and practical importance, for example, to have some explicit quantitative definition for the rate of change in the position and shape of an articulator; the theory in its present form assumes that such a time function exists for each articulator, but it does not give the forms of these functions. These forms would provide very worthwhile objects of research. Some tentative formulations based on the vocal tract area function have been suggested elsewhere (Broad, to be published), but the question is still open. Physiological phonetics provides a link between physical events and code elements in speech, and through the acoustic theory of speech production we can use physiological.phonetics to help understand the acoustic structure of speech sounds. The importance of relating the physiological and acoustical events becomes obvious once it is realized that though speech is well represented by phonetic classes based on actions of the vocal mechanism (articulatory events), the input to an automatic speech recognizer is typically the acoustic speech wave. Past attempts in ASR which used only acoustic data without reference to the underlying physiological behavior have been severely

BASIC DIRECTIONS IN AUTOMATIC SPEECH RECOGNITION

109

limited in their performance. Sometimes, with sufficient constraints on vocabulary size and manner of speaking, it is possible to bypass the recognition of phonetic types and recognize utterances as total patterns, or Gestalten. However, the number of template patterns to be stored grows at least as fast as the vocabulary size. Even more seriously, when vocabulary words are spoken in continuous utterances, their forms are modified by the phonetic context; and, further, the segmentation into words is not obvious at the acoustic level. The phonetic types, on the other hand, form a finite " alphabet" that permits an unlimited number of vocabulary items and utterances to be represented. It should be simpler, then, first to resolve an input utterance into its sequence of phonetic types and then to apply rules of phonetic assimilation and pronunciation matching to these sequences, rather than applying them to the acoustic analyzer output directly. The actual phonetic sequences employed by a speaker do not usually correspond simply with " dictionary" pronunciations of the word sequences. Sounds may be modified by the assimilation rules; they may be surprisingly modified when the redundancy of the language permits it; or they may even be altogether deleted under the same constraint. These facts pose significant problems for the linguistic description of speech needed for ASR, but they should not be confused with the problems of identifying the actual phonetic content of the speech signal. To lump the two types of problems together, in the hopes that some sort of acoustic analysis will yield word sequences directly, is to overlook the true complexity of the challenge of ASR. There are good reasons to believe that phonetic type recognition should be possible. First, trained human transcribers can listen to tape recordings of speech which is meaningless to them (for example, speech in a language unknown to them) and generate reasonable phonetic representations of what they hear. Since in this case no information is available to the transcriber other than the" acoustic waveforms, the phonetic information is available in the acoustic signal alone. From this point of view the solution of automatic phonetic type recognition depends on identifying the information that is actually in the acoustic signal. A second reason for believing that phonetic type recognition should be possible is the more constructive one based on physiological phonetics and the acoustic theory of speech production. We know a substantial amount about the acoustical properties of speech sounds; we know properties which should provide phonetically complete descriptions of the sounds. Beyond this theoretic knowledge, we also now have good ways of identifying acoustic features that we did not have until recently; the most important of these is improved formant frequency tracking (see below).

110

D.J. BROAD

This does not mean, of course, that phonetic type recognition has been solved, but rather that the necessary tools are available. The benefits of realizing phonetic type recognition might be summarized here: (1) phonetic type recognition is a non-trivial component of more general ASR; (2) phonetic type recognition would be a powerful tool for research on the phonetic sequences of speech--in this way, it would be valuable for the further research needed to solve higher level ASR problems; (3) phonetic type recognition would have some direct practical applications in limited-purpose ASR, such as the generation of rough draft typescripts of tape recordings in various languages, or the control of automatic devices by verbal commands; (4) phonetic type recognition would be a good test for the implicit and necessarily algorithmic specification of phonetic concepts required, and would thus enlarge our understanding of speech.

Acoustic Analysis The acoustic speech waveform can be segmented into intervals which may be classified into acoustic speech wave types. The acoustic speech wave types have been defined formally elsewhere (Peterson & Shoup, 1966b). Work in acoustic analysis for ASR has generally been concentrated on the acoustical parameters of speech, for example, the formant frequencies, and has been only minimally directed to the speech wave types, which also carry significant information. The study by Denes & yon Keller (1968) used speech wave type information as part of an automatic segmentation scheme. An aspirated voiceless plosive, for.example, [ ph], is a phonetic type that is, acoustically, a succession of three speech wave types: quiescent, burst, and quasi-random. Conversely, an utterance that is voiced throughout, such as m a y we all, is an example of a sequence of several phonetic elements all of the same quasi-periodic speech wave type. Clearly segmentation of the speech wave into speech wave types does not have a simple one-to-one correspondence to its segmentation into phonetic units. Although speech wave type segmentation does not directly yield phonetic segmentation, it is a necessary step in such segmentation. For example, further phonetic segmentation of m a y we all depends on identifying the steady-states and transitions of formants 1 within the extended quasi-periodic speech wave type. But in order to decide that formant behavior is the appropriate tool for subsequent analysis, it is necessary to have identified the interval during which 1A formant is a damped sinusoidal component of the acoustic impulse response of the vocal tract. That is, a formant corresponds to a conjugate-polepair component of the vocal tract transfer function.

BASIC DIRECTIONS IN AUTOMATIC SPEECH RECOGNITION

111

the speech signal is quasi-periodic and therefore possessed of a significant information-bearing formant structure. In the converse case of a sequence of different wave types mapping into a single phonetic unit, of course, the wave type identification is still crucial because the recognition of the phonetic type depends on the detection of the correct sequence of wave types. Thus the identification of a speech wave type implies the selection of a particular subsequent analysis as, for example, the selection of a particular set of parameters relevant to that wave type. While formant parameters seem appropriate to identifying most quasi-periodic speech wave types, a measurement of upper and lower cutoff frequencies together with a measure of the total energy is, perhaps, more appropriate to the identification of the quasi-random waveforms of an [s] or [f]. The extraction of formant parameters has long been recognized as one of the more important, as well as more difficult, problems in the acoustic analysis of speech. Formant frequencies have been associated with vowel quality since the nineteenth century. Scripture (1906) was one of the first to recognize the significance of formants as non-harmonic components of the speech wave. Since the invention of the sound spectrograph (Koenig, Dunn & Lacy, 1946) many acoustic phonetic studies have been addressed to studying the relation between phonetic elements and formant frequency behavior. As a result the characteristic formant steady-states and transitions for vowels and consonants have become relatively well known. Formant frequency studies have yielded considerable information on speech dynamics and on the effects of phonetic context on the formation of speech sounds. In his classic book, Acoustic Theory of Speech Production, Fant (1960) showed that the formant frequencies could be predicted accurately from the articulatory configuration in the vocal tract as determined by X-rays. This established perhaps the most important relation that exists between the physiological and acoustic domains in speech. It was not at all obvious from the computations, however, how the individual formant frequencies depended on particular features of the vocal tract geometry. Then Schroeder (1967) demonstrated a one-to-one correspondence between formant frequencies and the terms of the Fourier cosine series representation of the normalized logarithmic vocal tract area function. This remarkable result for the first time permitted the estimation of vocal tract shapes on the basis of acoustic measurements and made a more direct interpretation of acoustic behavior in terms of articulatory phonetics possible. The formant frequencies are seen to be directly related to the physiological formations that are basic to useful phonetic classifications. The formants are thus, in this sense, basic elements of speech production. Formant frequencies

112

D.J. BROAD

provide a common language for the dissemination of acoustic phonetic information that is relatively independent of the particular analyzer design. For example, formant information is more universally usable than the measured voltage levels of a particular filter bank's outputs. Further, the behavior of practically any other type of acoustic analysis for speech can be understood in terms of its response to the formants. This is clearly true of filter bank analysis, especially in the light of the elegant acoustic and perceptual study of the vowels by Klein, Plomp & Pols (1970) which related the two major factors of a filter bank output to the first two formant frequencies; it is also true of other apparently very different analysis schemes, such as that of Gaussian wave function representation (Markel, 1970). Hence, even if an actual ASR system does not explicitly use formant analysis, it seems likely that its behavior with respect to formant patterns will be a basic concern in the phonetic interpretation of whatever acoustic analysis it does employ. Stevens & Klatt (1968), for example, used a filter bank as a basic acoustic analyzer and interpreted the behavior of the filter bank by examining the movements of formant frequencies on sound spectrograms. The most serious objection to the use of formant analysis in ASR has been the unsatisfactory performance of formant analyzers. The main problem has not been the accuracy of formant estimates per se, but the more basic problem of identifying the gross formant structure. Often a formant analyzer will measure a formant frequency to an accuracy of a few Hertz, but will ignore the existence of another formant or will insert a formant where none exists. Formant misidentifications have been the plague of recognition and transmission systems based on formant analysis. There have, recently, been significant improvements in formant extraction techniques, however, and the prospects for a serviceable formant tracker seem quite good. These techniques have been based on cepstral analysis, chirp-z transform, and digital inverse filtering. The latter approach is quite promising, yielding continuous tracking of formants, even in the difficult environments of closely-spaced formants, fast transitions, and high fundamental frequencies (Markel, 1971a, b). The formulation is parallel to those of Itakura & Saito (1968) and of Atal (1970). The complete analysis for each frame of data is roughly as fast as a single 256-point complex fast Fourier transform (FFT) and can be implemented on a small computer, such as a PDP-8 with only 4K of core.

Acoustic Phonetics of Speech Since good acoustic parameter analysis seems to be imminently available, the next major challenge in ASR is the complete description of the acoustic-

BASIC DIRECTIONS IN AUTOMATIC SPEECH RECOGNITION

113

phonetic structure of speech. Much is known about the acoustic shapes of the different speech sounds. The situation is complicated by the following factors. (1) Sounds cannot be characterized by fixed absolute values of the parameters. For example, men on the average tend to have lower formant frequencies than women for the same sounds. This means that even ideal steady-state sounds are each represented by sizeable, and partly overlapping, regions of parameter space (Peterson & Barney, 1952). (2) Parameter patterns are influenced by phonetic context. This results from constraints in the articulatory mechanism. For example, vowels in context tend to be somewhat more centralized than the " s a m e " vowels uttered in isolation (Lindblom, 1963). (3) Parameter patterns are in continuous dynamic movement and transitions, as well as " steady-states ", contain important information (Liberman, 1957). It seems doubtful that these difficulties can be avoided by selecting some better parameter set, since the complexity is inherent in the phonetic structure of speech, and cannot be greatly simplified without a critical loss of information. This calls for the investigation of acoustic parameters as they actually exist in speech, and for the formulation of descriptive models of their behavior. Substantial work in this area has been done: notably the study of vowel reduction by Lindblom (1963); the study of formant trajectories in symmetric consonant-vowel-consonant (CVC) utterances by Stevens, House & Paul (1966); the study of coarticulation in VCV utterances by Ohman (1966); and the study of VCC utterances by Menon, Jensen & Dew (1969). This work has been extended by Broad & Fertig (1970) in a study of C/I/C utterances. The influences from the initial and final consonants on the vowel were separated and, in addition, statistical variations in the formant frequencies for repetitions of the same utterance were studied. These are important for characterizing the variability of formant patterns and for forming a statistical basis for evaluating similarities and differences between different formant patterns. Acoustic phonetic studies in the past have been limited by the available analysis techniques. Most of the above studies were based on sound spectrogram measurements. Stevens, House & Paul used an automatic analysis-bysynthesis formant measurement in their study, but even there the number of data considered was still relatively small. The more recent formant analysis techniques noted above promise, considerable improvement in acoustic phonetics research in terms both of the amount and accuracy of the formant data that can be obtained. Thus improved parameter analysis is not only a part of the overall ASR solution, it is also a research tool for ASR problems at the acoustic phonetic level.

114

D.J. BROAD

Phonetic Sequences of Speech Once a string of phonetic types has been derived from the speech signal, the final stage of ASR is to decode the string as a string of lexical terms, i.e. words. For example, the phonetic string [wiw~owe I] might be decoded as w e w e r e a w a y . At first glance this might appear to be a trivial conversion using a table look-up to match lexical items to phonological strings. Indeed, if the set of items to be recognized is limited and the items are spoken carefully and in isolation, this is the case. However, when the recognition is extended to the continuous strings of large-vocabulary natural language, the situation is vastly altered. First, the phonetic information by itself usually does not suffice to identify word boundaries and, as utterances become even only moderately long, the number of potential word segmentations becomes astronomically large. For example, a typical English utterance only 25 phones long can be segmented in literally hundreds of different ways, each of which totally accounts for the phonological sequence as a sequence of English words. All but a few such sequences, of course, are ungrammatical nonsense that would not occur to a human listener, but unless the ASR system has more information it will have no way to choose the right sequence out of the hundreds of apparently equally valid wrong ones. Second, individual words may be pronounced, quite legitimately, in a variety of ways, depending on the context in which they occur, on the speed of talking, and on individual variation due to regional dialects or to idiolectal idiosyncracies. These complications are properties of human speech communication which must be faced if effective general purpose ASR is to be realized. Our present ability to deal with them is further limited by the fact that almost no studies have been done to discover the phonetic sequences that are actually used in speech, and substantial progress in resolving ASR problems at this level will depend on extensive empirical studies of the actual phonetic sequences that human speakers use in their utterances. Descriptions of speech based on conventional dictionary pronunciations or standard phonologies of English, and this includes most of the studies in existence, are surprisingly inadequate for this. Although a great deal of information is needed, there is every reason to expect that it can be obtained through systematic study. The problem has apparently often been avoided because of its size and complexity. This has sometimes led linguists to view real speech as "ideal " speech that has been " degraded " in some indescribable manner. People do talk and listen to one another, however, and unless we are to hypothesize some sort of metaphysical mechanism in this process, we must

BASIC DIRECTIONS IN AUTOMATIC SPEECH RECOGNITION

115

infer that this communication does make use of definable code structures. These predictable and describable events can be obtained from sufficient study of the phonetic sequences that speakers use. Three major difficulties in performing the conversion from acoustic phonetic transcriptions to lexical items are immediately apparent to anyone who has done work in natural language problems as they relate to ASR. First no automatic system will give 100 % correct identification of the phones any more than any human being will consistently " h e a r " 100~/o of all sounds spoken. This deficiency should not deter the attempt to build general-purpose speech recognizers. A speaker may utter entire words, as well as portions of words, which never reach the listener's ear, but the lost segments usually will not significantly affect the reception of the message b y the listener. At this time it is difficult to know whether semantic rules of any nature can ever be written to allow a machine to fill in missing phones as done by a human listener. However, even if no semantic modifications were possible, an imperfect general-purpose recognizer could still be very valuable with a final edit by the user. Second, as mentioned previously, the continuous string of phonetic elements does not lend itself to a simple segmentation into words. Initial studies using a simulated phonetic representation (not automatically derived) of properly enunciated words indicate the complexity of this problem. To illustrate, a relatively simple sequence of words, There is a justifiedpride in . . . . yielded many possible segmentations, most of which would never occur to a human listener, for example, Theirs a just ifI'dpryed inn. Two significant tasks necessary to resolve the segmentation are, first, the selection of the " c o r r e c t " homonym (there-their, pride-pryed, in-inn) and, second the decision of how many cuts are proper, and where to make them (there isthere's, justified-just if I'd, pry din-pride in). Present grammatical theories of competence do not provide the linguistic information necessary to solve these segmentation problems. Fortunately, there is a growing desire among researchers in natural speech to develop a linguistic theory of performance. Such a theory should be directly relevant to ASR since it would model what people do say, rather than what people should say, or would say if they had unimpeded access to their linguistic "competence ". Many studies of natural language will be required before any adequate theory of performance can be written. The third major difficulty, already mentioned above, is that the phonetic representation does not conform to any " d i c t i o n a r y " pronunciation of the string of words. Consequently, the utilization of dictionary pronunciation patterns in ASR has not proved feasible and never will be useful until either

116

o.J. BROAD

the dictionary listings of pronunciation are more inclusive of what is said or the dictionary listings can be applied to algorithmic phonetic rules to permit the phonetic sequences of natural speech to be converted to the proper lexical items. For example, a commonly used word .just is normally indicated in a North American dictionary to have only a pronunciation of/d~zst/, but in conversation or natural speech in North America nearly all occurrences of just are realized phonetically as [d~ I st] (fist). A simple editing of the dictionary to allow for both these phonetic strings permits an ASR system to recognize the word in its more common form. Other types of deviation from the "dictionary " pronunciation can best be handled by phonetic rules.

ConduNon

Basic research problems in the areas of speech production, articulatory phonetics, acoustic analysis, acoustic phonetics, and the phonetic sequences of natural speech are all crucial for the realization of the automatic recognition of continuous large-vocabulary natural speech. Articulatory phonetics provides a connection between the physical events of speech and the phonological units of language. An understanding of the mechanisms of speech production, in turn, is necessary for a completely rigorous definition of the phonetic unit. More research is needed to bring the definitions of phonetic concepts to a level of explicitness that is adequate for the algorithmic specification of the phonetic unit. The acoustic theory of speech production permits a description of the acoustic structure of speech that can be related to the articulatory structure of speech. The recognition of phonetic types is an important problem in ASR. The automatic tracking of formant frequencies is perhaps the paramount problem in the acoustic analysis of speech. Recent advances in formant tracking are significant for the design of actual ASR systems as well as for the basic phonetic research related to ASR. A reasonably complete system for acoustic phonetics will require extended study of the acoustic parameters of speech. Finally, even if good recognition of phonetic units is realized, we need much more information about the phonetic sequences of natural language before really powerful automatic speech recognition can be achieved. This area has often been neglected as a worthy object for extended empirical observation. There is every reason to believe that ASR can be realized in forms far more general than have been realized to date. This will require research on the various basic problems outlined above. While the basic research m a y not lead to an immediate ASR design, it should make such a design possible.

BASIC DIRECTIONS..IN AUTOMATICSPEECHRECOGNITION

117

The Directorate of Mathematical and Information Sciences of the United States Air Force Office of Scientific Research, (AFSC) supported this research under contract F44620-69-C-0078.

References

ATAL,B. S. (1970). Determination of the vocal tract shape directly from the speech wave. J. acoust. Soc. Am., 47, 65. BROAD, D. J. (to be published). Formants in automatic speech recognition. BROAD, O. J. & FERTIG, R. H. (1970). Formant frequency trajectories in selected CVC syllable nuclei. J. acoust. Soc. Am., 47, 1572. DENES, P. B. t~¢VONKELLER,T. G. (1968). Articulatory segmentation for automatic recognition of speech. In Reports of the 6th International Congress on Acoustics, Vol. II, pp. B-143-B-146. Ed. Y. Kohasi. New York: Elsevier. FANT, G. (1960). Acoustic Theory of Speech Production. The Hague: Mouton. HILL, D. R. (1969). An ESOTerIC approach to some problems in automatic speech recognition, htt. J. Man-Machine Studies, 1, 101. HILL, D. R. (1971). Man-machine interaction using speech. Advances in Computers, Vol. II, pp. 165-230. Eds. F. L. Alt, M. Rubinoff (M. Yovitts, guest ed.). New York: Academic Press. HYDE, S. R. (1968). Automatic speech recognition, literature survey and discussion. Research Department Report No. 45. Eastcoate, England: Joint Speech Research Unit. INTERNATIONALPHONETICASSOCIATION(1949). The Principles of the International Phonetic Association. London: University College. ITAKLrRA,F. • SAITO,S. (1968). An analysis-synthesis telephony based on maximum likelihood method. Reports of the 6th International Congress on Acoustics, Vol. II, pp. C-17-C-20. Ed. Y. Kohasi. New York: Elsevier. KLEIN, W., PLOMP, R. ~¢ POLS, L. C. W. (1970). Vowel spectra, vowel spaces and vowelidentification. J. acoust. Soc. Am., 48, 999. KOENIG, W., DUNN, H. K. & LACy, L. Y. (1946). The sound spectrograph. J. acoust. Soc. Am., 18, 19. LEA, W. A. (1970). Towards versatile speech communication with computers. Int. J. Man-ll4achine Studies, 2, 107. LIBERMAN,A. M. 0957). Some results of research on speech perception. J. acoust. Soc. Am., 29, I17. LINDBLOM,B. (1963). Spectrographic study of vowel reduction. J. acoust. Soc. Am., 35, 1773. LIND~REN,N. (1965a). Machine recognition of human language. Part I--Automatic speech recognition. IEEE Spectrum, 2, 114. LINDGREN,N. (1965b). Machine recognition of human language. Part II--Theoretica1 models of speech perception and language. IEEE Spectrum, 2, 45. MARKEL,J. D. (1970). On the interrelationships between a wave function representation and a formant model of speech. SCRL Monograph No. 5, Santa Barbara, Speech Comm. Res. Lab. MARKEL, J. D. (1971a). Prony method and its application to speech analysis. J. acoust. Soc. Am., 49, 105.

118

D.J. BROAO

MARKEL,J. D. (1971b). A linear least-squares inverse filter formulation for formant trajectory estimation. SCRL Monograph No. 7, Santa Barbara, Speech Comm. Res. Lab. MENON, K. M. N., JENSEN, P. J. & DEW, D. (1969). Acoustic properties of certain VCC utterances. J. acoust. Soc. Am., 46, 449. OrlMgr~, S. E. G. (1966). Coarticulation in VCV utterances, spectrographic measurements. J. acoust. Soc. Am., 39, 151. PETERSON, G. E. & BARNEY,H. L. (1952). Control methods used in a study of the vowels. J. acoust. Soc. Am., 24, 175. PETERSON, G. E. & SHOUP, J. E. (1966a). A physiological theory of phonetics. J. Speech Hear. Res., 9, 5. PETERSON, G. E. & SI~OLrP,J. E. (1966b). The elements of an acoustic phonetic theory. J. Speech Hear. Res., 9, 68. PIKE, K. L. (1943). Phonetics. Ann Arbor: Univ. Michigan. SCHROEDER,M. R. (1967). Determination of the geometry of the human vocal tract by acoustic measurements. J. acoust. Soc. Am., 41, 1002. SCRIPTURE, E. W. (1906). Researches in Experimental Phonetics. The Study of Speech Curves. Washington: The Carnegie Institution of Washington. STEVENS,K. N., HOUSE,A. S. & PAUL, A. (1966). Acoustical description of syllable nuclei: an interpretation in terms of a dynamic model of articulation. J. acoust. Soc. Am., 40, 123. STEVENS, K. N. & KLATT, M. M. (1968). Study of Acoustic Properties of Speech Sounds. Cambridge, Massachussetts: Bolt, Beranek & Newman, Inc., Report No. 1669 (AD 676 979). WHITNEY, W. D. (1862). The Atharva-Veda Praticakhya, or Caunakiya Caturadhyayika: Text, Translation, and Notes. New Haven: The American Oriental Society.

Basic directions in automatic speech recognition

Basic directions in automatic speech recognition

Recommend Documents