Journal of Phonetics (1989) 17, 371 -373
Review Speech Recognition by Machine* By W. A. Ainsworth
Peter Peregrinus Ltd, on behalf of the Institution of Electrical Engineers, London, 1988 Wiktor Jassem Department of Acoustic Phonetics, Institute of Fundamental Technological Research, Polish Academy of Sciences, Poznan, Poland
There are mainly three kinds of reader with an interest in this book: (a) phoneticians who have a limited knowledge of the engineering and mathematical aspects of automatic speech recognition (ASR) but who intend or are called upon to co-operate in this area with computer, electronics and artificial intelligence specialists, (b) advanced (perhaps postgraduate) students of computer science and telecommunications who are becoming engaged in ASR projects, and (c) specialists with an engineering background currently engaged in ASR who would like to have a broad overview of the relevant areas, with representative referencing. The author has probably assumed this somewhat differentiated readership and, on the whole , he has done a good job of meeting the requirements of all sections of it, at least by some large portions of the book , if not indeed by most or all of it. This is no easy proposition considering the differing qualifications of linguistic phoneticians and, say, computer specialists, and some of the readers should be prepared to have some difficulty with the mathematical formulations , whilst others may find, e.g. , the details of the auditory apparatus less than absolutely necessary. The Introduction (Chapter 1) whose most essential part deals with the advantages of man-machine communication by voice is followed by a discussion of speech production and perception (Chapter 2). The relation of the former to ASR is not immediately obvious except for the theoretical basis of linear prediction coding. It would appear, on the other hand, that problems of speech perception should be crucial in this context. This book contains only some basic information in this particular area, which is all to the good . The last two decades have seen a proliferation of research papers in the field of speech perception in scientific journals devoted to psychology , psychophysics, speech communication , hearing and phonetics , and at least one very serious report series (Reports on Speech Perception, Indiana University) is entirely concerned with this domain of study. But even a superficial familiarity with the recent writings on this subject, or a perusal of such overviews as Lobacz (1985) or Schwab & Nusbaum (1986) shows that amongst the large number of different theories and models of speech perception some are diametrically opposed to each other, although most of them are based on extensive experimentation . Under these circumstances it is not surprising that work at those * ISBN 0 86341115 0. 0095-4470/89/040371 + 03 $03.00/0
© 1989 Academic Press Limited
372
Review
stages of ASR that correspond to perception tends to go its own way, though occasionally paying lip service to this or the other of the rivalling theories and models of human processing of the speech signal. Chapter 3 of Ainsworth's book is a balanced and objective presentation of the most general problems and the history of ASR. Chapter 4 (Techniques of Signal Processing), 5 (Speech Recognition Algorithms) and 6 (Architectures) tend to show that the techniques of front-end signal analysis are very largely independent of the knowledge of the physiology and neurology of hearing, e.g., autocorrelation analysis, cepstral processing or linear prediction. Indeed, it may well be asked why, emulating the auditory stage, ASR techniques should simulate human physiology or human functions seeing that man must be able to receive not only speech but a large variety of other sounds, with very different properties. Some parallels can, however, be observed at the stage of pattern processing (Chapter 4 Section 2) . There is at least one evaluative method, viz. multidimensional scaling, that can successfully be applied to discrimination of both percepts and physical phenomena (Section 4.2.6). Chapters 4 and 5 are so constructed that the individual sections correspond exactly: 4.1 Signal Processing5.1. Speech Analysis, 4.2. Pattern Processing-5.2 Speech Pattern Matching, 4.3 Knowledge Processing-5.3. Speech Understanding Systems. It is at the "knowledge-understanding" stage of ASR that machine models seem closest to human models. Although psychological , neurolinguistic and psycholinguistic evidence relating to speech understanding is not entirely consistent or free from controversy, it presents a reasonably coherent image of what the different knowledge sources are and how they are used. Sections 4.3, 5.3 and 6.3 (Knowledge Based Systems) indicate that the human-machine parallels are probably closest at this highest stage of speech recognition. There are, however, as shown in this book, some basic differences here also : speech understanding (SU) systems do not work in real time and cover only a very restricted semantic and pragmatic field. The first problem is being gradually overcome with advancing computer technology. As to the other, it should be realized that, as far as can be predicted, ASR (or SU) systems will be used for specific purposes. In other words, a machine will, at least for a very long time to come, be only required to "understand" a fraction of what a human is normally able to comprehend when spoken to. Chapter 7 of Ainsworth's book (Performance Assessment) presents a balanced statement of what existent ASR (SU) systems are capable of doing, whilst the problem of how they are being used and how future ones are likely to work is addressed in Chapter 8 (Applications). The closing, brief Chapter 9 (The Future) is mainly devoted to the likely further lines of research. The book is not free from some minor weaknesses: The spectrograms on pp. 13-19 do not do justice to the sonagraph , which makes it possible to present much more detail. In some of them, nothing is visible above 2kHz. Table 3.1 "Syllable Initial and Syllable Final Consonant Clusters Occurring in British English" in fact includes single consonants as well as clusters, and /t J j/ and /3/ are mistakenly given as some of the initial possibilities . The fact alone that oscillographic traces given at various places are stylized rather than original is not by itself a fault, but-regrettably, in some of them (e .g., in Fig. 4.4) the trace goes back along the time axis over some short intervals. One important addition might be made in Chapter 8, viz. controlling microscopes in microsurgery. The third paragraph on p. 166 is incomprehensible, probably due to some proof-reading oversight. There is a
Review
373
fair sprinkling of hyphenation errors and misprints, e .g., imm-ediate (p. 2), involv-ed (p. 9), permissable (p. 84), (finite) stage (legend to Fig. 4.18; read: state) , cons-ist (p. 92), utt-erances (p . 97), in increased (p. 109; read: is-), as-ymmetric (p. 115), wole-word (p. 121; read whole--) . These minor flaws may easily be corrected in a second edition, which will probably follow soon because the topic of the book is very much "in the news" and is treated with the specialist's high competence and the writer's proficiency. References Lobacz, P. (1985) Processing and decoding the signal in speech perception. Hamburger Phonetische Beitrage, Band 44. Hamburg: H. Buske. Schwab, E. C. & Nusbaum, N.C. (editors) (1986) Pattern recognition by humans and machines. Orlando: Academic Press.