From acoustic signal to phonetic message

From acoustic signal to phonetic message

JOURNAL OF COMMUNICATION FROM ACOUSTIC DISORDERS 8 (1975), 181-188 SIGNAL TO PHONETIC MESSAGE* MICHAEL STUDDERT-KENNEDY Queens College and Gra...

1MB Sizes 0 Downloads 78 Views

JOURNAL

OF COMMUNICATION

FROM ACOUSTIC

DISORDERS

8 (1975),

181-188

SIGNAL TO PHONETIC

MESSAGE*

MICHAEL STUDDERT-KENNEDY Queens College and Graduate Center, City University ofNew York, and Haskins Laboratories,

New Haven

Among the distinctive properties of language as a medium of communication is its “duality of patterning” (Hackett, 1958), its construction of a pattern of meaning from patterned combinations of meaningless elements. The meaningless elements are the phonemes of linguistic theory and the distinctive features that compose them. We have been warned by Helen Cairns that these elements are abstract, and that we should not expect to find them physically present in the articulatory gesture or in the acoustic signal: linguistic form is primary, its physical realization secondary. Nonetheless, we may well wonder about the origins of linguistic form. How did features arise? How do they facilitate linguistic communication? What is their biological function? We shall not find the answers to such questions in linguistic analysis. And I should emphasize from the outset that (at least, until we consider perception) I am not concerned with abstract features or with any particular feature system. My concern is rather with physical features of articulation and of the acoustic signal. Certainly, some correspondence between the abstract and the physical must ultimately be established if linguistics is to claim its place as a branch of biology rather than of mathematics (cf. Chomsky and Halle, 1968, pp. 297-299). However, it is already evident that the relation will prove extraordinarily complex. And if we are to gain insight into the function of features in general, it is to the structure and function of the organism that we must turn. The general function of meaningless commutable elements, finite in number, but essentially infinite in combinatorial possibilities, is clear enough: they provide flexibility, range and ease of acquisition, Without them, we would be reduced to a cumbersome language of acoustic ideographs, each meaningful symbol holistically distinct from every other. The invention of new symbols would then be an act of creation, and the child’s learning of its language a labored accumulation rather than the discovery of system. The system, of course, lies in the features. For, again, if each phoneme were an unanalyzed entity, having nothing in common with any other phoneme, the sounds of a language would have no pattern, and the child’s task would again be one of rote accumulation. Thus, one function of

* Revised version of a paper read before the Fifth Conference in the Mount Sinai Series in Communication Disorders on Articulation and Related Issues, March 22, 1974. 0 American

Elsevier Publishing

Company,

Inc.,

1975

181

182

MICHAEL

STUDDERT-KENNEDY

features in sound and articulation is analogous to the function of phonemes in language: to provide pattern or system. Nonetheless, pattern can hardly be essential to an inventory of acoustic elements, since we are surely capable of differentiating among a few dozen holistically distinct sounds. In fact, it is not until we turn to speech production that we begin to see the possible origin of features. The suprapulmonary apparatus is surprisingly limited. It has five main movable structures: larynx, tongue, velum, lips and jaw, and the last may be regarded as merely ancillary to tongue and lips, since we can speak intelligibly with our teeth clenched. Each of these articulators has a limited number of discriminably different states (not all of which are used in speech). I will not attempt to list the states, but simply refer the reader to the important work of Fant (1968), Lindblom (1972) and their colleagues in Stockholm. However, we may note that no more than 2 to 4 states each for larynx, velum and lips and perhaps 10 for the tongue are used contrastively in speech, and that it is from these 1.5or so values, with the help of a few contrasts in laryngeal tension, pharyngeal area, subglottal action and duration that the sounds of language are generated. These states are the physical features that define the articulatory correlates of speech sounds. There are several consequences of these facts for the concatenation of elements in running speech. First, although not every articulator need be engaged in the production of every contrast, it is only by the coordinated action of several distinct components that we can speak at all. Second, a change in state of a single articulator may provide a distinctive contrast between neighboring configurations of the vocal tract. Third, by corollary, any pair of configurations may have more articulatory configurations in common than in contrast. Let me illustrate all this with a spectrogram that provides a fairly direct display of the acoustic correlates of vocal tract configurations. Figure 1 is a spectrogram of the utterance, “She began to read her book.” We will not examine the utterance segment by segment (for extensive discussion of the spectrographic correlates of speech sounds, see Stevens and House, 1972), but let me pick out several striking aspects. Perhaps most obvious are the vertical striations and heavy concentrations of energy (formants) associated with vowels: these reflect laryngeal vibration, a relatively unconstricted vocal tract and open lips. Equally obvious is the disappearance or attenuation of these patterns during the occlusions of lb/, /g/, /d/ and during the moderate constriction of Id. These contrasts are introduced, in each instance, by a change in state of a single articulator (lips or tongue), which, for the stops, also checks the flow of air and automatically halts laryngeal vibration. Direct laryngeal action (glottal opening), combined with heightened tongue constriction, is reflected in the frication of ISI and of initial/t/ by the absence of the first formant, by the attenuation of higher formants and by a band of noise. Notice, too, that for several successive segments (as in“‘began” and “to read”), the lips may remain open and the larynx may continue to vibrate, while the tongue

FROM

ACOUSTIC

SIGNAL

TO PHONETIC

MESSAGE

183

introduces formant movement as it passes smoothly from one state to another. Notice, finally, that rounding of the lips for the vowel of “to” before the tongue has released the /t/ emphasizes the downward shift of second and third formant frequencies before the onset of laryngeal vibration. What we see in the spectrogram, then, are some of the sound feature contrasts generated by the coordinated changes in state of a few articulatory components. Each may change state more or less independently of the others to yield a new configuration and a discriminably different sound pattern. Each configuration may be succinctly described by specifying the states of the components, and, since many configurations necessarily share articulatory states, a given component may maintain a single state over several segments of an utterance. Finally, since not every component is required to specify every contrastive configuration, an unengaged articulator is free to move into a nonantagonistic, anticipatory state before other articulators have completed their current action. By such maneuvers the gestural load is distributed over the articulators to get “high-speed performance with low speed machinery” (Liberman et al., 1967, p. 446). What I am suggesting, then, is that the origin of feature oppositions is not to be found in the abstract linguistic system where they are now firmly lodged, but in the peculiar structure of our sound-producing apparatus. Feature contrasts may thus be seen, not as an advantage, but as a necessary solution to the mechanical problem of blending phonetic elements into a fluent stream. The solution is purchased at the cost of a complex relation between the acoustic signal and the phonetic message it is intended to convey. For what began in the mind of the speaker as a sequence of phonemes emerges from his mouth as a more or less continuous signal in which much of the original segmentation has been lost. Consider, in Fig. 1, the spectral pattern associated with “began.” If we were to excise from a tape recording the segment along the time line between roughly 0.40 and 0.55 set and were to ‘play it back, we would hear the utterance /beg/ (or perhaps /big/), with an unreleased final /g/. How is this utterance articulated? The lips are first closed, then opened. During their closure the tongue begins to assume its position for the following vowel. As the lips open, or very shortly thereafter, laryngeal vibration begins and continues while the tongue completes its gesture for /d and moves on, uninterrupted, into closure for /g/. Notice that the load is distributed over lips, tongue and larynx. We can detect in the spectrogram the vertical striations of laryngeal vibration, the rising formants of labial articulation as the mouth opens, movement of all formants through a pattern associated with a central vowel, and the rising second and third formant movements associated with velar occlusion following /E/. However, the coordinated actions of the articulators accomplish what Stetson (1952, p.4) called a “single ballistic movement,” the movement of the articulated syllable. The resulting acoustic pattern is itself a single movement in which the three phonetic segments have been lost. Figure 2 was devised by my colleague, Alvin Liberman, to illustrate this point.

184

L

0 +

c

FROM

ACOUSTIC

SIGNAL

TO PHONETIC

MESSAGE

185

-

TIME Fig. 2.

A stylized spectrogram of the utterance /bang//, illustrating the spread of phonetic information over an acoustic syllable. (Reprinted from Liberman, A.M., The grammars of speech and language. Cog. Psychol., 1970, 1, 301-323, with permission of the author and publisher.)

It presents the first and second formants, in a stylized spectrogram, of the utterance /bag/, and is intended to show how the discrete phonetic segments are merged in the acoustic signal. The effects of the vowel (grey stippling) are spread throughout the syllable: the entire formant pattern would be different if the vowel were different (as may be judged from the spectrographic display of another labial stop-vowel-velar stop syllable, “book,” in Fig. 1). Similarly, the rising second formant for /b/ (diagonal hatching), which continues well into the center of the utterance, would be quite different if the utterance were /gre/ instead of IbE/ (as may be judged from the spectrographic display for “-gan” in Fig. 1). Finally, the rising second formant for final /g/ (black dots) (a mirror image of the falling formant for initial /g/ in “-gan,” Fig. 1) begins so early in this unstressed syllable that there is no steady state portion associated with the vowel nucleus at all. In short, the acoustic pattern has lost all trace of the distinct gestures that went into its making. Just how the perceptual system recovers the features and phonemes that have been woven into the signal we do not know. But there are grounds for believing that it requires specialized neural machinery to do so, and that this machinery is located in the language hemisphere of the brain. As many of you know, about a dozen years ago Doreen Kimura (1961a,b) showed that, if pairs of different spoken digits were presented dichotically (one digit to the left ear, another to the right, at the same time), normal right-handed listeners were more likely to recall correctly the digit presented to the right ear than

186

MICHAEL

STUDDERT-KENNEDY

the digit presented to the left. She attributed the effect to prepotency of the contralateral pathways from the right ear to the left (language) hemisphere. Since Kimura had used digits, it was not known whether the effect rested on the semantic or on the phonetic properties of the stimuli. My colleague, Donald Shankweiler, and I therefore repeated Kimura’s test with nonsense syllables in which the dichotic contrasts lay entirely in a single phonetic segment (e.g., /ba/ vs. /da/) (Shankweiler and Studdert-Kennedy, 1967). We obtained the right ear advantage. Furthermore, we found that the degree of the advantage varied as a function of the phonetic feature contrasts between members of a pair. In a later experiment (Studdert-Kennedy and Shankweiler, 1970), we found that the features of place and voicing in stop consonants were independently lateralized to the language hemisphere. We concluded that, while the general auditory system of both hemispheres was probably equipped to extract the acoustic parameters of a speech signal, the task of transforming these parameters into phonetic features was a function of the language hemisphere. We were thus making a distinction between auditory perception and phonetic perception that has become increasingly important in current research (see Studdert-Kennedy, in press, for a review), and that is warranted by the gross discrepancies between acoustic signal and phonetic percept that I have been discussing. Striking evidence to support this distinction has recently been reported by Wood (1975). He synthesized two CV syllables, /ba/ and /gal, each at two fundamental frequencies, 104 Hz (low) and 140 Hz (high). From these syllables he constructed two types of random test order: in one, items differed only in pitch (e.g., /ba/ [low] vs. /ba/ [high] ); in the other, they differed only in phonetic class (e.g., /ba/ [low] vs. /gal [low] ). Subjects were asked to identify either the pitch or the phonetic class of the test items with reaction-time buttons. While they did so, evoked potentials were recorded from a temporal and a central location over each hemisphere. Records from each location were averaged and compared for the two types of test. Notice that both tests contained an identical item (e.g., /ba/ [low] ), identified on the same button by the same finger. Since cross-test comparisons were made only between EEG records for identical items, the only possible source of differences in the records was in the task being performed, auditory (pitch) or phonetic. Results showed highly significant differences between records for the two tasks at both left-hemisphere locations, but at neither of the right-hemisphere locations. A control experiment, in which the “phonetic” task was to identify isolated initial formant transitions (50 msec), revealed no significant differences at either location over either hemisphere. Since these transitions carry all acoustic information by which the full syllables are phonetically distinguished, and yet are not recognizable as speech, we may conclude that the original left-hemisphere differences arose during phonetic, rather than auditory, analysis. The entire set of experiments strongly suggests that different neural processes go on during phonetic, as opposed to auditory, perception in the left hemisphere, but not in the right hemisphere.

FROM

ACOUSTIC

SIGNAL

TO PHONETIC

MESSAGE

187

If, then, phonetic perception and auditory perception are indeed distinct processes, even to the extent of engaging different neural mechanisms, how are we to characterize the phonetic percept? There is ample evidence that we should do so in terms of abstract features, and I have argued that these features may have their origin in the peculiar structure of our sound-producing apparatus rather than in any auditory contrast they may be presumed to carry. At the same time, we must be careful not to identify any feature with any particular articulatory gesture or even with any particular vocal tract shape. This is not only because we must not confuse abstract classificatory features with their physical realizations, but also because we have experimental evidence that, in implementing the set of features associated with a particular phonetic segment, speakers have at their disposal a range of functionally equivalent gestures and even of functionally equivalent vocal tract shapes. Lindblom and Sundberg (197 l), for example, found that, if subjects were thwarted in their habitual articulatory gestures by the presence of a bite block between their front teeth, they were nonetheless able to approximate normal vowel quality, even within the first pitch period of the utterance. Bell-Berti (1975) has shown that the pattern of electromyographic potentials associated with pharyngeal enlargement during medial voiced stop consonant closure varies from individual to individual and from time to time within an individual. Finally, Ladefoged et al. (1972) have demonstrated that different speakers of the same dialect may use different patterns of tongue height and tongue root advancement to achieve phonetically identical vowels. In short, perhaps the closest we can come, at present, to specifying a particular value of a phonetic feature is to characterize it as the control system for a range of functionally equivalent vocal gestures. We are, of course, trapped in a logical circle if we claim functional equivalence on the basis of our equivalent percepts. Nonetheless, the construct is abstract in the sense that the control system for any motor act that may be performed in a variety of ways is abstract and yet has the virtue of neurological plausibility. If, finally, it seems strange to characterize the perception of a human act as the recovery of the control system by which it was initiated, you may reflect on your doctor’s handwriting. Prepuration of this paper was supported in purt by a grant to Haskins Laboratories, New Haven, from the National Institute for Child Health and Human Development, Bethesda, Md. Several of the ideas discussed have come ji-om conversations with my colleague, Alvin Liberman, and I thank him. References Bell-Berti, F. Control of pharyneal cavity size for English voiced and voiceless stops. J. Acousr. Sm. Am. 1975, 57. Chomsky, N., Halle, M. The sound pattern of English. New York: Harper and Row, 1968.

188

Fant,

MICHAEL

STUDDERT-KENNEDY

C.G.M. Analysis and synthesis of speech processes. In B. Malmberg (Ed.), Manual of phonetics. Amsterdam: North-Holland, 1968. Hackett, C.F. A course in modern linguisrics. New York: Harper and Row, 1958. Kimura, D. Some effects of temporal lobe damage on auditory perception. Can. J. Psychol., 196la, 15, 156-165. Kimura, D. Cerebral dominance and the perception of verbal stimuli. Can. J. Psycho/., 196lb, 15, 166171. Ladefoged, P., DeClerk, J., Lindau, M., Papgun, G. An auditory-motor theory of speech production. In Working papers in phonetics. University of California at Los Angeles, 1972, 22, 48-75. Liberman, A.M., Cooper, F.S., Shankweiler, D.P., Studdert-Kennedy, M. Perception of the speech code. Psychol. Rev., 1967, 74, 43 l-461. Lindblom, B. E.F. Phonetics and the description of language. In Proceedings of the 7th infernafional congress of phonetic sciences. The Hague: Mouton, 1972, pp. 63-97. Linkblom, B. E.F., Sundberg, J. Neurophysiological representation of speech sounds. Paper presented at the XVth World Congress of Logopedics and Phoniatrics, August 1419, Buenos Aires, Argentina, 197 I. Shankweiler, D. P., Studdert-Kennedy, M., Identification of consonants and vowels presented to left and right ears. Quart. .I. Exp. Psychol., 1967, 19, 59-63. Stetson, R. H. Motor phonetics. Amsterdam: North-Holland, 1952. Stevens, K. N., House, A.S., The perception of speech. In J. Tobias (Ed.), Foundurions of modern auditory theory. New York: Academic Press, 1972, Vol. 2, pp. 3-62. Studdert-Kennedy, M. Speech perception. In N.J. Lass (Ed.), Contemporary issues in experimental phonetics. Springfield, Ill.: Charles C. Thomas, in press. Studdert-Kennedy, M., Shankweiler, D.P., Hemispheric specialization for speech perception. J. Acousr. Sot. Am., 1970,48, 579-594. Wood, C. C. Auditory and phonetic levels of processing in speech perception: Neurophysiological and information-processing analyses. J. Exp. Psycho/. Hum. Percept. Perf. 1975, NO. 1, 104-133.