Journal of Phonetics (1974) 2, 91-108
The theory of local linearity Olle Gunnilstam Department of Linguistics, Uppsala University, Uppsala, Sweden Received 2nd September 1973
Abstract:
A new theory considering the conditions of the production of vowels as well as consonants is proposed. It postulates that the fundamental dimension of consonants is horizontal, or, in short, that they are the product of varying place of articulation, whereas the main dimension of vowels is located perpendicularly to that of consonants, i.e. vertically, in short, they are the product of varying degree of constriction. Our way of learning and maintaining these dimensions, which may be regarded as scales to "play" on, is discussed briefly.
Introduction The concept of speech production, now current, refers to the conclusions drawn from the results of the experiments that have been carried out by Fant (1960) with a model vocal tract. Also Stevens & House (1955) have presented similar experiments, but the following refers to Fant's description of the three-parameter model. An approximation of the vocal tract should consist of three sections with an optional fourth lip section. There are two models meeting these requirements. The first model consists offour cylindrical sections, as can be seen in Fig. 1. The tongue constriction is thus represented by a cylindrical tube. In the second type of the three-parameter model this tube is replaced by a horn-shaped tongue section, the shape of which is shown in Fig. 2. These two types of the three-parameter model produce signals which differ very little lis
2·5 em
Figure 1
em
X em
Three-parameter model of the vocal tract, based on four cylindrical sections representing lip opening, a front cavity, the tongue section, and a back cavity. The length of the tongue section is 5 em, provided it is located neither less than x = 2·5 em, nor more than x = 12·5 em from the glottis. The variable parameters are (a) the place and (b) the degree of constriction, (c) the ratio 11 /A 1 • (From Pant.)
92
0. Gunnilstam
acoustically. The movement of the cylindrical tongue section through the tract causes very sudden changes in the signal; this is not the case with the horn-shaped tongue section model. Other acoustic differences are negligible. We will here in brief describe the properties only of the horn-shaped tongue section model, as this reflects reality somewhat better than the other one. This model is shown in Fig. 2.
Centre of constriction from glottis (em)
Figure 2
Three-parameter model of the vocal tract, based on a horn-shaped tongue section, the stepwise area function of which is adjusted to the line-analogue LEA. The parameters are the same as in Fig. 1, i.e. the place and degree of constriction and the amount of lip rounding. The factors kept constant are the total length of the tube, the cross-sectional area of the cavities surrounding the constriction, and the symmetric cathenoid shape of the constriction. (From Fant.) A min(= A3) = 0·65 cm3. Curve N:r in Fig. 3 1 2 3
8 2 0·16
0 1 1
The effect of varying the place of constriction ceteris paribus (Fig. 3) A forward movement of the constriction causes the front cavity to decrease in length and to emit a rising resonance frequency, hence the upward slopes, seen from the right in the nomogram. A rising resonance is gradually taken over by higher and higher formants. At the same time the back cavity increases in length and emits a falling resonance frequency, hence the downward slopes. A certain formant thus has to represent sometimes a rising and sometimes a falling resonance frequency, i.e. it must now and then change its cavity affiliation. The effect of varying the degree of lip rounding ceteris paribus (Fig. 3) Increasing the degree of lip rounding causes the front cavity gradually to change from a quarter wavelength resonator into a half wavelength resonator, which lowers the frequency of the upward slopes (seen from the right) and due to the coupling to the back cavity to a certain amount also the frequency of the downward slopes. The effect of varying the degree of constriction ceteris paribus (Fig. 4) When a very narrow constriction passes through the tube the most drastic formant variations will take place. When a non-existent constriction "passes" through the tube, nothing, of course, will happen: we have an undistorted tube with resonance frequencies at odd multiples of a quarter wavelength. A constriction located in the anterior part of
Theory of local linearity
93
4500
3500
:I
25oo
1500
18
16
14
12
10
8
6
4
Centre of constriction from glottis (em)
The effect on a resonance as a function of the place of constriction at various degrees of lip rounding. Nomogram of the five lowest resonance frequencies of the model in Fig. 2. Ordinate: the frequencies of the first five formants. Abscissa: from right to left, the location of the centre of the tongue section in em from glottis. The constriction used in this nomogram has a crosssectional area of0·65 cm 2 • A narrower constriction would imply decreased coupling, i.e. greater formant variations. Completely uncoupled resonances would cross each other, the vocal tract, however, is mostly a coupled system, where resonance frequencies could impossibly coincide, unless the constriction is made maximal. A wider constriction would yield only minor formant variations, see Fig. 4. Continuous lines indicate absent lip section, curves 2 and 3 increasing lip rounding. (Simplified from Fant.) 2 Amin = 0·65 cm •
Figure 3
3500
2500
F3
N
:r
1500
Fz
-----
.....
Fr
500
18
16
14
12
10
8
6
4
2
0
Centre of constrict ion from glott is (em)
Figure 4
The effect on a resonance as a function of the place of constriction at various degrees of constriction. Nomogram of the three lowest resonance frequencies of the model in Fig. 2. Ordinate and abscissa as in Fig. 3. Lip section is missing, i.e. no lip rounding occurs (the effect of lip rounding is clear from Fig. 3). The cross-sectional area of the constriction varies from (curve 1) 0·12 cm 2 to (curve 4) 8 cm 2 , i.e. from very narrow to complete absence of constriction. The resonance frequencies of the undistorted tube are for sake of simplicity indicated to be exactly 500, 1500, 2500Hz, etc. (Modified from Fant.) Curve 1
2 3
4
0·32 1·3 5·0 8·0
94
0. Gunnilstam
the tube raises F2 and lowers F 1 , in the posterior partF2 is lowered and F 1 raised. F2 makes the most appreciable deviations from the value of the undistorted conditions and has the greatest influence on the sound quality. Since these findings were presented by Fant in 1958 (1960), they have been interpreted as will be shown below. The Maxima Theory
With these data at hand, we will discuss the relations between articulation and formant pattern. Those relations are the basis of the concept of natural places of articulation, a theory which is widely accepted and will be shortly outlined here. The places between the glottis and the lips are not all equally well suited as places of articulation. A convenient place for a constriction must produce a sound that differs acoustically from other possible sounds and cannot be misinterpreted. This is an auditive condition. Further, it is reasonable to permit, that a certain amount of negligence in controlling the place of articulation shall be hardly noticeable auditively. This is an articulatory condition. We may name these conditions the distinctiveness and stability conditions, respectively. A "natural place of articulation" is then to be defined as a place of constriction that fulfills both these conditions. The stability condition is fulfilled at those places where two formant curves have simultaneously their maxima or minima. Here it is quite obvious, that a small backward or forward displacement of the constriction causes formant frequency changes that are hardly perceptible. On the other hand, at places where the formant curves slope drastically, even a rather small displacement at once causes the formant frequencies to shift radically, and thus, a noticeable change in sound quality is obtained. If we look at Fig. 3, we find, at 4 em from glottis for unrounded conditions, that F 1 displays its maximally positive (upward) deviation and that F 2 has simultaneously its maximally negative position (stability). These formant positions are not found elsewhere in the nomogram (distinctness). This is an excellent place of articulation and corresponds to the vowel [a]. However, it is important not to round this vowel, since we will then find that F 1 and F 2 are lowered to positions that are found also at another place, namely at 10·5 em from glottis if the curve representing maximally rounded conditions is traced. A speech sound articulated at 4 em with lip rounding is affected not only by this risk of confusion, but also by the fact that F2 will be located in a sloping nomogram curve. This is unfavourable as it will then be dangerous to be negligent in articulating the sound . The FcF2 -proximity found at 4 em from glottis (the vowel [a]) is also important. Two formants in proximity will, as is well known, amplify each other and these two formants will then be auditively characteristic of the sound in question, whereas the influence of other formants is reduced. The reason that two formants run parallel at the same time as they also run in proximity is, as mentioned previously, that they sometimes must change cavity affiliation. All the facts mentioned above explain why the articulation at 4 em from glottis accompanied by unrounding produces a "good" sound. This is a natural place of articulation. Another natural place of articulation is found at 8 em from glottis. This, however, presupposes that the lips are rounded. Then the F 2 -curve of maximal rounding reaches its maximally negative position. In addition, F 3 here makes its intermediate maximally positive deviation, and consequently it fulfills the stability condition, although it is of minor importance at this place. Also at 8 em from glottis an extremely "good" sound is obtained, provided that the lips are rounded. This is the vowel [u] or the semivowel [w].
Theory of local linearity
95
At a moderate lip rounding the maximally negative deviation of the second formant will be found just between the places requiring minimal and maximal lip rounding respectively, i.e. at about 6 em from glottis. This would seem to correspond to the vowel [o ], supposing it has moderate lip rounding. The unrounded F 2 -curve displays its maximally positive deviation at 11 em from glottis. Here F 2 and F 3 fulfill the same conditions as F 1 and F 2 did at 4 em from glottis. The vowel [i] at 11 em is again a natural and "good" sound. Now a gradual rounding of the lips, resulting in [y] and [w], requires a simultaneous forward movement of the constriction to 12 and 13 em from glottis, in order to obtain the optimal places of articulation. The model also accounts for the articulation of consonants. For instance, a [g] is located at 11 em in the F 1 -F2 -proximity region. At the moment of release the formants are maximally narrowed. In coarticulation with, e.g. [u], F 2 must drop from 2500 to 600 Hz. Dental consonants can be placed at 14-16 em. Labials may be derived by looking at the curve of maximal lip rounding of the vowel with which the labial is coarticulated. If, for instance, [b] precedes an unrounded vowel corresponding to a constriction at, e.g. 13 em, F3 will rise from 2200 to 3000 Hz. It is reasonable to suggest that the optimal places of articulation produce sounds which can be regarded as the most "elementary", and which are found in all the languages of the world. The vowels that are all found in every language are [i], [a], and [u]. As predicted, these vowels correspond to the most optimal places at 4, 8, and 11 em from glottis, respectively. The relations seem to be obvious. On the other hand, it seems to be more difficult to explain why so few languages have the vowel corresponding to the obviously quite natural place of articulation at 13 em from glottis: the vowel [y] is rather rare, not to mention [w]. These discussions about the relations between articulation and formant pattern are usually brought under the heading "the theory of natural places of articulation". Considering the fundamental concepts of the theory, namely the maximally positive and negative deviations of the formants, respectively, it can also be referred to as "the maxima theory". Criticisms of the Maxima Theory
Against the maxima theory it may be said that it seems to concern vowels and consonants to the same extent and to handle them in the same way. Above all, it has been too much emphasized that the vowels are sounds of the front-back dimension. This should be something that chiefly concerns the consonants. These should be represented by a model with quite a considerable constriction that is held constant to the degree and variable to the place. As for the vowels, too little regard is paid to the fact that they may also have a varying degree of constriction, not only a varying place. The place of articulation should rather be regarded as a secondary factor, as it seems unreasonable to imagine the vowels placed along a front-back dimension with a large number of intermediate points as well. Nobody has probably ever seen a vowel system like this with long series of points between front and back articulation. Vowel systems, in general, usually define the place of a vowel by the highest point of the tongue. Yet Fant states that this point does not coincide with the point of maximal constriction, but he sometimes argues as if all vowels had the same degree of constriction. He nowhere stresses the phonetically important fact that the degree of constriction may vary. The effect of varying the degree of constriction may be studied not only in Fig. 4, but also in Fig. 5, which is taken from Stevens & House (1955), and which displays the effect on
96
0. Gunnilstam
vowel quality. The three-parameter model used by these authors is in several respects similar to that of Fant in Fig. 2. Figure 5(a) shows the articulation regions for the different vowels when the cross-sectional area of the constriction is very small; the radius is 0·4 em, which corresponds to an area of 0·5 cm 2 , thus comparable with the constriction in Fig. 3. The vowel quality is displayed as a function of the place of constriction (the abscissa) and lip dimensions (the ordinate; rounding at the bottom, unrounding at the top). Figure 5(a) shows how easy or how difficult it is to produce the indicated vowel sounds with this narrow constriction. If a vowel region is large, it is evident that the vowel is easy to obtain; if a region is small, the vowel is difficult to hit without considerable effort. At 6 em from glottis this constriction yields the vowel [u] when the lips are rounded. The region is large and thus
(b)
4
7
10
(c)
13
4
7
10
13
Constriction in em from glottis
Figure 5
Contours of vowel articulation. The contours show the regions where different articulatory configurations produce the indicated vowel quality. r = radius of cross-sectional area of constriction. (a), r = 0·4 em; (b), r = 0·8 em; (c), r = 1·0 em, A/1 for the mouth opening implies rounding below and unrounding on top. (Simplified from Stevens & House.)
easy to hit. When unrounding the lips the vowel yielded will be [a:] at this place, but this requires great precision as the region is small. If, in this situation, one widens the constriction, that is, goes to Fig. 5(b), and makes the constriction radius 0·8 em (the area 2 cm 2), it will clearly be easier to hit the [a:] region, which has grown larger. It will be still larger in Fig. 5(c), where the constriction is almost eliminated, the area is now 3·14 cm 2).1 If we travel from Fig. 5(a) to 5(c) and find that a certain vowel region shrinks, we can conclude that the vowel in question will be most easily produced with a narrow constriction; if the region grows larger, the vowel should be produced without any constriction at all. Justice is not done to evidence of this sort inFant's discussions. The question now arises of what are the most fundamental dimensions in the system of speech sounds. Aside from binary features of minor articulatory importance, as voicedunvoiced, nasal-oral, continuant-abrupt, etc., we move along gradual dimensions. What dimensions does this articulation proper really constitute and which of the dimensions are most fundamental? This is a real problem which appears from the enormous number of proposals for systematizing the speech sounds that have been presented in the 1
The vocal tract model of Stevens & House is in the unconstricted state narrower than that ofFant, only 4·5 cm 2 in contrast to 8 cm 2 •
Theory of local linearity
97
course of the last centuries. As for the consonants the majority of writers, including Fant, probably hold the view that the place of articulation, or the place of constriction, is a very fundamental dimension. As for the vowels, however, there is a confusing variety of systems and dimensions. If we choose the 12 most well-reasoned vowel systems that have been proposed since 1781 (Ungeheuer, 1962), we find that 11 use the high-low dimension. Fant alone differs . A peculiar view is also held by Noreen concerning this dimension: "Besides, the whole distinction seems to be of minor importance . . . since the capacity of a vowel of being high or low principally seems to depend not on the raising of the tongue towards the palate but rather on the shape of the palate at the specific place of articulation, a shape that brings the palate and the tongue now closer to, now away from each other, completely independent of the raising of the latter" (Noreen , 1907). But as for the rest, this high-low dimension is maintained throughout. Eight systems contain the front-back dimension, this is missing in the four oldest systems. In seven cases the lip rounding occurs as a dimension of its own . These very simple statistics thus suggest the high-low dimension to be the most fundamental vowel dimension. This happens to favour the following, but the purpose of this excursus is only to illustrate how troublesome it is to reach a unified view on the fundamental dimensions of speech sounds. The maxima theory also fails in pointing out the vowels between the corner ones. From the point of view of this theory the vowel [c] is as bad as [i] , etc. But we know that vowels tend to arrange themselves into series, above all a back (rounded) one and a front (rounded or unrounded) one. The theory of discreet maxima seems unable to explain this crucial fact. Another reflection concerning the maxima theory is more general. The theory enforces a problem that appears to be a natural consequence of the discussion about good and bad places of articulation that imbues the theory. However, no one has ever put forward this problem: some sounds are badly suited as speech sounds, others are better and more natural and we prefer these. If this is true, which is most probable, how do we manage to decide what sounds are not suited and what sounds are well suited as speech sounds or as sounds on the whole? Certainly nobody tells a child what sounds it shall use and what not. To refer to the child's imitation of the people around him would be wrong, as it has been proved that the babbling of the child turns up without any sort of stimulation. The babbling may very likely be regarded as an eager experimenting in order to discover all sorts of sounds, not only those used as speech sounds by the parents. It seems quite reasonable to suggest that the child learns on its own to find the "good" sounds, the sounds he will use later as speech sounds (and only then imitation begins) ; and if he does not quite perfectly learn to find them, it seems apparent that he tries to find them. This is the great problem: how does the child manage to find these " good" places of articulation? What are we actually doing when we learn how to produce sounds ? Could we solve these problems we would learn what actually happens when we utter sounds, i.e. regarding not only the child's situation. In the following chapter a theory will be outlined , which shows how well justified is the above-mentioned criticism on the insufficient distinction of the maxima theory between vowels and consonants. Further, the new theory tries to settle the basic question at issue about our way of finding the fundamental dimensions and thereby the natural sounds. The Theory of Local Linearity
Very few potential changes of the position of the speech organs produce an acoustic change directly and adequately reflecting the movement of the speech organ in question . For instance, from Fig. 4 it appears that a narrowing of the constriction at 8 em from glottis,
98
0. Gunnilstam
ceteris paribus, would be completely unmotivated; one does not achieve any acoustic result adequate to the movement of the tongue. Lots of similar examples could be given, where the acoustic result in no way reflects the mechanical movement of the speech organ. In fact, there are only a few places in the vocal tract, where a proportional relationship holds between a mechanical movement and its acoustic result. This fact is fundamental to the theory of local linearity. The significance of linear proportion to the production of speech sounds will be exhaustively explained as the discussion proceeds. The theory requires a separate treatment of consonants and vowels. Consonants During the discussion of consonants, Fig. 3 will be used as reference material. It must, however, be noted that the cross sectional area of the constriction counts 0·65 cm 2 in this picture, and this approximately corresponds to the constriction of the vowel [i] and is actually too wide a passage to be suited for consonants. This will be overlooked here; the argumentation will not suffer from that. A more adequate nomogram would show that the proximity regions would show a still higher degree of proximity owing to the decreased coupling between the resonance cavities. The narrowest constriction for vowels may be ca. 0·3 cm 2 (Stevens, 1971), a narrower constriction will yield fricatives . As for the consonants (primarily stops and fricatives) the effect of antiresonances must be taken into consideration. More will be said about this later. Let us examine Fig. 3 and look for some perceptually significant component in the acoustic signal performing a change which is caused by an equally significant mechanical change. This change may be a tongue, lip, or jaw movement, in fact, any movement of the speech organs. We find that F 2 makes an upward movement as a tongue constriction moves forward from 4 to 11 em from glottis. During the whole length of this 7 em movement a simple mechanical gesture thus causes a change of the undoubtedly most significant acoustic component in the resulting sound, a change that quite obviously corresponds to the mechanical gesture in question. The two changes are related to each other in a linear proportion. This proportional behaviour is the basis of the theory oflinearity. The second formant displays a movement and provides, as it were, a scale, along which one can "play". That this formant is the dominating one is quite clear; it has been shown earlier in connection with the pictures. The first formant is cancelled by an antiresonance in voiceless consonants and has thus no significance. The third formant, and, to a still greater extent, the higher ones, too often change cavity affiliation and are not suited to reflect any mechanical movements. Along the described "F2 -scale" the consonants are lined up. If we use the voiceless fricatives as an example, a posterior one like [X] will be located at 4 em from glottis and at the beginning of the F 2 -scale. Now the tongue, when moving forwards, will produce [x-9-s-8]. 2 The anterior fricatives, of course, are located more than 11 em from glottis (the anterior end of the 7 em Frscale), but the figure is not to be taken too literally. If we overlook the fact that the quarter wavelength resonance at 11 em will be controlled by F 3 and later by F4 , etc., F 2 will, so to speak, also control the anterior sounds. The crucial point of the theory, concerning consonants, is that the forward movement of the tongue constriction, i.e. the shortening of the front cavity, is in linear proportion to the raising ofthe quarter wavelength resonance of the cavity, and this resonance is for the most part controlled by F 2 in Fig. 3. Moreover, when a constriction passes the 11 em limit, an anti2[Jl is missing as the three-parameter model does not master sounds produced with retroflected tongue. Retroflexion will be treated later.
Theory of local linearity
99
resonance will cancel the half wavelength resonance of the back cavity in the same way as an antiresonance all the time cancels F 1 . The quarter wavelength resonance of the front cavity is thus always the first resonance and also the most prominent one. This obvious fact is a perfect means of distinguishing consonants from vowels. Consonants may generally be characterized by one single main formant being active, owing to the presence of antiresonances, whereas vowels are characterized by many formants (with varying prominence) being active, due to the absence of antiresonances. It would be very convenient now to adopt the (often unnoticed) proposal of Fant (I 962) to name the vowels "zero free" as opposed to the consonants being "zero attached". In this way the distinction between vowels and consonants, made in this paper, will be well justified. In Fig. 3, thus, F 2 between 4 and 11 em for the sake of simplicity may represent the important F 2 -scale. The notion ''Frscale" 3 is here used merely as a substitute for the clumsy "quarter wavelength resonance continuum". The scale is in a linear proportion to the movement of the constriction. Therefore, it does not pay to do anything else but move the constriction. If the constriction is simultaneously widened or the lips are moved, the movement of the constriction would no longer produce the simple proportional change in the acoustic output. The importance of the proportional relations is evident from the following facts. The linearity implies that it is easy to "practise" the scale in question. It is easy to do something if the result corresponds in a natural way to the action. That is exactly the simple condition for being able to control what one does. It explains our eminent control of the vocal tract. The thing being controlled has a proportional relation to the controlling mechanism. Along the linear scale, for instance, we perfectly know at what place a [9] is produced, if air is pressed through a narrow constriction. In the same way the stops are controlled, etc. One must, of course, also be able to discuss from this angle of linearity the significance of moving the lips. If we study the rounded curves in Fig. 3, we find that the rounding of the lips pushes the process, which by unrounding took place between 4 and 11 em, so that instead it takes place between 8 and 14 em from glottis. i.e. the whole Frscale is moved a bit to the left in the nomogram, since, as was previously mentioned, the quarter wavelength resonator is transformed into a half wavelength resonator. The acoustic result of the lip rounding may be regarded as being proportional to the change of the position of the lips. When the constriction of the tongue is fixed at a certain place, rounding the lips, ceteris paribus, will cause F2 to decrease to a degree proportional to the degree of lip rounding. This linearity oflip rounding can always be observed, regardless of the location of the tongue. Lip rounding could in fact be regarded as a binary action as far as consonants are concerned. The performing of a binary articulatory gesture implies that no sort of acoustic "scale" is employed. For instance, to switch on (and off) vocal cord vibration is a most typically binary phenomenon. Instead of practising a scale; we rather change the "instrument" in order to play on another scale. It is obvious that linearity cannot exist with a binary phenomenon, as linearity implies a movement along an axis, along which more than two values are established. For consonants it is difficult to point out varying degrees of lip rounding. Vowels, however, prove to have different rounding degrees. Thus, concerning consonants, it may be superfluous to speak about the above mentioned "linearity of lip rounding". Moreover, there is reason for classifying retroflexion as a binary action. Once 3 The word "scale", borrowed from music, is in fact misleading, since it means "ladder"; the conditions of the vocal tract are not to be compared with a stepwise scale but with a non-quantized continuum. This will be discussed later.
100
0. Gunnilstam
the tongue is retroflected, it may, if desired , practise a certain scale. In most languages, however, more than one place of retroflex articulation cannot be stated. Yet, it is not uncommon that people vary the lip opening during constant retroflex hissing, namely as a variant of ordinary whistling. A last binary action will be mentioned as an example, namely the appearance of semivowels. The constriction is widened to a degree, when noise ceases, admitting semi vowels to be produced along the Frscale. We realize that the binary phenomena may profitably be sorted among the manner features , whereas the so called place features concern the gradual events which belong to the "articulation proper" and the nature of which is described by the theory of linearity. Our ability of perfectly controlling the vocal tract is even more striking when regarding the fact that we are able to whistle. In ordinary whistling the lips are protruded and act as an anterior obstacle. A posterior constriction is made by the tongue somewhere along the palate. The two obstacles cause such a turbulence that a whistle tone is produced. The pitch of this tone is solely dependent on the distance between the two obstacles, and the frequency will coincide with the F2 of rounded consonants, according to Fig. 3 when the constriction is located at 8-14 em from glottis. This proves that whistling is produced in exactly the same dimension as consonants. Whistling at the lowest possible pitch corresponds to the production of a rounded [xw] (note: not [Xw], as the F 2 -scale for rounded conditions "begins" at 8 em from glottis). A gradual increase of the whistle pitch implies a forward movement of the posterior constriction until a rounded [9w] is reached, which corresponds to the highest possible whistle pitch due to physiological constraints. Figure 6 shows a narrow filter spectrogram of a whistle started at the highest pitch possible to the whistling informant and fini shed at the lowest possible pitch. Note how the highest level coincides with the main formant of [<;w] shown in the broad filter spectrogram to the left, and how the lowest pitch coincides with the main formant of [xw] shown to the right in the figure . Note also the resemblance of the whistle tone curve with the rounded F 2 -curve between 8 and 14 em in Fig. 3. The cross-sectional area of the posterior constriction may, for whistling conditions, be at the most ca. 0·3 cm 2 to yield the necessary turbulence (Stevens, 1971 ). If one wants to compare whistling with the production of vowels, one must find vowels with this considerable degree of constriction and with also maximally rounded lips. Only [w:] and [u] meet these requirements . The mouth cavity resonance of these vowels, in both cases F2 , actually coincides with the highest and the lowest whistle pitch respectively. As for the rest, no reasonable comparison can be made between whistling and vowels. The near relationship between whistling and consonants is thus ascertained. The process of whistling is also explained by the theory of linearity: we use the scale represented by F 2 in Fig. 3. The comparative ease in auditive control of whistling is also explained : it is simply easy to whistle as the pitch variation proportionally correlates with the movement of the tongue passage. This easiness appears most strikingly when it comes to reproducing the interval between two presented tones. If a sept is given as stimulus, one can easily reproduce a sept. The ability of also hitting exactly the same frequency, however, depends on training. But the important thing here is not the absolute ability to hit the pitch, but the general ability to reproduce the interval, as it is the latter that reveals to us our remarkable instinct for frequency, and which is most important, proves that linearity is the condition for our mastering these scales. As will be shown presently, the effect of varying the degree of constriction is, according to the principles of the theory, a dimension that, in all respects , is associated with vowels. It will be assumed that vowels are produced in a manner quite different from that of producing consonants.
-,, •
Figure 6
[ Sl
Calibr.
w
500Hz
Whistling from high to low pitch
Retouched spectrogram.
Calibr .
500Hz
lX) w
102
0. Gunnilstam
Vowels Figure 4 will be the reference material during the discussion of vowels. The theory of linearity is based on the assumption that vowels are mainly arranged in the dimension closed-open or high-low. In this dimension we always have plenty of intermediate points, which is not the case with the front-back dimension. Although vowel diagrams of this sort define the position of a vowel according to the highest point of the back of the tongue, we are still convinced that vowels have degrees of highness also when the criterion is their place of maximal constriction . Note that the notion "maximal constriction" does not imply that there may not be various degrees of constriction in different vowels! According to the theory of linearity, vowels are supposed to be produced by varying the degree of constriction, while consonants are produced by varying the place of constriction. Lip rounding occurs in both categories and are of secondary importance, but nevertheless it fulfills the conditions oflinearity (if it is not regarded as binary). As for the production of vowels, we can see from Fig. 4 acoustic changes that are caused by mechanical movements, the relations between the changes being linear. The starting point when producing vowels is the unperturbed tube; i.e. without any constriction. These conditions correspond to the vowel [re], and to the straight lines in Fig. 4, i.e. formant frequencies at odd multiples of ca. 500Hz. To produce vowels then requires constricting the tube at places where the movement of the tongue (or something else) causes some important acoustic component to perform an adequate deviation from the original undistorted level. These conditions are not found everywhere in the mouth. An adequate reaction of the important second formant may be achieved when constricting at 4 em from glottis. The same is attained when constricting at 11 em. These two places are the only ones suited as constriction places, at least when regarding the F 2 reactions. If the lips are rounded, these conditions are fulfilled at 8 and 14 em, respectively, see Fig. 3; the whole Frcurve is moved to the left but keeps the same form. It is true that it is better to constrict at two other places in the mouth, but still it is true that this is done at places where F 2 reacts adequately. The production of vowels is consequently accomplished by narrowing or widening the passage at only two well fit places in the vocal tract, namely where the relation between the mechanical and the acoustic change is linear. For consonants the constriction is constant, moving horizontally, whereas for vowels the constriction varies at a certain place, thus moving vertically, as it were. The production of the series [i-e-cre- a- a] then may be traced in Fig. 4 : the constriction is at first maximally narrow at 11 em ; the passage is widened and the constriction disappears, all this happens at 11 em; the tube is now straight and unperturbed; a constriction increases at 4 em and finally grows maximal at that place. With moderately rounded lips the series [y-0-re- - o] may be traced in exactly the same way, but the two places of widening and narrowing are located at 12·5 and 6 em, respectively. Again in the same way, maximally rounded lips will yield the series [w-e- u], the two places now being located at 14 and 8 em, respectively. So far during this discussion about vowels the tongue has perhaps been regarded as unable to accomplish a horizontal movement. That is, of course, not the case ; it is perfectly capable of doing so, for instance between the vowels [a-o-u], but the procedure then rather is this: the major factor is the lip opening performing an increasing protrusion, while the forward movement of the tongue follows secondarily as a compensation for "remaining" in the Frvalley in Fig. 4. The fact that the tongue always constricts in the Frvalley, does not upset the principle that the main vowel dimension is the vertical one. When, in this discussion, the tongue is said to move vertically, it is meant to be in relation to the F 2 -maximum and -minimum, respectively, rather than in relation to particular places in the mouth.
Theory of local linearity
103
The first and the third formant are here assigned secondary roles. Their purpose is only to amplify the second formant as a consequence of the movement of the latter. When F 2 is lowered by posterior constriction, F 1 and F 2 will amplify each other; this, therefore, may be regarded as a consequence of the F 2 movement. By anterior constriction F 2 is raised and the same conditions hold for F 2 and F 3 . As previously mentioned, it is characteristic for vowels that F 1 , F 3 , and all the other formants at all exist. The consonants, on the other hand, are characterized by the prominence of only one active formant (or in order to include also sonorant consonants: the mere presence of any one antiformant). Discussion It has been shown that the mechanical movements of the speech organs only seldom yield linear changes in the acoustic output. The linearity appears just locally, hence the name of the theory. It is most favourable for, e.g. the tongue to move along one of the dimensions at a time and not along two at the same time. Two simultaneous movements would lead to acoustic results very hard to control. This might be compared with a walk on a mountain ridge. Straightly following the crest, implies a movement in one single dimension. Walking · downhill, as well, at the same time, and deviating from the straight crest line, would mean a movement in two dimensions (forwards+ downwards). When a narrow tongue constriction passes forwards along the palate, it is uncomfortable to deviate from this sole action, by, for instance, moving the lips or widening the constriction, etc. It is illustrative to study the effect of hitting the wrong place or degree of constriction. First, let us look at the F 2 -scale of consonants in Fig. 3, which refers to a constant degree of constriction. If one articulates along this scale and keeps the degree of constriction constant, one moves along one dimension solely. If one happens to miss and makes the constriction too wide, the amount of friction will decrease, yielding a semivowel; one has fallen down from the crest and reached another manner of articulation. We now repeat the walk along the same F 2 scale, but this time we hit the wrong place of the intended consonant. This gives another consonant than the one we intended, still, however, within the same manner. It seems reasonable to expect that it is less probable that one falls down from the crest than that one "reels" a little on top of it. It is of greater significance to the sound quality to move along the ridge than to reel on it. It is more important to hit the right place than to hit the right degree of constriction when aiming at a consonant. Let us now study the vertical scale of the vowels in Fig. 4, where the mountain ridge is represented by the varying degrees of constriction; this ridge is easy to imagine if you think of it as a third dimension instead of all (the four) degrees of constriction pictured in the same diagram. If one articulates at a certain point somewhere on top of the crest it will yield a certain vowel. If you walk in one direction along the crest, the result will very soon be another vowel. However, if you walk down the hillside perpendicularly to the ridge, nothing special will happen at first, but after some time the sound is no longer the same as it was on top of the crest. We see that the shortest way to another vowel goes along the crest. It is, therefore, more important to hit the right degree than to hit the right place of constriction when aiming at a vowel. The conclusion will be that the consequence of missing when articulating is the same for all sounds: it is more important to hit the right point along the crests. It does not matter very much whether you fail perpendicularly to the crests, as it is so "ineffective" to deviate from them. The most "effective" action is to move along a crest. Such actions offer the best and most adequate acoustic results. The problem, mentioned earlier, of how we as children manage to find the natural sounds, the fundamental dimensions, may be given a satisfactory solution in terms of the
104
0. Gunnilstam
theory of linearity. The joyfully babbling child actually tries to find out the articulatory movements that are most interesting and fun to play with, and that give the most "beautiful" result, a sound that may be repeated over and over again in case it was fun enough to play with. The child learns to make those articulatory movements that yield an adequate sound possible to control (i.e. possible to repeat-cf. da-da). The child thus finds in its mouth the linear places, where there are scales to practise. There is reason to suspect that we possess a great unconscious love for scales, something which in its turn could be owing to an internalized capability making us able to experience three-dimensional coordinate systems, as, for instance, spectra changing with time. It is an attractive thought, in view of the obvious ease, with which most of us can whistle, irrespectively of "musicality", and in view also of the close relation between whistling and articulation. The musicality may also be an important cue, regarding the fact that music is something specific to human beings; all and only humans are capable of creating music. Our sense of scales needs not only include frequency scales. Many other phenomena may also manifest themselves in scales. For instance, a very young child might spontaneously arrange some toy blocks in order of size, which is an achievement that could be explained by the above suggestion. Many a reader might object: why is it then so hard to learn, for instance, to play the piano? Some do not succeed, no matter how much they practise! Such an objection, however, shows that it has not been taken into account that the scale of the piano consists of arranged discreet positions and consequently of no continuum. Learning to use such a scale (i.e. in the right sense of the word) is in fact very difficult. Very few of us manage to play on the piano a piece of music that we have heard once but never seen in print. But if one uses an instrument that has a continuous scale, which is unusual, for instance a toy trombone, only a short training time is needed for the practician (to get acquainted with the scale and the frequency range of the instrument, cf. the babbling of the child) to master the instrument and to play the melody. This speaks in favour of the assumption that our sense of scales is not unfounded. It might, by the way, also be mentioned how the production of speech sounds would be done, if it were analogous to piano-playing. It is impossible to play any piece of music, unless printed music is provided- on the music-stand or in the mind. The analogy would then be that the tongue strikes predestined points along the palate. It would surely take a considerable time to learn to speak in this way. From all this it may be concluded that our sense of scales most probably concerns continuous scales (in that case they ought to be logarithmic, but this detail will not be discussed here). Only a few words have been said about the articulatory movements from a purely physiological point of view. The discussion has so far been characterized by acoustic reasoning. What has been mostly just hinted at but will now be more stressed, is that the features characterizing the consonants as sounds of a horizontal dimension and the vowels as sounds of a vertical dimension , are features concerning mechanical movements of a constriction, and the acoustic output of those movements. It is important to connect each movement with the adequate "scale" . The description of these conditions has not called for much physiological reasoning. Earlier it has been said that the horizontal consonant dimension implies articulatory movements along the tube under a constant degree of constriction, and that the vertical vowel dimension implies articulatory movements perpendicularly to the tube with a constant place of constriction. Not only the tongue is responsible for the various dimensions ; also teeth and lips may participate (mainly in the consonant dimension). The tongue, of course, plays the leading part. By considering further physiological correlates of the theory of linearity, we find that the vowel "scale" from [i]
105
Theory of local linearity
to [a] is situated according to Fig. 7(a) and the scale [w]-[u] according to Fig. 7(b). A point somewhere in the middle of the tongue runs along a line, which is straight in the somewhat schematized pictures, but in reality should be somewhat curved in order to pass perpendicularly through the tube walls at two places. The form of the lips determines the inclination of the line in relation to the vocal tract, which in turn determines what vowel series will appear. The location of the point along the line finally determines the exact vowel quality. This physiological outline is the basis of Fig. 7(c), a model of vowel articulation. This model reflects physiological as well as acoustic facts. The model in Fig. 7(c), however, suffers from certain elementary shortcomings. The most unrealistic one is to say that the movements [i-w] and [a-u] are produced in completely analogous ways. The purpose of the discussion on Fig. 7 was, in fact, to call attention to these conditions. Acoustically these movements seem related to each other, but,
Round
0
i~n Cons!~
Max
~
Y/~Mox
0
Max
Figure 7
(c)
A possible suggestion for systematizing the vowels. The figures (a) and (hl are the basis of Fig. (c). According to the text, this model, however, suffers from certain shortcomings.
0. Gunnilstam
106
as we all know, the formant frequency decrease caused by both of these movements, is not due to articulatory movements of the same sort. The distinction [i]-[w] may, from a purely articulatory point of view, be brought about by moving the lips only. The tongue need not be moved forwards, though it would be a more natural gesture (according to the maxima theory). As for the distinction [a]-[u] (better: [a]-[u]) the rounding of the lips needs
Up>~'-----~1:::, 1
3
~-~·e•
._____\:v1" [...._____~!
_' ·
J
6 " 5_ 4
\J} \::!] Figure 8
Left : Schematized area functions of the vocal tract when producing the indicated vowels. Circles indicate the place of maximal constriction. Right: the circles from the left side pictures are collected into a single diagram.
Theory of local linearity
107
to be accompanied by a forward movement of the tongue. This explains why most languages have fewer back vowels than front vowels. A model must be able to demonstrate these facts. We intend to make a model, which, in these respects, is better than the previous one in Fig. 7(c). Let us first have a look at Fig. 8, where the vocal tract is seen as a tube that may be perturbed at different places and to various degrees. In the form of very schematized area function diagrams, the series 1-4 displays a series of asymmetric constrictions. N: r 3 implies no constriction at all, but is considered as belonging to the series. This series should be produced with unrounded lips . If these are rounded, the asymmetric series is no longer appropriate. We know that there is no point in rounding an [a]. If we make a symmetric constriction (i .e. in the middle of the tube), however, there is a very good point Round
0 Constr. Max
F2 I
i
2 e
y
Ww
I'
High
2' Low
Figure 9
0
3
e 3'
Max
4 a
u 4'
0
.u
5
6
7
Low
Schematized outline of the articulation of vowels and their acoustic properties, based on facts abstracted from Fig. 8 and from the theory of the acoustic dimensions of the vowels. 1-4: The "vertical scale" of the unrounded vowels. 11-4 1, 7: The "vertical scale" of maximally rounded vowels. 4-7: Increasing lip rounding accompanied by a change-over to a symmetric constriction during constant maximal constriction. See text.
in rounding the lips; it has a significant effect on the sound quality. During the movement 4-7 in Fig. 8, it is thus well-founded to let the lip rounding increase. The series 1-3 is independent oflip rounding, i.e. rounding is completely optional, yielding quite different vowels, ceteris paribus. Now we turn to Fig. 9, where the numbers correspond to those in Fig. 8. Figure 9 is supposed to be an approximation of the physiological reality at the same time as the acoustic dimensions, according to the theory of linearity, are indicated. The section 1(')-4(') articulatorily refers to the asymmetric series of Fig. 8. During the course of this section, the place is passed where the tube is undistorted, but that fact is not indicated in the model, as it is of no significance to the "vertical F 2 -scale" of the vowels. In the horizontal plane unrounding is located at the left hand side. The gradation of increasing lip rounding is distorted in the picture: [w. and [a] do not have the same degree of lip rounding. The section 4-7 refers to the change-over from asymmetric to symmetric perturbance of the tube during constant vowel constriction. This dimension thus coincides with that of rounding. The "Frscale" of the vowel runs, as mentioned, vertically. As for
108
0. Gunnilstam
unrounded vowels it runs between [i] and [a]; as for maximally rounded ones it runs from [w] to [u]. The latter route is not straight, owing to the impossibility of getting point 7 in line below the points 1'-4' in the model, which is supposed to be very schematic. We may see, in Fig. 9, the "acoustic lines" that are optimal according to the theory of linearity, and we may see what physiological movements that are most closely related to them.
Conclusion A theory has been presented, showing that articulatory movements and corresponding acoustic changes are related to each other in a linear way. It is thus postulated that there is local linearity in the vocal tract. This paper does not primarily deal with the dynamic aspects on what happens when we talk, but the intention has been to give an idea about the properties per se of the vocal tract. The theory of local linearity claims to outline the fundamental dimensions of the speech sound system. I wish to express my sincere thanks to Prof. Sven Ohman for the support and encouragement, to which this paper owes its existence. In fact, the general idea of this work originates from him. I also thank my colleagues and friends at the Department of Linguistics for the valuable outcome of numerous discussions on this subject. I wish to extend to Bengt Svensson a special thanks for his careful scrutiny of the English version of this paper. This work has been supported by N.I.H. Grant No. FROl HDO 5360-01. References Fant, G. (1960). Acoustic Theory of Speech Production . The Hague: Mouton. Fant, G. (1962). Descriptive Analysis of the Acoustic Aspects of Speech. Logos 5, (1). Noreen, A. (1903). Vdrt sprdk I. Lund. Stevens, K. N & House, A. S. (1955). Development of a Quantitative Description of Vowel Articulation. Journal of the Acoustic Society of America 27 (3). Stevens, K. N. (1971). Airflow and Turbulence Noise for Fricative and Stop Consonants: Static Considerations. Journal of the Acoustic Society of America, 50 (4) (II). Ungeheuer, G. (1962). Elemente einer akustischen Theorie der Vokalartikulation, Berlin, Gottingen, Heidelberg: Springer.