Journal of Phonetics (1993) 21, 205-229
Three approaches to the classification of American English diphthongs Michael Gottfried, James D. Miller and Donald J. Meyer Central Institute for the Deaf, 818 S. Euclid Avenue, Saint Louis, MO 63110, U.S.A. Received 22nd January 1991, and in revised form 1st June 1992
Various investigations of diphthongs suggest that they can be effectively classified in terms of (1) the pattern of fundamental frequency and formants at the onset and offset of the production , (2) the onset formant pattern and the F 2 rate of transition or (3) the onset formant pattern and the direction of formant movement in an acoustic space. These three hypotheses were assessed in (a) a log F, x log F2 space and (b) an " auditory-perceptual space" in which the dimensions are based on ratios between pairs of formants and between F, and a reference value related to the vowel's average formant frequency. Values for relevant parameters were obtained for a corpus of 768 tokens of six American English diphthongs produced in two contexts ([b_d], [h_d]) at two tempos (slow, fast) with differing stress (stressed, unstressed). The hypotheses were evaluated in respect to classification performance using a statistical pattern recognition procedure. A ll three hypotheses produced correct classification of the corpus exceeding 90%, although highest correct classification was obtained by specification of onset and offset formant patterns (an average of 96%) . Slightly higher percent correct classifications were obtained for each hypothesis when parameters are specified in the auditory-perceptual space rather than in the log F, x log F2 space.
1. Introduction
Diphthongs, as a class of speech sounds, are commonly characterized in terms of movement from one vowel to another (Ladefoged, 1982, p. 76) . In phonetic transcriptions, diphthongs are accordingly rendered as a sequence of two monophthongal vowels (or, sometimes, as in Trager & Smith, 1951, by a vowel followed by a semivowel) . Earlier experimental investigations of American English diphthongs can be organized into four general approaches which emphasize one or another aspect of this phonetic characterization. The introductory sections of this report provide a brief historical survey of these approaches in order to determine those acoustic parameters which have been proposed for the specification of American English diphthongs. The work reported here extends these previous studies in two respects. First, earlier studies were in large measure descriptive and did not evaluate the proposed parameters as bases for classification schemes. In the present work, acoustic parameters are used to specify several hypotheses regarding the means by 0095-4470/93/030205+25 $08.00/0
© 1993
Academic Press Limited
206
M. Gottfried et al.
which American English diphthongs may be most effectively cl assifi ed. Se o nd. while the effects of speech rate on acoustic properties of diphth ongs ha\-e be e n considered (Gay , 1968, Dolan & Mimori , 1986), the effect of stress has not. The present study considers the acoustic properties of diphthongs produced in y tema tically varied conditions of tempo and stress. In sum , this pape r consid e rs whi ch combination of acoustic parameters is most effective in the classification of diphthongs produced under varied conditions of tempo and stress. 1.1. Steady-state and glide segments
Some investigators have sought to describe diphtho ngs with respect to ec tio ns isolated from a particular production , which correspond to "vowel-target ··. These targets are identified as sections of the analyzed signal in which the formants a re essentially "steady-state", i.e. , are parallel to the time-axis. Th e int e rYening movement of one or more formants between the targets composes a section te rm ed a "glide". By this view , a diphthong is characterized as a sequence consisting of an initial steady-state followed by a glide and a terminal steady-state . This is the approach taken in Lehiste & Peterson (1961). For a production to be considered a diphthong , they required that it exhibit " a vocalic nucleus containing two target positions" (p . 277), each of which is associated with a steady-state section. Lehi ste & Peterson noted that the phonemes /e r/ and /ou/ "should not prope rl y be classified as diphthongs" (p. 276) as the phoneme /e r/ was found to consist of a glide followed by a steady-state , while /o u/ consisted of a typica ll y short steady-state followed by a glide . The phonemes /a u /, /a r/, / 'Jr / were, in contrast , co nside red diphthongs . However, the requirement that diphthong productions possess two steady-state target positions appears unduly restrictive , in particular when th e effects of different speaking rates are considered. Gay (1968) examined the effects of differing rates of speech (slow , moderate , fast) on productions of /a u / , / a r/, / e r/, / ou /, /'Jr/ . For slow and moderate speaking rates the presence of two steady-s tates was noted for all the phonemes studied , but for the fast speaking rate e ith e r the initial or final steady-state was found to be "negligible or not present " (p. 1571) . 1. 2. Diphthongs and monophthongs
The practice of transcribing diphthongs as a sequence of two monophthongs suggests that the diphthongs may in part be characterized in terms of formant values which identify these monophthongal segments. If particular diphthongs were identifiable with a distinctive set of initial and final monophthongal segments, such a characterization might provide a basis for distinguishing among the various diphthongs . Several studies have compared form ant values at the beginning and end of diphthongs with those of the monophthongs by which the diphthong is usually transcribed . These studies differed in the manner by which the comparisons were made . Lehiste (1964) , citing formant values (Ft. F 2 and F 3 ) reported in the earlier study with Peterson , compared average formant values for initial and final steady-states of diphthongs for five subjects with those obtained for monophthongs by one subject (GEP). Holbrook & Fairbanks (1962) compared formant values obtained at the beginning and end of diphthongs with tbose for a set of representative vowel samples as displayed in an F 1 by F 2 space . Wise (1965)
Classification of diphthongs
207
compared formant frequencies for the final steady-states of diphthong productions with "sustained utterances of /I-i/ and /u-u/" (p. 592). Though it is not clear what exactly Wise intends in the passage cited, his results are noted below. The results of these studies can be summarized by noting the closest monophthongal vowel for the initial or final portion of the diphthong. For the initial portion, there was agreement among the studies only for the diphthong /':JI/: both Lehiste (1964) and Holbrook & Fairbanks (1962) reported the closest monophthongal vowel as /':J/. For the diphthongs /au/ and /ai/, the closest monophthongal vowel was reported variously as /a/, /a/, /re/, I A/. Only Holbrook & Fairbanks (1962) record any monophthongal vowels as being close to the initial segments of /ei/, /ou/, ju/-the monophthongs are given as/£/, /-:J/ and /II (or /i/), respectively . For the final portion of the diphthong, there is even less consistency. For example , in Holbrook & Fairbanks (1962), the closest monophthongal vowel to the final portion of diphthongs transcribed with a final / -I / is recorded as /I, i/ and even /r./ (for the diphthong /ail). Wise (1965) noted that the terminal frequencies of /ou/ were, on differing occasions, comparable to any one of the three monophthongs /u/, /u/, /o/. In contrast, Lehiste (1964) noted that terminal frequencies of /ou/ were not comparable to either /u/ or /u/. In addition , even though the diphthongs /au/, /ou/ and /ai/, /ei/, /':JI/ are sometimes transcribed as ending in the semi vowels /w I and /j/, respectively , both Lehiste (1964) and Wise (1965) noted that the terminal steady-states for these diphthongs were not comparable in formant frequencies to these semivowels. Gay (1968) reported that the final targets were not reached when diphthongs were produced at a fast tempo. The diphthongs /ai/, /ei/, /':JI/ showed higher first formant and lower second formant offsets as phoneme duration decreased while the diphthongs /au/, /ou/ showed both higher F 1 and higher F 2 offsets. Dolan & Mimori (1986) reported the same pattern as Gay for F 2 offsets (F 1 offsets were not mentioned). The diphthong / ju/, examined only in Holbrook & Fairbanks (1962), ends with formant values comparable to those of / u/. It may be mentioned here that /ju/ is often distinguished from the other phonemes as an "ongliding diphthong". That is, it is usually considered to have a nucleus steady-state which follows rather than precedes the glide segment . Ladefoged , in his Course in Phonetics (1982) considers / ju/ as a diphthong on the basis of historical considerations (its development from a vowel) and simplicity of phonological statement. Thus, it appears that there is, in general, variability in respect to the vowel positions at which the various diphthongs begin and end . In particular, no one diphthong can be distinguished as consisting of a singular sequence of monophthongs. Some of this variation may be due to the use of speakers from different dialects , a point which cannot be established from the studies cited. Nevertheless , an optimal scheme for discriminating among diphthongs should accommodate this variation . Further, even if initial and final portions of diphthongs are comparable to particular monophthongs in formant values, it is not clear whether it is legitimate to identify these portions with those monophthongs given their differences in other acoustic properties. Thus, Lehiste (1964) noted that the diphthongs / au /, /ai/, /':JI/ exhibited durations which were comparable to those of intrinsically long monophthongs and not to a summation of corresponding monophthong durations (cf. pp. 185-186 of that work for other considerations). Such differences may explain the comment in Lehiste & Peterson (1961) that "neither of the elements comprising the
208
M. Gottfried et a!.
diphthongs is ordinarily phonetically identifiable with any stressed E ng lish mono phthong" (p. 276). A similar remark is made by Ladefoged (1982). discussing the auditory quality of initial and final portions of diphthongs: " . .. contrary to the traditional transcriptions, the diphthongs often do not begin and end with any of the sounds that occur in simple vowels" (p. 76) . It may still be maintained that the initial and final portions of diphth ongs (their " endpoints" ) provide acoustically relevant information. In particular. one might hypothesize that the various diphthongs are distinguished in terms of initial and final " targets" which need not coincide with any one monophthong. This '·du al target " hypothesis has been proposed in Bladon (1985 , pp . 145-146) and Mill er & Chang (1989). Nearey & Assmann (1986, p. 1303), in discussion of Pols (1977). prese nt a "dual target" hypothesis in which " two explicit vowel targets" are required to specify a diphthong. If it is assumed that these " targets" are defined in terms of steady-state segments , this claim should be differentiated from the broader thesis that the endpoints of a diphthong (regardless of steady-states) serve as acousticallyrelevant parameters in specifying diphthongs. This latter claim will be referred to as the " onset+ offset hypothesis" . It is evaluated in Section 3.2.2 . 1. 3. F; rate of transition
Another basis for acoustically distinguishing diphthongs is suggested by th e results of Gay (1968). In this study , the glide component of the diphthongs /a u /. / a I/, / ei / , /ou/ , /'JI/ was identified with the interval of movement in the second formant (F 1 rates of change were not considered due to " possible measurement error effects ", fn. 7, p . 1571). Particular diphthongs were found to show little variation in the rate of F2 transition (in Hz/ms) across changes in tempo. Moreover , the mean values reported for the F2 rate of transition at the different tempi fall into a di stinct range for each diphthong. This suggests that these diphthongs may be distinguishable , at least in part, with respect to this parameter (cf. Gay, 1970, for a perceptual study of synthetic diphthongs in which the F2 rate of transition was varied). In contrast to Gay 's results, Dolan & Mimori (1986) found that F2 rate of transition was in genera l not invariant across changes in tempo; the F2 rate of transition was found to increase significantly with increased tempo for the diphthongs /au/ , /ei/, /ou/, /JI/. In his 1968 paper , Gay also suggested that the formant values for the onset of a diphthong were relatively stable across changes in rate. That diphthong productions can be classified in terms of the formant values at diphthong onset and F2 rate of transition is referred to as the "onset+ slope" hypothesis and is considered in further detail in Section 3.2.2 (cf. the " target plus slope hypothesis" noted in Nearey & Assmann, 1968, p. 1303). 1.4. Overall spectral properties of diphthongs
In Holbrook & Fairbanks (1962) the overall formant pattern displayed in the course of a particular diphthong production was considered without segmentation into steady-state and glide components. Instead , frequency and amplitude measurements for the first three formants of a given production were made from five sampling points along a spectrogram. Median frequencies of F 1 and F 2 at each of the successive sampling points were obtained from spectrographic displays of a given
Classification of diphthongs
209
diphthong and then plotted as points in a F, by F 2 space (log scale). The successive points representing a given diphthong were then connected , forming a path within the space. Examination of these paths permitted Holbrook & Fairbanks to establish three classes of diphthongs. One class of "diverging diphthongs", comprising /ai/, I ei/ , /:JI/ , was characterized by a falling first and a rising second formant. A class of "parallel" diphthongs, consisting solely of I ou I , displayed a decrease in both formants with F, and F 2 maintaining a constant ratio. The phonemes I au I and I ju/ were considered "converging diphthongs". That is , the values for the first two formants were found to converge in the course of the production of these phonemes . For the phoneme /au / the values ofF, and F 2 both fell , the second more sharply than the first. For the phoneme / ju/ the value of the first formant rose slightly , whereas the second formant descended considerably . Pols (1977 , pp. 101-104) , studying the Dutch diphthongs / au /, /Ay/, /£i/ , also emphasizes the overall course of spectral change in diphthong productions. His hypothesis , dubbed the "target plus direction hypothesis" in Nearey & Assmann (1986), considers target formant frequencies of an initial steady-state together with direction of formant change rather than rate of formant change , to be crucial in specifying diphthongs. In Pol's work, direction was plotted within a two-dimensional space derived from a principal component analysis of average bandfilter spectra of vowel sounds (Pols, 1977) . Miller & Chang (1989) propose essentially the same hypothesis with onset and direction specified in terms of an auditory-perceptual space proposed by Miller (1987; 1989) . This hypothesis is addressed further in Section 3.2.2. 1.5. Overview of the present study
The present investigation considers three hypotheses as to the acoustically relevant properties of diphthongs. The three hypotheses are (1) the "onset plus offset" hypothesis , (2) the " onset plus slope" hypothesis , and (3) the " onset plus direction" hypothesis . These hypotheses are evaluated for productions of the English phonemes / au / , / ai /, / ei /, / ou /, /'JI/, and / ju/ in two segmental contexts ([b_d], [Ld]) and in two stress and two rate conditions . Each of the hypotheses is considered in respect to two forms of representation of the acoustic properties of vowel-like sounds . One form of representation is the log F, x log F 2 space. The other form of representation is the auditory-perceptual space (APS) proposed by Miller (1987; 1989). The speech research group at the Central Institute for the Deaf has been actively investigating the adequacy of the APS as a means for establishing , representing and evaluating the acoustic features of both consonantal and vocalic speech sounds (Chang , 1987 ; Miller , 1989; Hawks , 1990; Jongman & Miller , 1991 ; Fourakis , 1991) . Since , within the APS , differences presented by talker , age and gender are normalized by consideration of formant ratios and their relation to a low-frequency reference (cf. Section 2.4 .1), we anticipated that this space would provide a more adequate means for evaluating the three hypotheses . The principal results of the present study are that each of the hypotheses provided correct classifications of the diphthong corpus exceeding 90% and that slightly higher percent correct classifications were obtained when acoustic parameters were specified in the auditory-perceptual space rather than in the log F, x log F 2 space.
210
M. Gottfried et al.
2. Methods 2.1. Speakers and speech material The subjects were four native speakers of Midwestern American English. two male and two female, with no known speech or hearing disorders. Two sets of real English words were constructed containing the six diphthongs in two contexts : [b_d] and [h_d]. To minimize the number of non-English words used. the first context for the phoneme /ju/ was [b_t], an exception yielding the slang form .. beaut '' instead of the nonsense word [bjud] (see Table I). The target words were embedded in the carrier sentence shown, and four listings were constructed for each speaker, each randomizing over six repetitions of each test word. In two of these four listings, the word kay in the carrier sentence was displayed in upper-case and subjects were instructed to produce the sentence with stress on that syllable (making the target word "unstressed"). In the other two listings , the word kay was displayed in lower-case , and subjects were instructed to produce the sentence with stress on the target word (the "stressed" condition). Different randomizations were made for the lists for each speaker. One "stressed" list and one " unstressed" list were presented to the subjects in each of two recording sessions. In one recording session subjects were instructed to produce the sentences in a deliberative, careful manner (slow); in the other, the instructions were to speak as rapidly as comfortable without making mistakes (fast). Only four of the six repetitions were used in the analysis. If a particular utterance was heard as containing extraneous noise such as shuffling of papers, or presented problems in the determination of formant values (see below), another of the available repetitions was used instead . This process yielded 768 utterances [2 (rate) x 2 (stress) x 2 (contexts) x 6 (diphthongs) x 4 (repetitions) x 4 (subjects)]. 2.2. Recording and digitization Each speaker read the lists in an anechoic chamber with the microphone placed at the height of the speaker's mouth and 0.5 m away from it. The recording was made TABLE I. Speech material Carrier sentence: I will say kay _ _ _ ,again. (Stress on test word) again. (Stress on KAY) I will say KAY Test words : Context: b__d bowed bide bade bode Boyd beaut
h_d howed hide hayed hoed Hoyd hewed
IPA symbol au aI e1 ou ;)(
ju
Four randomized lists of six repetitions of each word : List 1: Slow rate-stress on test item List 2: Slow rate-stress on KAY List 3: Fast rate-stress on test item List 4: Fast rate-stress on KAY
Classification of diphthongs
211
using a low-noise microphone/pre-amplifier combination (Bruel and Kj aer 4179/2660) . The speakers were instructed to speak in a normal conversational manner, which resulted in a signal of - 70 dBA at the microphone. Output from the microphone was channeled directly into a digital audio recorder (Sony PCM-501ES) operated in its 16-bit mode. The frequency response of the total system was within ±2 dB over the range of 10Hz to 10kHz. The signal-to-noise ratio was > 65 dB A-weighted . Immediately prior to the recording , subjects were requested to begin a preliminary recitation of the utterances, so that the experimenter could set the recording level to about -9 dB VU . After setting this level , the experimenter recorded a calibration tone. The calibration tone , generated by a synthesizer (Hewlett-Packard 3325A) , consisted of a 1-kHz sine wave and was kept at a constant output level of 69.5 m V (equal to 70 dB SPL at the microphone 1V /Pa) . The recording of the calibration tone allows calculation of the speaker's sound level in dB SPL. Digitizations of four of the six repetitions of each test word were made at 20 kHz with 16-bit precision, through a Digisound-16, a stimulus access processor, and both a 50 Hz analog high-pass filter (to remove incidental noise) and a 10kHz anti-aliasing filter. Any residual AC noise or "hum" was removed by digitally notch filtering the files at 60Hz after digitization. The files were stored on a MicroVax II or VAXstation 3200 computer to facilitate further processing by the InteractiveLaboratory-System (ILS) commercial software package . 2.3. Measurements-durational Three interval durations were measured from the speech wave as displayed on a graphics terminal (Hewlett-Packard, Model 2623), using a hardware / software system which allows cursors to be placed at significant points along the waveform and returns durations for the marked intervals to the nearest millisecond . (1) Total sentence duration was measured from the onset of the first glottal pulse marking the onset of voicing for the word I to the end of the nasal murmur in the final word again. (2) Target word duration was measured from the onset of the initial closure following the syllable kay (in the context [b_d]) or from the onset of noticeable [h] frication (in the context [Ld]) until the end of the final closure . (The onset of [h] frication was sometimes difficult to locate, because the [h]s were generally voiced . In some cases, listening to various windowed sections of the waveform helped in determining the onset of the [h] sound.) (3) Vocalic nucleus duration was measured from the onset of voicing following the release of the [b] or from the end of the frication associated with [h] until the onset of closure for the final consonant ([d] or [t]). 2.4. Spectral processing The waveforms were then modified so that the signal for all but the isolated vocalic nucleus was set to zero , and the spectrum was analyzed using the API (LPC analysis with cepstral pitch extraction) command of ILS , with a 24 ms Hamming window moving in 1 ms steps, 24 poles, and a pre-emphasis factor of 98 %. A fast Fourier transform (FFT) was then performed on the LPC autoregression coefficients using the spectrogram display (SGM) command of ILS in order to obtain values for the first three formants. The values for F 0 , F~> F 2 and F 3 were stored in a tabular-formant
M. Gottfried et al.
212
file which could be hand-edited. In cases where the cepstral analysis did not yield values for the fundamental frequency , these values were obtained by determining the time interval for three successive pulse periods , and then calculating the F 0 values as the inverse of the average period determined. In cases where any of the first three formants was missing for brief intervals , direct FFTs were made on the waveform in order to determine appropriate formant values to be inserted into the file. In cases where the LPC analysis did not yield values for formants over large intervals, the root-solving command of ILS was used to obtain formant values for the token . This situation arose most frequently tor those productions during which F 2 and F 3 were dose in value (i.e ., for tokens of /ai/, /e1/, /-;,1/ , /ju/). Formant values for more than half the tokens of I eii and I jul were obtained by means of the root-solving command. The resulting formant values were first smoothed by means of a 25-ms median smoother and then by an 11-ms rolling average. The size of the smoothing window is increased from 1 to N as it moves into the vocalic nucleus and is decreased from N to 1 as it reaches the end of the nucleus (N equals 25 for the median smoother and equals 11 for the rolling average smoother). This method of smoothing, reached after much trial and error , was found to eliminate seemingly irrelevant perturbations in formant values which were not eliminated in the prior hand-editing, while preserving major formant movements assoc iated with co nso nant transitions and glide portions of the diphthong. 2.4.1. The APS transform To obtain a representation of the analyzed productions within the auditoryperceptual space, the values of the first three spectral prominences (denoted as SF1-SF3) together with the fundamental frequency for each millisecond of a given production were transformed into values of three co-ordinates by the following equations:
x = log(SF3ISF2) , y
=
log(SF1ISR),
z = log(SF2ISF1). The SR is a reference frequency, given by the following equation: SR = 168(GMFOI168) 113 where GMFO is the geometric mean of the speaker's F0 , which is usually calculated from the beginning of a syllable to the point of formant measurement. In Miller (1989) it is noted that most of the monophthongal vowels of American English fall into a narrow slab , called the " vowel slab". The axes of the space can, by the following equations, be linearly transformed so that this slab stands perpendicular to the "floor" of the transformed space : x' =0.7071(y -x)
y'
=
0. 8162(z) - 0.4081(x
z' = 0.5772(x
+ y)
+ y + z)
When rhotacization or nasalization do not occur, most of the variation of vocalic spectra is captured within the x' andy' axes. Under these circumstances only small
Classification of diphthongs
213
vanatwns in z' are observed , which are related to lip-rounding. Diphthongs, as vocalic speech sounds which are neither rhotacized nor nasalized in English, would also be expected to fall within this slab . At the risk of losing some indications of lip-rounding or of changes in lip-rounding in the diphthongs I au I, I ou I, I ju/, we used the two-dimensional x'y' space. To a first approximation , y' represents Fant's (1973) compact-diffuse dimension , while x' represents his grave-acute dimension (cf. Miller , 1989, p. 2126) . The values of F0 and the first three spectral prominences for each ms yield a point within the space ; successive points over the course of a production yields what we term a " sensory path". The distance between two given points along a path indicates the amount of spectral change between the two formant patterns represented by these points. For purposes of illustration, the sensory path for /::JI/ produced as an isolated citation form is displayed in Fig . 1. In the figure , triangles , oriented in the direction of the path , are positioned along the path at 50 ms intervals . In the research presented below sensory paths such as the one illustrated were obtained for the diphthong productions . The three hypotheses of Section 1.5 were then considered both in terms of various features of these paths in APS and in terms of related features within a log F 1 x log F2 space . 2. 4. 2. Determination of initial and final transitions
In earlier discussion , the term " vocalic nucleus" has been employed to refer to the diphthongal segment inclusive of transitions from and to the surrounding consonants . Transitions from the preceding [b] and [h] to the diphthongal element are here termed " initial transitions" ; transitions into the final consonant are " final transitions". The end of the initial transition is considered the diphthong " onset"
0·6
.r'
O nset
~
Figure 1. Se nsory path for male production of /~:H / in null context. Triangles are placed every 50 ms along path indicating the extent of spectral change over the period and the direction of the path.
214
M. Gottfried et al.
while the starting point of the final transition is taken as the "offset" of the diphthong. (Note that this terminology differs from that in Lehiste & Peterson , 1961, where initial and final consonant transitions , as described here, are termed "onglide" and "offglide". We use "initial" and "final transition" instead because the terms "onglide" and "offglide" may also refer to the position of the glide within the diphthong with respect to a presumed "nucleus" steady-state, cf. Andruski & Nearey, 1992, p. 408, fn. 3, for further discussion .) Consonant transitions are characterized by movement of one or more formants , from frequency positions associated with the preceding consonant into frequency positions characteristic of an "onset vowel" (the "initial transition"), and, conversely, from the frequencies characteristic of an " offset vowel" into those associated with the final consonant (the "final transition") . The formant movements expected in transitions depend upon the particular combination of consonant and vowel (Lehiste & Peterson, 1961). One approach to identifying consonant transitions is to quantify "rate of spectral change" and then find objective criteria for segmenting them by looking to see whether the formant transitions expected in a particular consonantal environment coincide with distinctive patterns in rate of spectral change (cf. Kewley-Port, 1982). The assumptions underlying this approach are that the end of an initial transition , if discernible , will follow upon a period of sizeable spectral change , and that a final transition will be initiated by a large spectral change. Another approach is to consider some fixed percentage of the total duration of the segment to comprise the initial or final transition (cf. Jenkins, Strange & Edman, 1983; Nearey & Assmann , 1986). The method described here incorporates features of both approaches. The rate of spectral change was calculated both in a four-dimensional space log F0 , log F~> log F 2 , log F 3 ) and in the auditory-perceptual space defined by x ', y' and z' in terms of the distance travelled over each successive ms of the vocalic segment (each ms corresponds to a frame). The resulting plots of speed ("speed plots") were then smoothed in a manner similar to that used for the formants themselves-in this case, a 5 ms median smoother was employed and then a 5 ms rolling average smoother. A sample of these speed plots was compared to formant displays in order to determine whether readily delimitable segments of the plots coincided with periods in which formants moved in the expected direction. A threshold for a significant rate of spectral change was determined as a speed of 0.005 log units per ms (this corresponds to a rate of about 1.67 octaves per 100 ms). Figure 2(a) displays a smoothed plot of the first three formants and fundamental frequency for a stressed production of /aJ/ in the context [b_d] by a male speaker at a slow tempo (using smoothing procedures of Section 2.4) . Figure 2(b) shows the speed plot which corresponds to the formant display in the upper panel. A "speed region" was defined as an interval during which speed values exceeded threshold. Successive speed regions were merged when the interval between the two was less than 10% of the interval between the beginning of the first and the end of the second. In Fig. 2(b), speed regions are indicated as the cross-hatched areas under the graph of the speed. Onset. In tokens with no significant speed regions in the first 15 % of the vocalic segment (i.e., no significant spectral change in this period), the onse t of the diphthong was taken at the beginning of the vocalic segment. Otherv.•ise . the onset was set to the frame value following the last speed region which be ga n wit hin the
215
Classification of diphthongs
N
:c:
~ u "' c:
" ::l
l
0·1
!§ ....
"
0.
·c:"' ::l
OJl
g "0
" O·OOO L__LLL~~~~~~~~CL~~~~LLLL~~~~LL~!ill~~LL~·~I~ ~ 360
Frame (ms) Figure 2. (a) Display of the values of F0 and the first three formants for a male production of stressed /a r/ at a slow tempo in the context (b_d]. Vertical lines indicate the position of the onset and offset formant pattern. (b) Corresponding plot of speed of F0 and the first three formants over each successive ms for token in top panel. Cross-hatched areas indicate intervals during which speed exceeds threshold value ( =0.0051og units perms) . Onset of the diphthong (i .e ., end of transition from preceding consonant) is determined as frame (ms) value after which speed falls below threshold within initial15 % of the production. Offset (i.e ., start of transition into following consonant) is determined as frame (ms) value before speed exceeds threshold over final 15% of the production . Vertical lines indicate position of onset and offset formant pattern .
initial 15% of the vocalic segment, unless the entire speed region comprised more than 30% of the segment. In this case , the onset of the diphthong was again taken to be the initial frame . These conditions are meant to preclude confounding glides within the diphthong with initial consonant transitions. Figure 2(b) illustrates the onset selected in accord with these criteria for this production of /ai/. The onset in this case follows immediately upon the first speed region . Offset. The criteria for choosing the offset were similar. In tokens with no speed regions in the last 15% of the vocalic segment , the offset of the diphthong was taken as the end of the vocalic segment. Otherwise, the beginning of the final transition is taken as the frame value preceding the last speed region which occurred within the last 15% of the vocalic segment, unless the entire speed region comprised more than 30% of the vocalic segment. In this case, the offset of the diphthong was taken as the last frame. The condition is stated to preclude confounding glides within the diphthong with final consonant transitions. Figure 2(b) illustrates the offset selected in accord with these criteria. The offset in this case immediately precedes the final speed region. This method provided acceptable values for the end of the initial transition and for the beginning of the final transition in th~ large majority of cases. There were 22
16
M. Gottfried et a!.
cases (3% of the total) in which the end of the initial transition was judged to be incorrect; most of these errors occurred for tokens of /au/ produced in the context [b_d] at a fast tempo. There were 18 cases (2% of the total) in which the beginning of the final transition was judged to be incorrect; most of these occurred for tokens of /au/ and /ou/. For those cases judged to be incorrect, two of the authors (MG, JDM) re-evaluated the formant and speed plots and determined appropriate values. These re-evaluations did not appreciably alter the classification results which were obtained (cf. Section 3.2.2).
2. 4. 3. Segmentation: methodological issues After a number of preliminary efforts, we decided not to attempt segmentation of the productions into steady-state and glide components (cf. Gottfried, 1989). Instead, the parameters for classification of the productions were established on the basis of the diphthong's onset and offset (as described above and in Section 3 .2.1). Identification of steady-state and glide components within an analyzed production presumes that a set of reliable criteria can be stated regarding the requisite extent of formant movement throughout some minimal duration which will at least operationally define these segments. Establishing reliable criteria is complicated by a number of factors. First, with the linear-predictive coding (LPC) analyses employed, plots of the first three formants for a production often exhibit small ( < 10 ms) intervals in which one or more of the formants move. Such intervals are questionably glide segments and may well be an artefact of the LPC procedure. Second, whereas steady-state portions of monophthongal segments may be identifiable in respect to some presumed vowel target, it is not clear that any vowel target can be supposed for the nucleus of a diphthong (cf. Section 1.2). Finally, the assumption that portions of a diphthong admit of a bipartite categorization into steady-state and glide may itself require re-examination, given the well established presence of vowel-inherent spectral change or diphthongization even in some productions of monophthongs (see, inter alia, Nearey & Assmann, 1986; Nearey, 1989; Strange, l989a,b; Andruski & Nearey, 1992). A consequence of this decision to not segment the analyzed productions is that possibly monophthongal productions of /ei/ and /ou/ are included in the data set of diphthong tokens to be classified. As just noted, even such monophthongal productions may exhibit a small but noticeable spectral change. The establishment of parameters with respect to the "onset" and "offset" of a production provides a uniform basis for evaluating the classification procedure (cf. Section 4 for further discussion). 3. Results 3. 1. Durations
The durational measurements were submitted to an analyses of variance performed in a four-factorial design with 48 cells: two tempos x two stresses X two contexts x six diphthongs as correlated repeated measures on subjects. Thus, the subject by conditions mean square was used as the error term in calculating the F -ratios. For example, the stress by rate interaction was tested against the subjects by rate by stress interaction. The F -ratios and degrees of freedom are reported for significant
Classification of diphthongs
217
II. Mean duration and standard errors in ms for sentences. n = 192 for cells and 384 for column and row means
TAB LE
Rate Stress condition Stressed
i a.~
Unstressed
i 0.1.'
Column mean
i Ox
Slow
Fast
Row mean
1636 19.59
1176 16.04
1406 17.27
1637 10.26
1273 5.65
1455 10.98
1636 11.04
1224 8.84
1430 10.26
effects (p < 0.05). The analyses indicated a number of interactions between the identity of the intended diphthong and tempo , stress , or context. Many of these interactions involved the diphthong I ju/, which , in the context of a preceding [b] was followed by [t], yielding the exceptional form " beaut". In those cases where these interactions were due solely to the inclusion of /ju/ in the data set (as confirmed by separate analyses of variance excluding these tokens) , the interaction is not reported. 3.1. 1. Sentence durations Table II shows the mean duration (in ms) for sentences in the different tempo-stress conditions. Sentences spoken at a fast tempo were shorter in duration than those spoken at a slow tempo by 25 % . The effect was significant [F(l , 3) = 17.84, p < 0.03]. This result indicates that subjects did change their rate of speech as instructed. There was also a three-way interaction among stress , context , and identity of the diphthong [F(5 , 15) = 3.86, p < 0.02]. 3.1. 2. Target syllable durations
Table III shows the mean durations (in ms) for the target syllables pooled across diphthongs . Target syllables produced at a fast tempo were 23 % shorter than those produced at a slow tempo . While sizeable , the effect only approaches significance [F(1, 3) = 7. 75 , p < 0.07]. Target syllables in the unstressed condition were 27% shorter than those in the stressed condition; this effect was significant [F(l , 3) = 24.35 , p < 0.02]. The duration of the target syllable was 8% shorter in the [h_d] context than in the [b_d] context. While the difference was small , the effect of context was highly significant [F(l, 3) = 60.65, p < 0.01]. 3.1 . 3. Vocalic segment durations Tables IV and V provide mean durations (in ms) for the vocalic segments containing the various diphthongs , i.e. , the diphthong inclusive of the transitions from and to the adjacent consonants . Vocalic segments produced at a fast tempo were 24% shorter than those produced at a slow tempo . While the effect was not significant , its size of about 25 % is consistent with the effect for sentence and word durations . A
M. Gottfried et al.
218
III. Mean durations and standard errors in ms for target syllables pooled across diphthong . n = 96 for cells and 192 for column and row means
TABLE
Rate Stress condition Stressed [b_d]
Slow
Fast
Row mean
ax
387 7.50 354 6.69
298 5.49 261 4.72
342 5.65 308 5.30
i ax
371 5.15
279 3.84
325 3.97
i O.r i X
272 4.24 260 3.84
220 3.11 199 3.46
247 3.25 229 3.39
a.i
266 2.89
209 2.44
238 2.38
318 3.98
244 2.89
281 2.80
i O.i i
(h_d) Column mean
Unstressed [b_d]
[h_d]
aColumn mean
i
i
Grand mean
O.i:
IV. Mean durations and standard errors in ms for each diphthong under each condition of stress and rate in context [b_d]. n = 16 for cells, 64 for rows and 96 for columns
TABL E
Rate / stress condition Diphthong au a1
i a_ti O.i:
e1 ou
i O.i i O.i:
i
::ll
0.~:
i
JU
0.~:
Column mean
i O.r
ss
su
FS
FU
Row mean
263 15.31 252 15.51 220 14.37 220 12.53 244 14.09 171 9.96
176 9.33 164 12.85 153 8.86 144 9.41 168 10.73 127 7.23
191 9.67 174 7.76 171 7.89 163 7.03 197 8.66 139 7.15
138 5.23 125 4.88 116 3.38 113 4.35 129 3.13 103 4.27
192 7.66 179 7.93 165 6.61 160 6.53 185 7.20 135 4.76
228 6.27
155 4.28
173 3.75
121 2.06
169 2.94
SS =Slow-stressed ; SU =slow-unstressed ; FS =fast-stressed ; FU =fast-unstressed
Classification of diphthongs
219
V. Mean durations and standard errors in ms for each diphthong under each condition of stress and rate in context [h_d]. n = 16 for cells , 64 for rows and 96 for columns
TABLE
Rate/stress condition
ss
su
FS
FU
Row mean
i ax i ax i ax i ax i ax i ax
207 10.59 203 10.51 192 10.38 182 8.49 209 11.19 179 10.14
149 8.58 150 10.98 138 6.42 132 8.75 161 9.43 131 9.20
148 3.78 139 5.75 138 3.92 143 4.59 159 6.32 138 6. 16
119 4.20 116 4.70 115 4.28 101 3.77 128 4.11 110 4.75
156 5.39 152 5.77 146 4.86 140 4.95 164 5.43 140 4.96
i ax
195 4.25
143 3.74
144 2.20
115 1.92
149 2.17
Diphthong au a! ei ou ;)! ju Column mean
SS =Slow-stressed ; SU =slow-unstressed; FS =fast-stressed ; FU =fast-unstressed
nearly equivalent reduction of duration was seen in the effect of stress: mean durations of the various vocalic segments when unstressed were 26% shorter than when stressed. This effect was significant [F(1 , 3) = 17.05, p < 0.03). Vocalic segments in the (h_d) context were about 12% shorter than those in the [b_d) context; this effect of context was significant [F(1, 3) = 14.19, p >0.04]. Vocalic segments requiring less articulatory movement (lei/, /ou/, /ju/) were 9% shorter than those which require greater articulatory movement (!au I, I a I/, /':JI/) . This effect of the identity of the diphthong was statistically significant [F(5 , 15) = 38.00, p < 0.01). This pattern was preserved in each rate condition and is in agreement with Holbrook & Fairbanks (1962) and Gay (1968) . There were three interactions of note: between tempo and identity of the diphthong; between context and identity of the diphthong; and between stress and context. The difference in duration between vocalic segments produced at a slow tempo and those produced at a fast tempo was largest for /ai/ and /au/ (27%) and smallest for I ju/ (19%). The difference in duration between the slow and the fast tempo for the other diphthongs was intermediate in size (approximately 23% ). The interaction between tempo and diphthong identity described above was significant [F(5 , 15) = 6.5, p < 0.01). A similar pattern among the diphthongs was found in respect to context. While all vocalic segments were shorter in duration in the [h_d) context than in the corresponding [b_d) context, the differences were largest (17%) for /au/ and /m/ , and were about the same (12%) for the other vocalic segments . In contrast, the duration of /ju/ was shorter in the [b_t) context than in the [Ld) context. The interaction between context and identity of the diphthong was highly significant [F(5, 15) = 18.00, p < 0.001). Unstressed vocalic segments were 31% shorter than stressed segments in the [b_d) context whereas in the [h_d) context
220
M. Gottfried et al.
unstressed segments were 24% shorter in duration. This interaction between stress and context effects was statistically significant [F(l, 3) = 12.41, p < 0.04]. 3.1.4. Summary In summary , the subjects alter their speaking rate as instructed. While there were differences among subjects , the effect of tempo was consistent for each subject at the level of sentence , target syllable and vocalic segment. For three of the subjects sentence durations in the fast tempo were reduced by approximately 15 % relative to the slow tempo, while for the fourth , this reduction was 37%. For each subject target-syllable and vocalic segment durations were reduced by approximately the same percentages. Subjects also placed stress on the target- or dummy-syllable as instructed , as can be noted from the significant effect of stress on the duration of target words and vocalic segments .
3. 2. Classification
While other investigators have noted various properties which might distinguish particular diphthongs (cf. Section 1) , these properties have not been evaluated as parameters within a classification procedure . The principal aim of this study is to formulate in terms of well-delimited parameters the various hypotheses which have been proposed concerning the acoustically relevant properties of diphthongs (Section 3.2.1), and then to evaluate these hypotheses in terms of a Bayesian classification procedure applied to the data set of diphthongs produced in varying stress and tempo conditions (Section 3.2.2). A further concern is to compare the adequacy of the APS and the traditional log F, x log F 2 space as schemes for representing the appropriate parameters. 3. 2.1. Three hypotheses
The onset+ offset hypothesis states that the F-patterns at the onset and offset of diphthong tokens provide acoustically relevant parameters for classification of the tokens . Within a log F, x log F 2 space , the onset and offset of a particular diphthong token are given as the log values of F, and of F2 . Within the auditory-perceptual space , both onset and offset are given as pairs of x'y' values (cf. Section 2.4.1). A schematic illustration of this hypothesis, in a log F 1 x log F 2 space, is displayed in Fig. 3(a) . The onset+ slope hypothesis emphasizes rate of spectral change as well as the F-pattern at onset in the classification of diphthongs . To facilitate comparison with the results of Gay (1968), this hypothesis is presented both (a) in terms of the values of the first two formants at onset along with the F 2 rate of transition (Hz/ms) and (b) in terms of the log-values of the first two formants and of the F 2 rate of transition. A schematic illustration of this hypothesis is displayed in Fig. 3(b) . There is no ready counterpart to this hypothesis within the formant -ratio based auditory-perceptual space . Nonetheless, within this space , slope , as rate of spectral change , can be stated as the Euclidean distance between the onset and offset over the duration of the diphthong . The onset and offset of the token are identified as pairs of x 'y' values as above.
Classification of diphthongs
221
(a)
- - - - - - - - - - - - - - - - -· Offset
---·
On se t
Log F1 (b)
'N
:r:
~
oo c
...]
Offset Time ( ms) (c)
Offset
0/J
0
...]
I I
Onset I ----
___(}'l_
I I I
Log Fi Figure 3. Schematic illustration of parameters within a log F, x log F2 space
used to specify (a) onset+ offset hypothesis , (b) onset+ slope hypothesis , and (c) onset+ direction hypothesis.
The onset + direction hypothesis emphasizes the significance of the course of formant movement in the classification of diphthongs. The course of formant movement is established from the onset and offset of each diphthong in the following manner. The straight line connecting the onset and offset points in each of the two spaces is referred to as a diphthong line . The length of this line (in log units) is an indication of the amount of overall spectral change associated with a particular token. If a set of co-ordinates is established with origin at the onset of the diphthong , the angle formed between the abscissa and the diphthong line represents the direction of formant movement over the course of production. A schematic illustration of this hypothesis, in a log F, x log F 2 space , is provided in Fig. 3(c). Figure 4 illustrates the determination of the angular value for the /ai/ token shown above in Fig. 2 within a log F, x log F2 space. Figure 5 shows the direction of formant movement within the auditory-perceptual space for the same token. The diphthong line is illustrated in these figures by a dashed line between the point of onset and offset. The angular
M. Gottfried et al.
222
3 50,---- - - - - - - - - - - - - - - -- - - - - - - ,
L( eo 0
£~ Offset --.. , --.. '\
...J '- --.. -.._
""-
[[
I
',~
f ~
'....._ ........
---
8= 141 °
--2'r-IOnset
III IV 3·00 '-----:-2·_,_,. 16 0.,------- - - - - - -- - - - - - - - - - - - ''-' 3·00 Log F , Figure 4. Plot of formant movement in log F, x log F2 space for token of /at/ presented in Fig. 2. A set of co-ordinates is erected at onset of the token. The diphthong line connecting onset and offset is indicated by a dashed line . The angle formed between the abscissa erected at onset and the diphthong line is marked by an arc , and measured in respect to the quadrants indicated in the figure . y'
Figure 5. Sensory path for token of /a t/ presented in Fig. 2. A set of co-ordinates is erected at onset of the token. The diphthong line connecting the onset and offset points is indicated by a dashed line . The angle formed between the abscissa erected at onset and the diphthong line is marked by an arc, and measured in respect to the quadrants indicated in the figure. Use of the parameters illustrated are described later in the text.
value, indicating direction of formant movement, is established in respect to the four quadrants marked in the figures. Use of angular values in representing the course of formant movement poses two methodological problems. One is the matter of accurately recording the variance associated with the angular values of diphthongs (this is of consequence for the
Classification of diphthongs
223
classification procedure described below) . This problem emerges in considering diphthongs whose angular values extend across the discontinuity between oo and 359°. In order to circumvent this problem , we examined those diphthongs whose range of angular values extended across the zero-axis and adjusted angular values in those cases where assignment of values > 360° (or <0°) yielded a smaller range of angles for a given diphthong. In the log F 1 x log F 2 space, this situation arose for two tokens of /au/ and one token of /ju/, which had values in the first quadrant (0.4% of the data set) instead of the fourth: in these three cases , the values were increased by 360°. In the auditory-perceptual space , the situation arose for four tokens (0.5% of the data set) of /ou/ in the fourth quadrant instead of the first. In these cases , angular values were decreased by 360°. The other problem is the possibly monophthongal productions of /er/ and /ou/. As Nearey & Assmann (1986 , p. 1305) note , "For vowels with near zero magnitudes of change, even small measurement errors would lead to large changes in direction ." For reasons noted in Section 2.4.3, we decided not to screen the data set for monophthongal productions , a decision which might introduce some "noise" into the results presented below.
3. 2. 2. Classification results The three hypotheses were evaluated by means of a Bayesian classifier. A Bayesian classifier makes use of the statistical properties which specify classes of diphthongs to obtain a classification decision. The formula for Bayes' theorem is: m
p(w; I X)= p(w;)p(X I w;)/2: p(X I wi)p(wi) j=l
where m is the number of classes, p(wi I X) is the probability that token X is from class w;, p (X I w;) is the multivariate normal probability density function, and p (w;) is the a priori probability. The denominator of the equation normalizes p (w; I X) into the range {0, 1} (Blake , 1979; Tou & Gonzalez, 1974). In order to evaluate the success of the three hypotheses (in their various formulations) in classifying diphthong tokens , the tokens were assembled into four groups corresponding to each tempo-stress condition. Each group comprises 192 tokens. For each of the three hypotheses , Bayesian classifications were first run using the resubstitution (R) method. To further appraise the success of the hypotheses, Bayesian classifications were run again, this time using the classification matrix obtained for the slowstressed group of tokens to classify tokens in the other tempo-stress groups ("cross-classification") . Onset+ offset hypothesis. Table VI lists the percent correct classification of diphthongs within the various tempo-stress groups and for the entire data set (noted as "All" in the tables) . Results obtained by the R-method indicate that percent correct classification was greater than 95% for each of the conditions and greater than 93% overall regardless of the space within which the parameters are specified. There was higher correct classification for the slow-tempo groups than for the fast-tempo groups. For diphthong tokens spoken at a fast tempo, there was higher correct classification for the stressed groups than for the unstressed groups. This pattern also obtained for tokens in the slow-tempo with the parameters specified in the log F 1 x log F 2 space. Other results, not recorded in Table VI, indicated that further specification of onsets and offsets by the log value r- ~-'3 (or of F 3 as well as
M. Gottfried et a!.
224
VI. Percent correct classification: onset + offset hypothesis resubstitution method (in bold) and cross-classification (in parentheses)
TABLE
Condition Parameters log F,,, log Fz, log F,1 , log F 21
x;y;
ss
su
FS
FU
All
99.0
98.4 (90.1)
96.4 (92 .2)
95.8 (83.3)
93.2
97.9
99.5 (99.0)
97.4 (96.4)
96.4 (89.1)
97.8
xjyj
SS =S low-stressed ; SU =slow-unstressed; FS =fast-stressed ; FU =fastunstressed. Subscript 'i' indicates values at onset of diphthong ; subscript f indicates values at offset of diphthong
F0 ) yielded percent correct classifications which were comparable or slightly worse , whereas further specification of the onset and offset by z' values within APS yielded comparable or slightly better results for each condition. There appears to be a slight advantage gained by specification of the parameters within the APS. Results obtained by cross-classification (indicated in parentheses in Table VI) confirm this advantage: while there is a reduction in percent correct classification for each of the groups, this reduction was considerably less when the parameters were specified within the APS . Onset+ slope hypothesis. Table VII lists the percent correct classifications for the three formulations of the onset+ slope hypothesis. The R-method yields percent correct classifications >90 % for each tempo-stress condition and for the entire data set regardless of the formulation of onset and slope . Performance for the stressed tokens exceeds or is equal to that of the unstressed tokens for each tempo . The slow-stressed tokens are most effectively classified; the fast-unstressed tokens are ge nerally least so. Within each of the tempo-stress conditions , one of the two specifications using F2 rate of transition is most effective ; all three formulations of VII. Percent correct classification: onset+ slope hypothesis resubstitution method (in bold) and cross-classification (in parentheses)
TABLE
Condition
ss
su
FS
FU
All
F,,, F2 , , slope
99.0 99.0
97.4 (94 .8) 97.4 (94.8) 96.4 (89 .1)
97.4 (82 .8) 96.9 (78.1) 91.1 (82 .8)
93.6
log F, ,, log Fz, log slope
97.4 (86.5) 97.9 (85. 9) 93.2 (90.6)
Parameters
x;y;
Aps slope
96.9
SS = Slow-stressed; SU =slow-unstressed; FS =fast-stressed ; FU = fa stunstressed Subscript i indicates values at onset of diphthong
93.1 93.8
Classification of diphthongs
225
VIII. Percent correct classification : onset + direction hypothesis resubstitution method (in bold) and cross-classification (in parentheses)
TABLE
Condition Parameters log F 1;, log Fz; Angle ' ;' X;Y Angle
ss
su
FS
FU
All
99.0
93.2 (84. 9) 95.3 (87 .0)
96.9 (90.1) 96.9 (94.8)
89.1 (72 .9) 85.4 (80 .2)
91.5
99.5
94.1
SS =Slow-stressed ; SU = slow-unstressed ; FS =fast-stressed; FU = fast-unstressed Subscript i indicates values at onset of diphthong
the onset+ slope hypothesis perform comparably (93 % ) for the entire data set. Correct classifications are markedly lower using the cross-classification technique for the two unstressed groups in comparison with those obtained by resubstitution (tokens in the fast-unstressed group being least effectively classified). Onset+ direction hypothesis. Table VIII lists the percent correct classification obtained for the various tempo- stress groups and the entire data set for the two formulations of the onset+ direction hypothesis . When the R-method is employed, correct classification is high (> 93 % ) for all tempo-stress groups except the fast-unstressed condition, where correct classification exceeds 85 %. For th e entire data set, both formulations of the hypothesis yielded better than 91 % effective classification . Correct classification is higher for the tokens produced at slow tempo . Within a given tempo , effective classification is higher for the stressed tokens . Both formulations of the hypothesis yield comparable results for each condition; for the entire data set , the APS representation is sli ghtly more effective. Classification results obtained using the slow-stressed matrix are given in parentheses . For both representations, correct classification is highest for the fast-stressed group (> 90% ) and lowest for those in the fast-unstressed group ( < 81 % ). The APS representation is more effective within each of the tempo- stress groups .
4. Discussion The research reported here was directed at investigating alternative hypotheses regarding the relevant parameters by wh ich diphthongs in American English can be classified. An additional concern was to examine whether these parameters were better represented within a log F 1 x log F2 space or within the auditory-perceptual space . Results from the pattern-recognition tests (Section 3.2.2) indicate that each of the proposed hypotheses provides effective parameters. Each of the hypotheses yie lded correct classifications exceeding 90% over the entire data set of 768 tokens (R-method). For the entire data set (condition " all " of Tables VI-VIII) , representation of these parameters in the auditory- perceptual space yie lded slightly better classification performance in tests of each hypothesis. The three highest percentages are obtained for classification in the auditory-perceptual space (97 .8, 94 .1, 93 .8) followed by those obtained in the logF 1 x logF 2 space (93 .2, 93.1 , 91.5) . Taking each of the hypotheses in turn, we discuss the results further and consider ways in which they may be further evaluated.
226
M. Gottfried et al. 4.1. Onset+ offset hypothesis
Common to all of the hypotheses (onset+ offset, onset+ slope, onset+ direction) is the onset parameter. The effects of tempo on the onset position of diphthongs have been considered by Gay (1968) and Dolan & Mimori (1986). The latter study suggested , in contrast to Gay (1968), that "the first element of a complex vowel tends to move toward the center of the vowel space as rate is increased" (Dolan & Mimori, 1986, pp . 142-144). The effect of tempo and stress on the onset position of diphthongs in the present work was evaluated, following Fourakis (1991) , by first establishing a point in the auditory-perceptual space which corresponds to the hypothetical neutral reference vowel described by Fant (1960). This neutral vowel is associated with a uniform male vocal tract, 17.6 em long, in which the first three resonances are 500, 1500 and 2500Hz and the fundamental is 133Hz. An identical point in the space is obtained for the neutral reference vowel produced by a uniform female vocal tract in which the first three resonances are 600, 800 and 3000Hz and a fundamental frequency of 230Hz. The means of the (Euclidean distance) between this point and the onsets of each diphthong in each tempo-stress condition were calculated in order to determine whether there is centralization of diphthong onsets. It was found that, for a given stress condition, the onsets of the diphthongs /ai/, /::JI/ did centralize with an increase in tempo. For the diphthongs /au/, /ai/ , /ou/, /::JI/ , the distance of the onset from the neutral point decreased going from the stressed to unstressed tokens. The absence of centralization for the onsets of the diphthongs /ei/ and /ju/ is in line with results reported in a comparable investigation of the effects of stress and tempo on American English monophthongs (Fourakis, 1991). Fourakis reports (p. 1823) that the front vowels [e1and [11 did not centralize with an increase in tempo or decrease in stress . In respect to offset positions, it was found that, with the exception of /ou/ , less extreme positions were reached in the unstressed condition than in the stressed. Within a given stress condition , the pattern across changes in tempo is in accord with Gay (1968): offsets reached by diphthongs produced at a fast tempo are less extreme than those produced at a slow tempo. The offsets of dipthongs produced at a fast tempo also displayed higher F 1 values than those produced at a slow tempo. However, in contrast with Gay's (1968) , results the F 2 values of diphthongs produced at a fast tempo were slightly less extreme than those produced at a slow tempo . It is of note that, despite the presence of centralization for the onsets of some diphthongs in differing tempo and stress conditions and the effects reported for offsets, classification performance for the onset+ offset hypothesis remained high. The average correct classification across the tempo-stress groups was greater than 97% regardless of the acoustic space considered . 4.2. Onset+ slope hypothesis The classification results reported for the onset+ slope hypothesis provide support for the suggestion in Gay (1968) that diphthongs display distinctive F 2 rates of transition [cf. Manrique (1979), Jha, (1986) for supportive evidence in acoustic studies of Argentinian Spanish and Maithili respectively, and Dolan & Mimori (1986) for conflicting evidence for American English 1. In the present study, the highest slope
Classification of diphthongs
227
values for F 2 (calculated in Hz/ms) were obtained for h•/, /ai/, /ju/, /au/ (in that order) and the smallest values for /ei/ , /ou/ . This ordering is largely in agreement with that of Gay (1968). The diphthongs /m/, /'JI/ in the present study exhibited specific slope values across conditions (5.11 and 8.11 Hz/ms, respectively) , which are quite close to those observed by Gay (5.5 and 8.42 Hz/ms , respectively). These diphthongs traversed the greatest (Euclidean) distances within each of the acoustic spaces. The position of /au/ in the ordering above is consistent with the observation of Flege, Fletcher, McCutcheon & Smith (1986) that this diphthong can be articulated with relatively little tongue movement (and a presumed compensatory increase in lip-rounding). However, in Gay (1968) the slope values for /ei/ exceed those of /ou/ for each of three speaking rates (slow, moderate, fast), whereas this pattern obtained only for stressed tokens of /ou/ and /ei/ in the present study. Overall, these observations lend support to the articulatory principle proposed in Kent & Moll (1972, p. 296) that "the farther the tongue body has to move in executing a vowel gesture, the greater is the articulatory velocity" , in brief, "the further, the faster". The F2 rate of transition is clearly an effective parameter in the classification of diphthongs. The average correct classification (across the tempo-stress groups) obtained for the onset+ slope parameters (in F 1 X F 2 space, in log F 1 x log F 2 space) was 97.8%. 4.3. Onset+ direction hypothesis The present study may be compared with that of Holbrook & Fairbanks (1962) in considering direction of formant movement (evaluated as "angle" in each of the acoustic spaces). In their study, Holbrook & Fairbanks (fig. 4, p. 49) provide paths for each of the diphthongs by connecting for each diphthong those (sampled) points representing the median frequencies of F 1 and F 2 (on a log-scale). Angular values for these paths are obtained by considering the angle formed by the initial point and arrow head for each diphthong path in their fig. 4. The following list provides the HolbrookFairbanks values, each followed by the corresponding value obtained in the present work: /au/-255 (212), /ai/-115 (144), /ei/-135 (167), /ou/-245 (146), /'JI/-95 (101), /ju/-285 (283) . As may be seen, the angular values obtained for the various diphthongs are comparable, with the exception of /ou/. This discrepancy between the values for /ou/ may be due to the presence of monophthongal tokens of /ou/ in the present data set. The onset + direction hypothesis was only slightly less effective than the others in classifying diphthong tokens. The average correct classification across the various tempo-stress groups was about 94% in either acoustic space . 4.4. Further work The general success of each of the hypotheses lends support to the earlier studies reviewed in the Introduction which emphasized one or other of the parameters. The effect of stress and tempo does not appear to present any major difficulties in obtaining correct classification of diphthong tokens. This leads to the question, in what way might a more decisive evaluation of these hypotheses be obtained? Note
228
M. Gottfried et al.
that all of the hypotheses make reference to the onset and offset values of one or more formants . Since (as we noted in Section 2.4 .2) the determination of onset and offset values is dependent upon an assessment of initial and final transitions , a more stringent evaluation of these hypotheses should consider a wider range of consonantal environments. A follow-up project is currently underway which extends the consonantal environments to include the contexts [d_d] and [g_d]. Further demands would be imposed if classification were attempted for diphthongs together with monophthongs . This is also planned in the project mentioned. Apart from such considerations , there is the question of how these hypotheses may themselves be further refined. One obvious direction is to consider values for the angle and slope parameters as derived from glide segments of the diphthongs . Speed of formant movement, considered above in delimiting transitions, may provide one indication of such segments , though there are outstanding questions to be answered regarding both appropriate ways of assessing spectral distance and the perceptual import of these formant movements in identification of diphthongs . The authors wish to recognize the assistance of Caroline Monahan on statistics and Aaron Schlafly on data collection as well as the efforts of E ric Frederick and Mervyn Pow. Marios Fourakis , John Hawks and Ira Hirsh provided useful discussion throughout the course of this work. Particularly helpful comments were provided by an anonymous reviewer and , especially , by Terry Nearey and the editor. This research was supported by NIDCD Grant R01-DC00296 to the Central Institute for the Deaf.
References Andruski , J. E . & Nearey , T. M. (1992) On the insufficiency of compound target specification of isolated vowels and vowels in /bVb/ syll ab les, Journal of the A coustical Society of America , 91 , 390- 410. Bladon , A . (1985) Diphthongs: A case study of dynamic auditory processing , Speech Communication, 4, 145-154. Blake , I. F. (1979) An introduction to applied probability . New York: John Wiley. Chang, H. (1987) SWIS: See what I say , a speaker-independent word recognition system by phoneme-oriented mapping on a phonetically encoded auditory - perceptual speech map. Washington University Doctoral dissertation , St Louis. Dolan, W. & Mimori , Y. (1986) Rate-independent variability in English and Japanese complex F 2 transitions , UCLA Working Papers in Phonetics , 63 , 125- 153 Fant , G. (1960) Acoustic theory of speech production. The Hague: Mouton. Fant , G. (1973) Speech sounds and features. Cambridge , MA: MIT Press. Flege , J. E ., Fletcher, S., McCutcheon , M. J. & Smith , S. C. (1986) The physiological specification of American English vowels , Language and Speech, 29, 361-388. Fourakis , M. (1991) Tempo , stress and vowel reduction in American English , Journal of the Acoustical Society of America, 90 , 1816-1827. Gay, T. (1968) Effects of speaking rate on diphthong form ant movements , Journal of th e Acoustical Society of America, 44 , 1550-1573. Gay, T. (1970) A Perceptual study of American English diphthon gs, Language and Speech , 13, 65-88 . Gottfried , M. (1989) Some acoustic properties of diphthongs, Journal of the A coustical Society of America, 86, S123 . Hawks , J. W. (1990) Perceptual aspects of a three-dimensional vowel space . Washin gton University Doctoral dissertation , St Louis. Holbrook, A. & Fairbanks, G. (1962) Diphthong formant s and th eir move ment s. Journal of Speech and Hearing Research , 5, 38-58. Jenkins, J . J. , Strange, W. & Edman , T. R. (1983) Identificatio n of vowe ls in "vowe lless" syllables , Perception and Psychophysics, 5, 441-450. Jha, S. K. (1985) Acoustic analysis of the Maithili diphthongs, Journal of Ph onetics, 13. 107-115. Jongman , A . & Miller, J. D. (1991) Method for the location of the burst-o nset spec tra in th e auditory-perceptual space: a study of place articulation in vo ice less-stop consonan ts. Journal of the Acoustical Society of America, 89, 867-873 .
Classification of diphthongs
229
Kent, R. D. & Moll , K. L. (1972) Tongue body articulation during vowel and diphthong gestures, Folia Phoniatrica , 24, 27R-300. Kewley- Port , D . (19R2) Measurement of formant transitions in naturally produced stop consonant-vowel syllables, Journal of the Acoustical Society of America, 72, 379-3R9. Ladefoged , P. (19R2) A course in phonetics. New York: Harcourt Brace Jova novich . Lehiste , I. (1964) Acoustical characteristics of selected English consonants. Bloomington : Indiana University. Lehiste , I. & Peterson , (19ol) Transitions , glides and diphthongs , Journal of the Acoustical Sociely of America , 33, 2()8-277. Manrique, A . M. B. ( 1979) Acoustic analysis of the Spanish diphthongs, Phonetica, 36 , 194-20o. Miller, J. D. (19R7) Auditory-perceptual processing of speech waveforms. In Audiwry processing of complex sounds (W . A. Yost & C. S. Watson , editors) , pp. 257-2o6. Hillsdale , NJ: Laurence Erlbaum Associates. Miller , J. D. (19R9) Auditory-perceptual interpretation of the vowel , Journal of !he Acouslical Sociely of America, 85, 2114-2134. Miller, J.D. & Chang, H. (1989) Patent #4R20059 , U.S. Patent Office. (49 figures, 5R columns). Nearey , T. M. ( 19R9) Static, dynamic and re lational properties in vowel perception , Journal of !he Acouslical Sociely of America , 85 , 2088-2113. Nearey , T. M. & Assmann , P. F. (1986) Modeling the role of inherent spectral change in vowel identification , Journal of !he Acouslical Sociely of America, 80 , 1297-130R. Pols, L. C. W. (1977) Speclral analysis and identification of Dulch vowels in monosyllabic words , University of Amsterdam, Doctoral dissertation, Academische Pers B. V. , Amsterdam. Strange, W. ( 1989a) Evolving theories of vowel perception , Journal of !he Acoustical Sociely of America , 85 , 2081-20R7. Strange , W. (19R9b) Dynamic specification of coarticulated vowels spoken in sentence context , Journal of !he Acouslical Sociely of America, 85 , 2135-2153. Strange, W. , Jenkins , J. J. & Johnson , T. L. (1983) Dynamic specification of coarticulated vowels, Journal of !he Acouslical Sociely of America , 74 , 695-705. Tou, J. C. & Gonzalez, R. C. (1974) Pal/ern recognilion principles. Reading, MA: Addison-Wesley. Trager , G. & Smith, H. L. (1951) An oulline of English struc/Ure. Norman: Battenburg Press. Wise , C. M. (1965) Acoustic structure of English diphthongs and semi-vowels vis-a-vis their phon em ic symbolization. In Proceedings of !he fifth inlernational congress of phonelics sciences (E. Zwirne r & W. Bethge , editors) , pp. 589-593. Basel: S. Karger.