Vowel normalization: the role of fundamental frequency and upper formants

Vowel normalization: the role of fundamental frequency and upper formants

ARTICLE IN PRESS Journal of Phonetics 32 (2004) 423–434 www.elsevier.com/locate/phonetics Vowel normalization: the role of fundamental frequency and...

341KB Sizes 0 Downloads 67 Views

ARTICLE IN PRESS

Journal of Phonetics 32 (2004) 423–434 www.elsevier.com/locate/phonetics

Vowel normalization: the role of fundamental frequency and upper formants Benjamin Halberstama,*, Lawrence J. Raphaelb a

Lehman College, City University of New York, 250 Bedford Park Boulevard West, Bronx, NY 10468, USA b Adelphi University, 1 South Avenue, Garden City, NY 11530, USA Received 12 March 2002; received in revised form 29 January 2004; accepted 2 March 2004

Abstract Some vowel normalization schemes assume the perceptual exploitation of f0 and F3 information. These schemes implicitly predict that access to such information should improve listeners’ classification of vowels in a mixed-speaker condition relative to their classification in a blocked-speaker condition. In this study listeners classified naturally produced phonated and whispered (no f0 information) vowels in which formants above F2 were either present or filtered out. Results provided some support for the use of f0 in vowel normalization; results for F3 were inconclusive. An unexpected finding was that upper formants were more important for whispered vowel classification than for phonated vowel classification. r 2004 Elsevier Ltd. All rights reserved.

1. Introduction It has been evident for some time that a simple acoustic target model, using the first two formants as the sole cues for vowel perception, is inadequate to explain vowel perception (Jenkins, 1987). ‘‘Vowel normalization models’’ (Disner, 1980) claim that supplemental acoustic information is used to disambiguate vowels that have similar frequency values for the first and second formants. Such vowel normalization approaches suggest that the resultant normalized formant frequency values are invariant for each vowel across subjects. The present study focuses on intrinsic normalization schemes that incorporate f0 and F3 as perceptual cues to disambiguate vowels (Miller, 1989; Syrdal, 1985). Although Miller’s approach and Syrdal’s approach differ in several respects, they both report data showing that the use of f0

*Corresponding author. Tel.: +1-718-960-8135. E-mail address: [email protected] (B. Halberstam). 0095-4470/$ - see front matter r 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.wocn.2004.03.001

ARTICLE IN PRESS 424

B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

and F3 as additional parameters significantly enhances algorithmic vowel classification rates. Disner (1980) and Hillenbrand and Gayvert (1993) report similar findings. However, as Hillenbrand and Gayvert (1993, p. 698) point out, data-analytic evaluations of perceptual models ‘‘can suggest logically possible perceptual strategies, but other information is required to determine whether listeners actually adopt a proposed strategy.’’ Many studies (Fujisaki & Kawashima, 1968; Slawson, 1968; Ainsworth, 1975; Traunmuller, . 1981; Nearey, 1989; Hirahara & Kato, 1992; Hoemeke & Diehl, 1994; and Fahey, Diehl, & Traunmuller, . 1996) have shown that for given F1 and F2 values, vowel perception is influenced by f0, and some studies (Fujisaki & Kawashima, 1968; Nearey, 1989; Slawson, 1968) have shown that vowel perception is influenced by F3. None of the above-mentioned studies, however, specifically examines whether f0 and F3 information improve vowel classification in the case of multiple speakers. Thus the key claim of the aforementioned vowel normalization schemes, that f0 and F3 information contribute to the accurate classification of vowels spoken by multiple, unknown speakers, has not been thoroughly tested. This issue has been examined in a previous version of the present study (Halberstam, 1998) and in a study done by Nusbaum and Morin (1992). Nusbaum and Morin (1992) synthesized simulated ‘‘phonated’’ and ‘‘whispered’’ vowels with and without acoustic energy at frequencies higher than F2. In a blocked-speaker condition (where each block simulated a single speaker), approximately 95% of the stimuli of all types were correctly identified. In a mixed-speaker condition (in which successive vowels simulated four different ‘‘speakers’’), unfiltered phonated vowels were identified best, followed closely by phonated filtered vowels, followed more distantly by whispered unfiltered vowels and, finally, by whispered filtered vowels. The implication of this research is that in a mixed-speaker condition, in which listeners must categorize vowels whose formant frequencies are similar, f0 performs a significant role in disambiguation of vowels, while the formants above F2 seem to perform a secondary role. There are, however, two potentially important concerns about generalizing from the results of Nusbaum and Morin (1992). The first is that perhaps the whispered vowels in their study would have been classified more accurately if their formant frequencies had been higher than those of the phonated vowels, as they are in naturally produced vowels (Peterson, 1961; Kallail & Emanuel, 1984; Eklund & Traunmuller, 1997). Second, had Nusbaum and Morin (1992) used a greater number of ‘‘speakers,’’ it is possible that they would have found lower mixed-speaker classification rates. Creelman (1957) reported that there is a tendency for percent correct identification to decrease as the number of talkers increases. The present study was designed to test whether results similar to those of Nusbaum and Morin (1992) would be obtained for naturally-spoken rather than for synthesized vowels. In addition, 15 speakers varying in age and sex were used to increase the potential for overlap of vowel categories in the F1  F2 space. While it is true that formant frequencies and other acoustic parameters can differ significantly between phonated and whispered vowels produced by the same speaker (Peterson, 1961; Kallail & Emanuel, 1984; Eklund & Traunmuller, 1997; Katz & Assman, 2001), any perceptual difficulties associated with whispered vowels per se should logically have similar effects on vowels presented both in blocked-speaker and in mixed-speaker conditions. If the decrement in classification rates associated with mixed-speaker condition is greater for the whispered than for the phonated vowels, we would conclude that the absence of f0 information had limited the listener’s ability to normalize in both conditions.

ARTICLE IN PRESS B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

425

2. Stimuli 2.1. Speakers Speakers were five adult males, 19–29 years of age, five adult females, 18–49 years of age, and five children, 7–14 years of age. All were native speakers of New York City English. 2.2. Vowels The speakers were recorded producing each of nine phonated and nine whispered vowels in a /hV/ syllable. The uncharacteristic occurrence of short or ‘‘lax’’ vowels in syllable-final position was mitigated through the use of an elicitation technique described below. The vowels in the set were /i i e æ a L e R u/. 2.3. Elicitation technique The experimenter spoke each vowel to the speaker, who then said twice ‘‘I just heard the /hV/ again,’’ once fully phonated and once in a whisper. The speakers were provided with a printed English word, corresponding to each /hVd/ utterance (e.g. ‘‘head’’) and were told to omit the /d/. The /hV/ context follows Kahn (1978) and allows the use of vowels in a familiar context (i.e. the /hVd/ ‘‘words’’), without the coarticulation cues found in other consonantal environments. Additionally, speakers paused before the word ‘‘again’’ in order to minimize the coarticulatory effects of the following schwa on the test vowel. The experimenter corrected the speaker’s pronunciation of a vowel when he perceived it to be inaccurately produced. Recordings were made with an Electrovoice omnidirectional microphone, model 635A, and a Marantz tape recorder, model PMD420. 2.4. Measurement and filtering of stimuli The vowels and their carrier phrases were digitized at a 16 bit, 20 kHz sampling rate on a Gateway 2000 computer. Using a sound-editing program, the target vowels were excised from the carrier phrase and their average amplitude levels were equated. Wide-band spectrograms and averaged amplitude spectra for all of the vowel segments were made. F2 and F3 frequencies were measured for each vowel by locating a cursor at the peaks of averaged wide-band amplitude spectra for the entire phonated segment that was displayed on a computer monitor. Two copies of each of the excised vowels were made. One copy of each vowel was low-pass filtered using a bank of two Stamford Research Model SR 650 Programmable Filters. The output was recorded on a digital tape using a Sony portable DAT, model TCD-D7. The filters had an extremely steep cut-off, of approximately 120 dB per octave. The cut-off frequencies were set to 100 Hz above the center frequency of F2. This effectively removed F3 and higher formant information. Second formant energy was always clearly present after filtering, confirmed visually through spectrograms and averaged amplitude spectra. Fig. 1 shows amplitude spectra for filtered and unfiltered /i/ stimuli produced by a male speaker. Despite the closeness of some of the F2 and

ARTICLE IN PRESS 426

B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

Fig. 1. (a) Amplitude spectra for unfiltered /i/ stimulus produced by a male speaker. (b) Amplitude spectra for the same stimulus shown in Fig 1(a) after filtering.

ARTICLE IN PRESS B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

427

F3 peaks, it can be seen that the procedure removes F3 information while sparing the F2 information. Four stimulus types were generated: phonated-filtered, whispered-filtered, phonated-unfiltered, and whispered-unfiltered. The total number of stimuli generated was 540 (9 vowels  15 speakers  4 types).

3. Testing procedure 3.1. Subjects Listeners were four males and four females, 18–21 years of age. All were native speakers of New York City English who reportedly had normal hearing and who were drawn from the same community as the speakers. None of the subjects had any experience in vowel transcription. The subjects were paid for their participation. 3.2. Stimulus presentation Stimuli were presented over Telephonics TDH50 headphones with an impedance of 60 O, using a Sony portable DAT, model TCD-D7 with an output impedance of 27 O. 3.3. Training All listeners were trained immediately before testing. Listeners were presented with stimuli recorded from one adult male speaker and one female child speaker. The training stimuli and those of the listening test were generated using identical procedures. The training stimuli were not employed in the listening test. Examples of each stimulus type were included. The total number of stimuli used in training was 72 (9 vowels  2 speakers  4 types). After the presentation of each training stimulus, subjects circled one of the keywords on an answer sheet. The keywords were ‘‘heed, hid, head, had, hod, hawed, hud, hood, who’d.’’1 The experimenter informed the subject whether the response matched the speaker’s intended vowel. If it did not, the experimenter informed the subject what the intended vowel was and presented the same stimulus again. Potential subjects whose initial responses were below the criterion of 80% correct repeated the training. All potential subjects met the criterion in the first or second training session before proceeding to the experiment. The criterion for subject inclusion was the average percent correct score for all of the stimulus types. 3.4. Testing The stimuli were presented in blocks of 45, at 4.5-s inter-stimulus intervals, with an interblock interval of 30 s. 1

‘‘HUD’’ is the name of a federal housing program in the United States. A ‘‘hod’’ is a bucket used to hold solid fuels such as coal or wood.

ARTICLE IN PRESS 428

B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

Each of the four stimulus types (phonated-filtered, whispered-filtered, phonated-unfiltered, and whispered-unfiltered) was presented separately in quasi-random order in a blocked-speaker condition and in a mixed-speaker condition. All subjects heard all eight groups of stimuli (4 types  2 conditions), for a total of 1080 stimuli (4 types  2 conditions  15 speakers  9 vowels). The order of presentation of these groups was counterbalanced across subjects. Following Assman, Nearey, and Hogan (1982), subjects circled words on answer sheets containing lines of the following keywords: ‘‘heed, hid, head, had, hod, hawed, hud, hood, who’d.’’

4. Results 4.1. Group means and standard deviations Each subject’s data were scored for percent correct identification of the target vowels. The group means and standard deviations were calculated for each of the eight conditions and are presented in Fig. 2. The best classification rates were found for the phonated-unfiltered vowels

Fig. 2. Percent correct identification of phonated and whispered vowels, with and without filtering to remove F3 and upper formants, under two presentation conditions. The data displayed are group means and standard errors for eight listeners.

ARTICLE IN PRESS B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

429

Table 1 Three-way ANOVA for presentation condition  phonatory type  availability of upper formants Source of variance

Estimated mean square

Degrees of freedom

Error term

Estimated mean square

Degrees of freedom

F ratio

p-level

Condition Phonation Filtering CP CF PF CPF

214.51 2725.49 245.12 139.45 28.88 145.41 0.07

1 1 1 1 1 1 1

C  Subject PS FS CPS CFS PFS CPFS

20.83 16.60 20.68 29.82 13.67 4.93 17.20

7 7 7 7 7 7 7

10.30 164.16 11.85 4.68 2.11 29.48 0.00

0.015 0.000004 0.011 0.067 0.189 0.001 0.952

when presented in blocked-speaker or mixed-speaker condition and the phonated-unfiltered vowels presented in blocked speaker condition (all approximately 87%).2 A three-way ANOVA was carried out on the mean identification data. The factors in the design were presentation condition (blocked or mixed speakers), phonatory type (phonated or whispered) and upper formant availability (unfiltered or filtered). The error terms were provided by interlistener differences. The results of the ANOVA are reported in Table 1. 4.2. Main effects 4.2.1. Presentation condition Percent correct identification was higher in the blocked-speaker than in the mixed-speaker condition. On average, 82.0% of the vowels were correctly identified in the blocked-speaker condition, compared with 78.3% in the mixed-speaker condition. The difference in classification rates was significant (p=0.015). The significant main effect for presentation condition indicates that when the stimuli used in this study were presented in a mixed-speaker condition, they did present the sort of perceptual difficulty caused by overlap in formant frequencies that vowel normalization models attempt to overcome. 4.2.2. Phonatory type A highly significant effect was found for phonatory type. Percent correct identification was higher for the phonated than for whispered stimuli. On average, 86.7% of the phonated vowels were correctly identified, compared with 73.6% of the whispered vowels. The difference of 2 Correct identification in the best condition was only about 87%. It might be argued that a substantial transcription error affected apparent additivity. Additional analysis was done assuming a ‘‘true’’ error rate of 0.05 similar to those reported by Hillenbrand and Gayvert (1993) and others rather than the higher error rate found in the present study. This analysis used logic comparable to ‘‘Abbot’s formula’’ for the adjustments for ‘‘natural responsiveness’’ in bioassay (Finney, 1971). The observed error was adjusted as follows to get a rough estimate of true error rates: Wc=[Wo–C]/(1–C). Wc is the corrected error rate where Wo is the observed error rate and C is the estimated transcription error rate. The estimated correct proportion Rc is 1–Wc. The data were then transformed using the formula arcsin(sqrt(Rc)), and the ANOVA was then applied. No main effect or interaction was affected by the reanalysis. All significant effects remained significant, and non-significant effects remained non-significant. The p-levels did not change in any important ways.

ARTICLE IN PRESS 430

B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

approximately 13.1 percentage points was significant (p o 0.0001). This finding is consistent with those reported by Kallail and Emanuel (1984), Tartter (1990) and Eklund and Traunmuller (1997). 4.2.3. Upper formant availability Percent correct identification was higher for the unfiltered than for the filtered condition. On average 82.1% of the unfiltered vowels were correctly identified, compared with 78.2% of the filtered vowels. The difference of nearly 4 percentage points was significant (p=0.011). It is important to note, however, that there was a significant interaction for phonatory type and availability of upper formants. In fact, percent correct identification collapsed across blockedspeaker and mixed-speaker conditions for phonated unfiltered vowels (87.1%) and phonated filtered vowels (86.2%) differed by less than one percentage point. Post-hoc Neuman Keuls analysis revealed that this difference was not significant (p=0.823). 4.2.4. Interaction between presentation condition and phonatory type For whispered vowels, percent correct scores were more than six and a half percentage points lower in the mixed-speaker condition (70.3%) than in the blocked-speaker condition (76.9%). For phonated vowels, percent correct scores were less than one percentage point lower in the mixedspeaker condition (86.3%) than in the blocked-speaker condition (87.0%). The interaction approached significance (p=0.067). Post hoc Newman Keuls analysis revealed that the difference in classification rates between mixed-speaker and blocked-speaker condition was significant for the whispered vowels (p=0.0112), but not for the phonated vowels (p=0.724). Although the n of 8 used in the experiment was low,3 it is unlikely that an increased n alone would have resulted in a significant effect for presentation condition for the phonated vowels, as the difference in classification rates between the blocked-speaker and mixed-speaker condition was less than one percentage point. The finding that classification rates for mixed-speaker and blocked-speaker condition were significantly different for the whispered vowels but not for the phonated vowels indicates that f0 is likely to contribute to vowel normalization. 4.2.5. Interaction between presentation condition and upper formant availability For filtered vowels, percent correct scores were approximately five percentage points lower in the mixed-speaker condition (75.7%) than in the blocked-speaker condition (80.6%). For unfiltered vowels, percent correct scores were approximately 2.5 percentage points lower in the mixed-speaker condition (80.9%) than in the blocked-speaker condition (83.3%). The interaction failed to reach significance. Thus although the results were in the expected direction, there is insufficient evidence in the present study to suggest that F3 contributes to vowel normalization. If F3 does contribute to vowel normalization, our results indicate that its contribution is not of the same magnitude as the contribution associated with f0. 3

A power analysis was conducted assuming an alpha of 0.05, an ES of 0.50 and power of 0.80 (Cohen, 1992). This analysis revealed that in order to find a significant interaction for an ANOVA such as the one conducted in this experiment, an n of 64 would be recommended.

ARTICLE IN PRESS B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

431

4.2.6. Interaction between phonatory type and upper formant availability For whispered vowels, percent correct scores were seven percentage points lower for the filtered stimuli (70.1%) than for the unfiltered stimuli (77.1%). For phonated vowels, percent correct scores were less than one percentage point lower for the filtered stimuli (86.2%) than for the unfiltered stimuli (87.1%). The interaction was significant (p=0.001).

5. Discussion 5.1. Classification rate The best mean classification rate for the phonated stimuli was 87%. Although many studies have reported classification rates as low as or lower than this level, many others have reported substantially better classification rates. It is likely that transcription errors are responsible for the relatively low classification rates. Despite the fact that training was provided to subjects, the training was fairly limited, and subjects had no prior experience in vowel classification. Also, the criterion for proceeding to the listening test was a fairly low 80% correct classification rate. As reported above, if there was transcription error, it did not affect the statistical analyses reported. 5.2. Evidence for normalization The most critical issues for this study relate to the influence of f0 information and F3 frequency information on classification rates for the two presentation conditions. This is because both f0 information (available in the phonated and not in the whispered vowels) and F3 frequency information (available in the unfiltered and not in the filtered vowels) have been used as parameters in vowel normalization schemes. Our study demonstrated that presentation of vowels in a mixed-speaker condition rather than a blocked-speaker condition resulted in significant losses in classification rates only for stimuli lacking f0 frequency information and not for stimuli containing such information. The implication is that vowel normalization schemes that incorporate f0 information, such as those of Syrdal (1985) and Miller (1989), reflect actual perceptual processes. The results are not definitive for the role of F3. The results also reduce the need to rely on extrinsic normalization schemes, which assume that the perceptual system must adjust to an individual’s speech before effectively perceiving his or her vowels (Joos, 1948; Fant, 1975; Lobanov, 1971). A similar argument with regard to f0 has been made by Katz and Assman (2001), based on their finding that synthetic whispered vowels presented in a mixed-speaker condition were identified more poorly than phonated vowels. However, the present results indicate that the addition of f0 information is nearly sufficient to achieve normalization without the addition of F3 information. This finding raises new questions about the perceptual basis of the vowel normalization models of Syrdal (1985) and Miller (1989). Both of these models suggest that the parameters of vowel perception are derived from comparisons between F3 and F2, F2 and F1, and F1 and f0, rather than from f0 and the formant frequencies themselves. The finding that vowel classification in singlespeaker phonated vowels is undisturbed by a lack of F3 information makes it unlikely that the

ARTICLE IN PRESS 432

B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

parameters for normal vowel perception are specifically derived from comparisons of formant frequencies with one another, rather than from the formant frequencies themselves. While the results do provide support for intrinsic vowel normalization, it should not be thought that the normalization process removes talker-specific information. Rather, it appears that as Pisoni (1992) has argued, detailed information relating to talker voice attributes is both used in normalization processes and retained in memory. This view is consistent with the present findings, as well as with the fact that listeners are able to use talker specific information to recognize the voices of familiar speakers. 5.3. Upper formants in whispered vowels What may be the most important finding of this study was not expected—namely, the significant interaction between phonatory type and availability of upper formants. This significant interaction suggests that the third formant is more important for the perception of whispered than of phonated vowels. The implication of this finding is that the absence of third formant information is less important for the classification of phonated vowels than for the classification of whispered vowels. This finding is partially consistent with Nusbaum and Morin (1992) who reported that filtering F3 reduced classification rates only for stimuli that contained no f0 information. Nusbaum and Morin, however, reported that filtering influenced classification rates only in a mixed-speaker condition, and that there was no significant effect of filtering on the perception of synthetic whispered vowels presented in a blocked-speaker condition. In the present study, filtering had a similar effect on whispered vowel classification in both blocked-speaker and mixed-speaker conditions. The difference between the findings of the two studies may relate to the difference between the stimuli: Unlike the naturally produced stimuli in the present study, Nusbaum and Morin’s synthetic ‘‘whispered’’ stimuli had formant frequencies that matched the formants of the ‘‘phonated’’ stimuli. Under those circumstances it is possible that F3 information is redundant except in perceptually challenging situations such as mixed-speaker presentation. On the other hand, the whispered vowels in the present study can be considered to be acoustically deficient for purposes of perception, even with blocked-speaker presentation. This deficiency may be the result of the difference between the formant frequencies of whispered vs. phonated vowels. Although no statistical analyses were performed on the formant frequencies measured in this study, there is an established tendency for formant frequencies to be higher in whispered vowels than in phonated vowels (Peterson, 1961; Kallail & Emanuel, 1984; Eklund & Traunmuller, 1997). As a result of the deficient nature of whispered vowels, the acoustic information provided by F3 (and possibly by higher formants) can be used effectively for vowel perception.

Acknowledgements The research reported here was part of a thesis completed by the first author and submitted to the City University of New York in partial fulfillment of the requirements for the doctoral degree in Speech and Hearing Sciences. We thank Arthur Boothroyd and Katherine Harris who served

ARTICLE IN PRESS B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

433

as thesis advisors. We also thank Associate Editor, Amanda C. Walley, and two anonymous reviewers for numerous helpful suggestions.

References Ainsworth, W. A. (1975). Intrinsic and extrinsic factors in vowel judgments. In G. Fant, & M. Tatham (Eds.), Auditory analysis and perception of speech (pp. 103–113). London: Academic Press. Assman, P. F., Nearey, T. M., & Hogan, J. T. (1982). Vowel identification: Orthographic, perceptual, and acoustic aspects. Journal of the Acoustical Society of America, 71, 975–989. Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159. Creelman, C. D. (1957). Case of the unknown talker. Journal of the Acoustical Society of America, 29, 655. Disner, S. F. (1980). Evaluation of vowel normalization procedures. Journal of the Acoustical Society of America, 67, 253–261. Eklund, I., & Traunmuller, . H. (1997). Comparative study of male and female whispered and phonated versions of the long vowels of Swedish. Phonetica, 54, 1–21. Fahey, R. P., Diehl, R. L., & Traunmuller, . H. (1996). Perception of back vowels: Effects of varying F1–f0 Bark distance. Journal of the Acoustical Society of America, 99, 2350–2357. Fant, G. (1975). Nonuniform vowel normalization. Royal Institute of Technology Quarterly Progress Status Report, (Speech Transmission Laboratory), 2/3, 1–19. Finney, D. J. (1971). Probit analysis. Cambridge: Cambridge University Press. Fujisaki, H., & Kawashima, T. (1968). The roles of pitch and higher formants in the perception of vowels. IEEE Audio Electroacoustics, AU-16(1), 73–77. Halberstam, B. (1998) Vowel normalization: The role of fundamental frequency and upper formants. Unpublished Doctoral Dissertation, City University of New York, New York. Hillenbrand, J., & Gayvert, R. T. (1993). Vowel classification based on fundamental frequency and formant frequencies. Journal of Speech Hearing Research, 36, 694–700. Hirahara, T., & Kato, H. (1992). The effect of f0 on vowel identification. In Y. Tohkura, E. Vatikiotis-Bateson, & Y. Sagisaka (Eds.), Speech perception, production and linguistic structure (pp. 88–111). Burke, VA: IOS Press. Hoemeke, K. A., & Diehl, R. L. (1994). Perception of vowel height: The role of F1–f0 distance. Journal of the Acoustical Society of America, 96, 661–674. Jenkins, J. J. (1987). A selective history of issues in vowel perception. Journal of Memory and Language, 26, 542–549. Joos, M. A. (1948). Acoustic Phonetics. Language, 24(2), 1–136. Kahn, D. (1978). On the identifiability of isolated vowels. UCLA Working Papers, 41, 26–31. Kallail, K. J., & Emanuel, F. W. (1984). An acoustic comparison of isolated whispered and phonated vowel samples produced by adult male subjects. Journal of Phonetics, 12, 175–186. Katz, W. F., & Assman, P. F. (2001). Identification of children’s and adult’s vowels: Intrinsic fundamental frequency, fundamental frequency dynamics, and presence of voicing. Journal of Phonetics, 29, 23–51 doi:10.006/jpho. 2000.0135. Lobanov, B. M. (1971). Classification of Russian vowels spoken by different speakers. Journal of the Acoustical Society of America, 49, 606–608. Miller, J. D. (1989). Auditory-perceptual interpretation of the vowel. Journal of the Acoustical Society of America, 85, 2114–2133. Nearey, T. M. (1989). Static, dynamic, and relational properties in vowel perception. Journal of the Acoustical Society of America, 85, 2088–2113. Nusbaum, H. C., & Morin, T. M. (1992). Paying attention to differences among talkers. In Y. Tohkura, E. VatikiotisBateson, & Y. Sagisaka (Eds.), Speech perception, production and linguistic structure (pp. 113–123). Burke, VA: IOS Press. Peterson, G. E. (1961). Parameters of vowel quality. Journal of Speech and Hearing Research, 4, 10–29. Pisoni, D. (1992). Talker normalization in speech perception. In Y. Tohkura, E. Vatikiotis-Bateson, & Y. Sagisaka (Eds.), Speech perception, production and linguistic structure (pp. 143–151). Burke, VA: IOS Press.

ARTICLE IN PRESS 434

B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

Slawson, A. W. (1968). Vowel quality and musical timbre as functions of spectrum envelopes and fundamental frequency. Journal of the Acoustical Society of America, 43, 87–101. Syrdal, A. K. (1985). Aspects of a model of the auditory representation of America English vowels. Speech Communication, 4, 121–135. Tartter, V. C. (1990). Identifiability of vowels and speakers from whispered vowels. Perception and Psychophysics, 49, 365–372. Traunmuller, . H. (1981). Perceptual dimension of openness in vowels. Journal of the Acoustical Society of America, 69, 1465–1475.