Vowel normalization: the role of fundamental frequency and upper formants

ARTICLE IN PRESS Journal of Phonetics 32 (2004) 423–434 www.elsevier.com/locate/phonetics Vowel normalization: the role of fundamental frequency and...

Download PDF

341KB Sizes 0 Downloads 67 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

Journal of Phonetics 32 (2004) 423–434 www.elsevier.com/locate/phonetics

Vowel normalization: the role of fundamental frequency and upper formants Benjamin Halberstama,*, Lawrence J. Raphaelb a

Lehman College, City University of New York, 250 Bedford Park Boulevard West, Bronx, NY 10468, USA b Adelphi University, 1 South Avenue, Garden City, NY 11530, USA Received 12 March 2002; received in revised form 29 January 2004; accepted 2 March 2004

Abstract Some vowel normalization schemes assume the perceptual exploitation of f0 and F3 information. These schemes implicitly predict that access to such information should improve listeners’ classiﬁcation of vowels in a mixed-speaker condition relative to their classiﬁcation in a blocked-speaker condition. In this study listeners classiﬁed naturally produced phonated and whispered (no f0 information) vowels in which formants above F2 were either present or ﬁltered out. Results provided some support for the use of f0 in vowel normalization; results for F3 were inconclusive. An unexpected ﬁnding was that upper formants were more important for whispered vowel classiﬁcation than for phonated vowel classiﬁcation. r 2004 Elsevier Ltd. All rights reserved.

1. Introduction It has been evident for some time that a simple acoustic target model, using the ﬁrst two formants as the sole cues for vowel perception, is inadequate to explain vowel perception (Jenkins, 1987). ‘‘Vowel normalization models’’ (Disner, 1980) claim that supplemental acoustic information is used to disambiguate vowels that have similar frequency values for the ﬁrst and second formants. Such vowel normalization approaches suggest that the resultant normalized formant frequency values are invariant for each vowel across subjects. The present study focuses on intrinsic normalization schemes that incorporate f0 and F3 as perceptual cues to disambiguate vowels (Miller, 1989; Syrdal, 1985). Although Miller’s approach and Syrdal’s approach differ in several respects, they both report data showing that the use of f0

*Corresponding author. Tel.: +1-718-960-8135. E-mail address: [email protected] (B. Halberstam). 0095-4470/$ - see front matter r 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.wocn.2004.03.001

ARTICLE IN PRESS 424

B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

and F3 as additional parameters signiﬁcantly enhances algorithmic vowel classiﬁcation rates. Disner (1980) and Hillenbrand and Gayvert (1993) report similar ﬁndings. However, as Hillenbrand and Gayvert (1993, p. 698) point out, data-analytic evaluations of perceptual models ‘‘can suggest logically possible perceptual strategies, but other information is required to determine whether listeners actually adopt a proposed strategy.’’ Many studies (Fujisaki & Kawashima, 1968; Slawson, 1968; Ainsworth, 1975; Traunmuller, . 1981; Nearey, 1989; Hirahara & Kato, 1992; Hoemeke & Diehl, 1994; and Fahey, Diehl, & Traunmuller, . 1996) have shown that for given F1 and F2 values, vowel perception is inﬂuenced by f0, and some studies (Fujisaki & Kawashima, 1968; Nearey, 1989; Slawson, 1968) have shown that vowel perception is inﬂuenced by F3. None of the above-mentioned studies, however, speciﬁcally examines whether f0 and F3 information improve vowel classiﬁcation in the case of multiple speakers. Thus the key claim of the aforementioned vowel normalization schemes, that f0 and F3 information contribute to the accurate classiﬁcation of vowels spoken by multiple, unknown speakers, has not been thoroughly tested. This issue has been examined in a previous version of the present study (Halberstam, 1998) and in a study done by Nusbaum and Morin (1992). Nusbaum and Morin (1992) synthesized simulated ‘‘phonated’’ and ‘‘whispered’’ vowels with and without acoustic energy at frequencies higher than F2. In a blocked-speaker condition (where each block simulated a single speaker), approximately 95% of the stimuli of all types were correctly identiﬁed. In a mixed-speaker condition (in which successive vowels simulated four different ‘‘speakers’’), unﬁltered phonated vowels were identiﬁed best, followed closely by phonated ﬁltered vowels, followed more distantly by whispered unﬁltered vowels and, ﬁnally, by whispered ﬁltered vowels. The implication of this research is that in a mixed-speaker condition, in which listeners must categorize vowels whose formant frequencies are similar, f0 performs a signiﬁcant role in disambiguation of vowels, while the formants above F2 seem to perform a secondary role. There are, however, two potentially important concerns about generalizing from the results of Nusbaum and Morin (1992). The ﬁrst is that perhaps the whispered vowels in their study would have been classiﬁed more accurately if their formant frequencies had been higher than those of the phonated vowels, as they are in naturally produced vowels (Peterson, 1961; Kallail & Emanuel, 1984; Eklund & Traunmuller, 1997). Second, had Nusbaum and Morin (1992) used a greater number of ‘‘speakers,’’ it is possible that they would have found lower mixed-speaker classiﬁcation rates. Creelman (1957) reported that there is a tendency for percent correct identiﬁcation to decrease as the number of talkers increases. The present study was designed to test whether results similar to those of Nusbaum and Morin (1992) would be obtained for naturally-spoken rather than for synthesized vowels. In addition, 15 speakers varying in age and sex were used to increase the potential for overlap of vowel categories in the F1 F2 space. While it is true that formant frequencies and other acoustic parameters can differ signiﬁcantly between phonated and whispered vowels produced by the same speaker (Peterson, 1961; Kallail & Emanuel, 1984; Eklund & Traunmuller, 1997; Katz & Assman, 2001), any perceptual difﬁculties associated with whispered vowels per se should logically have similar effects on vowels presented both in blocked-speaker and in mixed-speaker conditions. If the decrement in classiﬁcation rates associated with mixed-speaker condition is greater for the whispered than for the phonated vowels, we would conclude that the absence of f0 information had limited the listener’s ability to normalize in both conditions.

ARTICLE IN PRESS B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

425

2. Stimuli 2.1. Speakers Speakers were ﬁve adult males, 19–29 years of age, ﬁve adult females, 18–49 years of age, and ﬁve children, 7–14 years of age. All were native speakers of New York City English. 2.2. Vowels The speakers were recorded producing each of nine phonated and nine whispered vowels in a /hV/ syllable. The uncharacteristic occurrence of short or ‘‘lax’’ vowels in syllable-ﬁnal position was mitigated through the use of an elicitation technique described below. The vowels in the set were /i i e æ a L e R u/. 2.3. Elicitation technique The experimenter spoke each vowel to the speaker, who then said twice ‘‘I just heard the /hV/ again,’’ once fully phonated and once in a whisper. The speakers were provided with a printed English word, corresponding to each /hVd/ utterance (e.g. ‘‘head’’) and were told to omit the /d/. The /hV/ context follows Kahn (1978) and allows the use of vowels in a familiar context (i.e. the /hVd/ ‘‘words’’), without the coarticulation cues found in other consonantal environments. Additionally, speakers paused before the word ‘‘again’’ in order to minimize the coarticulatory effects of the following schwa on the test vowel. The experimenter corrected the speaker’s pronunciation of a vowel when he perceived it to be inaccurately produced. Recordings were made with an Electrovoice omnidirectional microphone, model 635A, and a Marantz tape recorder, model PMD420. 2.4. Measurement and filtering of stimuli The vowels and their carrier phrases were digitized at a 16 bit, 20 kHz sampling rate on a Gateway 2000 computer. Using a sound-editing program, the target vowels were excised from the carrier phrase and their average amplitude levels were equated. Wide-band spectrograms and averaged amplitude spectra for all of the vowel segments were made. F2 and F3 frequencies were measured for each vowel by locating a cursor at the peaks of averaged wide-band amplitude spectra for the entire phonated segment that was displayed on a computer monitor. Two copies of each of the excised vowels were made. One copy of each vowel was low-pass ﬁltered using a bank of two Stamford Research Model SR 650 Programmable Filters. The output was recorded on a digital tape using a Sony portable DAT, model TCD-D7. The ﬁlters had an extremely steep cut-off, of approximately 120 dB per octave. The cut-off frequencies were set to 100 Hz above the center frequency of F2. This effectively removed F3 and higher formant information. Second formant energy was always clearly present after ﬁltering, conﬁrmed visually through spectrograms and averaged amplitude spectra. Fig. 1 shows amplitude spectra for ﬁltered and unﬁltered /i/ stimuli produced by a male speaker. Despite the closeness of some of the F2 and

ARTICLE IN PRESS 426

B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

Fig. 1. (a) Amplitude spectra for unﬁltered /i/ stimulus produced by a male speaker. (b) Amplitude spectra for the same stimulus shown in Fig 1(a) after ﬁltering.

ARTICLE IN PRESS B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

427

F3 peaks, it can be seen that the procedure removes F3 information while sparing the F2 information. Four stimulus types were generated: phonated-ﬁltered, whispered-ﬁltered, phonated-unﬁltered, and whispered-unﬁltered. The total number of stimuli generated was 540 (9 vowels 15 speakers 4 types).

3. Testing procedure 3.1. Subjects Listeners were four males and four females, 18–21 years of age. All were native speakers of New York City English who reportedly had normal hearing and who were drawn from the same community as the speakers. None of the subjects had any experience in vowel transcription. The subjects were paid for their participation. 3.2. Stimulus presentation Stimuli were presented over Telephonics TDH50 headphones with an impedance of 60 O, using a Sony portable DAT, model TCD-D7 with an output impedance of 27 O. 3.3. Training All listeners were trained immediately before testing. Listeners were presented with stimuli recorded from one adult male speaker and one female child speaker. The training stimuli and those of the listening test were generated using identical procedures. The training stimuli were not employed in the listening test. Examples of each stimulus type were included. The total number of stimuli used in training was 72 (9 vowels 2 speakers 4 types). After the presentation of each training stimulus, subjects circled one of the keywords on an answer sheet. The keywords were ‘‘heed, hid, head, had, hod, hawed, hud, hood, who’d.’’1 The experimenter informed the subject whether the response matched the speaker’s intended vowel. If it did not, the experimenter informed the subject what the intended vowel was and presented the same stimulus again. Potential subjects whose initial responses were below the criterion of 80% correct repeated the training. All potential subjects met the criterion in the ﬁrst or second training session before proceeding to the experiment. The criterion for subject inclusion was the average percent correct score for all of the stimulus types. 3.4. Testing The stimuli were presented in blocks of 45, at 4.5-s inter-stimulus intervals, with an interblock interval of 30 s. 1

‘‘HUD’’ is the name of a federal housing program in the United States. A ‘‘hod’’ is a bucket used to hold solid fuels such as coal or wood.

ARTICLE IN PRESS 428

B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

Each of the four stimulus types (phonated-ﬁltered, whispered-ﬁltered, phonated-unﬁltered, and whispered-unﬁltered) was presented separately in quasi-random order in a blocked-speaker condition and in a mixed-speaker condition. All subjects heard all eight groups of stimuli (4 types 2 conditions), for a total of 1080 stimuli (4 types 2 conditions 15 speakers 9 vowels). The order of presentation of these groups was counterbalanced across subjects. Following Assman, Nearey, and Hogan (1982), subjects circled words on answer sheets containing lines of the following keywords: ‘‘heed, hid, head, had, hod, hawed, hud, hood, who’d.’’

4. Results 4.1. Group means and standard deviations Each subject’s data were scored for percent correct identiﬁcation of the target vowels. The group means and standard deviations were calculated for each of the eight conditions and are presented in Fig. 2. The best classiﬁcation rates were found for the phonated-unﬁltered vowels

Fig. 2. Percent correct identiﬁcation of phonated and whispered vowels, with and without ﬁltering to remove F3 and upper formants, under two presentation conditions. The data displayed are group means and standard errors for eight listeners.

ARTICLE IN PRESS B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

429

Table 1 Three-way ANOVA for presentation condition phonatory type availability of upper formants Source of variance

Estimated mean square

Degrees of freedom

Error term

Estimated mean square

Degrees of freedom

F ratio

p-level

Condition Phonation Filtering CP CF PF CPF

214.51 2725.49 245.12 139.45 28.88 145.41 0.07

1 1 1 1 1 1 1

C Subject PS FS CPS CFS PFS CPFS

20.83 16.60 20.68 29.82 13.67 4.93 17.20

7 7 7 7 7 7 7

10.30 164.16 11.85 4.68 2.11 29.48 0.00

0.015 0.000004 0.011 0.067 0.189 0.001 0.952

when presented in blocked-speaker or mixed-speaker condition and the phonated-unﬁltered vowels presented in blocked speaker condition (all approximately 87%).2 A three-way ANOVA was carried out on the mean identiﬁcation data. The factors in the design were presentation condition (blocked or mixed speakers), phonatory type (phonated or whispered) and upper formant availability (unﬁltered or ﬁltered). The error terms were provided by interlistener differences. The results of the ANOVA are reported in Table 1. 4.2. Main effects 4.2.1. Presentation condition Percent correct identiﬁcation was higher in the blocked-speaker than in the mixed-speaker condition. On average, 82.0% of the vowels were correctly identiﬁed in the blocked-speaker condition, compared with 78.3% in the mixed-speaker condition. The difference in classiﬁcation rates was signiﬁcant (p=0.015). The signiﬁcant main effect for presentation condition indicates that when the stimuli used in this study were presented in a mixed-speaker condition, they did present the sort of perceptual difﬁculty caused by overlap in formant frequencies that vowel normalization models attempt to overcome. 4.2.2. Phonatory type A highly signiﬁcant effect was found for phonatory type. Percent correct identiﬁcation was higher for the phonated than for whispered stimuli. On average, 86.7% of the phonated vowels were correctly identiﬁed, compared with 73.6% of the whispered vowels. The difference of 2 Correct identiﬁcation in the best condition was only about 87%. It might be argued that a substantial transcription error affected apparent additivity. Additional analysis was done assuming a ‘‘true’’ error rate of 0.05 similar to those reported by Hillenbrand and Gayvert (1993) and others rather than the higher error rate found in the present study. This analysis used logic comparable to ‘‘Abbot’s formula’’ for the adjustments for ‘‘natural responsiveness’’ in bioassay (Finney, 1971). The observed error was adjusted as follows to get a rough estimate of true error rates: Wc=[Wo–C]/(1–C). Wc is the corrected error rate where Wo is the observed error rate and C is the estimated transcription error rate. The estimated correct proportion Rc is 1–Wc. The data were then transformed using the formula arcsin(sqrt(Rc)), and the ANOVA was then applied. No main effect or interaction was affected by the reanalysis. All signiﬁcant effects remained signiﬁcant, and non-signiﬁcant effects remained non-signiﬁcant. The p-levels did not change in any important ways.

ARTICLE IN PRESS 430

B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

approximately 13.1 percentage points was signiﬁcant (p o 0.0001). This ﬁnding is consistent with those reported by Kallail and Emanuel (1984), Tartter (1990) and Eklund and Traunmuller (1997). 4.2.3. Upper formant availability Percent correct identiﬁcation was higher for the unﬁltered than for the ﬁltered condition. On average 82.1% of the unﬁltered vowels were correctly identiﬁed, compared with 78.2% of the ﬁltered vowels. The difference of nearly 4 percentage points was signiﬁcant (p=0.011). It is important to note, however, that there was a signiﬁcant interaction for phonatory type and availability of upper formants. In fact, percent correct identiﬁcation collapsed across blockedspeaker and mixed-speaker conditions for phonated unﬁltered vowels (87.1%) and phonated ﬁltered vowels (86.2%) differed by less than one percentage point. Post-hoc Neuman Keuls analysis revealed that this difference was not signiﬁcant (p=0.823). 4.2.4. Interaction between presentation condition and phonatory type For whispered vowels, percent correct scores were more than six and a half percentage points lower in the mixed-speaker condition (70.3%) than in the blocked-speaker condition (76.9%). For phonated vowels, percent correct scores were less than one percentage point lower in the mixedspeaker condition (86.3%) than in the blocked-speaker condition (87.0%). The interaction approached signiﬁcance (p=0.067). Post hoc Newman Keuls analysis revealed that the difference in classiﬁcation rates between mixed-speaker and blocked-speaker condition was signiﬁcant for the whispered vowels (p=0.0112), but not for the phonated vowels (p=0.724). Although the n of 8 used in the experiment was low,3 it is unlikely that an increased n alone would have resulted in a signiﬁcant effect for presentation condition for the phonated vowels, as the difference in classiﬁcation rates between the blocked-speaker and mixed-speaker condition was less than one percentage point. The ﬁnding that classiﬁcation rates for mixed-speaker and blocked-speaker condition were signiﬁcantly different for the whispered vowels but not for the phonated vowels indicates that f0 is likely to contribute to vowel normalization. 4.2.5. Interaction between presentation condition and upper formant availability For ﬁltered vowels, percent correct scores were approximately ﬁve percentage points lower in the mixed-speaker condition (75.7%) than in the blocked-speaker condition (80.6%). For unﬁltered vowels, percent correct scores were approximately 2.5 percentage points lower in the mixed-speaker condition (80.9%) than in the blocked-speaker condition (83.3%). The interaction failed to reach signiﬁcance. Thus although the results were in the expected direction, there is insufﬁcient evidence in the present study to suggest that F3 contributes to vowel normalization. If F3 does contribute to vowel normalization, our results indicate that its contribution is not of the same magnitude as the contribution associated with f0. 3

A power analysis was conducted assuming an alpha of 0.05, an ES of 0.50 and power of 0.80 (Cohen, 1992). This analysis revealed that in order to ﬁnd a signiﬁcant interaction for an ANOVA such as the one conducted in this experiment, an n of 64 would be recommended.

ARTICLE IN PRESS B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

431

4.2.6. Interaction between phonatory type and upper formant availability For whispered vowels, percent correct scores were seven percentage points lower for the ﬁltered stimuli (70.1%) than for the unﬁltered stimuli (77.1%). For phonated vowels, percent correct scores were less than one percentage point lower for the ﬁltered stimuli (86.2%) than for the unﬁltered stimuli (87.1%). The interaction was signiﬁcant (p=0.001).

5. Discussion 5.1. Classification rate The best mean classiﬁcation rate for the phonated stimuli was 87%. Although many studies have reported classiﬁcation rates as low as or lower than this level, many others have reported substantially better classiﬁcation rates. It is likely that transcription errors are responsible for the relatively low classiﬁcation rates. Despite the fact that training was provided to subjects, the training was fairly limited, and subjects had no prior experience in vowel classiﬁcation. Also, the criterion for proceeding to the listening test was a fairly low 80% correct classiﬁcation rate. As reported above, if there was transcription error, it did not affect the statistical analyses reported. 5.2. Evidence for normalization The most critical issues for this study relate to the inﬂuence of f0 information and F3 frequency information on classiﬁcation rates for the two presentation conditions. This is because both f0 information (available in the phonated and not in the whispered vowels) and F3 frequency information (available in the unﬁltered and not in the ﬁltered vowels) have been used as parameters in vowel normalization schemes. Our study demonstrated that presentation of vowels in a mixed-speaker condition rather than a blocked-speaker condition resulted in signiﬁcant losses in classiﬁcation rates only for stimuli lacking f0 frequency information and not for stimuli containing such information. The implication is that vowel normalization schemes that incorporate f0 information, such as those of Syrdal (1985) and Miller (1989), reﬂect actual perceptual processes. The results are not deﬁnitive for the role of F3. The results also reduce the need to rely on extrinsic normalization schemes, which assume that the perceptual system must adjust to an individual’s speech before effectively perceiving his or her vowels (Joos, 1948; Fant, 1975; Lobanov, 1971). A similar argument with regard to f0 has been made by Katz and Assman (2001), based on their ﬁnding that synthetic whispered vowels presented in a mixed-speaker condition were identiﬁed more poorly than phonated vowels. However, the present results indicate that the addition of f0 information is nearly sufﬁcient to achieve normalization without the addition of F3 information. This ﬁnding raises new questions about the perceptual basis of the vowel normalization models of Syrdal (1985) and Miller (1989). Both of these models suggest that the parameters of vowel perception are derived from comparisons between F3 and F2, F2 and F1, and F1 and f0, rather than from f0 and the formant frequencies themselves. The ﬁnding that vowel classiﬁcation in singlespeaker phonated vowels is undisturbed by a lack of F3 information makes it unlikely that the

ARTICLE IN PRESS 432

B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

parameters for normal vowel perception are speciﬁcally derived from comparisons of formant frequencies with one another, rather than from the formant frequencies themselves. While the results do provide support for intrinsic vowel normalization, it should not be thought that the normalization process removes talker-speciﬁc information. Rather, it appears that as Pisoni (1992) has argued, detailed information relating to talker voice attributes is both used in normalization processes and retained in memory. This view is consistent with the present ﬁndings, as well as with the fact that listeners are able to use talker speciﬁc information to recognize the voices of familiar speakers. 5.3. Upper formants in whispered vowels What may be the most important ﬁnding of this study was not expected—namely, the signiﬁcant interaction between phonatory type and availability of upper formants. This signiﬁcant interaction suggests that the third formant is more important for the perception of whispered than of phonated vowels. The implication of this ﬁnding is that the absence of third formant information is less important for the classiﬁcation of phonated vowels than for the classiﬁcation of whispered vowels. This ﬁnding is partially consistent with Nusbaum and Morin (1992) who reported that ﬁltering F3 reduced classiﬁcation rates only for stimuli that contained no f0 information. Nusbaum and Morin, however, reported that ﬁltering inﬂuenced classiﬁcation rates only in a mixed-speaker condition, and that there was no signiﬁcant effect of ﬁltering on the perception of synthetic whispered vowels presented in a blocked-speaker condition. In the present study, ﬁltering had a similar effect on whispered vowel classiﬁcation in both blocked-speaker and mixed-speaker conditions. The difference between the ﬁndings of the two studies may relate to the difference between the stimuli: Unlike the naturally produced stimuli in the present study, Nusbaum and Morin’s synthetic ‘‘whispered’’ stimuli had formant frequencies that matched the formants of the ‘‘phonated’’ stimuli. Under those circumstances it is possible that F3 information is redundant except in perceptually challenging situations such as mixed-speaker presentation. On the other hand, the whispered vowels in the present study can be considered to be acoustically deﬁcient for purposes of perception, even with blocked-speaker presentation. This deﬁciency may be the result of the difference between the formant frequencies of whispered vs. phonated vowels. Although no statistical analyses were performed on the formant frequencies measured in this study, there is an established tendency for formant frequencies to be higher in whispered vowels than in phonated vowels (Peterson, 1961; Kallail & Emanuel, 1984; Eklund & Traunmuller, 1997). As a result of the deﬁcient nature of whispered vowels, the acoustic information provided by F3 (and possibly by higher formants) can be used effectively for vowel perception.

Acknowledgements The research reported here was part of a thesis completed by the ﬁrst author and submitted to the City University of New York in partial fulﬁllment of the requirements for the doctoral degree in Speech and Hearing Sciences. We thank Arthur Boothroyd and Katherine Harris who served

ARTICLE IN PRESS B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

433

as thesis advisors. We also thank Associate Editor, Amanda C. Walley, and two anonymous reviewers for numerous helpful suggestions.

References Ainsworth, W. A. (1975). Intrinsic and extrinsic factors in vowel judgments. In G. Fant, & M. Tatham (Eds.), Auditory analysis and perception of speech (pp. 103–113). London: Academic Press. Assman, P. F., Nearey, T. M., & Hogan, J. T. (1982). Vowel identiﬁcation: Orthographic, perceptual, and acoustic aspects. Journal of the Acoustical Society of America, 71, 975–989. Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159. Creelman, C. D. (1957). Case of the unknown talker. Journal of the Acoustical Society of America, 29, 655. Disner, S. F. (1980). Evaluation of vowel normalization procedures. Journal of the Acoustical Society of America, 67, 253–261. Eklund, I., & Traunmuller, . H. (1997). Comparative study of male and female whispered and phonated versions of the long vowels of Swedish. Phonetica, 54, 1–21. Fahey, R. P., Diehl, R. L., & Traunmuller, . H. (1996). Perception of back vowels: Effects of varying F1–f0 Bark distance. Journal of the Acoustical Society of America, 99, 2350–2357. Fant, G. (1975). Nonuniform vowel normalization. Royal Institute of Technology Quarterly Progress Status Report, (Speech Transmission Laboratory), 2/3, 1–19. Finney, D. J. (1971). Probit analysis. Cambridge: Cambridge University Press. Fujisaki, H., & Kawashima, T. (1968). The roles of pitch and higher formants in the perception of vowels. IEEE Audio Electroacoustics, AU-16(1), 73–77. Halberstam, B. (1998) Vowel normalization: The role of fundamental frequency and upper formants. Unpublished Doctoral Dissertation, City University of New York, New York. Hillenbrand, J., & Gayvert, R. T. (1993). Vowel classiﬁcation based on fundamental frequency and formant frequencies. Journal of Speech Hearing Research, 36, 694–700. Hirahara, T., & Kato, H. (1992). The effect of f0 on vowel identiﬁcation. In Y. Tohkura, E. Vatikiotis-Bateson, & Y. Sagisaka (Eds.), Speech perception, production and linguistic structure (pp. 88–111). Burke, VA: IOS Press. Hoemeke, K. A., & Diehl, R. L. (1994). Perception of vowel height: The role of F1–f0 distance. Journal of the Acoustical Society of America, 96, 661–674. Jenkins, J. J. (1987). A selective history of issues in vowel perception. Journal of Memory and Language, 26, 542–549. Joos, M. A. (1948). Acoustic Phonetics. Language, 24(2), 1–136. Kahn, D. (1978). On the identiﬁability of isolated vowels. UCLA Working Papers, 41, 26–31. Kallail, K. J., & Emanuel, F. W. (1984). An acoustic comparison of isolated whispered and phonated vowel samples produced by adult male subjects. Journal of Phonetics, 12, 175–186. Katz, W. F., & Assman, P. F. (2001). Identiﬁcation of children’s and adult’s vowels: Intrinsic fundamental frequency, fundamental frequency dynamics, and presence of voicing. Journal of Phonetics, 29, 23–51 doi:10.006/jpho. 2000.0135. Lobanov, B. M. (1971). Classiﬁcation of Russian vowels spoken by different speakers. Journal of the Acoustical Society of America, 49, 606–608. Miller, J. D. (1989). Auditory-perceptual interpretation of the vowel. Journal of the Acoustical Society of America, 85, 2114–2133. Nearey, T. M. (1989). Static, dynamic, and relational properties in vowel perception. Journal of the Acoustical Society of America, 85, 2088–2113. Nusbaum, H. C., & Morin, T. M. (1992). Paying attention to differences among talkers. In Y. Tohkura, E. VatikiotisBateson, & Y. Sagisaka (Eds.), Speech perception, production and linguistic structure (pp. 113–123). Burke, VA: IOS Press. Peterson, G. E. (1961). Parameters of vowel quality. Journal of Speech and Hearing Research, 4, 10–29. Pisoni, D. (1992). Talker normalization in speech perception. In Y. Tohkura, E. Vatikiotis-Bateson, & Y. Sagisaka (Eds.), Speech perception, production and linguistic structure (pp. 143–151). Burke, VA: IOS Press.

ARTICLE IN PRESS 434

B. Halberstam, L.J. Raphael / Journal of Phonetics 32 (2004) 423–434

Slawson, A. W. (1968). Vowel quality and musical timbre as functions of spectrum envelopes and fundamental frequency. Journal of the Acoustical Society of America, 43, 87–101. Syrdal, A. K. (1985). Aspects of a model of the auditory representation of America English vowels. Speech Communication, 4, 121–135. Tartter, V. C. (1990). Identiﬁability of vowels and speakers from whispered vowels. Perception and Psychophysics, 49, 365–372. Traunmuller, . H. (1981). Perceptual dimension of openness in vowels. Journal of the Acoustical Society of America, 69, 1465–1475.

Vowel normalization: the role of fundamental frequency and upper formants

Vowel normalization: the role of fundamental frequency and upper formants

Recommend Documents