The interaction of formant frequency and pitch in the perception of voice category and jaw opening in female singers

The interaction of formant frequency and pitch in the perception of voice category and jaw opening in female singers

The Interaction of Formant Frequency and Pitch in the Perception of Voice Category and Jaw Opening in Female Singers Molly L. Erickson Knoxville, Tenn...

365KB Sizes 2 Downloads 63 Views

The Interaction of Formant Frequency and Pitch in the Perception of Voice Category and Jaw Opening in Female Singers Molly L. Erickson Knoxville, Tennesee

Summary: This study represents a first step toward understanding the contribution formant frequency makes to the perception of female voice categories. The effects of formant frequency and pitch on the perception of voice category were examined by constructing a perceptual study that used two sets of synthetic stimuli at various pitches throughout the female singing range. The first set was designed to test the effects of systematically varying formants 1 through 4. The second set was designed to test the relative effects of lower frequency formants (F1 and F2) versus higher frequency formants (F3 and F4) through construction of mixed stimuli. Generally, as the frequencies of all four formants decreased, perception of soprano voice category decreased at all but the highest pitch, A5. However, perception of soprano voice category also increased as a function of pitch. Listeners appeared to need agreement between all four formants to perceive voice categories. When upper and lower formants are inconsistent in frequency, listeners were unable to judge voice category, but they could use the inconsistent patterns to form perceptions about degree of jaw opening. Key Words: Voice classification—Perception—Pitch—Formant—jaw opening.

INTRODUCTION

a young male singer is a baritone or a tenor or whether a young female singer is a mezzo-soprano or a soprano. Traditionally, voice classification has been based on three perceptual parameters: range, timbre, and tessitura1; however, most research studies have focused on the acoustic correlates of one parameter, timbre.2,3 Yet even this parameter is not well understood. The accepted definition of timbre is as follows: two tones are of different timbre if they are judged to be dissimilar and yet have the same loudness and pitch,4 and thus timbre includes both spectral characteristics (eg, spectral slope and formant frequency) and temporal characteristics (eg, onset time, vibrato rate and extent, and offset time). Yet the influence of timbre on voice classification is typically investigated through analysis of spectral information, most

In spite of great advances in the science of the singing voice, most vocal pedagogues still rely on perceptual cues to help them determine whether Accepted for publication August 6, 2003. Presented at the 31st Annual Symposium: Care of the Professional Voice, June 6, 2002, Philadelphia, Pennsylvania. From the Department of Audiology and Speech Pathology, University of Tennessee, Knoxville, Tennesee. Address correspondence and reprint requests to Molly Erickson, PhD, Department of Audiology and Speech Pathology, 578 South Stadium Hall, University of Tennessee, Knoxville, TN 37996. E-mail: [email protected] Journal of Voice, Vol. 18, No. 1, pp. 24–37 0892-1997/$30.00 쑕 2004 The Voice Foundation doi:10.1016/j.jvoice.2003.08.001

24

VOICE CATEGORY AND JAW OPENING IN FEMALE SINGERS particularly, formant frequencies.3,5 In fact, Cleveland3 states that an individual singer has a characteristic timbre that is a function of the laryngeal source and vocal tract resonances, with singers having similar timbres constituting members of the same voice category. Experiments conducted by Cleveland3 and Sundberg6 have provided support for the notion that voice category perception is more related to formant frequency and less related to source spectrum differences. Berndtsson and Sundberg7 provide further support for this idea. They found that the center frequency of the singer’s formant is related to the perception of voice category. As the center frequency of a resonance is related to vocal tract length, singing voice researchers have suggested that vocal tract length, independent of the laryngeal source, is one of the primary physiological predictors of voice category.5,8 Some researchers have speculated that voice perception is based largely on third and fourth formant frequency, because the first two formants vary as a function of the vowel,5,9 but this hypothesis has not be tested experimentally. Although formant frequency may play a large role in the perception of voice categories, there may be other factors that influence such perception. In fact, there is likely an interaction between formant frequency and pitch in the perception of voice categories. Erickson10 found that spectral centroid differentiated voice categories in a dissimilarity experiment at pitches below A5, and yet when listeners were asked to categorize these same stimuli as mezzo-soprano or soprano, formant frequency and pitch interacted to affect these judgments (Figure 1). Likewise, the perception of male voice categories appears to be affected by both formant frequency and pitch.3,9 The existence of voice categories does not necessarily imply that the perception of voice types meets the traditional definition of categorical perception. According to this strict definition, a dichotomous stimulus is perceived categorically if it meets two criteria: (1) the perception of the two categories must shift abruptly and (2) listeners must have a difficult time discriminating between two stimuli in the same category.11 However, current speech perception research using mismatch negativity (MMN) suggests that the auditory system can detect subtle difference in stimuli that are within a category, even when listeners do not perceive these differences.12

25

These findings imply that categorical perception is a higher level brain function. Yet Pisoni and colleagues13 have shown that listeners can be trained to perceive subtle differences within speech category. Thus, the traditional definition of categorical perception may be too restrictive, and listener experience may be a factor. Erickson10 compared the results of judgments of female voice category to a multidimensional scaling based on dissimilarity and found that inexperienced listeners perceived virtually no differences between voices judged as the same voice category, whereas experienced listeners did perceive differences between voices judged as the same category, albeit these differences were very small. Thus, based on a less-restrictive definition of categorical perception, it might be concluded that voice types are categorically perceived. Even if the perception of voice type is not truly categorical, much can be learned about these judgments by employing classic techniques often used in speech identification studies14 where one parameter is systematically varied and its effect on perception is measured, generating an identification curve. This study represents a first step toward understanding the contribution formant frequency makes to the perception of female voice categories. The paper investigates the perception of voice category in female singers as a function of both formant frequency and pitch using a classic category identification paradigm. The paper also investigates whether voice category can be perceived using the third and fourth formant only.

METHOD This study consists of two parts, Part A and Part B. In Part A, the effects of formant frequency and pitch on the perception of voice category were examined by constructing a perceptual study that used two sets of synthetic stimuli at various pitches throughout the female singing range. The first set was designed to test the effects of systematically varying formants 1 through 4. The second set was designed to test the relative effects of lower frequency formants (F1 and F2) versus higher frequency formants (F3 and F4). Comments made by participants after completion of a pilot version Journal of Voice, Vol. 18, No. 1, 2004

26

MOLLY L. ERICKSON

FIGURE 1. Experienced listeners’ ratings of voice category for four natural singing voices.

of Part A resulted in the inclusion of a second perceptual experiment, Part B. Participants in the pilot study commented that for some stimuli, it seemed as if the singer was singing with the jaw clenched. Given that F1 is inversely related to jaw height,15 it is possible that these listeners might have been interpreting mixed formant stimuli (eg, those where lower frequency first and second formants were paired with higher frequency third and fourth formants) as being produced by singers with high jaw positions. As F1 is also affected by lip opening,15 it may be true that listeners can also perceive mixed formant stimuli in terms of lip opening, particularly if the synthesized vowel is one with a high degree of lip rounding such as /u/. However, in the pilot study, which used only synthesized /a/, listeners made no comments about lip opening. Thus, although it may be true that listeners might have been able to interpret these mixed stimuli in terms of lip opening, jaw opening, or both, Part B was designed to test only the hypothesis that mixed formant stimuli could be interpreted in terms of jaw position. Stimuli Two stimulus sets were synthetically generated using a terminal analogue digital synthesizer. The Journal of Voice, Vol. 18, No. 1, 2004

synthesis model was built using the Aladdin Interactive DSP Workbench (Hitech Development, Stockholm, Sweden). All stimuli were synthesized using first and second formant patterns appropriate for the vowel /a/. Frequencies of formants 1 through 4 were varied depending on the stimulus set. Regardless of experiment, each stimulus was synthesized using a constant source slope of -12 dB per octave with a constant frequency vibrato rate of 5.9 Hz and a constant frequency vibrato extent of 50 cents. Formant bandwidths were determined using a synthesis by analysis procedure described by Klatt and Klatt.16 The synthetic output spectrum was compared with the output spectrum of a female singer. Formant bandwidths of the synthetic signal were adjusted until the shape and level of each peak approximated those observed in the spectrum of the female singer. Based on this procedure, bandwidths for formants 1 through 4 were held constant at 125 Hz, 150 Hz, 150 Hz, and 150 Hz, respectively. For all experiments, each stimulus was synthesized at seven pitches: C4, G4, B4, D5, F5, and A5. Typically, perception of timbre is studied using stimuli that range from 0.5 second in length17 to 2 seconds in length.18 In this experiment, stimuli were 1 second

VOICE CATEGORY AND JAW OPENING IN FEMALE SINGERS in duration. Spline curves were applied to onsets and offsets to avoid clicks. Average RMS amplitudes were normalized to ⫾1.5 dB SPL. For stimulus set 1, frequencies for formants 1 through 4 were systematically varied from lower to higher in an attempt to simulate the acoustic results of corresponding changes in vocal tract length. Five formant patterns were synthesized (Patterns A-E). The pattern with the lowest frequencies, pattern A, was modeled from a professional mezzo-soprano who had been unambiguously categorized as such for over 8 years. The pattern with the highest frequencies, pattern E, was modeled from a professional soprano who also had been unambiguously categorized as such for over 8 years. The remaining patterns, B, C, and D, were interpolated using a linear frequency scale to fall at equal intervals between patterns A and E. The resulting stimulus set consisted of 35 stimuli (5 formant patterns × 7 pitches) (Table 1). Spectra for patterns A and E at each pitch are presented in Figure 2. The synthesized samples in stimulus set 2 were designed to test the relative importance of upper and lower formants in the perception of the female voice categories, soprano and mezzo-soprano. A set of two types of stimuli comprising conflicting upper and lower formants were constructed. The first type was modeled using the prototypical mezzo-soprano F1 and F2 values from pattern A combined with the prototypical soprano F3 and F4 values of pattern E. In contrast, the second type of stimuli was modeled using the prototypical soprano F1 and F2 values of pattern E combined with the prototypical mezzosoprano F3 and F4 values of pattern A. The resulting stimulus set consisted of 14 stimuli (2 formant patterns × 7 pitches). Attempts were made to create stimuli that were as natural as possible. The synthesized signals included TABLE 1. Formant frequencies in Hertz for formant patterns A–E Pattern A B C D E

F1

F2

F3

F4

625 680 741 806 878

1074 1141 1212 1287 1367

3027 3098 3170 3244 3320

3600 3674 3749 3827 3906

27

natural parameters such as mild breathiness, small perturbations, and slight irregularities in vibrato. The stimuli were not specifically tested for naturalness; however, upon completion of the experiment, all listeners were asked to describe how they felt about the experiment. In no case did any participant mention that they suspected that the stimuli were synthetic. Listeners Listeners provided informed consent using a procedure previously approved by the Institutional Review Board of the University of Tennessee, Knoxville. Two groups of listeners participated, experienced and inexperienced. All experienced listeners were vocal pedagogues or opera professionals who routinely made decisions based on their perceptions of singing voice categories. These listeners had either 10 years experience teaching singing at the university level or 10 years experience as a principle, conductor, or director with a professional opera company. Fifteen experienced listeners were recruited from the voice faculty at the University of Tennessee, Knoxville School of Music, and from the Knoxville Opera Company. All inexperienced listeners had no history of choral singing or voice study and expressed no interest in classical singing or opera. Fifteen inexperienced listeners were recruited from undergraduate courses in psychology. All listeners had bilateral hearing within normal limits as determined by a 20-dB hearing screening at 500 Hz, 1000 Hz, 2000 Hz, and 4000 Hz.19 Procedure For each of the total 49 stimuli (35 stimuli from set 1 and 14 stimuli from set 2), trials were constructed that consisted of three repetitions of the stimuli separated by 0.5 seconds. In Part A, listeners were presented with each of the 49 trials in random order twice for a total of 98 trials. Listeners were instructed to rate each trial as either mezzo-soprano or soprano. Listeners were also instructed to indicate how certain they were of their decision using a visual analog scale implemented as a scroll bar ranging from “not very certain” to “very certain.” Listeners were allowed to replay each stimulus as many times as they wished. In Part B, listeners were presented with the 14 stimuli from stimulus set 2, the set Journal of Voice, Vol. 18, No. 1, 2004

28

MOLLY L. ERICKSON

FIGURE 2. Spectra for formant patterns A and E for all pitches.

VOICE CATEGORY AND JAW OPENING IN FEMALE SINGERS comprising mismatched lower and upper formants. The 14 trials were presented in random order. Listeners responded via a visual analog scale interface implemented as a scroll bar. Each end of the scroll bar was anchored with a picture of a singer’s face, mouth barely open at one end and mouth wide open at the other end. Listeners were instructed to rate each trial according to the perceived degree of mouth opening. Listeners were allowed to replay each stimulus as many times as they wished. Experienced listeners completed Part A followed by Part B. Inexperienced listeners completed Part B only. Data coding For consistent and inconsistent stimuli of Part A, data were coded as follows. Data were coded as 1 for judgments of soprano and as ⫺1 for judgments for mezzo-soprano. Listener responses to degree of certainty were coded on a scale of 0 to 100, with “not very certain” coded as 0 and “very certain” coded as 100. Judgment codes were multiplied by certainty scores to create a response range from ⫺100 to 100. Judgments for jaw opening data in Part B were coded on a scale of 0 to 100, with 0 being the most closed position and 100 the most open position.

RESULTS Voice category judgments Consistent stimuli Based on previous research,3,9,10 it was hypothesized that formant pattern and pitch would interact in the perception of voice category. To test this hypothesis, a 5 × 7 repeated measures ANOVA was performed to determine the effects of formant pattern and pitch on the perception of voice category. Significant effects were observed for formant pattern [F(4,56) ⫽ 27.942, p ⬍ 0.001], for pitch [F(6,84) ⫽ 26.329, p ⬍ 0.001], and for the interaction of formant pattern and pitch [F(24,336) ⫽ 2.512, p ⬍ 0.001]. The interaction between formant pattern and pitch in the perception of voice category is presented graphically as a function of formant pattern in Figure 3.

29

Generally, as the frequencies of all four formants increased, mean perception of soprano voice category increased at all but the highest pitch, A5, where all formant patterns were perceived as soprano. Visual inspection of the data suggests that the mezzo-soprano/soprano cross-over point occurs between successively higher formant patterns as pitch increases. This hypothesis was tested by calculating the area under the curve for each pitch and statistically testing the results, because the area under the curve will decrease as the cross-over point increases in pitch.14 The first step in this process was the conversion of all listener ratings to positive values by adding 100 to each rating. The second step in this process was the calculation of the area under the curve for each pitch for each participant. In this case, because the data are discrete, the area under the curve for each pitch for each participant is simply the sum of each participant’s ratings for each formant pattern and is represented by the formula: ΣpP ⫽ R(A)pP ⫹ R(B)pP ⫹ R(C)pP ⫹ R(D)pP ⫹ R(E)pP where R ⫽ rating, p ⫽ pitch, and P ⫽ participant. Calculation of Σ for one participant is illustrated in Figure 4. At the pitch C4, the ratings for each pattern are as follows: A ⫽ 9.5, B ⫽ 38.5, C ⫽ 38.5, D ⫽ 59, and E ⫽ 133.5. Therefore, for this participant, ΣC4 ⫽ 9.5 ⫹ 38.5 ⫹ 38.5 ⫹ 59 ⫹ 133.5 ⫽ 279. Examination of the entire graph shows an inverse relationship between mezzo-soprano/soprano cross-over point and Σ. Thus, Σ provides a measure that correlates with cross-over point and can be used in parametric statistical analyses. The resulting data were subjected to a 1 × 7 repeated measures ANOVA to determine the effect of pitch on the area under the curve. A significant effect for pitch was found [F(6,78) ⫽ 24.008, p ⬍ 0.001]. The interaction between formant pattern and pitch in the perception of voice category is presented graphically as a function of pitch in Figure 5. Mean perception of soprano voice category by experienced listeners increased with pitch for the higher frequency formant patterns C, D, and E. However, patterns A and B appear less affected by pitch, with experienced listeners perceiving these stimuli as Journal of Voice, Vol. 18, No. 1, 2004

30

MOLLY L. ERICKSON

FIGURE 3. Mean ratings of voice category and standard errors of the mean as a function of formant pattern for each pitch.

being produced by a mezzo-soprano at all pitches but A5. One way to consider this interaction is to examine the mezzo-soprano/soprano cross-over point as a function of pitch. The cross-over is Journal of Voice, Vol. 18, No. 1, 2004

inversely related to pattern, with patterns A and B crossing between F5 and A5, patterns C and D crossing between B4 and D5, and pattern E crossing between C4 and E4. All patterns are perceived as

VOICE CATEGORY AND JAW OPENING IN FEMALE SINGERS

31

FIGURE 4. Category ratings and corresponding Σ values at each pitch for one sample participant.

highly soprano at A5, which suggests that the widely spaced harmonics at this pitch do not provide enough acoustic information to separate voices into categories, a finding that is supported by previous research.10,20 The hypothesis that the mezzo-soprano/ soprano cross-over point varies inversely with formant pattern was tested using the area under the curve procedure as described above. The area under the curve was calculated for each formant pattern for each participant. The 1 × 5 repeated measures ANOVA revealed a significant effect for formant pattern [F(4,52) ⫽ 30.510, p ⬍ 0.001]. Conflicting stimuli It has been speculated that the perception of voice category is largely based on perception of F3 and F4, because F1 and F2 vary as a function of vowel.5,9

In Figure 6, mean perception of voice category for consistent stimuli, where all four formant frequencies were either those of pattern A or those of pattern E, is compared with mean perception of voice category for conflicting stimuli, where the first and second format frequencies were those of pattern A and the third and fourth formant frequencies were those of pattern E or vice-versa. Although for the consistent stimuli pattern E is perceived by experienced listeners as more soprano-like than pattern A at all pitches but A5, there appears to be no difference in perception of voice category between the two types of conflicting stimuli; that is, there appears to be no consistent or significant difference in judgment of voice category regardless of whether F1 and F2 were typical of a mezzo-soprano while F3 and F4 were typical of a soprano, or vice-versa. However, Journal of Voice, Vol. 18, No. 1, 2004

32

MOLLY L. ERICKSON To test these observations, a 2 × 7 repeated measures ANOVA analyses was performed to determine the effects of conflicting formant pattern and pitch on the perception of voice category. Conflicting formant pattern did not have a significant effect on the perception of voice category. Significant effects were observed for pitch [F(6,84) ⫽ 20.837, p ⬍ 0.001] and the interaction of conflicting formant pattern and pitch [F(6,84) ⫽ 2.478, p ⫽ 0.03].

FIGURE 5. Mean ratings of voice category as a function of pitch.

there does appear to be an effect of pitch, with perception of voice category moving toward more soprano-like with increasing pitch. Also, there appears to be an interaction between type of conflicting pattern and pitch. Stimuli where the frequencies of the third and fourth formant are those of pattern E (gray triangles) are perceived by experienced listeners as slightly mezzo-soprano at all pitches except A5. On the other hand, stimuli where the frequencies of the third and fourth formant are those of pattern A (black circles) appear linearly related to pitch.

Jaw position judgments It was hypothesized that in cases where the lower two formant frequencies were in conflict with the upper two formant frequencies, listeners would interpret the unexpectedly high or low first formant in terms of physiology. Although F1 is affected by both lip opening and jaw position,15 only the perception of jaw position was examined in this study. That is, it was speculated that when F1 and F2 were from pattern E while F3 and F4 were from pattern A, listeners would perceive an open jaw position, whereas in the reverse case, listeners would perceive a closed jaw position. Experienced and inexperienced participants’ mean judgments of jaw position based on conflicting stimuli are presented in Figure 7. For both groups, jaw position is perceived to be more open when F1 and F2 are of pattern E while F3

FIGURE 6. Mean voice category judgments and standard errors of the mean as a function of pitch for formant patterns A and E (A) and for mixed formant patterns (B). Journal of Voice, Vol. 18, No. 1, 2004

VOICE CATEGORY AND JAW OPENING IN FEMALE SINGERS

33

FIGURE 7. Mean jaw position judgments and standard errors of the mean as a function of pitch for experienced (A) and inexperienced (B) listeners.

and F4 are of pattern A than when the patterns are reversed. A 2 × 2 × 6 repeated measures ANOVA was performed to test the effects of listener experience, formant pattern, and pitch on the perception of jaw position. The highest pitch, A5, was not included in the analysis because in experiments presenting isolated sounds, perception of categorical differences at this pitch are typically absent regardless of listener experience.10,20 Significant effects were found for listener experience [F(1,28) ⫽ 5.887, p ⫽ 0.022], pattern [F(1,28) ⫽ 35.834, p ⬍ 0.001], pitch [F(5,140) ⫽ 2.718, p ⫽ 0.022], and the interaction between pitch and listener experience [F(5,140) ⫽ 2.616, p ⬍ 0.027]. The interaction between pitch and listener experience can further be understood by visually inspecting Figure 7. With the exception of the highest pitch, A5, pitch does not appear to affect experienced listener perception of jaw opening, but it does appear to affect the responses of inexperienced listeners. To test this hypothesis, two separate 2 × 6 repeated measures ANOVA analyses were performed. For experienced listeners, significant effects were observed for formant pattern [F(1,14) ⫽ 19.978, p ⫽ 0.001] and for the interaction of formant pattern and pitch [F(5,70) ⫽ 4.893, p ⫽ 0.047], but not for pitch. For inexperienced listeners, significant effects

were observed for formant pattern [F(1,14) ⫽ 15.915, p ⫽ 0.001] and pitch [F(5,70) ⫽ 3.943, p ⫽ 0.003]. The interaction of conflicting formant pattern and pitch was not significant. When examining individual listener ratings of jaw opening in response to conflicting stimuli, four distinct patterns of response could be seen: formantbased, random, positively correlated with pitch, and negatively correlated with pitch. Prototypical examples of each pattern are presented in Figure 8. The primary difference between experienced and inexperienced listeners can be seen in the distribution of these patterns (Table 2). Two thirds of experienced listeners perceived large differences in jaw opening, with high-frequency first and second formants relative to low-frequency third and fourth formants being perceived as being produced with an open jaw, whereas the reverse pattern was perceived as being produced with a closed jaw. One third of experienced listeners responded in a random fashion. No experienced listener was influenced by pitch. On the other hand, only six inexperienced listeners perceived an open jaw position when F1 and F2 were low in frequency relative to F3 and F4 and a closed jaw position when the pattern was reversed. Five inexperienced listeners responded in a random fashion, whereas three perceived increasing jaw Journal of Voice, Vol. 18, No. 1, 2004

34

MOLLY L. ERICKSON

FIGURE 8. Graphic representations of the four basic response patterns observed in the jaw position task: formantbased (A), random (B), positively correlated with pitch (C), and negatively correlated with pitch (D).

opening with pitch, and one listener perceived decreasing jaw opening with pitch. DISCUSSION Consistent stimuli The results for the consistent stimuli suggest that there is a strong interaction between formant frequency and pitch. Yet this may not be entirely true for all experienced listeners or for all formant patterns. Journal of Voice, Vol. 18, No. 1, 2004

Consider the situation for pattern A, modeled after a prototypical mezzo-soprano, and pattern E, modeled after a prototypical soprano. The mean ratings of voice category suggest that pattern A was generally perceived by experienced listeners as mezzo-soprano at all pitches but the highest pitch, A5, and that this perception was essentially unaffected by pitch. On the other hand, the mean ratings of voice category suggest that pattern E, while being perceived by experienced listeners as soprano at all

VOICE CATEGORY AND JAW OPENING IN FEMALE SINGERS TABLE 2. Jaw position response patterns for experienced and inexperienced listeners Number of Listeners Pattern Formant-based Random Increasing with pitch Decreasing with pitch

Experienced listeners

Inexperienced listeners

10 5 0 0

6 5 3 1

pitches, was linearly related to pitch. However, the data exhibited a great deal of variability. When individual category ratings are examined, three response patterns emerge: pitch-independent category ratings, pitch-dependent category ratings, and random category ratings. Some experienced listeners were able to easily perceive differences between patterns A and E with little or no effect of pitch below A5. In fact, 9 of the 15 experienced listeners displayed patterns similar to that seen in Figure 9A, although many did not perceive the categories quite so clearly. Only 4 listeners were highly influenced by pitch (Figure 9B), with pattern E being much more heavily influenced by pitch than pattern A. The remaining two experienced listeners were unable to consistently perceive voice categories, but rather responded in an apparently random fashion. However, for the intermediate patterns, B to D, there was a strong interaction between pitch and formant frequency in the perception of voice category for most experienced listeners, particularly for the intermediate formant frequencies used in pattern C. Thus, it appears that when presented with the most extreme formant patterns, A and E, many, but not all, experienced listeners will be able to clearly perceive voice categories independent of pitch at all pitches but A5, but they will have more difficulty perceiving voice categories independent of pitch for formant frequency patterns that fall between these two extremes. Conflicting stimuli It may be that when classifying auditory stimuli, listeners extract perceptual dimensions separately and weight them to arrive at a classification. On the other hand, it may be that the classification of auditory stimuli is a gestalt process, wherein the

35

whole is greater than the sum of its parts.21 To provide a visual analogy, consider what might happen if a novel creature is created by attaching the head of a dog to the body of a cat or vice-versa. When asked to categorize these hybrids as dogs or cats, people might perceive two parameters, head and body, and weight these parameters to arrive at a decision. Or they might perceive the animal in its totality, concluding that it is neither cat nor dog, but something new. In the first case, the more strongly weighted parameter should dominate the categorization, resulting in category decisions that are greater than chance. In the second case, the gestalt catdog would not be perceived as either a cat or a dog, resulting in random category decisions at chance levels. In this study, when presented with novel auditory stimuli synthesized with conflicting F1-F2 and F3F4, experienced listeners could not systematically assign stimuli to either voice category, mezzo-soprano or soprano. Instead, all stimuli were perceived approximately equally (and therefore at chance levels), with pitch acting as the only perceptual parameter in their decision making. There are two possible reasons for this outcome. First, it is possible that the listeners were able to extract separate information concerning F1-F2 and F3-F4, but applied equal weights to both parameters, resulting in perceptual confusion. Second, it is possible that listeners perceive each voice category as a gestalt, where the whole is greater than the sum of its parts, and therefore, they were not able to place the novel stimuli in either category, resulting in random responses. On the other hand, it appears that both experienced and inexperienced listeners attempt to make sense of conflicting stimuli in terms of physiology. Although both lip opening and jaw position affect F1,15 listeners in this study were only asked to rate degree of jaw opening. It seems that when asked to do so, listeners can interpret a first formant frequency that is either too high or too low relative to F3 and F4 in terms of jaw position. The question remains, is this ability innate or learned? One possibility is that listeners in general have an innate ability to perceive the effects of raising and lowering the jaw. This implies an inherent articulatory knowledge on the part of the listener. Such a theory is not unlike Journal of Voice, Vol. 18, No. 1, 2004

36

MOLLY L. ERICKSON

FIGURE 9. Exemplars of the two basic response patterns for patterns A and E: pitch-independent (A) and pitch-dependent (B).

the Motor Theory of Speech Production proposed by researchers from Haskin Laboratories.22 However, this theory is not without its detractors.23 Another possibility is that the perception of jaw opening is learned. By comparing the perception of jaw opening by experienced listeners to that of inexperienced listeners, it is possible to test which of these two hypotheses seems most likely. If this ability is innate, then we would expect to see similar patterns of perception in both experienced and inexperienced listeners. This was not the case. Experienced listeners were better able to clearly perceive jaw position regardless of pitch. Such a finding supports the notion that perception of jaw position is learned. Synthetic versus natural voices In this study, the synthetic stimuli did not incorporate effects typically seen across pitch in natural voices. No attempt was made to simulate the effects of laryngeal elevation with pitch. If formant patterns were systematically raised with pitch, it might result in greater interaction between formant pattern and pitch, especially for the typical mezzo-soprano stimuli. Conversely, it might also be postulated that experienced listeners expect such increases in formant frequency with pitch and can compensate for these Journal of Voice, Vol. 18, No. 1, 2004

increases. This would not be unlike the ability of listeners to adjust their perception of pitch accent as a function of declination. In the case of pitch accent, when strength of the pitch accent remains constant as the overall pitch of the sentence declines, listeners perceive successive pitch accents as being stronger. However, if such a process was operating in these data, we would have expected to see an inverse relationship between the perception of voice category and pitch in this study. No such relationship was observed. The effects of jaw opening on the first formant were not explicitly incorporated in these data. In the consistent stimuli, the first formant was not raised with pitch to simulate jaw lowering. In the conflicting stimuli, both the first and second formants were either higher or lower than what would be expected relative to the frequencies of the third and fourth formant. Listeners were able to perceive differences in voice category based on formant pattern in the consistent stimuli, but not in the conflicting stimuli. However, experienced listeners were able to perceive a jaw position in the conflicting stimuli. What remains unknown is whether listeners are able to perceive both jaw opening and voice category if F1 is raised with pitch while F2, F3, and F4 are held constant.

VOICE CATEGORY AND JAW OPENING IN FEMALE SINGERS CONCLUSION As in other studies,3,9,10 this research has shown that the perception of voice category is affected by both formant frequency and pitch when mean data are considered. However, it should be noted that there is a great deal of variability in experienced listener response; some experienced listeners will likely be able to perceive voice categories independent of pitch, and others will not. It is unclear what variables contribute to the ability to perceive voice categories independent of pitch. Also, as in other studies,10,20 listeners find it difficult to perceive voice categories at high pitches when presented with stimuli in isolation. This is not surprising because formants are not represented by spectral peaks in the output spectrum when harmonics are widely spaced, minimizing spectral differences between sopranos and mezzo-sopranos. Finally, when presented with stimuli comprising conflicting upper and lower formants, experienced listeners find it difficult to perceive voice categories, but instead they are able to interpret the mismatch in terms of jaw position. Acknowledgements: This research was funded through a grant from the Fulbright Foundation. Sincere thanks must be extended to Johan Sundberg for use of facilities and to Sten Ternstro¨m for technical assistance.

REFERENCES 1. Vennard W. Singing, the Mechanism and the Technique. New York: Fisher; 1967. 2. Bloothooft G, Plomp R. The timbre of sung vowels. J Acoust Soc Am. 1988;84:847–860. 3. Cleveland TF. Acoustic properties of voice timbre types and their influence on voice classification. J Acoust Soc Am. 1977;61:1622–1629. 4. ANSI. Psychoacoustical Terminology. S3.20. New York: American National Standards Institute; 1973. 5. Dmitriev L, Kiselev A. Relationship between the formant structure of different types of singing voices and the dimensions of the supraglottal cavities. Folia Phoniatr (Basel). 1979;31:238–241.

37

6. Sundberg J. The source spectrum in professional singing. Folia Phoniatr. 1973;25:71–90. 7. Berndtsson G, Sundberg J. Perceptual significance of the center frequency of the singer’s formant. Scand J Logop Phoniatr. 1995;20:35–41. 8. Titze IR. Principles of Voice Production. Englewood Cliffs, NJ: Prentice Hall; 1994. 9. Sundberg J. Perceptual aspects of singing. J Voice. 1994; 8:106–122. 10. Erickson ML. Dissimilarity and the classification of female singing voices: A preliminary study. J Voice. 2003;17:195– 206. 11. Borden GJ, Harris KS, Raphael LJ. Speech Science Primer: Physiology, Acoustics, and Perception of Speech. 3rd ed. Baltimore: Williams and Wilkins; 1994. 12. Sharma A, Kraus N, McGee T, Carrell T, Nicol T. Acoustic versus phonetic representation of speech as reflected by the mismatch negativity event-related potential. Electroencephalogr Clin Neurophysiol. 1993;88:64–71. 13. Pisoni D, Carrell T, Gans S. Perception of the duration of rapid spectrum changes in speech and nonspeech signals. Percept Psychphys. 1983;34:314–322. 14. Hedrick MS, Schulte L, Jesteadt W. Effect of relative and overall amplitude on perception of voiceless stop consonants by listeners with normal and impaired hearing. J Acoust Soc Am. 1995;98:1292–1303. 15. Lindblom B, Sundberg J. Acoustical consequences of lip, tongue, jaw, and larynx movements. J Acoust Soc Am. 1971; 50:1166–1179. 16. Klatt DH, Klatt LC. Analysis, synthesis, and perception of voice quality variations among male and female talkers. J Acoust Soc Am. 1990;87:820–857. 17. Warrier C, Zatorre RJ. Influence of tonal context and timbral variation on perception of pitch. Percept Psychphys. 2002; 64:198–207. 18. Lakatos S. A common perceptual space for harmonic and percussive timbres. Percept Psychphys. 2000;62:1426– 1439. 19. ASHA. Guidelines for screening for hearing impairments and middle ear disorders. ASHA. 1990;32:17–24. 20. Erickson ML, Perry S, Handel S. Discrimination functions: Can they be used to classify singing voices? J Voice. 2001; 15:492–502. 21. Handel S. Listening: An Introduction to the Perception of Auditory Events. Cambridge, MA: MIT Press; 1989. 22. Liberman AM, Mattingly IG. The motor theory of speech perception revised. Cognition. 1985;21:1–36. 23. Schouten MEH. The case against a speech mode of perception. Acta Physio. 1980;44:71–98.

Journal of Voice, Vol. 18, No. 1, 2004