__ __ Ii3
&
ELSEVIER
SPEECH
COMMUNICATl0N Speech Communication 16 (1995) 359-368
Voice quality analysis of male and female Spanish speakers Pamela Jean Trittin, And& Depto. Ingeni’eria Electrhzica,
E. T.S.I. Telecomunicacidn,
de Santos y Lle6 Ciudad Universitaria,
*
28040 Madrid, Spain
Received 29 November 1993; revised 30 June 1994
Abstract This paper describes the results of an acoustical analysis which compares the quality of female and male voices of Spanish speakers. The analysis is a pilot study based on a similar one presented by Klatt and Klatt (1990). Results indicate that the Spanish female voice does differ in some respects from Spanish males; however, to a lesser extent than what was found by Klatt and Klatt. The breathy quality found in Spanish females is not too different from Spanish males, which may support the assumption that breathiness may be a learned, cultural behaviour. Zusammenfassung Dieser Artikel beschreibt die Ergebnisse einer akustischen Analyse, die die QualitIt mlnnlicher und weiblicher Stimmen von Spanischen Sprechern vergleicht. Diese Analyse ist eine Pilot-Studie lhnlich einer von Klatt und Klatt (1990). Die Ergebnisse zeigen da!3 die Stimmen von mlnnlichen und weiblichen Spanischen Sprechern sich zwar unterscheiden, aber in geringerem MaBe als Klatt und Klatt gefunden haben. Die Atmungsgedusche und die von ihnen beeinfluBte Qualitst, ist bei Spanischen MInnern und Frauen nicht sehr verschieden; dies kann die Aussage unterstfitzen, dal3 Atmungsgertiusche einen Lern- und Kulturellen Hintergrund haben. R&urn6 Cet article d&rit les r&ultats d’une analyse acoustique comparant les qualitds des voix fiminines et maculines des espagnols. Cette analyse est une Ctude pilote fondCe sur une Ctude semblable r&alisCe par Klatt et Klatt (1990). L,es r&sultats obtenus indiquent que les voix des femmes espagnoles prCsente de grandes diffkrences par rapport g celles des hommes; ces diffkrences sont, cependant, moins marqutes que celles signal&es par Klatt et Klatt. L’aspiration observCe dans la parole des femmes espagnoles ne diff&re pas essentiellement de celles des hommes ce qui permettrait d’affirmer qu’il s’agit d’une aspiration due 2 des origines d’ordre culturel. Keywords:
Male/female
speech; Voice analysis; Voice quality; Spanish
* Corresponding author. Tel.: +34 1 336 73 10. Fax: +34 1 336 73 23. 0167-6393/95/$09.50 0 1995 Elsevier Science B.V. All rights reserved SSDI 0167-6393(95)00004-6
360
P.J. Trittin, A. de Santos y Lle6 /Speech
1. Introduction
Klatt and Klatt (1990) describe the results of an analysis which compares voice quality variations in American male and female speech. The present article displays the results of a similar analysis for Castilian Spanish speakers. The purpose of this work is twofold: (1) to replicate a subset of the Klatt and Klatt analysis (not including perceptual analysis) for the Spanish language and (2) to present data on Spanish vowel quality. Throughout the paper, the reference to Klatt and Klatt (1990) will be assumed, unless otherwise indicated. In the past, female and children’s voices have been somewhat neglected in speech processing. This has also been true for the Spanish language (Rodriguez Crespo et al., 1991). The difficulty in reproducing a female voice is perhaps due to insufficient knowledge of the glottal source characteristics of female speech (Karlsson, 1992a,b; Monsen and Engebretson, 1977; Price, 1989). For excellent literature reviews pertaining to the problem, see Watt and Klatt, 1990; Savoji, 1990; Karlsson, 1991b; Fant, 1982, 1993). In addition to complications in higher frequencies which make formant frequencies difficult to estimate, it has also been shown that female voices tend to have a breathy quality. That is to say that the aspiration noise in the vowel spectrum has been found to be greater for women. Finally, it has also been hypothesized that women’s voices do not conform as well to an all-pole model. The results presented here are part of a project to produce a natural sounding, high quality female voice for Spanish language synthesis. In order to achieve this objective, it is first necessary to analyze the Spanish female voice to get a better idea of its qualities and characteristics. This paper summarizes the results of a pilot study analysis of Spanish female and male speech to determine which attributes are necessary in order to synthesize a Spanish female voice. Results indicate that Spanish female speech seems to be less breathy than that of the American English speaker. Some other researches (Karlsson, 1989, 1991a, 1992a) have also found, by analyzing the voice source of different Swedish
Communication 16 (1995) 359-368 Table 1 Database consisting of fourteen natural tive imitations of two of these sentences
Stress markings
Sentence Sl s2 s3 s4 s5 S6 Sl S8 s9 SlO Sll s12 s13
s14
sentences
Juan toma cafe. La flor es blanca. Es para Rosa. Pepe come pan. Luz tiene frio. El rio suena. Rita lo dice. Pablo se ducha. iCuLl es su nombre? iQu6 hora tienes? iQue tarde Ilegas! iHaz 10s deberes! Olga toma tC. [?V ?V ?V ?V ?V] [XV xv xv xv XVI La luna brilla. [?V ?V ?V ?V ?V] [XV xv xv xv XVI
a The stress syllable ‘ refers stress of content words.
female women female support may be
and itera-
a
v’
VI
v
v
V
v’
vL
v’
v(
v
v
v’
v)
V
vL
v
v’
v’
v
v’
V
v’
v
vL
v)
V
V
vL
v’
v
v
v‘
V’
V(
v
V’
v’
v’
V
v(
v’
v’
v
v‘
v‘
v
v
v’
v’
v
v’
V
V’
V
v’
v
vL
v
to verbal
stress,
and
V’
’ refers to
voices, that the breathiness quality in was not always a necessary attribute in voice synthesis. These findings seem to the suspicion that a breathy voice quality a learned behaviour.
2. Method The present acoustic analysis follows a similar speech analysis experiment conducted by Klatt and Klatt (1990). It consists of a database of fourteen natural sentences with varying syllable stress patterns. Reiterative versions of two of the natural sentences were recorded using the syllables [?V] and [XV] r, where V corresponds to one of the five vowels of the Spanish language, [a, e, i, o, u]. The purpose of using reiterative imitations of natural sentences is to control consonantal
t The Castilian dialect of Spanish does not include [h] in its phoneme set; therefore, [x] was chosen as a sound in which the glottis is partly open while there is sound generation in the larynx.
P.J. Trittin, A. de Santos y Lied/Speech Table 2 Dialect history
of ten Spanish
Female
Age
Dialect
AG AS EH RF SF
24 25 26 23 25
O-13 O-13 O-13 O-13 O-13
history
Madrid Madrid Madrid Madrid Madrid
Communication 16 (1995) 359-368
361
speakers Male
Age
Dialect
AM GG JF JL MA
23 26 27 25 26
O-13 O-13 O-13 O-13 O-13
History
Madrid Daimiel Madrid Toledo Madrid
influences on the vowels as much as possible. A list of the sentences is shown in Table 1. In total, ten speakers (five males and five females) were recorded in a sound-isolated booth. A Shure SMlOA microphone was used and placed approximately two inches to the left side of the breath stream. The database was recorded using a Sony DAT 57ES and digitized at 16,000 samples per second. The subjects chosen were all university students of about the same age. All of them have lived at least the first thirteen years of their life in the same dialect region of Castile. The speakers recorded along with their dialect history are shown in Table 2. Acoustic speech waveforms were viewed as well as spectrograms, frequency curves, and FFTs and LPCs of the spectra, using PCVox, a software package developed in our institution. The following section describes the techniques used to analyze the subjects’ speech. Then results of this analysis are revealed and compared to those obtained by Klatt and Klatt (1990).
3. Results One well-known difference between male and female voices is fundamental frequency, f,,. Overall, the f0 values of Spanish females averaged 230 Hz, whereas Spanish males averaged 130 Hz. Therefore, the average f0 of Spanish females is approximately 1.8 times higher than that of Spanish males, which comes very close to the 1.7 value obtained by Klatt (1987). Another potential difference between men and women pertains to the degree of noise in the higher formants spectrum, one acoustic correlate of breathiness. As in the analysis by Klatt and Klatt, the following hypothesized cues to breathi-
Fig. 1. A typical spectrum of the vowel /a/ by made speaker AM, measured with the usual first difference.
ness were analyzed: (1) the relative strength of the first harmonic, (2) the presence of aspiration noise in the vowel spectrum at higher frequencies, and (3) changes in the vocal-tract transfer function when the glottis is partially closed. The following subsections describe these parts of the analysis. 3. I. First harmonic relative strength One of the hypothesized cues to breathiness is the relative strength of the first harmonic, which has a tendency to increase as the open quotient increases. Open quotient is defined as the ratio of open time to total period or the part of a vibratory period in relation to the whole period during which the glottis is open. Breathy voices tend to have larger open quotients than clear voices (Klatt and Klatt, 1990). A fast Fourier transform (FFT) spectrum of 1024 points was used to measure the relative strength of the first harmonic of the reiterative sentences. As was suggested by Bickley (19821, we use as a reference to the first harmonic, Hl, the second harmonic, H2, which was measured to determine H l’s relative strength 2. The Hl and H2 amplitudes were measured without the usual first difference, at vowel midpoint to reduce as much as possible consonant influences. A typical spectrum of the vowel /a/ can be seen in Fig. 1. Results of the relative strength of the first harmonic are shown in Table 3. The values corresponding to each
’ Other references such as rms amplitude or first formant values could have also been used, but to be consistent with Klatt and Klatt’s analysis, we chose the second harmonic amplitude value.
P.J. Trittin, A. de Santos y Lle6 /Speech Communication 16 (1995) 359-368
362
Table 3 Amplitude of the first harmonic relative to the amplitude of the second harmonic of iterative sentences [?a] and [xa] Females Sylll sy112 S13 [?a] S13 [xal S14 [?a] S14 [xal Ave.
10.6 9.4 8.8 7.8 9.2
sy113
sy114
SyllS
Ave.
10.6 9.0 10.8 10.4 10.2
10.6 10.4 11.0 9.4 10.4
10.0 9.6 10.8 9.4 10.0
10.6 10.4 8.6 9.2 9.7
10.5 9.8 10.0 9.2 9.9
8.0 10.0 8.4 8.4 8.7
6.4 9.6 7.8 8.0 8.0
7.2 9.2 6.4 6.8 7.4
8.6 6.6 5.8 6.8 7.0
7.7 8.9 7.0 7.6 7.8
Males S13 [?a] S13 [xal S14 [?a] S14 [xa] Ave.
8.4 9.2 6.6 7.8 8.0
syllable were calculated in the same manner as Klatt and Klatt 3: Hl + 10 dB - H2. It has been seen with American female speakers that their first harmonic value is 5.7 dB higher than American males Watt and Klatt, 1990). This indicates that American females are more breathy than males, to the extent that Hl is an acoustic cue to breathiness. Results from Table 3 indicate the Spanish male/female averages 4. The amplitude of the first harmonic compared to the second harmonic of the vowel /a/ for Spanish males is 7.8 dB, which is 2.1 dB lower than Spanish females. This difference is statistically significant (p < 0.001, using the paired t test). The difference is also similar with natural sentences. On average, the relative first-harmonic strength of natural sentences of men is 2.8 dB lower than females. These results seem to indicate that Spanish females are slightly more breathy than their male counterparts. Looking at other Spanish vowels, we found that the medium, open vowels /e/ and /o/ for women and men have fairly close averages. The female averages for the vowels /i/ and /u/ on the other hand are considerably larger than the
Table 4 Relative first harmonic amplitude of Spanish vowels /a, e, i, o, u/ of both natural and iterative sentences Natural
Iterative
10.0 5.8 11.0 5.4 11.1
9.9 3.5 18.0 3.1 10.8
7.2 5.5 1.8 5.1 3.5
7.8 5.3 2.3 5.3 2.1
Females ;:; /i/ /o/ /u/ Males
male averages. This was found true for both iterative and natural sentences. These results can perhaps best be explained by the nature of these high, closed vowels at higher frequencies because Hl is found to be greater than H2. Table 4 shows the relative amplitude of the first harmonic values for all Spanish vowels of both imitative and natural sentences. Table 5 illustrates the results of the relative first-harmonic amplitude of sentences S2 and S14. These two sentences were chosen because the first and last syllables involve the vowel /a/. For women’s voices, the average value for the first syllable is 1.4 dB less than the last syllable, but for men the last syllable is 2.3 dB less than the first (the difference for women is not statistically significant at p = 0.10, but for men it is so at a level p = 0.01, according to the t test). Klatt and Klatt found that the first harmonic is about 2 dB weaker in the last syllable for women and 2.5 dB weaker for males, which seems to indicate that males laryngealize slightly more than females. Table 5 Amplitude of first harmonic relative to second harmonic of natural sentences S2 and S14 Females
3Adding 10 dB ensures positive number results. 4 In Tables 3, 4, 5 and 7, the values shown are the averages of the 5 female or 5 male measures. The numbers in bold face type are the averages of the values in the same row or column.
s2 s14 Ave.
Males
Sylll
sy115
Sylll
Syll5
9.2 7.0 8.1
11.4 7.6 9.5
9.2 6.4 7.8
6.4 4.6 5.5
P.J. T&tin, A. de Santos y Lied/Speech
Results of natural and iterative sentences indicate that this is also true for Spanish. Between the speakers, there does exist a wide range of values pertaining to the first-harmonic amplitude (see Table 6). However, our results differ from Klatt and Klatt’s in that the average difference between males and females is smaller (2.1 dB compared to Klatt and Klatt’s 5.7 dB), although significative. Also, we find a wider range for male speakers than for female speakers. Speaker JL, for example, has an average relative first harmonic value of 5.3 dB, whereas speaker JF’s value is 10.5 dB, which is higher than four of the female speakers’ averages. These average Hl-amplitude data suggest that males have a slightly shorter open quotient than females on average. 3.1.1. Stress and syllable amplitude As in the Klatt and Klatt analysis, the rms amplitude was measured at vowel midpoint with a 25-ms Hamming window. Table 7 reveals the results of iterative sentences [?a] and [xa]. The results show that stressed syllables (indicated in italics) are on average 1.9 dB more intense for females and 1.3 dB more intense for males than unstressed syllables. These differences however are not statistically significant (even for p < 0.1 according to the t test for unpaired data). The difference between American males and females Table 6 Individual data pertaining to relative the iterative sentences [?a] and [xa] Females
AG AS EH RF SF Ave.
first harmonic
values
Table 7 Rms amplitude and [xa]
at vowel midpoint
363
of iterative
sentences
Female
averages
Sentence
Sylll
sy112
sy113
sy114
sy115
Ave.
S13 S13 S14 S14
[?a] [xa] [?a] [xa]
78.3 77.9 75.2 75.2
75.7 76.0 77.8 76.5
75.6 73.4 74.4 74.2
73.5 72.1 73.3 74.3
72.4 70.6 69.5 65.1
75.1 74.0 74.0 73.1
S13 S13 S14 S14
[?a] [xa] [?a] [xal
75.9 74.7 79.2 77.6
77.2 74.3 78.6 7% I
75.0 77.0 75.1 69.8
77.2 76.4 79.9 75.1
[?a]
Male averages 79. I 78.7 77.6 77.6
78.6 77.4 79.7 77.6
is greater. American females and males speak stressed syllables on average 3.0 dB and 4.3 dB more intensely, respectively. Also, there tends to be a final syllable drop in amplitude, even if the last syllable is stressed. The amplitude with respect to syllable position of iterative sentences [?a] and [xa] has an average fall of 1.3 dB (1.0 dB-fall for Klatt and Klatt) from syllable 1 to syllable 3 and an average drop of 3.8 dB (4.3 dB-fall for KIatt and Klatt) from syllables 3 to 5. All other Spanish vowels behave similarly. Concerning all iterative sentences, an average fall of 0.7 dB from syllables 1 to 3 and a 3.6 dB-drop from syllable 3 to syllable 5 is shown. Also, the overall average fall of natural sentences from syllable 1 to syllable 3 is 2.5 dB and a drop of 6.2 dB is found between syllables 3 and 5.
of
3.2. Aspiration
noise in the vowel spectrum
Ave.
s13
s13
s14
s14
[?a1
lxal
[?a1
lxal
12.6 12.4 9.0 9.0 9.4
9.4 10.4 9.6 11.0 8.4
8.0 13.0 9.8 10.8 8.4
6.8 11.0 10.0 10.8 7.6
9.2 11.7 9.6 10.4 8.5 9.9
9.0 8.0 9.0 4.6 8.0
8.4 8.6 13.4 6.8 7.4
8.8 6.4 8.8 4.2 6.8
8.2 7.8 10.6 5.6 5.6
8.6 7.7 10.5 5.3 7.0 7.8
Males AM GG JF JL MA Ave.
Communication 16 (1995) 359-368
The second proposed cue to breathiness is the amount of noise found in the spectrum at higher frequencies. In order to assess the degree of breathiness, a filter was used to single out the periodic components of the F3 region. This was done because lower frequencies tend to dominate the visual impression of waveforms. F3 formant locations were estimated by viewing an LPC curve with 20 points at vowel midpoint of each syllable. Then, the original digitized waveform was filtered using an optimal bandpass filter with a bandwidth of 600 Hz using individual F3 values as the frequency centre. The filtered
364
P.J. Trittin, A. de Santos y Lle6 / Speech Communication 16 (1995) 359-368 Table 8 Degree of periodicity versus noise iterative sentences S13 [xa] (baia6)
excitation
of F3 of the
F3
Sylll
Syll2
Syl13
SyW
SyllS
Ave.
2750 2950 2500 2950 2750
2 2 2 2 2 2.0
1 2 2 2 2 1.8
1 3 2 3 2 2.2
1 2 2 4 2 2.2
2 3 3 4 3 3.0
1.4 2.4 2.2 3.0 2.2 2.2
2600 2300 2200 2400 2200
1 2 3 1 1 1.6
2 2 2 2 1 1.8
2 2 2 2 2 2.0
2 3 3 2 2 2.6
3 4 2 3 2 2.8
2.0 2.4 2.4 2.0 1.6 2.1
Females AG AS EH RF SF Ave. Males Fig. 2. Top: third formant, with little or no aspiration noise present.
bandpass filtered waveform of /a/ noise. Bottom: /a/ with aspiration
waveform was then subjectively viewed to determine the amount of noise present. Fig. 2 shows filtered waveforms of the vowel /a/. The top waveform contains little or no aspiration noise and the bottom shows evidence of aspirated noise. The following four-step scale by Klatt and Klatt was used to quantify the presence or absence of noise: (1) periodic, no visible noise; (2) periodic but occasional noise intrusion; (3) weakly periodic, clear evidence of noise excitation; (4) little or no periodicity, noise is prominent. The degree of aspiration noise of iterative sentences S13 and S14 is shown in Tables 8 and 9. In general, noise increases towards the end of the sentence, even if the last syllable is stressed (as in the case of S13). We see from the results that the noise content for females is only slightly higher than that of males (no significative difference even at level p = 0.10 according to the Mann-Whitney test). The Spanish male/female difference (O.l/O.O) compares to the American male/female difference (1.0/0.4). At the syllable level, we find that there does exist more noise at the last syllable. Also, unstressed vowels tend to have more aspiration noise than stressed vowels. In sentence S14, we find a stress difference of 0.1 units (unstressed vowels are more noisy) and a first and final syllable stress difference of 1.5 units (final syllables being more noisy). Observing sentence S13, however,
AM GG JF JL MA Ave.
we find a stress difference of 0.2 (stressed syllables having more noise) and a first-final syllable stress difference of 1.1 units (final syllables being more noisy). The reason for noisier stressed vowels in 513 is probably due to the fact that the final syllable is a stressed syllable, which has a considerable amount of noise. The paradox observed by Klatt and Klatt also pertains to our results. They found the following to be general tendencies: (1) more noise in F3 region implies greater glottal airflow in final sylTable 9 Degree of periodicity versus noise iterative sentences S14 [xa] (aLa?ia) F3
excitation
of F3 of the
Sylll
Sy112
Sy113
Syl14
Sy115
Ave.
1 2 2 1.6
2 2 3 2.0
2 3 3 2.4
2 3 3 2.4
3 4 4 3.4
1.8 2.0 2.8 3.0 2.4
1 2 3 2 2 2.0
2 2 2 3 2 2.2
2 2 3 2 1 2.0
3 2 3 2 2 2.4
3 4 3 3 3 3.2
2.2 2.4 2.8 2.4 2.0 2.4
Females AG AS EH RF SF Ave.
2750 2950
Males AM GG JF JL MA Ave.
2300 2200 2400 2200
P.J. Trittin, A. de Santos y Lleb/Speech
lables and (2) weak lst-harmonic amplitude implies pressed voice with shorter duration open quotient. To conclude, results show that aspiration noise is observed in the F3 region of all speakers, and although Spanish female speech tends to have slightly more noise than the average Spanish male, this difference is not significative. Finally, it seems that more noise is evident in unstressed and final syllables as opposed to stressed syllables, which corresponds to the results obtained with American English speakers. 3.3. Changes in the vocal-tract transfer function The third and final proposed acoustic correlate of the degree of breathiness of a vowel has to do with the acoustic effects of tracheal coupling on the vocal-tract transfer function. Among those effects are the potential (1) additional tracheal poles and zeros and (2) increases in the first-formant bandwidth. 3.3.1. Extra poles and zeros The speech sound /x/, in which the glottis is partially open, was chosen to locate tracheal poles and zeros. Using the LPC and FFT data, the [xl portions of the iterative sentences [xa] were analyzed looking for additional formant peaks and zeros in the spectra. Results are shown in Table 10. Locations Pl, P2, etc. refer to their positions relative to formant 1, formant 2, etc., respectively. Table 10 Extra poles and zeros iterative sentences
found
in the [xl portion
of S13 [xa]
Extra poles
Extra zeros
Syll
Location
1249 Hz
1425 Hz
1
Pl-P2
3798 Hz 1337 Hz
1476 Hz
2 3
P3-P4 Pl-P2
3 5
P2-P3 P2-P3
Females AG AS EH RF SF Males AM GG JF JL MA
1880 Hz 2271 Hz
2296 Hz
Communication 16 (1995) 359-368 Table 11 Extra poles and phoneme /a/
zeros
Speaker
Extra pole
AG AS
1097 1577 1665 1148 1135
EH SF
Hz Hz Hz Hz Hz
of natural
Extra zero
365
sentences
including
Syll.
Lot.
Sentence
1 [pa] 5 [ka] 2 [pa] 1 [wa] 1 [wa]
Pl-P2 P3-P4 P3-P4 Pl-P2 Pl-P2
S8 S2 S3 Sl Sl
the
They indicate minimal extra poles and zeros for Spanish speakers. Extra poles and zeros for Spanish speakers are not as common as in American English. Also, all of the extra poles and zeros were found in the [xl segment of the [xa] iterative sentences and in general did not creep into adjacent vowel spectra, at least as far as vowel midpoint is concerned. Observing initial and final parts of the vowel, there were fewer cases of extra poles and zeros. For speaker AG, extra poles were found at vowel beginning of syllables 1 and 2 with no extra zeros. The only other case was found in speaker AS at the end of syllable 2 where an extra pole was found. In all other cases, no extra poles or zeros were found. In the case of natural sentences, we also found few cases of extra poles and zeros, and of those cases, all speakers happened to be female. Concentrating on 4 different natural sentences (11 syllables in all) containing the vowel [a], only 5 instances of extra poles were noted and they exhibited no set pattern. Table 11 reveals the results of the chosen natural sentences. It was found in both male and female speakers that a vowel either preceded by or followed by a nasal behaved distinctly. It is known (Fant, 1960, 1973; Fujimura, 1962, 1968), that the nasal coupling produces a zero in the Fl region. We have found in such cases, that the third formant seems to disappear in the regions closest to the nasal. This may suggest that extra zeros exist with nasals. Fig. 3 illustrates this behaviour with the vowel /o/. Note the appearance of the extra zero due to nasalization. Also, it was found that more noise was evident in nasalized vowels. Further study is needed to improve the characterization of nasals.
366
P.J. T&tin, A. de Santos y Lleb/Speech
Communication 16 (1995) 359-368 Table 12 Prominence of the Fl peak in vowel iterative sentences (syllable 1, stressed)
Fig. 3. Visual appearance of the vowel /o/ by made speaker AM, (top) without nasal influences and (bottom) with a nasal zero. Note the clear zero that this speaker has at around 4,200 Hz.
3.3.2. First formant bandwidth
The first formant bandwidth of the vocal-tract transfer function determines the resonance peak width as well as the relative strength of the first formant peak. The same methods used by Klatt and Klatt (with one addition) to quantify the relative strength of the first formant was used for Spanish speakers. First, the amplitude of the Fl peak relative to the amplitude of the second formant peak was measured. Measurements were taken at the beginning, middle and final locations of the vowel in [xa] iterative sentences. Al,,, is the average of the first formant amplitude relative to the second formant amplitude.
of S13 [xa]
Female
A 1 re2
Flvi,
SW,,
Male
Al,,
Fl,,
SW,,
AG AS EH RF SF Ave.
-4.1 -4.7 -13.0 -8.7 -5.3 -7.3
4 5 6 5 5 5.0
341 235 139 222 248 237
AM GG JF JL MA
-4.3 -4.3 -2.1 -2.0 -8.0 -4.3
6 6 6 5 6 5.8
177 122 168 299 122 178
The second measure had to do with the visibility of the first formant bandwidth and followed the same scale initiated by Klatt and Klatt: 0 = no evidence of an Fl peak, 1 = inflection point in smoothed spectrum, and 2 = obvious local spectral maximum. The value Fl,,, is the sum of the estimated relative strength of the first formant peak measured at vowel beginning, end and midpoint. Fig. 4 illustrates spectra of two speakers (MA and SF). In the top figure, the first formant is very well-defined and its bandwidth is small. In comparison, the first formant of the bottom figure is difficult to distinguish and is relatively large. A third measure consisted of finding the spectral peak value, SP, and then recording the frequency values at +3 dB from SP to arrive at a first formant bandwidth value. Sharp spectral peaks will have a frequency range much less than smoothed spectral peaks. The results obtained by Klatt and Klatt (1990) and Karlsson (1991b) allowed us to expect higher Fl bandwidths in female voices. As can be seen by the results in Tables 12-15, a difference between male and female Al,,, is
Table 13 Prominence of the Fl peak in vowel iterative sentences (syllable 2, unstressed)
Fig. 4. Visual appearance of the first formant using.an FFT of 1024 points and an LPC of 30 points. Top: male speaker MA, bottom: female speaker SF.
spectra
spectra
of S13 [xa]
Female
Al,,,
Fl,,,
SW,,
Male
Al,,,
Fl,,
SW,,
AG AS EH RF SF Ave.
-5.0 -6.0 -13.3 -11.3 -2.7 -7.7
3 6 6 5 6 5.2
383 156 118 226 189 214
AM GG JF JL MA
-7.0 -3.0 0.3 -5.0 -1.3 -3.2
6 6 3 6 5 5.2
185 185 403 135 232 228
P.J. Trittin, A. de Santos y Lleb/Speech Table 14 Prominence of the Fl peak in vowel iterative sentences (syllable 1, unstressed) Female AG AS EN RF SF Ave.
spectra
of S14 [xa]
Al,,,
FIVi,
SW,,
Male
Al,,,
Fl,,,
SW,,
-3.3 -8.3 -6.7 -7.0 -2.0 -5.5
3 6 6 5 4 4.8
404 189 172 247 375 277
AM GG JF JL MA
-3.3 -4.3 -2.3 -3.3 -8.7 -4.4
6 6 6 6 6 6.0
186 135 131 172 75 140
Table 15 Prominence of the Fl peak in vowel iterative sentences (syllable 3, unstressed)
spectra
of S14 [xa]
Female
Al,,,
Flvi,
SW,,
Male
Al,,,
Fl,,,
SW,,
AG AS EH RF SF Ave.
-5.5 -7.7 -6.0 -9.7 -4.3 -6.6
4 6 5 6 4 5.0
269 177 240 184 299 234
AM GG JF JL MA
-9.1 -2.7 1.7 -5.0 -3.3 -3.8
6 5 3 6 6 5.2
130 261 408 189 156 229
noticed 5. A difference of 3.0 units is found between the genders in the stressed, first syllable of S13. Unstressed vowels show a difference of 4.5, 1.1 and 2.8 units (see Tables 13, 14 and 15, respectively). On the whole (considering stressed and unstressed syllables), the female Al,,, value is lower than the male value. This difference is statistically significant (level p > 0.01 according to the t test). Perhaps a better measure of the male/female differences is the SW,, value. On average, the Fl bandwidth for women is greater than for men (this difference is significant only at level p = 0.101, although male speaker JF has considerably large first formant bandwidths. Also, the Fl., values more-or-less correspond to the SW,, values, relatively speaking. The expanded bandwidth of female speakers may indicate that the glottis is partially open, which can sometimes almost to-
’ In Tables 12-15, each value of Al,,, shown is the average of the measures taken at the beginning, middle and end of the indicated syllable. The last row is the average value of the columns.
Communication 16 (1995) 359-368
367
tally erase the spectral peak at Fl (Klatt and Klatt, 1990; Fant, 1993). This effect in addition to extra poles and zeros can cause problems for formant trackers to represent male, female and child voices.
4. Discussion The male/female differences for Spanish speakers are summarized by the following discussion. It was found that the fundamental frequency of Spanish females is about 1.8 times greater than Spanish males, which correlates with the results of other investigations. To the extent that relative first-harmonic strength is an indication of the aspiration noise found in the vowel spectrum, our results seem to indicate that Spanish females are slightly more breathy than Spanish males. The difference is smaller that the results given by Klatt and Klatt, but statistically significant. Comparing the first and last syllables of natural and iterative sentences, we conclude from our results that Spanish males laryngealize slightly more than females. Also, between the speakers, the average difference between male and female Spanish speakers is nominal (males slightly lower than females), but a wider range is found among male speakers. The results seem to indicate that the difference in open quotient is slightly smaller for Spanish males. As far as stress and syllable amplitude are concerned, Spanish males and females seem to stress syllables with about the same intensity. The stress of final syllables is attenuated. The average drop of natural sentences is much more steep than the iterative ones, but even if the last syllable is stressed, a decrease in intensity is found. The results from the test which subjectively measured the aspiration noise in the third formant region indicate that Spanish female speech has a bit more noise than the Spanish male, but the difference has no statistical significance. On average, more aspiration noise is found in the last syllable and also in unstressed vowels. Extra poles and zeros were not as common-
368
P.J. Trittin, A. de Santos y Llea/Speech
place as was with American speakers. Although the occurrences took place more frequently in female speech, there seems to be no apparent pattern of extra poles/zeros. An interesting observation concerns the vowels influenced by nasals. In such cases, the third formant is weakened or completely vanishes. In most instances at vowel midpoint, the nasal influence disappears and the formant is restored. Also, it was observed that nasalized vowels contain a considerable amount of noise. In general, the first formant bandwidth of vowels produced by Spanish females is greater than those of Spanish males with the exception of speaker JF, who had considerably large first formant bandwidths. This may indicate that the glottis is partially open in female speech. The aforementioned analysis reveals results which try to differentiate Spanish male and female speakers. These findings in addition to future perceptual analyses will assist in producing a synthesized female voice for the Spanish language. As Karlsson (1991a,b) has found, incorporating attributes such as breathiness does indeed enhance the quality of voice synthesizers. However, more investigation with regard to acoustical male/female differences must at the same time continue to help augment the quality and naturalness of future speech synthesizers.
Acknowledgements
We wish to thank the J. William Fulbright Commission, Spanish Ministry of Education and Science and the Comunidad de Madrid (CO65/91) for providing funding for this investigation. In addition, we wish to express our gratitude to Xavier MenCndez-Pidal, Javier Macias, Francisco GimCnez and Jose Angel Vallejo for their helpful comments and insights.
Communication 16 (1995) 359-368
References C. Bickley (1982), “Acoustic analysis and perception of breathy vowels”, Working Papers, Vol. 1, MIT Speech Communication, 71-81. G. Acoustic Theory of Speech Production (MouThe Hague) in 1970). Fant Speech Sounds (MIT Press, Cambridge, MA). Fant “Preliminaries to analysis of the source”, pp. G. Fant (1993), “Some problems in voice source analysis”, Speech Vol. Nos. 1-2, 7-22. 0. Fujimura (19621, “Analysis of nasal Sot. Amer., 34, pp. 1865-1875. 0. Fujimura (19681, “An voice aperiodicity”, Trans. Audio Electroacoust., Vol. AU-16, No. pp. 68-72. for a text-to-speech sysI. Karlsson (1989), Conf Speech Communication and Vol. 1, 349-3.52. (1991a), I. Karlsson
Vol. 111-120. I. Karlsson (1992a), tween male female voices; A pilot study”, pp. 19-31. (1992b), I. Karlsson Vol 4-5, pp.
be-
Nos.
for English”, Sot. Amer., Vol. 82, No. 3, 737-793. and L. Klatt (19901, and perception of voice quality and male talkers”, J. Acoust. Sot. Amer., 820-857. R.B. Monsen and A.M. Engebretson (19771, male female glottal wave”, Sot. Amer., Vol. 62, No. 4, 981-993. P.J. and female Vol. 8, 3, 261-277. M.A. Rodriguez al. (19911, “Teoria y aplicaciones Vol. 2, 4, M.H. Savoji speech with regards
35-53. male pp. 1-13.
female