Speech Communication 10 (1991) 521-531 North-Holland
521
Prosodic and segmental speaker variations G u n n a r Fant, A n i t a K r u c k e n b e r g and L e n n a r t N o r d Department of Speech Communication and Music Acoustics, Royal Institute of Technology (KTH), Box 700 14, S-IO0 44 Stockholm, Sweden
Received 28 June 1991
Abstract. The purpose of our presentation is to make an inventory of knowledge developed at the KTH about speaker variabilities including findings from our more recent databank projects on text reading. We shall have something to say about male/female differences, voice source characteristics, and about prosodic and segmental features in connected speech. We also have some data of more general statistical nature such as pause durations, long time average spectrum, and about relative proportions of voiced and voiceless segments in speech.
Zusammenfassung. Das
Ziel dieses Beitrags ist es eine Aufstellung zu m a c h e n von d e m Wissen welches am K T H zusam-
mengestellt wurde und welches die Sprechervariabilit~it betrifft. Der Beitrag beinhaltet auch Resultate von unserem neuen Datenbank Projekt tiber das Lesen von Texten. Wir werden sprechen tiber Unterschiede zwischen Mannern und Frauen, tiber die Charakteristika des Quellensignals und fiber prosodische und segmentale Merkmale von zusammenh~ingender Sprache. Wir haben auch allgemeine statistische Daten tiber die Dauer von Pausen, das gemittelte Langzeitspektrum und den relativen Anteil von stimmhaften und stimmlosen Segmenten in der Sprache. R i s u m ~ . Notre but est d'inventorier les connaissances d6velopp6es au K T H / t propos des variabilit6s inter-locuteurs en y incluant les r6sultats les plus r6cents de nos projets de banque de donn6es sur la lecture de textes. Nous pr6sentons nos donn6es sur les diff6rences entre locuteurs f6minins et masculins, les caract6ristiques de la source vocale et les traits prosodiques et s e g m e n t a u x dans la parole continue. Quelques donn6es de nature statistique plus g6n6rale sur la dur6e des pauses, le spectre m o y e n ~ long terme et les proportions relatives de segments vois6s et non vois6s dans la parole sont 6galement pr6sent6es.
Keywords. Prosody, timing, male/female speech, voice source.
1. Introduction T h e r e is a strong m u t u a l d e p e n d e n c y b e t w e e n studying individual variations and studying general features of the speech code. W i t h o u t sufficient b a c k g r o u n d o f data f r o m several speakers we lose insight into invariance and essentials. Conversely, without a p r o p e r base of general k n o w l e d g e f r o m speech analysis and speech production and p e r c e p t i o n we are at a loss h o w to collect relevant data on s p e a k e r variations. Studies o f variabilities pave the way for studies of invariance and vice versa. In o u r project to d e t e r m i n e what constitutes g o o d reading o f Swedish texts and what s p e a k e r variations we m a y expect (Fant and K r u c k e n b e r g ,
1989), we have indeed b e e n c o n f r o n t e d with this dilemma. W e have processed data f r o m 17 subjects, but the m a j o r part of the w o r k has b e e n c o n c e r n e d with general studies of the realisation of prosody. W e have also initiated a contrastive study of Swedish, English and F r e n c h (Fant et al., 1989). S p e a k e r variabilities are language specific. T h e situation is similar in the m o r e detailed but fundamentally i m p o r t a n t m a t t e r of tracking voice source p a r a m e t e r s in c o n n e c t e d speech. W e have new efficient models of the voice source and well established source-filter interaction theory, but we still lack insight into n o r m a l contextual variations as a basis for describing individual variations.
0167-6393/91/$03.50 © 1991 Elsevier Science Publishers B.V. All rights reserved
522
G. Font et al. / Prosodic and segmental speaker variations FEMALE/MALE
kl % 30
/
~
k
Female/Male 6 languages ..... Tenor/bass, Swedish (Cleveland 1975)
SCALE-FACTORS 1-6 languages o---o Font (1959)
~
10
u
o
3
o
k3 ~k
&
/
20
m ...o
kI
a
ae
C
e
i
y
~
O oe e
"u-
u
o
a
e
\ \
i
/o
a
e
i
u
o
a
e
i
k2
% 30
Fig. 2. Tenor/bass formant frequency scale factors in per cent (Fant, 1975).
/ A\
20
10
\ I
0 k3 % 30
u 'I \
t o
3
a
a
ae ~
e
~
y
.e
~ oe e
y
~
¢
\
20
\ U
10
u
o
~
'ff
a
a
Y
ae
£:
e
i
ce
e
"u-
Fig. 1. Scale factors of female/male formant frequencies in per cent. Average data from six languages (Fant, 1975).
2. Vocal tract scale factors and normalization
Female speakers have a shorter vocal tract than males, but the scaling of dimensions is not uniform. Thus females and males differ more in terms of pharynx length than in terms of mouth cavity length. This is but one of several deviations from a uniform scaling, and individual variations are frequent within males as well as females. The corresponding formant frequency differences can be expressed in terms of scale factors, one for each formant of each prototype vowel of interest (Fant, 1975). Typical deviations from a uniform scaling inverse to the overall vocal tract length is the relatively large scale factor in F1 of open vowels such a s / a e / a n d / a / , and a relatively low scale factor in F1 of maximally close and rounded vowels such as/u/, and the relatively high scale factor Speech Communication
of F~ of the vowel/i/, see Figure 1. It is interesting to note that these female/male formant frequency scale factors, which are averages of lq5 different languages, show similarities with corresponding tenor/bass singer formant scale factors, see Figure 2. The specific vowel and formant scale factors thus appear to be systematically related to an overall vocal tract length. In the first approximation we may thus adopt the particular vocal tract length as an entry for normalization. This procedure is described in Fant (1975) and has been applied to female and male data of Swedish, Dutch and American English and also to a separate study of sociolects in French (Mettas and Font, 1977). It is a production oriented method suited for separating out individual and dialectal traits from average language data. Although F~ and F3 and higher formants may be combined into an effective "F2 prime", F2', (Fant, 1959, 1983; Carlson et al., 1975; Bladon and Font, 1978), we recommend a more detailed specification in the same graph of both F2 versus Fl and F3 versus F1 as in Figure 3. It pertains to the reference data of the 1-6 language study (Fant, 1975). The vowel and formant specific scale factors in percent are indicated for each data point.
3. Glottal functions and source filter interaction
We shall now review some data on source characteristics which reflect individual and sex related variations. A fairly extensive study of M~irtony (1965) on frequency domain inverse filtering, comparing narrowband sections of human
G. Fant et al. / Prosodic and segmental speaker variations
523
(a)
5 F 1 - F 2 of male vowels 1 - 6
2500
and f e m a l e - m a l e in percent
tanguages
scale factors k 1 and k 2
(b) FI - F 3 of male vowels 1 - 6
i
F3
7,21
a n d female-male in percent
languages
scale factors k I and k 3
e,l
2000
3000
11,22
y
i 7,13
£
0,19
19,18
~
6,17
jz
2117
e,l
5.16
2500
1500
F30= 2485
oe
0
7.18 3 2,21 "/~ 18.18 8
-1,10 ~t 3,12
U 6.I
0 7,5
C~
11.13 716
7.17
'
o-1,TM oe717 Y u A..22 "U" ' 0,17 6.23 J~ 3.18 5.16
25,15
O -4.9
1000
e
a
a 17.15- - a - - 25,15 ~e .A. 27,18 18,16
11.18 19 20 .................
a
7,17
17,12
5,22
2000
3 11,6
3 2,16
500
' 200
' 300
' 400
, 500
, 600
, 700
L 800
1500 FI
' 200
' 300
' 400
00 5
' 600
' 700
F,
Fig. 3. ~ versus F~ and F2 versus F~ of the 1-6 reference language data of vowels (Fant, 1975). Average female/male scale factors in percent are indicated.
vowels and cascade synthesized vowels, gave the results shown in Figure 4. These spectral data are normalized by a + 1 2 d B / o c t emphasis, which under ideal conditions would give a flat spectrum. The upper and lower curves mark the range of variation, which is + 1 0 d B at lower frequencies and + 1 5 d B at higher frequencies. The overall trend is a fall with frequency indicating a role off of the glottal flow spectrum by more than 12 dB/ oct. The minima at about 8 0 0 H z and 2 - 2 . 7 k H z could be of some significance. We need more data of this type supplementing our present standard technique of derivation of equivalent LF-model parameters (Fant et al., 1985; Gobl 1988, 1989; Karlsson, 1988, 1989). However, there is some evidence for systematic departures from model data. A recent frequency domain matching of Fant and Lin (1988), Figure 5, shows a good fit for the male voice but dips around 900 Hz and 2500 Hz for the female voice.
dB 10
05
lO
15
20
i
i
I
25 ..... ~
30 kc/s k
o-10-20-30-~.0Fig. 4. Mean voice source spectra of Swedish vowels derived from spectrum matching (M~irtony, 1965). Spectral preemphasis +12dB/oct. Average of 12 male subjects. The upper and lower limits are indicated. Vol. 10. NOS. 5 ~ , Dcccmbcr 1991
524
G. Fant et al. / Prosodic and segmental speaker variations
dB
i
t
i
i
i
Subjecf JS (16) 10
Fn 670 1120 2080 3100 3t,50 B n 75 85 90 1S0 2S0
o
--
Ln-L o l&
8
-I0
-7
o
/ 10 dB]-
*lZ,
FS
'""
. o
-
• Transfer
-5
~ LF-model
-
undressed + d i f f Ee ~ =3 Fa=3OO0 F
Rg=12
FO = 133
-10 f
l
I
I
I
dB
i Subj
l
l
2
i MS (16)
i Eo]
3
J
i Fn
i 600 1180 2t~.S0 3670
Bn 0
~ X o o °° .
/2x~
F1
/
o /o
110
120
200
250
7
-2
-2g
-26
-31
, o o o
o o
Fa=2000
o
e
~
°
• Transf . . . .
d ..... d ÷ d , f f
F0=187 5
Fa= 1800 2000
x formonfs I
I 1
I
°
\
V
I
2
o
o
= o
o
/°
°
°
°
°
°
o
F
3
I
[
&
kHz
Fig. 5. Frequency domain inverse filtering of a vowel/a/. A +6 dB/oct differentiation is included to take into account ideal source slope and radiation. A male subject above and a female subject below. From Fant and Lin (1988).
The 900 Hz dip might originate from acoustic interaction, see e.g. Fant (1986). The 2500Hz minimum is definitely due to a zero from subglottal coupling, which is often encountered in female phonation, see also Klatt and Klatt (1990). These examples illustrate the importance of documenting and looking into the frequency domain properties of the h u m a n voice source, which apparently has properties that are not built into present time domain based parametric models. Examples of long time narrow band average spectra are shown in Figure 6. In both the male and the female data we here also observe the local minima at approximately 900 Hz and 2.5-3.0 kHz, which we have already encountered in Figures 4 Speech C o m m u n i c a t i o n
2
3
k
5 kHz
×
F3"
I
1
FL, o
FCI=IO00 -10
0
Fig. 6. Mean long time average spectra of the voiced part of a text passage of approximately 40sec, read by one male speaker (top), five males (middle) and four females (bottom). The curves have been displaced vertically to avoid overlap. Two dips in the spectra are indicated by arrows.
i t,220
90
Ln-L o
#
L, k H z
and 5. In view of the discussion above, these features probably reflect interaction effects rather than uneven formant distributions. A few general comments on female/male source differences should be noted. We know that female vowels are on the average a few dB lower in intensity than male vowels. The corner effect at closure, also referred to as the residual closing phase, is more pronounced than for males, which means that the drop off in excess of 12dB/oct starts at a lower frequency, around 1000Hz instead of 2000 Hz for males. As far as overall pulse shape is concerned we may expect a larger open quotient for females, The flow pulse amplitude as well as the derivative at closure is about one half of that of males. The latter scale factor would account for a 6dB lower formant excitation. However, because of the larger rate of excitations at a higher pitch, this is partially compensated. In addition, the larger female formant bandwidths tend to reduce vowel intensity. All these factors would sum up to a 3 dB lower overall level for females than for males. O t h e r attributes of the female voice source are related to a greater tendency of breathiness than in males enhancing the level of the fundamental versus higher harmonics and promoting m o r e apparent noise components (Klatt and Klatt, 1990).
G. Fant et al. / Prosodic and segmental speaker variations
Another feature of female voice production is that the conditions of a high F0 approaching a low FI increases aerodynamic interaction (Fant, 1986)• As already pointed out in the introduction we are only gradually becoming acquainted with voice source contextual variations in connected speech. Such dynamic variations appear to be more important for speech naturalness than a proper static model per se. Here is also a domain with apparent individual variations. Gobl and Ni Chasaide (1988) have shown how preocclusive aspiration may start as assimilated breathiness in the previous vowel, which manifests as progres-
525
sive weakening of the source, starting already at an early stage of the vowel. The prominent reduction of formant amplitudes is to a considerable extent also caused by increased formant bandwidths. If the abduction has proceeded far enough, the first part of the occlusion will show noise in the closing phase of the supraglottal constriction• This phenomenon is language specific, typical of stressed aspirated stops in Swedish, but it is also an individual and sex related feature worth noting (Fant and Lin, 1988). This is illustrated in Figure 7, which shows spectrograms of the Swedish word/detta/, extracted from a read text. The two female informants, AO and IK on
.-.
¢
t,.
•:! ~ 7 - .
3:
~.: ~.~. ~ ;
..,
.[
1
5
8
Fig. 7. The Swedish w o r d " d e t t a " in stressed context. Preocclusion aspiration is m o r e a p p a r e n t in the female voices above than in the male voices below. O b s e r v e the extra f o r m a n t at 1400 Hz b e t w e e n F, and F 2 in the vowel / e / o f the female subject A O which is due to a subglottal resonance. See further Fant and Lin (1988). Vol.
10. Nos.
5-6,
December
1991
526
G. Fant et al. / Prosodic and segmental speaker variations
the top part, display typical signs of preocclusion aspiration as discussed above. Observe the pattern break halfway in the vowel/el for IK, and the appearance of an extra formant between F2 and ~ in t h e / e / o f AO. As discussed by Fant and Lin (1988), this extra formant reflects subglottal coupling. The two male subjects in the bottom part of the figure exhibit much cleaner vowels, but we may observe a weakening of the last three pitch periods o f / e / f o r speaker LN.
4. Stress and intonation
Prosodic aspects of speech production such as stress, intonation, phrasing, tempo and pausing probably contribute more to characterize a speaker than his segmental realizations. The range of prosodic variability is large and more difficult to structure than segmentals. It is well known that stress and accents manifest by duration, F0 contours, intensity and degree of articulatory enhancement versus reduction. The relative importance of these compo-
nents are language specific. Articulatory and corresponding acoustic pattern reduction is illustrated in Figure 8. It shows three individual utterances of the Swedish verb/legat/extracted from a passage of a read novel (Fant et al., 1986). The three individual productions illustrate three degrees of consonant/g/versus vowel contrast: normal, reduced and much reduced. The latter is an example of the/g/being reduced to a voiced continuant. Still it is perfectly acceptable in its context of local high tempo and low stress level. Here the drastic differences in the individual spectral pattern are subjectively less apparent. Another example is the very diffuse boundary between a nasal consonant and a following dental stop. An incomplete oral closure produces a nasalization instead of a nasal murmur. This is quite analogous to the reduction of the voiced stop in the example above. In our text reading experiments we have observed considerable individual variations in the degree of contrast between stressed and unstressed syllables. A distinct reading style requires a good contrast. An indirect overall measure of the
Fig. 8. T h e word "legat" of three subjects reading a Swedish text (Fant et al., 1986), illustrating normal, reduced and a very reduced vowel-consonant/g/-vowel contrast. Speech Communication
527
G. Fant et al. / Prosodic and segmental speaker variations
stressed/unstressed duration contrast may be deduced from the statistics of interstress foot length as a function of the number of p h o n e m e s in the foot as exemplified in Figure 9, which pertains to our reference subject AJ. The duration of a foot can be expressed in the form of a linear regression T,, = a + b n ,
where n is the number of p h o n e m e s in the foot, b is the average increment of duration per added p h o n e m e and a is a constant, which represents the stress induced lengthening. On the whole, at a constant level of stress the duration of the stressed syllable is independent of the number of p h o n e m e s in the foot. This holds approximately down to n = 1, which implies a foot consisting of a long stressed vowel only, of duration a + b. The average duration of feet not spanning a pause or other syntactic boundary domain is of the order of 550ms in English as well as in Swedish and French, and the average number of p h o n e m e s per such "free foot", is of the order of 7. The larger the a and the smaller the b, the greater will the contrast be between p h o n e m e s that are affected by stress and those that are not, i.e., stressed syllables stand out more clearly from unstressed syllables. A plot of the individual variations of 17 subjects' a and b constants averaged over a sequence of four long noun phrases is shown in Figure 10. The range of variation of a is much greater than that of b. A direct control of stress induced
p h o n e m e lengthening in the domain of stressed syllables showed that readers with a large a parameter prolonged the vowel nucleus, and especially the postvocalic consonants, more than those who had a low a value. These dimensionalities were studied in more detail in the context of varying reading style. Figure 11 refers to our reference subject AJ producing (1) normal relaxed reading, (2) normal but s o m e w h a t forced reading, (3) low overall voice effort, (4) loud voice and (5) distinct voice. The abscissa in these graphs is the average p h o n e m e duration within a free foot. 13
I
oEJ
// • BG
//
• BB
IK
• EA
"
//
1000
•jJ
./ /
/
//
•AA
oST
•
50
AR ER
/
MB
I
I
I
7O
6O i
i
8O i
i
600
To: Sz4 ms '
c]
ST
60
a : 1 5 8 ms b : 53 ms/phoneme
" "
~'00
/ /
•
/
LN
/ oEJ "Dra 0,6
b=59(
/
)
Tn: 158*n"53 (r : 0'9t'] ~J, nine 5ententes
200
n = number of phonemes o = meon volue
of Tn
I
i
l
l
i
i
i
i
2
4
6
8
10
12
14
16
NUMBER OF PHONEMES, n
Fig. 9. G r o w t h o f f r e e f o o t i n t e r s t r e s s i n t e r v a l s as a f u n c t i o n o f the n u m b e r o f p h o n e m e s in t h e f o o t . F r o m F a n t a n d Kruck e n b e r g (1989).
/
e//
55
"
&
oBB
////
~-Z
/J oER // /
: •
. "~ ~ i ~ "'>"
i
ooAR
/
Tn : o + b . n
mS
i
AA
b-PARAMETER SENTENEE 7
STRESSINTERVALS (UNINTERRUPTED), ,~J
800
2,6
r:O,5
/- / LN
,
I
( )Do
_
o-102 73
//
oJJ
z" o -~_
/
./
100
65
//
/
/
• AO
ms ,
i
150
b 1200
I
o- PARAMETER SENTENCE 7
ms
•BG
?K
o,~j
50 I
I
60
I
70
FREE
FOOT PHONEME
I
I
I
80 DURATION
ms D~
Fig. 10. The a and b parameters of the free foot regression equation for an ensemble of 14 subjects reading a long sentence. A large a and a small b conditions a relatively large stressed/unstressed durational contrast. Vol. lfl. Nos. 5~6, December 199I
528 [i
ms 3OO
G. Fant et al. / Prosodic and segmental speaker variations i
i
i
SUBJECT ~J, SENTENCE 7 VARYING
READING
MODE
distinct / / ' o /of high f ~" Voice J/" /ha \1,7 / url / "//"~'-fl= 186[ ~ ' )
[OW
200
voi ce / / ~ g~normo[
i J
•
100
I
forced
t
i
70 b
ms
i 90
80
'
'
60
' •
50
ms Oa
/'D~0, ss'
vo,cehi~h J
is3~y~ }
.
forced
i
58
/-
56 54 52
,
._
I I • distinct
/./-
tow .I
I 70
voi~,/%gmo~/ I
I 80
I
FREE FOOT PHONEME
J 90
I ms
DURATION
D~= (a+b- n a ) / n o Fig. 11. The a and b parameters of the reference Swedish subjecrs reading in 5 different styles. From Fant and Kruckenberg (1989).
With distinct and also with high voice effort the average phoneme duration is prolonged, the speed of reading is reduced. Compared to the datapoints of normal and low voice effort reading, the a parameter has increased by 115 ms but the b parameter by 9 ms only. The change in average phoneme duration, from 70 ms to 95 ms, is thus not evenly contributed by stressed and unstressed phonemes. Of the 25 ms difference, 16 ms derives from the a parameter but only 9ms from the b parameter. One may conclude that the major burden of the tempo change lies in the stressed syllables, whilst the unstressed syllables are more stable. The speed variation associated with more distinct and louder speech affected the stressed syllables about twice as much as the unstressed syllables. Additional data on speech tempo variations Speech Communication
are reported by Fant et al. (1991). A large source of individuality lies in intonation, i.e. in patterns of F0 movements, generally combined with other acoustic parameters to signal syntactical, semantic, pragmatic, situational and emotional elements, which are subject to dialectal variations and social conventions. Even though much of these features belong to an accepted code the range of free variations is large, which makes it difficult to develop descriptive models and quantitative rules. These extralinguistic aspects of intonation have been studied by F6nagy (1983). One observation that we have made is that a departure from the normal location of an /7o minimum or other intonational gesture at a syntactical boundary creates an impression of affective speaker attitude. A general encountered problem in handling F0 data is to decide whether to use a linear or a logarithmic scale. In our studies of Swedish prose reading we found a strong support for quantifying F0 in semitones, which made female word accent data come out quantitatively the same as for male data (Fant and Kruckenberg, 1989). Subjects vary much in their realization of potential syntactic boundaries. A boundary before a preposition phrase may be ignored by some readers or may be realized by a proper pause or by a less marked gesture of final lengthening and a voice source perturbation envolving a local drop of F0 and a shift towards a creaky voice (Fant et al., 1986; Fant and Kruckenberg, 1989).
5. Pauses We have found that the average duration of a pause within a sentence, e.g. between two phrases, is of the order of 300-600 ms with a typical value of 450 ms, see Figure 12. The prepause segmental lengthening in the syllable preceding a pause is about 100 ms. The sum of pause duration and final lengthening, 550 ms, is of the same order of magnitude as the average free foot. We have found a tendency of skilled readers to plan their sentence internal pauses plus prepause lengthening to match a local average of free foot length. Pauses between sentences are larger and often display quantal jumps. Skilled readers favour ef-
G. Fant et el. / Prosodic and segmental speaker variations i
i
i
SENTENCE PAUSES x--x = T D + 8 0 ms
At[f deffo ~fofogentykfornas 7
eec
Tp
2.0 ~-
L.O Z
F-
(~
1.5 F ~ . . ~ ; ~ v " - ~ ' N
6
T a = 550 me
~
1.0 t - - 7
r~
529
3
x-k
°.sp-
s
T e
2 Ta
'e
•
1 2 3 4 5 6
Z
g~
.~
7 8 9
S e n t e n c e nr
o5
nn
2.0 F
~3
1.5 f
Average Tp : 370 ms
F--
• IS
1.0
SO = 190 ms
(~
x-'-x~Y~x.~x x-.../.~x
!
I
I
I
I
[
Ta I
I
1
1 2 3 4 5 6 7 8 9
t./1
i
T
/
200
300
400
I
P
SO0
600
PAUSE DURATION T
2 Te
0.5]- . . . . . . . .
2
100
T a= 5 0 0 ms
I
1.511.0
~
I
S e n t e n c e nr
Ta =540 ms
700 ms I I I I I I I I I 1 2 3 4 5 6 7 8 9 S e n t e n c e nr
P
Fig. 12. Individual spread of subjective boundary rating versus duration of a specific pause between two noun phrases. From Fant and Kruckenberg (1989).
fective pause durations (pause plus final lengthening) of the order of two, three or four times the local free foot length. This is exemplified in Figure 13, which shows four subjects' pause durations at 8 successive sentence boundaries within a paragraph. Subject /kJ serves as our reference reader for these analyses. His pauses tend to attain values of three and two times his average free foot length of Ta = 550ms. Subject IK favours pause durations of 2T~ and subject B G favours 1Ta. Subject EJ also favours 1Ta with one jump up to 2T,. Other subjects show the same trend to a less degree with less discrete steps but with some consideration to the semantic demands of a text. We have found similar trends in French and English text reading. These phenomena appear to reflect a general demand of rhythmical continuity across a pause, just as in music. We need to study more closely how a subject's pauses change with reading speed and local tempo, and to what extent final lengthening enters. One difficulty is to define a relevant short time average of foot duration considering the very large fluctuation in the duration of successive feet, some spanning a previous pause.
2.0 t
l°1
@
Ta=640ms
" 1 2 3 4 5 6 7 8 9 Sentencenr
Fig. 13. Duration of pauses between sentences in a paragraph of text reading. The ratio of pause duration to the speakers' average free foot duration tends to favour discrete quantal values.
We have observed interesting trends in pause duration versus speech rate for an ensemble of 17 subjects. In Figure 14 we have plotted the average length of pauses between sentences against the subjects' effective speech rate, here represented by the average phoneme duration. There is a weak, r = 0.5, negative correlation between pause duration and average phoneme duration. Fast readers appear to compensate by longer pauses, which makes sense both from a production point of view and from the view of communication efficiency. The longer pauses of the fast readers tend to aim at a higher integer of :ira than those of slow readers. These effects go opposite to the findings that pause durations increase when a subject is asked to speak slower, see Fant et al. (1991). A most stable speaker characteristic is the accumulated pause time versus total reading time, V o l , 10. Nos, 5~6. D e c e m b e r
1991
530
G. Fant et al. I Prosodic and segmental speaker variations
i
see
1
i
i
Average pause durations between sentences, 16 subjects. 15
i
i
i
i
15
Z
~=MB
/o ,K /
MB
LN o
~J •
,:AR
m 10
~'A
J
t/1
1.0
uu~":~ S r-~ tiu e
cJ
~
9
=
B
G
c3 Tps=2,7-0,022Dp r : 0,52
B5 •
0.5
i
~0
i
70
80
9
[~ v
ms
0
I
I
I
1
I
10
20
30
40
50
60 sec
READING TIHE
Average phoneme duration, Dp
Fig. 14. Average duration of pauses between sentences versus average phoneme duration for 16 subjects. The correlation is negative.
Fig. 15. Total accumulated pause time versus total reading time for five selected subjects.
which attains a stable value after a relatively short time. This pause/speech ratio varies within the range of 15-30%, see Figure 15.
time. The reason for this was that we had observed that in some cases female speakers more often than the male speakers devoiced parts of the vowels in unvoiced contexts (see Figure 7). As a preliminary attempt to quantify these differences, the readings were segmented into three categories, as voice, noise or silence. Quotients of unvoiced/voiced and (unvoiced + silence)/total reading time were calculated. In Figure 16 the mean results are shown for eight males and four
6. Voiced/unvoiced
relations
A comparison was also made between the groups of male and female speakers, with respect to the amount of voicing versus overall speaking
Retotive amount of segment type for mole and female speakers 60 50 40 [ ] male % 30 [ ] female 20 10 0
.................. ~'~ ....
unvoiced/voiced
,
,
(unvoiced+silence)/total
Fig. 16. Relative amount of segment type for male and female speakers (8 males, 4 females). Speech
Communication
G. Fant et al. / Prosodic and segmental speaker variations
females. As can be seen there is a small but consistent difference between the speaker groups, the female speakers usually having a relatively greater proportion of unvoiced parts of the produced
speech. Acknowledgment This work has been financed by the Swedish Council for Research in the Humanities and Social Sciences (HSFR) and by the Swedish Board for Technical Development (STU) and the Bank of Sweden Tercentery Foundation (R J).
References R.A.W. Bladon and G. Fant (1978), "A two-formant model and the cardinal vowels", STL-QPSR, Vol. 1, pp. 1-8. R. Carlson, G. Fant and B. GranstrOm (1975), "Two-formant models, pitch and vowel perception", in Auditorv Analysis and Perception of Speech, ed. by G. Fant and M.A.A. Tatham (Academic Press, New York), pp. 5582. G. Fant (1959), Acoustic analysis and synthesis of speech with applications to Swedish, Ericsson Technics No. 1. G. Fant (1975), "Non-uniform vowel normalization", STLQPSR, Vols. 2-3, pp. 1-19. G. Fant (1983), "Feature analysis of Swedish vowels - A revisit", STL-QPSR, Vols. 2-3, pp. 1-19. G. Fant (1986), "Glottal flow: Models and interaction", J. Phonetics, Vol. 14, Nos. 3-4 (Theme issue, Voice Acoustics" and Dysphonia, Gotland, Sweden, August 1985), pp. 303-399. G. Fant and A. Kruckenberg (1989), "Preliminaries to the
531
study of Swedish prose reading and reading style", STLQPSR, Vol. 2, pp. 1-83. G. Fant and Q. Lin (1988), "Frequency domain interpretation and derivation of glottal flow parameters", STL-QPSR, Vols. 2-3, pp. 1-21. G. Fant, A. Kruckenberg and L. Nord (1989), "Rhythmical structures in text reading - A language contrasting study", Proc. Eurospeech 89, European Conf. on Speech Communication and Technology, Paris, 1989, Vol. I, pp. 498-501. G. Fant, A. Kruckenberg and L. Nord (1991), "Temporal organization and rythm in Swedish", XllOme CongrOs bzternational des Sciences Phongtiques, 19-24 August 1991, Aix-en-Provence, France. G. Fant, J. Liljencrants and Q. Lin (1985), "A four-parameter model of glottal flow", STL-QPSR, Vol. 4, pp. 1-13. G. Fant, L. Nord and A. Kruckenberg (1986), "Individual variations in text reading. A data-bank pilot study", STL-QPSR, Vol. 4, pp. 1-17. I. Fdnagy (1983), La Vive Voix, Payor, Paris, 1983. C. Gobl (1988), "Voice source dynamics in connected speech", STL-QPSR, Vol. 1, pp. 123-159. C. Gobl (1989), "A preliminary study of acoustic voice quality correlates", STL-QPSR, Vol. 4, pp. 9-22. C. Gobl and A. N/Chasaide (1988), "The effects of adjacent voiced/voiceless consonants on the vowel voice source: A cross language study", STL-QPSR, Vols. 2-3, pp. 2359. I. Karlsson (1988), "Glottal waveform parameters for different speaker types", STL-QPSR, Vols. 2-3, pp. 61~i7. I. Karlsson (1989), "Dynamic voice source parameters in a female voice", STL-QPSR, Vol. 1, pp. 75-77. D.H. Klatt and L.C. Klatt (1990), "Analysis, synthesis and perception of voice quality variations among female and male talkers", J. Acoust. Soc. Amer., Vol. 87, No. 2, pp. 820-857. J. M~irtony (1965), "Studies of the voice source", STL-QPSR, Vol. 1, pp. 4-9. O. Mettas and G. Fant (1977), "Front vowels in Parisian sociolects", STL-QPSR, Vols. 2-3, pp. 1-7.
Vol. 10. Nos. 5 ~ , December i991