Journal of Phon etics (1985) 13 , 155-162
Perceptual isochrony in English and in French Donia R. Scott, S. D. Isard Centre for Research on Perceptio n and Cognition, Univ ersity of Sussex, Brighton BNJ 9QG, England
and Benedicte de Boysson-Bardies Lab oratoire d e Psy cho logie, 54, b vd. Raspail, F-75006 Paris, Fran ce Received I Oth October 1984
Abstract:
It has often been claimed that English speech is characterized by equally-spaced (isochronous) intervals between stressed syllables, as opposed to French where all syllables are said to have equal length. To date, speech production data have failed to provide support for this claim. Some investigators, however, have suggested that English speech rhythm is perceptually isochronous, giving rise to a perceptual illusion of regularly occurring stress events. We report the results of two experiments which show that ( 1) the illusion of regularity is not specific to so-called stress-timed languages and (2) that it is not even specific to speech. We conclude that the phenomenon of regularization as such cannot be used as evidence for an underlying isochronous rhythm in English.
Introduction It is one thing to subscribe to the widespread intuition that stressed syllables play a special role in English spee ch ; it is quite another to explain just how this special role manifests itself at the level of the speech wave . Pike (1945) and Abercrombie (1964), writing at a time when it was much more difficult than it is today to make precise measurements on large numbers of waveforms, claimed that the intervals between successive stressed syllables, known as feet , are all (a.pproximately) equal. In fact, they went further and proposed a division of all the world's languages into two classes: stress-timed languages, like English , with equal length (or isochronous) feet, and syllable-timed languages, in which all syllables are supposed to be of equal length. Russian and Arabic were said to be other examples of stress-timed languages, and Fren ch , Telugu and Yoruba were classified as syllable-timed (Abercrombie , 1967). Subsequent measurements of recorded English speech have failed to reveal a pattern of regularly spaced stresses (although Fowler, 1977 and Dauer, 1983 among others have noted that it is difficult to say just how irregular the stresses should be to count as evidence against the hypothesis). One might at least expect that a tendency toward equal length for feet with differing numbers of syllables would make syllable lengths more variable in English than in a syllable-timed language. However, Roach (1982) measured spontaneous, unscripted speech of French and English native speakers, and found no more variation in the English syllable 0095 - 44 70/ 85 / 020155 + 08 $0 3.00/0
© 1985 Academic Press Inc. (London) Ltd.
156
D. R. Scott, S. D. Isard and B. de Boysson-Bardies
lengths than in the French. In fact, the clearest and most consistent finding about English foot lengths has been that the more syllables a foot contains, the longer its duration (O'Connor, 1968; Lea, 1974; Thompson, 1980). On the other hand, Huggins (1972) and Fowler (1977) constructed sentences in which the same stressed syllables were separated by differing numbers of unstressed syllables (e.g. "the FACT STARTed the ARGument", "the FACTor STARTed the ARGument", "the FACTor reSTARTed the ARGument") and found a tendency for the individual syllables of a foot to become shorter as their number increased. Fowler, surveying a number of studies on both production and perception of English stress, is led to conclude that they "provide support for a weak version of the stress-timing view, namely that the intervals between stressed syllables are constrained in some regular , predictable way. They do not demonstrate necessarily that the constraint is isochrony" (p. 64). Further evidence for this view comes from Scott (1982), who showed that listeners' perceptions of syntactically ambiguous sentences could be influenced by manipulations of phoneme durations when the manipulations changed the lengths of inter-stress intervals, and from Carlson, Granstrom & Klatt (1979), who found that listeners preferred synthetic utterances whose inter-stress intervals had been copied from natural speech to other, equally isochronous, versions of the same utterances. So although the intuition that inter-stress intervals are important units in English speech is supported by empirical findings , we are left with the question of why these intervals have been felt to be equal in length. Two main sorts of explanation have been advanced in the literature. They are not necessarily in contradiction, and it may well be that both have a role to play, but writers have tended to concentrate on one or the other. One tradition , which Dauer (1983) traces back through Ladefoged (1975) and Classe (1939) , says that the language manages to give a global impression of regularity by "conspiring" to make stresses crop up every two or three syllables. One element in the conspiracy is the process, which has received much recent attention following the work of Liberman & Prince (1977), whereby stresses change location within a word or phrase so as to avoid successive pairs of stressed syllables and produce an alternating pattern instead. Thus we have "loch NESS" and "MONster" pronounced in isolation, but "LOCH ness MONster". Giegerich (1980) documents a complementary tendency for normally unstressed function words to take on stress so as to break up what would otherwise be long sequences of unstressed syllables. Thus "I've been FISHing", with no stress on "been", but "I've BEEN at the PUB". Bolinger has often noted (e.g. Bolinger, 1965) that when the conspiracy fails to prevent a pair of stressed syllables from occurring in succession, then the first of the pair is appreciably longer than it would be if followed by an unstressed syllable , so that the stressed syllable "JOHN" is longer in "MAKE JOHN TELL" than in "MAKE JOHNny TELL". This brings monosyllabic and disyllabic feet closer in length than they would otherwise be. The findings of Cutler (1980) can be interpreted as the conspiracy caught in the act of trying to overreach itself; Cutler examined a corpus of speech errors and found that when there were errors of syllable omission, they generally led to a more regularly alternating stress pattern than that of the correct target utterance (e.g. "IN the metROLitan ARea" for "IN the metroPOLitan ARea"). The other tradition, which Lehiste (I 977) also traces back to Classe, as well as to Barnwell (1971 ), has it that we hear stresses as being more regular than they really are. The proposed explanation for this perceptual illusion is that regularity really does exist at an underlying level, but that it is lost in performance because of constraints on phoneme and syllable
Perceptual isochrony in English and in French
157
durations imposed by the speech production mechanism. The speech perception system then corrects for the performance distortion and restores the underlying regularity, rather in the way that the visual system sees a circle as round , even when it is viewed obliquely and casts an elliptical image on the retina. The most convincing experimental evidence in support of the perceptual illusion account has been provided by Donovan & Darwin , 1979 , and Darwin & Donovan , 1980, who fo und that when subjects are asked to tap out the rhythm of the stressed syllables of an English utterance, their taps are more regular than the stresses they are trying to imitate. The effect disappears when the subjects are asked to tap out the same rhythm presented as a sequence of noise bursts in silence, a task at which they are remarkably accurate . However, if Donovan and Darwin's results are the manifestation of an underlying isochrony in English, then the effect should also disappear when French listeners are asked to tap out the rhythm of selected syllables in French utterances, where there is presumably no underlying isochrony present. Testing English speakers with French sentences and French speakers with English sentences might help us to determine whether the source of the illusion lies in the nature of English speech, in the competence of the native speaker, or in some interaction between the two.
Experimental 1 We constructed sets of English and French sentences that were matched for number of syllables, syntax and general meaning (Table I). Following Donovan and Darwin , each English sentence contained four stressed syllables. Each stressed syllable of a given English sentence began with a stop consonant (which was always the same within a sentence , but varied between sentences). This device made it easy to refer to the stressed syllables without using the term "stress". In the corresponding French sentences, which we will refer to as Final-Syllable French sentences, the syllables whose rhythm was to be tapped out, and which began with the same stop consonant, were always word-final (usually, in fact , they belonged to mono-syllabic words). Final syllables in French are traditionally lengthened, and therefore carry some of the acoustic correlates of English stress. So English listeners, at least, should have heard the syllables they were tapping to as stressed. We also constructed a second set of French sentences (Table I). In these sentences, which we will refer to as Non-Final-Syllable French, some of the stop consonants marked non-final syllables of poly-syllabic words. In these cases, English listeners might have had the impression that they were tapping to unstressed syllables. We recorded each set of sentences spoken by a female native speaker. The resulting utterances each contained one tone group. Sixty-four subjects listened to 10 repetitions of all sentences and were instructed to tap ou t the rhythm of the target sounds after each presentation. (The target sounds are italicized in Table I) . Thirty-two of the 64 subjects were native speakers of British English who could neither read nor speak French. Potential subjects were rejected if they had passed 0' Level French or could translate any of the sentences. All of the English subjects were tested at the University of Sussex. The remaining 32 subjects were native speakers of French and were tested in Paris. We were unable to find French subjects who had had as little exposure to English as our English subjects had had to French, but none of the French subjects could converse in English. All 64 subjects were tested on both English and French. Half of the English subjects and half of the French heard the English sentences before the French sentences and the other half heard them the other way around. Half of the subjects were tested on the final-syllable french stimuli, and the other half on the non-final-syllable french stimuli. As in Donovan and
158
D. R. Scott, S. D. Isard and B. de Boysson-Bardies Table I
Stimulus sentences
English 1. the Dame Doubts that her Debts have reDoubled
2. 3. 4. 5.
the the the the
Beer is always Bad in that Bar where he Boo zes Turk takes his Tiger from the Tower Play will Please both Peter and Paul Prince surPrized his Page in the Park Final-Syllable French
1. Ia Dame se Doute que ses Dettes reDoublent
2. 3. 4. 5.
Ia Biere est toujours Bonne dans ce Bar ou il Boit le Turc Tient son Tigre pres de Ia Tour Ia Piece va Plaire a Pierre eta Paul le Prince surPrend so n Page dans le Pare Non-Final-Syllable French
I. 2. 3. 4. 5.
le Dentiste Doute que ses Dettes aient Double le Brio ches so nt Bonnes dans la Brass erie a Boulogn e le Tunisien Tient son Turban dans le Tiroir Ia Perruche Plaira au Prince du Portugal Ia Princesse surPrendra son Page en Pologne
Darwin' s experiments, subjects also participated in a control condition, where they heard five sets of four noise bursts in silence, with the rhythm of each set of noise bursts taken from one of the sentences. Each set of noise bursts was repeated five times. Each block of stimuli was preceded by a practice session consisting of five repetitions of each stimulus in the speech condition and two repetition s in the noise burst condition.
Results We measured the intervals between the four tap responses made to each stimulus, and calculated the deviation from regularity of the taps. For each block of 10 repetitions of each sentence , we ignored the response of the first three and last two presentations. We adopted a measure of irregularity of taps shown in Table II. The value of this measure is 0 if the taps are perfectly regular, and increases as the ratios of the intervals between the taps diverge from 1. That is, a ratio of 3 : 1 counts as more irregular than a ratio of 2: 1, regardless of the actual durations of the intervals. The measure is insensitive to the order in which the intervals occur; a set of intervals with ratios 1: 2: 3 is assigned the same irregularity as one with ratios 2 : 3 : 1. To determine whether the rhythm of the stresses in the sentences had been regularized , we compared the irregularity of each set of taps with that of the stimulus sentence (Table II). Table II
Measures of irregularity and regularization
where ! 1 is the first interval, ! 2 is the second interval and f 3 is the third interval Regularization measure = Irregularity (response intervals)Irregularity (actual intervals)
159
Perceptual isochrony in English and in French
If the rhythm of the sentence has been regularize d , (i.e. if the rhythm of the taps is less irregular than that of the stress beats of the stimulus), then substracting the irregularity value for the target from that of the taps will yield a negative number. A positive or zero value means that the rhythm of the stresses has not been regularized. The same calculations were made for all but the first and last presentations of each block of five repetitions of the control stimuli. English listeners Our results for English listeners tapping to English and control stimuli replicate the Donovan and Darwin finding (see Table III). The rhythm of the tap responses is significantly more regular than that of the stress beats of the target utterances, whereas that of the responses to the control noise-bursts is not. If the subjects' responses to English reflect some property of the speech itself, which it possesses by virtue of English being a stress-timed language, then, given the categorization of French as syllable-timed, one would expect to find no regularization of the French stimuli. But the English listeners did in fact regularize their responses to the French stimuli, and their degree of regularization of the two languages did not differ significantly (F( l ,70) = 1.91 , P < 0.20).
Table Ill English listeners: mean regularity score and t score for mean deviation from zero (tz). A negative regularity score indicates that the rhythms of the taps are more regular than the rhythms of the stimuli, and vice versa for a positive value
Stimuli
Mean
tz
df
p
English French Noise bursts
- 0.12 -0.18 0. 19
8.38 8.23 10.53
799 799 479
< 0.001 < 0.001 < 0.001
It is possible to argue that , from the point of view of a monolingual English listener, our French stimuli did possess an important property of English, namely that, on the whole , every second or third syllable could be heard as stressed, and that this might be enough to trigger regularization . We would then predict a difference in responses to final-stressed and non-final-stressed sentences, because in the latter case some of the syllables determining the rhythm to be tapped out lacked the acoustic correlates of English stress. However, the English listeners did not regularize the final-syllable stimuli any more than they did the non-final-syllable French stimuli (F(l,7) = 5.52, P < 0.06). So if regularization is the result of stress-timing at all, it must be by way of a tendency to regularize that is developed by speakers of the language . In this case, one would expect French listeners not to regularize either set of stimuli as the English do. French listeners Like their English counterparts, French listeners regularize their responses to both English and French, but not to the control noise bursts (see Table IV). Furthermore, they do signi· ficantly more regularizing to the French stimuli than to the English stimuli (F(l ,7) = 32.92, P < 0.0001). Patently, this should not be the case if responses on the tapping task are to be seen as reflecting the underlying stress-timed and syllable-timed rhythms of English and French respectively . As with the English listeners, French listeners do not respond differ· ently to the different types of French stimuli (F(l, 7) = 5.03 , P < 0.06).
160
D. R. Scott, S. D. lsard and B. de Boysson-Bardies Table IV French listeners: mean regularity score and t score for mean deviation from zero (tz). A negative regularity score indicates that the rhythms of the taps is more regular than the rhythm of the stimuli, and vice versa for a positive value Stimuli
Mean
tz
df
p
English French Noise bursts
-0.13 - 0.43 0.10
7.31 19. 53 5.20
799 799
< 0.001 < 0.001 < 0.001
464*
*Noise-burst data missing for one subject
Comparisons by language of listeners and stimuli The tendency toward regularization, it appears, is neither specific to English speech nor to English listeners. Indeed, both groups of listeners tap more regularly to French than they do to English (F(l ,60) = 13.92, P < 0.0005), and French listeners do more regularizing overall than do English listeners (F(1,60) = 6,82, P< 0.02). The most obvious difference between the two classes of listeners is that the French take significantly more time to tap to the speech stimuli (i.e. they put longer gaps between successive taps) than do the English listeners (F(l ,60) = 10.24, P < 0.003). Both sets of subjects overestimate the gaps between the target syllables, but the French listeners overestimate significantly more than the English (F(l,300) = 7.69, P < 0.004). This effect is not due to any general superiority of the English subjects at tapping out rhythms, since the French are in fact more accurate on the non-speech control stimuli (t = 3.78, df = 943 , P < 0.0001). It may simply indicate that the task of tapping to the given syllables is a more unnatural task for a French speaker than for an English one. This hypothesis is supported by the fact that the mean standard error of tap durations to speech stimuli is larger for French listeners than for English listeners (French: 50.11 ms; English: 30.60 ms). Discussion The results of this experiment question the conclusions Donovan and Darwin draw from their results and the assumption , implicit in their studies , that subjects' performance on the tapping task accurate ly reflects their perceived rhythmic structure. It is clear from the results presented here that Donovan and Darwin's results do not represent evidence for any more perceptual isochrony in English than in French, and as such do not constitute support to the claim that English has an underlying isochronous rhythm (unless French is claimed to have one as well). What is not clear from our results is why listeners should regularize their taps to speech stimuli and not to the noise stimuli. Is the phenomenon of regularization language-bound , but not to any specific language? One possible answer lies in the differing complexity of the two tasks: tapping to speech is a more difficult task than tapping to noise bursts in silence. The speech stimuli are acoustically more complex than the corresponding sequences of noise bursts, and the memory load in remembering a sentence is greater than that in remembering four noise bursts. Given the difficulty involved, listeners may respond to the speech stimuli in a very conservative manner, that is, by producing regularly spaced taps. This explanation is tested in Experiment II. Experiment II Three sets of stimuli were employed in this experiment: (1) the five English utterances and (2) matched noise-burst sequences used in Experiment I and (3) a new set of five matched
Perceptual isochrony in English and in French
161
stimuli which were not speech but which were similar to speech in terms of acoustic complexity. The new set of stimuli was constructed by distorting the English utterances in such a way that, outside the target syllables, segmental information was very severely degraded. This was achieved by performing a lin ea r prediction analysis on the English utterances, replacing the reflection coefficients of each analysis frame with the mean of the coefficients lOOms (lOframes) on either side of it , restoring 200ms of the original target syllable and , finally resynthesizing the waveform with these new coefficient values. The result bore a certain resemblance to speech heard underwater, and was not intelligible. Nine English subjects listened to all three sets of stimuli, and were given the same instructions as the subjects in Experiment I. Subjects were divided into three groups, each hearing the blocks of stimuli in a different order: (1) noise bursts, intact speech, distorted speech ; (2) intact speech, distorted speech, noise bursts ; (3) distorted speech, noise bursts, intact speech. Within each block, each stimulus was repeated 10 times, and each block was preceded by a practice session consisting of two repetitions of each stimulus. If the differences found in Experiment I to speech and noise burst stimuli are a function of the differences in acoustic complexity or memory load between these two types of stimuli, then listeners in the present experiment should respond to the distorted speech as they do to intact speech; they should regularize it. If, on the other hand , the phenomenon of regularization is exclusive to language, then listeners should respond to the distorted speech as they do to the noise bursts; they should not regularize it.
Results As in Experiment I , listeners in this experiment regularized the intact speech but not the noise burst stimuli. However, they also regularized the distorted speech. Analysis of variance of these data show a significant effect of type of stimulus (intact speech, distorted speech, noise bursts) on the regularity of responses (F(2, 12) = 5.4517, P < 0.03). There was no significant effect of the order in which stimuli were presented (F(2,6) = 1.3562, P < 0.4), and this variable did not interact significantly with type of stimulus (F( 4, 12) = 1.64, P < 0.3). Listeners ' responses to the distorted speech were not found to be significantly different from their responses to the intact speech (F( 1,8) = 0.5142, P < 0.6), suggesting that acoustic complexity of the signal alone contributes to the regularity of listeners' responses. This hypothesis is further supported by the finding that subjects' responses to the distorted speech is significantly different from their responses to the non-speech controls (F(l ,8) = 22.19, p < 0.002). The results of this experiment show that the phenomenon of regularization is not even specific to speech, but extends to other unintelligible noises with some speech-like properties. They raise the possibility that the subjects are not actually doing anything very interesting at all- that they are simply exhibiting a response bias toward evenly spaced taps when the task becomes difficult. Evidence against such a hypothesis comes from the second experiment by Darwin & Donovan (1980), in which small manipulations of the stimulus sentences, which changed syntactic structure and tone group boundaries, while leaving interstress intervals intact, produced different responses from listeners. Their result suggests that much less regularizing occurs across tone group boundaries, but still leaves open the possibility that the true interstress intervals within tone groups are largely ignored. Conclusions Donovan and Darwin's results raised the possibility that the phenomenon of regularization in the tapping task might identify Fowler's constraint on interstress intervals as underlying
162
D. R. Scott, S. D. Isard and B. de Boysson-Bardies
isochrony , distorted by the demands of speech production. The situation now appea rs more complicated. (1) Regularization as such does not differentiate between English and French , and so does not even help to support the weak stress-timed/syllable-timed distinction , much less the isochrony principle . (2) The version of the tapping task that we used does elicit different performance from English and French speakers. This may well reflect different rhythmic characteristics of the two languages. In particular, it is possible that the French would find it more natural to tap to a different selection of syllables. (3) The phenomenon of regularization may be in part a response bias that comes into play as the task becomes more difficult. This investigation was supported in part by the European Science Foundation and the Science and Engineering Research Council, UK. The authors wish to acknowledge the technical assistance of John Doyle , Gerri Foster and David Miller, and thank Andrew Donovan, Anne Cutler and Marie Jefso utin e for valuable discussion. References Abercrombie, D. (1964). Syllable quantity and enclitics in English. In In Honour of Daniel Jones, (Abercrombie, D., Fry, D. F. , McCarthy, P. A. D. , Scott, N. C. & Trim, J. M. L., eds). London: Longman s. Abercrombie, D. (1967). Elements of General Phonetics, Ed inburgh: Edinburgh University Press. Barnwell, T . P. (1971). An algo rithm for segment durations in reading machine context. Technical Report 4 79. Cambridge, Mass.: M.l.T. Research Laboratory of Electro nics. Bolinger, D. W. (1965). Pitch accent and se ntence rhythm. In Forms of English: Accent, Morphem e, Order. (1. Abe & T. Kanekiyo , eds). Cambridge: Cambridge University Press. Carlson, R. , Granstrom , B. & Klatt , D. H. (1979). Some notes on the perception of temporal patterns in speech. In Proceedings of the Ninth International Congress of Phonetic Sciences, Vol. 2, Copenhagen. Classe, A. (1939). The Rhythm of English Prose. Oxford: Blackwell. Cutler, A. (1980). Syllable omission errors and iso chrony. In Temporal Variables in Speech-Studies in Honour of Frieda Goldman- Eisler, (H. W. Dechert, ed.). The Hague: Mouton. Darwin, C. J. & Donovan, A. (1980) . Percep tual st udies of speech rhythm: isochrony and intonation. In Proceedings of Nato AS! on Spoken Language Generation and Understanding , (J. C. Simon, ed.), pp. 77-95. Dordrecht: Reidel. Dauer, R. M. (1983). Stress-timing and syllable-timing reanalysed. Journal of Phon etics, 11, 51-62 . Donovan, A. & Darwin, C. J. (1979). The perceived rhythm of speech. In Proceedings of the Ninth International Congress of Phonetic Sciences, Vol. 2, Copen hagen. Fowler, C. (1977) . Timing Control in Speech Production. Bloomington: Indiana University Linguistics Club. Giegerich, H. J. (1980). On stress timing in English phonology. Lingua, 51 , 73-79. Huggins, A. W. F . (1972). On the perception of temporal phenomena in speech. Journal of the Acoustical Society of America, 51, 1279- 1290. Ladefoged , P. (1975). A Course in Phonetics. New York: Harcourt Brace Janovich. Lea, W. A. (I 974). Prosodic aids to speech recognition: IV . A ge neral strategy for prosodically-guided speech understanding. Univac Report No. PXI0 79 1. St. Paul, Minn: Sperry Univac, DSD. Lehiste, I. (1977). Isochrony reconsid ered . Journal of Phonetics, 5, 253-263. Liberman , M. & Prince, A. (I 977). On stress and linguistic rhythm. Linguistic Inquiry , 8, 249- 336 . O'Connor, J. D. (1968). The duration of the foot in relation to the number of component sound segments. In Progress Report 3, Phonetics Labo ratory, University College, London. Pike, K. (1945). The Intonation of American English. Ann Arbor, Mich.: University of Michigan Press. Roach , P. ( 1982). On the distinc tion between "s tress-tim ed" and "sy llable-tim ed" languages. In Lingu istic Contr011ersies (D. Crystal, ed.), pp. 73-79. London : Edward Arnold. Scott, D. R. (1982). Duration as a cue to the perception of a phrase boundary. Journal of the Acoustical Society of America, 71, 996-1007. Thompson, H. S. (1980). Stress a11d salie11ce in E11glish: theory and practice. Palo Alta, Ca lif: Xerox Palo Alto Resea rch Centre.