Time-warping and the perception of rhythm in speech

Time-warping and the perception of rhythm in speech

Journal of Phonetics (1986) 14, 23 I- 246 Time-warping and the perception of rhythm in speech Andre-Pierre Benguerel and Janet D' Arcy School of Audi...

6MB Sizes 1 Downloads 38 Views

Journal of Phonetics (1986) 14, 23 I- 246

Time-warping and the perception of rhythm in speech Andre-Pierre Benguerel and Janet D' Arcy School of Audiology and Speech S ciences, The University of British Columbia, Vancouver , B.C. , Canada V6T 1 W5 Received August 1985, and in revised form April 1986

The perception of rhythm in speech may be affected by two factors: (I) a tendency, by speakers, to lengthen syllables as the utterance progresses; (2) a tendency, by listeners, to impose rhythmicity on speech sequences. Four tests were constructed in which each stimulus consisted of a sequence of six clicks or six syllables; each test contained time-warped stimuli. Time-warping was non-linear and progressive. Native speakers of English, French and Japanese were asked to rate each sequence as accelerating, regular, or decelerating. Results indicate that, for a range of parameter values of the timewarping parameter, stimuli were perceived as regular. However, most of these stimuli were not acoustically isochronous, but decelerating. The native language of the listeners, the nature of the stimulus, and the order of the tests did not have any significant effect on the results. The difference limen for regularity was estimated, then measured through additional testing.

1. Introduction

Amidst the great diversity and variability of human behaviour, one is struck by the number of its features which involve regularity or repetitiveness. Nowhere is this more evident than in language and speech. Predictability implies some sort of regularity, and communication would be hard to imagine without it. However, not all speech utterances follow these regularities equally well; breaks and inflections in the regularity of an utterance can be used to draw the listener's attention and convey additional information . Studies of the phonetic aspects of rhythm are numerous but the importance of rhythm in linguistics in general, and its place in phonology in particular, has been considered seriously only in the recent past (Liberman & Prince, 1977; Hayes, 1984). It is already obvious however that a detailed account of language will require a lot more knowledge about rhythm . Rhythm is very difficult to define satisfactorily, but it seems to have a dual nature whose two aspects are periodicity and structure (Fraisse, 1974, 1982). Periodicity is evidenced by the repetitiveness of certain events (here sound events), whereas structure deals with the constitutive elements (or subunits) of what is being repeated periodically. In many cases, the temporal organization of events seems to be hierarchical and rhythm can be considered at any of several levels (e.g. mora, syllable, foot, phrases, pauses, etc.). 0095-4470(86(02023 I

+

I6 $03.00(0

©

I986 Academic Press Inc. (London) Ltd.

232

A. P. Benguerel and J. D'Arcy

This dichotomy is very similar to the dichotomy encountered in music, i.e. grouping structure vs . metrical structure (Lehrdahl & Jackendoff, 1983, Chapter 2). Language specific constraints (e.g. syllable structure, existence and nature of linguistic stress, existence of tones, etc.) appear to be important in determining which one of these levels is most prominent in determining speech rhythm. In fact, speech rhythm can be (and has been) characterized in several different fashions. It has been proposed (Pike, 1945) that languages can be categorized as having either syllable-timing or stress-timing. According to this proposal, in those languages having syllable-timing, syllables would tend to recur at regular intervals, whereas in those languages having stress-timing, the tendency would be for stressed syllables to recur at regular intervals. How strictly the word " tendency" is taken, is of course, the source of much argument. Other possible categories of timing have been added, such as moratiming (Bloch, 1950) and tone-timing (Williamson, 1965). Although there is generally some agreement that English is a stress-timed language whereas French would be a syllable-timed language, there is considerable argument as to how a language whose timing status is still undetermined should be categorized, based on specific measurements, articulatory, acoustical or perceptual. O 'Connor (1973 , pp. 238- 239), envisaging a weaker version of the dichotomy stress- vs . syllable-timing-, proposes stress- and syllable-based rhythms. But even for English and French, the " archetypes" usually cited as examples for stress-timing and syllable-timing, agreement is not unanimous. Shen & Peterson (1962) have expressed doubt about English being a stress-time language in the first sense defined above. Wenk & Wieland ( 1982) have likewise expressed doubts about French being a syllable-timed language and have proposed instead that English is " leader-timed" or "regulated group-initially" , whereas French is "trailer-timed" or " regulated group-finally". The moraic nature of timing in Japanese, similarly, is far from being unanimously recognized (Beckma n, 1982). D iscussing stress-timed and syllabletimed languages, Ladefoged (1975 , p . 222) proposes that: Perhaps a better typology of rhythmic differences among languages would be to divide languages into those that have variable word stress (such as English and German) , those that have fixed word stress (such as Czech, Polish and Swahili), and those that have fixed phrase stress (such as French).

For Hoequist (1983), on the other hand : In a more general view, it may be better to compare the languages as something other than exemplars of various types of isochrony, though Japanese does fit the description fairly well . Japanese could instead be viewed as a duration-controlling language, English as a durationcompensating language, while Spanish is neither.

Obviously, a lack of agreement exists not only as to how to assign a given language to a particular timing category, but also as to how many categories there may be, which languages belong (or do not belong) to the same category, and how these categories should be labeled. Evidence for the interaction between production and perception in speech is abundant. Several hypotheses have been proposed to account for this interrelation, e.g. the motor theory of speech perception (Liberman, Cooper, Shankweiler & StuddertKennedy, 1967) and the auditory-motor theory of speech production (Ladefoged, De Clerk, Lindau & Papc;:un, 1972). Prosodies in general, and rhythm in particular, have

Time-warping and perception of rhythm

233

added to the evidence for perception- production links (Lehiste, 1977; Fowler, 1979, 1983), and the results of this study give further support to this view. The variability of the speech signal can be considered from two points of view. To the extent that this variability can be perceived and identified, it can transmit useful information. To the extent that it is beyond perceptual limits, it must be viewed as an acceptable departure from some canonical categorization. The mapping between the acoustic parameters of the speech signal and their perceptual (psychoacoustic or linguistic) correlates is not one-to-one, and the equality of two parameters in one domain is not, in general, preserved in the other: for example, sounds equal in acoustic intensity may be perceived as subjectively unequal; thus, at equal intensity, an [a] will sound less loud than an [i], or conversely, at equal loudness, an [a] will be more intense than an [i] (Lehiste & Peterson, 1959). Somewhat similarly, the quality of the vowel in a eve word repeated identically in a spectrally different frame sentence may be perceived differently, depending on the context (Ladefoged & Broadbent, 1957). It should thus be no surprise that perceived regularity (or equality) need not be based on objective regularity (or equality). Whereas the notion of regularity is fairly straightforward, deviation from regularity can be of at least two types. (1) Random irregularity: the size of the elements contributing to regularity, as far as their magnitude is concerned, varies in a random fashion. (2) Time-warping: the size of the elements contributing to regularity varies in some systematic fashion (e.g. linear increase, exponential decrease, etc.). It should be noted at this point that, whereas random irregularity varies basically in one direction only, away from regularity, time-warping can vary (and can be rated) in more than one direction with respect to regularity or isochrony, e.g. speeding up vs. slowing down. The perception of random irregularity in repetitive sequences has been investigated by Hibi (1983). He found that random timing distortion was more readily detected at repetition rates above 3 per second than below. He also found that at repetition rates faster than 3 per second, there appears to be a holistic (or Gestalt) processing mechanism, whereas below this rate, rhythm perception appears to be more analytic or, to use his wording, to proceed "on an ongoing basis". Since syllable rate seems to be typically above 3 per second (Gerber, 1974, p. 244; Malecot, 1975), whereas intervals between primary stresses correspond to rates below 3 per second (Shen & Peterson, 1962), it is imperative to use caution in relating results obtained at a particular repetition to stress-timing or to syllable-timing. The present study deals with time-warping, using both speech and non-speech repetitive sequences having repetition rates above 3 per second and the results should be looked at in this limited perspective. It is more directly related to syllable-timed languages. The possible influence of the linguistic background of the subjects, however, cannot be disregarded a priori, hence the use of subjects belonging to three substantially different linguistic backgrounds. Somewhat similarly, the possibility that subjects may respond differently to speech and non-speech stimuli (Lehiste, 1979) is the motivation for using both types of stimuli. The study considers aspects of perceived rhythmic structure (as inferred by the listener) that do not arise from variations in pitch or loudness. The goal of the experiments presented here is to study the ability of individuals to perceive progressive time-warping in the rhythm of a repetitive sound sequence, such as a sequence of repeated syllables. This paper proposes to find answers to the following two questions:

234

A. P. Benguerel and J. D 'Arcy

(1) What kind and what amount of time-warping must be present in the acoustic signal for regularity to be detected? (2) What increase or decrease of time-warping, above or below perceived regularity, is necessary for a deviation from regularity to be detected? 2. Experimental procedures 2.1. Test preparation

Four tests were designed to investigate the perception of time-warping. The goal of the first three tests was to answer question ( 1) above. Test 1 used sequences of six clicks, whereas Test 2 and Test 3 used sequences of six syllables, [ta] and [na], respectively. In order to make the timing of the sequences comparable from test to test, and because it was found (Allen, 1972) that "the onset of the nuclear vowel is a good first approximation as 'the' location of the rhythmic beat", the location of the clicks (in Test 1) was made to correspond to the location of the discontinuity between consonant (C) and vowel (V) (in Test 2 and Test 3). An additional test (Test 4), consisting of items similar to those in Test 3, was designed to answer question (2) above. In order to arrive at suitable stimulus parameter values for the main tests, several pilot studies were run. Their results will be mentioned here only where they are relevant to the main tests. Out of the large number of possible time warping functions that can be envisaged, the following ones were tested in one of the pilot studies: (1) identical durations or intervals, except for the last one of the sequence which is lengthened, shortened or unchanged; (2) durations or intervals changing linearly with the sequential position of the syllable (or of the interval in the case of clicks); (3) durations or intervals changing non-linearly, according to some equation such as the one given below. The last of these three types seemed to be more promising. Typical values of timing irregularities encountered in natural speech were also examined. The production data of Oller (1973) were used as a guide. The data, obtained from speakers reading nonsense words of the form "say [babab]" of up to five syllables, were recompiled and plotted for the five-syllable sequences; values were obtained for the following syllable duration ratios: R1

S 5 / S4 ,

R2

Ss /Sm ,

R3

S4 / S123 ,

where S 4 and S 5 stand for the durations of the fourth and fifth syllables, respectively, and S 123 stands for the average duration of the first three syllables. The range of values obtained was quite comparable with the data of Benguerel (1971) and of Nooteboom (1973). Lehiste (1979) has observed that listeners tend to hear as "longest" the last element of a sequence of four equal intervals. Hoequist (1985) also found partial support for such a perceptual bias. The perceptual data obtained from this pilot study indicated that, for stimuli covering symmetrically positive and negative values of the time-warping parameter a (defined below), the stimuli perceived as regular corresponded to a slight slowing-down, i.e. a positive value of a. It was thus decided to adopt a non-linear warping function which would generate syllable durations that, for some value of a,

Time-warping and perception of rhythm

235

would fit the corresponding median values of R~> R 2 and R 3 , obtained from Oller's (1973) data. In addition, setting the value of a to zero should correspond to the acoustically isochronous case. The function finally selected to compute the durations of the successive intervals (five in this experiment) between the (six) clicks, or to compute the durations of the (six) repeated syllables, was D = N, ea·x'",

where D is the duration of the between-click interval, N, is the duration of the first interval (in milliseconds), a is a scale factor for the exponent of the exponential, x is a parameter associated with the position of the interval in the sequence; it varies linearly and discretely between 0 (first interval) and I (last interval) . n is the exponent of x (when zn = 2, D has the form of the Gaussian distribution function) , z is a constant adjusting the value of the exponential so that for x = I (i.e. for the last interval), it will always have the same value, regardless of the value of n. For negative values of a, the value D'(a) = I - [D(- a) - I] was used instead of D(a), in order to preserve the symmetry of the duration function with respect to the value a= 0. The duration function presented above is very similar, in its fitting capabilities, to the one proposed by Lindblom & Rapp (1973, p. 10, Equation 2). It has more parameters, but also more flexibility. It can be generalized, similarly to their Equation 3 (p. II), to the case involving bi-directional effects. 2.2. Stimuli The stimuli were prepared on a digital computer, using a sampling rate of I 0 kHz. Lowpass filtering at 5kHz (48 dB/octave) was used both before A/D conversion and after D JA conversion for all manipulated signals. Each click stimulus (Test I) consisted of six 1-ms clicks and each click consisted of one cycle of a 1-kHz sinewave. This cycle always had the same (zero) starting phase. The duration of the (silent) intervals between clicks was determined using the duration function given above. The range of a values was chosen, based on the results of the pilot study. Figure I shows in graphical form the durations of the five successive between-click intervals (in milliseconds) of the seven click stimuli, with the corresponding a value for each stimulus. Stimuli CLI and CL2 are accelerating, whereas stimuli CL3 to CL 7 were decelerating. A horizontal line corresponds to acoustic isochrony. Each speech stimulus (Test 2, Test 3 and Test 4) consisted of a six-syllable sequence where the syllables ([ta] for Test 2, [na] for test 3 and Test 4) were phonetically identical, of the same amplitude and monotone, but where the durations varied according to the duration function given earlier. A phonetician, native speaker of French, recorded the desired syllables, [ta] and [na], in a sound proof room. Fundamental frequency was kept constant at 100Hz by using a reference pure tone which the speaker heard through a headphone and endeavoured to match. He also endeavoured to produce the syllables in such a way that both segments (C and V) would be long enough for the intended purpose. One of each of the two desired

A. P. Benguerel and J. D'Arcy

236

250 .---------------------------------------------~

0.679

0.498 0.346

c:

0

~::>

0.194

0

0.069 D---llll()-•lliiiiiU~~~3:t====~ -0.049 -0.139

5

4

3

2

Interval

Figure 1. Durations of the between-click intervals for the stimuli of Test 1, with the corresponding values of a. 300 .-------------------------------------------------, 0 .679

250 0.498

u;

s c:

0.346

Q

'§ ::>

0

0.194 0 .069

150

-0.049 -0. 139

2

4

3

5

6

Syllable

Figure 2. Durations of the six syllables for the stimuli of Tests 2 and 3, with the corresponding values of a.

syllables was then selected, computer sampled and edited. Seven six-syllable sequences were constructed in such a way that (1) the duration of each syllable was equal to its value calculated from the above function, and that (2) between-onset intervals for each consonant were equal to the between-click intervals of the click stimuli. The proportion used for the duration of the consonantal part ([t] or [n]) of the CV stimuli was chosen as 47.5% of the total syllable duration; this proportion was based on measurements of naturally spoken sequences of the same type. Figure 2, similar to Fig. 1, shows the durations of the six syllables of each CV stimulus (in ms), with the corresponding a value

Time-warping and perception of rhythm

237

for each stimulus. The duration of the first syllable of each CV stimulus was 143 ms, corresponding to a syllabic rate of approximately 7 per second, thus in the range encountered in the production of CV syllables. This rate, clearly in the "holistic" range determined by Hibi (1983), is also consistent with the average value given by Gerber (1974, p. 244) for English (440 syllables/min) and with the range given by Malecot (1975) for fast French speakers (360-580 syllables/min), all of the preceding values obtained for data including also short pauses and syllables other than just CV's, thus possibly underestimating the rate for CV sequences. Tests I, 2 and 3 consisted of five repetitions of the seven stimulus-types, plus seven practice items at the beginning and three extra items at the end of each test. Each test contained 45 items, but only items 8 to 42 were used for the analysis. The pseudo-random order imposed on the test items included the constraint that no stimulus-type could be repeated until at least three other stimulus-types had intervened. Test 4, designed to estimate the threshold of time-warping, will be described in section 4. 1.

2.3. Subjects Twenty-four adults, none of whom was linguistically naive, were used as subjects for Tests I, 2 and 3. These subjects were divided into three groups according to their native language, as indicated below. Group I consisted of 12 subjects (six females and six males) who were native speakers of English and whose age varied between 23 and 43 years. Group II consisted of six subjects (three females and three males) who were native speakers of French and whose age varied between 25 and 50 years. Group III consisted of six subjects (three females and three males) who were native speakers of Japanese and whose age varied between 24 and 37 years. All of the subjects had at least a working knowledge of English. None of the subjects had any known speech or hearing problems, and all had hearing within normal limits at the frequencies considered most important for speech. 2.4. Testing procedure Subjects were seated individually in a quiet room . They were informed that they would hear sequences of non-speech and speech sounds: in one test (Test I), they would hear stimuli consisting of sequences of clicks; in another test (Test 2), they would hear stimuli consisting of sequences of [ta] syllables; and in another test (Test 3), they would hear stimuli consisting of sequences of [na] syllables. They were further informed that they might notice that the timing was altered (either speeding up or slowing down) in some of the sequences and were asked to put a check in the appropriate column of the seven-point, forced-choice response form, after listening to each test item, and without assuming that the responses should necessarily be evenly distributed. Columns 1, 4 and 7 were labeled "speeded up" , "regular" and "slowed down", respectively. Subjects were encouraged to use columns 2, 3, 5 and 6 whenever they felt that the timing was intermediate between the three labels given. Occurrences of the most time-warped stimuli were always included among the practice items, and the subjects were made aware of this fact. For each of the tests, the first seven items were presented to the subject, the tape was then stopped, and any questions the subject had were answered. The tape was then restarted at item I. The subject was allowed to rest for I to 3 minutes between tests while the experimenter set up the equipment for the next test.

A. P. Benguerel and J. D'Arcy

238

The test tape was played on a Revox A 77 tape recorder and presented over TDH-39 headphones at a level of 60-70 dB SPL. The three tests were presented using for each group of six subjects each of the possible six orders. 3. Results 3.1. Data analysis

In order to analyze the results of the perception tests, each listener response was assigned a number from 1 (speeded up) to 7 (slowed down). Table I presents an example of such a table for Test 2 and Subject E2. The responses were cumulated by stimulus-type, and means and standard deviations were computed. Figures 3, 4 and 5 show for Tests 1, 2 and 3, respectively, the response mean and ± 1 SD range for each stimulus type, averaged for all subjects. The ± 1 SD ranges indicate the variability in response for each stimulus type in each test. It can be seen that the variability range does not vary greatly as a function of the stimulus number. 3.2. Perception of regularity

In order to estimate typical values of time-warping (as measured by a) likely to be perceived consistently as regular by a particular subject, the data for each subject were TABLE I. Response matrix for subject E2 in Test 2 Stimulus

Responses I I 4 4 5 7 7

TAl TA2 TA3 TA4 TA5 TA6 TA7

I I 4 4 4 7 7

I I 4 4 6 6 7

I 4 4 5 6 6 7

I I 4 4 6 6 7

Mean

SD

1.00 1.60 4.00 4.20 5.40 6.40 7.00

0.00 1.34 0.00 0.45 0.89 0.55 0.00

~ 5 0

:;} .~ 4

2

3

4

5

6

7

Stimu lus

Figure 3. Mean score and over all subjects.

±l

SD range for each stimulus of Test I, averaged

Time-warping and perception of rhythm

239

7

2

3

4

5

6

7

Stimulus

Figure 4. Mean score and over all subjects.

± I SO range for each stimulus of Test 2, averaged

7

6

~ 5 0

f5

&!

4 3 2

2

3

4

5

6

7

Stimu lu s

Figure 5. Mean score and over all subjects.

± I SO range for each stimulus of Test 3, averaged

first plotted separately for each test and for each subject, with each stimulus number plotted linearly on the abscissa and each corresponding response mean plotted linearly on the ordinate. A best fit curve through the seven points was then obtained by fourth-degree polynomial interpolation yielding 72 such curves (24 subjects x 3 tests). For each of the 72 curves, a value for the stimulus number which would correspond to a response mean of 4 was obtained by interpolation on each fitted curve. These 72 values are shown in Table II. They were used in the analysis of variance (ANOV A) which was performed for this study on a balanced subset of 18 subjects (El-E6, Fl- F6, Jl-J6) with native language of subject (L), test order (0) and test type (T) as the factors. Sex of subject was nested with language and test order, and was crossed with test type, yielding a repeated measures design in which all subjects responded to all three test types.

A. P. Benguerel and J . D'Arcy

240

TABLE II . Stimulus number corresponding to perception of regularity

Test No.

Test No . Subject El E2 E3 E4 E5 E6 E7 E8 E9 EIO Ell El2

3.44 3.53 4.02 3.89 4.52 6.04 3.67 3.59 4.36 3.90 5.41 3.84

Group III

Group II

Group I

2

3

Subject

3.39 3.51 3.65 3.03 3.35 4.92 3.58 3.41 4.15 3.33 4.2 1 3.27

3.96 3.91 3.80 3.72 4.01 4.73 3.90 3.73 3.94 4.49 5.04 3.34

Fl F2 F3 F4 F5 F6

3.73 3.56 4.15 3.9 1 4.69 4.15

Test No.

2

3

Subject

3.55 3.68 4.89 3.87 5.35 3.33

4.05 3.79 4.05 3.49 5.17 4.21

11 12 J3 14 15 16

3.42 4.68 4.52 4.52 5.84 3.42

2

3

5.46 3.56 3.33 4.37 4.73 3.64

4.53 4.27 3.16 4.14 3.84 4.43

The data displayed in Table II were used as input for this ANOV A. They were the values estimated for each test as corresponding to each subject's perception of regularity. The results indicate that none of the factors reached statistical significance. In particular, the language background of the subjects did not show an effect on the results, although this might have been expected. The absence of such an effect may have been due to the large intra- and inter-subject variabilities; the average degree of time-warping required for the perception of regularity clearly differs from subject to subject. The nature of the stimulus and the test order did not show any effect on the results either. A look at the distribution of the "regular" stimulus number values obtained as explained above and listed in Table II shows that the median value, for each group and for each test, lies between 3.5 and 4.5. This confirms that the value of a chosen for stimulus-type 4 on the basis of the pilot study was appropriate. Similarly the median for all subjects and all tests is 3.9, with the interquartile range going from 3.56 to 4.46. This also confirms the value obtained from the pilot study. 3.4. Threshold of time-warping In order to try to estimate the values of a corresponding to the stimuli likely to be perceived 50% of the time as regular, thus corresponding to some "difference limen" (DL) of time-warping, two procedures were used. For both of them, and for Test 4 as well, it was deemed of greater interest to determine the optimal performance of a few better subjects, since the experimental context itself is not typical of (average) running speech but more of idealized (and experimentally optimal) speech. In the first procedure, the response numbers 2.5 and 5.5 (half way between 1 and 4, and 4 and 7, respectively) were taken to correspond to the points of greatest uncertainty, thus one DL away from "regular". A value of a was determined for each one of the three better subjects and for each test in a two-step procedure. In the first step, an interpolation on the individual plots "response vs. stimulus" (described in the previous section) yielded the stimulus number values corresponding to responses of 2.5 and 5.5, for each subject and each test. In the second step, a value of a was obtained by quadratic interpolation,

241

Time-warping and perception of rhythm

using the same curve used to obtain the stimulus parameter values for the tests. From these values (corresponding to 2.5, 4.0 and 5.5), increment and decrement in a were obtained. The second procedure looked at the individual plots "standard deviation vs. stimulus number". For the four better subjects, these plots exhibited a camel back shape, as expected . The distances between the central minimum (the anchor point for regularity) and each side hump (the point of greatest uncertainty) were converted to an increment or decrement in a. A distribution of the IJ.a values obtained, was established for each procedure separately, but in both cases, the mean and median values were found to be close to IJ.a = 0.2. Obviously, the two procedures used are relatively crude, and a more sophisticated one is needed to estimate a DL of time-warping with more precision, even if these preliminary estimates already give an idea of what was meant earlier by " more or less regular" .

4. Additional testing

4.1. Test 4

Test 4, designed to refine the estimate obtained in the previous section and to measure a DL for time-warping, consisted of six subtests (4.1 through 4.6) of increasing difficulty, each consisting of five practice items, followed by 15 repetitions of each of three stimulus-types, for a total of 50 stimuli per subtest. The stimuli were of the same type as those used in Test 3, namely sequences of six [na]-syllables. For each subtest, one of the stimulus-types used was NA.O (cf. Table III). The value of a used for stimulus-type NA.O was chosen at a = 0.187, the median of the a values inferred from the results of Test 3 and corresponding to a response of 4 (i.e. perceptually neutral). The other two stimulus-types for each subtest had a values each on one side of the neutral value and equidistant from it (measured in terms of a) , but their distance to the neutral value decreased gradually from subtest 4.1, which was presented first and consisted of items NA- 6, NA.O and NA + 6, to subtest 4.6, which was presented last and consisted of items NA- 1, NA.O and to NA +I . The stimuli were TABLE III . Durational structure (in ms) for the stimuli of Test 4 Stimulus No .

Sl

S2

S3

S4

S5

S6

Total

a

NA-6 NA - 5 NA-4 NA-3 NA-2 NA-1 NA 0 NA+I NA+2 NA+3 NA+4 NA+5 NA+6

143.0 143.0 143.0 143.0 143.0 143.0 143.0 143.0 143.0 143.0 143.0 143.0 143.0

143 .0 143.0 143.0 143.0 143.0 143.0 143.0 143. 1 143. 1 143.1 143.1 143.1 143.1

142.9 143.1 143.2 143.3 143.4 143.6 143.7 143.8 143.9 144.1 144.2 144.3 144.5

142.6 143.3 143.9 144.5 145.2 145.9 146.5 147.2 147.8 148.5 149.2 149.8 150.5

141.8 143 .8 145.9 148 .0 150.1 152.2 154.4 156.6 158.8 161.1 163.4 165 .7 168 .1

140.0 145.0 150.1 155.4 160.9 166.6 172.4 178.5 184.8 191.3 198.1 205.0 212 .3

853.3 861 .2 869.1 877.2 885.6 894.3 903.0 912.2 92 1.4 931 .1 941.0 950.9 961.5

-0.021 0.014 0.049 0.083 0.118 0.153 0.187 0.222 0.256 0.291 0.326 0.360 0.395

A. P. Benguerel and J. D'Arcy

242

1 00.-----~--------------------------------------~

u ~

5 l)

Q)

0>



\

0

c Q)

~

Q)

n.

2

3 Sub test

Figure 6. Individual and mean scores for Test 4. D, Sl ; "· S2; mean.

&,

S3; •. S4; o,

pseudo-randomized with two constraints: there had to be 15 stimuli of each stimulustype, and no stimulus-type could be repeated more than once before another stimulustype had been presented. The subjects for this test received instructions similar to those for the first three tests, but they were asked to check the appropriate column of a three-point response form. The three columns were labeled "accelerating", "regular" and "decelerating". It was felt that the labels used earlier might be ambiguous. In none of the tests, however, did any of the subjects appear to be unclear about the task requested. The four subjects participating in Test 4 were selected for their consistency in the other tests. Each subject was administered the six subtests in the same session, with a pause of 2- 3 min between subtests. The data of Test 4 were examined in terms of confusion matrices. A 3 x 3 matrix and a percentage correct score were obtained for each subject in each subtest, as well as a cumulative matrix and an average score for each subtest. Only the scores are presented in Fig. 6. A statistical distribution of random scores for a thousand similar tests was established by random assignment of 45 responses to three possible categories. This distribution shows that a score of 19 out of 45, or 42% (and a score of 21 out of 45, or 47%) can be obtained 5% (respectively, 1%) of the time by chance alone. Since the 13 stimuli all have a different duration, it could be argued that duration is a second cue and that subjects are really detecting duration differences rather than time-warping differences. Creelman's (1962) data predict a differential threshold value of 12% (for 75% correct) for auditory durations of about one second such as we have here. As can be calculated from the total durations given in Table III, this value of 12% is reached only in Subtest 4.1, where stimuli NA + 6 and NA- 6 differ from each other by just 12% . In all the other subtests, the two extreme stimuli differ from each other by less than this value. An examination of scores and of the confusion matrices leads to several observations regarding these data. (1) Subtests 4.1 to 4.4 were easy for all four subjects and scores were always above 70%, going all the way to 100% .

Time-warping and perception of rhythm

243

(2) Starting with 4.5, perhaps already with 4.4, performance deteriorated for all subjects. In 4.6, performance of subjects Sl and S3 was lower than chance level (99% confidence level), that of subject S4 (49%) was very close to that level, while that of subject S2, although higher (60%), was also definitely declining. Figure 5 shows that the mean score for the four subjects would be equal to 75% at a point almost exactly halfway between subtests 4 and 5, thus corresponding to Aa = 0.0865, a value which, not surprisingly, is less than the estimate obtained in Section 3.3. For these subjects, none of the sequences perceived as "regular", was actually isochronous; in fact, these were decelerating sequences. (3) The number of cross-over confusions (i.e. between stimuli on opposite sides of the " neutral" stimulus) was never more than one per subject and subtest, except in 4.6 where it represented 8.33% of the responses. Subject S4 did not have any cross-over confusion in any of the subtests. (4) The number of confusions above the main diagonal [i.e. the responses corresponding to stimuli that were perceived as more (positively) time-warped than expected, based on the results of Tests 2 and 3] was greater than that below the diagonal in 17 of the 24 matrices, while the situation was reversed in only three of the 24 matrices. This seems to indicate that stimuli that were either perceived as accelerating (or predicted to be perceived as accelerating) presented more difficulty than those corresponding to deceleration. This is further supported by the casual remarks of some of the subjects that they had more difficulty with the (perceptually) accelerating stimuli than with the decelerating ones. However, a determination of this asymmetry was not attempted here, primarily because of the increased complication in preparing the stimuli, which would have to be tailored to each subject if such small differences are going to be determined with sufficient accuracy. Observations 1 and 2 support the suggestion that Subtest 4.6 indicates a limit to how small an increment or decrement in progressive irregularity can be, while the stimulus can still be detected as not regular in an identification task. Observation 3 further supports the above suggestion by the fact that cross-over confusions (i.e. confusions between stimuli which are 2 DL apart) represent such a small proportion of the responses. Observation 4 indicates that the DL value arrived at earlier is correct since stimuli 2 DL apart get confused only shortly before those which are one DL apart. In addition to the six subtests, subjects S1 and S2 were given an extra (and easier) sub test (4.0) before they took 4.1; they both had higher scores on 4.1 (1 00 and 91%) than on 4.0; subjects S3 and S4 started with 4.1 and also obtained better scores on some of the subsequent tests. Some practice effect is thus present, but it is best observable in the first two or three subtests presented. One subject was re-administered subtests 4.1 through 4.6 two weeks later. His results were very close to the earlier ones and did not reveal any improvement, nor any worsening. These last two observations, together with Fig. 6, show that there is some practice effect , but that it plateaus early and is not large enough to obscure the object of the investigation. 5. Conclusion

This study has made clearer a certain number of points regarding the perception of rhythm in speech.

244

A. P. Benguerel and J. D'Arcy

(I) What is perceived as a "regular" sequence of syllables is not, most of the time, acoustically isochronous, but is time-warped; a relatively simple function has been presented, describing this time-warping by the specification of a single parameter a. (2) What is acoustically isochronous, conversely, is in many cases (such as in the last experiment reported here) perceived as "regular". (3) The time-warping increment or decrement allowable while preserving the perception of "regularity" (DL) is measurable and not negligible. It was first estimated, then measured, in better than average conditions, for better than average subjects, and for repetitive sequences of identical elements. In a real-life situation, a number of other factors would most certainly interfere and enlarge this DL value, e.g. the abilities of the subjects, the nature and variability (in amplitude, duration, and spectral distribution) of the phonetic segments, other prosodic features (stress, intonation), and idiosyncratic, emotional, and behavioural variables. Rhythm is an indispensable concept in the description of language, but its regular and repetitive nature must be sought and investigated at the perceptual level rather than at the acoustic level, where it has often been sought, unsuccessfully (Classe, 1939; Shen & Peterson, 1962; O'Connor, 1965). Crystal (1969, p. 162) wrote the following: Clearly, if one means by isochrony a direct perception of regular peaks of prominence running through all the utterances of an individual, then English is not isochronous: careful measurement plus elementary statistics shows such regularity to be the exception, not the rule ... Classe ( 1939) demonstrates this, and it is also clear from the work of O'Connor (1965) and Shen & Peterson (1962);

It seems surprising that anyone interested in finding evidence for perceptual regularity would expect measurements (no matter how careful) of the acoustic signal to provide such evidence. It also appears that rhythmicity, or regularity, corresponds to a fuzzy area rather than to a clear line. This should not be a surprise when compared with other phonetic constructs. The overall variance corresponding to this fuzzy area is the sum of at least two main components, the first one of which corresponds to random irregularity and was measured by Hibi (1983), and the second one corresponds to time-warping, and was measured in this study. In our view, if there exists any regularity or rhythmicity in speech, it is at the perceptual level, and possibly, at the pre-production level, but not at the acoustic level. This hypothesis has been made before (Lehiste, 1973; Fowler, 1979). In the latter study, the results obtained point in the same direction as those in this study, although the experimental contexts differ on at least two points: Fowler's study investigated more specifically stress-timing, and for syllabic rates that are in the "ongoing processing" region (Hibi, 1983), whereas the stimuli of the present study were definitely not stresstimed and had a syllabic rate clearly in the "holistic processing region" (Hibi, 1983). We hypothesize that, due to articulatory, linguistic and other constraints, what is intended to be regular at the pre-production stage becomes time-warped, usually in the direction of a deceleration, resulting in a lengthening which is most marked at the end of an utterance or of a breath-group. This lengthening has been investigated by several authors (Benguerel, 1971 ; Lindblom & Rapp, 1973; Nooteboom, 1973; Oller, 1973; Klatt, 1975) and is known to be affected by several factors, the main ones of which are the number of syllables in the utterance, the position of the syllable in the utterance, the intonation pattern and the nature of the phonetic segments involved. The results of the present

Time-warping and perception of rhythm

245

study indicate that at the perceptual stage, a time-warping in the reverse direction takes place, in some sense "de-warping" the acoustic signal. These considerations beg two questions. (1) Is regularity at the pre-production stage really intended or just a by-product of the speech production process? Can this point be tested, and how? (2) Does the "de-warping" involved in perception cancel out, at least within the limits of discrimination, the warping assumed to take place at production time? No obvious strategy is available to try to answer the first question, but some experiments are already underway to try to answer the second one. References Allen, G. D . (1972) The location of rhythmic stress beats in English: an experimental study I and II, Language and Speech, 15, 72- 100, 179- 195. Beckman, M. (1982) Segment duration and the 'Mora' in Japanese, Phonetica, 39, 113- 135. Benguerel, A-P. (1971) Duration of French vowels in unemphatic stress, Language and Speech , 14, 383- 391. Bloch, B. (1950) Studies in colloquial Japanese IV: phonemics, Language, 26, 86--125. Classe, A. ( 1939) The rhy thm of English prose. Oxford: Blackwell. Creelman, C. D. (1962) Human discrimination of auditory duration , Journal of the Acoustical Society of America, 34, 582- 593. Crystal, D. (1969) Prosodic systems and intonation in English. Cambridge: Cambridge University Press. Fraisse, P. (1974) Psy chologie du ry thme. Paris: Presses Universitaires de France. Fraisse, P. (1982) Rhythm and tempo. In The Psychology of Music (D. Deutsch, editor), Chapter 6, pp. 149- 180. ·New York: Academic Press. Fowler, C. A. (1979) " Perceptual centers" in speech production and perception, Perception and Psychophysics, 25, 375- 388 . Fowler, C. A. (1983) Converging sources of evidence on spoken and perceived rhythms of speech: cyclic production of vowels in monosyllabic stress feet, Journal of Experimental Psychology: General, 112, 386--412. Gerber, S. E. (1974) Introductory hearing science: physical and psychological concepts. Philadelphia: W. B. Saunders Co. Hayes, B. (1984) The phonology of rhythm English, Linguistic Inquiry, 15, 33- 74. Hibi, S. ( 1983) Rhythm perception in repetitive sound sequence, Journal of the Acoustical Society in Japan , 4, 83- 95. Hoequist, Ch . (1983) Syllable duration in stress-, syllable- and mora-timed languages, Phonetica, 40, 203- 237. Hoequist, Ch . (1985) Parameters of speech perception, Arbeitsberichte des Instituts fiir Phonetik Kie/, 20, 99- 138. Klatt, D. H. (1975) Vowel lengthening is syntactically determined in a connected discourse, Journal of Phonetics, 3, 129- 140. Ladefoged, P. (1975) A course in phonetics. New York: Harcourt Brace Jovanovich. Ladefoged, P. & Broadbent, D . E. (1957) Information conveyed by vowels, Journal of the Acoustical Society of America, 29, 98- 104. Ladefoged, P. , De Clerk, J ., Lindau, M. & Pa p~un , G . (1972) An auditory-motor theory of speech production, UCLA Working Papers in Phonetics, 22, 48- 75. Lehiste, I. (1973) Rhythmic units and syntactic units in production and perception, Journal of the Acoustical Society of America, 54, 1228- 1234. Lehiste, I. (1977) Isochrony reconsidered, Journal of Phonetics, 5, 253- 263. Lehiste, I. (1979) The perception of duration within sequences of four intervals, Journal of Phonetics, 7, 313- 316. Lehiste, I. & Peterson, G . E. (1959) Vowel amplitude and phonemic stress in American English, Journal of the Acoustical Society of America, 32, 693- 703 . Lehrdahl, F. & Jackendoff, R. (1983) A generative theory of tonal music. Cambridge: MIT Press. Liberman, A.M ., Cooper, F. S., Shankweiler, D. S. & Studdert-Kennedy, M. (1967) Perception of the speech code, Psychological R eview, 74, 431 - 461. Liberman, M . Y. & Prince, A. (1977) On stress and linguistic rhythm, Linguistic Inquiry, 8, 249- 336. Lindblom, B. E. F. & Rapp, K . (1973) Some temporal regularities of spoken Swedish, Papers of the Institute of Linguistics of the University of Stockholm , Publication 21.

246

A. P. Benguerel and J. D'Arcy

Malecot, A. (!975) French liaison as a function of grammatical, phonetic and paralinguistic variables, Phonetica, 32, 161 - 179. Nooteboom, S. G . (!973) The perceptual reality of some prosodic durations , Journal of Phonetics, l, 25- 45. O'Connor, J.D. (!965) The perception of time intervals, Phonetics laboratory, University College, London, Progress Report 2, pp. ll- 15. O'Connor, J. D. (!973), Phonetics. Harmondworth: Penguin. Oller, D. K. (!973) The duration of speech segments: the effect of position in utterance and word length, Journal of the Acoustical Society of America, 54, 1235- 1247. Pike, K. L. (1945) The Intonation of American English . Ann Arbor: The University of Michigan Press. Shen, Y. & Peterson, G . G. (!962) Isochronism in English, University of Buffalo, studies in linguistics, occasional papers, 9, 1- 36. Wenk, B. J. & Wioland, F. (1982) Is French really syllable-timed? Journal of Phonetics, 10, 193- 216. Williamson, K. (!965) A grammar of the Kolokuma dialect of Ijo . Cambridge: Cambridge University Press.