Journal of Phonetics (1979) 7, 393-402
Range and frequency effects in consonant categorization Stuart Michael Rosen University College London, Department of Phonetics and Linguistics, Annexe, Wolfson House, 4 Stephenson Way, London NW1 2HE, England Received 20th June 1978
Abstract:
It is commonly thought that the existence of shifts in the judgment of a speech continuum after repeated presentation of one of the stimuli is due to the fatiguing ("adaptation") of detectors sensitive to some aspect ("feature") of the stimuli. Shifts of this sort are common, however, in classical psychophysics and are not due to feature detectors subserving perception. Experiments are reported here that show a shift in judgment of a speech continuum with changes only in the range or relative presentation frequencies of the stimuli. It is argued that these shifts, as well as those in "adaptation" experiments, have the same perceptual origins as those in the nonspeech continua; that is, they are due to a general phenomenon common to all perceptual continua.
Introduction In traditional psychophysics, it is well known that a subject's judgment of stimuli is affected by a change in their range or relative presentation frequencies. For instance, in a loudness judgment task, where subjects are asked to classify the stimuli as "loud" or "soft", a stimulus that is judged "loud" in the context of softer stimuli will most likely be judged "soft" in the context of louder stimuli . This may be termed a range effect. Similarly, even when a constant range is used, stimuli that are classified "loud" when a soft stimulus occurs more frequently than other stimuli will more likely be classified "soft" when a loud stimulus is presented more often than the other stimuli. This we may call a frequency effect, which arises from a non-uniform distribution of stimulus occurrence. Two theories have been formulated to account for these (as well as other) phenomena: adaptation level (AL) theory associated with Helson (1964) and range-frequency (RF) theory, associated with Parducci (1965). Range frequency theory seems to have been more successful in quantitatively describing the results of many categorization experiments, but this issue will not concern us here. We note only that reasonable theories to account for range and frequency effects do exist, and have been extensively tested. It is currently asserted that range and frequency effects do not occur in the judgment of speech sounds, at least in the case of stop consonants (Eimas & Miller, in press). This assertion is important in the interpretation of experiments of the so-called selective adaptation type, where repeated presentations of, say, a "ba" ("adaptation") makes it more likely that stimuli will be judged as "da" than is the case when no repetition is performed. Similarly, adaptation with a "da" causes more "ba" responses than is other0095-4470/79/040393 +I 0 $02.00
© 1979 Academic Press Inc. (London) Ltd.
394
S.M. Rosen
wise the case. The current interpretation of these results is that there are feature detectors (either at the acoustic feature or phonetic level) for "ba" and "da" that are fatigued by repeated presentation of the patterns to which they are sensitive (Eimas & Miller, in press). Another possibility is that these shifts in categorization can be explained by AL/RF theories if we think of adaptation as a change in the relative presentation frequencies of the stimuli. In experiments using simple stimuli (where the adaptation technique is known as anchoring) shifts in judgment are obtained whether the subject judges the extra stimuli or not (Parducci, 1974). While many experiments have been done (and continue to be done) using the adaptation technique, only a handful have been designed to test the basic hypothesis that feature detectors are indeed responsible for the shifts. Range effects were in fact found in an experiment by Brady & Darwin (1978). The classical result of shifts in judgment with shifts in range of stimuli is clearly seen. Darwin (1976) reports this result however as a "cautionary footnote" and its implications for all the adaptation work is not stressed sufficiently. Frequency effects have also been studied in two papers (Sawusch & Pisoni, 1973; Sawusch, Pisoni & Cutting, 1974) where shifts in categorization due to non-uniform stimulus presentation frequency have been compared for speech and non-speech continua. Although both papers state that no shift was found in the speech judgment situation some shift definitely has occurred in the later experiment. The shift for the non-speech judgment condition is much larger but what is at issue is not the relative lability of boundaries, but whether it is possible to shift boundaries at all with such a technique. It should not surprise us that non-speech judgment may be more easily influenced since we have little prior experience in classifying sounds as "soft" or "loud", whereas we constantly name sounds as "ba" or "da". In the two experiments to follow the effects ofrange and of frequency are shown, in fact, to be significant factors in the judgment of a speech continuum. Experiment 1-range effects Subjects Subjects were chosen from visitors to, and a subject pool maintained at, the Communications Biophysics Laboratory at MIT. There was no selection procedure; the first four people to volunteer were used: two males and two females . Three of the four subjects had not participated in experiments involving synthetic speech before. One of the subjects had performed in some short experiments involving synthetic speech, and had much experience in experiments involving non-speech auditory stimuli. All subjects were naive as to the purposes of the experiment, were native speakers of American English and had no known history of hearing impairment.
Stimuli Ten stimuli, numbered from 0 to 9, were synthesized using a software serial speech synthesizer (Klatt, 1972) running on a PDP-9 computer. Stimulus 0 represented a clear "ba" while stimulus 9 represented a clear "da". None of the stimuli had an initial onset noise burst. The changes in identity of the stimuli were cued primarily by the second and third formant starting frequencies . Further details of stimulus construction may be found in Appendix A. Apparatus The stimuli were transferred to the disk of a PDP-12 computer equipped with digital to analog converters (DAC's). During the experiments, the stimuli were played from the
Consonant categorization
395
disk in real time through the DAC's, low-pass filtered, amplified to a comfortable level and played through earphones to the subjects who sat in a sound-proof chamber. The subjects gave judgments by pressing buttons on boxes in the testing chamber. The computer controlled all details of stimulus presentation and recording of subject responses. Procedure At the start of each day subjects gave responses to all of the stimuli possible. These were presented in a random order, subject to the condition that all were presented before any were repeated. Normally, this session contained 50 trials (5 observations per stimulus) but on the first day it contained 100 trials (10 observations per stimulus). After this the subjects gave responses to stimuli drawn from a sub-range of the entire continuum. Four sub-ranges were possible and all subjects performed them in the same order in an attempt to counter balance the possible effects of practice. The sub-ranges, in the order the subjects judged them were stimuli 2 through 6, stimuli 3 through 7, stimuli 4 through 8 and stimuli 1 through 4. These sessions contained 100 trials (20 observations per stimulus), again presented in a random order such that no stimuli were repeated before all were presented. In their judgments subjects pushed one of six buttons corresponding to the following rating scale: (l) definitely a ba (2) probably a ba (3) maybe a ba (4) maybe ada (5) probably ada (6) definitely ada Results and discussion The categorization functions, averaged over all subjects, are shown as a function of sub-range in Figs. 1 and 2. The function for judgments of the total continuum, averaged over the 4 days, is marked with crosses. Figure 1 uses the mean rating given to each stimulus, while for the construction of Fig. 2, all responses have been collapsed into two categories, and the proportion of "da" responses is shown. 1 Both figures tell essentially the 1 The results are presented both in the form of mean ratings and binary judgments for a number of reasons. The former technique is preferable because more information is given by the subject in each judgment (assuming that these divisions are used consistently and meaningfully, which they most certainly appear to be). Thus differences between stimuli will often show up that do not show up using a binary response only. There are also drawbacks. First, it is difficult to provide a rigorous justification of using just the mean rating as this assumes that the distance between adjacent categories is constant across the range. More advanced techniques of analysis are available (e.g. Thurstone's Method of Successive Categories), though these seem unnecessarily complicated. Secondly, statistical techniques for dealing with mean ratings are not very well developed. The advantages and disadvantages of binary judgments (or in this case, binary classification of responses), precisely complement those of the mean rating technique. Although it seems that less information is gathered per observation, and distinctions between stimuli are less apt to be realized, no assumption about the use of the categories need be made, and well developed statistical techniques for data analysis and statistical testing do exist. For these reasons, both types of analysis will be used, although all statistical analyses of significance will be done on binary responses.
396
S.M. Rosen
6-
5-
g' 4-
2-
1I
I
0
Figure 1
I
I
2
3
I I 4 5 Stimulus number
I
I
I
I
6
7
8
9
Mean rating for each stimulus under five different experimental conditions. In four of the conditions, only five of the stimuli are judged in one session and these categorization functions are marked with the number of the first stimulus in the sub-range. The crosses mark the categorization function obtained when all ten stimuli are judged during a session.
:_ A
;;~/·------·
x/-.~l
3
/
,'
I
'
,' '
'
2 ,,
:
:x
I
' '
'
3
4
:
' ''
r-/
f--z/~· 0
2
3
4
5
Stimulus number
Figure 2
Proportion of "da" responses under the same five experimental conditions as in Fig. 1. Symbols as in Fig. 1.
same story: as the sub-range shifts through the continuum, so do the categorization functions, such that the phoneme boundaries tend to shift in the same direction as the sub-range. In order to quantify this shift, estimates of the phoneme boundary were
Consonant categorization
397
,., 0
"0
c
:J
0
-" (1)
E (1)
c
_g o._
Stimulus ranges
Figure 3
Phoneme boundaries as a function of sub-range with 95% confidence intervals.
calculated from the binary judgments 2 , and these are shown, along with 95% confidence intervals, in Fig. 3. The phoneme boundaries are displaced with increasing sub-range as one would expect from AL/RF theory. Although the difference between the phoneme boundaries of the 2-6 and 3-7 sub-ranges is not significant, all the other differences are. These effects are not in any way the result of data averaging. All four subjects show the shift in categorization with shifting sub-range. It is interesting to note the effect of sub-range on individual stimuli shared by the different sub-ranges. In particular, stimulus 4 goes from being called "da" 65% of the time in the lowest sub-range to I 6·25% in the highest. Similarly, stimulus 5 goes from 93·75% "da" judgments to 52·5 %. Experiment 2-frequency effects Subjects Subjects were chosen from visitors and staff members at the Department of Phonetics, University College London. A total of nine subjects started the experiment, but one was discarded since she heard all the stimuli as "ba". This left five females and three males. Although most of the subjects had not previously participated in experiments involving synthetic speech, they were almost all well acquainted with it. All subjects were naive as 2
Phoneme boundaries, as well as their standard errors were estimated using a maximum likelihood technique (also known as probit analysis) to fit a cumulative normal to the observed proportions (Bock & Jones, 1968). In each of the experiments, one of the four fits had chi-square values too large at the 0.05 level to be attributable to chance. This is not as bad as it sounds, since the probability of getting one or more successes in four Bernoulli trials with a success probability of0.05 is, in fact, 0.1855, although the probability of two or more successes in eight trials is only 0.0572. Adopting a 0.05 significance level still leaves us unable to reject the hypothesis that the psychometric functions are cumulative normals. More investigation of this matter is required.
398
S.M. Rosen
to the purposes of the experiment, were native speakers of English (British or American), and had no known history of hearing impairment. Stimuli Seven stimuli were used, those numbered 1 to 7 in Experiment I. Further details may be found in Appendix A. Apparatus The stimuli were again transferred to the disk of a (different) PDP-12 computer, at UCL, equipped with DAC's. This time, however, the experimental sessions were recorded from the PDP-12 on a Revox tape recorder, after low-pass filtering and amplification. Subjects listened to the recordings through earphones in sound-proof cubicles and marked their responses on an answer sheet. Procedure Four experimental tapes were created using the same basic scheme. One block of trials was the basic unit of a session. In a block, one may specify the frequency of occurrence of each stimulus in the set separately. The stimuli that comprise one block are presented in a random permutation without regard to stimulus identity. A session consists of an arbitrary number of blocks. This is a way of obtaining the desired number of stimulus occurrences without applying the rule that all stimuli must be presented before any are repeated (in this context one occurrence per stimulus per block), and lends itself readily to unequal frequencies of stimulus occurrence. The first session was a practice session which consisted of 3 blocks with two occurrences of each stimulus per block, making for a total of 42 trials. The second session consisted of 5 blocks of the type used in the practice session, making for a total of 70 stimuli. This will be called the Uniform condition, since the stimuli occur with uniform frequency. The third session consisted of 5 blocks of stimuli where stimulus I occurred 12 times and all the other stimuli once, known as Skewed-1, making for a total of 90 stimuli. The fourth session also contained 5 blocks of stimuli, but this time, stimulus 7 occurred 12 times in a block, and all the other stimuli once, the Skewed-7 condition. All subjects started with the practice session. Half the subjects were assigned to the Skewed- I group and half to the Skewed-7. In each group, half the subjects ran the Uniform condition first, and half ran the Skewed condition first. All subjects ran each session twice in a row (except for the practice session) with 3 min rest between each session. They were not told that sessions were repeated. Subjects made their judgments using a six-point rating scale similar to the one used in Experiment I, with the words "definitely a", "probably a", and "maybe a" replaced by "an excellent", "a good" and "a poor". It might be noted that the subjects were much more hesitant to use "an excellent" in this experiment than to use "definitely a" in the previous one, but this has little bearing on the main results. Results and discussion Categorization functions, again both as binary judgments and as mean ratings are shown for the two experimental groups in Figs. 4 and 5. For each experimental group, the results from both the Uniform and Skewed conditions are shown together. In both groups, the function as a whole shifts towards the stimulus presented most frequently. This shift may be quantified by calculating the phoneme boundaries, as was done in the previous experi-
Consonant categorization
399
1·01-
"'5: g_ 0·81"'~ =<(
0 ·61-
=0
0 0-41c:
.g g_ 0 ·21- aUniform ---· --- Skewed -I I I I I
0
a:
01-
5·01g-401-
·~ c: ~
301-
::2::
2·01-
- aUniform ·--•--- Skewed-1
I
I
I
_..l
__l
__l
2
3
4
5
6
J. 7
Stimulus number
Figure4
Categorization functions in the Uniform and Skewed-I conditions shown together both as mean rating and proportion of "da" responses.
~
/.>---~
lOt-
i:::
1/
0
,
~a/ i
g. 02t~
n_
0 t-
_:-=r=-~
-
a-
- -·-• ·--
_..l
_..l
5 ·0-
~
I
c:
3 ·02·0 1-
__l
a.--?"
.!:: 4 ·0-
~
_..l
/ /
o>
0
Uniform Skewed- 7
..
a/ I
/-:/ ';:::-::9=--····
-a--
I
I
--- - -- Skewed- 7 I I 1
I
234567
I
I
Uniform
Stimulus number
Figure 5
Categorization functions in the Uniform and Skewed-7 conditions shown together both as mean rating and proportion of "da" responses.
400
S.M. Rosen
i:' 500
'0
;
45-
0
-"
I I Uniform skewed - 1
Figure 6
Un1form skewed -7
Phoneme boundaries as a function of experimental condition with 95% confidence intervals.
ment. These are shown along with 95% confidence intervals in Fig. 6 where it may be seen that the shift in the phoneme boundary is, in both cases, significant. It is interesting to note that the boundary in the Uniform condition is significantly different between the two groups. Recall that half the subjects started on the Skewed condition in each group. It seems likely that the shift in judgment due to the skewing of stimulus frequencies is carried at least partially through to judgments in the Uniform condition. This is verified in Fig. 7, where phoneme boundaries in the Uniform condition are shown separately for those subjects who began with the Skewed condition and those who began with the Uniform condition. The difference in phoneme boundaries between the two experimental groups for subjects who started with the Uniform condition is not significant, while the difference between the two groups for subjects who started with the Skewed condition is highly significant. Again, the shifts occur in the individual subjects' data. Seven of the eight subjects showed a clear shift of the phoneme boundary in the expected direction. The size of the shift seems large, and somewhat comparable to the shifts reported in selective adaptation experiments. Why didn't Sawusch & Pisoni (1973) and Sawusch et a!. (1974) find any comparable shifts? Because they didn 't try so hard. The stimulus frequency distributions in the present experiment are quite a bit more skewed than those in their presentation. In the 1973 paper, the most frequent stimulus occurred only once for every three other stimuli judged, while in the 1974 paper, the most frequent stimulus occurred twice for every three other stimuli judged. In this experiment the most frequent stimulus occurred six times for every three other stimuli judged.
Conclusions Both experiments have shown how easily judgments in a speech categorization task may be shifted. It seems unlikely that the shifts are due to "adaptation" as the ratio of " adaptors" to other stimuli is quite small, especially compared to the ratios used in selective adaptation experiments. A typical adaptation experiment involves anywhere from 9.4 (Sawusch, 1977) to 24 (Pisoni & Tash, 1975) to 77 (Eimas & Corbit, 1973) adaptors per judged stimulus. In Experiment 2, if we think of the stimulus occurring most often as the adaptor, the ratio of adaptors to other stimuli is only two . It is rather more difficult to make a similar estimation in Experiment I as none of the stimuli may be properly classified
Consonant categorization
401
5 ·0f~
a -a c:
45f-
"0
.0
4 ·0 f-
"' E
"'c:0
35f-
-"'
Cl.
3 ·0fI
1
7 Unifo rm first
Figure 7
7 Skewed first
Phoneme boundaries for the Uniform condition as a function of whether the Skewed or Uniform condition was performed first with 95 % confidence intervals.
as adaptors. It can be said though, that while it took only 100 trials altogether to show a strong effect, adaptation experiments often start with I 00 to 200 presentations of the adaptor (Eimas & Corbit, 1973 ; Pisani & Tash, 1975). Besides this, the "adaptors" in both experiments are much more dispersed over time, and in that way likely to have less of an "adaptation" effect. Thus, these experiments imply that the greater part of shifts in the judgment of speech continua previously attributed to "adaptation" are, in fact, due to effects common to all sensory continua. They are not due to the existence of fatigable feature detectors, but to general principles of judgment. The experiments reported in this paper have received support from many quarters. Pilot experiments were started at the Speech Transmission Laboratory, Royal Institute of Technology in Stockholm with the kind assistance of Bjorn Granstrom and Rolf Carlson. Sheila Blumstein at MIT synthesized the stimuli used in both experiments. Many of the staff members at the Department of Phonetics, University College London consented to run as subjects. Ken Stevens and Nat Durlach at MIT, and Adrian Fourcin and Peter Howell at University College London provided much useful discussion . My thanks to them all, and also to Maureen Paley who served as a subject in all the experiments. References Bock, R . D . & Jones, L. V. (1968) . The Measurement and Prediction of Judgement and Choice. San Francisco : Holden-Day. Brady, S. A. & Darwin, C. J . (.1978). Range effect in the perception of voicing. Journal of the Acoustical Society of America 63, 1556- 1558. Darwin, C. J. (1976). The perception of speech . In Handbook of Perception (Carterette, E. C. & Friedman, M . P. eds) Vol. VII . New York: Academic Press. Eimas, P. D. & Corbit, J. D . (1973). Selective adaptation of linguistic feature detectors. Cognitive Psychology 4, 99-109. Eimas, P. D . & Miller, J. L. (in press). Effects of selective adaptation on the perception of speech and visual patterns : Evidence for feature detectors. In Perception and Experience (Pick, H. L. & Walk, R. D. eds). Helson, H . (1964). Adaptation Level Th eory. New York: Harper & Row. Klatt, D. (1972). Acoustic theory of terminal analog synthesis. Proceedings of the 1972 International Conference of Speech Communication and Processing. Boston, MA. Parducci, A. (1965). Category judgement: A range-frequency model. Psychological Review 72, 407-418.
402
S.M. Rosen
Parducci, A. (1974). Contextual effects: A range-frequency analysis. In Handbook of Perception (Carterette, E. C. & Friedman, M. P. eds), Vol. /1. New York : Academic Press. Pisoni, D . B. & Tash, J. B. (1975). Auditory property detectors and processing place features in stop consonants. Perception and Psychophysics 18, 401-408. Sawusch, J. R. (1977). Peripheral and central processes in selective adaptation of place of articulation in stop consonants. Journal of the Acoustical Society of America 62, 738-750. Sawusch, J. R . & Pisoni, D . B. (1973). Category boundaries for speech and nonspeech sounds. Paper presented at the 86th meeting of the Acoustical Society of America, Los Angeles. Sawusch, J. R., Pisoni, D. B. & Cutting, J. E. (1974). Category boundaries for linguistic and nonlinguistic dimensions of the same stimuli. Paper presented at the 87th meeting of the Acoustical Society of America, New York.
Appendix A-stimulus construction A sampling frequency of 10 kHz was used. The following table gives the values of the parameters that varied over the stimulus continuum, the first being the stimulus number. The next two columns indicate the moments in time (start of stimulus at time= 0) at which Fl reached the indicated frequencies . Fl was always 250 Hz at the start of the stimulus. Linear transitions were made between these points (and in all other places where transitions are effected). The fourth and fifth columns are the onset frequencies of the second and third formants, respectively. (All formants turned on at time= 0.) The transitions from these frequencies to the formant frequencies of the following vowel were 40 ms long. The vowel formant frequencies were 720, 1240,2500, 3500 and 4500Hz, and the bandwidths were 50, 70, 110, 170 and 250 Hz. The fourth and fifth formants were constant over the entire duration of the stimulus, which was 255 ms. Time (ms) where Fl is: Stimulus
450Hz
720Hz
F2 onset frequency
F3 onset frequency
0
10 10 10
20 20 20 25 25 30 30 35 35 35
900 990 1080 1170 1260 1350 1440 1530 1620 1700
2000 2090 2180 2270 2360 2450 2540 2630 2720 2800
I
2 3 4 5 6
7 8 9
10 10
15 15 15 15 15
The fundamental frequency contour was constructed with linear transitions from the points on the following table: Time (ms) 0
35 95 255 295
Frequency (Hz) 103 125 125 94 50