Acoustic and perceptual study of phonetic integration in Spanish voiceless stops

Acoustic and perceptual study of phonetic integration in Spanish voiceless stops

Speech Communication 27 (1999) 1±18 Acoustic and perceptual study of phonetic integration in Spanish voiceless stops S. Feijoo *, S. Fernandez, R. Ba...

553KB Sizes 0 Downloads 18 Views

Speech Communication 27 (1999) 1±18

Acoustic and perceptual study of phonetic integration in Spanish voiceless stops S. Feijoo *, S. Fernandez, R. Balsa Departamento de Fisica Aplicada, Facultad de Fisica, Universidad de Santiago, 15706 Santiago, Spain Received 28 July 1997; received in revised form 26 May 1998; accepted 30 September 1998

Abstract The relationship between the acoustic content and the perceptual identi®cation of Spanish voiceless stops, /p/, /t/ and /k/, in word initial position, has been studied in two di€erent conditions: (a) Isolated plosive noise (C condition); (b) Plosive noise plus 51.2 ms of the following vowel (CV condition). The purpose of the study was to assess whether there was a clear correspondence between the perceptual identi®cation made by listeners and the acoustic classi®cation performed using a spectral representation combined with the duration, energy and zero-crossings of the plosive noise. The acoustic classi®cation was represented by a distance pro®le, formed by the acoustic distances between a given token and the three classes corresponding to the three stops. The acoustic distances were de®ned as the a posteriori probabilities of membership in each class (APP scores). The perceptual identi®cation was represented by a response pro®le, formed by the number of listeners' responses assigned to each of the classes for a given token. The correlation between the acoustic and perceptual distances increased from the C condition (overall correlation 0.81) to the CV condition (0.95), indicating that as the signal is more precisely de®ned from the perceptual point of view, the acoustic content is also less ambiguous, since both the perceptual and acoustic classi®cations improve in the CV condition. The best correlations were achieved when the variables obtained in the temporal domain (duration, energy and zero-crossings of the plosive noise) were included in the analysis. Ó 1999 Elsevier Science B.V. All rights reserved. Zusammenfassung Die Beziehung zwischen dem akustischen Inhalt und der perzeptionalen Identi®zierung der spanischen stimmlosen Verschluûlaute /p/, /t/ und /k/ am Wortanfang, wurde unter zwei verschiedenen Bedingungen untersucht: (a) Isolierter Plosivlaut (Bedingung C); (b) Plosivlaut plus 51.2 ms des folgenden Vokals (Bedingung CV). Der Zweck der Un tersuchung bestand darin, festzustellen, ob es eine eindeutige Ubereinstimmung zwischen der akustischen Identi®zierung durch die Zuh orer und der akustischen Klassi®zierung gab, die durchgef uhrt wurde, indem wir eine Spektraldarstellung mit der Dauer, der Energie und den Nullkreuzungen des Plosivlautes kombinierten. Die akustische Klassi®zierung wurde durch ein Distanzpro®l veranschaulicht, das aus den akustischen Distanzen zwischen einem gegebenen Zeichen und den drei Klassen besteht, die den drei Verschluûlauten entsprechen. Die akustischen Distanzen wurden de®niert als die Wahrscheinlichkeiten a posteriori einer Zugeh origkeit zu einerjeden Klasse (APP Wertung). Die perzeptionale Identi®zierung wurde durch ein Reaktionspro®l dargestellt, das aus der Anzahl der Zuh orerantworten besteht, die jeder der Klassen eines gegebenen Zeichens zugeteilt wurden. Die Wechselbeziehung zwischen den akustischen und den perzeptionalen Entfernungen steigert sich von Bedingung C (Gesamt-Wechselbeziehung 0.81) auf Bedingung CV ( ˆ 0.95), da das Signal vom Perzeptionsstandpunkt aus pr aziser de®niert ist, es zeigt sich der akustische

*

Corresponding author. E-mail: [email protected]

0167-6393/99/$ ± see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 9 8 ) 0 0 0 6 4 - 8

2

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

Inhalt ebenfalls weniger zweideutig, weil beide Klassi®zierungen ± sowohl die perzeptionale als auch die akustische ± sich unter CV Bedingungen verbessern. Die besten Wechselporalem Gebiet (Dauer, Energie und Nullkreuzungen des plosivlauts) in die Analyse miteinbezogen wurden. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: Perceptual identi®cation; Acoustic analysis; Voiceless stops; Phonetic integration; Acoustic-phonetics; Speech recognition

1. Introduction One of the main problems in the study of speech is how to characterize and represent the process employed by listeners to perceive and identify the phonetic content of the acoustic waveform. Closely related to that problem is the fact that speech sounds are not organized sequentially and independently from the point of view of perception. The listener is able to decode the complex speech signals and sequentially organize them in a series of meaningful linguistic units, phonemes, syllables, etc. Those units, however, do not seem to be present in the acoustic signal in the same way as in the perceptual representation. That means that each phoneme does not necessarily correspond to a given portion of sound in the utterance. It is a well-known fact that adjacent segments often in¯uence each other in the perception of a given phoneme (Nygaard and Pisoni, 1995). Stop consonants are one of the classes of phonemes whose acoustic and perceptual study is interesting due to their highly dynamic characteristics, and the coarticulatory processes involved. Voiceless stop consonants are formed by a sudden transient which is usually called the release burst, plus a friction segment and an aspiration noise, which may or may not be present in the signal. The unvoiced segment formed by the release burst, the friction segment, and the aspiration noise, whenever it exists, will be henceforth denoted as plosive noise. After that unvoiced segment comes the onset of voicing and a vocalic transition following the movement of the articulators. Voiceless Spanish stops have three places of articulation: labial (/p/), dental (/t/) and velar (/k/). Their acoustic characteristics have been described in several works (Quilis, 1989; Torres and Iparaguirre, 1996), and are similar to those of French or Dutch stops, which are also unaspirated (Smits et al., 1996; Bonneau et al., 1996).

The burst of labial stops shows a di€use spectrum, although sometimes concentrations of energy can be seen in some vocalic contexts. The dental burst shows, on an average, a concentration of energy roughly between 2500 and 4000 Hz in the context of /i/, while in the /u/ context that prominence is located between 2000 and 3500 Hz. In other contexts, the spectrum is relatively ¯at. The velar burst shows a spectral prominence close to 1000 Hz in the /o/ and /u/ contexts, and around 3500 Hz in the /i/ context (Fig. 1). Numerous papers have been devoted to the study of the perception of stop consonants in a vowel environment, from di€erent points of view. Basically, the main issues about stop perception are related to the amplitude, durational and spectral characteristics of the burst section, and whether the associated vocalic transition is a necessary or suf®cient cue for stop perception. Associated to those issues is the question of whether invariant cues can be found in the speech signal for the determination of place of articulation, or whether those signals exhibit a lack of invariance, context-dependent cues being necessary for identi®cation. The relative amplitude of the release burst in synthetic CV syllables was studied by Ohde and Stevens (1983), ®nding that the relative amplitude of the burst signi®cantly a€ected the perception of the place of articulation of voiceless stop consonants. Hedrick and Jesteadt (1996) studied the relative amplitude of the burst, together with the presentation level and vowel duration on perception of voiceless stop consonants by normal and hearing-impaired listeners. Their results suggest that normal listeners weight the relative amplitude and vocalic transitions di€erently from hearingimpaired subjects. The time-intensity envelope of speech was studied by Van Tasell et al. (1987), ®nding that envelope features can be eciently used for stop perception even in the absence of spectral information.

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

3

Fig. 1. Spectra of the three stops /p/, /t/ and /k/ in the contexts of /i/ and /u/.

Some authors have also studied the temporal characteristics of voiceless stops, particularly the durations required for the identi®cation of place. It is generally accepted that 20±30 ms of initial stops are enough for stop identi®cation (Tekieli and Cullinan, 1979; Krull, 1990; Bonneau et al., 1996). That does not necessarily mean that shorter portions should exhibit lower scores. Tekieli and Cullinan (1979) determined the minimum initial portion durations required by listeners for the identi®cation of consonant±vowel syllables. They

found that the ®rst 10 ms of the CV syllable contained enough information for better than chance correct identi®cation of place of articulation in voiceless stop consonants. They came to the conclusion that the burst is sucient for the correct identi®cation of stop consonants. This opinion is shared by other authors like LaRiviere et al. (1975), Winitz et al. (1972), and Cole and Scott (1974a): The vocalic transition is neither a sucient nor a necessary cue for voiceless stop recognition. More recent works have opposite opinions.

4

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

For instance, the results of Ohde (1988) support a model of stop consonant perception that includes spectral and time-varying spectral properties as integral components of analysis. A similar view is shared by Dorman et al. (1977), who found trading relationships between the release bursts and vocalic transitions: where the perceptual weight of one increased, the weight of the other declined. Bonneau et al. (1996) found that, although 20±30 ms of the initial CV syllables contained enough cues for a correct identi®cation of the voiceless stops without the a priori knowledge of the subsequent vowel, performance was contextdependent. Moreover, they also found that a nearperfect identi®cation of stops can only be achieved when all the main cues (burst spectrum, burst duration and onset of vocalic formants) were present simultaneously. Acoustic analyses have also been extensively made on voiceless stop consonants (see, for instance Crystal and House, 1988; Deng and Braam, 1994; Jongman et al., 1985; Kobatake and Ohtani, 1987; Stevens and Blumstein, 1978; Halle et al., 1957; Blumstein and Stevens, 1979; Kewley-Port, 1982). Perhaps the most interesting works have been devoted to the acoustic classi®cation of the stop consonants using two basic approaches. The ®rst one adopts the invariant theory of Stevens and Blumstein (1978), based on static cues obtained at consonant release (Jongman and Miller, 1991; Torres and Iparaguirre, 1996). The second is based on the dynamic approach of Kewley-Port (1982), that emphasizes the need of using the dynamic information contained in the vocalic transition (Nossair and Zahorian, 1991; Tanaka, 1981; Forrest et al., 1988). A similar approach had already been considered by Searle et al. (1979) in their study on stop consonant discrimination. The correct recognition scores for the voiceless stop consonants in the ®rst approach are generally below 80%, while those of the second approach vary roughly between 80 and 95%. Moreover, the results of Nossair and Zahorian (1991) emphasize the need for using dynamic spectral transition cues, which they found to be invariant for place of articulation in initial voiceless stops. The purpose of this paper is to study the phonetic integration between the static and dynamic

cues to place of articulation in word initial Spanish stop consonants from both an acoustic and perceptual point of view, and to assess the degree of correlation between the acoustic representation and the perceptual identi®cation. There does not exist a clear de®nition of phonetic integration. From a perceptual point of view, two segments corresponding to acoustically di€erent components of the signal are integrated when they both contribute to perception of a given phonetic category (Repp, 1988). From an acoustical point of view, phonetic integration can be interpreted as a procedure that combines the acoustic characteristics of both segmens and evaluates them jointly. Two di€erent perceptual experiments were carried out: ®rst the plosive noise was presented to a set of listeners (henceforth denoted as the C condition); and second 51.2 ms of the following vowel were added to the noise (henceforth denoted as the CV condition). Those two segments were then submitted to an acoustical analysis, and the signals were acoustically classi®ed in terms of place of articulation. Then, the correlation between the acoustic and perceptual representation was calculated. Our hypothesis is that, if the stop consonant is better de®ned in the CV condition from a perceptual point of view, then the acoustic representation must re¯ect that fact, and the correlation between perceptual and acoustic representation should be higher than in the C condition. If there is no improvement in the perceptual identi®cation for the CV condition, the correlations should also re¯ect that fact. 2. Method 2.1. Perceptual analysis Eighteen subjects (9 men and 9 women) served as speakers in the experiments. They were all native Spanish speakers with no known history of speech or hearing disorders, with ages between 20 and 40 years old. They were asked to utter a series of two syllable words (CVCV) in citation form, whose ®rst syllable was formed by the combination of a voiceless stop (/p/, /t/ and /k/), with one of the ®ve vowels (/a/, /e/, /i/, /o/ and /u/). The total

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

number of stimuli was 270 ˆ 3 stops ´ 5 vowels ´ 18 speakers. All the vowels were stressed in the ®rst syllable. The recording of the signals took place in a normal oce in the Faculty of Physics. The distance between the microphone and the lips was adjusted so as to take advantage of the quantization range of the A/D card (DT-2801-A). No more instructions were given to the subjects, except to utter the words in a natural way. After passing through an antialiasing ®lter with cut-o€ frequencies between 100 and 4500 Hz, the signals were digitized with a 10 kHz sampling frequency and a precision of 12 bits/sample, and then stored on the computer disk. This sampling frequency was chosen because it has already been used by other authors in the study of stop consonants (see for instance Smits et al., 1996; Torres and Iparaguirre, 1996; Suomi, 1987), and the resulting frequency range of the signals contains the important spectral prominences of the burst. Previous to any further analysis, the signals are automatically normalized with respect to the maximum amplitude value in the signal. This procedure was adopted to maintain the level of the stressed vowel more or less constant across the signals, since the intensity of the words varied a great deal among the speakers. We were interested in keeping the vowel/burst intensity ratio as in the original signals. No further normalization of the burst level was carried out. The signals were manually segmented using an interactive sound editing program developed at our laboratory that permits the playback of any portion of the signal, over which the FFT and LPC-smoothed spectrum can be computed and displayed. Then, the segment corresponding to the plosive noise is selected, comprising from the onset of the release burst until the ®rst pulse of vowel onset. Usually it is straightforward to ®nd the point where the ®rst vocalic pulse starts. In case of doubt, the LPC spectrum of the candidate for the ®rst vocalic segment is checked looking for a steep rise of the second formant, which signals the onset of voicing. The manually selected points are relocated to the closest zero-crossing point, in order to avoid the presence of undesired noises due to a sudden rise in amplitude not directly associated with the stop

5

itself. The plosive noise is then selected as the ®rst segment for the analysis (C condition). The second segment is formed by the plosive noise plus 51.2 ms of the following vowel (CV condition). Since the ®nal part of the selected vocalic segment corresponds to a region where the signal intensity is quite high, the truncated vowel may introduce a transient sound that may distort the perception of the plosive. In order to avoid that undesired e€ect, the ®nal part of the vocalic segment is smoothed using a 10 ms cosine type window. The perceptual tests were carried out over the two segments described above: isolated plosive noise (C condition); and plosive noise plus 51.2 ms of the following vowel (CV condition). The stimuli were presented blocked by condition. Eleven subjects acted as listeners for the experiments. They were all native Spanish speakers between 20 and 37 years old, who participated voluntarily in the experiments after passing an audiometric test. None of the listeners participated as speakers in the recording of the stimuli. The experiments took place in a quiet oce of the Faculty of Physics devoted to these tasks. The segmented signals were presented to the listeners via headphones (Sony MDR-CD5790) at an approximate sound pressure level of 70 dBA, the whole procedure being controlled by a perceptual testing program developed at our laboratory. The perceptual program allows the listener to perform a training procedure prior to the perceptual test itself. This training was considered necessary for the C condition, since the isolated plosive noises are not commonly heard and may appear as non-speech noises to the listeners. The goal of the training session was to familiarize the listeners with the sounds they were going to hear, not to train listeners to achieve a very high identi®cation rate. It is supposed that their normal perceptual capabilities may suce to perform the experiment. Usually listeners spent 5±15 min performing the training session, after which they voluntarily decided to ®nish the session. During the experiments, listeners could repeat only once the playback of each stimulus by clicking on the appropriate place in the display. It was mandatory for the listener to select an answer for each stimulus, after which the own subject had to click on the appropriate place to listen to the

6

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

playback of the following stimulus. Then, the own listener controlled the pace at which the stimuli were presented. The order of presentation was random and di€erent for each listener. After the training procedure, the task of the listeners was to identify the voiceless stop in the stimuli as /p/, /t/ or /k/. A fourth option (another sound) was also included in order to reduce the amount of guessing by the listeners. The purpose and procedures of the perceptual experiments, including the use of the perceptual program, were clearly explained orally to the subjects, and written on a script the listeners could consult any time they wished. The answers of each experiment for each listener were stored for further calculations.

/t/ and /k/ with the vowel /a/ of the same speaker, together with the LPC smoothed spectra of the segments considered in our analysis. Other alternatives to this analysis may have been used, like various overlapping windows of a ®xed length covering the duration of the plosive noise. The problem with that approach is that, as plosive noise durations vary between a few milliseconds and approximately 50 ms, it may well happen that the use of 25.6 ms long window analysis, for instance, would result in parts of the vowel being included in the characterization of the plosive, which was considered an undesired e€ect that should be avoided. The following parameters were calculated in each of the three segments.

2.2. Acoustic parameters extracted from the signals

2.2.1. LPC cepstral coecients (CEP) The cepstral coecients were computed using the LPC approach (correlation method) described in (Markel and Gray, 1976). They represent a gross characterization of the smooth spectral content of the signal, and have been successfully used in many speech recognition tasks (see, for instance, Nossair and Zahorian, 1991; Zahorian and Jagharghi, 1993; Nocerino et al., 1985; Shikano and Sugiyama, 1982). The order of the analysis was 12 and, previous to the calculation, each segment was Hamming windowed and high frequency pre-emphasized (transfer function 1 ) 0.9zÿ1 ). The composition of the vectors for each condition was:

Since our purpose was to compare the identi®cation made by listeners with the automatic classi®cation performed with the acoustic data, the acoustic variables were extracted from exactly the same portions of signals employed in the perceptual experiments. Then the plosive noise (C condition) was represented by the same segment of signal used in the perceptual experiments, whose length was di€erent for each stimuli, since the durations of the plosive noises varied from one signal to the other. The vocalic portion was represented by two non-overlapping 25.6 ms long segments that corresponded to the 51.2 ms long segment of the vowel attached to the plosive noise in the CV condition. In this way, the C condition will be represented by one segment of variable duration, and the CV condition will be represented by three segments, two of which were of ®xed duration: the plosive noise plus the two nonoverlapping 25.6 ms segments corresponding to the vowel. The main reason to choose a variable length segment to characterize the plosive noise was that we wanted to extract the acoustic parameters from exactly the same stimuli used in the perceptual experiments. The maximum plosive noise length encountered in our database was 52 ms. Therefore we decided to ®x a maximum window duration of 51.2 ms for the plosive noise analysis. Fig. 2 shows the speech signals corresponding to the word initial combinations of /p/,

C condition:

fCpi g;

i ˆ 1; . . . ; 12;

CV condition: fCpi ; Cvf i ; Cvsi g;

i ˆ 1; . . . ; 12;

where Cpi is the ith cepstral coecient of the plosive noise; Cvf i is the ith cepstral coecient of the ®rst vocalic segment; and Cvsi is the ith cepstral coecient of the second vocalic segment. The vector representing the C condition has 12 variables, and the vector representing the CV condition has 36 variables. 2.2.2. Filter outputs (FIL) This is another way of characterizing the spectral content of the signal. A set of triangular ®lters was used in order to obtain a representation of the

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

7

Fig. 2. Speech signals corresponding to the plosives /p/, /t/ and /k/, in the context of /a/, uttered by the same speaker. The marked section A corresponds to the plosive noise; B corresponds to the ®rst 25.6 ms long vocalic segment; and C corresponds to the third vocalic segment. The LPC spectra corresponding to those segments are also shown.

spectrum of the segment. Within each ®lter, the spectrum is integrated and reduced to single logarithmic output, in such a way that the spectral dimensionality is drastically reduced. The design of the triangular ®lters was similar to that employed by Davis and Mermelstein (1980): the tri-

angular shaped ®lters are linearly spaced each 100 Hz, until 1000 Hz, and then logarithmically spaced until the Nyquist frequency. As a result, a total of 20 ®lters were obtained for spectral characterization. This alternative to the LPC spectral representation is commonly employed for acoustic

8

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

classi®cation of speech (Suomi,1987; Davis and Mermelstein, 1980). The vectors for each condition were: C condition: fFpi g; i ˆ 1; . . . ; 20; CV condition:

fFpi ; Fvf i ; Fvsi g;

i ˆ 1; . . . ; 20;

where Fpi is the output of the ith ®lter of the plosive noise; Fvf i is the output of the ith ®lter of the ®rst vocalic segment and Fvsi is the output of the ith ®lter of the second vocalic segment. 2.2.3. Energy, duration and zero-crossing rate (LOG) These three simple parameters were included in our study since they have been valuable for plosive noise characterization and acoustic classi®cation of the Spanish voiceless stop consonants (Feijoo et al., 1995; Torres and Iparaguirre, 1996). Moreover, Krull (1990) also demonstrated that the highest correlation between predicted and observed percent confusions for voiced stops was obtained with a formant-based model in combination with burst-length data. Energy and zerocrossings were calculated without normalizing them with respect to the number of samples of the segment. Obviously, the number of samples of the plosive segment varies for each signal, while the segments of the vocalic portion have a ®xed duration of 25.6 ms each. It could be argued that energy and zero-crossings should have been normalized with respect to the duration of the plosive segment, since this segment is of variable duration. However, the perceived intensity of the plosive noise depends not only on the amplitude of the samples but also on the duration of the segment, longer segments being perceived louder than shorter segments of a similar amplitude. In the case of zero-crossings, it was found in previous experiments that normalizing with respect to the plosive noise duration lowered its discriminatory ability. The logarithmic values of energy, duration and zero-crossings were used, since it has been proved that they are superior to non-logarithmic values in voiceless stop classi®cation. Their importance is restricted to the plosive noise, since the acoustic characteristics that de®ne the vocalic portion are not well represented by such a gross characterization.

The vectors for each condition were: C condition: fDp ; Evf ; Zvs g; CV condition: fDp ; Evf ; Zvs ; Evf ; Zvf ; Evs ; Zvs g; where D, E and Z stand for, respectively, the logarithmic values of duration, energy and zerocrossings, while the subscripts p, vf and vs denote, respectively, the plosive segment, the ®rst vocalic segment and the second vocalic segment. The duration of the vocalic segments was not included since those segments are of ®xed duration, equal to 25.6 ms each. In order to obtain a ®rst glimpse of the ability of the di€erent acoustic representations to characterize the stops, classi®cation procedures were carried out. The method used for classi®cation was a linear discriminant analysis. In linear discriminant analysis, di€erent linear combinations are formed, each of which is associated to a discriminant function, the number of functions being as high as the number of groups minus one. The ®rst function has the biggest ratio between the between-groups and the within-groups sum of squares; the second function is independent of the ®rst function and has the second biggest ratio. The linear discriminant equation is S ˆ B 0 ‡ B1 X 1 ‡    ‡ B p X p ;

…1†

where fXi g; i ˆ 1; . . . ; p; represents the values of the variables obtained for that particular sample. For instance, for the set CEP in the C-condition, these variables would be represented by fCpi g; i ˆ 1; . . . ; 12; fBi g; i ˆ 1; . . . ; p; are the coecients of the discriminant function estimated from a training sample. They are calculated in such a way as to maximize the di€erences among groups. During the classi®cation of a given sample, the values of all the discriminant functions are calculated for that sample, and each classi®cation group is represented by a centroid which is calculated using each discriminant function. For each sample to be classi®ed, the a posteriori probability of membership in each group is calculated from the distance from that sample to the centroids representing each classi®cation group, using Bayes' theorem, p…X jwi †P …wi † ; P …wi jX † ˆ Pn iˆ1 p…X jwi †P …wi †

…2†

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

where X represents the acoustic vector of variables of the token to be classi®ed; wi is the centroid of the ith group; n is the number of classi®cation groups; P(wi ) is the a priori probability for each group; p(X|wi ) is the conditional probability function for the ith group; and P(wi |X) is the a posteriori probability of X belonging to the ith group characterized by wi . Each token is classi®ed as belonging to the group for which the calculated a posteriori probability is bigger. In order to take advantage of the maximum number of cases in our corpus, the leave-one-out method was used for classi®cation. In this method, when a token is tested for classi®cation, the discriminant functions and a posteriori probabilities are calculated with the remaining samples, which implies that the computed misclassi®cation rates are close to the true ones (Fukunaga, 1972). In this method, thus, the set of discriminant functions are di€erent for each token to be classi®ed. 2.3. Comparison between acoustic measurements and the stop classi®cation performed by listeners In order to compare the acoustic variables and the identi®cation of voiceless stops made by listeners, ®rst we have to de®ne how this comparison is going to be performed. Since the perceptual tests involve identi®cation of phonetic classes by the listeners, the perceptual data obtained is formed by the responses given by the set of listeners to each individual token. Let us suppose that a given token corresponds to a given class (for instance /k/). The responses of the listeners could be grouped, for instance, like this: {1, 3, 7}, where the ®rst component of the vector is the number of listeners that considered that token to be /p/, the second one the number of listeners that considered the token to be /t/, and the third one would be the number of /k/ responses. That vector is called response pro®le, and is computed as the number of responses assigned to each of the three stop classes by the selected listeners. For instance, a response pro®le like {11, 0, 0} would implicate that the signal is clearly identi®ed as /p/, since all listeners select the same answer. A response pro®le like {4, 3, 4} would indicate that the signal is ambiguous from a phonetic point of view, since some

9

listeners identify it as /p/, others as /t/ and others as /k/. Then, each component of the response pro®le can be considered as a kind of inverse distance (perceptual distance) from the token to the corresponding class, since the component with the higher score would be the class for which the phonetic distance from the token would be minimum. The concept of the response pro®les, then, would be intimately related to the notion of a perceptual-phonetic space, in which the input signals are identi®ed as belonging to the class for which the perceptual distance is minimized (Nearey and Assmann, 1986). Once the perceptual distances have been de®ned, there is the question of how to de®ne the acoustic distances. The approach used in this study is the same adopted by Nearey and Assmann (1986) and Assmann et al. (1982). The a posteriori probabilities (APP scores) of membership in each classi®cation group were used to de®ne the inverse acoustic distance from a given token to a reference class. In the classi®cation scheme de®ned above (Linear discriminant), the token is classi®ed in the class for which the APP score is maximized, in a way that is most similar to the way the perceptualphonetic space supposedly operates: a given token is compared with the prototypes of di€erent classes of sounds and is identi®ed as belonging to the class for which the perceptual distance is minimum. In a similar way, a given token would be acoustically represented in the acoustic-phonetic space by a vector formed by the three APP scores obtained for that token, each of the components corresponding to one of the classes (distance pro®le). For instance, a token could be described by a distance pro®le like {0.94, 0.02, 0.04}. The value 0.94 would correspond to the APP score obtained using Eq. (2) above, with wi representing the centroid of the /p/ group obtained from the training samples for a particular set of coecients; 0.02 would be the APP score for the /t/ group; and 0.04 would be the APP score for the /k/ group. That vector indicates that the token is unambiguously classi®ed as /p/. A distance pro®le like {0.43, 0.45, 0.12}, would indicate that the signal would be classi®ed as /t/, although its acoustic content is highly ambiguous. Then, the components of the distance pro®le represent a kind of inverse acoustic

10

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

distance from the token to each of the classes (acoustic distance). Since each component of both the distance pro®le and the response pro®le corresponds to a kind of inverse distance from the token to each of the classes, the comparison between the acoustic classi®cation and the listeners' responses was carried out by computing the correlation coecients between the corresponding components: each row of the response pro®le is correlated with the corresponding row of the distance pro®le. In this way, three correlations are calculated, each corresponding to one of the stops. It should be expected that the stops that are better de®ned auditorily would also obtain the higher correlations with the acoustic data since their acoustic content should also be less ambiguous. Actually, in the perceptual experiments listeners had a fourth option (another sound). The selection of that answer would indicate that the listener is unable to identify the token as a stop. It could be that the number of listeners that select that option is higher enough so as to cast doubts about the convenience of including that response pro®le in the correlation analysis, since the acoustic classi®cation groups do not include that possibility. Therefore it was decided that those samples for which at least 20% of the listeners selected the option another sound would be excluded from the correlation analysis. That procedure ensures that the remaining samples would be identi®ed as a stop by at least 80% of the listeners.

teners' responses in order to select a group of listeners for which the correct identi®cation scores are within the interval average value ‹ one standard deviation of the sample. This procedure was done to avoid the inclusion of listeners with extremely low/high correct scores in the ®nal results. For that purpose, the average value and standard deviation of the correct identi®cation scores of all listeners were calculated. If the correct score of a listener falls within the interval average value ‹ one standard deviation, the data of that listener is selected for further analysis. For both the C and CV conditions, all listeners were within that interval. Table 1 shows the confusion matrices, pooled over listeners, for the C and CV conditions. Average identi®cation scores were 68.70% for the C condition, and 94.37% for the CV condition. There is a clear improvement in the CV condition with respect to the C condition. In both experiments, /k/ is the best recognized stop, followed by /p/, while /t/ attains the lowest score. For each condition, the listeners' responses were submitted to a one-way analysis of variance and the Sche€e test, in order to ®nd the groupings of plosives. In the C condition, the Sche€e test found two subgroups: Subgroup 1 (/t/); Subgroup 2 (/p/, /k/). Then /p/ and /k/ are equally recognized in the C condition, and better recognized than /t/. In the CV condition there are two subgroups: Subgroup 1 (/t/, /p/); Subgroup 2 (/p/, /k/). Then, in the CV condition, the three stops are almost equally recognized by listeners, although /k/ is slightly better recognized than /t/. The option another sound was selected in only 0.1% of the responses for the /t/ tokens and for 0.1% of the responses for the /k/ tokens. In Table 2 the listeners' responses are shown separately by stop and vowel context for both conditions. For each condition, the responses of

3. Results 3.1. Perceptual tests Prior to any analysis of the perceptual data, a statistical procedure was carried out on the lis-

Table 1 Confusion matrices obtained from the listeners' responses in the auditory identi®cation of the isolated plosive noise (C condition), and the plosive noise plus 51.2 ms of the following vowel (CV condition) C /p/ /t/ /k/

CV

/p/

/t/

/k/

other

/p/

/t/

/k/

other

70.7 20.9 3.7

21.3 61.3 22.1

8.0 17.7 74.1

0.0 0.1 0.1

94.4 4.1 0.0

4.0 92.0 3.3

1.6 3.9 96.7

0.0 0.0 0.0

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

11

Table 2 Correct percent of listeners' responses grouped by stop and vowel combination, shown separately for men and women, and for both sexes together Men /p/ C

Women /t/

/k/

Both

/p/

/t/

/k/

/p/

/t/

/k/

/a/ /e/ /i/ /o/ /u/

87.8 80.0 63.3 84.4 54.4

48.9 64.4 58.9 67.8 43.3

63.3 65.6 54.4 95.6 98.9

77.8 61.1 73.3 71.1 53.3

60.0 72.2 51.1 75.6 71.1

60.0 53.3 57.8 95.6 96.7

82.8 70.6 68.3 77.8 53.9

54.4 68.3 55.0 71.7 57.2

61.7 59.4 56.1 95.6 97.8

CV /a/ /e/ /i/ /o/ /u/

100.0 94.4 97.8 100.0 96.7

83.3 88.9 100.0 93.3 97.8

93.3 97.8 92.2 100.0 100.0

96.7 92.2 75.6 95.6 95.6

86.7 94.4 91.1 90.0 94.4

95.6 88.9 98.9 100.0 100.0

98.3 93.3 86.7 97.8 96.1

85.0 91.7 95.6 91.7 96.1

94.4 93.3 95.6 100.0 100.0

Top: C condition; Bottom: CV condition.

the listeners were submitted to a three-way analysis of variance with factors plosive ´ vowel ´ sex of the speaker. The results can be seen in Table 3. In the C condition, there is a signi®cant main e€ect of the stop and the vowel, plus signi®cant plosive ´ vowel interaction with level of signi®cance p < 0.0005, and a signi®cant plosive ´ sex interaction with level of signi®cance p < 0.05. In the CV condition, there is only a signi®cant e€ect of the plosive, plus a signi®cant plosive ´ vowel interaction (level of signi®cance p < 0.05). Then, the data of each plosive for each condition was analyzed separately using a one-way analysis of variance and the Sche€e test in order to ®nd the groupings of vowels (vowel as independent factor). A signi®cant e€ect of the vowel appeared for /p/ in the C condition: Subgroup 1 (/u/, /i/, /e/); Sub-

group 2 (/i/, /e/, /o/, /a/). Those groupings indicate that /p/ in the context of /o,a/ is better recognized than /p/ in the context of /u/. For /k/ in the C condition, a signi®cant e€ect of the vowel also showed up: Subgroup 1 (/i/, /e/, /a/); Subgroup 2 (/o/, /u/). Stop /k/ in the C condition is thus better recognized in the context of /o/ and /u/ than in any other context. No groupings of vowels showed up for /t/ in the C condition. A similar analysis for the CV condition showed that the three stops were equally recognized in the ®ve vocalic contexts (no groupings of vowels for any of the three stops). The next step was to investigate the e€ect of the experiment (C vs. CV condition). A one-way analysis of variance (with experiment as the independent factor) was carried out with the listeners' responses. The e€ect of the experiment was signi-

Table 3 Results obtained in the ANOVA of the listeners' responses, for both conditions (C and CV), with plosive, vowel and sex as the independent factors Plosive Vowel Sex Plosive ´ vowel Plosive ´ sex Vowel ´ sex Plosive ´ vowel ´ sex

C

CV

F(2,267) ˆ 9.06, p < 0.0005 F(4,265) ˆ 8.10, p < 0.0005 F(1,268) ˆ 0.001, p < 0.977 F(8,261) ˆ 8.69, p < 0.0005 F(2,267) ˆ 3.63, p < 0.028 F(4,265) ˆ 1.04, p < 0.385 F(8,261) ˆ 1.15, p < 0.331

F(2,267) ˆ 3.39, F(4,265) ˆ 2.10, F(1,268) ˆ 3.32, F(8,261) ˆ 2.20, F(2,267) ˆ 1.94, F(4,265) ˆ 1.02, F(8,261) ˆ 1.71,

p p p p p p p

< < < < < < <

0.035 0.081 0.070 0.028 0.146 0.397 0.096

F denotes the Fischer's F statistic, with the degrees of freedom within parentheses; p indicates the level of signi®cance of the test.

12

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

®cant (F(1,538) ˆ 229.714; p < 0.0001). Then, the listeners' responses were separated by plosive, and the data were again submitted to a one-way analysis of variance (experiment). For each plosive, the e€ect of the experiment was again signi®cant (p < 0.0001). Since the data also revealed a signi®cant main e€ect of vocalic context for /p/ and /k/ in the C condition, the listeners' responses were separated by vocalic context for each stop, and the data were submitted to a one-way analysis of variance (experiment). The e€ect of the experiment was signi®cant for /k/ in the context of /u/ (F(1,34) ˆ 4.86; p < 0.05), and for /p/ in the context of /i/ (F(1,34) ˆ 7.12; p < 0.05). For the rest of the stop ´ vowel combinations, the e€ect of the experiment was signi®cant at level p < 0.001. Then, including the vowel causes a signi®cant increase in the listeners' identi®cations for every stop + vowel combination. Summarizing the results of the perceptual experiments, in the C condition, /k/ is the best recognized stop followed by /p/ and /t/, whereas in the CV condition /k/ is better recognized than /t/. The perceptual identi®cation of the isolated plosive noise is a€ected by vocalic context for /p/ and /k/, while in the CV condition each of the three stops is equally recognized in any vocalic context. Adding the vowel to the plosive noise caused a signi®cant increment in correct responses for the three stops in any vocalic context, the e€ect being less marked for /k/ in the context of /u/, which was one of the combinations better recognized by the listeners in the C condition, and /p/ in the context of /i/, due especially to the poor improvement in the identi®cation of women's /p/ in the context of /i/ in the CV condition with respect to the C condition (see Table 2).

3.2. Acoustic analysis and classi®cation Table 4 shows the mean duration of the burst for the stops in every vocalic context. It can be seen that the duration of /k/ bursts is well separated from the duration of /t/ bursts, while there was a certain degree of overlapping between the durations of /t/ and /p/ bursts, especially for women. Statistical analysis of variance, with duration as the dependent variable and factors stop, vowel and sex revealed that the duration of the three stops was di€erent (F(2,267) ˆ 157.70; p < 0.0005): /p/ bursts are shorter than /t/ bursts, and /t/ bursts are shorter that /k/ bursts, a result similar to that obtained in other languages (Winitz et al., 1972; Smits et al., 1996). It was also found that women's bursts were shorter than men's (F(1,268) ˆ 6.82; p < 0.05). The vocalic context also seems to play a certain role in the duration of the bursts (F(4,265) ˆ 7.136; p < 0.005). A stop ´ vowel interaction also showed up (F(8,261) ˆ 2.23; p < 0.05). The Sche€e test revealed that /p/ bursts were longer in the context of /u/, and that /t/ bursts of women were longer in the context of /i/. Nevertheless, the e€ect of the plosive was stronger than any other e€ect. The parameters described in the preceding section were used as inputs for a statistical classi®cation procedure employing the linear discriminant method described in Section 2.2. Apart from the sets CEP, FIL and LOG, the combinations of CEP and FIL with the parameters of the LOG set were also considered (CEP + LOG, FIL + LOG). The reason is that both CEP and FIL o€er information about the spectral characteristics of the segments, while in the LOG set

Table 4 Mean durations of the bursts in every stop + vowel combination, expressed in ms, shown separately for men and women, and for both sexes together Men /a/ /e/ /i/ /o/ /u/

Women

Both

/p/

/t/

/k/

/p/

/t/

/k/

/p/

/t/

/k/

9.3 10.5 10.5 12.6 17.3

15.5 14.8 18.9 13.1 19.3

24.5 29.1 27.4 27.7 31.2

8.7 8.2 10.9 11.7 16.4

10.5 12.4 19.9 11.2 13.5

22.0 24.7 31.1 27.7 24.2

9.0 9.3 10.7 12.1 16.9

13.0 13.6 19.4 12.1 16.4

23.3 26.9 29.3 27.7 28.0

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

13

ing the information contained in the vocalic portion contributed a higher recognition score for all parameters, this increase being lower for the log transformed values of energy, duration and zerocrossings, since they o€er a very poor description of the vocalic portion. Particularly striking is the performance attained by the combination of the ®lter band outputs with the log transformed values of energy, duration and zero-crossings (97.04%). The confusion matrices for that particular combination in both conditions, C and CV, can be seen in Table 5. Data corresponding to the APP scores obtained in the classi®cation of the ®ve sets of parameters considered above for both the C and CV conditions, were submitted to a one-way analysis of variance in order to asses the e€ect of the experiment in the acoustic classi®cation. As expected, the e€ect of the experiment was signi®cant for cepstral coecients (F(1,538) ˆ 79.92; p < 0.0001); for Filter band outputs (F(1,538) ˆ 57.52; p < 0.0001); and for the combination of the log values of energy, duration and zero-crossings with both cepstral (F(1,538) ˆ 62.28; p < 0.0001) and ®lter outputs (F(1,538) ˆ 58.05; p < 0.0001). The next step consisted in the analysis of the acoustic data of the plosive noise. Data from the ®ve selected sets of parameters corresponding to the plosive noise were submitted to a Multivariate Analysis of Variance (Wilks' Lambda statistic), with the acoustic parameters as the dependent variables and plosive, vowel and sex as the independent factors. Table 6 shows the outcome of the analysis. All the main e€ects (plosive, vowel and sex) were signi®cant for all the sets of parameters, and most of the interactions. The only interactions that turned out to be non-signi®cant corresponded to the plosive ´ vowel ´ sex interaction and vowel

Fig. 3. Correct classi®cation percents obtained in the classi®cation of the stimuli by the ®ve sets of parameters, for both the C and CV conditions. CEP corresponds to the cepstral coecients, FIL to the ®lter outputs, and LOG to the log values of energy, duration and zero-crossings.

other aspects of the temporal domain of the signal are included. For the C condition, the parameters calculated on the plosive noise were used to build the corresponding sets. Using the leave-one-out method, tokens were classi®ed with each of the ®ve sets as belonging to any of the three groups, /p/, /t/ or /k/. For the CV condition, the parameters calculated on the two vocalic segments were added to the parameters of the plosive noise, as described in Section 2.2. Fig. 3 shows the overall correct identi®cation percents obtained by the ®ve sets of parameters in both the C and CV condition. As was evident in previous works (Feijoo et al., 1995), the log transformed values of energy, duration and zero-crossings (LOG set) o€er a good characterization of the plosive noise in terms of stop classi®cation. When added to a complete spectral representation, like that of the Filter band spectra (FIL + LOG set) or the cepstral coecients (CEP + LOG set), the results improve with respect to the results of either group of parameters alone. Add-

Table 5 Confusion matrices obtained by the acoustic representation formed by the combination of ®lter outputs and the log energy, duration and zero-crossings (FIL + LOG), for both conditions C /p/ /t/ /k/

CV

/p/

/t/

/k/

/p/

/t/

/k/

91.1 5.6 1.1

6.7 88.9 12.2

2.2 5.6 86.7

98.9 3.3 0.0

0.0 94.4 2.2

1.1 2.2 97.8

14

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

Table 6 Results of the MANOVA (independent factors plosive, vowel and sex) obtained for the ®ve sets of parameters, in the C condition Plosive Vowel Sex Plosive ´ vowel Plosive ´ sex Vowel ´ sex Plosive ´ vowel ´ sex

CEP

FIL

LOG

CEP + LOG

FIL + LOG

F ˆ 24.69 p < 0.0005 F ˆ 13.05 p < 0.0005 F ˆ 5.91 p < 0.0005 F ˆ 9.76 p < 0.0005 F ˆ 5.97 p < 0.0005 F ˆ 2.48 p < 0.0005 F ˆ 2.35 p < 0.0005

F ˆ 17.65 p < 0.0005 F ˆ 7.47 p < 0.0005 F ˆ 5.05 p < 0.0005 F ˆ 4.59 p < 0.0005 F ˆ 1.881 p < 0.002 F ˆ 1.29 p < 0.050 F ˆ 1.02 p < 0.425

F ˆ 143.75 p < 0.0005 F ˆ 25.08 p < 0.0005 F ˆ 16.77 p < 0.0005 F ˆ 9.41 p <0.0005 F ˆ 1.24 p < 0.283 F ˆ 1.08 p < 0.373 F ˆ 0.64 p < 0.911

F ˆ 41.61 p < 0.0005 F ˆ 14.04 p < 0.0005 F ˆ 7.97 p < 0.0005 F ˆ 8.49 p < 0.0005 F ˆ 5.08 p < 0.0005 F ˆ 2.28 p < 0.0005 F ˆ 2.01 p < 0.0005

F ˆ 28.85 p < 0.0005 F ˆ 8.50 p < 0.0005 F ˆ 5.77 p < 0.0005 F ˆ 4.49 p < 0.0005 F ˆ 1.85 p < 0.001 F ˆ 1.21 p < 0.094 F ˆ 0.99 p < 0.512

F represents the approximation to Wilks' Lambda, and p indicates the level of signi®cance.

´ sex interaction (for ®lter outputs, log values of energy, duration and zero-crossings, and the combination of both); and plosive ´ sex interaction (for log energy, duration and zero-crossings). Then, the results show that the acoustic information contained in the plosive noise is a€ected by diverse factors such as vocalic context and the sex of the speaker, although the plosive in¯uence is bigger than any other in¯uence, since the plosive noise is best de®ned in terms of the stop consonant. 3.3. Correlation between acoustic and perceptual data The comparison between the perceptual and acoustic distances was performed separately for each condition (C and CV).The comparison was carried out by computing the product-moment Pearson correlation between the perceptual and acoustic distances to a given class. Since none of the tokens received more than 20% another sound responses, all the tokens were included in the correlation procedure. Results of the correlation between the APP scores of the di€erent sets of parameters and the components of the response pro®le for each stop class can be seen in Fig. 4 for the C condition, and in Fig. 5 for the CV condition. The highest correlation coecients in both conditions are obtained for /p/ and /k/, which are the stops that attain the highest correct classi®cation scores by both listeners and the acoustic

method. It is surprising again that such a simple characterization of the plosive noise as that represented by the log energy, duration and zerocrossings, obtained higher correlations than either cepstral coecients or ®lter outputs. The results of the correlation improve when the vowel is added, and the di€erences between the stop classes are reduced, as happens with their acoustic or perceptual identi®cation. If the correlation between the distance and response pro®les are calculated over the three classes, the correlation coecients would represent a kind of overall correlation between the perceptual and acoustic data. Results of that correlation can be seen in Fig. 6 for both C and CV conditions. The highest coincidence between acoustic and perceptual data for the C

Fig. 4. Correlation coecients obtained between the perceptual and acoustic distances to each class, for the ®ve sets of parameters in the C condition.

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

Fig. 5. Correlation coecients obtained between the perceptual and acoustic distances to each class, for the ®ve sets of parameters in the CV condition.

Fig. 6. Correlation coecients obtained between the response pro®les and distance pro®les for the ®ve sets of parameters, in both the C and CV conditions.

condition is achieved by the set formed by log energy, duration and zero-crossings (0.81), while for the CV condition, the combination of that set with the ®lter outputs attains the highest value among the correlation coecients (0.95). 4. Discussion In this paper we have studied the perception and acoustic characterization of the Spanish voiceless stops, and the relation between them. For that purpose, two di€erent conditions were chosen: isolated plosive noise (C condition), and plosive noise plus 51.2 ms of the following vowel (CV condition), in ®ve vocalic contexts. The results obtained in the perceptual experiments do not completely agree with those of other authors. In our study, in the C condition, /k/ is

15

better identi®ed than /p/ and /t/ has the lowest correct recognition score. For instance, Winitz et al. (1972) in their study of English stops found opposite results: /t/ is the best identi®ed stop, followed by /p/ and /k/. It should be remembered, though, that the stimuli used by Winitz et al. corresponded to aspirated stops, which are longer than the unaspirated stops used in the present study. In their study of unaspirated prevocalic stops of Dutch, Smits et al. (1996) found an average perceptual recognition rate of burst-only stimuli close to that obtained in the present study (73.6%). They found that /k/ is the best recognized stop (91.1%), followed by /p/ (80.0%) and /t/ (49.6%). Bonneau et al. (1996) found that in French, /k/ is the best identi®ed stop, followed by /t/ and /p/. Their average identi®cation results for the burst only were also higher than those encountered in our study, since the correct recognition scores were 94% for /k/, 91% for /t/ and 76% for /p/. Bonneau attributed the poorer results for /p/ to the fact that /p/ bursts have very short durations. In our case, listeners reported that the shorter duration and lesser amplitude of /p/ bursts helped them sometimes in their identi®cation. Since the amplitude and duration characteristics of /t/ are intermediate between those of /p/ and /k/, that fact could explain, at least to some extent, the poorer performance for /t/. It seemed that listeners used di€erent kinds of acoustic information depending on the stimuli: some were identi®ed mainly by their spectral content, while others were identi®ed mainly by their temporal and intensity characteristics. Our results agree with theirs in the fact that /k/ in the C condition is better identi®ed in the context of /o/ and /u/, than in any other context. Since it is well known that the burst of /k/ shows the e€ects of anticipatory lip rounding in the context of /o/ and /u/, it is clear that vowel in¯uence on the plosive noise is not necessarily a drawback for perceptual identi®cation of the stop in the C condition. A possible explanation for that result is that, /k/ being a velar stop and /o/, /u/ back vowels, the characteristics of the low part of the spectrum of both sounds, in some way, reinforce each other to produce a distinct percept due to the coarticulatory e€ects involved in the production of some stops.

16

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

The results for the CV condition clearly suggest that the vocalic transition de®nitely helps in the perceptual identi®cation of the voiceless stops in all vocalic contexts, this transition being necessary for the near perfect identi®cation of the stops. It is interesting to observe that the analysis of variance of the listeners' responses in the CV condition were little a€ected by factors such as the vocalic context, despite the fact that vocalic segment provides information about vowel identity. Since the tokens used in our study corresponded to natural Spanish words, there is no con¯ict between the place of articulation encoded by the burst and the following vocalic transition, the process of perceptual recognition not being disturbed by the presence of con¯icting cues (Whalen, 1984). Then, the perceptual invariance of stops seems to be achieved when the contextual information is available. The trend observed in our acoustic analysis is not exactly similar to that of the perceptual experiments: for instance, for the set formed by the ®lter outputs and the log energy, duration and zero-crossings in the C condition, /p/ is better identi®ed than either /t/ or /k/, with little di€erence among the three stops, while in the CV condition /p/ is again the best identi®ed stop, but /k/ is better recognized than /t/. Then the acoustic representation seems to be more robust than listeners in the characterization of short unvoiced sounds which, although being part of the speech signal, are not commonly heard isolated from the context. On the other hand, the perceptual system has durational constraints that may severely limit their ability to identify short unvoiced sounds. As was already proved by Nossair and Zahorian (1991), adding the information contained in the vocalic portion helped to improve the correct classi®cation rates for all the sets of parameters, and also caused a reduction in the di€erences of correct classi®cation between the listeners and the acoustic method. It seems then that, as more signi®cant perceptual information is gathered in the signal, the acoustic representation needs to be more precise in order to compete with the listeners' abilities. The acoustic analysis also showed that the spectral characteristics and the duration of the plosive noise are in¯uenced by the vocalic context and the sex of the speaker. It is doubtful that these two are the only

factors that cause an in¯uence on the plosive noise, since human voice conveys di€erent kinds of information that may play a relevant role in speech communication (i.e. social, physiological, psychological), although that question was not explicitly treated in our study (Nygaard and Pisoni, 1995). The in¯uence of factors such as vocalic context and the sex of the speaker, nevertheless, did not prevent the achievement of good acoustic classi®cation rates in the C condition. The analysis of variance for both the spectral representation and the burst duration showed that the e€ect of the stop was stronger than any other e€ect. Then, although the plosive noise shows the in¯uence of many factors, it is still better de®ned in terms of the stop than in terms of any other factor. The correlation achieved between the acoustic and perceptual distances con®rmed our basic hypothesis. The correlation coecients were generally quite high, especially for the combination of cepstral coecients and ®lter outputs with log energy, duration and zero-crossings, in the CV condition. As expected, correlations were lower in the C condition. This fact can be explained by a certain diculty of the listeners in identifying the isolated plosive noises. That implies that listeners could not distinctly perceive the input stimuli in some cases of the C condition, but in some way they thought they could judge them as belonging to any of the stop classes considered, probably taking advantage of characteristics such as burst duration, intensity or average spectral content. When presented with more usual stimuli, as in the CV condition, the amount of uncertainty in the signal diminishes, as it happens with the perceptual identi®cation. The results do not imply that the features used here are the ones employed by listeners in the perceptual identi®cation. This question has been extensively treated in other studies, especially concerning the importance of formants in stop characterization (see, for instance, Nossair and Zahorian, 1991). The importance of an adequate spectral representation of speech sounds has been repeatedly stressed. For voiceless stop consonants, especially for burst characterization, it seems that some temporal measures, particularly burst duration, should also be considered. The improvement in the correlation between the ob-

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

served and predicted responses when the burst intensity, duration and zero-crossings were included con®rms the fact that, at least in the present study, these temporal variables played a certain role in the listeners' responses, con®rming some of the listeners' impressions. Moreover, those measures obtained the highest correlation with the listeners' responses in the C condition, outperforming the more detailed spectral representations. This points to di€erent type of auditory processing for plosive noise and vocalic transitions, as was suggested by Cole and Scott (1974b), although that question is out of the scope of the present paper. The results presented have implications for the design of speech recognition or acoustic-phonetic decoding systems, since they clearly show that listeners use all the available important information for stop identi®cation, including both the plosive noise and the transitions. An approach of that type had been already considered by Furui (1986a) when he showed the improvement in speech recognition when temporal changes in the spectra are incorporated in the analysis. The phonetic integration of that complementary information can be parallelled by an acoustic representation of the type shown in the present paper, where all the important acoustic characteristics of the di€erent parts involved are considered and used. We have not studied whether the formant transitions or the quality of the vocalic part are the factors that help improve the perceptual identi®cation, and as a consequence, the acoustic classi®cation, but have treated the problem from an explicitly statistical point of view. In that sense, it is clear that a complete acoustic representation of the adjacent portions of speech bene®t each other in the acoustic determination of place of articulation in stop consonants, as it does in the perceptual identi®cation. The reasons for that improvement are not clear. Furui (1986b) had shown in experiments with Japanese syllables that there exists in the speech signal perceptual critical points associated to maximum spectral transition positions. Some works (Cole and Scott, 1974a; LaRiviere et al., 1975; Smits et al., 1996; Dorman et al., 1977) have shown that eliminating the transitions or splicing bursts produced in one vocalic context into other contexts did not cause a signi®cant de-

17

crement in stop identi®cation, even with con¯icting cue stimuli, except for some stops in certain vocalic contexts. If, as suggested by Smits, the burst is the strong cue in some contexts, while the transitions are the strong cue in other contexts, the improvement in the CV condition could be attributed to the fact that the natural way of perceiving stops is within a syllable, so listeners always expect the presence of a vocalic context, where a clear spectral transition is going to take place. We agree with Suomi (1987) in the fact that ``... rather than being noise in the signal, the systematic context-dependent variation is a factor that can be used for bene®t to increase the amount of phonetic information extractable from the speech signal''. The results of the correlation between the acoustic and perceptual distances, and the correct percent scores obtained by the acoustic representation further support that hypothesis. Acknowledgements The authors would like to thank Francisco Cruz for his assistance in the perceptual experiments. References Assmann, P.F., Nearey, T.M., Hogan, J.T., 1982. Vowel identi®cation: orthographic, perceptual and acoustic aspects. J. Acoust. Soc. Amer. 71, 975±989. Blumstein, S.E., Stevens, K.S., 1979. Acoustic invariance in speech production: Evidence from measurements of the spectral characteristics of stop consonants. J. Acoust. Soc. Amer. 64, 1001±1016. Bonneau, A., Djezzar, L., Laprie, Y., 1996. Perception of the place of articulation of French stop bursts. J. Acoust. Soc. Amer. 100, 555±564. Cole, R.A., Scott, B., 1974a. The phantom in the phoneme: Invariant cues for stop consonants. Perception and Psychophysics 15, 101±107. Cole, R.A., Scott, B., 1974b. Toward a theory of speech perception. Psychological Review 81, 348±374. Crystal, T.H., House, A.S., 1988. The duration of AmericanEnglish stop consonants: An overview. J. Phonetics 16, 285±294. Davis, S.B., Mermelstein, P., 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28, 357±366.

18

S. Feijoo et al. / Speech Communication 27 (1999) 1±18

Deng, L., Braam, D., 1994. Context-dependent Markov model structured by locus equations: Applications to phonetic classi®cation. J. Acoust. Soc. Amer. 96, 2008±2025. Dorman, M.F., Studdrt-Kemmedy, M., Raphael, L.J., 1977. Stop-constant recognition: Release bursts and formant transitions as functionally equivalent, context-dependent cues. Perception and Psychophysics 22, 109±122. Feijoo, S., Dominguez, J.A., Viso, M., Balsa, R., 1995. A pattern recognition approach for the identi®cation of nonvocalic stop consonants. In: Proceedings of the Sixth Spanish Symposium on Pattern Recognition and Image Analysis, Granada, Spain, pp. 279±285. Forrest, K., Weismer, G., Milenkovic, P., Dougall, R.N., 1988. Statistical analysis of word-initial voiceless obstruents: Preliminary data. J. Acoust. Soc. Amer. 84, 115±123. Fukunaga, K., 1972. Statistical Pattern Recognition. Academic Press, New York. Furui, S., 1986a. Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Trans. Acoust. Speech Signal Process. 34, 52±59. Furui, S., 1986b. On the role of spectral transition for speech perception. J. Acoust. Soc. Amer. 80, 1016±1035. Halle, M., Hughes, G.W., Radley, J.-P.A., 1957. Acoustic properties of stop consonants. J. Acoust. Soc. Amer. 29, 107±116. Hedrick, M.S., Jesteadt, W., 1996. E€ect of relative amplitude, presentation level, and vowel duration on perception of voiceless stop consonants by normal and hearing-impaired listeners. J. Acoust. Soc. Amer. 100, 3398±3407. Jongman, A., Miller, J.D., 1991. Method for the location of burst-onset spectra in the auditory-perceptual space: A study of place of articulation in voiceless stop consonants. J. Acoust. Soc. Amer. 89, 867±873. Jongman, A., Blumstein, S., Lahiri, A., 1985. Acoustic properties for dental and aveolar stop consonants: A crosslanguage study. J. Phonetics 13, 235±251. Kewley-Port, D., 1982. Measurement of formant transitions in naturally produced stop consonant-vowel syllables. J. Acoust. Soc. Amer. 72, 379±389. Kobatake, H., Ohtani, S., 1987. Spectral transition dynamics of voiceless stop consonants. J. Acoust. Soc. Amer. 81, 1146± 1151. Krull, D., 1990. Relating acoustic properties to perceptual responses: A study of Swedish voiced stops. J. Acoust. Soc. Amer. 88, 2557±2570. LaRiviere, C., Winitz, H., Herriman, E., 1975. Vocalic transitions in the perception of voiceless initial stops. J. Acoust. Soc. Amer. 57, 470±475. Markel, J.D., Gray, A.H., 1976. Linear Prediction of Speech. Springer, New York. Nearey, T.M., Assmann, P.F., 1986. Modeling the role of inherent spectral change in vowel identi®cation. J. Acoust. Soc. Amer. 80, 1297±1308. Nocerino, N., Soong, F.K., Rabiner, L.R., Klatt, D.H., 1985. Comparative study of several distortion measures for speech recognition. Speech Communication 4, 317±331.

Nossair, Z.B., Zahorian, S.A., 1991. Dynamic spectral shape features as acoustic correlates for initial stop consonants. J. Acoust. Soc. Amer. 89, 2978±2991. Nygaard, L.C., Pisoni, D.B., 1995. Speech perception: New directions in research and theory. In: Miller, J.L., Eimas, P.D. (Eds.), Recent advances in Speech understanding. Academic Press, San Diego. Ohde, R.N., 1988. Revisiting stop-consonant perception for two-formant stimuli. J. Acoust. Soc. Amer. 84, 1551±1555. Ohde, R.N., Stevens, K.N., 1983. E€ect of burst amplitude on the perception of stop consonants place of articulation. J. Acoust. Soc. Amer. 74, 706±714. Quilis, A., 1989. Fonetica Ac ustica de la Lengua Espa~ nola. Gredos, Madrid. Repp, B.H., 1988. Integration and segregation in speech perception. Language and Speech 31, 239±271. Searle, C.L., Jacobson, J.Z., Rayment, S.G., 1979. Stop consonant discrimination based on human audition. J. Acoust. Soc. Amer. 65, 799±809. Shikano, K., Sugiyama, M., 1982. Evaluation of LPC Spectral matching measures for spoken word recognition. Transactions of the IECE J65-D, 535±541. Smits, R., Bosch, L.T., Collier, R., 1996. Evaluation of various sets of acoustic cues for the perception of prevocalic stop consonants. Part I. Perception experiment. J. Acoust. Soc. Amer. 100, 3852±3864. Stevens, K.N., Blumstein, S.E., 1978. Invariant cues for place of articulation in stop consonants. J. Acoust. Soc. Amer. 64, 1358±1368. Suomi, K., 1987. On spectral coarticulation in stop-vowel-stop syllables: Implications for automatic speech recognition. J. Phonetics 15, 85±100. Tanaka, K., 1981. A parametric representation and a clustering method for phoneme recognition ± application to stops in a CV environment. IEEE Trans. Acoust. Speech Signal Process. 29, 1117±1127. Tekieli, M.E., Cullinan, W.L., 1979. The perception of temporally segmented vowels and consonant-vowel syllables. J. Speech Hear. Res. 22, 13±121. Torres, M.I., Iparaguirre, P., 1996. Acoustic parameters for place of articulation identi®cation and classi®cation of Spanish unvoiced stops. Speech Communication 18, 369± 379. Van Tasell, D.J., Soli, S.D., Kirby, V.M., Widin, G.P., 1987. Speech waveform envelope cues for consonant recognition. J. Acoust. Soc. Amer. 82, 1152±1161. Whalen, D.H., 1984. Subcategorical phonetic mismatches slow phonetic judgments. Perception and Psychophysics 35, 49±64. Winitz, H., Scheib, M.E., Reeds, J.A., 1972. Identi®cation of stops and vowels for the burst portion of /p, t, k/ isolated from conversational speech. J. Acoust. Soc. Amer. 51, 1309±1317. Zahorian, S.A., Jagharghi, A.J., 1993. Spectral-shape features versus formants as acoustic correlates for vowels. J. Acoust. Soc. Amer. 94, 1966±1982.