Exemplar-theoretic integration of phonetics and phonology: Detecting prominence categories in phonetic space

Journal of Phonetics 77 (2019) 100915 Contents lists available at ScienceDirect Journal of Phonetics journal homepage: www.elsevier.com/locate/Phone...

Download PDF

2MB Sizes 0 Downloads 49 Views

Report

PDF Reader
Full Text

Journal of Phonetics 77 (2019) 100915

Contents lists available at ScienceDirect

Journal of Phonetics journal homepage: www.elsevier.com/locate/Phonetics

Special Issue: Integrating Phonetics and Phonology, eds. Cangemi & Baumann

Exemplar-theoretic integration of phonetics and phonology: Detecting prominence categories in phonetic space Antje Schweitzer Institute for Natural Language Processing, University of Stuttgart, Germany

a r t i c l e

i n f o

Article history: Received 30 March 2018 Received in revised form 26 July 2019 Accepted 26 July 2019

Keywords: Prominence Intonation Pitch accents Exemplar theory Clustering Prosodic categories

a b s t r a c t This article explores an exemplar-theoretic approach to the integration of phonetics and phonology in the prosodic domain. In an exemplar-theoretic perspective, prominence categories, here speciﬁcally, pitch-accented syllables and unaccented syllables, are assumed to correspond to accumulations of similar exemplars in an appropriate perceptual space. It should then be possible, as suggested for instance by Pierrehumbert (2003), to infer the (phonological) prominence categories by clustering speech data in this (phonetic) space, thus modeling acquisition of prominence categories according to an exemplar-theoretic account. The present article explores this approach on one American English and two German databases. The experiments extend an earlier study (Schweitzer, 2011) by assuming more acoustic-prosodic dimensions, by excluding higher-linguistic or phonological dimensions, and by suggesting a procedure that adjusts the space for clustering by modeling the perceptual relevance of these dimensions relative to each other. The procedure employs linear weights derived from a linear regression model trained to predict categorical distances between prominence categories from phonetic distances using prosodically labeled speech data. It is shown that clusterings obtained after adjusting the perceptual space in this way exhibit a better cluster-to-category correspondence that is comparable to the one found for vowels, and that both the detection of vowel categories and the detection of prominence categories beneﬁt from the perceptual adjustment. Ó 2019 Elsevier Ltd. All rights reserved.

1. Introduction

Within the autosegmental-metrical (AM) framework, recent work has emphasized that intonation research should consider phonological as well as continuous aspects (Cangemi & Grice, 2016; Grice, Ritter, Niemann, & Roettger, 2017). It is not new to investigate continuous parameters in the AM tradition (see, e.g., Arvaniti, Ladd, & Mennen, 1998; Barnes, Veilleux, Brugos, & Shattuck-Hufnagel, 2012; Kügler & Gollrad, 2015; Liberman & Pierrehumbert, 1984; Peters, Hanssen, & Gussenhoven, 2015); however the investigation of phonetic implementation in terms of continuous parameters was usually focused on the role of these parameters in motivating and implementing the categories, i.e. on systematic variation between categories. In contrast, Grice et al. (2017) found within-category variation of these parameters, and showed that this variation could be related to different linguistic functions. In their experiment, pitch accents of the same category

E-mail address: [email protected] https://doi.org/10.1016/j.wocn.2019.100915 0095-4470/Ó 2019 Elsevier Ltd. All rights reserved.

sometimes led to different linguistic interpretations depending on their phonetic implementation, i.e. meaning depended on continuous phonetic parameters and not only on phonological categories. Similarly, Cangemi and Grice (2016) observed different amounts of variation in phonetic implementation depending on linguistic meaning. They argue for a distributional approach which views phonological categories as clusters in multidimensional phonetic space. This idea is consistent with exemplar theory (e.g. Johnson, 1997; Pierrehumbert, 2003, 2016): Exemplar-theoretic work in the area of speech assumes that all instances of speech that listeners perceive are stored and retained in memory. Perceptually similar exemplars are stored close together, and thus phonological categories are expected to form clusters in multidimensional perceptual space because their exemplars should all be perceptually very similar. Phonological knowledge then consists of the unconsciously acquired “implicit knowledge” (Pierrehumbert, 2016, p.34) encoded in the stored exemplars of each category. Phonological knowledge thus integrates phonetic perceptual knowledge in a natural way, in the form of

2

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

each category’s probability distribution over perceptual space (Pierrehumbert, 2003). Exploring the idea that phonological categories form clusters in perceptual space, a previous study (Schweitzer, 2011) had conducted experiments on a prosodically annotated German speech corpus. If exemplars of pitch accents form clusters of similar exemplars for each category, then the acquisition of pitch accent categories could be “simulated” by applying clustering algorithms to speech data. In the experiments however there was by far no one-to-one correspondence between automatically derived clusters and categories. The best correspondence between categories and clusters was achieved when allowing for 1000–2000 clusters. Then, more than 80% of the accented (or unaccented) syllables within a cluster would correspond to the same prosodic category. However, assuming more than 1000 clusters for just a few pitch accent categories seemed inappropriate: with ﬁve to ten pitch accent categories, this would amount to 200 clusters on average per category. Of course, pitch accent categories may exhibit variation depending on their positional context, analogous to the variation evidenced in allophones in the segmental domain. Indeed, it is well known that context inﬂuences pitch accent implementation: Pitch accent shape depends on segmental structure, syllable structure, and/or position in phrase for instance for English (e.g., Silverman & Pierrehumbert, 1990; van Santen & Hirschberg, 1994) and Spanish (e.g., Prieto, van Santen, & Hirschberg, 1995). These effects are not universal: Work by Grabe (1998) for instance suggests that the same factors can condition such variation in two languages, but with different results: both in German and British English, upcoming phrase boundaries affect the phonetic implementation of falls and rises, but in English this results in compressed (i.e. steeper) rises and falls, whereas in German, rises are compressed, but falls are truncated. For German, vowel height has also been shown to inﬂuence peak height in H*L accents (Jilka and Möbius, 2007), and the earlier study had identiﬁed position in word as affecting peak alignment in L*H accents (Schweitzer, 2011). Thus numerous segmental and prosodic factors can govern the implementation of pitch accents, and so different clusters may simply represent context-dependent implementation variants of categories, like positional allophones in the segmental domain. However, I would claim that 200 phonetically distinct positional variants of each category still constitute an unexpectedly high number, given that in the segmental domain for instance, there are usually no more than just a few allophones assumed per phoneme (such as, say, an aspirated, an unaspirated, and a glottalized version of an underlyingly unvoiced stop). The present study extends the earlier experiments, this time using data from two German databases of read speech (Barbisch, Dogil, Möbius, Säuberlich, & Schweitzer, 2007), as well as American English data from the Boston Radio News Corpus (Ostendorf, Price, & Hufnagel, 1995). The difﬁculty in ﬁnding reasonable cluster-to-category correspondences may have been due to several problems, and the present study suggests solutions to these problems: First of all, clustering algorithms quantify similarity by some distance measure, usually by Euclidian distance across all dimensions. However, considering irrelevant dimensions in clustering introduces noise: In the worst case, the meaningful variation in few rele-

vant dimensions is “buried” under irrelevant variation in many other dimensions. The present study therefore limits itself to those dimensions that are conﬁrmed to be most relevant perceptually. Second, most similarity measures give equal importance to all dimensions. However, this may not be perceptually adequate—listeners may be more susceptible to small differences in one parameter than in the other. Indeed, the perceptual relevance of dimensions such as F0 rise amplitude, F0 peak alignment, or F0 rise or fall steepness relative to each other has not been thoroughly investigated. The present study addresses this issue by exploring a new way of scaling the acoustic-prosodic space in a way that it becomes perceptually more adequate and then comparing clustering results on the original vs. the perceptually adjusted data. This is of particular importance here since the acoustic-prosodic space will include a comparably high number of perceptual dimensions, including aspects beyond F0, since it is well known that factors other than F0 are also related to pitch accents (e.g., Bolinger, 1957; Campbell & Beckman, 1997; Kochanski, Grabe, Coleman, & Rosner, 2005; Niebuhr & Pﬁtzinger, 2010; Okobi, 2006; Turk & White, 1999). Nevertheless, the relative importance of these dimensions is not known beforehand. The present article thus contributes to the overarching topic of this Special Issue by exploring the perceptual basis of prominence in terms of pitch-accentedness from an exemplar-theoretic perspective. This allows for a natural integration of phonological form, i.e. the pitch accent categories, to phonetic substance, i.e. the phonetic implementation in terms of perceptually motivated phonetic dimensions that constitute the space for storing memory traces of perceived pitchaccented and unaccented syllables. Regarding the deﬁnition of prominence, I assume that prominence is a property of linguistic units which makes them salient in perception relative to other units at the same level (cf. Terken & Hermes, 2000). Similarly, the editors of this Special Issue in their Call for Papers stated that “prominence is a relational property that refers to any unit of speech that somehow ‘stands out’”. In the present article, I am looking at prominence in terms of pitch accent categories, and thus at the relative prominence of syllables bearing different types of pitch accents over other syllables at sentence level. Thus the deﬁnition of prominence assumed here is the presence of a pitch accent. This is consistent not only with a view from which all pitchaccented syllables are prominent, while all unaccented syllables are non-prominent, but also with the idea that different pitch accent categories can lead to different degrees of prominence (e.g. Baumann & Röhr, 2015; Cole et al., 2019, this Special Issue). Please note that while it is certainly uncontroversial that pitch accents at least in Germanic languages lend prominence at sentence level (e.g., Bolinger, 1958; Gussenhoven & Rietveld, 1988; Rietveld & Gussenhoven, 1985), pitchaccentedness is not exactly equivalent to prominence in those languages. Pitch-accentedness is taken to be a categorical phenomenon by many scholars (e.g., Bruce, 1977; Ladd, 1996; Pierrehumbert, 1980), while prominence is sometimes assumed to be more gradient (see for instance the discussion of prominence scales by Wagner, Ćwiek, & Samlowski (2019)). However, the approach taken in the present study requires the investigation of prominence categories rather than that of

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

more gradient levels of prominence since only the former can per deﬁnition be expected to be separable in perceptual space. In support of this categorical approach, the strong relation between possibly gradient prominence at sentence level and pitch accent in German is corroborated by ﬁndings in Baumann and Winter (2018), who investigate phonetic and phonological factors in the perception of prominence by naïve listeners and ﬁnd that while a number of acoustic and discrete linguistic factors are related to prominence as perceived by the listeners, pitch accent type and pitch accent position are most predictive of listeners’ judgments; similarly Cole et al. (2019, this Special Issue) conﬁrm a strong relationship between pitch accents and prominence for English, Spanish, and French, in that pitch-accentedness is clearly reﬂected in prominence ratings of untrained listeners for these languages. Thus while it may be a simpliﬁcation to equate prominence to pitch accent, and non-prominence to absence of pitch accent here, it is not an ad hoc one: the relation between prominence and pitchaccentedness is well established. To conclude this introduction, the aim of this contribution is to investigate whether categories can be detected based on their distribution in perceptual space. To this end, I investigate the substance, i.e. those dimensions that are expected to be relevant in pitch accent implementation. The same approach could be taken to investigate prominence in terms of word stress, i.e. prominence of one syllable relative to other syllables at the word level. Then, presumably, other dimensions would constitute the phonetic substance, viz. those that have been found to be relevant for the perception of word stress. Before presenting the method and results of the present study, I will ﬁrst elaborate in more detail how exemplar theory readily integrates form and substance in Section 1.1, then in Section 1.2 present more details on the Schweitzer (2011) study on detecting clusters of syllables corresponding to prosodic categories, which the present study takes as a starting point. Based on these preliminaries, Section 2 will state the research question addressed in this article. Method and data will be at issue in Section 3. Section 4 will then introduce the new procedure for scaling the perceptual dimensions. Clustering results using this scaling procedure will be discussed in Section 5, followed by an overall discussion in Section 6. 1.1. Exemplar theory and categories

The central idea in exemplar theory as applied to speech (e.g. Goldinger, 1996, 1997, 1998; Johnson, 1997; Lacerda, 1995; Pierrehumbert, 2001, 2003, 2016; Wade, Dogil, Schütze, Walsh, & Möbius, 2010; Walsh, Möbius, Wade, & Schütze, 2010) is that the mental representations of speech categories are not only abstract, symbolic entities, but that they are instead accumulations of concrete, previously perceived instances of speech that have been stored in memory by listeners, and that these memory traces include considerable phonetic detail in addition to abstract information. Details that are assumed to be stored comprise frequency scale information such as formant location or aspiration noise, but also voice details beyond speaker identity. The latter is supported by memory effects of identical and of perceptually similar voices in memory tests (Goldinger, 1996). Some models, for instance Lacerda’s (1995), Johnson’s (1997), or Pierrehumbert’s (2003,

3

2016) models, assume further that the exemplars are labeled with category information. This assumption constitutes the link between the stored phonetic detail (the “substance”) and the phonological category (the phonological “form”), and allows for the natural integration of the two alluded to above. It is not entirely clear yet what the units are when storing exemplars—most models are not very explicit about this.1 However, the models that assume category labels do state what these categories are. For instance, Lacerda (1995) models vowel classiﬁcation and assumes at least vowel categories as labels. Johnson (1997) states that “the set of category labels includes any classiﬁcation that may be important to the perceiver, and which was available at the time the exemplar was stored” (p. 147). He mentions name of the speaker and gender as possible category labels, in addition to linguistic categories. Pierrehumbert (2003) exempliﬁes her model using vowel categories as labels for illustration, whereas her later model (Pierrehumbert, 2016) assumes word categories. Walsh et al. (2010) assume category labels at least at segment, syllable, and word level. Fig. 1 is a rough sketch of what exemplar clouds could look like, illustrated in an only two-dimensional space, along with some category labels—in this case, with word category labels. I used words as the units for this sketch, to be consistent with Pierrehumbert’s more recent publication (Pierrehumbert, 2016), but please note that I have chosen one-syllable words here since in all the experiments below, we will be dealing with exemplars of syllables,2 so one could also imagine that the labels are actually syllable category labels. For the sake of tidiness, only few exemplars in Fig. 1 have been labeled (a few exemplars of “ball” and “bell”), but it can be assumed that all exemplars should at least be labeled with their word category. The dots indicate further word category labels. In the ﬁgure it is not speciﬁed what the perceptual dimensions are; but the dimensions assumed by the exemplar-theoretic approaches mentioned above have used locations of the ﬁrst formant (F1) in open vs. closed vowels (Lacerda, 1995), of the second formant (F2) in /I/ vs. /e/ (Pierrehumbert, 2001), or of the third formant (F3) in /ɹ/ vs. /ɾ/ (Pierrehumbert, 2016) for demonstrating exemplar-theoretic classiﬁcation in the one-dimensional case. Johnson (1997) employed several dimensions, fundamental frequency (F0), F1, F2, F3, and the duration for classifying vowels. Similarly, Pierrehumbert (2003, p. 183) illustrates exemplar clouds using an F1/F2 plot of American English vowels. Exemplar-theoretic work has also addressed the prosodic domain. For instance Calhoun and Schweitzer (2012) have argued that short phrases with speciﬁc discourse functions are stored along with their intonation contours. In that study, clustering was used to identify similar intonation contours, and the same parameters to describe the contours as in the 1 Two exceptions are Walsh et al. (2010), who assume that the size of stored units is variable and depends on the frequency of the respective unit, and Wade et al. (2010), who assume that perceived speech is not stored in units at all, but stored as a whole, i.e. in complete longer utterances, with category labels at various levels. 2 This is not because I necessarily want to claim that the syllable is the unit for exemplar storage, but because at least in the AM tradition there is consensus that pitch accents are linked to speciﬁc syllables, prosodic labels always refer to speciﬁc syllables, and consequently I will use acoustic-prosodic properties of single syllables, or of single syllables relative to neighboring syllables, and the corresponding prosodic label for the clustering experiments below.

4

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

Fig. 1. Rough sketch of exemplars in two-dimensional perceptual space. Exemplars are labeled at least with word category information, but possibly with further categories that the listener had access to. Exemplars of the same linguistic category are expected to form clusters in perceptual space, indicated by clouds of differently colored exemplars here.

present study. Similarly, Schweitzer et al. (2015) provide evidence of exemplar storage of intonation. They describe frequency effects on the phonetic implementation of pitch accent contours, which can be explained assuming an exemplar-theoretic account of intonation instead of the traditional post-lexical view. In any case, linguistic categorical knowledge then arises from abstracting over the stored exemplars (e.g. Pierrehumbert, 2001, 2003), or, in other words, phonological form arises from abstracting over phonetic substance. Goldinger exempliﬁes this kind of abstraction by a metaphor that he attributes to Richard Semon, stating that the blending of many photographs results in a generic image of a face (Goldinger, 1998, p. 251). Similarly, the mass of speciﬁc exemplars of a category “blends” into an abstract idea of the properties of that category. Or, coming back to Fig. 1, the many points of a word category, which all have individual speciﬁc values, together deﬁne a less speciﬁed region in perceptual space that corresponds to that word category. This categorical knowledge can then be used both in speech perception and in production: in perception, categorizing new instances is achieved by comparing them to the stored exemplars and their categories (Johnson, 1997; Lacerda, 1995; Pierrehumbert, 2001, 2003); in speech production, production targets are derived from them (Pierrehumbert, 2001, 2003), for instance by random selection from the cloud corresponding to the intended word category. I will not go into the details here of exactly how exemplartheoretic perception and production work, as this would be beyond the scope of this article. Here, it is only of interest that exemplar theory does not separate phonetic detail, or substance, on the one hand, and phonological form, or categories, on the other hand—the two are closely related, or maybe indeed, as Grice et al. (2017, p. 105) put it, two sides of the same coin.

1.2. Clustering intonation categories

If phonological knowledge arises by abstracting over clouds of exemplars stored in memory, then speech acquisition is initiated by accumulating those exemplars in memory and by

starting to label these exemplars with meaning. As more and more exemplars are stored, implicit phonological knowledge begins to build up when exemplars associated with the same abstract meaning categories exhibit similar perceptual features, i.e. when they are located in similar regions in perceptual space. In segmental acquisition for instance, exemplars referring to ball objects would end up in the same region in perceptual space, and very close to exemplars referring to bell objects. This implicitly encodes the phonological identity of the “ball” exemplars as well as the phonological proximity of the “ball” and the “bell” exemplars. In illustrating this view, Pierrehumbert (2003) mentions results obtained by Kornai (1998), who showed that unsupervised clustering of F1/F2 data for vowels yields clusters that are “extremely close to the mean values for the 10 vowels of American English” (Pierrehumbert, 2003, p. 187). She interprets these results as supporting evidence that detection of phonetic categories in human speech acquisition may be guided by identifying regions in perceptual space which correspond to peaks in population density. She also cites experiments by Maye and Gerken (2000) and Maye, Werker, and Gerken (2002) in which participants interpreted stimuli in a continuum as belonging to two distinct categories if the stimuli exhibited a bimodal distribution over the continuum (Pierrehumbert, 2003, p. 187). In general, she assumes that “well-deﬁned clusters or peaks in phonetic space support stable categories, and poor peaks do not” (Pierrehumbert, 2003, p. 210). Following this assumption, in an earlier study (Schweitzer, 2011) the idea of clustering as a means to simulate the acquisition of speech categories in the prosodic domain was explored. If clusters of F1/F2 vowel data can be shown to correspond to vowel categories, then clustering intonation data in several prosodic dimensions could possibly give insight into the reality of intonation categories. In those experiments, altogether 29 linguistic, phonetic, and phonological (segmental and prosodic) features were extracted for each syllable from a database of read speech which had been manually annotated for prosody and contained six different pitch accent categories. The clustering results were evaluated by comparing the obtained clusters to the manual prosodic labels. It turned out that even though the probably most widely used clustering algorithm, k-means, yielded satisfactory results in terms of cluster-to-category correspondence as quantiﬁed by an accuracy-based evaluation measure proposed in Schweitzer (2011), it did so only when allowing for an extremely high number of clusters—in case of k-means clustering for instance, best results were obtained for around 1600 clusters. Clearly, the relation between the number of clusters, 1600, to the number of pitch accent categories in the data, six, is very imbalanced. Thus it was suggested among other things (i) that future work should investigate which dimensions are relevant in the perception of prosodic categories, and limit the dimensions in clustering to these relevant dimensions, and (ii) that these dimensions might not contribute with equal weight, i.e. that dimensions might have to be scaled differently in order to model their individual importance to human perception more closely. The present article suggests a procedure that addresses both these problems at the same time.

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

2. Research question

The aim of this article is to explore an exemplar-theoretic perspective on the detection of what I call “prominence categories” in the following, using prosodically annotated data from German and American English databases of read speech. The novelty in the experiments presented here lies in the fact that I employ perceptually motivated weights for modeling the relative importance of a number of potentially relevant acousticprosodic dimensions. The experiments are intended to “simulate” in a very simple way the acquisition of prominence categories by detecting clusters of similar syllable exemplars in this perceptually more adequate space. Since I do not want to assume that this detection relies on any phonological knowledge, the clustering has to take into account every single syllable, irrespective of whether it is stressed or not, simply because that knowledge would not be available in acquisition. Instead, it would probably be learned in the same way, possibly even jointly with the acquisition of the prominence category. Therefore the term “prominence category” in the remainder of this article will comprise the pitch accent categories assumed for the two languages, plus a “NONE” category for unaccented syllables: the listener would have to ﬁgure out which syllables are accented and by which accent type, and which syllables are unaccented, and they would have to start detecting these categories without prior knowledge of the location of pitch accents. In fact, listeners in prosody acquisition would not even know what pitch accents are and what it means for a syllable to be unaccented; instead they would simply (unconsciously) notice that the perceived syllable instances fall into groups based on their properties in the acoustic-prosodic dimensions, thus initiating the prosodic categories. Only then would learners start to notice that these groups share meaning aspects and label them accordingly, for instance maybe relating accumulations of short syllables with ﬂat F0 contours to non-prominence. The present article aims to simulate the instantiation of the prominence categories in this way. The clusters detected will then be compared to established pitch accent categories. I do not assume that speakers have access to those categories in acquisition; the “correct” category labels are used only for evaluating the plausibility of the detected categories. Similarly, I do not want to assume any phonological knowledge that may not have been acquired at the stage when the prosodic categories start to form, thus I will limit the dimensions for clustering to acoustic dimensions only. However I assume that phoneme categories are established at this stage already, which allows for normalizing some of the acoustic parameters by phoneme category.

3. Data preparation for clustering pitch accents 3.1. Databases

The clustering experiments were carried out on three databases, viz. the German SWMS (2 h, male) and SWRK (3 h, female) databases, and on a part of the Boston Radio News

5

Database (BRN, Ostendorf et al., 1995) for which prosodic labels were available (approx. 1 h, 5 speakers). The SWMS and SWRK databases had been recorded for unit selection speech synthesis in the SmartWeb project (Barbisch et al., 2007; Wahlster, 2004), hence the “SW” in their names. The speakers are professional speakers of Standard German. The utterances represent typical sentences from ﬁve different text genres, and they usually consist of one or at most two short sentences, corresponding to just a few prosodic phrases. The utterances were annotated on the segment, syllable, and word level, and prosodically labeled according to GToBI(S) (Mayer, 1995). Prosodic labeling for each utterance was carried out by one of three human labelers, all supervised and instructed by myself, without having the Schweitzer (2011) study, or the present experiments, in mind. The SWMS database amounts to 28,000 syllables, and 14,000 words. The SWRK data contain 34,000 syllables, and 17,000 words. The BRN database contains recordings of professional American English speakers, partially as recorded during radio broadcasts, and partially re-recorded in the lab. Prosodic annotation followed the ToBI guidelines for American English (Beckman & Ayers, 1994). Only a portion of the database was labeled prosodically. This part was used in the experiments presented below. It amounts to 23,000 syllables and 14,000 words, from 5 different speakers (3 female, two male). 3.2. Pitch accent inventories

The BRN database provides pitch accent labels according to ToBI (Beckman & Ayers, 1994). Within the database, H* is most frequent, followed by L+H*, downstepped !H*, L+!H*, and H+!H*. L*+H and its downstepped version L*+!H are very infrequent in BRN. A number of syllables have labels that indicate labeler uncertainty; these were excluded from the analyses. As for the prosodic labels in the SWMS and SWRK databases, these are to some extent similar to the ToBI labels in BRN, as GToBI(S) is a German labeling system based on the original ToBI (Beckman & Ayers, 1994) used for BRN. GToBI distinguishes 5 basic types of pitch accents, L*H, H*L, L*HL, HH*L, and H*M, which are claimed to serve different functions in the domain of discourse interpretation (Mayer, 1995). They can be described as rise, fall, rise-fall, early peak, and stylized contour, respectively. The stylized contour, H*M, is extremely infrequent—it is mostly used in calling out and thus does usually not occur in read speech. In addition to the basic types GToBI assumes “allotonic” variants of L*H and H*L in pre-nuclear contexts, viz. L* and H*: in these, only the starred tone is realized on the pitch-accented syllable, while the trail tone (annotated as ..H in L*H, and as ..L in H*L) is realized on the syllable immediately preceding the next pitchaccented syllable, or even omitted completely. Even though the linked trail tones are annotated in GToBI(S), they do not have accent status because they do not lend prominence to the syllable that they occur on (therefore no * symbol in their name), and consequently they are not treated as pitch accents here. Thus in the following we will be dealing with L*H, H*L, L*HL, HH*L, as well as the variants L* and H*. Just as in BRN, labeler uncertainties in the SWMS and SWRK databases

6

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

were indicated by “?” as a diacritic to accent labels; these were discarded for the analyses here. Similar to ToBI for American English (Beckman & Ayers, 1994), GToBI provides a diacritic “!” to indicate downsteps (i.e., a H* target which is realized signiﬁcantly lower than a preceding H* target in the same phrase). Mayer (1995) notes, however, that although it is recommended to label downsteps, it is not clear whether the downstepped pitch accents differ in discourse meaning from their non-downstepped counterparts (Mayer, 1995, p.8). Accordingly, in labeling the two German databases that I will be using for clustering, labelers’ attention had been less focused on consequently and consistently labeling downstep. Indeed, downsteps were only labeled in 19 and 27 cases in the SWMS and SWRK databases, respectively, which is why I will not distinguish downstepped and nondownstepped accents as belonging to different categories here. Downsteps were consistently labeled in the English data. With !H* and L+!H*, downstep constitutes a special case: although I do not want to deny that downstepped accents at least in English, and possibly also in German (cf. Grice, Baumann, & Jagdfeld, 2009), have different pragmatic functions, I would like to claim that categories involving downstep have to be acquired later than other prosodic categories: in order to perceive that a category is implemented with a downstepped high target, relative to a high target in a preceding category in the same phrase, at least this preceding category as well as possibly intervening phrase boundaries have to be perceived with adult competence. However in the approach taken here, when clustering the syllable data to detect possible categories, of course we do not yet have access to the “true” prosodic categories, nor to those of the context syllables: it is not yet known which preceding syllable qualiﬁes as the preceding category with the high target. In contrast, H+!H* as an accent in which the downstep is relative to the immediately preceding syllable, is unproblematic. In any case, as a consequence of these considerations, I will even for English treat downstepped !H* as H* and (the very infrequent) L+!H* as L+H* in the following. 3.3. Features for clustering

The model uses six linguistically motivated parameters to describe the shape of the F0 contour in and around accented syllables. Mathematically PaIntE employs a function of time, with f ðxÞ giving the F0 value at time x. It is deﬁned as follows: f ðxÞ ¼ d

c1 c2 1 þ ea1 ðbxÞþc 1 þ ea2 ðxbÞþc

ð1Þ

This function yields a peak shape (Fig. 2), where the ﬁrst term, the d constant, can be interpreted as peak height parameter. The amplitude of the rise towards the peak is determined by c1, termed rise amplitude in the following, and the peak itself is reached at (syllable-normalized) time b; in other words, b can be interpreted as the peak alignment parameter. The amplitude of the fall after the peak is given by parameter c2 (fall amplitude). Finally, the steepness of both movements is captured by parameters a1 (steepness of rise) and a2 (steepness of fall). For determining the PaIntE values of a given syllable, the F0 contour is estimated using ESPS’s get_f0 from the Entropic waves+ software package, then smoothed using a median smoother from the Edinburgh Speech Tools (Taylor, Caley, Black, & King, 1999), interpolating across unvoiced regions but not across silences. Then the PaIntE model utilizes an optimization procedure that ﬁnds those parameter values for which the function contour in a three-syllable window around the syllable of interest is optimally close to the smoothed F0 contour. Since the aim of the present experiments is to focus on parameters that can be employed in perception to distinguish between pitch-accented and unaccented syllables, and between different pitch accents, I extracted the PaIntE parameters for every syllable in the three databases, irrespective of whether they had been manually labeled as pitch-accented or not. It is of course expected for instance that pitchaccented syllables should exhibit higher rise amplitude and fall amplitude than unaccented syllables, that peak alignment in early peak accents is earlier than in other accents, or that the peak height of downstepped accents should be lower than that for non-downstepped accents. In an exemplar-theoretic account of speech acquisition, such generalizations would be implicitly learned by abstracting over clouds of pitch accents of different categories.

In contrast to the earlier study, I focus on acoustic-prosodic parameters for clustering here, without making use of linguistic or phonological parameters such as which syllables are stressed, or which syllables occurred in function or content words, in order to more realistically model acquisition, as explained above in Section 2. However I consider acoustic parameters beyond F0 shape and duration here; other recent work on prosody also emphasizes that dimensions beyond F0 need to be taken into account when discovering meaningful elements of prosody (e.g. Niebuhr, 2013; Niebuhr & Ward, 2018; Ward & Gallardo, 2017). 3.3.1. PaIntE

The present paper as well as the study it extends employ the PaIntE model (Möhler, 2001; Möhler & Conkie, 1998) to quantify the F0 contour around pitch-accented syllables. PaIntE is short for “Parameterized Intonation Events” and was originally developed for F0 modeling in speech synthesis.

Fig. 2. Example PaIntE contour in a window of three syllables around a pitch-accented syllable (r*). See text for more details.

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

3.3.2. Duration features

Categorization experiments in Schweitzer (2011) had shown that normalized nucleus durations were helpful in distinguishing accented from unaccented syllables. Speciﬁcally, that study had used phoneme-speciﬁc z-scores of the nucleus duration for predicting whether syllables were accented or not, and those of word-ﬁnal phones for predicting the location of phrase boundaries. Converting absolute scores to z-scores is a common statistical transformation: absolute values are replaced by their deviation from the overall mean, and divided by the overall standard deviation. To get phoneme-speciﬁc zscores, mean and standard deviations are calculated for each phoneme class separately, and in transforming a particular exemplar to its z-score, mean and standard deviation of the respective phoneme class are used. Using z-scores, it is possible to model the lengthening or shortening related to prosodic context, because phonemespeciﬁc mean and standard deviation are eliminated. Also, in an exemplar-theoretic account of prosody perception, it is plausible that listeners have access to phoneme-speciﬁc duration z-scores—after all, these can in principle be interpreted as the location of an exemplar in terms of duration relative to other exemplars of the same category: in the duration dimension, an exemplar of a speciﬁc phoneme with a high duration z-score would be located “to the right” of most other exemplars of that type, whereas an exemplar with a z-score of 0 would be located in the middle of all exemplars of that type. Since prosody is acquired later than segmental aspects, listeners should already have accumulated enough phoneme exemplars to make such generalizations. A second feature related to duration that I employ for the current study had not been used in the earlier experiments: here I include the number of voicing periods within a syllable. The rationale for including it is that in order to realize a pitch accent on a syllable, the speaker has to produce voicing for a reasonably long time span. This feature was calculated using a Praat script (Boersma & Weenink, 2017). To this end, I ﬁrst estimated the pitch range of the respective speaker by extracting F0 values from all speech data of that speaker, using the ESPS program get_f0 from the Entropic waves+ software package, as this does not require the speciﬁcation of speaker-dependent parameters for the expected minimum and maximum pitch. Then, I used the 5th percentile from these data as minimum pitch for the speaker, and the 99th percentile as the maximum pitch within the Praat script. Given that the speech signal corresponding to each syllable was long enough, the script then calculated a Praat point process using these minimum and maximum pitch parameters and then retrieved the number of periods in it. 3.3.3. Spectral parameters

It is well known that syllables that carry word stress differ from unstressed syllables in their spectral balance, or spectral tilt, in many languages (e.g., Aronov & Schweitzer, 2016; Crosswhite, 2003; Sluijter & van Heuven, 1997; Okobi, 2006). It should be noted however that Campbell and Beckman (1997) have challenged this result at least for American English, relating the effect of spectral tilt in a speech corpus to pitch accentedness rather than word stress. For the experiments here, it is less important whether spectral tilt is

7

correlated with pitch accent directly, or indirectly through word stress. In any case, if there are systematic differences between unaccented and pitch accented syllables in terms of spectral tilt, then this parameter should also play a role in exemplartheoretic acquisition of prominence categories. Thus I include two measures of spectral tilt: once the spectral balance as operationalized by Sluijter and van Heuven (1997), speciﬁcally the difference in energy in the frequency band between 0 and 500 Hz and that in the frequency band between 500 and 1000 Hz, and, second, spectral tilt in terms of the regression line in a long-term average spectrum. Both measures had been investigated by Aronov and Schweitzer (2016), interestingly with opposite ﬁndings, in a database of German conversational speech: the spectral balance approach conﬁrmed the claim that the stressed syllables exhibit a ﬂatter spectrum than unstressed syllables, i.e. a more even spectral balance for stressed syllables; while the spectral tilt approach employing the regression line gave higher tilt values for stressed syllables. For the present experiments, both values were extracted syllable by a Praat script. Speciﬁcally I ﬁrst calculated longterm average spectra for each syllable using the pitch range parameters estimated as explained above and Praat’s default values for all other parameters; then extracted the slope of the regression line for the range of 100–5000 Hz using Praat’s function Report spectral tilt with the default parameters to calculate spectral tilt, as well as the Praat function to retrieve the mean energy values in the two relevant frequency bands for calculating the spectral balance. 3.3.4. Intensity

In addition, I included overall intensity values. Intensity has long been recognized as a parameter related to word stress and pitch accent—already Bolinger (1957), p. 176 mentions that intensity is a frequent correlate of pitch accent, and both Fry (1955) and Lieberman (1960) found that it is one correlate of perceived word stress in American English. More recent work found that intensity is a correlate of word stress only in pitch-accented syllables in American English (Okobi, 2006). The correlation of intensity and pitch accent has also been conﬁrmed for other languages, for instance for British and Irish English (Kochanski et al., 2005), or for German, where Niebuhr and Pﬁtzinger (2010) found that two types of pitch accent differ with respect to their intensity patterns. For the present experiments, intensities were again determined by a Praat script, using the speaker-speciﬁc minimum pitch parameter mentioned above. I included the intensity within the vowel (vowel intensity) as well as the intensity across the whole syllable (syllable intensity) as raw values. For the experiments below I will however use the deltas between subsequent syllables rather than the absolute values. This is described in SubSection 3.5 below. 3.4. Outlier removal

All of the acoustic parameters described above were obtained by automatic procedures. In addition, the segmentation of the databases, albeit manually checked in most cases, was originally based on forced alignment for all three databases. In case of the BRN database, I derived syllable label

8

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

ﬁles from the phone label ﬁles using an automatic syllabiﬁcation procedure with the segment labels as input. Thus all three databases, even though they are in general very clean, may occasionally contain erroneous phone labels. These together with the syllable labels however are the basis for deriving the acoustic parameters by Praat scripts. Some further noise is introduced in calculating the acoustic parameters by scripts even for cases with perfect labels. I tried to reduce this kind of noise by a thorough procedure for outlier removal. Outlier removal, as well as normalization and all following analyses described in the remainder of this article were carried out using R (R Core Team, 2017). In a ﬁrst step, I removed cases where the Praat script had not yielded spectral or intensity values due to shortness or lack of voicing, or simply because the syllable did not contain a vowel. This concerned a considerable amount of data (approx. 9–15% depending on the database). I also removed cases where the PaIntE values indicated rises or falls that reached beyond the approximation window and where the rise or fall amplitude inside the window did not reach a value within 5 Hz of the full amplitude as expressed by the PaIntE parameter.3 This was the case for another roughly 20% of data points. In general outliers were removed for each parameter individually by removing all data points were the observed value was more than 1.5 times the interquartile range below the ﬁrst or above the third quartile. For the peak height parameter this was done separately for each speaker, and for spectral balance, spectral tilt and the intensity measures, separately for each vowel. Altogether this reduced the number of data points by roughly 5%. Finally I removed syllables that contained infrequent vowels (deﬁned as cases where there were less than 100 instances of that vowel in the database), labeler uncertainties, and syllables with infrequent pitch accent types (deﬁned as cases where the accent occurred less than 200 times in the database4). Outlier removal signiﬁcantly reduced the number of data points, however it made sure that only parameters are used in the following analyses for which we are very conﬁdent that they are correct. Table 1 gives an overview of the data points available for each database after each step, the last line indicates the ﬁnal numbers.

Table 1 Approximate numbers of data points left after each step in the outlier removal process, by database. Database Full data set Long enough vowels Fall/rise completed No outliers No labeler uncertainties

All parameters were z-scored to obtain means of 0 and standard deviations of 1. In case of the vowel dependent parameters syllable intensity, vowel intensity, spectral balance and spectral tilt, this was done on a by-vowel basis, as motivated and described for the vowel-speciﬁc duration z-scores 3 These cases constitute “degenerate” approximation results, treated as outliers here: inside the approximation window, the optimization procedure guarantees that the PaIntE function is optimally close to the smoothed F0 contour. Outside the approximation window, the two contours may of course be quite different from each other. So if the PaIntE contour reaches the maximum far outside the approximation window, this does not guarantee that the smoothed contour follows the same course and reaches the same maximum inside the approximation window. Therefore the amplitude that PaIntE assumes may overestimate the true amplitude. The origin of this problem is discussed in more detail in an article submitted elsewhere (Schweitzer, Möhler, Dogil, & Möbius, in preparation), and future work on PaIntE will address it further. For the time being, cases that might be problematic can be identiﬁed and ignored as suggested here. It should be noted that the problem occurs similarly often for accented and unaccented syllables, thus removing such data points does not signiﬁcantly affect the overall distribution of accented/unaccented syllables. 4 Given that there are fewer accent categories than vowel categories it seemed appropriate to assume a higher threshold for accents; also for the adjustment procedure described below I needed at least 200 instances of each accent category.

SWMS

SWRK

23,000 21,500 17,000 16,500 16,000

41,000 35,000 28,500 26,500 26,500

34,000 30,000 24,500 22,500 22,500

above (cf. Section 3.3.2). The peak height parameter was ﬁrst normalized by speaker, by subtracting the speaker’s individual mean, then z-scored across all speakers. Finally, instead of using the raw intensity values I calculated the difference in intensity between each syllable and its preceding syllable, as well as the difference between the syllable and the following syllable, once based on the syllable intensities, and once based on the vowel intensities. This yielded four intensity parameters: vowel intensity delta (the difference in vowel intensity between a syllable and its preceding syllable), next vowel intensity delta (the difference in vowel intensity between the next syllable and the current syllable), and syllable intensity delta and next syllable intensity delta (analogously, but using intensities across whole syllables instead of intensities within the vowels). 4. Procedure for ﬁnding perceptual weights

Two potential problems when clustering data for detecting perceptually relevant categories were addressed in the introduction above: ﬁrstly, keeping irrelevant dimensions in clustering introduces noise. Clusters are characterized by small distances among their members. Distances are usually quantiﬁed by a distance metric such as the Euclidean distance: if x ¼ ðx 1 ; x 2 ; . . . ; x n Þ and y ¼ ðy 1 ; y 2 ; . . . ; y n Þ are two points in n-dimensional space, then their Euclidean distance is calculated by summing the squared distances in all individual dimensions and taking the square root: distðx; yÞ ¼

3.5. Normalization

BRN

sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ n X ðx i y i Þ2 i¼1

So assuming that, say, the second dimension is not relevant for perception, then the term ðx 2 y 2 Þ2 , the squared distance between the two points in dimension 2, will add perceptually irrelevant noise to the overall distance. Secondly, all dimensions contribute to the Euclidean distance to the same extent. However, distances in some dimensions may be perceptually more relevant than distances in others, so we might want to factor this into the overall distance. This could be obtained for instance by introducing a weight w i for the distance in each dimension as in the following equation. sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ n X dist w ðx; yÞ ¼ w i ðx i y i Þ2

ð2Þ

i¼1

This is however equivalent to sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ n X dist w ðx; yÞ ¼ ðx 0i y 0i Þ2 i¼1

ð3Þ

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

pﬃﬃﬃﬃﬃ where x 0i ¼ w 0i x i , and y 0i ¼ w 0i y i , and w 0i ¼ w i . In general adding dimension speciﬁc weights in calculating the distance as in (2) is conceptually the same as scaling all values in each dimension with an appropriate weight and then taking the usual unweighted Euclidean distance. If the weights used for scaling the dimensions before taking the distance are the square roots of the weights used in the weighted distance, as in Eqs. (2) and (3) above, then the two approaches even yield identical distances. The question is then, which would be perceptually appropriate weights for scaling the dimensions? To address this problem, we would have to assess for each dimension how much speciﬁc differences along this dimension are reﬂected in perception. Unfortunately, we do not have perceptual data available in the form of gradient perceptual ratings of similarity between instances of the prominence categories. However, we do have prosodic annotations from human annotators, so we do have categorical perceptual ratings of similarity: we can be fairly sure that two syllables that have been labeled as belonging to different categories are perceptually more different than two syllables which have been labeled as belonging to the same category. So re-phrasing the above problem of how much speciﬁc differences along a dimension are reﬂected in perception, we can ask: How big does the difference have to be to cause a change in perceived category as labeled by the human annotators? The solution that I am proposing then is to ﬁt a linear regression model that predicts whether two syllables will be perceived (i.e. labeled by the annotator) as belonging to the same category or not, given the individual distances in all potentially relevant dimensions: dist cat ðx; yÞ a0 þ

n X bi ðx i y i Þ

ð4Þ

i¼1

where dist cat ðx; yÞ is the categorical distance between x and y (i.e. dist cat ¼ 1 if x and y belong to different categories, and 0 if they are from the same category), a0 the intercept of the model, and bi the coefﬁcients of the model.5 It is easy to see that the coefﬁcients obtained from this model can be interpreted as weights for scaling the dimensions: the model approximates the difference in categories as a weighted sum of differences in the individual dimensions with the coefﬁcients as weights. These weights thus reﬂect the importance of each dimension: a large coefﬁcient shows that this dimension can be particularly important in distinguishing categories. The second beneﬁt of the linear regression model approach for estimating perceptually motivated weights for clustering is that the model also yields signiﬁcance values for each dimension. Thus in addition to the desired weights for scaling the dimensions, we get an assessment of how likely each dimension is to play a role in distinguishing categories. That is, it also 5 Please note that I am using a simple linear regression model, not a generalized one. In an experiment where one would truly be interested in a model that can predict whether two syllable instances belong to the same category or not, one would use a generalized model, i.e. a model that does not directly predict the distance, but the odds of the distance being 1. This would avoid predicting non-integer values which could even lie outside the interval [0,1], and which would have to be mapped to 0 or 1 then. However for the present purpose, where we are only interested in the relative contribution of each dimension when predicting the distance, but not in the predicted value, I prefer the immediate relationship in the simple linear regression model between the overall (categorical) distance and the distances in each dimension.

9

solves the ﬁrst problem stated above, that we want to separate relevant from irrelevant dimensions before clustering. The next section will describe how the proposed procedure was applied to the three databases before clustering.

5. Clustering experiments 5.1. Estimating the weights

For each database I ﬁrst randomly sampled 100 data points from each category. These data were used to estimate the weights for scaling the dimensions as described in Section 4. Sampling the same amount of data for each category ensured that even relatively infrequent categories would be represented in these data.6 Thus parameters that might be helpful in distinguishing relatively infrequent categories would not be neglected. I then built a new data set containing the pairwise distances between all sampled data points, i.e. a set which contained for each pair of data points their categorical distance (0 if same category, 1 if different), and their distances in all the dimensions corresponding to the parameters introduced in Section 3.3 above. Thus if n was the number of categories considered, the resulting data set consisted of 1=2 100n ð100n 1Þ data points. After mapping downstepped accents to their nondownstepped counterparts, the number of categories for all three databases was 5, and thus 124,750 distances contributed to estimating the weights in each database. For estimating the weights, I used the lm function in R (R Core Team, 2017) to ﬁt a linear regression model that predicted the categorical distance of each pair using the distances in all parameters as predictors. I then selected only those parameters for which R indicated signiﬁcance at p < 0:05, then ﬁt a second model using only these signiﬁcant predictors. If any of the parameters were not signiﬁcant anymore in the simpler second model, I ﬁtted a third one, again keeping only signiﬁcant predictors. There was no case in which that third model still contained insigniﬁcant predictors. Table 2 shows the resulting coefﬁcients of the ﬁnal models for all three databases. Interestingly, the two speakers in the two German databases differ considerably in the parameters they use to encode category differences. The SWMS speaker uses almost all parameters considered in these experiments, while the SWRK speaker uses only three of the six PaIntE parameters, and spectral balance and the number of voicing periods. It is especially noteworthy here that the largest part of these databases is identical in terms of text, so speaking style or content cannot explain the difference. The ﬁve speakers in the BRN database, similar to the SWMS speaker, also used more parameters for encoding the categories—in fact they used almost all. In order to check whether the range of parameters can be attributed to the fact that the data came from several speakers, I ran the same procedure on the female speaker for which I had most data (speaker f2b, with approx. 11,500 syllables before outlier removal). Of the parameters in Table 2, the speaker used all except steepness of rise and next vowel intensity, so the diversity of parameters used in the BRN case can only weakly, if at all, be related to the fact that several speakers contributed. 6 The extremely infrequent categories already excluded above were not represented anymore at this stage.

10

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

Table 2 Coefﬁcients of the linear regression models for the three databases: Estimated values (“Est.”), p-values, and the level at which the coefﬁcients were signiﬁcant (“sig”, * p < .05, ** p < .01, *** p < .001). Intercepts are reported for the sake of completeness; only the remaining coefﬁcients will be used as weights in the following. Empty cells indicate that the coefﬁcient was not in the model for that database. SWMS Parameter Intercept Steepness of rise Steepness of fall Peak alignment Rise amplitude Fall amplitude Peak height syl int delta vwl int delta nxt vwl int delta Spectral balance Spectral tilt Nucleus duration Voicing periods

SWRK

BRN

Est.

p

sig

Est.

p

sig

Est.

p

sig

0.645 0.008 0.005 0.034 0.03 0.033 0.005

0.000 0.000 0.000 0.000 0.000 0.000 0.006

*** *** *** *** *** *** **

0.701

0.000

***

0.031 0.017 0.021

0.000 0.000 0.000

*** *** ***

-0.004 0.009 -0.002

0.003 0.000 0.047

** *** *

0.003

0.018

*

0.007 0.02

0.000 0.000

*** ***

0.022

0.000

***

0.652 0.004 0.008 0.03 0.024 0.008 0.018 0.011 0.005 0.006 0.005 0.004 -0.005 0.017

0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.000 0.000 0.001 0.000 0.000

*** ** *** *** *** *** *** *** ** *** *** ** *** ***

Instead we can conclude that speakers seem to employ a variety of parameters, in fact nearly all parameters investigated here, and that additionally, speakers may differ in which parameters they use for systematically encoding category differences.7 Some of the parameters used above are of course expected to be correlated—deltas derived from syllable intensity and vowel intensity, for instance. Similarly, it could be suspected that spectral balance, in terms of Sluijter and van Heuven’s (1997) deﬁnition, and spectral tilt, in terms of the regression line in a long-term average spectrum, are correlated (although the fact that they yielded opposite results in an earlier study (Aronov & Schweitzer, 2016) indicates otherwise). Thus in order to make sure that multicollinearity did not constitute a problem for ﬁtting the models, I used the usdm R package (Naimi, Hamm, Groen, Skidmore, & Toxopeus, 2014) to calculate the variance inﬂation factor (VIF) for all coefﬁcients. All VIFs were below 2 in all cases, indicating no problem with multicollinearity. It is possible that the low VIFs are due to the fact that the potentially problematic intensity predictors are not raw values, but instead delta values, and that these deltas are again not used directly but that I employ distances between these delta values. Indeed, in the sample data used from the SWMS database, the correlation between vowel intensity delta and syllable intensity delta is 0.65, while that of the pairwise distances between the two is considerably lower (0.42). Similarly, spectral balance and spectral tilt are negatively correlated at 0.38, while the distances for the two parameters are positively correlated at a lower value (0.22). To be conservative, I ﬁtted two other models for the BRN data, as this was the only data set were both intensity parameters were part of the ﬁnal model. In these models I retained only one of the two factors vowel intensity delta and syllable intensity delta and compared these to the full model and to each other using anovas. The results indicate that none of the three models provides a significantly worse ﬁt than the others. I thus kept both factors in the model to adhere to the procedure as described above.

7 Interestingly, such individual differences are even expected from an exemplar-theoretic perspective, since the collection of exemplars stored in a speaker’s memory, which is assumed to be the basis for production, is unique for each speaker.

In any case the linear model has identiﬁed parameters where differences in that parameter may be related to differences in category. Thus we would expect to see different distributions of these parameters for each accent category. Figs. 3 through 5 show density plots for each of the signiﬁcant parameters from above by category. These plots indicate the likelihood of speciﬁc parameter values: peaks in these plots occur at values that are more likely than the surrounding values. The plots thus show the underlying parameter distribution of each prominence category. It can be seen that for all parameters we ﬁnd visible differences in the distributions for at least one pair of categories. The differences correspond well to expectations based on knowledge about these categories. For instance, for the SWMS data in Fig. 3, unaccented syllables (category NONE, dotted line) are most likely to exhibit no rise, thus their distributions for steepness of rise (top left) and rise amplitude (second row, right panel) have peaks at the lowest values for these two parameters. Rising L*H accents (the dot-dashed line with the longer dashes), in contrast, tend to have a steeper rise (the broader peak between 1 and 0 in the top left panel indicates that these values are most likely), greater rise amplitudes (the very broad plateau ranging from 0 to 2 in the right panel in the second row) and their peak alignment relatively late or even very late (the two peaks in the left panel in row 2). Similarly the property of H* (solid line) being a less pronounced peak with moderate steepness and moderate amplitudes aligned in the middle of the syllable is borne out (ﬁrst ﬁve panels). Beyond these F0 related parameters, there are very subtle differences in terms of vowel intensity delta (left panel in row 4) that seem to indicate that the distribution for H*L accents (dashed line) might be shifted to the right compared to the other categories, i.e. they seem to exhibit subtly greater deltas in intensity than the other categories including NONE. This unexpected ﬁnding is more pronounced in the delta to the next syllable (right panel in row 4). The remaining panels mostly indicate differences between the accent categories compared to unaccented syllables, indicating that while unaccented syllables are mostly characterized by very neutral spectral balance values of almost exactly 0, there is much more variation with much more extreme values in both directions for the accent categories. The last two panels show that nucleus duration is shorter in

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

11

Fig. 3. Density plots showing the distributions of the parameters identiﬁed as important for each prominence category, for the SWMS database. See text for further details.

unaccented syllables (right panel in row 5), and this is also reﬂected in the fact that unaccented syllables either have no detectable voicing periods (the pronounced left peak in their distribution in the bottom panel) or only few periods (the right peak in that distribution). For the parameters that were identiﬁed for the SWRK data, similar observations can be made. In her case even for unaccented syllables voicing periods could usually be detected, as evident from the very little peak in the dotted line in the bottom panel. One could speculate that this is the reason why she

made no use of intensity and duration—her more consistent voicing may have allowed to encode more differences via F0 related parameters. The BRN data ﬁnally look similar to the SWMS data: Unaccented syllables (dotted lines) have the lowest values for steepness of rise and steepness of fall (top panels), and low rise amplitudes (right panel in second row) and fall amplitudes (left panel in row 3). H+!H* accents have the earliest peak alignment (the peak in the dashed line in the left panel in row 2). Again, there are only subtle differences in the intensity del-

12

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

Fig. 4. Density plots showing the distributions of the parameters identiﬁed as important for each prominence category, for the SWRK database. See text for further details.

tas, and for the spectral parameters we ﬁnd a clear prevalence of 0 values in the case of unaccented syllables for both spectral balance (right panel in row 5) and spectral tilt (left panel in row 6). Interestingly, in the BRN database many unaccented syllables have no detectable voicing (the sharp left peak in the dotted line in the bottom panel). All in all, from Figs. 3–5 it should be clear that in an acoustic space where the dimensions correspond to the parameters presented above, there are indeed differences between the prominence categories in terms of where they are located in that space. However, within each individual dimension, there is a strong overlap, despite the differences discussed above, so it is not clear whether the category differences are sufﬁciently pronounced to identify clusters in that space that would correspond to the categories. This question will be addressed in the following section. 5.2. Finding clusters

For clustering the data, I next randomly selected another 100 data points for each category, creating a second subset with equal proportions of accent categories for each database. Since I had excluded accent categories for which there were less than 200 instances in the database, it is guaranteed that there are enough data points left of each category to have entirely new data points in this subset which have not been used for ﬁnding the weights above. I then clustered these data twice: once keeping the original values, and once adjusting the values by multiplying them with the dimension-speciﬁc perceptual weights determined above. In both cases, I used only those dimensions that had been identiﬁed as perceptually relevant by the procedure described in Section 4 above. Thus the clustering space was 11-dimensional in the case of SWMS, 5-dimensional for SWRK, and 13-dimensional in case of BRN.

As a ﬁrst experiment, I clustered the data using k-means clustering as implemented in R (R Core Team, 2017) with k ¼ 5, as we would optimally want to ﬁnd clusters for 5 different categories of accents in each case. Please note that the assumption of a ﬁxed number of clusters exclusively originates from the fact that k-means clustering does not decide on the appropriate number of clusters, instead it will only look for a given number of clusters. The assumption of a speciﬁc number of clusters thus is a necessary technicality in which the simulation differs from perceptual reality: Humans would of course not look for a given number of clusters but detect the clusters by perceiving particularly dense regions in perceptual space. As a starting point for illustrating the idea of clustering as a means to detect categories, I will assume ﬁve clusters for now, since we know that the number of prominence categories in the data is ﬁve. In this respect, the clustering has an advantage over human detection since humans would not even know beforehand what the “correct” number of clusters should be. It will be argued below that the number of clusters that should be expected is actually higher than that. Fig. 6 shows visual representations of the results for clustering with and without adjusting the dimensions by the weights found above. For illustrating the effects of the weights, I used the same 11 dimensions that were found to be signiﬁcant in the analysis above for clustering, i.e. here the only difference in clustering was in whether the weights were used for adjustment or not. Both representations were generated by mapping the 11-dimensional space to the ﬁrst two discriminant dimensions using the plotcluster function provided by Hennig (2015). Each number in the plots corresponds to one data point. The number indicates the number of the cluster that the data point belongs to, while the color indicates its prominence category.

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

13

Fig. 5. Density plots showing the distributions of the parameters identiﬁed as important for each prominence category, for the BRN database. See text for further details.

In both plots, the clusters occupy speciﬁc regions in the space spanned by the two dimensions, i.e. numbers tend to appear in similar regions. For instance in the upper plot, data

points from cluster 4 tend to be in the lower left region, while those from cluster 1 are more towards to the upper left, those from cluster 2 in the upper right region, etc. This is of course

14

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

Fig. 6. 2-dimensional projection of the data points in clustering space using the SWMS data before (upper panel) and after (lower panel) adjusting the dimensions by the perceptual weights from Section 4. Numbers indicate the cluster number of the data point, while colors indicate its accent category. Optimally, data points with same colors should be crowded together, and have identical number labels. (For interpretation of the references to colour in this ﬁgure legend, the reader is referred to the web version of this article.)

expected, since the clusters arise from grouping points together that were close in the original (upper panel), or in the adjusted (lower panel) 11-dimensional space, and they are still close when projected on the two discriminant dimensions. However, we hope to get a better separation of the clusters after the adjustment, which is conﬁrmed here, since the data points before adjustment seem to be grouped around just one denser region slightly above and to the left of the center of the plot in the upper panel, while there are several regions with higher density after the adjustment. Second, the goal of the

present paper is to show that prominence categories will form clusters in the appropriate perceptual space, so optimally, data points of the same prominence category should belong to the same cluster, or at least they should end up in similar regions. It can be observed in Fig. 6 that indeed while each category (each color) appears in various areas in the upper panel, the colors are much better separated in the lower panel, conﬁrming that the prominence categories are better separated after adjustment. They also tend to belong to similar clusters in the lower panel, i.e. data points of the same color tend to have the same number: most red data points for instance belong to cluster 4, most green data points appear to belong to cluster 3, cyan data points to clusters 1 and 4, etc. There is no perfect correspondence between cluster number and color after adjustment, but a better one than in the upper panel before the adjustment. To illustrate the cluster-to-category correspondence after adjusting the weights more objectively, Table 3 indicates which prominence categories were found in which cluster after adjustment of the weights. Obviously, cluster 3 is dominated by L*H cases; they are more than twice as frequent as the other categories in that cluster. In cluster 4 H*L is most frequent, while cluster 5 is dominated by L*HL. However there are always considerable numbers of other categories in each cluster. Also, the majority of cases in clusters 1 and 2 belong to two categories each: H* and NONE in case of cluster 1, and H* and L*HL in case of cluster 2. Analyzing the results from a category perspective, most H* cases belong to cluster 1. The correspondence is not perfect, as a considerable number of them occurs in cluster 2. However, only few of them are grouped into any of the other clusters. Similarly, H*L accents are mostly in cluster 4, and also often in cluster 1, but rarely in other clusters. L*H is usually in cluster 3, L*HL in cluster 2, and the NONE cases mostly in cluster 1. To illustrate how well the clusters found for these relatively few data with only 100 cases of each category are compatible with new data, I applied the obtained clustering to more data, taking up to another 100 cases of each category, if available, and assigning each of them to the closest of the cluster centers found above. Table 4 shows the result. Obviously, the distribution of categories across clusters is very similar even for new data, conﬁrming that the regions derived from the clusters on one set of data generalize well to new data points. In order to verify whether the impression gained from the visual inspection of the plots that the adjustment led to better separation is valid, I used the silhouette index (Rousseeuw, 1987) as a well established measure of cluster separability relative to cluster cohesion, as implemented in the cluster package in R (Maechler, Rousseeuw, Struyf, Hubert, & Hornik, 2017). The silhouette index ranges from 1 to 1, with 1 indicating appropriate clustering. Table 5 gives an overview of the

Table 3 Category-to-cluster correspondence for the SWMS data after adjusting the weights.

H* H*L L*H L*HL NONE

1

2

3

4

5

43 33 21 15 50

35 17 10 38 19

13 2 56 22 6

2 45 2 0 23

7 3 11 25 2

15

A. Schweitzer / Journal of Phonetics 77 (2019) 100915 Table 4 Category-to-cluster correspondence for the SWMS data when classifying further data by assigning them to the cluster that they are closest to.

H* H*L L*H L*HL NONE

1

2

3

4

5

48 30 30 0 48

30 16 9 7 8

13 1 53 1 14

2 46 0 0 18

7 7 8 2 12

results. It can be seen that the silhouette index improves in all three cases when the dimensions are adjusted, though it is far from ideal even then. Regarding the additional claim that the clusters are not only better separated but also a better match to the categories after adjusting the dimensions, I evaluated the clusterings in terms of another well established measure, the Corrected Rand index (Gordon, 1999), using the implementation by Hennig (2015). Given two different groupings of data points, in our case the grouping found by k-means clustering vs. the grouping into the manually labeled prominence categories, the Rand index in general indicates how many pairs of elements are in the same group in both groupings, relative to the overall number of pairs, and thus is a measure to capture the match between the detected clusters and categories. Values for the Corrected Rand index can range from 0 to 1, with 1 indicating perfect match. Table 6 lists Corrected Rand indices for prominence categories and clusterings obtained on the original data vs. those obtained on the adjusted data, for each of the three databases. The indices are not close to 1, indicating far from perfect ﬁt in all cases; however they are consistently higher in all three databases after adjusting the dimensions using the perceptual weights. The indices calculated above do conﬁrm that the adjusted dimensions seem to make it easier to ﬁnd reasonable clusters that correspond well to the prominence categories. However, how well is “well”? To give a second measure for the goodness of ﬁt between clusters and categories, and one that I ﬁnd easier to interpret intuitively, an accuracy-based measure was used in Schweitzer (2011): in exemplar-theoretic categorization, a listener would compare an incoming exemplar to the stored exemplars, and categorize the exemplar as

Table 5 Average silhouette widths for original vs. adjusted dimensions, for clustering 100 new data points from each category with k-means clustering and k = 5. The higher values after adjustment indicate a better separation-to-cohesion ratio for all three databases after adjusting the dimensions using the perceptual weights. Database Original Adjusted

SWMS

SWRK

BRN

0.110 0.225

0.165 0.225

0.102 0.179

Table 6 Corrected Rand indices for original vs. adjusted dimensions, for clustering 100 new data points from each category with k-means clustering and k = 5. The values indicate a slightly better cluster-to-category correspondence in all three databases after adjusting the dimensions using the perceptual weights. Database Original Adjusted

SWMS

SWRK

BRN

0.095 0.116

0.051 0.061

0.065 0.067

belonging to the category that corresponds to the dominant label among the most similar stored exemplars. In other words, they would (unconsciously) ﬁnd the cluster that the new exemplar belongs to, and assign the new exemplar the same label as the other exemplars from that cluster. In the same way, we can categorize each exemplar in the data as belonging to the majority class within its cluster. I suggest to use the accuracy of this categorization as an easier-to-interpret measure of the goodness of ﬁt between clusters and categories. The formal deﬁnition of the classiﬁcation accuracy is as follows. Let K ¼ fK 1 ; K 2 ; . . . K N g be the clusters and C ¼ fC1 ; C2 ; . . . CM g the prominence categories. To calculate a corresponding contingency table a, we determine for each pair of cluster and category how many instances are both members of cluster i and of category j, i.e., the cells aij of the contingency table are calculated using aij ¼ jfxjx 2 K i ^ x 2 Cj gj;

1 6 i 6 N;

16j6M

Then, the classiﬁcation accuracy can be calculated from the contingency table: class accðaÞ ¼

N max aij X j PM i¼1 j¼1 aij

ð5Þ

For the SWMS data and k-means clustering with k ¼ 5 for instance, the accuracy according to this deﬁnition is 36.0% for the original data, and 39.6% for the adjusted data. While this is clearly better than the chance baseline of 20%, the accuracies are disappointingly low: Only around 40% of data points in a cluster belong to one category, the remaining accents belong to other categories. However, it is probably naïve to expect a perfect 1-to-1 correspondence where each cluster represents exactly one category. Indeed, Pierrehumbert (2003) herself concedes that, while the distributions for phoneme categories may be quite distinct for phonemes in the same contexts (and thus we could hope to ﬁnd a perfect match between clusters and categories in these contexts), there may be overlap between distributions for different phonemes in different contexts. She therefore suggests that “positional allophones appear to be a more viable level of abstraction for the phonetic encoding system than phonemes in the classic sense” (Pierrehumbert, 2003, p. 211). This means that while the underlying distributions for phoneme categories are expected to overlap in phonetic space, the underlying distributions of positional allophones should be more clearly distinct. Taking these considerations from the segmental domain to the prosodic domain, speciﬁcally, to prominence, we would expect each prominence category to correspond to several clusters—where each cluster corresponds to a prominence category in some speciﬁc type of context, an “allotone”, if you will. This would explain the only moderate Corrected Rand indices and silhouette widths found above when allowing only 5 clusters. Thus I varied the numbers of clusters k in a series of experiments, with k > 5. However, when requiring data where all prominence categories are represented by equal proportions, the problem is that for the relatively infrequent categories, we have only 100 new data points left. Consequently the amount of data points altogether is 500 when keeping equal proportions. Thus with, say, 10 clusters, we would end up with 50 data points on average in each cluster, and it does

16

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

not seem reasonable to aim for more, and therefore less populated, clusters. However, ultimately the clustering experiments presented here aim at simulating exemplar-theoretic acquisition, where we would be dealing with much more data than what we are currently looking at. Also, even though it was important to have equal proportions of each category for ﬁnding the weights, there is no reason to require equal proportions for the clustering. After all, in exemplar-theoretic acquisition, listeners would certainly be exposed to very unbalanced proportions of categories—this imbalance is in fact pertinent to any linguistic category. So in the next section I give up the requirement that the categories should be represented by equal proportions, allowing greater imbalance. 5.3. Clustering more data

In order to experiment with higher numbers of clusters and with more, and more imbalanced, data, I next ran a series of experiments for each database where I took at least 100, and, if available, up to 2000 instances of each category for clustering. Again, these data points had not been used for ﬁnding the weights. What is new compared to the previous section is that I will make use of both outcomes of the procedure for ﬁnding weights: the identiﬁcation of relevant dimensions, plus the weights. Thus I will compare results when clustering the original data using all dimensions to results when clustering data that have been adjusted by the weights, excluding dimensions that have been found to be irrelevant above. In these experiments I varied the number of clusters from 5 to the number of data points divided by 50, i.e. allowing on average 50 instances per cluster. For each cluster, I determined the majority category in that cluster, classiﬁed each data point in that cluster as belonging to that category, and the accuracy rate for this classiﬁcation. Figs. 7–9 show the results on the three databases. It can be seen that depending on the database, accuracies of between 60% and 65% can be obtained compared to the chance baselines of between 32% and 38% when classifying each data point as belonging to the overall majority class. For all three databases, there is a clear beneﬁt of adjusting the dimensions by the perceptual weights, with the most pronounced advantage obtained on the BRN data, followed by the SWRK data. This overall beneﬁt of the perceptual adjustment conﬁrms the effectiveness of the proposed method. As noted above, the increase in accuracy for the perceptual adjustment is highest for the BRN data. Recall that in the case of BRN, almost all dimensions proposed were found to be relevant for perception, and thus the clustering space had the highest dimensionality for BRN. I suggest that this is why the adjustment, which reﬂects modeling the relative relevance of the dimensions, is most helpful in case of BRN. In all three cases, curves are slightly steeper for moderate numbers of clusters and ﬂatter for higher numbers—in case of the SWMS and SWRK databases, the accuracy rates start to level off at around 30–40 clusters already, as can be seen from the elbows around that point. In case of BRN, there is no pronounced elbow, but again there is little increase beyond, say, 50 clusters. Given that we expect clusters that correspond to “positional allophones” (Pierrehumbert, 2003, p. 211) rather than clusters where each cluster would contain all instances of

Fig. 7. Accuracy rates on the SWMS data when treating all data points in a cluster as belonging to the majority category in that cluster. Light gray circles indicate accuracies on the original data; black diamonds indicate those on adjusted data.

Fig. 8. Accuracy rates on the SWRK data when treating all data points in a cluster as belonging to the majority category in that cluster. Light gray circles indicate accuracies on the original data; black diamonds indicate those on adjusted data.

a category irrespective of context, these numbers may well be appropriate. With 5 prominence categories, at 50 clusters we would have an average of 10 positional allophones (or rather, “allotones”) per category. These could possibly correspond to different implementations depending on position in the phrase or on voicing properties of syllable onset and coda for instance, all aspects that are well known to affect accent shape. 5.4. Clustering vowel data for comparison

To gain a more objective estimate of what constitutes a good category-to-cluster correspondence, and how many clusters to expect per category, I clustered vowel data using the same procedure as for clustering the prominence categories. After all, as mentioned in Section 1.2, it has been suggested by Pierrehumbert (2003) that clustering should be a viable way for detecting vowel categories in an exemplar-theoretic

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

17

(Boersma & Weenink, 2017), then determined perceptual weights for clustering as above. In the case of BRN, all six parameters were retained in that procedure, with the largest coefﬁcient and thus the greatest importance of vowel duration, followed by F1 and F2. For removing outliers and normalization I proceeded analogously to the procedure described in Section 3. The number of vowel categories in the resulting data used for clustering was nine. Fig. 10 shows a traditional F1/F2 plot for the vowels used for clustering below, along with their category labels. Axes were ﬂipped in a way that the plot matches the usual vowel diagram. It can be seen that while

Fig. 9. Accuracy rates on the BRN data when treating all data points in a cluster as belonging to the majority category in that cluster. Light gray circles indicate accuracies on the original data; black diamonds indicate those on adjusted data.

Fig. 10. F1/F2 plot of the vowel data used for clustering. Axes were ﬂipped in a way that the plot matches the usual vowel diagram.

fashion, and she had stated that cluster centers obtained on F1/F2 data by Kornai (1998) are “extremely close” to mean vowel formants of the categories. Thus the category-tocluster correspondence for vowels in the above data should be a good indicator of what can optimally be expected for prominence categories. I assume that the correspondence observed on vowel data should be an upper bound rather than a lower bound for what to expect for prominence categories, since vowel categories have long been accepted as valid categories, while this is probably more controversial for pitch accents. Also, prosodic categories are notorious for lower labeling consistency, while vowel categories seem to be much less problematic in that respect. For clustering the vowels, I extracted F1, F2 and F3 as well as durations, spectral balance, and spectral tilt for all full monophthongs in the three databases using a Praat script

Fig. 11. 2-dimensional projection of the data points in clustering space using the BRN vowel data before (upper panel) and after (lower panel) adjusting the dimensions by the perceptual weights from Section 4. Numbers indicate the cluster number of the data point, while colors indicate its accent category. Same-colored data points are crowded together much better after the adjustment.

18

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

Fig. 12. Accuracy rates for vowel classiﬁcation on the BRN data when treating all data points in a cluster as belonging to the majority category in that cluster. Light gray circles indicate accuracies on the original data; black diamonds indicate those on adjusted data.

the vowel categories end up in the expected regions, there is also considerable overlap between vowel categories in the F1/F2 dimensions. Fig. 11 shows 2-dimensional representations of clusterings using k-means with k ¼ 9 and only 100 data points per vowel category, plotted using the ﬁrst two discriminant dimensions, as above. The plots indicate results before (upper panel) and after (lower panel) adjusting the dimensions. It can be seen that vowels of the same categories end up in more similar regions of the space after the adjustment, and that vowels of the same categories seem to end up in the same cluster more often. I then clustered the vowel data using up to 2000 instances of each vowel, once in the original space, and once in the adjusted space, varying the numbers of clusters as above. The accuracies obtained for vowel categorization based on these clusterings are given in Fig. 12 for BRN as an example; the graphs for SWRK and SWMS look very similar. In general the similarity of each of the graphs to the graphs obtained for clustering prominence categories is striking. The majority baseline in the case of vowels is lower than that for the above experiments, but we obtain similar rates of slightly above 60%. Again, the results are better when allowing for more clusters than categories–around 40 to 50 seems to be a good choice in this case. Comparing the results for vowel category detection and prominence category detection, it is found that we can do similarly well in both cases, just slightly better on prominence categories in terms of absolute accuracies. When taking into account the higher number of categories in the case of vowels and the consequently lower majority baseline however, then it can be said that the clustering adds slightly more information for vowels. 6. Discussion

Compared to the Schweitzer (2011) study, the best accuracy rates in the present study are reached at far fewer numbers of clusters—in case of the SWMS and SWRK databases, at around 30–40, in case of BRN, at around 50

clusters. In the earlier study, the optimal number of clusters of around 1600 clusters had been found using a different evaluation method: it was obtained on independent test data using 10-fold cross validation, with 90% of the data used clustering in each fold, at the expense of having much more unbalanced data in terms of categories for the clustering. Thus we cannot easily compare the numbers of clusters in the two studies.8 The more imbalanced data in the earlier study also affected the accuracy results: the majority baseline (the frequency of the NONE category) in that study was at almost 78%. Not surprisingly given this strong baseline, the accuracy rates of around 85% were higher than those of around 65% in the present study, where the majority baselines were between 32% and 38%. However the considerable differences between majority baseline and the obtained accuracies in the present study demonstrate that the approach taken here is the more promising one. Also, the present study relies solely on acoustic parameters, whereas the earlier study also used higherlinguistic parameters such as the location of word stress or part-of-speech information. Thus it included some dimensions that are known to be highly predictive of prominence—for instance unstressed syllables always belong to the prominence category NONE, and nouns and adjectives are much more likely to be pitch-accented than function words—but these dimensions on the other hand encode categories that in an exemplar-theoretic account of speech acquisition would have to be learned from the data in the same way as prominence categories are learned. Probably, the two would be learned jointly, and thus the former would not be available when establishing the prominence categories. Thus the challenge in the present study is considerably higher than in the earlier study, and also a more realistic approximation of real human prominence acquisition. The present study simulates in a very simple way the acquisition of prominence categories. It does so using read data of the kind that adults would be exposed to rather than spontaneous data of the kind that children would be exposed to. However I believe that the distribution of prominence categories was shifted towards what children would hear: By selecting portions of the data that contained as many of the infrequent prominence categories as possible, these data were less imbalanced than the full data set, speciﬁcally, they favored pitch-accented categories over the most frequent category NONE. Given that child-directed speech has been shown to be prosodically more exaggerated than adult-directed speech (e.g. Fernald et al., 1989; Vosoughi & Roy, 2012) this should match the distributions that children are exposed to slightly better.

8 In order to make sure that the lower numbers of clusters in the present study are not due to the fact that there was no such external evaluation on independent data, I used another up to 2000 independent data points per remaining category for evaluating the clusters again. The accuracies when assigning new data points to the nearest cluster center were consistently higher than those for the data originally used for clustering. This is not too surprising, since the independent data are necessarily much more imbalanced: after the last instances of the less frequent prominence categories have been used for detecting the clusters, only very frequent categories, i.e. almost exclusively unaccented syllables, are left for the evaluation. However the fact that the accuracies for independent test data are higher than those obtained on the clustering data themselves clearly indicates that the results on the clustering data do not suffer from overﬁtting. Also, the results on the independent data do not indicate that higher numbers of clusters are better, instead, the accuracies are largely independent of the number of clusters.

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

In any case I would like to emphasize that this study is to be interpreted as an exploration of the claim put forth by Pierrehumbert (2003) that categories can be established, or at least initiated, by detecting clusters in perceptual space. It is not intended to provide a fully-ﬂedged simulation of speech acquisition. The latter would require a substantial amount of real child-directed, and prosodically labeled, data, recorded at the age of acquisition of prosodic categories, and such data is currently, at least to my knowledge, not available for any language. A future full account would also require modeling exemplar production. As discussed in Section 1.1, production would rely on the category labels that are stored with the exemplars. For instance, when producing the word “ball”, exemplars which are labeled as ball exemplars will unconsciously be activated and then contribute to establishing the production target. In extending exemplar-theoretic production to the prosodic domain, one would have to decide which abstract labels can be assumed to be stored. The prominence categories investigated here are maybe too abstract to be accessible to speakers as potential labels. Given the communicative function of prominence categories I would expect that labels such as “new information” or “corrective information” serve as a proxy to the prosodic categories at least at the beginning. Such “easy” cognitive concepts have been argued to be accessible to infants already before they are implemented with adult prosody (Höhle, Berger, & Sauermann, 2016). Furthermore, modeling production would require modeling activation of individual exemplars, as well as potential consequences such as activation competition and resonance. Future work could thus (i) try to run similar cluster experiments using data for which such easy concepts have been annotated and (ii) model production including more complex aspects such as considering activation of exemplars. There is still much to be learned from the present study. First of all, it suggests a simple approach to establishing and scaling the perceptual importance of the acoustic dimensions and shows that this considerably increases the quality of the resulting clusters. The fact that the scaling is done using one uniform weight for each dimension rather than a more complex non-linear warping of these dimensions again owes to exemplar-theoretic considerations: while perceptual warping of the phonetic space as evidenced in the well-known perceptual magnet effect (Kuhl, 1991, PME) might be taken to suggest that different areas along each dimension have to be adjusted in different ways, Lacerda (1995) has already shown that the non-linearity in the PME can be modeled as a consequence of different densities of exemplars along the dimensions, without assuming any non-linear adjustment.9 Second, the experiments show that clusters corresponding to prominence categories can indeed be detected, as predicted by exemplar theory, and that they can be detected with similar accuracy as clusters corresponding to vowel categories. As suspected by Pierrehumbert (2003), there is not a perfect 1-to-1 correspondence between clusters and categories, but both in the case of vowels and in the case of 9 It should be noted that I do not want to argue against the kind of non-linearity encountered in psycho-acoustic scales, which is different from the magnet effect. The latter arises only in the learning process and is not innate. I assume that non-linearities that follow from neurobiological circumstances do not need to be modeled by exemplar theory, while non-linearities that arise in the learning process should be explained.

19

prominence categories we can identify just a few clusters on average for each category, and these clusters can be taken to correspond to allophones of vowel categories in the segmental domain, or to “allotones” of prominence categories in the prosodic domain. 7. Conclusion

I have illustrated the exemplar-theoretic integration of phonological form and phonetic substance in the domain of prominence. According to exemplar theory, phonological categories arise from abstracting over clusters of phonetically similar exemplars that are associated with the same meanings. To model the perceptual importance of potentially relevant dimensions I have suggested a simple procedure to derive perceptual weights for scaling the dimensions and shown that it considerably facilitates the detection of clusters corresponding to positional variants of prominence categories. The procedure not only yields weights, it also makes it possible to identify which dimensions are relevant for perception. Thus as a byproduct the adjustment procedure can in general be used to conﬁrm or reject hypotheses about which parameters play a role in distinguishing perceptual categories.The number of acoustic-prosodic dimensions that were found to be relevant in each database here demonstrates that speakers encode prominence jointly by a variety of parameters. The difference between SWRK and SWMS in that respect shows that the use of these dimensions can also be speaker-speciﬁc. In contrast to an earlier study, the present study assumes only low-level phonetic features that should be available early in speech acquisition, and a more even distribution in terms of prominence categories, at the expense of altogether lower accuracy rates when evaluating the clusters. However, the increase in accuracy over the overall majority baseline (i.e. the baseline corresponding to an “educated guess”) is much greater in the current study than in the earlier study. This attests the dimensions used in the present study a greater effect. In addition, the number of detected categories is reasonably lower in the current study, conﬁrming the plausibility of an exemplar-theoretic approach to category detection in general and to prominence category detection in particular. References Aronov, G., & Schweitzer, A. (2016). In C. Draxler & F. Kleber (Eds.), Tagungsband der 12. Tagung Phonetik und Phonologie im deutschsprachigen Raum (pp. 12–15). Arvaniti, A., Ladd, D. R., & Mennen, I. (1998). Stability of tonal alignment: The case of Greek prenuclear accents. Journal of Phonetics, 26, 3–25. Barbisch, M., Dogil, G., Möbius, B., Säuberlich, B., & Schweitzer, A. (2007). Unit selection synthesis in the SmartWeb project. In Proceedings of the 6th ISCA Workshop on Speech Synthesis (SSW-6 Bonn) (pp. 304–309). Barnes, J., Veilleux, N., Brugos, A., & Shattuck-Hufnagel, S. (2012). Tonal Center of Gravity: A global approach to tonal implementation in a level-based intonational phonology. Laboratory Phonology, 3, 337–383. Baumann, S., & Röhr, C. (2015). The perceptual prominence of pitch accent types in German. In Proceedings of the 18th International Congress of Phonetic Sciences, Glasgow. . Baumann, S., & Winter, B. (2018). What makes a word prominent? Predicting untrained German listeners’ perceptual judgments. Journal of Phonetics, 70, 20–38. Beckman, M.E., & Ayers, G.M. (1994). Guidelines for ToBI labelling, version 2.0.. Boersma, P., & Weenink, D. (2017). Praat, a system for doing phonetics by computer [computer program].http://www.praat.org/. Version 6.0.36, retrieved 20 Nov. 2017.. Bolinger, D. L. (1957). On intensity as a qualitative improvement of pitch accent. Lingua, 7, 175–182. Bolinger, D. L. (1958). A theory of pitch accent in English. WORD, 14, 109–149. Bruce, G. (1977). Swedish word accents in sentence perspective. Gleerup, Lund: Travaux de l’Institut de Phonétique XII.

20

A. Schweitzer / Journal of Phonetics 77 (2019) 100915

Calhoun, S., & Schweitzer, A. (2012). In G. Elordieta Alcibar & P. Prieto (Eds.), Prosody and Meaning (Trends in Linguistics) (pp. 271–327). Mouton DeGruyter. Campbell, N., & Beckman, M. E. (1997). Stress, prominence, and spectral tilt. In A. Botinis, G. Kouroupetroglou, & G. Carayiannis (Eds.), Intonation: Theory, models and applications (proceedings of an ESCA workshop, September 18–20, 1997, Athens, Greece) (pp. 67–70). ESCA and University of Athens Department of Informatics. Cangemi, F., & Grice, M. (2016). The importance of a distributional approach to categoriality in autosegmental-metrical accounts of intonation. Laboratory Phonology: Journal of the Association for Laboratory Phonology, 7, 1–20. Cole, J., Hualde, J. I., Smith, C. I., Eager, C., Mahrt, T., & de Souza, R. N. (2019). Sound, structure and meaning: The bases of prominence ratings in English, French and Spanish. Journal of Phonetics, 75, 113–147. https://doi.org/10.1016/ j.wocn.2019.05.002. Crosswhite, K. (2003). Spectral tilt as a cue to word stress in Polish, Macedonian, and Bulgarian. In Proceedings of ICPhS 2003 (Barcelona, Spain) (pp. 767–770). Fernald, A., Taeschner, T., Dunn, J., Papousek, M., de Boysson-Bardies, B., & Fukui, I. (1989). A cross-language study of prosodic modiﬁcations in mothers’ and fathers’ speech to preverbal infants. Journal of Child Language, 16, 477–501. https://doi.org/ 10.1017/S0305000900010679. Fry, D. B. (1955). Duration and intensity as physical correlates of linguistic stress. Journal of the Acoustical Society of America, 27, 765–768. Goldinger, S. D. (1996). Words and voices: Episodic traces in spoken word identiﬁcation and recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, 1166–1183. Goldinger, S. D. (1997). Words and voices—perception and production in an episodic lexicon. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 33–66). San Diego: Academic Press. Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexical access. Psychological Review, 105, 251–279. Gordon, A. D. (1999). Classiﬁcation (2nd ed.). Chapman and Hall. Grabe, E. (1998). Pitch accent realization in English and German. Journal of Phonetics, 26, 129–143. Grice, M., Baumann, S., & Jagdfeld, N. (2009). Tonal association and derived nuclear accents—The case of downstepping contours in German. Lingua, 119, 881–905. Grice, M., Ritter, S., Niemann, H., & Roettger, T. B. (2017). Integrating the discreteness and continuity of intonational categories. Journal of Phonetics, 64, 90–107. Gussenhoven, C., & Rietveld, A. (1988). Fundamental frequency declination in Dutch: Testing three hypotheses. Journal of Phonetics, 16, 355–369. Hennig, C. (2015). fpc: Flexible Procedures for Clustering. R package version 2.1-10. Höhle, B., Berger, F., & Sauermann, A. (2016). Information structure in ﬁrst language acquisition. In C. Féry & S. Ishihara (Eds.), The oxford handbook of information structure (pp. 562–580). Oxford University Press. Jilka, M., & Möbius, B. (2007). The inﬂuence of vowel quality features on peak alignment. In Proceedings of Interspeech 2007 (Antwerpen) (pp. 2621–2624). Johnson, K. (1997). Speech perception without speaker normalization: An exemplar model. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 145–165). San Diego: Academic Press. Kochanski, G., Grabe, E., Coleman, J., & Rosner, B. S. (2005). Loudness predicts prominence: Fundamental frequency lends little. The Journal of the Acoustical Society of America, 118, 1038–1054. Kornai, A. (1998). Analytic models in phonology. In J. Durand & B. Laks (Eds.), The organization of phonology: Constraints, levels and representations (pp. 395–418). Oxford, U.K.: Oxford University Press. Kügler, F., & Gollrad, A. (2015). Production and perception of contrast: The case of the fall-rise contour in German. Frontiers in Psychology, 6, 1254. https://doi.org/10.3389/ fpsyg.2015.01254. Kuhl, P. K. (1991). Human adults and human infants show a ‘perceptual magnet effect’ for the prototypes of speech categories, monkeys do not. Perception and Psychophysics, 50, 93–107. Lacerda, F. (1995). The perceptual-magnet effect: An emergent consequence of exemplar-based phonetic memory. In Proceedings of the 13th international congress of phonetic sciences (Stockholm) (pp. 140–147). Ladd, D. R. (1996). Intonational phonology. Number 79 in Cambridge studies in linguistics. Cambridge, UK: Cambridge University Press. Liberman, M., & Pierrehumbert, J. (1984). Intonational invariance under changes in pitch range and length. In M. Aronoff & R. T. Oehrle (Eds.), Language sound structure (pp. 157–230). Cambridge: MIT Press. Lieberman, P. (1960). Some acoustic correlates of word stress in American English. The Journal of the Acoustical Society of America, 32, 451–454. https://doi.org/10.1121/ 1.1908095. Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., & Hornik, K. (2017). Cluster: Cluster analysis basics and extensions. R package version 2.0.6. Maye, J., & Gerken, L. (2000). Learning phonemes without minimal pairs. In Proceedings of the 24th annual Boston University conference on language development (pp. 522–533). Somerville, Mass: Cascadilla Press. Maye, J., Werker, J. F., & Gerken, L. (2002). Infant sensitivity to distributional information can affect phonetic discrimination. Cognition, 82, B101–B111. Mayer, J. (1995). Transcription of German intonation—The Stuttgart system (Technical Report). Institute of Natural Language Processing, University of Stuttgart.

Möhler, G. (2001). Improvements of the PaIntE model for F0 parametrization. Manuscript. URL:http://www.ims.uni-stuttgart.de/institut/mitarbeiter/moehler/papers/ gm_aims01.ps.gz.. Möhler, G., & Conkie, A. (1998). Parametric modeling of intonation using vector quantization. In Proceedings of the third international workshop on speech synthesis (Jenolan Caves Australia) (pp. 311–316). Naimi, B., Hamm, N. A. S., Groen, T. A., Skidmore, A. K., & Toxopeus, A. G. (2014). Where is positional uncertainty a problem for species distribution modelling. Ecography, 37, 191–203. https://doi.org/10.1111/j.1600-0587.2013.00205.x. Niebuhr, O. (2013). The acoustic complexity of intonation. In E. L. Asu & P. Lippus (Eds.), Nordic prosody XI (pp. 15–29). Frankfurt: Peter Lang. Niebuhr, O., & Pﬁtzinger, H. R. (2010). On pitch-accent identiﬁcation – The role of syllable duration and intensity. In Speech prosody 2010. , pp. 100773:1–4. Niebuhr, O., & Ward, N. G. (2018). Challenges in studying prosody and its pragmatic functions: Introduction to JIPA special issue. Journal of the International Phonetic Association, 48, 1–8. Okobi, A. O. (2006). Acoustic correlates of word stress in American English (Ph.D. thesis). Massachusetts Institute of Technology. Ostendorf, M., Price, P. J., & Hufnagel, S. S. (1995). The Boston University radio news corpus (Technical Report). Linguistic Data Consortium. Technical Report. Peters, J., Hanssen, J., & Gussenhoven, C. (2015). The timing of nuclear falls: Evidence from Dutch, West Frisian, Dutch Low Saxon, German Low Saxon, and High German. Laboratory Phonology, 6, 1–52. Pierrehumbert, J. (1980). The phonology and phonetics of English intonation (Ph.D. thesis). Cambridge, MA: MIT. Pierrehumbert, J. (2001). Exemplar dynamics: Word frequency, lenition and contrast. In J. Bybee & P. Hopper (Eds.), Frequency and the emergence of linguistic structure (pp. 137–157). Amsterdam: Benjamins. Pierrehumbert, J. (2003). In R. Bod, J. Hay, & S. Jannedy (Eds.), Probability theory in linguistics (pp. 177–228). The MIT Press. Pierrehumbert, J. B. (2016). Phonological representation: Beyond abstract versus episodic. Annual Review of Linguistics, 2, 33–52. https://doi.org/10.1146/annurevlinguistics-030514-125050. Prieto, P., van Santen, J., & Hirschberg, J. (1995). Tonal alignment patterns in Spanish. Journal of Phonetics, 23, 429–451. https://doi.org/10.1006/jpho.1995.0032. R Core Team (2017). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. URL:https://www.R-project.org/. Rietveld, A., & Gussenhoven, C. (1985). On the relationship between pitch excursion size and prominence. Journal of Phonetics, 13, 299–308. Rousseeuw, P. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. van Santen, J., & Hirschberg, J. (1994). Segmental effects on timing and height of pitch contours. In Proceedings of the 3rd International Conference on Spoken Language Processing (ICSLP 94) (pp. 719–722). Yokohama, Japan. Schweitzer, A. (2011). Production and perception of prosodic events—Evidence from corpus-based experiments (Ph.D. thesis). Universität Stuttgart. Schweitzer, A., Möhler, G., Dogil, G., Möbius, B., in preparation. The PaIntE model of intonation, in: Barnes, J., Shattuck-Hufnagel, S. (Eds.), Prosodic Theory and Practice. MIT Press. Schweitzer, K., Walsh, M., Calhoun, S., Schütze, H., Möbius, B., Schweitzer, A., & Dogil, G. (2015). Exploring the relationship between intonation and the lexicon: Evidence for lexicalised storage of intonation. Speech Communication, 6, 65–81. Silverman, K., & Pierrehumbert, J. (1990). The timing of prenuclear high accents in English. In J. Kingston & M. E. Beckman (Eds.), Papers in laboratory phonology. Volume I of papers in laboratory phonology (pp. 72–106). Cambridge University Press. Sluijter, A. M., & van Heuven, V. J. (1997). Spectral balance as an acoustic correlate of linguistic stress. Journal of the Acoustical Society of America, 2471–2486. Taylor, P., Caley, R., Black, A.W., & King, S. (1999). Edinburgh speech tools library [ http://festvox.org/docs/speech_tools-1.2.0/]. System Documentation Edition 1.2, for 1.2.0 15th June 1999.. Terken, J., & Hermes, D. (2000). The perception of prosodic prominence. In M. Horne (Ed.), Prosody: Theory and experiment (pp. 89–127). Kluwer Academic Publishers. Turk, A. E., & White, L. (1999). Structural inﬂuences on accentual lengthening in English. Journal of Phonetics, 27, 171–206. Vosoughi, S., & Roy, D. (2012). A longitudinal study of prosodic exaggeration in childdirected speech. In SP-2012 (pp. 194–197). Wade, T., Dogil, G., Schütze, H., Walsh, M., & Möbius, B. (2010). Syllable frequency effects in a context-sensitive segment production model. Journal of Phonetics, 38, 227–239. Wagner, P., Ćwiek, A., & Samlowski, B. (2019). Exploiting the speech-gesture link to capture ﬁne-grained prosodic prominence impressions and listening strategies. Journal of Phonetics (in press). Wahlster, W. (2004). Smartweb: Mobile applications of the semantic web. In S. Biundo, T. Frühwirth, & G. Palm (Eds.), KI 2004: Advances in artiﬁcial intelligence (pp. 50–51). Berlin/Heidelberg: Springer. Walsh, M., Möbius, B., Wade, T., & Schütze, H. (2010). Multilevel exemplar theory. Cognitive Science, 34, 537–582. Ward, N. G., & Gallardo, P. (2017). Non-native differences in prosodic-construction use. Dialogue & Discourse, 8, 1–30.

Exemplar-theoretic integration of phonetics and phonology: Detecting prominence categories in phonetic space

Exemplar-theoretic integration of phonetics and phonology: Detecting prominence categories in phonetic space

Recommend Documents