Speech dynamics: Converging evidence from syllabification and categorization

Speech dynamics: Converging evidence from syllabification and categorization

Journal of Phonetics xxx (2017) xxx–xxx Contents lists available at ScienceDirect Journal of Phonetics journal homepage: www.elsevier.com/locate/Pho...

606KB Sizes 0 Downloads 62 Views

Journal of Phonetics xxx (2017) xxx–xxx

Contents lists available at ScienceDirect

Journal of Phonetics journal homepage: www.elsevier.com/locate/Phonetics

Special Issue: Mechanisms of regulation in speech, eds. Mücke, Hermes & Cho

Speech dynamics: Converging evidence from syllabification and categorization Betty Tuller a,*, Leonardo Lancia b a b

National Science Foundation, 4201 Wilson Boulevard, Arlington, VA 22230, USA Laboratoire de Phonétique et Phonologie (CNRS, Sorbonne Nouvelle), 19 rue des Bernardins, 75005 Paris, France

a r t i c l e

i n f o

Article history: Received 11 March 2016 Received in revised form 7 February 2017 Accepted 9 February 2017 Available online xxxx Keywords: Dynamical systems Models of speech perception Models of speech production Syllabification Categorization

a b s t r a c t The present paper explores the dynamics of speech production and perception in the context of syllabification and categorization. The selective review includes empirical work and dynamical models that account for changes in the perception and production of syllable structure as transitions between attractors in a dynamical system and that highlight the role of instabilities as a mechanism for regulating flexibility and change. Different conceptual approaches to changes in perceptual categorization are reviewed, including a nonlinear dynamic model, a related Bayesian approach, and a hybrid approach. Of particular importance are recent models that incorporate cognitive factors (such as attention, expectation, and memory) and that change slowly or quickly relative to the changing acoustic input. These dynamical models allow phenomena such as self-organization, emergence, and other hallmarks of complex adaptive systems and may also suggest a mechanism linking speech production and perception, providing an alternative description to the internal models often invoked. Published by Elsevier Ltd.

1. Introduction

Speech communication, from the movements of the articulators to taking turns in conversation, involves an intricate dance in time. Small differences in the timing of “velar” movement in articulatory synthesis, for example, can shift perception of the resulting acoustic signal among banana, bandana, bad#nana and bad#data (Rubin & Vatikiotis-Bateson, 1998) and alteration in the short intervals between turns of conversational participants can severely disrupt social interaction (Levinson & Torreira, 2015). In this paper, we consider as fundamental the notion that speech is a dynamic process that evolves in time and that the temporal flow is inseparable from the processes that regulate when production and perception are stable and that allow both stability and plasticity/flexibility to coexist. We will describe work that assesses and models how both stability and flexibility arise, coexist, and influence speech communication. Over the last decades, we have explored the mechanisms involved in the stability and flexibility of speech communication, guided by concepts from the theory of nonlinear * Corresponding author. Fax: +1 703 292 9068. E-mail address: [email protected] (B. Tuller).

dynamical systems (e.g., Case, Tuller, Ding, & Kelso, 1995; Lancia, Nguyen, & Tuller, 2008; Nguyen, Lancia, Bergounioux, Wauquier-Gravelines, & Tuller, 2005; Nguyen, Wauquier-Gravelines, & Tuller, 2009; Tuller, 2003, 2004; Tuller, Case, Ding, & Kelso, 1994; Tuller & Kelso, 1990, 1991). These studies emphasize the role of interactions among the multiple heterogeneous processes that affect speech communication at different levels of organization and that unfold over different time scales. These time scales range from the fast time scale of milliseconds (characterizing neuronal firing) to the much slower time scales of years (characterizing learning, developmental processes, and language change). Although the interdependencies among processes that underlie the observed behaviors are a source of complexity when considered from the point of view of symbolic computations, mutual interactions can be a source of order when considered from a dynamical view. The sensorimotor system can reorganize in a task-specific manner through changes in the parameters governing the interactions, thus producing stable, yet flexible, behavior. This idea is grounded in theories of self-organization and pattern-formation in open systems far from equilibrium (particularly Haken’s Synergetics, 1977/1983). In what follows, we will summarize and illustrate the main principles governing this approach,

http://dx.doi.org/10.1016/j.wocn.2017.02.001 0095-4470/Published by Elsevier Ltd.

Please cite this article in press as: Tuller, B., & Lancia, L. Speech dynamics: Converging evidence from syllabification and categorization. Journal of Phonetics (2017), http://dx.doi.org/10.1016/j.wocn.2017.02.001

2

B. Tuller, L. Lancia / Journal of Phonetics xxx (2017) xxx–xxx

using examples from the production and perception of syllabification and perceived categorization (including illusory changes in categorization). In addition, we will describe different approaches to nonlinear dynamic modeling of these phenomena. 2. Non-equilibrium phase transitions in sensorimotor processes

Processes that unfold over time can be modeled as dynamical systems. In a dynamical system, the present state of the system depends in some rule-governed way on previous states. Differential equations or maps of essential variables offer a mathematical description of how a behavior’s essential parameters change as time passes and contextual parameters change. Interactions among dynamical systems are modeled as coupling relations: systems are coupled when the current state of one system depends on its own past values as well as on the past values of another system. Coupling relations can be linear or nonlinear, but the presence of nonlinear coupling relations is a necessary condition for the observation of non-equilibrium phase transitions, as described below. Dynamical models of particular relevance for speech are complex, open systems (e.g., living systems) that require interaction with their environment, exchanging matter or energy to maintain an organized structure. This organized structure is described by collective variables, which capture the macroscopic behavior of the many individual degrees of freedom; collective variables obey a lower-dimensional dynamics than that describing the behavior of the individual components. The behavior of complex systems is also influenced by control parameters, which may be fixed from the outside (i.e., they quantify the influences of the environment) or may be generated within the system under consideration. Across some range of values, changes in a control parameter might have little or no observable change in system behavior despite everpresent fluctuations, i.e., the system remains stable. But when control parameters reach specific critical values, a system may become unstable and undergo qualitative changes in organization or behavior. These qualitative changes, or transitions between macroscopic patterns, reveal a new state of collective variables, which capture the new macroscopic behavior of the individual system components. This qualitative change in the behavior of a system driven by a change in a control parameter is termed a phase transition. An important herald of an approaching phase transition is the growth in instability of the current state (organization/behavior/attractor) until, when the threshold is crossed, the initially stable behavior vanishes and the stable system states shift until behavior adopts a new stable organization. The evaluation of these principles in speech communication leads to a conception of speech production and perception as characterized by a limited number of stable states, or attractors, which allow the system to perform a discretization of articulatory gestures and perceptual space and which are associated with abstract speech categories. Changes in articulatory and/ or perceptual state may occur as a pattern formation process resulting from a non-equilibrium phase transition. The control parameters governing phase transitions in speech can represent, for example, features of the input signals or top-down patterns of activation. In this view, coupling relations and speech-

relevant control parameters act as constraints, harnessing the underlying dynamics so that their joint action produces stable patterns that are associated to a symbol or category. As a consequence, the same spatiotemporal pattern can be described as due to the interaction of many partially independent microscopic dynamical systems or as due to the action of a lowdimensional macroscopic system in the achievement of a symbolic goal (such as a speech utterance or perception of a speech sound). The two levels of description are not redundant because the function served by the macroscopic system can be understood at the symbolic level but not at the microscopic level (Pattee, 1972). In other words, we will usually find no traces of the goal defined at the macroscopic level in the equations that describe the microscopic systems. The integration between the microscopic and the macroscopic levels of description permits grounding abstract representations on physical reality (Raz czaszek-Leonardi, 2012). It reconciles the lack of detail implicit in abstract symbolic representations with the sensitivity of the sensorimotor processes to the details of their physical implementation (Nguyen et al., 2009). 3. Production dynamics and syllabification

The principles of dynamical systems, introduced above, allow an explanation of the shifts in perception of the utterances banana, bandana, bad#nana and bad#data with acoustic changes consequent to changes in the timing of model parameters akin to velar movement: the temporal structure of movements, in particular their relative phase, might act as a collective variable that indexes distinct production states with perceptual consequences. In flesh-and-blood speakers, small differences in the phasing of velar movement relative to the timing of alveolar closure may act as a control parameter, moving the speaker/listener through several macroscopic system states. In order to explore whether these different states emerge through a non-equilibrium phase transition, it is necessary to lead the speaker or listener through the states and assess how the switches in production and perception occur. However, there is usually little reduplication of syllables in English conversational speech, so it is unclear what might constitute a base cycle within which phase would be assessed. A base cycle can be induced in experiment by repeating an event, such as a word, syllable, or phrase (e.g., Cummins, 2009; Cummins & Simko, 2009). Repeating the event at a variety of rates can induce a varying base cycle. One well-known example comes from Stetson (1951) who observed that when subjects repeat groups of syllables at a range of rates, syllables such as /at/ that are clearly distinct from /ta/ at slow rates become perceptually identical to /ta/ at fast rates. As a subject produces a VC syllable (such as /at/) repetitively, with gradually increasing speaking rate, the syllable affiliation of the consonant appears to change, producing a CV series (/ta, ta. . ./). Stetson proposed that the need to simplify coordination caused the syllable-final consonant to become syllable-initial because the final consonant was “off phase, out of step with the syllable movement” (p. 96). Another interpretation of this observation is that the phasing among component gestures is a collective variable capable of exhibiting multiple patterns and complex behaviors. In this interpretation, the role of speech rate is that of an external control parameter whose

Please cite this article in press as: Tuller, B., & Lancia, L. Speech dynamics: Converging evidence from syllabification and categorization. Journal of Phonetics (2017), http://dx.doi.org/10.1016/j.wocn.2017.02.001

B. Tuller, L. Lancia / Journal of Phonetics xxx (2017) xxx–xxx

manipulation can induce qualitative changes in the behavior of the collective variable when a critical speech rate is exceeded. The discontinuities created by rate scaling are at least consistent with the idea that a transition in this collective variable causes a change in consonant syllable affiliation. This phenomenon was later studied by Tuller and Kelso (1990) who asked speakers to produce the utterance /ip/ or /pi/ repetitively at a slow speaking rate. During each trial, the experimenter signaled the speaker several times to increase the rate of production. The acoustic speech signal was recorded simultaneously with glottal and oral movements. The reference cycle was defined as the interval between successive lip aperture minima (during lip closure for the bilabial stop consonant) and the phase at which the peak glottal opening occurred within each cycle was noted. This phase variable successfully distinguished eep from pee at slow rates and could be used to track the transition from eep to pee at fast rates, a transition that usually occurred within only one or two cycles. This abrupt change, together with the observation that beyond the transition only the CV pattern was stable (no speaker ever switched back to the VC), suggests that relative phase of the coordinated glottal and supraglottal gestures acted as a collective variable and that the dynamics of this collective variable are bistable in one parameter regime (two attractor states are observed at slow rates, when both VC and CV are possible productions) and monostable in another regime, where one of the attractors vanishes (at fast rates, only the CV occurs). One intriguing observation was that when subjects chose their speaking rate, they tended to produce a large step change in rate exactly when the switch from VC to CV production occurred. That is, they would jump over rates at which production of the VC syllable might be unstable, switching instead to a clearly monostable dynamic. This meant that we could not assess whether a growth in instability of the VC form was a mechanism leading to the transition to the now more stable CV. In subsequent work, Tuller and Kelso (1991) repeated their earlier 1990 experiment but with one variation: they asked subjects to speak in time to a metronome instead of allowing self-selected rate changes. This leads the speaker into the unstable regime and, as predicted, increases in phase variability (indexing the growing instability of the VC form) were observed just prior to the transition into the CV form. After the transition to the CV form, phase variability quickly decreased. Of course, it is impossible to know whether the speaker still intended to produce the VC form at faster speaking rates; especially at the early part of the transition, the intention and the production may well differ.

3

range from that observed in this speaker’s eep trials at slow speaking rates and that observed in the same speaker’s pee trials. When such items were identified, they were excised beginning at the onset of voicing for the vowel in eep to the onset of closure after the following vowel, as determined from the acoustic waveform. Thus, each excised item had the form (/ip#i/) with the required syllable boundary after the consonant. It was not possible to find items whose relative phase values were separated by exactly equal steps but we created a “continuum” from the closest approximation to such a continuum for this speaker (Tuller & Kelso, 1991). Subjects first performed a 4IAX discrimination task in which they indicated in writing whether the stimuli sounded exactly the same or were different in any way. After a short break, they completed a forced-choice identification task, indicating whether the middle consonant was the end of the first “word” (i.e., eep or eeb) or the beginning of the second “word” (i.e., pee or bee). Results shown in Fig. 1 describe a remarkably typical identification function (given the naturally produced stimuli), with an abrupt change from approximately 90% identification of /i#pi/ to 25% ID of /i#pi/ as relative phase of the stimuli increased; discrimination accuracy peaked at the observed phase transition (Tuller & Kelso, 1991). In other words, the shift in relative phase of glottal and oral movements for the VC to that of the CV entailed a corresponding perceptual change. Note that this analysis did not examine whether any specific acoustic consequences of relative phase or other cotemporaneous variables provided the information for perception. Later work on the topic by de Jong, Lim, and Nagao (2002) also observed that a glottal-to-oral phasing measure shows transitions in values as speaking rate increases, corresponding to those reported by Tuller and Kelso (1990, 1991). However, their results suggest a more complicated picture in that the rate-scaling task involved a more complex pattern of glottal behavior than originally suggested. In follow-up work, de Jong, Lim, and Nagao (2004) observed that increasing speaking rate induced a variety of changes in the articulation of both VC and CV forms: VCs did not resolve completely to CVs and both forms were more perceptually ambiguous at faster rates. Nevertheless, changes to the VC form were still more perceptually dominant than were the changes to the CV form. Despite this more complicated picture, implicating both articulatory conversion of VC productions to CV productions and acoustic

4. Production dynamics, syllabification, and perception

If relative phasing of articulators can be considered a collective variable that is relevant for speech communication, its changes should affect not only the behavior of the articulatory system of the speaker but also that of the perceptual system of the listener. This in practice leads to the expectation that a shift in relative phase for eep to the relative phase of pee would be perceived as a change from eep to pee. Articulatory gestures from eep trials produced by one speaker in Tuller and Kelso (1990) were examined for instances of relative phase of peak glottal opening and minimum lip aperture that spanned the

Fig. 1. Identification function showing the effect of relative articulatory phase (x-axis) on perceived syllable affiliation of an inter-vocalic /p/ (y-axis). Adapted from Tuller and Kelso (1991).

Please cite this article in press as: Tuller, B., & Lancia, L. Speech dynamics: Converging evidence from syllabification and categorization. Journal of Phonetics (2017), http://dx.doi.org/10.1016/j.wocn.2017.02.001

4

B. Tuller, L. Lancia / Journal of Phonetics xxx (2017) xxx–xxx

aspects of juncture marking, relative phasing of articulators and the consequent acoustic changes can be considered collective variables that describes the behavior of the speech processing system during the production and the perception of syllables. This conclusion has a number of implications. By identifying a parameter that acts as a collective variable in a process involving both speakers and listeners, we are modeling both of their sensorimotor systems as two components of a larger complex system. Basically, we are postulating that the phase transition bringing the articulators of the speaker to show the hallmarks of the CV syllable organization also has the potential to drive the sensory system of the listener, who switches from one perceptual state to the other. Relative phase, behaving as a collective variable for the production system of one speaker, at the same time behaves as a control parameter for the perceptual system of the listener. Through this mechanism, a hierarchy of dependencies couples the sensory system of the listener to the motor system of the speaker. A tight link between perception and production processes is suggested by a number of phenomena observed in the literature, such as phonetic convergence (Nguyen & Delvaux, 2015) or the after-effects of speech motor learning on speech perception (Lametti, Rochet-Capellan, Neufeld, Shiller, & Ostry, 2014). Some investigators conclude that speech movements are aimed at producing acoustic patterns but the perception of the acoustic pattern is affected by the motor constraints underlying their production (see for example the Perception for Action theory by Schwartz, Basirat, Ménard, & Sato, 2012). Moulin-Frier, Diard, Schwartz, and Bessière (2015) recently simulated interactions between artificial agents and highlighted the importance of sensory-motor coupling in the emergence of a speech code. In their simulation scenario, several conversational agents are involved in communicative interactions in which an agent refers to an external object and another agent has to identify that object. The agents represent knowledge about each external object using probability distributions of motor, sensory and internal variables. Through Bayesian inference, this knowledge is manipulated to select actions during production and objects in the environment during perception. After conditioning the outcome of perception and production onto different combinations of the sources of knowledge, the investigators conclude that “without a capacity to associate what they hear from others with their own motor gestures and to select adequate gestures in relationship with the sensory information they provide to their interlocutor, agents are unable to converge to a conventional code.” (Moulin-Frier et al., 2015, p. 21). It is also worth noting that the approach taken here dovetails nicely with that of Browman and Goldstein’s Articulatory Phonology, an influential theory (e.g., see contributions in this volume) that seeks a principled account for how gestural units in speech can simultaneously serve as units of linguistic contrast. Articulatory Phonology posits a gestural score, a formal representation of articulatory movements and their relative timing, that is used to generate articulatory sequences. The theory has addressed a wide variety of coarticulatory phenomena as well as epenthesis, lenition, deletion, etc. (Browman & Goldstein, 1990, 1992) and has been extended to evaluating how prosodic units coordinate articulatory movements in time

and how rhythmic structure emerges from the interactions among different prosodic units (e.g., Barbosa, 2007; Goldstein, Nam, Saltzman, & Chitoran, 2009; Nam, Goldstein, & Saltzman, 2009; Tilsen, 2016). Looking back, it is obvious that Tuller and Kelso (1990, 1991) used far from optimal procedures to explore perceptual dynamics. They followed the typical methodology of the day, randomizing stimulus presentations in order to reduce or eliminate presentation order effects, the interaction among stimuli over time, and changes in the listener over time. These “controls” in fact disrupt the “footprints” of the naturally occurring dynamics. In a dynamical system, the current value of a variable is not independent of preceding values and the current state of a system can only be evaluated in the context of preceding states. Thus, to reveal the characteristics of perceptual dynamics (the processes that allow stability and flexibility to coexist in speech communication, the mechanisms that regulate when perception is stable, and when and how it can shift), one should manipulate the order of presentation of stimuli in a systematic fashion, taking into account the state of the perceptual system when each stimulus is presented. 5. Perceptual dynamics and categorization

A dynamical account of speech categorization (or of any behavioral phenomenon) involves several levels of description. First at the task level, a gradual change of a control parameter will affect the perceptual processes and produce a qualitative change in perception. Second, the change in perception may be described and modeled as a phase transition between two stable states or attractors of the perceptual system. Finally, the macroscopic system must be understood as emerging from the interactions among its subsystems. One initial attempt (Tuller et al., 1994), presented listeners with stimuli ranging on an acoustic continuum between the English words say and stay (by manipulating the duration of the silent gap between the fricative and the diphthong). Stimuli were presented to the listener in either a randomized order or a sequential order. In the latter case, listeners heard the entire set of stimuli twice, going from one of the two endpoints (e.g., say) to the other (stay), and then back again to the first one (say). The listeners’ task was simply to identify each stimulus as one of these two words. Note that with this ordering of the stimuli it is possible to control the state of the perceptual system before each stimulus is presented. With sequential presentation of stimuli, a listener starts perceiving one word then, at a given point in the continuum, she or he switches to the perception of the alternative word. In the second half of the sequence, when stimuli are presented in the reversed direction, a second switch is observed when the listener switches back to the perception of the initial word. With sequential presentation, three response patterns are possible: (a) a critical boundary, where the switch between the two percepts is associated with the same stimulus regardless of whether the silent gap is successively increasing or decreasing; (b) hysteresis, defined as the tendency for the listener to hold on to the initial categorization so that their response persists as the stimuli move closer to the other endpoint; and (c) contrast, in which the listener quickly switches to the alternate percept and does not hold on to the initial categorization. The results showed that

Please cite this article in press as: Tuller, B., & Lancia, L. Speech dynamics: Converging evidence from syllabification and categorization. Journal of Phonetics (2017), http://dx.doi.org/10.1016/j.wocn.2017.02.001

B. Tuller, L. Lancia / Journal of Phonetics xxx (2017) xxx–xxx

a critical boundary was much less frequent than hysteresis or contrast, which occurred equally often. Even with randomized presentation of stimuli there was a strong conditional probability (context) effect for mid-range stimuli, suggesting that short-term dynamic effects were present. Results were virtually identical for monolingual French speakers’ categorization of speech sounds, using a stimulus continuum that varied on the same acoustic dimension but that ranged from cèpe, a type of mushroom, to steppe, a plain without trees (Nguyen, Wauquier-Gravelines, Lancia, & Tuller, 2007; Nguyen et al., 2005). A macroscopic model of the dynamics of categorization was developed, based on these data. The model involved a single process that evolves over time and includes two complementary aspects: On the one hand, speech perception is assumed to be a highly context-dependent process sensitive to the detailed acoustic structure of the speech input. On the other hand, it is viewed as a nonlinear dynamical system emerging from the interactions between the underlying processes and characterized by a limited number of stable states or attractors, which may be identified as abstract symbolic categories. As with speech production, the concept of stability is fundamental to the approach to understanding speech perception. Can changes in categorization in speech perception also be understood as a pattern formation process involving nonequilibrium phase transitions? In this section, after a schematic presentation of the Tuller et al. (1994; referred to as TCDK) model, we describe how it captured some observed behaviors and predicted others, but also left important aspects of behavior unspecified. These were addressed by later work aimed at understanding how the dynamics of the TCDK model emerge from the interactions observed in a classical neural network design in which competition, habituation and learning dynamics are implemented (Lancia, 2009; Lancia & Winter, 2013). 6. An early model of perceptual dynamics

Is a nonlinear dynamical system required to account for the observed behavior of listeners? A static model that associates each perceptual category with a distinct range of acoustic parameters could account for the observation of hysteresis, contrast, and a critical boundary occurring under the same stimulus and task conditions if one assumes that perception is tuned for a critical boundary but that its outcomes are noisy near that boundary. A dynamical systems account might be more appropriate if it makes predictions about perceptual or categorization effects that could not be explained within a static model with noise. In our conceptualization, phonological categories are equivalent to attractors (stable behaviors of the system). After the presentation of a stimulus the listener’s categorization is defined when the system reaches a stable state. The perceived categorization switches when, as a consequence of a change in a control parameter, the attractor corresponding to the first categorization disappears and a new, alternative attractor dominates. Thus, there exist ranges of acoustic parameter variation within which the perceptual form remains relatively stable (i.e., resistant to change as a function of parameter variation or noise). In other ranges, however, even small variations in the acoustic parameter can cause large (nonlinear) changes in categorization of the input and

5

the likelihood of change is enhanced in the presence of noise. At critical values, which are sensitive to context, history, linguistic factors, etc., the existing attractor(s) lose stability and the observed behaviors may change gradually or abruptly as other attractors dominate. A behavior like hysteresis indicates that in the first half of a sequential run, the most ambiguous stimuli of the continuum are perceived differently than in the second half of the run. This can be easily explained if the presentation of an ambiguous stimulus generates two (or more) perceptual attractors instead of only one. After the presentation of the first stimulus in a sequential run, the system lives in one attractor and it will only leave it when that attractor is no longer a stable (viable) alternative given the input. Multistability (the coexistence of several stable attractors) is not a characteristic of linear systems, instead requiring nonlinearity. A first pass at modeling the switching between categories as the appearance and disappearance of attractive states in the underlying nonlinear dynamical system is presented as Eq. (1). VðxÞ ¼ kx 

x2 x4 þ 2 4

ð1Þ

In this equation, x represents the perceptual form (say or stay for Case et al. (1995), Tuller et al. (1994); cèpe or steppe for Lancia (2009), Nguyen et al. (2005, 2007)), k is a control parameter (here, gap duration after the fricative; speaking rate in the syllabification experiments described above), and V(x) is a potential function that may have up to two stable perceptual forms, indicated by minima in the potential, depending on the value of k. Fig. 2 shows the shape of the potential function for five values of k between 1 and 1. The potential function depends on the value of k and its shape permits predicting the future behavior of the variable x given its current value. x always changes in the direction that permits reducing V(x) and ultimately gets trapped into a local minimum, or attractor, of V(x). Each of the two possible responses in our categorization task corresponds to one attractor in the perceptual space. The potential function has one minimum only for extreme values of k, which correspond to stimuli unambiguously associated with only one of the two categories, but two minima in the middle range of k, where both categories are possible. As k increases in a monotonic fashion (from left to right in Fig. 2), and in the vicinity of a critical value kc, the system’s state, represented by the filled circle in Fig. 2, abruptly switches from the basin of attraction in which it was initially located, to the second basin that has gradually formed as the first one disappears. If the change of the parameter k is small from one stimulus to the next one, the tilt of the potential will also change slowly and there will be several ambiguous stimuli producing a bistable potential. When this happens, in each direction of change of the control parameter the system remains in the same attractor throughout the ambiguous region of the continuum, producing hysteresis. If the rate of change of k increases, the rate of change of the potential tilt will be high enough so that no stimulus is bistable and a critical boundary is observed. Although this system adequately accounts for the observation of hysteresis and a critical boundary, it cannot produce contrast. In order to do so, we needed to recognize the decades of research showing perceptual dependence on

Please cite this article in press as: Tuller, B., & Lancia, L. Speech dynamics: Converging evidence from syllabification and categorization. Journal of Phonetics (2017), http://dx.doi.org/10.1016/j.wocn.2017.02.001

6

B. Tuller, L. Lancia / Journal of Phonetics xxx (2017) xxx–xxx

Fig. 2. Shape of the potential function V(x) for the five values of k. Adapted from Tuller et al. (1994).

experience and learning and allow the temporal dynamics of k to depend not only on the acoustic characteristics of the stimulus but also on the time-dependent effects of learning, linguistic experience, and attention, as described by Eq. (2):

e

kðkÞ ¼ k 0 þ k þ þ ehðn  nc Þðk  kf Þ 2

ð2Þ

In Eq. (2), k0 refers to the initial tilt of the system’s potential (the “tilt” corresponding to the percept at the beginning of a run), k represents the acoustic parameter that changes across stimuli, e characterizes the lumped effect of learning, linguistic experience and attention (the division by 2 simply makes the function resolve more quickly), h is the discrete form of the Heaviside step function, n is the number of perceived stimulus repetitions in a given run, nc represents a critical number of accumulated repetitions, and the subscript f denotes the value of k at the other extreme from its initial value (the final value). The Heaviside step function h equals 0 when n < nc (during the 1st half of each sequential run), so that k depends on the initial categorization, the gap duration, and the lumped effects of various cognitive factors represented by e. For higher values of e, the system shows a larger change in behavior for the same amount of acoustic change between stimuli, compatible with the idea that e is positively correlated with experience with the stimuli, with the task, and with attention and negatively correlated with fatigue. When n > nc (during the 2nd half of each sequential run), then h = 1. This produces a larger change in the tilt (k) for each step change in gap duration k than in the first half of the run. In this way the same degree of acoustic difference between two consecutive stimuli produces a bigger change of the potential in the second part of the sequence than in the first. Since the outcome of the step function is multiplied by e, the rate of change of the potential function increases even more in the second half of the sequence when the experience with the stimuli and the task has increased. The net result is to allow hysteresis (for sufficiently low values of the parameter e), critical boundary (for average values of e), and contrast (for high values of e) to occur within the same dynamical system. Signature properties of nonlinear dynamical systems (e.g., hysteresis) were observed for both French and English listeners in that the critical point for switching in any given trial depended on the direction of changing gap duration in the stimulus sequence and the initial percept. However the TCDK model makes also specific predictions due to the role assigned to cognitive factors (whose joint effect is modeled through the parameter e) in shaping the perceptual dynamics. In order to

appreciate how these predictions were supported by the data, some additional consideration of the notion of stability is required. Earlier, we noted that in this dynamical account, phonological categories are equivalent to attractors (stable behaviors of the system) and switching between phonological categories means changes in the relative stability of the attractors (a change in the control parameter makes one attractor unstable and the other progressively stable). But we didn't unpack what is meant by “stable.” A behavior that is stable is rapidly displayed (i.e., the corresponding attractor is rapidly reached) and is resistant to change even in the presence of reasonable levels of noise (noise is always part of biological systems). The system takes longer to reach a less stable attractor state/ behavior and perceptual changes are far more likely given the same level of system noise. One prediction of the model concerns these effects of noise on listeners’ categorizations. Approaching a perceptual switch, the initially stable attractor is progressively weakened by the change of the control parameter. Therefore, fluctuations in both the listeners’ responses (sequences of consecutive changes in the perceived category) and in response times are expected to increase as the switch point is approached (confirmed by Lancia (2009), Lancia, Nguyen, and Tuller (2008) and Lancia and Winter (2013)). Another prediction of the TCDK model stemming from the role assigned to the cognitive factors affecting e is that the stimulus presented first in the sequence will be less stable than the acoustically identical stimulus presented as the end of the sequence (due to the larger change in tilt of the potential for each step change in the acoustic stimulus when h = 1). Case et al. (1995) confirmed this rather odd prediction, using judged goodness as an index of the stability of the attractor. Finally, since the amount of hysteresis and contrast are determined by the values of the parameter e, the likelihood of hysteresis should decrease and the likelihood of contrast should increase as the subject gets more experienced with the task and stimuli. These predictions were also borne out in several experiments (e.g., Case et al., 1995; Lancia, 2009; Nguyen et al., 2005, 2009; Tuller, 2004). While these data provide strong support for the idea of speech perception as a process characterized by a rich variety of dynamical properties, the details of the model are unsatisfying in several respects. Especially unsatisfying are e and the step function, h. The e term includes the lumped effects of several cognitive factors (experience, learning, attention, etc.) over the long term, making no distinction among factors and

Please cite this article in press as: Tuller, B., & Lancia, L. Speech dynamics: Converging evidence from syllabification and categorization. Journal of Phonetics (2017), http://dx.doi.org/10.1016/j.wocn.2017.02.001

B. Tuller, L. Lancia / Journal of Phonetics xxx (2017) xxx–xxx

offering no explanation of how these effects might change over time. The step function was included to allow the model to produce contrast, but this was an operational shortcut and glosses over the dynamics of the rate of change of the potential. In the next section, we describe a more recent model by Lancia and Winter (2013) that explicitly takes into account the dynamics of these processes that occur on radically different time scales. 7. A component-level model of perceptual dynamics

Lancia and Winter (2013; hereafter referred to as LW) proposed an explanation for the emergence of the dynamics captured at the macroscopic level by the TCDK model. The LW model accounts for the same patterns of categorization as does the TCDK model, but unpacks components such as experience, learning, and attention that unfold on different time scales. As shown in the preceding section, the core dynamics of the TCDK model are summarized by its potential function whose changing shape determines the time course of perceptual categorization and is affected both by the stimulus acoustics and by cognitive factors. The potential changes from one stimulus to the next, due to the influence of the acoustic parameter on fast perceptual processes. However, the potential also changes from one half of a sequential run to the other half (otherwise contrast would not be modeled) and from one trial to the next (accounting for longer term minimization of hysteresis and enhancement of contrast). These interactive effects suggest that fast perceptual processes are modulated by other processes that occur on (at least) two additional time scales. In LW (2013) the fastest timescale is that of competition (see also Amari, 1977; Grossberg, 1973, 1978; Usher & McClelland, 2001). Perceptual categories are each associated with the activity of a node in the competition; when the node exceeds an activation threshold, that percept is experienced. When a stimulus is presented, it will preferentially activate a node so that its activation grows faster than others. But nodes are mutually inhibitory (competition), each inhibiting the growth of the other. Each node also includes a self-recurrent activation that sustains its own activity more strongly as its activity increases. The self-recurrent signal has a damping term to limit activation over time; after stimulus offset, activations decay. But if the next stimulus occurs without too long a delay, the previously most active node is still somewhat active because its stronger recurrent signal has not yet completely decayed, leading to a bias for how the second stimulus will be categorized. Note that this differential delay in activation biases perceptual categorization toward the previously perceived category, resulting in hysteresis. Acting in parallel with the competition process is a habituation process, which occurs on a somewhat slower timescale (Grossberg, 1973; Kawamoto & Anderson, 1985; Schöner & Thelen, 2006; but also see Köhler, 1940; Köhler & Wallach, 1940). In the habituation process, sustained activation of a node during a long interval of time leads to a decrease in the sensitivity of that node to external stimuli, attenuating the effects of bottom-up activation. The net effect is to bias categorization away from the previously perceived category when the same stimulus is repeatedly presented (as in selective adaptation; e.g., Eimas & Corbit, 1973). In sequential presentation, habituation biases

7

categorization toward the alternative, hastening the perceptual switch so that the contrastive pattern occurs. Habituation is a reversible process: when a node starts losing the perceptual competition, its sensitivity to external stimuli gradually recovers. The slowest process changing connection strengths is that of associative learning (Grossberg, 1978; Hebb, 1949; McClelland, Mirman, & Holt, 2006). If a node has won the competition, connections from the input layer that contributed to its winning are slowly strengthened, gradually increasing the node’s sensitivity to the current stimulus and to its neighbors. One effect of associative learning is that the input signal, as filtered by connection weights, will be biased toward one of the subsystems, reducing the number of acoustic stimuli that contribute equally to the activation of the two subsystems. In other words, learning reduces the range of ambiguous stimuli and thus reduces the size of the bistable region. The LW model also nicely accounts for the dependence of learning on initial attractor configuration for an individual (Tuller, 2007; Tuller, Jantzen, & Jirsa, 2008), interactive activation models with perceptual biases (Vallabha & McClelland, 2007), and long-term adaptation effects (e.g., Eisner & McQueen, 2006; Kraljic & Samuel, 2005). More generally, it supports the idea that dynamical processes acting on different time scales are needed to reproduce speech categorization behavior. 8. Perceptual dynamics of illusory categorization

To recap, we have tried to underscore the adaptive ability of the speech perceptuomotor system across different domains, describing adaptive flexibility as a dynamic system that evolves, finding stable solutions (conceptualized as attractors) in the system dynamics. Nevertheless, we are left with several grounds for pause. In particular, both the production and perception work described include a very restricted range of phonetic forms in each experiment, albeit from two languages. Speech production and perception are of course far richer in phonemic content. This is evident not only from natural conversation but is exemplified by a phenomenon, called the verbal transformation effect (VTE), first identified by Warren and Gregory in 1958 (Warren & Gregory, 1958) who were exploring an acoustic analog of visual reversible figures. In vision, an ambiguous figure is displayed for some period of time and the viewer experiences reversals despite the unchanging display. Notable examples are Necker cubes and the face-vase illusion. Unlike in vision, auditory events do not occupy very long stretches of time. So Warren and Gregory decided to repeat an auditory stimulus rapidly and discovered that listeners undergo illusory transformations of the stimulus. In audition as in vision, illusions are often seen as windows into the processes that operate during veridical perception. Warren (1968) described illusions as temporary lesions of the perceptual system that reveal its inner workings. The VTE is at its strongest when the stimulus is repeated quickly (about 2/s). Initially, the veridical percept (the intended word) is heard, but at some point illusory changes begin to be perceived. For example, a subset of transformations reported in response to the word “ripe” includes “right,” “ride,” “rife,” “life,” “bright,” “rape,” and “wife” (Warren, 1961). Although some transformations can be small phonetic deviations from the veridical percept,

Please cite this article in press as: Tuller, B., & Lancia, L. Speech dynamics: Converging evidence from syllabification and categorization. Journal of Phonetics (2017), http://dx.doi.org/10.1016/j.wocn.2017.02.001

8

B. Tuller, L. Lancia / Journal of Phonetics xxx (2017) xxx–xxx

as in the examples above, others can be very far from the stimulus (e.g., reporting “seven” when listening to repeated /kI/). The VTE is one area of perceptual adaptation where there is the potential to explore whether the dynamical models described can be expanded to situations where there is switching among many alternatives. Most investigations of the VTE have focused on the phonemic, lexical, or semantic relationship between the stimulus and the perceived transformations (Clegg, 1971; Goldstein & Lackner, 1973; Lackner & Goldstein, 1975) and others (e.g., Pitt & Shoaf, 2002) concentrated on the causes of transformations, viewed, for example, as perceptual regrouping and auditory streaming. Ditzinger and colleagues took a different tack and explored the dynamics of switching among transformations over time (Ditzinger, Tuller, Kelso, & Haken, 1997b). They found that the stimulus presented to the listener acts as an anchor or attractor and is the most reported perceptual form for all listeners. But listeners also heard many perceptual changes over the course of each trial, sometimes perceiving as many as 27 distinct forms in a single trial. These perceptual changes can move fairly far from the original acoustics (e.g., the earlier mentioned “seven” for / kI/) but they tend to get there in small steps (“kih-keh-kev-sevsevn”). They also observed that as perception gets further from the original acoustics, it does so via pairwise coupling between alternatives until one of the alternatives changes. Moreover, sustained oscillations between two transforms are far more frequent than expected on the basis of a random arrangement of alternatives. These two-form alternations show a faster and more stable dynamic than observed in the far rarer cases of cycling among more than two perceptual forms. This suggests a coupling mechanism underlying the VTE, much like that proposed for the alternation of visual reversible figures (Ditzinger & Haken, 1989, 1990), with local bistability prevailing despite global multistability. Ditzinger, Tuller, Haken, and Kelso (1997a) modeled the pattern of illusory changes in the VTE using a dynamical model that is an extension of Ditzinger and Haken’s (1989, 1990) synergetic model of the perceptual oscillations of visual ambiguous figures. The main extension is connected to the large number of reported alternative phonemic forms (compared to the small number of alternative visual forms). The importance of the model in the current context lies in its three interacting processes, each having a different predominant time scale. These processes are (1) competition among states but a bias from the acoustics; (2) a timedependent saturation of attention (habituation); and (3) associative memory/learning between the input parameters (input nodes) and stored patterns. This should sound familiar. The LW model (2013) used the same three interacting processes with three time scales of influence to describe switches in perceptual categorization of speech stimuli. 9. Do we need multiple time scales to explain speech categorization?

There is a considerable amount of work showing that the three time scales individuated by Ditzinger et al. (1997a) and by Lancia and Winter (2013) play a role in regulating human behavior. Competition between alternative behaviors on a fast time scale of fractions of a second has shown to be a useful

model of decision making in perceptual choice tasks (e.g., Usher & McClelland, 2001) and motor preparation (Erlhagen & Schöner, 2002). Habituation, a process occurring on a slower time scale of seconds, leads to a decrease in response to stimuli (e.g., Fischer, Furmark, Wik, and Fredrikson (2000) for habituation to visual stimuli and Dycus and Powers (2000) for habituation to auditory and tactile stimuli) and is observed in both behavioral and brain (neuroimaging) responses (e.g., Pantev et al., 2004). Learning, occurring on an even slower time scale, is a consensually acknowledged process by which we develop new skills and tune our behavior to the environment. Interestingly, these three time scales are also relevant to the classification of neurophysiological processes underlying the Free Energy theory of brain functioning proposed by Karl Friston and colleagues (see Friston, Kilner, & Harrison, 2006 and references therein). The conception of perceptual categorization as arising from the interactions between processes occurring on multiple time scales has recently been challenged by studies in which perceptual choice is formalized as Bayesian inference under uncertainty and learning is formalized as a belief update rule (Kleinschmidt & Jaeger, 2015, 2016; Moulin-Frier & Arbib, 2013). Those studies assume that a listener categorizes speech sounds by choosing the category with the highest probability of being perceived given the current values of the cues and her/his a priori belief that the category will be produced. This posterior probability is computed by combining the listener’s a priori expectations that the category will be produced with the likelihood that the cues’ values are observed, given that the category has been produced. On this basis, the listener can predict which are the most probable values of a cue, given that a category has been produced. In this class of models, learning can be implemented by updating the distribution of the cue values associated with each category. When new stimuli are identified as tokens of a category, the model updates its knowledge of the distribution of cue values for that category and its predictions change accordingly. In this approach, the temporal dimension of the processes underlying the perceptual choice is ignored and there is no habituation process. Nevertheless, with a simplified belief updating model Kleinschmidt and Jaeger (2015) reproduced both perceptual recalibration and selective adaptation, two perceptual phenomena that share similarities with hysteresis (in LW modeled as an aftereffect of perceptual competition) and contrast (in LW modeled as an aftereffect of habituation triggered by learning). Perceptual recalibration is observed after a listener has identified several ambiguous stimuli from an acoustic continuum as tokens of the same category. Subsequent ambiguous stimuli will also tend to be perceived as tokens of that category. If a belief updating mechanism is at work, the mean of the distribution for the identified category will shift toward the ambiguous stimuli and the category boundary will shift toward the other category. Like hysteresis, this is a positive perceptual aftereffect: after perceiving several instances of one category the likelihood of perceiving more instances of that category increases. Selective adaptation is observed after a listener has identified several unambiguous stimuli from an acoustic continuum as tokens of the same

Please cite this article in press as: Tuller, B., & Lancia, L. Speech dynamics: Converging evidence from syllabification and categorization. Journal of Phonetics (2017), http://dx.doi.org/10.1016/j.wocn.2017.02.001

B. Tuller, L. Lancia / Journal of Phonetics xxx (2017) xxx–xxx

category; subsequent ambiguous stimuli tend to be perceived as tokens of the alternative category. In this case, the belief updating mechanism narrows the distribution of values of the perceptual cue associated to that category (by reducing its variance). Like enhanced contrast, this is a negative aftereffect: after perceiving several instances of one category, the likelihood of perceiving more instances of that category decreases. Kleinschmidt and Jaeger (2015) were able to simulate the time course of these effects as reported by Vroomen, van Linden, De Gelder, and Bertelson (2007) who ran identification experiments with audiovisual stimuli. While the negative aftereffect increases asymptotically with exposure to the stimuli, the positive aftereffect increases toward a peak value and then slowly vanishes as exposure to stimuli increases. This evolution over time is not incompatible with that of hysteresis, which tends to diminish after one or few ordered sequences of stimuli, and contrast, which strengthens with increasing exposure to the stimuli. The proposal that a single learning process can explain the behavioral complexity observed in speech categorization experiments is a tempting idea. Bayesian models can also account for the hierarchical organization of abstract knowledge while showing sensitivity to the details of speech patterns. It is therefore reasonable to ask whether a Bayesian approach allows us to abstract away from the dynamics of faster time scales and explain perceptual aftereffects as due to processes responsible for knowledge organization and storage. Attempting to provide a definitive answer is outside the scope of this paper. Moreover, the simulations produced in the two frameworks cannot be directly compared because the two implemented models are based on different simplifying assumptions (justified by the different constraints imposed on the perceptual system by the experimental designs). However, a number of considerations may help orient future research aimed at comparing the two approaches. First, accounting for both hysteresis and contrast may be challenging for a model based on a belief update mechanism. The occurrence of hysteresis or contrast does not depend on the distribution of the acoustic cues in the stimuli but on the order of presentation of stimuli (cf. Section 5). In contrast, the occurrence of positive and negative aftereffects, as modeled through belief update, depends also on the shape of the distribution of the perceptual cue values in the stimuli that condition the learning process. A second point concerns the dependency of the learning process on the initial configuration of the perceptual space. This may be a challenging behavior to simulate for models driven by a single time-varying process. Lastly, belief updating does not readily explain an important difference between perceptual recalibration and selective adaptation/habituation effects. The effect of perceptual recalibration lasts over time. Speakers asked to evaluate the same stimuli the day after their initial exposure maintain the same categorization bias (e.g., Eisner & McQueen, 2006; Kraljic & Samuel, 2005). In contrast, selective adaptation appears to be a far more transient phenomenon, vanishing after only half an hour if listeners are no longer exposed to the stimuli (Eimas & Corbit, 1973). It is unclear how this difference might be accounted for in models based on belief updating rules, whose parameters can change only after the perception of a stimulus.

9

10. Perceptuomotor integration

The models discussed above are largely concerned with perceptual categorization and change. The question arises, how do we ever hear the phonetic content of our own productions veridically? The explanation is typically couched in terms of an internal model, such as a forward model processing the efference copy of a motor command. The idea is that a motor command is sent to an actuator system and to a forward model enabling the central nervous system to anticipate the consequences of motor commands for forthcoming movements and for the sensory feedback resulting from those movements. Early forms of forward models were developed for the oculomotor system. For example, von Holst and Mittelstaedt (1973) attributed the perceptual stability of the visual world during voluntary eye movements to a central comparator relating the predictions of the forward model based on an efference copy of the neural command signal sent to the extraocular muscles with the sensory feedback (“reafference”) from the retina contingent on the movement. In the event the commanded movement and the resulting movement coincided in magnitude and expected sign, the visual world would seem stable; otherwise it would be seen to jump. This simple model explains why the visual world appears to move when the eye is passively displaced: the retinal displacement does not match the predictions of the forward model. It also explains why an afterimage appears to move in the direction of an eye movement: the prediction based on the efference copy indicates the direction and magnitude of the change in eye position while the unchanging retinal signal indicates that the stimulus is linked to eye position. Similar prediction mechanisms have been proposed for limb motor control (e.g., Kawato, 1999; Miall & Wolpert, 1996), stable visual perception during saccades (e.g., Sommer & Wurtz, 2008), cognitive motor awareness (e.g., Desmurget & Sirigu, 2009; Grush, 2004), and speech (Lackner, 1974; Tian & Poeppel, 2010). In the context of the VTE, if one repeats a speech sound aloud to oneself, there may be an efferent representation of the intended speech sound that is compared with the returning speech sound, keeping perception stable. By contrast, if one listens to a tape recording of the sounds one had actively spoken, perceptual changes are commonplace (Lackner, 1974). Recently, Tian and Poeppel (2014) provide what may be electrophysiological (MEG) evidence for auditory efference copies. In that work, they asked whether an internally generated representation, elicited by mental imagery, could act as an adequate adaptor for a subsequent overt probe stimulus. Adaptation, a form of neural habituation leading to the reduction in neural activity after repetitions of a stimulus, can be used to probe the functional specificity of neural assemblies because the previous experience modulates the response properties of neural populations (Grill-Spector & Malach, 2001; Henson, 2003). This modulation depends on the relation between the two stimuli, so that a very similar adaptor and probe will result in suppression of the activation for the probe. Tian and Poeppel (2014) cleverly included imagined speech as an adaptor, in addition to overt speech, to avoid the issue of overlapping activity between auditory and motor speech areas. Two conditions were especially intriguing. In one, subjects mimed articulating a syllable over and over but without producing any sound. In another condition, they

Please cite this article in press as: Tuller, B., & Lancia, L. Speech dynamics: Converging evidence from syllabification and categorization. Journal of Phonetics (2017), http://dx.doi.org/10.1016/j.wocn.2017.02.001

10

B. Tuller, L. Lancia / Journal of Phonetics xxx (2017) xxx–xxx

imagined articulating the syllable without any sound. After some time, a probe stimulus, either the repeated syllable or a novel one, was presented. Tian and Poeppel reasoned that since articulation (both actual and imagined) during the adaptation time had no sound output, auditory estimation should be defined by the putative efference copy and internal forward models. This motor-based prediction should prime the subsequent neural processing of the actual sound, resulting in repetition enhancement of the M200. This is indeed what they observed. The top-down internally induced neural representation modulated neural responses to bottom-up (acoustically induced) perception. Tian and Poeppel (2014) interpret their results as neural evidence for an efference copy mechanism in speech. Although such internal comparators might allow adaptive flexibility by triggering recalibration of the internal forward models, they still leave many questions unanswered. Thanks to the predictions of the forward model, efference copies specify configurations of the motor system and of the perceptual input consequent to the execution of some motor command, but they do not specify the motor command itself. A configuration of the motor system and of the acoustic output can be generated through different motor commands. In order to choose the right motor command, the forward model must be complemented by an inverse model. The inverse model computes the right motor command given the current state of the environmental variables and the state of the motor system by minimizing some quantity defining a cost to be paid in the execution of the motor command. But what quantity is being minimized? How are the cost functions and their dependency on the contextual variables learned? In a multidimensional system, what are the dimensions of comparison? Given that the relevance of the physical dimensions depends on both the task and the contextual variables, which mechanisms permit their selection? And how does the system know how to adjust for an error in a multidimensional system? As the environmental context changes, how is contextual updating accomplished? Moreover, architectures based on internal models predict strongly stereotyped trajectories whereas trial-to-trial variability is the norm in speech production (and in motor control in general). Some of these issues have been addressed within the framework of optimal feedback control theory (see Todorov and Jordan (2002) and Houde and Nagarajan (2011) for an application to speech) in which perceptual feedback, the sensory consequences of motor commands, is used to update the motor commands and correct for eventual perturbations. In the execution of rapid tasks, whose temporal requirements are incompatible with the delays that affect perceptual feedback, internal forward models permit estimating the current state of the plant (the motor system plus the environment). The estimate is used as a surrogate for perceptual feedback in order to compute an updated motor command that can counterbalance the effects of noise and external perturbations. If the movement goal is defined only in the terms of some taskrelevant dimension (e.g., bringing to 0 the distance between the two lips for production of a bilabial plosive), the system can compensate for error observed in the task-relevant dimension by activating several task-redundant ones. For example, the level of muscle activity controlling the lips and jaw are

task-redundant dimensions for the goal of minimizing the distance between the lips, but they still affect the outcome. Such a control schema in which speech production is coupled with a predictive surrogate of perceptual feedback reduces the variability observed in task-relevant dimensions by spreading it over the degrees of freedom available via task-redundant dimensions. In this respect, optimal feedback control theory augments the predictive power of internal models with the self-organizing properties of coupled dynamical systems (Shim, Latash, & Zatsiorsky, 2003; Todorov & Jordan, 2002). Indeed structured variability, displaying both compression in the task-relevant dimensions and dilation in task-redundant dimensions, is usually explained by assuming that available degrees of freedom (e.g., the positions of the speech articulators) can be mutually coupled in a task-specific fashion to behave as if they were governed by a low-dimensional system stabilizing the behavior of few task-relevant dimensions at the expense of introducing variability in task-redundant ones (Kelso, Tuller, Vatikiotis-Bateson & Fowler, 1984; Schöner, Martin, Reimann, & Scholz, 2008). A further step in the integration of predictive models with principles of dynamical systems is represented by the active inference theory (Friston, 2011). Friston (2011) proposed that the sensorimotor system is organized in a hierarchy of predictive models processing information at different levels of abstraction, with each level informing the levels below and above. The idea is that our knowledge is structured in such a way as to reflect the causal relations between the dynamics in our environment. A hierarchy of predictive models can be used both to produce and to perceive speech. For example, in speech production the representation of a syllable active at a level of the hierarchy predicts the corresponding sequence of phonemes at a lower level of the hierarchy, which in turn allows predicting proprioceptive input (Kiebel, Von Kriegstein, Daunizeau, & Friston, 2009). The predicted signals are not translated into motor commands but are sent as input to an actuator system existing within a complex dynamical system that includes muscles and motor neurons connected in such a way as to form a closed loop of coupling relations. The elements of the actuator system self-organize their behavior in order to enact the prediction generated by the processing hierarchy. In perception, the system selects trough Bayesian inference the set of activation levels across the hierarchy that best predicts the sensory input reaching the lowest level. The system implements perceptual learning by allowing the predictive functions at all levels to mutually adjust in order to minimize the prediction error. One would expect that a system based on tunable predictions of its own behavior would also facilitate imitation of one’s own speech. But when speakers try to imitate their own vowel productions, they show biases that are significantly larger than those expected solely from articulatory or perceptual noise (Vallabha & Tuller, 2004). Regardless of whether Tian and Poeppel’s observations constitute evidence for an internal comparator, they do imply a very tight production-perception link in speech, a notion that complements much behavioral evidence and that is also characteristic of optimal feedback control theory, in which production and perception are explicitly coupled, and the active interference theory (Friston, 2011),

Please cite this article in press as: Tuller, B., & Lancia, L. Speech dynamics: Converging evidence from syllabification and categorization. Journal of Phonetics (2017), http://dx.doi.org/10.1016/j.wocn.2017.02.001

B. Tuller, L. Lancia / Journal of Phonetics xxx (2017) xxx–xxx

which bases production and perception on identical processes. An alternative description to a comparative internal model is that the motor signal helps stabilize an otherwise changing perceptual dynamic (Lackner & Tuller, 2008). Take as a starting point the similarities between the idea of recalibration of perceptual and motor mechanisms and the dynamic models of categorization and of the VTE. In the VTE model, an associative memory enables it to identify an input pattern by calculating the overlap of the input pattern with each of a set of previously stored prototype patterns, much like the comparison of actual and intended consequences in an internal model. Unlike internal models that assume minimization of some cost function, in a dynamic model the overlap is a measure of the strength of possible patterns or attractors. As the perceptual process stabilizes over time, the quantity which is reduced is the energy of the system that is in part determined by the distance between the current state of the system and the attractor. In reality, people encounter and produce a variety of speech patterns that continuously change in space and time so that their attractor landscape changes as well. These important global effects on the attractor landscape are captured by the associative memory that affects the pattern of coupled oscillations of transforms in the VTE and that shifts sensitivity to a range of inputs over longer time scales, influencing categorization. Individuals also have speech perceptuomotor biases that affect the attractor landscape, seen not only in the biases in self-imitation but also in perceptual learning in speech, where a new calibration of acoustic signals into phonemic categories can also be considered as a dynamic system. Native speakers of a language consider a range of acoustic objects as being phonemically identical even though in another language the same acoustic range might span two or more phonemic categories. When asked to learn a category that is close to the native one, listeners must become attuned to distinctions that are not phonologically meaningful in their native language. This can be very difficult, often impossible, for adult learners. For example, Case, Tuller, and Kelso (2003) report that a listener’s initial perceptual biases greatly affect the likelihood that an acoustic range will be perceived reliably as a new (non-native) speech sound. In their work, learning the new category also shifted the acoustic parameters of what was perceived as the best exemplar of the native category. Although this process might be described as recalibrating the acoustic signal and perceptual categories, those authors suggested an alternative approach, akin to that used to examine dynamics in verbal transforms, in which category learning is viewed as a dynamical process that modifies perceptual space over time. When listeners can initially perceive the non-native sound as “different” from the native one, the progressive stabilization of the sound to be learned is relatively fast. In other words, the rate of change of the perceptual landscape (the arrangement of attractors corresponding to phonologically meaningful distinctions), indeed whether the landscape can change sufficiently, depends on the initial conditions of the listener’s perceptual/linguistic system. Gafos and Kirov’s (2009) implementation of a nonlinear dynamical model of phonological change, using lenition as the model case, can be interpreted similarly, although the effects unfold over a longer time scale.

11

11. Concluding remarks: flexibility, time scales and symbolic hierarchies

Communicative exchanges rely on the production and interpretation of repeatable spatiotemporal patterns. This implies a degree of stability in the functioning of the sensorimotor processes for both speakers and listeners. At the same time, behavior needs to be flexible and plastic in order to deal with a complex and constantly changing environment. Nonlinear dynamical systems provide one window into how the sensorimotor system might balance stability and flexibility of behavior. Stability is a property of complex dynamical systems, which converge from a range of initial conditions (the basin of attraction) to well-characterized behaviors (attractors). The different initial conditions yield context-dependent trajectories toward a context-invariant state. Note that this constitutes a source of flexibility because the time-varying behavior is flexibly adapted to the context. Another potential source of context-dependent behavior is represented by multistability. When multiple attractors are simultaneously present, the system evolves to one of the target attractors as a function of the initial conditions. Flexibility is also enhanced by interactions between time scales; processes on slow time scales modulate processes on faster time scales, and vice versa. This principle is embodied in the perceptual models summarized in Sections 6 and 7, in which rich and flexible behavior emerges from the interactions of a small number of relatively simple processes unfolding over different time scales. The separation between time scales permits a reduction of the complexity of processes observed at each time scale. For example, the effect of the learning process on speech perception changes so slowly that it can be considered as constant during the perception of a monosyllabic stimulus. Investigations of the coordination of neuronal oscillatory activity also suggest that interactions among multiple time scales may play an important role in the processing of speech representational hierarchies. Neurophysiological considerations (Giraud & Poeppel, 2012), behavioral data (Ghitza & Greenberg, 2009), and modeling studies (Ghitza, 2011) support the idea that theta waves (modulations of cortical activity between 4 and 8 Hz), track the amplitude modulations of the speech signal. The activity of theta waves in turn modulates the activity of faster gamma waves (25–35 Hz), which induce an alternation between periods of strong and weak excitability in superficial cortical layers from which information is sent to higher processing levels. While the frequency of theta waves is compatible with the average time scale of syllable production, the frequency of gamma waves is compatible with the average time scale of segments. Hence, periods of strong excitability tend to coincide with the time periods characterized by highest amplitude in the speech signal (usually containing vowels and transitions). This cascaded modulation of activity may help the perceptual system segment the continuous acoustic speech signal. At the same time, this mechanism is an example of how the sensory system of a listener can be coupled to the motor system of a speaker and how interspeaker coupling propagates through levels of processing and time scales. Nested time scales also play a crucial role in other dynamical models of speech processing (e.g., Grossberg & Myers,

Please cite this article in press as: Tuller, B., & Lancia, L. Speech dynamics: Converging evidence from syllabification and categorization. Journal of Phonetics (2017), http://dx.doi.org/10.1016/j.wocn.2017.02.001

12

B. Tuller, L. Lancia / Journal of Phonetics xxx (2017) xxx–xxx

2000; Kiebel et al., 2009). In these hierarchical models, the activation of high-level categories depends on processes changing on slower time scales, while the activation of lowlevel categories depends on rapidly changing processes. The dynamics are shaped by mutual constraints among processes with similar time scales and across those with different time scales. When these models are used to simulate speech perception, the collective variables emerging from the interactions at one level determine the values of the control parameters that affect the dynamics of the level below, so that their outcomes depend on the representational content of the upper level. On the other hand, the activity of the lower level enters the upper level as rapidly changing bottom-up input that has the potential to destabilize the current state of the upper-level dynamics and induce a phase transition. Hierarchical dynamical models illustrate how constraints and control parameters mediate between time scales but at the same time can embody a symbolic function. Thanks to their mutual regulatory effects, multiple processes unfolding over different timescales get coordinated to produce and recognize complex patterns of activity arising from the functioning of an elaborated symbolic system in which constraints and control parameters continuously shape the underlying dynamics. Note that different models capture different features of the observed behavior and potentially different causal hierarchies. For example the production of a syllable can be modeled as emerging from the interactions between sub-syllabic abstract segmental representations (e.g., Grossberg & Myers, 2000; Kiebel et al., 2009) or from the interactions among coupled articulators (Browman & Goldstein, 1990, 1992; Tuller & Kelso, 1990). Common to the models is the notion that adaptive behavior is a property of the system that includes the speaker, the listener, and the communicative goal. The speech perceptuomotor system must continuously find stable solutions (conceptualized as attractors) in the system dynamics, allowing it to be flexible and plastic as conditions and intentions evolve. A challenge for all is to assess the various model behaviors using a heterogeneous set of tasks in order to test generalizability and robustness. Acknowledgements Betty Tuller’s effort on this project was supported by the National Science Foundation. Any opinions, findings, and conclusions expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation. The original work was supported by NIDCD grant DC-00411, NIMH grant MH42900, and NSF grants BCS-0414657 and BCS-0549983, to Florida Atlantic University. Leonardo Lancia’s work was partially supported by the LabEx EFL (ANR/CGI). References Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biological Cybernetics, 27, 77–87. Barbosa, P. N. A. (2007). From syntax to acoustic duration: A dynamical model of speech rhythm production. Speech Communication, 49, 725–742. Browman, C. P., & Goldstein, L. (1990). Gestural specification using dynamically-defined articulatory structures. Journal of Phonetics, 18, 299–320. Browman, C. P., & Goldstein, L. (1992). Articulatory phonology: An overview. Phonetica, 49, 155–180. Case, P., Tuller, B., Ding, M., & Kelso, J. A. S. (1995). Evaluation of a dynamical model of speech perception. Perception and Psychophysics, 57, 977–988. Case, P., Tuller, B., & Kelso, J. A. S. (2003). The dynamics of learning to hear new speech sounds. Speech Pathology, 1, 1–23.

Clegg, J. M. (1971). Verbal transformations on repeated listening to some English consonants. British Journal of Psychology, 62, 303–309. Cummins, F. (2009). Phase and coordination in speech production. Proceedings of the 20th Irish conference on artificial intelligence and cognitive science (pp. 13–22). Springer Lecture Notes in Computer Science. Cummins, F., & Simko, J. (2009). Notes on phase and coordination, and their application to rhythm and timing in speech. Journal of the Phonetic Society of Japan, 13, 7–18. de Jong, K., Lim, B., & Nagao, K. (2002). Phase transitions in a repetitive speech task as gestural recomposition. IULC working papers – online, 2 Abstract published in. Journal of the Acoustical Society of America, 110, 2657. 2aSC11. de Jong, K., Lim, B., & Nagao, K. (2004). The perception of syllable affiliation of singleton stops in repetitive speech. Language and Speech, 47, 241–266. Desmurget, M., & Sirigu, A. (2009). A parietal-premotor network for movement intention and motor awareness. Trends in Cognitive Sciences, 13, 411–419. Ditzinger, T., & Haken, H. (1989). Oscillations in the perception of ambiguous patterns. Biological Cybernetics, 61, 279–287. Ditzinger, T., & Haken, H. (1990). The impact of fluctuations on the recognition of ambiguous patterns. Biological Cybernetics, 63, 453–456. Ditzinger, T., Tuller, B., Haken, H., & Kelso, J. A. S. (1997a). A synergetic model for the verbal transformation effect. Biological Cybernetics, 77, 31–40. Ditzinger, T., Tuller, B., Kelso, J. A. S., & Haken, H. (1997b). Temporal patterning in an auditory illusion: The verbal transformation effect. Biological Cybernetics, 77, 23–30. Dycus, W. A., & Powers, A. S. (2000). The effects of alternating tactile and acoustic stimuli on habituation of the human eyeblink reflex. Psychobiology, 28, 507–514. Eimas, P., & Corbit, J. (1973). Selective adaptation of linguistic feature detectors. Cognitive Psychology, 4, 99–109. Eisner, F., & McQueen, J. (2006). Perceptual learning in speech: Stability over time. Journal of the Acoustic Society of America, 119, 1950–1953. Erlhagen, W., & Schöner, G. (2002). Dynamic field theory of movement preparation. Psychological Review, 109, 545. Fischer, H., Furmark, T., Wik, G., & Fredrikson, M. (2000). Brain representation of habituation to repeated complex visual stimulation studied with PET. NeuroReport, 11, 123–126. Friston, K. (2011). What is optimal about motor control? Neuron, 72, 488–498. Friston, K., Kilner, J., & Harrison, L. (2006). A free energy principle for the brain. Journal of Physiology-Paris, 100, 70–87. Gafos, A., & Kirov, C. (2009). A dynamical model of change in phonological representations. In F. Pellegrino, E. Marisco, I. Chitoran, & C. Coupé (Eds.), Approaches to phonological complexity (pp. 219–240). Berlin/New York: Mouton de Gruyter. Ghitza, O. (2011). Linking speech perception and neurophysiology: Speech decoding guided by cascaded oscillators locked to the input rhythm. Frontiers in Psychology, 2, 130. Ghitza, O., & Greenberg, S. (2009). On the possible role of brain rhythms in speech perception: Intelligibility of time-compressed speech with periodic and aperiodic insertions of silence. Phonetica, 66, 113–126. Giraud, A. L., & Poeppel, D. (2012). Cortical oscillations and speech processing: Emerging computational principles and operations. Nature Neuroscience, 15, 511–517. Goldstein, L. M., & Lackner, J. R. (1973). Alterations of the phonetic coding of speech sounds during repetition. Cognition, 2, 279–297. Goldstein, L., Nam, H., Saltzman, E., & Chitoran, I. (2009). Coupled oscillator planning model of speech timing and syllable structure. In G. Fant, H. Fujisaki, & J. Shen (Eds.), Frontiers in phonetics and speech science, Festschrift for Wu Zongji (pp. 239–250). Beijing: Commercial Press. Grill-Spector, K., & Malach, R. (2001). FMR-adaptation: A tool for studying the functional properties of human cortical neurons. Acta Psychologica, 107, 293–321. Grossberg, S. (1973). Contour enhancement, short term memory, and constancies in reverberating neural networks. Studies in Applied Mathematics, 52, 213–257. Grossberg, S. (1978). A theory of human memory: Self-organization and performance of sensory-motor codes, maps, and plans. In R. Rosen & F. Snell (Eds.). Progress in theoretical biology (Vol. 5, pp. 233–374). New York, USA: Academic Press. Grossberg, S., & Myers, C. W. (2000). The resonant dynamics of speech perception: Interword integration and duration-dependent backward effects. Psychological Review, 107, 735. Grush, R. (2004). The emulation theory of representation: Motor control, imagery, and perception. Behavioral and Brain Sciences, 27, 377–396. Haken, H. (1977/1983). Synergetics, an introduction: Nonequilibrium phase transitions and self-organization in physics, chemistry and biology. Berlin, Germany: Springer. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley and Sons. Houde, J. F., & Nagarajan, S. S. (2011). Speech production as state feedback control. Frontiers in Human Neuroscience, 5, 82. Henson, R. N. (2003). Neuroimaging studies of priming. Progress in Neurobiology, 70, 53–81. Kawamoto, A., & Anderson, J. (1985). A neural network model of multistable perception. Acta Psychologica, 59, 35–65. Kawato, M. (1999). Internal models for motor control and trajectory planning. Current Opinion in Neurobiology, 9, 718–727. Kelso, J. S., Tuller, B., Vatikiotis-Bateson, E., & Fowler, C. A. (1984). Functionally specific articulatory cooperation following jaw perturbations during speech: Evidence for coordinative structures. Journal of Experimental Psychology: Human Perception and Performance, 10, 812. Kiebel, S. J., Von Kriegstein, K., Daunizeau, J., & Friston, K. J. (2009). Recognizing sequences of sequences. PLoS Computational Biology, 5, e1000464.

Please cite this article in press as: Tuller, B., & Lancia, L. Speech dynamics: Converging evidence from syllabification and categorization. Journal of Phonetics (2017), http://dx.doi.org/10.1016/j.wocn.2017.02.001

B. Tuller, L. Lancia / Journal of Phonetics xxx (2017) xxx–xxx Kleinschmidt, D. F., & Jaeger, T. F. (2015). Robust speech perception: Recognize the familiar, generalize to the similar, and adapt to the novel. Psychological Review, 122, 148–203. Kleinschmidt, D. F., & Jaeger, T. F. (2016). Re-examining selective adaptation: Fatiguing feature detectors, or distributional learning? Psychonomic Bulletin & Review, 23, 1–14. Köhler, W. (1940). Dynamics in psychology. New York: Liveright. Köhler, W., & Wallach, H. (1940). Figural after-effects: An investigation of visual processes. Proceedings of the American Philosophical Society, 88, 269–357. Kraljic, T., & Samuel, A. (2005). Perceptual learning for speech: Is there a return to normal? Cognitive Psychology, 51(2), 141–178. Lackner, J. R. (1974). Speech production: Evidence for corollary-discharge stabilization of perceptual mechanisms. Perceptual and Motor Skills, 39, 899–902. Lackner, J. R., & Goldstein, L. M. (1975). The psychological representation of speech sounds. Quarterly Journal of Experimental Psychology, 27, 173–185. Lackner, J. R., & Tuller, B. (2008). Dynamical systems and internal models. In A. Fuchs & V. Jirsa (Eds.), Coordination: Neural, behavioral and social dynamics (pp. 93–103). Berlin, Germany: Springer. Lametti, D. R., Rochet-Capellan, A., Neufeld, E., Shiller, D. M., & Ostry, D. J. (2014). Plasticity in the human speech motor system drives changes in speech perception. The Journal of Neuroscience, 34, 10339–10346. Lancia, L. (2009). Dynamique non linéaire de la perception de la parole [Thèse de doctorat en Cognition, langage, education]. Universite de Marseille. Lancia, L., & Winter, B. (2013). The interaction between competition, learning and habituation dynamics in speech perception. Laboratory Phonology, 4, 221–257. Levinson, S. C., & Torreira, F. (2015). Timing in turn-taking and its implications for processing models of language. Frontiers in Psychology, 6, 2016. http://dx.doi.org/ 10.3389/fpsyg.2015.00731. Last Accessed March 10. McClelland, J. L., Mirman, D., & Holt, L. L. (2006). Are there interactive processes in speech perception? Trends in Cognitive Sciences, 10, 363–369. Miall, R. C., & Wolpert, D. M. (1996). Forward models for physiological motor control. Neural Networks, 9, 1265–1279. Moulin-Frier, C., & Arbib, M. A. (2013). Recognizing speech in a novel accent: The motor theory of speech perception reframed. Biological Cybernetics, 107, 421–447. Moulin-Frier, C., Diard, J., Schwartz, J. L., & Bessière, P. (2015). COSMO (“Communicating about Objects using Sensory-Motor Operations”): A Bayesian modeling framework for studying speech communication and the emergence of phonological systems. Journal of Phonetics, 53, 5–41. Nam, H., Goldstein, L., & Saltzman, E. (2009). Self-organization of syllable structure: A coupled oscillator model. In F. Pellegrino, E. Marisco, I. Chitoran, & C. Coupé (Eds.), Approaches to phonological complexity (pp. 299–328). Berlin/New York: Mouton de Gruyter. Nguyen, N., & Delvaux, V. (2015). Role of imitation in the emergence of phonological systems. Journal of Phonetics, 53, 46–54. Nguyen, N., Lancia, L., Bergounioux, M., Wauquier-Gravelines, S., & Tuller, B. (2005). Role of training and short-term context effects in the perception of /s/ and /st/ in French. In V. Hazan & P. Iverson (Eds.), ISCA workshop on plasticity in speech perception (pp. A38–A39). London, UK. Lancia, L., Nguyen, N., & Tuller, B. (2008). Nonlinear dynamics of speech categorization: Critical slowing down and critical fluctuations. Journal of the Acoustical Society of America, 123, 3077. Nguyen, N., Wauquier-Gravelines, S., & Tuller, B. (2009). The dynamical approach to speech perception: From fine phonetic detail to abstract phonological categories. In F. Pellegrino, E. Marisco, I. Chitoran, & C. Coupé (Eds.), Approaches to phonological complexity (pp. 193–218). Berlin/New York: Mouton de Gruyter. Nguyen, N., Wauquier-Gravelines, S., Lancia, L., & Tuller, B. (2007). Detection of liaison consonants in speech processing in French: Experimental data and theoretical implications. In P. Priéto, J. Mascaro, & M.-J. Solé (Eds.), Current issues in linguistic theory: Segmental and prosodic issues in romance phonology (pp. 3–23). London: John Benjamins. Pantev, C., Okamoto, H., Ross, B., Stoll, W., Ciurlia-Guy, E., Kakigi, R., & Kubo, T. (2004). Lateral inhibition and habituation of the human auditory cortex. European Journal of Neuroscience, 19, 2337–2344. Pattee, H. H. (1972). Laws and constraints, symbols and languages. In C. H. Waddington (Ed.), Towards a theoretical biology 4, essays (pp. 248–258). Edinburgh: Ediburgh University Press. Pitt, M. A., & Shoaf, L. (2002). Linking verbal transformations to their causes. Journal of Experimental Psychology: Human Perception and Performance, 28, 150–162.

13

Raz czaszek-Leonardi, J. (2012). Language as a system of replicable constraints. In H. H. Pattee & J. Raz czaszek-Leonardi (Eds.), Laws, language and life (pp. 295–333). Netherlands: Springer. Rubin, P., & Vatikiotis-Bateson, E. (1998). Measuring and modeling speech production in humans. In S. L. Hopp & C. S. Evans (Eds.), Animal acoustic communication: Recent technical advances (pp. 251–290). New York: Springer-Verlag. Schöner, G., & Thelen, E. (2006). Using dynamic field theory to rethink infant habituation. Psychological Review, 113, 273–299. Schöner, G., Martin, V., Reimann, H., & Scholz, J. P. (2008). Motor equivalence and the uncontrolled manifold. In R. Sock, S. Fuchs, & Y. Laprie (Eds.), Proceedings of the international seminar on speech production (ISSP 2008) in Strassbourg (pp. 23–28). INRIA. Schwartz, J. L., Basirat, A., Ménard, L., & Sato, M. (2012). The perception-for-actioncontrol theory (PACT): A perceptuo-motor theory of speech perception. Journal of Neurolinguistics, 25, 336–354. Shim, J. K., Latash, M. L., & Zatsiorsky, V. M. (2003). Prehension synergies: Trial-to-trial variability and hierarchical organization of stable performance. Experimental Brain Research, 152, 173–184. Sommer, M. A., & Wurtz, R. H. (2008). Brain circuits for the internal monitoring of movements. Annual Review of Neuroscience, 31, 317–338. Stetson, R. H. (1951). Motor phonetics: A study of speech movements in action. Amsterdam: North-Holland. Tian, X., & Poeppel, D. (2010). Mental imagery of speech and movement implicates the dynamics of internal forward models. Frontiers in Psychology, 1, 1–23. Tian, X., & Poeppel, D. (2014). Dynamics of Self-monitoring and error detection in speech production: Evidence from mental imagery and MEG. Journal of Cognitive Neuroscience, 27, 352–364. Tilsen, S. (2016). Selection and coordination: The articulatory basis for the emergence of phonological structure. Journal of Phonetics, 55, 53–77. Todorov, E., & Jordan, M. I. (2002). Optimal feedback control as a theory of motor coordination. Nature Neuroscience, 5, 1226–1235. Tuller, B. (2003). Computational models in speech perception. Journal of Phonetics, 31, 503–507. Tuller, B. (2004). Categorization and learning in speech perception as dynamical processes. In M. A. Riley & G. C. Van Orden (Eds.), Tutorials in contemporary nonlinear methods for the behavioral sciences. . Last accessed March 10, 2016. Tuller, B. (2007). Acoustic and phonological learning: Two different dynamics? Mathematics and Social Science, 180, 127–139. Tuller, B., & Kelso, J. A. S. (1990). Phase transitions in speech production and their perceptual consequences. In M. Jeannerod (Ed.), Attention and performance XIII (pp. 429–452). Hillsdale, NJ: Erlbaum. Tuller, B., & Kelso, J. A. S. (1991). The production and perception of syllable structure. Journal of Speech and Hearing Research, 34, 501–504. Tuller, B., Case, P., Ding, M., & Kelso, J. A. S. (1994). The nonlinear dynamics of speech categorization. Journal of Experimental Psychology: Human Perception and Performance, 20, 1–14. Tuller, B., Jantzen, M. G., & Jirsa, V. (2008). A dynamical approach to speech categorization: Two routes to learning. New Ideas in Psychology, 26, 208–226. Usher, M., & McClelland, J. L. (2001). The time course of perceptual choice: The leaky, competing accumulator model. Psychological Review, 108, 550–592. Vallabha, G., & McClelland, J. (2007). Success and failure of new speech category learning in adulthood: Consequences of learned Hebbian attractors in topographic maps. Cognitive, Affective, & Behavioral Neuroscience, 7, 53–73. Vallabha, G., & Tuller, B. (2004). Perceptuomotor bias in the imitation of steady-state vowels. Journal of the Acoustical Society of America, 116, 1184–1197. von Holst, E., & Mittelstaedt, H. (1973). The reafference principle. In R. Martin (Ed.). The behavioural physiology of animals and man: The collected paper of Erich von Holst (Vol. 1, pp. 139–173). Coral Gables: University of Miami Press. Vroomen, J., van Linden, S., De Gelder, B., & Bertelson, P. (2007). Visual recalibration and selective adaptation in auditory–visual speech perception: Contrasting build-up courses. Neuropsychologia, 45, 572–577. Warren, R. M. (1961). Illusory changes of distinct speech upon repetition–The verbal transformation effect. British Journal of Psychology, 52, 249–258. Warren, R. M. (1968). Verbal transformation effect and auditory perceptual mechanisms. Psychological Bulletin, 70, 261–270. Warren, R. M., & Gregory, R. L. (1958). An auditory analogue of the visual reversible figure. American Journal of Psychology, 71, 613–621.

Please cite this article in press as: Tuller, B., & Lancia, L. Speech dynamics: Converging evidence from syllabification and categorization. Journal of Phonetics (2017), http://dx.doi.org/10.1016/j.wocn.2017.02.001