Cognitive processes underlying spoken word recognition during soft speech

Cognition 198 (2020) 104196 Contents lists available at ScienceDirect Cognition journal homepage: www.elsevier.com/locate/cognit Cognitive processe...

Download PDF

2MB Sizes 0 Downloads 66 Views

Report

PDF Reader
Full Text

Cognition 198 (2020) 104196

Contents lists available at ScienceDirect

Cognition journal homepage: www.elsevier.com/locate/cognit

Cognitive processes underlying spoken word recognition during soft speech Kristi Hendrickson a b

a,b,⁎

a

, Jessica Spinelli , Elizabeth Walker

T

a

Department of Communication Sciences & Disorders, University of Iowa, 250 Hawkins Drive, 52242 Iowa City, IA, United States of America Department of Psychological & Brain Sciences, University of Iowa, 250 Hawkins Drive, 52242 Iowa City, IA, United States of America

A R T I C LE I N FO

A B S T R A C T

Keywords: Speech perception Spoken word recognition Soft speech Lexical competition Adverse listening conditions

In two eye-tracking experiments using the Visual World Paradigm, we examined how listeners recognize words when faced with speech at lower intensities (40, 50, and 65 dBA). After hearing the target word, participants (n = 32) clicked the corresponding picture from a display of four images – a target (e.g., money), a cohort competitor (e.g., mother), a rhyme competitor (e.g., honey) and an unrelated item (e.g., whistle) – while their eyemovements were tracked. For slightly soft speech (50 dBA), listeners demonstrated an increase in cohort activation, whereas for rhyme competitors, activation started later and was sustained longer in processing. For very soft speech (40 dBA), listeners waited until later in processing to activate potential words, as illustrated by a decrease in activation for cohorts, and an increase in activation for rhymes. Further, the extent to which words were considered depended on word length (mono- vs. bi-syllabic words), and speech-extrinsic factors such as the surrounding listening environment. These results advance current theories of spoken word recognition by considering a range of speech levels more typical of everyday listening environments. From an applied perspective, these results motivate models of how individuals who are hard of hearing approach the task of recognizing spoken words.

1. Introduction Language comprehension requires fast and eﬃcient word recognition. Spoken word recognition is a complex cognitive process that involves mapping the incoming speech signal to entries in the mental lexicon. This complexity arises because spoken words unfold sequentially and rapidly in time. As a result, at each moment listeners have only partial information for identifying the target word. Given that many words sound the same, such partial information creates moments of uncertainty (Marslen-Wilson, 1987). For example, after hearing the onset of the word candle, there is insuﬃcient information to discriminate the target word from phonological competitors that share word-initial phonemes (e.g., camera). A rich history in psycholinguistic research shows that listeners manage this temporary ambiguity by taking an immediate competition approach. That is, listeners immediately activate multiple lexical candidates that are consistent with the unfolding speech signal (MarslenWilson, 1987; McClelland & Elman, 1986; Norris & McQueen, 2008). These candidates compete for recognition as the word unfolds (Allopenna, Magnuson, & Tanenhaus, 1998; Marslen-Wilson & Zwitserlood, 1989). For instance, cohort competitors (word pairs that share phonemes at the beginning; e.g., candle and camera) compete for

⁎

recognition early, whereas rhyme competitors (i.e., word pairs that share phonemes at the end; e.g., candle and handle) compete later. As listeners accumulate information that disambiguates the word, these initial interpretations are updated until competition has been suppressed, and the appropriate word is fully active (Dahan & Magnuson, 2006; Dahan, Magnuson, & Tanenhaus, 2001; Frauenfelder, Scholten, & Content, 2001; Luce & Pisoni, 1998; Weber & Scharenborg, 2012). The degree of activation – how much a given word is considered – depends on where in the word phonological overlap is present. To illustrate, listeners heavily weigh word-initial overlap, and therefore, cohorts show more robust activation than do rhymes (Allopenna et al., 1998). Further, the degree of competition is inﬂuenced by word length (Simmons & Magnuson, 2018). For instance, rhyme eﬀects are less likely to be observed in mono-syllabic words compared to bi-syllabic words. Given that rhymes are deﬁned as words that overlap in the nucleus of the ﬁrst syllable (i.e., the vowel) until the end of the word, bi-syllabic rhymes have a large portion of phonological overlap which likely drives greater competition (Simmons & Magnuson, 2018). A limitation of the existing psycholinguistic research is that it mostly examines word recognition under ideal listening conditions (i.e., given clear acoustic input presented at a conversational level [~65–70 dBA] in quiet). The breadth of research on speech recognition should

Corresponding author at: Dept. of Communication Sciences and Disorders, University of Iowa, 250 Hawkins Drive, 52242 Iowa City, IA, United States of America. E-mail addresses: [email protected] (K. Hendrickson), [email protected] (J. Spinelli), [email protected] (E. Walker).

https://doi.org/10.1016/j.cognition.2020.104196 Received 12 April 2019; Received in revised form 6 January 2020; Accepted 18 January 2020 0010-0277/ © 2020 Published by Elsevier B.V.

Cognition 198 (2020) 104196

K. Hendrickson, et al.

be quite diﬀerent for soft speech compared to other types of suboptimal input. For instance, vocoded speech (and speech faced by CI users) removes spectral diﬀerences, such as formant frequencies, which eliminates much of the information needed to discriminate phonemes. However, information about changes in amplitude over time (i.e., amplitude envelope) is preserved. As a result, CI users adopt whatever cues are most reliable in their input and primarily rely on temporal instead of spectral information to recognize words (McMurray, FarrisTrimble, Seedorﬀ, & Rigler, 2016; Moberly, Bates, Harris, & Pisoni, 2016; Peng, Chatterjee, & Lu, 2012; Winn & Litovsky, 2015). Conversely, for softer levels of speech around 40–55 dBA, spectral diﬀerences are mostly accessible and the amplitude envelope is somewhat reduced (Olsen, 1998). Thus, both spectral and temporal cues exist in the input, but may be less accessible. As a result, the task of reweighting cues and focusing on the most reliable cues may not be the best approach to dealing with softer levels of speech input. Research examining word recognition in the face of soft speech has almost entirely focused on populations with hearing loss by measuring accuracy using clinical speech perception tasks (e.g., consonant-nucleus-consonant [CNC] word scores). In these tasks, listeners are asked to simply repeat the word they hear. For CI users, single word recognition accuracy is quite similar at 70 dBA and 60 dBA, though there is a signiﬁcant drop-oﬀ in accuracy at 50 dBA (Holden et al., 2013). Despite the fact that measuring access to a range of sound levels (including soft speech) is a part of the clinical assessment of populations with hearing loss, and soft speech recognition is a crucial component of everyday speech perception for all listeners, surprisingly little is known about how individuals with normal hearing recognize soft speech, both in terms of recognition accuracy and the dynamics of real-time competition that subserve recognition (Olsen, 1998; Skinner, Holden, Holden, Demorest, & Fourakis, 1997). Such an investigation has both theoretical and clinical implications. From a theoretical perspective, understanding the inﬂuence of soft speech on the competition dynamics of spoken word recognition will inform more ecologically valid models. Indeed, any model of spoken word recognition should be able to account for the breadth of listening conditions listeners face in natural communication settings. From a clinical perspective, understanding how normal hearing listeners perceive soft speech is a useful model for determining how individuals with hearing loss approach the task of recognizing words in the face of reduced audibility.

reﬂect the listening challenges that individuals face in natural communication settings. For instance, listeners are often confronted with adverse listening environments, in which several factors (e.g., background noise, decreased sound levels, distortion) lead to reduced speech intelligibility (Mattys, Davis, Bradlow, & Scott, 2012; Olsen, 1998). How listeners recognize words in-the-moment may be quite diﬀerent under adverse listening conditions. This is because when listeners are confronted with suboptimal speech input, they encounter two sources of ambiguity: the temporary ambiguity that is present during optimal listening conditions as a result of the unfolding signal, plus the ambiguity that arises from the suboptimal speech signal. Research examining the dynamics of lexical competition in adverse listening conditions is limited and mixed. The presence of background noise appears to cause listeners to delay ﬁxations to the target (van der Feest, Blanco, & Smiljanic, 2019), and delay activation for cohort competitors (Ben-David et al., 2011). However, more recent research has found that background noise actually heightens and sustains cohort and rhyme competition (Brouwer & Bradlow, 2016). Further, background noise not only changes the magnitude of activation for diﬀerent types of competitors, but also increases the number of candidate words (Scharenborg, Coumans, & van Hout, 2018). When confronted with distorted speech, in which some phonemes are replaced with radio noise, listeners reduce activation for cohorts and increase activation for rhymes (McQueen & Huettig, 2012). A similar eﬀect is observed for cochlear implant (CI) users who are prelingually deaf (those CI users who have little acoustic basis for developing phoneme categories) and for hearing adults with vocoded speech meant to simulate a CI (McMurray, Farris-Trimble, & Rigler, 2017). In addition to speech-intrinsic factors (i.e., factors involving the speech signal itself), speech-extrinsic factors (i.e., factors related to the listening environment, such as predictability) also aﬀect the nature of speech recognition and lexical competition (Brouwer & Bradlow, 2014; Brouwer, Mitterer, & Huettig, 2012; Smith & McMurray, 2018). Brouwer et al. (2012) examined lexical competition for canonical forms (e.g., computer [kɔmpjutər]) and reduced forms [e.g., pjutər] across multiple eye-tracking experiments. The stimuli were identical across experiments, though the nature of the presentation diﬀered; in one experiment canonical forms were interleaved with reduced forms, whereas in another experiment listeners only heard canonical forms. The uncertainty of the listening environment (caused by the interleaved presentation of canonical and reduced forms) aﬀected the recognition of canonical words. Speciﬁcally, canonical phonological competitors were more strongly activated in the interleaved presentation compared to the canonical only presentation. Brouwer and colleagues suggest that these ﬁndings reﬂect that when listening to reduced speech interleaved with canonical speech, listeners become more tolerant of acoustic mismatches for canonical inputs. These results are crucial because they demonstrate that listeners not only adjust competition based on speechintrinsic factors (e.g., reduced vs. canonical), but also speech-extrinsic factors (i.e., the predictability of the listening environment). Though recent research has begun to investigate spoken word recognition in adverse listening conditions (Brouwer & Bradlow, 2016; McQueen & Huettig, 2012), listening environments with soft speech (i.e., speech at lower intensities) are noticeably absent from the literature. It is crucial to understand how soft speech at multiple intensities is recognized given that it is ever-present in a range of typical listening environments. Indeed, listeners encounter a variety of speakers, some speaking face-to-face at conversational levels (~65 dBA), others speaking softly (< 55 dBA) and/or at distances that are not optimal for speech perception (Pearsons, Bennett, & Fidell, 1977). Further, speech sound levels in everyday listening environments can reach as low as 25 dBA (Holden, Reeder, Firszt, & Finley, 2011). Therefore, listeners must recognize words over a range of sound levels as a basis for eﬀective communication in many listening situations (Davidson, 2006). There are reasons to believe that the dynamics of competition may

1.1. The present study The goal of the present study was to investigate how listeners recognize soft speech. Speciﬁcally, this research had two primary aims. First, across two experiments, we examined how soft speech inﬂuences the competition dynamics of spoken word recognition (i.e., what word types compete for recognition and when competition occurs during processing). To this end we obtained precise information regarding the time course of lexical competition. Previous research has shown that both speech-intrinsic and speech-extrinsic factors can change the dynamics of lexical competition (Brouwer et al., 2012). When reduced and canonical word forms were presented in an interleaved fashion, the dynamics of lexical competition changed for the canonical forms. Speciﬁcally, canonical forms demonstrated more robust competitor activation when presented alongside non-canonical forms. Thus, our second aim was to determine how speech-extrinsic factors, such as overall uncertainty in the listening environment, aﬀect how soft speech is recognized. For both Experiments 1 and 2 we use eye-tracking and the Visual World Paradigm (VWP; Cooper, 1974; Tanenhaus, Spivey-Knowlton, Eberhard, & Sedivy, 1995). Spoken words are presented at multiple levels of intensity. After hearing the target word, participants click on the corresponding picture from a display of four images representing candidate interpretations of the stimulus – the target (e.g., money), a cohort competitor that shares word-initial phonemes (e.g., mother), a 2

Cognition 198 (2020) 104196

K. Hendrickson, et al.

design. On each trial one of the four items was played as the auditory stimulus, and visual stimuli consisted of a target word (e.g., money), a cohort of the target (e.g., mother), a rhyme of the target (e.g., honey), and a phonologically unrelated word (e.g., whistle). Sets of four items were constructed to ensure that no concepts overlapped semantically and unrelated items had no phonological overlap with any other word in the set. All four items within a set always appeared together. Each word was presented as the target (i.e., the auditory stimulus) an equal number of times. Because each word within a set served as the auditory stimulus, there were four trial types. To illustrate, again consider the set, money-mother-honey-whistle. When money was the auditory stimulus, the resulting trial type was a target-cohort-rhyme-unrelated or TCRU trial. However, the structure of the trial (i.e., the nature of the competitor set) changed depending on which auditory stimulus appeared as the target. For example, on trials in which the cohort (e.g., mother) appeared as the auditory stimulus, the trial was termed a target (mother)-cohort (money)-unrelated (honey)-unrelated (whistle) or TCUU trial. Thus, the four trial types were TRCU, TCUU, TRUU, and TUUU, where the letters refer to the relationship among the items on the screen depending on the auditory stimulus. In order to determine which sound levels to use in the eye-tracking task, a modiﬁed version of the consonant-nucleus-consonant (CNC) speech perception test was conducted with a separate set of 10 participants. We used the CNC test for two main reasons. First, the CNC is one of the clinical gold standards for speech recognition. Further, we wanted to choose sound levels that were highly audible and listeners could recall the word and not simply recognize it given the referent. Consistent with the CNC protocol, participants repeated words they heard from a recording to the experimenter who scored their responses for the percentage of phonemes and whole words correctly produced. We modiﬁed the traditional CNC protocol to include six diﬀerent sound levels (30, 35, 40, 45, 50, and 65 dBA). A diﬀerent CNC word list was used for each sound level and word lists and sound levels were counterbalanced across participants. Given that trials with incorrect responses are removed from analyses in studies of spoken word recognition using the VWP, it was essential to choose sound levels in which recognition accuracy was high enough to ensure that participants were not simply guessing. Indeed, by including sound levels with low recognition accuracy within the eye-tracking task, we would not have suﬃcient trials to look at condition speciﬁc diﬀerences. See Table 1 for results at the six diﬀerent sound levels. Based on the results of this pretest and clinically related motivations, we chose 40 and 50 dBA as the soft speech sound levels for the eye-tracking task. We chose 50 dBA primarily for its clinical relevance; 50 dBA is used in clinical settings to verify audibility of the hearing aid for soft speech. We chose 40 dBA because it was the lowest sound level in which word recognition accuracy was high enough (> 90%) to ensure a suﬃcient number of usable trials during the eye-tracking task. Further, performance on the 40 dBA condition was associated with a signiﬁcant decrease in word recognition accuracy (logit transformed) from the next lowest intensity level (35 dBA, t(9) = 8.73, p < .001). Each of the 136 words (34 sets with 4 items) occurred as the auditory stimulus twice at the three sound levels (40, 50, and 65 dBA), resulting in 816 total trials (136 words × 2 repetitions × 3 sound levels). The sound levels of the study were randomly interleaved for

rhyme competitor that shares word-ﬁnal phonemes (e.g., honey) and a phonologically unrelated picture (e.g., whistle) – while their eye movements are monitored (e.g., Allopenna et al., 1998). Participants must make a ﬁxation to plan their ultimate response, and can make multiple ﬁxations over the course of the trial. As a result, eye movements in this paradigm are tightly time-locked to unfolding activation among lexical competitors. To examine whether and to what extent lexical competition for speech at diﬀerent intensities is aﬀected by speech-extrinsic factors (the uncertainty of the listening environment), we manipulated the nature of the sound level presentation: in Experiment 1 sound level was presented in an interleaved fashion, whereas in Experiment 2, sound level was blocked. We predict that listeners will be slower to recognize the target word as sound level decreases. Of particular interest to the current investigation is how the competition dynamics change with soft speech. We predict that lexical competition is likely to change along two dimensions: magnitude (the amount of activation) and/or timing (when activation occurs). In terms of magnitude, soft speech could cause listeners to reduce activation of competitors. Consistent with research on distorted speech input, listeners may decrease activation to cohorts, rhymes or both (McQueen & Huettig, 2012). Conversely, listeners could increase activation for phonological competitors in the face of soft speech similarly to what is seen in some CI users (McMurray et al., 2016). Further, activation could change in terms of timing. Most research on word recognition in adverse conditions shows some level of sustained competitor activation, in which consideration for competitors is maintained later in processing (Ben-David et al., 2011; Brouwer & Bradlow, 2016; McMurray et al., 2017). From this view, we may see either delayed activation for competitors, or activation may be immediate but maintained longer than in conversational speech. Finally, in accordance with previous research on speech-extrinsic factors (Brouwer et al., 2012; Smith & McMurray, 2018), we predict that the nature of the presentation (randomly interleaved vs. blocked) will inﬂuence lexical competition, particularly for optimal speech (conversational speech, 65 dBA). Speciﬁcally, we expect more phonological competition for conversational speech when interleaved with soft speech (Experiment 1) than when presented in a blocked fashion (Experiment 2). Finally, in Experiment 2 we examine the inﬂuence of sound level on lexical competition for mono- and bi-syllabic words separately. Recent research shows stark diﬀerences in the magnitude of rhyme eﬀects for mono-syllabic vs. bi-syllabic words. Bi-syllabic rhymes (e.g., paddlesaddle) are longer and contain more phonological overlap with the target word than mono-syllabic rhymes (e.g., cat-bat). As a result, bisyllabic rhymes have been shown to elicit more competition (Simmons & Magnuson, 2018). If listeners sustain activation for competitors during soft speech, this might be particularly apparent for bi-syllabic rhymes. 2. Experiment 1 2.1. Methods 2.1.1. Participants Thirty-two undergraduate participants were recruited from the University of Iowa psychology participant pool. All participants were monolingual English speaking and had normal hearing and normal or corrected-to-normal vision. Participants ranged from 18 to 23-years-old (M = 18.5 years) and received academic credit for a university level psychology course. All participants signed an informed consent document before participating in the study.

Table 1 Mean (and standard deviation) word recognition accuracy by sound level. Sound level (dBA)

2.1.2. Design Items consisted of 136 words (72 mono-syllabic and 64 bi-syllabic) that were combined into sets of four, resulting in 34 total sets (see Appendix). This study had a target-cohort-rhyme-unrelated (TCRU)

Accuracy

30

35

40

45

50

65

% words correct

71.2 (7.8) 88.7 (3.9)

80.1 (7.8) 92.9 (3.1)

92.6 (3.8) 97.4 (1.4)

95.0 (3.8) 98.4 (1.2)

97.0 (4.2) 98.9 (1.6)

97.6 (2.3) 99.0 (1.1)

% phonemes correct

3

Cognition 198 (2020) 104196

K. Hendrickson, et al.

conversational speech. Further, trials were randomized so that each participant received a diﬀerent presentation order.

Experiment 1. 2.1.3. Stimuli Auditory stimuli were 136 words recorded by a female speaker with a standard American accent in an anechoic chamber using a Marantz Professional Steady State Recorder at a sampling rate of 44100 Hz. The speaker wore a Shure head-mounted microphone to ensure the distance between her mouth and the microphone was consistent throughout the recording. Audio recordings were completed within a single session. Each word was recorded multiple times within the sentence frame (“He said ___.”) to ensure natural intonation. The best exemplar of each word was extracted from the sentence frame and edited to remove noise artifacts. Even though the audio recordings were made in an anechoic chamber, there was minor background noise due to the fan of the recording computer. We used a standard lab protocol to remove this background noise. For this, we used Audacity to measure a ~1000 msec sample of silence within the larger recording and calculated the acoustic proﬁle of the background noise. We used this recording-speciﬁc background noise proﬁle to remove the background noise in the audio recordings while preserving the acoustic properties of the speech stimuli. Finally, all words were amplitude normalized to the three sound levels of interest (40, 50, and 65 dBA), and 100 ms of silence was added to the beginning and end of the sound ﬁles using Praat (Boersma & Weenink, 2019). Visual stimuli were clip art style images developed using a standard lab protocol (McMurray, Samelson, Lee, & Tomblin, 2010). For each image, several candidates were identiﬁed from a commercial clip art library. These images were then viewed by a focus group to identify the “best” exemplar of a given word and to identify alterations to ensure a more prototypical orientation. Each image was edited to implement these changes and remove any unnecessary features. Images were matched for style and visual saliency and approved by a senior member of the lab with extensive experience in the VWP. Most of the images used in this study were pulled from a database of images developed with this protocol and used across multiple studies (Apfelbaum, Blumstein, & McMurray, 2011; McMurray et al., 2010).

2.1.5. Eye tracking recording and data processing Eye movements were recorded with an Eyelink Portable Duo eyetracker in the chin rest conﬁguration. Once the experimenter adjusted the chin rest to a comfortable position and a clear image of the pupil and corneal reﬂection were obtained, a 9-point calibration and validation procedure was completed. Drift correction was performed 24 times throughout the experiment (every 34 trials). No participants failed drift correction during the experiment. For analysis, saccades and successive ﬁxation were combined into a single unit called a “look” (McMurray et al., 2010; McMurray, Tanenhaus, & Aslin, 2002). 2.2. Results We conducted two sets of analyses. The ﬁrst set examined accuracy and reaction time of the mouse-click response as a function of sound level (40, 50, or 65 dB). Next, we analyzed the time course of target and competitor (cohort and rhyme) ﬁxations. For these analyses we derived an estimate of the precise time window at which two timeseries diﬀer to understand the timing and magnitude of activation for targets, cohorts, rhymes, and unrelated words across sound levels. Only correct trial, and saccades occurring > 200 ms post-word onset were included in the time series and reaction time analyses. 2.2.1. Accuracy and latency Accuracy and reaction time to click the target image (measured from word-onset) were calculated for each subject and sound level. Two separate two-way ANOVAs were conducted to compare the eﬀect of sound level and trial type (TCRU, TCUU, TRUU, TUUU) on response accuracy (logit transformed) and reaction time, respectively. For accuracy, there was a main eﬀect of sound level [F(2, 62) = 4.49, p = .015]. Participants were highly accurate at selecting the target image at all sound levels, though slightly less so for 40 dBA speech: 65 dBA (99.4%), 50 dBA (99.5%) and 40 dBA (99.1%). Further, there was a main eﬀect of trial type [F(3, 93) = 8.20, p < .001], such that participants were more accurate at selecting the target on trials in which a competitor was not on the screen: TCUU (99.5%), TCRU (99.2%), TRUU (99.1%), and TUUU (99.7%). There was no signiﬁcant sound level × trial type interaction [F(6, 186) = 1.22, p = .35]. For reaction time, there was a signiﬁcant main eﬀect of sound level [F(2, 62) = 8.21, p = .0007], in which participants were fastest at responding for 50 dBA speech (1197 ms), followed by 65 dBA speech (1207 ms) and ﬁnally 40 dBA speech (1217 ms). There was also a main eﬀect of trial type [F(3, 93) = 29.75, p < .0001], such that participants were fastest at responding during trials with no competitor: TCUU (1229 ms), TCRU (1226 ms), TRUU (1194 ms), and TUUU (1179 ms). However, there was no signiﬁcant sound level × trial type interaction [F(6, 186) = 1.64, p = .14].

2.1.4. Procedure For the eye-tracking task, participants were seated 24 in. away from a computer monitor. The task was programmed with Experiment Builder (SR Research, Ontario, Canada). Auditory stimuli were played over two front-mounted speakers, and the signal was ampliﬁed by a Rolls 35 watt stereo power ampliﬁer. The speakers were calibrated by measuring the sound level to a concatenated ﬁle containing a representative sample of stimuli using a portable sound level meter. The sound level meter was centered in the speaker array at a distance corresponding to the location of the participant's head during testing. Testing was conducted in a sound-attenuated room to minimize background noise. Before the task began, participants received both verbal and written study instructions. Further, practice trials were administered to ensure participants understood the task. For each trial, four pictures appeared on the screen: a target (money), cohort competitor (mother), rhyme competitor (honey), and an unrelated item (whistle). Pictures were 300 × 300 pixels and appeared 50 pixels vertically and horizontally from the edges of a 17″ computer monitor running at 1280 × 1024 pixels. Picture location was counterbalanced across trials, such that all pictures and word types (target, cohort, rhyme, and unrelated) appeared equally in each location. In the middle of the screen there was a blue dot. After 500 milliseconds the blue dot turned red at which point, participants were instructed to click the red dot to initiate the trial. Once clicked, the red dot disappeared, and an auditory stimulus identiﬁed the target. Thus, the preview of the visual display was at least 500 ms, though given that the trials where participant initiated, participants could preview the images for longer. Participants were instructed to click the corresponding picture. Soft speech was presented randomly intermixed with

2.2.2. Time course Time series data were analyzed using Bootstrapped Diﬀerences of Timeseries (BDOTS; a statistical package within R, Version 1.1.456, R Core Team, 2014). BDOTS is a statistical tool that detects diﬀerences in two timeseries (as in the VWP) when the time window is unknown in advance and oﬀers a precise characterization of the time window in which a diﬀerence occurs (for a summary see Seedorﬀ, Oleson, & McMurray, 2018). BDOTS achieves this level of precision in four steps. First, for each participant, word type (target, cohort, rhyme, and unrelated), and sound level (40, 50, 65 dBA) speciﬁc ﬁtted curves are applied to the raw time series data to capture the shape of the functions. Target ﬁxations are best approximated by a logistic function because there is an initial period of low looks, followed by an exponential increase in looks, and ending with a constant period of high looks. Competitor and unrelated ﬁxations are best approximated using a 4

Cognition 198 (2020) 104196

K. Hendrickson, et al.

there was a signiﬁcant diﬀerence in target ﬁxations for 40 dBA speech compared to 50 dBA speech, in which looks were signiﬁcantly higher to the 50 dBA speech. Finally, proportion looking to the target was signiﬁcantly higher for 65 dBA speech compared to 40 dBA speech from 752 to 2000 ms. In line with our predictions, participants ﬁxated the target faster and to a greater extent for 65 dBA speech compared to 40 dBA speech. However, participants were faster to ﬁxate the target for 50 dBA speech compared to 65 dBA speech, though there was no difference in the asymptote of ﬁxations (maximum ﬁxations) (see Fig. 1).

double Gaussian function because these ﬁxations start with a period of low looks, followed by an increase in looking culminating at the peak, and ﬁnally a low period of looks (see Seedorﬀ et al., 2018). Curve ﬁtting smooths the data to minimize any idiosyncratic patterns of signiﬁcance. To evaluate goodness of ﬁts, we measured R2 values and visually examined the observed data compared to the estimated curve for each subject, word type, and sound level. Second, a bootstrapping procedure was used to estimate the standard error of the mean at each time point. Third, these standard errors of the function were used to conduct 2 sample t-tests at every time point. Fourth, family-wise error was controlled with a modiﬁed Bonferroni corrected signiﬁcance level which takes advantage of the inherent autocorrelation of the test statistics to avoid being overly conservative. For any signiﬁcant diﬀerences between sound levels, we report the time window of signiﬁcance, the mean diﬀerence across the time window, the autocorrelation of the test statistic, and the adjusted alpha value.

2.2.2.2. Competitor ﬁxations. For competitor ﬁxations (cohort and rhyme) we use unrelated ﬁxations as a baseline to control for diﬀerential levels of looking between sound levels (see Fig. 2 for the average raw ﬁxations to cohorts, rhymes, and unrelated items, and Fig. S2 in Supplementary materials for single-subject raw ﬁxations to the competitors and unrelated items). For trials in which two unrelated items were on the screen (i.e., TCUU and TRUU trials), we took the average. We use the diﬀerence between competitors and unrelated ﬁxations as an estimate of competitor activity and compare these competitor-unrelated diﬀerences between sound levels. For analyses related to competitor looks we ﬁt separate double Gaussian functions to the components of the diﬀerence curves (competitor – unrelated baseline), and then evaluate these diﬀerence curves between sound levels. The following describes the ﬁve parameters for the double Gaussian function: μ = mean for each of the individual normal distributions; σ12 = variance for the left-side normal distribution (essentially the onset slope); σ22 = variance for the right-side normal distribution (the oﬀset slope); Pi = peak height, B1 = baseline for the left-side normal distribution, and B2 = baseline for the rightside. For cohort ﬁxations, 32 subjects were ﬁt using the double Gaussian function in a within-subject design. We compared each sound level (3 comparisons), resulting in 192 ﬁts in all (32 participants × 2 word types [cohort and unrelated] × 3 sound levels [40, 50, 65 dB]). In the ﬁtting stage, 120 curves were good ﬁts (R2 ≥ 0.95), 58 curves had reasonable ﬁts (R2 ≥ 0.8), and 14 curves had ﬁts that were below R2 = 0.8 (though no ﬁt was < 0.70). All 14 curves that had ﬁts of R2 < 0.80 represented ﬁxations to the unrelated item, which were by design meant to be low. To illustrate, listeners have no reason to maintain ﬁxations to unrelated items as these items contain no phonological overlap with the target word. What results are curves with decreased kurtosis (i.e., relatively ﬂat curves). After multiple reﬁts, and upon visual inspection of the observed data compared to the estimated

2.2.2.1. Target ﬁxations. Though the primary purpose of the paper was to examine the competition dynamics during soft speech recognition, it is important to determine whether the presence of soft speech also inﬂuences how readily target words are recognized. Indeed, there may be diﬀerences in the dynamics of competitor activation that have little eﬀect on how fast listeners recognize words. That is, listeners could recognize soft speech just as quickly as conversational speech, but get there via a diﬀerent route (i.e., by adapting the nature of competition as a ﬂexible and eﬃcient means of recognizing soft speech). Thus, it is crucial to ﬁrst analyze the timing and magnitude of target ﬁxations by sound level. For analyses related to the timeseries of target ﬁxations, we collapsed across all four trial types (TCRU, TCUU, TRUU, and TUUU). All 32 subjects were ﬁt using the logistic function in a within-subject study design (see Fig. S1 in Supplemental materials for single-subject raw ﬁxation curves to the target). The function was a four parameter logistic with time on the x-axis, and separate parameters for the upper and lower asymptotes, the crossover (the time at which ﬁxations were halfway between asymptotes), and the slope (the derivative at the crossover). In the ﬁtting stage, all 96 curves (one curve per sound level for 32 subjects) had good ﬁts (R2 ≥0.95). The bootstrapping stage was conducted three times, one for each sound level comparison (see Table 2 for full results). We found a signiﬁcant diﬀerence in target ﬁxations from 400 to 776 ms post-word onset, in which proportion ﬁxations to the 50 dBA speech were signiﬁcantly higher than 65 dBA speech. Further, from 412 to 2000 ms

Table 2 Results of timeseries analyses for Experiment 1 by word type (Target, Cohort, Rhyme) and sound level. BDOTS output Comparison

Signiﬁcant time window(s)

Direction of eﬀect (mean diﬀerence)

Autocorrelation

Adjusted alpha

Target 65 v 50 50 v 40 65 v 40

400–776 412–2000 752–2000

50 > 65 (0.01) 50 > 40 (0.008) 65 > 40 (0.006)

0.999 0.999 0.999

0.011 0.012 0.02

– 408–824 1016–2000 384–860 1012–2000

– 50 40 65 40

> > > >

40 50 40 65

(0.01) (0.004) (0.01) (0.005)

– 0.996

– 0.003

0.998

0.005

204–768 1168–1240 208–756 868–1200 256–640 880–2000

65 50 40 40 65 40

> > > > > >

50 65 50 50 40 65

(0.03) (0.003) (0.02) (0.004) (0.02) (0.007)

0.987

0.001

0.993

0.002

0.996

0.003

Cohort 65 v 50 50 v 40 65 v 40 Rhyme 65 v 50 50 v 40 65 v 40

5

Cognition 198 (2020) 104196

K. Hendrickson, et al.

Fig. 1. Target ﬁxations (Experiment 1: Randomized and Interleaved Presentation). A) Proportion ﬁxations to the target image (after curve ﬁtting) across the four trial types (TCRU, TCUU, TRUU, TUUU) by sound level (40, 50, 65 dBA). B) Average proportion ﬁxations to the target by sound level for the three time windows of signiﬁcance.

Fig. 2. Raw ﬁxations (before curve ﬁtting) to cohorts, rhymes, and unrelated items by sound level for the randomized and interleaved presentation. Shaded regions represent standard error.

curve it was determined that despite the lower R2 value for some unrelated ﬁxations, we obtained reasonable ﬁts for these data. The bootstrapping stage was conducted three times, one for each sound level comparison (see Table 2 for full results). There were no signiﬁcant diﬀerences in cohort ﬁxations to 50 dBA compared to 65 dBA speech. For the two levels of soft speech (50 dBA and 40 dBA), we found two regions of signiﬁcance. From 408 to 824 ms, cohort ﬁxations were signiﬁcantly higher for 50 dBA compared to 40 dBA speech; however, from 1016 to 2000 ms ﬁxations to 40 dBA speech were higher. Finally, we compared cohort ﬁxations for conversational speech (65 dBA) and very soft speech (40 dBA). Similar to the results of the 50 dBA and 40 dBA comparison, we found regions of signiﬁcance from 384 to 860 ms and 1012–2000 ms, in which cohort activation for conversational speech was initially higher, though later in processing

activation was higher for 40 dBA speech (see Fig. 3). For time series analyses of rhyme ﬁxations the same 32 subjects were ﬁt using the double Gaussian function in a within-subject design. Again, we compared each sound level (3 comparisons in all), resulting in 192 ﬁts (32 participants × 2 word types [cohort and unrelated] × 3 sounds levels [40, 50, 65 dB]). In the ﬁtting stage, 109 curves were good ﬁts (R2 ≥ 0.95), 69 curves had reasonable ﬁts (R2 ≥ 0.8), 12 curves had ﬁts that were below R2 = 0.8 (though no ﬁts were < 0.70), and 1 subject was dropped due to poor ﬁtting (R2 < 0.70) in at least one of their curves. The bootstrapping stage was conducted three times, one for each sound level comparison (see Table 2). There was a signiﬁcant diﬀerence in rhyme ﬁxations between 65 dBA and 50 dBA speech from 204 to 768 ms, in which proportion ﬁxations were higher for 65 dBA speech,

Fig. 3. Competitor ﬁxations (Experiment 1: Randomized and Interleaved presentation). Diﬀerence in proportion ﬁxations (after curve ﬁtting) to cohorts-unrelated (A) and rhymes-unrelated (B) by sound level (40, 50, 65 dBA). 6

Cognition 198 (2020) 104196

K. Hendrickson, et al.

though from 1168 to 1240 ms proportion ﬁxation were higher for 50 dBA speech. Further, rhyme ﬁxations were signiﬁcantly higher for 40 dBA compared to 50 dBA speech from 208 to 756 ms, and again from 868 to 2000 ms. Finally, from 256 to 640 ms rhyme ﬁxations were signiﬁcantly higher for 65 dBA speech compared to 40 dBA speech, however, from 880 to 2000 ms, 40 dBA speech demonstrated signiﬁcantly higher rhyme ﬁxations than did 65 dBA speech.

If the pattern of results in Experiment 1 were partly due to the interleaved presentation, we expect the following results. First, consistent with prior research on the inﬂuence of speech-extrinsic factors in word recognition (Brouwer et al., 2012; Smith & McMurray, 2018), we predict a relative decrease in cohort and rhyme activation for optimal speech (65 dBA speech) in Experiment 2. Second, given that the task in the blocked design will be less challenging to the listener, we expect attention will not modulate eﬀects of target ﬁxations to the same degree as was observed in Experiment 1. Thus, for Experiment 2, we predict listeners will be fastest to recognize the target word for 65 dBA speech, followed by 50 dBA speech, and ﬁnally, 40 dBA speech. To help further tease apart the competition dynamics of conversational and soft speech, in Experiment 2, we also examined eﬀects word length (mono- vs. bisyllbic words).

2.3. Discussion The results of Experiment 1 replicate previous research that shows a preference for onset over oﬀset overlap competitors. Indeed, across all three sound levels, listeners showed more activation for cohorts than rhyme competitors. However, there were condition speciﬁc diﬀerences in the pattern of activation. For soft speech, activation for both cohort and rhyme competitors was reduced, and this reduction in activation was particularly pronounced for cohorts. The cohort results are in line with previous research that suggests that the more degraded the speech signal; the more likely listeners are to reduce activation for cohort competitors (McMurray et al., 2017). The rhyme eﬀects are somewhat unexpected because previous research has shown that listeners increase rhyme activation when presented with highly degraded input (McMurray et al., 2017). Thus, one might expect more rhyme activation for 40 dBA compared to 65 dBA speech. However, in Experiment 1 we obtained the opposite eﬀect and rhyme activation for the 65 dBA condition was larger than expected. Indeed, previous research has shown smaller rhyme eﬀects than those obtained in Experiment 1 for spoken words presented at a conversation level. For instance, Farris-Trimble, McMurray, Cigrand, and Tomblin (2014) found a diﬀerence in peak height (i.e., maximum ﬁxations) to rhymes - unrelated items of 0.03; in the current experiment the difference in peak height for 65 dBA speech is nearly twice that, (0.055). However, it is crucial to note that previous research was concerned with word recognition in optimal listening conditions and as a result all stimuli were presented at a conversational level (between 65 dB and 70 dB). Thus, there was no uncertainty as to which speech sound level may come next. In Experiment 1, conversational speech and levels of soft speech were randomly interleaved, such that listeners could not predict sound level on any given trial. As previously mentioned, speech-extrinsic factors, such as presenting conditions in an interleaved fashion, have been shown to increase competitor activation for more canonical or typical inputs. Thus, the ﬁnding that 65 dBA speech demonstrated robust rhyme (and cohort) activation when interleaved with soft speech corroborates previous research that shows heightened activation for canonical phonological competitors when interleaved with reduced inputs (Brouwer et al., 2012). A complementary argument could also help explain the target ﬁxation results (i.e., faster ﬁxations to the target for 50 dBA compared to 65 dBA speech). Listeners may have had diﬃculty processing conversational speech when interleaved with soft speech, which slowed the rate of looking to the target. This interpretation is in line with the desirable diﬃculty eﬀect (Bjork, 1994). Listeners may have been more engaged in the task when the sound level was interleaved because they needed to be prepared for hearing speech at lower intensities. Indeed, research has shown that if information challenges the listener, processing may be enhanced. From this view, soft speech may have engaged more cognitive resources, and listeners may have slightly disengaged in the task during conversational speech. To test the hypothesis that presenting soft speech and conversational speech in an interleaved fashion inﬂuenced the pattern of results for targets and competitors, we conducted a follow-up experiment. Experiment 2 was identical to Experiment 1, with one crucial exception: sound level was presented in a blocked fashion. This change allowed us to analyze the target ﬁxations and competition dynamics of each sound level without interference from the other sound levels, thus creating a highly predictable listening environment.

3. Experiment 2 3.1. Methods 3.1.1. Participants A diﬀerent group of 27 undergraduate participants were recruited from the University of Iowa psychology participant pool. As in Experiment 1, all participants were monolingual English speaking and had normal hearing and normal or corrected-to-normal vision. The participants ranged from 18 to 21-years-old (M = 19.11 years). Participants either received academic credit for a university level psychology course or monetary compensation. All participants signed an informed consent document before participating in the study. 3.1.2. Design and stimuli Audio and visual stimuli were identical to Experiment 1. The study design was identical to Experiment 1, with one signiﬁcant change. Instead of presenting sound level in an interleaved fashion, listeners were presented with each sound level in a blocked design. To minimize the potential of order eﬀects, the order of the sound levels was counterbalanced across participants such that each sound level appeared equally in every presentation order. 3.2. Results 3.2.1. Accuracy and latency Accuracy and reaction time to click the target image (measured from word onset) were calculated for each subject and sound level. Two separate two-way ANOVAs were conducted to compare the eﬀect of sound level and trial type (TCRU, TCUU, TRUU, TUUU) on response accuracy (logit transformed) and reaction time (for correct trials only), respectively. There was a main eﬀect of sound level [F(2, 52) = 15.75, p < .001]. Just as in Experiment 1, participants were highly accurate at selecting the target image at all sound levels, though slightly less so for 40 dBA speech: 65 dBA (99.6%), 50 dBA (99.6%) and 40 dBA (98.8%). Further, there was a signiﬁcant main eﬀect of trial type [F(3, 78) = 5.81, p = .001]. Again, as in Experiment 1 participants were more accurate at selecting the target on trials in which a competitor was not on the screen: TCUU (99.5%), TCRU (98.9%), TRUU (99.1%), and TUUU (99.7%). Unlike Experiment 1, there was a signiﬁcant sound level × trial type interaction [F(6, 156) = 2.61, p = .002]. This interaction appears to be mainly driven by the signiﬁcantly lower accuracy for the 40 dBA condition compared to 50 and 65 dBA during TCRU and TRUU trials. For reaction time, there was a main eﬀect of trial type [F (3, 78) = 18.77, p < .001], such that participants were faster at responding on trials without competitors: TCUU (1244 ms), TCRU (1236 ms), TRUU (1225 ms), and TUUU (1195 ms). However, there was no main eﬀect of sound level [F(2, 752) = 2.26, p = .14], and no signiﬁcant sound level × trial type interaction [F(6, 156) = 1.03, p = .41]. 7

Cognition 198 (2020) 104196

K. Hendrickson, et al.

Table 3 Results of timeseries analysis for Experiment 2 by word type (Target, Cohort, Rhyme) and sound level. BDOTS output Comparison

Signiﬁcant time window(s)

Direction of eﬀect (mean diﬀerence)

Autocorrelation

Adjusted alpha

Target 65 v 50 50 v 40 65 v 40

520–2000 500–1012 444–2000

65 > 50 (0.02) 50 > 40 (0.02) 65 > 40 (0.02)

0.999 0.999 0.998

0.01 0.01 0.007

336–568 792–2000 236–668 784–855 1044–1280 200–700

50 65 50 40 40 65

> > > > > >

65 50 40 50 50 40

(0.006) (0.004) (0.02) (0.004) (0.005) (0.02)

0.991

0.002

0.984

0.001

0.985

0.001

344–648 732–1284 356–460 552–2000 428–580 648–2000

65 50 40 40 65 40

> > > > > >

50 65 50 50 40 65

(0.01) (0.007) (0.003) (0.006) (0.01) (0.009)

0.987

0.001

0.986

0.001

0.983

0.001

Cohort 65 v 50 50 v 40

65 v 40 Rhyme 65 v 50 50 v 40 65 v 40

intensities (see Fig. 4).

3.2.2. Time course The procedure for analyzing time series data were identical to those described in Experiment 1.

3.2.2.2. Competitor ﬁxations. Again, as in Experiment 1, for competitor ﬁxations (cohort and rhyme) we use unrelated ﬁxations as a baseline to control for diﬀerential levels of looking between sound levels (see Fig. 5 for raw ﬁxations to cohorts, rhymes, and unrelated items, and Fig. S4 in Supplementary materials for single-subject raw ﬁxations to the competitors and unrelated items). For cohort ﬁxations, as in Experiment 1, 27 subjects were ﬁt using the double Gaussian function in a within-subject design. We compared each sound level (3 comparisons), resulting in 162 ﬁts in all (32 participants × 2 word types [cohort and unrelated] × 3 sound levels [40, 50, 65 dB]). In the ﬁtting stage, 113 curves were good ﬁts (R2 ≥0.95), 38 curves had reasonable ﬁts (R2 ≥ 0.8), 10 curves had ﬁts that were below R2 = 0.8 (no ﬁt was < 0.70), and 1 ﬁt was dropped from the 50 dBA condition. The bootstrapping stage was conducted three times, one for each sound level comparison (see Table 3). Though the overall pattern of looking appeared similar for 65 dBA and 50 dBA speech, there were two time windows that showed condition speciﬁc diﬀerences. We found a signiﬁcant diﬀerence from 336 to 568 ms (50 dBA > 65 dBA) and from 792 to 2000 ms (65 dBA > 50 dBA). Next, we compared cohort activation for the two levels of soft speech (50 dBA and 40 dBA). We found regions of signiﬁcance from 236 to 668 ms (50 dBA > 40 dBA), 784 to

3.2.2.1. Target ﬁxations. For analyses related to the timeseries of target ﬁxations, we collapsed across all four trial types (TCRU, TCUU, TRUU, and TUUU) (see Fig. S3 in Supplemental materials for single-subject raw ﬁxation curves to the target). All 27 subjects were ﬁt using a fourparameter logistic function in a within-subject study design as described in Experiment 1. In the ﬁtting stage, all 81 curves (one curve per sound level for 27 subjects) had good ﬁts (R2 ≥ 0.95). The bootstrapping stage was conducted three times, one for each sound level comparison (see Table 3 for full results). First, we compared target curves for 65 dBA and 50 dBA speech. We found a signiﬁcant diﬀerence from 520 to 2000 ms in which proportion ﬁxations to 65 dBA speech were signiﬁcantly higher. Further, proportion looking to the target was signiﬁcantly higher for 50 dBA compared to 40 dBA speech from 500 to 1012 ms. Finally, there was a signiﬁcant diﬀerence in proportion ﬁxations to the target in 65 dBA speech compared to 40 dBA speech from 444 to 2000 ms, in which looking to the target was greater for 65 dBA speech. As predicted, we see a pattern in which participants looked to the target image faster and to a greater extent for conversational speech compared to speech at lower

Fig. 4. Target ﬁxations (Experiment 2: Blocked presentation). A) Proportion ﬁxations to the target image across the four trial types (TCRU, TCUU, TRUU, TUUU) by sound level (40, 50, 65 dBA). B) Average Proportion ﬁxation to the target by sound level for the three time windows of signiﬁcance. 8

Cognition 198 (2020) 104196

K. Hendrickson, et al.

Fig. 5. Raw ﬁxations to cohorts, rhymes, and unrelated items by sound level for blocked presentation. Shaded region represents the standard error.

Fig. 6. Competitor ﬁxations (Experiment 2: Blocked presentation). Diﬀerence in proportion ﬁxations (after curve ﬁtting) to cohorts-unrelated (A) and rhymesunrelated (B) by sound level (40, 50, 65 dBA).

level. To statistically test the eﬀect of syllable length and sound level on competition, we compared each sound level separately for mono- and bi-syllabic words again using BDOTS (see Tables 4 & 5 for results). Mono-syllabic cohorts demonstrated a graded pattern of eﬀects, such that cohort activation was highest for 65 dBA speech, followed by 50 dBA, and ﬁnally, 40 dBA speech (see Fig. 8). For mono-syllabic rhymes, the 65 dBA condition demonstrated the greatest levels of activation, though in a mid-latency time window, the two levels of soft speech exhibited more rhyme activation than did conversational speech. For bi-syllabic words a diﬀerent pattern of eﬀects emerged for both cohorts and rhyme. For cohorts, activation for the 50 dBA condition was greatest, and the 65 dBA and 40 dBA conditions displayed similar levels of activation. For rhymes, 40 dBA speech demonstrated greater activation than both 65 dBA and 50 dBA speech, and late in processing a graded pattern emerged (40 dBA > 50 dBA > 65 dBA).

844 ms (40 dBA > 50 dBA), and 1044 to 1280 ms (40 dBA > 50 dBA). Finally, there was a signiﬁcant diﬀerence between 65 dBA and 40 dBA speech from 200 to 700 ms (65 dBA > 40 dBA) (see Fig. 6). For timeseries analyses of rhyme ﬁxations the same 27 subjects were ﬁt using the double Gaussian function in a within-subject design. Again, we compared each sound level (3 comparisons in all), resulting in 162 ﬁts in all (27 participants × 2 word types [cohort and unrelated] × 3 sounds levels [40, 50, 65 dB]). In the ﬁtting stage, 89 curves were good ﬁts (R2 ≥ 0.95), 65 curves had reasonable ﬁts (R2 ≥ 0.8), 8 curves had ﬁts that were below R2 = 0.8 (though none were < 0.70), and 1 subject was dropped due to poor ﬁtting in at least one of their curves (R2 < 0.70) (see Fig. 6). The bootstrapping stage was conducted three times, one for each sound level comparison (see Table 3). There was a signiﬁcant diﬀerence in rhyme ﬁxations between 65 dBA and 50 dBA speech from 344 to 648 ms (65 dBA > 50 dBA), and 732 to 1284 ms (50 dBA > 65 dBA). Further, we found a signiﬁcant diﬀerence between the two soft speech levels (40 dBA and 50 dBA) from 356 to 460 ms and 552 to 2000 ms in which there were more ﬁxations to 40 dBA speech. Finally, there were two time windows of signiﬁcance for the 40 dBA and 65 dBA comparison: 428–580 (65 dBA > 40 dBA) and 648–2000 ms (40 dBA > 65 dBA). To further tease apart the inﬂuence of soft speech on the dynamics of word recognition, we analyzed competitor eﬀects separately for mono- and bi-syllabic words. Fig. 7 shows the raw ﬁxations (before curve ﬁtting) to mono- and bi-syllabic cohorts, rhymes, and unrelated items by sound level. For cohorts, activation for mono- and bi-syllabic words for 65 dBA speech was similar; however, for the two levels of soft speech, cohort activation was greater for bi-syllabic compared to monosyllabic. For rhymes, consistent with recent research (Simmons & Magnuson, 2018), bi-syllabic words elicited more activation than mono-syllabic words across sound level, though the precise eﬀect of syllable length on rhyme competition is heavily inﬂuenced by sound

3.3. Discussion In Experiment 2, we presented sound level in a blocked fashion to determine how the predictability in the listening environment inﬂuenced the dynamics of lexical competition across sound level. As in Experiment 1, there was a signiﬁcant decrease in cohort activation for 40 dBA speech compared to both 50 dBA and 65 dBA speech. However, unlike Experiment 1, 65 dBA speech demonstrated less cohort activation than 50 dBA speech. For rhymes, there was also a relative decrease in activation for the 65 dBA condition, and rhyme activation was greatest for 40 dBA speech. Further, target ﬁxations displayed the expected pattern of eﬀects: listeners were fastest at ﬁxating the target for 65 dBA speech, followed by 50 dBA speech, and ﬁnally 40 dBA speech. Thus, when sound level was presented in a blocked fashion, listeners became faster at ﬁxating the target as sound level increased, and this was accompanied by a relative decrease in phonological competition 9

Cognition 198 (2020) 104196

K. Hendrickson, et al.

Fig. 7. Raw ﬁxations to competitors (and unrelated items) by sound level (40, 50, 65 dBA), and word length (mono- vs. bi-syllabic).

for conversational speech.

Table 4 Fit quality for BDOTS analysis of word length (mono-and bi-syllabic), word type (Cohort, Rhyme), and sound level (40, 50, and 65dBA).

4. Cross experiment comparisons

BDOTS output Word type

MONO-syllabic Cohort 65 v 50 50 v 40 65 v 40 Rhyme 65 v 50 50 v 40 65 v 40 BI-syllabic Cohort 65 v 50 50 v 40 65 v 40 Rhyme 65 v 50 50 v 40 65 v 40

The pattern of results across Experiments 1 and 2 demonstrate that manipulating the predictability of a listening environment can change the dynamics of competition for conversational and soft speech. To further probe how the dynamics of competition change as a function of listening environment, we directly compared the time course of target, cohort, and rhyme ﬁxations at each sound level across experiments. To obtain a detailed characterization of the time course of ﬁxations across experiment we again used Bootstrapped Diﬀerences of Timeseries (BDOTS; a statistical package within R, Version 1.1.456, R Core Team, 2014). For these analyses, the same curve ﬁts reported in Experiments 1 and 2 were used to compare ﬁxations for the same sound level across experiment (see Section 2.2.2 for description of the bootstrapping and multiple comparison correction procedures). First, we compared target ﬁxations across experiments within each sound level (see Fig. 9). For the 65 dBA condition target ﬁxations for the blocked presentation were signiﬁcantly higher than the interleaved presentation from 512 to 2000 ms. The pattern of eﬀects for the two levels of soft speech were similar across experiments. In an early and short-lived time window ﬁxations to 40 dBA speech (from 400 to 552 ms) and 50 dBA speech (from 352 to 560 ms) were higher in the interleaved presentation, though soon after, this eﬀect ﬂipped, as target

Fit quality (R2 ≥0.95)

(R2 ≥0.8)

(R2 < 0.8)

Subjects dropped

48 45 51

35 37 33

6 5 11

5 5 3

31 27 25

51 54 49

10 7 6

4 5 7

38 30 29

36 42 49

10 8 10

6 7 5

24 27 28

59 61 58

5 12 10

5 2 3

10

Cognition 198 (2020) 104196

K. Hendrickson, et al.

5. General discussion

Table 5 Results of timeseries analysis for Experiment 2 by word length (mono- and bisyllabic), word type (Target, Cohort, Rhyme), and sound level (40, 50, 65 dBA).

The goal of the present study was to investigate the temporal dynamics underlying the recognition of soft speech. We had two primary aims. First, we used eye-tracking in the VWP to examine how soft speech inﬂuences lexical access (i.e., the speed of word recognition) and lexical competition (i.e., what word types compete for recognition and when in processing does competition occur). For this aim we obtained precise information regarding how ﬁxations to the target (e.g., candle) and phonological competitors (cohorts, e.g., candy; rhymes, e.g., handle) unfold over time. Second, we examined whether and to what extent lexical access and competition for soft speech is aﬀected by speech-extrinsic factors (the uncertainty in the listening environment). For this aim we manipulated sound level presentation across two experiments. In Experiment 1, we randomly interleaved diﬀerent sound levels of speech, whereas for Experiment 2 we presented sound level in a blocked design. Overall, we found that both speech-intrinsic information (the sound level of the speech input) and speech-extrinsic information (whether there was uncertainty in the listening environment) inﬂuenced the dynamics of lexical access and competition. In Experiment 1, listeners ﬁxated the target fastest for 50 dBA speech followed by 65 dBA speech, and ﬁnally 40 dBA speech. Based on the desirable diﬃculty eﬀect (Bjork, 1994), listeners may have had diﬃculty processing conversational speech when interleaved with soft speech, which slowed the rate of looking to the target. Another complementary interpretation is that when sound levels were interleaved, listeners considered phonological competitors to a greater extent for conversational speech, which slowed down the speed of target recognition (Brouwer et al., 2012). Indeed, in Experiment 1, when sound level was presented in an interleaved fashion, phonological activation was greatest for conversational speech. Speciﬁcally, activation for cohort competitors demonstrated a graded eﬀect such that activation was greatest for conversational speech, followed by slightly soft speech, and ﬁnally very soft speech. For rhyme competitors there was also a graded eﬀect, in which activation was again greatest for conversational speech, however, unlike cohort competitors, very soft speech showed more rhyme activation than did slightly soft speech. Thus, in Experiment 1, when presented with two levels of soft speech randomly interleaved with conversational speech, listeners showed the most cohort and rhyme activation during conversational speech, and reduced activation for both competitor types for soft speech (Fig. 2). As previously mentioned, research has shown that introducing uncertainty in a listening environment by interleaving conditions aﬀects the nature of speech recognition and lexical competition (Brouwer et al., 2012; Brouwer & Bradlow, 2014). In a VWP task, Brouwer and colleagues presented canonical forms (i.e., optimal speech) and reduced forms (i.e., forms with missing segments) in an interleaved and blocked fashion. They found that in the interleaved presentation, canonical phonological competitors were more strongly activated than in the blocked presentation. A similar result was found for diﬀerent levels of vocoded speech (Smith & McMurray, 2018). In Experiment 1 of the current study, when sound level was interleaved, the activation for both cohorts and rhymes was greatest for conversational speech, and rhyme eﬀects were particularly large (rhyme – unrelated peak height diﬀerence = 0.055). Rhyme eﬀects in previous studies using conversational speech presented in a blocked fashion have found much smaller eﬀects (nearly half those reported in Experiment 1 of the current study; FarrisTrimble et al., 2014). Thus, the results from Experiment 1 begin to corroborate Brouwer and colleagues' ﬁnding that when interleaved with reduced speech, optimal input shows more phonological competition. To evaluate this claim, for Experiment 2, we presented sound level in a blocked fashion. Just as in Experiment 1, there was a signiﬁcant decrease in cohort activation for 40 dBA speech compared to both 50 dBA and 65 dBA speech. However, unlike Experiment 1, cohort

BDOTS output Comparison

Signiﬁcant time window(s)

Direction of eﬀect

Autocorrelation

Adjusted alpha

65 50 50 40 65

> > > > >

50 40 40 50 40

0.994 0.998

0.003 0.008

0.994

0.003

65 50 65 40 65 40

> > > > > >

50 65 50 50 40 65

0.995

0.003

0.998 0.995

0.006 0.003

388–640 292–624 772–876 268–356 1184–2000

50 50 40 65 40

> > > > >

65 40 50 40 65

0.995 0.994

0.003 0.003

0.998

0.008

276–380 1236–2000 284–488 592–1084 368–464 576–2000

65 50 40 40 40 40

> > > > > >

50 65 50 50 65 65

0.997

0.005

0.995

0.003

0.997

0.004

MONO-syllabic Cohort 65 v 50 400–764 50 v 40 284–300 388–560 1104–2000 65 v 40 368–888 Rhyme 65 v 50 436–692 860–956 1496–2000 50 v 40 660–812 65 v 40 424–596 648–984 BI-syllabic Cohort 65 v 50 50 v 40 65 v 40 Rhyme 65 v 50 50 v 40 65 v 40

looks were greater for the blocked condition (40 dBA [796–2000 ms], 50 dBA [680–2000]). For cohorts, both the 65 dBA and 40 dBA conditions showed signiﬁcant competition in the interleaved than the blocked presentation around the peak of activation (65 dBA [448–788 ms], 40 dBA [360–664 ms]). Further, for the 65 dBA condition there were slightly more ﬁxations for the blocked presentation later in processing (1020–2000 ms). For 50 dBA speech the pattern of eﬀects were quite similar across experiments. The only diﬀerence was due to an overall shift in the ﬁxation curve, which caused an early diﬀerence from 292 to 452 ms (blocked > interleaved), and a late diﬀerence in which this relationship ﬂipped (636–912 ms, blocked > interleaved) (see Fig. 10). Finally, we compared rhyme activation across experiments. As predicted, there was a robust diﬀerence in rhyme ﬁxations for the 65 dBA condition such that participants ﬁxated the rhyme to a greater extent in the interleaved compared to the blocked condition from 448 to 788 ms. Later in processing (1020–2000 ms), participants ﬁxated rhymes slightly more in the blocked compared to the interleaved condition. For the two levels of soft speech, ﬁxations across presentation type mostly patterned together with a few exceptions. For 50 dBA speech, from 292 to 452 ms and from 636 to 913 ms, there were more ﬁxations in the blocked compared to the interleaved presentation. Further, for 40 dBA speech, there was an early (360–664 ms) and short-lived diﬀerence such that proportion ﬁxations to the rhyme were higher in the blocked condition. Finally, we ran a two-way ANOVA to compare the average participant accuracy (logit transformed) for each sound level and experiment. There was a main eﬀect of sound level [F(2, 114) = 120.06, p < .001], however, there was no main eﬀect of experiment [F(1, 57) = 0.035, p = .85], nor was there a signiﬁcant sound level × experiment interaction [F(2, 114) = 1.69, p = .19].

11

Cognition 198 (2020) 104196

K. Hendrickson, et al.

Fig. 8. Diﬀerence in proportion ﬁxations (after curve ﬁtting) to cohorts-unrelated and rhymes-unrelated by sound level (40, 50, 65 dBA) and word length (mono- vs. bi-syllabic).

activation for 65 dBA and 50 dBA speech ﬂipped. Thus, when sound level was blocked, 65 dBA speech demonstrated less cohort activation than did 50 dBA speech. For rhymes, there was also a relative decrease in activation for the 65 dBA condition, and rhyme activation was greatest for 40 dBA speech. Further, target ﬁxations displayed the expected pattern of eﬀects: listeners were fastest at ﬁxating the target for 65 dBA speech, followed by 50 dBA speech, and ﬁnally 40 dBA speech. Thus, when sound level was presented in a blocked fashion, listeners were faster at ﬁxating the target as sound level increased. To further probe how the dynamics of competition change as a function of listening environment, we directly compared the time course of target, cohort, and rhyme ﬁxations at each sound level across experiments. For target ﬁxations, listeners were faster at recognizing the target word in the blocked compared to the interleaved presentation

across all 3 sound levels. Further, consistent with Brouwer and colleagues' ﬁndings, we found that optimal speech input (65 dBA speech) demonstrated more cohort and rhyme activation in the blocked compared to interleaved presentation. Interestingly, we see a similar pattern for 40 dBA speech (more competitor activation in the interleaved compared to the blocked design). Whereas, for 50 dBA speech, the pattern of results appears quite similar across experiments. We ﬁnd two interpretations for the cross experiment ﬁndings most likely. First, as discussed in detail previously, hearing conversational speech interleaved with reduced speech may have caused listeners to accept acoustic mismatches more readily (Brouwer et al., 2012). From this view, interleaving optimal with suboptimal speech caused phonological competitors for optimal input to become stronger because the acoustic mismatches were considered less egregious. Indeed, Brouwer

Fig. 9. Cross study comparison of target ﬁxations. Fixations to the target by Experiment ([EXP1] Interleaved vs. [EXP 2] Blocked) and sound level (40, 50, 65 dBA). 12

Cognition 198 (2020) 104196

K. Hendrickson, et al.

Fig. 10. Cross study comparison of competitors. Diﬀerence in competitor-unrelated ﬁxations by Experiment ([EXP1] Interleaved vs. [EXP 2] Blocked) and sound level (40, 50, 65 dBA) for cohorts (a–c) and rhymes (d–f).

process as listeners adopt diﬀerent approaches to managing temporary ambiguity depending on the nature of the listening environment, both in terms of the type of acoustic input (i.e., sound level) and the degree of overall uncertainty in what may come next (McMurray, Ellis, & Apfelbaum, 2019; Smith & McMurray, 2018). Given that previous studies and the current study have found that uncertainty in the listening environment can modulate speech-intrinsic lexical access and competition, coupled with the fact that the majority of previous research using the VWP has used blocked designs, we further discuss the results of Experiment 2 with regard to how soft speech inﬂuences in-the-moment lexical competition. For slightly soft speech, lexical competition appears similar to conversational speech with some notable exceptions. For cohort competitors, listeners demonstrate a slight increase in cohort activation, whereas for rhyme competitors, activation starts later and is sustained later in processing. This is consistent with previous research that suggests given mildly distorted or reduced input (e.g., accented speech, background noise), activation is quite similar to optimal input, although listeners may maintain partial consideration of phonological competitors (Brouwer & Bradlow, 2016; Farris-Trimble et al., 2014). Similar to the results of the 50 dBA condition, normal hearing listeners presented with CI simulated speech (meant to simulate speech faced by adult post-lingually deaf CI users)

and colleagues found heightened competitor activation for canonical inputs only when such inputs were interleaved with reduced input, as opposed to fully articulated but casual input. An alternative interpretation – but perhaps not mutually exclusive, is that the interleaved condition was just more cognitively taxing. There have been several studies that have examined how cognitive processes interact with speech recognition accuracy in adverse listening conditions (mainly with background noise) (see Mattys et al., 2012 for a review). Such studies point to the fact that listeners may have allocated more listening eﬀort in the interleaved presentation because the task required more cognitive resources (Fraser, Gagné, Alepins, & Dubois, 2010; Gosselin & Gagné, 2011; Hicks & Tharpe, 2002; Zekveld, Kramer, & Festen, 2010). Thus, listeners may have been more engaged in the task when the sound level was interleaved because they needed to be prepared for hearing speech at lower intensities. If listeners were focused on processing the more diﬃcult conditions, they may have been less sensitive to speech at a conversational level (Wu, Stangl, Zhang, Perkins, & Eilers, 2016). Taken together, the results from Experiments 1 and 2 ﬁt nicely with previous research on the importance of speech-extrinsic factors in lexical access and competition. These results support the notion that spoken word recognition is a highly adaptable and tunable cognitive

13

Cognition 198 (2020) 104196

K. Hendrickson, et al.

or signiﬁcantly greater. What explains these diﬀerential results for mono- and bi-syllabic words by sound level? The current results may be accounted for by changing the short and long word biases seen in computational models of speech perception (i.e., TRACE; McClelland & Elman, 1986). Early in processing, there is a bias for short words to compete for recognition, whereas later in processing competition favors longer words. These biases are the result of lateral inhibition from overlapping words; bi-syllabic words overlap with more words, and as a result receive more inhibition early on. However, later in processing, mono-syllabic words have largely been ruled out, and thus, there is a later occurring bias for longer words. Consistent with a wait-and-see approach, for very soft speech (i.e., 40 dBA), listeners appear to relax this early short word bias (i.e., there is a decrease in cohort activation for mono-syllabic words), and increase the later occurring long word bias (i.e., there is an increase in rhyme activation for bi-syllabic words). In combination with previous research, the current results help reﬁne models of lexical processing, as these models were not initially meant to account for a range of adverse listening conditions. Indeed, a major obstacle to traditional psycholinguistic theories of speech recognition is that they may underestimate speech-extrinsic factors (such as the nature of the listening environment), cognitive adaptability, and even the role of attention (Mattys et al., 2012). From an applied perspective, the current results may serve as a model for how individuals who are hard of hearing recognize spoken words in real-time. For instance, adapting the dynamics of competition from immediate competition to a wait-and-see approach could be adaptive for individuals who are hard of hearing. To illustrate, if the speech signal is less accessible it may be detrimental to adopt an immediate competition approach. In this case, if listeners are not exactly sure what they heard, robustly activating multiple lexical candidates as soon as input arrives could cause activation to spread across too many lexical items. Because there are more lexical candidates activated, there are more lexical items to inhibit, which may cause delayed word recognition. A wait-and-see approach, on the other hand, allows listeners to accrue more lexical context, which reduces overall competition. Though determining whether the current results are indeed adaptive is beyond the scope of the present study, these results are in line with a growing body of research that portrays spoken word recognition as highly ﬂexible and perhaps tunable. Supplementary data to this article can be found online at https:// doi.org/10.1016/j.cognition.2020.104196.

sustain activation for rhyme but not cohort competitors (Farris-Trimble et al., 2014). These results suggest that listeners were less certain when they heard 50 dBA speech compared to 65 dBA speech, and because this uncertainty lasted late in processing, there exists a persistent processing cost in the presence of slightly soft speech even on correctly identiﬁed trials (Brouwer & Bradlow, 2016). This proﬁle of sustaining competitor activation may be particularly useful in cases where a mistake was made and a recently rejected candidate must now be reactivated as the target (Clopper, 2014; Luce & Cluﬀ, 1998; McMurray, Tanenhaus, & Aslin, 2009). Conversely, for very soft speech (40 dBA), the pattern of eﬀects is diﬀerent. Listeners demonstrated a decrease in cohort activation, and a corresponding increase in rhyme activation. These results are in line with a wait-and-see approach, in which listeners wait to activate lexical items until more input has accrued (McMurray et al., 2017). This waitand-see approach minimizes lexical ambiguity by waiting to engage lexical candidates until a point later in the word, in which there are fewer possible candidates. To illustrate, a typical competition pattern (similar to those observed for 65 dBA speech) is to activate a range of lexical candidates immediately, and to the extent they match the incoming signal. Crucially, this typical pattern heavily weighs word-initial overlap. What results is robust cohort activation and short-lived, but signiﬁcant, rhyme activation. Conversely, the wait-and-see approach heavily downweighs word-initial overlap and gives more importance than is typical to word-ﬁnal overlap because listeners are waiting until more lexical context is encountered. This pattern in which listeners reduce cohort activation and increase rhyme activation, has been observed in more egregious adverse listening environments (e.g., when noise replaces phonemes; McQueen & Huettig, 2012) and for populations who have severely degraded input (e.g., prelingually deaf CI users; McMurray et al., 2017). It must be noted that in the current study, the degree to which listeners are waiting to activate competitors is less than what is seen in CI users. This may reﬂect that the chosen soft speech conditions are too easy for listeners (very high accuracy on all conditions). Perhaps with lower sound levels conditions it would be more clear-cut to see listeners indeed use a wait-and-see approach when listening to a suboptimal signal. It is important to note that both levels of soft speech demonstrated some degree of sustained rhyme activation when compared to conversational speech. This suggests that rhymes are particularly susceptible to any degree of signal reduction. A similar ﬁnding has been reported for noise-vocoded speech (Farris-Trimble & McMurray, 2013). Farris-Trimble and colleagues explain this ﬁnding in terms of when disambiguating information is encountered. Cohort competitors eventually receive disambiguating information as to word identity (after the medial vowel). Conversely, once rhyme competitors are active they are indistinguishable from the target word until after the word is heard. Therefore, during soft speech, activation builds up for rhymes, and is sustained because there is no information in the speech signal to rule it out. Finally, we conducted a follow-up analysis in Experiment 2 examining the inﬂuence of sound level on lexical competition separately for mono- and bi-syllabic words. For mono-syllabic words, cohort and rhyme competition was weaker for soft speech than conversational speech. Previous research on lexical competition in adverse listening conditions has shown increased or sustained activation for rhyme competitors (Brouwer & Bradlow, 2016; McQueen & Huettig, 2012). However, these previous studies mainly used bi-syllabic words, which have been shown to demonstrate more robust and consistent rhyme competition than mono-syllabic words (Simmons & Magnuson, 2018). Indeed, we replicate this eﬀect in the current study; across sound level, rhyme competition was greater for bi- compared to mono-syllabic words. The overall pattern of results for bi-syllabic words was much different than mono-syllabic words. Competition (cohort and rhyme) for bi-syllabic words was either commensurate with conversational speech,

Author statement Kristi Hendrickson: Conceptualization, methodology, validation, formal analysis, investigation, resources, data curation, writing-original draft, writing – review& editing, visualization, supervision, project administration. Jessica Spinelli: Conceptualization, formal analysis, investigation, data curation, writing-original draft, project administration. Elizabeth Walker: Conceptualization, methodology, resources, writing-original draft, writing- review & editing, supervision, project administration. Acknowledgements This research was supported by the Iowa Center for Research by Undergraduates (ICRU) awarded to the 2nd author. We thank Kathryn Gabel, Lindsey Meyer, and Lyndi Roecker for assistance with data collection. References Allopenna, P., Magnuson, J., & Tanenhaus, M. (1998). Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language, 38(4), 419–439.

14

Cognition 198 (2020) 104196

K. Hendrickson, et al.

Psychology, 18, 1–86. McMurray, B., Ellis, T. P., & Apfelbaum, K. S. (2019). How do you deal with uncertainty? Cochlear implant users diﬀer in the dynamics of lexical processing of noncanonical. Ear and Hearing: Inputs. McMurray, B., Farris-Trimble, A., & Rigler, H. (2017). Waiting for lexical access: Cochlear implants or severely degraded input lead listeners to process speech less incrementally. Cognition, 169, 147–164. McMurray, B., Farris-Trimble, A., Seedorﬀ, M., & Rigler, H. (2016). The eﬀect of residual acoustic hearing and adaptation to uncertainty on speech perception in cochlear implant users: Evidence from eye-tracking. Ear and Hearing, 37(1), e37. McMurray, B., Samelson, V. M., Lee, S. H., & Tomblin, J. B. (2010). Individual diﬀerences in online spoken word recognition: Implications for SLI. Cognitive Psychology, 60(1), 1–39. McMurray, B., Tanenhaus, M. K., & Aslin, R. N. (2002). Gradient eﬀects of within-category phonetic variation on lexical access. Cognition, 86(2), 33–42. McMurray, B., Tanenhaus, M. K., & Aslin, R. N. (2009). Within-category VOT aﬀects recovery from “lexical” garden-paths: Evidence against phoneme-level inhibition. Journal of Memory and Language, 60(1), 65–91. McQueen, J. M., & Huettig, F. (2012). Changing only the probability that spoken words will be distorted changes how they are recognized. Journal of Acoustical Society of America, 131(1), 509–517. Moberly, A. C., Bates, C., Harris, M. S., & Pisoni, D. B. (2016). The enigma of poor performance by adults with cochlear implants. Otology & Neurotology: Oﬃcial Publication of the American Otological Society. American Neurotology Society [and] European Academy of Otology and Neurotology, 37(10), 1522. Norris, D., & McQueen, J. M. (2008). Shortlist B: A Bayesian model of continuous speech recognition. Psychological Review, 115(2), 357. Olsen, W. O. (1998). Average speech levels and spectra in various speaking/listening conditions. American Journal of Audiology, 7(2), 21–25. Pearsons, K. S., Bennett, R. L., & Fidell, S. (1977). Speech levels in various noise environments. Oﬃce of Health and Ecological Eﬀects. US EPA: Oﬃce of Research and Development. Peng, S. C., Chatterjee, M., & Lu, N. (2012). Acoustic cue integration in speech intonation recognition with cochlear implants. Trends in Ampliﬁcation, 16(2), 67–82. R Core Team (2014). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/. Scharenborg, O., Coumans, J. M., & van Hout, R. (2018). The eﬀect of background noise on the word activation process in nonnative spoken-word recognition. Journal of Experimental Psychology: Learning. Memory, and Cognition, 44(2), 233. Seedorﬀ, M., Oleson, J., & McMurray, B. (2018). Detecting when timeseries diﬀer: Using the Bootstrapped Diﬀerences of Timeseries (BDOTS) to analyze Visual World Paradigm data (and more). Journal of Memory and Language, 102, 55–67. Simmons, E., & Magnuson, J. (2018). Word length, proportion of overlap, and phonological competition in spoken word recognition. Proceedings of the 40th Annual Conference of the Cognitive Science Society (pp. 1064–1069). Madison: WI. Skinner, M. W., Holden, L. K., Holden, T. A., Demorest, M. E., & Fourakis, M. S. (1997). Speech recognition at simulated soft, conversational, and raised-to-loud vocal eﬀorts by adults with cochlear implants. The Journal of the Acoustical Society of America, 101(6), 3766–3782. Smith, F., & McMurray, B. (2018). Lexical access in the face of degraded speech: The eﬀects of cognitive adaptation. Poster session presented at the Acoustical Society of America Annual Meeting. BA, Canada: Victoria. Tanenhaus, M. K., Spivey-Knowlton, M. J., Eberhard, K. M., & Sedivy, J. C. (1995). Integration of visual and linguistic information in spoken language comprehension. Science, 268(5217), 1632–1634. van der Feest, S. V., Blanco, C. P., & Smiljanic, R. (2019). Inﬂuence of speaking style adaptations and semantic context on the time course of word recognition in quiet and in noise. Journal of Phonetics, 73, 158–177. Weber, A., & Scharenborg, O. (2012). Models of spoken-word recognition. Wiley Interdisciplinary Reviews: Cognitive Science, 3(3), 387–401. Winn, M. B., & Litovsky, R. Y. (2015). Using speech sounds to test functional spectral resolution in listeners with cochlear implants. The Journal of the Acoustical Society of America, 137(3), 1430–1442. Wu, Y. H., Stangl, E., Zhang, X., Perkins, J., & Eilers, E. (2016). Psychometric functions of dual-task paradigms for measuring listening eﬀort. Ear and Hearing, 37(6), 660. Zekveld, A. A., Kramer, S. E., & Festen, J. M. (2010). Pupil response as an indication of eﬀortful listening: The inﬂuence of sentence intelligibility. Ear and Hearing, 31(4), 480–490.

Apfelbaum, K. S., Blumstein, S. E., & McMurray, B. (2011). Semantic priming is aﬀected by real-time phonological competition: Evidence for continuous cascading systems. Psychonomic Bulletin & Review, 18(1), 141–149. Ben-David, B. M., Chambers, C. G., Daneman, M., Pichora-Fuller, M. K., Reingold, E. M., & Schneider, B. A. (2011). Eﬀects of aging and noise on real-time spoken word recognition: Evidence from eye movements. Journal of Speech, Language, and Hearing Research, 54(1), 243–262. Bjork, R. A. (1994). Memory and metamemory considerations in the training of human beings. In J. Metcalfe, & A. P. Shimamura (Eds.). Metacognition: Knowing about knowing (pp. 185–205). Cambridge, MA: MIT Press. Boersma, P., & Weenink, D. (2019). Praat: Doing phonetics by computer [Computer program]. Version 6.0.50. Retrieved from http://www.praat.org/. Brouwer, S., & Bradlow, A. R. (2014). Contextual variability during speech-in-speech recognition. The Journal of the Acoustical Society of America, 136(1), EL26–EL32. Brouwer, S., & Bradlow, A. R. (2016). The temporal dynamics of spoken word recognition in adverse listening conditions. Journal of Psycholinguistic Research, 45(5), 1151–1160. Brouwer, S., Mitterer, H., & Huettig, F. (2012). Speech reductions change the dynamics of competition during spoken word recognition. Language & Cognitive Processes, 27(4), 539–571. Clopper, C. G. (2014). Sound change in the individual: Eﬀects of exposure on cross-dialect speech processing. Laboratory Phonology, 5(1), 69–90. Cooper, R. M. (1974). The control of eye ﬁxation by the meaning of spoken language: A new methodology for the real-time investigation of speech perception, memory, and language processing. Cognitive Psychology, 6(1), 84–107. Dahan, D., & Magnuson, J. S. (2006). Spoken word recognition. Handbook of psycholinguistics (pp. 249–283). Academic Press. Dahan, D., Magnuson, J. S., & Tanenhaus, M. K. (2001). Time course of frequency eﬀects in spoken-word recognition: Evidence from eye movements. Cognitive Psychology, 42, 317–367. Davidson, L. S. (2006). Eﬀects of stimulus level on the speech perception abilities of children using cochlear implants or digital hearing aids. Ear and Hearing, 27(5), 493–507. Farris-Trimble, A., & McMurray, B. (2013). Test–retest reliability of eye tracking in the visual world paradigm for the study of real-time spoken word recognition. Journal of Speech, Language, and Hearing Research, 56(4). Farris-Trimble, A., McMurray, B., Cigrand, N., & Tomblin, J. B. (2014). The process of spoken word recognition in the face of signal degradation. Journal of Experimental Psychology: Human Perception and Performance, 40(1), 308. Fraser, S., Gagné, J. P., Alepins, M., & Dubois, P. (2010). Evaluating the eﬀort expended to understand speech in noise using a dual-task paradigm: The eﬀects of providing visual speech cues. Journal of Speech, Language, and Hearing Research, 53, 18–33. Frauenfelder, U. H., Scholten, M., & Content, A. (2001). Bottom-up inhibition in lexical selection: Phonological mismatch eﬀects in spoken word recognition. Language & Cognitive Processes, 16(5–6), 583–607. Gosselin, P. A., & Gagné, J. P. (2011). Older adults expend more listening eﬀort than young adults recognizing speech in noise. Journal of Speech, Language, and Hearing Research, 54, 944–958. Hicks, C. B., & Tharpe, A. M. (2002). Listening eﬀort and fatigue in school-age children with and without hearing loss. Journal of Speech, Language, and Hearing Research, 45, 573–584. Holden, L. K., Reeder, R. M., Firszt, J. B., & Finley, C. C. (2011). Optimizing the perception of soft speech and speech in noise with the Advanced Bionics cochlear implant system. International Journal of Audiology, 50(4), 255–269. Holden, L. K., Finley, C. C., Firszt, J. B., Holden, T. A., Brenner, C., Potts, L. G., ... Skinner, M. W. (2013). Factors aﬀecting open-set word recognition in adults with cochlear implants. Ear and hearing, 34(3), 342. Luce, P. A., & Cluﬀ, M. S. (1998). Delayed commitment in spoken word recognition: Evidence from cross-modal priming. Perception & Psychophysics, 60(3), 484–490. Luce, P. A., & Pisoni, D. B. (1998). Recognizing spoken words: The neighborhood activation model. Ear and Hearing, 19(1), 1. Marslen-Wilson, W., & Zwitserlood, P. (1989). Accessing spoken words: The importance of word onsets. Journal of Experimental Psychology: Human Perception and Performance, 15(3), 576. Marslen-Wilson, W. D. (1987). Functional parallelism in spoken word-recognition. Cognition, 25(1–2), 71–102. Mattys, S. L., Davis, M. H., Bradlow, A. R., & Scott, S. K. (2012). Speech recognition in adverse conditions: A review. Language & Cognitive Processes, 27(7–8), 953–978. McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive

15

Cognitive processes underlying spoken word recognition during soft speech

Cognitive processes underlying spoken word recognition during soft speech

Recommend Documents