Cognition 198 (2020) 104196
Contents lists available at ScienceDirect
Cognition journal homepage: www.elsevier.com/locate/cognit
Cognitive processes underlying spoken word recognition during soft speech Kristi Hendrickson a b
a,b,⁎
a
, Jessica Spinelli , Elizabeth Walker
T
a
Department of Communication Sciences & Disorders, University of Iowa, 250 Hawkins Drive, 52242 Iowa City, IA, United States of America Department of Psychological & Brain Sciences, University of Iowa, 250 Hawkins Drive, 52242 Iowa City, IA, United States of America
A R T I C LE I N FO
A B S T R A C T
Keywords: Speech perception Spoken word recognition Soft speech Lexical competition Adverse listening conditions
In two eye-tracking experiments using the Visual World Paradigm, we examined how listeners recognize words when faced with speech at lower intensities (40, 50, and 65 dBA). After hearing the target word, participants (n = 32) clicked the corresponding picture from a display of four images – a target (e.g., money), a cohort competitor (e.g., mother), a rhyme competitor (e.g., honey) and an unrelated item (e.g., whistle) – while their eyemovements were tracked. For slightly soft speech (50 dBA), listeners demonstrated an increase in cohort activation, whereas for rhyme competitors, activation started later and was sustained longer in processing. For very soft speech (40 dBA), listeners waited until later in processing to activate potential words, as illustrated by a decrease in activation for cohorts, and an increase in activation for rhymes. Further, the extent to which words were considered depended on word length (mono- vs. bi-syllabic words), and speech-extrinsic factors such as the surrounding listening environment. These results advance current theories of spoken word recognition by considering a range of speech levels more typical of everyday listening environments. From an applied perspective, these results motivate models of how individuals who are hard of hearing approach the task of recognizing spoken words.
1. Introduction Language comprehension requires fast and efficient word recognition. Spoken word recognition is a complex cognitive process that involves mapping the incoming speech signal to entries in the mental lexicon. This complexity arises because spoken words unfold sequentially and rapidly in time. As a result, at each moment listeners have only partial information for identifying the target word. Given that many words sound the same, such partial information creates moments of uncertainty (Marslen-Wilson, 1987). For example, after hearing the onset of the word candle, there is insufficient information to discriminate the target word from phonological competitors that share word-initial phonemes (e.g., camera). A rich history in psycholinguistic research shows that listeners manage this temporary ambiguity by taking an immediate competition approach. That is, listeners immediately activate multiple lexical candidates that are consistent with the unfolding speech signal (MarslenWilson, 1987; McClelland & Elman, 1986; Norris & McQueen, 2008). These candidates compete for recognition as the word unfolds (Allopenna, Magnuson, & Tanenhaus, 1998; Marslen-Wilson & Zwitserlood, 1989). For instance, cohort competitors (word pairs that share phonemes at the beginning; e.g., candle and camera) compete for
⁎
recognition early, whereas rhyme competitors (i.e., word pairs that share phonemes at the end; e.g., candle and handle) compete later. As listeners accumulate information that disambiguates the word, these initial interpretations are updated until competition has been suppressed, and the appropriate word is fully active (Dahan & Magnuson, 2006; Dahan, Magnuson, & Tanenhaus, 2001; Frauenfelder, Scholten, & Content, 2001; Luce & Pisoni, 1998; Weber & Scharenborg, 2012). The degree of activation – how much a given word is considered – depends on where in the word phonological overlap is present. To illustrate, listeners heavily weigh word-initial overlap, and therefore, cohorts show more robust activation than do rhymes (Allopenna et al., 1998). Further, the degree of competition is influenced by word length (Simmons & Magnuson, 2018). For instance, rhyme effects are less likely to be observed in mono-syllabic words compared to bi-syllabic words. Given that rhymes are defined as words that overlap in the nucleus of the first syllable (i.e., the vowel) until the end of the word, bi-syllabic rhymes have a large portion of phonological overlap which likely drives greater competition (Simmons & Magnuson, 2018). A limitation of the existing psycholinguistic research is that it mostly examines word recognition under ideal listening conditions (i.e., given clear acoustic input presented at a conversational level [~65–70 dBA] in quiet). The breadth of research on speech recognition should
Corresponding author at: Dept. of Communication Sciences and Disorders, University of Iowa, 250 Hawkins Drive, 52242 Iowa City, IA, United States of America. E-mail addresses:
[email protected] (K. Hendrickson),
[email protected] (J. Spinelli),
[email protected] (E. Walker).
https://doi.org/10.1016/j.cognition.2020.104196 Received 12 April 2019; Received in revised form 6 January 2020; Accepted 18 January 2020 0010-0277/ © 2020 Published by Elsevier B.V.
Cognition 198 (2020) 104196
K. Hendrickson, et al.
be quite different for soft speech compared to other types of suboptimal input. For instance, vocoded speech (and speech faced by CI users) removes spectral differences, such as formant frequencies, which eliminates much of the information needed to discriminate phonemes. However, information about changes in amplitude over time (i.e., amplitude envelope) is preserved. As a result, CI users adopt whatever cues are most reliable in their input and primarily rely on temporal instead of spectral information to recognize words (McMurray, FarrisTrimble, Seedorff, & Rigler, 2016; Moberly, Bates, Harris, & Pisoni, 2016; Peng, Chatterjee, & Lu, 2012; Winn & Litovsky, 2015). Conversely, for softer levels of speech around 40–55 dBA, spectral differences are mostly accessible and the amplitude envelope is somewhat reduced (Olsen, 1998). Thus, both spectral and temporal cues exist in the input, but may be less accessible. As a result, the task of reweighting cues and focusing on the most reliable cues may not be the best approach to dealing with softer levels of speech input. Research examining word recognition in the face of soft speech has almost entirely focused on populations with hearing loss by measuring accuracy using clinical speech perception tasks (e.g., consonant-nucleus-consonant [CNC] word scores). In these tasks, listeners are asked to simply repeat the word they hear. For CI users, single word recognition accuracy is quite similar at 70 dBA and 60 dBA, though there is a significant drop-off in accuracy at 50 dBA (Holden et al., 2013). Despite the fact that measuring access to a range of sound levels (including soft speech) is a part of the clinical assessment of populations with hearing loss, and soft speech recognition is a crucial component of everyday speech perception for all listeners, surprisingly little is known about how individuals with normal hearing recognize soft speech, both in terms of recognition accuracy and the dynamics of real-time competition that subserve recognition (Olsen, 1998; Skinner, Holden, Holden, Demorest, & Fourakis, 1997). Such an investigation has both theoretical and clinical implications. From a theoretical perspective, understanding the influence of soft speech on the competition dynamics of spoken word recognition will inform more ecologically valid models. Indeed, any model of spoken word recognition should be able to account for the breadth of listening conditions listeners face in natural communication settings. From a clinical perspective, understanding how normal hearing listeners perceive soft speech is a useful model for determining how individuals with hearing loss approach the task of recognizing words in the face of reduced audibility.
reflect the listening challenges that individuals face in natural communication settings. For instance, listeners are often confronted with adverse listening environments, in which several factors (e.g., background noise, decreased sound levels, distortion) lead to reduced speech intelligibility (Mattys, Davis, Bradlow, & Scott, 2012; Olsen, 1998). How listeners recognize words in-the-moment may be quite different under adverse listening conditions. This is because when listeners are confronted with suboptimal speech input, they encounter two sources of ambiguity: the temporary ambiguity that is present during optimal listening conditions as a result of the unfolding signal, plus the ambiguity that arises from the suboptimal speech signal. Research examining the dynamics of lexical competition in adverse listening conditions is limited and mixed. The presence of background noise appears to cause listeners to delay fixations to the target (van der Feest, Blanco, & Smiljanic, 2019), and delay activation for cohort competitors (Ben-David et al., 2011). However, more recent research has found that background noise actually heightens and sustains cohort and rhyme competition (Brouwer & Bradlow, 2016). Further, background noise not only changes the magnitude of activation for different types of competitors, but also increases the number of candidate words (Scharenborg, Coumans, & van Hout, 2018). When confronted with distorted speech, in which some phonemes are replaced with radio noise, listeners reduce activation for cohorts and increase activation for rhymes (McQueen & Huettig, 2012). A similar effect is observed for cochlear implant (CI) users who are prelingually deaf (those CI users who have little acoustic basis for developing phoneme categories) and for hearing adults with vocoded speech meant to simulate a CI (McMurray, Farris-Trimble, & Rigler, 2017). In addition to speech-intrinsic factors (i.e., factors involving the speech signal itself), speech-extrinsic factors (i.e., factors related to the listening environment, such as predictability) also affect the nature of speech recognition and lexical competition (Brouwer & Bradlow, 2014; Brouwer, Mitterer, & Huettig, 2012; Smith & McMurray, 2018). Brouwer et al. (2012) examined lexical competition for canonical forms (e.g., computer [kɔmpjutər]) and reduced forms [e.g., pjutər] across multiple eye-tracking experiments. The stimuli were identical across experiments, though the nature of the presentation differed; in one experiment canonical forms were interleaved with reduced forms, whereas in another experiment listeners only heard canonical forms. The uncertainty of the listening environment (caused by the interleaved presentation of canonical and reduced forms) affected the recognition of canonical words. Specifically, canonical phonological competitors were more strongly activated in the interleaved presentation compared to the canonical only presentation. Brouwer and colleagues suggest that these findings reflect that when listening to reduced speech interleaved with canonical speech, listeners become more tolerant of acoustic mismatches for canonical inputs. These results are crucial because they demonstrate that listeners not only adjust competition based on speechintrinsic factors (e.g., reduced vs. canonical), but also speech-extrinsic factors (i.e., the predictability of the listening environment). Though recent research has begun to investigate spoken word recognition in adverse listening conditions (Brouwer & Bradlow, 2016; McQueen & Huettig, 2012), listening environments with soft speech (i.e., speech at lower intensities) are noticeably absent from the literature. It is crucial to understand how soft speech at multiple intensities is recognized given that it is ever-present in a range of typical listening environments. Indeed, listeners encounter a variety of speakers, some speaking face-to-face at conversational levels (~65 dBA), others speaking softly (< 55 dBA) and/or at distances that are not optimal for speech perception (Pearsons, Bennett, & Fidell, 1977). Further, speech sound levels in everyday listening environments can reach as low as 25 dBA (Holden, Reeder, Firszt, & Finley, 2011). Therefore, listeners must recognize words over a range of sound levels as a basis for effective communication in many listening situations (Davidson, 2006). There are reasons to believe that the dynamics of competition may
1.1. The present study The goal of the present study was to investigate how listeners recognize soft speech. Specifically, this research had two primary aims. First, across two experiments, we examined how soft speech influences the competition dynamics of spoken word recognition (i.e., what word types compete for recognition and when competition occurs during processing). To this end we obtained precise information regarding the time course of lexical competition. Previous research has shown that both speech-intrinsic and speech-extrinsic factors can change the dynamics of lexical competition (Brouwer et al., 2012). When reduced and canonical word forms were presented in an interleaved fashion, the dynamics of lexical competition changed for the canonical forms. Specifically, canonical forms demonstrated more robust competitor activation when presented alongside non-canonical forms. Thus, our second aim was to determine how speech-extrinsic factors, such as overall uncertainty in the listening environment, affect how soft speech is recognized. For both Experiments 1 and 2 we use eye-tracking and the Visual World Paradigm (VWP; Cooper, 1974; Tanenhaus, Spivey-Knowlton, Eberhard, & Sedivy, 1995). Spoken words are presented at multiple levels of intensity. After hearing the target word, participants click on the corresponding picture from a display of four images representing candidate interpretations of the stimulus – the target (e.g., money), a cohort competitor that shares word-initial phonemes (e.g., mother), a 2
Cognition 198 (2020) 104196
K. Hendrickson, et al.
design. On each trial one of the four items was played as the auditory stimulus, and visual stimuli consisted of a target word (e.g., money), a cohort of the target (e.g., mother), a rhyme of the target (e.g., honey), and a phonologically unrelated word (e.g., whistle). Sets of four items were constructed to ensure that no concepts overlapped semantically and unrelated items had no phonological overlap with any other word in the set. All four items within a set always appeared together. Each word was presented as the target (i.e., the auditory stimulus) an equal number of times. Because each word within a set served as the auditory stimulus, there were four trial types. To illustrate, again consider the set, money-mother-honey-whistle. When money was the auditory stimulus, the resulting trial type was a target-cohort-rhyme-unrelated or TCRU trial. However, the structure of the trial (i.e., the nature of the competitor set) changed depending on which auditory stimulus appeared as the target. For example, on trials in which the cohort (e.g., mother) appeared as the auditory stimulus, the trial was termed a target (mother)-cohort (money)-unrelated (honey)-unrelated (whistle) or TCUU trial. Thus, the four trial types were TRCU, TCUU, TRUU, and TUUU, where the letters refer to the relationship among the items on the screen depending on the auditory stimulus. In order to determine which sound levels to use in the eye-tracking task, a modified version of the consonant-nucleus-consonant (CNC) speech perception test was conducted with a separate set of 10 participants. We used the CNC test for two main reasons. First, the CNC is one of the clinical gold standards for speech recognition. Further, we wanted to choose sound levels that were highly audible and listeners could recall the word and not simply recognize it given the referent. Consistent with the CNC protocol, participants repeated words they heard from a recording to the experimenter who scored their responses for the percentage of phonemes and whole words correctly produced. We modified the traditional CNC protocol to include six different sound levels (30, 35, 40, 45, 50, and 65 dBA). A different CNC word list was used for each sound level and word lists and sound levels were counterbalanced across participants. Given that trials with incorrect responses are removed from analyses in studies of spoken word recognition using the VWP, it was essential to choose sound levels in which recognition accuracy was high enough to ensure that participants were not simply guessing. Indeed, by including sound levels with low recognition accuracy within the eye-tracking task, we would not have sufficient trials to look at condition specific differences. See Table 1 for results at the six different sound levels. Based on the results of this pretest and clinically related motivations, we chose 40 and 50 dBA as the soft speech sound levels for the eye-tracking task. We chose 50 dBA primarily for its clinical relevance; 50 dBA is used in clinical settings to verify audibility of the hearing aid for soft speech. We chose 40 dBA because it was the lowest sound level in which word recognition accuracy was high enough (> 90%) to ensure a sufficient number of usable trials during the eye-tracking task. Further, performance on the 40 dBA condition was associated with a significant decrease in word recognition accuracy (logit transformed) from the next lowest intensity level (35 dBA, t(9) = 8.73, p < .001). Each of the 136 words (34 sets with 4 items) occurred as the auditory stimulus twice at the three sound levels (40, 50, and 65 dBA), resulting in 816 total trials (136 words × 2 repetitions × 3 sound levels). The sound levels of the study were randomly interleaved for
rhyme competitor that shares word-final phonemes (e.g., honey) and a phonologically unrelated picture (e.g., whistle) – while their eye movements are monitored (e.g., Allopenna et al., 1998). Participants must make a fixation to plan their ultimate response, and can make multiple fixations over the course of the trial. As a result, eye movements in this paradigm are tightly time-locked to unfolding activation among lexical competitors. To examine whether and to what extent lexical competition for speech at different intensities is affected by speech-extrinsic factors (the uncertainty of the listening environment), we manipulated the nature of the sound level presentation: in Experiment 1 sound level was presented in an interleaved fashion, whereas in Experiment 2, sound level was blocked. We predict that listeners will be slower to recognize the target word as sound level decreases. Of particular interest to the current investigation is how the competition dynamics change with soft speech. We predict that lexical competition is likely to change along two dimensions: magnitude (the amount of activation) and/or timing (when activation occurs). In terms of magnitude, soft speech could cause listeners to reduce activation of competitors. Consistent with research on distorted speech input, listeners may decrease activation to cohorts, rhymes or both (McQueen & Huettig, 2012). Conversely, listeners could increase activation for phonological competitors in the face of soft speech similarly to what is seen in some CI users (McMurray et al., 2016). Further, activation could change in terms of timing. Most research on word recognition in adverse conditions shows some level of sustained competitor activation, in which consideration for competitors is maintained later in processing (Ben-David et al., 2011; Brouwer & Bradlow, 2016; McMurray et al., 2017). From this view, we may see either delayed activation for competitors, or activation may be immediate but maintained longer than in conversational speech. Finally, in accordance with previous research on speech-extrinsic factors (Brouwer et al., 2012; Smith & McMurray, 2018), we predict that the nature of the presentation (randomly interleaved vs. blocked) will influence lexical competition, particularly for optimal speech (conversational speech, 65 dBA). Specifically, we expect more phonological competition for conversational speech when interleaved with soft speech (Experiment 1) than when presented in a blocked fashion (Experiment 2). Finally, in Experiment 2 we examine the influence of sound level on lexical competition for mono- and bi-syllabic words separately. Recent research shows stark differences in the magnitude of rhyme effects for mono-syllabic vs. bi-syllabic words. Bi-syllabic rhymes (e.g., paddlesaddle) are longer and contain more phonological overlap with the target word than mono-syllabic rhymes (e.g., cat-bat). As a result, bisyllabic rhymes have been shown to elicit more competition (Simmons & Magnuson, 2018). If listeners sustain activation for competitors during soft speech, this might be particularly apparent for bi-syllabic rhymes. 2. Experiment 1 2.1. Methods 2.1.1. Participants Thirty-two undergraduate participants were recruited from the University of Iowa psychology participant pool. All participants were monolingual English speaking and had normal hearing and normal or corrected-to-normal vision. Participants ranged from 18 to 23-years-old (M = 18.5 years) and received academic credit for a university level psychology course. All participants signed an informed consent document before participating in the study.
Table 1 Mean (and standard deviation) word recognition accuracy by sound level. Sound level (dBA)
2.1.2. Design Items consisted of 136 words (72 mono-syllabic and 64 bi-syllabic) that were combined into sets of four, resulting in 34 total sets (see Appendix). This study had a target-cohort-rhyme-unrelated (TCRU)
Accuracy
30
35
40
45
50
65
% words correct
71.2 (7.8) 88.7 (3.9)
80.1 (7.8) 92.9 (3.1)
92.6 (3.8) 97.4 (1.4)
95.0 (3.8) 98.4 (1.2)
97.0 (4.2) 98.9 (1.6)
97.6 (2.3) 99.0 (1.1)
% phonemes correct
3
Cognition 198 (2020) 104196
K. Hendrickson, et al.
conversational speech. Further, trials were randomized so that each participant received a different presentation order.
Experiment 1. 2.1.3. Stimuli Auditory stimuli were 136 words recorded by a female speaker with a standard American accent in an anechoic chamber using a Marantz Professional Steady State Recorder at a sampling rate of 44100 Hz. The speaker wore a Shure head-mounted microphone to ensure the distance between her mouth and the microphone was consistent throughout the recording. Audio recordings were completed within a single session. Each word was recorded multiple times within the sentence frame (“He said ___.”) to ensure natural intonation. The best exemplar of each word was extracted from the sentence frame and edited to remove noise artifacts. Even though the audio recordings were made in an anechoic chamber, there was minor background noise due to the fan of the recording computer. We used a standard lab protocol to remove this background noise. For this, we used Audacity to measure a ~1000 msec sample of silence within the larger recording and calculated the acoustic profile of the background noise. We used this recording-specific background noise profile to remove the background noise in the audio recordings while preserving the acoustic properties of the speech stimuli. Finally, all words were amplitude normalized to the three sound levels of interest (40, 50, and 65 dBA), and 100 ms of silence was added to the beginning and end of the sound files using Praat (Boersma & Weenink, 2019). Visual stimuli were clip art style images developed using a standard lab protocol (McMurray, Samelson, Lee, & Tomblin, 2010). For each image, several candidates were identified from a commercial clip art library. These images were then viewed by a focus group to identify the “best” exemplar of a given word and to identify alterations to ensure a more prototypical orientation. Each image was edited to implement these changes and remove any unnecessary features. Images were matched for style and visual saliency and approved by a senior member of the lab with extensive experience in the VWP. Most of the images used in this study were pulled from a database of images developed with this protocol and used across multiple studies (Apfelbaum, Blumstein, & McMurray, 2011; McMurray et al., 2010).
2.1.5. Eye tracking recording and data processing Eye movements were recorded with an Eyelink Portable Duo eyetracker in the chin rest configuration. Once the experimenter adjusted the chin rest to a comfortable position and a clear image of the pupil and corneal reflection were obtained, a 9-point calibration and validation procedure was completed. Drift correction was performed 24 times throughout the experiment (every 34 trials). No participants failed drift correction during the experiment. For analysis, saccades and successive fixation were combined into a single unit called a “look” (McMurray et al., 2010; McMurray, Tanenhaus, & Aslin, 2002). 2.2. Results We conducted two sets of analyses. The first set examined accuracy and reaction time of the mouse-click response as a function of sound level (40, 50, or 65 dB). Next, we analyzed the time course of target and competitor (cohort and rhyme) fixations. For these analyses we derived an estimate of the precise time window at which two timeseries differ to understand the timing and magnitude of activation for targets, cohorts, rhymes, and unrelated words across sound levels. Only correct trial, and saccades occurring > 200 ms post-word onset were included in the time series and reaction time analyses. 2.2.1. Accuracy and latency Accuracy and reaction time to click the target image (measured from word-onset) were calculated for each subject and sound level. Two separate two-way ANOVAs were conducted to compare the effect of sound level and trial type (TCRU, TCUU, TRUU, TUUU) on response accuracy (logit transformed) and reaction time, respectively. For accuracy, there was a main effect of sound level [F(2, 62) = 4.49, p = .015]. Participants were highly accurate at selecting the target image at all sound levels, though slightly less so for 40 dBA speech: 65 dBA (99.4%), 50 dBA (99.5%) and 40 dBA (99.1%). Further, there was a main effect of trial type [F(3, 93) = 8.20, p < .001], such that participants were more accurate at selecting the target on trials in which a competitor was not on the screen: TCUU (99.5%), TCRU (99.2%), TRUU (99.1%), and TUUU (99.7%). There was no significant sound level × trial type interaction [F(6, 186) = 1.22, p = .35]. For reaction time, there was a significant main effect of sound level [F(2, 62) = 8.21, p = .0007], in which participants were fastest at responding for 50 dBA speech (1197 ms), followed by 65 dBA speech (1207 ms) and finally 40 dBA speech (1217 ms). There was also a main effect of trial type [F(3, 93) = 29.75, p < .0001], such that participants were fastest at responding during trials with no competitor: TCUU (1229 ms), TCRU (1226 ms), TRUU (1194 ms), and TUUU (1179 ms). However, there was no significant sound level × trial type interaction [F(6, 186) = 1.64, p = .14].
2.1.4. Procedure For the eye-tracking task, participants were seated 24 in. away from a computer monitor. The task was programmed with Experiment Builder (SR Research, Ontario, Canada). Auditory stimuli were played over two front-mounted speakers, and the signal was amplified by a Rolls 35 watt stereo power amplifier. The speakers were calibrated by measuring the sound level to a concatenated file containing a representative sample of stimuli using a portable sound level meter. The sound level meter was centered in the speaker array at a distance corresponding to the location of the participant's head during testing. Testing was conducted in a sound-attenuated room to minimize background noise. Before the task began, participants received both verbal and written study instructions. Further, practice trials were administered to ensure participants understood the task. For each trial, four pictures appeared on the screen: a target (money), cohort competitor (mother), rhyme competitor (honey), and an unrelated item (whistle). Pictures were 300 × 300 pixels and appeared 50 pixels vertically and horizontally from the edges of a 17″ computer monitor running at 1280 × 1024 pixels. Picture location was counterbalanced across trials, such that all pictures and word types (target, cohort, rhyme, and unrelated) appeared equally in each location. In the middle of the screen there was a blue dot. After 500 milliseconds the blue dot turned red at which point, participants were instructed to click the red dot to initiate the trial. Once clicked, the red dot disappeared, and an auditory stimulus identified the target. Thus, the preview of the visual display was at least 500 ms, though given that the trials where participant initiated, participants could preview the images for longer. Participants were instructed to click the corresponding picture. Soft speech was presented randomly intermixed with
2.2.2. Time course Time series data were analyzed using Bootstrapped Differences of Timeseries (BDOTS; a statistical package within R, Version 1.1.456, R Core Team, 2014). BDOTS is a statistical tool that detects differences in two timeseries (as in the VWP) when the time window is unknown in advance and offers a precise characterization of the time window in which a difference occurs (for a summary see Seedorff, Oleson, & McMurray, 2018). BDOTS achieves this level of precision in four steps. First, for each participant, word type (target, cohort, rhyme, and unrelated), and sound level (40, 50, 65 dBA) specific fitted curves are applied to the raw time series data to capture the shape of the functions. Target fixations are best approximated by a logistic function because there is an initial period of low looks, followed by an exponential increase in looks, and ending with a constant period of high looks. Competitor and unrelated fixations are best approximated using a 4
Cognition 198 (2020) 104196
K. Hendrickson, et al.
there was a significant difference in target fixations for 40 dBA speech compared to 50 dBA speech, in which looks were significantly higher to the 50 dBA speech. Finally, proportion looking to the target was significantly higher for 65 dBA speech compared to 40 dBA speech from 752 to 2000 ms. In line with our predictions, participants fixated the target faster and to a greater extent for 65 dBA speech compared to 40 dBA speech. However, participants were faster to fixate the target for 50 dBA speech compared to 65 dBA speech, though there was no difference in the asymptote of fixations (maximum fixations) (see Fig. 1).
double Gaussian function because these fixations start with a period of low looks, followed by an increase in looking culminating at the peak, and finally a low period of looks (see Seedorff et al., 2018). Curve fitting smooths the data to minimize any idiosyncratic patterns of significance. To evaluate goodness of fits, we measured R2 values and visually examined the observed data compared to the estimated curve for each subject, word type, and sound level. Second, a bootstrapping procedure was used to estimate the standard error of the mean at each time point. Third, these standard errors of the function were used to conduct 2 sample t-tests at every time point. Fourth, family-wise error was controlled with a modified Bonferroni corrected significance level which takes advantage of the inherent autocorrelation of the test statistics to avoid being overly conservative. For any significant differences between sound levels, we report the time window of significance, the mean difference across the time window, the autocorrelation of the test statistic, and the adjusted alpha value.
2.2.2.2. Competitor fixations. For competitor fixations (cohort and rhyme) we use unrelated fixations as a baseline to control for differential levels of looking between sound levels (see Fig. 2 for the average raw fixations to cohorts, rhymes, and unrelated items, and Fig. S2 in Supplementary materials for single-subject raw fixations to the competitors and unrelated items). For trials in which two unrelated items were on the screen (i.e., TCUU and TRUU trials), we took the average. We use the difference between competitors and unrelated fixations as an estimate of competitor activity and compare these competitor-unrelated differences between sound levels. For analyses related to competitor looks we fit separate double Gaussian functions to the components of the difference curves (competitor – unrelated baseline), and then evaluate these difference curves between sound levels. The following describes the five parameters for the double Gaussian function: μ = mean for each of the individual normal distributions; σ12 = variance for the left-side normal distribution (essentially the onset slope); σ22 = variance for the right-side normal distribution (the offset slope); Pi = peak height, B1 = baseline for the left-side normal distribution, and B2 = baseline for the rightside. For cohort fixations, 32 subjects were fit using the double Gaussian function in a within-subject design. We compared each sound level (3 comparisons), resulting in 192 fits in all (32 participants × 2 word types [cohort and unrelated] × 3 sound levels [40, 50, 65 dB]). In the fitting stage, 120 curves were good fits (R2 ≥ 0.95), 58 curves had reasonable fits (R2 ≥ 0.8), and 14 curves had fits that were below R2 = 0.8 (though no fit was < 0.70). All 14 curves that had fits of R2 < 0.80 represented fixations to the unrelated item, which were by design meant to be low. To illustrate, listeners have no reason to maintain fixations to unrelated items as these items contain no phonological overlap with the target word. What results are curves with decreased kurtosis (i.e., relatively flat curves). After multiple refits, and upon visual inspection of the observed data compared to the estimated
2.2.2.1. Target fixations. Though the primary purpose of the paper was to examine the competition dynamics during soft speech recognition, it is important to determine whether the presence of soft speech also influences how readily target words are recognized. Indeed, there may be differences in the dynamics of competitor activation that have little effect on how fast listeners recognize words. That is, listeners could recognize soft speech just as quickly as conversational speech, but get there via a different route (i.e., by adapting the nature of competition as a flexible and efficient means of recognizing soft speech). Thus, it is crucial to first analyze the timing and magnitude of target fixations by sound level. For analyses related to the timeseries of target fixations, we collapsed across all four trial types (TCRU, TCUU, TRUU, and TUUU). All 32 subjects were fit using the logistic function in a within-subject study design (see Fig. S1 in Supplemental materials for single-subject raw fixation curves to the target). The function was a four parameter logistic with time on the x-axis, and separate parameters for the upper and lower asymptotes, the crossover (the time at which fixations were halfway between asymptotes), and the slope (the derivative at the crossover). In the fitting stage, all 96 curves (one curve per sound level for 32 subjects) had good fits (R2 ≥0.95). The bootstrapping stage was conducted three times, one for each sound level comparison (see Table 2 for full results). We found a significant difference in target fixations from 400 to 776 ms post-word onset, in which proportion fixations to the 50 dBA speech were significantly higher than 65 dBA speech. Further, from 412 to 2000 ms
Table 2 Results of timeseries analyses for Experiment 1 by word type (Target, Cohort, Rhyme) and sound level. BDOTS output Comparison
Significant time window(s)
Direction of effect (mean difference)
Autocorrelation
Adjusted alpha
Target 65 v 50 50 v 40 65 v 40
400–776 412–2000 752–2000
50 > 65 (0.01) 50 > 40 (0.008) 65 > 40 (0.006)
0.999 0.999 0.999
0.011 0.012 0.02
– 408–824 1016–2000 384–860 1012–2000
– 50 40 65 40
> > > >
40 50 40 65
(0.01) (0.004) (0.01) (0.005)
– 0.996
– 0.003
0.998
0.005
204–768 1168–1240 208–756 868–1200 256–640 880–2000
65 50 40 40 65 40
> > > > > >
50 65 50 50 40 65
(0.03) (0.003) (0.02) (0.004) (0.02) (0.007)
0.987
0.001
0.993
0.002
0.996
0.003
Cohort 65 v 50 50 v 40 65 v 40 Rhyme 65 v 50 50 v 40 65 v 40
5
Cognition 198 (2020) 104196
K. Hendrickson, et al.
Fig. 1. Target fixations (Experiment 1: Randomized and Interleaved Presentation). A) Proportion fixations to the target image (after curve fitting) across the four trial types (TCRU, TCUU, TRUU, TUUU) by sound level (40, 50, 65 dBA). B) Average proportion fixations to the target by sound level for the three time windows of significance.
Fig. 2. Raw fixations (before curve fitting) to cohorts, rhymes, and unrelated items by sound level for the randomized and interleaved presentation. Shaded regions represent standard error.
curve it was determined that despite the lower R2 value for some unrelated fixations, we obtained reasonable fits for these data. The bootstrapping stage was conducted three times, one for each sound level comparison (see Table 2 for full results). There were no significant differences in cohort fixations to 50 dBA compared to 65 dBA speech. For the two levels of soft speech (50 dBA and 40 dBA), we found two regions of significance. From 408 to 824 ms, cohort fixations were significantly higher for 50 dBA compared to 40 dBA speech; however, from 1016 to 2000 ms fixations to 40 dBA speech were higher. Finally, we compared cohort fixations for conversational speech (65 dBA) and very soft speech (40 dBA). Similar to the results of the 50 dBA and 40 dBA comparison, we found regions of significance from 384 to 860 ms and 1012–2000 ms, in which cohort activation for conversational speech was initially higher, though later in processing
activation was higher for 40 dBA speech (see Fig. 3). For time series analyses of rhyme fixations the same 32 subjects were fit using the double Gaussian function in a within-subject design. Again, we compared each sound level (3 comparisons in all), resulting in 192 fits (32 participants × 2 word types [cohort and unrelated] × 3 sounds levels [40, 50, 65 dB]). In the fitting stage, 109 curves were good fits (R2 ≥ 0.95), 69 curves had reasonable fits (R2 ≥ 0.8), 12 curves had fits that were below R2 = 0.8 (though no fits were < 0.70), and 1 subject was dropped due to poor fitting (R2 < 0.70) in at least one of their curves. The bootstrapping stage was conducted three times, one for each sound level comparison (see Table 2). There was a significant difference in rhyme fixations between 65 dBA and 50 dBA speech from 204 to 768 ms, in which proportion fixations were higher for 65 dBA speech,
Fig. 3. Competitor fixations (Experiment 1: Randomized and Interleaved presentation). Difference in proportion fixations (after curve fitting) to cohorts-unrelated (A) and rhymes-unrelated (B) by sound level (40, 50, 65 dBA). 6
Cognition 198 (2020) 104196
K. Hendrickson, et al.
though from 1168 to 1240 ms proportion fixation were higher for 50 dBA speech. Further, rhyme fixations were significantly higher for 40 dBA compared to 50 dBA speech from 208 to 756 ms, and again from 868 to 2000 ms. Finally, from 256 to 640 ms rhyme fixations were significantly higher for 65 dBA speech compared to 40 dBA speech, however, from 880 to 2000 ms, 40 dBA speech demonstrated significantly higher rhyme fixations than did 65 dBA speech.
If the pattern of results in Experiment 1 were partly due to the interleaved presentation, we expect the following results. First, consistent with prior research on the influence of speech-extrinsic factors in word recognition (Brouwer et al., 2012; Smith & McMurray, 2018), we predict a relative decrease in cohort and rhyme activation for optimal speech (65 dBA speech) in Experiment 2. Second, given that the task in the blocked design will be less challenging to the listener, we expect attention will not modulate effects of target fixations to the same degree as was observed in Experiment 1. Thus, for Experiment 2, we predict listeners will be fastest to recognize the target word for 65 dBA speech, followed by 50 dBA speech, and finally, 40 dBA speech. To help further tease apart the competition dynamics of conversational and soft speech, in Experiment 2, we also examined effects word length (mono- vs. bisyllbic words).
2.3. Discussion The results of Experiment 1 replicate previous research that shows a preference for onset over offset overlap competitors. Indeed, across all three sound levels, listeners showed more activation for cohorts than rhyme competitors. However, there were condition specific differences in the pattern of activation. For soft speech, activation for both cohort and rhyme competitors was reduced, and this reduction in activation was particularly pronounced for cohorts. The cohort results are in line with previous research that suggests that the more degraded the speech signal; the more likely listeners are to reduce activation for cohort competitors (McMurray et al., 2017). The rhyme effects are somewhat unexpected because previous research has shown that listeners increase rhyme activation when presented with highly degraded input (McMurray et al., 2017). Thus, one might expect more rhyme activation for 40 dBA compared to 65 dBA speech. However, in Experiment 1 we obtained the opposite effect and rhyme activation for the 65 dBA condition was larger than expected. Indeed, previous research has shown smaller rhyme effects than those obtained in Experiment 1 for spoken words presented at a conversation level. For instance, Farris-Trimble, McMurray, Cigrand, and Tomblin (2014) found a difference in peak height (i.e., maximum fixations) to rhymes - unrelated items of 0.03; in the current experiment the difference in peak height for 65 dBA speech is nearly twice that, (0.055). However, it is crucial to note that previous research was concerned with word recognition in optimal listening conditions and as a result all stimuli were presented at a conversational level (between 65 dB and 70 dB). Thus, there was no uncertainty as to which speech sound level may come next. In Experiment 1, conversational speech and levels of soft speech were randomly interleaved, such that listeners could not predict sound level on any given trial. As previously mentioned, speech-extrinsic factors, such as presenting conditions in an interleaved fashion, have been shown to increase competitor activation for more canonical or typical inputs. Thus, the finding that 65 dBA speech demonstrated robust rhyme (and cohort) activation when interleaved with soft speech corroborates previous research that shows heightened activation for canonical phonological competitors when interleaved with reduced inputs (Brouwer et al., 2012). A complementary argument could also help explain the target fixation results (i.e., faster fixations to the target for 50 dBA compared to 65 dBA speech). Listeners may have had difficulty processing conversational speech when interleaved with soft speech, which slowed the rate of looking to the target. This interpretation is in line with the desirable difficulty effect (Bjork, 1994). Listeners may have been more engaged in the task when the sound level was interleaved because they needed to be prepared for hearing speech at lower intensities. Indeed, research has shown that if information challenges the listener, processing may be enhanced. From this view, soft speech may have engaged more cognitive resources, and listeners may have slightly disengaged in the task during conversational speech. To test the hypothesis that presenting soft speech and conversational speech in an interleaved fashion influenced the pattern of results for targets and competitors, we conducted a follow-up experiment. Experiment 2 was identical to Experiment 1, with one crucial exception: sound level was presented in a blocked fashion. This change allowed us to analyze the target fixations and competition dynamics of each sound level without interference from the other sound levels, thus creating a highly predictable listening environment.
3. Experiment 2 3.1. Methods 3.1.1. Participants A different group of 27 undergraduate participants were recruited from the University of Iowa psychology participant pool. As in Experiment 1, all participants were monolingual English speaking and had normal hearing and normal or corrected-to-normal vision. The participants ranged from 18 to 21-years-old (M = 19.11 years). Participants either received academic credit for a university level psychology course or monetary compensation. All participants signed an informed consent document before participating in the study. 3.1.2. Design and stimuli Audio and visual stimuli were identical to Experiment 1. The study design was identical to Experiment 1, with one significant change. Instead of presenting sound level in an interleaved fashion, listeners were presented with each sound level in a blocked design. To minimize the potential of order effects, the order of the sound levels was counterbalanced across participants such that each sound level appeared equally in every presentation order. 3.2. Results 3.2.1. Accuracy and latency Accuracy and reaction time to click the target image (measured from word onset) were calculated for each subject and sound level. Two separate two-way ANOVAs were conducted to compare the effect of sound level and trial type (TCRU, TCUU, TRUU, TUUU) on response accuracy (logit transformed) and reaction time (for correct trials only), respectively. There was a main effect of sound level [F(2, 52) = 15.75, p < .001]. Just as in Experiment 1, participants were highly accurate at selecting the target image at all sound levels, though slightly less so for 40 dBA speech: 65 dBA (99.6%), 50 dBA (99.6%) and 40 dBA (98.8%). Further, there was a significant main effect of trial type [F(3, 78) = 5.81, p = .001]. Again, as in Experiment 1 participants were more accurate at selecting the target on trials in which a competitor was not on the screen: TCUU (99.5%), TCRU (98.9%), TRUU (99.1%), and TUUU (99.7%). Unlike Experiment 1, there was a significant sound level × trial type interaction [F(6, 156) = 2.61, p = .002]. This interaction appears to be mainly driven by the significantly lower accuracy for the 40 dBA condition compared to 50 and 65 dBA during TCRU and TRUU trials. For reaction time, there was a main effect of trial type [F (3, 78) = 18.77, p < .001], such that participants were faster at responding on trials without competitors: TCUU (1244 ms), TCRU (1236 ms), TRUU (1225 ms), and TUUU (1195 ms). However, there was no main effect of sound level [F(2, 752) = 2.26, p = .14], and no significant sound level × trial type interaction [F(6, 156) = 1.03, p = .41]. 7
Cognition 198 (2020) 104196
K. Hendrickson, et al.
Table 3 Results of timeseries analysis for Experiment 2 by word type (Target, Cohort, Rhyme) and sound level. BDOTS output Comparison
Significant time window(s)
Direction of effect (mean difference)
Autocorrelation
Adjusted alpha
Target 65 v 50 50 v 40 65 v 40
520–2000 500–1012 444–2000
65 > 50 (0.02) 50 > 40 (0.02) 65 > 40 (0.02)
0.999 0.999 0.998
0.01 0.01 0.007
336–568 792–2000 236–668 784–855 1044–1280 200–700
50 65 50 40 40 65
> > > > > >
65 50 40 50 50 40
(0.006) (0.004) (0.02) (0.004) (0.005) (0.02)
0.991
0.002
0.984
0.001
0.985
0.001
344–648 732–1284 356–460 552–2000 428–580 648–2000
65 50 40 40 65 40
> > > > > >
50 65 50 50 40 65
(0.01) (0.007) (0.003) (0.006) (0.01) (0.009)
0.987
0.001
0.986
0.001
0.983
0.001
Cohort 65 v 50 50 v 40
65 v 40 Rhyme 65 v 50 50 v 40 65 v 40
intensities (see Fig. 4).
3.2.2. Time course The procedure for analyzing time series data were identical to those described in Experiment 1.
3.2.2.2. Competitor fixations. Again, as in Experiment 1, for competitor fixations (cohort and rhyme) we use unrelated fixations as a baseline to control for differential levels of looking between sound levels (see Fig. 5 for raw fixations to cohorts, rhymes, and unrelated items, and Fig. S4 in Supplementary materials for single-subject raw fixations to the competitors and unrelated items). For cohort fixations, as in Experiment 1, 27 subjects were fit using the double Gaussian function in a within-subject design. We compared each sound level (3 comparisons), resulting in 162 fits in all (32 participants × 2 word types [cohort and unrelated] × 3 sound levels [40, 50, 65 dB]). In the fitting stage, 113 curves were good fits (R2 ≥0.95), 38 curves had reasonable fits (R2 ≥ 0.8), 10 curves had fits that were below R2 = 0.8 (no fit was < 0.70), and 1 fit was dropped from the 50 dBA condition. The bootstrapping stage was conducted three times, one for each sound level comparison (see Table 3). Though the overall pattern of looking appeared similar for 65 dBA and 50 dBA speech, there were two time windows that showed condition specific differences. We found a significant difference from 336 to 568 ms (50 dBA > 65 dBA) and from 792 to 2000 ms (65 dBA > 50 dBA). Next, we compared cohort activation for the two levels of soft speech (50 dBA and 40 dBA). We found regions of significance from 236 to 668 ms (50 dBA > 40 dBA), 784 to
3.2.2.1. Target fixations. For analyses related to the timeseries of target fixations, we collapsed across all four trial types (TCRU, TCUU, TRUU, and TUUU) (see Fig. S3 in Supplemental materials for single-subject raw fixation curves to the target). All 27 subjects were fit using a fourparameter logistic function in a within-subject study design as described in Experiment 1. In the fitting stage, all 81 curves (one curve per sound level for 27 subjects) had good fits (R2 ≥ 0.95). The bootstrapping stage was conducted three times, one for each sound level comparison (see Table 3 for full results). First, we compared target curves for 65 dBA and 50 dBA speech. We found a significant difference from 520 to 2000 ms in which proportion fixations to 65 dBA speech were significantly higher. Further, proportion looking to the target was significantly higher for 50 dBA compared to 40 dBA speech from 500 to 1012 ms. Finally, there was a significant difference in proportion fixations to the target in 65 dBA speech compared to 40 dBA speech from 444 to 2000 ms, in which looking to the target was greater for 65 dBA speech. As predicted, we see a pattern in which participants looked to the target image faster and to a greater extent for conversational speech compared to speech at lower
Fig. 4. Target fixations (Experiment 2: Blocked presentation). A) Proportion fixations to the target image across the four trial types (TCRU, TCUU, TRUU, TUUU) by sound level (40, 50, 65 dBA). B) Average Proportion fixation to the target by sound level for the three time windows of significance. 8
Cognition 198 (2020) 104196
K. Hendrickson, et al.
Fig. 5. Raw fixations to cohorts, rhymes, and unrelated items by sound level for blocked presentation. Shaded region represents the standard error.
Fig. 6. Competitor fixations (Experiment 2: Blocked presentation). Difference in proportion fixations (after curve fitting) to cohorts-unrelated (A) and rhymesunrelated (B) by sound level (40, 50, 65 dBA).
level. To statistically test the effect of syllable length and sound level on competition, we compared each sound level separately for mono- and bi-syllabic words again using BDOTS (see Tables 4 & 5 for results). Mono-syllabic cohorts demonstrated a graded pattern of effects, such that cohort activation was highest for 65 dBA speech, followed by 50 dBA, and finally, 40 dBA speech (see Fig. 8). For mono-syllabic rhymes, the 65 dBA condition demonstrated the greatest levels of activation, though in a mid-latency time window, the two levels of soft speech exhibited more rhyme activation than did conversational speech. For bi-syllabic words a different pattern of effects emerged for both cohorts and rhyme. For cohorts, activation for the 50 dBA condition was greatest, and the 65 dBA and 40 dBA conditions displayed similar levels of activation. For rhymes, 40 dBA speech demonstrated greater activation than both 65 dBA and 50 dBA speech, and late in processing a graded pattern emerged (40 dBA > 50 dBA > 65 dBA).
844 ms (40 dBA > 50 dBA), and 1044 to 1280 ms (40 dBA > 50 dBA). Finally, there was a significant difference between 65 dBA and 40 dBA speech from 200 to 700 ms (65 dBA > 40 dBA) (see Fig. 6). For timeseries analyses of rhyme fixations the same 27 subjects were fit using the double Gaussian function in a within-subject design. Again, we compared each sound level (3 comparisons in all), resulting in 162 fits in all (27 participants × 2 word types [cohort and unrelated] × 3 sounds levels [40, 50, 65 dB]). In the fitting stage, 89 curves were good fits (R2 ≥ 0.95), 65 curves had reasonable fits (R2 ≥ 0.8), 8 curves had fits that were below R2 = 0.8 (though none were < 0.70), and 1 subject was dropped due to poor fitting in at least one of their curves (R2 < 0.70) (see Fig. 6). The bootstrapping stage was conducted three times, one for each sound level comparison (see Table 3). There was a significant difference in rhyme fixations between 65 dBA and 50 dBA speech from 344 to 648 ms (65 dBA > 50 dBA), and 732 to 1284 ms (50 dBA > 65 dBA). Further, we found a significant difference between the two soft speech levels (40 dBA and 50 dBA) from 356 to 460 ms and 552 to 2000 ms in which there were more fixations to 40 dBA speech. Finally, there were two time windows of significance for the 40 dBA and 65 dBA comparison: 428–580 (65 dBA > 40 dBA) and 648–2000 ms (40 dBA > 65 dBA). To further tease apart the influence of soft speech on the dynamics of word recognition, we analyzed competitor effects separately for mono- and bi-syllabic words. Fig. 7 shows the raw fixations (before curve fitting) to mono- and bi-syllabic cohorts, rhymes, and unrelated items by sound level. For cohorts, activation for mono- and bi-syllabic words for 65 dBA speech was similar; however, for the two levels of soft speech, cohort activation was greater for bi-syllabic compared to monosyllabic. For rhymes, consistent with recent research (Simmons & Magnuson, 2018), bi-syllabic words elicited more activation than mono-syllabic words across sound level, though the precise effect of syllable length on rhyme competition is heavily influenced by sound
3.3. Discussion In Experiment 2, we presented sound level in a blocked fashion to determine how the predictability in the listening environment influenced the dynamics of lexical competition across sound level. As in Experiment 1, there was a significant decrease in cohort activation for 40 dBA speech compared to both 50 dBA and 65 dBA speech. However, unlike Experiment 1, 65 dBA speech demonstrated less cohort activation than 50 dBA speech. For rhymes, there was also a relative decrease in activation for the 65 dBA condition, and rhyme activation was greatest for 40 dBA speech. Further, target fixations displayed the expected pattern of effects: listeners were fastest at fixating the target for 65 dBA speech, followed by 50 dBA speech, and finally 40 dBA speech. Thus, when sound level was presented in a blocked fashion, listeners became faster at fixating the target as sound level increased, and this was accompanied by a relative decrease in phonological competition 9
Cognition 198 (2020) 104196
K. Hendrickson, et al.
Fig. 7. Raw fixations to competitors (and unrelated items) by sound level (40, 50, 65 dBA), and word length (mono- vs. bi-syllabic).
for conversational speech.
Table 4 Fit quality for BDOTS analysis of word length (mono-and bi-syllabic), word type (Cohort, Rhyme), and sound level (40, 50, and 65dBA).
4. Cross experiment comparisons
BDOTS output Word type
MONO-syllabic Cohort 65 v 50 50 v 40 65 v 40 Rhyme 65 v 50 50 v 40 65 v 40 BI-syllabic Cohort 65 v 50 50 v 40 65 v 40 Rhyme 65 v 50 50 v 40 65 v 40
The pattern of results across Experiments 1 and 2 demonstrate that manipulating the predictability of a listening environment can change the dynamics of competition for conversational and soft speech. To further probe how the dynamics of competition change as a function of listening environment, we directly compared the time course of target, cohort, and rhyme fixations at each sound level across experiments. To obtain a detailed characterization of the time course of fixations across experiment we again used Bootstrapped Differences of Timeseries (BDOTS; a statistical package within R, Version 1.1.456, R Core Team, 2014). For these analyses, the same curve fits reported in Experiments 1 and 2 were used to compare fixations for the same sound level across experiment (see Section 2.2.2 for description of the bootstrapping and multiple comparison correction procedures). First, we compared target fixations across experiments within each sound level (see Fig. 9). For the 65 dBA condition target fixations for the blocked presentation were significantly higher than the interleaved presentation from 512 to 2000 ms. The pattern of effects for the two levels of soft speech were similar across experiments. In an early and short-lived time window fixations to 40 dBA speech (from 400 to 552 ms) and 50 dBA speech (from 352 to 560 ms) were higher in the interleaved presentation, though soon after, this effect flipped, as target
Fit quality (R2 ≥0.95)
(R2 ≥0.8)
(R2 < 0.8)
Subjects dropped
48 45 51
35 37 33
6 5 11
5 5 3
31 27 25
51 54 49
10 7 6
4 5 7
38 30 29
36 42 49
10 8 10
6 7 5
24 27 28
59 61 58
5 12 10
5 2 3
10
Cognition 198 (2020) 104196
K. Hendrickson, et al.
5. General discussion
Table 5 Results of timeseries analysis for Experiment 2 by word length (mono- and bisyllabic), word type (Target, Cohort, Rhyme), and sound level (40, 50, 65 dBA).
The goal of the present study was to investigate the temporal dynamics underlying the recognition of soft speech. We had two primary aims. First, we used eye-tracking in the VWP to examine how soft speech influences lexical access (i.e., the speed of word recognition) and lexical competition (i.e., what word types compete for recognition and when in processing does competition occur). For this aim we obtained precise information regarding how fixations to the target (e.g., candle) and phonological competitors (cohorts, e.g., candy; rhymes, e.g., handle) unfold over time. Second, we examined whether and to what extent lexical access and competition for soft speech is affected by speech-extrinsic factors (the uncertainty in the listening environment). For this aim we manipulated sound level presentation across two experiments. In Experiment 1, we randomly interleaved different sound levels of speech, whereas for Experiment 2 we presented sound level in a blocked design. Overall, we found that both speech-intrinsic information (the sound level of the speech input) and speech-extrinsic information (whether there was uncertainty in the listening environment) influenced the dynamics of lexical access and competition. In Experiment 1, listeners fixated the target fastest for 50 dBA speech followed by 65 dBA speech, and finally 40 dBA speech. Based on the desirable difficulty effect (Bjork, 1994), listeners may have had difficulty processing conversational speech when interleaved with soft speech, which slowed the rate of looking to the target. Another complementary interpretation is that when sound levels were interleaved, listeners considered phonological competitors to a greater extent for conversational speech, which slowed down the speed of target recognition (Brouwer et al., 2012). Indeed, in Experiment 1, when sound level was presented in an interleaved fashion, phonological activation was greatest for conversational speech. Specifically, activation for cohort competitors demonstrated a graded effect such that activation was greatest for conversational speech, followed by slightly soft speech, and finally very soft speech. For rhyme competitors there was also a graded effect, in which activation was again greatest for conversational speech, however, unlike cohort competitors, very soft speech showed more rhyme activation than did slightly soft speech. Thus, in Experiment 1, when presented with two levels of soft speech randomly interleaved with conversational speech, listeners showed the most cohort and rhyme activation during conversational speech, and reduced activation for both competitor types for soft speech (Fig. 2). As previously mentioned, research has shown that introducing uncertainty in a listening environment by interleaving conditions affects the nature of speech recognition and lexical competition (Brouwer et al., 2012; Brouwer & Bradlow, 2014). In a VWP task, Brouwer and colleagues presented canonical forms (i.e., optimal speech) and reduced forms (i.e., forms with missing segments) in an interleaved and blocked fashion. They found that in the interleaved presentation, canonical phonological competitors were more strongly activated than in the blocked presentation. A similar result was found for different levels of vocoded speech (Smith & McMurray, 2018). In Experiment 1 of the current study, when sound level was interleaved, the activation for both cohorts and rhymes was greatest for conversational speech, and rhyme effects were particularly large (rhyme – unrelated peak height difference = 0.055). Rhyme effects in previous studies using conversational speech presented in a blocked fashion have found much smaller effects (nearly half those reported in Experiment 1 of the current study; FarrisTrimble et al., 2014). Thus, the results from Experiment 1 begin to corroborate Brouwer and colleagues' finding that when interleaved with reduced speech, optimal input shows more phonological competition. To evaluate this claim, for Experiment 2, we presented sound level in a blocked fashion. Just as in Experiment 1, there was a significant decrease in cohort activation for 40 dBA speech compared to both 50 dBA and 65 dBA speech. However, unlike Experiment 1, cohort
BDOTS output Comparison
Significant time window(s)
Direction of effect
Autocorrelation
Adjusted alpha
65 50 50 40 65
> > > > >
50 40 40 50 40
0.994 0.998
0.003 0.008
0.994
0.003
65 50 65 40 65 40
> > > > > >
50 65 50 50 40 65
0.995
0.003
0.998 0.995
0.006 0.003
388–640 292–624 772–876 268–356 1184–2000
50 50 40 65 40
> > > > >
65 40 50 40 65
0.995 0.994
0.003 0.003
0.998
0.008
276–380 1236–2000 284–488 592–1084 368–464 576–2000
65 50 40 40 40 40
> > > > > >
50 65 50 50 65 65
0.997
0.005
0.995
0.003
0.997
0.004
MONO-syllabic Cohort 65 v 50 400–764 50 v 40 284–300 388–560 1104–2000 65 v 40 368–888 Rhyme 65 v 50 436–692 860–956 1496–2000 50 v 40 660–812 65 v 40 424–596 648–984 BI-syllabic Cohort 65 v 50 50 v 40 65 v 40 Rhyme 65 v 50 50 v 40 65 v 40
looks were greater for the blocked condition (40 dBA [796–2000 ms], 50 dBA [680–2000]). For cohorts, both the 65 dBA and 40 dBA conditions showed significant competition in the interleaved than the blocked presentation around the peak of activation (65 dBA [448–788 ms], 40 dBA [360–664 ms]). Further, for the 65 dBA condition there were slightly more fixations for the blocked presentation later in processing (1020–2000 ms). For 50 dBA speech the pattern of effects were quite similar across experiments. The only difference was due to an overall shift in the fixation curve, which caused an early difference from 292 to 452 ms (blocked > interleaved), and a late difference in which this relationship flipped (636–912 ms, blocked > interleaved) (see Fig. 10). Finally, we compared rhyme activation across experiments. As predicted, there was a robust difference in rhyme fixations for the 65 dBA condition such that participants fixated the rhyme to a greater extent in the interleaved compared to the blocked condition from 448 to 788 ms. Later in processing (1020–2000 ms), participants fixated rhymes slightly more in the blocked compared to the interleaved condition. For the two levels of soft speech, fixations across presentation type mostly patterned together with a few exceptions. For 50 dBA speech, from 292 to 452 ms and from 636 to 913 ms, there were more fixations in the blocked compared to the interleaved presentation. Further, for 40 dBA speech, there was an early (360–664 ms) and short-lived difference such that proportion fixations to the rhyme were higher in the blocked condition. Finally, we ran a two-way ANOVA to compare the average participant accuracy (logit transformed) for each sound level and experiment. There was a main effect of sound level [F(2, 114) = 120.06, p < .001], however, there was no main effect of experiment [F(1, 57) = 0.035, p = .85], nor was there a significant sound level × experiment interaction [F(2, 114) = 1.69, p = .19].
11
Cognition 198 (2020) 104196
K. Hendrickson, et al.
Fig. 8. Difference in proportion fixations (after curve fitting) to cohorts-unrelated and rhymes-unrelated by sound level (40, 50, 65 dBA) and word length (mono- vs. bi-syllabic).
activation for 65 dBA and 50 dBA speech flipped. Thus, when sound level was blocked, 65 dBA speech demonstrated less cohort activation than did 50 dBA speech. For rhymes, there was also a relative decrease in activation for the 65 dBA condition, and rhyme activation was greatest for 40 dBA speech. Further, target fixations displayed the expected pattern of effects: listeners were fastest at fixating the target for 65 dBA speech, followed by 50 dBA speech, and finally 40 dBA speech. Thus, when sound level was presented in a blocked fashion, listeners were faster at fixating the target as sound level increased. To further probe how the dynamics of competition change as a function of listening environment, we directly compared the time course of target, cohort, and rhyme fixations at each sound level across experiments. For target fixations, listeners were faster at recognizing the target word in the blocked compared to the interleaved presentation
across all 3 sound levels. Further, consistent with Brouwer and colleagues' findings, we found that optimal speech input (65 dBA speech) demonstrated more cohort and rhyme activation in the blocked compared to interleaved presentation. Interestingly, we see a similar pattern for 40 dBA speech (more competitor activation in the interleaved compared to the blocked design). Whereas, for 50 dBA speech, the pattern of results appears quite similar across experiments. We find two interpretations for the cross experiment findings most likely. First, as discussed in detail previously, hearing conversational speech interleaved with reduced speech may have caused listeners to accept acoustic mismatches more readily (Brouwer et al., 2012). From this view, interleaving optimal with suboptimal speech caused phonological competitors for optimal input to become stronger because the acoustic mismatches were considered less egregious. Indeed, Brouwer
Fig. 9. Cross study comparison of target fixations. Fixations to the target by Experiment ([EXP1] Interleaved vs. [EXP 2] Blocked) and sound level (40, 50, 65 dBA). 12
Cognition 198 (2020) 104196
K. Hendrickson, et al.
Fig. 10. Cross study comparison of competitors. Difference in competitor-unrelated fixations by Experiment ([EXP1] Interleaved vs. [EXP 2] Blocked) and sound level (40, 50, 65 dBA) for cohorts (a–c) and rhymes (d–f).
process as listeners adopt different approaches to managing temporary ambiguity depending on the nature of the listening environment, both in terms of the type of acoustic input (i.e., sound level) and the degree of overall uncertainty in what may come next (McMurray, Ellis, & Apfelbaum, 2019; Smith & McMurray, 2018). Given that previous studies and the current study have found that uncertainty in the listening environment can modulate speech-intrinsic lexical access and competition, coupled with the fact that the majority of previous research using the VWP has used blocked designs, we further discuss the results of Experiment 2 with regard to how soft speech influences in-the-moment lexical competition. For slightly soft speech, lexical competition appears similar to conversational speech with some notable exceptions. For cohort competitors, listeners demonstrate a slight increase in cohort activation, whereas for rhyme competitors, activation starts later and is sustained later in processing. This is consistent with previous research that suggests given mildly distorted or reduced input (e.g., accented speech, background noise), activation is quite similar to optimal input, although listeners may maintain partial consideration of phonological competitors (Brouwer & Bradlow, 2016; Farris-Trimble et al., 2014). Similar to the results of the 50 dBA condition, normal hearing listeners presented with CI simulated speech (meant to simulate speech faced by adult post-lingually deaf CI users)
and colleagues found heightened competitor activation for canonical inputs only when such inputs were interleaved with reduced input, as opposed to fully articulated but casual input. An alternative interpretation – but perhaps not mutually exclusive, is that the interleaved condition was just more cognitively taxing. There have been several studies that have examined how cognitive processes interact with speech recognition accuracy in adverse listening conditions (mainly with background noise) (see Mattys et al., 2012 for a review). Such studies point to the fact that listeners may have allocated more listening effort in the interleaved presentation because the task required more cognitive resources (Fraser, Gagné, Alepins, & Dubois, 2010; Gosselin & Gagné, 2011; Hicks & Tharpe, 2002; Zekveld, Kramer, & Festen, 2010). Thus, listeners may have been more engaged in the task when the sound level was interleaved because they needed to be prepared for hearing speech at lower intensities. If listeners were focused on processing the more difficult conditions, they may have been less sensitive to speech at a conversational level (Wu, Stangl, Zhang, Perkins, & Eilers, 2016). Taken together, the results from Experiments 1 and 2 fit nicely with previous research on the importance of speech-extrinsic factors in lexical access and competition. These results support the notion that spoken word recognition is a highly adaptable and tunable cognitive
13
Cognition 198 (2020) 104196
K. Hendrickson, et al.
or significantly greater. What explains these differential results for mono- and bi-syllabic words by sound level? The current results may be accounted for by changing the short and long word biases seen in computational models of speech perception (i.e., TRACE; McClelland & Elman, 1986). Early in processing, there is a bias for short words to compete for recognition, whereas later in processing competition favors longer words. These biases are the result of lateral inhibition from overlapping words; bi-syllabic words overlap with more words, and as a result receive more inhibition early on. However, later in processing, mono-syllabic words have largely been ruled out, and thus, there is a later occurring bias for longer words. Consistent with a wait-and-see approach, for very soft speech (i.e., 40 dBA), listeners appear to relax this early short word bias (i.e., there is a decrease in cohort activation for mono-syllabic words), and increase the later occurring long word bias (i.e., there is an increase in rhyme activation for bi-syllabic words). In combination with previous research, the current results help refine models of lexical processing, as these models were not initially meant to account for a range of adverse listening conditions. Indeed, a major obstacle to traditional psycholinguistic theories of speech recognition is that they may underestimate speech-extrinsic factors (such as the nature of the listening environment), cognitive adaptability, and even the role of attention (Mattys et al., 2012). From an applied perspective, the current results may serve as a model for how individuals who are hard of hearing recognize spoken words in real-time. For instance, adapting the dynamics of competition from immediate competition to a wait-and-see approach could be adaptive for individuals who are hard of hearing. To illustrate, if the speech signal is less accessible it may be detrimental to adopt an immediate competition approach. In this case, if listeners are not exactly sure what they heard, robustly activating multiple lexical candidates as soon as input arrives could cause activation to spread across too many lexical items. Because there are more lexical candidates activated, there are more lexical items to inhibit, which may cause delayed word recognition. A wait-and-see approach, on the other hand, allows listeners to accrue more lexical context, which reduces overall competition. Though determining whether the current results are indeed adaptive is beyond the scope of the present study, these results are in line with a growing body of research that portrays spoken word recognition as highly flexible and perhaps tunable. Supplementary data to this article can be found online at https:// doi.org/10.1016/j.cognition.2020.104196.
sustain activation for rhyme but not cohort competitors (Farris-Trimble et al., 2014). These results suggest that listeners were less certain when they heard 50 dBA speech compared to 65 dBA speech, and because this uncertainty lasted late in processing, there exists a persistent processing cost in the presence of slightly soft speech even on correctly identified trials (Brouwer & Bradlow, 2016). This profile of sustaining competitor activation may be particularly useful in cases where a mistake was made and a recently rejected candidate must now be reactivated as the target (Clopper, 2014; Luce & Cluff, 1998; McMurray, Tanenhaus, & Aslin, 2009). Conversely, for very soft speech (40 dBA), the pattern of effects is different. Listeners demonstrated a decrease in cohort activation, and a corresponding increase in rhyme activation. These results are in line with a wait-and-see approach, in which listeners wait to activate lexical items until more input has accrued (McMurray et al., 2017). This waitand-see approach minimizes lexical ambiguity by waiting to engage lexical candidates until a point later in the word, in which there are fewer possible candidates. To illustrate, a typical competition pattern (similar to those observed for 65 dBA speech) is to activate a range of lexical candidates immediately, and to the extent they match the incoming signal. Crucially, this typical pattern heavily weighs word-initial overlap. What results is robust cohort activation and short-lived, but significant, rhyme activation. Conversely, the wait-and-see approach heavily downweighs word-initial overlap and gives more importance than is typical to word-final overlap because listeners are waiting until more lexical context is encountered. This pattern in which listeners reduce cohort activation and increase rhyme activation, has been observed in more egregious adverse listening environments (e.g., when noise replaces phonemes; McQueen & Huettig, 2012) and for populations who have severely degraded input (e.g., prelingually deaf CI users; McMurray et al., 2017). It must be noted that in the current study, the degree to which listeners are waiting to activate competitors is less than what is seen in CI users. This may reflect that the chosen soft speech conditions are too easy for listeners (very high accuracy on all conditions). Perhaps with lower sound levels conditions it would be more clear-cut to see listeners indeed use a wait-and-see approach when listening to a suboptimal signal. It is important to note that both levels of soft speech demonstrated some degree of sustained rhyme activation when compared to conversational speech. This suggests that rhymes are particularly susceptible to any degree of signal reduction. A similar finding has been reported for noise-vocoded speech (Farris-Trimble & McMurray, 2013). Farris-Trimble and colleagues explain this finding in terms of when disambiguating information is encountered. Cohort competitors eventually receive disambiguating information as to word identity (after the medial vowel). Conversely, once rhyme competitors are active they are indistinguishable from the target word until after the word is heard. Therefore, during soft speech, activation builds up for rhymes, and is sustained because there is no information in the speech signal to rule it out. Finally, we conducted a follow-up analysis in Experiment 2 examining the influence of sound level on lexical competition separately for mono- and bi-syllabic words. For mono-syllabic words, cohort and rhyme competition was weaker for soft speech than conversational speech. Previous research on lexical competition in adverse listening conditions has shown increased or sustained activation for rhyme competitors (Brouwer & Bradlow, 2016; McQueen & Huettig, 2012). However, these previous studies mainly used bi-syllabic words, which have been shown to demonstrate more robust and consistent rhyme competition than mono-syllabic words (Simmons & Magnuson, 2018). Indeed, we replicate this effect in the current study; across sound level, rhyme competition was greater for bi- compared to mono-syllabic words. The overall pattern of results for bi-syllabic words was much different than mono-syllabic words. Competition (cohort and rhyme) for bi-syllabic words was either commensurate with conversational speech,
Author statement Kristi Hendrickson: Conceptualization, methodology, validation, formal analysis, investigation, resources, data curation, writing-original draft, writing – review& editing, visualization, supervision, project administration. Jessica Spinelli: Conceptualization, formal analysis, investigation, data curation, writing-original draft, project administration. Elizabeth Walker: Conceptualization, methodology, resources, writing-original draft, writing- review & editing, supervision, project administration. Acknowledgements This research was supported by the Iowa Center for Research by Undergraduates (ICRU) awarded to the 2nd author. We thank Kathryn Gabel, Lindsey Meyer, and Lyndi Roecker for assistance with data collection. References Allopenna, P., Magnuson, J., & Tanenhaus, M. (1998). Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language, 38(4), 419–439.
14
Cognition 198 (2020) 104196
K. Hendrickson, et al.
Psychology, 18, 1–86. McMurray, B., Ellis, T. P., & Apfelbaum, K. S. (2019). How do you deal with uncertainty? Cochlear implant users differ in the dynamics of lexical processing of noncanonical. Ear and Hearing: Inputs. McMurray, B., Farris-Trimble, A., & Rigler, H. (2017). Waiting for lexical access: Cochlear implants or severely degraded input lead listeners to process speech less incrementally. Cognition, 169, 147–164. McMurray, B., Farris-Trimble, A., Seedorff, M., & Rigler, H. (2016). The effect of residual acoustic hearing and adaptation to uncertainty on speech perception in cochlear implant users: Evidence from eye-tracking. Ear and Hearing, 37(1), e37. McMurray, B., Samelson, V. M., Lee, S. H., & Tomblin, J. B. (2010). Individual differences in online spoken word recognition: Implications for SLI. Cognitive Psychology, 60(1), 1–39. McMurray, B., Tanenhaus, M. K., & Aslin, R. N. (2002). Gradient effects of within-category phonetic variation on lexical access. Cognition, 86(2), 33–42. McMurray, B., Tanenhaus, M. K., & Aslin, R. N. (2009). Within-category VOT affects recovery from “lexical” garden-paths: Evidence against phoneme-level inhibition. Journal of Memory and Language, 60(1), 65–91. McQueen, J. M., & Huettig, F. (2012). Changing only the probability that spoken words will be distorted changes how they are recognized. Journal of Acoustical Society of America, 131(1), 509–517. Moberly, A. C., Bates, C., Harris, M. S., & Pisoni, D. B. (2016). The enigma of poor performance by adults with cochlear implants. Otology & Neurotology: Official Publication of the American Otological Society. American Neurotology Society [and] European Academy of Otology and Neurotology, 37(10), 1522. Norris, D., & McQueen, J. M. (2008). Shortlist B: A Bayesian model of continuous speech recognition. Psychological Review, 115(2), 357. Olsen, W. O. (1998). Average speech levels and spectra in various speaking/listening conditions. American Journal of Audiology, 7(2), 21–25. Pearsons, K. S., Bennett, R. L., & Fidell, S. (1977). Speech levels in various noise environments. Office of Health and Ecological Effects. US EPA: Office of Research and Development. Peng, S. C., Chatterjee, M., & Lu, N. (2012). Acoustic cue integration in speech intonation recognition with cochlear implants. Trends in Amplification, 16(2), 67–82. R Core Team (2014). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/. Scharenborg, O., Coumans, J. M., & van Hout, R. (2018). The effect of background noise on the word activation process in nonnative spoken-word recognition. Journal of Experimental Psychology: Learning. Memory, and Cognition, 44(2), 233. Seedorff, M., Oleson, J., & McMurray, B. (2018). Detecting when timeseries differ: Using the Bootstrapped Differences of Timeseries (BDOTS) to analyze Visual World Paradigm data (and more). Journal of Memory and Language, 102, 55–67. Simmons, E., & Magnuson, J. (2018). Word length, proportion of overlap, and phonological competition in spoken word recognition. Proceedings of the 40th Annual Conference of the Cognitive Science Society (pp. 1064–1069). Madison: WI. Skinner, M. W., Holden, L. K., Holden, T. A., Demorest, M. E., & Fourakis, M. S. (1997). Speech recognition at simulated soft, conversational, and raised-to-loud vocal efforts by adults with cochlear implants. The Journal of the Acoustical Society of America, 101(6), 3766–3782. Smith, F., & McMurray, B. (2018). Lexical access in the face of degraded speech: The effects of cognitive adaptation. Poster session presented at the Acoustical Society of America Annual Meeting. BA, Canada: Victoria. Tanenhaus, M. K., Spivey-Knowlton, M. J., Eberhard, K. M., & Sedivy, J. C. (1995). Integration of visual and linguistic information in spoken language comprehension. Science, 268(5217), 1632–1634. van der Feest, S. V., Blanco, C. P., & Smiljanic, R. (2019). Influence of speaking style adaptations and semantic context on the time course of word recognition in quiet and in noise. Journal of Phonetics, 73, 158–177. Weber, A., & Scharenborg, O. (2012). Models of spoken-word recognition. Wiley Interdisciplinary Reviews: Cognitive Science, 3(3), 387–401. Winn, M. B., & Litovsky, R. Y. (2015). Using speech sounds to test functional spectral resolution in listeners with cochlear implants. The Journal of the Acoustical Society of America, 137(3), 1430–1442. Wu, Y. H., Stangl, E., Zhang, X., Perkins, J., & Eilers, E. (2016). Psychometric functions of dual-task paradigms for measuring listening effort. Ear and Hearing, 37(6), 660. Zekveld, A. A., Kramer, S. E., & Festen, J. M. (2010). Pupil response as an indication of effortful listening: The influence of sentence intelligibility. Ear and Hearing, 31(4), 480–490.
Apfelbaum, K. S., Blumstein, S. E., & McMurray, B. (2011). Semantic priming is affected by real-time phonological competition: Evidence for continuous cascading systems. Psychonomic Bulletin & Review, 18(1), 141–149. Ben-David, B. M., Chambers, C. G., Daneman, M., Pichora-Fuller, M. K., Reingold, E. M., & Schneider, B. A. (2011). Effects of aging and noise on real-time spoken word recognition: Evidence from eye movements. Journal of Speech, Language, and Hearing Research, 54(1), 243–262. Bjork, R. A. (1994). Memory and metamemory considerations in the training of human beings. In J. Metcalfe, & A. P. Shimamura (Eds.). Metacognition: Knowing about knowing (pp. 185–205). Cambridge, MA: MIT Press. Boersma, P., & Weenink, D. (2019). Praat: Doing phonetics by computer [Computer program]. Version 6.0.50. Retrieved from http://www.praat.org/. Brouwer, S., & Bradlow, A. R. (2014). Contextual variability during speech-in-speech recognition. The Journal of the Acoustical Society of America, 136(1), EL26–EL32. Brouwer, S., & Bradlow, A. R. (2016). The temporal dynamics of spoken word recognition in adverse listening conditions. Journal of Psycholinguistic Research, 45(5), 1151–1160. Brouwer, S., Mitterer, H., & Huettig, F. (2012). Speech reductions change the dynamics of competition during spoken word recognition. Language & Cognitive Processes, 27(4), 539–571. Clopper, C. G. (2014). Sound change in the individual: Effects of exposure on cross-dialect speech processing. Laboratory Phonology, 5(1), 69–90. Cooper, R. M. (1974). The control of eye fixation by the meaning of spoken language: A new methodology for the real-time investigation of speech perception, memory, and language processing. Cognitive Psychology, 6(1), 84–107. Dahan, D., & Magnuson, J. S. (2006). Spoken word recognition. Handbook of psycholinguistics (pp. 249–283). Academic Press. Dahan, D., Magnuson, J. S., & Tanenhaus, M. K. (2001). Time course of frequency effects in spoken-word recognition: Evidence from eye movements. Cognitive Psychology, 42, 317–367. Davidson, L. S. (2006). Effects of stimulus level on the speech perception abilities of children using cochlear implants or digital hearing aids. Ear and Hearing, 27(5), 493–507. Farris-Trimble, A., & McMurray, B. (2013). Test–retest reliability of eye tracking in the visual world paradigm for the study of real-time spoken word recognition. Journal of Speech, Language, and Hearing Research, 56(4). Farris-Trimble, A., McMurray, B., Cigrand, N., & Tomblin, J. B. (2014). The process of spoken word recognition in the face of signal degradation. Journal of Experimental Psychology: Human Perception and Performance, 40(1), 308. Fraser, S., Gagné, J. P., Alepins, M., & Dubois, P. (2010). Evaluating the effort expended to understand speech in noise using a dual-task paradigm: The effects of providing visual speech cues. Journal of Speech, Language, and Hearing Research, 53, 18–33. Frauenfelder, U. H., Scholten, M., & Content, A. (2001). Bottom-up inhibition in lexical selection: Phonological mismatch effects in spoken word recognition. Language & Cognitive Processes, 16(5–6), 583–607. Gosselin, P. A., & Gagné, J. P. (2011). Older adults expend more listening effort than young adults recognizing speech in noise. Journal of Speech, Language, and Hearing Research, 54, 944–958. Hicks, C. B., & Tharpe, A. M. (2002). Listening effort and fatigue in school-age children with and without hearing loss. Journal of Speech, Language, and Hearing Research, 45, 573–584. Holden, L. K., Reeder, R. M., Firszt, J. B., & Finley, C. C. (2011). Optimizing the perception of soft speech and speech in noise with the Advanced Bionics cochlear implant system. International Journal of Audiology, 50(4), 255–269. Holden, L. K., Finley, C. C., Firszt, J. B., Holden, T. A., Brenner, C., Potts, L. G., ... Skinner, M. W. (2013). Factors affecting open-set word recognition in adults with cochlear implants. Ear and hearing, 34(3), 342. Luce, P. A., & Cluff, M. S. (1998). Delayed commitment in spoken word recognition: Evidence from cross-modal priming. Perception & Psychophysics, 60(3), 484–490. Luce, P. A., & Pisoni, D. B. (1998). Recognizing spoken words: The neighborhood activation model. Ear and Hearing, 19(1), 1. Marslen-Wilson, W., & Zwitserlood, P. (1989). Accessing spoken words: The importance of word onsets. Journal of Experimental Psychology: Human Perception and Performance, 15(3), 576. Marslen-Wilson, W. D. (1987). Functional parallelism in spoken word-recognition. Cognition, 25(1–2), 71–102. Mattys, S. L., Davis, M. H., Bradlow, A. R., & Scott, S. K. (2012). Speech recognition in adverse conditions: A review. Language & Cognitive Processes, 27(7–8), 953–978. McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive
15