Audiovisual perception of noise vocoded speech in dyslexic and non-dyslexic adults: The role of low-frequency visual modulations

Audiovisual perception of noise vocoded speech in dyslexic and non-dyslexic adults: The role of low-frequency visual modulations

Brain & Language 124 (2013) 165–173 Contents lists available at SciVerse ScienceDirect Brain & Language journal homepage: www.elsevier.com/locate/b&...

409KB Sizes 0 Downloads 86 Views

Brain & Language 124 (2013) 165–173

Contents lists available at SciVerse ScienceDirect

Brain & Language journal homepage: www.elsevier.com/locate/b&l

Audiovisual perception of noise vocoded speech in dyslexic and non-dyslexic adults: The role of low-frequency visual modulations Odette Megnin-Viggars, Usha Goswami ⇑ Centre for Neuroscience in Education, University of Cambridge, UK

a r t i c l e

i n f o

Article history: Accepted 5 December 2012 Available online 25 January 2013 Keywords: Auditory–visual Temporal Prosody Dyslexia Amplitude envelope

a b s t r a c t Visual speech inputs can enhance auditory speech information, particularly in noisy or degraded conditions. The natural statistics of audiovisual speech highlight the temporal correspondence between visual and auditory prosody, with lip, jaw, cheek and head movements conveying information about the speech envelope. Low-frequency spatial and temporal modulations in the 2–7 Hz range are of particular importance. Dyslexic individuals have specific problems in perceiving speech envelope cues. In the current study, we used an audiovisual noise-vocoded speech task to investigate the contribution of low-frequency visual information to intelligibility of 4-channel and 16-channel noise vocoded speech in participants with and without dyslexia. For the 4-channel speech, noise vocoding preserves amplitude information that is entirely congruent with dynamic visual information. All participants were significantly more accurate with 4-channel speech when visual information was present, even when this information was purely spatio-temporal (pixelated stimuli changing in luminance). Possible underlying mechanisms are discussed. Ó 2013 Elsevier Inc. All rights reserved.

1. Introduction Dyslexia can be defined as a specific difficulty in reading and spelling that cannot be accounted for by low intelligence, poor educational opportunity, or obvious sensory/neurological damage. The accepted core cognitive deficit involves specific problems with phonological representations and processing (e.g. Ziegler & Goswami, 2005). At the sensory level, it appears that these phonological difficulties might be related to impaired processing of the temporal modulation patterns associated with the syllable rate and the amplitude envelope, one of the critical acoustic properties underlying syllable rate (the temporal sampling hypothesis, Goswami, 2011; see also Abrams, Nicol, Zecker, & Kraus, 2009; Goswami et al., 2002). Children with dyslexia have particular difficulty in perceiving accurately the rate of onset of amplitude envelopes (Goswami, 2011, for a recent review). Children with dyslexia also show impairments in discriminating frequency modulation at slower rates (e.g., 2 Hz, English, Witton et al., 1998; Norwegian, Talcott et al., 2003), and adults with dyslexia show significantly reduced neural phase locking to 2 Hz amplitude-modulated white noise (Hämäläinen, Rupp, Soltész, Szücs, & Goswami, 2012). For speech processing, these auditory difficulties should be associated with cognitive difficulties with syllabic segmentation of the speech ⇑ Corresponding author. Address: Centre for Neuroscience in Education, Downing Street, Cambridge CB2 3EB, UK. Fax: +44 1223 333582. E-mail address: [email protected] (U. Goswami). 0093-934X/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.bandl.2012.12.002

stream and with difficulties in processing rhythmic and prosodic features associated with the speech envelope. These cognitive difficulties are indeed found in developmental dyslexia (Goswami, Gerson, & Astruc, 2010; Leong, Hämäläinen, Soltész, & Goswami, 2011; Thomson & Goswami, 2008; Thomson, Fryer, Maltby, & Goswami, 2006). Advances in our understanding of speech processing have highlighted the critical role played by low frequency modulations and the amplitude envelope in speech intelligibility (Drullman, Festen, & Plomp, 1994a, 1994b; Shannon, Zeng, Kamath, Wygonski, & Ekelid, 1995). The acoustic cues specifying the temporal patterning of larger speech units like syllables and of prosodic structure are primarily found in the amplitude envelope. As described by Giraud and Poeppel (2012), regular modulations of signal energy over time give rhythmic structure to the envelope. For speech, modulations at a rate of 4–6 Hz have the strongest power, the ‘‘syllable rate’’ (Greenberg et al., 2003). The importance of low frequency modulations for speech intelligibility has been demonstrated by extracting the amplitude envelope for different speech frequency bands and then selectively removing (filtering out) modulations at different frequencies from these sub-band envelopes. For example, experiments by Drullman and his colleagues showed that most of the important information for speech intelligibility lay in modulation frequencies between 1 and 16 Hz (Drullman et al., 1994a, 1994b). Meanwhile, it has long been known that the addition of visual speech information improves speech intelligibility over that of auditory speech information alone (e.g., Grant & Braida, 1991;

166

O. Megnin-Viggars, U. Goswami / Brain & Language 124 (2013) 165–173

Grant & Seitz, 2000; Sumby & Pollack, 1954). Some aspects of speech that are difficult to hear (e.g., place of articulation) are often relatively easy to see (Walden, Prosek, Montgomery, Scherr, & Jones, 1977). However, visual speech incorporates more than lipreading cues to phonetic information, as shown by the extensive work of Munhall and colleagues on the role of rhythmic facial and head movement cues in conveying information about the speech envelope. Head movements and eyebrow movements as well as lip, jaw and cheek movements are known to be systematically related to speech amplitude and fundamental frequency (e.g., Munhall, Jones, Callan, Kuratate, & Vatikiotis-Bateson, 2004; Munhall, Kross, & Vatikiotis-Bateson, 2002). When a person is speaking, their facial movements are systematically related to their speech production because the vocal tract is the source of the movement (Yehia, Kuratate, & Vatikiotis-Bateson, 2002; Yehia, Rubin, & VatikiotisBateson, 1998). Configuring the vocal tract to shape the acoustic signal simultaneously deforms the face. Yehia et al. (2002) showed experimentally that it is possible to estimate speech acoustics from face motion (and vice versa) with high reliability, using both English and Japanese speakers. Most visual speech information is recovered from the lower spatial and temporal frequencies (below 7 cycles per face, and 6–9 frames per second, respectively). Interestingly, there are data suggesting that individuals with dyslexia show enhanced visuo-spatial processing for low frequency information. Schneps, Brockmole, Sonnert, and Pomplun (2012) demonstrated that when detecting visual targets in natural scenes that had been low-pass filtered, dyslexic adults accrued greater advantage compared to controls when these blurred natural scenes were presented repeatedly (the contextual cueing effect). Schneps et al. suggested that visual perception of low frequency spatial components could be enhanced in developmental dyslexia, noting that dyslexics seem over-represented in fields requiring the use of spatial information in low spatial frequency scenes such as astronomy (Schneps, Rose, & Fischer, 2007). This raises the possibility that in dyslexia, visual sensitivity to low frequency information may be enhanced to compensate for poorer sensitivity to low frequency auditory information, an hypothesis we explore here. Recent investigations of the natural statistics of audiovisual (AV) speech have also highlighted the importance of low frequency modulations (Chandrasekaran, Trubanova, Stillittano, Caplier, & Ghazanfar, 2009). Chandrasekaran et al. investigated natural spoken sentences in English and French, analysing both the area of mouth opening as a function of time and the wideband and narrowband speech envelopes. The visual measure hence estimated the visual temporal content of audiovisual speech, while the auditory measures were indices of auditory temporal content. Temporal frequency modulations for these visual and auditory components were estimated using Fourier-based signal processing, and the coherence between the auditory and visual signals was estimated as a function of temporal modulation frequency. Chandrasekaran et al. (2009) reported a close temporal correspondence between the area of the mouth opening and the wideband acoustic envelope. Further, this temporal co-ordination had a distinct rhythm that was between 2 and 7 Hz. Drawing on recent oscillatory frameworks for speech perception (Ghitza, 2011; Ghitza & Greenberg, 2009; Giraud & Poeppel, 2012; Hickok & Poeppel, 2007), Chandrasekeran et al. observed that the natural temporal features of AV speech signals are optimally structured for the neural (oscillatory) rhythms of the brains of their receivers. Given that visual articulatory cues can precede vocalisations by 200 ms or more, Schroeder, Lakatos, Kajikawa, Partan, and Puce (2008) have suggested that visual gestures reset auditory cortex to the ‘‘optimal state’’ for processing the succeeding vocalisations. Hence auditory and visual low frequency temporal modulations may play important complementary roles in the neural processing of speech.

The importance of low frequency auditory and visual temporal information for speech perception is supported by recent neuroimaging studies. For example, Luo, Liu, and Poeppel (2010) were able to measure low frequency auditory–visual integration directly in adults using MEG. They demonstrated that human auditory cortex tracks both auditory and visual stimulus dynamics using low-frequency neuronal phase modulation. Luo et al. showed adult participants movie clips during either congruent (same auditory and visual information) or incongruent (auditory track of one movie clip with visuals from another movie clip) viewing conditions. Using a phase coherence analysis technique, Luo et al. were able to measure how the phase of cortical responses was coupled to aspects of stimulus dynamics. They found that the auditory cortex reflected dynamic (low frequency) aspects of the visual signal as well as tracking auditory stimulus dynamics. The visual cortex reflected dynamic (low frequency) aspects of the auditory signal as well as tracking visual stimulus dynamics. The neural mechanism underpinning AV integration was cross-modal phase modulation. The data showed that the visual stream in a matched movie modulated the auditory cortical activity by aligning the phase to the optimal phase angle, so that the expected auditory input arrived during a high excitability state (inputs arriving during high excitability states evoke greater neuronal responding and are assumed to be better encoded). The auditory stream in a matched movie had a similar effect on visual cortical activity. Although not tested in Luo et al.’s study, the data suggest that cross-modal low frequency phase modulation may be one mechanism whereby the complementary information provided by the visual stream boosts neuronal auditory signal processing in noisy or degraded conditions. Cross-modal phase modulation has also been measured in EEG studies. To date, most studies have used rhythmic stimulus streams rather than speech, for example measuring anticipatory phase locking to visual flashes or auditory tones presented in a rhythmic stream (e.g., Besle et al., 2011; Gomez-Ramirez et al., 2011; Stefanics et al., 2010). These studies have shown that oscillatory entrainment enhances target detection in different attended paradigms. For example, Besle et al. (2011) studied adult epileptic patients, and were able to take intracranial recordings during a paradigm involving interleaved auditory and visual rhythmic stimulus streams (following the paradigm devised for animals by Lakatos, Karmos, Mehta, Ulbert, and Schroeder (2008)). They reported that the degree of entrainment depended on the predictability of the stream, and represented a reorganisation of ongoing neuronal activity so that excitability was phase-shifted to align with the stimulation rate, in their case in the delta band (1.5 Hz). Power, Mead, Barnes, and Goswami (2012) used a rhythmic entrainment paradigm with speech, and studied children. They presented either auditory-alone, visual-alone or auditory–visual (AV) speech in a paradigm involving rhythmic repetition of the syllable ‘‘ba’’. Children had to respond with a button press when a ‘‘ba’’ was out of rhythm (temporally delayed). Power et al. reported cross-modal phase modulation for the 13-year-old children that they studied. Comparison of entrainment to auditory-alone and (AVminusV) speech showed that preferred phase was altered in the theta band by predictive visual speech information. Power et al. argued that their data supported Schroeder et al.’s (2008) suggestion that visual rhythmic information modulates auditory oscillations to the optimal phase for auditory processing and audio-visual integration. Further, they found that individual differences in auditory theta entrainment were related to individual differences in reading development. These neural studies offer a potential mechanism for explaining the beneficial effects of visual information on auditory signal processing shown in behavioural cross-modal studies. It has long been known that the visual information provided by articulatory movements enhances the intelligibility of speech, with estimates of gains in intelligibility of around 11 dB (Macleod & Summerfield, 1987). In

O. Megnin-Viggars, U. Goswami / Brain & Language 124 (2013) 165–173

addition to the quantitative gain conferred by the accompanying lip movements, visual speech can also produce a qualitatively different percept. For instance, the McGurk effect (McGurk & MacDonald, 1976) describes a bimodal illusion whereby an incongruent auditory and visual syllable (e.g. auditory ‘‘ba’’ and the lip movements ‘‘ga’’) produce an illusory percept (e.g. ‘‘da’’). These visually-induced auditory illusions provide evidence for an automatic and mandatory effect of visual speech on auditory speech processing, and are also observed in infants as young as 4-months-old (Burnham & Dodd, 2004). In the current study, we sought behavioural evidence for cross-modal low frequency phase modulation in AV speech processing in dyslexia. Given dyslexic impairment in processing low frequency amplitude modulations (Hämäläinen et al., 2012), and possible dyslexic enhancement in processing low frequency visuo-spatial information (Schneps et al., 2012), it is plausible to suggest that low frequency visual information might offer particular benefits to AV speech processing for individuals with developmental dyslexia (a compensation hypothesis). Here we attempted to isolate the benefit offered by cross-modal low frequency modulation coherence by devising two AV conditions. In the AV Face condition, participants were presented with videos of faces speaking, and both heard the target and saw the accompanying lip movements of the speaker. In a second AV condition (AV Pixelated condition), the auditory targets were paired with a dynamic visual pixelated stimulus. AV pixelated stimuli were created by inverting the video image of the face and applying a mosaic effect in order to remove the actors’ facial features, whilst maintaining an equivalent luminance level and retaining the temporal dynamic characteristics of the stimuli (see Fig. 1). This pixelation technique reduced high frequency spatial and temporal information (such as small, fast muscular movements) to a greater extent than low frequency spatial and temporal information (such as slower movements of larger muscles) via the utilisation of large, relatively coarse pixels to create the mosaic. To create these larger pixels, the luminance changes over the original (smaller) pixels were averaged over time. Hence faster temporal changes (such as tongue or lip movements) were damped to a greater extent than the slower temporal changes carried over larger spatial areas (such as jaw movements). Therefore pixelation largely preserved the slower luminance changes in the visual scene and preserved their temporal correspondence with the auditory stimulus, while fine-grained lip and tongue cues were lost (visemes were not available). If AV integration is governed by neural integration at lower frequencies (2–7 Hz), then having multi-modal (auditory and visual) information available might boost speech processing in dyslexia, even when facial cues are not present. If this were the case, then dyslexic participants might gain more than control participants from the AV Pixelated stimuli (in both accuracy and response time). Audiovisual speech perception for real word targets was studied. As word recognition should be an easy task, and as all participants were adult and high-functioning, the auditory signal was degraded in order to avoid ceiling effects. We used a noise vocoding technique for signal degradation. A vocoder divides the speech signal

Fig. 1. Example frames taken from the video stimuli used in the AV Face and AV Pixelated experimental conditions, these visual stimuli were accompanied by noisevocoded monosyllabic spoken words.

167

into logarithmically spaced frequency bands, and then extracts the amplitude envelope (the slow-varying pattern of energy change) in each band. The residual fast-varying ‘‘fine structure’’, carrying information about pitch and formant changes, is discarded. These amplitude envelopes are used to modulate white noise (band-limited in the corresponding frequencies), and the resulting modulated bands of noise are recombined to create a noise-vocoded sentence. The vocoding technique simulates speech as heard through a cochlear implant. Untrained listeners find it difficult to recognise noise vocoded speech when the entire frequency spectrum is divided into 4 bands (4 channels), but most listeners can recognise sentences created from 10 bands or more. This is because the cochlea also divides the audio frequency spectrum into many bands, representing the envelope and fine structure for each frequency band separately. As more bands or channels are present, the spectral patterns of energy change in the vocoded stimuli become more similar to real speech, facilitating speech recognition. Conversely, when few channels are present, speech recognition relies more heavily on supra-segmental cues such as prosodic stress patterning, known to be impaired in developmental dyslexia (Goswami et al., 2010). Therefore, participants with dyslexia should be relatively more affected by fewer channels when attempting to interpret noise vocoded speech. Utilising the principle of inverse effectiveness (Meredith & Stein, 1986; Stein & Meredith, 1993), we expected that multisensory enhancement would be greatest when unisensory stimuli were weaker. Therefore, we predicted that visual temporal information would enhance auditory signal processing for the 4-channel speech in particular. As dyslexics should find it more difficult than control participants to recognise 4-channel speech, the provision of visual temporal information could potentially boost their performance to non-dyslexic levels. Two types of noise vocoded speech were used, 4-channel speech and 16-channel speech. Sixteen channel speech should be relatively easy to recognise for both dyslexic and non-dyslexic participants.

2. Method 2.1. Participants Forty-five adults participated in this study, drawn from a university population (hence the dyslexic participants were relatively high-functioning). Twenty-three participants comprised the dyslexic group (20 females, mean age 23.1 years, SD 3.6 years) and 22 participants comprised the typically-developing control group (15 females, mean age 25.2 years, SD 4.5 years). The groups were predominantly female due to the greater number of female participants who volunteered for the study (see also Vandermosten et al., 2010). All participants gave their informed consent for the study, spoke English as a first language, were free of neurological disease, and had normal hearing and normal or corrected to normal vision. All participants in the dyslexic group had a formal educational statement of developmental dyslexia, indicating a childhood history and current concessions for study and exams. IQ was assessed with the Wechsler Abbreviated Scale of Intelligence (WASI; Wechsler, 1999). The groups did not differ significantly in either full scale IQ (dyslexics = 118.0, SD = 10.6; controls = 117.0, SD = 7.4), verbal IQ (dyslexics = 117.1, SD = 13.9; controls = 114.9, SD = 8.2), or non-verbal IQ (dyslexics = 115.3, SD = 9.8; controls = 114.7, SD = 9.6). Reading and spelling were also assessed for the current study with the Wide Range Achievement Test (WRAT-3; Wilkinson, 1993). Here the groups showed significant differences, as would be expected given the statements of dyslexia. Mean reading standard score was 108.0 (SD 8.4) for the dyslexic participants and 113.7 (SD 6.2) for the controls, t(1, 43) = 2.6, p < .05. Mean spelling standard score was 106.0 (SD 9.4) for the

168

O. Megnin-Viggars, U. Goswami / Brain & Language 124 (2013) 165–173

dyslexic participants and 115.7 (SD 6.7) for the controls, t(1, 43) = 4.0, p < .001. The groups did not differ statistically in age. For some participants we also had additional background information including a non-word reading measure (TOWRE [Test of Word Reading Efficiency, Torgesen, Wagner, & Rashotte, 1999] Phonemic Decoding Efficiency scale) and psychoacoustic threshold measures (for sound rise time, pitch, and intensity). For the TOWRE PDE and psychoacoustic threshold measures, non-parametric Mann–Whitney U tests were performed in view of the small group sizes. The groups differed significantly for the TOWRE PDE (U = 17.5, p < .01) and for sound rise time (U = 33, p < .05), but not for intensity (U = 56) or pitch perception (U = 56.5). 2.2. Stimuli Participants were told that they would be performing an auditory word recognition task in difficult listening conditions. The total task lasted around 10 min. Noise-vocoded monosyllabic spoken words were presented. As described previously, noise-vocoding involves extracting the amplitude envelope in different frequency bands (channels) in speech and using the envelope to modulate white noise that has been band-limited to the same frequency band. It thereby forces greater perceptual reliance on amplitude envelope cues, as less spectral information is available. For the present study speech was vocoded to 4- and 16-channels, over the frequency range 200 Hz to 7 kHz, using the technique described by Shannon et al. (1995). The speech signal was divided into logarithmically-spaced frequency bands using 4th order Butterworth filters and following the cut-off frequencies used by Fu, Chinchilla, and Galvin (2004). The distribution of the analysis filters was according to Greenwood’s (1990) formula. There were 54 words for each type of degraded speech (4-channel, 16-channel). Although stimulus presentation was not blocked, three conditions were used to present the noise-vocoded monosyllabic words: an auditory-only (A) condition, an audiovisual with face (AV Face) condition, and an audiovisual ‘low frequency temporal dynamics’ condition without face information (AV Pixelated, see Fig. 1). The AV Pixelated stimuli were created by inverting the video image and applying a mosaic effect which divided the screen into squares and replaced the luminance in each square with the mean value of the small individual pixels in the square. This manipulation ensured equivalent total luminance and preserved the gross temporal modulations in the visual scene and their temporal correspondence with the auditory stimulus, but prevented the use of finegrained (facial) information to aid speech intelligibility. Hence the AV Pixelated condition estimates the effects of low frequency visual temporal information on speech perception in the absence of facial movements. There were 36 words in each condition (18 of each channel type) spoken by four different actors. In the A condition participants heard the words while seeing a blank screen, and then saw the response alternatives. In the AV Face and AV Pixelated conditions, participants were presented with videos and both heard the word and saw the accompanying video before being presented with the response alternatives. The target words were matched for word length and frequency of occurrence in spoken language (derived from London-Lund Corpus of English Conversation by Brown, 1984) across conditions. All words are provided as Appendix A. The videos were edited using Adobe Premiere Pro 2.0 with a digitization rate of 30 frames per second. Video images were cropped to display head and shoulder views. The stimuli were temporally aligned as in natural speech. 2.3. Procedure Participants were seated in a sound-attenuated booth and stimuli were presented using Presentation software (version 10.3,

Neurobehavioural Systems). Each trial began with a fixation star. For AV stimuli videos were then displayed centred on a black background. The size of the image was 17.7°  12.8°. Sounds were presented binaurally through earphones at 70 dB SPL. In order to record reaction times consistently, the speaker face was always visible first, and the auditory onset began 500 ms later. Lip movements would begin prior to the onset of sound for each word as in natural speech production, with initial mouth movements beginning on average 332 ms before the word was heard. These facial movements would be expected to initiate the AV phase locking mechanism that is our focus of interest. The 108 trials were presented in the same pseudo-randomised presentation order for all participants, beginning at a different randomly-selected point in the ordering for each participant. Participants were required to perform a two-alternative forced choice task indicating which word they had just heard by pressing a button on their left or right. Although this was a reading measure of prior spoken word recognition, the words were highly overlearned, and both groups showed equivalent overall response speed (see below). 2.4. Data analysis Repeated measure ANOVAs for accuracy and reaction time data in the two groups were conducted with within-subject factors of condition (3 levels: A, AV Face, AV Pixelated), and channel (2 levels: 4-channel, 16-channel). The between-subjects factor was group (2 levels: dyslexic, control). The accuracy measure was sensitivity as measured by d0 . The reaction time was the length of time between presentation of the response screen and the participants’ key press. For the reaction time analysis, only correct responses were included, and reaction time outliers were excluded by pooling responses across conditions for each participant and excluding reaction times ±2 standard deviations from the individuals’ mean reaction time. Less than 5% of trials were excluded. 3. Results 3.1. Accuracy Mean performance by group is shown in Fig. 2. The d0 and criterion values were also calculated as a measure of sensitivity whilst controlling for bias, and the d0 measures by group and condition are shown in Table 1. One-way ANOVAs comparing mean d0 (dyslexic 3.3, control 3.4) and c (dyslexic 0.18, control 0.20) values revealed no significant group effects for either measure (d0 : F(1, 43) = 0.47; c: F(1, 43) = 0.10). Hence there was no overall difference between control and dyslexic participants in their sensitivity (d0 ) in the task or in bias towards giving a left-hand or right-hand response. A repeated-measures ANOVA taking d0 scores as the dependent variable revealed significant main effects of Condition, F(2, 42) = 82.9, p < .0001, gq2 = .798, and of Channel, F(1, 43) = 227.0, p < .0001, gq2 = .841, but no main effect of Group, F(1, 43) = 0.5 and no interactions involving Group. The Condition by Channel interaction was significant, F(2, 42) = 31.5, p < .0001, gq2 = .600. Inspection of Fig. 2 and Table 1 suggests that, as expected, performance was always superior with the 16-channel speech. To explore the source of the interaction further, one-way repeated measures ANOVAs were run for each channel, taking Condition as the repeated measure (3 levels). For the 4-channel ANOVA, all means were significantly different from each other, with strongest performance in the AV Face condition, and weakest performance in the Auditory condition (Tukey’s post-hoc tests, p’s < .0001). Performance in the AV Pixelated condition was significantly better than in the Auditory condition, p < .0001. Hence the AV Pixelated stimuli, which provided no viseme or other facial

O. Megnin-Viggars, U. Goswami / Brain & Language 124 (2013) 165–173

169

Fig. 2. Mean accuracy (number correct out of 18) for control and dyslexic participants in the auditory-only, audiovisual Face, and audiovisual Pixelated conditions with 4channel and 16-channel noise-vocoded speech. Table 1 Sensitivity (d0 ) by Group and Condition. 4-Channel

16-Channel

Group

Dyslexic

CA Controls

Dyslexic

CA Controls

AV Face (s.d.) AV Pixelated (s.d.) Auditory Only (s.d.)

3.26 (0.53) 2.52 (0.53) 2.24 (0.69)

3.42 (0.32) 2.60 (0.56) 2.16 (0.66)

3.66 (0.26) 3.47 (0.32) 3.38 (0.45)

3.72 (0.25) 3.51 (0.27) 3.44 (0.38)

information, enhanced the intelligibility of 4-channel speech significantly for both groups. As the pixelation technique was selectively preserving gross visual spatial and temporal changes, the visual dynamic information was congruent with the grosser auditory temporal information (amplitude envelope information) that was preserved in the 4-channel speech, thereby preserving cross-modal low frequency dynamics. For the 16-channel ANOVA, performance in the Auditory condition and in the AV Pixelated condition did not differ. Presumably here the slow visual dynamics preserved in the AV Pixelated condition were less congruent with the more detailed auditory dynamics, and hence did not support intelligibility. Performance in the AV Face condition was however significantly superior to performance in both of these conditions, AV Face versus AV Pixelated, p < .05, AV Face versus Auditory, p < .0001. Therefore, while AV Face information led to superior speech intelligibility when more spectral content was available (16-channel speech), dynamic visual information alone (the AV Pixelated condition) conferred no intelligibility benefit. Possibly, the AV Pixelated stimuli did not affect phase resetting for 16-channel speech, because there was no complementary narrowband visual information (which was present, of course, in the AV Face condition). In contrast, when low-frequency visual temporal information was present which was congruent with amplitude information (as in 4-channel speech), both AV Face and AV Pixelated information improved speech intelligibility. Low-frequency visual temporal dynamics hence appear to boost speech intelligibility for both groups when optimally congruent with the temporal dynamics of amplitude envelope information. 3.2. Response speed Mean response time by group is shown in Fig. 3. Both groups were faster with the easier 16-channel speech, and were fastest of all with the AV Face stimuli. However, for the more degraded speech (4-channel condition), both visual conditions appeared to facilitate performance, enabling faster correct responses compared to the Auditory condition. Possible differences were

investigated using a repeated-measures ANOVA with reaction time as the dependent variable. The ANOVA showed main effects of Condition, F(2, 42) = 62.8, p < .0001, gq2 = .749, and of Channel, F(1, 43) = 178.0, p < .0001, gq2 = .805, but again no main effect of Group, F(1, 43) = 2.4. However, there were significant interactions between Condition  Channel, F(2, 86) = 18.1, p < .0001, gq2 = .463, and Group  Condition  Channel, F(2, 86) = 3.5, p < .05, gq2 = .144. As by hypothesis the dyslexics should differ from controls in the Auditory condition, the source of the 3-way interaction was explored using separate Group  Channel (2  2) repeated measures ANOVAs, one for each condition. In each case these ANOVAs showed a significant main effect of Channel, and no significant main effect of Group. Only one ANOVA showed a significant interaction between Channel and Group, the ANOVA for the Auditory condition, F(1, 43) = 4.4, p < .05, gq2 = .093. Inspection of Fig. 3 suggests that the dyslexic participants were performing significantly more slowly than controls with the 4-channel speech, but not with the 16-channel speech. This was confirmed by a pair of independent samples t-tests, 4-channel t(43) = 1.83, p = .04; 16-channel t(43) = .86, p = .20 (onetailed tests). Hence the groups differed in response speed with auditory information alone in the more difficult 4-channel condition. This is consistent with our hypothesis of an auditory processing deficit in dyslexia for speech envelope information, which would predict that the relative impairment in dyslexic performance for auditory information alone should be greater with the 4-channel speech. At the same time, however, the dyslexic participants showed statistically equivalent speed of performance to controls in the AV Face and AV Pixelated conditions for the 4-channel speech (no Group main effect and no Group  Channel interaction). Both groups showed improved accuracy in the 4-channel condition with AV Pixelated stimuli as well as with AV Face stimuli, and found the Auditory condition most difficult. The data suggest that both groups showed processing speed gains for 4-channel speech when visual information was available, even if the visual information was purely temporal (the AV Pixelated condition). The reaction time data provide further behavioural support for low frequency visual temporal modulations having a beneficial effect on auditory signal processing. Contrary to a compensation hypothesis, this beneficial effect was again found for both groups. 3.3. Multiple regression analysis As the dyslexic participants were significantly slower than controls in perceiving 4-channel noise vocoded speech in the Auditory

170

O. Megnin-Viggars, U. Goswami / Brain & Language 124 (2013) 165–173

Fig. 3. Mean reaction time in milliseconds for control and dyslexic participants in the auditory-only, audiovisual Face, and audiovisual Pixelated conditions with 4-channel and 16-channel noise-vocoded speech.

only condition, there is nevertheless indirect support for the hypothesis that dyslexics might benefit more from the presence of visual information during speech processing. The dyslexics were significantly slower than control participants with auditory information alone, yet did not differ from controls in response speed when low frequency visual temporal information was provided. Therefore, the low frequency visual dynamic information was boosting dyslexic response speeds to control levels. In order to investigate whether there was any evidence that low frequency visual information was more beneficial to participants with poorer auditory perception, multiple regression equations were run. By hypothesis, poorer rise time discrimination is a marker for impaired amplitude envelope processing in dyslexia. Therefore participants with higher rise time thresholds (reduced sensitivity to rise time) might show greater gains in processing accuracy when low frequency visual temporal information is present. On the other hand, if efficient cross-modal phase modulation depends on both visual and auditory modulation rates being perceived efficiently, then reduced auditory sensitivity to rise time would be associated with reduced benefit from visual dynamics (Luo et al., 2010). In either case, individual differences in rise time discrimination should be significantly related to gains in speech intelligibility (d0 ). If low frequency visual information is able to compensate to some extent for poor auditory perception of low frequency modulations, boosting auditory signal processing, then the relationship between rise time processing and intelligibility gains should be positive. Higher auditory thresholds should be related to greater accuracy gains. In contrast, if poor auditory perception of low frequency amplitude modulations also affects auditory–visual integration, then the relationship should be negative. Participants with poorer amplitude modulation perception (as indexed by higher rise time thresholds) would gain less benefit from low frequency visual dynamics, showing smaller sensitivity benefits. To explore whether there was a significant relationship between auditory sensory perception and the gains conferred by visual information, 3-step multiple regression equations were computed for all participants who contributed auditory data.1 Two control outlier scores were removed for pitch, and one dyslexic

1 Auditory data had been collected for 25 participants, 9 of whom were controls. The two sub-groups were still matched for FSIQ (dyslexics = 121.5, SD = 8.9; controls = 117.7, SD = 6.5), and still differed significantly in spelling standard score (dyslexics = 106.5, SD = 7.5; controls = 113.6, SD = 7.2; p < .05) and nonword reading (dyslexics = 49.0, SD = 8.3; controls = 58.5, SD = 6.2; p < .05), but did not differ significantly in reading standard score (dyslexics = 108.8, SD = 6.5; controls = 112.7, SD = 7.0).

outlier score was removed for intensity. Gain in d0 (either AV Pixelated – A; or AV Face – A) was the dependent variable in each equation. Six equations were run in total, each entering age at step 1, full scale IQ at step 2 (to control for effects of intelligence), and then entering an auditory measure at step 3 (rise time, frequency or intensity). The results are shown in Table 2. As can be seen, individual differences in rise time discrimination accounted for a significant 24% of unique variance in gains in speech intelligibility in the AV Pixelated condition (p = .016), and for 12% of unique variance in gains in speech intelligibility in the AV Face condition (not significant, p = .108). No other effects approached significance. As the standardised Beta coefficient was negative in each case, the data suggest that it is the participants who were more sensitive to rise time who gained more intelligibility benefit from the low frequency visual dynamic information (see scatterplot provided as Fig. 4).

4. Discussion Here we used an audiovisual noise-vocoded speech task to investigate the contribution of visual speech to auditory processing in dyslexia. Low-frequency spatial and temporal modulations in the 2–7 Hz range are of particular importance in AV speech integration, and individuals with dyslexia have specific problems in perceiving low-frequency auditory temporal modulations, indexed by rise time perception difficulties (Goswami, 2011; Hämäläinen et al., 2012). Individuals with dyslexia also show enhanced processing of visual spatial information in low-pass filtered natural visual scenes, which may suggest enhanced peripheral processing mechanisms (Schneps et al., 2007, 2012). To investigate whether cross-modal congruence between low-frequency visual and auditory speech rhythm cues plays a role in supporting compromised amplitude envelope perception in dyslexia, we studied the perception of noise vocoded speech in two AV conditions. In one condition, both visual prosody on the face and viseme information were present (AV Face condition). In a second condition, low frequency visual spatio-temporal modulations were dominant, as participants viewed mosaic-like stimuli varying in luminance (AV Pixelated condition). Two manipulations of the speech signal were compared, 16-channel noise vocoded speech, which contains spectral patterns of energy change that are closer to real speech, and 4-channel noise vocoded speech, which is quite challenging to perceive. For 4-channel noise vocoded speech, amplitude information is preserved that is congruent with the visual dynamic information in the AV Pixelated condition. We were interested in the possibility that individuals with dyslexia may use visual speech

171

O. Megnin-Viggars, U. Goswami / Brain & Language 124 (2013) 165–173

Table 2 Unique variance (R2 change) in Intelligibility Gains (d0 ) for (a) the AV Face and (b) the AV Pixelated conditions explained by auditory processing after controlling for age and IQ. Step

a. Beta (AV Face)

a. R2 change (AV Face)

1. Age 2. IQ

.084 .172

.007 .030

.134 .055

Rise time Frequency Intensity

.344 .185 .168

.114 .027 .028

.502 .234 .098

b. Beta (AV Pix)

b. R2 change (AV Pix) .018 .003 .243* .043 .010

Note. AV Face = AV Face condition, AV Pix = AV Pixelated condition. p < .05.

*

Fig. 4. Scatterplot of rise time threshold against speech sensitivity gain (d0 gain) from the AV Pixellated stimuli with the 4-channel noise vocoded speech.

information to compensate for their impairments in auditory speech perception when listening to real words. In fact, the data showed that both groups benefited to a similar extent from low level visual dynamics. Indeed, and contrary to our original hypothesis, regression analyses showed that those participants with more impaired auditory processing (as indexed by poorer rise time perception) gained less in terms of processing accuracy from the low frequency visual information than those participants with better rise time processing. Therefore, the low frequency temporal sampling deficit proposed for developmental dyslexia (Goswami, 2011) may extend to other modalities in addition to the auditory modality. Regarding speech intelligibility overall, we found that adults both with and without dyslexia showed equivalent and significant processing enhancements when visual speech information was present on the face (the AV Face condition). For both speech conditions (4-channel, 16-channel), and for both groups, performance was significantly more accurate in the AV Face condition. However, for the more degraded speech (4channel condition), significant gains in accuracy also accrued to

both groups when low frequency visual temporal modulations were provided by pixelated matrix displays (accuracy in the AV Pixelated condition was significantly greater than in the Auditory alone condition). As low-frequency visual temporal dynamics were perfectly matched between AV Face and AV Pixelated stimuli, this is suggestive of a relatively low-level perceptual effect on accuracy for 4-channel speech, such as cross-modal phase resetting. The response speed data showed similar patterns. Although the participants with dyslexia were significantly slower than controls to recognise the target words in the Auditory alone condition with the 4-channel stimuli, they were as fast as controls when low frequency visual dynamic information was present, showing equivalent performance to controls with both AV Pixelated stimuli and AV Face stimuli. Therefore, cross-modal congruence between low-frequency visual and auditory speech rhythm cues did support faster and more accurate responding. However, the multiple regression analyses of sensitivity (d0 ) suggested that the assumed mechanism of phase-resetting of auditory cortical responding by visual information was more efficient for participants with better auditory processing, across both dyslexics and controls. Contrary to expectation, therefore, dyslexic participants did not show greater processing gains than control participants from the presence of low frequency visual dynamic information. Overall, our data are congruent with some earlier studies of auditory–visual speech processing in developmental dyslexia. In one early study, de Gelder and Vroomen (1998) presented 11 year old poor readers with a syllable discrimination task (/ ba/ versus /da/) delivered in three conditions, auditory alone, visual alone, or auditory–visual (AV). de Gelder and Vroomen (1998) found that poor readers were worse than age-matched and reading level controls in processing both auditory and visual speech. The groups did not differ in the AV condition. However, the weak performance shown by the poor readers in the visual speech condition led the authors to conclude that poorer readers are worse at speech reading than typically-developing controls. In a more recent study, Ramirez and Mann (2005) asked adult participants with dyslexia to recognise target consonant–vowel (CV) syllables like /ba/, /da/ and /ma/ in three conditions, auditory alone, visual alone, and AV. The task was made more difficult by introducing different levels of noise. Ramirez and Mann reported that the dyslexic participants were significantly worse at recognising the syllables in the visual alone condition, compared to non-impaired adults. They were also helped relatively less by AV presentation when noise was present. Ramirez and Mann concluded that individuals with dyslexia showed less effective use of visual articulatory cues when noise was present and also showed less accurate perception of visual speech in the absence of acoustic speech. In these two earlier behavioural studies, therefore, neither adults nor children with dyslexia showed signs of compensating for impaired auditory processing by an enhanced use of visual speech. Therefore, the current data converge with earlier studies in suggesting that dyslexic participants utilise visual information

172

O. Megnin-Viggars, U. Goswami / Brain & Language 124 (2013) 165–173

to a similar extent to non-dyslexic participants when experiencing audiovisual speech in challenging listening conditions. Interestingly, both groups tested here showed a significant intelligibility benefit from low frequency visual modulations in the absence of facial information (greater accuracy and speed of responding in the AV Pixelated condition for the 4-channel speech). The pixelated data suggest that this effect may occur at a very basic level of signal processing, namely at the level of temporal modulation rather than at the level of ‘‘reading’’ speech cues on the lips or cheeks. It is also possible that our effects depend on the kind of degradation that we imposed on the auditory signal. For example, it would be interesting to explore the effects of pixelated images on processing time-compressed speech, where envelope following is again made more difficult, with speechin-noise (where all auditory temporal modulations are equally affected). The current pixelation technique offers a means for such experimental manipulations. Finally, it would be interesting to explore directly the effects of cross-modal phase resetting in the dyslexic brain, by measuring the visual modulation of ongoing auditory oscillations in EEG by the different kinds of AV stimuli used here.

Acknowledgments We would like to thank our participants, and we thank Alan Power and Vicky Leong for their helpful input. This research was supported by funding from the Medical Research Council, grants G0400574 and G0902375. The sponsor had no input into study design, data analysis nor report writing. Requests for reprints should be addressed to Usha Goswami, Centre for Neuroscience in Education, Downing St., Cambridge CB2 3 EB, U.K.

Appendix A

Auditory-only

References Abrams, D. A., Nicol, T., Zecker, S., & Kraus, N. (2009). Abnormal cortical processing of the syllable rate of speech in poor readers. Journal of Neuroscience, 29, 7686–7693. Besle, J., Schevon, C. A., Mehta, A. D., Lakatos, P., Goodman, R. R., McKhann, G. M., et al. (2011). Tuning of the human neocortex to the temporal dynamics of attended events. Journal of Neuroscience, 31, 3176–3185. Brown, G. D. A. (1984). A frequency count of 190,000 words in the London–Lund Corpus of English Conversation. Behavior Research Methods, Instruments, & Computers, 16, 502–532. Burnham, D., & Dodd, B. (2004). Auditory-visual speech integration by prelinguisticinfants: Perception of an emergent consonant in the McGurk effect. Developmental Psychobiology, 45, 204–220. Chandrasekaran, C., Trubanova, A., Stillittano, S., Caplier, A., & Ghazanfar, A. A. (2009). The natural statistics of audiovisual speech. PLoS Computational Biology, 5(7), e1000436. de Gelder, B., & Vroomen, J. (1998). Impaired speech perception in poor readers: Evidence from hearing and speech reading. Brain and Language, 64, 269– 281. Drullman, R., Festen, J. M., & Plomp, R. (1994a). Effect of temporal envelope smearing on speech reception. Journal of the Acoustical Society of America, 95, 1053–1064. Drullman, R., Festen, J. M., & Plomp, R. (1994b). Effect of reducing slow temporal modulations on speech reception. Journal of the Acoustical Society of America, 95, 2670–2680. Fu, Q.-J., Chinchilla, S., & Galvin, J. J. (2004). The role of spectral and temporal cues in voice gender discrimination by normal-hearing listeners and cochlear implant users. Journal of the Association for Research in Otolaryngology, 5, 253–260. Ghitza, O. (2011). Linking speech perception and neurophysiology: Speech decoding guided by cascaded oscillators locked to the input rhythm. Frontiers in Psychology, 2, 130. Ghitza, O., & Greenberg, S. (2009). On the possible role of brain rhythms in speech perception: Intelligibility of time-compressed speech with periodic and aperiodic insertions of silence. Phonetica, 66, 113–126. Giraud, A. L., & Poeppel, D. (2012). Speech perception from a neurophysiological perspective. In D. Poeppel, T. Overath, A. N. Popper, & R. R. Fay (Eds.), The human auditory cortex (pp. 225–260). Berlin: New York, NY. Gomez-Ramirez, M., Kelly, S. P., Molholm, S., Sehatpour, P., Schwartz, T. H., & Foxe, J. J. (2011). Oscillatory sensory selection mechanisms during intersensory attention to rhythmic auditory and visual inputs: A human electrocorticographic investigation. Journal of Neuroscience, 31, 18556–18567. Goswami, U. (2011). A temporal sampling framework for developmental dyslexia. Trends in Cognitive Sciences, 15(1), 3–10. Goswami, U., Gerson, D., & Astruc, L. (2010). Amplitude envelope perception, phonology and prosodic sensitivity in children with developmental dyslexia. Reading and Writing, 23, 995–1019.

AV Pixelated

AV Face

4-Channel

16-Channel

4-Channel

16-Channel

4-Channel

16-Channel

Target

Distracter

Target

Distracter

Target

Distracter

Target

Distracter

Target

Distracter

Target

Distracter

Bell Boat Bridge Cage Coal Cut Face Fight Flame Foot Half Lips Note Nut Paint Push Sink Win

Tell Goat Fridge Page Goal Hut Race Might Blame Soot Calf Hips Vote But Faint Bush Pink Bin

Boot Cold Feed Fix Fox Hate Head Heat Hit Lunch Man Net Run Sound Talk Toy Try Year

Root Hold Seed Mix Box Date Dead Neat Bit Punch Ran Wet Fun Pound Walk Boy Dry Hear

Blood Blow Book Door Duck Fall Lamb Look Moon Pen Pin Pull Rock Ship Sun Tap Van Wait

Flood Flow Took Moor Luck Ball Jamb Cook Soon Ten Fin Full Sock Chip Bun Nap Ban Bait

Back Bat Bike Coat Gun Hall Mouse Nose Nurse Pet Pick Road Roof Rope Tear Tent Tin Tree

Pack Cat Hike Moat Nun Wall House Rose Purse Set Sick Toad Hoof Hope Bear Dent Sin Free

Band Bath Bed Burn Cry Dress Eye Fork Jump Lie Line Make Pig Plate Play Rain Sing Sit

Hand Path Red Turn Fry Press Bye Cork Bump Tie Nine Bake Big Slate Clay Pain Ring Pit

Belt Bite Block Brick Can Car Fat Game Hair Land Lap Pan Park Roar Shop Slide Wish Wood

Melt Kite Clock Trick Tan Bar Hat Name Pair Sand Cap Fan Dark Soar Chop Glide Dish Good

O. Megnin-Viggars, U. Goswami / Brain & Language 124 (2013) 165–173 Goswami, U., Thomson, J., Richardson, U., Stainthorp, R., Hughes, D., Rosen, S., et al. (2002). Amplitude envelope onsets and developmental dyslexia: A new hypothesis. Proceedings of the National Academy of Sciences of the United States of America, 99, 10911–10916. Grant, K. W., & Braida, L. D. (1991). Evaluating the articulation index for auditory– visual input. Journal of the Acoustical Society of America, 89, 2952–2960. Grant, K. W., & Seitz, P.-F. (2000). The use of visible speech cues for improving auditory detection of spoken sentences. Journal of the Acoustical Society of America, 108, 1197–1208. Greenberg, S., Carvey, H., Hitchcock, L., & Chang, S. (2003). Temporal properties of spontaneous speech - a syllable-centric perspective. Journal of Phonetics, 31, 465–485. Greenwood, D. D. (1990). A cochlear frequency-position function for several species – 29 years later. Journal of the Acoustical Society of America, 87, 2592–2605. Hämäläinen, J. A., Rupp, A., Soltész, F., Szücs, D., & Goswami, U. (2012). Reduced phase locking to slow amplitude modulation in adults with dyslexia: An MEG study. Neuroimage, 59(3), 2952–2961. Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience, 8, 393–402. Lakatos, P., Karmos, G., Mehta, A. D., Ulbert, I., & Schroeder, C. E. (2008). Entrainment of neuronal oscillations as a mechanism of attentional selection. Science, 320, 110–113. Leong, V., Hämäläinen, J., Soltész, F., & Goswami, U. (2011). Rise time perception and detection of syllable stress in adults with developmental dyslexia. Journal of Memory and Language, 64, 59–73. Luo, H., Liu, Z., & Poeppel, D. (2010). Auditory cortex tracks both auditory and visual stimulus dynamics using low-frequency neuronal phase modulation. PLoS Biology, 8, e1000445. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748. Macleod, A., & Summerfield, Q. (1987). Quantifying the contribution of vision to speechperception in noise. British Journal of Audiology, 21, 131–141. Meredith, M. A., & Stein, B. E. (1986). Visual, auditory, and somatosensory convergence on cells in superior colliculus results in multisensory integration. Journal of Neurophysiology, 56, 640–662. Munhall, K. G., Jones, J. A., Callan, D. E., Kuratate, T., & Vatikiotis-Bateson, E. (2004). Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological Science, 15, 133–137. Munhall, K. G., Kross, C., & Vatikiotis-Bateson, E. (2002). Audiovisual perception of band-pass filtered faces. Journal of the Acoustical Society of Japan, 21, 519–520. Power, A. J., Mead, N., Barnes, L., & Goswami, U. (2012). Neural entrainment to rhythmically-presented auditory, visual and audio-visual speech in children. Frontiers in Psychology, 3, 216. http://dx.doi.org/10.3389/fpsyg.2012.00216. Ramirez, J., & Mann, V. (2005). Using auditory-visual speech to probe the basis of noise-impaired consonant-vowel perception in dyslexia and auditory neuropathy. Journal of the Acoustical Society of America, 118, 1122–1133. Schneps, M. H., Rose, L. T., & Fischer, K. W. (2007). Visual learning and the brain: Implications for dyslexia. Mind, Brain & Education, 1(3), 128–139.

173

Schneps, M. H., Brockmole, J. R., Sonnert, G., & Pomplun, M. (2012). History of reading struggles linked to enhanced learning of low spatial frequency scenes. Public Library of Science One, 7(4), e35742. Schroeder, C. E., Lakatos, P., Kajikawa, Y., Partan, S., & Puce, A. (2008). Neuronal oscillations and visual amplification of speech. Trends in Cognitive Science, 12, 106–113. Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech recognition with primarily temporal cues. Science, 270, 303–304. Stefanics, G., Hangya, B., Hernádi, I., Winkler, I., Lakatos, P., & Ulbert, I. (2010). Phase entrainment of human delta oscillations can mediate the effects of expectation on reaction speed. Journal of Neuroscience, 30, 13578–13585. Stein, B. E., & Meredith, M. A. (1993). The merging of the senses. London: MIT Press. Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America, 26, 212–215. Talcott, J. B., Gram, A., Van Ingelghem, M., Witton, C., Stein, J. F., & Toennessen, F. E. (2003). Impaired sensitivity to dynamic stimuli in poor readers of a regular orthography. Brain and Language, 87, 259–266. Thomson, J. M., Fryer, B., Maltby, J., & Goswami, U. (2006). Auditory and motor rhythm awareness in adults with dyslexia. Journal of Research in Reading, 29, 334–348. Thomson, J. M., & Goswami, U. (2008). Rhythmic processing in children with developmental dyslexia: Auditory and motor rhythms link to reading and spelling. Journal of Physiology-Paris, 102, 120–129. Torgesen, J., Wagner, R., & Rashotte, C. (1999). Test of Word Reading Efficiency (TOWRE). UK: The Psychological Corporation. Vandermosten, M., Boets, B., Luts, H., Poelmans, H., Golestani, N., Wouters, J., et al. (2010). Adults with dyslexia are impaired at categorising speech and nonspeech sounds on the basis of temporalcues. Proceedings of the National Academy of Sciences, 107(23), 10389–10394. Walden, B. E., Prosek, R. A., Montgomery, A. A., Scherr, C. K., & Jones, C. J. (1977). Effects of training on the visual recognition of consonants. Journal of Speech and Hearing Research, 20, 130–145. Wechsler, D. (1999). Wechsler abbreviated scale of intelligence. San Antonio: Psychological Corporation. Wilkinson, G. (1993). Wide Range Achievement Test-3 (WRAT-3). Wilmington, DE: Jastak Associates. Witton, C., Talcott, J. B., Hansen, P. C., Richardson, A. J., Griffiths, T. D., & Rees, A. (1998). Sensitivity to dynamic auditory and visual stimuli predicts nonword reading ability in both dyslexic and normal readers. Current Biology, 8, 791– 797. Yehia, H., Kuratate, T., & Vatikiotis-Bateson, E. (2002). Linking facial animation, head motion and speech acoustics. Journal of Phonetics, 30, 555–568. Yehia, H., Rubin, P., & Vatikiotis-Bateson, E. (1998). Quantitative association of vocal-tract and facial behaviour. Speech Communication, 26, 23–43. Ziegler, J. C., & Goswami, U. (2005). Reading acquisition, developmental dyslexia, and skilled reading across languages: A psycholinguistic grain size theory. Psychological Bulletin, 131, 3–29.