Hearing Research 245 (2008) 35–47
Contents lists available at ScienceDirect
Hearing Research journal homepage: www.elsevier.com/locate/heares
Research paper
Envelope and spectral frequency-following responses to vowel sounds Steven J. Aiken a,*, Terence W. Picton b a b
School of Human Communication Disorders, Dalhousie University, 5599 Fenwick Street, Halifax, Canada B3H 1R2 Rotman Research Institute, Baycrest Centre for Geriatric Care, University of Toronto, Canada
a r t i c l e
i n f o
Article history: Received 16 February 2008 Received in revised form 15 July 2008 Accepted 13 August 2008 Available online 19 August 2008 Keywords: Auditory evoked potentials Frequency-following responses Speech envelope Vowel sounds Fourier analyzer
a b s t r a c t Frequency-following responses (FFRs) were recorded to two naturally produced vowels (/a/ and /i/) in normal hearing subjects. A digitally implemented Fourier analyzer was used to measure response amplitude at the fundamental frequency and at 23 higher harmonics. Response components related to the stimulus envelope (‘‘envelope FFR”) were distinguished from components related to the stimulus spectrum (‘‘spectral FFR”) by adding or subtracting responses to opposite polarity stimuli. Significant envelope FFRs were detected at the fundamental frequency of both vowels, for all of the subjects. Significant spectral FFRs were detected at harmonics close to formant peaks, and at harmonics corresponding to cochlear intermodulation distortion products, but these were not significant in all subjects, and were not detected above 1500 Hz. These findings indicate that speech-evoked FFRs follow both the glottal pitch envelope as well as spectral stimulus components. Ó 2008 Elsevier B.V. All rights reserved.
1. Introduction Infants with hearing impairment detected by neonatal hearing screening are referred for hearing aids within the first few months of age. Fitting is mainly based on thresholds obtained by electrophysiological measurements – generally the auditory brainstem response (ABR) to tone bursts (Stapells, 2000b, 2002) or the auditory steady-state response (ASSR) to amplitude modulated tones (Picton et al., 2003; Stueve and O’Rourke, 2003; Luts et al., 2004; Luts and Wouters, 2005). However, these measurements are not exact. Both ABR and ASSR threshold estimates predict behavioral thresholds with standard deviations that range from 5 to 15 dB across various studies (see Tables 1 and 2, Tlumak et al., 2007 and Table 4 in Herdman and Stapells, 2003; Stapells et al., 1990; Stapells, 2000a). It would thus be helpful to have some way of assessing how well the amplified sound is received in the infant’s brain (Picton et al., 2001; Stroebel et al., 2007). Speech stimuli would be optimal because the main intent of amplification is to provide the child with sufficient speech information to allow communication and language learning. Speech sounds elicit both transient and sustained activity in the human brainstem and cortex. Transient brainstem responses can be recorded with a consonant–vowel diphone stimulus. The speech-evoked auditory brainstem response evoked by /da/ has been used to investigate the brainstem encoding of speech in chil-
Abbreviations: CM, cochlear microphonic; FFR, frequency-following response; ABR, auditory brainstem response; ASSR, auditory steady-state response * Corresponding author. Tel.: +1 902 494 1057; fax: +1 902 494 5151. E-mail address:
[email protected] (S.J. Aiken). 0378-5955/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.heares.2008.08.004
dren with learning problems (Cunningham et al., 2001; King et al., 2002) and children with auditory processing problems (Johnson et al., 2007), but not in children or adults wearing hearing aids. Transient cortical responses have also been used in subjects with hearing impairment and with hearing aids (Billings et al., 2007; Golding et al., 2007; Korczak et al., 2005; Rance et al., 2002; Tremblay et al., 2006). However, these responses are more variable in morphology than the brainstem responses – especially in infants (Wunderlich et al., 2006) – and less clearly related to speech parameters other than major changes in intensity or frequency. Most commercial hearing aids exhibit sharply non-linear behavior designed to preferentially amplify speech and attenuate other sounds. As a result, hearing aid gain and output characteristics are different for speech and non-speech stimuli, and different for transient and sustained stimuli. We have therefore been considering the use of sustained speech stimuli such as vowel sounds (Aiken and Picton, 2006) or even sentences (Aiken and Picton, 2008). Sustained speech stimuli can evoke a variety of potentials from the cochlea to the cortex. Since cortical potentials in infants are variable and change with maturation, a reasonable approach might be to measure frequency-specific brainstem responses to speech stimuli presented at conversational levels. Brainstem responses to sustained speech and speech-like stimuli have been called envelope-following responses (Aiken and Picton, 2006), frequency-following responses (Krishnan et al., 2004), and auditory steady-state responses (Dimitrijevic et al., 2004). Although frequency-following responses have sometimes been distinguished from envelope-following responses (e.g. Levi et al., 1995), the term ‘frequency-following response’ (FFR) has been used to describe responses to speech formants (Plyler and
36
S.J. Aiken, T.W. Picton / Hearing Research 245 (2008) 35–47
1.1. Responses to the fundamental
Table 1 Frequencies (Hz) of formants and harmonics Vowel
First formant
Second formant
Peak frequency
Closest harmonic
Peak frequency
Closest harmonic
/a/ /i/
937 229
960(f9) 244(f2)
1408 2613
1387(f13) 2562(f21)
Table 2 Average response nomenclature Response
Derivation
Components
++
Average together all responses to the original stimulus
Envelope FFR Spectral FFR Cochlear microphonic Stimulus artifact
+
Average together an equal number of responses to the original stimulus and responses to the inverted stimulus
Envelope FFR
Subtract responses to the inverted stimulus from an equal number of responses to the original stimulus and divide by the total number of responses
Spectral FFR Cochlear microphonic Stimulus artifact
Ananthanarayan, 2001), intermodulation distortion arising from two-tone vowels (Krishnan, 1999), speech harmonics (Aiken and Picton, 2006; Krishnan, 2002), and the speech fundamental frequency (Krishnan et al., 2004), which presumably relates to the speech envelope. Thus the term ‘frequency-following response’ can be used in a general sense – denoting a response that follows either the spectral frequency of the stimulus or the frequency of its envelope. For the purposes of this paper, we shall distinguish between ‘‘spectral FFR” and ‘‘envelope FFR.” A similar distinction was suggested by Krishnan (2007, Table 15.1), who proposed using alternating-phase stimuli to record responses locked to the envelope and fixed-phase stimuli to record responses phase-locked to spectral components. For simplicity, we shall restrict the term FFR to responses generated in the nervous system, and not include the CM or stimulus artifact, even though these do follow the spectral frequencies of the stimulus. An important difference between spectral and envelope FFR is that the latter is largely insensitive to stimulus polarity, much like the transient auditory brainstem response (Krishnan, 2002; Small and Stapells, 2005). Spectral FFR can thus be teased apart from the transient response by recording responses to stimuli presented in alternate polarities, and averaging the difference between the responses (Huis in’t Veld et al., 1977; Yamada et al., 1977). Other researchers have averaged the sum of responses to stimuli presented in alternate polarities, in order to separate the FFR from the cochlear microphonic (e.g. Cunningham et al., 2001; King et al., 2002), but this manipulation would eliminate (or severely distort) the spectral FFR, preserving only the envelope FFR (Chimento and Schreiner, 1990). Speech FFRs may be ideal for evaluating the peripheral encoding of speech sounds, since they can be evoked by specific elements of speech (e.g. vowel harmonics; Aiken and Picton, 2006; Krishnan, 2002). FFRs may be evoked by several separate elements of speech. One is the speech fundamental – the rate of vocal fold vibration. The other is the harmonic structure of speech. Voiced speech has energy at the integer multiples of the fundamental frequency, which are selectively enhanced by formants (resonance peaks created by the shape of the vocal tract). Responses to harmonics may thus provide information about the audibility of the formant structure of speech.
Frequency-following responses to the speech fundamental frequency should be relatively easy to record, since speech is naturally amplitude modulated at this rate by the opening and closing of the vocal folds. Although the amplitude envelope does not have any spectral energy of its own, energy at the envelope frequency is introduced into the auditory system as a result of rectification during cochlear transduction. In an earlier study (Aiken and Picton, 2006), we recorded responses to the fundamental frequencies of naturally produced vowels with steady or changing fundamental frequencies. We used a Fourier analyzer to measure the energy in each response as the fundamental frequency changed over time (followed a ‘trajectory’). When the frequency trajectory of a response can be predicted in advance, the Fourier analyzer can provide an optimal estimate of the response energy along that trajectory. This is in contrast to traditional windowed signal processing techniques (e.g. the shortterm fast Fourier transform), which assume that a response does not change its frequency within each window (is ‘stationary’). With the Fourier analyzer, significant responses were recorded in all of the subjects, and the average time required to elicit a significant response varied from 13 to 86 s. Other techniques have also been used to evaluate the fundamental response. Krishnan et al. (2004) recorded frequency-following responses to Mandarin Chinese tones with changing fundamental frequencies, using a short-term autocorrelation algorithm. Dajani et al. (2005) used a filterbank-based algorithm inspired by cochlear physiology to analyze responses to speech segments with changing fundamental frequencies. Both techniques can measure the frequency trajectory well, but neither accurately estimates the response energy if the signal changes frequency within the window used for the filter or the autocorrelation. 1.2. Responses to harmonics Although responses to the fundamental can be measured quickly and reliably, such responses provide limited information about the audibility of speech in different frequency ranges. Since all energy in voiced speech is amplitude modulated at the fundamental frequency, a response at the fundamental could be mediated by audible speech information at any frequency, and thus any place on the basilar membrane. In order to measure place-specific responses to speech, it might be best to record responses directly to the harmonics of the fundamental frequency. Using the Fourier analyzer (Aiken and Picton, 2006), we recorded significant responses to the second and third harmonics of vowels with steady and changing fundamental frequencies. We did not measure responses to higher harmonics, due to the limited bandwidth of the electroencephalographic recording (1–300 Hz). Krishnan (2002) recorded wide-band responses to synthetic vowels with formant frequencies below 1500 Hz (i.e. back vowels) using a fast Fourier transform. Since the frequencies in the synthesized stimuli were stationary, the fast Fourier transform would have provided an optimal estimate of the response energy in each frequency bin. Responses were detected at harmonics close to formant peaks, and at several low-frequency harmonics distant from formant peaks, but not at the vowel fundamental frequency. In this study, half of the responses were recorded to a polarity-inverted stimulus, and the final result was derived by subtracting the responses obtained in one polarity from the responses obtained in the opposite polarity. This subtractive approach (see also Greenberg et al., 1987; Huis in’t Veld et al., 1977) is analogous to the compound histogram technique used in neurophysiologic
S.J. Aiken, T.W. Picton / Hearing Research 245 (2008) 35–47
studies (Anderson et al., 1971; Goblick and Pfeiffer, 1969). Its rationale stems from the effects of half-wave rectification involved in inner hair cell transduction (Brugge et al., 1969). Discharges only occur during the rarefaction phase of the stimulus. If the polarity of the stimulus is inverted, the discharges to the rarefaction phase of the inverted stimulus now occur during the condensation phase of the initial stimulus. Subtracting the period histogram of this inverted stimulus from the period histogram of the noninverted stimulus cancels the rectification-related distortion, and the discharge pattern corresponds to the stimulating waveform. Scalp-recorded frequency-following responses reflect the activity of synchronized neuronal discharges, so it is reasonable to apply the compound histogram technique to these data. This approach shows different results for envelope and spectral FFRs. By subtracting responses to alternate stimulus polarities, the alternate rectified responses to the stimulus are combined to produce non-rectified analogues of stimulus components (inasmuch as the neural system is able to phase-lock to those components). Subtracting responses to alternate polarities thus removes distortions associated with half-wave rectification (e.g. the energy at the envelope) that exist in the neural response. Using the subtractive procedure, Krishnan (2002) found responses at prominent stimulus harmonics, but not at the envelope frequency. In contrast, when this subtractive procedure has not been used, robust responses have been recorded at the envelope frequency (Aiken and Picton, 2006; Krishnan et al., 2004; Greenberg et al., 1987). An alternate technique that has been used to analyze FFR is to add responses recorded to alternate polarity stimuli (e.g. Johnson et al., 2005; Small and Stapells, 2005). This technique is generally employed to eliminate the cochlear microphonic or residual artifact from the stimulus, and to preserve the envelope FFR. Summing alternate responses cancels the cochlear microphonic and artifact, leaving the envelope FFR. The downside of this approach is that it also cancels the spectral FFR.
37
sponse was able to represent second formant transitions in synthetic consonant–vowel pairs. FFRs to harmonics may thus provide useful information about speech encoding. In the present study, we recorded wideband responses to several vowels in alternate polarities and analyzed both the additive and subtractive averages with the same dataset. We also recorded two responses in the same polarity, so that we could compare the added and subtracted alternate polarity averages to a constantpolarity average calculated across the same number of responses. We hypothesized that the average response to the constant-polarity stimuli would have energy at prominent stimulus harmonics (i.e. near formant peaks) as well as energy corresponding to the stimulus envelope. We further hypothesized that the harmonic pattern in the stimulus would be displayed in the subtractive average, and that the stimulus envelope would be displayed in the additive average. We hypothesized that we would be able to obtain reliable individual subject responses to the fundamental and stimulus harmonics up to approximately 1500 Hz (the upper limit for recording frequency-following responses; Krishnan, 2002; Moushegian et al., 1973). 2. Materials and methods 2.1. Subjects Seven women (ages 20–30) and three men (ages 23–30) were recruited internally at the Rotman Research Institute. All subjects were right-handed, and had hearing thresholds that were 15 dB HL or better at octave frequencies from 250 to 8000 Hz. Nine subjects (6f/3m) participated in the main experiment, and a smaller numbers of subjects participated in subsidiary experiments to evaluate the recording montage (4f/1m) and to examine masking (2f/1m). 2.2. Stimuli
1.3. Relationship between harmonics and formants Formants and formant trajectories carry information that is essential for speech sound identification. The lowest two or three formants convey enough information to identify vowels, and to specify the consonant place of articulation (Liberman et al., 1954). Formants correspond to peaks in the spectral shape, and not to specific harmonics (for a review, see Rosner and Pickering, 1994). The vocal tract can be characterized as a filter that shapes the output from the glottal source (Fant, 1970), with the peaks of the spectrum (i.e. formants) corresponding to the poles of the filter. Voigt et al. (1982) recorded auditory nerve responses to noisesource vowels (i.e. vowels with no harmonics), and found that the Fourier transforms of interval histograms had large frequency components corresponding to the peaks of the formants. However, this temporal encoding would not likely produce measurable responses at the scalp, since the temporal intervals would have occurred at random phases in the absence of a synchronizing stimulus. The frequency-following response requires synchronized neural activity. FFRs recorded to formant-related harmonics could be used to assess the audibility of formants and formant trajectories. For example, responses recorded at harmonics related to the first and second formant would indicate that the formant peaks had been neurally encoded, and that the information was likely available for the development of the phonemic inventory. Krishnan (2002) recorded frequency-following responses to synthetic vowels with steady frequencies, where the peaks of the first and second formants were not multiples of f0. In this study, responses were found at harmonics related to the first two formants. Plyler and Ananthanarayan (2001) found that the frequency-following re-
Two naturally-produced vowels, /a/ (as in ‘father’) and /i/ (as in ‘she’), were recorded for the experiment. The /i/ was chosen because its second formant frequency is higher than other vowels (Hillenbrand et al., 1995), occurring where synchronized responses are most difficult to record. A response to harmonics near the second formant of /i/ would suggest that responses could be recorded to the second formants of all the other vowels. The /a/ was chosen because it has a low-frequency second formant, to maximize the chances of recording a second formant response for at least one of the stimuli. The vowels were recorded from a male speaker in a doublewalled sound-attenuating chamber. A Shure KSM-44 large-diaphragm cardioid microphone was placed approximately 3 in. from the mouth, with an analogue low-pass filter (6 dB/octave above 200 Hz) employed to mitigate the proximity effect. The signal was digitized at 32 kHz with 24 bits of resolution using a Sound Devices USBpreTM digitizer, and saved to hard disk using Adobe AuditionTM. Two tokens of /a/ and two tokens of /i/ were selected from portions of the recordings where vocal amplitude was steady. Each token was manually trimmed to be close to 1.5 s long, with onsets and offsets placed at zero-crossings spaced by integer multiples of the pitch period. This made each token suitable for continuous play, with no audible discontinuities between successive stimulus iterations. Each token was then resampled at a rate slightly higher or lower than 32 kHz in order to give exactly 48,000 samples over 1.5 s. This resampling introduced a slight pitch shift, but this was less than +/ 1 Hz. Stimuli were then bandpass filtered between 20 and 6000 Hz with a 1000-point finite impulse response filter having no phase delay.
38
S.J. Aiken, T.W. Picton / Hearing Research 245 (2008) 35–47
Stimuli of reversed polarity were obtained by multiplying the stimulus by -1. Thus, there were in total eight stimuli – two vowels, two tokens and two polarities. These were named a1+, a1, a2+, a2, i1+, i1, i2+, and i2. An LPC (linear predictive coding) analysis was conducted in order to determine the formant structure of each vowel. Since formant structure can be estimated more easily after removing the low-pass characteristic of speech, the spectrum of each token (x) was pre-emphasized (or ‘whitened’) using the following equation:
continuously for 75 s, corresponding to 50 iterations (with no time delay between successive presentations). This process was repeated 4 times per block, with results averaged offline to provide a single 5 min (200-sweep) average. Each of the /i/ and /a/ tokens was presented twice in the same polarity, and once in the opposite polarity. The second experiment investigated three possible sources for the responses – electrical artifact, brainstem, and CM. In order to ensure that the responses were not contaminated by electrical artifact, responses were recorded to the first /a/ token routed to an insert earphone, which was not coupled to the ear. Since the subject’s ears were occluded during the experiment, this rendered the stimulus inaudible. The transducer of this insert earphone was in the same location as when it was connected to the ear canal. Recording significant responses in this condition would indicate that the recordings made when the earphone was normally coupled to the ear were contaminated by artifact. We then recorded responses to the first /a/ token between electrodes at the right and left mastoids, in order to increase the sensitivity of the recording to horizontally aligned dipoles (e.g. related to activity in the cochlea, auditory nerve or lower brainstem). In a third condition, we attempted to determine whether any part of the response could reflect the cochlear microphonic (which may precede neural responses by as little as 3 ms; Huis in’t Veld et al., 1977), by recording responses (using the vertical Cz to nape montage) to the first /a/ token in the presence of speech-shaped masking noise. Masking eliminates neural responses without eliminating the cochlear microphonic, so any response recorded in the presence of an effective masker would likely be cochlear microphonic. The minimum effective masking level was determined for each subject by testing whether the /a/ token (a1+) could be detected while speech-shaped noise was being played. The noise was first presented at 50 dB HL, with the level raised by 5 dB after two correct behavioral responses. This process was repeated until the subject could no longer detect the vowel. During the electroencephalographic recording, the noise was presented 5 dB above each subject’s minimum effective masking level.
y½n ¼ x½n ax½n 1 where a (0.90 for the /a/ tokens and 0.94 for the /i/ tokens) was calculated by conducting a first-order linear predictive coding (LPC) analysis on each of the tokens (the first-order LPC providing an estimate of spectral tilt). Fig. 1 (left) show the spectra of the /a/ and /i/ stimuli, as calculated using the Fourier analyzer (solid line), as well as the spectral shape of the vowels, as calculated via the 34th-order LPC analysis (dotted line). The location of the formant peaks and the closest harmonic are given in Table 1, and the harmonics closest to the first two formants are indicated on the figure. The relative intensity of each harmonic in the cochlea was estimated by calculating its amplitude in the digital stimulus waveform with the Fourier analyzer, and then modifying this value to take into account the effects of the middle ear transfer function (see Fig. 2, Puria et al., 1997). These middle ear compensated spectra are shown in Fig. 1 (right). Stimulus presentation was controlled by a version of the MASTER software (John and Picton, 2000) modified to present external stimuli. The digital stimuli were DA converted at a rate of 32 kHz, routed through a GSI 16 audiometer, and presented monaurally with an EAR-Tone 3A insert earphone in the right ear. The left ear was occluded with a foam EAR earplug. All stimuli were scaled to produce a level of 60 dBA (RMS) in a 2-cm3 coupler. 2.3. Procedure The first experiment examined the responses to natural vowels of the same or opposite polarity. Each 1.5-s stimulus was presented
Vowel Stimuli f9 f13
90 dB
/a/
Middle-Ear Compensated
/a/
0 dB f2
90 dB
/i/
/i/
f21
0 dB 60 Hz
1500 Hz
3000 Hz 60 Hz
1500 Hz
3000 Hz
Fig. 1. Left: Spectra of the /a/ and /i/ vowels, as calculated using the Fourier analyzer (solid line) as well as the spectral shape of the vowels, as calculated via a 34th-order LPC analysis (dotted line). The reference signals of the Fourier analyzer followed the f0 trajectory, and are plotted with respect to the average frequency in each reference trajectory. The harmonics closest to the first and second formants are indicated on the figure. Right: Estimated vowel spectra at the level of the cochlea (i.e. taking the middle ear transfer function into account). See text for details.
39
S.J. Aiken, T.W. Picton / Hearing Research 245 (2008) 35–47
Stimulus
200 Hz Tone
200 Hz AM (2 kHz Tone) A
G
B
H
C
I
D
J
“ +- ” (Orig. + Inv.)/2
E
K
“- -” (Orig. - Inv.)/2
F
L
Original Inverted
Response “ ++” Original Inverted
10 ms
10 ms
Fig. 2. Top left panels (A, B) show two periods of a 200 Hz tone, in opposite polarities. Top right panels (G, H) show a 2 kHz tone, amplitude modulated at 200 Hz, also in opposite polarities. The four centre panels show the expected scalp-recorded FFR for either the 200-Hz tone (C, D) or the 200 Hz envelope (I, J). Nerve fibers are unable to lock to the 2000 Hz carrier of the AM tone but can lock to its 200 Hz envelope, so the right column deals with only envelope FFRs. The average response to stimuli presented in a single polarity (+ +) contains both spectral and envelope FFRs and there are responses in both left and right columns (C, I). However, when spectral FFRs to opposite polarities (C, D) are added (+ ), the result (E) is small and twice the frequency of the actual responses. Conversely, when spectral FFRs to opposite polarities are subtracted ( ), the result (F) resembles the original spectral component in the stimulus (A).
These experiments were part of an ongoing research project on ‘‘evoked response audiometry” which was approved by the Research Ethics Committee of the Baycrest Centre for Geriatric Care. 2.4. Recordings Electroencephalographic recordings were made while subjects relaxed in a reclining chair in a double-walled sound-attenuating chamber. Subjects were encouraged to sleep during the recording. Responses were recorded between gold disc electrodes at the vertex and the mid-posterior neck for all conditions except the horizontal condition in the second experiment. For this condition, responses were recorded between the right and left mastoids. A ground electrode was placed on the left clavicle. Inter-electrode impedances were maintained below 5 kX for all recordings. Responses were preamplified and filtered between 30 Hz and 3 kHz with a Grass LP511 AC amplifier and digitized at 8 kHz by a National Instruments E-Series data acquisition card. Prior to analysis the recordings were averaged in the following way. For each subject and vowel three different responses were calculated. A + + average was obtained by averaging all four responses to the original stimulus (e.g. two presentations of the a1+ token and the a2+ token). A + average was obtained by averaging the first two responses to the original stimulus (e.g. one presentation of the a1+ token and the a2+ token) together with two responses to the inverted stimulus (e.g. the a1 token and the a2 token). A average was then obtained by subtracting the two responses to the inverted stimulus from the two responses to the original stimulus. In this nomenclature, the first sign gives the operation and the second sign codes whether the second response is the response to the original or the inverted stimulus. We always start with a response to the original stimulus. Table 2 summarizes these procedures. For each type of average response, grand mean average responses were obtained by averaging the responses of all subjects together.
Fig. 2 shows a simple model of these procedures – with the spectral FFR on the left and the envelope FFR on the right. The top left panels (A, B) of Fig. 2 show two periods of a 200 Hz tone, in opposite polarities. The top right panels (G, H) show a 2 kHz tone, amplitude modulated at 200 Hz, also in opposite polarities. Inversion of the modulated tone has no effect on the modulation envelope. One might simplistically consider these as the two parts of a vowel sound with a fundamental frequency of 200 Hz and a formant frequency at 2000 Hz that was amplitude modulated at the fundamental frequency. The four centre panels show the expected scalp-recorded FFR for either the 200-Hz tone (C, D) or the 200 Hz envelope (I, J). The model assumes that the nerve fibers are unable to lock to the 2000 Hz carrier of the AM tone but can lock to its 200 Hz envelope. This makes the right column deal with only envelope FFRs. The combined effect of the hair cell transduction and the synaptic transmission between the hair cell and the afferent nerve fiber effectively rectifies the signal. Action potentials occur with the greatest probability during the rarefaction phase of the stimulus (plotted upward), so the magnitude of the neural population response is proportional to a half-wave rectified version of the stimulus. Synchronized neural activity gives rise to the voltage fluctuations in the FFR, so this similarly resembles the half-wave rectified stimulus. This is analogous to the period histogram technique used to study the temporal structure of neurophysiologic responses (Anderson et al., 1971; Brugge et al., 1969). In the figure, delays associated with stimulus presentation, cochlear transduction, synaptic transmission and neural conduction have been excluded in order to align the modeled responses with the stimulus period. The different averaging procedures (Table 2) then give three different responses. In our nomenclature, the first sign denotes the procedure (addition or subtraction), and the second sign denotes the polarity of the second set of responses (the first is always the original). The average response to stimuli presented in a single polarity (+ +) contains both spectral and envelope FFRs and there
40
S.J. Aiken, T.W. Picton / Hearing Research 245 (2008) 35–47
are responses in both left and right columns (C, I). However, when spectral FFRs to opposite polarities (C, D) are added (+ ), the result (E) is small and twice the frequency of the actual responses. Conversely, when spectral FFRs to opposite polarities are subtracted ( ), the result (F) resembles the original spectral component in the stimulus (A). The subtraction artificially reconstitutes the pre-rectified stimulus components present in the response, thereby removing rectification-related distortion (including the envelope). Spectral FFR is therefore present in all of the averages, but clearest in the average. In the + + average and + averages, it is mixed with envelope FFR and rectification-related distortion (and its frequency is doubled in the + average). When envelope FFRs to opposite polarities (I, J) are added (+ ), the average (K) is the same as the actual response, since polarity inversion has little effect on envelope modulations (H). For the same reason, subtracting opposite polarity envelope FFRs ( ) eliminates the envelope FFR in the average (L). Therefore, envelope FFR is present in the + + and + averages, but absent in the average. 2.5. Analysis 2.5.1. Natural vowels The energy in voiced speech is concentrated at the fundamental frequency (f0) – equal to the rate of vocal fold vibration – and at its harmonics, which are integer multiples of f0. The harmonics are labeled with a subscript that corresponds to the harmonic number. For example, when f0 is 100 Hz, f2 is 200 Hz. When it is present, f1 is equal to f0, although f0 can be perceived in the absence of any actual energy at f1 (a phenomenon known as the ‘‘missing fundamental”). The fundamental frequency and harmonics of natural speech vary across time. The rate of f0 variation in a steady naturally-produced vowel can be as high as 50 Hz/s (see Fig. 5c in Aiken and Picton, 2006). The response to the speech f0 precisely mirrors its frequency changes (Aiken and Picton, 2006; Krishnan et al., 2004), so responses to natural speech cannot be accurately analyzed with techniques that require a stationary signal. The stimuli and responses were therefore analyzed using a Fourier analyzer. Unlike the fast Fourier transform (FFT), which calculates energy in static frequency bins, a Fourier analyzer calculates energy in
relation to a set of reference signals, which need not be static. Fig. 3 shows the spectrum of the first /a/ stimulus as calculated using the FFT and as calculated using the Fourier analyzer. Both analyses were conducted with a resolution of 2 Hz, but the reference signals of the Fourier analyzer were constructed to follow the f0 trajectory of the speech. For the Fourier analyzer, data are plotted relative to the mean frequency in each reference trajectory. Note that harmonic amplitudes were much greater when the analysis was conducted with the Fourier analyzer, indicating that the FFT underestimated these amplitudes. The Fourier analyzer was used to quantify the amplitude of the response along the trajectory of f0 and 23 of its harmonics (f2–f24). The same analyzer was used to quantify response amplitude along 16 frequency trajectories adjacent to each of the harmonics (i.e. 8 above and 8 below). Each trajectory was separated by 2 Hz, so the highest and lowest trajectories were 16 Hz above and below each harmonic, respectively. The 16 adjacent trajectories were used to quantify non-stimulus-locked electrophysiologic activity; considered to be electrophysiologic ‘noise’ for the purpose of statistical testing. Reference frequency tracks at each trajectory were created in the following manner. Since the first harmonic was present in the all of the vowel tokens, it was used to create the f0 reference track. Visual inspection of the stimulus spectrum indicated that the first harmonic was slightly higher than 100 Hz, so f1 was isolated by filtering the response between 50 and 200 Hz (with a 1000-point phase-corrected finite impulse response filter). The Hilbert transform provided the complex representation of f1 and the four-quadrant inverse tangent provided its instantaneous phase. The instantaneous frequency of f1 could then be calculated by finding the derivative of the instantaneous phase with respect to time. This frequency track was smoothed to remove any sharp changes introduced by the process of approximate differentiation, using a 50 ms boxcar moving average applied 3 times. Frequency tracks at each higher harmonic fi were created by multiplying the f0 frequency track by each integer between 2 and 24. Adjacent frequency tracks were created by transposing each track by the appropriate number of Hz. A Fourier analyzer computes the amplitude of a signal along a particular frequency trajectory by scalar multiplication of the signal with a set of reference sinusoids (i.e. sine and cosine projec-
Fourier Analyzer
Fast Fourier Transform
90 dB
0 dB 60 Hz
1000 Hz
2000 Hz 60 Hz
1000 Hz
2000 Hz
Fig. 3. Spectra of the first /a/ token calculated with a fast Fourier transform (right) and with a Fourier analyzer (left). The resolution of each analysis was 2 Hz, but the reference signals of the Fourier analyzer were constructed to follow the f0 trajectory of the vowel. Fourier analyzer data are plotted with respect to the average frequency of each reference trajectory.
41
S.J. Aiken, T.W. Picton / Hearing Research 245 (2008) 35–47
tions) that represent the trajectory. Reference sinusoids were created for each trajectory by calculating the sine and cosine of the instantaneous phase angle of the corresponding frequency track (i.e. the cumulative sum of the starting phase and the derivative of the instantaneous frequency). This produced pairs of orthogonal sinusoids with unity amplitude. Responses were resampled to be 32 kHz (4 times the data acquisition rate) prior to multiplication with the reference sinusoids. The products were then integrated over the length of the sweep (1.5 s), producing two values (x and y). The amplitude (a) and phase (h) of the response along each trajectory was then calculated by finding the vector magnitude and phase, using the following equations:
a¼
soids by an amount equal to the delay in the response, prior to the multiplication. Evoked potentials are delayed by the time required to transduce the stimulus, as well as the time required for activation to reach the place where the response is generated. We estimated the delay of the response to be approximately 10 ms (see Table 1 in Picton et al., 2003), and delayed the reference sinusoids by 10 ms prior to multiplication. The significance of the response at each harmonic was evaluated by comparing the power of the response along the harmonic’s trajectory with the power of the response along adjacent trajectories, using an F statistic (Zurek, 1992; Dobie and Wilson, 1996; Lins et al., 1996). An alpha criterion of 0.05 was selected for all analyses. A Bonferroni correction was applied to account for the 24 significance tests (1 per harmonic) involved in each analysis. Thus the F statistic was accepted as significant for the grand mean recordings at p < 0.0021. An additional Bonferroni correction (further dividing the alpha criterion by the number of subjects) was applied when individual subject responses were tested for significance. We estimated the relative intensity of each harmonic in the cochlea by calculating its amplitude in the digital stimulus waveform with the Fourier analyzer, and then modifying this value to take into account the effects of the middle ear transfer function (the frequency response function of the ER-3A insert earphone in the outer ear is relatively flat – within ±4 dB – from 100 to 1500 Hz). The middle ear provides a gain of approximately 20 dB between 500
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi x2 þ y2
h ¼ tan1 ðy=xÞ Since the analysis requires scalar multiplication of the response with each of the reference signals, the result is sensitive to the temporal alignment of the stimulus and response. If there is a lag between the stimulus (the basis for the reference sinusoids) and the response, the trajectories will not be temporally aligned. A lag that is smaller than the period of the reference frequency will merely shift the phase of the measured response, but a greater lag may result in an underestimation of the response amplitude. This problem can be circumvented by delaying the reference sinu-
Grand (vector) average responses to /a/
100 nV
++ .1 nV 100 nV
* * * * * * *
+–
Harmonic Amplitude (Response) Average Neighbouring Frequency Amplitude (Response) Significant at p < 0.0021
* * * *
* * * * *
*
*
*
.1 nV 100 nV
*
–– .1 nV
f0
f2
* * * * * f3
f4
f5
f6
f7
f8
f9
* * f10 f11
f12
f13
f14
f15
f16 f17 f18
90 dB
stimulus 0 dB 60 Hz
400 Hz
800 Hz
1200 Hz
1600 Hz
2000 Hz
Fig. 4. Grand (vector) average responses to /a/. For the top three panels, black lines indicate the response amplitude at the fundamental or harmonic. Grey bars indicate the average amplitude at the 16 adjacent frequencies. Significant responses are marked with an asterisk. The top panel shows the + + average, which was created by averaging responses to all four presentations of the /a/ stimulus (in the same polarity). The second panel shows the + average, which was created by averaging responses to two presentations of the /a/, with responses to two presentations of the inverted-polarity /a/. The third panel shows the average, which was created by subtracting responses to two presentations of the /a/ from responses to two presentations of the inverted-polarity /a/. The stimulus is displayed in the lowest panel for reference (modified by the middle ear transfer function).
42
S.J. Aiken, T.W. Picton / Hearing Research 245 (2008) 35–47
and 2000 Hz, but only about 10 dB at 300–400 Hz, 7 dB at 200 Hz and 2 dB at 100 Hz (see Fig. 2, Puria et al., 1997). After taking this transfer function into account, the three most intense harmonics for /a/ were f8, f9, and f13, and the three least intense harmonics were f17, f18, and f20. The three least intense harmonics below 1500 Hz (the approximate frequency limit of FFR) were f0, f3, and f4. The most intense harmonics for /i/ were f2, f3, and f4, and the three least intense harmonics were f10, f12, and f15. Below 1500 Hz, the three least intense harmonics were f9, f10, and f12. We expected + + responses and + responses to have envelope FFRs at f0. Responses at f0 were compared to the responses at other harmonics (in terms of amplitude and in terms of incidence of significant responses). For the /a/, the first harmonic was estimated to be one of the three least intense harmonics, so a response at f0 would not likely be a spectral FFR. Spectral FFRs would be present in the + + and responses. This was tested by comparing measurements at the most and least intense harmonics. A further goal of the study was to determine whether reliable responses could be detected (with the Fourier analyzer) at higher frequencies (up to approximately 1500 Hz) and in individual subjects. We therefore calculated the percentage of individual subject responses that were significant in each condition. 3. Results 3.1. Experiment 1: natural vowels 3.1.1. Responses to /a/ Fig. 4 shows the amplitude of the grand average response (coherently averaged across all subjects) to /a/. Although we were able to record up to 3000 Hz, there were no significant responses in the grand average or individual subject waveforms above f14 (1493 Hz). For simplicity, responses are only shown up to 2000 Hz. Response amplitudes at speech harmonics are represented by the narrow black bars, and mean response amplitudes at adjacent frequencies are represented by the wide gray bars. The stimulus spectrum (as calculated with the Fourier analyzer) is displayed in the bottom panel. An asterisk indicates those re-
sponses that were significant at p < 0.0021 (i.e. alpha of 0.05 with Bonferroni correction for 24 significance tests). The three most intense harmonics in the cochlea were f8, f9 (the harmonics closest to the first formant) and f13 (the harmonic closest to the second formant). Significant responses were detected at two of these harmonics in the + + average (f8, f9), and at all three of these harmonics in the average. The presence of these responses in the average indicates that they were spectral FFRs. In the average, significant responses were also recorded at other stimulus harmonics: f2, f5, f6, f7, and f14. The three least intense harmonics in the cochlea were f17, f18, and f20 (or f0, f3, and f4 below 1500 Hz). There were no significant responses to f17, f18, or f20 in any of the averages. There were no significant responses to f0, f3 and f4 in the average, although there were significant responses to these harmonics in the + + and + average, and these responses were higher in amplitude than most other responses. The top panels of Fig. 5 show the percentage of individual subject responses that were significant at the fundamental and at the harmonic nearest to the first formant of /a/. There were no significant individual subject responses to the harmonics near the second formant (f13 and f14), even though these were significant in the grand mean average over all subjects (in the average response). All subjects displayed significant responses to the fundamental frequency, in both the + + and + averages, but only one displayed significant responses in the average. At the harmonic nearest to the first formant (f9), most of the subjects displayed significant responses in the + + and average, but none displayed significant responses in the + average. 3.1.2. Responses to /i/ Fig. 6 shows the grand average responses to /i/. There were no significant responses in the grand average or individual subject waveforms above f9 (1098 Hz), so responses are only shown up to 2000 Hz. Data are presented as with Fig. 4. The three most intense harmonics were f2 (the harmonic closest to the first formant), f3 and f4. Significant responses were detected at all of these harmonics in all averages, although response ampli-
Harmonic Nearest First Formant
Fundamental Frequency (f0) 100
/a/
/a/
/i/
/i/
%
0 100
%
0
++
+ –
– –
++
+ –
– –
Fig. 5. Percentage of individual subject responses that were significant at the fundamental frequency (left) and at the harmonic nearest to the first formant (right). The top row corresponds to the /a/ stimulus, and the bottom row corresponds to the /i/ stimulus. Significance was determined by comparing the power of the response at the fundamental or harmonic with response power in adjacent frequency bins, using an F statistic. A Bonferroni correction was applied for repeated testing at 24 harmonics and 9 subjects (2160 significance tests).
43
S.J. Aiken, T.W. Picton / Hearing Research 245 (2008) 35–47
Grand (vector) average responses to /i/
100 nV
++ .1 nV 100 nV
*
+– .1 nV 100 nV
* * f0
*
* * *
–– .1 nV
Harmonic Amplitude (Response) Average Neighbouring Frequency Amplitude (Response) Significant at p < 0.0021
* * * *
f2
* f3
*
*
* *
* * f4
f5
f6
f7
f8
f9
f10
f11
f12
f13
f14
f15
f16
90 dB
stimulus 0 dB 60 Hz
400 Hz
800 Hz
1200 Hz
1600 Hz
2000 Hz
Fig. 6. Grand (vector) average responses to /i/. Data are presented as in Fig. 4.
tude at f0 (the envelope) was greatest in the + + and + averages, and the response amplitude at f2 (the harmonic nearest the first formant peak) was greatest in the condition. The three least intense harmonics were f10, f12, and f15 (or f9, f10, and f12 below 1500 Hz). No significant responses were recorded at these harmonics in any of the conditions, with the exception of a significant response to f9 in the + + average. The bottom panels of Fig. 5 show the percentage of individual subject responses that were significant at the fundamental and at the harmonic nearest to the first formant of /i/. All subjects displayed significant responses to the fundamental frequency, in both the + + and + averages, and half of the subjects displayed significant responses in the average. At the harmonic nearest to the first formant (f2), all of the subjects displayed significant responses in the + + and averages, and most displayed significant responses in the + average. 3.2. Experiment 2: investigating the sources of the harmonic responses Fig. 7 shows the results from the three parts of Experiment 2. All of the averages in this experiment were + + averages. When the stimulus was routed to the left earphone, which was not inserted into the ear, there were no significant responses (top line ‘‘artifact”). This was true for the grand mean average as well as for the individual subjects. When responses were recorded between the right and left mastoids (see Fig. 7, second panel), the grand average response showed significant responses at f5 and f9. Only two individual subject responses were significant, and both occurred at f9.
When the stimulus was effectively masked using speechshaped masking noise (see Fig. 7, third panel), only one significant response (at f9) was detected in the grand average, but none of the individual subject responses were significant.
4. Discussion 4.1. Envelope FFR and spectral FFR The present study investigated the human FFR to naturally-produced vowels, and related these responses to the amplitude envelope (the fundamental frequency) and the formant frequencies in the vowel spectrum (at harmonics of the fundamental frequency, themselves modulated in amplitude at the frequency of the glottal waveform). By adding or subtracting responses to opposite polarity stimuli, response components related to the envelope (‘‘envelope FFR”) can be distinguished from components related to the spectrum (‘‘spectral FFR”). The results of the present study can be interpreted in light of the model presented in Fig. 2. Since only spectral FFR is present in the average, one would expect that this average would resemble the stimulus spectrum most closely, subject to the upper frequency limit for neural phase-locking. For the /a/, the harmonics that should have been most intense in the cochlea were f8, f9, and f13 (the harmonics closest to the formant peaks). In the average, significant responses were recorded at all of these harmonics. The harmonics that should have been least intense in the cochlea (below 1500 Hz) were f0, f3, and f4. No significant responses were
44
S.J. Aiken, T.W. Picton / Hearing Research 245 (2008) 35–47
Harmonic Amplitude (Response) Average Neighbouring Frequency Amplitude (Response) Significant at p < 0.0021
100 nV
*
artifact .1 nV 100 nV
*
horizontal
*
.1 nV 100 nV
*
masked .1 nV
f0
f2
f3
f4
f5
f6
f7
f8
f9
f10 f11
f12
f13 f14
f15
f16 f17 f18
90 dB
stimulus 0 dB 60 Hz
400 Hz
800 Hz
1200 Hz
1600 Hz
2000 Hz
Fig. 7. Grand (vector) average responses to /a/, in Experiment 2. Data are presented as in Fig. 4. Responses in the top panel were collected with the stimulus routed to an earphone that was not inserted in the ear canal. Responses in the second panel were collected using a horizontal electrode montage (non-inverting electrode on ipsilateral mastoid; inverting electrode on contralateral mastoid). Responses in the third panel were collected in the presence of an effective speech-shaped masker (with the standard vertical electrode montage used elsewhere in the study).
detected at these harmonics in the average, even though all three were significant in the + + and + averages. For the /i/, the harmonics that should have been most intense in the cochlea were f2, f3, and f4. Significant responses were detected at these frequencies in the + + and + averages, but only to f2 and f3 in the average. Also, the largest response in the average occurred at f2 (the harmonic closest to the first formant peak), whereas the largest response in the + + and + averages occurred at f0 (the frequency of the glottal pitch envelope). Thus, the spectral FFR peaked near the first formant, while the envelope FFR peaked at the fundamental frequency. No significant responses were recognized above 1493 Hz (in any of the averages). This indicates that the central nervous system does not follow frequencies higher than this limit in a time-locked way. Clearly the nervous system responds to these higher frequencies – but this is likely accomplished with a rate-place code rather than a synchronized temporal code (Rhode and Greenberg, 1994). A 1300 Hz limit for following temporal signals is also found in studies of sound localization wherein the timing of signals is compared between the ears (Zwislocki and Feldman, 1956). Interestingly, such localization is dependent on the carrier timing and independent of envelope timing (Schiano et al., 1986). The response at f0 is mainly an envelope FFR. This was most clearly shown for the /a/, since it was present in the + + and + averages, but not in the average. For the /i/ it was present in all of the averages. However, the first formant was very close to f0 and there may have been a combination of an envelope FFR to the glottal pitch envelope and a spectral FFR to harmonics close to the formant peak.
The responses to harmonics near formant peaks are best examined in the /a/ response, since the first two harmonics were separate from the fundamental, but within the frequency range of observable FFR components. The response at the harmonic closest to the first formant was significant in all of the averages, and thus might have been a combination of envelope and spectral FFR, although it also could have been entirely spectral FFR (since spectral FFR may be present in the + average – see Fig. 2, panel E). A significant response was detected at the harmonic closest to the second formant peak in the + + and averages, but not in the + average, indicating that this was a spectral FFR. Two phenomena are not explained by the concept that f0 is detected by an envelope FFR and that formant harmonics are detected by spectral FFRs: 1. The occurrence of large envelope FFRs to the harmonics f2–f7 (in the + + and + averages for both /a/ and /i/). These responses could be introduced by non-linearities in the auditory nervous system beyond the auditory nerve, or harmonic distortion resulting from the rectification of the envelope within the cochlea. Although the model in Fig. 2 presents the envelope FFR as a DC-shifted sinusoid (I, J), this is only because of the sinusoidal nature of the modeled stimulus envelopes (G, H). The actual envelope of the glottal waveform is triangular or sawtooth-shaped, with the glottis closing more quickly than it opens (Holmberg et al, 1988). This sawtooth characteristic is responsible for the rich harmonic structure of the vocal source, and this broadly harmonic signal is then accentuated by the resonance characteristics of the vocal tract to give the characteris-
S.J. Aiken, T.W. Picton / Hearing Research 245 (2008) 35–47
tic formant spectrum of a vowel. All of the harmonics present in the sound can give rise to spectral FFRs. Each harmonic varies in amplitude over time according to the glottal waveform. Envelope FFRs relate to the energy introduced by the rectification of this modulating envelope – mainly at its fundamental frequency, but probably also at its harmonics (cf. Cebulla et al., 2006). Unlike the acoustic harmonics present in the stimulus, these harmonics are produced by the rectification of the envelope during inner hair cell transduction, and are thus not spectrally-shaped by the vocal tract. They relate only to the energy in the stimulus envelope. 2. The occurrence of spectral FFRs to the /a/ at f5 and f6, where the energy in the stimulus was low. This is probably related to distortion products in the cochlea. The cochlear nonlinearity produces multiple intermodulation distortion products in response to pairs of tones, including prominent intermodulation distortion components at the 2fa fb cubic difference frequency, and at the fb fa difference frequency, where fa and fb can refer to any pair of frequencies (fa < fb). These are called distortion product otoacoustic emissions when acoustically measured with a sensitive microphone in the ear canal, but they can also be electrically detected in the scalp-recorded brainstem responses (Chertoff et al., 1992; Krishnan, 2002; Pandya and Krishnan, 2004; Rickman et al., 1991). Because they are not related to half-wave rectification, they are not removed by subtracting responses to alternate polarities, and they are present in the average (Krishnan, 2002). The response at f5 could have followed the 2f9–f13 cubic distortion product and/or intermodulation distortion at the f14 f9 and f13 f8 difference frequencies. Similarly, the response at f6 could have followed intermodulation distortion at the f14 f8 difference frequency. These were all prominent harmonics near formant peaks (at which spectral FFRs were detected), so we would expect these harmonics to give rise to cochlear distortion products. 4.2. Contributions from the cochlear microphonic The second experiment was conducted to investigate the possibility of non-neurogenic (e.g. artifactual or cochlear) contributions to the measured response. No significant responses were detected when the stimulus was uncoupled from the ear, while the electrical components (i.e. the EAR-3A transducer) were in place, indicating that responses could not be attributed to electrical artifact. However, a significant response at the harmonic closest to the first formant of the /a/ was recorded with a horizontal electrode montage. This suggests a contribution of the CM to the recorded FFRs, although it could have been earlier neurogenic activity (e.g. from the auditory nerve). The scalp-recorded FFR likely represents a combination of fields generated in the upper brainstem near the inferior colliculus and in the cochlea (Sohmer et al., 1977). The cochlear response is best recorded from electrodes near the cochlea, such as on the mastoid, whereas the brainstem generator is best recorded in the vertically oriented electrode montage (Galbraith et al., 2000). The results with masking are more definite. The masking will prevent any synchronous activation of neurons to the stimulus and therefore remove any neurally dependent response. However, masking does not affect the CM which readily reproduces the acoustic signal in the noise. Averaging the masked CM attenuates the microphonic masking noise (which is random from trial to trial) and leaves the microphonic of the stimulus. Various masking techniques have been proposed to distinguish the CM from the brainstem FFR (e.g. Chimento and Schreiner, 1990). Our procedure simply demonstrated that a significant component of our response was immune to noise masking and likely CM.
45
A portion of what is recorded from the scalp as spectral FFR therefore originates as CM. Due to synaptic delays, CM activity occurs earlier than neural activity. At a constant frequency, these timing differences amount to differences in phase, and the contributions of each source are difficult to distinguish. However, with a changing frequency stimulus and a Fourier analyzer, the maximum response amplitude measurement should occur when the stimulus and response are temporally aligned. Thus, by repeating the analysis at different response lags, the CM component should be distinguishable from the neural component. In the present study, the spectral content of the vowels was not variable enough to permit this type of analysis. The average instantaneous frequency rate of change was about 12 Hz/s for the /a/ tokens, and 13 Hz/s for the /i/ tokens. If this change occurred consistently in one direction, a 10 ms mismatch between the stimulus and response would result in an average frequency mismatch of 0.12 Hz. Since the resolution of the Fourier analyzer was 0.67 Hz (the reciprocal of the integration time), the response would still fall within the same analysis bin, and the amplitude of the measured response would be unaffected. With such a small rate of change, the stimulus–response timing mismatch would have to be at least 55 ms for the average deviation between the stimulus and the response to be large enough to reduce the amplitude of the measured response. Moreover, the frequency changes were small fluctuations in the rate of vocal fold vibration that ultimately remained relatively steady, so any timing mismatch between the stimulus and response would have to be even greater than this to have an effect on the measurement. In future studies it would be helpful to use stimuli with rapid frequency changes (e.g. with exaggerated intonation), since these would facilitate a precise analysis of the temporal delay of the response. The CM may precede the spectral FFR by as little as 3 ms (Huis in’t Veld et al., 1977). A stimulus with a 333 Hz/s frequency changes would increase or decrease in frequency by 1 Hz in 3 ms. With a 0.67 Hz analysis resolution (1.5 s integration time), it would be possible to tease these contributions apart. The separation of neural FFR from CM could be further aided by the use of a near field (i.e. meatal or middle ear) recording electrode, which would make it possible to determine the precise timing (and phase) of the CM. One interesting aspect of our responses is that the CM that we recorded as part of the + + or response was not recognizable at frequencies greater than 1500 Hz. Cochlear microphonic is generated by hair cells at all regions of the cochlea, and one therefore would not expect it to be limited by the frequency postulated as the limit of neural synchronization. One might therefore consider that part of what we are terming the CM is actually related to synchronous firing in the afferent auditory neurons (cf. Coats et al., 1979). Since this neurophonic would not survive the masking manipulation, however, we would still have to find an additional reason for no CM responses at frequencies greater than 1500 Hz. The absent CM above 1500 Hz can perhaps be explained on the basis of the stimulus energy being greatest near the first formant. If the speech energy had been more evenly distributed across the frequencies, CM may have been recognized at higher frequencies. 4.3. Stimulus–response relationships With a harmonic stimulus like speech, stimulus components and related distortions may overlap at harmonic frequencies. The glottis produces a sawtooth-shaped pulse, introducing acoustic energy at integer multiples of the glottal pulse rate (f0). This is registered in the cochlea, where the cochlear nonlinearity produces harmonic and intermodulation distortion components, all of which occur at integer multiples of f0. The rectification involved in the transduction of stimulus energy at speech harmonics (and cochlear distortion products) produces neural harmonic and intermodula-
46
S.J. Aiken, T.W. Picton / Hearing Research 245 (2008) 35–47
tion distortion products, all of which occur at integer multiples of f0. The rectification also introduces energy at the fundamental frequency of the amplitude envelope, which overlaps with energy at f0. Finally, harmonic distortion resulting from rectification of the envelope produces energy at integer multiples of f0. The neural response occurring at a particular harmonic frequency might thus relate to stimulus energy at that harmonic, a cochlear distortion product related to the stimulus energy at other harmonics, a rectification-related distortion of a stimulus component at other harmonics, the stimulus envelope, or a rectification-related distortion of the stimulus envelope. It may even relate to a distortion involved in the transduction of the signal (e.g. an earphone non-linearity), electrical artifact, or in the generation of the CM. This precludes any simple interpretation of the speech-evoked FFR – the response at a given speech harmonic may not indicate audibility of that harmonic. Using both + and averages (Table 2), it is possible to tease apart some of the possible contributions to the scalp-recorded response. The average eliminates the envelope FFR. The + average eliminates the CM, and preserves the envelope FFR. However, we have seen that some distortion products of the spectral FFR may show up in the + average. Interestingly one of the early studies of the tone-evoked FFR demonstrated a neurally generated FFR (as distinct from the CM) by using a + average which showed a response at twice the frequency of the tone (Sohmer et al., 1977). Since frequency components in the + average could relate to distorted spectral FFR or envelope FFR, this derivation is not particularly helpful in distinguishing the two types of FFR. A simpler way to distinguish spectral and envelope FFR would be to compare the + + average with the average. Spectral FFR components should be present in both, while envelope FFR components should only be present in the former. However, the + average is unique in that it eliminates the CM and the stimulus artifact. The + average would thus be preferred for establishing that a neurogenic response has occurred. Recent work with the ABR (Johnson et al., 2007; King et al., 2002; Russo et al, 2004) has used a + average to evaluate the transient FFR to brief speech stimuli. Stimulus artifact and CM are eliminated by averaging together responses to stimuli of opposite polarity leaving an FFR that follows the fundamental frequency of the vowel. In our terminology this would be an envelope FFR. The downside of the + average is that it eliminates most spectral FFR in addition to the CM. For instance, + average responses (to the /a/) were not detected at prominent stimulus harmonics or at their second harmonics, apart from the significant response at f9. If the goal of the technique is to assess the neural encoding of particular speech features (e.g. harmonics), it would be best to use the average, and control for the CM. Cochlear microphonic can be identified by recording noisemasked responses. Masking eliminates the FFR, but not the cochlear microphonic. With a stimulus with rapid frequency changes, the cochlear microphonic and the neural FFR could be distinguished by varying the stimulus–response delay in the Fourier analyzer. The short-latency cochlear microphonic would be expected to be present in the noise-masked recording, but the longer latency spectral FFR would be expected to be eliminated by the masker. It should therefore be possible to separate the FFR from cochlear microphonic, and to verify the effectiveness of this separation with masking. A simpler technique might be to subtract the noise-masked response from the average (Chimento and Schreiner, 1990). 4.4. Clinical Implications The speech-evoked FFR might be useful for the validation of hearing aid fittings in infants, by providing information about the neural encoding of speech. However, the relationship between
the FFR and harmonic speech components is complicated by the spectral overlap of stimulus harmonics, distortion products, and the stimulus envelope. The + average provides a way to ensure that reliable speechrelated information has reached the nervous system. However, it may not indicate that the nervous system is receiving sufficient information to discriminate between different vowels. The envelope FFR shows that energy is being modulated at the fundamental frequency of the vowel but it does not indicate the frequencies of the modulated energy. Deriving envelope FFRs using high-pass noise techniques may be a way to determine what frequencies the envelope FFR is carrying. The spectral FFR that is obtained in the average may be the most useful measure clinically, since it can be used to assess the audibility of formants via the related harmonics. Future studies should incorporate techniques to eliminate the potential for cochlear microphonic contamination, so that these responses can be unequivocally related to neural activity. These responses should also be related to behavioral measures of speech understanding. The best approach for evaluating the neural encoding of speech has yet to be determined, but this will likely involve the measurement of brainstem FFRs, along with other speech-evoked activity from the cortex. Future studies should focus on ways to relate recorded responses to the neural encoding of important speech components (e.g. formants), and to relate these responses to speech understanding. Acknowledgments This study was supported by a Grant from the Canadian Institutes for Health Research and by funds donated by James Knowles. Patricia Van Roon provided technical assistance with the recordings and with the manuscript. References Aiken, S.J., Picton, T.W., 2006. Envelope following responses to natural vowels. Audiol. Neuro-otol. 11 (4), 213–232. Aiken, S.J., Picton, T.W., 2008. Cortical responses to the speech envelope. Ear Hear. 29 (2), 139–157. Anderson, D.J., Rose, J.E., Hind, J.E., Brugge, J.F., 1971. Temporal position of discharges in single auditory nerve fibers within the cycle of a sine-wave stimulus: frequency and intensity effects. J. Acoust. Soc. Am. 49 (4, Suppl. 2), 1131–1139. Billings, C.J., Tremblay, K.L., Souza, P.E., Binns, M.A., 2007. Effects of hearing aid amplification and stimulus intensity on cortical auditory evoked potentials. Audiol. Neuro-otol. 12 (4), 234–246. Brugge, J.F., Anderson, D.J., Hind, J.E., Rose, J.E., 1969. Time structure of discharges in single auditory nerve fibers of the squirrel monkey in response to complex periodic sounds. J. Neurophys. 32 (3), 1005–1024. Cebulla, M., Stürzebecher, E., Elberling, C., 2006. Objective detection of auditory steady-state responses: comparison of one-sample and q-sample tests. J. Am. Acad. Audiol. 17 (2), 93–103. Chertoff, M.E., Hecox, K.E., Goldstein, R., 1992. Auditory distortion products measured with averaged auditory evoked potentials. J. Speech Hear. Res. 35 (1), 157–166. Chimento, T.C., Schreiner, C.E., 1990. Selectively eliminating cochlear microphonic contamination from the frequency-following response. Electroencephalogr. Clin. Neurophysiol. 75 (2), 88–96. Coats, A.C., Martin, J.L., Kidder, H.R., 1979. Normal short-latency electrophysiological filtered click responses recorded from vertex and external auditory meatus. J. Acoust. Soc. Am. 65 (3), 747–758. Cunningham, J., Nicol, T., Zecker, S., Bradlow, A., Kraus, N., 2001. Neurobiologic responses to speech in noise in children with learning problems: deficits and strategies for improvement. Clin. Neurophys. 112 (5), 758–767. Dajani, H., Purcell, D., Wong, W., Kunov, H., Picton, T., 2005. Recording human evoked potentials that follow the pitch contour of a natural vowel. IEEE Trans. Biomed. Eng. 52 (9), 1614–1618. Dimitrijevic, A., John, M.S., Picton, T.W., 2004. Auditory steady-state responses and word recognition scores in normal-hearing and hearing-impaired adults. Ear Hear. 25 (1), 68–84. Dobie, R.A., Wilson, M.J., 1996. A comparison of t test, F test and coherence methods of detecting steady-state auditory evoked potentials, distortion-product otoacoustic emissions, or other sinusoids. J. Acoust. Soc. Am. 100 (4), 2236– 2246.
S.J. Aiken, T.W. Picton / Hearing Research 245 (2008) 35–47 Fant, G., 1970. Acoustical Theory of Speech Production: With Calculations Based on X-Ray Studies of Russian Articulations. Walter de Gruyter, The Hague. Galbraith, G.C., Threadgill, M.R., Hemsley, J., Salour, K., Sondej, N., Ton, J., Cheung, L., 2000. Putative measure of peripheral and brainstem frequency-following in humans. Neurosci. Lett. 292 (2), 123–127. Goblick, T.J., Pfeiffer, R.R., 1969. Time-domain measurements of cochlear nonlinearties using combination click stimuli. J. Acoust. Soc. Am. 46 (4), 924– 938. Golding, M., Pearce, W., Seymour, J., Cooper, A., Ching, T., Dillion, H., 2007. The relationship between obligatory cortical auditory evoked potentials (CAEPs) and functional measures in young infants. J. Acoust. Soc. Am. 18 (2), 117–125. Greenberg, S., Marsh, J.T., Brown, W.S., Smith, J.C., 1987. Neural temporal coding of low pitch. I. Human frequency following responses to complex tones. Hear. Res. 25 (2–3), 91–114. Herdman, A.T., Stapells, D.R., 2003. Auditory steady-state response thresholds of adults with sensorineural hearing impairments. Int. J. Audiol. 42 (5), 237–248. Hillenbrand, J., Getty, L.A., Clark, M.J., Wheeler, K., 1995. Acoustic characteristics of American English vowels. J. Acoust. Soc. Am. 97 (5 pt 1), 3099–3111. Holmberg, E.B., Hillman, R.E., Perkell, J.S., 1988. Glottal airflow and transglottal air pressure measurements for male and female speakers in soft, normal, and loud voice. J. Acoust. Soc. Am. 84 (2), 511–529 (Erratum in: J. Acoust. Soc. Am. 85 (4) (1989) 1787. Huis in’t Veld, F., Osterhammel, P., Terkildsen, K., 1977. The frequency selectivity of the 500 Hz frequency following response. Scand. Audiol. 6 (1), 35–42. John, M.S., Picton, T.W., 2000. MASTER: a Windows program for recording multiple auditory steady-state responses. Comput. Meth. Prog. Biomed. 61 (2), 125–150. Johnson, K.L., Nicol, T.G., Kraus, N., 2005. Brain stem response to speech: a biological marker of auditory processing. Ear Hear. 26 (5), 424–434. Johnson, K.L., Nicol, T.G., Zecker, S.G., Kraus, N., 2007. Auditory brainstem correlates of perceptual timing deficits. J. Cognit. Neurosci. 19 (3), 376–385. King, C., Warrier, C.M., Hayes, E., Kraus, N., 2002. Deficits in auditory brainstem pathway encoding of speech sounds in children with learning problems. Neurosci. Lett. 319 (2), 111–115. Korczak, P.A., Kurtzberg, D., Stapells, D.R., 2005. Effects of sensorineural hearing loss and personal hearing aids on cortical event-related potential and behavioral measures of speech-sound processing. Ear Hear. 26 (2), 165–185. Krishnan, A., 1999. Human frequency-following responses to two-tone approximations of steady-state vowels. Audiol. Neuro-otol. 4 (2), 95–103. Krishnan, A., 2002. Human frequency-following responses: representation of steady-state synthetic vowels. Hear. Res. 166 (1–2), 192–201. Krishnan, A., 2007. Frequency-following response. In: Burkhard, R.F., Eggermont, J.J., Don, M. (Eds.), Auditory Evoked Potentials: Basic Principles and Clinical Application, Lippincott Williams & Wilkins, New York, pp. 313–333. Krishnan, A., Xu, Y., Gandour, J.T., Cariani, P.A., 2004. Human frequency-following response: representation of pitch contours in Chinese tones. Hear. Res. 189 (1– 2), 1–12. Levi, E.C., Folsom, R.C., Dobie, R.A., 1995. Coherence analysis of envelope-following responses (EFRs) and frequency-following responses (FFRs) in infants and adults. Hear. Res. 89 (1–2), 21–27. Liberman, A.M., Delattre, P.C., Cooper, F.S., Gerstman, L.J., 1954. The role of consonant–vowel transitions in the perception of the stop and nasal consonants. Psychol. Monogr. 68, 1–13. Lins, O.G., Picton, T.W., Boucher, B.L., Durieux-Smith, A., Champagne, S.C., Moran, L.M., Perez-Abalo, M.C., Martin, V., Savio, G., 1996. Frequency-specific audiometry using steady-state responses. Ear Hear. 17 (1), 81–96. Luts, H., Desloovere, C., Kumar, A., Vandermeersch, E., Wouters, J., 2004. Objective assessment of frequency-specific hearing thresholds in babies. Int. J. Pediat. Otorhinolaryngol. 68 (7), 915–926. Luts, H., Wouters, J., 2005. Comparison of MASTER and AUDERA for measurement of auditory steady-state responses. Int. J. Audiol. 44 (4), 244–253. Moushegian, G., Rupert, A.L., Stillman, R.D., 1973. Laboratory note. Scalp-recorded early responses in man to frequencies in the speech range. Electroencephalogr. Clin. Neurophysiol. 35 (6), 665–667. Pandya, P.K., Krishnan, A., 2004. Human frequency-following response correlates of the distortion product at 2F1-F2. J. Am. Acad. Audiol. 15 (3), 184–197.
47
Picton, T.W., Dimitrijevic, A., van Roon, P., John, M.S., Reed, M., Finkelstein, H., 2001. Possible roles for the auditory steady-state responses in fitting hearing aids. In: Seewald, R.C., Gravel, J.S. (Eds.), A Sound Foundation through Early Amplification: Proceedings of the Second International Conference, Basel, Phonak AG, pp. 59–69. Picton, T.W., John, M.S., Dimitrijevic, A., Purcell, D., 2003. Human auditory steadystate responses. Int. J. Audiol. 42 (4), 177–219. Plyler, P., Ananthanarayan, A.K., 2001. Human frequency-following responses: representation of second formant transitions in normal and hearing-impaired listeners. J. Am. Acad. Audiol. 12 (10), 523–533. Puria, S., Peake, W.T., Rosowski, J.J., 1997. Sound-pressure measurements in the cochlear vestibule of human-cadaver ears. J. Acoust. Soc. Am. 101 (5), 2754– 2770. Rance, G., Cone-Wesson, B., Wunderlich, J., Dowell, R., 2002. Speech perception and cortical event related potentials in children with auditory neuropathy. Ear Hear. 23 (3), 239–253. Rhode, W.S., Greenberg, S., 1994. Encoding of amplitude modulation in the cochlear nucleus of the cat. J. Neurophysiol. 71 (5), 1797–1825. Rickman, M.D., Chertoff, M.E., Hecox, K.E., 1991. Electrophysiological evidence of nonlinear distortion products to two-tone stimuli. J. Acoust. Soc. Am. 89 (6), 2818–2826. Rosner, B.S., Pickering, J.B., 1994. Vowel Perception and Production. Oxford University Press, Toronto. Russo, N., Nicol, T., Musacchia, G., Kraus, N., 2004. Brainstem responses to speech syllables. Clin. Neurophysiol. 115 (9), 2021–2030. Schiano, J.L., Trahiotis, C., Bernstein, L.R., 1986. Lateralization of low-frequency tone and narrow bands of noise. J. Acoust. Soc. Am. 79 (5), 1563–1570. Small, S.A., Stapells, D.R., 2005. Multiple auditory steady-state responses to boneconduction stimuli in adults with normal hearing. J. Am. Acad. Audiol. 16 (3), 172–183. Sohmer, H., Pratt, H., Kinarti, R., 1977. Sources of frequency following responses (FFR) in man. Electroencephalogr. Clin. Neurophysiol. 42 (5), 656–664. Stapells, D.R., 2000a. Threshold estimation by the tone-evoked auditory brainstem response: a literature meta-analysis. J. Speech-Lang. Path Audiol. 24 (2), 74–83. Stapells, D.R., 2000b. Frequency-specific evoked potential audiometry in infants. In: Seewald, R.C. (Ed.), A Sound Foundation through Early Amplification: Proceedings of an International Conference, Basel, Phonak AG, pp. 13–31. Stapells, D.R., 2002. The tone-evoked ABR: why it’s the measure of choice for young infants. Hear. J. 55, 14–18. Stapells, D.R., Picton, T.W., Durieux-Smith, A., Edwards, C.G., Moran, L.M., 1990. Thresholds for short-latency auditory-evoked potentials to tones in notched noise in normal-hearing and hearing-impaired subjects. Audiology 29 (5), 262– 274. Stroebel, D., Swanepoel, W., Groenewald, E., 2007. Aided auditory steady-state responses in infants. Int. J. Audiol. 46 (6), 287–292. Stueve, M.P., O’Rourke, C., 2003. Estimation of hearing loss in children: comparison of auditory steady-state response, auditory brainstem response, and behavioral test methods. Am. J. Audiol. 12 (2), 125–136. Tlumak, A.I., Rubinstein, E., Durrant, J.D., 2007. Meta-analysis of variables that affect accuracy of threshold estimation via measurement of the auditory steady-state response (ASSR). Int. J. Audiol. 46 (11), 692–710. Tremblay, K.L., Billings, C.J., Friesen, L.M., Souza, P.E., 2006. Neural representation of amplified speech sounds. Ear Hear. 27 (2), 93–103. Voigt, H.F., Sachs, M.B., Young, E.D., 1982. Representation of whispered vowels in discharge patterns of auditory-nerve fibers. Hear. Res. 8 (1), 49–58. Wunderlich, J.L., Cone-Wesson, B.K., Shepherd, R., 2006. Maturation of the cortical auditory evoked potential in infants and young children. Hear. Res. 212 (1–2), 185–202. Yamada, O., Yamane, H., Kodera, K., 1977. Simultaneous recordings of the brain stem response and the frequency-following response to low-frequency tone. Electroencephalogr. Clin. Neurophysiol. 43 (3), 362–370. Zurek, P.M., 1992. Detectability of transient and sinusoidal otoacoustic emissions. Ear Hear. 13 (5), 307–310. Zwislocki, J., Feldman, R.S., 1956. Just noticeable difference in dichotic phase. J. Acoust. Soc. Am. 28 (5), 860–864.