Classification of speech-evoked brainstem responses to English vowels

Classification of speech-evoked brainstem responses to English vowels

Available online at www.sciencedirect.com ScienceDirect Speech Communication 68 (2015) 69–84 www.elsevier.com/locate/specom Classification of speech-...

2MB Sizes 0 Downloads 58 Views

Available online at www.sciencedirect.com

ScienceDirect Speech Communication 68 (2015) 69–84 www.elsevier.com/locate/specom

Classification of speech-evoked brainstem responses to English vowels q Amir Sadeghian a, Hilmi R. Dajani b,⇑, Adrian D.C. Chan a b

a Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada

Received 10 November 2013; received in revised form 24 November 2014; accepted 12 January 2015 Available online 29 January 2015

Abstract This study investigated whether speech-evoked auditory brainstem responses (speech ABRs) can be automatically separated into distinct classes. With five English synthetic vowels, the speech ABRs were classified using linear discriminant analysis based on features contained in the transient onset response, the sustained envelope following response (EFR), and the sustained frequency following response (FFR). EFR contains components mainly at frequencies well below the first formant, while the FFR has more energy around the first formant. Accuracies of 83.33% were obtained for combined EFR and FFR features and 38.33% were obtained for transient response features. The EFR features performed relatively well with a classification accuracy of 70.83% despite the belief that vowel discrimination is primarily dependent on the formants. The FFR features obtained a lower accuracy of 59.58% possibly because the second formant is not well represented in all the responses. Moreover, the classification accuracy based on the transient features exceeded chance level which indicates that the initial response transients contain vowel specific information. The results of this study will be useful in a proposed application of speech ABR to objective hearing aid fitting, if the separation of the brain’s responses to different vowels is found to be correlated with perceptual discrimination. Ó 2015 Elsevier B.V. All rights reserved.

Keywords: Speech-evoked auditory brainstem response; Envelope following response; Frequency following response; Classification of evoked responses; Auditory processing of speech; Fitting hearing aids

1. Introduction In recent years, there has been increasing interest in measuring brain signals in response to speech stimuli. The ultimate goal of this research is to understand brain processing of speech in order to develop better clinical tools for both the diagnosis and treatment of sensory and cognitive impairments. Currently, hearing assessment is limited

q A preliminary version of this research was published in the Proceedings of the International Conference of the IEEE Engineering in Medicine and Biology Society, Boston, USA, 2011 (Sadeghian et al., 2011). ⇑ Corresponding author at: School of Electrical Engineering and Computer Science, 161 Louis Pasteur, Ottawa, Ontario K1N 6N5, Canada. Tel.: +1 613 562 5800x6217; fax: +1 613 562 5175. E-mail address: [email protected] (H.R. Dajani).

http://dx.doi.org/10.1016/j.specom.2015.01.003 0167-6393/Ó 2015 Elsevier B.V. All rights reserved.

by diagnostic tests which usually employ artificial signals like tones or clicks that do not allow a clear assessment of auditory function for speech communication. While there are tests of speech perception that rely on subjective responses, these are of no value for assessing the hearing of infants and uncooperative individuals. Speech evoked responses (speech ABRs) could thus fill the need to objectively assess auditory performance in these cases (Anderson and Kraus, 2013). Recent studies have also demonstrated that the speech ABR may help identify children with language and learning problems that derive from central auditory processing impairments (Russo et al., 2004; Johnson et al., 2008). On the treatment side, speech ABRs may prove to be very useful for the objective fitting of hearing aids (Aiken and Picton, 2008; Anderson and Kraus, 2013). Currently,

70

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84

hearing aid fitting is often based on diagnostic tests that use simple stimuli, which do not allow for selective acoustic treatments (Johnson et al., 2005). Since the speech ABR is believed to mostly originate in the upper brainstem (inferior colliculus, lateral lemniscus), it provides a window into subcortical processing of speech (Banai et al., 2007; Chandrasekaran and Kraus, 2010). A number of studies have presented evidence that subcortical processing of speech at this level provides the substrate for speech perception in quiet and noise (e.g. Krishnan et al., 2005; Hornickel et al., 2009; Anderson et al., 2013a,b). The speech ABR could therefore objectively measure the effect of adjusting any of the multiple settings of modern hearing aids on the auditory system’s response. Dajani et al. (2013) have proposed several ways in which the measurement of speech ABR could be used to improve this process. For example, it may be possible to use changes in the amplitudes of the response harmonics to tune the frequency dependent gain and compression levels of the hearing aid. The internal SNR of speech ABR, which was estimated in Pre´vost et al. (2013), may also be useful as an indicator of the quality of the internal neural representation of the speech sound after processing by the hearing aid. Moreover, since the amplitude and latency of the initial transient complex of the speech ABR depend on the initial consonant in a speech sound and are affected by noise (Russo et al., 2004; Johnson et al., 2008), they will likely be dependent on the compression time constants of the hearing aid. Anderson and Kraus (2013) have also suggested that hearing aid settings could be adjusted to maximize correlation between the spectra of the speech ABR and the stimulus. However, given that the speech ABR reflects signal transformations from the auditory periphery to the upper brainstem, it is currently unclear whether the similarity between the stimulus and the response would be the best measure of hearing aid performance. Much more systematic study of the speech ABR is needed to determine the best use of these responses in hearing aid fitting (Clinard and Tremblay, 2013). Although the auditory brainstem response (ABR) to simple artificial stimuli is widely used in the clinic, it provides little understanding about the auditory processing of complex stimuli such as speech sounds. The speech ABR, on the other hand, reflects neural processing of the different components of speech. In an early study, Greenberg showed that components that follow speech formants are present in the evoked response (Greenberg, 1980). More recent work has led to a more detailed characterization of the auditory brainstem response to vowel stimuli. This response consists of two parts: (1) the transient response and (2) the sustained response. The transient response is short ([20 ms) and is similar to the transient response to click stimuli (Skoe and Kraus, 2010). It usually contains prominent peaks which originate from the ascending auditory pathway between the cochlear nerve and midbrain, and the VA complex of the transient response in particular signifies auditory processing in the upper

brainstem (Banai et al., 2007; Chandrasekaran and Kraus, 2010). As such the transient response may be thought to be a response to the “attack” characteristics of the stimulus onset (Skoe and Kraus, 2010). The transient response can thus differ depending on the initial consonant, and may contribute to the identification of specific speech sounds (Johnson et al., 2008; Skoe and Kraus, 2010). However, there has been no previous work on whether it is able to convey phonetic information when the stimulus is a pure vowel. The sustained response ( J 20 ms) of speech ABR follows the periodic components of the speech stimulus. Depending on how the response signals are analyzed, it can correspond to the Envelope Following Response (EFR) or Frequency Following Response (FFR) (Aiken and Picton, 2006). The EFR is calculated by taking the average between the responses to the stimulus in one polarity and an equal number of responses to the stimulus in inverted polarity (Avg. speech ABRs + Avg. Inv polarity speech ABRs/2), while the FFR is calculated by taking the average between the response to the stimulus in one polarity and an equal number of the negative of the responses to the stimulus in inverted polarity (Avg. speech ABRs  Avg. Inv polarity speech ABRs/2) (Aiken and Picton, 2008). In the notation of some recent studies, the EFR corresponds to the response to the temporal envelope in the stimulus (or ENV), and the FFR corresponds to the response to the temporal fine structure (or TFS) (e.g. Anderson et al., 2013a,b; Zhong et al., 2014). The EFR mainly reflects auditory neural activity that is phase-locked to the envelope of the speech stimuli, and so it has a fundamental frequency equal to the stimulus F0 (Aiken and Picton, 2008; Dajani et al., 2005). The response at F0 is probably introduced primarily through the rectification of the envelope during inner hair cell transduction (Cebulla et al., 2006; Aiken, 2008; Aiken and Picton, 2008). Energy would also appear at the early harmonics of F0 because the envelope is non-sinusoidal. Other nonlinearities within the cochlea and in higher neural pathways produce multiple intermodulation distortion products in response to pairs of stimulus harmonics which also appear in the brainstem response and contribute to the EFR. Since the synthetic vowel used in this study only contains energy at F0 and integer harmonics, these distortion products would also occur only at integer harmonics of F0 (Aiken and Picton, 2008). On the other hand, the FFR is formed as a result of auditory neural phase-locking that directly follows the harmonics of the speech stimulus, and in particular near the first formant F1 since these harmonics are typically the most intense in the stimulus and are usually well within the phase-locking frequency limit of neurons in the auditory brainstem (Krishnan, 2002; Skoe and Kraus, 2010). Intermodulation distortion products, however, may also contribute to the FFR (Aiken and Picton, 2008). In addition to the contribution of activity in the ascending auditory system to the speech ABR, top-down influences

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84

from higher neural centers have been shown to affect the responses. For example, auditory training and experience with a tonal language have been found to enhance responses at F0 (Krishnan et al., 2005; Bidelman et al., 2009; Song et al., 2012). Currently, there is limited understanding of the neural responses evoked by various speech sounds. With brainstem responses, most of the studies have involved single stimuli such as the consonant vowel /da/ (e.g. Russo et al., 2004; Johnson et al., 2005) or a single vowel (e.g. Laroche et al., 2013; Pre´vost et al., 2013). On the other hand, Johnson et al. (2008) investigated the auditory brainstem responses to 3 consonant–vowel stimuli (/ba/, /da/, /ga/) in normal learning children and found that they can be distinguished based on differences in the spectrotemporal characteristics of the first 60 ms of the response. There has also been work on using cortical responses to classify speech. Engineer et al. (2008) and Centanni et al. (2014) were able to classify cortical neural activity patterns in the rat in response to English consonant-/a/-consonant stimuli. In humans, Pasley et al. (2012) reconstructed words and sentences based on cortical surface potentials in patients undergoing neurosurgical procedures. In addition, as part of the effort to develop brain–computer interfaces for speech communication, several groups have investigated the decoding of cortical activity associated with spoken and imagined speech. Two informative reviews of this work can be found in Brumberg et al. (2010) and Denby et al. (2010). None of the work described in the literature has investigated the classification of scalp recordings of brainstem responses to vowels in humans. In this study, we investigated the speech ABR to five English vowels and how well they may be automatically separated into distinct classes using a basic classifier. One use of automatic classification of speech ABR could be to objectively assess how well a certain choice of hearing aid settings allows the separation of highly confusable speech samples into distinct classes. In the future, this approach could extend beyond the vowels investigated in this work, to consonant vowel combinations (Johnson et al., 2008), and perhaps eventually to words.

71

et al., 2013). The glottal source was a unit impulse train with a periodicity of 100 Hz. For all the vowels, this resulted in a fundamental frequency of 100 Hz and harmonics at integer multiples of the fundamental frequency. The parameters of the stimuli, namely the first 3 formant frequencies, formant bandwidths, and relative formant amplitudes, are shown in Table 1. These parameters followed those for male speakers (Klatt, 1980; Peterson and Barney, 1952). All the vowel stimuli were 300 ms in duration, and were generated with a sampling frequency of 48 kHz at a 16 bit resolution. Fig. 1 shows a portion of the time domain signals of the five synthetic vowels. 2.3. Experimental protocol

Eight subjects (six males, 25–45 years old) with no known hearing problems participated in this experiment. The subjects had pure tone hearing thresholds of 15 dB HL or less at 500, 1000, 2000, and 4000 Hz. Participants provided informed consent according to the regulations of the University of Ottawa Research Ethics Board.

Subjects were seated comfortably in an acoustical booth. During a recording session, the subjects were asked to stay relaxed in order to minimize artefacts. They were also asked to keep their eyes open, and watch a muted movie with subtitles. A single recording session consisted of six trials, and in each trial, subjects were presented 500 repetitions of a single vowel at a repetition rate of 3.1/s, leading to a 22.6 ms silence interval between stimulus presentations. Responses were coherently averaged over the 500 repetitions to give one EFR and one FFR. A BioMARK v.7.0.2 system (Biological Marker of Auditory Processing, Bio-logic Systems Corporation) was used to present the stimuli and record the responses. The recording system includes a bandpass filter with low and high cut-off frequencies of 30 Hz and 1000 Hz, respectively. Each vowel was presented at a calibrated level of 80.5 dB SPL by adjusting an internal calibration factor in the BioMARK system, with the calibration performed by connecting the earphone to a 2 cc coupler attached to a Bru¨el & Kjaer Artificial Ear type 4152, and a Sound Level Meter (SLM) Type 2230. Stimuli were presented using the Bio-logic insert-earphone of the BioMARK v.7.0.2 system. Three gold-plated Grass electrodes were used, with the recording electrode placed at the vertex (Cz), the reference electrode on right earlobe, and the ground electrode on the left earlobe. Electrode impedances were kept below 5 kX during the recording by monitoring the impedance at the start and end of each trial, and discarding and repeating the trials with impedance higher than 5 kX. Vowels were presented in alternate polarities to both ears. Speech ABRs were recorded with a sampling frequency of 3202 Hz for a duration of 319.8 ms starting at stimulus onset. To minimize artefacts, epochs in which the response exceeded 23.8 lV were discarded. Moreover, trials which contained more than 20 discarded epochs were repeated.

2.2. Stimuli

2.4. Classification

As stimuli, we used five synthetic English vowels (/a/, /O/ , /ae/, /i/, /u/) generated using a simplified version of Klatt’s cascade/parallel formant synthesizer (Klatt, 1980; Laroche

2.4.1. Feature selection Features were obtained from the frequency domain representation of the sustained response (EFR and FFR) and

2. Methods 2.1. Subjects

72

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84

Table 1 Formant frequencies, bandwidths and amplitudes of the five synthetic vowels used as stimuli. Vowels

F1 (Hz)

F2 (Hz)

F3 (Hz)

BW1 (Hz)

BW2 (Hz)

BW3 (Hz)

A1 (dB)

A2 (dB)

A3 (dB)

/a/ /ae/ /O/ /i/ /u/

700 660 570 270 300

1220 1720 840 2290 870

2600 2410 2410 3010 2240

130 70 100 50 65

70 150 60 100 110

160 320 110 140 140

1 1 0 4 3

5 12 7 24 19

28 22 34 28 43

Fig. 1. First 100 ms of the time domain representation of five synthetic vowels as spoken by a male with a fundamental periodicity of 10 ms.

from the time domain transient response. Note that the transient response (<19.8 ms) was removed for EFR and FFR calculations which were based on the final 300 ms of the response. For the first set of features, we examined both amplitude and phase of EFR and FFR spectra at F0 (100 Hz) and its harmonics, since the stimulus signal energy only exists at these frequencies. We further reduced the number of frequency points by considering the spectral values (i.e. amplitudes and phases) below and including 1000 Hz because neural phase-locking degrades above 1000 Hz. Therefore, the sustained feature vectors had 10 amplitude and 10 phase feature elements for the EFR or FFR. We assessed amplitude and phase features by comparing the classification results for different combinations of the features and found that using the phase features

did not improve the classification accuracy by much (only 2–3% improvement). Thus, we decided to only consider the amplitude features to reduce the risk of over-fitting (Duda et al., 2001). This means that the sustained feature vectors were reduced to 10 amplitude feature elements for the EFR or FFR. Fig. 2 shows the amplitude spectra of the EFR for each vowel averaged across all trials and subjects. As can be seen, there are robust peaks at harmonics of F0. Fig. 3 illustrates the amplitude spectra of the FFR for each vowel averaged across all trials and subjects. In these figures, the DC offset was removed from the amplitude spectra of both EFR and FFR by subtracting the mean amplitude of the grand averaged waveform in time domain.

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84

73

Fig. 2. Amplitude spectra (up to 1000 Hz) of the speech ABRs for all vowels averaged over all trials and subjects (grand-averages) for the Envelope Following Response (EFR).

Figs. 4 and 5 show the time domain EFR and FFR waveforms for each vowel averaged across all trials and subjects. For the second set of features, we focused on the significant transient peaks from the EFRs in the time domain (<19.8 ms). In the transient response, significant peaks correspond to neural responses along the ascending auditory pathway. We identified all the peaks by finding local maxima/minima around the time when the peaks are expected to occur. Of these, peaks V and A (a.k.a. the VA complex) are the most prominent and robust in speech ABR (Russo et al., 2004). Hence, we only assessed features of the VA complex, namely amplitude and latency of V and A, VA duration, VA amplitude, and VA slope. The classification accuracy obtained with all possible non-dependent combinations of the 7 features (i.e. highly linearly correlated features were not combined to avoid singularity in training data) showed that the 4 features of latency of V and A, and amplitude of V and A provided the highest classification accuracy. As such, we present the classification results with transient feature vectors containing these 4 elements. Table 2 shows means and Standard Deviation (SD) of the transient features for all vowels.

2.4.2. Classification method: Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA) was employed for classification using Matlab (v. 7.9.0.529, The Mathworks, Natick, MA, U.S.) (Duda et al., 2001). We had five classes corresponding to the five different vowels and each class had 48 sets of speech ABR trials (6 samples each corresponding to 500 stimulus repetitions per subject  8 subjects). Leave-one-out cross-validation was used to train and test. That is, training was performed on all samples except one, which was used to test. The leave-one-out was repeated such that each of the 240 speech ABR samples (5 vowels  6 trials/vowel  8 subjects = 240) was tested. Leave-one-out cross-validation was used because the data set is small and this approach provides better training (i.e. reduces error rates), especially for small data sets (Duda et al., 2001). Generally, LDA separates classes by determining hyperplanes between them such that the ratio of between-class variance over within-class variance is maximized in order to optimize class separation (Duda et al., 2001). In other words, LDA tries to find the best linear combination of features that is used for separating classes. One of the advantages of LDA over more complex classifiers like artificial

74

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84

Fig. 3. Amplitude spectra (up to 1000 Hz) of the speech ABRs for all vowels averaged over all trials and subjects (grand-averages) for the Frequency Following Response (FFR).

neural networks is that it helps to avoid over-fitting. Overfitting can occur when training produces a classifier that describes the noise within the training data, and as a result, the classifier does not generalize well to the test data. A classifier is more susceptible to over-fitting if it has too many parameters with respect to the number of training examples. In order to support the LDA classification results, we measured the Mahalanobis distance between all possible pair-wise combinations of the five classes. To compute this distance for each pair of classes, we considered one class as observation samples and the other class as reference samples. The Mahalanobis distance measures the distance of the mean of the observation samples from the mean value (lsample) of the reference samples (lreference) which is normalized by the covariance (R) of the reference cluster, as follows: sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðlsample  lreference Þt ðlsample  lreference Þ Mahal Dist ¼ R A smaller Mahalanobis distance indicates a higher probability that the sample observation is part of the reference class and vice versa (Duda et al., 2001).

3. Results Classification was evaluated using an accuracy measure which is the aggregate of correctly classified samples over the total number of samples (Duda et al., 2001). Because there are five classes, the chance level accuracy is 20%. 3.1. Classification of speech ABRs using sustained response features Table 3 shows LDA classification results per subject for three different sets of amplitude features including, 1) EFR (10 elements), 2) FFR (10 elements), and 3) EFR + FFR (combining both EFR and FFR features to generate a new feature set with 20 elements). The displayed results include classification accuracies when the LDA is applied to the training data, as well as to the leave-one-out test data. The last row of this table illustrates the overall classification accuracy (i.e. average accuracy over 8 subjects). As can be seen, EFR + FFR amplitude features provided the highest accuracy of 83.33% with the test data followed by the individual EFR and FFR amplitude features. A greedy backward elimination method was used to find which features

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84

75

Fig. 4. Time domain waveforms (up to 300 ms) of the speech ABRs for all vowels averaged over all trials and subjects (grand-averages) for the Envelope Following Response (EFR).

contributed the most toward LDA classification results. We used the sequentialfs function in Matlab to eliminate the least effective feature at each step until there is one feature left. Tables 4(a)–(c) shows LDA classification accuracies from greedy backward elimination method. For instance in Table 4(a) the least effective feature was feature 1 and the second least effective was feature 4. Tables 5(a)–(c) shows confusion matrices for the three different amplitude feature sets, while Tables 5(d)–(f) shows the Mahalanobis distances between all possible pair-wise combinations of the five vowels for the three different amplitude feature sets (note that all Mahalanobis distances are normalized with Mahalanobis distance of each vowel from itself). Each value in the matrices represents the averaged Mahalanobis distances from every speech ABR sample of one class in “Sample Vowels” to a reference class in “Reference Vowels”. The smallest Mahalanobis distance in all three matrices is 1, which corresponds to cases where the reference and the sample vowels are the same (i.e. the diagonal cells of the matrices). In order to simplify the comparison between confusion matrices and Mahalanobis distances, five different shades of grey are used that signify five different ranges for classification distribution and Mahalanobis distances. In general, on the non-diagonal

cells, the lighter the grey, the fewer the misclassifications and the longer is the distance. 3.2. Classification of speech ABRs using transient response features Table 6 shows classification accuracies per subject for the set of the four transient features derived from the latency and amplitude of V and A waves. This table includes classification accuracies when the LDA is applied to the training data, as well as to the leave-one-out test data with the overall accuracy of 41.83% for the training data and 38.33% test data respectively. Table 7 provides more details about the classification results for each class by illustrating confusion matrices, on the left side (a) and (b), and the normalized Mahalanobis distances, on the right side (c), for the four selected transient features. Since vowel /a/ provides the lowest classification accuracy among all vowels, we re-examined the trials without vowel /a/ samples to see if the classification accuracies improve. The result of this analysis is shown in the confusion matrix (b). In general, on the non-diagonal cells, the lighter the grey, the fewer the misclassifications and the longer is the distance.

76

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84

Fig. 5. Time domain waveforms (up to 300 ms) of the speech ABRs for all vowels averaged over all trials and subjects (grand-averages) for the Frequency Following Response (FFR).

Table 2 Mean and Standard Deviation (SD) of the four selected transient response features for all 48 trials in each vowel class. Class labels

Stats

V latency (ms)

Vowel /a/

Mean SD

8.06 0.78

Vowel /ae/

Mean SD

Vowel /O/

A latency (ms)

V height (lV)

A height (lV)

9.78 1.00

0.41 0.17

0.05 0.15

7.78 0.73

9.87 1.25

0.38 0.15

0.10 0.25

Mean SD

8.25 0.78

10.04 1.07

0.46 0.20

0.05 0.21

Vowel /i/

Mean SD

8.33 1.04

10.37 1.47

0.45 0.20

0.01 0.20

Vowel /u/

Mean SD

8.79 0.63

10.34 0.95

0.37 0.16

0.05 0.20

3.3. Classification of speech ABRs using sustained and transient response features

“EFR + FFR” feature set. Table 8 shows the corresponding confusion matrix in four shades of grey.

The combination of all sustained (“EFR + FFR”) and the best 4 transient response features provide a classification accuracy of 84.16% which is 1% higher than the highest classification accuracy obtained using the

4. Discussion The results show that the sustained and transient components of non-invasively recorded speech ABRs with five

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84

77

Table 3 LDA classification accuracies of the three different amplitude feature sets. Results include classification accuracies when the LDA is applied to the training data, as well as to the leave-one-out test data. Subjects

Sub1 Sub2 Sub3 Sub4 Sub5 Sub6 Sub7 Sub8 All subjects

Classification accuracy using test data

Classification accuracy using training data

EFR + FFR (%)

EFR (%)

FFR (%)

EFR + FFR (%)

EFR (%)

FFR (%)

80.00 96.67 80.00 86.67 80.00 73.33 83.33 86.67 83.33

70.00 70.00 76.67 63.33 76.67 70.00 73.33 70.00 70.83

50.00 63.33 53.33 63.33 56.67 63.33 53.33 66.67 59.58

86.26 87.30 87.21 87.33 87.26 87.12 87.34 87.24 86.89

76.02 76.09 75.88 75.98 76.12 76.02 75.97 75.98 75.69

63.71 63.67 62.34 63.56 63.33 63.02 64.17 63.76 63.44

English vowels can be discriminated through a basic classification method. This result indicates that the different components of the speech ABR carry useful information for discriminating speech stimuli. 4.1. Classification of speech ABR using sustained response features Using the sustained response, the speech ABRs of the five different vowels could be classified with an accuracy of 83.33%, which is considerably higher than the chance level of 20%. Together, the combined EFR and FFR features provide the highest classification accuracy followed by the individual EFR and FFR features, respectively. The classification accuracies of all three sets of features, the combined EFR and FFR, EFR alone, and FFR alone, were statistically significantly different from each other (two-tailed paired t-test with Bonferroni correction: p < 0.01). The classification accuracy of 70.83% with the EFR was relatively high. This is a novel finding since neural activity that corresponds to the envelope of speech and its harmonics is used here to distinguish the vowels, whereas vowels are usually thought to be perceptually discriminated based mainly on the formant frequencies, and in particular the relative frequencies of F1 and F2 (Peterson and Barney, 1952; Advendano et al., 2004; Assmann and Summerfield, 2004), although including F0 has been shown to improve automatic vowel classification (Hillenbrand and Gayvert, 1993). Moreover, this finding is different from a model of neural processing of speech that Kraus and Nicol (2005) have proposed in which the “source” of speech, reflected in components of the response in the region of F0, and “filter” of speech, reflected in components of the response in the region of F1, are processed in separate neural pathways. In our study, the differences in the envelope shapes of the vowels can only be due to differences in the formant content since the impulse train source signal used to synthesize all the vowels was the same. Therefore, differences among the responses to vowels in the region of F0 in the EFR cannot be said to reflect any differences in the “source”. Instead, they correspond to neural activity that results from differences in the “filter”.

This neural activity occurs at frequencies mostly well below the first formant and yet allows for vowel discrimination. Different EFR components are likely obtained for different vowels since the shape of envelope is different for each vowel. Moreover, additional EFR components are probably introduced by non-linearities in the cochlea, including rectification, and in the different neural centers leading up to the upper brainstem during processing of speech envelope (Aiken and Picton, 2008). These effects could result in EFR spectral content that differentiates well between the vowels. The FFR amplitude features provided a classification accuracy of 59.58%. It can also be noted that some vowels like /i/ and /u/ had similar F1 values (270 Hz and 300 Hz respectively) while their speech ABRs do not look similar in Fig. 3 and they have the highest classification accuracies (77%) among all vowels. Such dissimilarities may again be due to distortion products which are caused by nonlinearities in the auditory system (Aiken and Picton, 2008). None-the-less, the relative weakness of the FFR amplitude features could be due to three causes. First, for three of the vowels (/a/, /ae/, and /i/), only the FFR in the region of F1 was included in the analysis, as responses at higher formants are usually weak due to degradation in neural phase-locking above approximately 1000 Hz. Adding the responses of higher formants (especially F2), if they are available in the speech ABR, might improve the FFR features by providing additional distinct information specific to each vowel (Peterson and Barney, 1952). The second reason for the relative weakness of the FFR amplitude features is that the separation between F1 frequencies in different vowels is in some cases small (Table 1). This could generate overlapping response peaks at harmonics of F0 around F1 frequencies. We tested the effect of F1 separation by obtaining the classification accuracy for all 10 (5 choose 2) with all vowel pairs. The classification accuracy (in%) was positively correlated with F1 separation (in Hz) with a Pearson correlation coefficient equal to 0.76 (p = 0.011). The third reason for the weakness of the FFR amplitude features is that the harmonics of the FFR were found to be generally weaker (in amplitude) than those of the EFR

78

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84

Table 4 LDA classification accuracies with greedy backward elimination method for (a) EFR + FFR. (b) EFR, (c) FFR. The grey colored features indicate those features which are included at each step.

(a)

Feature Set

Classification

EFR 1

2

3

4

5

6

Accuracy for

FFR 7

8

9

1

1

1

1

1

1

1

1

1

1

2

0

1

2

3

4

5

6

7

8

9

0

EFR + FFR

83.33% 75.00% 71.25% 69.17% 66.673% 63.33% 58.75% 54.58% 49.58% 45.00% 43.33% 41.25% 40.83% 38.75% 34.58% 31.25% 27.92% 22.91% 20.83% 14.17%

(continued on next page)

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84 Table 4 (continued)

(b)

1

2

3

Feature Set

Classification Accuracy for

EFR

EFR

4

5

6

7

8

9

1 0 70.83% 57.08% 47.50% 45.00% 43.75% 40.42% 37.08% 32.08% 25.00% 19.17%

(c)

1

2

3

Feature Set

Classification Accuracy for

FFR

FFR

4

5

6

7

8

9

1 0 59.58 % 51.67% 45.00% 40.83% 36.67% 35.42% 30.42% 25.42% 20.82% 14.17%

79

80

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84

Table 5 Left column shows confusion matrices for (a) EFR + FFR, (b) EFR, and (c) FFR. Right column shows Mahalanobis distances between all possible pair-wise combinations of vowels for (d) EFR + FFR, (e) EFR, (f) FFR.

(Figs. 2 and 3), and so auditory phase-locking to F1 may not have been consistent across subjects and would have been more vulnerable to noise. As a result, the FFR spectra of different vowels may not have been strongly distinguishable by the classifier. This may be supported by comparing the variability of the individual classification accuracies for

the EFR and FFR features in Table 3. The difference between the highest and lowest individual accuracies is 30% (83.33–53.33%) for the EFR features and 53.33% (80–26.67%) for the FFR features. This indicates that the FFR features are more subject dependent compared to the EFR features.

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84 Table 6 LDA classification accuracies per subject using 4 transient features (latencies and amplitudes of waves V and A). Results include classification accuracies when the LDA is applied to the training data, as well as to the leave-one-out test data. Subjects

Classification accuracy for V, A latency + V, A amplitude using test data (%)

Classification Accuracy for V, A latency + V, A amplitude using training data (%)

Sub1 Sub2 Sub3 Sub4 Sub5 Sub6 Sub7 Sub8 All subjects

33.33 33.33 46.67 26.67 50.00 36.67 30.00 50.00 38.33

41.84 41.91 41.78 41.75 41.74 41.86 41.78 41.88 41.83

Mahalanobis distances were calculated for the pairs of vowels. As can be seen in Table 5, shorter non-diagonal Mahalanobis distances generally correspond to higher misclassification rates and longer Mahalanobis distances generally correspond to lower misclassification rates. However, there are a few cases which do not follow this rule. Such discrepancies may occur because the main difference between applying LDA and Mahalanobis distance on two classes is that LDA uses the pooled covariance matrix to build a hyperplane between the two classes, whereas

81

Mahalanobis distance uses a covariance of a single class, depending on which class is taken as a reference, to calculate the distance. Moreover, we used different methods to test and train for each analysis approach. For LDA, we used leave-one-out, whereas for Mahalanobis distance we considered all samples of one class as “Reference Vowels” and all samples of the other class as “Sample Vowels”. Tables 4(a)–(c) shows that higher frequency components were generally effective in improving classification accuracies of all the three feature sets. This is an important finding, because from Figs. 2 and 3 it can be seen that higher frequency components generally have smaller amplitudes compared to lower frequency components. 4.2. Classification of speech ABR using transient response features The fact that we were able to classify speech ABRs of the English vowels using the transient response features, albeit with low accuracy, indicates that the process of encoding of vowels begins even before processing the pitch and formants of the stimuli, although it is also possible that backward masking from the sustained response modified the transient response and contributed to the ability to classify. The LDA classification of the speech ABR of the five vowels obtained from the best combination of the transient response features (V, A latencies and amplitudes) had an

Table 7 (a) Confusion matrix for the group of 4 transient features which provide the highest classification accuracy among all possible combinations of the transient features (i.e. latencies and amplitudes of waves V and A). (b) Confusion matrix without vowel /a/. (c) Mahalanobis distances between all possible pair-wise combinations of vowels for the 4 transient features.

82

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84 Table 8 Confusion matrix for the combination of sustained and transient response features.

accuracy of 38.33%, which is almost double the chance level of 20%. This accuracy is significantly different from chance level based on the one-tailed binomial test (p < 0.001), with total number of samples: n = 240, number of samples that were correctly classified: k = 91 (240  38.33%), and probability of chance occurrence = 0.2. This result also suggests that the neural responses in the upper brainstem (corresponding to the VA complex) carry vowel-specific information. Previous studies have found differences in the transient response when the stimulus was a consonant–vowel syllable with different initial consonants (Johnson et al., 2008; Skoe et al., 2011). However, to the best of our knowledge, no pervious study has distinguished between separate vowels using the onset response of the ABR or suggested that it contains phonetic information. In general, the transient response features have provided lower classification accuracies compared with the sustained response features. The weakness of the transient features may be because they only include limited phonetic information. It is also noted that only 4 transient features were used, compared to 10 EFR and 10 FFR features. If a higher number of repetitions were available to be averaged, this would reduce noise and could provide more robust transient peaks. As a result, it could be possible to obtain other transient peaks that provide a stronger and larger collection of features. With the transient responses, we calculated Mahalanobis distances for the vowel pairs. As can be seen in Table 7(c), the small Mahalanobis distances on the nondiagonal cells generally correspond to large number of misclassifications on the matching cells in Table 7(a) and (b). However, there are a few exceptions to this, which may be explained by the differences between LDA classification and Mahalanobis distance as discussed above. 4.3. Classification of speech ABRs using sustained & transient response features The classification accuracy was increased by approximately 1% when the combination of the sustained and

transient response features was used as a feature set. Comparing Tables 5(a) and 8, it can be seen that the classification accuracies in Table 8 have increased for the vowels with lower accuracies in Table 5(a) (i.e. vowels /ae/ and /u/) in contrast to the vowel with the highest classification accuracy (i.e. vowel /i/). 5. Conclusion This study has demonstrated that the speech-evoked auditory brainstem response (speech ABR) of five English vowels can be classified with a fairly high accuracy of 83.33% with the sustained response features and 38.33% with the transient response features, using a Linear Discriminant Analysis (LDA) classifier. Given the limited data set in this study, a simple classifier like LDA probably helped to prevent over-fitting. Amplitudes of Envelope Following Response (EFR) and Frequency Following Response (FFR) spectral peaks were used as the sustained response features and properties of the VA complex as the transient features. The relatively high accuracy of 70.83% with the EFR (envelope) features is a novel finding since it has been thought that the formants of speech are the main contributors to perceptual discrimination of different vowels. The FFR features gave a lower accuracy of 59.58% possibly because the second formant is not well represented in the responses. The EFR and FFR features were also shown to be not redundant since the combination of the two sets of features gave a higher accuracy than either feature set alone. The ability to classify the vowels with the transient response features is also a new finding because the transient response to a vowel is usually thought to carry general sound onset information and not vowel-specific information. However, it is important to stress that the leap from speech ABR coding to perception is not fully understood, and it is likely that there in no one-to-one mapping from subcortical speech differentiation to perceptual performance. Speech ABRs are therefore shown to contain useful information which can be used to automatically classify

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84

different vowels. The classification of the speech ABR could find application in tuning hearing aids so that evoked responses to confusable speech sounds would be clustered into maximally separated classes. However, there is first a need to better understand the role played by envelope versus temporal fine structure related features of the speech ABR in normal and hearing impaired listeners. In normal listeners, recent studies have found that responses at the fundamental frequency of the envelope are enhanced with auditory training for speech recognition in noise (Song et al., 2012) and with long term experience with music or a tonal language (Krishnan et al., 2005; Wong et al., 2007; Bidelman et al., 2009). Also, Swaminathan and Heinz (2012), using an approach based on a computational auditory nerve model combined with perceptual tests, determined that envelope encoding was a more important contributor to speech perception in quiet and noise in normal listeners than neural temporal fine structure. On the other hand, recent studies have shown that with hearing loss, there is an augmentation of envelope related features in the neural response (i.e. the EFR) and this strengthening contributes to difficulties in understanding speech (Anderson et al., 2013a,b; Zhong et al., 2014). The differences found between normal and hearing impaired listeners show that the relationship between brainstem processing and perceptual categorization is not yet well understood and further study is needed before speech ABR can be fully utilized as a tool to improve hearing aid performance. References Advendano, C., Deng, L., Hermansky, H., Gold, B., 2004. Analysis and representation of speech. In: Greenberg, S., Ainsworth, W.A., Popper, A.N., Fay, R.R. (Eds.), Speech Processing in the Auditory System. Springer, New York, pp. 63–101. Aiken, S.J., Picton, T.W., 2006. Envelope following responses to natural vowels. Audiol. Neuro-Otol. 11, 213–232. Aiken, S.J., Picton, T.W., 2008. Envelope and spectral frequencyfollowing responses to vowel sounds. Hear. Res. 245, 35–47. Aiken, S.J., 2008. Human brain responses to speech sounds. PhD thesis. Institute of Medical Science, University of Toronto. Anderson, S., Kraus, N., 2013. The potential role of the cABR in the assessment and management of hearing impairment. Int. J. Otolaryngol. (Article ID 604729, 10 pages) Anderson, S., Parbery-Clark, A., White-Schwoch, T., Drehobl, S., Kraus, N., 2013a. Effects of hearing loss on the subcortical representation of speech cues. J. Acoust. Soc. Am. 133 (5), 3030–3038. Anderson, S., White-Schwoch, T., Choi, H.J., Kraus, N., 2013b. Training changes processing of speech cues in older adults with hearing loss. Front. Syst. Neurosci. 7. http://dx.doi.org/10.3389/fnsys.2013.00.097. Assmann, P., Summerfield, Q., 2004. Perception of speech under adverse conditions. In: Greenberg, S., Ainsworth, W.A., Popper, A.N., Fay, R.R. (Eds.), Speech Processing in the Auditory System. Springer, New York, pp. 231–309. Banai, K., Abrams, D., Kraus, N., 2007. Sensory-based learning disability: insight form brainstem processing of speech sounds. Int. J. Audiol. 46 (9), 524–532. Bidelman, G.M., Gandour, J.T., Krishnan, A., 2009. Cross-domain effects of music and language experience on the representation of pitch in the human auditory brainstem. J. Cognitive Neurosci. 23, 425–434.

83

Brumberg, J.S., Nieto-Castanon, A., Kennedy, P.R., Guenther, F.H., 2010. Brain–computer interfaces for speech communication. Speech Commun. 52 (4), 367–379. Cebulla, M., Stu¨rzebecher, E., Elberling, C., 2006. Objective detection of auditory steady-state responses: comparison of one-sample and qsample tests. J. Am. Acad. Audiol. 17 (2), 93–103. Centanni, T.M., Sloan, A.M., Reed, A.C., Engineer, C.T., Rennaker, R.L.I.I., Kilgard, M.P., 2014. Detection and identification of speech sounds using cortical activity patterns. Neuroscience 258, 292–306. Chandrasekaran, B., Kraus, N., 2010. The scalp-recorded brainstem response to speech: neural origins and plasticity. Psychophysiology 47, 236–246. Clinard, C.G., Tremblay, K.L., 2013. What brainstem recordings may or may not be able to tell us about hearing aid-amplified signals. Semin. Hearing 34, 270–277. Dajani, H.R., Purcell, D., Wong, W., Kunov, H., Picton, T.W., 2005. Recording human evoked potentials that follow the pitch contour of a natural vowel. IEEE Trans. Biomed. Eng. 52, 1614–1618. Dajani, H.R., Heffernan, B., Gigue`re, C., 2013. Improving hearing aid fitting using the speech-evoked auditory brainstem response. In: Proc. International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC’13), Osaka, Japan. Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J.M., Brumberg, J.S., 2010. Silent speech interfaces. Speech Commun. 52 (4), 270–287. Duda, R.O., Hart, O.E., Stork, D.G., 2001. Pattern Classification, second ed. Wiley-Interscience, Toronto, Canada. Engineer, C.T., Perez, C.A., Chen, Y.T.H., Carraway, R.S., Reed, A.C., Shetake, J.A., Jakkamsetti, V., Chang, K.Q., Kilgard, M.P., 2008. Cortical activity patterns predict speech discrimination ability. Nat. Neurosci. 11, 603–608. Greenberg, S., 1980. Temporal Neural Coding of Pitch and Vowel Quality, UCLA Working Papers in Phonetics, vol. 52 (Ph.D. Thesis, UCLA). Hillenbrand, J., Gayvert, R.T., 1993. Vowel classification based on fundamental frequency and formant frequencies. J. Speech Hear. Res. 36, 694–700. Hornickel, J., Skoe, E., Nicol, T., Zecker, S., Kraus, N., 2009. Subcortical differentiation of stop consonants relates to reading and speech-innoise perception. Proc. Natl. Acad. Sci. U.S.A. 106, 13022–13027. Johnson, K.L., Nicol, G.T., Kraus, N., 2005. Brain stem response to speech: a biological marker of auditory processing. Ear Hear. 26, 424–434. Johnson, K.L., Nicol, G.T., Zecker, S.G., Bradlow, A.R., Skoe, E., Kraus, N., 2008. Brainstem encoding of voiced consonant–vowel stop syllables. Clin. Neurophysiol. 119, 2623–2635. Klatt, H.D., 1980. Software for a cascade/parallel formant synthesizer. J. Acoust. Soc. Am. 67 (33), 971–995. Kraus, N., Nicol, G.T., 2005. Brainstem origins for cortical ‘what’ and ‘where’ pathways in the auditory system. Trends Neurosci. 28, 176–181. Krishnan, A., 2002. Human frequency-following responses: representation of steady-state synthetic vowels. Hear. Res. 166, 192–201. Krishnan, A., Xu, Y., Gandour, J., Cariani, P., 2005. Encoding of pitch in the human brainstem is sensitive to language experience. Brain Res. Cogn. Brain Res. 25, 161–168. Laroche, M., Dajani, H.R., Pre´vost, F., Marcoux, M., 2013. Brainstem auditory responses to resolved and unresolved harmonics of a synthetic vowel in quiet and noise. Ear Hearing 34, 63–74. Pasley, B.N., David, S.V., Mesgarani, N., Flinker, A., Shihab, A.S., Corne, N.E., Knight, R.T., Chang, E.F., 2012. Reconstructing speech from human auditory cortex. PLoS Biol. 10 (1). Peterson, E.G., Barney, L.H., 1952. Control methods used in a study of the vowels. J. Acoust. Soc. Am. 24, 175–184. Pre´vost, F., Laroche, M., Marcoux, A.M., Dajani, H.R., 2013. Objective measurement of physiological signal-to-noise gain in the brainstem response to a synthetic vowel. Clin. Neurophysiol. 124, 52–60. Russo, N., Nicol, T., Musacchia, G., Kraus, N., 2004. Brainstem responses to speech syllables. Clin. Neurophysiol. 115, 2021–2030. Sadeghian A, Dajani HR, Chan A. Classification of English vowels using speech evoked potentials. In: Proceedings of the International

84

A. Sadeghian et al. / Speech Communication 68 (2015) 69–84

Conference of the IEEE Engineering in Medicine and Biology Society (EMBC’11), Boston, USA, August 30–September 3, 2011, pp. 5000– 5003. Skoe, E., Kraus, N., 2010. Auditory brain stem response to complex sounds: a tutorial. Ear Hearing 31, 302–324. Skoe, E., Nicol, T., Kraus, N., 2011. Cross-phaseogram: objective neural index of speech sound differentiation. J. Neurosci. Methods 196 (2), 308–317. Song, J.H., Skoe, E., Banai, K., Kraus, N., 2012. Training to improve hearing speech in noise: biological mechanisms. Cereb. Cortex 22, 1180–1190.

Swaminathan, J., Heinz, M.G., 2012. Psychophysiological analyses demonstrate the importance of neural envelope coding for speech perception in noise. Neuroscience 32, 1747–1756. Wong, P.C.M., Skoe, E., Russo, N.M., Dees, T., Kraus, N., 2007. Musical experience shapes human brainstem encoding of linguistic pitch patterns. Nat. Neurosci. 10, 420–422. Zhong, Z., Henry, K.S., Heinz, M.G., 2014. Sensorineural hearing loss amplifies neural coding of envelope information in the central auditory system of chinchillas. Hear. Res. 309, 55–62.