Neural Networks 45 (2013) 62–69
Contents lists available at SciVerse ScienceDirect
Neural Networks journal homepage: www.elsevier.com/locate/neunet
2013 Special Issue
Nonlinear spectro-temporal features based on a cochlear model for automatic speech recognition in a noisy situation Yong-Sun Choi, Soo-Young Lee ∗ Department of Electrical Engineering and Brain Science Research Center, Korea Advanced Institute of Science and Technology, 373-1 Guseong-dong Yuseong-gu, Daejeon 305-701, Republic of Korea
article
info
Keywords: Nonlinear auditory features Cochlear model Nonlinear amplification Noise-robust speech recognition Adaptive gain control
abstract A nonlinear speech feature extraction algorithm was developed by modeling human cochlear functions, and demonstrated as a noise-robust front-end for speech recognition systems. The algorithm was based on a model of the Organ of Corti in the human cochlea with such features as such as basilar membrane (BM), outer hair cells (OHCs), and inner hair cells (IHCs). Frequency-dependent nonlinear compression and amplification of OHCs were modeled by lateral inhibition to enhance spectral contrasts. In particular, the compression coefficients had frequency dependency based on the psychoacoustic evidence. Spectral subtraction and temporal adaptation were applied in the time-frame domain. With long-term and shortterm adaptation characteristics, these factors remove stationary or slowly varying components and amplify the temporal changes such as onset or offset. The proposed features were evaluated with a noisy speech database and showed better performance than the baseline methods such as mel-frequency cepstral coefficients (MFCCs) and RASTA-PLP in unknown noisy conditions. © 2013 Elsevier Ltd. All rights reserved.
1. Introduction Although new speech features based on deep belief networks (Dahl, Ranzato, Mohamed, & Hinton, 2010; Dahl, Yu, Deng, & Acero, 2012; Lee & Lee, 2011) have recently resulted in much better recognition performance for clean speech, the human ability for speech recognition still outperforms that of automatic speech recognition (ASR) systems in a noisy environment. Since it is expected that there may be noise-robust processing in human hearing, one possible approach to obtain better speech recognition performance is to model the functions of the human auditory system. In humans, acoustic speech signals are collected in the outer ear and transmitted to the inner ear. In the inner ear, it is primarily the cochlea that extracts speech features, which are delivered to the brain through auditory pathways (Dallos, Popper, & Fay, 1996; Nobili, Mammano, & Ashmore, 1998). The noise robustness may come from both cochlear functions and brain functions. However, a noise-robust front-end of ASR has mainly studied cochlear functions because the structures and functions of auditory peripheries have been relatively well investigated, while higher brain functions are not fully understood yet. There have been several investigations attempting to model cochlear functions for the front-end of speech recognition systems to enhance performance in noisy environments (Kim, Lee,
∗
Corresponding author. Tel.: +82 42 350 3431; fax: +82 42 350 8490. E-mail address:
[email protected] (S.-Y. Lee).
0893-6080/$ – see front matter © 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.neunet.2013.02.006
& Kil, 1999). One example is mel-frequency cepstral coefficients (MFCCs) (ETSI, 2000), which model mel-scale frequency decomposition (Stevens, Volkmann, & Newman, 1937) and logarithmic compression. The performance of MFCCs is quite good in a clean environment, but becomes worse in a noisy environment. To improve noise robustness, it will be required to make more detailed functional models of the cochlea, especially the functions of the Organ of Corti, in which basilar membrane (BM), outer hair cells (OHCs), and inner hair cells (IHCs) exist. Therefore there have been models of more complex nonlinear functions of the cochlea such as lateral inhibition (Cheng & O’Shaughnessy, 1991; Park & Lee, 2003; Raj, Turicchia, SchmidtNielsen, & Sarpeshkar, 2007; Strope & Alwan, 1997; Virag, 1999) and temporal adaptation (Haque, Togneri, & Zaknich, 2009; Holmberg, Gelbart, & Hemment, 2006; Seneff, 1988). Lateral inhibition models two-tone suppression phenomena (Rhode & Recio, 2001), in which large signals suppress small neighboring signals in the frequency domain so that the frequency contrasts are enhanced. Temporal adaptation is motivated from the neurotransmitter reservoir model of IHCs (Seneff, 1988) and enhances temporal edges by amplifying changes in the time domain such as onset or offset. In general, nonlinear compression and amplification were modeled as the same simple functions for all signal frequencies such as the log function (ETSI, 2000) and the cubic root (Hermansky, 1990; Raj et al., 2007). Recently, Plack and Oxenham (2000) reported that the amount of compression might be different for different frequency components, which we believe may be an important characteristic of nonlinear feature extraction.
Y.-S. Choi, S.-Y. Lee / Neural Networks 45 (2013) 62–69
63
In this study, we report a nonlinear speech feature extraction model motivated by complex cochlear functions incorporating biologically inspired nonlinearities. The model is based on frequency-dependent nonlinear amplification combined with lateral inhibition and temporal adaptation. It is proposed as a front-end of an ASR system for a real-world noisy environment. The performance of the model was evaluated with a noisy speech database, and the developed model demonstrated better recognition rates than conventional methods such as the standard MFCC and RASTA-PLP (Hermansky, 1990; Hermansky & Morgan, 1994) features for word recognition tasks in unknown noisy conditions. The developed model is neuromorphic in a broad sense. Although the term ‘‘neuromorphic’’ in a limited sense has been used for hardware and algorithms based on neuronal information processing at molecular and cellular levels, functional models of neuronal systems are also included in the neuromorphic engineering in a broad sense. These functional models may be either qualitative or quantitative. However, as the modeling system becomes bigger, it becomes more difficult to estimate the model parameters directly from measured or simulated data. Then, as presented in this paper, an exhaustive parameter search may be adopted. 2. Nonlinear spectro-temporal coefficients Speech signals are collected in the outer ear and delivered to the cochlea through the middle ear. At the base of the cochlea, traveling waves are generated, and they start to travel along the basilar membrane (BM). The BM decomposes time signals into frequency components (Dallos et al., 1996; Nobili et al., 1998). Although the Fourier transform performs a similar function, the frequency-to-place map in human auditory systems has a loglike relationship (Greenwood, 1990) that can be modeled by melfrequency components with a linear-to-mel-frequency transform (Stevens et al., 1937). In the conventional ASR front-end, it usually helps to apply spectral subtraction (Boll, 1979) to the frequency components to remove stationary background noise. Then, outer hair cells (OHCs) perform nonlinear amplification or compression depending on the signal amplitude (Nobili et al., 1998). Although the nonlinear amplification or compression has been modeled with one simple function such as the log function (ETSI, 2000) or the cubic root (Hermansky, 1990; Raj et al., 2007), the amount of compression was reported to be different for different frequency components (Plack & Oxenham, 2000). Furthermore, one frequency component may be affected by neighboring frequency components, as observed from two-tone suppression experiments (Rhode & Recio, 2001) and modeled by lateral inhibition (Park & Lee, 2003). As the traveling waves propagate, inner hair cells (IHCs) distributed along the BM receive the acoustic signals, convert them into neural signals, and deliver them to the auditory nerve (Nobili et al., 1998). IHCs are also known to perform temporal adaptation (Seneff, 1988), which can be modeled by simple high-pass filtering (Holmberg et al., 2006). We propose a feature extraction algorithm that mimics the functions of the cochlea. The whole procedure of extracting nonlinear features by the proposed nonlinear spectro-temporal coefficients (NSTC) model is shown in Fig. 1, along with the standard MFCC procedures. In the model, three stages are inserted, for spectral subtraction (SS), spectral enhancement (SE), and temporal adaptation (TA), respectively. The SS and TA stages are designed to reduce quasi-stationary noises and to emphasize temporal changes. Also, the SE stage emphasizes stronger signals and reduces weaker signals. Unless it is specifically pointed out, the parameters are the same as those of the standard MFCC (ETSI, 2000).
Fig. 1. Block diagrams for the standard MFCC and NSTC model procedures.
The transfer function of the middle ear is modeled as a preemphasis high-pass filter (ETSI, 2000) as xpe (k) = xt (k) − cpe xt (k − 1),
(1)
where xpe , xt , cpe , and k are the pre-emphasized time signal, timedomain input signal, filter coefficient, and time index, respectively. In the experiments, cpe is set to 0.7 so that the gain of the transfer function at 2 kHz is about 15 dB obtained from the forward gain of the human middle ear (Puria, 2003). After the pre-emphasis, xpe (k) values at 25 ms interval are Hamming windowed and transformed into 129 linear frequency components of xft (l, i) by calculating absolute magnitudes of the fast Fourier transform (FFT). The variables l and i denote linear frequency and time frame indices, respectively. The windows are shifted by 10 ms with 15 ms overlaps. 2.1. Spectral subtraction with time-dependent noise level For each frequency component, the spectral subtraction removes stationary noises as Xss (l, i) = max[Xft (l, i) − nf (l, i), β Xft (l, i)],
(2)
where Xss and Xft are linear frequency domain signals with and without spectral subtraction, respectively. The max operation provides a smooth transition for small Xss values close to the noise level, and β is set to 0.3. Although the noise spectrum is assumed to be invariant along time in many spectral substitution algorithms, we adopt the slowly varying noise spectrum defined as a weighted moving average, nf (l, i) = ηn
i
(1 − ηn )(i−q) Xft (l, q),
(3)
q =0
which is implemented as nf (l, i) = nf (l, i − 1) + ηn [Xft (l, i) − nf (l, i − 1)].
(4)
The value ηn is related to the time constant τn of the long-term adaptation, which is related to the assumed temporal amplitude
64
Y.-S. Choi, S.-Y. Lee / Neural Networks 45 (2013) 62–69
Fig. 2. Block diagram for the spectral enhancement stage of the NSTC model.
changes of the noises. Here, ηn = 0.004 is used to set τn = 2.5 s with a 10 ms frame shift. The values β and ηn were found to provide the best performance after all the other parameters of the model were fixed. The tuning ranges were 0–0.9 for β and 0.001–0.010 for ηn , and the steps were 0.1 and 0.001, respectively. 2.2. Spectral enhancement with frequency-dependent compression The frequency-to-distance relationship along the basilar membrane has log-like nonlinearity (Greenwood, 1990). Therefore, it is common to convert the frequency spectrum from the linear frequency domain into a log-like mel-frequency domain (Stevens et al., 1937). In this study, the absolute magnitude of the FFT output Xss (l, i) is transformed into mel-filter output Xmf (m, i). Here, m is the index of 35 mel-frequency filters, and the center frequencies of the first and the last mel-filters are 93.75 Hz and 3.78 kHz, respectively. OHCs are distributed along the basilar membrane and work as adaptive gain controllers for each frequency component (Nobili et al., 1998). They amplify smaller signals more than they do bigger signals. However, the amplification factor of a frequency component depends not only on its amplitude but also on those of nearby frequency components; this phenomenon is observed indirectly from two-tone suppression phenomena (Rhode & Recio, 2001). The spectral enhancement stage in Fig. 1 models this nonlinear amplification and lateral inhibition. A block diagram for spectral enhancement of the proposed NSTC model is shown in Fig. 2, and the mathematical expressions are a(m, i) =
Xmf (m′ , i)T (m − m′ ),
(5)
m′
g (m, i) = [a(m, i)/v(m)]α(m) ,
(6)
Xse (m, i) = g (m, i)Xmf (m, i),
(7)
where Xse (m, i) and Xmf (m, i) are spectral-enhanced and melfiltered amplitudes, respectively, and g (m, i) is the adaptive gain of the mth filter bank at the ith frame. The adaptive gain g (m, i) is a nonlinear function of the effective signal amplitude a(m, i) with a normalization constant v(m) and a compression exponent α(m). The effective amplitude a(m, i) is obtained as the weighted sum of neighborhood frequency components, the weights of which are defined as T (·). In this study, we use the weight function T as a localized triangular function, i.e., T (m) = max(1 − |m|/b, 0),
(8)
Fig. 3. Adaptive gain values for frequency-dependent compression of the NSTC model.
which covers 2b + 1 mel-frequency ranges. For the normalization, we set the scaling factor v(m) equal to the average value of a(m, i) over time frames for each channel, and the adaptive gain g (m, i) becomes 1 at the center of its dynamic range. The amount of compression is determined by α(m). With α(m) = 0, the gain becomes independent of the signal amplitude, while negative α(m) means amplitude compression, i.e., higher gain for smaller signals. As shown in Plack and Oxenham (2000), the compression parameter α(m) is assumed to vary along two straight lines in the mel-frequency domain as
m−1 αmax − (αmax − αmin ) for 1 ≤ m ≤ mc α(m) = mc − 1 αmin for m ≥ mc .
(9)
The best parameters were estimated by an exhaustive search to result in b = 3, αmax = −0.3, αmin = −0.9, and mc = 26. The search ranges were 1–10 for b, −1.0–0 for α , and 1–31 for mc . Step sizes were 1, 0.1, and 5, respectively. Fig. 3 shows the adaptive gain values of frequency-dependent compression at six mel-frequencies. The legend shows each filter index m and corresponding center frequency. Since α(m) is held constant at higher frequencies above 2125 Hz, the compression gain curves are the same as that of the 26th filter. These parameters were found to have the best performance in a noisy environment. The α values are small for high frequencies. Due to the pre-emphasis, high-frequency components are amplified more and the resulting signals are more subject to
Y.-S. Choi, S.-Y. Lee / Neural Networks 45 (2013) 62–69
65
Fig. 4. An example of spectral enhancement of the NSTC model.
Fig. 5. An example of temporal adaptation of the NSTC model.
noises and signal distortion. However, in the model, the highfrequency components are nonlinearly compressed with smaller α to have noise robustness. This may be a smart way to utilize narrower dynamic ranges of speech signals in high frequencies while maintaining noise robustness. The spectral enhancement stage also has lateral inhibition characteristics with the T function widely spanned in the melfrequency domain. In Park and Lee (2003), the spectral enhancement function was explicitly encoded by lateral inhibition networks with difference-of-Gaussian synaptic weights among nearby frequency components. In the NSTC model, the spectral enhancement is not explicitly encoded, but included implicitly in the effective amplitude and adaptive compression functions. When larger (smaller) amplitude components exist in neighborhood frequencies, the effective amplitude of the mel-frequency component becomes larger (smaller) and the adaptive gain becomes smaller (greater). As shown in Fig. 4, this implicit lateral inhibition results in sharper edges in the spectral domain. A similar concept was reported in Raj et al. (2007). In their model, the companding is performed in the narrow linear frequency domain, and a fixed amplification is used with a cubic-root nonlinearity. However, the NSTC model performs in the mel-frequency domain with wideband neighborhood, and the compression is done naturally with a separately defined α(m) coefficient at each mel-frequency.
effective in simultaneous masking is 200 ms (Zwicker & Fastl, 1990). Thus, in our experiments, τa is set to 200 ms and the corresponding ca is 0.9512 with a 10 ms frame shift. The cut-off frequency of the high-pass filtering becomes 0.76 Hz. The effects of temporal adaptation are shown in Fig. 5. The dashed line and the solid line show the input signal Xse and the output signal Xta , respectively. Even though the input signal has a rectangular shape, the output signal has nonlinear transitions along time frames. In particular, the onset and offset of the input signals are emphasized so that the temporal contrast is enhanced.
2.3. Temporal adaptation While the adaptive gain values are set by OHCs, IHCs convert mechanical vibration signals into neural signals (Nobili et al., 1998). The IHCs respond sensitively to the change of signal amplitudes, and generate large outputs when the signals start to appear (onset) or disappear (offset). This short-term adaptation can be modeled as a high-pass filter (Holmberg et al., 2006), and the transfer function of IHCs is denoted as H (z ) =
1 − z −1 1 − ca z −1
.
3. Experimental results 3.1. Database and baseline system The proposed model was tested for the automatic speech recognition task in noisy environments on the AURORA2 (Hirsch & Pearce, 2000) database. The speech recognition rates were compared against popular methods such as the standard MFCC (ETSI, 2000) and RASTA-PLP (Hermansky, 1990; Hermansky & Morgan, 1994). The task is connected-word recognition with 11 digits from ‘zero’ to ‘nine’, and ‘oh’. There are 8440 utterances from 55 males and 55 females for training. The sampling frequency is 8 kHz. The AURORA2 database provides two training modes, clean training and multi-training. While the former uses only clean speech for training, the latter includes noisy speech. training. In this study, our interests reside in speech recognition in a noisy environment when the noise is unknown. Thus the results were mainly compared for the clean-training mode. There are 4004 utterances for the test, and 1001 utterances are tested for each noise type. The hidden Markov model (HMM) was used as a recognizer (Rabiner, 1989), and was trained with 16 states per word, simple left-to-right models, and three Gaussian mixtures per state.
(10) 3.2. Overall performance evaluation
The high-pass filter was implemented as Xta (m, i) = ca Xta (m, i − 1) + [Xse (m, i) − Xse (m, i − 1)],
(11)
where ca is a coefficient related to the time constant τa of shortterm adaptation. The values Xse (m, i) and Xta (m, i) are signals before and after the high-pass filtering at the ith time frame of the mth mel-filter, respectively. The results of human psychoacoustic experiments showed that the time constant τa of short-term adaptation lies between 200 ms and 300 ms (Holmberg et al., 2006; Spoor, Eggermont, & Odenthal, 1976) and the time constant
The primary effects of the model are edge enhancements in the spectral and temporal domains. For example, the features of two connected words ‘nine’ and ‘four’ in the clean condition are extracted by the standard MFCC and NSTC, and shown in Fig. 6(a) and (b), respectively. Both features were obtained before discrete cosine transform (DCT) and the number of MFCC mel-frequency bins was set to 35 for the comparison. The example shows that the spectro-temporal features of NSTC enhance the edge contrast in both the spectral domain and the temporal domain. In particular,
66
Y.-S. Choi, S.-Y. Lee / Neural Networks 45 (2013) 62–69
Fig. 6. Spectrogram examples of (a) MFCC and (b) NSTC in the clean condition. The speech includes connected words ‘nine’ and ‘four’.
Table 1 Recognition rates (%) with NSTC features for clean-training and multi-training cases for ten noise types. SNR (dB)
Set A
Set B
Subway
Babble
Car
Exhibition
Restaurant
Clean train
Clean 20 15 10 5 0 −5
98.5 97.0 94.6 86.6 63.8 27.6 −0.9
98.4 96.9 92.7 75.8 42.4 2.1 −20.2
98.4 97.2 95.7 90.6 68.5 29.8 4.2
98.1 95.0 91.1 78.8 51.4 13.6 −7.4
Multi-train
Clean 20 15 10 5 0 −5
98.2 96.8 96.0 93.2 85.5 64.7 28.9
97.3 97.1 96.2 92.4 80.5 43.6 0.1
97.3 96.9 95.8 94.0 88.5 72.8 34.5
97.3 95.1 93.9 90.4 82.0 59.6 24.4
Set C Airport
Train station
98.5 96.9 89.9 73.7 40.7 1.6 −24.8
98.4 96.7 93.3 82.3 59.6 26.4 2.4
98.4 97.3 94.7 83.4 56.3 18.6 −7.0
98.1 96.5 93.7 84.2 58.8 21.0 −2.1
97.8 95.0 91.3 82.6 59.4 29.5 8.7
97.9 95.0 91.1 80.5 58.4 30.6 8.9
98.2 96.4 92.8 81.8 55.9 20.1 −3.8
98.2 96.6 94.4 88.1 68.8 32.7 −8.0
97.3 96.6 95.3 92.6 84.7 65.1 32.6
97.3 96.8 95.7 92.8 83.6 56.3 16.2
97.3 96.4 94.7 91.0 81.4 60.0 25.3
96.5 95.3 93.6 89.9 81.1 61.5 31.9
96.6 95.5 94.2 90.4 81.2 60.1 31.7
97.3 96.3 95.0 91.5 81.7 57.6 21.8
NSTC features show more detailed peaks in the spectral domain, and the onset and offset positions are emphasized in the temporal domain compared to MFCC features. The performance of the developed NSTC was evaluated with the AURORA2 database. The recognition rates of clean-training and multi-training cases for ten noise types with seven signal-tonoise ratio (SNR) values are shown in Table 1, and the average recognition rates at each SNR value are shown in Fig. 7 with those of the standard MFCC features. The ten different noise types represent subway, babble, car, exhibition, restaurant, street, airport, train
Subway (MIRS)
Average
Street
Street (MIRS)
station, subway (MIRS), and street (MIRS). The notation MIRS stands for the filtered output of the telephone channel. Due to the penalty on the false recognition adopted by the AURORA2 scheme, the recognition rates may be negative. For clean-training cases, the NSTC model enhances the performance in noisy conditions for almost all SNR values. For clean speech, the NSTC performance is comparable to or slightly poorer than that of MFCC. However, between 10 and 20 dB SNR, the false recognition rates become much smaller, i.e., from 5.9% to 3.6%, from 15.0% to 7.2%, and from 34.7% to 18.2% for 20, 15, and 10 dB
Y.-S. Choi, S.-Y. Lee / Neural Networks 45 (2013) 62–69
67
Table 2 Recognition rates (%) with and without all three stages of NSTC. SNR (dB)
MFCC
NSTC (All: SS–SE–TA)
SS–MFCC (without SE and TA)
SE–TA (NSTC without SS)
SS–TA (NSTC without SE)
SS–SE (NSTC without TA)
NSTC with constant α
Clean 20 15 10 5 0 −5
99.0 94.1 85.0 65.5 38.6 17.1 8.6
98.2 96.4 92.8 81.8 55.9 20.1 −3.8
99.0 94.9 86.7 68.1 40.8 15.5 5.3
98.2 95.8 90.9 75.9 46.0 12.1 −1.0
99.0 96.1 88.4 68.0 38.0 12.2 4.2
98.5 96.1 91.4 76.6 45.7 11.8 −1.1
98.3 95.3 89.3 74.1 46.1 15.1 −3.7
Fig. 7. Recognition rates of NSTC and MFCC versus SNR for clean-training and multi-training cases.
Fig. 8. Recognition rates of NSTC and substage models versus SNR.
SNR, respectively. Since many practical applications require less than 10% false rate, the NSTC model extends the applicable range of noisy speech below 15 dB SNR. However, the performance becomes much worse at −5 dB SNR. This can be explained with the function of the spectral enhancement stage. In the spectral enhancement stage, the main function is lateral inhibition with frequency-dependent compression. When signals and noises are mixed, nearby frequency components suppress each other and the larger is emphasized. Therefore, noises are suppressed for 0 dB or higher SNR, while speech signals are suppressed and noises are emphasized for 0 dB SNR or below. However, above 0 dB, the spectral enhancement stage amplifies larger speech signals and the temporal adaptation stage amplifies the onset and offset signals. Thus, the enhanced contrast in both frequency and time introduces noise robustness. For the multi-training case, the proposed model shows slightly poorer performance than the standard MFCC. Since the proposed NSTC model extracts noise-robust features, the additional noisy training data for the multi-training case does not contribute much to the performance. On the other hand, the noise-sensitive MFCC gets bigger benefits from the multi-training. Although the multitraining cases provide the upper limits of performance, noise characteristics are unknown in real applications and only cleantraining cases are practically applicable. Therefore, we focus on the clean-training cases.
subtraction (SS), the proposed SS stage is different from the conventional SS method. Thus, the SS–MFCC combination was first tested. Then, by removing one of the three stages, the performances of SE–TA (without SS stage), SS–TA (without SE stage) and SS–SE (without TA stage) were evaluated. When the SE stage was removed, a log function was inserted for compression. One of the novel properties in the SE stage is the different compression coefficient α for different frequencies. Therefore, the NSTC with modified SE, i.e., with a constant α through all the frequency range, was also tested to check the effect of different α values. The constant coefficient α was obtained from 10 candidates, −1.0–0 with a 0.1 step, and α = −0.3 was found to provide the best performance and thus was used here. The speech recognition rates of the NSTC and five trials at each SNR values are shown in Fig. 8, and absolute error improvements of the NSTC compared to the other methods are summarized in Table 2. With all three additional stages, the average recognition rate for noisy speech with 0–20 dB SNRs based on the AURORA2 official protocol is 69.40%, while the average recognition results are reduced to 61.20% (SS–MFCC), 64.14% (SE–TA), 60.72% (SS–TA), 64.32% (SS–SE) and 63.96% (NSTC with constant α ). Adaptive spectral subtraction (SS) has the least contribution, while spectral enhancement (SE) provides the highest improvement in recognition rate. Assigning different α values for different frequency bands is also important. However, all three stages need to be integrated for the best performance.
3.3. Contribution of each additional stage
3.4. Comparison with other noise-robust features
The NSTC model has three stages added to the standard MFCC speech features, i.e., (1) spectral subtraction (SS), (2) spectral enhancement (SE), and (3) temporal adaptation (TA). We tested the contributions of each stage to the whole system by removing them one by one. There are five trials for evaluating the stage contributions. Because long-term adaptation is applied to spectral
The performance of NSTC was compared to those of other noise-robust features such as cepstral mean normalization (CMN) (Tufekci, 2007) and RASTA-PLP (Hermansky, 1990; Hermansky & Morgan, 1994). Fig. 9 shows the performances of the five experiments, NSTC, SS–MFCC, MFCC–CMN, SS–MFCC–CMN, and RASTA-PLP. The MFCC is our primary baseline system and its
68
Y.-S. Choi, S.-Y. Lee / Neural Networks 45 (2013) 62–69
Table 3 False recognition rates (%) of NSTC and other noise-robust features. SNR
NSTC
MFCC
SS–MFCC–CMN
RASTA-PLP
Clean 20 15 10 5 0
1.8 3.6 7.2 18.2 44.1 79.9
1.0 5.9 15.0 34.5 61.4 82.9
1.0 4.5 12.6 32.7 62.9 86.0
1.3 4.6 9.0 21.4 47.7 74.9
Av (0–20 dB) Av (5–20 dB) Av (10–20 dB)
30.6 18.3 9.7
39.9 29.2 18.5
39.7 28.2 16.6
31.5 20.7 11.7
Fig. 9. Recognition rates of NSTC and other noise-robust features versus SNR.
advanced form is SS–MFCC–CMN, which has a spectral subtraction stage before MFCC and a cepstral mean normalization stage after DCT. RASTA-PLP is another noise-robust feature extraction method based on human auditory perception. The misclassification rates of NSTC and the absolute error improvements of NSTC compared to those three methods are summarized in Table 3. As shown in Fig. 9 and Table 3, the performance of NSTC is much better than those of SS–MFCC, MFCC–CMN, and SS–MFCC–CMN for almost all SNRs. The NSTC performance is slightly better than that of RASTA-PLP. However, at the practical application ranges of noise levels between 10 and 20 dB SNRs, the NSTC features reduce the false recognition rates of the RASTA-PLP by 14–28%, i.e., from 4.6% to 3.6%, from 9.0% to 7.2%, and 21.4% to 18.2%, respectively. The average false recognition rate from 5 to 20 dB decreases by 11.6% from 20.7% to 18.3%. Also, the average false recognition rate from 5 to 20 dB decreases by 17.1% from 11.7% to 9.7%. Recently, multi-microphone blind signal separation algorithms successfully resulted in 15 dB or higher SNRs from mixtures of 0 dB or lower SNRs (Lee, Oh, & Lee, 2008; Park, Oh, & Lee, 2009). With the speech enhancement algorithms, the SNR ranges of practical interest are around 15 dB, and the developed NSTC outperforms all other speech features. In Table 4, the average recognition rates for each noise type are shown with five different SNR values, i.e., 0, 5, 10, 15, and
20 dB (AURORA2 official protocol). Ten different noise types are used from the AURORA2 database, i.e., subway, babble, car, exhibition, restaurant, street, airport, train station, subway (MIRS), and street (MIRS). The notation MIRS stands for the filtered output of the telephone channel, and S–M–C and R-PLP represent SS–MFCC–CMN and RASTA-PLP, respectively. For all noise types, the proposed model demonstrates much better recognition rates compared to the MFCC-based methods. However, the NSTC works better for certain noise types, while RASTA-PLP works better for the others. Compared to MFCC and SS–MFCC–CMN, the model is very robust for babble and car noises, which have high energy in the lower frequency bands. When noises are added in the lowfrequency bands, the lateral inhibition results in cleaner signals because the main speech energy is also in the low-frequency bands and the SNR is greater than 0 in the majority of the frequency bands. On the other hand, if high-frequency noises are added, some high-frequency bands may have negative SNRs and the model may enhance the noise components at those bands. For example, subway noises have wide-band time-periodic components and result in small improvement only. When the noise spectrum varies rapidly along time, the spectral subtraction does not work properly. Also, the sudden onset and offset of noises may be amplified by the time adaptation. The exhibition noises have spot noises with high frequency such as people laughing or children crying, and result in poorer performance. Actually the same is true in human auditory systems. Compared to RASTA-PLP, the model has better performance for subway and car noises but worse performance for babble and restaurant noises. For the detailed comparison, absolute error improvements of NSTC relative to RASTA-PLP for all noise types at each SNR value are shown in Table 5 along with average misclassification rates of the NSTC. Above 10 dB SNR, the NSTC shows better performance even for the babble and restaurant noises. However, it does not show such improvement below 10 dB SNR. This may be explained by the lateral inhibition and energy distributions. The subway and car noises have non-human-voice components, and the energy peaks are located at 2469 Hz for subway noise and at 94 Hz (primary) and 781 Hz (secondary) for car noise. However, the babble and restaurant noises have mainly human-voice noises and the peaks are located at around 625 Hz. Since the SNRs are calculated for a wide temporal range, even though the SNR is above 0 dB, the speech-like-noise level can be higher than the speech signal level at certain frames. Thus, at those frames, lateral inhibition of the NSTC suppresses speech signals and amplifies speech-like-noise signals. This may distort the original speech features and degrade the recognition performance. 4. Conclusion A noise-robust front-end for speech recognition systems was established and tested with the AURORA2 task. The feature extraction model is based on a model of the Organ of Corti in the human cochlea. In particular, compressions of OHCs are modeled by lateral inhibition among nearby frequency
Table 4 Average recognition rates (%) for noisy speech with five different SNRs from 0 to 20 dB (AURORA2 official protocol). S–M–C and R-PLP represent SS–MFCC–CMN and RASTAPLP, respectively. Method
NSTC MFCC S–M–C R-PLP
Set A
Set B
Set C
Average
Subway
Babble
Car
Exhibition
Restaurant
Street
Airport
Train station
Subway (MIRS)
Street (MIRS)
73.9 69.5 62.4 68.7
62.0 49.9 53.5 69.0
76.4 60.6 62.8 66.1
66.0 65.4 57.1 64.0
60.6 52.6 56.0 68.3
71.7 61.5 61.7 69.2
70.0 53.3 60.9 72.9
70.8 55.6 60.2 68.3
71.6 66.2 62.8 69.5
71.1 66.1 65.4 68.9
69.4 60.1 60.3 68.5
Y.-S. Choi, S.-Y. Lee / Neural Networks 45 (2013) 62–69
69
Table 5 Absolute recognition rate improvements (%) of NSTC over RASTA-PLP for ten noise types. SNR
Clean 20 15 10 5 0 −5
Set A
Set B
Set C
NSTC error
Subw
Babb
Car
Exhi
Rest
Stre
Airp
TrSt
SubM
StrM
0.1 1.1 3.1 8.3 11.0 2.8 −13.8
−0.5
−0.7
−0.9
−0.7
−0.9
0.0 2.4 13.0 25.4 10.7 −6.2
1.3 3.3 4.1 6.9 −5.7 −15.8
0.1 4.4 2.6 −4.0 −14.4 −27.4 −38.3
−0.5
2.1 2.7 −3.3 −12.8 −23.9 −31.6
0.1 1.6 3.6 6.6 0.6 −9.3
1.6 1.3 −0.2 −3.9 −13.2 −21.6
1.7 4.1 8.0 −2.1 −13.2 −0.5
−1.0 −0.2
−0.2 −0.3
3.8 4.2 3.7 −4.5 −1.1
2.3 5.0 4.4 −3.1 2.5
components. Spectral subtraction and high-pass filtering in the time-frame domain are introduced for long-term and short-term temporal adaptation, respectively. It has been demonstrated that without any knowledge of the noise the proposed NSTC model has better recognition performance in noisy environments. Acknowledgments This research was supported by the Korean Brain Neuroinformatics Research Program at the earlier stage, and by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2011-0018291 and 2012-0005793) at the latter stage. References Boll, S. F. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, & Signal Processing, 27(2), 113–120. Cheng, Y. M., & O’Shaughnessy, D. (1991). Speech enhancement based conceptually on auditory evidence. IEEE Transactions on Signal Processing, 39(9), 1943–1954. Dahl, G., Ranzato, R., Mohamed, A., & Hinton, G. (2010). Phone recognition with the mean–covariance. In Advances in neural information processing systems. Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio Speech and Language Processing,. Dallos, P., Popper, A. N., & Fay, R. R. (1996). Springer handbook of auditory research: vol. 8. The cochlea. Springer. ETSI (2000). Speech processing, transmission and quality aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms. ETSI ES 201 108 V1.1.2. In http://www.etsi.org. Greenwood, D. D. (1990). A cochlear frequency-position function for several species-29 years later. Journal of the Acoustical Society of America, 87, 2592–2605. Haque, S., Togneri, R., & Zaknich, A. (2009). Perceptual features for automatic speech recognition in noisy environments. Speech Communication, 51, 58–75. Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America, 87(4), 1738–1752. Hermansky, H., & Morgan, N. (1994). IEEE Transactions on Speech and Audio Processing, 2(4), 578–589. Hirsch, H. G., & Pearce, D. (2000). The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proc. ISCA ITRW ASR2000 (pp. 181–188). Holmberg, M., Gelbart, D., & Hemment, W. (2006). Automatic speech recognition with an adaptation model motivated by auditory processing. IEEE Transactions on Audio Speech and Language Processing, 14(1), 43–49.
1.8 3.6 7.2 18.2 44.1 79.9 103.8
RASTA-PLP error
1.3 4.6 9.0 21.4 47.7 74.9 88.1
Average improvement Absolute
Relative (%)
−0.5
−38.4
1.0 1.8 3.2 3.6 −5.0 −15.7
21.7 20.0 15.0 7.5 −6.7 −17.8
Kim, D.-S., Lee, S.-Y, & Kil, R. M. (1999). Auditory processing of speech signals for robust speech recognition in real-world noisy environments. IEEE Transactions on Speech and Audio Processing, 7, 55–69. Lee, J., & Lee, S. -Y. (2011). Deep learning of speech features for improved phonetic recognition. In INTERSPEECH-2011 (pp. 1249–1252). Lee, J. H., Oh, S. H., & Lee, S. Y. (2008). Binaural semi-blind dereverberation of noisy convoluted speech signals. Neurocomputing, 72(1–3), 636–642. Nobili, R., Mammano, F., & Ashmore, J. (1998). How well do we understand the cochlea? Trends in Neurosciences, 21(4), 159–167. Park, K.-Y., & Lee, S.-Y. (2003). An engineering model of the masking for the noiserobust speech recognition. Neurocomputing, 52–54, 615–620. Park, H. M., Oh, S. H., & Lee, S. Y. (2009). A bark-scale filter bank approach to independent component analysis for acoustic mixtures. Neurocomputing, 73(1–3), 304–314. Plack, C. J., & Oxenham, A. J. (2000). Basilar-membrane nonlinearity estimated by pulsation threshold. Journal of the Acoustical Society of America, 107(1), 501–507. Puria, S. (2003). Measurements of human middle ear forward and reverse acoustics: implications for otoacoustic emissions. Journal of the Acoustical Society of America, 113, 2773–2789. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286. Raj, B., Turicchia, L., Schmidt-Nielsen, B., & Sarpeshkar, R. (2007). An FFT-based companding front end for noise-robust automatic speech recognition. EURASIP Journal on Audio, Speech and Music Processing, 1–13. Rhode, W. S., & Recio, A. (2001). Multicomponent stimulus interactions observed in basilar-membrane vibration in the basal region of the chinchilla cochlea. Journal of the Acoustical Society of America, 110(6), 3140–3154. Seneff, S. (1988). A joint synchrony/mean rate model of auditory speech processing. Journal of Phonetics, 16, 55–76. Spoor, A., Eggermont, J. J., & Odenthal, D. W. (1976). Comparison of human and animal data concerning adaptation and masking of eighth nerve compound action potential. In J. Ruber, C. Elberling, & G. Solomon (Eds.), Electrocochleography (pp. 183–198). Baltimore, MD: University Park. Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). A scale for the measurement of the psychological magnitude pitch. Journal of the Acoustical Society of America, 8, 185–190. Strope, B., & Alwan, A. (1997). A model of dynamic auditory perception and its application to robust word recognition. IEEE Transactions on Speech and Audio Processing, 5(5), 451–564. Tufekci, Z. (2007). Convolutional bias removal based on normalizing the filterbank spectral magnitude. IEEE Signal Processing Letters, 14(7), 485–488. Virag, N. (1999). Single channel speech enhancement based on masking properties of the human auditory system. IEEE Transactions on Speech and Audio Processing, 7(2), 126–137. Zwicker, E., & Fastl, H. (1990). Psychoacoustics: facts and models (2nd ed.). New York: Springer-Verlag.