Int. J. Electron. Commun. (AEÜ) 72 (2017) 125–133
Contents lists available at ScienceDirect
International Journal of Electronics and Communications (AEÜ) journal homepage: www.elsevier.com/locate/aeue
An immunological approach based on the negative selection algorithm for real noise classification in speech signals Caio Cesar Enside de Abreu a,b, Marco Aparecido Queiroz Duarte c, Francisco Villarreal d a
State University of Mato Grosso - UNEMAT, Department of Computing, Santa Rita Street 128, Alto Araguaia, Brazil São Paulo State University - UNESP, Department of Electrical Engineering, Brasil Avenue 56, Ilha Solteira, Brazil c State University of Mato Grosso do Sul - UEMS, Department of Mathematics, Rodovia MS 306 KM 06, Cassilândia, Brazil d São Paulo State University - UNESP, Department of Mathematics, Brasil Avenue 56, Ilha Solteira, Brazil b
a r t i c l e
i n f o
Article history: Received 6 April 2016 Accepted 6 December 2016
Keywords: Noise classification Speech enhancement Artificial immune systems Negative selection algorithm Dual-tree complex wavelet transform
a b s t r a c t This paper presents a new approach to detect and classify background noise in speech sentences based on the negative selection algorithm and dual-tree complex wavelet transform. The energy of the complex wavelet coefficients across five wavelet scales are used as input features. Afterward, the proposed algorithm identifies whether the speech sentence is, or is not, corrupted by noise. In the affirmative case, the system returns the type of the background noise amongst the real noise types considered. Comparisons with classical supervised learning methods are carried out. Simulation results show that the artificial immune system proposed overcomes classical classifiers in accuracy and capacity of generalization. Future applications of this tool will help in the development of new speech enhancement or automatic speech recognition systems based on noise classification. Ó 2016 Elsevier GmbH. All rights reserved.
1. Introduction Speech processing systems are essential in many branches of telecommunications, as well as in the entertainment industry. Examples include applications on signal encoding, automatic speech recognition, mobile applications, communication in noisy environments and medicine. In the specific case of speech enhancement, the task consists of extracting the original signal from the corrupted speech signal [1]. Classical speech enhancement methods include spectral subtraction and power spectral subtraction, minimum mean square error (MMSE) estimators based, Wiener filtering and wavelet thresholding. Regardless of the fact that these methods have their own theoretical basis with specific objectives, all of them require noise estimation in order to perform noise reduction. In fact, the better the noise estimation the greater the denoising speech quality [2,3]. For mobile communication systems, background noise is the main drawback. Noise can damage speech communication quality and the performance of automatic speech recognition algorithms [4]. According to [5], the design of speech enhancement algorithms usually does not take into account the differences in statistical properties of different noise types. It can be the cause of failure E-mail addresses:
[email protected] (Caio Cesar Enside de Abreu),
[email protected] (M.A.Q. Duarte),
[email protected] (F. Villarreal) http://dx.doi.org/10.1016/j.aeue.2016.12.004 1434-8411/Ó 2016 Elsevier GmbH. All rights reserved.
of some algorithms in specific noise environments, as can be seen in [6]. In this sense, the development of ever more powerful devices, such as smartphones, tablets and hearing aids, enables the design of algorithms which work efficiently in any noisy environment, by incorporating noise classification. As an example of real-time implementation, in [7], an implementation for smartphones that includes frequency domain transformation, noise classification and suppression was presented. One of the first speech enhancement techniques capable of practical applications is spectral subtraction (SS), proposed by Boll in [1]. The basic idea of SS is to estimate the noise frequency spectrum and then subtract it from noisy speech. The SS technique works well when the noise frequency spectrum is uniform or when noise is stationary. Under the conditions of real noise environments, SS generates undesirable tones in the processed signal, known as ‘‘Musical Noise”. However, musical noise is not a specific shortcoming of SS. Wiener filtering [8], statistical-model approaches such as MMSE estimators [2,9], log-spectral amplitude estimators [10] and MAP estimators [11] are also affected by this problem. Scarlat and Vieira Filho, in [12], suggested that a posteriori signal-to-noise ratio (postSNR) and a priori signal-to-noise ratio (prioSNR) concepts could avoid or attenuate the musical noise problem by means of a derived Wiener filter. Simulations in [12,3] showed improved results; however, for some noise environments the performance was not successful. This was due to
126
Caio Cesar Enside de Abreu et al. / Int. J. Electron. Commun. (AEÜ) 72 (2017) 125–133
prioSNR estimation by the decision-directed method, which depends fundamentally on good noise estimation [2]. Wavelet based methods are also susceptible to poor performance due to noise estimation. The wavelet thresholding approach accepts as noise coefficients those with absolute value under a certain value (threshold), after which such coefficients are modified. Originally, threshold estimation, proposed by Donoho in [13], was based on the assumption that background noise is white Gaussian. Therefore, results are not satisfactory in real noise conditions. Several strategies to improve original threshold estimation were proposed. Examples are adaptive thresholding [14] and new statistical models to threshold computation [15–17]. As can be seen, it is natural that future research in speech enhancement focuses on the treatment and understanding of background noise. In this sense, the development of systems that are able to predict the noise type present in a noisy sentence are indispensable to statistical methods or for obtaining optimal parameters in algorithms. This tendency is confirmed in two current works. In [5], authors have used noise classification to choose optimal smoothing parameters in noise and prioSNR estimation. Such optimal parameters were used in the log-spectral amplitude speech estimator algorithm. In [18], noise classification was employed to predict the background noise and a specific weighted denoising auto-encoder model, [19], was used in the estimation of the clean power spectrum, employed in the implementation of a Wiener filter. Besides these applications in speech enhancement, noise classification has been used in automatic speech recognition systems [20,21], acoustic scene classification [22,23] and hearing aid applications [24,25]. In [22], authors inserted noise classification for context-aware applications and a hidden Markov model based noise classifier was proposed. Without any distinction between speech presence or absent, mel-frequency cepstral coefficients were computed and used as input feature for the classifier. However, authors highlighted that noise-only segments would suitable for noise classification. The noise classification method proposed in [5] is based on the traditional support vector machine (SVM) classifier. Features are acquired by mapping noise energy from the 256-point short time Fourier transform to the Bark domain. In other words, noise energy in each Bark band is calculated and used as an input feature for the classifier. In order to perform noise classification in a noisy speech signal, the first 15 frames are assumed to be noise-only segments. Classification is carried out only in these frames in the following way: within the 15 frames, noise is classified frame-by-frame and the noise type with greater vote number is selected to be the noise type in the whole sentence. In a similar way, in [18], authors used the normalized subband power spectrum of noisy speech as an input feature from a classifier. However, in that case, a Gaussian mixture model is used as classifier [26]. As well as in [5], the first 10 frames of the noisy speech are assumed as noise-only segments and the noise classification is carried out frame-by-frame. The noise type with greater vote number is selected for classification. Furthermore, in order to address possible changes of noise type, the authors have used a voice activity detection (VAD) algorithm and noise classification performed every time that a speech absent frame is identified. The motivation of this paper is the proposition of a methodology for real noise classification in speech signals in a frame-byframe way, which will contribute to the scientific development of speech enhancement and other speech processing systems. Unlike the methods proposed in [5,18], for the proposed algorithm initialization a single noise-only frame is required. Furthermore, it is proposed in this paper that the system identifies whether the speech sentence is clean, or the noise level is so low that no pro-
cessing (e.g. enhancement) is required. Thus, the proposed algorithm is easily coupled to other speech processing systems. Besides the classifiers used in the above noise classification methods, classifiers commonly used in pattern recognition problems include neural networks (NNs) [27] and decision trees (DTs) [28]. Therefore, simulations encompassing both classifiers are presented. Another motivation of this work is the inclusion of artificial immune systems (AIS) in the related area. As outlined previously, noise classification has become a tendency for speech processing, making possible the development of algorithms that will operate in a specific way for each real noise environment. The main objective of this work is the development of an intelligent system for real noise classification based on an AIS [29,30]. AISs constitute a relatively recent approach in the artificial intelligence field. Researchers in the AIS field look to the biological immune system (BIS) for inspiration on how to solve problems in engineering and computer science [29]. Composed by a set of organs, cells and molecules, BIS aims to protect an individual from infections, eliminating foreign substances [31]. In a general way, it is able to recognize common structures from different classes of microorganisms, in order to generate an immune response. The exposure of the BIS to a foreign antigen increases its ability to respond more quickly to a new exhibition to the same antigen, characterizing the immunological memory concept [31,32]. From such biological concepts, several algorithms were proposed. Forrest et al., in [33], proposed an approach based on the generation of T-cells in the immune system which contributed with the broad dissemination of AISs [29]. In that paper, the authors proposed a method called negative selection algorithm (NSA) that was then applied to the problem of computer virus detection. Although originally applied to computer security, AISs have been applied to several areas; some examples are: pattern recognition [34–36], data mining [37–39], optimization [40–42], fault and anomalies detection [43,44], as well as machine learning [45]. This paper presents a real noise classifier for speech processing systems, based on NSA [33]. Among its characteristics, NSA is attractive for its simplicity of implementation and high accuracy in pattern recognition [29]. NSA is mainly based on simple comparisons between patterns by means of an affinity measure (e.g. a distance measure), unlike SVM and NNs with implementations based on optimization algorithms. In addition, unlike the works in [5,18], where frequency domain extracted features are used, the present work proposes a multiscale analysis from the dual-tree complex wavelet transform (dual-tree CWT) [46]. The dual-tree CWT is a relatively recent architecture developed to implement a complex wavelet transform first inserted in the speech enhancement context by Abreu et al. in [47]. From the speech signal, features are extracted by the dual-tree CWT and NSA is applied in order to detect whether a speech sentence is degraded, or not, by noise. In the affirmative case, extracted features are stored in a database. This procedure is repeated as often as necessary, until a set of detectors (antibodies) is built for each type of real noise environment considered, which is accomplished in off-line mode. In on-line mode, the proposed system will act by identifying and classifying background noise in speech signals. This procedure is inspired by the monitoring phase of the negative selection algorithm [33]. If the speech sentences do not show degredation, the monitoring phase will not be started. The remainder of this paper is organized as follows: Section 2 presents NSA in its original form; proposed methodology is described in Section 3; Section 4 presents simulation details, including the features extraction process and results; and, finally, concluding remarks are presented in Section 5.
127
Caio Cesar Enside de Abreu et al. / Int. J. Electron. Commun. (AEÜ) 72 (2017) 125–133
2. Negative selection algorithm: analogies and definition For the suitable function and thus prevention of autoimmune disease development, BIS needs to be able to distinguish between molecules of our own cells, called self molecules and foreign molecules, generally referred to as self and non-self, respectively [31,48]. According to [32], antibody molecules and T-cell receptors have the ability to recognize any self or non-self molecule, including those artificially synthesized. The elimination of any cell with receptors able to recognize self molecules, named auto-reactive cells, characterizes the negative selection concept [32,29]. Negative selection allows the control of B and T lymphocytes, such that those in development that are potentially auto-reactive are eliminated [30,31]. The negative selection of T-cells occurs within the thymus, which is responsible for T-cell maturation. Therefore, only T-cells that do not recognize self-molecules are allowed to survive [48]. NSA was developed based on the negative selection of T-cells within the thymus. For this reason, it has intrinsic characteristics of pattern recognition [49,50]. Executed in two phases, NSA is defined as follows [33]: 1. Define two different datasets P and C. P contains only self patterns and C contains both self and non-self patterns. Compute the affinity (match) between each element cj 2 C and each element in P. If an element in C is recognized by an element in P, in other words, if the affinity score between cj and an element in P exceeds an established threshold, reject cj . Otherwise, store cj in a set of detectors R. This phase of the NSA is commonly called the Censoring phase. 2. After the generation of R, the current phase consists of system monitoring in order to detect non-self patterns. A set P is defined to be protected and the affinity between each element in P and each rj 2 R is evaluated. If such an affinity is superior to a predefined threshold, then a non-self element is identified. This phase is called the Monitoring phase. According to [30], P can be composed of a subset of P with the addition of new standards or a completely new data set. C data set can be used in both the censoring and monitoring phases. In this way, the data used to generate detectors must not exceed 30% of the data in C, as suggested in [33]. Figs. 1 and 2 show the flowchart of the censoring and monitoring phase of NSA, respectively. In NSA, detectors are referred to as mature T-cells that can recognize pathogenic agents. In other words, detectors are similar to antibodies [33]. Just as in the negative selection principle in BIS,
Set of detectors (R)
Yes non-self pattern
Protected set (P*)
Match?
No Fig. 2. Flowchart of the monitoring phase of the NSA.
the auto-reactive antibodies are discarded in the censoring phase, which has as its sole purpose the generation of a set of detectors that can recognize non-self patterns (pathogenic agents). The censoring phase is performed in off-line mode. The monitoring phase occurs in real-time mode. The objective is to monitor and protect the system in order to identify non-self patterns. Identification is performed based on a detector set, created in the censoring phase through an affinity measurement [30,32,29]. 2.1. Affinity measurements NSA was originally proposed for computational security tasks [33]. In such a problem, patterns are strings or binary strings. In this way, there are affinity measures that assess the number of different characters between two strings, as in the Hamming distance, or measures that are based on the relative number of bits that match or differ the strings [29]. However, most common applications are those where patterns have real values. Table 1 presents the most common affinity measures used in the real-valued representation for NSA [29]. Note that Ab and Ag represent a detector and a foreign molecule, respectively. In Table 1, Manhattan distance is also known as city block distance. Furthermore, Minkowski distance, also known as p-norm distance, becomes Manhattan distance if p ¼ 1 and the Euclidean distance if p ¼ 2. It is noteworthy that Ab and Ag are real-valued vectors of characteristics, in the RN space, that represent antibodies and antigens, respectively [49,29]:
Ab ¼ ½Ab1 ; Ab2 ; . . . ; Abi1 ; Abi ; . . . ; AbN1 ; AbN ;
ð1Þ
Ag ¼ ½Ag 1 ; Ag 2 ; . . . ; Ag i1 ; Ag i ; . . . ; Ag N1 ; Ag N :
ð2Þ
In order to define a matching criterion and thus assess the similarity between two patterns, a suitable affinity measure should be chosen according to the characteristics of the problem studied [29].
Generate a data set ramdomly (C)
Table 1 Main affinity measures for real-valued negative selection algorithm.
Yes Rejected
Affinity measures
Self data set (P)
Match?
No
Measure
Equation
Euclidean distance
dðAb; AgÞ ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rX 2 ðAbi Ag i Þ
Manhattan distance
dðAb; AgÞ ¼
Xi j Abi Ag i j i
Minkowski distance
Set of detectors (R)
Fig. 1. Flowchart of the censoring phase of the NSA.
dðAb; AgÞ ¼
X p j Abi Ag i j
!1p
i
Chebyshev distance Canberra distance
dðAb; AgÞ ¼ maxfj Abi Ag i j for i ¼ 1; . . . ; ng X jAb Ag j i i dðAb; AgÞ ¼ jAbi jþjAg j i
i
128
Caio Cesar Enside de Abreu et al. / Int. J. Electron. Commun. (AEÜ) 72 (2017) 125–133
The matching between two patterns can be perfect or partial [29]. In perfect matching, the two patterns are equal, dðAb; AgÞ ¼ 0. However, in real-valued applications the concept of partial matching is often employed [29,51]. Partial matching is achieved by defining a matching threshold k. Therefore, if dðAb; AgÞ < k, the match occurs. According to [51], the matching threshold can be considered a generalization of the system. 3. Proposed methodology In this work, an algorithm for the identification and classification of real noise in speech sentences based on negative selection algorithm is proposed. The system will operate in the silent intervals of speech sentences. The detection of silence and speech intervals is performed in the wavelet domain using the VAD algorithm proposed in [52]. Therefore, decision-making will be performed every time a window of silence is identified. The classification, carried out frame by frame, is suitable for real-time processing. Normal operation conditions for the system occur when the speech signal in processing is clean, or the noise level is very low. In this way, self patterns are features extracted from clean signals (noise-free), when speech is absent. Features extracted from noisy speech sentences acquired via the windowing technique in silent segments will be considered non-self patterns. In this work, six types of realistic noise environments are considered. Noise types are babble, cafeteria, car, exhibition hall, traffic and train. The clean sentences database used is the NOIZEUS database [6], which is composed of thirty IEEE sentences phonetically balanced in the English language. The sentences in NOIZEUS were produced by three male and three female speakers. Noise signals were taken from the AURORA database and added to the clean sentences with four SNR levels: 0 dB, 5 dB, 10 dB and 15 dB. Regarding the corrupted files, some considerations are necessary: speech sentences corrupted by babble, cafeteria, exhibition hall and train noise were taken from the NOIZEUS database; car and traffic noise were taken from authors’ personal database and added to clean sentences from NOIZEUS; traffic is a very singular noise and this sound was acquired during a traffic jam, where car horns sound often; all signals have a sampling rate of 8 kHz. In order to ensure that no repeated pattern-pairs were used in the simulations, the contamination process for clean sentences was started randomly. Furthermore, data are split into training, development and test data. Training data are used for detector generation and development data are used for matching threshold estimation. Finally, test data are used to simulate accuracy. In this way, the test items are independent. Despite the fact that classification is carried out in non-speech segments, different speech sentences from different speakers were used for training, development and testing. Half the data, including sentences from three speakers in all SNR levels, was used for testing purposes. The remaining data was split into training and development sets.
2. Define a self dataset P and load a subset C C containing samples of noise type . Define the number of detectors for noise type to be stored; 3. Until the desired number of detectors is obtained, choose a noise sample in C randomly, extract its features and check the match with all self patterns in P. If a match does not occur (non-self identified), the pattern is stored in a set of detectors. Otherwise, reject the current noise sample and choose a new signal randomly in C ; 4. Repeat steps 2 and 3 until a dataset R is obtained, containing patterns for all noise types considered in step 1. Self dataset P contains patterns with features extracted from clean sentences. In order to perform the self/non-self decision (matching criterion), a matching threshold k is considered. Noise levels can range in time. Thus, the censoring phase ensures that noise patterns similar to self patterns will not be stored as detectors. This procedure ensures the quality of the detector set. Threshold k can be defined according to baseline conditions, considered by the designer. Fig. 3 shows the censoring phase flowchart. The output of the censoring phase is a matrix containing noise patterns for all noise types. This phase occurs in off-line mode and the number of detectors stored is determined by the designer. It is noteworthy that the censoring phase is accomplished on training data. 3.2. Monitoring phase for the proposed algorithm After running the censoring phase, a non-self set of noise patterns R ¼ fRb ; Rc ; Rca ; Re ; Rt ; Rtr g is generated. R will act as a set of detectors (antibodies) for each type of real noise environment considered. The Monitoring phase should act in the following way: 1. Load the self and noise datasets P and R, respectively. Consider a speech signal in processing by the windowing technique. It is assumed that the first window is a silent segment; Set of all noise samples (C)
Choose the noise type
Choose a sample randomly in C*
Characteristics extraction
3.1. Censoring phase for the proposed algorithm Yes
In the proposed algorithm, the censoring phase aims to build a set of noise detectors R. R should contain detectors for all noise types considered for classification. For this purpose, the following procedure has been proposed: 1. Build a dataset containing noise samples C¼ fCb ; Cc ; Cca ; Ce ; Ct ; Ctr g. In C, the indexes represent, respectively, the following noise types: babble, cafeteria, car, exhibition hall, traffic and train. Use the windowing technique;
Rejected
Estimate λ
Match?
Self data set (P)
No
Set of detectors (R)
Fig. 3. Censoring phase flowchart for the proposed algorithm.
Caio Cesar Enside de Abreu et al. / Int. J. Electron. Commun. (AEÜ) 72 (2017) 125–133
2. Extract the feature vector Ag (foreign pattern) from the window content and check the match with P: if dðAb; AgÞ < k, for any Ab 2 P (self pattern identified), go to the next window of silence; Otherwise, go to step 3; 3. Assess the affinity between Ag and each noise pattern r i 2 R. Identify the noise detector that had the match: dðri ; AgÞ < k. 4. If more than one noise detector r i matches the pattern Ag to be classified, the subset R R with the greatest number of activated detectors is selected to represent the non-self pattern. It is noteworthy that the self/non-self decision is performed in the same way for both the censoring and monitoring phases. If a self sample is identified, the algorithm searches for the next window of silence. Otherwise, a non-self sample is identified and the monitoring phase is activated in order to classify the background noise. Fig. 4 presents the flowchart of the monitoring phase. The monitoring phase of the proposed algorithm occurs in the real-time mode and aims to protect the system by identifying non-self patterns. For a speech signal going through processing, every time a window of silence is detected, the monitoring phase is activated in order to verify if the signal is corrupted by noise. In the affirmative case, the system returns the type of the background noise. In the proposed system, the designer is free to decide how to extract the characteristics of noise samples, the length and type of the window. The designer is also free to choose the affinity measure and the matching threshold. In Section 4, results of several simulations are presented. Furthermore, details about feature extraction and how to assess affinity are given. Comparisons with classical classifiers are also carried out.
Start
4. Implementation and results In order to check the performance of the proposed methodology, several simulations are performed in this section. Self dataset P consists of 30 patterns extracted from clean signals acquired from rectangular windowing in windows of silence. The matching threshold k used in both the censoring and monitoring phases was defined empirically. A feature extraction process based on complex wavelets is proposed in this paper. The next subsection provides more details about the dual-tree CWT and input features. 4.1. Feature extraction Upon the extraction of a noise sample y½n ¼ ½y1 ; y2 ; . . . ; yi1 ; yi ; . . . ; yN1 ; yN from a 1024-point rectangular window, the following normalization is performed:
^i ¼ y
yi : maxðy½nÞ
N
Abk ¼
2k X 2 j dc ðk; nÞ j ; ðk ¼ 1; 2; . . . ; 5Þ; ðN ¼ 1024Þ;
Yes Self dataset (P)
Match? No Non-self pattern identified
Set of noise detectors (R)
Identify the noise detector that had the match with the non-self pattern
Classify the non-self pattern according to the specific subset that had the match Fig. 4. Monitoring phase flowchart for the proposed algorithm.
ð4Þ
n¼1
where dc ðk; nÞ are the complex wavelet coefficients in the k-th scale provided by dual-tree CWT, and N is the noise sample length. The complex coefficients dc ðk; nÞ ¼ dr ðk; nÞ þ jdi ðk; nÞ, where pffiffiffiffiffiffiffi j ¼ 1, are obtained from two filter banks, similar to those used in discrete wavelet transform (DWT): one of the trees produces the real part and the other produces the imaginary part of the complex wavelet coefficients. Therefore, absolute value in (4) is obtained in the following way:
jdc ðk; nÞj ¼
Self pattern identified
ð3Þ
^i with five After normalization, the dual-tree CWT is applied to y decomposition levels. Thus, the vector of characteristics Ab ¼ ½Ab1 ; Ab2 ; . . . ; Ab5 is built as follows:
Windowing
Characteristics extraction
129
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 ½dr ðk; nÞ þ ½di ðk; nÞ :
ð5Þ
For each decomposition tree in the dual-tree CWT, low and high frequencies are separated by successive applications of a low-pass/ high-pass filter pair followed by a dyadic downsampling. The outputs of the low-pass and high-pass filters are respectively termed approximation and detail coefficients [53]. The main advantage of the dual-tree CWT over DWT, for onedimensional signal processing, is the near shift-invariance property of the complex wavelet coefficients magnitude. Filter sets are designed in such a way that the whole transform is approximately analytic and near shift-invariant [46]. Details about filters design for the dual-tree CWT can be found in [46,54]. In addition to the dual-tree CWT properties mentioned in this section, its choice is also based on the good results presented in [47]. It is noteworthy that the proposed five-dimensional feature vector is based on noise energy distribution across five octave bands. According to [55], detail coefficients dc ðk; nÞ, provided by wavelet decomposition, correspond to the frequency situated approximately between ð2k f s ; 2k1 f s Þ, where f s is the signal sampling rate. Table 2 shows the frequency ranges for each wavelet scale k, considering the speech signal database used in this work. For a better understanding of the characteristics proposed, Fig. 5 shows the scatter plot of 180 detectors generated in the censoring phase of the proposed algorithm, divided equally among babble, train, traffic and car noise. Only characteristics Abk (k ¼ 1; 3 and 5) are used. Note, from Fig. 5, that the proposed features are consistent. Babble and train noise are fully separated amongst themselves and others. In fact, babble noise has most of the energy concentrated
130
Caio Cesar Enside de Abreu et al. / Int. J. Electron. Commun. (AEÜ) 72 (2017) 125–133
Table 2 Frequency ranges for each wavelet scale k.
(a)
Wavelet scale
Frequency range (Hz)
1 2 3 4 5
2000–4000 1000–2000 500–1000 250–500 125–250
Accuracy ( % )
98 96 94 92 90
λ
88
0.6
0.65
0.7
0.75
0.8
95
Accuracy ( % )
in lower frequencies, whereas train noise has strong energy concentration in high frequencies [56]. Traffic and car noise have some similarities between themselves, but it is still possible to distinguish them from small clusters. Finally, the windowing with 1024 points was fixed due to the filter bank architecture employed in the dual-tree CWT computation, where N must be a power of two [46,53]. Furthermore, with 1024 points, a considerable number of wavelet coefficients in the fifth scale are obtained.
0.85
0.9
85
Euclidean Minkowski Manhattan
80 75 70
λ 5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Accuracy ( % )
90 80 70 60 50
λ 10
11
12
13
14
15
16
17
Fig. 6. Noise classification accuracy in terms of the affinity measure and matching threshold k: (a) Canberra distance (k ¼ 0:9, accuracy ¼ 97:01%); (b) Euclidean (k ¼ 11, accuracy ¼ 92:31%), Minkowski (p ¼ 3; k ¼ 10, accuracy¼ 77:33%) and Manhattan (k ¼ 14, accuracy ¼ 92:68%) distances; (c) Chebyshevi distance (k ¼ 16, accuracy ¼ 90:64%).
lighted in Fig. 5, directly influenced system accuracy. Together, these two classes are responsible for the most of the incorrect predictions. Regarding self/non-self decision, no noise sample was classified as a self sample. Therefore, the system achieved a hit rate of 100% in self/non-self discrimination. In real-time applications, the time employed in each classification should be as short as possible. This is due to the fact that a classification algorithm will act together with other algorithms such as enhancement or automatic speech recognition approaches. In order to check the classification time required by the proposed approach, Fig. 7 shows the histogram for the time employed in
7 6 5
Ab5
4 3
Babble Train Traffic Car
2 1 0 80 60 40 20 0
5
1.5
(c)
In order to achieve a better configuration for the proposed system, several simulations encompassing the matching threshold and affinity measures were carried out. Fig. 6 shows the noise classification accuracy for the proposed algorithm in terms of the affinity measure and the matching threshold k, evaluated on development data. Analyzing the accuracy in Fig. 6, it can be seen that the best performance was achieved with Canberra distance when k ¼ 0:9. Euclidean and Manhattan distances provide similar results. The worst performance was achieved for Minkowski distance with p ¼ 3. Based on the plots in Fig. 6, Canberra distance and matching threshold k ¼ 0:9 are selected as the affinity measure in the current experiments. In order to provide more details about the performance of the proposed method, Table 3 shows the confusion matrix for noise classification. In Table 3, rows correspond to the current noise class, whereas columns correspond to the predicted class. Bold values in the diagonal represent the success rate in the noise classification. According to Table 3, the proposed algorithm achieved at least 90% of success rate for all noise types. In general, the proposed real noise classifier achieved for the six noise types a hit rate of 96.29%. Note that the similarities between car and traffic patterns, high-
0
1
90
65
4.2. Simulation results
Ab3
0.95
(b)
10
15
20
25
30
Ab1
Fig. 5. Scatter plot of the characteristics Ab1 ; Ab3 and Ab5 .
35
40
45
131
Caio Cesar Enside de Abreu et al. / Int. J. Electron. Commun. (AEÜ) 72 (2017) 125–133 Table 3 Confusion matrix for the noise classification using Canberra distance with a matching threshold k ¼ 0:9. Predicted class
Actual class
%
Babble
Cafeteria
Car
Ex.hall
Traffic
Train
Babble Cafeteria Car Ex.hall Traffic Train
100 1.66 3.34 0 0 0
0 98.33 0 0 0 0
0 0 91.66 0 7.33 1.66
0 0 0 100 0 0
0 0 5 0 90 0
0 0 0 0 2.67 97.77
2000 1800
Number of classifications
1600 1400 1200 1000 800 600 400 200 0 0.006
0.008
0.01
0.012 0.014 Time (seconds)
0.016
0.018
Fig. 7. Histogram of the classification time employed in the simulations for 3450 noise classifications.
3450 classifications. For each decision made, the process described in Fig. 4 was executed. In the histogram in Fig. 7, 30 bins have been used in order to obtain a more detailed analysis. Note that runtime is less than eight milliseconds. Actually, the average runtime is 6.9 milliseconds for noise classification. Therefore, based on its very small runtime requirement, the proposed system is suitable for real-time processing. In Section 4.3 the proposed method is compared with classical classifiers, commonly used in pattern recognition problems.
criminate between two possible classes. For n-class problems, the one-against-one scheme consists of employing nðn 1Þ=2 binary classifiers, where each classifier discriminates between two classes [58]. In order to classify a noise pattern, the objective of a DT is to create a model to predict the true class from a vector of characteristics by learning simple decision rules inferred from the training data. The decision tree algorithm used in this work is the CART (Classification and Regression Tree) [28]. Finally, MLPNN was carried out with the backpropagation learning algorithm and sigmoid activation function [27]. In the training stage, the learning rate and the number of epochs were respectively defined as 0.3 and 500. Moreover, one hidden layer with five neurons was used. The training and test data used for SVM, MLPNN and DT are the same used in the proposed AIS. Results, shown in Table 4, were acquired using 360 corrupted sentences in all speech-plus-noise combinations. Implementations were made in Python 2.7 and the scikit-learning library [59] was used for the SVM and DT simulations. MLPNN was implemented in MatlabÒ2014 and the proposed AIS was implemented in both programming languages. In Table 4, the last column presents the average performance for the considered methods and the values in bold indicate the best performances. Average performance was computed by averaging the success rate for all noise types. It is noted from Table 4 that the proposed method performed the best for the most types of noise. Only on traffic noise conditions was the proposed approach overcome by MLPNN. SVM and the proposed AIS performed equally well for babble, car and exhibition hall noise conditions. Regarding the overall performance, the best algorithm was the proposed AIS, as can be seen in the last column.
5. Conclusion 4.3. Comparisons from classical classifiers For comparison purposes, the vector of characteristics proposed in Eq. (4) will be used for the training and evaluation of classical supervised learning methods. Classifiers used are SVM [57], Multilayer Perceptrom neural network (MLPNN) and DT [28]. A brief description of each approach is given in the following paragraphs. The SVM implemented has linear kernel and uses the oneagainst-one scheme. As known, SVMs are binary classifiers and dis-
This paper has presented a new method based on the negative selection algorithm for real noise classification in speech sentences. In order to perform realistic simulations, the NOIZEUS database was used. Six types of real noise were selected to validate the methodology. Noise types were chosen according to more practical noise environments; in other words, according to people’s everyday situations. The proposed classifier proved to be efficient as a means for background noise identification and classification.
Table 4 Performance comparisons considering classical classifiers and the proposed approach. Accuracy (%) Classifier
Babble
Cafeteria
Var
Ex.hall
Traffic
Train
Avg
DT MLPNN SVM Proposed
95.00 95.00 100.00 100.00
93.33 94.16 91.66 98.33
90.00 85.00 91.66 91.66
100.00 99.16 100.00 100.00
88.33 91.66 81.66 90.00
91.11 96.66 97.66 97.77
92.96 93.60 93.77 96.29
132
Caio Cesar Enside de Abreu et al. / Int. J. Electron. Commun. (AEÜ) 72 (2017) 125–133
In order to implement the proposed classifier, in the censoring phase, a set of detectors must be generated which is executed in off-line mode, without inference in real-time applications. The classification phase, called the monitoring phase, is performed in less than 8 milliseconds. Additionally, the average time used to perform a classification was 6.9 milliseconds, which is powerful for real-time usage. The present work has also reported on feature extraction based on the energy of complex wavelet coefficients through wavelet scales. Five scales have been used and a characteristic vector with only five elements was proposed. These features were used in the training and assessment of classical supervised learning methods. Based on simulations carried out in this paper, it was found that the artificial immune system proposed overcame classical classifiers in accuracy and capacity of generalization, achieving an average hit rate of 96.29%. Although a multiscale analysis from the dual-tree complex wavelet transform has been used as feature extraction, any vector of characteristics can be used as input features for the proposed algorithm. Furthermore, it is a powerful tool for future development of new speech enhancement systems or automatic speech recognition systems based on noise classification. Acknowledgments Authors would like to thank Coordination for the Improvement of Higher Education Personnel (CAPES) – Brazil and FAPESP (Grant 2011/17610-0). References [1] Boll S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoust Speech Signal Process 1979;27(2):113–20. http://dx.doi.org/ 10.1109/TASSP.1979.1163209. [2] Ephraim Y, Malah D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process 1984;32(6):1109–21. http://dx.doi.org/10.1109/TASSP.1984.1164453. [3] Hu Y, Loizou PC. Subjective comparison and evaluation of speech enhancement algorithms. Speech Commun 2007;49(7–8):588–601. http://dx.doi.org/ 10.1016/j.specom.2006.12.006. [4] Rabiner L, Juang B. Fundamentals of speech recognition. USA: Prentice hall; 1993. [5] Yuan W, Xia B. A speech enhancement approach based on noise classification. Appl Acoust 2015;96:11–9. http://dx.doi.org/10.1016/j.apacoust.2015.03.005. [6] Hu Y, Loizou PC. Subjective comparison and evaluation of speech enhancement algorithms. Speech Commun 2007;49(7):588–601. http://dx.doi.org/10.1016/j. specom.2006.12.006. [7] Parris S, Torlak M, Kehtarnavaz N. Real-time implementation of cochlear implant speech processing pipeline on smartphones. In: 36th annual international conference of the IEEE engineering in medicine and biology society (EMBC, 2014). p. 886–9. http://dx.doi.org/10.1109/ EMBC.2014.6943733. [8] Lim J, Oppenheim AV. Enhancement and bandwidth compression of noisy speech. Proc IEEE 1979;67(12):1586–604. http://dx.doi.org/10.1109/ PROC.1979.11540. [9] Hu Y, Loizou PC. Estimators of the magnitude-squared spectrum and methods for incorporating snr uncertainty. IEEE Trans Audio Speech Lang Process 2011;19(5):1123–37. http://dx.doi.org/10.1109/TASL.2010.2082531. [10] Ephraim Y, Malah D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process 1985;33(2):443–5. http://dx.doi.org/10.1109/TASSP.1985.1164550. [11] Lotter T, Vary P. Speech enhancement by map spectral amplitude estimation using a super-gaussian speech model. EURASIP J Appl Signal Process 2005;2005:1110–26. http://dx.doi.org/10.1155/ASP.2005.1110. [12] Scalart P, Filho JV. Speech enhancement based on a priori signal to noise estimation. In: IEEE international conference on acoustics, speech, and signal processing, 1996. ICASSP-96, vol. 2. p. 629–32. [13] Donoho DL. De-noising by soft-thresholding. IEEE Trans Inf Theory 1995;41 (3):613–27. http://dx.doi.org/10.1109/18.382009. [14] Ghanbari Y, Karami-Mollaei MR. A new approach for speech enhancement based on the adaptive thresholding of the wavelet packets. Speech Commun 2006;48(1):927–40. http://dx.doi.org/10.1016/j.specom.2005.12.002. [15] Lallouani A, Gabrea M, Gargour CS. Wavelet based speech enhancement using two different threshold-based denoising algorithms. In: Canadian conference on electrical and computer engineering, 2004. p. 315–8. http://dx.doi.org/ 10.1109/CCECE.2004.1345019.
[16] Sheikhzadeh H, Abutalebi HR, An improved wavelet-based speech enhancement system. In: INTERSPEECH; 2001, p. 1855–58. [17] Tabibian S, Akbari A, Nasersharif B. Speech enhancement using a wavelet thresholding method based on symmetric kullback–leibler divergence. Signal Process 2015;106:184–94. http://dx.doi.org/10.1016/j. sigpro.2014.06.027. [18] Xia B, Bao C. Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification. Speech Commun 2014;60:13–29. http://dx.doi.org/10.1016/j.specom.2014.02.001. [19] Vicent P, Larochelle H, Bengio Y. Manzagol PA. Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on machine learning; 2008. p. 1096–103. doi:http:// dx.doi.org/10.1145/1390156.1390294. [20] Xu H, Tan ZH, Dalsgaard P, Lindberg B. Robust speech recognition based on noise and snr classification-a multiple-model framework. In: INTERSPEECH; 2005. p. 977–80. [21] Hoseinkhani F, Parcham E, Pournazary M, Borzue N. Speech recognition by classifying speech signals based on the fire fly and fuzzy. In: International conference on advanced computer science applications and technologies (ACSAT, 2012); 2012. p. 187–91. doi:http://dx.doi.org/10.1109/ACSAT.2012. 31. [22] Ma L, Smith D, Milner B. Environmental noise classification for context-aware applications. In: 14th International conference on database and expert systems applications (DEXA 2003); 2003. p. 360–70. doi:http://dx.doi.org/10. 1007/978-3-540-45227-036. [23] Rakotomamonjy A, Gasso G. Histogram of gradients of time-frequency representations for audio scene classification. IEEE/ACM Trans Audio Speech Lang Process 2015;23(1):142–53. http://dx.doi.org/10.1109/TASLP.2014. 2375575. [24] Kates JM. Classification of background noises for hearing-aid applications. J Acoust Soc Am 1995;97(1):461–70. http://dx.doi.org/10.1121/1.412274. [25] Saki F, Kehtarnavaz N, Background noise classification using random forest tree classifier for cochlear implant applications, in: IEEE International conference on acoustics, speech and signal processing (ICASSP, 2014); 2014. p. 3591–595. doi:http://dx.doi.org/10.1109/ICASSP.2014.6854270. [26] Reynold DA, Rose RC. Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Trans. Speech Audio Process 1995;3 (1):72–83. http://dx.doi.org/10.1109/89.365379. [27] Rumelhart DE, McClelland JL. Parallel Distributed Processing: explorations in the microstructure of cognition, vol. 2. The MIT Press; 1986. [28] Breiman L, Friedman J, Olshen R, Stone C. Classification and regression trees. Wadsworth and Brooks; 1984. [29] Dasgupta D, Nino F. Immunological computation: theory and applications. CRC Press; 2008. [30] Castro LND, Timmis J. Artificial immune systems: a new computational intelligence approach. Springer Science & Business Media; 2002. [31] Abbas AK, Lichtman A, Pillai S. Cellular and molecular immunology: with STUDENT CONSULT online access. Elsevier Health Sciences; 2014. [32] de Castro LN. Development and application of computational tools inspired by artificial immune systems (in portuguese) Ph.d. thesis. Campinas – SP: Universidade Estadual de Campinas; 2001. [33] Forrest S, Perelson AS, Allen L, Cherukur R, Self-nonself discrimination in a computer. [34] Forrest S, Javornik B, Smith RE, Perelson AS. Using genetic algorithms to explore pattern recognition in the immune system. Evol Comput 1993;1 (3):191–211. http://dx.doi.org/10.1162/evco.1993.1.3.191. [35] Hunt J, Timmis J, Cooke E, Neal M, King C. Jisys: the envelopment of an artificial immune system for real world applications. In: Artificial immune systems and their applications. p. 157–86. http://dx.doi.org/10.1007/978-3-642-59901-99. [36] Yu X, Fu D, Yang T, Riha K. The application of negative selection algorithm in multi-angle infrared vehicle images recognition. In: 38th International conference on telecommunications and signal processing (TSP, 2015); 2015. p. 776–80. doi:http://dx.doi.org/10.1109/TSP.2015.7296371. [37] de Casto LN, Zuben FJV. An evolutionary immune network for data clustering, In: Proceedings of the Sixth Brazilian Symposium on Neural Networks, 2000, IEEE; 2000. p. 84–9. doi:http://dx.doi.org/10.1109/SBRN. 2000.889718. [38] Knight T, Timmis J. Aine: an immunological approach to data mining. In: Proceedings IEEE international conference on data mining, 2001. (ICDM 2001); 2001. p. 297–304. doi:http://dx.doi.org/10.1109/ICDM.2001.989532. [39] Puteh M, Hamdan AR, Omar K, Bakar AA, Flexible immune network recognition system for mining heterogeneous data. In: 7th, International conference on artificial immune systems, ICARIS 2008, Phuket, Thailand; 2008. [40] Fukuda T, Mori K, Tsukiama. Parallel search for multi-modal function optimization with diversity and learning of immune algorithm. In: Artificial immune systems and their applications. Springer; 1999. p. 210–20. http://dx. doi.org/10.1007/978-3-642-59901-911. [41] de Castro LN, Zuben FJV, The clonal selection algorithm with engineering applications. In: Proceedings of GECCO; 2000. p. 36–9. [42] Xiao X, Li T, Zhang R. An immune optimization based real-valued negative selection algorithm. Appl Intell 2015;42(2):289–302. http://dx.doi.org/ 10.1007/s10489-014-0599-9. [43] Lima FPA, Lotufo ADP, Minussi CR. Disturbance detection for optimal database storage in electrical distribution systems using artificial immune systems with negative selection. Electr Power Syst Res 2014;109:54–62. http://dx.doi.org/ 10.1016/j.epsr.2013.12.010.
Caio Cesar Enside de Abreu et al. / Int. J. Electron. Commun. (AEÜ) 72 (2017) 125–133 [44] Li D, Liu S, Zhang H. Negative selection algorithm with constant detectors for anomaly detection. Appl Soft Comput 2015;36:618–32. http://dx.doi.org/ 10.1016/j.asoc.2015.08.011. [45] Hightower R, Forrest S, Perelson AS. The baldwin effect in the immune system: learning by somatic hypermutation. In: Adaptive individuals in evolving populations. Addison-Wesley Longman Publishing Co., Inc; 1996. p. 159–67. [46] Selesnick IW, Baraniuk RG, Kingsbury NG. The dual-tree complex wavelet transform – a coherent framework for multiscale signal and image processing. IEEE Signal Process Mag 2005;22(6):123–51. http://dx.doi.org/10.1109/ MSP.2005.1550194. [47] Abreu CCE, Duarte MAQ, Villarreal F. Dual-tree complex wavelet transform in the problem of speech enhancement. In: Proceeding series of the brazilian society of applied and computational mathematics; 2015. p. 010467-1 – 010467-7. doi:http://dx.doi.org/10.5540/03.2015.003.01.0467. [48] Timmis J, Knight T, de Castro LN, Hart A. An overview of artificial immune systems. In: Computation in cells and tissues. Berlin Heidelberg: Springer; 2004. p. 51–91. http://dx.doi.org/10.1007/978-3-662-06369-94. [49] de Castro LN, Timmis J. Artificial immune systems: a novel paradigm to pattern recognition. Artif Neural Netw Pattern Recogn 2002;1:67–84. [50] Dasgupta D, Yu S, Nino F. Recent advances in artificial immune systems: models and applications. Appl Soft Comput 2011;11(2):1574–87. http://dx. doi.org/10.1016/j.asoc.2010.08.024. [51] Ji Z, Dasgupta D. Revisiting negative selection algorithms. Evol Comput 2007;15(2):223–51.
133
[52] Duarte MAQ, Filho JV, Alvarado FV. A simple and efficient voice activity detector using the wavelet transform (in portuguese). In: Brazilian congress of computational and applied mathematics (CNMAC, 2009). p. 1022–8. [53] Mallat S. A theory for multiresolution representation signal decomposition: the wavelet representation. IEEE Trans Pattern Anal Mach Intell 1989;11 (7):674–93. doi: 110.1109/34.192463. [54] Kingsbury N, Design of q-shift complex wavelets for image processing using frequency domain energy minimization. In: Proceedings of the IEEE international conference on image processing, Barcelona; 2003. p. 1013– 1016. doi:http://dx.doi.org/10.1109/ICIP.2003.1247137. [55] Daubechies I. Ten lectures on wavelets. Philadelphia: SIAM Books; 1992. [56] Hirsch HG, Pearce D, The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000-automatic speech recognition: challenges for the new millenium ISCA tutorial and research workshop (ITRW); 2000. [57] Vapnik VN. The nature of statistical learning theory. Springer-Verlag New York Inc; 1995. [58] Burges C. A tutorial on support vector machines for pattern recognition. Data Min Knowl Discovery 1998;2(2):121–67. http://dx.doi.org/10.1023/ A:1009715923555. [59] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, et al. Scikit-learn: machine learning in Python. J Mach Learn Res 2011;12:2825–30.