JID:IRBM AID:583 /FLA
[m5G; v1.261; Prn:6/01/2020; 10:29] P.1 (1-11)
IRBM ••• (••••) •••–•••
Contents lists available at ScienceDirect
IRBM www.elsevier.com/locate/irbm
Original Article
Voice Pathologies Classification and Detection Using EMD-DWT Analysis Based on Higher Order Statistic Features I. Hammami ∗ , L. Salhi, S. Labidi University of Tunis El Manar, Higher Institute of Medical Technologies of Tunis, Research Laboratory of Biophysics and Medical Technologies, 1006, Tunis, Tunisia
h i g h l i g h t s
g r a p h i c a l
a b s t r a c t
• EMD-DWT two stage analysis of voice samples.
• Temporal energy to select robust IMF. • Extraction of HOSs features from robust IMF.
• Classification accuracy reached 100% for the classification of “Funktionelle Dysphonia vs. Rekurrensparse”.
a r t i c l e
i n f o
Article history: Received 24 January 2019 Received in revised form 12 November 2019 Accepted 29 November 2019 Available online xxxx Keywords: EMD-DWT analysis High Order Statistics features (HOS) Pathological voices detection and classification Saarbrücken Voice Database (SVD) Support vector machine (SVM)
*
a b s t r a c t Background: The voice is a prominent tool allowing people to communicate and to change information in their daily activities. However, any slight alteration in the voice production system may affect the voice quality. Over the last years, researchers in biomedical engineering field worked to develop a robust automatic system that may help clinicians to perform a preventive diagnosis in order to detect the voice pathologies in an early stage. Method: In this context, pathological voice detection and classification method based on EMD-DWT analysis and Higher Order Statistics (HOS) features, is proposed. Also DWT coefficients features are extracted and tested. To carry out our experiments a wide subset of voice signal from normal subjects and subjects which suffer from the five most frequent pathologies in the Saarbrücken Voice Database (SVD), is selected. In The first step, we applied the Empirical Mode Decomposition (EMD) to the voice signal. Afterwards, among the obtained candidates of Intrinsic Mode Functions (IMFs), we choose the robust one based on temporal energy criterion. In the second step, the selected IMF was decomposed via the Discrete Wavelet Transform (DWT). As a result, two features vector includes six HOSs parameters, and a features vector includes six DWT features were formed from both approximation and detail coefficients. In order to classify the obtained data a support vector machine (SVM) is employed. After having trained the proposed system using the SVD database, the system was evaluated using voice signals of volunteer’s subjects from the Neurological department of RABTA Hospital of Tunis. Results: The proposed method gives promising results in pathological voices detection. The accuracies reached 99.26% using HOS features and 93.1% using DWT features for SVD database. In the classification, an accuracy of 100% was reached for “Funktionelle Dysphonia vs. Rekrrensparese” based on HOS features. Nevertheless, using DWT features the accuracy achieved was 90.32% for “Hyperfunktionelle Dysphonia vs. Rekurrensparse”. Furthermore, in the validation the accuracies reached were 94.82%, 91.37% for HOS and DWT features, respectively. In the classification the highest accuracies reached were for classifying “Parkinson versus Paralysis” 94.44% and 88.87% based on HOS and DWT features, respectively.
Corresponding author. Full postal address: 9 Rue Zouhair Essafi, 1006, Tunis, Tunisia. E-mail addresses:
[email protected] (I. Hammami), lotfi
[email protected] (L. Salhi),
[email protected] (S. Labidi).
https://doi.org/10.1016/j.irbm.2019.11.004 1959-0318/© 2019 AGBM. Published by Elsevier Masson SAS. All rights reserved.
JID:IRBM
AID:583 /FLA
2
[m5G; v1.261; Prn:6/01/2020; 10:29] P.2 (1-11)
I. Hammami et al. / IRBM ••• (••••) •••–•••
Conclusion: HOS features show promising results in the automatic voice pathology detection and classification compared to DWT features. Thus, it can reliably be used as noninvasive tool to assist clinical evaluation for pathological voices identification. © 2019 AGBM. Published by Elsevier Masson SAS. All rights reserved.
1. Introduction The voice occupies a prominent role in social and professional life of people, it allows them to communicate and to change information. Furthermore, due to the voice abuse and the unhealthy lifestyle, around 25% of people in the world suffer from voice problems [1]. People who use excessively their voices and whose jobs require them to speak loud (e.g., teachers, lawyers, singers, actors, etc.) are especially at risk of being affected by several kinds of voice problems. In order to detect these disorders, physicians usually used the classic methods which are the invasive techniques aiming to explore the laryngeal system. For instance, stroboscopy, endoscopy and laryngoscopy are widely used as mainly techniques to diagnose people with voice problems. In addition, non-invasive techniques are also applied like electroglottography (EGG) and subjective method based on the perceptual assessment of voice. However, these methods require expert to use them, and also it may cause discomfort to the patients. In order to avoid these issues, researchers have developed several patterns based on automatic system for pathological voices detection and classification. These systems are considered as assistive tools that may be helpful to physicians in Ear, Nose and Throat Department (ENT). Indeed, these systems are mostly based on voice signal analysis (e.g., sustained vowels, running speech, and digits) or EGG signal. In this context, Hossain et al. [2] implemented a smart healthcare monitoring framework using the voice signal and the EGG signal. A set of acoustic features were extracted from the voice signal, also four features combined with six cepstral features to describe EGG shape. The developed system achieved an accuracy of 92.8% for using voice signal. In the previous work, pathological voice detection system was made based on acoustic analysis of voice signal to extract long-term features such as pitch, formant, shimmer, jitter, and harmonic to noise ratio (HNR) [3]. A vector of 28 mixed parameters (e.g., long-term features, nonlinear features) extracted from the voice signal of three sustained vowels [4]. A three selection methods were used to select the robust features. The identification rate obtained was 100% for using random forest as classifier. Furthermore, the short-term features like Mel Frequency Cepstral Coefficients (MFCC) are commonly used in the field of speech recognition. Several studies were conducted using MFCC parameters only or combined with other parameters [5–7]. Ammami et al. [8] developed a novel policy based on MFCC and a modified density-based clustering algorithm named DBSCAN in order to discriminate between normophonic and pathologic voices. Moreover, Cordeiro et al. [9] used MFCC features combined with line spectral frequencies features to discriminate between physiological and neuromuscular laryngeal pathologies. Three classifier systems were used such as SVM, Gaussian Mixture Model (GMM) and Discriminant Analysis (DA). Nevertheless, this system does not achieve the best results for any of the three classes. Therefore, to improve the results of the single system; the authors carry out hierarchical classification and system combination. The best accuracy rate acquired was 84.4%. Most of the developed systems of pathological voice detection were made based on the voice signal of the sustained vowel /a/, which it gives always the best result. However, Mesallam et al. [10] the best accuracies rate obtained were 81.5% for using running speech and 79.5% for using sustained vowel /a/. Furthermore, the spoken Arabic digits voice signals taken
from Arabic voice pathology database (AVPD) [11] gives a good result over 99% of accuracy rate. In the literature, several studies used features extracted from spectral analysis of voice signal. Ali et al. [12] used two sets of features: auditory processed spectrum and all-pole model-based cepstral coefficients (APCC). The best result for detection is obtained for using APCC features, which is equal to 99.56%. Moro-Velázquez et al. [13] have used a group of new and already proposed metrics applying modulation spectrum method. These latter were optimized considering different time and frequency ranges pursuing the maximization of efficiency for the detection of pathological voices. The implemented system was tested using three databases: Massachusetts Eye and Ear Infirmary (MEEI), PDA and HGM which are two Spanish databases. The highest accuracy rate reached was 90.6% for MEEI database. In addition, the information about formant frequencies extracted from spectral envelope shows also a good ability to discriminate between normophonic and pathologic subjects [14]. Orozco-Arroyave et al. [15] extracted several sets of long-term features, nonlinear features, from the obtained result they deducted that it is not suitable to use all types of features to model a specific pathology. Contrariwise, it is necessary to study the physiology of each disorder to choose the most appropriate set of features which can describe it. The Auto-Correlation function is considered to be one of the most common methods for extracting various characteristics from speech signals [1,16]. In this way, statistical measures of similarity such as correntropy may be successfully used for the extraction of information and data characterization [17]. As a new developed approach in [18,19] the authors based on the area under the voice contour and the estimated vocal tract area, they make decision whether the voice is normal or pathologic. Furthermore, a spectro-temporal description of glottal source excitation signal based on interlaced derivative pattern to detect pathological voices was implemented in [20]. Recently, to improve performance of their framework, recherches used several methods for dimensionality reduction and features selection. These methods, allows the selection of robust features which contribute the most in the classification [21,22]. Muhammad et al. [23] introduced MPEG-7 as a new feature used in the field of pathological voice detection and classification, which provides a set of audio-visual descriptors. The MPEG-7 set of features reached an accuracy rate equal to 99.994%. The wavelet transform has a great contribution in the field of pathological voice identification and classification. Indeed, features extracted via wavelet analysis of voice signal are relevant and allowing the discrimination between normal and pathological voices. In these two works [24,25], Saeedi et al. adapted a new wavelet technique for sorting and identifying pathological voices, 100% correct classification rate was reached using MEEI database. In our work, we propose a less-expensive computational method to detect and classify the five frequent pathologies in the Saarbrücken Voice Database (SVD). For that we used a set of features that describe properly the voice signal. Indeed, the voice signal was analyzed into two stages. In the first stage, we performed the empirical mode decomposition process, after that we applied for the obtained IMFs a selection method, to pick out the relevant one. In the second stage, the selected IMF will undergo a new decomposition using wavelet transform analysis. Subsequently, a set of Higher Order Statistics (HOS) features and DWT coefficients features were extracted to feed a support vector machine for the sake of classification.
JID:IRBM AID:583 /FLA
[m5G; v1.261; Prn:6/01/2020; 10:29] P.3 (1-11)
I. Hammami et al. / IRBM ••• (••••) •••–•••
3
Fig. 1. Flowchart of the proposed method (H.F. Dysphonia: Hyperfunktionelle dysphonia, F. Dysphonia: funktionelle dysphonia).
This paper is organized as follows. Section 2 contains the description of the material and the methods used in this work. Section 3 presents the final results and discussion. In the last section we concluded all the work featured in this paper and we give the perspective. 2. Materials and methods Our proposal for pathological voice detection and classification contains two main parts as shown in Fig. 1. The first part focus on feature extraction, the voice signal was turned into a set of HOS features and DWT coefficients features based on EMD-DWT analysis. In the second part, we used these features, which are arranged in appropriate vectors as input data to provide to SVM classifier. 2.1. Features extraction method A new set of high order statistic features extracted in the wavelet domain was used to identify voice impairments. As mentioned in the previous section, the most existing approaches are based on classical parameters such as MFCC, jitter, shimmer, pitch, formant, etc. which are difficult to extract regarding the aperiodicity of voice signal. Furthermore, these features regarding its instability can have the common value for both pathological and normal voices which makes detection more difficult. Consequently, the identification rate is not high using these later. In contrast, HOS features obtained from frequency domain [26] and from time domain [19,27] gives promising results in this field. Inspired by these two studies we try to extract HOS parameters from wavelet coefficients to detect and classify voice impairments. The wavelet transform analysis is widely explored in the field of voice detection. Several features were extracted from its coefficients. In addition to HOS features a set of DWT coefficients features were extracted for instance: Mean Wavelet Value, Mean Wavelet Energy and Mean Wavelet Entropy. In this article, we perform two stage analysis using empirical mode decomposition and discrete wavelet transform, which proved its reliability for voice signal processing, to conceive a robust system able to classify five voice impairments using SVM classifier. 2.1.1. Empirical mode decomposition (EMD) The EMD is an efficient method to analyze non-linear and nonstationary data, that it was widely used in many fields including speech signal processing [28,29]. It was introduced by Huang et al. in 1998 [28]. Indeed, EMD process decomposes a given signal x(t ) into a series of oscillatory modes named Intrinsic Mode Function (IMF) and a residue as indicates equation (1):
x (t ) =
n i =1
h i (t ) + rn (t )
(1)
where h i (t ) was the IMF and rn (t ) is the residue. The EMD process algorithm proposed in [29] operates as follows: 1) Find all the extrema (minima and maxima) that exist in the input signal x(t ); 2) Interpoling the local maxima and minima by using cubic spline interpolation; to obtain the upper envelope U max and the lower envelope L min ; 3) Compute the average of the upper and lower envelopes: m(t ) = [U max (t ) + L min (t )]/2; 4) Extract the IMF (detail component): IMF(t ) = x(t ) − m(t ); 5) Iterate the four previous steps on the residual signal: r (t ) = x(t ) − IMF(t ); to extract a new IMF candidate after checking its proprieties. EMD analysis has the greatest advantage that the basis functions are extracted from the signal it self. Nevertheless, an IMF candidate should satisfy some criteria: i. The number of zero crossing and the number of extrema should not be differing by no more than one; ii. The mean value of the upper envelope and the lower envelope should be equal to zero. However, if the extracted signal does not satisfy the previous criteria, we repeated steps 1 to 4 until we find an IMF; this is what we call the sifting process. The stopping criterion which is the standard deviation for the sifting process is fixed to S D = 0.1 [28]. In our study we used EMD to pick out the entire oscillatory mode from voice signal. The IMFs stems from EMD analysis are not all relevant to be used to construct a feature vector aiming to discriminate between healthy and pathological subjects. Indeed, the effects of end points and filtering criteria may generate noise component for high-frequency and surplus invalid component for low-frequency within the extracted IMFs. A selection method was carried out to pick out the relevant one and also to reduce the number of IMFs that will be used, which is the temporal energy criterion. 2.1.2. IMF selection method based on temporal energy EMD has the major drawback of ensuring the reliability of the interpolation near the endpoint, that the endpoint will have the inevitable divergence phenomenon. In order to pick out the relevant IMF from the entire IMFs candidates stems from the decomposition process, we required to use a selection method. In the literature, researchers have developed several methods to select the relevant IMFs such as the nonlinear correlation coefficient, the temporal energy, the correlation coefficient, the K -L divergence, and the Hurst exponent. In our experience we adopted the computing of temporal energy as a selection criterion for this task. For all the
JID:IRBM
AID:583 /FLA
[m5G; v1.261; Prn:6/01/2020; 10:29] P.4 (1-11)
I. Hammami et al. / IRBM ••• (••••) •••–•••
4
Fig. 2. (a). Tree structure of the DWT; (b). Daubechies 40 wavelet and its spectrum.
candidates IMFs obtained from EMD analysis of the voice signal, we calculated the temporal energy using the following equation (2):
ET =
N
IMF[n]2
(2)
i =1
where N is the length of the IMF and E T stands to the temporal energy. Only the IMF with the highest energy value is chosen as the relevant IMF. However, using only selected IMF may not be able to differentiate between the two classes. Therefore, we have to add another stage to decompose the relevant IMF based on discrete wavelet transform. 2.1.3. Discrete wavelet transform (DWT) Wavelet analysis has been extensively used in the field of signal processing, because it retains the location of the changes in the speech signal. This method of analysis is found in multiform of continuous wavelet transform, discrete wavelet transform, and wavelet packets [24]. The wavelet process brings a given signal from time domain to wavelet domain using basis function called mother wavelets. A wavelet function at a scale s and a spatial displacement u is defined as [31]:
1
u ,s (t ) = √ s
t−u s
u ∈ IR, s ∈ IR∗+
j
x (k) 2− j /2 2− j n − k
N
γ3 = γ4 =
(4)
k
Indeed, the DWT has several wavelet families for instance Coiflets, Haar, Symlets, Daubechies, etc. The choice of the suitable wavelet family to analyze a given signal family depends on type of the signal. The Daubechies wavelet to 40 vanishing moments and its spectrum was shown in Fig. 2.b was widely used to analyze the speech signal. Indeed, this wavelet showed sharper cut-off frequency compared with other wavelets and hence the leakage energy between different resolution levels is reduced [30].
− μ)3 ( N − 1) σ 3
n =1 ( S n
(5)
N
(3)
In our proposal, we applied discrete wavelet transform (DWT) which decomposes a given signal into approximation and detail coefficients using successive low-pass and high-pass filters as shown in Fig. 2.a. DWT process is based on sub-band coding, and yields a fast computation of wavelet transform and it is easy to implement it [31]. The DWT is defined by the following equation:
W j ,k =
2.1.4. High order statistics (HOSs) features The acoustic parameters such as pitch, formants, shimmer, jetter and MFCC are widely exploited in the field of pathological voice detection. Moreover, it gave satisfactory results in the task of detection and classification of pathological voices. However, the extraction of these features have some issues regarding its sensibility to the periodicity of the speech signal. Contrariwise, HOSs features has the advantage of not requiring a periodic or quasiperiodic voice signal to do reliable analysis [26]. The application of HOSs to speech processing has been primarily motivated by their inherent Gaussian suppression and phase preservation properties [27,32]. As it is showing in the previous sections the computation of HOSs features is preceded by several steps. Among several HOSs features that have been developed in different studies [26,27,32, 33], only variance, skewness and kurtosis were chosen to create a features vector from approximation and detail signals. The third and fourth moments, which stands to skewness and kurtosis respectively, of a given signal can be defined as equations (5) and (6):
− μ)4 ( N − 1)σ 4
n =1 ( S n
(6)
γ3 , γ4 represents skewness and kurtosis respectively. N is the number of samples, while μ and σ stands to the mean and the standard deviation, respectively. Indeed, the skewness measures the lack of symmetry of a given distribution. Therefore, a symmetric distribution has a skewness value equal to zero. However, a negative values means that the data skewed to the left; positive values means that the data skewed to the right. Once the kurtosis expresses the degree of peakedness of a distribution, it gives information about the data whether it is heavy tailed or light tailed. A lower value of Kurtosis indicates less infrequent extreme deviations of DWT coefficients of the selected IMF within the same class. However, higher kurtosis value denotes less frequent extreme deviations for the DWT coefficients. We added the variance of the DWT coefficients to the features vector because it shows a compactness property for the same class.
JID:IRBM AID:583 /FLA
[m5G; v1.261; Prn:6/01/2020; 10:29] P.5 (1-11)
I. Hammami et al. / IRBM ••• (••••) •••–•••
5
Table 1 The selected subset distribution (SVD database). Number
Normal Rekurrensparses Laryngitis Hyperfunktionelle Dysphonia Funktionelle Dysphonia Dysphonia
Mean age (years)
Age range (years)
261 70 50 32 24 29
300 127 32 112 51 42
28.6 57.5 50.4 46.15 46.5 48.2
24.0 56.2 47.9 39.73 46.4 46.0
16–66 14–81 16–71 22–73 20–71 11–77
16–63 21–79 12–73 13–73 16–74 18–73
2.1.5. DWT coefficients features Further, a features vector of DWT coefficients were also extracted from the voice signal after undergoing EMD-DWT analysis. The features vector is composed of the Mean Wavelet Value, the Mean Wavelet Energy and the Mean Wavelet Entropy. These features are used in several signal processing studies using wavelet decomposition regarding its ability to preserve time information. In this study, these features were extracted from each wavelet coefficients. It’s ability of discrimination in the detection step and the classification step were tested combined with SVM classifier.
• Mean Wavelet Value (MWV) Fig. 3. Distribution of the ten most impairments in SVD database.
Defined as follows (7):
MWV =
N 1
N
Xi
(7)
i =1
where N is the length of the signal coefficient X and i is the level of decomposition.
• Mean Wavelet Energy (MWE)
The energy for each wavelet coefficients at level i is normalized against total energy of the signal and defined as equation (8):
MWE =
Ei
(8)
ET
• Mean Wavelet Entropy (MWT)
1 N
X i2 log X i2
(9)
i
2.2. Classification: support vector machine (SVM) The support vector machine (SVM) has been developed by Vapnik [34] in 1995. The SVM was introduced as a binary classifier. It was widely used in statistical learning problems and it has the great success in almost areas where it was applied [35]. The main purpose of the SVM classifier is to look for an optimal hyperplane with a wide margin in order to separate two classes within the Hilbertian space using a kernel function. Therefore, several studies that are interested to discriminate between normal and pathological voices based on SVM reached a high accuracy rate [23,35,36]. The equation of hyperplane is (10):
wT x + b = 0 where w is the weight vector and b is bias.
k xi , x j = exp(−γ || xi − x j ||2 )γ > 0
(11)
||xi − xj || : is the Euclidean distance between two feature vectors [34]. The Radial Basis Function (RBF) has two important parameters as input C and γ ; C is the optimization parameter and γ is the RBF kernel parameter. 2
3. Experimental results and discussion
The Shannon entropy is a measurement of disorder of given signal. Furthermore, it gives information about the concentration of energy of each wavelet coefficients. Equation (9) defines the Mean Wavelet Entropy as:
MWT =
In order to map low dimensional data set in a higher dimensional feature space, which makes the separation problem more easy, SVM use many kernel functions such as linear, sigmoid, polynomial, radial basis function (RBF). In the literature, the RBF kernel is the most used one, because it is more general than other kernels and it makes better accuracy than other kernel functions. The RBF kernel function is defined as follows (11):
(10)
3.1. Database The voice database used in this study is the Saarbrücken Voice Database (SVD) [37]. This database is recorded by the Institute of Phonetics of Saarbrücken University and it is freely available online. The SVD contains more than 2000 files of three sustained vowels /a/, /i/ and /u/ recorded in normal, high, low and low-high pitch. The length of files ranges between 1–4 second, sampled at 50 kHz and the resolution was 16 bits. To carry out our experience, we selected a subset of samples of sustained vowel /a/, for both normal and pathological subjects. The set of pathological subjects contains files from the five most frequent pathologies in the SVD database as shown in Fig. 3. Indeed, files of subjects who suffer from several pathologies were excluded. Table 1 gives statistic information of the selected subset. 3.2. Database for clinical validation In order to involve concrete scientific added values a clinical validation was performed to our proposed system for 58 subjects from Neurological Institute of Rabta Hospital of Tunis. The pathological voice signals are recorded from 28 volunteers patients which are diagnosed by specialized ENT physicians from Neurological department based on clinical examination. The impairments
JID:IRBM
AID:583 /FLA
[m5G; v1.261; Prn:6/01/2020; 10:29] P.6 (1-11)
I. Hammami et al. / IRBM ••• (••••) •••–•••
6
Table 2 Distribution of the subjects used in the validation. Number
Pathological Healthy
Total
17 18
11 12
28 30
Mean age (years)
Age range (years)
53 47
32–75 28–82
found were: Parkinson, Alzheimer, Chronic laryngitis, and paralysis. However, the normal voice signals are recorded from 30 volunteers’ subjects after having treatment and they undergoing a clinical evaluation. Table 2 shows all the details about subjects used in the validation step. Each subject was asked to pronounce the vowel “a” at a comfortable level of loudness. The voice signals are recorded using a microphone attached to a headset and placed at a microphone-to-mouth distance of 20 cm. Voice recording is done using Wave studio a free windows application. After choosing the file format (channel: mono channels, sampling rate: 22 kHz, encoding bits: 8-bit), the audio files are saved in “*.wav” format and the silence intervals at the beginning and end of each recording are eliminated.
of correctly detected samples divided by the total number of samples, equation (12). The sensitivity expressed the true positive rate by identifying the samples that are actually classified positive and it can be calculated using the simple equation (13). However, the specificity also named as the true negative rate measures the probability of actual negatives samples in a subset, equation (14).
AC C = Sn = Sp =
TP +TN TP + TN + FP + FN TP
TP + FN TN TN + FP
(12) (13) (14)
where (TN) is the true negative samples and means that the classifier detects a normal sample as a normal sample, (TP) is the true positive samples and means that the classifier detects a pathological sample as a pathological sample. However, (FN) refers to the false negative samples and means that the classifier detects a pathological sample as a normal sample, (FP) is the false positive samples and means that the classifier detects a normal sample as a pathological sample.
3.3. Setup 3.4. Results and discussion The selected files of SVDsubset have a sampling frequency of 50 kHz. Same files in the case of pathological subjects contain two or more pathologies. These files were removed from the subset. The voice signal was analyzed into two stages: the first one was the empirical mode decomposition (Fig. 4 and Fig. 5) and the second one was the wavelet transform. The temporal energy was used as a dimensionality reduction method. Therefore, only the IMF which has the highest value of energy was selected. In Fig. 6.a the 7th IMF was selected and in Fig. 6.b the 4th was selected. These latter were decomposed into wavelet coefficients using Daubechies wavelet at level one (Fig. 7). Accordingly, from each coefficient we extracted three HOSs features which are skewness, kurtosis and variance forming a vector of six features, F = { H Sa ; H K a ; H V a ; H Sd ; H K d and H V d }, while S, K and V subscripts stands to skewness, kurtosis and variance respectively, a and d indicates for each coefficient of DWT stands the HOSs parameters, approximation and detail respectively. Likewise, from each coefficient a DWT coefficient features vector was extracted, Y = { M V a ; M Ea ; M T a ; M V d ; M Ed and M T d }, while V , E and T subscripts stands to value, energy and entropy. Two different experiments were performed using SVM, the detection of pathological voices in SVDsubset , and the classification of five pathologies. The RBF was used as kernel function for the SVM classifier, as it is more general than other kernels and usually it produces better accuracy and has less restriction than other kernels. The optimal parameters C and γ were C = 1.97 and γ = 0.23. Ten-fold cross validation approach was conducted to evaluate the performance of our system in the training and test steps using SVD database. The pathological data and the normal data are randomly divided into ten groups, nearly equal each. In each iteration nine groups are used for training, while the remaining group is used for validation. Finally, after ten iterations all the groups were tested. In the classification the objective is to classify among two pathologies in each time. Therefore, ten experiments were performed, a pairwise of five impairments, rekkurrensparses, laryngitis, hyperfunktionelle dysphonia, funktionelle dysphonia, dysphonia. These impairments are the most frequent in the SVD database. A clinical validation has been conducted to approve the performance of the developed framework on patients from Neurological Institute of Rabta hospital. The results are expressed in terms of accuracy ( A cc ), sensitivity (S n ) and specificity (S p ). Indeed, the accuracy measures the ratio
The results of pathological voice detection step are shown in Table 3 using SVD database. The average accuracy, the specificity and the sensitivity achieve respectively 99.29%, 99.46% and 96.42% for using HOS feature as input vector for SVM. However, applied DWT features the performance of the proposed framework was 93.1%, 93.33% and 92.85% for accuracy, specificity and sensitivity, respectively. From this we can see that the proposed HOS feature from DWT outperformed the DWT features in the detection of pathological voice detection. Furthermore, this approved the reliability of HOS Feature against voice disruptions. Fig. 8 shows a comparison of our proposal with other methods using the same voice database SVD. Muhammad et al. [19] estimated the vocal tract area from the voice signal of sustained vowel /a/ using LP filter, then they compute a set of statistics features (e.g., variance, ratio of variance, skewness, kurtosis). The accuracy rate reached was 94.7% using principal component to select effective features. Al nacheri et al. [1,16] exploring different frequency region based on correlation functions. The conducted experiments show good performance for the sake of pathology detection and reached accuracies of 92.79% for [1] and 90,779% for [16]. However, our proposed method based on HOS feature (variance, skewness and kurtosis) extracted in wavelet domain reached an accuracy of 99.29%. Fang et al. [21] used a big dimension set of various features (e.g., 430 basic acoustics features, 84 Mel ??-transform cepstrum coefficients and 12 chaotic features) to develop a voice pathological detection system, the reported accuracy was 78.7%. Nevertheless, our proposal based on HOSs features extracted from EMD-DWT analysis shows good results in pathological voice detection. The p values obtained using U -test for the six HOSs features (α = 0.05) shown in Table 4. The performed test rejects the null hypothesis for all cases. Therefore, the p values are close to zero as shown in the Table 4. The HOSs features are statistically significant and have the ability to discriminate between normal and pathological voices. All the p values were near to zero. In the classification experiments we classified the five most frequent pathologies: dysphonia, laryngitis, hyperfunktionelle dysphonia, funktionelle dysphonia and rekkurrensparse, in the SVD database as we mentioned in the previous section. Therefore, ten experiments were performed as shown in Table 5. Every impairment was classified against one of the four remaining impairments. The highest accuracies reached were 100% for “F. Dysphonia
JID:IRBM AID:583 /FLA
[m5G; v1.261; Prn:6/01/2020; 10:29] P.7 (1-11)
I. Hammami et al. / IRBM ••• (••••) •••–•••
Fig. 4. IMFs candidates extracted from voice signal of sustained vowel /a/ of normal woman.
Fig. 5. IMFs candidates extracted from voice signal of sustained vowel /a/ of pathological woman that suffers from laryngitis disorder.
7
JID:IRBM
AID:583 /FLA
[m5G; v1.261; Prn:6/01/2020; 10:29] P.8 (1-11)
I. Hammami et al. / IRBM ••• (••••) •••–•••
8
Fig. 6. (a). Temporal energy pattern of IMFs obtained from pathological voice; (b). Temporal energy pattern of IMFs obtained from normal voice.
Fig. 7. (a). The relevant IMF with the DWT coefficients of normal voice; (b). The relevant IMF with DWT coefficients of pathological voice. Table 3 Performance of the proposed method on pathology detection. Method
Specificity (%)
Sensitivity (%)
Accuracy (%)
HOS-SVM DWT-SVM
99.46 93.33
96.42 92.85
99.29 93.1
vs. Rekurrensparse”, 99.66% for “Laryngitis vs. Rekurrensparse” and 99.54% for “F. Dysphonia vs. H.F. Dysphina” using HOS features. Applying the DWT features to classify voice disorders the highest accuracies revealed were 90.32% for “H.F. Dysphonia vs. Rekurrensparse”, 88.87% for “F. Dysphonia vs. Rekurrensparse” and 88.46% for “laryngitis vs. Rekurrensparse” as shown in Table 7. The DWT features proved its reliability two discriminant Rekurrensparse impairment from other impairments. Although, the achieved accuracies are too weak compared with HOS features. The detection and classification of pathological voices based on automatic systems are considered as assistive tool for primary screening of voice. In this work, the HOS features (skewness, kurtosis and variance) are extracted after having analyzed the voice signal with EMD-DWT to perform two experiments. The first experiment, was aiming to discriminate between normophonic and pathologic subjects. Whereas, in the second experiment we aiming to classify an impairment against an impairment. The HOS features combined with SVM classifier gives promising results in
the detection step, the accuracy achieved was 99.29%. In the classification step most of the accuracies achieved were more than 99%. However, the accuracy achieved for classifying “F. Dysphonia vs. Rekrrensparese” was 100%. Therefore, analyzing the voice signal using EMD-DWT process, proved to be a reliable tool allowing the extraction of relevant information about normal and pathological voices. The obtained results show that the proposed method outperforms some state of the art using the same database [1,16, 19,21] but not the same SVDsubset in the detection process. The skewness and kurtosis which are respectively the third order and fourth order parameters introduced as higher order statistics features combined with the variance (second order) extracted from wavelet coefficients, shows good ability to discriminate between pathological and normal subjects. In addition the statistical test conducted confirms the relevance of the HOS features, which each feature has a significance, the p-values were p < 0.05, p < 0.001 and p < 0.001 for skewness, kurtosis and variance, respectively. Furthermore, to validate the performance of the proposed framework a subset of sustained vowel /a/ signals were collected from Neurological Institute of Rabta Hospital of Tunisia after have undergoing clinical evaluation. Table 6 shows the obtained results of extracted data of subjects from the Neurological department of RABTA Hospital of Tunisia in the detection step. The accuracy reached was 94.82% and 91.37% for using HOS features and DWT
JID:IRBM AID:583 /FLA
[m5G; v1.261; Prn:6/01/2020; 10:29] P.9 (1-11)
I. Hammami et al. / IRBM ••• (••••) •••–•••
9
Fig. 8. Comparison of accuracies of some developed methods for pathological voices detection in the literature. Table 4 p-value for performing U -test for the HOS features. HOS parameters
Skewness D
Kurtosis D
Variance D
Skewness App.
Kurtosis App.
Variance App.
P-values
< 0.05
< 0.001
< 0.001
< 0.05
< 0.001
< 0.001
D: detail, App.: approximation. Table 5 Performance of the proposed method on pathology identification. Disease types
Dysphonia vs. Laryngitis Dysphonia vs. F. Dysphonia Dysphonia vs. H.F. Dysphonia Dysphonia vs. Rekurrensparse Laryngitis vs. F. Dysphonia Laryngitis vs. H.F. Dysphonia Laryngitis vs. Rekurrensparsese F. Dysphonia vs. H.F. Dysphonia F. Dysphonia vs. Rekurrensparsese H.F. Dysphonia vs. Rekurrensparsese
HOS features
DWT features
Specificity (%)
Sensitivity (%)
Accuracy (%)
Specificity (%)
Sensitivity (%)
Accuracy (%)
96.38 95.71 98.61 97.14 98.75 99.28 99.5 100 100 99.5
97.32 100 100 99.47 97.63 100 100 98.75 100 99.33
96.74 97.95 99.06 98.84 98.12 99.54 99.66 99.54 100 99.41
75 75 83.33 84.21 87.5 91.66 89.47 75 89.47 91.67
71.42 57.14 71.42 85.71 75 87.5 75 62.5 87.5 89.47
73.33 66.67 78.94 84.61 81.25 85.71 88.46 70 88.87 90.32
F. Dysphonia: Funktionelle Dysphonia, H.F. Dysphonia: HyperFunktionelle Dysphonia.
features, respectively. In the classification task the highest accuracies reached were 94.44% for the classification of “Parkinson vs. Paralysis” impairments using HOS features and 88.87% for the classification of “Parkinson vs. Paralysis” also based on DWT features. Fig. 9 shows a comparison between the performance (sensitivity, specificity and accuracies) of the proposed system using the SVD database in the training step and the data used in the validation from the Neurological Institute of RABTA Hospital in the validation step using HOS features combined with SVM classifier. The obtained results are promising mainly in the identification part in which new impairments were classified (Alzheimer, Parkinson and paralysis). This makes our proposed system more general and able to classify several vocal impairments from several origins. The misclassified data in the validation step can be explained by the dissimilar recording conditions between the voice signals of the SVD database and our recordings. As well there is a wide difference between Tunisian and German dialects and this can influence the pronunciation of the vowel “a”. Nevertheless, the combination of the High Order Statistic (HOS) features stemming from EMDDWT two stage analysis combined with the SVM classifier gives a robust system for the detection and classification of pathological voices and this was confirmed by the ENT physicians of the Neurological department from the RABTA Hospital of Tunisia. However, the DWT features gives less detection and identification rate compared with the proposed HOS features and the results obtained in other works in which it were combined with other classifier (multilayer perception). The proposed system outperforms the methods that exist in the literature in the classification task using HOS features and can compete the new framework based on deep learning
Table 6 Performance of the proposed method on pathology detection in validation step. Method
Specificity (%)
Sensitivity (%)
Accuracy (%)
HOS-SVM DWT-SVM
96.66 90
92.85 92.85
94.82 91.37
approaches which are more complicated and require much time for training. 4. Conclusion In this work, we evaluate the performance of the proposed new high order statistic features extracted form wavelet domain to discriminate between pathological and normal voices. Classical features such Mean Wavelet Value, Mean Wavelet Energy and Mean Wavelet Entropy are also tested. The voice signals were decomposed via two stage analysis process, EMD-DWT before we extracted the features vectors. The skewness and kurtosis respectively combined with variance shows an ability to discriminate between normophonic and pathologic subjects. These features combined with SVM classifier, reached the highest accuracies 99.26% in the detection step and 100% when classifying “F. Dysphonia vs. Rekurrensparse”. Our proposal gives encouraging results in comparison with the state of art. In order to involve concrete scientific added values a clinical validation was performed on data collected from subjects from Neurological Institute of Rabta Hospital of Tunis. The obtained results were satisfactory and the accuracies reached were 94.82% and 94.44% for the detection and classification, respectively. The deep learning methodologies get a great
JID:IRBM
AID:583 /FLA
[m5G; v1.261; Prn:6/01/2020; 10:29] P.10 (1-11)
I. Hammami et al. / IRBM ••• (••••) •••–•••
10
Table 7 Performance of the proposed method on pathology identification in validation step. Disease types
HOS features
Alzheimer vs. Parkinson Alzheimer vs. Chronic laryngitis Alzheimer vs. Paralysis Chronic laryngitis vs. Parkinson Chronic laryngitis vs. Paralysis Parkinson vs. Paralysis
DWT features
Specificity (%)
Sensitivity (%)
Accuracy (%)
Specificity (%)
Sensitivity (%)
Accuracy (%)
66.66 85.71 80 87.5 80 100
75 100 100 100 85.71 87.5
72.72 90 84.61 93.33 82.35 94.44
62.5 71.42 70 75 70 90
66.67 66.67 100 85.71 71.42 87.5
63.63 70 76.92 80 70.58 88.87
Fig. 9. Comparison of the performance of the developed framework in the test and validation steps.
interest in the last few years regarding its limited applications in the field of pathological voice. A few study combined deep learning methods with MFCC features still not achieved high accuracies unlike machine learning systems. Therefore, as a future work we will still work on these results to ameliorate the obtained identification accuracy based on the deep learning developed methodology.
CRediT authorship contribution statement I. Hammami: Conceptualization, Data curation, Formal analysis, Writing - original draft, Validation. L. Salhi: Conceptualization, Data curation, Formal analysis, Supervision, Writing - original draft, Validation. S. Labidi: Conceptualization, Supervision, Writing - original draft, Validation.
Human and animal rights References The authors declare that the work described has been carried out in accordance with the Declaration of Helsinki of the World Medical Association revised in 2013 for experiments involving humans as well as in accordance with the EU Directive 2010/63/EU for animal experiments. Informed consent and patient details The authors declare that this report does not contain any personal information that could lead to the identification of the patient(s). Funding This work did not receive any grant from funding agencies in the public, commercial, or not-for-profit sectors. Author contributions All authors attest that they meet the current International Committee of Medical Journal Editors (ICMJE) criteria for Authorship. Declaration of competing interest The authors declare that they have no known competing financial or personal relationships that could be viewed as influencing the work reported in this paper.
[1] Al nasheri A, Muhammad G, Alsulaiman M, Ali Z, Malki KH, Messallam TA, et al. Voice pathology detection and classification using auto-correlation and entropy features in different frequency regions. IEEE Access 2017:99. https:// doi.org/10.1109/ACCESS.2017.2696056. [2] Hossain MS, Muhammad G, Alamri A. Smart healthcare monitoring: a voice pathology detection paradigm for smart cities. Multimed Syst 2017;26:1–11. [3] Panek D, Skalski A, Gajda J. Voice data mining for laryngeal pathology assessment. Comput Biol Med 2015;69:270–6. https://doi.org/10.1016/j.compbiomed. 2015.07.026. [4] Panek D, Skalski A, Gajda J, Tadeusiewicz R. Acoustic analysis assessment in speech pathology detection. Int J Appl Math Comput Sci 2015;25:631–43. [5] Pravena D, Dhivya S, Durga Devi A. Pathological voice recognition for vocal fold disease. Int J Comput Appl 2012;47:31–7. [6] Arias-Londono JD, Godino-Llorente JI, Saenz-Lechon N, Osma-Ruiz V, Castellanos-Dominguez G. Automatic detection of pathological voices using complexity measures, noise parameters, and mel-cepstral coefficients. IEEE Trans Biomed Eng 2011;58:370–9. [7] Arias-Londono JD, Godino-Llorente JI, Markaki M, Stylianou Y. On combining information from modulation spectra and mel-frequency cepstral coefficients for automatic detection of pathological voices. Logop Phoniatr Vocol 2011;36:60–9. [8] Amami R, Smit A. An incremental method combining density clustering and support vector machines for voice pathology detection. Comput Electr Eng 2016;57:257–65. https://doi.org/10.1016/j.compeleceng.2016.08.021. [9] Cordeiro H, Fonseca J, Guimarães I, Hierarchical Meneses C. Classification and system combination for automatically identifying physiological and neuromuscular laryngeal pathologies. J Voice 2016;31:9–14. https://doi.org/10.1016/j. jvoice.2016.09.003. [10] Mesallam TA, Farahat M, Malki KH, Alsulaiman M, Ali Z, Alnasheri A, et al. Development of the Arabic voice pathology database and its evaluation by using speech features and machine learning algorithms. J Healthc Eng 2017. [11] Muhammad G, Mesallam TA, Malki KH, Farahat M, Mahmood A, Alsulaiman M. Multidirectional regression (MDR)-based features for automatic voice disorder detection. J Voice 2012;26:19–27.
JID:IRBM AID:583 /FLA
[m5G; v1.261; Prn:6/01/2020; 10:29] P.11 (1-11)
I. Hammami et al. / IRBM ••• (••••) •••–•••
[12] Ali A, Elamvazuthi I, Alsulaiman M, Muhammad G. Automatic voice pathology detection with running speech by using estimation of auditory spectrum and cepstral coefficients based on the all-pole model. J Voice 2016;30:7–19. [13] Moro-Velázquez L, Gómez-García JA, Godino-Llorente JI. Voice pathology detection using modulation spectrum-optimized metrics. Front Bioeng Biotechnol 2016. https://doi.org/10.3389/fbioe.2016.00001. [14] Cordeiro HT, Fonseca JM, Ribeiro CM. LPC spectrum first peak analysis for voice pathology detection. Proc Technol 2013;9:1104–11. https://doi.org/10.1016/j. protcy.2013.12.123. [15] Orozco-Arroyave JR, Belalcazar-Bolanos EA, Arias-Londono JD, Vargas-Bonilla JF, Skodda S, Rusz J, et al. Characterization methods for the detection of multiple voice disorders: neurological, functional, and laryngeal diseases. IEEE J Biomed Health Inform 2015;19:1820–8. [16] Al nasheri A, Muhammad G, Alsulaiman M, Ali Z. Investigation of voice pathology detection and classification on different frequency regions using correlation functions. J Voice 2017;31:3–15. https://doi.org/10.1016/j.jvoice.2016.01. 014. [17] Fontes AIR, Souza PTV, Neto ADD, Martins AM, Silveira LFQ. Classification system of pathological voices using correntropy. Math Probl Eng 2014. [18] Ali Z, Alsulaiman M, Elamvazuthi I, Muhammad G, Mesallam TA, Farahat M, et al. Voice pathology detection based on the modified voice contour and SVM. Biol Inspir Cognit Archit 2015;15:10–8. https://doi.org/10.1016/j.bica.2015.10. 004. [19] Muhammad G, Altuwaijri G, Alsulaiman M, Ali Z, Mesallam TA, Farahat M, et al. Automatic voice pathology detection and classification using vocal tract area irregularity. Biocybern Biomed Eng 2016;39:309–17. [20] Muhammad G, Alsulaiman M, Ali Z, Mesallam TA, Farahat M, Malki KH, et al. Voice pathology detection using interlaced derivative pattern on glottal source excitation. Biomed Signal Process Control 2017;31:156–64. [21] Fang C, Li H, Ma L, Zhang M. Intelligibility evaluation of pathological speech through multigranularity feature extraction and optimization. Comput Math Methods Med 2017. [22] Al nasheri A, Muhammad G, Alsulaiman M, Ali Z, Mesallam TA, Farahat M, et al. Investigation of multidimensional voice program parameters in three different databases for voice pathology detection and classification. J Voice 2017;31:9–18. https://doi.org/10.1016/j.jvoice.2016.03.019. [23] Muhammad G, Melhem M. Pathological voice detection and binary classification using MPEG-7 audio features. Biomed Signal Process Control 2014;11:1–9.
11
[24] Saeedi NE, Almasganj F. Wavelet adaptation for automatic voice disorders sorting. Comput Biol Med 2013;43:699–704. [25] Saeedi NE, Almasganj F, Support Torabinejad F. Vector wavelet adaptation for pathological voice assessment. Comput Biol Med 2011;41:822–8. [26] Lee JL, Hahn M. Automatic assessment of pathological voice quality using higher-order statistics in the LPC residual domain. EURASIP J Adv Signal Process 2009. https://doi.org/10.1155/2009/748207. [27] Alonso JB, De Leon J, Alonso I, Ferr AM. Automatic detection of pathologies in the voice by HOS based parameters. EURASIP J Appl Signal Process 2001;4:247–84. https://doi.org/10.1155/S1110865701000336. [28] Huang N, Shen Z, Long S, Wu M, Shih H, Zheng Q, et al. The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proc R Soc A, Math Phys Eng Sci 1998;454:903–95. https://doi. org/10.1098/rspa.1998.0193. [29] Zao L, Coelho R, Fladrin P. Speech enhancement with EMD and Hurst-based mode selection. IEEE/ACM Trans Audio Speech Lang Process 2014;22:899–911. https://doi.org/10.1109/TASLP.2014.2312541. [30] Agbinya JI. Discrete wavelet transform techniques in speech processing. IEEE Tencon Digit Signal Process Appl 1996;45:514–9. https://doi.org/10.1109/ TENCON.1996.608394. [31] Salhi L, Talbi M, Abid S, Cherif A. Performance of wavelet analysis and neural networks for pathological voices identification. Int J Electron 2011;98:1129–40. [32] Nemer E, Goubran R, Mahmoud S. Robust voice activity detection using higherorder statistics in the LPC residual domain. IEEE Trans Speech Audio Process 2001;9:217–31. [33] Dubuisson T, Dutoit T, Gosselin B, et al. On the use of the correlation between acoustic descriptors for the normal/pathological voices discrimination. EURASIP J Adv Signal Process 2009. https://doi.org/10.1155/2009/173967. [34] Vapnik VN. The nature of statistical learning theory. New York: Springer Verlag; 2000. [35] Samb ML, Camara F, Ndiaye S, Essghir MA. Approche de sélection d’attributs pour la classification basée sur l’algorithme RFE-SVM. ARIMA J 2014. [36] Al Mojaly M, Muhammad G, Alsulaiman M. Detection and classification of voice pathology using feature selection. IEEE Comput Syst Appl 2014:571–7. https:// doi.org/10.1109/AICCSA.2014.7073250. [37] Barry WJ, Putzer M. Saarbruecken voice database. Available from: http://www. stimmdatenbank.coli.uni-saarland.de/.