Machine Learning Based Approach to Assess Denoised Speech

Machine Learning Based Approach to Assess Denoised Speech

Available online at www.sciencedirect.com Available online at www.sciencedirect.com Available online at www.sciencedirect.com ScienceDirect Procedia...

725KB Sizes 0 Downloads 76 Views

Available online at www.sciencedirect.com Available online at www.sciencedirect.com Available online at www.sciencedirect.com

ScienceDirect

Procedia Computer Science 00 (2019) 000–000 Procedia Computer Science 00 (2019) 000–000 Procedia Computer Science 159 (2019) 698–706

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

23rd International Conference on Knowledge-Based and Intelligent Information & Engineering Systems 23rd International Conference on Knowledge-Based and Intelligent Information & Engineering Systems

Machine Learning Based Approach to Assess Denoised Speech Machine Learning Based Anis Approach to Assess Denoised Speech Ben Aicha(1,2) University of Carthage,

(1) Higher

School (2) Laboratory, University of Carthage, (1) Higher School Laboratory, (2)

of Communications of(1,2) Tunis (SUPCOM), LR11TIC04 COSIM research Anis Ben Aicha Faculty of Sciences of Bizerte (FSB), Tunisia of Communications of Tunis (SUPCOM), LR11TIC04 COSIM research Faculty of Sciences of Bizerte (FSB), Tunisia

Abstract Purpose: Many objective measures are developed to assess speech in specific contexts such as speech coding, transAbstract mission, etc. Speech enhancement is one of emergent speech technology applications. Yet, there is no objective Purpose: are developed to assess speechare in developed specific contexts as speech coding, transstandard toMany assessobjective denoisedmeasures speech. Indeed, the objective criteria mainlysuch to assess speech for specific mission, etc. speech Speechcoding, enhancement one of emergent speech applications. theredenoised is no objective context i.e., speech is transmission, etc. They aretechnology not developed specially Yet, to assess speech. standard to assess speech. thebased objective are developed assess speech for specific Some attempts are denoised developed in theIndeed, literature on acriteria linear combination of mainly existingtomeasures. Method: Even context i.e., speech coding, speech transmission, etc. They are not developed specially to assess denoised speech. such approaches permit a promising performances, they reach their limits. To overcome these limits, we propose to Some attempts are developed the of literature based on a linear combination of of existing measures. Method: Even address the problem from the in point view of classification instead of the point view of regression. The assessed such approaches a promising performances, reach their limits. Toconventional overcome these limits, we propose to speech utterance permit is divided into frames. Each framethey is evaluated using one of criteria. Hence, a feature address the problem from the point of view of classification instead of the point of view of regression. The assessed vector is constructed. The objective denoising speech assessment becomes a classical classification problem. Results: speech utterance iscriteria, dividedthe intoproposed frames. Each frame evaluated using objective one of conventional criteria. Hence,asa an feature Unlike traditional method can isgive a significant score directly interpreted estivector The objective denoising speech becomes classification problem. Results: mationisofconstructed. real mean opinion score. Conclusion: It isassessment shown in this studya classical that is possible to predict the subjective Unlikeopinion traditional the proposed can give aspecificity, significantprecision objectiveand score directly interpreted as an estimean scorecriteria, with acceptable and method fairly sensitivity, accuracy. mation of real mean opinion score. Conclusion: It is shown in this study that is possible to predict the subjective mean opinion score with acceptable and fairly sensitivity, specificity, precision and accuracy. c 2019  The Author(s). Published by Elsevier B.V. This is an Authors. open access article under B.V. the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc© 2019 The Published by Elsevier c 2019 The Author(s). Published by Elsevier B.V.  nd/4.0/) This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an under open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-ncPeer-review underresponsibility responsibility of KES International. Peer-review of KES International. nd/4.0/) Keywords: Speech ; speech distortion ; background noise Peer-review under enhancement responsibility; Speech of KESquality International. Keywords: Speech enhancement ; Speech quality ; speech distortion ; background noise

1. Introduction 1. Introduction Measuring speech quality constitutes an important task for evaluating many recent speech applications such as telephony, telephony over IP, coding, watermarking, speech enhancement, etc. The most reliable Measuring speech quality constitutes an important task for evaluating many recent speech applications way to assess speech quality is the subjective tests. This test is normalized by the international union of such as telephony, telephony over IP, coding, watermarking, speech enhancement, etc. The most reliable telecommunication (ITU). In such tests, Listeners are asked to rate the speech quality according to five-point way to assess speech quality is the subjective tests. This test is normalized by the international union of telecommunication (ITU). In such tests, Listeners are asked to rate the speech quality according to five-point E-mail address: [email protected] E-mail address: [email protected] c 2019 The Author(s). Published by Elsevier B.V. 1877-0509  This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) cunder 1877-0509 2019 responsibility The Author(s). by Elsevier B.V. Peer-review of Published KES International. 1877-0509 2019access The Authors. by CC Elsevier B.V. This is an ©open articlePublished under the BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Peer-review under responsibility of KES International. Peer-review under responsibility of KES International. 10.1016/j.procs.2019.09.225

2

Anis Ben Aicha / Procedia Computer Science 159 (2019) 698–706 Anis Ben Aicha / Procedia Computer Science 00 (2019) 000–000

699

opinion scale ranging from ‘bad’ to ‘excellent’. In order to overcome the subjectivity of such test, the mean value is computed and defined as the well-known Mean Opinion Score (M OS). The M OS test is a heavy and time consuming processing. An alternative more rapid and economic is to estimate the speech quality using an algorithm. Many algorithms are developed in the literature. They can be classified into three groups according to the domain in which they operate i.e., temporal, spectral or perceptual. Most of objective criteria are mainly developed to assess speech signal for specific context. We relate for example PESQ criterion which is developed to evaluate speech over telecommunication systems [10]. For the specific case of speech enhancement context, only few attempts are conducted in the literature such as composite criteria [8] or perceptual audible degradation [7]. The principle of these attempts is to linear combine existing objective measure in order to get a new one more correlated to the subjective test. For the case of conventional assessment criteria, the main difficulty is to get a subjective interpretation of the obtained score. In other word the corresponding MOS score. It is found that the errors related to the correlation between objective and subjective measure are not negligible even for the latest proposed criteria in the area This means that when we assess a denoised signal using an objective criterion, we are not really sure about its correspond subjective note. Techniques developed in [8, 7] are based mainly in the linear regression model to predict M OS evaluation from objective scores. However, even the promising performances of the composite criteria they have reached their limits. The second intuitive approach is to combine with nonlinear manner conventional criteria. However, due to the scatter of the objective scores regarding the subjective tests, the nonlinear regression cannot resolve the problem. In this paper, we propose a novel method of denoised speech quality estimation. The process is done frame by frame. Each frame of denoised signal is assessed separately. A set of 10 Objective criteria is used to compute 10 objective scores which are considered as a features characterizing the denoised frame. Support  Vector Machine (SVM) trained with used features is used. To attribute a final score M OS, the mean value of all individual estimated MOS (mos)(n) is computed. Unlike traditional criteria, the proposed method can give a significant objective score directly interpreted as an estimation of real MOS note. The remainder of the paper is organized as following. Section 2 presents an overview of speech enhancement methodology. In section 3, we detail the state of the art of the conventional criteria for speech denoising assessment. In section 4, we present the used dataset. Section 5 presents the limits of classical objective criteria. In section 6 , we detail our proposed approach based on frame by frame assessment. Experimental results are given in section 7. Section 8 is reserved to conclusions and discussion.

2. Speech enhancement overview Speech processing is one of emergent technologies applications. Many applications are based on speech processing techniques such as speech transmission, human-machine command, biometric identification, speech recognition, speech coding, etc [1, 2, 3]. However, recorded speech by microphone usually contains various types of interference, such as ambient noise, reverberation and extraneous speakers’ utterances. Such interferences degrade seriously the speech applications performances. Hence, the need to clean the speech is obvious to improve the performances of speech based technologies applications. Without loss of generality, we treat in the current work the case of an additive noise. The speech enhancement process consists of recovering the signal of interest namely called clean speech s(n) from the noisy observed one y(n). y(n) = s(n) + b(n), b(n) is the unwanted additive noise. n is the discrete time index.

(1)

700

Anis Ben Aicha / Procedia Computer Science 159 (2019) 698–706 Anis Ben Aicha / Procedia Computer Science 00 (2019) 000–000

3

A generic enhancement block diagram is depicted in figure 1. It is well known that the human speech signal is a non-stationary signal. Hence, it cannot be study with long-time approaches. Fortunately, the speech can be considered as quasi-stationary over a short duration of time mainly under 30ms. Therefore, the observed speech is sliced into frames. In frequency domain and assumption of non-correlation between clean speech and noise, the equation 1 can be written as follows Y (m, k) = S(m, k) + B(m, k),

(2)

where k denotes the frequency index. Y (m, k) (S(m, k), B(m, k)) represents the Short Time Fourier Transform of y(m, n) s(m, n), b(m, n) respectively. From the corrupted speech frame, the noise is estimated using noise estimation techniques [4, 5]. Generally, the denoising process can be seen as a filtering process. ˆ S(m, k) = H(m, k)Y (m, k),

(3)

where H(m, k) denotes the denoising filter. The noise reduction problem becomes one of finding an optimal filter that would attnuate the noise as much as possible while keeping the clean speech from being dramatically distorted. Finally, an overlapping process is conducted to reconstruct the whole denoised signal sˆ(n).

Fig. 1. Speech enhancement principle.

3. Conventional assessment criteria It is important to highlight that there is many works are proposed in literature to assess speech quality for specific purpose such as speech coding, speech transmission, etc. Moreover, some standards have been developed by international organizations such as international telecommunication union (ITU). However, a few works are interesting to the specific case of the assessment of denoised speech [7, 8]. We present in the following the main and relevant criteria for speech quality assessment. 3.1. Subjective tests Evaluation of speech quality is a subjective process and depends from listeners. Subjective evaluations are well considered in the context of speech communication to assess communication systems. In order to overcome such subjectivity the international union of telecommunication develops a standard namely mean opinion score (MOS) [6]. In this test, speech materials are played to a panel of a large number of listeners, who are asked to rate the global quality of played signals, often using a ?ve-point quality scale. The final score is obtained by averaging all registered notes according to a rating scale represented in table 1.

4

Anis Ben Aicha / Procedia Computer Science 159 (2019) 698–706 Anis Ben Aicha / Procedia Computer Science 00 (2019) 000–000

Table 1. Description in the mean opinion score (MOS).

Rating 5 4 3 2 1

Speech quality Excellent Good Fair Poor Unsatisfactory

701

Level of degradation Imperceptible Just perceptible but not annoying Perceptible and slightly annoying Annoying but not objectionable Very annoying and objectionable

3.2. Objective tests Subjective assessment according to the M OS standard is a heavy and time consuming process. Many objective criteria are developed to evaluate the quality of the speech using algorithms without need to human intervention. Such criteria are well used in the context of communication systems quality evaluation. Among a long list, the following most relevant objective criteria are used in this paper [8, 7]. Temporal domain criterion: segmental SNR SN Rseg . Spectral domain criteria: Log Likelihood Ratio LLR, Log-Area Ratio LAR, Itakura-Saito distortion measure IS, Cepstral distance CEP , Weighted-Slope Spectral distance W SS and frequency SNR f wSN R. Perceptual criteria: Modified Bark Spectral Distortion M BSD, Perceptual Speech Quality Measurement P SQM , Perceptual Evaluation of Speech Quality P ESQ, composite criteria Covl and Perceptual Signal to Audible Noise and Distortion Ratio P SAN DR. 3.3. Correlation between objective and subjective test It is obvious that the most relevant way to assess the speech quality is the subjective test. We consider subjective test as the ground truth. All developed objective criteria seek to estimate the subjective MOS score as accurate as possible. To measure the relevance of an objective criteria, we compute the per-condition correlation R and root mean square error RM SE between the score computed by objective criteria and the one obtained by subjective test. Estimation of true subjective MOS score from objective score obtained by an objective criteria can be seen as a regression problem. In order to evaluate the precision of the regression fitting, we compute the root mean square error (RMSE):   N 1  [fO (i) − O(i)]2 , RMSE =  N i=1

(4)

where, O (resp. fO ) is the objective criterion value O (resp. fitted function). i denotes the evaluation sequence index and N is the total number of sequences. We also calculate the per-condition correlation R between objective criteria and its related fitted polynomial function. The coefficient R is computed according to Pearson’s correlation method [9]:

N 

[fO (i) − fO ][O(i) − O]

 R=  , N N    [fO (i) − fO ]2  [O(i) − O]2 i=1

i=1

i=1

(5)

702

Anis Ben Aicha / Procedia Computer Science 159 (2019) 698–706 Anis Ben Aicha / Procedia Computer Science 00 (2019) 000–000

5

where O (resp. fO ) is the average of O(i) (resp. fO (i)). We recall that, a criterion can be good for predicting subjective tests if its RMSE is small and its percondition R correlation is high. 4. Used material It is important to highlight that many speech corpus are available and designed for specific context such speech recognition, speech coding, etc. However, and for best of our knowledge there is no datasets developed specifically for the assessment of denoised techniques. We use, in the current framework, the corpus delivered by Hu and Loizou [8]. The corpus was designed to evaluate speech enhancement algorithms. A total number of 570 sentences is obtained from noisy signals corrupted by several kind of noises (white, babble, car, factory and f16) with input SN R range from −5 dB to 25 dB. The used denoising methods encompass four different classes of algorithms [8]. Spectral subtractive: multiband spectral subtraction, and spectral subtraction using reduced delay convolution and adaptive averaging. Subspace: generalized subspace approach, and perceptually based subspace approach. Statistical-model-based on minimum mean square error: mmse, log-mmse, and log-mmse under signal presence uncertainty. Wiener-filtering type algorithms: the a priori SNR estimation based method, the audible-noise suppression method, and the method based on wavelet thresholding the multitaper spectrum. All these files are subjectively evaluated using standardized MOS methodology to get correspond subjective scores. Then, the files are classified into five classes according to their subjective MOS scores. 5. Limits of regression approach for estimating MOS measure Since objective criteria is an alternative to the heavy MOS tests, obtained objective scores have to reflect the subjective MOS assessment. This task can be seen as a regression problem when estimated M ˆOS can be obtained by a regression constructed from objective scores. Without loss of generality, we consider the case of P ESQ measure which is an ITU standard P.862 [10]. Figure 2 represents the scatter plot of M OS = f (P ESQ) and the different regression models. Firstly, we can see clearly that the dots are scattered everywhere without specific trends. This observation shows that it will be not easy to predict M OS values from obtained P ESQ scores. We performed linear, quadratic and cubic regression. As it is shown in figure 2, all regression models cannot model the relationship between M OS scores and P ESQ ones. In order to quantify the non-ability of classical objective criteria to predict subjective tests, we have computed the relative mean square error according to 4 for all tested criteria mentioned in 3.2. Table 2 represents the regression errors. As it is expected, high level error are obtained for all tested criteria. More than 0.4 are observed for all criteria. In some cases the error reaches a level of 0.7. Such values are not promising for modelling subjective test M OS using objective ones.

Fig. 2. Estimated MOS from PESQ measure using different regression model.

6

Anis Ben Aicha / Procedia Computer Science 159 (2019) 698–706 Anis Ben Aicha / Procedia Computer Science 00 (2019) 000–000

Table 2. Regression error of conventional objective criteria.

Criteria PESQ SNR CEP FWSNR IS LAR LLR MBSD PSQM WSS COVL PSANDR

Linear model 0.4724 0.5333 0.7154 0.3918 0.7154 0.7026 0.6368 0.6959 0.6959 0.5468 0.4612 0.4456

Quadratic model 0.4049 0.4625 0.7134 0.3823 0.7153 0.6652 0.5553 0.6879 0.6086 0.5216 0.4520 0.4413

703

Cubic model 0.4025 0.4620 0.7127 0.3789 0.7148 0.6329 0.5357 0.6820 0.6050 0.4896 0.4503 0.4391

Previous observations and interpretations can be explained by the fact that the tested objective criteria are developed mainly to assess speech for specific context i.e., speech coding, speech transmission, etc. They are not developed specially to assess denoised speech. Even that, they are still used as objective criteria for dennoised speech assessment. In the literature some attempts are proposed in order to combine them in some way to construct a new criteria more correlated to M OS test [8, 7]. A multiple linear regression is used to combine classical criteria. In [8] COV L measure is proposed as following COVL = 0.279 − 0.011 · IS + 1.137 · PESQ + 0.041 · LLR − 0.008 · WSS.

(6)

In [7], a linear combination of speech distortion criterion P SADR and background noise criterion P SAN R leads to the overall quality assessment measure P SAN DR. PSANDR = −0.0039 · PSADR + 0.1339 · PSANR + 2.4176.

(7)

It is shown that these proposed criteria perform better than isolated classical and conventional criteria. The accuracy of predicting the subjective test is measured by the correlation coefficient. Using 5, we have computed the R coefficient for all tested criteria including composite ones. The best R coefficients are obtained for the composite criterion COV L with R = 0.68 and the P SAN DR criterion with R = 0.81. 6. Frame by frame speech quality assessment approach 6.1. Idea and motivation As it is detailed in the previous section, main attempts to construct an objective criterion for the specific case of speech denoising context are based on the combination of the existed objective criteria. The idea behind such approaches is to profit from the individual correlation of the conventional criteria in order to build a new one more correlated with subjective test M OS. As it is shown in section 5 a simple regression can improve the accuracy of the developed new criteria more suitable and adapted to the speech denoising context. We think that linear combination of existing objective criteria can be seen as a hard assumption. In fact, either in the case of only one criterion (see figure 2 as an example) it is hard to suppose a simple regression to predict M OS from P ESQ scores. We think that a nonlinear combination of existing objective criteria can

704

Anis Ben Aicha / Procedia Computer Science 159 (2019) 698–706 Anis Ben Aicha / Procedia Computer Science 00 (2019) 000–000

7

take into account the specificity of each criteria. We think that the problem can be seen as a classification task. The idea of the whole proposed approach is given in figure 3. 6.2. Proposed approach Since we approach the task of speech assessment as a classification problem, we have done some transformation of the available data set. The subjective test M OS which constitute in our case the target of the classification process ranges continuously in the interval [1, 5]. This is suitable for the regression task, but not for classification tasks. Hence, we discretise the M OS values into 5 levels. Therefore, the M OS score of an utterance “i” is reduced to the nearest integer value.

Fig. 3. Proposed approach to assess denoised speech signal.

The available dataset is divided into learning set and test set. From learning set, we evaluate the denoised speech signal using conventional criteria. The objective evaluation can be seen as a comparison between original clean speech and the denoised one. As it has been mentioned before and due to the non-stationarity of the speech signal, each speech utterance “i” is divided into frames “n”. Each fame “n” is assessed individually using conventional criteria. The obtained scores are considered as features characterizing the frame “n” of the utterance “i”. The obtained features are represented by the vector Fi (n). The subjective score M OSi corresponding to the vector Fi (n) is known and constitute the target of the classification model. We have conducted the same experiments for all learning dataset. Obtained data are fed to a classical machine learning in order to build a classification model. Once the learning model is established, we can evaluate any denoised speech, according to the same methodology of learning phase. A feature vector is computed for each frame “j” and a final feature matrix Fj is built. In the current framework, we suppose that the obtained scores of frame evaluation are independent. Hence, it is possible to get different M OS scores for the frames of the same tested denoised speech. To overcome such problem, we propose to post processing the classification processing using a voting approach. The final score of the whole tested single sequence is obtained as a mean of all M OSj (n). 7. Experimental results 7.1. Regression point of view  The proposed method leads to estimate the subjective score M OS j of a denoised speech sequence “j”. The real subjective M OSj of the sequence “j” constitute the ground truth. M OSj can be obtained by  a regression on M OS j . Hence, without loss of generality, we have used Support Vector Machine (SVM)

8

Anis Ben Aicha / Procedia Computer Science 159 (2019) 698–706 Anis Ben Aicha / Procedia Computer Science 00 (2019) 000–000

705

method to perform the classification. We have performed several experiments in order to determine the best kernel functions in terms of classification accuracy. The choice of Gaussian radial kernel is adopted. The remainder of experimental results is conducted with SVM technique using the Gaussian radial kernel. We propose to evaluate the precision of the regression fitting in terms of RM SE according to equation 4. Table 3 represents the RM SE results. The proposed method performs better than the two tested methods. About 38% improvement of regression accuracy is obtained with cubic model comparing to COV L measure. Table 3. Regression error of Proposed approach and conventional objective criteria.

Criteria COVL PSANDR Proposed

Linear model 0.4612 0.4456 0.2956

Quadratic model 0.4520 0.4413 0.2884

Cubic model 0.4503 0.4391 0.2790

Once the objective score over the test dataset has been calculated using the proposed method, the correlation coefficient R according to equation 5 is computed. We notice that, the more ρ is close to one, the more the objective criterion is accurate in terms of subjective score prediction. Table 4 represents the R coefficient for the two baselines and for the proposed method. We remark that the R coefficient is improved with the proposed approach when compared to the two baselines. This means that the proposed criteria is more accurate to evaluate the overall quality of then denoised speech. Table 4. Correlation coefficient R with subjective test M OS

R

COVL 0.68

PSANDR 0.78

Proposed 0.87

7.2. Classification point of view The ground truth M OS scores are adjusted in such a way the evaluation of the denoised speech utterance is an integer belongs to {1, 2, 3, 4, 5}. Hence, The results of the performed experiments are expressed in terms of sensitivity, specificity, precision and accuracy. Table 5 presents the experimental results. We can remark that an acceptable accuracy level 0.91 is achieved using the proposed method. Table 5. Classification performances in terms of Sensitivity, Specificity, Precision and Accuracy with SVM technique

Proposed

Sensitivity 0.86

Specificity 0.88

Precision 0.92

Accuracy 0.91

8. Conclusions In the current framework, we have shown the limits of conventional criteria regarding the denoised speech assessment. A new method of denoised speech is presented. Two baselines are considered. They are based on the linear regression approach. We have shown that the simple intuitive idea of linear regression of existed objective criteria to predict the quality of the denoised speech suffers from some limits. The obtained performances in terms of accuracy even they are fair, they still insufficient to predict the mean opinion score accurately. In the current work, the problem of denoise speech assessment is addressed differently. Instead of the regression approach, the problem is formulated as a classification problem. The SVM technique is used to as a machine learning tools. Without loss of generality, any other classical machine learning can be used. Experimental results show the validity of the proposed technique. The predicted MOS scores using the baseline techniques and the proposed one are compared in terms of correlation coefficient R. The best correlation coefficient is obtained with the proposed technique 0.87 and a gain of 11% is achieved compared to baseline techniques.

706

Anis Ben Aicha / Procedia Computer Science 159 (2019) 698–706 Anis Ben Aicha / Procedia Computer Science 00 (2019) 000–000

9

References [1] T. Ogunfunmi, R. Togneri and M. Narasimha, “Speech and audio processing for coding, enhancement and recognition," Springer, 2015. [2] A. Ben Aicha and S. Ben Jebara, “Reduction of musical residual noise using perceptual tools with classic speech denoising techniques," Signal, Image and Video Processing, vol. 6, no. 1, pp. 85-97, 2012. [3] D. S. Williamson and D. Wang, “Time-frequency masking in the complex domain for speech dereverberation and denoising," IEEE Transactions on Audio, Speech, and Language Processing, vol. 25, no. 7, pp. 1492-1501, 2017. [4] S. Rangachari, P. Loizou “A noise-estimation algorithm for highly non-stationary environments," Speech Communication, vol. 48, pp. 220-231, 2006. [5] A. Ben Aicha, “Noise estimation for speech enhancement algorithms with post-smoothness processor incorporating global posterior SNR," Multimedia Tools and Applications, vol. 76, no. 22, pp. 23661-23678, 2017. [6] ITU-T P.835, Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm, ITU-T Recommendation P.835, 2003. [7] A. Ben Aicha and S. Ben Jebara, “Perceptual speech quality measures separating speech distortion and additive noise degradations," Speech Communication, vol. 54, no. 4, pp. 517-528, 2012. [8] Y. Hu and P. Loizou, “Evaluation of objective quality measures for speech enhancement," IEEE Transactions, Audio, Speech and Language Processing, vol. 16, no. 1, pp. 229-238, 2008. [9] S. Dimolitsas, “Objective speech distortion measures and their relevance to speech quality assessments," Proceedings of the IEEE, vol. 136, no.5, 1984. [10] Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment of nerrowband telephone networks and speech codecs, ITU-T Recommendation P.862, 2000.