Available online at www.sciencedirect.com
Speech Communication 54 (2012) 517–528 www.elsevier.com/locate/specom
Perceptual speech quality measures separating speech distortion and additive noise degradations Anis Ben Aicha ⇑, Sofia Ben Jebara Ecole Supe´rieure des Communications de Tunis, Research unit TECHTRA, University of Carthage, Route de Raoued 3.5 Km, Cite´ El Ghazala, Ariana 2083, Tunisia Received 19 June 2009; received in revised form 24 November 2011; accepted 28 November 2011 Available online 13 December 2011
Abstract In this paper, novel perceptual criteria measuring speech distortion, additive noise and the overall quality are presented. Based on the masking concept, they are built to measure only the audible degradations perceived by the human ear. The class of perceptual equivalence (CPE) is introduced which leads to specify the nature of degradations affecting denoised speech. The CPE is defined in the frequency domain using perceptual tools and limited by two curves : upper bound of perceptual equivalence (UBPE) and lower bound of perceptual equivalence (LBPE). Denoised speech components belonging to this class are perceptually equivalent to the clean speech components, otherwise audible degradations are noticed. Based on this concept, new perceptual criteria are developed to assess denoised speech signals. After criteria introduction and explanation, they are validated by comparing their relationship, in terms of scatter plots and Pearson correlation with ITU-T recommendation P.835 which specifies three subjective tests to evaluate independently the speech distortion (SIG), the residual background noise (BAK) and the overall quality (MOS). Moreover, proposed criteria are compared conventional criteria, indicating an improved ability for predicting subjective tests. Ó 2011 Elsevier B.V. All rights reserved. Keywords: Upper bound of perceptual equivalence; Lower bound of perceptual equivalence; Class of perceptual equivalence; Objective criteria
Contents 1. 2.
3.
4.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Speech quality assessment overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1. Overall perceptual measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1. Subjective measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2. Objective measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Measures separating noise and speech distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1. Subjective measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2. Objective measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Motivation of the proposed criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Masking threshold overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Perceptual characterization of degradations affecting denoised speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1. Perceptual characterization of audible noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2. Perceptual characterization of audible distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
⇑ Corresponding author. Tel.: +216 72391260.
E-mail addresses:
[email protected] (A.B. Aicha), sofi
[email protected] (S.B. Jebara). 0167-6393/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2011.11.005
518 519 519 519 519 520 520 520 520 520 521 521 521 521 522
518
5.
6.
7.
8.
A.B. Aicha, S.B. Jebara / Speech Communication 54 (2012) 517–528
4.3. Usefulness of UBPE and LBPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . New proposed criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1. Perceptual signal to audible noise ratio (PSANR) and Perceptual signal to audible distortion ratio (PSADR) . . . . . 5.2. Perceptual signal to audible noise and distortion ratio (PSANDR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3. Combination of proposed measures with conventional criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scatter plot and preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1. Speech corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2. Scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3. Per-condition correlation and root mean square error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Correlation of objective measures with subjective tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1. Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1. Introduction In recent speech communication applications such as wireless telephony, hands-free telecommunication, car and mobile phones, etc., it is often mandatory to reduce the environmental noise to improve the speech quality. A speech denoising algorithm can be viewed as successful if it suppresses perceivable background noise and preserves perceived signal quality without distortion. Generally, the most reliable way to measure denoising quality is through subjective listening tests. Although they are accurate, these tests are unsuitable for online applications since they are time consuming and they are also very expensive to perform. Objective speech quality assessment has been introduced to predict the subjective speech quality (Quanckenbush et al., 1988). Most of them were developed to evaluate the overall quality which includes speech distortion and residual background noise. segmental signal to noise ratio (SNRseg) (Hensen and Pellom, 1998), weighted-slope spectral distance (WSS) (Klatt, 1982), perceptual evaluation of speech quality (PESQ) (Perceptual evaluation of speech quality (PESQ), 2000; Rix et al., 2001), LPC-based objective measures including the log-likelihood ratio (LLR), Itakura–Saito distance measure (IS), cepstrum distance measures (CEP) (Quanckenbush et al., 1988) and frequency-weighted segmental SNR (fwSNRseg) (Tribolet et al., 1978), are examples of such objective measures. Nowadays, novel tendencies in quality evaluation of denoised speech signals are oriented towards a better precision of judgement concerning the type of perceived degradation. Indeed, during listening tests, some people clearly prefer a lowered background noise level, while others tolerate slightest distortions caused by denoising process. For such purpose, the more recent ITU-T recommendation P.835 (Subjective test methodology, 2003) instructs the listeners to rate three different speech components: the speech signal alone, the background content alone and the overall speech plus noise content. At the moment, there are no standard methods to separate objectively speech distortion and residual noise. Moreover, to the best of our knowledge, few works have
523 524 524 524 524 525 525 525 526 527 527 527 528 528
considered objective criteria based on the degradation discrimination. The first attempts are based on linear combination of well known existing criteria to built two criteria for evaluating speech distortion and residual noise separately (see for example Dreiseitel, 2001 and Hu and Loizou, 2006a). We propose in this work to build new criteria that are more accurate in the denoised speech assessment than existing measures. In this paper, our recent advances in objective separation of speech degradations are described. More precisely, two criteria suitable for audible distortion and audible noise measure are presented and a composite criteria, extracted from them, which measures the overall quality is introduced. They are respectively called the perceptual signal to audible noise ratio (PSANR), the perceptual signal to audible distortion ratio (PSADR) and the perceptual signal to audible noise and distortion ratio (PSANDR). To gain an improved correlation with the proposed criteria in terms of better correlation with subjective tests, we have combined the developed measures with conventional criteria to define new ones: composite PSADR (CPSADR), composite PSANR (CPSANR) and composite PSANDR (CPSANDR). These criteria, based on the auditory properties of human ear, measure only audible degradations. Indeed, they use the masking concept in order to decide on the audibility of each kind of degradation. The proposed criteria takes the form of a ratio between audible signal and audible degradation. The paper is organized as follows. Section 2 presents an overview of speech evaluation criteria through three points of view: signal distortion, residual background noise and overall quality. In Section 3, we present our motivation and proposed tools about perceptual characterization of degradations affecting denoised speech. In Section 4, we detail our ideas about perceptual characterization of audible degradations. Section 5 presents the proposed criteria. In Section 6, we study the relationship between perceptual measures and subjective measures imposed by ITU-T P.835 recommendation. In Section 7, we compare the correlation coefficients of objective measures with subjective tests, focusing on the ability of each proposed measure to predict
A.B. Aicha, S.B. Jebara / Speech Communication 54 (2012) 517–528
the subjective quality. Finally, Section 8 is devoted to the conclusion. 2. Speech quality assessment overview In this section, recent advances in subjective and objective speech quality measures are reviewed. These criteria, taking into account human auditory perception, are classified into two categories. The first one contains global measures which are essentially built to estimate the overall speech quality and the second one contains criteria which separate speech distortion and background noise. Since we are interested in the evaluation of enhanced speech signals, we give an overview of recent criteria developed for such purpose. 2.1. Overall perceptual measures 2.1.1. Subjective measure The commonly used subjective test, to evaluate speech quality, is the mean opinion score (MOS) (Methods for subjective determination of transmission quality, 1996). In this test, speech materials are played to a panel of a large number of listeners, who are asked to rate the global quality of played signals, often using a five-point quality scale. The final indicator is usually expressed as an average value of all the rating scores registered by the subjects. The rating scale employed in MOS testing is illustrated in Table 1 along with a general description of the levels of speech quality typically associated with each numerical score. MOS test procedure requires lengthy subjective testing. It is very expensive and time consuming (Hu and Loizou, 2006b, 2007). Hence, the automatic prediction of MOS score directly from the enhanced speech signals and without human subjects could be of a great practical value. Many objective criteria, well correlated with MOS, were proposed to estimate speech quality with low cost. 2.1.2. Objective measures In this paper, we have selected two of the recent objective criteria developed to assess speech enhanced signals. These criteria are the modified PESQ and the composite measures (Hu and Loizou, 2006a, 2008). It is found that these criteria are well correlated with subjective test MOS when compared with existing objective criteria such as PESQ, LLR, WSS, CEP, etc. This is mainly due to the fact that the mentioned criteria are developed to assess speech Table 1 Description in the mean opinion score (MOS). Rating
Speech quality
Level of degradation
5 4 3 2 1
Excellent Good Fair Poor Unsatisfactory
Imperceptible Just perceptible but not annoying Perceptible and slightly annoying Annoying but not objectionable Very annoying and objectionable
519
signals in contexts different from speech enhancement (Rix et al., 2006). For example PESQ measure is optimized to evaluate the speech processed through networks whereas the modified bark spectrum distortion (MBSD) measure is built to evaluate speech coders (Yang et al., 1998). We consider modified PESQ and composite criterion to test and compare with our proposed measures. They are outlined, as follows. PESQOVL: modified PESQ for evaluating overall quality The recommendation ITU-T P862 known as PESQ measure was developed in order to evaluate the global quality of speech over handset telephony and narrowband speech coders (Perceptual evaluation of speech quality (PESQ), 2000; Rix et al., 2001). The PESQ is computed as a linear combination of the average disturbance value Dind and the average asymmetrical disturbance values Aind as follows: PSEQ ¼ a0 þ a1 Dind þ a2 Aind ;
ð1Þ
where a0 = 4.5, a1 = 0.1 and a2 = 0.0309. These parameter values were optimized for speech processed through networks and not for evaluating enhanced speech signals. In Hu and Loizou (2008), Hu and Loizou propose to optimize the linear parameters a0, a1 and a2 for a specific application: overall quality assessment of enhanced speech by noise suppression algorithms. Multiple linear regression analysis was used to determine the linear parameters a0, a1 and a2. The values of Dind and Aind were considered as independent variables in the regression analysis. This yields to the modified PESQ for overall quality assessment (PESQOVL): PSEQOVL ¼ 4:788 0:152Dind 0:016Aind :
ð2Þ
COVL: composite criterion for overall quality assessment Since there is not yet standard to evaluate the quality of speech signals enhanced by noise suppression algorithms, many researches have consisted in combining the existing criteria to get new ones more adequate to speech enhancement context (Quanckenbush et al., 1988; Dreiseitel, 2001; Hu and Loizou, 2006a, 2008). As it is mentioned earlier, composite measures are necessary as we can not expect a high correlation of conventional objective measures (e.g., PESQ, LLR, WSS, etc) with the overall quality of enhanced speech signals. The basic idea of composite measure is the combination of some known conventional criteria to get new ones more correlated with subjective assessment of the overall quality. Hence, the first step consists in finding the most correlated criteria with subjective test. Then, the composite criterion for overall quality assessment, can be obtained by linear combination of the selected criteria. In this context, it is worth panting out the recent work proposed by Hu and Loizou in Hu and Loizou (2006a, 2008). After a large study of conventional mea-
520
A.B. Aicha, S.B. Jebara / Speech Communication 54 (2012) 517–528
sures and their correlation with MOS test, the IS, PESQ, LLR and WSS criteria are selected to be combined in order to derive a new composite criterion for overall quality assessment COVL. The composite measure COVL is derived by utilizing multiple linear regression analysis. COVL is given by Loizou (2007): C OVL ¼ 0:279 0:011 IS þ 1:137 PESQ þ 0:041 LLR 0:008 WSS:
ð3Þ
2.2. Measures separating noise and speech distortion 2.2.1. Subjective measures Generally, speech enhancement techniques are involved in noise suppression. However, these methods introduce unavoidable degradations (Chetouani et al., 2007; Benesty et al., 2008). We remind that noisy speech is composed by the clean speech and a background noise. The speech enhancement techniques affect the two components of such signal. Hence, the enhanced speech signal is composed of a distorted version of the clean speech and the residual background noise. These two components are perceptually different and not perceived in the same manner (Subjective test methodology, 2003; Hu and Loizou, 2008). In fact, during listening tests, some people clearly prefer a lowered background noise level, while others tolerate slightly a speech distortion. The ITU-T recommendation P.835 was designed to reduce the listener’s uncertainty in subjective listening test to the nature of components degradation (speech distortion, background noise or both of them). This method instructs the listeners to successively attend to rate the enhanced speech signal on: – the speech signal alone using a five-point scale of signal distortion (SIG); – the background alone using a five-point scale of background intrusiveness (BAK); – the overall quality using the scale of the mean opinion score (OVL) as it is mentioned in Table 1. The SIG and BAK scales are described in Table 2 (Subjective test methodology, 2003). 2.2.2. Objective measures To estimate subjective tests SIG and BAK quickly at a low cost, objective measures should be introduced. To the best of our knowledge, only few measures have been formulated (Dreiseitel, 2001; Hu and Loizou, 2006a). In this paper, we have selected the recent composite criteria developed in Hu and Loizou (2008) for the test. The chosen criteria are outlined as follows. PESQSIG: modified PESQ for estimating SIG test As it is mentioned in Eq. (1), the linear parameters a0, a1 and a2 are not optimized to assess speech distortion. Hence, Hu and Loizou propose to optimize them in
Table 2 Description of the SIG and BAK scales used in the subjective listening tests. Rating
Description
SIG scale 5 4 3 2 1
Very natural, no degradation Fairly natural, little degradation Somewhat natural, somewhat degraded Fairly unnatural, fairly degraded Very unnatural, very degraded
BAK scale 5 4 3 2 1
Not noticeable Somewhat noticeable Noticeable but not intrusive Fairly conspicuous, somewhat intrusive Very conspicuous, very intrusive
order to built a new measure more adequate for speech distortion assessment (Hu and Loizou, 2008). By the same manner as PESQOVL, linear regression yields to a criterion for speech distortion evaluation PESQSIG. PSEQSIG ¼ 4:959 0:191Dind 0:006Aind :
ð4Þ
PESQBAK: modified PESQ for estimating BAK test The same methodology which is used to built PESQSIG is adopted to construct the modified PESQ for residual background noise evaluation PESQBAK (Hu and Loizou, 2008). PSEQBAK ¼ 5:336 0:082Dind 0:058Aind :
ð5Þ
CSIG: composite criteria for SIG estimation The basic idea of composite criterion for speech distortion estimation is to profit from the most correlated conventional measures to built a new one more adequate for speech distortion evaluation. As it is done with COVL, linear regression yields to the composite criterion for speech distortion evaluation CSIG (Hu and Loizou, 2006a, 2008). C SIG ¼ 2:164 0:02 IS þ 0:832 PESQ 0:494 CEP þ 0:352 LLR: ð6Þ CBAK: composite criteria for BAK estimation To built a new criterion for residual background noise evaluation CBAK, Hu and Loizou combine the criteria PESQ, CEP, LLR and WSS (Hu and Loizou, 2006a, 2008). C BAK ¼ 0:985 þ 0:848 PESQ 0:319 CEP þ 0:295 LLR 0:008 WSS:
ð7Þ
3. Motivation of the proposed criteria 3.1. Background Without loss of generality, in this study we focus on the speech denoising application to define the different kinds of degradation affecting speech signal. We also use spectral denoising approaches which are viewed as a multiplication
A.B. Aicha, S.B. Jebara / Speech Communication 54 (2012) 517–528
of the short time Fourier transform (STFT) of the noisy speech Y(m, k) by a linear filter H(m, k). The STFT of the denoised speech b Sðm; kÞ is written as follows: b S ðm; kÞ ¼ H ðm; kÞ Y ðm; kÞ;
ð8Þ
where m (resp. k) denotes frame index (resp. frequency index). Generally, speech and noise are assumed to be mutually uncorrelated. Thus, the power spectrum density (PSD) of the error between clean and denoised speech Cn(m, k) is given by: 2
2
Cn ðm; kÞ ¼ ½H ðm; kÞ 1 CS ðm; kÞ þ H ðm; kÞ CN ðm; kÞ; ð9Þ where CS(m, k) (resp. CN(m, k)) denotes the speech PSD (resp. the noise PSD). Since 0 < H(m, k) < 1 is used to reduce the quantity of noise in the observed signal, its amplitude is less to one. Consequently, the first term of Eq. (9) expresses the ‘attenuation’ of the clean speech frequency components. Such degradation is perceptually heard as clean speech distortion. The second term reflects the residual noise which is perceptually heard as a background noise. Since, it is additive, it is possible to assume it as a term of speech frequency components ‘accentuation’. 3.2. Motivation We aim to perceptually characterize the degradation affecting denoised speech. Hence, auditory properties of human ear are considered. More precisely, the masking concept is used: a masked signal is made inaudible by a masker if the masked signal magnitude is below the perceptual masking threshold (MT) (Zwicker and Fast, 1990). In our case, both degradations (speech distortion and residual background noise) can be audible or inaudible according to their position with respect to the masking threshold. We propose to find decision rules to decide on the audibility of residual noise and speech distortion based on the masking threshold concept. There are many techniques to compute masking threshold MT. We use, in this paper, Johnston’s model which is well popular thanks to its simplicity (Johnston, 1988). 3.3. Masking threshold overview The physical interpretation of Johnston’s model and its different steps are outlined as follows (Johnston, 1988). – Spectral analysis: after subdivision of the speech signal into frames and multiplication by the Hanning window, the power spectrum is computed. – Critical band analysis: it is well known that frequency components, in each critical band (CB), are equally perceived by the human ear (Painter and Spanias, 2000). Hence, the spectrum components of each critical band are added up to get a discrete Bark spectrum.
521
– Convolution with a spreading function: any acoustic stimulus of the human ear is transformed as vibrations to the cochlea and excites specific regions in the basilar membrane. Only one region of the basilar membrane is concerned with the sinusoidal stimuli. However, neighborhood regions are also excited. This phenomenon expresses the masking across critical bands. To take into account this phenomenon, the discrete Bark spectrum is convolved with the spreading function (Painter and Spanias, 2000). As a result the spread critical band spectrum is obtained. – Subtraction of a relative threshold offset: there are different kinds of masking according to the tone-like or noise-like nature of the masker. In order to determine the signal tone-like nature, the tonality coefficient is computed in order to derive the threshold offset. The threshold offset is then subtracted from spread critical band spectrum to yield the spread threshold estimate. – Renormalization and comparison with the absolute threshold of hearing: the spread threshold estimate is scaled by a correction factor to simulate the deconvolution of the spreading function, and it is then checked against the absolute threshold of hearing and replaced by the maximum of the two thresholds.
4. Perceptual characterization of degradations affecting denoised speech 4.1. Perceptual characterization of audible noise According to the MT definition, it is possible to add to the PSD of the clean speech the MT curve without modifying its audibility. In the temporal domain, the resulting signal (obtained by inverse FFT) has a different temporal shape but the same audible quality of the original one. The resulting spectrum is called upper bound of perceptual equivalence (UBPE): UBPEðm; kÞ ¼ CS ðm; kÞ þ MTðm; kÞ:
ð10Þ
Now, if we assume that the only kind of degradation contained in the denoised speech is the residual noise, we can write: Cbðm; kÞ ¼ CS ðm; kÞ þ CR ðm; kÞ; S
ð11Þ
where CR(m, k) = jH(m, k)j2CN(m, k) denotes the PSD of residual noise. The residual noise becomes audible if its PSD exceeds the MT: CR(m, k) > MT(m, k) and then Cbðm; kÞ > CS ðm; kÞ þ MTðm; kÞ ¼ UBPEðm; kÞ: S
ð12Þ
It means that if some frequency components of the denoised speech are above UBPE(m, k), the resulting additive noise is audible. It is possible to extract the audible parts of the residual noise using a simple subtraction:
522
Cpn ðm; kÞ ¼
A.B. Aicha, S.B. Jebara / Speech Communication 54 (2012) 517–528
Cbðm; kÞ UBPEðm; kÞ if Cbðm;kÞ > UBPEðm; kÞ S S 0 otherwise; ð13Þ
where the suffix p in Cpn ðm; kÞ designs the perceptual sense of the PSD. As an illustration, we have represented in Fig. 1 the power spectra of the clean and denoised signals and the UBPE calculated from the clean speech according to Eq. (10). The clean speech sequence is extracted from TIMIT data base and re-sampled at 8 kHz (Garofolo, 1988). It is corrupted by an additive white noise, extracted from Noisex-92 database (Varga and Steeneken, 1993). Then, the denoised speech is obtained from noisy one by Wiener filtering (Scalart and Filho, 1996). In Fig. 1, only parts which exceed the UBPE are heard as residual background noise. Frequency components of denoised speech which are under UBPE can be either perceptually equivalent to original clean speech or heard as speech distortion. At this stage, we can not see any thing about such components. Obviously, we need to resort to an other additional criteria about the audibility of speech distortion. This tool will be explained and detailed in the next section. 4.2. Perceptual characterization of audible distortion By duality, some attenuations of frequency components can be heard as speech distortion which. Our aim is to quantify them. We proceed in the same way adopted to establish the UBPE: we propose to derive a second curve which expresses the lower bound under which any attenuation of frequency component is heard as a distortion. We call it the lower bound of perceptual equivalence (LBPE).
To compute it, we exploit the concept of audible spectrum, introduced by Tsoukalas et al. for audio signal enhancement (Tsoukalas et al., 1997). The audible spectrum is defined as the maximum between clean speech PSD and its masking threshold: CS ðm; kÞ if CS ðm; kÞ P MTðm; kÞ; AS ðm; kÞ ¼ ð14Þ MTðm; kÞ otherwise: Eq. (14) reflects the fact that frequency components above the masking threshold are audible and are kept untouched. Frequency components under MT(m, k) are inaudible and can be modified without impairing the perceptual quality of the speech. Tsoukalas et al. have proposed to replace them equally to the masking threshold, their PSD is then amplified and accentuated. In our previous works (Chetouani et al., 2007), we have proposed to attenuate them for another reason. In fact, since we aim at detecting speech distortions which appear as a PSD attenuation, we seek to find a curve under which any attenuation of frequency component is heard as a speech distortion. This lower bound, denoted r(m, k), must obey the inaudibility condition: r(m, k) < MT(m, k). According to this idea, the proposed LBPE can be defined as follows: CS ðm; kÞ if CS ðm; kÞ P MTðm; kÞ LBPEðm; kÞ ¼ rðm; kÞ otherwise: ð15Þ There is a freedom degree for choosing r(m, k). During this work, we have chosen it equal to 0 dB. Once the LBPE is defined, it is possible to estimate just the audible power spectrum density of speech distortion Cpd ðm; kÞ using a simple subtraction:
Fig. 1. Audible noise detection using UBPE.
A.B. Aicha, S.B. Jebara / Speech Communication 54 (2012) 517–528
Cpd ðm; kÞ ¼
LBPEðm; kÞ Cbðm; kÞ if Cbðm; kÞ < LBPEðm;kÞ S S 0 otherwise: ð16Þ
In Fig. 2, we have illustrated the detection of audible distortion. The LBPE defines the lower limit under which any denoised speech component will be heard as speech distortion. In some cases, the denoised speech spectrum is under the clean speech spectrum but not under LBPE. This kind of distortion is not audible and should not be take into account when constructing the proposed criterion.
523
4.3. Usefulness of UBPE and LBPE Using UBPE and LBPE, we can define three regions characterizing the perceptual quantity of denoised speech: frequency components between UBPE and LBPE are perceptually equivalent to the original speech components, frequency components above UBPE contain a background noise and frequency components under LBPE are characterized as speech distortion. This novel characterization allows to identify and detect audible additive noise and audible distortion. As an illustration, we have presented in Fig. 3 an example of speech frame
Fig. 2. Audible distortion detection using LBPE.
Fig. 3. An example of UBPE and LBPE of a clean speech frame.
524
A.B. Aicha, S.B. Jebara / Speech Communication 54 (2012) 517–528
power spectrum and its related curves UBPE (upper curve in bold line) and LBPE (bottom curve in dash line). The clean speech power spectrum is, for all frequencies index, between the two curves UBPE and LBPE. We note that the two curves are the same for most peaks. It means that, in these frequency intervals, any kind of degradation affecting speech will be audible. If it is quite over UBPE, it will be heard as a background noise. Otherwise, it will be perceived as a speech distortion. 5. New proposed criteria In this section, we justify and define the proposed criteria to quantify separately the residual noise and the speech distortion. By linear combination of both criteria, we propose to built a new criterion to quantify the overall quality. First of all, before criteria definition, some important and preliminary remarks are given. The definition of the proposed criteria is inspired from the well known signal to noise ratio (SNR) criteria which is defined as the ratio between signal power and noise power. In our work, the most of the processing is formed in the frequency domain. Hence, the proposed criteria will be calculated in the frequency domain thanks to Parseval theorem of energy conservation. Since the UBPE and LBPE are perceptually equivalent to the original signal and since we aim at measuring audible signal parts characteristics and not the signal characteristics in themselves, the proposed criteria involve UBPE and LBPE powers instead of the clean speech power. The time domain signal related to UBPE is called “upper effective signal” whereas the time domain signal related to LBPE is called “lower effective signal”. In the following subsection, we are going to define the proposed criteria. 5.1. Perceptual signal to audible noise ratio (PSANR) and Perceptual signal to audible distortion ratio (PSADR) The perceptual signal to audible noise ratio PSANR(m) of frame m is defined as the ratio between upper effective signal and audible residual noise powers. It is defined in the frequency domain as follows: PN k¼1 UBPEðm; kÞ PSANRðmÞ ¼ P ; ð17Þ N p k¼1 Cn ðm; kÞ where N is the number of frequency bins. In the same manner, we have defined the perceptual signal to audible distortion ratio PSADR(m) of frame m as a ratio between lower effective signal and audible distortion powers:
PN k¼1 LBPEðm; kÞ : PSADRðmÞ ¼ P N p k¼1 Cd ðm; kÞ
ð18Þ
To compute the global PSANR and PSADR of the total speech sequence, we consider the segmental SNR (SNRseg) thanks to its better correlation with subjective tests when compared to the traditional SNR (Hu and Loizou, 2008). The principle of segmental SNR consists in determining the SNR for each frame SNR(m) and then calculating their geometric mean over the total number of frames (Hansen and Pellom, 1998): vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uM Y u M SNRseg ¼ t SNRðmÞ: ð19Þ m¼1
Moreover, since the SNR and SNRseg are usually expressed in dB. The geometric mean is equivalent to the arithmetic mean in the log domain. Using this approach, we compute the global PSANR and PSADR for a given sequence of speech. 5.2. Perceptual signal to audible noise and distortion ratio (PSANDR) As mentioned in Dreiseitel (2001), Subjective test methodology (2003), the overall quality depends on the speech distortion level and the background noise level. The human mind combines the two kinds of degradation to assess the overall signal quality. For the sake of simplicity and inspired from similar works such as Dreiseitel (2001) and Hu and Loizou (2006a), we propose to linearly combine the PSADR and PSANR to get an overall measure of the speech quality. The resulting criterion is called perceptual signal to audible noise and distortion ratio PSANDR: PSANDR ¼ a PSADR þ b PSANR þ c;
ð20Þ
where a, b and c denote the linear weight parameters. They are determined by selecting a large database of speech signals whose MOS criteria are known. The mean square error between quantitative evaluation using PSANDR and subjective evaluation using MOS is minimized, yielding to the optimal values of the linear parameters. After such training phase, the following criterion is obtained: PSANDR ¼ 0:0039 PSADR þ 0:1339 PSANR þ 2:4176: ð21Þ 5.3. Combination of proposed measures with conventional criteria We expect to improve the proposed criteria PSANR, PSADR and PSANDR by adding the advantages of other criteria. Inspired by the composite criteria, developed by Hu and Loizou, we propose to construct our composite measures (Hu and Loizou, 2008). Consequently, after parameters optimization and criterion selection, according to the same concept as for composite criteria CBAK, CSIG
A.B. Aicha, S.B. Jebara / Speech Communication 54 (2012) 517–528
and COVL (Hu and Loizou, 2008), we propose composite PSANR and PSADR as follows: C PSANR ¼ 0:0685 PSANR þ 0:1369 PESQ 0:0026 WSS þ 0:0821 SNRseg þ 2:2781;
ð22Þ
C PSADR ¼ 0:1262 PSADR þ 0:7967 PESQ þ 0:4573 LLR þ 0:0638 SNRseg 0:8251;
ð23Þ
525
mined: speech distortion (SIG), residual background noise (BAK) and overall quality (OVL). A total number of 32 listeners where recruited for listening tests. Each of them has assessed, for each denoised sequence, the speech distortion, the residual background noise and the overall quality according to Tables 1 and 2. The final scores SIG, BAK and OVL are obtained by averaging the individual ones. 6.2. Scatter plots
and C PSANDR ¼ 0:1294 PSANR þ 0:0049 PSADR 0:1477 PESQ 0:1149 CEP þ 2:9395: ð24Þ 6. Scatter plot and preliminary results Scatter plots are basic tools to validate criteria by considering their relationship with subjective criteria. In this section, after database description used to calculate objective and subjective criteria, scatter plots and fitted polynomial functions for mentioned criteria and proposed ones. Next, errors due to polynomial fitting are evaluated. 6.1. Speech corpus The database is constructed from 18 clean speech sequences extracted from TIMIT database (Garofolo, 1988). They are corrupted by four types of noise (white, babble, f16 and factory) at different SNR levels (from 0 dB to 25 dB). Next, they are processed by four denoising techniques (power spectral subtraction (Beruoti et al., 1979), Wiener filtering (Benesty et al., 2005), minimum mean square error proposed by Benesty et al. (2005) and perceptual filtering proposed by Gustafsson et al. (2002)). A total number of 90 processed sequences are included for evaluation. For each sequence and according to ITUP835 recommendation, three subjective scores are deter-
The scatter plot visualizes the relationship between objective and subjective scores. According to the ITU recommendation P.862 Section 7, we fitted third polynomial predictors to the scatter plots using least square linear regression approach. We have conducted experiments to represent scatter plots and fitted polynomial functions of mentioned criteria. Experimental results are summarized in Figs. 4–6. Fig. 4 concerns the evolution of subjective criterion SIG versus each objective criterion, namely PESQSIG, CSIG, PASDR and CPSADR. We notice that the scatter plots are not closed to the fitted polynomial functions. This means that it is difficult to predict subjective criterion SIG via objective measures. Nevertheless, CPSADR scatter plot is closer to the fitted functions than the remainder criteria. Hence, we expect that CPSADR will be more correlated with subjective test SIG than tested criteria. Fig. 5 deals with BAK measure. We represent the evolution of BAK criterion versus objective criteria PESQBAK, CBAK, PASNR and CPSANR. We notice that, for all criteria, the scatter plots are closed to fitted functions. Thus, we can expect that classic objective criteria are more adequate for evaluating the background noise than for evaluating speech distortion. Fig. 6 is denoted to the relationship between objective criteria PESQOVL, COVL, PASNDR, CPSANDR and the sub-
Fig. 4. Scatter plot: relationship between subjective criterion SIG and objective measures.
526
A.B. Aicha, S.B. Jebara / Speech Communication 54 (2012) 517–528
Fig. 5. Scatter plot: relationship between subjective criterion BAK and objective measures.
Fig. 6. Scatter plot: relationship between subjective criterion MOS and objective measures.
jective criterion MOS. The improvement of the closeness of objective scores to fitted functions. Hence, we can expect a good correlation between objective measures and MOS criterion. 6.3. Per-condition correlation and root mean square error In order to evaluate the precision of the third order fitting, we compute the root mean square error (RMSE): vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u N u1 X 2 RMSE ¼ t ð25Þ ½fO ðiÞ OðiÞ ; N i¼1 where, O (resp. fO) is the objective criterion value O (resp. fitted function). i denotes the evaluation sequence index and N is the total number of sequences.
We also calculate the per-condition correlation R between objective criteria and its related fitted polynomial function. The coefficient R is computed according to Pearson’s correlation method (Dimolitsas, 1984): PN i¼1 ½fO ðiÞ fO ½OðiÞ O sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; R ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð26Þ N P PN 2 2 ½OðiÞ O i¼1 ½fO ðiÞ fO i¼1
where O (resp. fO ) is the average of O(i) (resp. fO(i)). We recall that, a criterion can be good for predicting subjective tests if its RMSE is small and its per-condition R correlation is high. In Table 3, we summarize the RMSE and the pre-condition correlation (R) obtained for tested criteria. Subscripts “S”, “B” and “M” denote respectively SIG, BAK and MOS. We notice that the best results in
A.B. Aicha, S.B. Jebara / Speech Communication 54 (2012) 517–528 Table 3 Per-condition correlation and root mean square error. Bold values are used to highlight performances of our proposed technique.
PESQSIG PESQBAK PESQOVL CSIG CBAK COVL PSADR PSANR PSANDR CPSADR CPSANR CPSANDR
RMSES
RS
0.52
0.74
0.51
RMSEB
RB
0.33
0.97
RMSEM
RM
0.21
0.98
0.96 0.29
0.99 0.21
0.54
0.97 0.26
0.98 0.15
0.39
0.99
0.99
0.97 0.24
0.99 0.13
0.99
terms of small RMSE and high R are obtained for proposed criteria (see bold values in Table 3). This catches up with scatter plots analysis. Hence, we expect having a good correlation between proposed criteria and their corresponding subjective measures.
7. Correlation of objective measures with subjective tests 7.1. Definition Once an objective measure O(i) over the database has been computed and a subjective score S(i) has been obtained over the same database, a correlation coefficient q, also known as Pearson’s correlation coefficient, indicating the goodness of fit can be calculated by the same manner in Eq. (26) (Dimolitsas, 1984). We notice that, the more q is close to one, the more the objective criterion is accurate in terms of subjective scores prediction. This correlation can be used to obtain an estimate of the standard deviation of the error r when the objective measure is used in place of the subjective measure (Dimolitsas, 1984). The standard deviation r is given as: pffiffiffiffiffiffiffiffiffiffiffiffiffi ^ S 1 q2 ; r¼r ð27Þ ^S is the standard deviation of S(i). A smaller value where r of r indicates that the objective measure is better for predicting subjective quality.
7.2. Results We recall that we are interested in the comparison of our proposed criteria with those developed essentially to assess denoised speech which are found more correlated with subjective tests SIG, BAK and MOS than conventional criteria (Hu and Loizou, 2006a, 2008). We compute the correlation coefficient q and the standard deviation of the error r for all tested criteria and we summarize the results in Table 4.
527
Table 4 Estimated correlation coefficients q and standard deviation of the error r of objective measures with overall quality, signal distortion and residual background noise. Prefixes SIG, BAK and OVL denote respectively signal distortion, background noise and overall quality. Bold values are used to highlight performances of our proposed technique.
PESQSIG PESQBAK PESQOVL CSIG CBAK COVL PSADR PSANR PSANDR CPSADR CPSANR CPSANDR
jqSIGj
rSIG
0.36
0.77
0.48
jqBAKj
rBAK
0.65
0.59 0.67
0.47
0.68
0.46
0.78
0.39
0.81
37
0.54
0.73 0.74
0.62
rOVL
0.72 0.72
0.44
jqOVLj
0.52
0.64 0.79
0.47
– Speech distortion analysis From Table 4, we can deduce that predicting speech distortion quality by objective criteria is not an easy task. Indeed, tested criteria are not well correlated to subjective test SIG. For example, the correlation of PESQSIG with subjective test SIG is about 0.36 which is not promising for the assessment of speech distortion in real applications. Nevertheless, significant improvement of the correlation with SIG test is obtained with the proposed criteria PSADR and CPSADR which is about 0.3 compared to the coefficient obtained by PESQSIG (see bold values in Table 4). To explain these results, we have to remind the expression of the speech distortion given in Eq. (9) ([H(m, k) 1]2 CS(m, k)). Indeed, speech distortion can be seen as an attenuation of speech components since the denoised filter H(m, k) is always less than one. Moreover, the residual noise can be seen as an attenuation of background noise components (H(m, k)2 CN(m, k)). Experiments show that even for a little attenuation of speech components the speech quality can be dramatically degraded. However, conventional criteria use energy computation in their definition. We think that a little decrease of energy can not be significantly enough to modify the objective criteria behavior. Although, this little amount of energy can have dramatic consequences on speech distortion. Based on this analysis, we have constructed PSADR criterion in order to give an importance to any little attenuation of speech components (see subSection 4.2). – Residual noise analysis Table 4 shows that tested criteria are more suitable for evaluating residual background noise than speech distortion. In fact, correlation coefficients of objective criteria with subjective test BAK are higher than those obtained for speech distortion evaluation. We note that the best correlation is obtained by the proposed criteria PSANR and CPSANR.
528
A.B. Aicha, S.B. Jebara / Speech Communication 54 (2012) 517–528
– Overall quality analysis We remark that tested criteria are well correlated with subjective test. This means that tested objective criteria are more accurate to evaluate the global quality than the assessment of speech distortion or residual noise. We also notice that the best correlation is obtained by proposed criteria PSANDR and CPSANDR. 8. Conclusion In this work, we have proposed six perceptual measures to independently evaluate speech distortion, residual background noise and overall quality. These criteria are developed especially for speech denoising applications. They are built after a fine analysis of the degradations affecting denoised speech. Indeed, based on this analysis, the concept of the class of perceptual equivalence CEP is introduced leading to specify and quantify just the audible degradations. The proposed measures use human auditory properties to perceptually characterize the degradations affecting denoised speech contrary to the existed criteria which are based on the combination of some measures without taking into account the audibility of the degradations. Experimental results have indicated that the proposed criteria CPSADR, CPSANR and CPSANDR are the most correlated criteria to the relative subjective measures SIG, BAK and MOS. It is important to notice that human listeners give more attention to the speech distortion than the background noise to assess the overall quality. However, as it is shown in this paper, objective criteria do not well in predicting speech distortion quality which confirms the difficulties of the speech distortion assessment. In fact, even small attenuations of the clean speech components, especially harmonic components, introduce unavoidable distortion to the denoised speech. Hence, the criteria which make more attention to the assessment of speech distortion may be more correlated to the subjective criteria. Many perspectives could be investigated for instance, using and adapting these new criteria to generalize their use in many speech processing applications such as evaluating distortions introduced by speech codecs and communication channels. References Benesty, J., Makino, S., Chen, J., 2005. Speech Enhancement. Springer. Benesty, J., Sondhi, M.M., Huang, Y., 2008. Handbook of Speech Processing. Springer, pp. 843–1015. Beruoti, M., Schwartz, R., Makhoul, J., 1979. Enhancement of speech corrupted by acoustic noise. In: Proc. IEEE Internat. Conf. Acoustics, Speech, Signal Process, pp. 208–211. Chetouani, M., Hussain, A., Gas, B., Milgram, M., Zarader, J.L., 2007. Advances in Nonlinear Speech Processing. Springer, pp. 230–245. Dimolitsas, S., 1984. Objective speech distortion measures and their relevance to speech quality assessments. In: Proc. of the IEEE, vol. 136, no. 5. Dreiseitel, P., 2001. Quality measures for single channel speech enhancement algorithms. In: Proc. Internat. Workshop on Acoustics Echo and Noise Control.
Garofolo, J.S., 1988. Getting started with the DARPA TIMIT CD-ROM: an acoustic phonetic continuous speech database. National Institute of Standards and Technology. Gustafsson, S., Martin, R., Jax, P., Valery, P., 2002. A psychoacoustic approach to combined acoustic echo cancellation and noise reduction. IEEE Trans. Speech Audio Process. 10 (5), 245–256. Hansen, J.H.L., Pellom, B.L., 1998. An effective quality evaluation protocol for speech enhancement algorithms. In: Proc. Internat. Conf. on Spoken Language Processing ICSLP. Hensen, J.H.L., Pellom, B.L., 1998. An effective quality evaluation protocol for speech enhancement algorithms. In: Proc. Internat. Conf. on Spoken Language Processing ICSLP, vol. 7, pp. 2819–2822. Hu, Y., Loizou, P., 2006. Evaluation of objective measures for speech enhancement. In: Proc. Interspeech, pp. 1447–1450. Hu, Y., Loizou, P., 2006. Subjective comparison of speech enhancement algorithms. In: Proc. IEEE Internat. Conf. Acoustics, Speech, Signal Process, vol. 1, pp. 153–156. Hu, Y., Loizou, P., 2007. Subjective comparison and evaluation of speech enhancement algorithms. Speech Comm. 49, 588–601. Hu, Y., Loizou, P., 2008. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio, Speech, Lang. Process. 16 (1), 229–238. Johnston, J.D., 1988. Transform coding of audio signal using perceptual noise criteria. IEEE J. Select. Areas Comm. 6, 314–323. Klatt, D., 1982. Prediction of perceived phonetic distance from critical band spectra. In Proc. IEEE Internat. Conf. Acoustics, Speech, Signal Process, vol. 7, pp. 1278–1281. Loizou, P., 2007. Speech Enhancement: Theory and Practice. CRC, Boca Raton, FL. Methods for subjective determination of transmission quality, 1996. ITUT Recommendation p.800. Painter, T., Spanias, A., 2000. Perceptual coding of digital audio. In: Proc. of the IEEE, vol. 88, no. 4, pp. 451–513. Perceptual evaluation of speech quality (PESQ), and objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs, 2000. ITU-T Recommendation. p. 862. Quanckenbush, S., Barnwell, T., Clements, M., 1988. Objective Measures of Speech Quality. Prentice Hall, Englewood Cliffs. Rix, A., Beerends, J., Hollier, M., Hekstra, A., 2001. Perceptual evaluation of speech quality (PESQ)-A new method for speech quality assessment of telephone networks and codecs. In: Proc. IEEE Internat. Conf. Acoustics, Speech, Signal Process, vol. 2, pp. 749–752. Rix, A.W., Beerends, J.G., Kim, D.S., Kroon, P., Ghitza, O., 2006. Objective assessment of speech and audio quality-technologie and application. IEEE Trans. Audio, Speech, Lang. Process. 14 (6), 1890–1901. Scalart, P., Filho, J.V., 1996. Speech enhancement based on a priori signal to noise estimation. In: Proc. IEEE Internat. Conf. on Acoustics, Speech, Signal Process. Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm, 2003. ITU-T Recommendation. p. 835. Tribolet, J., Noll, P., McDermott, B., Crochiere, R.E., 1978. A study of complexity and quality of speech waveform coders. In: Proc. IEEE Internat. Conf. on Acoustics, Speech, Signal Process, pp. 586–590. Tsoukalas, D.E., Mourjopoulos, J., Kokkinakis, G., 1997. Speech enhancement based on audible noise suppression. IEEE Trans. Speech Audio Process. 5 (6), 497–514. Varga, A., Steeneken, H.J.M., 1993. Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Comm. 12, 247–251. Yang, W., Benbouchta, M., Yantorno, R., 1998. Performance of the modified bark spectral distortion as an objective speech measure. In: Proc. IEEE Internat. Conf. on Acoustics, Speech, Signal Process, vol. 1, pp. 541–544. Zwicker, E., Fast, H., 1990. Psychoacoustics Facts and Models. Springer.