Advances in phase-aware signal processing in speech communication

ARTICLE IN PRESS JID: SPECOM [m5+;May 19, 2016;19:14] Available online at www.sciencedirect.com Speech Communication xxx (2016) xxx–xxx www.elsevi...

Download PDF

4MB Sizes 5 Downloads 208 Views

Report

Full Text

ARTICLE IN PRESS

JID: SPECOM

[m5+;May 19, 2016;19:14]

Available online at www.sciencedirect.com

Speech Communication xxx (2016) xxx–xxx www.elsevier.com/locate/specom

Q1

Advances in phase-aware signal processing in speech communication Pejman Mowlaee a,∗, Rahim Saeidi b, Yannis Stylianou c

Q2 Q3

a Signal

Processing and Speech Communication Laboratory, Graz University of Technology, Austria b Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland c Computer Science Department, University of Crete, Crete, Greece

Received 1 July 2015; received in revised form 8 April 2016; accepted 11 April 2016 Available online xxx

Abstract During the past three decades, the issue of processing spectral phase has been largely neglected in speech applications. There is no doubt that the interest of speech processing community towards the use of phase information in a big spectrum of speech technologies, from automatic speech and speaker recognition to speech synthesis, from speech enhancement and source separation to speech coding, is constantly increasing. In this paper, we elaborate on why phase was believed to be unimportant in each application. We provide an overview of advancements in phase-aware signal processing with applications to speech, showing that considering phase-aware speech processing can be beneficial in many cases, while it can complement the possible solutions that magnitude-only methods suggest. Our goal is to show that phase-aware signal processing is an important emerging field with high potential in the current speech communication applications. The paper provides an extended and up-to-date bibliography on the topic of phase aware speech processing aiming at providing the necessary background to the interested readers for following the recent advancements in the area. Our review expands the step initiated by our organized special session and exemplifies the usefulness of spectral phase information in a wide range of speech processing applications. Finally, the overview will provide some future work directions. © 2016 Elsevier B.V. All rights reserved. Keywords: Phase-aware speech processing; Phase-based features; Signal enhancement; Automatic speech recognition; Speaker recognition; Speech synthesis; Speech coding; Speech analysis.

1

1. Introduction

2

In many everyday-life applications a reliable speech communication system is in demand whose performance is expected to be robust enough to deliver a certain high quality of service to establish a reliable speech communication medium. For this purpose, it is highly important to verify the robustness of the underlying speech communication interface against impairments that occur due to some acoustic interference, background noise or those introduced by the failure in the communication channel, which can be modeled, for example, as the additive background noise or reverberations in the room, respectively. In a speech communication application, the design goal is to deliver enough flexibility and

3 4 5 6 7 8 9 10 11 12 13

∗

Corresponding author. Tel.: +43 316 873 4433; fax: +43 316 873 7941. E-mail addresses: [email protected] (P. Mowlaee), [email protected] (R. Saeidi), [email protected] (Y. Stylianou).

reliability such that it provides a certain quality of service that may differ depending on the specified application. A full end-to-end speech communication chain from microphone to receiver end involves several blocks including: speech analysis, multi-channel processing (beamformer), single-channel signal restoration entailing speech enhancement, source separation and artificial bandwidth extension. Depending on the desired application, one might be interested in playing back the reconstructed signal at the receiver end and, as a consequence, a speech synthesis block is required. Alternatively, in biometric and dictation applications, the goal is to recognize, or verify the identity of the speaker (speaker identification/verification) or to recognize the phonemes and words spoken by the transmitting party (automatic speech recognition). Another application is speech watermarking, for which certain cover data for copyright issues are required to be embedded in speech without modifying the speech quality of the watermarked signal still be robust against spoofing attacks.

http://dx.doi.org/10.1016/j.specom.2016.04.002 0167-6393/© 2016 Elsevier B.V. All rights reserved.

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

JID: SPECOM

2

ARTICLE IN PRESS P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

Finally, in speech coding the phase information is important to fulfill the bits budget while delivering a transparent speech 34 quality at the receiver end. This is feasible by keeping the 35 phase quantization errors below a perceptual threshold, de36 termined by just noticeable difference, not perceptible by the 37 listener. 38 The common strategies described in the aforementioned 39 speech processing applications mainly focus on spectral am40 plitude modification and signal processing methods that are 41 used to filter the spectral amplitude of the speech signal. More 42 recently, the research attention towards incorporating the spec43 tral phase information has increased. In particular, individual 44 steps have been taken by researchers through different appli45 cations from different sub-communities in speech signal pro46 cessing. In order to unify the attempts made by the researchers 47 in such diverse communities and to gather them together to 48 share their findings, a special session at INTERSPEECH 2014 49 entitled Phase Importance in Speech Processing Applications Q4 50 (Mowlaee et al., 2014) has recently been organized. 51 In this paper, we present an extended literature review of 52 earlier viewpoints about the importance of phase information 53 in speech signal processing. We exemplify the importance of 54 phase information in several speech processing applications 55 and further demonstrate the positive impact brought about 56 by phase-aware signal processing in each of the aforemen57 tioned applications. Finally, recent and future directions to58 wards applying phase-aware signal processing to tackle dif59 ferent speech applications are pointed out. Our work extends 60 the initial step made at special session (Mowlaee et al., 2014). 61 In contrast, here we present a broader picture of phase-aware 62 signal processing and provide a detailed overview of the ad63 vances made by researchers in the field. Following the lit64 erature review, we present several useful representations for 65 phase spectrum that illustrate certain structures in the phase 66 domain. The common short-time Fourier transform (STFT) 67 spectral phase representation exhibits no useful structure or 68 distribution in the phase domain due to its cyclic wrapping 69 nature. However, new presentations provide means to explore 70 various properties and details of speech signals (e.g., con71 tinuity, formant, harmonicity, or statistical explanation cap72 tured by mean, variance and probability density function). 73 We then describe the phase-based derived features appearing 74 in the literature for different speech applications. To high75 light the importance of phase-aware signal processing, we 76 give examples in several speech applications: single-channel 77 and multi-channel speech enhancement, source separation, au78 tomatic speech recognition, speaker recognition, speech cod79 ing, speech analysis/synthesis, digital speech watermarking, 80 and speech quality estimation. For each of these applications, 81 we present the impact of phase information and explain the 82 state-of-the-art methods that neglect the phase information, as 83 well as more recent ones that incorporate phase information 84 into their proposed solutions. 85 Several recent attempts have presented partial overviews 86 of some aspects of phase processing for speech applications 87 (Alsteris and Paliwal, 2007b; Gerkmann et al., 2015; Mowlaee 88 and Kulmer, 2015a; 2015b): 32

[m5+;May 19, 2016;19:14]

•

33

•

•

Alsteris and Paliwal (2007b) reviewed experimental results for short-time phase spectrum usage in speech processing. This work demonstrated the usefulness of phase spectrum for automatic speech recognition, human listening and speech intelligibility. Paliwal et al. (2011) studied the importance of phase information for speech enhancement and proposed a phase spectrum compensation (PSC) method using the conjugate symmetric property of the discrete Fourier transform (DFT). These subjects are covered in Section 4.1.3. Gerkmann et al. (2015) reviewed phase processing for single-channel speech enhancement. In contrast to this narrow focus on phase-aware speech enhancement, the current review presents a unified approach to phase-aware speech communication including other applications such as speech analysis, speech synthesis, speaker/speech recognition, speech coding, speech watermarking, source separation and multi-channel speech processing. These subjects are covered in Section 4. Mowlaee and Kulmer (2015a); 2015b) reviewed phase estimation in speech enhancement and demonstrated the potential and limits of the phase estimation methods demonstrated with a comparative study. The main contribution in Mowlaee and Kulmer (2015a); 2015b) was on the phase estimation in noise and incorporate it for improved signal reconstruction. Efficient techniques for estimating phase in noisy environments are reviewed in Section 4.1.3.

In this paper, throughout the extended up-to-date bibliography we provide a guidance for our readers to a summary of what has been done with regard to phase processing of speech signal. For readers more interested in the specific solution, we refer to bibliography where a list of early to latest contributions can be found. Our goal is to help the interested readers to have a quick access to the novel papers, and get to know the earlier and current innovations and efforts made to explore phase-aware signal processing. We also elaborate on why phase was believed to be unimportant in each application and why phase processing is recently attracting increasing interest in speech processing community. The usefulness of spectral phase information will be exemplified in a wide range of speech processing applications (speech analysis/synthesis, speech enhancement, source separation, speech coding, speech quality estimation, speech watermarking, automatic speech recognition, and speaker recognition), providing a reference for the researchers who are just starting with their research topic related to the phase-aware signal processing. The current paper is an extensive overview to date on the topic of phase-aware speech processing and provides a unique picture that depicts the importance of phase assessment in many aspects of speech technology. The paper is organized as follows: Section 2 presents the controversial standpoints in the literature with regard to the importance of phase information in speech signal processing; Section 3 presents useful phase representations derived from the short-time Fourier transform phase; It also focuses on phase-based features used in speech processing applications;

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115

116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144

JID: SPECOM

ARTICLE IN PRESS

[m5+;May 19, 2016;19:14]

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

147

Section 4 we exemplify the importance of phase-aware signal processing in speech applications; and finally, Section 5 concludes the work and presents a discussion.

148

2. History on phase processing

149

In 1843 (von Helmholtz, 1912; Ohm, 1843), Helmholtz and Ohm concluded from their experiments that the human ears are insensitive to phase perception. Several studies later demonstrated that phase information could contribute to the intelligibility of speech signals, disproving the earlier observations. For example, Schroeder (1975) reviewed hearing models, studying the importance of the phase spectrum in human perception. Oppenheim and Lim (1981) explored the importance of phase in signals. In Oppenheim and Lim (1981); Oppenheim et al. (1983); 1979), Oppenheim et al. also demonstrated the contribution of phase-only signal reconstruction on human perception with regard to the quality of the reconstructed signal, exemplified for audio, image, crystallographic and speech processing applications. Liu et al. (1997) studied the perceptual importance of the phase spectrum specifically on the intervocalic stop consonants. They reported the strong dependency of the perception of intervocalic stops on phase information. As another example that highlights the importance of phase information, the authors in Ni and Huo (2007) presented the statistical interpretation of phase information in signal and image reconstruction. Oppenheim et al. (1983) demonstrated signal synthesis and proposed signal reconstruction techniques using only partial information in the Fourier domain. In more detail, they investigated the conditions for reconstructing a signal only from its available partial information in the Fourier-domain. They also reported results on the exact signal reconstruction from partial STFT information, i.e., signedmagnitude-only or phase-only. Same authors then studied the conditions under which a signal can be reconstructed from one-bit phase-only and successfully applied them for image reconstruction (Hayes et al., 1980). Several studies later estimated a time-domain signal from its magnitude information of the short-time Fourier transform as well as studied how to use the phase-only information to achieve a unique reconstruction of a time-domain signal (Griffin and Lim, 1984; Hayes et al., 1980; Quatieri and Oppenheim, 1981). More specifically, for speech signal processing, Paliwal et al. demonstrated the importance of phase information in human listening (Paliwal and Alsteris, 2003; 2005) and speech signal processing (Alsteris and Paliwal, 2007b) exemplifying its contribution in the context of automatic speech recognition (ASR) (Alsteris and Paliwal, 2004a), speech intelligibility (Alsteris and Paliwal, 2006), speech analysis (Stark and Paliwal, 2008; 2009a), and speech enhancement (Paliwal et al., 2011). In multi-channel signal processing, phase difference information between sensors has been used to estimate direction of arrival (Knapp and Carter, 1976; Van Trees, 2002). An overview on microphone array speech processing was presented by Brandstein and Ward (2001) where the phase difference between channels was shown as the main source

145 146

150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199

3

of information for time-delay estimation and source localization applications. Aarabi et al. (2006) reviewed phase-based speech processing methods that were focused on improved automatic speech recognition or binaural speech enhancement using phase difference information between microphones. The importance of phase information in noise reduction has been controversial. Wang and Lim (1982) concluded that phase information was negligible in speech enhancement. Vary (1985) derived the relationship between the phase distortion threshold and local signal-to-noise ratio to be 6 decibels, above which the noisy phase can be considered to be a decent estimate for the clean phase spectrum and the roughness in the synthesized speech is not audible. Ephraim and Malah (1984) showed that the noisy spectral phase is the minimum mean square error (MMSE) estimate for the clean phase, assuming a uniform prior distribution on the phase. Similarly, Lotter and Vary showed that the noisy phase is the maximum a posteriori (MAP) estimate of the clean phase under the same assumptions (Lotter and Vary, 2005). Recently, a short review on phase importance in different speech processing applications was made (Mowlaee et al., 2014). Several scattered papers describe phase-aware processing mechanisms in specific applications such as: phase-aware processing in singlechannel speech enhancement history and advances (Gerkmann et al., 2015), phase estimation in noise and its limits and potential in single-channel speech enhancement (Mowlaee and Kulmer, 2015a; 2015b), phase estimation for single-channel source separation (Mayer and Mowlaee, 2015), speech synthesis (Degottex and Erro, 2014a; Sanchez et al., 2015), and automatic speech recognition (Alsteris and Paliwal, 2007b). One common observation in the literature with regard to the importance of phase in speech processing is the fact that the positive contribution of short-time Fourier phase spectrum becomes larger once an analysis window of a relatively long length is chosen for the spectral Fourier analysis of speech (Gotoh et al., 2005; Kazama et al., 2010). The origin of the previously observed controversy was mainly due to the fact that, as reported in Liu et al. (1997) the phase holds negligible information about speech intelligibility when a short window analysis is applied, e.g., 20 to 40 ms. This is in contrast to conventional magnitude processing where common signal processing for speech employs the short window analysis, leading to meaningful features e.g., spectral amplitude for speech enhancement or Mel frequency cepstral coefficients (MFCC) for automatic speech/speaker recognition.

200

3. Phase analysis

245

When analyzing speech phase, we first need to mention that the phase term has been used in different contexts with different meanings. Table 1 provides a list of different phase terms (named on the first column) used throughout the speech signal processing literature found in different applications (exemplified on the second column). We explain each specific phase representation in the following Sections.

246

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244

247 248 249 250 251 252

ARTICLE IN PRESS

JID: SPECOM

4

[m5+;May 19, 2016;19:14]

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx Table 1 The term phase has different meanings and usages in speech processing.

Q5

Phase representation

References

Related section

STFT phase

Signal enhancement (Gerkmann, 2014; Mowlaee and Saeidi, 2013a; Paliwal et al., 2011), signal reconstruction (Griffin and Lim, 1984; Krawczyk and Gerkmann, 2014; Le Roux and Vincent, 2013; Mayer and Mowlaee, 2015; Mowlaee and Kulmer, 2015a; 2015b) Multi-channel noise reduction (Aarabi et al., 2006; Brandstein and Ward, 2001; Doclo et al., 2015; Obaid et al., 2012), time-delay estimation (Knapp and Carter, 1976) Speech synthesis (Degottex and Erro, 2014a), glottal model estimation (Degottex et al., 2011a; 2011b; Tahon et al., 2012) Speech synthesis (Degottex and Erro, 2014a; Saratxaga et al., 2009; 2012), phase estimation (Krawczyk and Gerkmann, 2014; Kulmer and Mowlaee, 2015a; 2015b) Speech coding (Kleijn and Paliwal, 1995; Pobloth, 2004), speech watermarking (Hernaez et al., 2014; Kuo et al., 2002; Liew and Armand, 2007)

Sections 4.1.7 and 4.1

Inter-microphone phase Phase of the vocal tract filter Phase in harmonic domain Phase in perceptual bands

4

3

3

2

1

0

0

0.5

1

Section 4.6 Section 4.6 Sections 4.2 and 4.3

Instantaneous Phase

4

Frequency (kHz)

Frequency (kHz)

Spectrogram

Section 4.1.1

1.5

Time (s)

2

1

0

0.5

1

1.5

Time (s)

Fig. 1. Time-frequency representations, (left) magnitude spectrogram |X(k, l)| and (right) instantaneous phase spectrum ࢬX(k, l) shown for a female utterance saying “bin blue at l four soon”.

253

3.1. Fourier analysis of speech signals

254

Let x(n) represent a time-domain speech signal with n being the time sample index. Having segmented the signal into short frames of length N, we define xl (n) as the short-time windowed segment with n ∈ [0, N − 1]. Taking the Fourier transformation of xl (n) we obtain

255 256 257 258

X (k, l ) = F (xl (n)), 259 260 261 262

where X(k, l) is the DFT coefficient for the frequency bin index k and frame index l. The complex DFT value can be inspected through its magnitude and phase representations given by X (k, l ) = |X (k, l )|e

263 264 265 266 267 268 269 270

(1)

jφx (k,l )

,

(2)

with |X(k, l)| and φx (k, l ) = ∠X (k, l ) representing the spectral amplitude and the instantaneous phase spectrum, respectively. While the STFT magnitude spectrum is known to follow a super-Gaussian distribution (see e.g., Vary and Martin, 2006), the spectral phase has often been reported to follow a uniform distribution. Fig. 1 (first column) shows the STFT representation for the magnitude (left) and phase (right) spectra. Unlike the magni-

tude spectrum, the phase spectrum presents no useful structure. In order to have a proper access to phase information, a similar clear harmonics structure in terms of speech formants and harmonicity as presented by the spectrogram could be quite helpful1 . The plots are shown for a female utterance from GRID corpus (Cooke et al., 2006). The utterances in this corpus are of command like structure consisting of six units, including a color, letter, and number. The chosen utterance for our experiment shown here is “bin blue at l four soon” consisting of a plosive [b], vowel [U:], and fricative [s]. In the following, we provide alternative phase representations that bring a more useful structure and are of more help when performing a phase-aware speech processing.

271

3.2. Phase derivatives across time and frequency

285

Extracting useful features from a Fourier phase spectrum is not straightforward. This is due to the difficulties in phase wrapping, the dependency of the phase spectrum on window

286

1 Some implementations for the figures and audio samples are available at https:// www.spsc.tugraz.at/ PhaseLab.

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

272 273 274 275 276 277 278 279 280 281 282 283 284

287 288

JID: SPECOM

ARTICLE IN PRESS

[m5+;May 19, 2016;19:14]

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

5

Fig. 2. Group delay representation of a voiced speech segment. Three group delay functions are presented along with Fourier spectrum (FFT) and LP spectrum (LPC). modified group delay (MGD) (Hegde et al., 2007b), Chirp group delay (CGD) (Bozkurt et al., 2007) and LP group delay (LPGD) (Rajan et al., 2013) methods provide different properties of a group delay function, which can be utilized efficiently in specific recognition applications.

Fig. 3. Time-frequency representations, instantaneous phase plots, from left to right: STFT phase, phase variance, group delay, relative phase shift (RPS), phase distortion standard deviation (PDD) shown for a female speech saying “bin blue at l four soon”. While the instantaneous phase presents no useful structure, both phase variance (second panel) and group delay (third panel) present a harmonic structure. At voiced segments, the shape of the glottal pulse smoothly changes justified by a close to zero PDD pattern (Degottex and Erro, 2014a; Koutsogiannaki et al., 2014) and presents a randomized structure in non-harmonic or unvoiced segments (fourth panel). Finally, RPS shows stable patterns during the vowels showing the evolution along time of the RPS for each harmonic (fifth panel).

289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304

starting sample and rapid changes in the phase spectrum when the zeros of complex spectrum X(z) lie near the unit circle in z-plane. The wrapping issue in the STFT analysis leads to cyclic wrapping that masks any useful structure in the spectral phase. In the literature, phase unwrapping methods (Nico and Fortuny, 2003) has been studied with a recent overview presented in Drugman and Stylianou (2015). In phase-based feature extraction, two important definitions play a major role in finding relevant features; group delay and instantaneous frequency. The group delay describes the local time behavior as a function of frequency. In a dual representation, the instantaneous frequency characterizes a local frequency behavior as a function of time. Taking the derivative of the spectral phase across frequency and time de-emphasizes the wrapping issue of phase and reveals the harmonic structure (for example see the group delay plots shown in Figs. 2 and 3).

The group delay representation has been long considered as an alternative method for spectrum estimation (Murthy and Yegnanaray, 2011; Yegnanarayana and Murthy, 1992). The calculation of the group delay requires a continuous function, hence unwrapped phase. As a solution for the wrapping problem in calculating the group delay function, an averaging method was proposed in Yegnanarayana et al. (1985). The first derivative of the instantaneous phase with respect to frequency called group delay (GD) is given by: τ (k, l ) = −k (φ(k, l )),

305 306 307 308 309 310 311 312 313

(3)

where k is the discrete differentiator across frequency. Group delay representation has been reported as a useful analysis tool in sound classification (Kandia and Stylianou, 2008), spectrum estimation (Yegnanarayana and Murthy, 1992), speech recognition (Hegde et al., 2007a),

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

314 315 316 317 318

ARTICLE IN PRESS

JID: SPECOM

6 319 320 321 322 323 324 325 326 327 328 329 330 331

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

speaker identification (Hegde et al., 2004), signal reconstruction (Yegnanarayana et al., 1984), speech segmentation (Kamakshi Prasad et al., 2004), waveform estimation (Yegnanarayana et al., 1985), and formant extraction (Murthy and Yegnanarayana, 1991a). Stark and Paliwal proposed group delay deviation (GDD) as a useful feature for speech analysis (Stark and Paliwal, 2009b). The concept was also employed to estimate spectral phase of a desired signal in noise or a mixture of speakers (Mowlaee and Saeidi, 2014; Mowlaee et al., 2012). As an alternative for phase representation in the STFT domain, one can consider the first time-derivative called instantaneous frequency (IF), Cohen (1989) given by IF(k, l ) = princ{φ(k, l ) − φ(k, l − 1)},

332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355

(4)

where princ{ · } is the principal value operator that maps an input value to the range in [−π , π ]. The instantaneous frequency has been reported to be useful for formant detection (Kumaresan and Rao, 1999), speaker recognition (Grimaldi and Cummins, 2008), and source separation (GU, 2010). Furthermore, the authors in Stark and Paliwal (2008) proposed IF deviation (IFD). The IFD has been shown useful representation for speech analysis (Stark and Paliwal, 2008) as well as speech quality estimation (Gaich and Mowlaee, 2015a; 2015b). For more details on instantaneous frequency and different methods for its estimation, we refer to Boashash (1992). Speech signal is, in general, a mixed phase type of signal. In terms of its analysis, the group delay of a mixed phase signal is the sum of the group delay of its minimum phase and maximum phase components (Yegnanarayana et al., 1984). It is common to work with only the minimum phase part of speech signal (Kamakshi Prasad et al., 2004; Quatieri and Oppenheim, 1981). The definition of the group delay function as the derivative of the phase spectrum, results in a tight connection to the amplitude spectrum, implying that phase and amplitude spectrum convey complementary information about the underlying signal. By dropping the frame index l, for a short-time signal x(n) of length N, we have X (z) =

N−1

x(n)z−n = x(0)z−(N−1)

n=0

N−1

(z − zm ),

(5)

m=1

356

X (ω) = X (z)|z=e jω = |X (ω)|e

jφx (ω)

,

(6)

357

log(X (ω)) = log |X (ω)| + jφx (ω). Q6 358 359

360 361

[m5+;May 19, 2016;19:14]

(7)

where zm are the zeros of the signal. The group delay function in the continuous domain is then calculated as dφx (ω) d (log(X (ω))) τc (ω) = − =− dω dω I XR (ω)DR (ω) + DI (ω)XI (ω) = , (8) |X (ω)|2 where the R and I indices indicate the real and imaginary parts of a complex variable, respectively. The first derivative

of X(ω) with respect to ω is shown as D(ω). The zeros of X(ω) can cause instability in group delay function. In order to deal with spurious peaks in the resulting group delay function (dips of amplitude spectrum), it was suggested to use a cepstrally smoothed version of |X(ω)|, called S(ω), in the denominator of τ (ω) and further proceed with defining the modified groupdelay (MGD) given by Hegde et al. (2007b) τs (ω) =

XR (ω)DR (ω) + DI (ω)XI (ω) , |S(ω)|2γ

τm (ω) =

τs (ω) |τs (ω)|α , |τs (ω)|

362 363 364

Q7 365 366 367 368

(9) 369

(10)

where γ and α are the tune parameters. Placing the group delay function in the form of τ m (ω) is an established way to extract phase-based features for ASR and speaker recognition. A conventional way of extracting meaningful features from group delay is to convert them to cepstral representations (Hegde et al., 2004) using discrete cosine transformation on group delay. Several modifications of the group delay calculation are proposed in Bozkurt and Couvreur (2005) to deal with the zeros of complex spectra when preparing phase-based features for ASR. Alternatively, it is possible to reduce the effect of zeros by smoothing the phase spectrum of a mixed-phase speech signal before treating at the group delay function and next performing cepstral smoothing (Alsteris and Paliwal, 2007b). Another method is to treat the group delay representation as the power spectrum and pass it through the Mel-scale filterbank before cepstral transformation (Bozkurt and Couvreur, 2005). Unlike the common method of application of log compression for Mel filterbank energies, the significance of log (root) compression in processing of the group delay representation is controversial. In the search for alternative ways to deal with dips in the amplitude spectrum in Donglai and Paliwal (2004), a product spectrum was used whereby the power spectrum and group delay were multiplied. It has been shown that windowing is crucial to obtain reliable group delay functions for speech analysis (Bozkurt and Couvreur, 2005; Bozkurt et al., 2004a). With a proper choice of window function (Alsteris and Paliwal, 2004b), i.e., efficient treatment of zeros in the amplitude spectrum, group delay functions reveal formant structure more clearly. Zeros close to the unit circle in the z-plane result in spikes in the group delay function and consequently, mask the vocal tract information present in group delay function (Murthy and Yegnanarayana, 1991b). In order to obtain a spike-free group delay representation, a chirp group delay (CGD) was proposed in Bozkurt et al. (2007). The CGD was defined as the negative derivative of the phase spectrum computed from chirp z-transform, that is z-transform computed on a circle/spiral other than the unit circle. Finally, in Rajan et al. (2013), the group delay function was calculated using the phase of a parametric all-pole modeling rather than the Fourier spectrum. To this end, the speech spectrum is decomposed into minimumphase and all-pass components. The group delay is then

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412

ARTICLE IN PRESS

JID: SPECOM

[m5+;May 19, 2016;19:14]

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

419

extracted from the minimum-phase part which has been reported to be suitable for speaker recognition. A demonstration of selected group delay representations is presented in Fig. 2. The Time-frequency representation for group delay shown in Fig. 3 (third column) reveals a harmonic structure in speech that follows the spectrogram shown in Fig. 1 (first column).

420

3.3. Phase representations in harmonic domain

421

McAulay and Quatieri (1986a) proposed sinusoidal model as an analysis/synthesis technique whereby a short-time segmented speech signal is represented as sum of sinusoids. While in sinusoidal model each sinusoid is represented by triple parameters of amplitude, frequency and phase, in case of harmonically related sinusoidal components, a harmonic model, as a special case of the more general sinusoidal model is used. The speech waveform is approximated by harmonics as follow:

413 414 415 416 417 418

422 423 424 425 426 427 428 429

xˆl (n) =

Hl

A(h, l ) cos φx (h, l ),

(11)

h=1 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452

where h is the harmonic index, Hl is the number of harmonics at frame l, and A(h, l) is the real-valued amplitude and φ x (h, l) is the phase both sampled at the hth harmonic, respectively. The instantaneous phase at harmonic h, φ x (h, l) is wrapped between [−π , π [. Alternatively, a meaningful representation of phase is the unwrapped one where the phase changes continuously without jumps. Several different methods have been proposed to unwrap the phase of a signal (see, for example, Karam and Oppenheim, 2007; Nico and Fortuny, 2003; Tribolet, 1977). A recent study by Drugman and Stylianou (2015) showed a comparative evaluation of the existing phase unwrapping methods in terms of their accuracy and computational complexity. Furthermore, they proposed an efficient iterative phase unwrapping method that relied on the calculated number of zeros of the z-transform outside the unit circle. Given a fundamental frequency at the lth frame denoted by f0 (l), a harmonic model plus phase distortion signal analysis/synthesis was proposed in Degottex and Erro (2014b). The authors proposed a pitch-synchronous segmentation with time instant t (l ) = t (l − 1) + 4 f01(l ) whereby the instantaneous phase at harmonics was decomposed into a linear and unwrapped phase part as: φx (h, l ) = 2π h

l l =0

f0 (l ) t (l ) − t (l − 1 )

linear phase

+ ∠V (h, l ) + ψd (h, l ) .

(12)

Unwrapped Phase: (h,l )

453 454 455 456

The first term is the linear phase component, which depends only on the fundamental frequency at lth frame f0 (l). Degottex and Erro gave a detailed analysis on phase distortion analysis (Degottex and Erro, 2014b). The second term ࢬV(h,

7

l), is the phase response of the vocal tract filter denoted by V(h, l), known to have a minimum phase response, assuming an all-pole model. The phase response of an envelope filter can be estimated through the Hilbert transform that links the phase response and its magnitude for a minimum-phase signal through cepstral representation (see Oppenheim et al., 1999 for more details). The last term, called source shape ψ d (h, l), represents the pulse shape and captures the stochastic character of the phase spectrum of the speech signal. The term was also called phase dispersion used in speech coding using a vector quantizer (Pobloth and Kleijn, 2003) and was modeled using wrapped Gaussian distribution (Agiomyrgiannakis and Stylianou, 2009). It is important to note that the linear phase removal results in an unwrapped phase composed of the minimum phase and dispersion phase parts. The unwrapped phase has been also used for phase-based speech enhancement (Kulmer and Mowlaee, 2015b; Mowlaee and Kulmer, 2015a; 2015b). Comparing the plots for magnitude spectrogram versus the instantaneous phase spectrum shown in Fig. 1 reveals that the instantaneous STFT phase exhibits only a random pattern without a clear harmonic structure due to the wrapping issue of the STFT phase. In fact, phase wrapping has been the main reason that phase-based signal processing has been considered less often in the literature on speech signal processing. As a consequence, the phase-based signal processing is believed to be more troublesome than signal processing methods relying on spectral amplitude-only. As alternative useful representations, in the following, we highlight three recent representations, derived from harmonic phase φ x (h, l) which provide useful insights into phase interpretation and show a more useful structure, hence, help when performing phaseaware speech processing.

457

3.3.1. Phase Distortion (PD) Phase distortion (PD) was first proposed by Degottex and Obin (2014), who calculated it by removing the contributions of minimum phase and linear phase parts from the instantaneous phase as:

490

PD(h, l ) = φx (h + 1, l ) − φx (h, l ) − φx (1, l ).

459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489

491 492 493 494

(13)

Phase distortion represents the shape of the glottal signal (Degottex et al., 2012; Degottex and Obin, 2014) and the glottal model parameters (Degottex et al., 2011a). As reported in Degottex and Erro (2014b), phase distortion can be statistically modeled using a wrapped Gaussian distribution determined by circular mean and variance parameters. As the mean value of phase was shown to be perceptually negligible in Koutsogiannaki et al. (2014), Degottex et al. proposed to characterize the short-time phase features of phase distortion in terms of its standard deviation only Degottex and Erro (2014b) where the phase distortion is modeled as: PD(h, l ) = WG (0, PDD(h f0 (l ), l )).

458

495 496 497 498 499 500 501 502 503 504 505

(14)

In (14), PDD(hf0 (l), l) captures the variance of phase changes at the hth harmonic multiple (Degottex and Erro, 2014b) and

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

506 507

JID: SPECOM

8 508 509 510 511 512

513 514 515 516

518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534

535 536 537 538 539 540

3.3.2. Relative Phase Shift (RPS) Saratxaga et al. (2012) proposed relative phase shift (RPS) that relates the phase shift between the hth harmonic and the one at the fundamental frequency h = 1 given by:

542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557

(15)

As shown in Degottex and Erro (2014b), the RPS representation discards the linear phase contribution as only the source shape and minimum-phase envelope phase contributions remain at each harmonic relative to those of the first harmonic. Because both envelope and source shape are assumed to evolve smoothly over time, the RPS is known for its smooth trend across frames (Saratxaga et al., 2012), without the requirement of estimating the pitch pulse onset. The RPS representations and derived features were used for spoofing detection in speaker recognition (De Leon et al., 2011a; Sanchez et al., 2015). More recently, the RPS representation was used for frequency-smoothing during phase estimation in single-channel speech enhancement (Kulmer et al., 2014; Mowlaee and Kulmer, 2015a). In Fig. 3 (fourth column), the RPS uncovers the harmonic structure in the phase spectrum, which is not visible in the instantaneous phase. The RPS of clean signal displays small changes with a smooth trend at harmonics in a clean signal.

and Mowlaee, 2015b; Kulmer et al., 2014; Mowlaee and Kulmer, 2015b). A non-uniform representation of the unwrapped phase in harmonics has been employed to achieve a meaningful statistical representation of the underlying distribution of the harmonic phase features. To this end, the authors in Kulmer and Mowlaee (2015b); Kulmer et al. (2014) proposed to fit a von Mises distribution (Evans et al., 2000, p. 191) to harmonic phase given as VM(μc (h, l ), κ (h, l )) =

eκ (h,l ) cos( (h,l )−μc (h,l )) , 2π I0 (κ (h, l ))

558 559 560 561 562 563 564 565 566

(17)

where Iη ( · ) is a modified Bessel function of order η with μc (h, l) and κ(h, l) denoting the mean direction (also called circular mean) and concentration parameters in the von Mises distribution, respectively. It is important to note that the von Mises distribution as phase prior ranges between a uniform distribution (maximum uncertainty) when κ = 0 and a Dirac delta (maximum deterministic) when κ = ∞. The concentration parameter κ plays the key role to deliver the uncertainty around a given mean direction and is estimated indirectly from the phase variance using a non-linear equation in terms of Bessel functions (Evans et al., 2000). Fig. 3 (second column) shows the phase variance in time-frequency for a clean speech signal, revealing a harmonic structure similar to spectrogram shown in Fig. 1 (first column).

567 568 569 570 571 572 573 574 575 576 577 578 579 580

4. Applications

581

4.1. Signal enhancement

582

583

(16)

Signal enhancement is itself composed of two sub-groups: speech enhancement and source separation. While in the first category, one is interested in recovering the desired signal only by filtering out the undesirable interfering noise, in the latter category we are interested to recover all the signals in the mixture. Depending on the number of microphones used during the process of signal enhancement, two sub-groups are often considered: multi-channel and single-channel scenarios.

known for its smooth change across time (Degottex and Erro, 2014b) with ψ d (h, l) as the phase dispersion already defined in (12). This property of unwrapped phase has been used to derive several interpolation techniques used in speech synthesis (McAulay and Quatieri, 1986b) using the harmonic model of speech signals. Recently, Agiomyrgiannakis (2015) presented an overview on several interpolation techniques applied to harmonic phase for speech synthesis. he interpolation techniques comprised of quadratic phase spline and phase-locked pitch-synchronous and were demonstrated to deliver a high quality synthesized speech quality when used in a harmonic signal modeling framework. Furthermore, using the smoothness property of the unwrapped phase in harmonics, temporal smoothing filters were proposed to enhance noisy speech in a single-channel scenario. Certain improvements in perceived quality and speech intelligibility have been reported by reducing the phase variance of the noisy signal (see e.g., Kulmer

4.1.1. Multi-channel signal enhancement Array processing was first used in sonar and radar systems that were designed to detect and analyze narrowband signals. Extensions to speech signal processing (wideband signals) started by Brandstein and Ward, which were extended to microphone array processing for many speech applications, included noise reduction, acoustic source localization, source separation and target tracking (Brandstein and Ward, 2001). In case of more than one microphone, the spatial diversity could be added in order to improve the signal acquisition in front-end unit. Microphone array processing allows for extracting the spatial and statistical characteristics of the signals. This is feasible by a combination of microphone signals in the array carried out with a beamformer. A beamformer is designed such that a desired spatial selectivity is met while the undesired sources from other directions are suppressed.

591

3.3.3. Unwrapped harmonic phase Following harmonic model phase decomposition given in (12), by removing the linear phase component (defined as the integral part of the instantaneous frequency) from the instantaneous harmonic phase, an unwrapped harmonic phase is obtained

(h, l ) = ∠V (h f0 ) + ψd (h, l ),

541

[m5+;May 19, 2016;19:14]

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

WG (·) denotes the wrapped Gaussian distribution characterized by circular mean and variance parameters. In Fig. 3 (fifth column), the PDD representation shows a binary pattern between low-mid harmonics with low PDD and levels a randomized pattern for high frequencies.

RPS(h f0 , l ) = φx (h f0 , l ) − hφx ( f0 , l ). 517

ARTICLE IN PRESS

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

584 585 586 587 588 589 590

592 593 594 595 596 597 598 599 600 601 602 603 604 605 606

JID: SPECOM

ARTICLE IN PRESS

[m5+;May 19, 2016;19:14]

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx 607 608 609 610 611 612 613 614 615 616 617

618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659

The simplest beamformer is given by adding up the microphone signals, called delay-and-sum beamformer2 . The delay is approximated as phase shifts instead of delay lines and the resulting beamformer is referred to as phased-array, implemented using phase shifters (Van Trees, 2002, 35). Let the phase of the harmonics arriving at the first and the second microphone be represented by φ 1 and φ 2 and we further define c as the speed of sound, d the inter-spacing between microphones, and θ s , the direction of arrival. Therefore, the phase shift between the two microphones is given by: ˜ sin (θs ), φd = φ1 − φ2 = 2π f τ = βd (18) where τ = d sinc θs is defined as the time delay of arrival. The phase shift defined in (18) is often used instead of time delay in sensory array processing (also called phased-array in radar (Haykin and Liu, 2009) and antenna array processing (Van Trees, 2002, 35)), where β˜ = 2π is defined as the wave λ number with λ as the wave length. The phase difference contains information about the direction of arrival which is useful for source localization applications (Brandstein and Ward, 2001). Several cross-correlation based methods have been proposed in the literature and intensively used for the estimation of the time delay of arrival of a signal. The methods are referred to as generalized cross correlation (GCC) (Knapp and Carter, 1976), where the time-delay estimate is given by searching for the peak of the cross-correlation function. The cross-correlation based on the magnitude spectrum is known as the basis for the GCC method. In order to filter out the spurious outlier peaks that often occur in the cross correlation, a GCC weighting function (Knapp and Carter, 1976) or cepstral prefiltering step (Omologo and Svaizer, 1997) has been recommended. In contrast, the cross-spectrum phase has been reported in Knapp and Carter (1976) as a successful approach to resolve the peak for the dominant delay in a reverberant environment. The use of phase information in microphone array processing has been mostly introduced for time-delay estimation, relying on the calculation the phase difference between the microphones, φ d in (18). As an example, Aarabi et al. demonstrated robust speech enhancement (Shi et al., 2007) and improved automatic speech recognition (Guangji et al., 2006). Obaid et al. (2012) also derived phase error filters, attempting to minimize noisy phase variance. They showed improved speech intelligibility in cochlear implant processing with more acceptable performance for cochlear implant recipients. Aarabi and Shi (2004) proposed phase error filters that relied on the prior knowledge of the time difference of arrival from the sources at the microphones. They reported improvement in digit recognition accuracy (where tests were conducted by several speakers), over some phase-insensitive benchmark methods. They proposed to use the difference between the Fourier transforms phase of the microphones. The central idea was to employ the fact that, the phase difference 2

As a special example of the more generalized filter-and-sum beamformer.

9

for a clean non-reverberant speech signal, is equal after delay compensation. The phase difference was calculated as Aarabi and Shi (2004): θβ (ω, l ) = φR (ω, l ) − φL (ω, l ) − ωβ,

660 661 662

(19)

where the R and L are sub-indices refer to the right and left microphones in a binaural setup, β refers to a selected time delay estimate upon which the phase difference is calculated. Aarabi et al., showed that the signal-to-noise ratio at each frame and frequency is bounded by a function of the phase error (Shi et al., 2007). This principal was also used to derive a perceptually motivated phase-error filter for speech enhancement (Aarabi and Shi, 2004; Shi et al., 2007). In order to obtain the optimal phase transform (PHAT) they proposed a search to find the minimum phase error used to find the time-delay estimate between the left and right microphones. Given the optimal, a phase-mask was derived based on this error term. The mask, hence, was a phase-based time-varying filter, and was finally applied to either of the two microphones signals to get a signal with reduced noise. Later, Kim et al. (2009) proposed a two-microphone approach where additional smoothing of phase estimates was performed over time and frequency by applying Gammatone channel weighting. This further improved the automatic speech recognition accuracy. Deleforge and Kellermann (2015) proposed a blind source separation method for an under-determined case3 where phase information was taken into account. The phase difference information carries the spatial information comprised of interaural time difference (ITD) and interaural level difference (ILD). Such spatial information are to be preserved due to the correlation in the speech-dominated signal components, which were dominated by the desired signal. Deleforge and Kellermann (2015) generalized the K-singular value decomposition (KSVD) in terms of its sparse activation matrix in that, they proposed a sparse factorization relying on estimation of the instantaneous phase information. They showed that their proposed phase-aware method preserves spatial cues contained in the phase difference between the channels. The atoms in the dictionary are learned, using phase-optimized dictionary learning, where the source activations together with their corresponding instantaneous phase data are used to develop a phase-corrected dictionary. These obtained dictionaries are then used to recover the sources from a mixture. Their results showed that phase-optimized KSVD outperforms the conventional NMF or KSVD methods, where phase information is not taken into account.

663

4.1.2. Single-channel speech enhancement In single-channel speech enhancement, the aim is to extract the clean speech signal from a single-channel recorded, noise-corrupted speech. In particular, an optimal estimate of the spectral amplitude and phase in order to recover the underlying speech signal from background noise is desired. Conventional methods used within the last three decades (see Hendriks et al., 2013 for an overview), have mostly focused

704

3

Where the number of sources are more than the number of microphones.

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703

705 706 707 708 709 710 711

JID: SPECOM

10

ARTICLE IN PRESS

[m5+;May 19, 2016;19:14]

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

Fig. 4. Block diagram for different configurations for phase-based single-channel speech enhancement. Conventional (top): amplitude spectrum estimation using noisy phase in signal reconstruction (Breithaupt et al., 2008; Hendriks et al., 2013; Loizou, 2013). Phase enhancement (middle): conventional amplitude estimation using enhanced phase for signal reconstruction (Krawczyk and Gerkmann, 2012; 2014; Kulmer and Mowlaee, 2015b; Mowlaee and Kulmer, 2015a; 2015b; Mowlaee et al., 2012). Phase-aware amplitude estimation (bottom): enhanced phase followed by phase-aware amplitude estimation (Gerkmann, 2014; Gerkmann and Krawczyk, 2013; Mowlaee and Saeidi, 2013a; 2013b). y, x and v refer to the noisy, clean and noise signals, φ y and φˆx are the noisy and estimated clean phase spectra, σˆ v2 is the estimated noise variance, and |Xˆ | is the estimated spectral amplitude spectrum from the conventional amplitude estimator.

712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740

on filtering the spectral amplitude of the noisy speech, while reconstructing the time-domain signal by copying the noisy spectral phase. Recent advances demonstrate that phase information could be of certain importance, extend the limited performance of the conventional enhancement methods that rely only on amplitude information. The importance of phase information in speech enhancement has been studied extensively in Alsteris and Paliwal (2006); Paliwal et al. (2011) and more recently emphasized in Mayer and Mowlaee (2015); Mowlaee and Kulmer (2015a); 2015b); Mowlaee et al. (2014). In this section, we provide an extended up-to-date bibliography to highlight the recent advances made towards phase-aware signal processing in single-channel speech enhancement. Fig. 4 (top row) shows a block diagram of the conventional single-channel speech enhancement, which is composed of two stages: (1) amplitude estimation and (2) signal reconstruction. As phase has an impact on both, in the following we consider three scenarios; in the first two, we study the impact of phase on each stage individually, while in the third we consider the joint estimation of the amplitude and phase information. For the first stage, a variety of estimators of spectral amplitude are used, which differ in their efficacy, selected optimization criterion, and the assumptions they made about the speech and noise probability distribution (see e.g. Breithaupt et al., 2008; Hendriks et al., 2013). During signal reconstruction stage, the noisy phase spectrum has been a common choice. The noisy phase has been shown to provide the minimum mean square error (MMSE) (Ephraim and Malah, 1984)

or maximum a posteriori (MAP) (Lotter and Vary, 2005) estimate for a clean phase at individual frequency bins. However, this is only true when assuming that the Fourier spectral coefficients are independent in time and frequency, which is obviously not the case in practice (Alsteris and Paliwal, 2007b; Jensen and Taal, 2014). Let x(n) and v(n) represent speech and noise signals, respectively, and let y(n) = x(n) + v(n) represent the noisy observation in a discrete time domain, with n as a time index. Let Y (k, l ) = |Y (k, l )|e jφy (k,l ) being the complex Fourier representation of the noisy signal as defined for the kth frequency bin with |Y(k, l)| and φ y (k, l) as the noisy spectral amplitude and phase spectrum, respectively. Similarly, we define |X(k, l)| and |Y(k, l)| as the spectral amplitude for clean and noisy speech signals, respectively with X (k, l ) = |X (k, l )|e jφx (k,l ) and V (k, l ) = |V (k, l )|e jφv (k,l ) as the complex spectra for speech and noise, and |V(k, l)| as the noise spectral amplitude. For the observed noisy signal: |Y (k, l )|e jφy (k,l ) = |X (k, l )|e jφx (k,l ) + |V (k, l )|e jφv (k,l ) .

741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758

(20)

The spectral amplitude of the noisy signal is the absolute value of the vector sum of the underlying components:

759 760

|Y (k, l )| = |X (k, l )|2 +|V (k, l )|2 +2|X (k, l )||V (k, l )| cos φ(k, l ), (21) where the phase difference is defined as φ(k, l ) = φx (k, l ) − φv (k, l ). It is obvious that ± φ(k, l) are both valid solutions for (21) and this is the source of ambiguity

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

761 762 763

ARTICLE IN PRESS

JID: SPECOM

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780

in phase estimation based on the speech and noise geometry (Mowlaee et al., 2012). The goal in spectral amplitude estimation is to find Xˆ (k, l ) in a noisy complex spectrum Y(k, l) by defining a cost function in the form of d = f (X (k, l ), Xˆ (k, l )), which should be minimized. The form of cost function specifies the type of the estimator and optimality criterion used. Examples of distance functions are the mean square error in amplitude and logamplitude spectral domain (Ephraim and Malah, 1984; 1985), weighted squared error, or Itakura–Saito measure (Loizou, 2013) (see Hendriks et al., 2013 for an overview). For the sake of compact representation, in the rest of this subsection we omit the frame index l unless needed. The optimal estimate for the spectral amplitude of speech |Xˆ (k)|, in the sense of a minimum mean square error (MMSE), is derived in Breithaupt et al. (2008) by optimizing the MMSE criterion formulated as: |Xˆ (k)| = argmin{E {(c(|X (k)| ) − c(|Xˆ (k)| ))

2

Xˆ (k)

||Y (k)|, σv2 (k), ξ (k)}}, 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805

806 807 808 809 810 811 812 813 814 815

(22)

where σv2 (k) = E {|V (k)|2 } is the noise power for frequency bin k, ξ (k) = E {|X (k )|2 }/σv2 (k ) and c(x) = x βa defines a compression function parameterized by β a , and E{ · } is the expectation operator. The MMSE estimation for the spectral amplitude was defined as a parametric estimator in Breithaupt et al. (2008) and was expressed in the form of a softmask function G multiplied by the observed signal as |Xˆ (k)| = G(ξ (k ), ζ (k ), βa , μa )|Y (k)|. The parameters β a and μa determine the type of the estimator, where ξ (k) and ζ (k) = |Y (k )|2 /σv2 (k ) are defined as the a priori and the a posteriori signal-to-noise ratios (SNRs), respectively. Different choices of β a and μa lead to different amplitude estimators. For instance, βa = 1, μa = 1 gives the short-time spectral amplitude (MMSE-STSA) estimator (Ephraim and Malah, 1984) and βa → 0, μa = 1 yields the log -spectral amplitude (MMSE-LSA) estimator (Ephraim and Malah, 1985). By incorporating psychoacoustic considerations into the error function, the estimation of enhanced speech was driven by the perceptual perspective (Loizou, 2013). State-of-the-art speech enhancement techniques use the enhanced spectral amplitude together with the noisy phase in order to reconstruct the enhanced speech as denoted by xˆ(n). This process comprises a return back to the time domain by performing F −1 {|Xˆ (k)|e j φy (k) }, where F −1 (·) is the inverse short-time Fourier transformation. 4.1.3. Phase estimation for signal reconstruction Although the noisy phase is a reasonable estimate (MMSE Ephraim and Malah, 1984 or MAP Lotter and Vary, 2005) for the clean phase spectrum, Vary showed that the phase distortion due to added noise is perceptible for signal components with a local SNR lower than 6 decibels (Vary, 1985). While the choice of noisy phase at a high enough local signal-tonoise ratios is not critical, the choice of noisy phase spectrum for all signal components in signal reconstruction has been observed to introduce a certain degradation in the signal

[m5+;May 19, 2016;19:14]

11

reconstruction performance (Krawczyk and Gerkmann, 2014; Kulmer and Mowlaee, 2015b; Le Roux and Vincent, 2013; Mowlaee and Kulmer, 2015a; 2015b; Mowlaee and Saeidi, 2013a; 2013b; Mowlaee et al., 2012). Therefore, a proper phase estimation method has the potential to improve perceived speech quality. In the following, we explain recent advances made towards phase estimation for signal reconstruction in speech enhancement and source separation. GL-based: GL-methods rely on Griffin–Lim iterative signal reconstruction, described as follows. The problem of estimating the spectral phase information dates back to 1980’s, when researchers took the first steps to address the issue of recovering the time-domain signal from STFT magnitude information only. For example, Hayes et al. (1980) proposed techniques to reconstruct a signal from its magnitude or phase information only. Quatieri and Oppenheim (1981) derived solutions to estimate the time-domain signal for a minimumphase sequence, in an iterative manner. Alsteris and Paliwal (2007a) presented a review of iterative signal reconstruction methods from short-time Fourier spectra. The key idea behind the exploration of the possibility of reconstructing a signal from its partial information in the STFT (either magnitude or phase) is the redundancy in the short-time Fourier representation, with its minimum half-length overlapped frames, which results in the magnitude and phase information not being independent. Griffin and Lim (1984) derived an iterative solution (a minimum mean square error solution) for signal reconstruction given the magnitude of STFT. They reported improved performance for large overlaps (e.g., 87.5%) and for a specific choice of window, and square root Hann, leading to the maximum improvement in signal reconstruction in the sense of MSE optimality criterion. The goal was to exploit the correlation between the magnitude and phase spectra. Inspired by the Griffin–Lim (GL) iterative method for signal reconstruction, several different variants have been proposed in the literature4 . The original proposal by Griffin and Lim had some fundamental drawbacks: (1) it demanded large number of iterations with Fourier transformations (synthesisanalysis) for each frame in order to provide a high signal reconstruction quality, (2) the algorithm required the magnitude spectra of all frames in the audio data, 3) the GL algorithm was not appropriate for real-time applications due to the large number of iterations. Therefore, the authors in Zhu et al. (2007) proposed a practical real-time iterative magnitude spectrogram inversion. Sturmel and Daudet (2012) attempted to extend the GL solution from a full-band to a partial phase reconstruction (PPR) where GL iteration was restricted to a selected timefrequency region. The idea was to replace the noisy phase at a confidence domain where the Wiener filter amplitude of the desired source is above a certain threshold in a static (Sturmel and Daudet, 2012), or in a signal-dependent way (Mowlaee and Watanabe, 2013; Sturmel and Daudet, 2013). More

4 For a comparative study on the GL-based methods see Mowlaee and Watanabe (2013).

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868

JID: SPECOM

12 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907

909 910 911 912

913 914 915

917 918 919 920

(24)

where k is the discrete differentiation operator and we define φ dev (k, l) as the phase deviation given by the difference between the noisy phase and clean phase φdev (k, l ) = φy (k, l ) − φx (k, l ),

916

(23)

The group delay deviation has been reported to present minima at sinusoidal components (spectral peaks) (Stark and Paliwal, 2009b; Yegnanarayana and Murthy, 1992). Furthermore, in Mowlaee and Saeidi (2013a) it was shown that the group delay deviation is related to phase deviation, described as: τ (k, l ) = k (φdev (k, l )),

[m5+;May 19, 2016;19:14]

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

recently, the authors in Gunawan and Sen (2010) extended the iterative GL signal reconstruction procedure to a blind source separation framework. The idea was to calculate a remixing error in time-frequency between the mixture and the estimated sources at each iteration of GL and distribute it over individual sources. Different ways for error re-distribution have been considered: uniformly (Gunawan and Sen, 2010), or depending on the activation of sources in the STFT (Sturmel and Daudet, 2013), or via a sinusoidal model fit to each source (Mowlaee and Watanabe, 2013). As an alternative approach, the inconsistency between the valid STFT for a real signal versus the STFT obtained from the Wiener filter gain function and mixture phase was taken into account. Le Roux and Vincent (2013) proposed a consistent Wiener filter solution where the inconsistency was incorporated as a constraint when deriving the mean square error solution. Geometry-based: Mowlaee et al. (2012) proposed a phase estimation method that relied on a geometry-derived analytic solution together with an additional constraint on group delay deviation. The additional constraint helped to resolve the ambiguity problem faced when choosing between two phase candidates that differed in their resulting sign of φ(k, l). The outcome of the phase-enhanced reconstructed signal was compared to benchmark methods, which showed a considerable improvement in blind source separation evaluation (BSSEVAL) measures (Vincent et al., 2006) including signal-todistortion ratio (SDR) and signal-to-interference ratio (SIR) as well as the perceived quality measured in terms of perceptual evaluation of speech quality (PESQ) (ITU-T, 2001). The geometry-based phase estimation method in Mowlaee et al. (2012) relies on the fact that, given the amplitude spectra of speech and noise, there are two phase candidates at each timefrequency cell. In order to remove the ambiguity in selecting the most likely phase value, a group delay constraint at sinusoids was employed. The group delay deviation is defined as the variation in group delay with respect to the fixed value due to the analysis window, denoted by τw = N−1 , with N 2 being the length of the window. The group delay deviation is given by: τ (k, l ) = τw − τ (k, l ).

908

ARTICLE IN PRESS

(25)

first suggested by Vary (1985). For large enough SNRs (e.g., SNRs above 6 decibels), the group delay deviation becomes small and is primarily contributed by a single sinusoid. The optimal phase value at a sinusoidal peak frequency was calculated by searching for the phase candidate in the ambiguous

set, which minimized the group delay deviation (Mowlaee et al., 2012). Mowlaee and Saeidi (2014) further extended the geometrybased phase estimation idea to single-channel speech enhancement, where other constraints on phase such as instantaneous frequency deviation (IFD) and relative phase shift (RPS) were incorporated in order to reduce the phase ambiguity. Slight improvements in both perceived quality and speech intelligibility as predicted by instrumental measures were reported for phase-enhanced reconstructed signals. The geometry-based phase estimators were integrated into the iterative phase-aware closed-loop single-channel speech enhancement in Mowlaee and Saeidi (2013a). Model-based: The authors in Mehmetcik and Ciloglu (2012) proposed reconstructing the spectral phase across the time-axis only and reported insignificant improvements in PESQ measure (ITU-T, 2001). The idea of having predictable phase evolving across time for the first few harmonics has often been used in phase vocoders (Dolson, 1986) and lowbit-rate speech coding, assuming that the phase of the current frame φˆx (k, l ) is given by adding an error term to the phase of the previous frame φˆx (k, l − 1), and we have φˆx (k, l ) = φˆx (k, l − 1) + αs IF(k, l ),

921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942

(26)

where IF(k, l) is the instantaneous frequency (IF) defined in (4) and α s representing a positive real-valued factor that indicate as an increasing or decreasing signals instantaneous frequency (phase vocoder case for αs = 1). Following this principle, the authors in Krawczyk and Gerkmann (2012) proposed STFT phase reconstruction of noisy speech signal called STFT phase improvement (STFTPI). The method imposes the harmonic structure up to the Nyquist frequency. A comb filter harmonizes the noisy speech, by modification of noisy phase spectrum across the frequency axis in order to compensate the phase introduced by the window. The spectral phase components between harmonics were modified to the value of the closest harmonic. An improvement in PESQ was reported in Krawczyk and Gerkmann (2014); Patil and Gowdy (2014) but buzzy speech quality was introduced, in particular at high frequencies. A similar concept was employed in Patil and Gowdy (2014) to estimate noise in-between harmonics. Improvement in terms of PESQ (ITU-T, 2001), was reported when replacing the noisy STFT phase with the temporally reconstructed one using the phase prediction model given in (26). The improvement predicted by PESQ versus the informal listening results achieved for STFTPI phase-only enhanced signals invites questions about the reliability of the existing instrumental measures that were originally proposed for conventional amplitude-only speech enhancement, without considering capturing phase distortion. In Section 4.4, we will expand on this topic and present some recent findings for speech quality estimation when phase-aware signal processing is utilized (Gaich and Mowlaee, 2015a; 2015b). Time-frequency smoothing of unwrapped phase: With the recent advances made in speech synthesis using harmonic model plus phase distortion (HMPD) (Degottex and Erro, 2014b), it is possible to obtain an unwrapped phase from

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975

JID: SPECOM

ARTICLE IN PRESS P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028

[m5+;May 19, 2016;19:14]

13

an instantaneous STFT phase. Kulmer and Mowlaee (2015b) proposed applying harmonic phase decomposition followed by a temporal smoothing filter in order to reduce the variance of the noisy phase. The phase variance has been shown to be a reliable metric for the quality assessment of synthesized speech (Koutsogiannaki et al., 2014), and was further reported to be the key to resolving the spectral phase information of two adjacent harmonics (Mowlaee and Kulmer, 2015b). The phase variance given by the PDD calculated in (14) was shown to be a function of the selected window and, in particular, depends on two characteristics: i) side-lobe level rejection and ii) the coherent gain5 . A study in Mowlaee and Kulmer (2015b) showed that a Blackman window choice provided a good balance between the joint improvement in quality predicted by PESQ (ITU-T, 2001) and short-time objective intelligibility (STOI) (Taal et al., 2011) measures. In Kulmer et al. (2014) the idea of temporal smoothing of unwrapped noisy phase was extended to selective time-frequency smoothing by using a probabilistic approach, that relied on a pre-defined threshold on the circular variance of the unwrapped phase. The method led to a joint improvement in instrumental speech quality and intelligibility when used for phase-only enhancement or when combined with some conventional amplitudeonly enhancement. Finally, Maly and Mowlaee (2016) studied the importance of harmonic phase modification for signal reconstruction in speech enhancement reporting that an unwrapped phase combined with a linear phase suffice for an improved signal reconstruction in speech enhancement. Mismatched window: Wojcicki and Paliwal (2007) suggested employing different windows in terms of their dynamic range in the analysis and synthesis stages. Paliwal et al. showed that employing a Hamming window for analysis, followed by choosing a Chebyshev window for the synthesis stage, resulted in improvement in PESQ. Phase Spectrum Compensation (PSC): The DFT phase spectrum of the noisy speech signal is conjugate symmetric. In Stark et al. (2008); Wojcicki et al. (2008), the authors proposed a speech enhancement approach by controlling the degree of conjugate symmetricity enforced on the noisecorrupted spectrum. By modifying the noisy spectral phase, hence, the angular relationship between the speech and noise spectra, an anti-symmetry function was applied in order to compensate the noisy short-time phase spectrum before the signal reconstruction stage. The compensated phase spectrum was eventually used together with the noisy spectral amplitude to produce a modified phase-enhanced time-domain signal. The method improved the perceptual quality of noisy speech, in particular, and was reported to be well suited for mid- to high-SNR scenarios. Maximum A Posteriori (MAP): Assuming a uniform distribution of spectral phase and the independence of the spectral amplitude and phase, it has been shown that an optimal estimate for the clean phase is equal to that of the noisy

phase in terms of the minimum mean square error (MMSE) 1029 (Ephraim and Malah, 1984) or maximum a posteriori (MAP) 1030 sense (Lotter and Vary, 2005). In contrast, changing the prior 1031 distribution of phase from uniform to a non-uniform one re- 1032 sults in a different MAP phase estimator rather than the ob- 1033 served noisy phase. For example, in Gerkmann (2014), it was 1034 shown that the phase posterior for a known speech amplitude 1035 follows a von Mises distribution. For a given clean ampli- 1036 tude, a MAP phase estimator of the clean phase, was pro- 1037 posed relying on a grid search over a set of candidate phases 1038 in the possible range, i.e., φ(k, l ) ∈ [−π , π [. As reported in 1039 Gerkmann (2014), the brute-force solution is difficult due to 1040 its increased computational complexity. 1041 More recently, Kulmer and Mowlaee (2015a) proposed a 1042 closed-form solution for a MAP estimator of clean phase. 1043 The MAP harmonic phase estimate was derived by replacing 1044 the uniform prior with a von Mises distribution and under the 1045 assumption of a white Gaussian distribution for the noise sig- 1046 nal. The MAP phase estimate is a function of the von Mises 1047 parameters (mean and concentration), SNR at the underlying 1048 harmonic6 , and the observation data length. For large SNRs in 1049 the limiting scenario, the MAP estimator asymptotically ap- 1050 proaches the ML estimate given by the noisy DFT phase sam- 1051 pled at the harmonic frequency (Kay, 1993, p. 168). At such 1052 high SNR scenario, the noisy phase is more weighted rather 1053 than the mean value. This is made possible by incorporating a 1054 low concentration parameter in the von Mises prior considered 1055 in the MAP phase estimator derived in Kulmer and Mowlaee 1056 (2015a). On the other hand, the additional dependency on the 1057 harmonic SNR serves as a reliability mechanism for the esti- 1058 mated phase, implying that at low SNRs, the proposed MAP 1059 estimator relies only on the circular mean. The circular mean 1060 and variance parameters in the von Mises distribution phase 1061 prior provide a flexible sweep between the maximum cer- 1062 tainty (Dirac delta with infinite concentration parameter) and 1063 maximum uncertainty (uniform prior with zero concentration 1064 parameter) and statistically take into account the uncertainty 1065 in the estimated phase. 1066 Phase randomization: Miyahara and Sugiyama (2013); 1067 2014); Sugiyama and Miyahara (2013) proposed phase ran- 1068 domization as a new paradigm for speech enhancement. They 1069 reported improved speech enhancement for auto-focusing 1070 noise. The auto-focusing noise is a mechanical noise in 1071 audio-visual recording using digital still cameras. In partic- 1072 ular when using zoom function during movie recording, the 1073 auto-focusing noise could become stronger than the desired 1074 signals from the environment, masking speech (Miyahara and 1075 Sugiyama, 2014). 1076 The main principle in phase randomization is to take 1077 into account the underlying auto-focusing or clicking noise 1078 and its linear phase character. Therefore, the structured pat- 1079 terns of the phase of noise signal could be de-emphasized 1080 via randomizing the observed noisy phase spectrum. Phase 1081

The coherent gain is defined as Harris (1978): CG = N−1 n=0 w(n), where w(n) is the prototype window of length N with n ∈ [0, N − 1].

6 Defined as the signal-to-noise ratio for each sinusoid sampled at each harmonic.

5

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

JID: SPECOM

14 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120

1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133

ARTICLE IN PRESS

[m5+;May 19, 2016;19:14]

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

randomization contributes to change the noise corrupted regions and restricts the constructive addition (in-phase) of noise-dominated spectral components. Results in Miyahara and Sugiyama (2014) demonstrated that, using the overlapadd at signal reconstruction, the consecutive segments with randomized phase brought improved noise reduction performance due to the removal of the noise phase pattern. The improvement was justified via subjective listening results using a comparative category rating (CCR) test (Miyahara and Sugiyama, 2013; 2014; Sugiyama and Miyahara, 2013). Minimum variance phase estimation error: Mowlaee and Kulmer (2015b) presented a theoretical derivation for the phase variance in a single-channel speech enhancement and demonstrated how to optimize the window to minimize this variance. It was shown that the selected window impacted the variance in phase estimation directly, and the phase variance was a function of window type and signal-to-signal ratio of the adjacent harmonic to the current one. The choice of the Blackman window in phase estimation was shown to provide a good balance between a joint improvement in perceived quality and intelligibility, as predicted by instrumental measures. Throughout this analysis, the authors in Mowlaee and Kulmer (2015b) presented the potential and limits of the phase estimation methods in single-channel speech enhancement. Furthermore, they made a comparative study between representative methods and quantified their performance. The performance evaluation was carried out using instrumental predictors for perceived quality and speech intelligibility using PESQ and STOI, respectively, as reported in Mowlaee and Kulmer (2015a); 2015b). In a subjective evaluation, MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) (Gaich and Mowlaee, 2015b; ITU-R BS, 2001) and comparison category rating (CCR7 ) (Mowlaee and Kulmer, 2015a) tests were conducted to estimate the perceived speech quality achieved by enhancing the phase of a noisy speech signal. For a subjective evaluation of speech intelligibility, the listening test was conducted and results were reported in Gaich and Mowlaee (2015a); Mowlaee and Kulmer (2015a) (see also Section 4.4 for more details). 4.1.4. Phase-aware amplitude estimators An estimate of the clean phase could be used to derive an improved spectral amplitude. In this regard, the authors in Gerkmann and Krawczyk (2013); Mowlaee and Saeidi (2013b) derived the MMSE estimator of the spectral amplitude given the clean spectral phase information. The derived estimator is called phase-sensitive and showed an improved performance in the sense of a reduction in noise outliers and more noise reduction because of the phase contribution. The derivation showed dependency to the phase deviation which rotates the noisy phase coordination towards the clean phase one, the phase difference was interpreted as an additional noise reduction mechanism as it corrects the noisy phase coor7 The instructions provided in ITU-T (1996) was recommended for assessing speech processing methods that either degrade or improve the speech quality.

dination towards the coordination of the clean spectral coeffi- 1134 cients. This, as a consequence, leads to additional noise reduc- 1135 tion, apart from the MMSE phase-insensitive part. The phase 1136 deviation concept was first defined by Vary (1985), given in 1137 (25) whereby he studied the impact of phase modification in 1138 speech enhancement concluding that a phase deviation below 1139 0.679 radians is not perceptually relevant to human. Vary also 1140 showed that the maximum phase deviation corresponds to a 1141 local signal-to-noise ratio for a harmonic above 6 decibels 1142 threshold. The phase deviation concept was further used for 1143 speech quality estimation (Gaich and Mowlaee, 2015b), phase 1144 estimation, and iterative closed-loop phase-aware speech en- 1145 hancement (Mowlaee and Saeidi, 2013a). 1146 In Krawczyk et al. (2013), a combination of phase-aware 1147 minimum mean square error (PAMMSE) (Gerkmann and 1148 Krawczyk, 2013; Mowlaee and Saeidi, 2013b) with a phase- 1149 insensitive amplitude estimator (Breithaupt et al., 2008) was 1150 proposed. The combined method relied on the voicing prob- 1151 ability, provided by a pitch estimator that was robust to 1152 high levels of noise (PEFAC) (Gonzalez and Brookes, 2014). 1153 The enhanced spectral amplitude under conditions of voiced- 1154 unvoiced uncertainty was derived in Krawczyk et al. (2013). 1155 The method made a trade-off between the degree of confi- 1156 dence and the amount of phase information to incorporate 1157 in voiced frames, while the conventional phase-insensitive 1158 method was used in unvoiced frames. 1159

4.1.5. Joint amplitude and phase estimation 1160 So far we consider the individual estimators for amplitude 1161 and phase spectra of the clean signal. In the following, we 1162 consider approaches addressing the issue of jointly estimating 1163 the amplitude and phase spectra of clean speech from the 1164 noisy observation. 1165 Complex spectral speech coefficients given Uncertain 1166 Phase information (CUP): Following that the distribution of 1167 the clean speech phase around the prior phase information is 1168 characterized by a von Mises distribution given in (17), in 1169 Gerkmann (2014) a joint estimator called Complex spectral 1170 speech coefficients given Uncertain Phase information (CUP) 1171 estimator, was proposed. The idea was to take into account 1172 the uncertainty in the spectral phase information given an un- 1173 certain clean phase estimate, in the derivation of a Bayesian 1174 estimator for clean speech coefficients (Gerkmann, 2014). 1175 Iterative closed-loop phase-aware speech enhancement: 1176 Mowlaee and Saeidi (2013a) proposed a closed-loop configu- 1177 ration that jointly improved the spectral amplitude and phase 1178 given a noisy speech observation. The block diagram for the 1179 method is shown in the bottom panel in Fig. 4. The cen- 1180 tral idea was to incorporate the phase estimator followed by 1181 the MMSE phase-aware amplitude estimator (Gerkmann and 1182 Krawczyk, 2013; Mowlaee and Saeidi, 2013b), and iterate 1183 until either a stopping criterion was satisfied or a maximum 1184 number of iterations was reached. Improved PESQ was re- 1185 ported through a few iterations. As the stopping criterion, it 1186 was proposed to check the number of iterations required in 1187 order to achieve a reasonable reduction in the inconsistency 1188

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

JID: SPECOM

ARTICLE IN PRESS

[m5+;May 19, 2016;19:14]

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209

1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237

1238 1239 1240

measure in the form of a relative error capturing the difference in the STFT inconsistency across iterations Phase enhancement combined with amplitude enhancement: The methods enlisted in Section 4.1.3 provide an estimate for the clean phase spectrum. They could be used as the post-processing step to improve the signal reconstruction performance in a conventional amplitude-only enhancement method. We refer to Mowlaee and Kulmer (2015b), regarding a comparative study that investigated how much of each amplitude or phase spectra is important to bring about improved speech quality. The phase information recovers missing harmonics and allows additional noise reduction between harmonics, with a special focus on onset (timing events). The phase enhancement further suppresses the musical noise outliers, which are introduced in the amplitude enhancement stage. An improvement in performance up to 0.2 in PESQ and 2% in STOI has been reported (Mowlaee and Kulmer, 2015b). Similar joint improvements were reported using the MAP phase estimator proposed in Kulmer and Mowlaee (2015a) with a lower sensitivity to errors in fundamental frequency estimation. 4.1.6. Proof-of-concept example As a proof-of-concept, we report the importance of incorporating phase information in two stages of single-channel speech enhancement: (i) in the signal reconstruction stage, (ii) in the phase-aware amplitude estimator. For the phase estimation, we choose the MAP phase estimator (Kulmer and Mowlaee, 2015a) and for the phase-aware amplitude and phase estimator we consider the iterative phase-aware closed-loop method proposed in Mowlaee and Saeidi (2013a). The conventional single-channel speech enhancement is set to MMSE-LSA (Ephraim and Malah, 1984) with noise estimation from minimum statistics (Martin, 2001). Fig. 5 shows the results obtained by phase-aware speech processing methods applied to enhance a noisy speech signal. The noisy speech file is created by combining a female clean speech selected from GRID corpus with babble noise set at 0 (dB) SNR (the same sentence as used in the phase representation, see Fig. 1). The results are shown in terms of spectrogram (top), phase variance (middle) and group delay (bottom). The achievable improvement8 in speech enhancement performance is quantified using instrumental predictors of perceived quality (PESQ ITU-T, 2001) and speech intelligibility (STOI Taal et al., 2011), shown at the top of each panel. Consistent improvement is achieved when phase estimation is applied for signal reconstruction only (the fourth column), or also for phase-aware amplitude estimation (last column), as compared to when no phase information is applied in the conventional method (third column). 4.1.7. Phase-aware single-channel source separation Unlike with speech enhancement, there are applications which demand recovering all the underlying sources in an 8

The audio samples are available at https:// www.spsc.tugraz.at/ PhaseLab.

15

observed mixed signal. This scenario is categorized as source 1241 separation which is carried out either with a single micro- 1242 phone or multiple microphones. While the advances regarding 1243 phase-aware signal processing in multi-microphone scenario 1244 was earlier discussed in Section 4.1.1, we will now summarize 1245 the advances made towards incorporating phase information 1246 to solve the source separation problem. 1247 The single-channel source separation (SCSS) problem is 1248 formulated in the following;

given a mixture of P audio 1249 sources denoted by y(n) = Pp=1 x p (n), with xp (n) referring 1250 to each pth underlying source in the mixture, the goal is to 1251 estimate all sources xˆp (n). In the case of two sources, to 1252 simplify the interpretation, and applying the Fourier transfor- 1253 mation, we have: 1254 Y (k, l ) = |X1 (k , l )|e jφ1 (k,l ) + |X2 (k , l )|e jφ2 (k,l ) .

(27)

For perfect recovery of the sources, both spectral amplitude 1255 and phase information are required to be estimated from the 1256 observed signal. This infers that the superposition of the es- 1257 timated source spectra should reproduce both spectral ampli- 1258 tude |Y(k, l)| and spectral phase φ y (k, l) of the observed signal 1259 y(n). The single-channel source separation problem is an ill- 1260 conditioned one, as the number of unknowns (=4) is lower 1261 than the number of observations (=2). As a work around 1262 to solve the aforementioned problem, the conventional SCSS 1263 methods are often focused on filtering the spectral amplitude 1264 of the mixed signal only, while they copy the mixture phase 1265 spectrum directly at signal reconstruction stage (Mowlaee, 1266 2010). The conventional methods are mainly categorized into 1267 time-frequency masking (TFM) and non-negative matrix fac- 1268 torization (NMF), both neglecting the phase information con- 1269 sideration for spectral amplitude modification or signal re- 1270 construction. In the following, we provide a brief overview 1271 of each method. 1272 The TFM source separation methods rely on estimating a 1273 time-frequency mask. The mask is eventually applied on the 1274 observed signal to estimate the underlying target or masker 1275 signal. The TFM methods are further categorized into bi- 1276 nary mask and ratio mask groups, which differ in their se- 1277 lected mask type. The binary masking is introduced as the 1278 main goal in computationally auditory scene analysis (CASA). 1279 Wang and Brown (2006) presented details of CASA-based 1280 methods, which rely on an estimated ideal binary mask re- 1281 quired to separate individual sources from a given mixture. 1282 The ideal binary mask has been reported to have the maxi- 1283 mum intelligibility achievable by employing a source-driven 1284 method (CASA) (Wang, 2005). A binary mask (that takes 1285 values either of 0 or 1) is applied to the observed signal to 1286 recover both underlying signals in the mixture. Without loss 1287 of generality, the ideal binary mask, for the first source is 1288 defined as follows: 1289 Y (k, l ) |X1 (k, l )| ≥ |X2 (k, l )| Xˆ1IBM (k, l ) = 0 Otherwise By swapping the role of the underlying sources, we can 1290 also define the second source binary-mask separated signal, 1291 which is denoted by Xˆ2IBM (k, l ). An ideal ratio mask (IRM) is 1292

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

ARTICLE IN PRESS

JID: SPECOM

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx Noisy PESQ = 1.17 STOI = 0.62

0.5 1 Time (s) Clean

0

1.5

Frequency (kHz)

Frequency (kHz)

3 2 1 0.5 1 Time (s) Clean

2 1 0.5 1 Time (s)

Frequency (kHz)

2 1 0.5 1 Time (s)

1.5

0

0.5 1 Time (s)

1

2 1 0.5 1 Time (s)

0.5 1 Time (s)

1.5

Frequency (kHz)

1 0

0.5 1 Time (s)

1

2 1 0.5 1 Time (s)

0.5 1 Time (s)

1.5

0 -10

2

-20

1

-30 0

0.5 1 Time (s)

1.5

2 1 0.5 1 Time (s)

1 3

0.8

2

0.6

1.5

0.4

1

1.5

3

0 0

10

3

4

0.2 0.5 1 Time (s)

4

2

4

0

1.5

3

1.5

3

0 0

2

4

4

2

3

0

1.5

3

1.5

3

0 0

1

4

4

3

2

0

1.5

3

1.5

4 Frequency (kHz)

0.5 1 Time (s)

4

4

0 0

0

Frequency (kHz)

0

Frequency (kHz)

0

1

Frequency (kHz)

1

2

3

Frequency (kHz)

2

3

4

Frequency (kHz)

3

Phase-Aware PESQ = 1.67 STOI = 0.60

Conventional + PE PESQ = 1.56 STOI = 0.62

4 Frequency (kHz)

4 Frequency (kHz)

Frequency (kHz)

4

Conventional PESQ = 1.47 STOI = 0.61

Frequency (kHz)

Clean

Frequency (kHz)

16

[m5+;May 19, 2016;19:14]

1.5

4

1

3

0.5

2

0

1

-0.5

0 0

0.5 1 Time (s)

1.5

-1

Fig. 5. Proof-of-concept simulation results for phase-aware single-channel speech enhancement for a female utterance corrupted in babble noise at SNR of 0 decibels. From left to right results are shown for: clean, noisy, conventional phase-insensitive enhancement (MMSE-LSA) (Ephraim and Malah, 1984), conventional + phase enhancement using MAP phase estimator (Kulmer and Mowlaee, 2015a), and iterative phase-aware amplitude and phase estimator (Mowlaee and Saeidi, 2013a). Results are shown as (top) spectrograms, (middle) group delay (3), and (bottom) phase variance. Improvements achieved in quality and intelligibility are shown in PESQ (ITU-T, 2001) and STOI (Taal et al., 2011) scores shown on top of panels.

1293

defined as:

Xˆ1IRM (k, l ) = 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303

|Xˆ1 (k, l )|2 |Xˆ1 (k , l )|2 + |Xˆ2 (k , l )|2

βs Y (k, l ),

(28)

where β s is the compression factor (also called as the tuning parameter) used to scale the mask9 . The second categorization of the conventional SCSS is NMF. The goal in NMF is to represent the spectrogram of the observed signal in terms of a weighted sum of spectrograms originating from the different acoustic events of the underlying sources. The weights are estimated in the separation stage while the basis spectrograms (called atoms) are learned in the training stage. Then, each source is represented as Virtanen (2007): |Xˆ p | =

M

wm,p bm,p ,

(29)

m=1 1304 1305 1306 1307 1308 1309 1310 1311

where M is the number of atoms, with m ∈ [1, M] representing the atom index, and bm, p refers to the mth atom for the pth source, and wm,p is the non-negative weights quantifying the contribution of its corresponding basis vector. Neither of the aforementioned conventional SCSS methods (TFM or NMF categories), take the phase information of the sources into account. The phase information, however, as displayed in the block diagram in Fig. 4, affects two individual 9 For example, for β = 0.5, we get the square-root Wiener filter, which s provides the best results as shown in Huang et al. (2014); Wang et al. (2014).

steps in the single-channel source separation problem: i) sig- 1312 nal interaction, and ii) signal reconstruction. For signal inter- 1313 action, the information with regard to the spectral phase dif- 1314 ference of the sources plays an important role in the interac- 1315 tion model used to account the interactions between the under- 1316 lying sources in the observed mixture. Mowlaee and Martin 1317 (2012) showed that a phase-sensitive interaction model out- 1318 performs the conventional ones, such as log-max or MMSE 1319 (in power spectrum), that average out the phase information. 1320 Erdogan et al. proposed a phase-sensitive filter model that 1321 led to improved single-channel source separation (Erdogan 1322 et al., 2015) and automatic speech recognition performance 1323 (Weninger et al., 2015) compared to phase-insensitive ones 1324 (log-max or MMSE). 1325 For signal reconstruction stage, proper phase information 1326 about the source leads to certain improvement in the syn- 1327 thesis quality (Mayer and Mowlaee, 2015; Mowlaee et al., 1328 2012). Like the single-channel speech enhancement described 1329 in Section 4.1.2, replacing the mixed phase spectrum with 1330 a clean spectral phase estimate has a positive impact on 1331 the achievable signal reconstruction performance in single- 1332 channel source separation. The ideal binary mask is known 1333 to produce the upper-bound performance for speech intelligi- 1334 bility in the single-channel source separation literature yet suf- 1335 fering from a poor performance in terms of perceptual quality. 1336 Several attempts have been made for improving the perceived 1337 quality of the separated sources when a time-frequency mask 1338 is applied. For example, Williamson et al. (2014b) proposed 1339 a two-stage approach in which a softmask was applied on the 1340

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

JID: SPECOM

ARTICLE IN PRESS P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384

Q8 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397

mixed signal. In the second stage, they employed NMF to impose sparsity constraint while reconstructing each source. Their results showed improvement in both perceived quality and speech intelligibility performance. In Williamson et al. (2014a) the same authors presented a review of different signal reconstruction techniques, that could be used to improve the perceptual quality of binary masked speech. Improved perceived quality without loss of speech intelligibility was reported. Boldt et al. (2010) applied hidden Markov models to correct errors while estimating the ideal binary mask, and hence, improved the signal reconstruction performance. Several studies investigated the importance of phase spectrum in improved signal reconstruction in single-channel source separation. Mowlaee et al. proposed a geometric approach to estimate the phase information for signal reconstruction (Mowlaee et al., 2012). The method included auxiliary constraint regarding the source phase spectra to resolve the ambiguity in the phase candidates provided from the problem geometry. To demonstrate the effectiveness of phase estimation for signal reconstruction, they compared results with respect to two benchmark methods: ideal binary mask (SCSS upper-bound) and multiple input spectrogram inversion (MISI) (Gunawan and Sen, 2010) (where GL iteration is combined with re-mixing error distribution). Throughout their experiments, it was shown that geometry-based approach improved performance in terms of PESQ (ITU-T, 2001) and signal-to-distortion ratio (Vincent et al., 2006) measures. More recently, Mayer and Mowlaee (2015) proposed a phase estimation method for single- channel source separation, which relied on harmonic phase decomposition, first proposed by Degottex and Erro (2014a). As phase decomposition relies on an estimated fundamental frequency of each source, a multi-pitch estimator was also proposed (Mayer and Mowlaee, 2015). The estimated pitch trajectories of each source was then used to estimate the unwrapped phase. Temporal smoothing filters were applied on the unwrapped phase spectrum to reduce the large variance caused by the interfering source in the mixture phase spectrum. It was demonstrated that the source separation performance of the ideal binary mask as well as NMF is improved in terms of quality and intelligibility when phase recovery is applied. This was a crucial finding as the ideal binary mask is associated with the highest achievable speech intelligibility performance in single-channel source separation. These findings altogether highlight the importance of phase information in single-channel source separation since the joint improvement in quality and intelligibility was not achievable using the ideal binary mask or NMF where phase information is not taken into account (the conventional SCSS methods improve either quality or intelligibility at the expense of reducing the other). The phase information has been also directly taken into account to modify the matrix factorization, which is commonly carried out by employing NMF. In NMF, the mixture spectrogram is approximated as the sum of the spectrogram of the underlying sources, hence the cross term due to the phase difference of the sources is neglected. Parry and Essa (2007) investigated the sensitivity of this conven-

[m5+;May 19, 2016;19:14]

17

tional assumption in spectrogram factorization by considering 1398 a probabilistic phase representation. They showed that given 1399 the source spectrograms, their probabilistic phase represen- 1400 tation contributes to a better fit to the true distribution of 1401 the components in the mixture spectrogram, compared to the 1402 conventional NMF. Kameoka et al. (2009) proposed complex 1403 NMF by introducing an interaction model in the complex- 1404 spectrum where additivity between the sources holds. This 1405 complex interaction model was also used to formulate an it- 1406 erative approach to extract the magnitude spectra that fit to 1407 the complex spectrum and the phase spectra of the underly- 1408 ing signals. King and Atlas (2011) proposed to incorporate 1409 phase estimations using complex matrix factorization to se- 1410 lect the bases for the optimal separation of sources. They 1411 demonstrated that incorporating phase information in com- 1412 plex matrix factorization (CMF) improves source separation 1413 and automatic speech recognition performance as compared 1414 to NMF. Bronson and Depalle (2014a) proposed using phase- 1415 constrained complex NMF to separate the overlapping time- 1416 frequency regions in mixtures composed of harmonic music 1417 sources. Ewert et al. (2014) proposed a method to refine the 1418 dictionaries learned by conventional NMF method by taking 1419 into account the interactions of sound sources in the spec- 1420 trogram on the basis of their phase overlaps and canceling 1421 regions. Magron et al. (2015a) proposed a phase recovery 1422 approach in the NMF framework in audio source separa- 1423 tion. They presented a comparison between two scenarios: 1424 blind separation without prior information and oracle sepa- 1425 ration with supervised learned model. A comparative study 1426 on the GL-based methods in single-channel source separa- 1427 tion was made in Mowlaee and Watanabe (2013). Magron 1428 et al. (2015b) presented a comparative study in which dif- 1429 ferent phase reconstruction techniques including GL-method 1430 were used for NMF-based source separation. Their method 1431 relied on unwrapping the phase over time and frequencies 1432 to meet the temporal and spectral coherence of the signal 1433 between partials. The techniques discussed in Bronson and 1434 Depalle (2014b); Ewert et al. (2014); Magron et al. (2015a); 1435 2015b) attempt to resolve the linear phase mismatch as sig- 1436 nal reconstruction, demonstrated to improve separation per- 1437 formance of pitched musical sources from their mixture. 1438 4.2. Speech coding

1439

In speech coding the goal is to meet a certain bit-rate 1440 constraint while still a transparent speech quality is pro- 1441 vided10 . The most of speech coders mainly focus on assigning 1442 bits to represent the spectral envelope or harmonicity. How- 1443 ever, phase has been also shown to be perceptually important 1444 (Agiomyrgiannakis and Stylianou, 2009; Kim, 2000; Pobloth, 1445 2004; Pobloth and Kleijn, 1999; 2003; Skoglund et al., 1997) 1446 . 1447 Phase discontinuities between consecutive frames may oc- 1448 cur due to changes in the fundamental frequency at frame 1449 10 A transparent speech quality is referred to when the reconstructed speech at receiver is perceptually close enough to the one at the transmitter.

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

JID: SPECOM

18 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506

ARTICLE IN PRESS

[m5+;May 19, 2016;19:14]

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

boundaries resulting in perceptually audible artifacts. Transitions from a voiced segment without any connecting silent or unvoiced segments may cause such discontinuities. In hybrid coders the initial phases of the harmonic excitation are utilized from the past excitation vectors. This procedure could be problematic in particular for rapidly-changing speech onsets (Kondoz, 2006). Choosing small data blocks in waveform coders, such as code-excited linear prediction (CELP) is a way to get around this issue. However, it causes inaccurate coding of the phase spectrum for a pitch cycle in the speech signal, resulting in linear phase mismatch at signal reconstruction (Kubin and Kleijn, 1999). In particular, any displacement of a fraction of a sample could introduce audible distortion at high frequencies, specially at frames of length shorter than the pitch period. The perceptual degradation due to the aforementioned displacements in speech coding could be circumvented by considering enough number of bits, hence performing a high resolution coding whereby small changes of phase is taken into account. For example, in mixed-excitation linear prediction (MELP) or CELP speech coders the continuity of the signal is preserved by assigning bits for phase alignment for the encoded frames, which could be important for signal reconstruction due to linear phase mismatch. In order to limit the artifacts made when switching between MELP coder (amplitude-only) and CELP coder (amplitude and phase), the zero phase equalization concept has been proposed (Moriya and Honda, 1986; Stachurski and McCree, 2000) to reduce the phase discontinuities by excluding the phase component that is not encoded in the MELP coder. In this approach, the signal phase component which is not coded in MELP is removed. The method was applied for the LPC residual using a filter with finite duration (Moriya and Honda, 1986), where the coefficients were calculated using a smoothed pitch pulse waveform. The obtained zero-phase-equalized residual signal was eventually used for reconstructing the coded speech using the LPC synthesis filter. The importance of phase quantization in terms of perceptual artifacts has been studied (Kim, 2000). It was suggested to account for the fact that the human auditory system is only sensitive to the relative phase changes within one critical band. Therefore, irrelevant phase information of speech could be eliminated in the procedure of quantization and harmonic coding. Subjective test results demonstrate the effectiveness of this approach. More phase components were reported to be perceptually important for low-pitched signals (e.g., male speech) than for high-pitch signals (female speech), concluding that for the sake of a high quality CELP coded speech for male speech, more bits are required to encode the phase information. In Kim (2003), the just-noticeable difference (JND) (Zwicker and Fastl, 2007) was first measured for the phase of one harmonic versus different fundamental frequencies. A mathematical model of JND was then proposed as a quantization method making it possible to assign bits to the most perceptually important phase information. This procedure led to a quantized speech signal that was perceptually close to the uncoded speech.

In low-bit-rate harmonic speech coders, the phase infor- 1507 mation is often not transmitted, but is produced at the de- 1508 coder for speech reconstruction (Kondoz, 2006). Examples 1509 include phase estimation using minimum-phase assumption 1510 (Brandstein et al., 1990), where the harmonic phase is pre- 1511 dicted based on frequencies at the previous and present frames 1512 (Kleijn and Paliwal, 1995). Pobloth and Kleijn (1999) demon- 1513 strated that the human perception capacity towards phase is 1514 considerable, therefore, the existing speech coders take into 1515 consideration phase distortions, in particular, for low-pitched 1516 voices. Pobloth and Kleijn (2003) also studied the relevance 1517 of a squared error, occurring between the original and phase- 1518 distorted signals. While the squared error correlates well to 1519 the perceptual measures for small phase distortions, it does 1520 not correlate with increases in perceptual measure for large 1521 distortion values. 1522 In other studies (Kim, 2000; 2003), Kim showed that phase 1523 information below a certain frequency in a harmonic signal 1524 is not perceptually relevant. This certain frequency is called 1525 the critical phase frequency defined as: 1526 fκc = κc f0 ,

(30)

where κ is the index of critical phase frequency. The phase 1527 modification for those frequencies above fκc is defined as the 1528 local phase change, which modifies the phase relationship 1529 within a critical band. This changes the timbre of the sig- 1530 nal and is perceptually relevant. In contrast, phase changes 1531 for harmonic frequency below the critical phase frequency 1532 are in fact glottal phase changes, and only modify the phase 1533 relationship between channels, while the relative phase rela- 1534 tionship within critical bands is preserved, and is hence, not 1535 perceptually important. The parameter κ c is pitch-dependent 1536 and using equivalent rectangular bandwidth (ERB) represen- 1537 tation, is given by Kim (2003); Yu and Chan (2004): 1538 Bmin − 0.5 , κc = Qear 1 − (31) f0 with Qear = 9.26449 and Bmin = 24.7 (H z), following Glas- 1539 berg and Moore’s recommendation (Glasberg and Moore, 1540 1990). For example, for a fundamental frequency of f0 = 1541 100 (H z), a critical phase frequency of κc = 7 is obtained. 1542 Further, Kim (2003) concluded that phase information below 1543 the critical phase frequency are irrelevant to perceived qual- 1544 ity as they result in a lower JND compared to than high 1545 frequencies. Finally, in Yu and Chan (2004), it was shown 1546 that encoding phase of harmonics above or equal to the κ c th 1547 harmonic index suffices to preserve the perceptual quality of 1548 coded speech. 1549 In low-bit-rate speech coding, the predictability of low- 1550 band harmonics across frames has been taken into account 1551 to minimize the number of bits assigned for harmonic phase 1552 representation. In particular, in CELP and MELP coders, only 1553 the phase information at peaks is transmitted, while at the de- 1554 coder, the phase information across time is synthesized using 1555 interpolation techniques. For example (McAulay and Quatieri, 1556 1986a), assumed a minimum phase for the phase contribu- 1557 tion by the vocal tract filter and proposed cubic interpola- 1558

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

JID: SPECOM

ARTICLE IN PRESS P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

1579

tion to provide smooth phase trajectories for speech synthesis. In order to maintain a transparent speech quality, a better phase than minimum phase information at harmonics is required. Phase dispersion has been used to train codebooks for speech coding and complement the minimum phase assumption (Agiomyrgiannakis and Stylianou, 2009). Finally, it is important to note that the correct overlap alignment of sinusoidal trajectories regarding their phase coherence has been frequently studied and has been reported of high importance for the sake of high-quality synthesized speech. Phase interpolation techniques are commonly used in speech synthesis using sinusoidal/harmonic models: cubic phase model (McAulay and Quatieri, 1986a) and quadratic polynomial phase (Ding and Qian, 1997). More recently, Agiomyrgiannakis (2015) presented a review of phase smoothing methods used for harmonic speech reconstruction. Agrimonakkis also proposed a novel phase interpolation called quadratic phase splines where the phase-locked pitchsynchronous and the smoothing property of phase were taken into account. The methods showed improved speech quality in the text-to-speech synthesis application.

1580

4.3. Digital speech watermarking

1581

In the digital multimedia industry, it is important to develop robust ways to hide and secure multimedia (audio or video) from unauthorized users and intentional manipulation through transmission. To guarantee the proof of ownership for the manufacturers, watermarking is used. A watermark (guest) signal is added to the host signal, enabling the host signal to be identified. In the following, we will present a review on those speech watermarking methods where phase information is taken into account (for a review of multimedia watermarking, please refer to Spanias et al. (2005)). A speech watermarking approach, which relied on manipulation of the phase spectrum of unvoiced speech has been proposed while keeping the power spectrum unchanged. The watermark data (guest signal) was embedded into the phase of unvoiced speech segments by replacing the excitation signal of an auto-regressive signal representation. In unvoiced speech, the auditory system does not distinguish the different realization of the Gaussian excitation process as long as the temporal and spectral envelope of speech is preserved (Kubin et al., 1993). This insensitivity for the fluctuations in unvoiced and whispered speech allows watermark data to be embedded in the phase spectrum of the unvoiced speech, at a rate of up to 540 bits/s (Hofbauer et al., 2009). The method was reported to be highly robust against various attacks including nonlinear phase and bandpass filtering. A speech watermarking method that relies on the harmonic model of speech signal and harmonic phase coding principle was suggested in Hernaez et al. (2014). The digital watermark data is embedded into the phase at harmonics while preserving the relative phase shift (RPS) as an important structure for high quality speech synthesis (see e.g. De Leon et al., 2011a; Sanchez et al., 2015; Saratxaga et al., 2012). The method used each harmonic as a communication channel, conveying

1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578

1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613

[m5+;May 19, 2016;19:14]

19

one bit digital data per frame. In theory, all harmonics could 1614 be used to hide data, however, the variability of the number 1615 of harmonics across frames and speakers, restricts the use of 1616 the method. Therefore, only a selection of channels was used 1617 to hide data. The method was reported to be robust to coding 1618 algorithms including MP3 and OPUS11 . In order to quantify 1619 the performance difference between the received and trans- 1620 mitted RPS features, the mean square error criterion of the 1621 RPS domain was defined. 1622 Dong et al. (2004), proposed a data hiding technique based 1623 on the phase modulation of audio signals. Two phase encod- 1624 ing methods were proposed: i) re-assigning the relative phases 1625 for selected frequency components in the audio signal, and ii) 1626 based on quantization index modulation where variable set of 1627 phase quantization steps were employed, to hide data. The ca- 1628 pacity of the method in terms of embedded data was reported 1629 as 20 kbit of data per minute. A solution to compensate for 1630 the phase discontinuity of the audio signal was proposed by 1631 enforcing the phase shift to reach zero at the frame bound- 1632 aries, where phase discontinuity is most pronounced due to 1633 linear phase mismatch. 1634 The human auditory system is sensitive to the relative 1635 phase difference between frequency components but not their 1636 absolute phase values. Liew and Armand (2007) exploited this 1637 property to propose a speech watermarking method by em- 1638 bedding the watermark data in phase of the randomly chosen 1639 frequencies of low amplitudes. The watermark data should be 1640 embedded in such a way that the changes in the phase be- 1641 tween frequencies remained insignificant and thus, perceptu- 1642 ally inaudible. A variable frame strategy was used to improve 1643 the robustness of the method against synchronization attacks 1644 like time-scaling. An algorithm for covert digital audio wa- 1645 termarking has been proposed (Kuo et al., 2002) that relied 1646 on the perceptual insignificance of the long-term multi-band 1647 phase modulation. The method achieved a data rate of 20–30 1648 bits/s and was shown to be highly robust to coding meth- 1649 ods such as Moving Picture Experts Group Advanced Audio 1650 Coding (MPEG-2/MPEG-4 AAC, 1998). 1651 4.4. Speech quality estimation

1652

In this Section, we focus on the use of phase informa- 1653 tion in predicting the perception of speech quality. This is 1654 important in the design for high quality speech communica- 1655 tion systems. The perceived quality is often characterized by 1656 the naturalness, clarity, or brightness of the speech signal. In 1657 contrast, the speech intelligibility refers to the number of cor- 1658 rectly recognized words in a speech utterance. For a recent 1659 review with regard to speech quality estimation, we refer to 1660 Möller et al. (2011) (on speech quality), and Kleijn et al. 1661 (2015) (on speech intelligibility). 1662 Conventional instrumental measures commonly used to 1663 evaluate the speech quality are categorized in three groups: (1) 1664 11 OPUS is a lossy audio coding format developed by the Internet Engineering Task Force (IETF) that is particularly suitable for interactive real-time applications over the Internet

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

Q9

JID: SPECOM

20 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709

1710 1711 1712 1713 1714 1715 1716 1717

ARTICLE IN PRESS

[m5+;May 19, 2016;19:14]

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

SNR-based methods including global SNR (GSNR) (Deller et al., 2000), frequency-weighted segmental SNR (fwSNRseg) (Tribolet et al., 1978), segmental SNR (SSNR) (Loizou, 2007) and , (2) speech coding-based measures including perceived speech quality estimation (PESQ) (Rix et al., 2001), loglikelihood ratio (LLR) (Gray and Markel, 1976), cepstral distance (CEPS) (Kitawaki et al., 1988), and Itakura-Saito distance (ISa) (Itakura and Saito, 1970), 3) source separation based measures including blind source separation evaluation (BSS EVAL) (Vincent et al., 2006) comprised of signalto-distortion ratio (SDR), signal-to-interference Ratio (SIR) and signal-to-artifact Ratio (SAR), and perceptual evaluation methods for audio source separation (PEASS) (Emiya et al., 2011). Conventional instrumental measures used to evaluate the speech intelligibility are: speech intelligibility index (SII) (ANSI S3.5-1997, 1997), SNR loss (Ma and Loizou, 2011), coherence SII (CSII) (Kates and Arehart, 2004), normalized covariance measure (NCM) (Holube and Kollmeier, 1996), mutual information k-nearest neighbor (MIKNN) (Taghia and Martin, 2014), speech intelligibility prediction based on mutual information (SIMI) (Jensen and Taal, 2014), short-time objective intelligibility measure (STOI) (Taal et al., 2011), DAU model (Dau et al., 1997), and glimpsing model (Cooke, 2006). Most of these conventional measures either take no phase information into account, or their prediction performance when used for phase-modified speech signals has not been studied. Although many instrumental measures have been defined and standardized for the performance evaluation of speech processing algorithms, their use is limited to conditions where they have been tested or recommended (see e.g., Loizou, 2013, Ch. 11 or Quackenbush et al., 1988 for a detailed list of the conventional objective measures). Because many of these measures rely on a distortion metric as defined between the spectral amplitude of a reference signal (termed desired signal) and the estimated (termed processed signal) signal, it is not clear what the reliability of the conventional measures is when used to predict the performance of a phase-aware speech processing method. Therefore, in this section of the current overview, we present some highlights regarding the recent advances made towards speech quality estimation of enhanced speech when phase-aware signal processing is applied. The importance of phase information in speech quality evaluation could be studied by addressing the following two questions: 1. How reliable are the conventional instrumental measures in predicting the performance of signal enhancement methods when both spectral amplitude and phase are modified? 2. Is it possible to improve the reliability of the quality/intelligibility prediction by considering new measures (termed phase-aware instrumental metrics) where distortions due to spectral amplitude and phase are both taken into account?

their usefulness for phase-aware speech processing, in terms 1720 of the perceived quality (Gaich and Mowlaee, 2015b) and 1721 speech intelligibility (Gaich and Mowlaee, 2015a). The can- 1722 didate phase-aware measures included in the studies were: 1723 group delay deviation (GDD), instantaneous frequency de- 1724 viation (IFD), unwrapped mean square error12 , phase devi- 1725 ation (PD). They also proposed two other measures focused 1726 on quantifying the phase estimation error in the unwrapped 1727 domain: unwrapped harmonic phase SNR (UnHPSNR) and 1728 unwrapped root mean square estimation error (UnRMSE). 1729 Gaich and Mowlaee (2015b) studied the correlation be- 1730 tween the existing instrumental measures for speech qual- 1731 ity with subjective listening results for phase-aware enhanced 1732 speech signals. They reported that the phase deviation metric 1733 shows the highest correlation (ρ = 0.92) on average, followed 1734 by PESQ (ρ = 0.9). Furthermore, the PD metric performed 1735 most reliably because it showed a more stable correlation co- 1736 efficient across noise types. Gaich and Mowlaee repeated their 1737 investigation in Gaich and Mowlaee (2015a) on speech intel- 1738 ligibility where they reported that coherence speech intelli- 1739 gibility index (CSII) measure for the medium and the low 1740 levels followed by STOI led to the highest correlation with 1741 the results in the subjective listening tests. 1742 These observations indicate that more studies are needed 1743 to develop new instrumental metrics when quantifying the 1744 achievable performance of a noise reduction or source sepa- 1745 ration method that modifies both amplitude and phase spectra. 1746 The conventional measures, BSS-EVAL (Vincent et al., 2006) 1747 including signal-to-distortion ratio, signal-to-interference ra- 1748 tio, signal-to-artifact ratio, and their follow up measures 1749 (Emiya et al., 2011) could be adapted into phase-sensitive 1750 versions where both amplitude and phase distortions are taken 1751 into account. Finding a reliable predictor for quality and in- 1752 telligibility evaluation will also contribute to the revision of 1753 the cost function used to optimize the signal processing stage 1754 that best matches with the subjective listening results. 1755 4.5. Speaker and speech recognition

In the following, we consider two important applications in 1757 speech technology addressing the advances made towards in- 1758 corporating phase information in the feature extraction stage. 1759 1760 4.5.1. Automatic Speech Recognition (ASR) It has been shown that the phase information is important 1761 in human speech recognition (Guangji et al., 2006). Work- 1762 ing with phase spectrum, there are several practical issues 1763 that must be considered when analyzing phase information, 1764 which are commonly computed from the short-time Fourier 1765 transform (STFT). Some of these practicalities include deal- 1766 ing with compensation for the inter-frame time step, the effect 1767 of the window function, and the lack of a common tempo- 1768 ral origin when making comparisons of the phase spectrum 1769

12

1718 1719

Gaich and Mowlaee suggested a list of phase-aware instrumental measures to predict the speech quality and studied

1756

For small estimation errors it is well resembling the squared-error 2 ˆ x (k,l )) distortion measure, since 1 − cos(φˆx (k, l ) − φx (k, l )) ≈ (φx (k,l )−φ for 2 φˆx (k, l ) − φx (k, l ) 1.

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

JID: SPECOM

ARTICLE IN PRESS P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

over different sequences. Furthermore, it is important to select proper processing parameters such as frame size and window 1772 type that are well-suited for phase analysis rather than naively 1773 applying parameters that work well for the amplitude spec1774 trum estimation (Alsteris and Paliwal, 2007b; McCowan et al., 1775 2011). 1776 Phase spectrum has long been used for other applications 1777 like pitch and formant extraction (Charpentier, 1986; Gowda 1778 et al., 2013b; Murthy and Yegnanarayana, 1991a), and phase 1779 information has been found to be useful in epoch extraction 1780 and recognition (Vijayan and Murty, 2014). Delta phase pa1781 rameters as group delay features are treated as long-term anal1782 ysis of phase in Yamamoto et al. (2010) and by experiments 1783 on Japanese version of the AURORA-2 database an improved 1784 word recognition accuracy is reported for phase-based fea1785 tures over magnitude-only features in high SNR conditions. 1786 Spectral subtraction in its primal form ignores the phase dif1787 ference between speech and noise frequency components and 1788 in order to obtain better word recognition rates, the correla1789 tion between speech and noise can be accounted for as in 1790 Kitaoka and Nakagawa (2002). In Kleinschmidt et al. (2011), 1791 a complex spectral subtraction method was used to bene1792 fit from phase information in noise-robust automatic speech 1793 recognition. The robustness analysis of the group delay-based 1794 features against additive or convolutive noise and solutions 1795 to circumvent the sensitivity to noise have been studied for 1796 speaker recognition (Gowda et al., 2013a; Parthasarathi et al., 1797 2011). 1798 Feeding a raw speech signal to a deep neural network and 1799 convolutive neural network is an implicit feature extraction 1800 procedure in which amplitude and phase information have 1801 both been utilized (Golik et al., 2015; Tüske et al., 2014). 1802 Finally, features derived from Hilbert transform has been con1803 sidered as a way to utilize both amplitude and phase informa1804 tion in a unified way for speech recognition (Loweimi et al., 1805 2013). 1806 As phase-aware features, Modified group delay and instan1807 taneous frequency (IF) as frequency and time derivate of spec1808 tral phase have been used for ASR (Alsteris, 2006; Alsteris 1809 and Paliwal, 2007b). With appropriate windowing, smoothed 1810 group delay functions can be computed as shown in Bozkurt 1811 et al. (2004b). Therefore, by employing glottal closure in1812 stances (GCI), Bozkurt and Couvreur proposed group delay 1813 of GCI-synchronously windowed speech (GDGCI) for ASR 1814 (Bozkurt and Couvreur, 2005). An alternative way to smooth 1815 group delay is to utilize chirp group delay of the zero-phase 1816 version (CGDZP) which converts the signal to its zero-version 1817 where all zeros occur very close to the unit circle, hence, 1818 well resolved formant peaks are provided. Saratxaga et al. 1819 (2010) used harmonic phase information as relative phase Q10 1820 shift (RPS) and showed improved ASR performance com1821 pared to classical MFCC parameterization. 1770 1771

1822 1823 1824 1825

4.5.2. Speaker recognition Most of the automatic speech and speaker recognition systems are built upon short-term spectro-temporal feature representations, typically calculated from the amplitude spectrum

[m5+;May 19, 2016;19:14]

21

(Davis and Mermelstein, 1980). The amplitude spectrum can 1826 be calculated as the magnitude of a complex Fourier trans- 1827 form or other parametric and non-parametric spectrum es- 1828 timation methods, including linear prediction, multi-tapering 1829 and their variants (Kinnunen et al., 2012; Makhoul, 1975; 1830 Saeidi et al., 2010). The Mel-frequency cepstral coefficients 1831 are among the most popular features derived by applying a 1832 perceptually weighted filter-bank to the amplitude spectrum. 1833 In anti-spoofing for speaker recognition systems, phase 1834 consistency is being used as a counter measure for synthetic 1835 speech detection (De Leon et al., 2011b; Wu et al., 2015a). 1836 Several phase-aware techniques for anti-spoofing methods in 1837 speaker recognition were studied in Wu et al. (2015b). Wang 1838 et al. (2015) considered relative phase information to discrim- 1839 inate natural human speech from machine-generated spoofed 1840 speech. Xiao et al. (2015) studied five types of phase-based 1841 features: group delay, modified group delay, instantaneous 1842 frequency deviation, baseband phase difference, and pitch- 1843 synchronous phase for spoofing speech detection. Liu et al. 1844 (2015) proposed several spoofing countermeasures based on 1845 magnitude and phase information for speaker verification. 1846 The robustness of features derived from group delay 1847 against transmission channel and ambient noise has been in- 1848 vestigated (Hegde et al., 2007b; Parthasarathi et al., 2011; 1849 Yegnanarayana and Murthy, 1992). The modified group-delay 1850 (MGD) function followed by discrete cosine transform (DCT) 1851 has been used (Hegde et al., 2004) for a comparative eval- 1852 uation of speaker recognition on TIMIT and noisy TIMIT 1853 corpora. When calculating MGDF, it has been assumed that 1854 the speech signal is minimum phase. 1855 Acoustic features based on AM-FM representation of the 1856 speech signal have been included in different recognition ap- 1857 plications. The AM-FM model is generally used to describe a 1858 band-pass speech signal by its envelope (instantaneous ampli- 1859 tude) and phase (instantaneous frequency). Hence, the speech 1860 signal is first passed through a filterbank (Gowda et al., 2015; 1861 Peddinti and Hermansky, 2013; Sadjadi and Hansen, 2015; 1862 Schädler et al., 2012; Tyagi and Wellekens, 2006) to decom- 1863 pose it into narrow-band signal that can be approximated lo- 1864 cally with a single sinusoid. Features derived from the ana- 1865 lytic phase of speech signal have been used for robust speaker 1866 recognition (Vijayan et al., 2014b). By taking the derivative 1867 of instantaneous phase, the instantaneous frequency cepstral 1868 coefficients (IFCC) are derived from smoothed sub-band in- 1869 stantaneous frequencies. Similarly, amplitude weighted instan- 1870 taneous frequency (AWIF) features (Grimaldi and Cummins, 1871 2008) have been built upon characterization based on speech 1872 signal pyknograms (density plot of the frequencies present in 1873 the input signal) (Potamianos and Maragos, 1995). The ap- 1874 plication of instantaneous frequency (and its deviation) along 1875 with the delta-phase spectrum have been used for feature ex- 1876 traction in order to account for the large variability of phase 1877 caused by the starting point of analysis window (McCowan 1878 et al., 2011; Stark and Paliwal, 2008). 1879 All-pass modeling of LP spectrum is considered for 1880 speaker recognition (Murty et al., 2012). However, the resid- 1881 ual phase feature alone would not characterize the complete 1882

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

JID: SPECOM

22

ARTICLE IN PRESS P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

1896

phase information in the speech signal. A parametric model on the time-domain phase was proposed in Vijayan et al. (2014a), where all-pass cepstral coefficients (APCC) were calculated from time-domain phase after normalizing out the amplitude spectrum. A common way to utilize phase-derived features is to directly combine the features derived from amplitude and phase individually (Schlüter and Ney, 2001). The amplitude and phase information has been combined on the feature- or score-levels (Murty and Yegnanarayana, 2006; Nakagawa et al., 2007; 2012; Wang et al., 2010). Features derived from Hilbert transform has been considered as a way to utilize both amplitude and phase information in a unified way for speaker recognition (Murty and Yegnanarayana, 2006; Sadjadi and Hansen, 2011).

1897

4.6. Speech synthesis

1898

As in other areas of speech technology where the target listener is human (i.e., speech enhancement), text to speech synthesis application also benefits from phase-aware signal processing. Currently, there are three main approaches for speech synthesis: a) unit selection, b) statistical parametric speech synthesis, and c) hybrid approaches. Among these three approaches, the approach based on unit selection is in less demand in signal processing. The hybrid approach requires some signal processing especially for feature extraction from the speech signal. During synthesis, hybrid approaches may not use these features to generate the final synthetic speech signal since the approach is then similar to the concatenation of units, selected using the same mechanism as in the unit selection approach. However, there are hybrid systems that use generated units by the acoustic model features to synthesize parts of the final synthetic speech. Finally, the statistical parametric speech synthesis systems are currently in heavy demand in signal processing. This is both in terms of speech analysis (speech modeling) and during synthesis, where the parameters of the speech model (given a text) are generated by maximizing the likelihood of a probabilistic acoustic model learned during the training of the system. Although unit selection based approaches provide a high quality of synthetic speech, they suffer from not being flexible in terms of modifying emotions of the generated speech and creating a customized voice without requiring too many new recordings. On the other hand, statistical parametric speech synthesis systems are quite flexible but the synthetic speech quality they generate is currently not as good as the one in unit selection. Recent advancements in statistical parametric speech synthesis, however, using deep learning for the acoustic model, show that the quality of synthetic speech using that approach can be improved and be competitive to the unit selection approach (Zen and Sak, 2015). Hybrid systems exist between these two approaches both in terms of their resulting speech quality and flexibility. We show below that, independent of the approach taken for speech synthesis, phase information is important either for increasing quality or enabling flexibility.

1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895

1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936

[m5+;May 19, 2016;19:14]

Although unit selection does not use many speech models, 1937 it has been shown that pre-processing units in the speech 1938 database can facilitate the concatenation of units during 1939 the synthesis. Units are usually phone or even half-phone 1940 (Beutnagel et al., 1999). Time alignment of these units dur- 1941 ing their concatenation is important to remove linear phase 1942 mismatches during synthesis, which are perceived as annoy- 1943 ing clicks. Cross-correlation can be used during synthesis to 1944 alleviate the linear phase problem, but at the expense of an 1945 increased computational cost. Another solution is to mark the 1946 glottal closure instants (GCIs) in the units and then align the 1947 units based on these time instants. Although GCIs detection is 1948 prone to errors especially for emotional data and, for speak- 1949 ers with high pitch, the databases used in this approach are 1950 of large size, such an approach may still result in poor lin- 1951 ear phase alignment. In Stylianou (2001), the audible clicks 1952 due to the linear phase mismatches at the frame boundaries 1953 were studied and a common reference point, referred to as 1954 the center of gravity, for all the speech frames (usually two 1955 pitch periods long) was defined. It was shown that the cen- 1956 ter of gravity of each frame could be simply approximated 1957 by the phase of the first harmonic component of the signal 1958 (Stylianou, 2001). A simpler model relying on linear phase 1959 subtraction from all harmonic phases was shown to work very 1960 well when pre-processing the frames in the speech database, 1961 synthesizing speech with linear phase mismatches, without 1962 the need to estimate GCIs or compute cross-correlation func- 1963 tions during synthesis. As a proof-of-concept example, Fig. 6 1964 shows speech frames (two pitch periods long) from various 1965 instances in the database before and after alignment using the 1966 linear phase correction mechanism, described above. The left 1967 column of the figure shows the different position of the anal- 1968 ysis window before linear phase correction, while the right 1969 column shows it after linear phase correction. For further de- 1970 tails we refer to Stylianou (2001). 1971 To increase the flexibility of unit selection-based text- 1972 to-speech synthesis systems, techniques in voice conversion 1973 could be of advantage (Stylianou et al., 1998). However, in 1974 that case, speech modeling is required. To have a successfully 1975 perceived converted speaker, both in the sense of similarity 1976 to the target speaker as well as of quality of the converted 1977 signal, approaches using phase information as mentioned in 1978 the previous sections for the relationship between amplitude 1979 and phase, and speaker identification should be considered. 1980 Hybrid systems require more signal processing than unit 1981 selection-based systems, especially during training. Spectral 1982 magnitude based information is often used in that case (i.e., 1983 cepstral based representations) and phase information is not 1984 used. During synthesis, spectral magnitude information is 1985 used to select units from a speech database, in a similar way 1986 as in the unit selection systems. However, there are instances 1987 when speech signal should be synthesized using the gener- 1988 ated spectral magnitude and fundamental frequency informa- 1989 tion. The latter is used to create the linear phase term (as 1990 shown in (12)) while the vocal tract based phase given in 1991 (12) is computed by the magnitude spectrum. This results in 1992 a phase that consists of a minimum phase term plus a linear 1993

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

JID: SPECOM

ARTICLE IN PRESS P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

[m5+;May 19, 2016;19:14]

23

Fig. 6. Phase correction based on the center of gravity method proposed in Stylianou (2001). The different positions of the analysis window are displayed for (left) before and (right) after phase correction. The frames after phase correction are aligned. 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

phase term with the dispersion phase part missing13 , resulting in a synthesized signal with a buzzy perceived quality. In terms of signal processing, this quality is due to a highly unnatural correlation between the generated speech samples. To overcome this problem, the band aperiodicity (BAP) information can be used (Zen et al., 2006). This is mainly an SNR measurement in certain frequency bands. BAP is used to insert noise in some frequency bands. This has the effect of reducing the correlation and therefore reducing the buzziness effect. In essence, this is an ad-hoc solution for not being able to model the dispersion phase in (12) accordingly. In order to increase the flexibility of the hybrid systems, the same issues as for the unit selection based systems, mentioned above, need to be addressed, and phase information again will play an important role for improving current results. Finally, the statistical parametric-based speech synthesis systems face the same problems as the hybrid ones. However, in this case, during synthesis, all the generated parameters are used for synthesizing a speech signal. Therefore, the problems of buzziness mentioned above for the Hybrid systems, are more frequent and even more pronounced than for the hybrid ones. These observations have encouraged many researchers in speech synthesis to look for new ways to model phase information. It has been widely accepted that better phase 13

Following the harmonic model with phase decomposition presented in Eq. (12).

models will boost the quality of synthetic speech in particu- 2018 lar in the current statistical parametric speech synthesis sys- 2019 tems. Towards that goal, Maia et al. (2012) proposed a com- 2020 plex cepstrum approach trying to model amplitude and phase 2021 information together. In more recent work on complex cep- 2022 stra (Maia and Stylianou, 2014), Maia et al. studied complex 2023 cepstrum-based speech factorization for acoustic modeling in 2024 statistical parametric synthesizers. Throughout the investiga- 2025 tion of various complex cepstrum factorization approaches, 2026 it has been found that the factorization of minimum phase 2027 and all-pass filters provided the best results in terms of qual- 2028 ity of the synthetic speech. More recently, Agiomyrgiannakis 2029 (2015) presented an AM-FM sinusoidal based approach for 2030 statistical parametric speech synthesis, where manipulation of 2031 phase spectrum was shown to be critical to boost the quality 2032 of the synthetic speech. 2033 All the above observations showed that, for speech syn- 2034 thesis, phase-aware processing is crucial to improve the qual- 2035 ity of synthetic speech and highly flexible speech synthesis 2036 systems. However, optimal modeling of phase information in 2037 speech synthesis is still an active research topic. 2038 5. Conclusions and future work

2039

In this paper, we presented an overview of the earlier and 2040 more recent advances made towards incorporating phase in- 2041 formation in the processing of speech signals. We exemplified 2042

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

JID: SPECOM

24 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071

ARTICLE IN PRESS P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

the phase-aware signal processing contributions made by researchers in diverse fields of speech processing applications including: speech analysis, speaker and speech recognition, speech enhancement, source separation, speech coding, watermarking, and speech synthesis. Throughout this review, it has been concluded that phase-awareness has had positive impact on the achievable performance as conventionally reported by amplitude-only signal processing methods during the last decades. With the increasing interests of the researchers in the community towards phase-aware signal processing in speech communication, it is deemed that synergy will raise among the researchers actively working on phase-aware signal processing in different applications of speech communication. The attempt in this work was targeted to unify the existing scattered evidence in the literature with regard to phase-aware signal processing, where researchers individually incorporated phase-awareness in different applications. In the future, works follow up to the current special issue series, could benefit from an increased, improved overall understanding regarding the existing proposals to tackle the controversial topic of phase processing. From a practical viewpoint, several examples regarding the positive impact of phase-aware signal processing in different speech applications delivers the big picture regarding how the current state of knowledge could be used as the starting reference to continue working towards new proposals in the emerging field of phase-aware speech signal processing, with a high potential of pushing the limits of the conventional state-of-the-art methods.

2072

Acknowledgment

2073

The work of Pejman Mowlaee was supported by the Austrian Science Fund (project number P28070-N33). The work of Rahim Saeidi is supported by the Academy of Finland (project number 256961 and 284671).

2074 2075 2076

Q11 2077

References

2078

Aarabi, P., Shi, G., 2004. Phase-based dual-microphone robust speech enhancement. IEEE Trans. Syst. Man Cybern. Part B Cybern. 34 (4), 1763– 1773. Aarabi, P., Shi, G., Modir Shanechi, M., Rabi, S.A., 2006. Phase-Based Speech Processing. World Scientific Publishing. Agiomyrgiannakis, Y., 2015. VOCAINE the vocoder and applications in speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4230–4234. Agiomyrgiannakis, Y., Stylianou, Y., 2009. Wrapped Gaussian mixture models for modeling and high-rate quantization of phase data of speech. IEEE Trans. Audio Speech Lang. Process. 17 (4), 775–786. Alsteris, L., 2006. Short-Time Phase Spectrum in Human and Automatic Speech Recognition. Griffith University, Brisbane, Australia Ph.D. thesis.. Alsteris, L.D., Paliwal, K.K., 2004a. ASR on speech reconstructed from short-time fourier phase spectra. In: Proceedings of the ISCA Interspeech, pp. 1–4. Alsteris, L.D., Paliwal, K.K., 2004b. Importance of window shape for phaseonly reconstruction of speech. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 573–576.

2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097

[m5+;May 19, 2016;19:14]

Alsteris, L.D., Paliwal, K.K., 2006. Further intelligibility results from human listening tests using the short-time phase spectrum. Speech Commun. 48 (6), 727–736. Alsteris, L.D., Paliwal, K.K., 2007a. Iterative reconstruction of speech from short-time Fourier transform phase and magnitude spectra. Elsevier Comput. Speech Lang. 21 (1), 174–186. Alsteris, L.D., Paliwal, K.K., 2007b. Short-time phase spectrum in speech processing: A review and some experimental results. Proc. Int. Conf. Digital Signal Process. 17 (3), 578–616. ANSI S3.5-1997, 1997. Methods for Calculation of the Speech Intelligibility Index. American National Standard. Beutnagel, M., Conkie, A., Schroeter, J., Stylianou, Y., Syrdal, A., 1999. The AT&T next-gen TTS system. In: Proceedings of the Joint Meeting of ASA, EAA, and DAGA, pp. 18–24. Boashash, B., 1992. Estimating and interpreting the instantaneous frequency of a signal part i: fundamentals. Proc. IEEE 80 (4), 520–538. Boldt, J.B., Pedersen, M.S., Kjems, U., Christensen, M.G., Jensen, S.H., 2010. Error-correction of binary masks using hidden Markov models. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4722–4725. Bozkurt, B., Couvreur, L., 2005. On the use of phase information for speech recognition. In: Proceedings of the European Signal Processing Conference, pp. 1–4. Bozkurt, B., Couvreur, L., Dutoit, T., 2007. Chirp group delay analysis of speech signals. Speech Commun. 49 (3), 159–176. Bozkurt, B., Doval, B., D’Alessandro, C., Dutoit, T., 2004a. Appropriate windowing for group delay analysis and roots of z-transform of speech signals. In: Proceedings of the European Signal Processing Conference, pp. 733–736. Bozkurt, B., Doval, B., D’Allessandro, C., Dutoit, T., 2004b. Appropriate windowing for group delay analysis and roots of z-transform of speech signals. In: Proceedings of the European Signal Processing Conference, pp. 733–736. Brandstein, M., Monta, P., Hardwick, J., Lim, J., 1990. A real-time implementation of the improved mbe speech coder. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5–8. Brandstein, M., Ward, D., (Eds.), 2001. Microphone Arrays: Signal Processing Techniques and Applications. Springer. Breithaupt, C., Krawczyk, M., Martin, R., 2008. Parameterized MMSE spectral magnitude estimation for the enhancement of noisy speech. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4037–4040. Bronson, J., Depalle, P., 2014a. Phase constrained complex NMF: separating overlapping partials in mixtures of harmonic musical sources. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 7475–7479. Bronson, J., Depalle, P., 2014b. Phase constrained complex NMF: separating overlapping partials in mixtures of harmonic musical sources. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 7475–7479. Charpentier, F., 1986. Pitch detection using the short-term phase spectrum. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 113–116. Cohen, L., 1989. Time-frequency distributions - a review. Proc. IEEE 77 (7), 941–981. Cooke, M., 2006. A glimpsing model of speech perception in noise. J. Acoust. Soc. Am. 119, 1562–1573. Cooke, M., Barker, J., Cunningham, S., Shao, X., 2006. An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120, 2421–2424. Dau, T., Kollmeier, B., Kohlrausch, A., 1997. Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. J. Acoust. Soc. Am. 102, 2892–2905. Davis, S., Mermelstein, P., 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Audio Speech Lang. Process. 28 (4), 357–366.

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164

JID: SPECOM

ARTICLE IN PRESS P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232

De Leon, P.L., Hernaez, I., Saratxaga, I., Pucher, M., Yamagishi, J., 2011a. Detection of synthetic speech for the problem of imposture. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4844–4847. De Leon, P.L., Hernaez, I., Saratxaga, I., Pucher, M., Yamagishi, J., 2011b. Detection of synthetic speech for the problem of imposture. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4844–4847. Degottex, G., Erro, D., 2014a. A measure of phase randomness for the harmonic model in speech synthesis. In: Proceedings of the ISCA Interspeech. Singapore, pp. 1638–1642. Degottex, G., Erro, D., 2014b. A uniform phase representation for the harmonic model in speech synthesis applications. EURASIP J. Audio Speech Music Process. 2014 (1), 38. Degottex, G., Godoy, E., Stylianou, Y., 2012. Identifying tenseness of Lombard speech using phase distortion. In: Proceedings of the Listening Talker: An interdisciplinary Workshop on Natural and Synthetic Modification of speech in response to listening conditions. Edinburgh, UK. Degottex, G., Obin, N., 2014. Phase distortion statistics as a representation of the glottal source: Application to the classification of voice qualities. In: Proceedings of the ISCA Interspeech. Singapore, pp. 1633–1637. Degottex, G., Roebel, A., Rodet, X., 2011a. Function of phase-distortion for glottal model estimation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4608–4611. Degottex, G., Roebel, A., Rodet, X., 2011b. Phase minimization for glottal model estimation. IEEE Trans. Audio Speech Lang. Process. 19 (5), 1080–1090. Deleforge, A., Kellermann, W., 2015. Phase-optimized K-SVD for signal extraction from underdetermined multi-channel sparse mixtures. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 355–359. Deller, J.R.J., Hansen, J.H.L., Proakis, J.G., 2000. Discrete-Time Processing of Speech Signals. An IEEE Press classic reissue. Wiley. Ding, Y., Qian, X., 1997. Processing of musical tones using a combined quadratic polynomial-phase sinusoid and residual (quasar) signal model. J. Acoust. Soc. Am. 45 (7), 571–584. Doclo, S., Kellermann, W., Makino, S., Nordholm, S., 2015. Multichannel signal enhancement algorithms for assisted listening devices: Exploiting spatial diversity using multiple microphones. IEEE Signal Process. Mag. 32 (2), 18–30. Dolson, M., 1986. The phase vocoder: a tutorial. Comput. Music J. Springer 10 (4), 14–27. Dong, X., Bocko, M.F., Ignjatovic, Z., 2004. Data hiding via phase manipulation of audio signals. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 377– 380. Donglai, Z., Paliwal, K.K., 2004. Product of power spectrum and group delay function for speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 125–128. Drugman, T., Stylianou, Y., 2015. Fast and accurate phase unwrapping. In: Proceedings of the ISCA Interspeech, pp. 1171–1175. Emiya, V., Vincent, E., Harlander, N., Hohmann, V., 2011. Subjective and objective quality assessment of audio source separation. IEEE Trans. Audio Speech Lang. Process. 19 (7), 2046–2057. Ephraim, Y., Malah, D., 1984. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Audio Speech Lang. Process. 32 (6), 1109–1121. Ephraim, Y., Malah, D., 1985. Speech enhancement using a minimum mean square error log-spectral amplitude estimator. IEEE Trans. Audio Speech Lang. Process. 33, 443–445. Erdogan, H., Hershey, J.R., Watanabe, S., Le Roux, J., 2015. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Evans, M., Hastings, N., Peacock, B., 2000. von Mises Distribution. Wiley & Sons, New York. chapter 45.

[m5+;May 19, 2016;19:14]

25

Ewert, S., Plumbley, M.D., Sandler, M., 2014. Accounting for phase cancellations in non-negative matrix factorization using weighted distances. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 649–653. Gaich, A., Mowlaee, P., 2015a. On speech intelligibility estimation of phaseaware single-channel speech enhancement. In: Proceedings of the ISCA Interspeech, pp. 2553–2557. Gaich, A., Mowlaee, P., 2015b. On speech quality estimation of phase-aware single-channel speech enhancement. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 216–220. Gerkmann, T., 2014. Bayesian estimation of clean speech spectral coefficients given a priori knowledge of the phase. IEEE Trans. Signal Process. 62 (16), 4199–4208. Gerkmann, T., Krawczyk, M., 2013. MMSE-optimal spectral amplitude estimation given the STFT-phase. IEEE Signal Process. Lett. 20 (2), 129–132. Gerkmann, T., Krawczyk, M., Le Roux, J., 2015. Phase processing for singlechannel speech enhancement: History and recent advances. IEEE Signal Process. Mag. 32 (2), 55–66. Glasberg, B.R., Moore, B.C.J., 1990. Derivation of auditory filter shapes from notched-noise data. Hearing Research 47 (12), 103–138. Golik, P., Tüske, Z., Schlüter, R., Ney, H., 2015. Convolutional neural networks for acoustic modeling of raw time signal in LVCSR. In: Proceedings of the ISCA Interspeech. Dresden, Germany. Gonzalez, S., Brookes, M., 2014. PEFAC - a pitch estimation algorithm robust to high levels of noise. IEEE Trans. Audio Speech Lang. Process. 22 (2), 518–530. Gotoh, S., Kazama, M., Tohyama, M., 2005. Speech intelligibility estimation based on phase correlation index. In: On-line Workshop on Measurement of Speech and Audio Quality in Networks (MESAQIN). Gowda, D., Pohjalainen, J., Alku, P., Kurimo, M., 2013a. Robust spectral representation using group delay function and stabilized weighted linear prediction for additive noise degradations. In: Conference on Speech Technology and Human - Computer Dialogue, pp. 1–7. Gowda, D., Pohjalainen, J., Kurimo, M., Alku, P., 2013b. Robust formant detection using group delay function and stabilized weighted linear prediction. In: Proceedings of the ISCA Interspeech, pp. 49–53. Gowda, D., Saeidi, R., Alku, P., 2015. AM-FM based filter bank analysis for estimation of spectro-temporal envelopes and its application for speaker recognition in noisy reverberant environments. In: Proceedings of the ISCA Interspeech, pp. 1166–1170. Gray, A., Markel, J., 1976. Distance measures for speech processing. IEEE Trans. Audio Speech Lang. Process. 24 (5), 380–391. Griffin, D., Lim, J., 1984. Signal estimation from modified short-time Fourier transform. IEEE Trans. Audio Speech Lang. Process. 32 (2), 236–243. Grimaldi, M., Cummins, F., 2008. Speaker identification using instantaneous frequencies. IEEE Trans. Audio Speech Lang. Process. 16 (6), 1097– 1111. GU, L., 2010. Single-Channel Speech Separation Based on Instantaneous Frequency. Carnegie Mellon University, USA Ph.D. thesis.. Guangji, S., Shanechi, M.M., Aarabi, P., 2006. On the importance of phase in human speech recognition. IEEE Trans. Audio Speech Lang. Process. 14 (5), 1867–1874. Gunawan, D., Sen, D., 2010. Iterative phase estimation for the synthesis of separated sources from single-channel mixtures. IEEE Signal Process. Lett. 17 (5), 421–424. Harris, F.J., 1978. On the use of windows for harmonic analysis with the discrete fourier transform. Proc. IEEE 66 (1), 51–83. Hayes, M., Lim, J., Oppenheim, A.V., 1980. Signal reconstruction from phase or magnitude. IEEE Trans. Audio Speech Lang. Process. 28 (6), 672–680. Haykin, S., Liu, K.J.R., 2009. Handbook on Array Processing and Sensor Networks. IEEE, Hoboken, N.J. Wiley, Piscataway, NJ. Hegde, R.M., Murthy, H.A., Gadde, V.R.R., 2007a. Significance of the Modified Group Delay Feature in Speech Recognition. IEEE Trans. Audio Speech Lang. Process. 15 (1), 190–202. Hegde, R.M., Murthy, H.A., Gadde, V.R.R., 2007b. Significance of the modified group delay feature in speech recognition. IEEE Trans. Audio Speech Lang. Process. 15 (1), 190–202.

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300

JID: SPECOM

26 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367

ARTICLE IN PRESS

[m5+;May 19, 2016;19:14]

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

Hegde, R.M., Murthy, H.A., Rao, G.V.R., 2004. Application of the modified group delay function to speaker identification and discrimination. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 517–520. von Helmholtz, H., 1912. On the Sensations of Tone as a Physiological Basis for the Theory of Music. Longmans, Green. Hendriks, R.C., Gerkmann, T., Jensen, J., 2013. DFT-Domain Based SingleMicrophone Noise Reduction for Speech Enhancement. In: Synthesis Lectures on Speech and Audio Processing. Morgan & Claypool Publishers. Hernaez, I., Saratxaga, I., Ye, J., Sanchez, J., Erro, D., Navas, E., 2014. Speech watermarking based on coding of the harmonic phase. In: Advances in Speech and Language Technologies for Iberian Languages. In: Lecture Notes in Computer Science, vol. 8854, pp. 259–268. Hofbauer, K., Kubin, G., Kleijn, W.B., 2009. Speech watermarking for analog flat-fading bandpass channels. IEEE Trans. Audio Speech Lang. Process. 17 (8), 1624–1637. Holube, I., Kollmeier, B., 1996. Speech intelligibility prediction in hearingimpaired listeners based on a psychoacoustically motivated perception model. J. Acoust. Soc. Am. 100 (3), 1703–1716. Huang, P.S., Kim, M., Hasegawa-Johnson, M., Smaragdis, P., 2014. Deep learning for monaural speech separation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 1562–1566. Itakura, F., Saito, S., 1970. A statistical method for estimation of speech spectral density and formant frequencies. Electron. Commun. Jpn. 53 (1), 36–43. ITU-R BS 1534-1, Method for the subjective assessment of intermediate quality level of coding systems. 2001. ITU-T, 1996, P.800. Methods for subjective determination of transmission quality - Series P: telephone transmission quality; methods for objective and subjective assessment of quality. ITU-T, 2001. Perceptual Evaluation of Speech Quality (PESQ). Recommendation P.862. International Telecommunication Union. Jensen, J., Taal, C.H., 2014. Speech intelligibility prediction based on mutual information. IEEE Trans. Audio Speech Lang. Process. 22 (2), 430–440. Kamakshi Prasad, V., Nagarajan, T., Murthy, H.A., 2004. Automatic segmentation of continuous speech using minimum phase group delay functions. Speech Commun. 42 (34), 429–446. Kameoka, H., Ono, N., Kashino, K., Sagayama, S., 2009. Complex NMF: A new sparse representation for acoustic signals. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 3437–3440. Kandia, V., Stylianou, Y., 2008. Detection of clicks based on group delay. Can. Acoust. 36 (1), 48–54. Special Issue on Marine Mammals Detection and Classification. Karam, Z.N., Oppenheim, A.V., 2007. Computation of the one-dimensional unwrapped phase. In: Proceedings of the International Conference on Digital Signal Processing, pp. 304–307. Kates, J., Arehart, K., 2004. Coherence and the speech intelligibility index. J. Acoust. Soc. Am. 115 (5), 2604. Kay, S.M., 1993. Fundamentals of Statistical Signal Processing, Volume I: Estimation Theory. Prentice Hall. Kazama, M., Gotoh, S., Tohyama, M., Houtgast, T., 2010. On the significance of phase in the short term fourier spectrum for speech intelligibility. J. Acoust. Soc. Am. 127 (3), 1432–1439. Kim, C., Kumar, K., Raj, B., Stern, R.M., 2009. Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain. In: Proceedings of the ISCA Interspeech, pp. 2495– 2498. Kim, D.S., 2000. Perceptual phase redundancy in speech. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, pp. 1383–1386. Kim, D.S., 2003. Perceptual phase quantization of speech. IEEE Trans. Audio Speech Lang. Process. 11 (4), 355–364. King, B.J., Atlas, L., 2011. Single-channel source separation using complex matrix factorization. IEEE Trans. Audio Speech Lang. Process. 19 (8), 2591–2597.

Kinnunen, T., Saeidi, R., Sedlak, F., Lee, K., Sandberg, J., HanssonSandsten, M., Li, H., 2012. Low-variance multitaper MFCC features: a case study in robust speaker verification. IEEE Trans. Audio Speech Lang. Process. 20 (7), 1990–2001. Kitaoka, N., Nakagawa, S., 2002. Evaluation of spectral subtraction with smoothing of time direction on the Aurora2 task. In: Proceedings of the ISCA Interspeech, pp. 477–480. Kitawaki, N., Nagabuchi, H., Itoh, K., 1988. Objective quality evaluation for low-bit-rate speech coding systems. IEEE J. Sel. Areas Commun. 6 (2), 242–248. Kleijn, W.B., Crespo, J.B., Hendriks, R.C., Petkov, P., Sauert, B., Vary, P., 2015. Optimizing speech intelligibility in a noisy environment: A unified view. IEEE Signal Process. Mag. 32 (2), 43–54. Kleijn, W.B., Paliwal, K.K., 1995. Speech Coding and Synthesis. Elsevier Amsterdam, New York, New York, NY, USA. Kleinschmidt, T., Sridharan, S., Mason, M., 2011. The use of phase in complex spectrum subtraction for robust speech recognition. Elsevier Comput. Speech Lang. 25 (3), 585–600. Knapp, C., Carter, G.C., 1976. The generalized correlation method for estimation of time delay. IEEE Trans. Audio Speech Lang. Process. 24 (4), 320–327. Kondoz, A.M., 2006. Digital Speech: Coding for Low Bit Rate Communication Systems„ second ed. John Wiley & Sons, Ltd. Koutsogiannaki, M., Simantiraki, O., Degottex, G., Stylianou, Y., 2014. The importance of phase on voice quality assessment. In: Proceedings of the ISCA Interspeech, pp. 1653–1657. Krawczyk, M., Gerkmann, T., 2012. STFT phase improvement for single channel speech enhancement. In: Proceedings of the International Workshop on Acoustic Signal Enhancement., pp. 1–4. Krawczyk, M., Gerkmann, T., 2014. STFT phase reconstruction in voiced speech for an improved single-channel speech enhancement. IEEE Trans. Audio Speech Lang. Process. 22 (12), 1931–1940. Krawczyk, M., Rehr, R., Gerkmann, T., 2013. Phase-sensitive real-time capable speech enhancement under voiced-unvoiced uncertainty. In: Proceedings of the European Signal Processing Conference, pp. 1–5. Kubin, G., Atal, B.S., Kleijn, W.B., 1993. Performance of noise excitation for unvoiced speech. In: Proceedings of the IEEE Workshop on Speech Coding for Telecommunications, pp. 35–36. Kubin, G., Kleijn, W.B., 1999. On speech coding in a perceptual domain. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 205–208. Kulmer, J., Mowlaee, P., 2015a. Harmonic phase estimation in single-channel speech enhancement using von Mises distribution and prior SNR. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5063–5067. Kulmer, J., Mowlaee, P., 2015b. Phase estimation in single channel speech enhancement using phase decomposition. IEEE Signal Process. Lett. 22 (5), 598–602. Kulmer, J., Mowlaee, P., Watanabe, M., 2014. A probabilistic approach for phase estimation in single-channel speech enhancement using von Mises phase priors. In: Proceedings of the IEEE Workshop on Machine Learning for Signal Processing., pp. 1–6. Kumaresan, R., Rao, A., 1999. Model-based approach to envelope and positive instantaneous frequency estimation of signals with speech applications. J. Acoust. Soc. Am. 105 (3), 1912–1924. Kuo, S.S., Johnston, J.D., William, T., Quackenbush, S.R., 2002. Covert audio watermarking using perceptually tuned signal independent multiband phase modulation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2, pp. 1753– 1756. Le Roux, J., Vincent, E., 2013. Consistent Wiener filtering for audio source separation. IEEE Signal Process. Lett. 20 (3), 217–220. Liew, P.Y., Armand, M.A., 2007. Inaudible watermarking via phase manipulation of random frequencies. Multimed. Tools Appl. 35 (3), 357–377. Liu, L., He, J., Palm, G., 1997. Effects of phase on the perception of intervocalic stop consonants1. Speech Commun. 22 (4), 403–417.

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433

JID: SPECOM

ARTICLE IN PRESS P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487

Q12

2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501

Liu, Y., Tian, Y., Liang, H., Jia, L., Johnson, M.T., 2015. Simultaneous utilization of spectral magnitude and phase information to extract supervectors for speaker verification anti-spoofing. In: Proceedings of the ISCA Interspeech, pp. 2082–2086. Loizou, P., 2007. Speech Enhancement: Theory and Practice. CRC Press, Boca Raton. Loizou, P., 2013. CRC Press, Boca Raton. Lotter, T., Vary, P., 2005. Speech enhancement by MAP spectral amplitude estimation using a super-Gaussian speech model. EURASIP J. Adv. Signal Process. 2005 (7), 1110–1126. Loweimi, E., Ahadi, S.M., Drugman, T., 2013. A new phase-based feature representation for robust speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 7155–7159. Ma, J., Loizou, P.C., 2011. SNR Loss: A new objective measure for predicting the intelligibility of noise-suppressed speech. Speech Commun. 53 (3), 340–354. Magron, P., Badeau, B., David, B., 2015a. Phase recovery in NMF for audio source separation: an insightful benchmark. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 81–85. Magron, P., Badeau, R., David, B., 2015b. Phase reconstruction of spectrograms with linear unwrapping: application to audio signal restoration. Proceedings of the European Signal Processing Conference. Maia, R., Akamine, M., Gales, M.J.F., 2012. Complex cepstrum as phase information in statistical parametric speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4581–4584. Maia, R., Stylianou, Y., 2014. Complex cepstrum factorization for statistical parametric synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 3839– 3843. Makhoul, J., 1975. Linear prediction: a tutorial review. Proc. IEEE 64 (4), 561–580. Maly, A., Mowlaee, P., 2016. On the importance of harmonic phase modification for improved speech signal reconstruction. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 584–588. Martin, R., 2001. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Audio Speech Lang. Process. 9 (5), 504–512. Mayer, F., Mowlaee, P., 2015. Improved phase reconstruction in singlechannel speech separation. In: Proceedings of the ISCA Interspeech, pp. 1795–1799. McAulay, R., Quatieri, T., 1986a. Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. Audio Speech Lang. Process. 34 (4), 744–754. McAulay, R., Quatieri, T., 1986b. Speech analysis/synthesis based on a sinusoidal representation. IEEE Trans. Acoust., Speech, Signal Process. 34 (4), 744–754. McCowan, I., Dean, D., McLaren, M., Vogt, R., Sridharan, S., 2011. The delta-phase spectrum with application to voice activity detection and speaker recognition. IEEE Trans. Audio Speech Lang. Process. 19 (7), 2026–2038. Mehmetcik, E., Ciloglu, T., 2012. Speech enhancement by maintaining phase continuity between consecutive analysis frames. J. Acoust. Soc. Am. 132 (3). 1972–1972. Miyahara, R., Sugiyama, A., 2013. An auto-focusing noise suppressor for cellphone movies based on peak preservation and phase randomization. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2785–2789. Miyahara, R., Sugiyama, A., 2014. An auto-focusing noise suppressor for cellphone movies based on phase randomization and power compensation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2199–2203. Möller, S., Chan, W., Cote, N., Falk, T.H., Raake, A., Waltermann, M., 2011. Speech quality estimation: Models and trends. IEEE Signal Process. Mag. 28 (6), 18–28.

[m5+;May 19, 2016;19:14]

27

Moriya, T., Honda, M., 1986. Speech coder using phase equalization and vector quantization. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 1701–1704. Mowlaee, P., 2010. New Strategies for Single-channel Speech Separation. Institute for Electronic system, Aalborg University, Aalborg, Denmark Ph.D. thesis.. Mowlaee, P., Kulmer, J., 2015a. Harmonic phase estimation in single-channel speech enhancement using phase decomposition and SNR information. IEEE Trans. Audio Speech Lang. Process. 23 (9), 1521–1532. Mowlaee, P., Kulmer, J., 2015b. Phase estimation in single-channel speech enhancement: Limits-potential. IEEE Trans. Audio Speech Lang. Process. 23 (8), 1283–1294. Mowlaee, P., Martin, R., 2012. On phase importance in parameter estimation for single-channel source separation. In: Proc. International Workshop on Acoustic Signal Enhancement., pp. 1–4. Mowlaee, P., Saeidi, R., 2013a. Iterative closed-loop phase-aware singlechannel speech enhancement. IEEE Signal Process. Lett. 20 (12), 1235– 1239. Mowlaee, P., Saeidi, R., 2013b. On phase importance in parameter estimation in single-channel speech enhancement. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 7462–7466. Mowlaee, P., Saeidi, R., 2014. Time-frequency constraint for phase estimation in single-channel speech enhancement. Proceedings of the International Workshop on Acoustic Signal Enhancement, pp. 338–342. Mowlaee, P., Saeidi, R., Stylianou, Y., 2014. Phase importance in speech processing applications. In: Proceedings of the ISCA Interspeech, pp. 1623– 1627. Mowlaee, P., Saiedi, R., Martin, R., 2012. Phase estimation for signal reconstruction in single-channel speech separation. In: Proceedings of the ISCA Interspeech, pp. 1548–1551. Mowlaee, P., Watanabe, M., 2013. Iterative sinusoidal-based partial phase reconstruction in single-channel source separation. In: Proceedings of the ISCA Interspeech, pp. 832–836. MPEG-2/MPEG-4 AAC, 1998. http:// www.mp3-tech.org/ aac.html. Murthy, H., Yegnanaray, B., 2011. Group delay functions and its applications in speech technology. J. Indian Acad. Sci. (SADHANA) 36, 745–782. Murthy, H.A., Yegnanarayana, B., 1991a. Formant extraction from group delay function. Speech Commun. 10 (3), 209–221. Murthy, H.A., Yegnanarayana, B., 1991b. Speech processing using group delay functions. Elsevier Signal Process. 22 (3), 259–267. Murty, K.S.R., Boominathan, V., Vijayan, K., 2012. Allpass modeling of LP residual for speaker recognition. In: International Conference on Signal Processing and Communications (SPCOM), pp. 1–5. Murty, K.S.R., Yegnanarayana, B., 2006. Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Process. Lett. 13 (1), 52–55. Nakagawa, S., Asakawa, K., Wang, L., 2007. Speaker recognition by combining MFCC and phase information. In: Proceedings of the ISCA Interspeech, pp. 2005–2008. Nakagawa, S., Wang, L., Ohtsuka, S., 2012. Speaker identification and verification by combining MFCC and phase information. IEEE Trans. Audio Speech Lang. Process. 20 (4), 1085–1095. Ni, X.S., Huo, X., 2007. Statistical interpretation of the importance of phase information in signal and image reconstruction. Stat. Probab. Lett. 77 (4), 447–454. Nico, G., Fortuny, J., 2003. Using the matrix pencil method to solve phase unwrapping. IEEE Trans. Signal Process. 51 (3), 886–888. Obaid, R.Q., van Dijk, B., Moonen, M., Wouters, J., 2012. Speech understanding performance of cochlear implant subjects using time-frequency masking-based noise reduction. IEEE Trans. Biomed. Eng. 59 (5), 1364– 1373. Ohm, G.S., 1843. Über die Definition des Tones, nebst dara geknüfter Theorie der Sirene und ähnlicher tonbildender Vorichtungen. Ann. Phys. Chem. 59, 513–565. Omologo, M., Svaizer, P., 1997. Use of the crosspower-spectrum phase in acoustic event location. IEEE Trans. Audio Speech Lang. Process. 5 (3), 288–292.

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535

Q13

2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569

JID: SPECOM

28 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635

ARTICLE IN PRESS

[m5+;May 19, 2016;19:14]

P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

Oppenheim, A.V., Lim, J.S., 1981. The importance of phase in signals. Proc. IEEE 69 (5), 529–541. Oppenheim, A.V., Lim, J.S., Curtis, S.R., 1983. Signal synthesis and reconstruction from partial Fourier-domain information. J. Opt. Soc. Am. 73 (11), 1413–1420. Oppenheim, A.V., Lim, J.S., Kopec, G., Pohlig, S.C., 1979. Phase in speech and pictures. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 4, pp. 632–637. Oppenheim, A.V., Schafer, R.W., Buck, J.R., 1999. Discrete-time Signal Processing. Prentice-Hall, Inc., Upper Saddle River, NJ, USA. Paliwal, K.K., Alsteris, L.D., 2003. Usefulness of phase spectrum in human speech perception. In: Proceedings of the Eurospeech, pp. 2117–2120. Paliwal, K.K., Alsteris, L.D., 2005. On the usefulness of STFT phase spectrum in human listening tests. Speech Commun. 45 (2), 153–170. Paliwal, K.K., Wojcicki, K.K., Shannon, B.J., 2011. The importance of phase in speech enhancement. Speech Commun. 53 (4), 465–494. Parry, R.M., Essa, I.A., 2007. Incorporating phase information for source separation via spectrogram factorization. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 661–664. Parthasarathi, S.H., Padmanabhan, R., Murthy, H.A., 2011. Robustness of group delay representations for noisy speech signals. Int. J. Speech Technol. 14 (4), 361–368. Patil, S.P., Gowdy, J.N., 2014. Exploiting the baseband phase structure of the voiced speech for speech enhancement. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 6133–6137. Peddinti, V., Hermansky, H., 2013. Filter-bank optimization for frequency domain linear prediction. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 7102– 7106. Pobloth, H., 2004. Perceptual and Squared Error Aspects in Speech and Audio Coding. Royal Institute of Technology (KTH), Stockholm, Sweden Ph.D. thesis.. Pobloth, H., Kleijn, W.B., 1999. On phase perception in speech. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 29–32. Pobloth, H., Kleijn, W.B., 2003. Squared error as a measure of perceived phase distortion. J. Acoust. Soc. Am. 114 (2), 1081–1094. Potamianos, A., Maragos, P., 1995. Speech formant frequency and bandwidth tracking using multiband energy demodulation. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 784–787. Quackenbush, S.R., Barnwell, T.P., Clements, M.A., 1988. Objective Measures of Speech Quality. Prentice-Hall, Englewood Cliffs, NJ. Quatieri, T., Oppenheim, A.V., 1981. Iterative techniques for minimum phase signal reconstruction from phase or magnitude. IEEE Trans. Audio Speech Lang. Process. 29 (6), 1187–1193. Rajan, P., Kinnunen, T., Hanilci, C., Pohjalainen, J., Alku, P., 2013. Using group delay functions from all-pole models for speaker recognition. In: Proceedings of the ISCA Interspeech, pp. 2489–2493. Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P., 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 749–752. Sadjadi, S.O., Hansen, J.H.L., 2011. Hilbert envelope based features for robust speaker identification under reverberant mismatched conditions. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5448–5451. Sadjadi, S.O., Hansen, J.H.L., 2015. Mean Hilbert envelope coefficients (MHEC) for robust speaker and language identification. Speech Commun. 72, 138–148. Saeidi, R., Pohjalainen, J., Kinnunen, T., Alku, P., 2010. Temporally weighted linear prediction features for tackling additive noise in speaker verification. IEEE Signal Process. Lett. 17 (6), 599–602.

Sanchez, J., Saratxaga, I., Hernaez, I., Navas, E., Erro, D., Raitio, T., 2015. Toward a universal synthetic speech spoofing detection using phase information. IEEE Trans. Inf. Forens. Secur. 10 (4), 810–820. Saratxaga, I., Hernaez, I., Erro, D., Navas, E., Sanchez, J., 2009. Simple representation of signal phase for harmonic speech models. IEE Electr. Lett. 45 (7), 381–383. Saratxaga, I., Hernaez, I., Odriozola, I., Navas, E., Luengo, I., Erro, D., 2010. Using harmonic phase information to improve ASR rate. In: Proceedings of the ISCA Interspeech, pp. 1185–1188. Saratxaga, I., Hernaez, I., Pucher, M., Sainz, I., 2012. Perceptual importance of the phase related information in speech. In: Proceedings of the ISCA Interspeech, pp. 1448–1451. Schädler, M.R., Meyer, B.T., Kollmeier, B., 2012. Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J. Acoust. Soc. Am. 131 (5), 4134–4151. Schlüter, R., Ney, H., 2001. Using phase spectrum information for improved speech recognition performance. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 133–136. Schroeder, M.R., 1975. Models of hearing. Proc. IEEE 63 (9), 1332–1350. Shi, G., Aarabi, P., Hui, J., 2007. Phase-based dual-microphone speech enhancement using a prior speech model. IEEE Trans. Audio Speech Lang. Process. 15 (1), 109–118. Skoglund, J., Kleijn, W.B., Hedelin, P., 1997. Audibility of pitchsynchronously modulated noise. In: Proceedings of the IEEE Workshop on Speech Coding for Telecommunications, pp. 51–52. Spanias, A., Painter, T., Atti, V., 2005. Audio Signal Processing and Coding. John Wiley & Sons, Inc.. Stachurski, J., McCree, A., 2000. A 4 kb/s hybrid melp/celp coder with alignment phase encoding and zero-phase equalization. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, pp. 1379–1382. Stark, A.P., Paliwal, K.K., 2008. Speech analysis using instantaneous frequency deviation. In: Proceedings of the ISCA Interspeech, pp. 22–26. Stark, A.P., Paliwal, K.K., 2009a. Group-delay-deviation based spectral analysis of speech. In: Proceedings of the ISCA Interspeech, pp. 1083–1086. Stark, A.P., Paliwal, K.K., 2009b. Group-delay-deviation based spectral analysis of speech. In: Proceedings of the ISCA Interspeech, pp. 1083–1086. Stark, A.P., Wójcicki, K.K., Lyons, J.G., Paliwal, K.K., 2008. Noise driven short-time phase spectrum compensation procedure for speech enhancement. In: Proceedings of the ISCA Interspeech, pp. 549–552. Sturmel, N., Daudet, L., 2012. Iterative phase reconstruction of wiener filtered signals. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 101–104. Sturmel, N., Daudet, L., 2013. Informed source separation using iterative reconstruction. IEEE Trans. Audio Speech Lang. Process. 21 (1), 178– 185. Stylianou, Y., 2001. Removing linear phase mismatches in concatenative speech synthesis. IEEE Trans. Audio Speech Lang. Process. 9 (3), 232– 239. Stylianou, Y., Cappe, O., Moulines, E., 1998. Continuous probabilistic transform for voice conversion. IEEE Trans. Audio Speech Lang. Process. 6 (2), 131–142. Sugiyama, A., Miyahara, R., 2013. Phase randomization - a new paradigm for single-channel signal enhancement. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 7487–7491. Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J., 2011. An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 9 (17), 2125–2136. Taghia, J., Martin, R., 2014. Objective intelligibility measures based on mutual information for speech subjected to speech enhancement processing. IEEE Trans. Audio Speech Lang. Process. 22 (1), 6–16. Tahon, M., Degottex, G., Devillers, L., 2012. Usual voice quality features and glottal features for emotional valence detection. In: Proceedings of the International Conference on Speech Prosody. Shanghai, China, pp. 693– 696.

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702

JID: SPECOM

ARTICLE IN PRESS P. Mowlaee et al. / Speech Communication xxx (2016) xxx–xxx

2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755

Tribolet, J., 1977. A new phase unwrapping algorithm. IEEE Trans. Audio Speech Lang. Process. (2) 170–177. Tribolet, J., Noll, P., McDermott, B., Crochiere, R., 1978. A study of complexity and quality of speech waveform coders. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 3, pp. 586–590. Tüske, Z., Golik, P., Schlüter, R., Ney, H., 2014. Acoustic modeling with deep neural networks using raw time signal for LVCSR. In: Proceedings of the ISCA Interspeech, pp. 890–894. Singapore. Tyagi, V., Wellekens, C., 2006. Fepstrum and carrier signal decomposition of speech signals through homomorphic filtering. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Van Trees, H.L., 2002. Optimum Array Processing: Detection, Estimation, and Modulation Theory, Part IV. Wiley-Interscience. Vary, P., 1985. Noise suppression by spectral magnitude estimation mechanism and theoretical limits. Elsevier Signal Process. 8 (4), 387–400. Vary, P., Martin, R., 2006. Digital Speech Transmission: Enhancement, Coding And Error Concealment. John Wiley & Sons. Vijayan, K., Kumar, V., Murty, K.S.R., 2014a. Allpass modelling of Fourier phase for speaker verification. Speaker Odyssey, pp. 112–117. Vijayan, K., Kumar, V., Murty, K.S.R., 2014b. Feature extraction from analytic phase of speech signals for speaker verification, pp. 1658–1662. Vijayan, K., Murty, K., 2014. Epoch extraction from allpass residual of speech signals. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 1493–1497. Vincent, E., Gribonval, R., Fevotte, C., 2006. Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14 (4), 1462–1469. Virtanen, T., 2007. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 15 (3), 1066–1074. Wang, D., Brown, G.J., 2006. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press. Wang, D., Lim, J., 1982. The unimportance of phase in speech enhancement. IEEE Trans. Audio Speech Lang. Process. 30 (4), 679–681. Wang, D.L., 2005. On ideal binary mask as the computational goal of auditory scene analysis. Speech Sep. Humans Mach. 181–197. Wang, L., Minami, K., Yamamoto, K., Nakagawa, S., 2010. Speaker identification by combining MFCC and phase information in noisy environments. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4502–4505. Wang, L., Yoshida, Y., Kawakami, Y., Nakagawa, S., 2015. Relative phase information for detecting human speech and spoofed speech. In: Proceedings of the ISCA Interspeech, pp. 2092–2096. Wang, Y., Narayanan, A., Wang, D., 2014. On training targets for supervised speech separation. IEEE Trans. Audio Speech Lang. Process. 22 (12), 1849–1858. Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J.R., Schüller, B., 2015. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In: Proceedings of the International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA). Liberec, Czech Republic.

[m5+;May 19, 2016;19:14]

29

Williamson, D.S., Wang, Y., Wang, D., 2014a. Reconstruction techniques for improving the perceptual quality of binary masked speech. J. Acoust. Soc. Am. 136 (2), 892–902. Williamson, D.S., Wang, Y., Wang, D., 2014b. A two-stage approach for improving the perceptual quality of separated speech. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 7034–7038. Wojcicki, K., Milacic, M., Stark, A., Lyons, J., Paliwal, K., 2008. Exploiting conjugate symmetry of the short-time Fourier spectrum for speech enhancement. IEEE Signal Process. Lett. 15, 461–464. Wojcicki, K.K., Paliwal, K.K., 2007. Importance of the dynamic range of an analysis window function for phase-only and magnitude-only reconstruction of speech. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 4, pp. 729–732. Wu, Z., Evans, N., Kinnunen, T., Yamagishi, J., Alegre, F., Li, H., 2015a. Spoofing and countermeasures for speaker verification: A survey. Speech Commun. 66, 130–153. Wu, Z., Kinnunen, T., Evans, N., Yamagishi, J., Hanilci, M., Sahifullah, M., Sizov, A., 2015b. Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In: Proceedings of the ISCA Interspeech, pp. 2037–2041. Xiao, X., Xiaohai, T., Du, S., Haihua, X., Chng, E.S., Li, H., 2015. Spoofing speech detection using high dimensional magnitude and phase features: The ntu approach for asvspoof 2015 challenge. In: Proceedings of the ISCA Interspeech, pp. 2052–2056. Yamamoto, K., Sueyoshi, E., Nakagawa, S., 2010. Speech recognition using long-term phase information. In: Proceedings of the ISCA Interspeech, pp. 1189–1192. Yegnanarayana, B., Murthy, H., 1992. Significance of group delay functions in spectrum estimation. IEEE Trans. Signal Process. 40 (9), 2281–2289. Yegnanarayana, B., Saikia, D., Krishnan, T., 1984. Significance of group delay functions in signal reconstruction from spectral magnitude or phase. IEEE Trans. Audio Speech Lang. Process. 32 (3), 610–623. Yegnanarayana, B., Sreekanth, J., Rangarajan, A., 1985. Waveform estimation using group delay processing. IEEE Trans. Audio Speech Lang. Process. 33 (4), 832–836. Yu, E.W.M., Chan, C.F., 2004. Harmonic + noise coding based on the characteristics of human phase perception. In: Proceedings of International Symposium on Intelligent Multimedia, Video and Speech Processing, pp. 310–313. Zen, H., Nose, T., Yamagishi, J., Sako, S., Masuko, T., Black, A., Tokuda, K., 2006. The HMM-based speech synthesis system (HTS) version 2.0. In: International Speech Communication Association Speech Synthesis Workshop (SSW6), pp. 294–299. Zen, H., Sak, H., 2015. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4470–4474. Zhu, X., Beauregard, G., Wyse, L.L., 2007. Real-time signal estimation from modified short-time Fourier transform magnitude spectra. IEEE Trans. Audio Speech Lang. Process. 15 (5), 1645–1653. Zwicker, H., Fastl, E., 2007. Psychoacoustics - Facts and Models. Springer series in information sciences. Springer Verlag, New York.

Please cite this article as: P. Mowlaee et al., Advances in phase-aware signal processing in speech communication, Speech Communication (2016), http://dx.doi.org/10.1016/j.specom.2016.04.002

2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808

Advances in phase-aware signal processing in speech communication

Advances in phase-aware signal processing in speech communication

Recommend Documents