Combining evidences from magnitude and phase information using VTEO for person recognition using humming

Combining evidences from magnitude and phase information using VTEO for person recognition using humming

JID: YCSLA ARTICLE IN PRESS [m3+;September 26, 2017;11:48] Available online at www.sciencedirect.com Computer Speech & Language 00 (2017) 132 www...

5MB Sizes 1 Downloads 92 Views

JID: YCSLA

ARTICLE IN PRESS

[m3+;September 26, 2017;11:48]

Available online at www.sciencedirect.com

Computer Speech & Language 00 (2017) 132 www.elsevier.com/locate/csl

Combining evidences from magnitude and phase information using VTEO for person recognition using hummingI TagedPD1X XHemant A. PatilD2X X, D3X XMaulik C. MadhaviD4X X* TagedPSpeech Research Lab, Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT), Gandhinagar, India Received 10 October 2016; received in revised form 6 May 2017; accepted 12 June 2017 Available online xxx

TagedPAbstract Most of the state-of-the-art speaker recognition system use natural speech signal (i.e., real speech, spontaneous speech or contextual speech) from the subjects. In this paper, recognition of a person is attempted from his or her hum with the help of machines. This kind of application can be useful to design person-dependent Query-by-Humming (QBH) system and hence, plays an important role in music information retrieval (MIR) system. In addition, it can be also useful for other interesting speech technological applications such as human-computer interaction, speech prosody analysis of disordered speech, and speaker forensics. This paper develops new feature extraction technique to exploit perceptually meaningful (due to mel frequency warping to imitate human perception process for hearing) phase spectrum information along with magnitude spectrum information from the hum signal. In particular, the structure of state-of-the-art feature set, namely, Mel Frequency Cepstral Coefficients (MFCCs) is modified to capture the phase spectrum information. In addition, a new energy measure, namely, Variable length Teager Energy Operator (VTEO) is employed to compute subband energies of different time-domain subband signals (i.e., an output of 24 triangularshaped filters used in the mel filterbank). We refer this proposed feature set as MFCC-VTMP (i.e., mel frequency cepstral coefficients to capture perceptually meaningful magnitude and phase information via VTEO)The polynomial classifier (which is inprinciple similar to other discriminatively-trained classifiers such as support vector machine (SVM) with polynomial kernel) is used as the basis for all the experiments. The effectiveness of proposed feature set is evaluated and consistently found to be better than MFCCs feature set for several evaluation factors, such as, comparison with other phase-based features, the order of polynomial classifier, person (speaker) modeling approach (such as, GMM-UBM and i-vector), the dimension of feature vector, robustness under signal degradation conditions, static vs. dynamic features, feature discrimination measures and intersession variability. Ó 2017 Elsevier Ltd. All rights reserved. TagedPKeywords: Music information retrieval (MIR); Person recognition; Humming; Mel filterbank; Variable length Teager Energy Operator (VTEO); Polynomial classifier

1. Introduction TagedPNow-a-days multimedia database is being frequently used to exchange information for an interaction between humans and the machines. Multimedia system deals with various types of data such as speech, audio, video, text, still I

This paper has been recommended for acceptance by R. K. Moore. * Corresponding author. E-mail address: [email protected] (H.A. Patil), [email protected] (M.C. Madhavi).

http://dx.doi.org/10.1016/j.csl.2017.06.009 0885-2308/ 2017 Elsevier Ltd. All rights reserved.

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

2

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

Fig. 1. A schematic block diagram of person-dependent QBH system.

TagedPimage, etc. The information retrieval based on the multimedia database has become the leading technologies. Since humming for songs by the person contains tune for that particular song, it can be considered as new kind of multimedia information. In the context of information retrieval, an approach to retrieve a query of an image is done by supplying an image or its sketch as an input to a system. Similarly, the humming of a person can be used in multimedia or Music Information Retrieval (MIR), in retrieving of a relevant audio database. This retrieval technique is called Query-by-Humming (QBH) system (Ghias et al., 1995; Jang and Lee, 2008). TagedPQBH system was initially developed by Ghias et al. (Ghias et al., 1995). Humming sound contains music and melody information inside it, and hence, the QBH can be an important part of the MIR. The characteristics of QBH system is that it is very convenient and efficient to hum than using conventional text-based information retrieval system that searches keywords using various ranking-based concepts. QBH systems use various pitch or fundamental frequency (F0) extraction, melody representation and matching methods (Ghias et al., 1995; Lu and Seide, 2008). As an independent capability, we may like to use QBH system that is person-dependent. A person-dependent QBH system can be considered as shown in Fig. 1. This kind of system can find its application in mobile ringtone search or download through QBH system (Lu and Seide, 2008). Consider a scenario where a person wants to retrieve a ringtone of a particular song. If a person forgets the correct lyrics and words about this song, then the text-based information retrieval system cannot be used here. TagedPIn such realistic scenarios, humming can be useful and in fact, very convenient in retrieving the query song. Moreover, an artist creates an album for different songs, and develops QBH system to retrieve these songs. In such scenarios, artist may want to create an authentication mechanism on QBH, i.e., an artist may want to restrict access to the album. In this context, an artist can use humming-based person detection system followed by the QBH system. For example, in Fig. 1, an input hum is fed to a system. Person detection (or authentication) system decides whether a user can get access to the QBH system. In addition, many cell phones have speaker-dependent telephone dialing capability, i.e., this ensures that these (personal) devices are secured from the biometric attacks. In this context, person-dependent telephone dialing can be designed, and person authentication is done using humming rather than the speech signal. Here, the speaker (person) detection task is carried out by exploiting person-specific information through the humming pattern of a person. Hence, the approach is based on extracting features that characterize the pattern of humming, which is unique to a person. Furthermore, since hum is a sound created by closing mouth and enforced airflow to pass through nasal cavities; it is a nasalized sound as it produces primarily due to nasal cavities. Since humming is not the same as the production of normal speech, in this paper, the speaker recognition task is referred to as person recognition. 1.1. Other relevant applications TagedPBesides person recognition using humming, several studies exploit humming for different applications. For example, in a landmark study reported in (Andersson and Schalen, 1998), humming corresponding to a melodic tune was used (along with other nasal sounds such as /m/ and /n/ and fricative sounds /v/ and /z/) as part of therapy content in order to regain voice and to make the subjects (who are suffering from Psychological Voice Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

ARTICLE IN PRESS

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

3

Fig. 2. Schematic of human-computer interaction system. After Suzuki et al. (2003).

TagedPDisorder (PVD)) aware about presence of larynx (or voicebox) as communicative and respiratory tool. In this context, proposed humming-based voice biometrics can be used for subjects suffering from such voice disorders (Jin et al., 2009). Furthermore, humming sounds exist very naturally in nature. In particular, cooing of a human infant, meowing of a cat, mooing of a cow, sounds of brass instruments, the humming of adults for a melodic tune, humming of earth (Rhie and Romanowicz, 2004) etc. We naturally interact with a great pleasure (due to social bonds) with human infants and pets, who produce hummed sounds. For example, parents or caretaker make an effort to elicit a response with cooing from newborns/infants in order to build emphatic interaction between them (Masataka, 1992). In an interesting psychological experiments reported in Suzuki et al. (2003), it was observed that human-computer interaction system (as shown in Fig. 2) could also be used to illustrate interpersonal relations. In particular, an interactive system in this experiment used an animated character that mimicked the prosodic features (such as loudness via power calculation, duration via speech segmentation and F0 dynamics for intonation) in a human’s voice echoicly by synthesizing the humming sounds via its simple sinusoidal model. Subjective evaluation of this experiment suggests that subjects favorably consider a computer-generated humming sounds (with higher mimicry ratio) than constant prosody response from template sound generation (as shown in Fig. 2). In particular, humming sounds generated by a computer facilitates friendly human-computer interaction and also establish the scientific basis for social bonds during interaction between caretaker/parents with the infants producing hummed sounds. This finding may be useful to develop humming-based biometrics for infants who are not able to speak Jin et al. (2009). TagedP1.1.1. Motivation for person recognition using humming TagedPHumming sounds can be very significant for person recognition due to the following reasons (Amino and Arai, 2009): TagedP Nasal cavities are almost fixed during the production of nasal sounds as compared to the oral sounds and hence, the variability in using subject’s articulators will have less effect on humming. Hence, speaker-specific information will be conveyed through physiological features (such as shape and size of nasal cavities) to characterize persons from their nasal sounds. TagedP Nasal sounds occupy lower frequency region of spectrum (mostly less than 2 kHz). Hence, features needs to be extracted from this localized region only rather than broad frequency region as in the case of normal speech. TagedP Nasal sounds have less intra-speaker variability and more inter-speaker variability. TagedP In terms of universality criteria of a biometric pattern, the hum is applicable with the deaf person as well as an infant who has a disorder in speech production mechanism (Andersson and Schalen, 1998). Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

4

ARTICLE IN PRESS

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

TagedP The deliberate control of velar movements (to control an amount of nasalization) is not easy, i.e., velar movements generally occurs out of control of the brain for some subjects. This particular features of nasal sounds could also be useful to identify an uncooperative person under forensic conditions. TagedP Recent studies based on acoustic and morphological data proved that there are significant acoustic contributions (in terms of antiresonances) of different paranasal sinus cavities, namely, maxillary, sphenoidal, frontal, and ethomoidal to the transmission characteristics of the nasal tract (Dang and Honda, 1996). Thus, it is difficult to deliberately manipulate nasal spectra at least as compared to that of oral cavity. TagedP As discussed in Section 1.1, it is interesting to note that humming sounds are found to play a significant role in establishing a social bonds with the subjects during human-computer interaction and hence, could be explored for humming-based biometrics problem as well. TagedPThe pattern of hum signal, and its corresponding pitch or fundamental frequency (F0) contour, formant contour and overall spectral energy distribution in spectrogram are distinct (Patil et al., 2009; 2008; Patil and Parhi, 2010b). Furthermore, lower formants are relatively more dominant (distinct characteristics of a nasal sound) in hum. These observations motivated authors to investigate whether hum produced by the persons can be used for biometrics problem, i.e., person recognition from humming. To the best of the authors’ knowledge, humming-based person recognition was proposed first time by Patil et al. (Patil et al., 2009; 2008; Patil and Parhi, 2010b) and subsequently by Jin et al. (Jin et al., 2009). Jin et al. found that pitch, which is conducive to humming-based music retrieval, is not conducive to human verification and identification (as the pitch in humming is highly dependent on the melody and not on the target speaker) (Jin et al., 2009). In this context, authors have reported the use of spectral features for humming-based person recognition, where the performance of state-of-the-art Mel Frequency Cepstral Coefficients (MFCCs) were found to be better than Linear Prediction (LP)-based features, such as, Linear Prediction Coefficients (LPCs) themselves and Linear Prediction Cepstral Coefficients (LPCCs) (Patil et al., 2009). In addition, Patil and Parhi proposed a new feature set, namely, Variable length Teager Energy-based Mel Frequency Cepstral Coefficients (VTMFCCs) for this application (Patil and Parhi, 2010b). Furthermore, an effectiveness of VTMFCCs to capture perceptually meaningful excitation source-like information from hum signal is explored. In addition, a score-level fusion of VTMFCCs with MFCCs is found to give better person recognition performance than MFCCs alone for various evaluation factors, such as, population size and dimensions of feature vector (Patil et al., 2011). For VTMFCCs feature extraction, newly developed Variable length Teager Energy Operator (VTEO) (Tomar and Patil, 2008) was applied on pre-processed humming signal to find running estimate of signal’s energy in timedomain instead subband energy (i.e., l2-norm) in frequency-domain (as in case of state-of-the-art feature set, namely, MFCCs (Davis and Mermelstein, 1980)). TagedPIn this paper, we propose a new feature set to exploit perceptually meaningful magnitude and phase spectrum information via Mel frequency warping followed by use of VTEO to compute subband energies in time-domain. In traditional MFCCs feature extraction, magnitude spectrum is mel warped (while phase spectrum is simply neglected), and then subband energy is computed, followed by logarithm and Discrete Cosine Transform (DCT) (Davis and Mermelstein, 1980). In the present work, proposed feature set computation inherently exploits (i.e., combine evidences from) both phase and magnitude spectra. Moreover, the proposed scheme does few modifications, such as, instead of applying VTEO on preprocessed frame (as in case of VTMFCCs (Patil and Parhi, 2010b)) on finding subband energy in frequency-domain (as in case of MFCCs (Davis and Mermelstein, 1980)), VTEO of timedomain filtered (corresponding to both magnitude and phase spectra after mel frequency warping) humming signals is taken. Hence, the novelty of the proposed approach lies in exploiting perceptually meaningful phase spectrum information and employing VTEO to capture subband energies. This work is an extension of our earlier work presented in Patil and Madhavi (2012) and Madhavi and Patil (2014). In the next Section, a brief review of work related to use of phase information is presented. TagedPOrganization of rest of the paper is as follows. Section 2 discusses the prior work in phase-related speech signal processing. In Section 3, we discuss details of proposed feature extraction scheme to exploit phase information.Section 4 gives brief details of our in-house database and corpus design. The details of existing phase-based features, polynomial classifier and performance evaluation metrics are given in Section 5. In Section 6, various person recognition experiments are done in order to evaluate the effectiveness of proposed feature set over MFCCs. Finally, Section 7 presents summary and conclusions from the paper along with future research directions. Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

5

2. Importance of phase in speech TagedPDue to century old experiments by Ohm, it was assumed that human auditory system (HAS) is phase deaf (Ohm, 1843; Von Helmholtz, 1912; Paliwal and Alsteris, 2003b). It means that human ear was considered simply insensitive to phase change, and hence, only magnitude spectrum of a sound signal was assumed to be sufficient to convey meaningful information to HAS (Paliwal and Alsteris, 2003b). However, during recent years, many researchers have tried to exploit phase information in the speech signal, which was previously discarded. Phase information can be exploited in terms of Fourier transform phase or instantaneous (analytic) phase derived by using analytic signal concept (e.g., using Hilbert transform). Researchers used various phase-related features either as standalone or combined with magnitude spectrum information. As discussed earlier, MFCC feature set is used as the state-of-the-art features for speech and speaker recognition applications. However, during extraction of MFCCs, only magnitude spectrum is considered, i.e., information related to phase spectrum is simply neglected (due to the assumption human ear as phase deaf (Davis and Mermelstein, 1980)). However, Hayes et al. observed that under a variety of conditions, such as when the finite signal length is considered, phase information alone is sufficient to completely reconstruct a signal to within a scale factor and this problem is referred to as phase retrieval, which has roots in optical imaging research (Hayes et al., 1980). Recently, there has been a growing interest in phase retrieval problem encountered in optical imaging, speech processing, etc. In particular, Shenoy et al. have proposed the idea of exact phase retrieval in principal shift-invariant spaces (Shenoy et al., 2016; Shenoy and Seelamantula, 2015). They have identified a class of continuous-time signals that are neither causal nor minimum phase and yet guarantee the retrieval of exact phase. In Seelamantula (2016), an interesting approach to encode a phase of speech signal in its spectrogram (called as phase-encoded spectrogram) is proposed using the condition of causal delta dominance (CDD). TagedPKleinschmidt et al. exploited phase information in order to develop robust speech recognition under noisy or signal degradation conditions. They used complex cepstrum subtraction (CCS) with Phase Estimation via Delay Projection (PEDEP). They found encouraging results for noisy data at various SNR levels (Kleinschmidt et al., 2011). Leigh et al. explored the speech signal reconstruction from magnitude only stimuli and phase only stimuli (Alsteris and Paliwal, 2007). They observed the Mean Square Error (MSE) between a reconstructed signal from magnitude only spectrum and phase only spectrum for different number of iterations and found that phase spectrum is sufficient to reconstruct speech signal. Zhu et al. used group delay function to capture phase information in the signal (Zhu and Paliwal, 2004). The power spectrum captures only magnitude spectrum information. On the other hand, the product of both group delay function and power spectrum was taken. They named the resulting spectrum as product spectrum. They extracted MFCCs from this product spectrum that has magnitude as well as phase information and this new feature set was found to give the best performance (Zhu and Paliwal, 2004). Paliwal et al. used different window types, namely, rectangular and Hamming window and different window duration, namely, 32 ms and 1024 ms to investigate the importance of phase information. Their studies found that phase information captures more information from the signal as window duration increases. Furthermore, it was found that for smaller window duration, magnitude spectrum captures more intelligibility, while for higher window duration, phase spectrum captures more intelligibility (Paliwal and Alsteris, 2005). In addition, Paliwal et al. observed that by changing window from tapered window (like Hamming window) to non-tapered (rectangular window), phase only stimuli has got intelligibility in smaller analysis duration than magnitude only stimuli (Paliwal and Alsteris, 2003b; 2003a; Alsteris and Paliwal, 2004). In the similar line, study reported in Smith et al. (2002) proved that decomposition of an audio signal into slowly-varying envelope and a rapidly-varying fine time structure capture information pertaining to speech perception and pitch perception, respectively. Furthermore, Paliwal et al. used analytic signal decomposition method to derive instantaneous frequency (IF). This IF conveys the instantaneous phase information. They proposed Mel Frequency Instantaneous Frequency (MFIF) feature set for speech recognition task, which gave good results in terms of accuracy over conventional MFCCs and LPCCs (Paliwal and Atal, 2003). Analytic signal generated using Hilbert transform was proposed to derive time-domain analytic instantaneous phase from Linear Prediction (LP) residual phase Murty and Yegnanarayana (2006) and Teager Energy Operator (TEO) phase Patil and Parhi (2010a) for speaker recognition task. TagedPRecently, Wang et al. used phase derived from STFT to get new phase information {u}. They used a score-level fusion of this new phase information {cos u, sin u} with the models from conventional MFCCs. They also used same or different classifiers (e.g., Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM)) for their data Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

6

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

TagedPfusion approach. They found encouraging results using this fusion strategy for speaker recognition task (Wang et al., 2010; Nakagawa et al., 2012). However, in this approach or most of the other approaches, phase information is not derived in perceptually meaningful way, i.e., mel frequency warping is not done to capture phase information. Group-delay-based approach has been introduced by Murthy et al. (Murthy and Yegnanarayana, 1991). Later on, this approach was modified in order to emphasize formant information over pitch (F0) harmonics and applied for speech and speaker recognition tasks (Murthy and Gadde, 2003; Hegde et al., 2004). In Text-To-Speech (TTS), complex cepstrum approach was used to exploit phase information in HMM-based parametric speech synthesis (Maia et al., 2012). In speech coder, the instantaneous phase was used to control randomness across time and frequency to synthesize speech (Degottex and Erro, 2014). An excellent review on importance of phase in different speech processing applications is presented in Mowlaee et al. (2014; 2016). 3. Proposed feature extraction TagedPTraditional feature extraction for MFCCs involves pre-processing, filtering of magnitude spectrum (not phase spectrum) at segmental-level through mel filterbank followed by subband energy computation in frequency-domain. This is as if a pre-processed frame is passed through any triangular-shaped frequency-domain window and subband energy is computed. Then, energy is compressed using logarithm operation followed by DCT to get MFCCs. Studies discussed in Section 2 motivated us to derive perceptually meaningful phase information during feature extraction for person recognition task, which is discussed in next section. 3.1. Exploitation of phase information TagedPIn this work, mel filtering is done in such a way that both magnitude and phase spectrum information of humming signal are captured in frequency-domain. Then, this mel filtered signal is converted back into time-domain to preserve phase information inherently. This newly derived feature set is referred to as MFCC-VTMP (i.e., Mel frequency cepstral coefficients to capture the perceptually meaningful magnitude and phase spectrum information via VTEO). Thus, proposed feature set exploits perceptually meaningful phase obtained via mel scale filterbank. The preliminary work on this is published in Madhavi and Patil (2014), which is extended in this paper by performing more intensive experiments for several evaluation factors. The idea is to keep the same theoretical framework of MFCCs while extracting phase information via Mel scale filtering. To understand the basis for proposed approach, let us consider any real discrete-time signal x(n) and its DFT denoted by X(k) that has conjugate symmetry, i.e., magnitude spectrum of X(k) has even symmetry and phase angle of X(k) has odd symmetry (Oppenheim et al., 1997). Let us consider X(k) as any complex conjugate sequence. By taking inverse DFT of X(k), we will get x(n) as a real sequence. From this fact, time-domain Mel filterbank output can be obtained. To use this result, triangular-shaped mel filterbank structure is reversed to mimic negative frequency side as flipped spectrum. This can be considered as magnitude spectrum that is even symmetric. Then, a frequency-domain signal is multiplied with this new filterbank structure. The resulting frequency-domain filtered signal will be conjugate symmetric in frequency-domain. Then IDFT is taken to convert back into time-domain (i.e., to get a real signal). By doing this, phase spectrum information is also utilized implicitly. Figs. 3 and 4 describe this mathematical formulation designed for proposed feature extraction scheme. After this, in order to compute the subband energy (in time-domain) of different subband signals (corresponding to an output of different Mel scale subband filters), variable length Teager energy operator (VTEO) is used. This is discussed in next Section. 3.2. Variable length TEO (VTEO) TagedPA nonlinear energy tracking operator referred to as Teager Energy Operator (TEO) (denoted as c) for discretetime signal, x(n), is defined as Kaiser (1990) TEOfxðnÞg ¼ cfxðnÞg ¼ x2 ðnÞ  xðn þ 1Þxðn  1Þ:

ð1Þ

TagedPTEO algorithm gives a good running estimate of signal’s energy, when the signal has sharp transitions in the timedomain (Kaiser, 1990), Teager (1980). However, in situations when the amplitude difference between two Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

ARTICLE IN PRESS

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

7

Fig. 3. Basis for proposed feature extraction scheme.

TagedPconsecutive samples of the signal is very small, then the TEO will give zero energy output, which indicates that energy required to generate such sequence of samples is zero. However, this may not be the case in the actual physical signal (e.g., speech or hum). To alleviate this problem, VTEO (denoted as z{.}) was proposed by the authors in Tomar and Patil (2008). VTEO of discrete-time signal, x(n), for dependency index (DI) ’d’ is defined as VTEOd fxðnÞg ¼ ξfxðnÞ; dg ¼ x2 ðnÞ  xðn þ dÞxðn  dÞ;

ð2Þ

where ξ{x(n), d} is expected to give the running estimate of signal’s energy after considering past d and future dth sample to track the dependency in the sequence of samples of speech or hum signal. Hence, ‘d’ in Eq. (2) is called as dependency index (DI). In this work, VTEO is used to compute subband energy of time-domain subband signals obtained from mel scale filters. In particular, the humming signal is passed through a pre-processing block to get pre-processed frame (i.e., frame blocking, Hamming windowing and pre-emphasis). The discrete Fourier transform (DFT) of pre-processed frame is taken. A mel filterbank with 24 subbands will give us corresponding 24 timedomain subband signals, where 24 subband filters spans the entire bandwidth, i.e., Fs 2 ¼ 11025 Hz. Then, running estimate of energy for these subband signals is computed using VTEO for a particular value of DI. Finally, normalized subband energy is computed followed by logarithm and DCT operation to get proposed feature set, namely, MFCC-VTMP. The kth cepstral coefficient at ith frame of MFCC-VTMP is computed as Patil and Madhavi (2012); Madhavi and Patil (2014)   NF X kðj  0:5Þp MFCC  VTMPi ðkÞ ¼ Sli;j cos ; ð3Þ NF j¼1 th

where k ¼ 1; 2; . . . ; NC ; yi,j(n) D Mel subband filtered signal in time-domain for ith frame and jth filter in Mel filterbank, zi,j(n) D VTEO of yi,j(n) for given dependency index (DI), Sli,j D log{mean(zi,j(n))} (for ith frame and jth filter in Mel filterbank), NF D number of subband filters used in the Mel filterbank, NC D dimensions of feature vector. Eq. (3) appears to be real. However, Sli,j in Eq. (3) inherently represents magnitude and phase information. Hence, we name it as MFCC-VTMP. Eq. (3) is type-II version of DiscretepCosine Transform (DCT) (Rao and Yip, ffiffi 1989; Mallat, 2009). The constant multiplier associated with the DCT is N2 (as we took DCT for k ¼ 1; 2; . . . ; NC and NC < NF (Rao and Yip, 1989)). This constant multiplier is the same for all the coefficients in a feature vector and hence, it does not contribute in classifier decision. Thus, we neglected this multiplying factor. Fig. 5 shows block diagram for feature extraction of three feature sets, namely, MFCC-VTMP, MFCC-VTP (i.e., Mel Frequency Cepstral Coefficients to capture perceptually meaningful phase spectrum information via VTEO) and MFCC-VTM (i.e., Mel Frequency Cepstral Coefficients to capture perceptually meaningful magnitude spectrum information via VTEO), respectively. Fig. 4 shows output at various stages of Fig. 5. Pseudo code for proposed feature extraction (i. e., for MFCC-VTMP) scheme is shown in Algorithm 1.

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

8

ARTICLE IN PRESS

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

Fig. 4. The internal stage output during MFCC-VTMP feature extraction: (a) Pre-processed hum signal, (b) magnitude spectrum of (a), (c) phase spectrum of (a), (d) 3rd subband filter of mel filterbank, (e) time-domain subband filtered signal. After Patil and Madhavi (2012).

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

ARTICLE IN PRESS

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

9

Fig. 5. A schematic block diagram of proposed feature extraction: (a) Pre-processed frame (b) Mel filterbank output in time-domain for Mel filterbank comprises 24 subband filters, (c) corresponding VTEO profile for time-domain subband signal shown in Fig (b), and (d) corresponding cepstral features. After Patil and Madhavi (2012).

Algorithm 1. A pseudo code for proposed MFCC-VTMP feature extraction. After Patil and Madhavi (2012).

3.3. Feature discrimination capability of MFCC-VTMP TagedPMFCCs and MFCC-VTMP features were computed using frames of 23.2 ms (512 samples for 22,050 Hz) duration with an overlap of 50 % (i.e., 256 samples). Each frame was pre-emphasized with the highpass filter 1  0:97z1 ; followed by Hamming windowing. We used 24 Mel filters in entire available frequency range, i.e., 011,025 Hz for both MFCCs and MFCC-VTMP features. We perform feature extraction in the chunks of 5 s to avoid loading more data into the computer memory. There is a high correlation between separation of classes and good recognition accuracy. In particular, if the patterns (features) from different classes are well separated in the feature space; then recognition performance should be good. On the other hand, if they are very close to each other or overlapped, clearly the discrimination among them will become poor, and hence, the recognition performance is expected to degrade (Nicholson et al., 1997). In this context, we use well known Fisher’s discriminant, i.e., F-ratio that measures only the separability of a single coefficient or dimension of the feature vector. However, to evaluate the discrimination of an entire feature set, an extension of the F-ratio is needed (i.e., J-measures). To understand computational details of these J-measures, let covariance of the class means be matrix B (i.e., between-class covariance) and an Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

10

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132 Table 1 Feature discrimination of MFCC and MFCC-VTMP features using J1, Js measures and KL-divergence. Ncb D number of centroids in VQ codebook. The bold numbers indicate relative better performance. Feature set

J1

Js

KL (Ncb ¼ 256)

MFCC MFCC-VTMP (DI D 4)

13.8862 15.2645

10.4495 10.8550

52.4798 53.6447

TagedPaverage of the class covariance be matrix W (i.e., within-class covariance). Each element of matrices denotes small letters b and w; respectively. J-measures for N-dimensional feature vectors are given by Nicholson et al. (1997) J1 ¼ traceðW1 BÞ and Js ¼

k¼N X bkk k¼1

wkk

:

ð4Þ

TagedPFor population size of 100 subjects, class separation of different 12-dimensional feature sets, namely, MFCCs and MFCC-VTMP are examined. We compute J-measure by computing the mean and covariance matrix assuming that features have a single multi-variate Gaussian distribution (so that the class is described by only first two moments, namely, mean and covariance, which in turn determine class separation). In addition, humming sounds are nasalized sounds and can be considered as one group of phoneme. Earlier, features from the same phoneme were used to evaluate the discrimination (Umesh et al., 1999). TagedPThe results obtained for J1 and Js measures on these feature sets are shown in Table 1, which shows that proposed set, namely, MFCC-VTMP has significantly higher J1 and Js measures than MFCCs. This indicates that MFCCVTMP has better class discrimination power than MFCCs. This motivated authors to use MFCC-VTMP feature set for humming-based person recognition task. These classical approaches assumed that the features from a single class (single person, in this paper) follow Gaussian distribution (so that it is described by only two moments, namely, mean and variance). However, the distribution of features may not be strictly Gaussian for MFCC and MFCCVTMP in realistic scenarios. To that effect, we used information-theoretic measures, namely, KullbackLeibler (KL) divergence. In particular, we estimate the joint density of features using vector quantization (VQ). We prepare the 256 centroids in VQ codebook by iteratively splitting the codebook into two by using LindeBuzoGray (LBG) algorithm. For each person, we computed the probability density vector of 256 dimension based on the proximity w.r.t. centroids. To compute these centroids, we used features from two classes (persons) and prepare two probability density function (pdf) vectors. To measure the discrimination between two classes, KL-divergence between two pdf vectors p1 and p2 is given by Reddy et al. (2014)     Ncb Ncb X X p1 ðkÞ p2 ðkÞ p1 ðkÞ log p2 ðkÞ log Dðp1 k p2 Þ ¼ þ ; ð5Þ p2 ðkÞ p1 ðkÞ k¼1 k¼1 where Ncb is the number of codebook vectors. The base of logarithm in Eq. (5) would not affect the discrimination because it results into a constant by change of logarithm base to number ‘e’. Higher KL distance corresponds to higher discrimination between the two classes. We conducted all possible pair of two classes from 100 subjects’ humming data, which leads to 4950 pairs and average the total KL distance. It can be observed from Table 1 that according to KL-distance discrimination, proposed MFCC-VTMP feature set gave better discrimination than MFCC. We also varied Ncb from 1 to 256 and computed the KL divergence, which is shown in Fig. 6. It can be observed that for all the Ncb, proposed feature sets have higher class separability. For Ncb ¼ 1; the KL divergence is 0. This is because all the features are assigned to a single codebook vector and hence, classes are not separable. 4. Data collection and corpus design TagedPPerson (speaker) recognition is a data-driven field. In particular, factors affecting the performance are recording conditions, acoustic noise, transmission, speaker (person) variability and channel characteristics, etc. Thus, performance figures in this study are meaningless if the recording conditions during data collection are not known (Patil, 2005). In this section, a methodology and typical experimental setup used for data collection and corpus design are presented. Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

[m3+;September 26, 2017;11:48]

11

Fig. 6. KL divergence-based feature separability for various Ncb.

4.1. Ideal characteristics of corpus TagedPFollowing are the important or desirable characteristics that should be considered during data collection (Patil, 2005; Chhayani and Patil, 2013). TagedP Depth and breadth of coverage: We should have enough humming data from enough number of subjects in order to build statistically meaningful corpus. TagedP Data should be collected from realistic noisy environments such as home, car, babble, etc. Instead of only acoustically clean laboratory condition.  TagedP Corpus should be unbiased w.r.t. the gender, i.e., it should have almost equal number of male and female subjects.  TagedP Corpus should be prepared from the subjects having wide age variations. This helps to understand effect of age on person recognition performance and also suitability of this technology for wide range of subjects.  TagedP Corpus should be designed to have sufficiently larger duration of training and testing segments in order to investigate performance of limited data vs. sufficient data conditions.  TagedP Data should be collected in multiple training and testing sessions in order to study effect of intersession variability.  TagedP Corpus should be designed for both closed set and open set scenarios.  TagedP Data should be collected from variety of recording and transmission channel characteristics. TagedPIt should be noted that it is very difficult if not impossible to satisfy, all of the above ideal characteristics rather one should try to consider many of such desirable attributes of corpus for a given application. 4.2. Database preparation TagedPThe humming database is prepared for corresponding melodic tune of Hindi (an Indian language) song with the help of HP and SCHURE microphones from 100 subjects (many of them are university students). All the subjects (having age variations from 12 to 59 years) were multilingual and had Hindi as their second (i.e., L2) language. Data is collected only from those subjects who voluntarily agreed for this task. As a part of text material for recordings of humming, we used text corresponding to popular classical Hindi songs sung by legendary singers of Indian cinema, namely, Late Kishor Kumar, Late Mukesh Kumar and Lata Mangeshkar. In this work, database is prepared in home environment, DA-IICT radio room and laboratory environment. All the humming data is recorded with 22.05 kHz sampling rate with 16-bit resolution. Humming corresponding to 15 Hindi songs was considered for training data and that for another 5 Hindi songs for testing data. Subjects were given the text material and allowed to choose the songs for which they can hum with great comfort w.r.t. choice of song and duration of humming. Thus, eventhough the recording seems to be prompted, however, recorded hums for the same song may have different pattern of acoustic pressure variation for two subjects. We kept different songs in training and testing (thus something like Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

12

ARTICLE IN PRESS

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

Fig. 7. Panel I: (a) Time-domain waveform, (b) F0 contour, and (c) spectrogram with formant contour for female subject (Age: 23 years). Panel II:  lander and Beskow (2009). Similar analysis for male subject (Age: 26 years). The screenshot is taken from the wavesurfer software Sjo

TagedPtext-independent speaker verification mode). To illustrate this, Fig. 7 shows the time-domain waveform, corresponding F0 contour, and spectrogram along with formant contour for a humming sound w.r.t. same Hindi song recorded from a male and a female subject. The song is ‘Mere Dil Me Aaj Kya Hai Tu Kahe To Me Bataa Du’, (which means “What is in my heart ? If you ask, I will tell you”). It can be observed that hum content, pattern of acoustic pressure variation, F0 contour and overall spectral energy distribution is distinct for a particular subject (as was observed in our earlier studies (Patil et al., 2009; 2008; Patil and Parhi, 2010b)). Fig. 8 shows the experimental setup in lab for data collection. Coauthor (on right) is adjusting the setting in Audacity software for recording of humming from a subject (on left). The recordings for humming sounds are stored in laptop computer through the sound card of the computer (Dell VOSTRO: core2 Duo 1.6 GHz, 2.5 GB RAM, Windows XP OS). The humming data is stored in *.wav format with the help of Audacity open source sound editing software (Madhavi, 2011).

Fig. 8. Experimental setup during data collection. Interviewer (right) adjusting the recording setup for humming recording from a subject (left). After Madhavi (2011).

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

13

Table 2 Details of in-house humming database used for this study (Madhavi, 2011). Item

Details

No. of subjects Age variations No. of songs for humming per subject No. of sessions Data type Microphone type Acoustic environment Recording software Sampling rate Sampling format Train segments Test segments Genuine trials Impostor trials

100 (80 male and 20 female) 1259 years 20 (15 used for training and 05 used for testing) 1 Hum for Hindi (an Indian language) songs HP dual earphone with mic, SCHURE mic DA-IICT Radio Room, DA-IICT Lab and home environment Open source Audacity - sound editing software 22,050 Hz 16-bit resolution, mono-channel 30 s, 60 s, 90 s (3 overlapping training segments) 1 s, 2 s,  30 s (30 overlapping testing segments) 100 £ 3 (no. of train segment) £ 30 (no. of test segments) D 9000 30 segments £ (99 £ 100 £ 3) non-targets D 8,91,000

4.3. Corpus design TagedPUsing a suitable and statistically meaningful corpus to perform any experiment for person recognition requires the development of an evaluation procedure or experimental protocol that specifies, among other things, the parameters of a corpus into training and testing datasets (Patil, 2005). Investigation of system performance for any evaluation factors (such as, dimension of feature vector, noise robustness, intersession variability, etc.) requires that corpus should have enough humming data from enough subjects for the condition of interest to construct a valid/ statistically significant experiment. To that effect, our in-house humming corpus is designed into training segments of 30 s, 60 s, 90 s (where 60 s and 90 s are overlapped with 30 s) and testing segments of 1 s, 2 s, . . . ; 30 s (again with overlapping segments similar to the training segments). Table 2 shows the brief summary of the data collection procedure and corpus plan. Acoustic feature sets, MFCCs and MFCC-VTMPs are trained on 30 s, 60 s and 90 s. There are 30 testing segments of duration 1 s to 30 s. The final score is the average of results of all the testing segments. In Section 6, we will evaluate proposed MFCC-VTMP feature sets for person recognition task under various evaluation factors. 4.4. Efforts and experiences during data collection: TagedPThe following are some of our efforts and experiences during data collection phase, which may be of considerable important while collecting data in realistic scenarios for practical person recognition system using humming. agedPT The subjects were informed about the purpose of this study and to avoid any apprehension about misuse of the humming data and relevant metadata.  TagedP Some subjects were observed to be initially shy and reluctant for recording humming data for various songs.  TagedP A subject information form was prepared to know the relevant metadata information of the subjects (such as name, age, gender, place of birth, native place, languages known, socio-economical status, profession, etc.). TagedP The presence of the observer (the observer being an experimenter, recording equipment, or any other tool of measurement) affects the acoustical characteristics of humming (which is referred to as observers’ paradox (Patil, 2005; Carter et al., pp. 409413, 2000; Wolfram and Schilling, 1998)). TagedP Subjects occasionally become bored and distracted and lowered their voice intensity or turned their heads away from the microphone. TagedP There was stammering, laughter, throat clearing, tittering and poor articulation. All of these cases were recorded in a normal way. TagedP The silence (i.e., no hum) to hum duration ratio was about 0.15 (which was measured from a male and a female subject). Many times, it was found that silence-to-hum duration ratio was small between hum of two different songs for a subject. This may be due to different priorities associated with humming of a particular song in a list (or text material).

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

14

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

5. Experimental setup 5.1. Existing phase-based features TagedPIn this Section, we will discuss different existing phase-based features that are considered to compare the performance of proposed feature set in sub-Section 6.1. TagedP5.1.1. Modified group delay cepstral coefficients (MGDCCs) TagedPThe negative derivative of the unwrapped Fourier transform phase is called as group delay function. The group delay has the better formant resolving capability (due to its high resolution additive property) than the short-time magnitude spectrum of Fourier transform Murthy and Gadde (2003). The modified group delay (MGD) is defined as follows:    XR ðkÞYR ðkÞ þ YI ðkÞXI ðkÞ a  ; tx ðkÞ ¼ sign: ð6Þ  SðkÞ2g þ YI ðkÞXI ðkÞ where sign is given by XR ðkÞYR ðkÞSðkÞ ; XR(k) and XI(k) are the real and imaginary parts of X(k), respectively, 2 (i.e., DFT of x(n)), YR(k) and YI(k) are the real and imaginary parts of Y(k), respectively, (i.e., DFT of yðnÞ ¼ nxðnÞ), jS(k)j2 is the cepstrally smoothed spectra, a and g are the smoothing parameters, that are kept as 0.4 and 0.9, respectively. 12-dimensional modified group delay cepstral coefficients (MGDCCs) are taken as DCT of the logarithm of MGD. MGDCC features were computed using frames of 23.2 ms (512 samples for Fs D 22050 Hz) duration with an overlap of 50 % (i.e., 256 samples). Each frame was pre-emphasized with the highpass filter 1  0:97z1 ; followed by Hamming windowing.

TagedP5.1.2. Linear prediction (LP) residual phase TagedPFor LP residual computation, feature analysis was performed using 12th order linear prediction (LP) on a 23.2 ms frame with an overlap of 50 %. However, frame duration can be selected for 1020 ms. Each frame was multiplied by Hamming window and pre-emphasized with a smooth highpass filter. LP analysis is used to predict the sample from past few sample of a speech signal. The linear predicted speech/hum sample at nth instant, b s ðnÞ; is given by b s ðnÞ ¼ 

p X

ak sðn  kÞ;

ð7Þ

k¼1

where p is the order of the linear prediction and fak gpk¼1 are the linear prediction coefficients (LPCs). The difference between actual sample s(n) and the predicted sample b s ðnÞ gives the LP residual or error signal, r(n), i.e., rðnÞ ¼ sðnÞ  b s ðnÞ. The value of LP residual is high around the glottal closure instants (GCIs) . The analytic signal ra(n) is ra ðnÞ ¼ rðnÞ þ jrh ðnÞ; where rh(n) is the Hilbert transform of r(n). The magnitude of analytic signal is Hilpffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi bert envelope re(n) and is given by re ðnÞ ¼ jra ðnÞj ¼ r 2 ðnÞ þ rh2 ðnÞ. The LP residual phase is the cosine phase, i.e., cos (uLPR(n)) and is computed as follows (Murty and Yegnanarayana, 2006): cosðuLPR ðnÞÞ ¼

Reðra ðnÞÞ rðnÞ ¼ : jra ðnÞj re ðnÞ

ð8Þ

TagedP5.1.3. TEO phase and VTEO phase TagedPSimilar to LP residual, TEO profile of a speech signal has higher energy around GCIs. In order to extract the analytic phase information, authors proposed use of TEO instead of LP residual signal (Patil and Parhi, 2010a). This analytic phase information is referred to as TEO phase. TEO phase is computed using the analytic signal of TEO profile, which was obtained via Hilbert transform of TEO. The TEO phase is the cosine phase cos (uTEO(n)) and computed as follows (Patil and Parhi, 2010a): cosðuTEO ðnÞÞ ¼

Refca fxðnÞgg ; jca fxðnÞgj

ð9Þ

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

15

TagedP where ca{(x(n)} and jca{x(n)}j are the analytic signal and Hilbert envelope of TEO profile of x(n), respectively. Similarly, VTEO phase is obtained using Eq. (2) followed by VTEO phase computation (keeping the same computational framework of Eq. (8)). Fixed dimensional features are extracted from LP residual phase, TEO phase, and VTEO phase. First, GCIs are estimated from LP residual, TEO, and VTEO profiles using singularity detection with first-order derivative of Gaussian as suggested in Patil and Parhi (2010a). After GCI detection, ten blocks of 12 samples of LP residual phase, TEO phase and VTEO phase are selected around GCIs to form the feature vector (since Signal-to-Noise (SNR) ratio is higher around GCIs (Murty and Yegnanarayana, 2006)).

5.2. Polynomial classifier TagedPIn this paper, a discriminatively-trained polynomial classifier is used as the basis for all the person recognition experiments. This classifier is the best approximation to the Bayes classifier. It has the capability of new class addition and efficient multiply and add DSP structure. It uses out-of-class data to optimize performance (as opposed to other statistical methods such as HMM and GMM). The details of classifier structure and training algorithm are given in Campbell and Assaleh (1999); Campbell et al. (2002). The feature vectors are processed by the polynomial discriminant function. During recognition, the score for each test segment is computed as inner product between the polynomial expansion of test segment feature vectors and person (speaker) model for each hypothesized speaker. The testing feature vectors (which may be either from a speaker or an impostor) are processed by the polynomial discriminant function. Every speaker j has a speaker-specific vector wj , to be identified during training and the output of a discriminant function is averaged over time resulting in a score for every wj . The score for jth speaker is then given by: sj ðxÞ ¼

M 1X wT pðxi Þ; M i¼1

ð10Þ

TagedPxi ¼ ith input test feature vector, M ¼ total number of testing feature vectors, w ¼ speaker (person) model and pðxÞ ¼ vector of polynomial basis terms of the input test feature vector. In particular, for a 2-D feature vector, x ¼ ½x1 x2 t and second-order polynomial approximation, pðxÞ; is given by pðxÞ ¼ ½1 x1 x2 x21 x22 x1 x2 t . TagedPTraining of polynomial classifier is accomplished by obtaining the optimum speaker model for each person using discriminatively-trained classifier with mean-squared error (MSE) criterion (Campbell and Assaleh, 1999; Campbell et al., 2002), i.e., for a speaker’s feature vector (e.g., for the present problem, target or genuine speaker), an output of one is desired, whereas for an impostor data, an output of zero is desired. For the two-class problem, let wspk be the optimum speaker model, l(v) the class label, and thus, the ideal output will be lðspkÞ ¼ 1 and lðimpÞ ¼ 0. The resulting problem using MSE is h 2 i wspk ¼ argminw E wT pðxÞ  lðvÞ ; ð11Þ where E means expectation operator over x and v . This can be approximated using the training feature set as "N # Nimp spk X X 2 T T jw pðxi Þ1j þ jw pðyi Þj ; wspk ¼ argminw i¼1

ð12Þ

i¼1

where x1 ; . . . ; xNspk are genuine training data and y1 ; . . . ; yNimp are the impostor data. Let Mspk and Mimp whose rows contain polynomial expansion of genuine training data and impostor data, respectively. Thus,  Mspk ¼ pðx1 Þt pðx2 Þt . . . pðxNspk Þt t and Mimp ¼ pðy1 Þt pðy2 Þt . . . pðyNimp Þt t . Let,

Mspk M¼ : ð13Þ Mimp TagedPFrom Eq. (11), we have wspk ¼ argminw k Mwspk  o k 2 ;

ð14Þ

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

16

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

TagedPwhere o contains the ideal output (i.e., Nspk ones followed by Nimp zeros). Solve Eq. (14) using set of normal equations M t Mwspk ¼ M t o:

TagedPExpanding in terms of Mspk and Mimp, we have, t t t Mspk Mspk þ Mimp Mimp wspk ¼ Mspk 1;   t ; Rspk þ Rimp wspk ¼ Mspk 1; h i 1 t ;wspk ¼ R Mspk 1 ;

ð15Þ

t t where R ¼ Rspk þ Rimp , Rspk ¼ Mspk Mspk and Rimp ¼ Mimp Mimp . The computation of Rspk and Rimp is performed offline for all the enrolled speakers (persons). If the polynomial expansion of feature x; i.e., pðxÞ contains Mterms, then 2 2 computational cost associated with Rspk and Rimp are OðNspk Mterms Þ and OðNimp Mterms Þ; respectively (Campbell and Assaleh, 1999; Campbell et al., 2002). The space complexity becomes larger for higher degree of polynomial expansion. However, the matrix Rspk (so as its impostor counterpart, Rimp) contains huge redundant terms. Table 3 shows the number of terms in Rspk and the redundant terms for 12-dimensional feature vector. It can be observed that as degree of polynomial increases, the redundant term increases. In particular, there is a relatively huge jump in memory space (by 26.3 %) for polynomial classifier with degree 4. Hence, to reduce the space complexity, training algorithm stores only the unique terms p2 ðxÞ. The redundancy in matrix R is well structured and hence, matrix R can be obtained by the mapping p2 ðxÞ to R via mapping algorithm, which is based on semigroup property of monomials (Campbell and Assaleh, 1999; Campbell et al., 2002). TagedPInterestingly, the polynomial classifier used in our paper is related to Support Vector Machine (SVM) with polynomial kernel (Campbell et al., 2006). In addition, i-vectors were originally proposed to combine speaker recognition using SVM and Gaussian Mixture Model (GMM) (Dehak et al., 2011; Dehak, 2009). In this context, polynomial classifier is also discriminatively-trained classifier that has in a way a polynomial kernel and has scoring mechanism via inner product operation similar to cosine distance scoring used in state-of-the-art i-vector (Dehak et al., 2011; Dehak, 2009). In this study, we have not used other data except training set from 100 subjects. In addition, while training for a particular person’s model, other person’s data is exploited for better discrimination. In this sense, the problem discussed in this paper is closed set, however, can be easily extended to open set scenarios, which is discussed in next sub-section. The training part is split into two parts. The first part computes speaker-specific polynomial expansion, i.e., Rspk from p2(x) for all the enrolled subjects. The second part computes the speaker model wspk (as per Eq. (15)).

TagedP5.2.1. New class addition capability TagedPThe structure of polynomial classifier is very effective w.r.t new class addition. In particular, suppose we have Nspk number of subjects. Assume that, we have stored the vectors of unique entries for all the Nspk subjects (Campbell et al., 2002), i.e., r¼

Nspk X

ri ;

ð16Þ

i¼1

P i p2 ðxi;j Þ and p2 ðxi;j Þ is the vector of unique entries for ith subject and jth training feature vectors and where ri ¼ Nj¼1 Ni is the number of features in ith subjects’ data. Then, as a new class data is acquired into the system, we update the Table 3 Number of redundant terms for 12-D feature vector in Rspk. After Campbell and Assaleh (1999); Campbell et al. (2002). Degree

Terms in Rspk

Unique terms

Ratio

2 3 4

8281 207,025 3,312,400

1820 18,564 125,970

4.55 11.15 26.3

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

[m3+;September 26, 2017;11:48]

17

TagedPentries of r stored for Nspk subjects as Campbell et al. (2002): rnew ¼ r þ

Nnew X

p2 ðxnew;j Þ;

ð17Þ

j¼1

where xi is the feature vector from newly added class and p2 ðxnew;j Þ is the corresponding values of unique entries and Nnew is the number of feature vectors for new class. Retraining can be performed after Eq. (17) has been calculated for all the features of newly added class. The Algorithm 2 presents the procedure for model vector computation after new class addition in polynomial classifier framework (Campbell et al., 2002). For the new class addition, step 1113 of Algorithm 2 take huge number of addition operations than the model computation step, i.e., step 17 of Algorithm 2 (where i ¼ Nsubject þ 1; which corresponds to the new class). For 90 s training data, Nnew ¼ 7722 and number of terms obtained in polynomial expansion p2 ðxnew;j Þ (for j ¼ 1 to Nnew) is Nterm ¼91 for 2nd order poly2 nomial expansion. The number of addition operations required for rnew computation from r is Nnew ¢ Nterm  6:39  7 10 ; which is done offline. To compute the model vector, wnew ; for the new class, the number of operations required 3 are Nterm  7:53  105 ; which is performed online. Since, Nterm  Nnew, the computational complexity is much smaller for wspk computation than rnew computation (i.e., around 85 times in this case). Algorithm 2. A pseudo code for new class addition for polynomial classifier. After Campbell et al. (2002).

TagedP5.2.2. Limitations of polynomial classifier TagedPFollowing are the limitations of polynomial classifier: 1TagedP . Training algorithm of polynomial classifier is independent size of training data, however, the computational complexity of classifier increases severely with increase in dimension of feature vector. TagedP2. The computational complexity further increases with degree of polynomial approximation. In particular, in the step of “jump” in mapping algorithm (i.e., to expand vector of unique entries ‘r’ into matrix R) requires a large “jump” in memory space (Campbell et al., 2002). Thus, as the degree of polynomial increases, the memory (space) requirement increases tremendously (as shown in Table 3).

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

18

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

5.3. Performance evaluation metrics TagedPThe performance of person recognition system is evaluated on Detection Error Trade-off (DET) curves (Martin et al., 1997). Equal Error Rate (EER) is the point at which both the false acceptance (FA) and false rejection (FR) rates are equal on the DET curve (Martin et al., 1997). Another commonly adopted metric is the minimum detection cost function (DCF), that takes into account on both the cost associated with FA and FR, i.e., Cdet ¼ CFA PNontarget PFA þ CFR PTarget PFR ;

ð18Þ

where CFR and CFA are the cost associated with false rejecting a target speaker and false accepting an impostor, respectively. PFR and PFA are the probabilities of false rejection and false acceptance, respectively, PTarget and PNontarget are the probabilities of target and impostor tests among all the tests, respectively, and PNontarget ¼ 1  PTarget . In this paper, PTarget ¼ 0:5; CFR ¼ 1 and CFA ¼ 1 are used for the experiments. 6. Experimental results TagedPThis Section presents experimental results for various evaluation conditions. Table 4 reports Equal Error Rate (EER) and minimum DCF (Detection Cost Function) measures (Martin et al., 1997) for person verification task using 12-dimensional state-of-the-art MFCCs and proposed feature set, namely, MFCC-VTMP (for DI D 4). From Table 4, it can be observed that the performance of proposed feature set is better than MFCCs alone in terms of reduction in EER by around 0.2 %. DET curves for these experiments are shown in Fig. 9, where it can be observed that EER for MFCC-VTP (with DI D 4) (i.e., phase only) is very high (i.e., around 20.75 %), whereas EER for MFCC-VTM (with DI D 4) (i.e., magnitude only) is less than that for MFCCs alone by 0.07 %. However, when magnitude and phase Table 4 % EER and min. DCF with proposed feature set (namely, MFCC-VTMP) (12-D features and duration of data as specified in Table 2). The bold numbers indicate relative better performance. Feature sets !

MFCC

MFCC-VTM (DI D 4)

MFCC-VTP (DI D 4)

MFCC-VTMP (DI D 4)

EER (%) Min DCF

12.20 0.1214

12.13 0.1204

20.75 0.2035

12.01 0.1196

Fig. 9. DET curves for 12-D MFCC and 12-D proposed feature set, MFCC-VTMP.

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

19

Table 5 Performance comparison of MFCC-VTMP with existing phase-based features. The bold numbers indicate relative better performance. Feature sets !

MFCC-VTMP

MGDCC

LP residual phase

TEO phase

VTEO phase

% EER Min. DCF

12.01 0.1196

14.30 0.1424

47.64 0.4620

34.79 0.3334

34.72 0.3457

TagedPspectra are combined via proposed feature set, i.e., MFCC-VTMP, EER is less than that for MFCCs, MFCC-VTP and MFCC-VTM alone. This shows that MFCC-VTP and MFCC-VTM carries strong complementary information than traditional state-of-the-art MFCCs feature sets. In addition, from Fig. 9, it can be observed that proposed feature sets, namely, MFCC-VTMP performs better than traditional MFCCs at all the operating points of the DET curve. 6.1. Comparison with different phase-based features TagedPIn this Section, we compare performance of proposed feature set with existing phase-based features that are briefly described in Section 5.1. Table 5 shows the performance of various phase-based features. It can be observed that proposed MFCC-VTMP feature set gave better performance than all other existing phase-based features. Fig. 10 shows DET curves for various phase-based features. It can be observed from the Fig. 10 that proposed feature set gave relatively better performance than all the phase-based features considered in this study. 6.2. Effect of dependency index (DI) TagedPVTEO was proposed to capture dependency of current sample onto the adjacent samples (Tomar and Patil, 2008). The experiment was conducted to vary dependency index (DI) of VTEO used from 1 to 18 in proposed feature extraction scheme, i.e., MFCC-VTMP. Dimension of feature vector is kept as 12. Fig. 11 shows plot of EER vs. DI used in MFCC-VTMP. From Fig. 11, it can be observed that the dependency index (DI) in MFCC-VTMP does play an important role to estimate the performance of person recognition system. In particular, minimum EER (i.e.,

Fig. 10. DET curves for proposed vs. existing phase-based features.

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

20

ARTICLE IN PRESS

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

Fig. 11. Effect of dependency index (DI) of VTEO used in MFCC-VTMP computation on EER.

TagedP12.01 %) is obtained for DI D 4. Thus, performance is optimized for DI D 4 in this paper. Hence, the remaining set of experiments in this paper are reported with this choice of dependency index, i.e., DI D 4 for MFCC-VTMP feature extraction. 6.3. Effect of number of subband filters TagedPIn this study, we have kept the number of subband filters 24 as an arbitrary choice, keeping in mind that they are sufficient to cover the audible frequency range. To investigate the effect of number of subband filters, we conducted an experiment, where we varied number of subband filters from 20 to 40 for a fixed 12-dimensional feature vector. Fig. 12 shows the performance of the MFCCs and proposed MFCC-VTMP for different number of subband filters. It can be observed that the performance of proposed feature set is better than MFCCs in most of the cases. In addition, it can be observed that the performance does not deviate significantly w.r.t. number of subband filters for the same feature set, however, it is relatively better for 24 subband filters especially for proposed feature set. 6.4. Effect of degree of polynomial classifier TagedPThis experiment was conducted to investigate the performance of person recognition system for different order (i. e., degree) of polynomial classifier. The performance is executed on general computer (laptop) having following specifications: Dell VOSTRO: core2 Duo 1.6 GHz, 2.5 GB RAM, Windows XP OS. As discussed in Section 5.2

Fig. 12. Effect of number of subband filters on EER.

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

21

Fig. 13. DET curves for 2nd and 3rd order polynomial classifier.

TagedP(Table 3), as the order of polynomial classifier increases, an algorithm for generating speaker model and testing needs more cost in terms of complexity of time as well as memory space (Campbell et al., 2002). In particular, speaker model of 2nd order polynomial classifier occupies around 14.7 kB and takes 1.038 s of time for computation, while for same training duration, speaker model of 3rd order requires around 144 kB space and takes 12.38 s of time for computation. However, at this computational and storage cost, it gives better performance than its lower order counterpart. This may be attributed to the Cover’s theorem, i.e., the high-dimensional polynomial kernel has the capability to separate features efficiently, whereas lower order polynomial kernel is not so capable (possibly due to reduction in feature occupancy) Cover (1965). In this experiment, person recognition results for 2nd order and 3rd order polynomial classifier are plotted by DET curves for 12-dimensional MFCC-VTMP (DI D 4) and conventional MFCCs as shown in Fig. 13, in which dashed line shows DET curve for 3rd order polynomial classifier and solid line shows DET curve for 2nd order polynomial classifier. It can be observed from Fig. 13 that proposed feature set, i.e., MFCC-VTMP gives less EER than MFCCs for both 2nd order and 3rd order polynomial classifier. In addition, 3rd order polynomial classifier performs better than 2nd order polynomial classifier. Furthermore, it is important to note that proposed feature set, namely, MFCC-VTMP performs better than MFCCs at all the operating points of DET curves. The justification for this better performance at 3rd order polynomial classifier, is due to the reduced feature occupancy in higher feature dimension. Table 6 shows EER and min. DCF values for polynomial classifier of 2nd and 3rd order

Table 6 EER and min. DCF measures for 2nd and 3rd order polynomial classifier. The bold numbers indicate relative better performance.

EER (%) Min. DCF

2nd order

3rd order

polynomial classifier

polynomial classifier

MFCC

MFCC-VTMP (DI D 4)

MFCC

MFCC-VTMP (DI D 4)

12.20 0.1214

12.01 0.1196

8.72 0.8690

8.65 0.8600

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

22

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132 Table 7 EER and min. DCF measures measures for different feature dimensions. MFCC-VTMP (DI D 4)

Feature

MFCCs

Dimension

EER (%)

Min. DCF

EER (%)

Min. DCF

6 8 10 12 14 16 18 20 22 24

18.17 15.29 13.36 12.20 11.28 10.56 9.93 9.52 9.10 8.28

0.1814 0.1525 0.1327 0.1214 0.1117 0.1041 0.0978 0.0935 0.0895 0.0812

18.20 15.33 13.33 12.01 11.17 10.53 9.96 9.52 9.01 8.20

0.1811 0.1523 0.1329 0.1196 0.1107 0.1039 0.0979 0.0936 0.0900 0.0811

TagedPpolynomial approximation. It is evident from Table 6 that 3rd order polynomial classifier reduces EER are by 3.48 % as compared to its 2nd order counterpart. 6.5. Effect of feature dimension TagedPThe experiments were conducted to vary the dimensions of feature vector from 6 to 24. EER obtained for different feature dimensions for MFCCs and proposed feature set, namely, MFCC-VTMP (DI D 4) are reported in Table 7. It can be observed from Table 7 that, as dimension of feature vector increases, person recognition performance improves. However, high-dimensional feature vectors require higher computational cost and memory usage (Section 5.2.2). For example, any training feature vectors for 90 s training duration of dimension 12 takes 50.88 s for computation, while 24-dimensional takes 52.98 s for computation. The corresponding space requirement is 1.56 MB and 2.56 MB, respectively. The improved performance at higher feature dimensions may be due to the fact that as the dimension of feature vector increases, feature (and hence, class) occupancy at high-dimensional feature space decreases (as discussed in Section 6.4). Hence, the separability between the patterns increases and thereby improvement in recognition performance (as was observed for 3rd degree polynomial classifier in Section 6.4). Fig. 14 shows DET curves using MFCCs and proposed feature set, namely, MFCC-VTMP with feature dimension of 12 and 24, respectively.

Fig. 14. DET curves for 12-dimensional MFCCs and proposed feature set (MFCC-VTMP) (with DI D 4) and 24-dimensional MFCCs and proposed feature set (MFCC-VTMP) (with DI D 4).

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

ARTICLE IN PRESS H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

[m3+;September 26, 2017;11:48]

23

TagedPIt can be observed from the Fig. 14 that EER is significantly lower for the higher dimension of feature vector. In addition, proposed feature set, namely, MFCC-VTMP performs better at most of the operating points of the DET curve. It should be noted that, as feature dimension increases, features for a particular class are separated well apart from another class in feature space (in the sense of orthogonality, features are uncorrelated). Hence, feature discrimination power is high for higher-dimensional feature space. Hence, the performance of person recognition system is found to be improved. Due to the increased complexity and memory storage for 24-D feature vector, experimental results are shown for 12-D feature vector for most of the experiments. 6.6. Comparison with speaker modeling approach TagedPThe Gaussian mixture model-Universal background model (GMM-UBM) is widely used speaker modeling approach.The UBM parameters are estimated on a larger amount of data using Expectation maximization (EM) algorithm. The Maximum A Posteriori (MAP) adaptation is used to estimate the speaker-dependent parameters in the enrollment phase (Reynolds et al., 2000). TagedPMost recent state-of-the-art approach in speaker verification is i-vector-based system that uses cosine distance scoring (CDS). This approach captures both speaker and channel variability effectively in low-dimensional subspace known as Total variability space. i-vector effectively summarizes utterance that is nothing but low-dimensional representation of GMM super vector. The procedure to extract i-vector and CDS are given in Dehak et al. (2011). The use of the cosine kernel as a decision score for speaker verification makes the process faster and less complex than the other scoring methods. To execute, GMM-UBM and i-vector-based speaker modeling, we used development data, i.e., 50 person’s hum data (which is discussed in Section 6.9 and consisting five hours of duration). GMM is trained with 256 mixture components and 400-dimensional i-vector is used in the experiment. The rest of the experimental setup for data collection is same as discussed in Table 2. DET curves for different speaker modeling approach is shown in Fig. 15. It can be seen from Fig. 15 that MFCC-VTMP gave better performance across all different speaker modeling approaches. % EER values for different speaker modeling approaches are reported in Table 8. In addition, it is interesting to note that results are better 3rd order polynomial classifier than GMM-UBM and even i-vector. This may be due to use of less data to build UBM for both GMM-UBM and i-vector (Naik, 2017). 6.7. Effect of Noisy or signal degradation conditions TagedPJabloun et al. observed that TEO has noise suppression capability and it performs better for speech recognition application in signal degraded (car noise) conditions (Jabloun et al., 1999). In addition, Bahoura et al. found that TEO is applicable for speech enhancement problem as well. They evaluated results with real and artificially added noise data (Bahoura and Rouat, 2001). Proposed feature set, namely, MFCC-VTMP is based on VTEO (which is a

Fig. 15. DET curves for different speaker modeling approaches.

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

24

ARTICLE IN PRESS

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132 Table 8 EER measures for different speaker modeling approach. The bold numbers indicate relative better performance. Feature

GMM-UBM

sets MFCC MFCC-VTMP

13.68 12.94

i-vector

Polynomial classifier

CDS

order-2

order-3

12.56 10.43

12.20 12.01

8.72 8.65

TagedPgeneralization of TEO). Hence, to investigate possible noise suppression capability of VTEO, let us consider white noise taken from CMU NOISEX’92 database (Varga and Steeneken, 1993). Note that, the original sampling rates in NOISEX are 11.98 kHz and 8 kHz. The sampling rate of humming data is 22.05 kHz. We have used 8 kHz sampled noise and to resample, we used an Audacity sound editing tool that uses a library called libresample, that has an implementation of resample algorithm to take care of mismatch of sampling rate Aud; Aud. Fig. 16 shows the autocorrelation function for white noise (before VTEO) and white noise after VTEO (for DI D 4). Suppression of amplitude of autocorrelation function of white noise after VTEO operation in autocorrelation-domain is clearly evident in Fig. 16. In particular, the maximum amplitude value of autocorrelation function of noise as shown Fig. 16 is 0.7164 and after applying VTEO for DI D 4, the maximum amplitude value in this new signal’s autocorrelation function is significantly reduced to 0.0013. By Weiner-Khinchin theorem, power spectral density (PSD) is the Fourier transform of an autocorrelation function of sample function of the random process. In this context, Fig. 17 shows PSD analysis for the case of before and after applying VTEO (DI D 4) of a white noise signal. PSD of white noise after applying VTEO is much downward than PSD of white noise (i.e., noise is also suppressed by VTEO similar to TEO (Jabloun et al., 1999)). This experimental observation can be analyzed by the following argument. TagedPLet x(n) be discrete-time humming signal and noise b(n) is added to it in order to get resulting noise-corrupted signal, as yðnÞ ¼ xðnÞ þ bðnÞ . Then, VTEO is applied on this noise-corrupted signal, i.e., from Eq. (2), we have ξfyðnÞ; dg ¼ fx2 ðnÞ  xðn  dÞxðn þ dÞg þ fb2 ðnÞ  bðn  dÞbðn þ dÞg þ f2xðnÞbðnÞ  xðn  dÞbðn þ dÞ  xðn þ dÞbðn  dÞg; ¼ ξfxðnÞ; dg þ ξfbðnÞ; dg  ξfxðnÞ; bðnÞ; dg;

ð19Þ

Fig. 16. Autocorrelation function for white noise before VTEO and after VTEO (with DI D 4).

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

ARTICLE IN PRESS

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

25

Fig. 17. Power spectral density (PSD) for white noise before applying VTEO and after applying VTEO (DI D 4).

TagedPwhere ξ{x(n), b(n), d} are the cross-terms for VTEO operation. Let us assume that noise b(n) is wide-sense stationary (WSS). In addition, x(n) and b(n) are statistically-independent from each other, i.e., for every dependency index, d, this means, TagedPE½xðnÞ; bðnÞ ¼ 0; TagedPE½xðn  dÞbðn þ dÞ ¼ 0 and TagedPE½xðn þ dÞbðn  dÞ ¼ 0. TagedPTherefore, E½ξfxðnÞ; bðnÞ; dg ¼ 0. Here, E½ ¢  is expectation operator. From Eq. (19), we have E½ξfyðnÞ; dg ¼ E½ξfxðnÞ; dg þ E½ξfbðnÞ; dg; ¼ E½x2 ðnÞ  E½xðn  dÞxðn þ dÞ þ E½b2 ðnÞ  E½bðn  dÞbðn þ dÞ; ¼ Rxx ð0Þ  Rxx ð2dÞ þ Rbb ð0Þ  Rbb ð2dÞ;

ð20Þ

where Rxx(.) and Rbb(.) are autocorrelation functions for signal x(n) and noise b(n), respectively. For DI D 4, autocorrelation values after VTEO operation at 0th instant and 2d D 8th instant are shown in Table 9, where it can be seen that Rbb(0)  Rbb(8), i.e., TagedP;E½ξfbðnÞ; dg ¼ 0 and hence, E½ξfyðnÞ; dg  E½ξfxðnÞ; dg: TagedPIt is observed from Table 9 that amplitude of autocorrelation function for noise at 0th and 8th instants (i.e., lag) cancel each other’s effect. Hence, noise-corrupted signal, y(n), has the only effect due to humming signal x(n). Thus, VTEO suppresses noise to a certain extent. Next, the person recognition experiment was conducted to compare the effectiveness of proposed feature set over state-of-the-art MFCC feature set. The feature dimension for both MFCC and MFCC-VTMP was kept as 12. Different types of noise signals, such as, white noise, HF channel noise, and babble noise are taken from CMU NOISEX’92 database and added to the original humming data Varga and Steeneken (1993). SNR is varied from 5 dB to +15 dB to generate noisy humming data at various SNR levels (i.e., additive noise). In this study, noise-related experiments are conducted in matching condition, i.e., noise is added to both train and test dataset. The resulting DET curves for person recognition performance under these degraded conditions are shown in Fig. 18. It is very important to observe from Fig. 18 that proposed feature set performs better than MFCCs Table 9 Autocorrelation values on various noise signals for dependency index (DI D 4). type of noise

Rbb(0)

Rbb(2d)

Rbb(0) - Rbb(2d)

White Babble HF channel

0.0001 0.0020 0.0910

0.00009 0.00061 0.02380

0.00001 0.00140 0.06720

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

26

ARTICLE IN PRESS

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

Fig. 18. DET curves at various SNR levels in presence of (a) white noise, (b) HF channel noise, and (c) babble noise.

TagedPat all the points for all operating points of DET curve for all three different types of noise at all SNR levels considered in this experiment (indicating noise suppression capability of proposed feature set that exploits VTEO). In addition, it is also interesting to note that as SNR level increases, there is an improvement in person recognition performance (especially for proposed feature set). 6.8. Effect of static and dynamic features TagedPDifferent studies have been reported to use dynamic (or transitional) information contained in a speech signal. Extracting suitable dynamic, speaker-dependent features have a significant effect on the performance of speech and speaker recognition systems (Furui, 1981). The most popular approach consists of extracting time-derivatives of static feature sets (e.g., MFCCs). Moreover, the humming signal has melody associated with it. Hence, dynamic features, such as, delta cepstrum (D-cepstrum) and shifted delta cepstrum (SDC), are expected to convey more meaningful temporal (hidden) information in the sequence of samples of humming signal for a person. Hence, this experiment was conducted to use such dynamic features and concatenate them with state-of-the-art MFCC (i.e., static feature set) and similar concatenation is done for proposed feature set. In particular, 12-dimensional MFCCs were calculated with 12-dimensional D-cepstrum features to form 24-dimensional feature vector. This can be considered as feature-level fusion strategy. Similar, concatenation was done to form 24-dimensional feature vector for Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

ARTICLE IN PRESS H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

[m3+;September 26, 2017;11:48]

27

TagedPMFCC-VTMP. As delta-cepstrum captures dynamic variation along time frames, delta (D) cepstrum features can capture tone/melody inside a humming signal (Furui, 1981). First, static features c (e.g., MFCCs, MFCC-VTMP), are extracted and then, their delta features are computed, i.e., 8 Pd¼ þ D > d¼D dci þ d > > i ¼ D to i ¼ Nf  1  D; < Pd¼ þ D 2 d d¼D ð21Þ Dci ¼ > DcD i ¼ 0 to i ¼ D  1; > > : i ¼ Nf  D to i ¼ Nf  1; DcNf 1D where Dci indicates ith frame delta-cepstrum derived from static feature c, D is the number of frames to be considered in calculation of Dci and Nframe is the total number of frames. From Eq. (21), it is clear that each delta features are nothing but consecutive differences around any feature vectors followed by normalization some constant P þ D with 2 that depends on shifting parameter d. In particular, this normalization factor is given by d¼ d . d¼D TagedPShifted D features can capture dynamic variation along time frames of humming signal. The features derived from shifted delta contains more dynamic variation because one shifted delta feature contains many delta features during its calculation and accumulates those in the calculation of SDC over the longer time interval for humming (Calvo et al., 2007). The delta cepstrum of SDC features are computed using a recurrent expression and it is given by Pd¼ þ D dcðt þ iP þ dÞ Dcðt þ iPÞ ¼ d¼DPd¼ þ D ; ð22Þ 2 d¼D d where N is number of cepstral coefficients in each frame, d is time advance and delay for the delta computation, P is time shift between consecutive blocks, K is number of blocks, whose delta coefficients are concatenated to form the final SDC feature vector, i.e., K is number of D-cepstrum used in SDC calculation. The computation of the SDC feature vector is illustrated in Fig. 19. TagedPNow, weighted sum of these are taken to calculate SDC features. Experiments were conducted to use such dynamic features and concatenate them with state-of-the-art MFCCs (i.e., static feature set) and similar concatenation is done for proposed feature set, namely, MFCC-VTMP. In particular, 12-D MFCC were concatenated with 12-dimensional D cepstrum features to form 24-dimensional (static + dynamic) feature vector. For delta cepstrum, D ¼ 2 is selected and for SDC, parameters N, d, P and K are taken as 12, 2, 2 and 5, respectively. Fig. 20 shows DET plots for MFCC+D-cepstrum, MFCC-VTMP+D-cepstrum, MFCC+SDC and MFCC-VTMP+SDC (for DI D 4). In Fig. 20, DET curves for SDC-based features are shown with dashed lines, whereas DET curves for delta-based features are shown with solid lines. TagedPIt is observed from DET curves that in both the cases, proposed feature set, namely, MFCC-VTMP performs better than MFCCs for the majority of the operating points of DET curve. In addition, it can be observed from Fig. 20 that SDC performs significantly better than delta-cepstral features. This may be due to the fact that humming by a person for the particular song is mostly quasi-periodic signal and dynamic variation in such signal is expected to capture well for a larger time duration, i.e., dynamic variation over a large number of successive frames of cepstral feature vectors. This, in turn, may be due to the fact that over longer time duration (as in the case of SDC), humming pattern may indicate the manner (which may be considered as person-specific prosodic

Fig. 19. A functional block diagram for shifted delta cepstrum computation.

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

28

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

Fig. 20. Performance evaluation on DET curves for static vs. dynamic features.

TagedPattribute) in which a person hums for a particular song. Furthermore, this dynamic variation in humming is also expected to indicate pattern associated with the human inhalation and exhalation respiratory system for a particular person. The gap between DET curve for the state-of-the-art feature set, namely, MFCC and proposed feature set is more in the case of delta-based features than SDC-based features. This may be due to the fact that phase of the Fourier transform of a signal is closely associated with relative time occurrence of samples in a hum signal. Thus, delta cepstrum may capture better relative phase information in humming due to shorter duration time derivatives. Table 10 shows EER values obtained in this experiment. It is observed from Table 10 that improvement in recognition performance using SDC-based features in EER and min DCF is around 0.5 % for both feature sets, namely, MFCC and MFCC-VTMP.

6.9. Effect of intersession variability TagedPThe intersession variability should be considered while evaluating the person recognition system. To that effect, we took a training set with six different sessions of test recording from 50 subjects (Patel, 2012). These 50 subjects are different than previous 100 subjects (whose details are reported in Section 4). This database has been exclusively prepared for analyzing the intersession variability effect with similar experimental setup for data collection (discussed in Section 4). Out of 6 intersessions, first session of test recording (i.e., test session 1) was done along with recording of training session. Hence, the performance of intra-session recognition system is expected to give better performance. The performance w. r. t. to intersession variability is shown in Fig. 21. It can be observed that proposed Table 10 EER and min. DCF measures of MFCC and MFCC-VTMP features for delta and shifted delta cepstrum (SDC). The bold numbers indicate relative better performance. Feature sets

EER

Min. DCF

Static features

MFCC MFCC-VTMP (DI D 4)

12.20 12.01

0.1214 0.1196

Static + dynamic features

MFCC+ D MFCC-VTMP (DI D 4) + D MFCC + Shifted D MFCC-VTMP (DI D 4) +Shifted D

11.92 11.77 11.42 11.34

0.1187 0.1169 0.1134 0.1129

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

ARTICLE IN PRESS

JID: YCSLA

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

[m3+;September 26, 2017;11:48]

29

Fig. 21. Effect of intersession variability on % EER.

TagedPfeature set performs better than MFCCs w.r.t. intersession variability as well, indicating robustness of proposed feature set for session variability. 7. Summary and conclusions TagedPIn this study, perceptually meaningful (due to mel warping) phase information (in addition to magnitude spectrum) from humming signal is exploited to derive a novel feature set, namely, MFCC-VTMP for person recognition task. The effectiveness of the proposed feature set for person recognition task is evaluated under various experimental evaluation factors, such as, feature dimensions, feature discrimination power, comparison with existing phasebased features, order of polynomial classifier, noisy (degraded) conditions, use of static vs. dynamic features, different speaker modeling approach and intersession variability. Proposed feature set, is found to be consistently performing well than state-of-the-art feature set, namely, MFCCs for all evaluation factors considered in this paper. It has been observed that phase information alone does not convey significant person-specific features for humming. However, when magnitude and phase spectra are combined, then performance is better as compared to the magnitude spectrum alone, indicating that phase information captures complementary information than magnitude spectrum alone. TagedPOne of the limitations of the present work could be performance is optimized w.r.t. dependency index (DI) in VTEO. Future work can be directed towards design a framework to investigate optimal DI for a given database. For example, a score-level fusion of various systems (each having different values of DI) can be performed to find optimum DI or there could be an alternative way to predict DI (by using some optimality criteria) for a given database. In addition, in this work, phase information is based on Fourier transform phase. Another approach to capture phase information could be using instantaneous analytic phase from the concept of analytic signal generation. This could be one of the future research directions to exploit instantaneous phase information for improving the performance of person recognition system. We would like to explore the allpass filter-based approach for feature extraction as a new feature representation. In this paper, we have not considered the feature processing such as Relative Spectral Transform (RASTA) and Cepstral Mean and Variance normalization (CMVN) to improve the performance in noisy conditions. We plan to explore this in our future research work. In addition, we did not explore selective use of spectrum in a specific region (such as lower, middle or higher) for possible feature discrimination than that of entire available bandwidth, i.e., Fs 2 ; which could be another interesting research direction to pursue further in this problem. Acknowledgments TagedPThe authors would like to thank authorities of DA-IICT, Gandhinagar and the subjects for their kind support and co-operation to carry out this research work. Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

30

ARTICLE IN PRESS

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

References TagedPAlsteris, L.D., Paliwal, K.K., 2004. ASR on speech reconstructed from short-time Fourier phase spectra. In: Proceedings of 8th International Conference on Spoken Language Process INTERSPEECH 2004 - ICSLP. Jeju Island, Korea, pp. 565–568. TagedPAlsteris, L.D., Paliwal, K.K., 2007. Iterative reconstruction of speech from short-time Fourier transform phase and magnitude spectra. Comput. Speech Lang. 21 (1), 174–186. TagedPAmino, K., Arai, T., 2009. Speaker-dependent characteristics of the nasals. Forensic Sci. Int. 185 (1), 21–28. TagedPAndersson, K., Schalen, L., 1998. Etiology and treatment of psychogenic voice disorder: results of a follow-up study of thirty patients. J. Voice 12 (1), 96–106. TagedPLibresample. http://wiki.audacityteam.org/wiki/Libresample. (Last Accessed on April 5, 2017). TagedPBahoura, M., Rouat, J., 2001. Wavelet speech enhancement based on the Teager energy operator. IEEE Signal Process. Lett. 8 (1), 10–12. TagedPCalvo, J.R., Fernandez, R., Hernandez, G., 2007. Application of shifted delta cepstral features in speaker verification. In: Proceedings of INTERSPEECH. Antwerp, Belgium, pp. 734–737. TagedPCampbell, W.M., Assaleh, K.T., 1999. Polynomial classifier techniques for speaker verification. In: Proceedings of International Conference on Acoustics, Speech and Signal Process. (ICASSP). Phoenix, Arizona, USA, pp. 321–324. TagedPCampbell, W.M., Assaleh, K.T., Broun, C.C., 2002. Speaker recognition with polynomial classifiers. IEEE Trans. Speech Audio Process. 10 (4), 205–212. TagedPCampbell, W.M., Campbell, J.P., Reynolds, D.A., Singer, E., Torres-Carrasquillo, P.A., 2006. Support vector machines for speaker and language recognition. Comput. Speech Lang. 20 (23), 210–229. TagedPCarter, A.K., Dillon, C.M., Harnsberger, J.D., Herman, R., Clarke, C.M., Pisoni, D.B., Hern, L.R., 2000. A Multi-Talker Dialect Corpus of Spoken American English: An Initial Report. Technical Report. Speech Research Laboratory, Indiana University, Research on Spoken Language Processing, Progress Report. pp. 409–413. TagedPChhayani, N.H., Patil, H.A., 2013. Development of corpora for person recognition using humming, singing and speech. In: Proceedings of 2013 International Conference on Oriental COCOSDA. Gurgaon, India, pp. 1–6. TagedPCover, T.M., 1965. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 14 (3), 326–334. TagedPDang, J., Honda, K., 1996. Acoustic characteristics of the human paranasal sinuses derived from transmission characteristic measurement and morphological observation. J. Acoust. Soc. Am. 100 (5), 3374–3383. TagedPDavis, S., Mermelstein, P., 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28 (4), 357–366. TagedPDegottex, G., Erro, D., 2014. A measure of phase randomness for the harmonic model in speech synthesis. In: Proceedings of INTERSPEECH. Singapore, pp. 1638–1642. TagedPDehak, N., 2009. Discriminative and Generative Approaches for Long-and Short-Term Speaker Characteristics Modeling: Application to Speaker  Verification. Ecole de technologie superieure, Montreal (Ph.D. Thesis). TagedPDehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P., 2011. Front-end factor analysis for speaker verification. IEEE Trans. Audio, Speech Lang. Process. 19 (4), 788–798. TagedPFurui, S., 1981. Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process. 29 (2), 254–272. TagedPGhias, A., Logan, J., Chamberlin, D., Smith, B.C., 1995. Query-by-humming: musical information retrieval in an audio database. In: Proceedings of 3rd ACM International Conference on Multimedia. San Francisco, CA, USA, pp. 231–236. TagedPHayes, M., Lim, J., Oppenheim, A., 1980. Signal reconstruction from phase or magnitude. IEEE Trans. Acoust. Speech Signal Process. 28 (6), 672–680. TagedPHegde, R.M., Murthy, H.A., Rao, G.R., 2004. Application of the modified group delay function to speaker identification and discrimination. In: Proceedings of International Conference on Acoustics, Speech and Signal Process. (ICASSP), Vol. 1. Montreal, Quebec, Canada, pp. I–517. TagedPJabloun, F., Cetin, ¸ A.E., Erzin, E., 1999. Teager energy based feature parameters for speech recognition in car noise. IEEE Signal Process. Lett. 6 (10), 259–261. TagedPJang, J.R., Lee, H., 2008. A general framework of progressive filtering and its application to query by singing/humming. IEEE Trans. Audio Speech Lang. Process. 16 (2), 350–358. TagedPJin, M., Kim, J., Yoo, C.D., 2009. Humming-based human verification and identification. In: Proceedings of International Conference on Acoustics, Speech and Signal Process. (ICASSP). Taipei, Taiwan, pp. 1453–1456. TagedPKaiser, J.F., 1990. On a simple algorithm to calculate the energy’ of a signal. In: Proceedings of International Conference on Acoustics, Speech and Signal Process. (ICASSP). Albuquerque, New Mexico, USA, pp. 381–384. TagedPKleinschmidt, T., Sridharan, S., Mason, M., 2011. The use of phase in complex spectrum subtraction for robust speech recognition. Comput. Speech Lang. 25 (3), 585–600. TagedPLu, L., Seide, F., 2008. Mobile ringtone search through query by humming. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP). Las Vegas, Nevada, USA, pp. 2157–2160. TagedPMadhavi, M.C., 2011. Person Recognition from Their Hum. Dhirubhai Ambani Institute of Information and Communication Technology (DAIICT) (M. Tech. Thesis). TagedPMadhavi, M.C., Patil, H.A., 2014. Exploiting variable length Teager energy operator in melcepstral features for person recognition from humming. In: Proceedings of the 9th International Symposium on Chinese Spoken Lang. Process. (ICSLP). Singapore, pp. 624–628. TagedPMaia, R., Akamine, M., Gales, M.J.F., 2012. Complex cepstrum as phase information in statistical parametric speech synthesis. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP). Kyoto, Japan, pp. 4581–4584.

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

ARTICLE IN PRESS H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

[m3+;September 26, 2017;11:48]

31

TagedPMallat, S., 2009. A Wavelet Tour of Signal Processing - The Sparse Way. 3rd edition Academic Press. TagedPMartin, A.F., Doddington, G.R., Kamm, T., Ordowski, M., Przybocki, M.A., 1997. The DET curve in assessment of detection task performance. In: Proceedings of the 5th European Conference on Speech Communication and Technology, EUROSPEECH. Rhodes, Greece, pp. 1895– 1898. TagedPMasataka, N., 1992. Pitch characteristics of Japanese maternal speech to infants. J. Child Lang. 19 (2), 213–223. TagedPMowlaee, P., Saeidi, R., Stylianou, Y., 2014. Phase importance in speech processing applications. In: Proceedings of INTERSPEECH 2014. Singapore, pp. 1623–1627. TagedPMowlaee, P., Saeidi, R., Stylianou, Y., 2016. Advances in phase-aware signal processing in speech communication. Speech Commun. 81, 1–29. TagedPMurthy, H.A., Gadde, V., 2003. The modified group delay function and its application to phoneme recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP). Hong Kong, pp. 68–71. TagedPMurthy, H.A., Yegnanarayana, B., 1991. Formant extraction from group delay function. Speech Commun. 10 (3), 209–221. TagedPMurty, K.S.R., Yegnanarayana, B., 2006. Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Process. Lett. 13 (1), 52–55. TagedPNaik, A., 2017. I-vector based speaker and person recognition. Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT) (M. Tech. Thesis). Accessed on April 20, 2017. TagedPNakagawa, S., Wang, L., Ohtsuka, S., 2012. Speaker identification and verification by combining MFCC and phase information. IEEE Trans. Audio Speech Lang. Process. 20 (4), 1085–1095. TagedPNicholson, S., Milner, B.P., Cox, S.J., 1997. Evaluating feature set performance using the F-ratio and J-measures. In: Proceedings of the 5th European Conference on Speech Communications and Technology, EUROSPEECH. Rhodes, Greece, pp. 413–416. € TagedPOhm, G.S., 1843. Uber die definition des tones, nebst daran gekn€upfter theorie der sirene und €ahnlicher tonbildender vorrichtungen. Ann. Phys. 135 (8), 513–565. TagedPOppenheim, A.V., Willsky, A.S., Nawab, S.H., 1997. Signals and Systems. 2nd Prentice-Hall, Inc. TagedPPaliwal, K.K., Alsteris, L., 2003a. Usefulness of phase in speech processing. In: Proceedings of IPSJ Spoken Language Processing Workshop. Gifu, Japan, pp. 1–6. TagedPPaliwal, K.K., Alsteris, L.D., 2003b. Usefulness of phase spectrum in human speech perception. In: Proceedings of European Conference on Speech Communications and Technology, EUROSPEECH2003. Geneva, Switzerland, pp. 2117–2120. TagedPPaliwal, K.K., Alsteris, L.D., 2005. On the usefulness of STFT phase spectrum in human listening tests. Speech Commun. 45 (2), 153–170. TagedPPaliwal, K.K., Atal, B.S., 2003. Frequency-related representation of speech. In: Proceedings of European Conference on Speech Communications and Technology, EUROSPEECH. Geneva, Switzerland, pp. 65–68. TagedPPatel, C., 2012. Person recognition from humming with intersession variability. Dhirubhai Ambani Institute of Information and Communication Technology (DA-IICT) (M. Tech. Thesis). TagedPPatil, H.A., 2005. Speaker Recognition in Indian Languages: A Feature based Approach. Department of Electrical Engineering, Indian Institute of Technology (IIT) Kharagpur, India (Ph.D. Thesis). TagedPPatil, H.A., Jain, P.K., Jain, R., 2009. A novel approach to identification of speakers from their hum. In: Proceedings of International Conference on Advances in Pattern Recognition (ICAPR). Kolkata, India, pp. 167–170. TagedPPatil, H.A., Jain, R., Jain, P.K., 2008. Identification of speakers from their hum. In: Proceedings of the 11 International Conference on Text, Speech and Dialogue, (TSD). Brno, Czech Republic, pp. 461–468. TagedPPatil, H.A., Madhavi, M.C., 2012. Significance of magnitude and phase information via VTEO for humming based biometrics. In: Proceedings of the 5th IAPR International Conference on Biometrics, ICB. New Delhi, India, pp. 372–377. TagedPPatil, H.A., Madhavi, M.C., Parhi, K.K., 2011. Combining evidence from spectral and source-like features for person recognition from humming. In: Proceedings of INTERSPEECH 2011. Florence, Italy, pp. 369–372. TagedPPatil, H.A., Parhi, K.K., 2010a. Development of TEO phase for speaker recognition. In: Proceedings of International Conference on Signal Process. & Comm. (SPCOM), 2010. Bangalore, India, pp. 1–5. TagedPPatil, H.A., Parhi, K.K., 2010b. Novel variable length Teager energy based features for person recognition from their hum. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP). Dallas, Texas, USA, pp. 4526–4529. TagedPRao, K.R., Yip, P., 1989. Discrete Cosine Transform-Algorithims, Advantages and Applications. 1st Academic Press. TagedPReddy, P.R., Rout, K., Murty, K.S.R., 2014. Query word retrieval from continuous speech using GMM posteriorgrams. In: Proceedings of International Conference on Signal Processing and Communication (SPCOM). Bangalore, India, pp. 1–6. TagedPReynolds, D.A., Quatieri, T.F., Dunn, R.B., 2000. Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. 10 (13), 19–41. TagedPResample. http://wiki.audacityteam.org/wiki/Resample. (Last Accessed on April 5, 2017). TagedPRhie, J., Romanowicz, B., 2004. Excitation of earth’s continuous free oscillations by atmosphereoceanseafloor coupling. Nature 431 (7008), 552–556. TagedPSeelamantula, C.S., 2016. Phase-encoded speech spectrograms. In: Proceedings of INTERSPEECH. San Francisco, CA, USA, pp. 1775–1779. TagedPShenoy, B.A., Mulleti, S., Seelamantula, C.S., 2016. Exact phase retrieval in principal shift-invariant spaces. IEEE Trans. Signal Process. 64 (2), 406–416. TagedPShenoy, B.A., Seelamantula, C.S., 2015. Exact phase retrieval for a class of 2-D parametric signals. IEEE Trans. Signal Process. 63 (1), 90–103.  lander, K., Beskow, J., 2009. Wavesurfer [Computer program] (Version 1.8.5). (Last Accessed on May 5, 2017). TagedPSjo TagedPSmith, Z.M., Delgutte, B., Oxenham, A.J., 2002. Chimaeric sounds reveal dichotomies in auditory perception. Nature 416 (6876), 87–90. TagedPSuzuki, N., Takeuchi, Y., Ishii, K., Okada, M., 2003. Effects of echoic mimicry using hummed sounds on human-computer interaction. Speech Commun. 40 (4), 559–573. TagedPTeager, H., 1980. Some observations on oral air flow during phonation. IEEE Trans. Acoust. Speech Signal Process. 28 (5), 599–601.

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009

JID: YCSLA

32

ARTICLE IN PRESS

[m3+;September 26, 2017;11:48]

H.A. Patil and M.C. Madhavi / Computer Speech & Language 00 (2017) 132

TagedPTomar, V., Patil, H.A., 2008. On the development of variable length Teager energy operator (VTEO). In: Proceedings of INTERSPEECH. Brisbane, Australia, pp. 1056–1059. TagedPUmesh, S., Cohen, L., Marinovic, N., Nelson, D.J., 1999. Scale transform in speech analysis. IEEE Trans. Audio Speech Lang Process. 7 (1), 40– 45. TagedPVarga, A., Steeneken, H.J., 1993. Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12 (3), 247–251. TagedPVon Helmholtz, H., 1912. On the Sensations of Tone as a Physiological Basis for the Theory of Music. Longmans, Green. TagedPWang, L., Minami, K., Yamamoto, K., Nakagawa, S., 2010. Speaker identification by combining MFCC and phase information in noisy environments. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP). Dallas, Texas, USA, pp. 4502– 4505. TagedPWolfram, W., Schilling, N., 1998. American English. 1st Malden, MA: Blackwell. TagedPZhu, D., Paliwal, K.K., 2004. Product of power spectrum and group delay function for speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP). Montreal, Quebec, Canada, pp. 125–128.

Please cite this article as: H. Patil, M. Madhavi, Combining evidences from magnitude and phase information using VTEO for person recognition using humming, Computer Speech & Language (2017), http://dx.doi.org/ 10.1016/j.csl.2017.06.009