ARTICLE IN PRESS
Signal Processing 87 (2007) 1504–1522 www.elsevier.com/locate/sigpro
The fan-chirp transform for non-stationary harmonic signals Luis Weruagaa,, Ma´rian Ke´pesib a
Commission for Scientific Visualisation, Austrian Academy of Sciences, Donau-City Strasse 1, 1220 Vienna, Austria Signal Processing and Speech Communication Laboratory, Technical University of Graz, Inffeldgasse 12, 8010 Graz, Austria
b
Received 30 August 2006; received in revised form 4 December 2006; accepted 4 January 2007 Available online 16 January 2007
Abstract This paper presents a novel transform related to the framework of warping operators when the continuous time warping mapping is a second-order polynomial. This case is proven in the paper to be the only one from the aforementioned group that marginalizes the Wigner distribution along line paths, in particular, with a fan geometry. The properties and attributes of the fan-chirp transform (FChT) along with the analytical characterization of harmonically related Gaussian chirplets bear especial relevance in the paper. This analysis shows that for chirp-periodic signals the FChT can reach the limit of the time–frequency (TF) uncertainty principle, while simultaneously keeping the cross-terms at minimum level. The formulation of the fast digital computation of the FChT is also provided in the paper. Two practical scenarios—the analysis of speech with natural intonation and bat ultrasound—validate the theoretical developments and shows manifestly the eloquent competitive performance of the new transform. r 2007 Elsevier B.V. All rights reserved. Keywords: Fan-chirp transform (FChT); Time–frequency (TF); Fan marginalization; Harmonically related chirplets
1. Introduction Identifying frequency-modulated (FM) sinusoids or chirps in a signal is known to be a tough challenge for classical linear analysis. Most of the alternative solutions for this problem has come from the field of time–frequency (TF) analysis, mainly in the form of Cohen’s class bilinear time–frequency distributions (TFD) [1]. However, the long debate on the relevance and meaning of the cross-terms [2] or the dilemma on the need of positive TFDs [3,4] have moved the attention towards different approaches, such as matching Corresponding author. Tel.: +43 1515816707.
E-mail addresses:
[email protected],
[email protected] (L. Weruaga),
[email protected] (M. Ke´pesi).
pursuits [5] over redundant chirplet dictionaries [6–8], or chirp-based transforms that marginalize the Wigner–Ville distribution according to certain geometries [10–14]. Although the redundant dictionary remains the most popular method for chirplet decomposition so far, chirp-based transforms are especially interesting because they can provide a broader picture of the TF content of the signal. Several signal processing transforms are related to the term ‘‘chirp’’: the Chirp-Z [9] and the ‘‘Chirplet’’ transforms [10] contain explicitly the term, whilst the fractional Fourier transform [11,12] and the warped-time operators [13,14] are conceptually related to it. The Chirp-Z transform [9] is an efficient algorithm for calculating discrete Fourier transforms (DFTs) at frequencies not related to a
0165-1684/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.sigpro.2007.01.006
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
power-of-two fraction of the sampling bandwidth. Although a discrete-time chirp signal is used in the mechanism, the usage of this algorithm is not actually related to the context of this paper. The first relevant chirp-based transform, the chirplet transform (CT) [10], is described by the inner product between the signal and a chirplet as Z 1 X bð f Þ ¼ xðtÞ gR;t;f ;b ðtÞ dt, (1)
a
1505
b f
f
t
t
1
where t is time, xðtÞ is the analysis signal, denotes complex conjugate, and gR;t;f ;b ðtÞ is a Gaussian chirplet of unit energy
c
d f
f
2
eð1=2ÞððttÞ=RÞ j2pð f ðttÞþð1=2ÞbðttÞ2 Þ p ffiffiffiffiffiffiffi gR;t;f ;b ðtÞ ¼ . e 4 pR2
(2)
Here n is the instantaneous frequency (IF) at t ¼ t; b the frequency variation rate and R the time spread. By disregarding the Gaussian term and setting t ¼ 0 for simplicity, it is easy to deduce that the squared magnitude of the Chirp(let) transform is Z 1 jX b ð f Þj2 ¼ WDx ðt; f btÞ dt, (3) 1
where WDx ðt; f Þ is the Wigner–Ville distribution [2] of xðtÞ (henceforth Wigner distribution, WD). Thus, the CT yields the ‘‘slanted’’ marginal of the WD. Another well-established chirp-based transform is the fractional Fourier transform (FrFT) [11] Z 1 X y ðuÞ ¼ xðvÞ K y ðv; uÞ dv, (4) 1
where K y ðv; uÞ is the transformation kernel [12]. The FrFT involves products with linear-FM chirps in such a way that it yields the marginalization of the WD along the angular direction y, i.e., Z 1 2 jX y ðuÞj ¼ WDx ðcu sv; su þ cvÞ dv, (5) 1
where c ¼ cos y and s ¼ sin y. The last chirp-based transform considered here is the warping operator [13–15], defined as Z 1 pffiffiffiffiffiffiffiffiffiffiffiffi X cðÞ ð f Þ ¼ xðcðtÞÞ jc0 ðtÞj ej2pft dt, (6) 1
where cðtÞ is a continuous differentiable time mapping and c0 ðtÞ its derivative. An equivalent formulation to this warped-time Fourier transform is Z 1 pffiffiffiffiffiffiffiffiffiffiffiffi X ð f ; fðÞÞ ¼ xðtÞ jf0 ðtÞjej2pf fðtÞ dt, (7) 1
t
t
Fig. 1. Marginalization of the Wigner distribution along straight line paths: (a) Fourier, (b) chirplet, (c) fractional Fourier and (d) fan-chirp transforms. The dark area represents the TF energy of a Gaussian chirplet.
where fðtÞ ¼ c1 ðtÞ.1 Eq. (7) represents the inner product of signal xðtÞ with a non-linear chirp, which is the basic mechanism in TF redundant dictionaries for chirp-based signal decomposition [8]. The warped-time framework has also given rise to new TF distributions, such as the generalized warped Cohen’s class (GWCC) [16], which introduces new ways of interpreting the TF content. This paper tackles the warping operator (7) constrained to the mapping fðtÞ being a secondorder polynomial. This case is proven here to marginalize the WD along straight line paths. Thus, it can be seamlessly compared against other linear chirp-based transforms, such as the CT and FrFT (see Fig. 1 for an introductory illustration), and studied with more detail apart from the general framework. The organization of the paper follows: Section 2 introduces the analysis and synthesis equations of the proposed fan-chirp transform (FChT); in Section 3 its main basic attributes, the marginalization geometry and representation of chirp-periodic signals are studied; Section 4 contains a discussion on previous works related to the FChT; Section 5 addresses the estimation of the only userdefined parameter of the FChT, the chirp rate, in 1 By doing the variable change t ¼ fðtÞ in (6), this integral pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi R1 becomes X cðÞ ð f Þ ¼ 1 xðtÞ jf0 ðfðtÞÞjej2pf fðtÞ dt, that is similar to (7).
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
1506
order to better match the TF geometry of the analysis signal; Section 6 elaborates on the practical aspects of the digital implementation; Section 7 presents the performance evaluation of the FChT on synthetic and real scenarios, namely, the analysis of natural speech and sound of mammals; the conclusions close the paper.
xðtÞ can be recovered from its FChT as Z 1 qffiffiffiffiffiffiffiffiffiffiffiffiffi X ð f ; aÞ jf0a ðtÞj ej2pf fa ðtÞ df . xðtÞ ¼
(14)
1
Proof. By replacing (8) in (14), (14) becomes Z 1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi xðtÞ jf0a ðtÞf0a ðtÞj dðfa ðtÞ fa ðtÞÞ dt. zðtÞ ¼ 1
(15)
2. The FChT The analysis formula of the FChT of signal xðtÞ is defined as Z 1 qffiffiffiffiffiffiffiffiffiffiffiffiffi X ð f ; aÞ9 xðtÞ jf0a ðtÞj ej2pf fa ðtÞ dt, (8) 1
Eq. (15) can be simplified by using the time-scaling property of the Dirac delta X dðt ti Þ dðmðtÞÞ ¼ , (16) jm0 ðti Þj i
where t is time, f is frequency,2 and fa ðtÞ is the second-order polynomial controlled by the so-called chirp rate a
where mðtÞ is a continuous differentiable function and ti is such that mðti Þ ¼ 0. Since here mðtÞ ¼ fa ðtÞ fa ðtÞ is a second-order polynomial, two roots exist. After simplifications (15) becomes
fa ðtÞ9ð1 þ 12 atÞt.
zðtÞ ¼ xðtÞ þ xðt 2=aÞ.
(9)
The FChT involves the inner product between xðtÞ and the complex signals pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (10) xðt; f ; aÞ ¼ j1 þ atjej2pf ð1þð1=2ÞatÞt which are chirps whose IF, defined as the time derivative of the exponent, varies linearly over time dfa ðtÞ ¼ ð1 þ atÞf . (11) dt According to (11), the sign of the IF of all basis components switches at the instant nðtÞ ¼ f
1 (12) a which is called the ‘‘focal point’’ instant, that is, all basis components meet at the point of the Wigner plane ht; f i ¼ h1=a; 0i.3 For the sake of simplicity and without loss of generality, henceforth the chirp rate a is considered positive.
t¼
2.1. Synthesis formula Lemma 2.1. If xðtÞ fulfils the time support property xðtÞ ¼ 0 2
1 for to . a
(17)
The synthesis formula (14) delivers the input signal overlayed with itself mirrored around the focal time instant. Thus, in order to achieve perfect synthesis, condition (13) has to be met. & Fig. 2 contains a toy example illustrating the nature of the FChT synthesis: the signal on top is the analysis signal,4 and the one at the bottom the reconstructed signal; the focal time instant is marked with an asterisk. Since the signal is nonzero beyond the focal point, that half of the time axis results overlayed over the half of interest (in the example t410) in the reconstruction. Thus, in order to prevent this time aliasing, the focal point must be found outside the time support of the signal. 2.2. FChT as warped-time Fourier By doing the variable change t ¼ fa ðtÞ in the analysis equation, (8) becomes Z 1 ðx þ ðtÞ þ x ðtÞÞ ej2pf t dt, (18) X ð f ; aÞ ¼ 1=2a
(13)
As will be shown, f has actually a connotation with IF at t ¼ 0. 3 The complex signal (10) is strictly non-analytic because of the pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi amplitude modulation with the normalization term j1 þ atj. Furthermore, since its IF crosses at the focal point (12), the chirp sweeps the entire frequency axis.
where signals x þ ðtÞ and x ðtÞ are given by5 1 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi xðc x ðtÞ ¼ p a ðtÞÞ. 4 j1 þ atj
(19)
4 In order to evaluate numerically the integral (8), (14), the signals in the example are oversampled discrete-time signals. 5 Hence, sub/superindex describes both cases þ and .
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
1507
1
0
−1 −20
−15
−10
−5
0 t
5
10
15
20
−15
−10
−5
0 t
5
10
15
20
1
0
−1 −20
Fig. 2. Synthesis of a signal from its fan-chirp transform: (a) original signal, (b) reconstructed signal. The chirp rate is a ¼ 0:1. Focal point 1=a marked with asterisk.
Here cþ a ðtÞ and ca ðtÞ correspond to the double solution of the inverse mapping of fa ðtÞ, i.e.,
1 c a ðtÞ ¼ a
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 þ 2at . a
(20)
Eq. (18) describes the FChT as the Fourier transform of a warped-time version of the analysis signal. Since both components x þ ðtÞ and x ðtÞ overlap in time, the synthesis conditions (13) in Lemma 2.1 can be also stated as: xðtÞ can be obtained from its FChT if x ðtÞ ¼ 0. The most relevant aspect of (18) is the fact that the FChT can be evaluated with the Fourier transform mechanism. This fact clearly points out to its fast discrete-time evaluation: the signal is prewarped accordingly and the FFT algorithm applied thereafter. These aspects are addressed in detail in Section 6. The last paragraphs of the seminal work on warped wavelets [13] anticipates the high potential of that approach: ‘‘In the future, we can imagine the choice of a warping operator being made automatically to best fit a certain style of analysis to a given set of data.’’ In the FChT, that time-warping law is governed by chirp rate a.
3. Properties In this section we derive the most relevant property of the FChT regarding the marginalization of the WD; Parseval’s theorem along with other basic properties are also derived; the TF resolution over harmonically related chirplets covers the final part of the section. 3.1. TF marginalization geometry The analysis formula (8) can be arranged as Z 1 X ð f ; aÞ ¼ cðtÞ ej2pf fa ðtÞ dt, (21) 1
where cðtÞ is the normalized signal, pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi wðtÞ ¼ j1 þ atjxðtÞ.
(22)
Lemma 3.1. The squared magnitude of the FChT is equal to the marginalization of the WD of the normalized signal cðtÞ according to a fan geometry, that is, Z 1 jX ð f ; aÞj2 ¼ WDw ðt; ð1 þ atÞf Þ dt. (23) 1
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
1508
Proof. With the variable change t ¼ fa ðtÞ, (21) can be written as Z 1 X ð f ; aÞ ¼ (24) w ðtÞ ej2pf t dt,
t v
1
where w ðtÞ ¼ w þ ðtÞ þ w ðtÞ, w ðtÞ ¼
wðc a ðtÞÞ uðt þ 1=ð2aÞÞ. f0a ðc a ðtÞÞ
1
1 −
1 −
0
t
Fig. 3. The FChT power spectrum is equal to the fan marginalization of the Wigner distribution from the focal point ðt; f Þ ¼ ð1=a; 0Þ.
graphically explainable by placing the focal point on the positive half of the time axis.
where WDw ðt; f Þ ¼
f
(25b)
Supported on the Fourier-based formulation (24), the squared magnitude of the FChT of xðtÞ is equal to the frequency marginal of the WD of the warped signal w ðtÞ, i.e., Z 1 2 jX ð f ; aÞj ¼ WDw ðt; f Þ dt, (26a) Z
f
(25a)
t t w t þ w t ej2pf t dt. 2 2 1 1
(26b) The following variable change in the double integral (26): 1 v 1 v þ fa u , (27a) t ¼ fa u þ 2 2 2 2 v v t ¼ fa u þ fa u 2 2 ¼ ð1 þ auÞ v ð27bÞ gives rise to the following Jacobian: qðt; tÞ v 0 v 0 qðu; vÞ ¼ fa u þ 2 fa u 2 .
(28)
Given that c a ðfa ðtÞÞ ¼ t, the double integral (26) after simplifications becomes equal to the proposition (23). & Fig. 3 illustrates the result of Lemma 3.1. The image corresponds to the WD of a signal composed of harmonically related chirplets, where the dark stripes represent the non-stationary TF energy (for the sake of simplicity, the cross-terms are not depicted). When the structure is ‘‘illuminated’’ from the TF point h1=a; 0i, the resulting projection gives rise to the FChT power spectrum for chirp rate a. The fan marginalization achieves its finest representation when the focal point is h1=g; 0i. The Fourier analysis corresponds to the ‘‘light’’ source located at the infinite. Negative chirp rates are
Corollary 3.2. The FChT is the only warped-time Fourier transform (7) that marginalizes the WD along line paths. Proof. Note first that the proof of Lemma 3.1 has been conducted without expanding fa ðtÞ, except for the final integral variable (27b). Thus, in order for the general warping operator to marginalize the WD along line paths, the general mapping fðtÞ is to fulfil in (27b) v v f uþ f u ¼ nðuÞ v; u; v 2 R, (29) 2 2 where nðuÞ is a u-dependent function representing the mentioned line paths. Since (29) must hold for any value of v (and u), by setting v ¼ 0, the line paths turn out to be nðuÞ ¼ f0 ðuÞ. Given that (29) must hold for any v, it is simple to deduce that the only function fðtÞ fulfilling that condition is a second-order polynomial, which can be univocally written as fðtÞ ¼ 1 þ 12 aðt tÞ ðt tÞ þ Z. (30) Here t and Z are constants playing simply the role of time and phase shifts, respectively. Then, the TF line ht; f0 ðtÞf i represents the marginalization path. & 3.2. Basic properties A main characteristic of the FChT is its variant nature respect to time shift and time scale, that is, the FChT of xðt t0 Þ and xðctÞ yield complex expressions that cannot be intuitively related to the
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522 Table 1 Basic properties of the fan-chirp transform
1 2 3 4 5 6 7 8
Signal
FChT
x ðtÞ xð2=a tÞ xðtÞ xðtÞ is real xðtÞ is imaginary axðtÞ þ byðtÞ xðtÞ ej2pnfa ðtÞ xðtÞ yðtÞ p ffiffiffiffiffiffiffiffiffi
X ðf ; aÞ X ð f ; aÞ X ðf ; aÞ X ð f ; aÞ ¼ X ðf ; aÞ X ð f ; aÞ ¼ X ðf ; aÞ aX ð f ; aÞ þ bY ð f ; aÞ X ð f n; aÞ X ð f ; aÞnY ð f ; aÞ
j1þatj
FChT of xðtÞ. The time-variant characteristic can be easily understood from the fan geometry depicted in Fig. 3. Additional basic properties of the FChT, directly deduced from the analysis formula, are presented in Table 1, such as the linearity (No. 6), the chirp modulation (No. 7) and the windowing theorem (No. 8). For a ¼ 0 all FChT properties turn out properties of the Fourier transform (except No. 2, which results meaningless). Given that perfect synthesis is constrained to finite-time support conditions, the FChT basis (10) is not strictly orthogonal within the whole time domain. Thus, Parseval’s Theorem does not strictly hold. This fact can be intuitively argued by the ‘‘sink’’ effect around the singular focal point. Furthermore, the ambiguity in the reconstruction, which implies that two different signals, with different energy values, can have the same FChT (see property No. 2 and Fig. 2), clearly stresses that limitation. However, if the signal has finite time support according to (13), Parseval’s theorem Z 1 Z 1 xðtÞ y ðtÞ dt ¼ X ð f ; aÞY ð f ; aÞ df (31) 1=a
1
does hold regardless of the chirp rate a. If yðtÞ ¼ xðtÞ, (31) holds the equivalence between the energy in the time and frequency domains.
energy of the kth harmonic. Assuming that the evolution of the IF of any harmonic follows a linear trajectory, the fundamental phase results in Z t 1 (33) f 0 ð1 þ gtÞ dt f 0 1 þ gt t, jðtÞ ¼ 2 1 where f 0 is the instantaneous fundamental frequency (or pitch) at t ¼ 0, and g is the so-called pitch rate. The TF energy of signal (32) is illustrated schematically in Fig. 3. The signal (32), which is referred here to as ‘‘chirp-periodic’’, describes several natural sounds, such as short segments of human speech [17] and the song of some mammals [18]. Based on the Gaussian chirplet definition (2), xðtÞ can be written as X xðtÞ ¼ ak gRk ;tk ;nk ;bk ðtÞ, (34) k
where nk ¼ kf 0 ð1 þ gtk Þ,
(35a)
bk ¼ kf 0 g.
(35b)
Different tk and Rk parameters allow independent energy distribution for each harmonic (required for instance in an accurate description of speech [17]). We are now interested in obtaining the analytical description of the chirp-periodic signal (34) under the FChT transform. Since the squared magnitude of the FChT is equivalent to the fan marginalization of the pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi WD of the normalized signal wðtÞ ¼ j1 þ atjxðtÞ, the WD of the multi-component signal xðtÞ is firstly required.6 This WD is composed of the positive contribution of the Gaussian chirplets and the crossterms among them X XX WDx ðt; f Þ ¼ C k;k ðt; f Þ þ C i;j ðt; f Þ, (36)
Here k is the harmonic index and tk , Rk and jak j2 are, respectively, the time location, time spread and
i
k
jai
where C i;j ðt; f Þ ¼ 2jai jjaj j cosð2pði jÞf 0 fg ðtÞÞ
!! 1 ðt ti Þ2 ðt tj Þ2 exp þ 2 R2i R2j 2 ! ði þ jÞ 0 2 2 fg ðtÞ exp 4p R f f 0 2
3.3. Analysis of harmonically related chirplets Let xðtÞ be a signal composed of Gaussian chirplets harmonically related, that is, whose phase is integer multiple of a common phase jðtÞ X 1 2 xðtÞ ¼ ak qffiffiffiffiffiffiffiffi eð1=2Þððttk Þ=Rk Þ ejkjðtÞ . (32) 4 k pR2k
1509
ð37Þ 2
and R ¼
2R2i R2j =ðR2i
þ
R2j Þ.7
6 Note that the WD of the normalized signal wðtÞ can be approximated by WDw ðt; f Þ ¼ j1 þ atj WDx ðt; f Þ because the input signal xðtÞ is assumed to fluctuate much faster than factor pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi j1 þ atj. 7 In case iaj, Eq. (37) is an accurate approximation.
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
1510
The squared magnitude of the FChT of the multicomponent signal results thus in X jX ð f ; aÞj2 ¼ jG k ð f ; aÞj2 þ Cð f ; aÞ, (38) k
where Gk ð f ; aÞ is the FChT of the kth harmonic and Cð f ; aÞ contains the fan marginal of all cross-terms. Appendix A contains the analysis of the positive terms in (38): the fan marginalization of the kth harmonic has nearly Gaussian shape of mean and variance, respectively mk ¼ kf 0 sk2 ¼
1 þ gtk , 1 þ atk
2 R2 ðRk kf 0 Þ2 k =ð8p Þ þ ðg aÞ2 . ð1 þ atk Þ2 2ð1 þ atk Þ4
(39a)
(39b)
On the other hand, the fan marginalization of the cross-terms can be likewise addressed based on the previous result: the cross-terms have Gaussian envelope and follow the same fan TF geometry than the positive terms; the major difference is, however, the oscillating nature of the TF energy described by the first factor in (37). The uncertainty principle [1] provides a lower bound to the accuracy of the joint TF resolution 1 , (40) 4p where st and sf are the spread of the time and frequency marginal, respectively. In case of the kth harmonic of the multi-component signal xðtÞ, the spectral dispersion s2f ðaÞ is given by (39b), and its time dispersion (referred to the ‘‘warped’’ harmonic) is given by
st sf X
s2t ðaÞ ’ 12 Rk2 jf0a ðtk Þj2 .
(41)
Then, the product duration–bandwidth results in qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 þ kðg aÞ2 , (42) st ðaÞsf ðaÞ ’ 4p where k ¼ 4p2 R4k k2 f 20 =ð1þ atk Þ2 40. The minimum of (42) takes clearly place for a ¼ g, the duration– bandwidth product being close to the lower bound of the uncertainty principle. Two conclusions arise from the last results:
The CT (1) and the FrFT (4) can reach the lower bound of the uncertainty principle for a single Gaussian chirplet. On the contrary, none is capable to reach that resolution for all harmonics of a linear chirp-periodic signal on the same transform instance as the FChT does. Although
this statement may seem obvious at this point, previous works [19,20] suggest surprisingly the use of the FrFT for the analysis of chirp-periodic signals. Since the cross-terms (37) follow the same fan TF geometry as the positive terms, and the same TF geometry intrinsic in the FChT marginalization when a ¼ g, it is simple to deduce that the crossterm interference among the harmonics in the FChT is minimum and, given the oscillatory nature of the cross-terms TF energy, irrelevant.
4. Prior related contributions The FChT compares seamlessly against the Chirplet [10], the fractional Fourier [12], and the Fourier transforms, all yielding the marginalization of the TF plane along different straight line geometries (see Fig. 1 again for an illustrative comparison). Additionally, it is important to address here previous works [19–22] related to the FChT to a larger or lesser extent. The work [19] proposes the so-called Harmonic fractional Fourier transform (HFT) Z 1 HFTðoÞ9 xðtÞ ejoð1þAtÞt dt (43) 1
as analysis method for speech segments with nonstationary pitch. Here A is the so-called base tone, clearly related to the analysis chirp rate a. From the analysis formula (43) it results obvious that the HFT (43) and the FChT (8) are closely related to each other. One important different between both is pffiffiffiffiffiffiffiffiffiffiffiffiffi the normalization term jf0a ðtÞj, which makes only the FChT fulfil Parseval theorem. However, the major difference comes from the interpretation of Eq. (43) given in [19] by its authors: the FrFT (4) is suggested as ‘‘natural choice for fast evaluation’’ of the integral (43), arriving to the conclusion that the required rotation of the FrFT is tanðyÞ ¼ oA. This correct geometrical correspondence tells us that an FrFT instance delivers only one single HFT frequency bin, namely, for o ¼ tanðyÞ=A. Thus, the evaluation of the integral (43) requires as many FrFT instances as frequency bins wished to evaluate. This exhaustive processing mechanism, seconded by recent works [20], relate implicitly rotations on the TF plane with the fan TF geometry of chirp-periodic signals, a relation that is not conceptually appropriate. Another work closely related to the FChT indeed is found in [21]. In spite of addressing analytically the practical boundaries
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
of the chirp rate, the work [21] does not provide theoretical insights on the proposed time-warper. The work [22] proposes the so-called local polynomial Fourier transform (LPFT) Z 1 LPFTðxÞ ¼ xðtÞ ejfLP ðt;xÞ dt, (44a) 1
where the phase of the analysis basis is an m-order polynomial fLP ðt; xÞ9 o1 t þ
1 1 o 2 t2 þ þ om tm 2 m!
(44b)
and x ¼ ðo1 ; . . . ; om Þ are the polynomial coefficients. At first sight one might be tempted to enclose the FChT into the second-order LPFT. However, a more careful look reveals that the second-order LPTF corresponds actually to the Chirplet transform (1), where o1 ¼ 2pf is frequency and o2 ¼ 2pb is the frequency variation rate.8 Consequently, the second-order LPFT yields a marginalization geometry (Fig. 1b) different from the FChTs (Fig. 1d). Even in case of a general order m, the marginalization takes place (loosely analytically speaking) in curved ‘‘parallel’’ paths, but never with a fan geometry. As major consequence, neither the LPFT (nor the Chirplet transform) is indicated in the analysis of harmonically related chirplets (as the seminal [22] and applied works [23] confirm). It is important to remark again that harmonically related chirplets, the fan marginalization of the WD and the Fourier transform of the fa ðtÞ-based time-warping are conceptually related. The relation between the FChT and TFD may be suggested as required exercise. However, it is necessary to remark that the FChT itself is a transform, and not a TFD. On the other hand, the time-warping operation fa ðtÞ (9) can be linked to recent TFDs, such as the GWCC [16], which in its simplest case, the generalized warped Wigner distribution (GWD) [8], is defined by GWDx ðt; f Þ9WDxðf1 ðtÞÞ ðfa ðtÞ; f =f0a ðtÞÞ. a
(45)
This GWD yields different distribution of the crossterms with respect to those delivered by the plain WD. This can be easily understood by parsing the GWD mechanism: the signal xðtÞ is time-warped according to f1 a ðtÞ, thus resulting in ‘‘stationary’’ harmonics, then the WD of the result is taken, and finally the bidimensional representation is morphed 8
Likewise, one may argue that the LPFT is the extension of the Chirplet transform to polynomial laws higher than two.
1511
accordingly. If the adequate kernel in the GWCC is selected, the result can be a nearly cross-term-free TF representation. That is clearly achieved only if the signal xðtÞ has fan TF geometry and if the analysis chirp rate a matches that geometry. On the contrary, Cohen’s class TFDs are not suited to the fan geometry but to the slanted geometry, such as in [24]. Further comments on warped TFD are beyond the scope of this paper. 5. Chirp rate estimation The most important aspect of the FChT in practical scenarios regards the adequacy of the law fa ðtÞ to the actual TF characteristics of the signal. Signals with fan geometry are found in practice only in short segments, such as in case of speech [17] or the song of some mammals [18]. In that sense, two options are at hand: either to use a warping function fðtÞ with more degrees of freedom than fa ðtÞ for matching the possible non-linear geometry, or to parse the signal into short segments and analyze them independently with the FChT. The first option directs the attention back to the general warping operator (7) described in the Introduction. Due to its inherent complexity, this general option is beyond the scope of this paper. The second option refers to the use of the FChT as generalization of the spectrogram [1] (which is indeed a TFD), that is, Cx ðt; f Þ ¼ jFChTfwðtÞxðt þ tÞ; aðtÞgj2 ,
(46)
where wðtÞ is the analysis window and aðtÞ is the analysis chirp rate for the segment centered at time t. The analysis window must be a Gaussian-like window, such as Gaussian or Hamming, in order for the existing chirped components of the segment to become Gaussian chirplets, as suggested in Section 3.3. This segmentwise processing brings interesting advantages, namely, the spectrogram is known to be the TFD with the least cross-term interference,9 and the chirp rate is independent for each segment and can be set accordingly to obtain the best resolution for the segment. This segmentwise processing is common in on-line applications and the preferred methodology for the practitioner. The need to estimate the chirp rate aðtÞ that best matches the TF characteristics of the segment is 9
Although the spectrogram is positive, it is not free of crossterms interference. In fact, these correspond actually to the (in this case) fan-marginalization of the WD cross-terms of each single segment wðtÞxðt þ tÞ.
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
1512
doubtlessly the decisive factor on using the FChTspectrogram instead of the plain spectrogram. This estimation can be approached with two different methodologies: inter-frame and intra-frame.
1 0 −1
−15
5.1. Inter-frame
f 0 ðtÞ aðtÞ ¼ 0 , f 0 ðtÞ
f 0 ½n þ 1 f 0 ½n 1 . 2Sf 0 ½n
(48)
Note that the estimation of the chirp rate for the nth segment as given in (48) will require the pitch of the next segment. This ‘‘non-causal’’ method implies to step one segment back to recompute the FChT of the nth segment once that the pitch of the n þ 1th is available. More details of a procedure inspired on this approach can be found in [17]. 5.2. Intra-frame While in the inter-frame approach pitch information from adjacent segments is available, the intraframe methodology relies only on the current segment. Computing a dense ða; f Þ plane turns out to be the most intuitive methodology. Fig. 4 contains an example of such a plane: the analysis signal corresponds to a real speech segment; the ða; f Þ plane reveals a detailed harmonic representation for positive chirp rate values; the vertical projection of that plane has a maximum at a ’ 0:3, which can be adopted as the estimation of the pitch
5
10
15
2 1
f 00 ðtÞ
a½n ¼
0 ms
3
(47)
where is time derivative of f 0 ðtÞ. Thus, the intuitive approach is to quantify the evolution of the pitch f 0 ðtÞ and then compute the chirp rate as (47). The estimation of the pitch f 0 is a popular problem. Given the broad spectrum of pitch estimation methodologies [25], it is not our aim to focus here on a particular pitch estimation algorithm and elaborate it further. On the other hand, since the short-based processing in (46) is commonly carried out only at certain instants t ¼ nS, based on a shift interval S, the estimated pitch on the neighboring segments around the nth can be used to obtain the pitch rate as
−5
4
kHz
Assuming that the signal presents a continuous evolution of its fundamental frequency f 0 ðtÞ, according to the IF of the fan geometry (11), the best estimation of the pitch rate is
−10
0 −0.5
0 α
−0.5
0 α
0.5
γ
0.5
Fig. 4. Example of dense ða; f Þ plane. From top to bottom: analysis signal, ða; f Þ plane, and its vertical marginalization. The ða; f Þ is in logarithmic scale. The chirp rate a has been scaled by the signal length.
rate g.10 The ða; f Þ plane reveals the ‘‘bowtie’’shaped spread, typical in chirp-based transforms and unequivocal sign of redundancy. In order to skip the large computational load required to obtain the redundant ða; f Þ plane, a different methodology, proposed in [18], is that of computing L FChT instances (associated to different chirp rates ai and estimating the chirplet parameters from the L FChT ‘‘views’’. We will show that three views (L ¼ 3Þ are sufficient for that purpose. According to the harmonic representation in (39), the kth harmonic in the ith FChT view is to be centered at frequency mk;i and have a width sk;i according to mk;i ¼ kf 0
s2k;i ¼ 10
1 þ gtk , 1 þ ai t k
2 R2 ðRk kf 0 Þ2 k =ð8p Þ þ ðai gÞ2 . ð1 þ ai tk Þ2 2ð1 þ ai tk Þ4
(49a)
(49b)
Note that this estimation method turns out much simpler than the FrFT-based approach in [19,20].
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
Assuming that the centers and spreads C ¼ fmk;i ; sk;i gi¼1;...;L are available, it is possible to estimate the kth harmonic characteristics as follows:
The central instant tk can be obtained as mk;i mk;j tk ¼ . mk;j aj mk;i ai
(50)
At least two views are required to estimate tk . By replacing (49a) into (49b), we can write ri 92s2k;i ð1 þ ai tk Þ2 ¼
R2k m2 1 þ ðai gÞ2 . 4p2 R2k ð1 þ gtk Þ2
ð51Þ
The pairs ðai ; ri Þ correspond to the samples of a second-order polynomial, r ¼ aa2 þ ba þ c. It is simple to see that the minimum of that parabola corresponds to the chirp rate g, i.e., b . (52) 2a Likewise, the spread Rk can be obtained from the parabola coefficients. In order to obtain the coefficients ða; b; cÞ, three points are required. The central frequency is estimated as g¼
kf 0 ¼
mk;i mk;j ðaj ai Þ . ð1 þ gtk Þðmk;j aj mk;i ai Þ
(53)
The requirement of three different FChT instances is equivalent to that in the fast refinement for chirplet decomposition based on redundant dictionaries proposed in [7], in which three projective chirp views are used to estimate the chirplet parameters. Despite the simplicity of the inverse problem in (50)–(53), the question that immediately arises regards the method to obtain the set C of centers and spreads for all harmonics. The answer has been addressed in [18]. That approach is based on decomposing each FChT view into Gaussians so that each triple Gaussians (each from each view) fulfil (39). This is achieved with a Gaussian-fitting algorithm driven by the expectation maximization (EM) algorithm, the procedure being constrained to (50)–(53). This iterative mechanism converges to the harmonically related chirplets present in the signal. Although the computational load of the algorithm is not low, mainly due to EM, the mechanism admits parallel implementation and resembles somewhat biological processing [26]. For the sake of clarity, the details are out of the scope of the paper, the interested reader being referred to that work.
1513
5.3. Inter versus Intra In this section we have tried to outline some inhouse methods for estimating the optimal chirp rate of a given segment based on two different methodologies, each one having its own pros and limitations. The inter-frame methodology is mainly conditioned by the accuracy of the pitch tracking. Interframe methods other than those supported on pitch estimation are not foreseen as promising or competitive, essentially because pitch and its change over time is the main descriptive parameter of quasiperiodic or chirp-periodic signals. Although the pitch tracking task is not especially costly, it should not be underestimated, especially if the signal segment contains disturbing components, such as background noise of any type. On the contrary, the inter-frame approach does not rely on temporal information, but only on the information within the very segment. Although, computing a dense ða; f Þ plane is the most intuitive methodology, the intrinsic redundancy and resulting computational overload suggests the need of alternative methods. The alternative method outlined in this section makes use of very few FChT instances, thus skipping the use of redundant information. However, the large computational load required to estimate the chirplets present in the signal does not suggest its use in on-line applications. At the risk of leaving this comparative without a clear conclusion, let us mention that our current research is mainly focused on inter-frame methodologies, and in particular on how to track the pitch in more general scenarios, such as the signal being corrupted with background noise, or even with another interfering harmonic signal. 6. Discrete-time formulation In analogy to the discrete-time Fourier transform, the formulation of the discrete-time FChT could be thought as the continuous-time transform of signal xðtÞ ¼
1 X
x½n dðt nT s Þ,
(54)
n¼1
where x½n is the discrete-time signal and T s is the sampling interval. This way of proceeding results in X ðO; a^ Þ ¼
1 X n¼1
x½n
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi j1 þ a^ njejOð1þð1=2Þ^anÞn ,
(55)
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
1514
where n is discrete time, O is frequency, and the analysis chirp rate a^ is the discrete counterpart of the chirp rate a, that is, a^ ¼ aT s .
(56)
Likewise, frequency O is related to its continuoustime counterpart f as O ¼ 2pfT s . Although ‘‘good-looking’’, the major drawback of the suggested discrete-time formulation (55) comes from the undesired spectral overlapping of the chirp basis, a well-known problem in chirpbased transforms. Therefore, an aliasing-free formulation may result by defining xðtÞ as xðtÞ ¼
1 X
x½n pðt nT s Þ,
(57)
n¼1
where pðt) is the interpolation filter, which is to be set to the ideal case pðtÞ ¼
sinðpt=T s Þ . pt=T s
(58)
This yields the aliasing-free expression X ðO; a^ Þ ¼
1 X
x½n W a^ ðO; nÞ,
(59)
n¼1
where the basis correspond to the bulky expression Z 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pðt nÞ j1 þ a^ tj ejOfa^ ðtÞ dt. (60) W a^ ðO; nÞ ¼ 1
Unfortunately, the integral in (60) has no explicit solution and thus (59) is of little help. In the practice, signal x½n has finite time support. Furthermore, the focal point is to be placed outside the time support interval (as concluded in previous sections).11 These facts along with the use of the Fourier-based evaluation presented in Section 2.2 gives rise to compact processing mechanisms. 6.1. Digital evaluation of the FChT Let x½n , n ¼ f0; . . . ; N 1g have N samples. We associate the discrete time index n to continuous time instants as follows: N 1 tn ¼ n (61) T s. 2 These time instants are placed equidistant, the equivalent segment being centered on t ¼ 0 and 11 Note that this type of restrictions are common in other chirpbased transforms, such as the FrFT (4), in which the TF content of the signal is to be confined within a circle [27].
having a length of T ¼ NT s . Note that the instants tn are all within the interval ½T=2; T=2 . Let xðtÞ be the equivalent continuous signal xðtÞ ¼
N 1 X
x½n hðt tn Þ,
(62)
n¼0
where hðtÞ is an interpolation filter, such that hð0Þ ¼ 1 and hðkT s Þ ¼ 0 for ka0 integer. Since the focal point is to be outside that interval, the chirp rate a is to fulfil 2 2 ! j^ajo . (63) T N The Fourier-based evaluation of the FChT (Section 2.2) involves the resampling of the signal xðtÞ according to the warping law pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 þ 2at ca ðtÞ ¼ þ . (64) a a (19) is defined within the The resampled signal xðtÞ interval ½fa ðT=2Þ; fa ðT=2Þ , which has also length of T. The time instants tn where the samples of the are to be located are obtained as warped signal x½n 1 T tn ¼ fa ðT=2Þ þ n þ , (65) 2 M jajo
where M is the number of samples of x½n , and in (65) the index spans n ¼ f0; . . . ; M 1g. Note that the resampled signal x½n has different number of results in samples than x½n . Thus, the x½n ¼ x½n
N1 X ‘¼0
x½‘ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi hðca ðtn Þ t‘ Þ. j1 þ at‘ j
(66)
yields Finally, taking the Fourier transform of x½n the FChT of x½n . Two issues in the previous procedure deserve special attention. 6.1.1. Interpolation filter hðtÞ The requirement of hð0Þ ¼ 1 and hðkT s Þ ¼ 0 for ka0 makes the ideal filter (58) a possible candidate. However, that choice implies that the evaluation of (66) requires N operations, which each sample x½n is clearly an expensive option. Thus, despite sacrificing ideal interpolation, we suggest a shorter interpolation filter, such as a first-order filter, or the cubic Hermite spline, whose normalized definition is 83 3 5 2 0pjtjo1; > < 2 jtj 2 jtj þ 1; 3 2 1 5 hðtÞ ¼ 2 jtj þ 2 jtj 4jtj þ 2; 1pjtjo2; > : 0; jtjX2: (67)
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
In case of cubic Hermite interpolation, only four ^ samples of x½n are used to compute each of x½n . With linear interpolation only two. 6.1.2. Length M The warping rule (64) has a slope greater than one in one time half, that is, c0a ðtÞ41 for ato0. This implies that signal xðtÞ is ‘‘undersampled’’ on that region, this leading to undesired aliasing effects (equivalent to those mentioned on (55)). This undesired aliasing is reduced or even suppressed by setting the length M to a proper value. It is clear then that M4N. Fig. 5 illustrates the effects of the time warping on the TF contents of the signal xðtÞ. The dark stripes of Fig. 5a represent harmonically related chirplets (cross-terms are not depicted); since the signal xðtÞ results from the synthesis of a discrete signal x½n , its spectral content is limited to frequency B ¼ 1=ð2T s Þ. spans to a The TF content of the warped signal xðtÞ higher frequency, as Fig. 5b shows. In order to have N aliasing-free bins (out of M), this length M is to be set as MX
1 j^ajN=4 N. 1 j^ajN=2
(68)
The frequency that corresponds to the length (68) is marked with a solid line in Fig. 5b. This length gives rise to spectral overlapping (shaded on that figure) only in the remaining M N bins, which in many practical cases do not contain valuable information and can be thus disregarded in further processing.12 The inverse FChT process results from taking the inverse Fourier transform of the M samples, and dewarping signal x½n to obtain the original x½n . The inverse resampling (66) clearly exists, especially given that M4N. However, that inverse process involves a sparse matrix pseudo-inversion, which implies large computational costs. An approximate with the inverse warping is based on resampling x½n same interpolation filter hðtÞ, as follows: y½n ¼
X pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi M1 j1 þ atn j x½‘ hðf a ðtn Þ t‘ Þ.
(70)
‘¼0
12 On the other hand, in order to code the whole TF information, the discrete time must be resampled according to the length
MX
N . 1 j^ajN=2
(69)
1515
a
b B 1-| |T/2 B
−T 2
0
T 2
B
(−T/2)
0
(−T/2)
Fig. 5. Time–frequency plane of (a) xðtÞ and (b) warped signal spans to a higher frequency. The xðtÞ. The TF content of xðtÞ example corresponds to a chirp rate a ¼ 0:5=T.
Based on the approximated dewarping (70) and the use of length M in (68), the reconstructed signal y½n differs from the original x½n . We can then talk of a coding/decoding residual error, whose magnitude depends on the interpolation filter hðnÞ and chirp rate a^ . Note that for a^ ¼ 0, the discrete FChT reduces to the DFT. 6.2. Computational load Table 2 summarizes the digital computation of the FChT and its inverse. Both directions are composed of four main operations, which in case of the direct transform correspond to:
Normalization: The discrete signal x½n is weighted by a chirp-rate-dependent window. This task implies the window computation (N operations) and the product between window and signal (N operations). Warped index: The resulting weighted discrete signal is to be resampled according to the law ca ðtÞ; the new time instants need to be computed (M operations). Resampling: The discrete warped signal is obtained by interpolation. By using cubic Hermit spline interpolation (67), the computing load of this task results in 4M. DFT computation: The resampled signal is DFTed, resulting in the FChT. Assuming that the length M is power of two, the number of operations is in the order of M log M.
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
1516
Table 2 Digital computation of the fan-chirp transform and its inverse Task Fan-chirp transform Normalization
Description
Operations
x½n z½n ¼ pffiffiffiffiffiffiffiffiffiffiffi
2N
tn ¼ cP a ðtn Þ z½n ¼ z½‘ hðt‘ tn Þ
M 4M
j1þatn j
Warped index Resampling
−1
−0.5
‘
DFT
X ½k; a ¼ DFTfz½n g
Inverse fan-chirp transform iDFT z½n ¼ iDFTfX ½k; a g Warped index t n ¼ fa ðtn Þ P Resampling z½‘ hðt‘ t n Þ z½n ¼ ‘ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p Normalization y½n ¼ j1 þ atn j z½n
0
0.5
1
Fourier transform
M log M M log M N 4N 2N
The length M is assumed to be power of two.
The previous analysis on computing requirements is approximate and do not correspond to single operations of a processor. Some tasks will likely unfold in more processor instructions, such as the computation of the normalization window. Nonetheless, an exact computing load analysis is left as programming-related exercise. Since we can fairly assume that both lengths N and M are comparable, the number of operations required in the FChT computation is approximately ðlog N þ 7ÞN.
−1
−0.5
0 Chirplet transform
0.5
1
−1
−0.5
0
0.5
1
fractional Fourier transform
(71)
Therefore, the additional tasks would cost nearly as much effort as the DFT operation. This fact clearly opens the door the FChT in real-time applications. Finally, it is necessary to remark that the estimation of the best-fitting a is considered an external process to the FChT digital evaluation. Thus, its computational load is not addressed by Table 2.
−1
−0.5
0
0.5
1
fan-Chirp transform
7. Results
Fig. 6. Linear-path marginalization-based transforms on linear pitch-variant train of pulses. The horizontal axis correspond to normalized frequency and the vertical one to the magnitude value.
The first experiment provides a comparison among the CT, FrFT and FChT on a toy synthetic example. The synthetic signal corresponds to a train of pulses non-equidistantly spaced, in such way that the fundamental frequency changes in a linear fashion; for the sake of clarity, the spectral envelope delineated by the harmonics is not flat. The mentioned transforms were applied over that signal, in such a way that the resolution achieved around the fourth harmonic k ¼ þ4 were the highest. The results are shown in Fig. 6. As proven in Section 3.3
the FChT is able to reach the maximum resolution for any harmonic in the same transform instance, while CT and FrFT can achieve that resolution only for the selected harmonic. On the contrary to these last transforms, the FT and FChT deliver always a symmetric spectrum. A real case of chirp-periodic signals is found in human speech. A short segment of speech can be described by the so-called speech formation model [28], as the filtration of the vocal cord excitation pðtÞ
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
1517
4
kHz
3 2 1 0 0
0.5
1
1.5
2
2.5
3
2
2.5
3
1.5
2.5
3
1.5
2.5
3
sec. spectrogram, T=24 ms 4
kHz
3 2 1 0 0
0.5
1
0
0.5
1
0
0.5
1
1.5 sec. spectrogram, T=48 ms
4
kHz
3 2 1 0 2 sec. FChT-based spectrogram, T=48 ms
4
kHz
3 2 1 0 2 sec. Pseudo smooth-Wigner–Villedistribution
Fig. 7. Time–frequency analysis of female speech with natural intonation. The horizontal and vertical axis in all representations are time (in seconds) and frequency (from 0 to 4 kHz), respectively. The intensity is in logarithmic scale and preemphasis of high frequencies has been applied.
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
1518
by the vocal tract impulse response hðtÞ. The excitation pðtÞ is classically considered a train of pulses equidistant in time. However, with natural intonation, the fundamental frequency (or pitch) undergoes linear variation over time. In consequence, a speech signal is more accurately modeled as X sðtÞ ¼ Hðkf 0 f0 ðtÞÞ ej2pkf 0 fg ðtÞ , (72) k
where Hð f Þ is the Fourier transform of hðtÞ, f 0 is the pitch at the center of the interval, and g is the pitch
rate (the proof of model (72) can be found in [17]). Thus, a windowed segment of the natural speech can be written as X ak ðtÞ ej2pkf 0 fg ðtÞ , (73) xðtÞ ¼ sðtÞwðtÞ ¼ k
where wðtÞ is a Gaussian-like window, and the envelopes ak ðtÞ can be approximated also with a Gaussian shape. Supported on the previous facts, the speech can be analyzed with the FChT-based generalization of
4
kHz
3 2 1 0 0
0.5
1
1.5
2
2.5
3
3.5
2.5
3
3.5
2 2.5 sec. Pseudo smooth-Wigner–Ville distribution
3
3.5
sec. spectrogram,T =48 ms 4
kHz
3 2 1 0 0
0.5
1
0
0.5
1
1.5
2 sec. FChT-based spectrogram, T=48 ms
4
kHz
3 2 1 0 1.5
Fig. 8. Time–frequency analysis of male speech with natural intonation. The horizontal and vertical axis in all representations are time (in seconds) and frequency (from 0 to 4 kHz), respectively. The intensity is in logarithmic scale and preemphasis of high frequencies has been applied.
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
the spectrogram (46). Two examples are considered in this paper: a British female speaker (TIMIT database) and an Austrian male speaker (recorded from public radio broadcasting). Both speakers talk with natural intonation, which results in a continuous and very often fast variation of the pitch. Figs. 7 and 8 depict the TF representation of the female and male recordings, respectively, as result of the following analysis techniques: from top to bottom, spectrogram with different window lengths, FChT-based generalization of the spectrogram,13 and smoothed pseudo-Wigner distribution (SPWD).14 Hamming window was used in all methods. The first conclusion from the results in Figs. 7 and 8 is that the spectrogram, regardless of the segment length, fails to represent properly the harmonics of the voiced utterances within the whole bandwidth. This limitation, which is commonly believed to be caused by ‘‘stochastic’’ components of voiced utterances [28], can be argued here as fast frequency variation of the medium- and highfrequency harmonics. This explanation is validated by the FChT-based spectrogram, whose medium- and high-frequency areas contain detailed harmonic trajectories. Finally, the performance obtained by the SPWD (and other TFDs of Cohen’s class) does not especially encourage the use of kernel distributions on chirp-periodic signals. The reason of this poor performance lies on the large TF overlapping between adjacent harmonics, especially in those from the medium and high frequencies. Furthermore, kernel TFD can suit signals with slanted TF geometry [24], but perform poorly with fan TF geometries. The limitation of the classical spectrogram observed in this paper, overcome by the FChT-based generalization of the spectrogram, may bring important consequences in speech coding: in the harmonic or sinusoidal coding the global bandwidth is commonly split up into two bands [29], the lowest one being considered voiced and the highest one unvoiced; upon estimation of the split frequency each region is coded accordingly. This scenario is illustrated in Fig. 9 with some spectral lines delivered by the classical and the FChT-based spectrogram over the male record. The example (a) corresponds to a speech segment of mild pitch rate: the second half of 13 The chirp rate estimation was based on the inter-frame methodology, i.e., by tracking the pitch. 14 Other kernel distributions, such as the Choi–Williams (s ¼ 1) and the Zhao–Atlas–Marks distribution [1] were also considered in the analysis, both delivering similar results to the SPWD.
1519
0
1
2 kHz
3
4
0
1
2 kHz
3
4
0
1
2 kHz
3
4
0
1
2 kHz
3
4
Fig. 9. Fourier transform (top) and FChT (bottom) of the speech record considered in Fig. 8, at instants (a) t ¼ 2:25 s and (b) t ¼ 0:25 s.
the Fourier transform is blurry, but the position of the spread harmonics still correspond visually to that given by the more detailed FChT representation. In example (b), the pitch rate of the naturally intonated voiced segment causes the Fourier representation become unclear on the middle frequencies, whereas in the high frequencies the Fourier spectrum seems to contain harmonics again. However, these are false harmonics, caused by the marginalization of the cross-terms (38), that do not match those delivered by the FChT. Consequently, with a FChT-based harmonic the whole bandwidth can be coded entirely as an harmonic structure. Another example of chirp-periodic signals is found in the song of some mammals. The popular bat echolocation ultrasound15 was chosen 15 The authors wish to thank Curtis Condon, Ken White, and Al Feng of the Beckman Institute of the University of Illinois for the bat data and for permission to use it in this paper.
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
1520
a
b
c
kHz
50
25
0 0
2
4
0
ms
2
4
0
ms
2
4 ms
Fig. 10. TF analysis of ultrasound bat signal: (a) WD, (b) spectrogram ðT ¼ 0:9 msÞ, and (c) FChT spectrogram ðT ¼ 0:9 msÞ. The intensity scale follows a root square, that is, amplitude level (instead of energy level) is shown.
as final scenario. The results of the analysis are illustrated in Fig. 10: (a) the WD, (b) the spectrogram and (c) the FChT-spectrogram (both with 0.9-ms Hamming window) of that signal are shown. The FChT-spectrogram was obtained according to the inter-frame methodology, as in the previous speech examples. The segment length chosen (0.9 ms) assures the linear evolution of the pitch within the segment. Since very few harmonics are found in the bandwidth ð’ 71 kHzÞ, their localization in the WD is not difficult. On the other hand, the TF resolution of the FChT-spectrogram is excellent, following precisely the harmonic trajectory obtained by the WD. The only drawback of the FChT-based spectrogram refers to the rigid TF resolution (42), due to the constant window length. The PSWD on the contrary does not suffer from this limitation so much, but it is affected severely by the cross-terms, whereas the FChT-spectrogram shows content only in areas close to the positive TF energy. The TF resolution of the classical spectrogram (Fig. 10) is very poor, which supports the biological evidence [26] that the auditory system of small mammals should be doing more than Fourier spectral analysis.
8. Conclusions The fan-chirp transform (FChT) is an effective method for representing signals with fan time– frequency (TF) structure. This type of signals, denoted here as chirp-periodic signals, are common in nature, such as segments of the song of mammals and human speech in natural intonation. The FChT possesses the property of marginalizing the WD along straight line paths according to a fan geometry. This geometry is entirely described by the userdefined chirp rate a, or by its inverse, the focal point instant, which plays the role of ‘‘source’’ and ‘‘sink’’ of frequency. In practical terms, the signal should be zero beyond (and near to) that instant. The FChT efficacy in real scenarios depends on the choice of the chirp rate to best fit the TF characteristics of the signal. This strong signal dependency points out to the development of ad hoc automatic methods for optimal chirp rate estimation. The guidelines of such estimation mechanisms have been provided in the paper. The formulation in discrete time and its fast digital evaluation have been provided as well. The excellent TF resolution and suppression of crossterms delivered by the FChT has been illustrated with real examples of chirp-periodic signals.
ARTICLE IN PRESS L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522
of the Gaussian chirplet. On the other hand, the Gaussian approximation (A.3) describes with accuracy the central frequency and variance of the representation. In spite of the severe asymmetry for extreme values of the chirp rate a, the Gaussian approximation is a very good reference of the spectral shape.
Appendix A. Gaussian chirplet representation The Gaussian chirplet 2 2 1 ffiffiffiffiffiffiffi eð1=2ÞððttÞ=RÞ ej2pðnðttÞþð1=2ÞbðttÞ Þ gðtÞ ¼ p 4 2 pR
is the only signal with a positive WD: WDg ðt; f Þ ¼ 2eR
2 ðttÞ2 4p2 R2 ð f nbðttÞÞ2
e
.
References
Based on the fan-geometry marginalization (23), the squared magnitude of the FChT of the Gaussian chirplet results in j1 þ atð f Þ j ð1=2Þðð f ð1þatÞnÞ=sð f Þ Þ2 jGð f ; aÞj2 ¼ pffiffiffiffiffiffi e , 2psð f Þ
(A.1)
where the parameters sð f Þ and tð f Þ are governed by frequency f as follows: 1 2 1 R þ ðaf bÞ2 R2 , 8p2 2 8 an > ; if b ’ >
> otherwise: : b af
1521
s2ð f Þ ¼
(A.2a)
tð f Þ
(A.2b)
The power spectrum (A.1) has nearly Gaussian shape: the ‘‘variance’’ (A.2a) depends on variable f and thus (A.1) is not strictly a Gaussian function. The resulting shape of jGð f ; aÞj2 is asymmetric, the more the larger the chirp rate a and the frequency. In spite of this asymmetry, we are interested in characterizing jGð f ; aÞj2 in terms of its spread and central location, that is, to establish an approximation with the true Gaussian model. The mean and variance of the proposed Gaussian approximation result, respectively, in n m¼ , (A.3a) 1 þ at 1 1 2 1 2 2 ðam bÞ s2 ¼ R þ R . (A.3b) 2 ð1 þ atÞ2 8p2 This approximation does not correspond to the maximum likelihood criterion, since such an analysis faces integrals with no analytical solution. Instead, the exponent in (A.1) was expanded in a Taylor series around f ¼ m, and the parameters (A.3) were obtained from the second-order term. The validity of the previous analysis has been assessed extensively with simulation experiments. These confirmed the analytical expression (A.1) as accurate description of the FChT power spectrum
[1] L. Cohen, Time–frequency Analysis, Prentice-Hall, Englewood Cliffs, NJ, 1995. [2] Mecklenbra¨uker, Hlawatsch (Eds.), The Wigner Distribution—Theory and Applications in Signal Processing, Elsevier, Amsterdam, NL, 1997. [3] L. Cohen, T. Posch, Positive time–frequency distribution functions, IEEE Trans. Acoust. Speech Signal Process. ASSP-33 (1985) 31–38. [4] P.J. Loughlin, J.W. Pitton, L.E. Atlas, Construction of positive time–frequency distributions, IEEE Trans. Signal Process. 42 (10) (October 1994) 2697–2705. [5] S. Mallat, Z. Zhang, Matching pursuit with time–frequency dictionaries, IEEE Trans. Signal Process. 41 (12) (December 1993) 3397–3415. [6] A. Bultan, A four-parameter atomic decomposition of chirplets, IEEE Trans. Signal Process. 47 (3) (March 1999) 731–745. [7] Q. Yin, S. Qian, A. Feng, A fast refinement for adaptive Gaussian chirplet decomposition, IEEE Trans. Signal Process. 50 (6) (June 2002) 1298–1306. [8] A. Papandreou-Suppapola, S.B. Suppapola, Analysis and classification of time-varying signals with multiple time– frequency structures, IEEE Signal Process. Lett. 9 (3) (March 2002) 92–95. [9] L.R. Rabiner, R.W. Schafer, C.M. Rader, The Chirp-Z transform algorithm and its application, Bell System Technical J. 48 (5) (May–June 1969) 1249–1292. [10] S. Mann, S. Haykin, The chirplet transform: physical considerations, IEEE Trans. Signal Process. 43 (11) (November 1995) 2745–2761. [11] H.M. Ozaktas, Z. Zalevsky, M.A. Kutay, The Fractional Fourier Transform with Applications in Optics and Signal Processing, Wiley, New York, NY, 2001. [12] L.B. Almeida, The fractional Fourier transform and time– frequency representations, IEEE Trans. Signal Process. 42 (11) (November 1994) 3084–3091. [13] R.G. Baraniuk, D.L. Jones, Warped wavelet bases: unitary equivalence and signal processing, in: Proceedings of IEEE ICASSP, 1993, pp. 320–323. [14] R. Baraniuk, Unitary equivalence: a new twist on signal processing, IEEE Trans. Signal Process. 43 (10) (October 1995) 2269–2282. [15] T. Twaroch, F. Hlawatsch, Modulation and warping operators in joint signal analysis, in: Proceedings of IEEE Symposium on Time–frequency and Time-scale Analysis, 1998, pp. 9–12. [16] A. Papandreou–Suppapoula, F. Hlawatsch, G.F. Boudreaux-Bartels, Quadratic time–frequency representations with scale covariance and generalized time-shift covariance: a unified framework for the affine, hyperbolic, and power
ARTICLE IN PRESS 1522
[17]
[18]
[19]
[20]
[21]
[22]
L. Weruaga, M. Ke´pesi / Signal Processing 87 (2007) 1504–1522 classes, Digital Signal Process.: Rev. J. 8 (1) (January 1998) 3–48. M. Kepesi, L. Weruaga, Adaptive chirp-based time–frequency analysis of speech signals, Speech Commun. 48 (5) (May 2006) 474–492. L. Weruaga, M. Ke´pesi, Self-organizing chirp-sensitive artificial auditory cortical model, in: Proceedings of Interspeech, 2005, pp. 705–708. F. Zhang, Y.Q. Chen, G. Bi, Adaptive harmonic fractional Fourier transform, IEEE Signal Process. Lett. 6 (11) (November 1999) 281–283. J.G. Vargas-Rubio, B. Santhanam, An improved spectrogram using the multiangle centered discrete fractional Fourier transform, in: Proceedings of IEEE ICASSP, 2005, pp. 505–508. R.J. Sluijter, A.E.J.M. Janssen, A time warper for speech signals, in: IEEE Workshop Speech Coding, 1999, pp. 150–152. V. Katkovnik, Discrete-time local polynomial approximation of the instantaneous frequency, IEEE Trans. Signal Process. 46 (10) (October 1998) 2626–2637.
[23] L. Stankovic, S. Djukanovic, Order adaptive local polynomial FT based interference rejection in spread spectrum communication systems, IEEE Trans. Instrum. Meas. 54 (6) (2005) 2156–2162. [24] D.L. Jones, R.G. Baraniuk, An adaptive optimal-kernel time–frequency representation, IEEE Trans. Signal Process. 43 (10) (October 1995) 2361–2371. [25] A.M. Kondoz, Digital Speech Coding for Low Bit Rate Communication Systems, Wiley, Chichester, UK, 2004. [26] E. Mercado III, C. Myers, M.A. Gluck, Modelling auditory cortical processing as an adaptive chirplet transform, Neurocomputing 32–33 (2000) 913–919. [27] H.M. Ozaktas, O. Arikan, M.A. Kutay, G. Bozdag˘i, Digital computation of the fractional Fourier transform, IEEE Trans. Signal Process. 44 (9) (September 1996) 2141–2150. [28] T.F. Quatieri, Discrete-time Speech Signal Processing, Prentice-Hall, Upper Saddle River, NJ, 2002. [29] R.J. McAulay, T.F. Quatieri, Sinusoidal coding, in: Kleijn, Paliwal (Eds.), Speech Coding and Synthesis, Elsevier Science, Amsterdam, 1995, pp. 121–174.