Systematic errors in the formant analysis of steady-state vowels

Speech Communication 38 (2002) 141–160 www.elsevier.com/locate/specom Systematic errors in the formant analysis of steady-state vowels Gautam K. Vall...

Download PDF

450KB Sizes 47 Downloads 77 Views

Report

PDF Reader
Full Text

Speech Communication 38 (2002) 141–160 www.elsevier.com/locate/specom

Systematic errors in the formant analysis of steady-state vowels Gautam K. Vallabha *, Betty Tuller Center for Complex Systems & Brain Sciences, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431, USA Received 12 October 2000; received in revised form 9 January 2001; accepted 13 June 2001

Abstract The locations of formants in a speech signal are usually estimated by computing the linear predictive coeﬃcients (LPC) over a sliding window and ﬁnding the peaks in the spectrum of the resulting LP ﬁlter. The peak locations are estimated either by root-solving or by computing a coarse spectrum and ﬁnding its maxima. We discuss four sources of systematic error in this analysis: (1) quantization of the speech signal due to the fundamental frequency, (2) incorrect order for the LP ﬁlter, (3) exclusive reliance upon root-solving, and (4) the three-point parabolic interpolation used to compensate for the coarse spectrum. We show that the expected error due to F 0 quantization is 10% of F 0, and that the other three sources can independently skew the ﬁnal formant estimates by 10–80 Hz. We also show that errors due to incorrect ﬁlter order are related to systematic diﬀerences between speakers and phonetic classes, and that root-solving is especially error-prone for low formants or when formants are close to each other. We discuss methods for avoiding these errors and improving the accuracy of formant estimation, and give a heuristic for estimating the optimal ﬁlter order of a steady-state signal. Ó 2002 Elsevier Science B.V. All rights reserved. Keywords: Speech analysis; LPC; Vowels; Formant estimation

1. Introduction One of the key aspects of a speech signal is its formant structure. Over the past 50 years, methods such as the short-term Fourier transform and linear predictive coding (LPC) have been used to identify the formants in a speech signal. LPC in particular has gained popularity as a formant estimation technique since it avoids the timebandwidth problems of the Fourier transform and is straightforward to implement. However, these

* Corresponding author. Tel.: +1-561-297-2230; fax: +1-561297-3634. E-mail address: [email protected] (G.K. Vallabha).

methods cannot always be assumed to give accurate formant estimates. Systematic errors in the estimates can be introduced by the signal itself, by particular analysis parameters, or by factors intrinsic to the analysis methods. These errors may not be an issue for speech technology but they can be crucial in psychoacoustic studies or studies of speech perception that depend on sensitive formant analyses (e.g. Repp and Williams, 1987; Johnson et al., 1993a). In Section 2, we give an overview of LPC and summarize two common methods used to estimate formant locations from LPC coeﬃcients (Markel and Gray, 1976; Rabiner and Schafer, 1978). In Sections 3 and 4, we examine the accuracy of the formant estimation methods and in some cases suggest simple ways of avoiding problems.

0167-6393/02/$ - see front matter Ó 2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 6 3 9 3 ( 0 1 ) 0 0 0 4 9 - 8

142

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

2. LPC analysis In a typical LPC analysis, the acoustic speech signal is sampled at Fs points per second (after suitable lowpass ﬁltering) and the resulting discrete signal is divided into overlapping analysis frames. The data in each frame are Hamming-windowed, preemphasized, and submitted to a pth -order LPC analysis. The analysis yields a set of p real coeﬃcients ½a1 ; a2 ; a3 ; . . . ; ap which best predict the data value sðnÞ from the p previous values sðn 1Þ; . . . ; sðn pÞ. Thus, ~sðnÞ ¼

p X

ak sðn kÞ;

ð1Þ

k¼1

errorðnÞ ¼ sðnÞ

p X

ak sðn kÞ:

ð2Þ

k¼1

Eq. (2) corresponds to a ﬁlter whose output is the prediction error. The system function for the ﬁlter is a pth -order complex polynomial, AðzÞ ¼ 1

p X

ak zk ;

ð3Þ

k¼1

where z ¼ r expði2pf =Fs Þ is a complex number with magnitude r and angle 2pf =Fs . Here it represents a damped sinusoid with frequency f and damping r. If r < 1, the sinusoid decays over time, with smaller r indicating faster decay. If r > 1, the sinusoid ampliﬁes over time, with greater r indicating faster ampliﬁcation. The system function AðzÞ gives the response of the ﬁlter for an input sinusoid with arbitrary frequency and damping. Normally, however, the operation of a ﬁlter is characterized by how it modiﬁes undamped sinusoids with an arbitrary frequency f, i.e. when r ¼ 1. This is the frequency response of the ﬁlter, and it is formally written as Aðexpði2pf =Fs ÞÞ; for notational convenience, we shall simply write it as Aðf Þ. Since digital signals can only have frequencies between 0 and Fs =2, f varies from 0 to Fs =2 and, by extension, 2pf =Fs ð xÞ varies from 0 to p (thus, z represents the polar coordinates ðr; xÞ of a point in the upper half of the complex plane). For concreteness, we henceforth assume that Fs ¼ 10 kHz, so f goes from 0 to 5 kHz.

Following the source-ﬁlter theory of speech production, jH ðf Þj j1=Aðf Þj is the envelope of the spectrum of the speech signal and its maxima correspond to the resonances of the vocal tract, viz. the formants. The principal problem then is to determine the locations of the maxima of H ðf Þ. It is not possible to derive the locations of the maxima analytically, so there are two common approximations: root-solving and peak-picking. There are other LPC-based methods for estimating formant locations (Snell and Milinazzo, 1993; Welling and Ney, 1998), but their emphasis is on computationally eﬃcient ways to approximate the roots of the LP polynomial. An analysis of these methods is beyond the scope of this paper, but it should be noted that in principle they too are susceptible to the problems discussed in Sections 2.1 and 3.3.

2.1. Root solving A pth -order polynomial is uniquely characterized by p roots. Each root contributes independently to the output of the polynomial. For example, if f ðxÞ ¼ x2 4 ¼ ðx þ 2Þðx 2Þ, then logðf ðxÞÞ ¼ logðx þ 2Þ þ logðx 2Þ. Notice that the contribution of a root to f ðxÞ is proportional to its distance from x, so the zeroes of f ðxÞ occur at the roots x ¼ þ2 and x ¼ 2. This principle extends to complex polynomials also. If a1 ; . . . ; ap are the p roots of AðzÞ, then AðzÞ ¼ 1

p X

ak zk ¼ zp

k¼1

log jAðzÞj ¼ p log jzj þ

p Y ðz ak Þ;

ð4Þ

k¼1 p X

log jz ak j:

ð5Þ

k¼1

Each root ak is a complex number rk expði2pfk =Fs Þ, and thus represents a sinusoid with frequency fk and damping rk . The contribution of a root ak to jAðzÞj, the magnitude of the system response, is proportional to the distance of ak from the input z. Thus if the input sinusoid z matches both the frequency and damping of any of the roots, jAðzÞj is guaranteed to be a minimum. Conversely, jH ðzÞj will be a maximum.

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

143

The roots of H ðzÞ, also called its poles, satisfy two constraints. First, they always occur in pairs: if ak is a pole, then the complex conjugate of ak is also a pole. This implies that the lower half of the complex plane is a mirror image of the upper half. Second, the poles are always damped to ensure stability of the ﬁlter (i.e. rk < 1). Consequently, the poles will always lie inside the unit circle. Since we are interested in the frequency response H ðf Þ, the input sinusoid will always be undamped, so z will always lie on the perimeter of the unit circle. Notice in Eq. (5) that if jzj ¼ 1, then log jzj ¼ 0 and log jH ðf Þj ¼ log½1=jAðf Þj p X log jz ak j: ¼

ð6Þ

k¼1

The term log jz ak j is the log-magnitude (LM) frequency response of a single pole, so LM[H ðzÞ] is simply the linear superposition of the responses of the individual poles. If a pole ak is weakly damped ðrk 1Þ, then its frequency response will have a single sharp peak at fk . If the pole is strongly damped ðrk 1Þ, the peak will still occur at fk but it will be lower and broader. However, the peaks of LM[H ðf Þ] will not occur at precisely the pole frequencies fk : the linear superposition ‘‘tilts’’ the peaks and slightly skews their locations. The higher the damping of a pole, the broader the peak in its response, the more susceptible the peak is to the skew, and the greater the deviation of the ‘‘true peak location’’ in LM[H ðf Þ] from the pole location fk . Fig. 1 illustrates this behavior. Fig. 1(a) shows the four poles for a sample two-formant signal. The LM response for a pole ak is obtained by varying the frequency of input sinusoid z and computing the distance from z to ak . If ak is weakly damped or isolated, then near fk the contributions of the other poles to LM[H ðf Þ] will be relatively ﬂat and the peak in LM[H ðf Þ] will occur very close to fk . This is the case with the pole a1 in Fig. 1(b). On the other hand, if ak is strongly damped or there are other poles close by, then the contributions of the adjacent poles are likely to skew the location of the peak in LM[H ðf Þ] away from fk . This is the case with pole a2 ; its peak occurs at

Fig. 1. (a) The roots ða1 ; a2; a 1 ; a 2 Þ for a sample two-formant signal and their distances from a sample input sinusoid. (b) The log-magnitude response for each root across all frequencies; the cumulative response is the linear superposition of the four individual responses.

f2 ¼ 3000 Hz, but the corresponding peak in LM[H ðf Þ] occurs at a slightly lower frequency. Once the roots of a complex polynomial are calculated, the frequency fk and bandwidth bk of a root ak can be calculated as follows: fk ¼ ðFs =2pÞ tan1 ½Imðak Þ=Reðak Þ; bk ¼ ðFs =pÞ lnð1=rk Þ:

ð7Þ ð8Þ

Only the roots in the upper half of the unit circle ð0 < f < Fs =2Þ are considered for the above equation (the conjugate roots do not introduce any new resonances; they simply ensure that the output of the ﬁlter is real). Notice that it is easily possible to have more roots than formants, especially if p > 10. In such cases, a bandwidth criterion is applied and only roots with bandwidths less than the criterion are accepted as formants. The criterion can either be absolute (e.g. bk < 300 HzÞ or relative (e.g. bk < fk =10). For a more detailed description of this method, see (Markel and Gray, 1976, Section 7.2.2). It needs to be kept in mind that the formants are deﬁned with respect to the spectrum of the speech signal and, by construction, of H ðf Þ. Determining the maxima of jH ðf Þj from knowledge of the roots is a diﬃcult and computationally expensive procedure, so normally the root locations

144

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

are used as convenient approximations. This approximation is valid only to the extent that the roots are weakly damped and well-separated from each other. For example, it can be a problem when analyzing vowels whose formants are close to each other (e.g. F 1 and F 2 for back vowels, F 2 and F 3 for high front vowels). Root-solving is sometimes used speciﬁcally for its ‘‘ability’’ to resolve neighboring formants, but this use is debatable as it goes against the assumption of well-separatedness; moreover, in some cases there may be only one formant (Maurer and Landis, 1995). Section 3.3 discusses the errors which result from violating this assumption.

2.2. Peak picking The other method for ﬁnding the maxima of H ðf Þ is admirably direct: calculate H ðf Þ for regularly spaced frequencies and scan for maxima. The calculation can be done by evaluating Eq. (3) for each quantal frequency fD (with z ¼ expði2pfD = Fs ÞÞ and inverting the result to get H ðfD Þ. This polynomial evaluation is equivalent to the discrete Fourier transform (DFT) of the LP coeﬃcients ½1; a1 ; a2 ; a3 ; . . . ; ap . For a sequence of N points, the DFT yields a quantized spectrum with a frequency quantum df of Fs =N Hz. For the LP coeﬃcients, this spectrum is simply AðfD Þ and it is inverted and squared to get the power spectrum H 2 ðfD Þ. The locations and bandwidths of the peaks of H 2 ðfD Þ yield the formant estimates. Normally, the sequence of LP coeﬃcients is padded with zeros such that N ¼ 128 and df ¼ Fs =128 78 Hz. A common way of dealing with the coarse resolution is three-point parabolic interpolation (Markel and Gray, 1976, Section 7.2.2), which works as follows: Assume that f0 ; f1 ; f2 ; . . . ; fN =2 are the quantized frequencies and let fj be the location of a spectral maximum. A parabola is ﬁtted through the three points ðfj1 ; H 2 ðfj1 ÞÞ, ðfj ; H 2 ðfj ÞÞ and ðfjþ1 ; H 2 ðfjþ1 ÞÞ. The location of the maximum of the parabola is taken to be the location of the formant peak, and its bandwidth is calculated by ﬁnding the half-power points of the parabola. To ﬁt a parabola of the form yðkÞ ¼ ak2 þ bk þ c, let yð1Þ, yð0Þ and y(1)

be equal to H 2 ðfj1 Þ; H 2 ðfj Þ and H 2 ðfjþ1 Þ, respectively. Then Height of parabola: c ¼ yð0Þ; Tilt: b ¼ ½yð1Þ yð1Þ=2;

ð9Þ ð10Þ

Convexity: a ¼ ½yð1Þ þ yð1Þ=2 yð0Þ; Estimated location of formant peak:

ð11Þ

fpeak ¼ ½fj b=2aFs =N :

ð12Þ

The interpolation scheme described above is only eﬀective to the extent that it mimics the shape of the peaks in the power spectrum, an assumption which is examined in detail in Section 3.4. We shall refer to this scheme – a 128-point DFT followed by peak-picking and interpolation as ‘‘coarse-resolution’’ peak-picking. Alternatively, one could avoid the need for interpolation by increasing the amount of zero-padding (with N ¼ 8192, for example, df 1 Hz), but incurring a heavier computational load. We shall refer to this scheme – a 8192-point DFT followed by peak-picking without interpolation – as ‘‘ﬁne-resolution’’ peak-picking.

3. Error analysis Our original goal was to study the systematic errors in estimating the locations of the peaks of jH ðf Þj, i.e. those introduced by using root-solving and peak-picking. However, evaluating the severity of these errors requires studying the nature of the speech signal and of the LPC parameters – errors due to F 0 quantization may swamp the errors due to root-solving and peak-picking, and errors due to parabolic interpolation can sometimes be reduced dramatically by changing the order of the LP ﬁlter. Therefore our analysis of systematic errors includes those caused by F 0 quantization and incorrect ﬁlter order. Admittedly, there are yet more sources of error, such as the degree of preemphasis, the length of the analysis frame and the window used to taper it. To keep this study at a manageable level, we did not study the eﬀects of these parameters (for a discussion, see (Markel and Gray, 1976, Section 6.5)). It is important to note that by ‘‘error’’, we refer to the formant estimation error, not the prediction

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

error. The notion of ‘‘estimation error’’ is clear in the case of root-solving and parabolic interpolation, since the reference standard (the true location of the peak) is unambiguous. However, the notion is methodologically thorny for ﬁlter order errors. All else being equal, a 20th -order ﬁlter will yield diﬀerent formant estimates than a 12th -order ﬁlter, and it is not obvious which estimates are better and by what criteria. However, it is clear that ﬁlter orders can be too low (e.g. p ¼ 4) or too high (e.g. p ¼ 50), so we proceed from the assumption that there is a ‘‘just right’’ ﬁlter order (see Section 3.2). We studied the systematic errors using two kinds of manipulations: synthesized speech sounds and direct root modiﬁcation. The synthesized speech consisted of an impulse train ﬁltered through a cascade of resonators (similar to the design of (Klatt, 1980), and KLSYN88, Sensimetrics, Cambridge, MA). The glottal source rolloﬀ was approximated by lowpass ﬁltering the impulse train with the cutoﬀ at 200 Hz (in KLSYN88 this corresponds to an Open Quotient of 50%), and lip radiation was approximated as a ﬁrst-diﬀerence ﬁlter. The resulting sounds were sampled at 10 kHz. The synthesizer was deliberately kept simple for two reasons. One, our goal was to study the analysis characteristics of ‘‘speechlike’’ sounds, not to produce realistic-sounding speech. Two, the synthesizer satisﬁed all the assumptions of LPC analysis – the source and ﬁlter were completely separated, there were no irregularities such as jittering or ﬂuttering of the voicing source, and there were no anti-resonators in the cascade. Furthermore, the synthesized sounds were always steady-state vowel-like sounds, usually 2048 samples long. The fundamental frequency, number of formants, the formant locations and bandwidths, and the relative formant amplitudes did not change during the sound. The resulting speech-like sounds were therefore perfect for LPC modeling and constituted a best-case scenario for the analysis method. For the LPC analysis, a single 256-point pitch-asynchronous frame was chosen from the synthesized sound, Hamming-windowed, and preemphasized at 100% (for a justiﬁcation of these preprocessing steps, see (Markel and Gray, 1976, Section 6.5)). The pth-order forward linear pre-

145

dictor was estimated by solving the autocorrelation equations using the Durbin recursion (Rabiner and Schafer, 1978, p. 411). The covariance method is more accurate in general, but when the frame length N is more than 2 pitch periods, the autocorrelation and covariance methods give approximately the same results (Chandra and Lin, 1974). In our case the choice of window width (25.6 ms) suﬃces for pitch periods less than 12.5 ms, i.e. for F 0 > 80 Hz. The spectrum of the LP ﬁlter was ﬁnally evaluated using either root-solving or peak-picking. Since we were only interested in formant estimation, the gain of the LP ﬁlter was discarded. A technique involving the direct modiﬁcation of roots was used to study the relation between the roots of a ﬁlter and its frequency response. This technique allowed us to study root-solving and parabolic interpolation independently of the F 0 quantization problem. The modiﬁcation itself was quite straightforward: a root ak was computed from a speciﬁed resonant frequency fk and bandwidth bk using the inverse of Eqs. (7) and (8) (see Note 1, Appendix A). Once the roots were computed, they were multiplied to give the system function AðzÞ and therefore H ðf Þ (see Eq. (4)). The maxima of H ðf Þ were then found using peak-picking. The following sections describe four sources of systematic errors – F 0 quantization, incorrect ﬁlter order, exclusive reliance upon root-solving, and the parabolic interpolation method. The sections are roughly ordered by the generality of the errors (from the most general to the least general), which also corresponds to their refractoriness. 3.1. Errors due to F0 quantization The energy in the spectrum of the impulse train (‘‘glottal source’’) is concentrated at harmonics of the fundamental frequency F 0. Since the vocal tract resonances are assumed to be independent of the source, the formant frequencies need not line up with the harmonics. Intuitively, we would expect that if F 0 is held constant and the formant frequency f is varied, then the formant estimation error (see Note 2, Appendix A) would oscillate – it would decrease as f approached a harmonic of F 0, and increase as f moved away from it. Previous

146

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

research has shown this to be the case (Markel and Gray, 1976, p. 188; Atal and Schroder, 1974). In our study, we manipulated F 0 and the locations of the formants (using the synthesized sounds) in order to characterize the estimation error in more detail. Our results are given below: 1. As F 0 increases, the range of the estimation error increases linearly (Fig. 2(a)). If signal energy is only present at the harmonics of F 0 (the worst-case quantization scenario), only the harmonic closest to the formant would be picked up, so the error range would be approximately F 0=2 $ þF 0=2. In actual speech, there is energy present between the harmonics also, so we would expect the error range to be less; on basis of our simulations, this range seems to be F 0=4 $ þF 0=4 for LPC analysis. 2. It can be seen in the ﬁgure that the errors ‘‘oscillate’’ more rapidly for higher formants, i.e., estimates of higher formants are more sensitive to perturbations in F 0. This is simply because

small changes in F 0 accumulate and cause larger shifts in the higher harmonics of F 0 (see Note 3, Appendix A). 3. The above errors cannot be avoided by using F 2–F 1 or F 3–F 2 as the measured dimensions (Fig. 2(b)). This follows partly from the above result that estimation errors behave diﬀerently at diﬀerent frequency ranges. More importantly, if the locations of the formants are assumed to be independent of each other, then the errors can vary independently also, which implies that the error of the F 2–F 1 estimate may be as large as the sum of the magnitudes of the F 1 and F 2 estimation errors. In summary, the error in estimating formant locations increases linearly with F 0. If F 0 is varying within a small range (e.g. F 0 10 Hz), the F 1 estimate will vary slowly and, because F 1 bandwidth is usually small, the error range will be quite large. For the same F 0 ﬂuctuation, the F 2 estimate will ﬂuctuate rapidly but because of the large band-

Fig. 2. (a) The eﬀect of F 0 on the estimation of F 1; F 2 and F 3. Synthesized F 1 ¼ 500 Hz, F 2 ¼ 1500 Hz, F 3 ¼ 2500 Hz. (b) The eﬀect of F 0 on the estimation of F 2–F 1 and F 3–F 2.

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

width, the error range will be smaller than for F 1 (of course, the error relative to F 2 is fairly small to begin with). If the formants are close together, such as F 2 and F 3 for high front vowels or F 1 and F 2 for low back vowels, then large bandwidths may present a serious problem. This issue is discussed in more detail in Section 3.3. 3.2. Errors due to incorrect ﬁlter order The crucial parameter in linear prediction is the order of the linear predictor, i.e. the number of LP coeﬃcients. There are two usual rules of thumb for estimating it: (1) Twice the number of formants one expects to ﬁnd, plus two: Ideally, each formant corresponds to a damped sinusoid that can be captured by a pair of roots with the correct frequency and damping (one of the roots is the complex conjugate of the other). The two extra coeﬃcients are there ‘‘just in case’’ to absorb any leftover energy in the signal. (2) The sampling frequency in kHz: If Fs ¼ 10 kHz, for example, then one would use 10 LP coeﬃcients. The rationale is that it takes sound approximately 1 ms to travel from the glottis to the lips, and so the statistical structure among 1 ms worth of samples is suﬃcient to capture the vocal tract resonances (Markel and Gray, 1976, p. 154). Both of these methods ignore systematic between-speaker or between-vowel diﬀerences. For example, back vowels usually require a higher-order ﬁlter than front vowels. In our analysis below, we shall try to show qualitatively the errors caused by incorrect ﬁlter order, and suggest a heuristic for determining the ‘‘correct’’ ﬁlter order for a signal. The eﬀects of using too few or too many ﬁlter coeﬃcients can be best understood by the formulation of Durbin recursion. Conceptually, this method has two stages. (1) Compute the partial correlation coeﬃcients, also called the reﬂection coeﬃcients, for all lags up to the predeﬁned order P of the ﬁlter. The lag-m coeﬃcient km signiﬁes the linear dependence between xðnÞ and xðn mÞ after the inﬂuence of the intervening samples is factored out. Being a correlation coeﬃcient, km can vary between þ1 and 1. (2) Use k1 ; k2 ; . . . ; kp recursively to estimate the LP coeﬃcients for ﬁlters of orders 1; 2; . . . ; p. The LP coeﬃcients for the mth -

147

order ﬁlter are derived using km and the coeﬃcients of the ðm 1Þth -order ﬁlter. This step has a convenient spectral interpretation (Markel and Gray, 1976, Section 6.2.5). Let em ðf Þ be the difference between the log power spectra of Hm1 ðf Þ and Hm ðf Þ, em ðf Þ ¼ lnðgainm jHm ðf Þj2 Þ 2

lnðgainm1 jHm1 ðf Þj Þ;

ð13Þ

where f varies from 0 to Fs =2. em ðf Þ can be derived purely from Hm1 ðf Þ and km , and oscillates between lnð1 km =1 þ km Þ and lnð1 þ km =1 km Þ, reaching the extrema exactly m þ 1 times (see Note 4, Appendix A). So as m increases, em ðf Þ will become more bumpy and potentially allows for ﬁner details of the power spectrum to be resolved. Fig. 3(a) shows e8 ðf Þ for a sample two-formant signal for various values of k8 . Notice that as k8 increases, the peaks in e8 ðf Þ, while retaining the same basic shape, become taller and sharper (since e8 ðf Þ is in units of log power, the modest value of 1 is actually quite signiﬁcant). Conversely, for k8 near 0, e8 ðf Þ approaches a ﬂat line, which causes H8 ðf Þ to be fairly similar to H7 ðf Þ. We can thus establish a necessary condition for an optimal ﬁlter order popt : for all m > popt ; km has to be ‘‘insigniﬁcant’’. Using this condition, we can study the errors caused by a non-optimal ﬁlter order. 3.2.1. Estimation errors due to insuﬃcient ﬁlter order ðp < popt Þ The LP procedure minimizes the diﬀerence (measured as mean square error) between the log power spectra of H ðf Þ and the input signal. We can relate this to the partial correlation coeﬃcients by observing that the power spectrum of a signal is the same as the spectrum of its autocorrelation sequence RðmÞ. By restricting the ﬁlter order to p, we are capturing autocorrelations only up to lag p, that is, RðpÞ; Rðp þ 1Þ; . . . ; Rðp 1Þ; RðpÞ. This is equivalent to multiplying RðmÞ by a rectangular window, which is in turn equivalent to convolution of the two spectra. Since the spectrum of the rectangular window has a broad central peak with small side peaks, the convolution smooths the original power spectrum. Thus, the LP ﬁlter attempts to match the smoothed power spectrum of

148

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

Fig. 3. em ðf Þ for a sample two-formant signal with F 1 ¼ 1000 Hz, F 1 bandwidth ¼ 60 Hz, F 2 ¼ 4000 Hz, F 2 bandwidth ¼ 200 Hz. (a) e8 ðf Þ for various values of k8 , (b) e16 ðf Þ for various values of k16 .

the original signal. This conclusion is supported by direct examination of the partial correlation coefﬁcients: the coeﬃcients for larger m specify ﬁnergrain spectral information, so if p < popt , then the resulting ﬁlter captures only coarse spectral features (Deller et al., 1993). 3.2.2. Estimation errors due to excessive ﬁlter order ðp > popt Þ The reﬂection coeﬃcients past the optimal order are either due to noise, or else they specify unnecessary ﬁne-grain spectral information (such as the F 0 harmonics). The inclusion of these coeﬃcients is much more likely to distort rather than improve the formant estimate. We present below the results of a simulation to examine the nature of this distortion. The general idea of the simulation was to take a synthesized signal with a known popt , introduce noise into the signal for various combinations of p and kp , and measure the resulting errors in the estimation of the formant peak (see Note 5, Appendix A for details of the simulation).

Fig. 4(a) shows the root-mean-square error for diﬀerent combinations of p and kp for a signal with a single low-bandwidth formant. Fig. 4(b) shows the errors for a signal with a single high-bandwidth formant. The single-formant signals were chosen in order to minimize interactions between the roots, so the errors reﬂect the inﬂuence of noise on the magnitude response of a single root. The ﬁgures are interpreted in the following manner: If a pth -order ﬁlter was used to analyze the signal, then p popt extra reﬂection coeﬃcients were used. The error introduced by each additional coeﬃcient can be estimated from the ﬁgure. If p popt > 1, the total error is the sum of the errors for each extra coeﬃcient. For example, if popt ¼ 6 and p ¼ 8, then the vertical lines m ¼ 7 and m ¼ 8 give the error introduced by various values of jk7 j and jk8 j, and these individual errors are simply added to get the overall expected error (see Note 6, Appendix A). There are several observations that can be made about the pattern of errors:

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

149

Fig. 4. Formant estimation error of an mth -order ﬁlter when popt ¼ m 1, for a single-formant signal. (a) Formant bandwidth ¼ 60 Hz. (b) Formant bandwidth ¼ 200 Hz. Note the diﬀerent y-axis scales.

1. As jkm j increases, the error increases. The reason for this can be seen in Fig. 3: km changes the amplitude of em ðf Þ, and therefore the steepness of its slopes. Since Hm ðf Þ is derived by adding Hm1 ðf Þ and em ðf Þ, steeper slopes in em ðf Þ imply potentially greater skew in the locations of spectral peaks. 2. The error increases with ﬁlter order m. There are m þ 1 extrema in em ðf Þ, regardless of the value of km or of Hm1 ðf Þ, so the steepness of the slopes increases with m. This is not usually a serious problem because jkm j normally decreases with m, mitigating the eﬀect of the steeper slopes. 3. The eﬀect of noisy km ’s is exacerbated by the bandwidth of the formant. By deﬁnition, low-bandwidth formants have sharp spectral peaks, and high-bandwidth formants have broad peaks. All else being equal, broader peaks are more susceptible than sharp peaks to the skew introduced by em ðf Þ.

The error curves in Fig. 4 are based on simulations with single-formant signals, and should be regarded as order-of-magnitude estimates when applied to real speech signals for the following two reasons. First, introducing additional formants to a signal tends to increase popt and the bandwidth of existing formants. Second, in the simulation the location of the single formant in the simulation was always at 2500 Hz in order to minimize the eﬀect of the conjugate root (see Section 3.3). As the formant moves away from this ‘‘neutral’’ center frequency, the inﬂuence of the conjugate root will increase the bandwidth of the formant. As noted in points 2 and 3 above, both these factors will increase the skew introduced by an extra reﬂection coeﬃcient. Thus, the error curves in Fig. 4 should be regarded as approximate lower bounds. 3.2.3. Estimating the optimal ﬁlter order popt The two heuristics for determining Popt (twice the number of formants + 2, or the sampling

150

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

frequency in kHz) assume a generic speaker producing a generic vowel. This assumption was tested using the following data: two native speakers of American English produced the vowels =e= (as in ‘‘head’’) and /o/ (as in ‘‘hoed’’) ﬁve times each in a list context. The vowels were processed using the method described at the beginning of Section 3 and, for each vowel, the reﬂection coeﬃcients k1 ; . . . ; k20 were calculated. Fig. 5 shows the reﬂection coeﬃcients of the ﬁve productions for diﬀerent speaker/vowel combinations. Note that S1’s productions of =e= have a bump at m ¼ 10, whereas the productions of /o/ have a bump around m ¼ 12. Similarly, S2’s =e= and /o/ have bumps at m ¼ 8 and m ¼ 10, respectively. This variability suggests that it is not possible to use the

same ﬁlter order p to analyze all the productions at the same level of accuracy; popt may be systematically diﬀerent for each vowel and each speaker. We therefore propose a heuristic for determining the optimal order popt of a given signal: (1) Pick a representative analysis frame and compute k1 ; . . . ; k20 . (2) Let pmin , the minimum ﬁlter order, be twice the expected number of formants. Then, pick popt to be the smallest p > pmin such that jkpþ1 j and jkpþ2 j are both less than 0.15 (see Note 7, Appendix A). We used the above ‘‘k < 0:15’’ heuristic to estimate popt for each sound from Fig. 5. To get an idea of how the formant estimates changed with ﬁlter order, each sound was analyzed with seven diﬀerent ﬁlter orders ðp ¼ 9; p ¼ 10; . . . ; p ¼ 15Þ,

Fig. 5. Reﬂection coeﬃcient curves for two speakers S1 and S2. Each plot shows ﬁve curves, one for each production of the vowel. The km ¼ 0:15 band is the cutoﬀ region for the proposed heuristic (see text). The inset numbers are the popt ’s that were found using the heuristic for each of the 5 productions.

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

151

Table 1 Formant estimates for four sample vowels (one from each speaker/vowel combination in Fig. 5) Speaker S1, =e=

Speaker S2, =e=

Filter order

F1

F2

F3

F1

F2

F3

9 9!10 10!11 11!12 12!13 13!14 14!15

648 )45 )12 )15 1 )5 3

1776 )61 )22 )37 0 5 )10

2582 )114 )41 )65 24 38 53

614 )7 8 )6 0 1 1

1863 )20 4 8 0 2 1

2835 179 )338 60 17 )3 1

Speaker S1, /o/

Speaker S2, /o/

Filter order

F1

F2

F3

F1

F2

F3

9 9!10 10!11 11!12 12!13 13!14 14!15

596 12 )82 )37 )3 )4 5

1855 )2 )682 )41 )12 )16 16

2958 2 )1014 )28 )9 )9 )2

588 )17 0 4 )3 8 )5

1229 )22 4 )2 )2 )1 12

2711 )125 0 )16 4 30 31

Rows with ﬁlter order a ! b indicate the formant changes when the ﬁlter order is changed from a to b. Rows marked with an asterisk ( ) indicate the popt identiﬁed by the ‘‘k < 0:15’’ heuristic.

and the peaks in the resulting LP ﬁlters were estimated using ﬁne-resolution peak-picking. Table 1 shows the resulting changes in formant estimates for one sample vowel per speaker/vowel combination. Notice that the formants usually change in a consistent direction up to a certain ﬁlter order, after which the directions change erratically. Near the same ﬁlter order, the magnitude of the change decreases. We assume that this ‘‘critical’’ order is close to the true popt , and the heuristic is surprisingly eﬀective in identifying it. A key pattern in Table 1 is the diﬀerence between the (estimated) popt ’s of speakers S1 and S2. Assuming that there are four measurable formants, the heuristic of ‘‘twice the expected number of formants + 2’’ would have been ideal for S2, but not for S1. For example, notice that the changes from p ¼ 10 to p ¼ 12 for S1’s production of =e= are much larger than in S2’s case. The diﬀerence between S1 and S2 is especially clear in the case of /o/. When the ﬁlter order is increased from p ¼ 10 to p ¼ 11, S1’s F 2 and F 3 decrease by 682 and 1014 Hz, respectively. The reason is that the 10th order ﬁlter averages F 2 and F 3 into a spurious F 2 peak at 1853 Hz, while the 11th-order ﬁlter is able

to resolve an F 2 peak at 1171 Hz and an F 3 peak at 1946 Hz. In contrast, S2’s formants stabilize at p ¼ 10 and are insensitive to further increases in ﬁlter order. The above disparities suggest that diﬀerent speakers have characteristically diﬀerent popt s. These are probably due to physiological diﬀerences in vocal tract length and shape, but may also stem from diﬀerences in manners of production (Johnson et al., 1993b). When the reﬂection curves are converted to vocal tract area functions, for example, S1 consistently shows a larger front cavity (relative to the rest of the vocal tract) than S2 does for productions of the same vowel. The diﬀerences between S1’s =e= and /o/ also suggest that even for a single speaker, popt may vary with the phonetic class of the production. In face of this variability, popt is typically identiﬁed through trial-and-error: one performs the analysis with a default p and if a formant location has an unreasonable value, varies p until the formant locations look ‘‘reasonable’’. The ‘‘k < 0:15’’ heuristic does not solve the basic methodological problem, but it does provide a principled way of bypassing the trial-and-error process.

152

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

3.3. Errors due to root solving As discussed in Section 2.1, formant frequencies and bandwidths can be reasonably estimated from single roots only if they are well-separated. If the roots are close to each other, then the formant locations estimated from the roots may be signiﬁcantly diﬀerent from the corresponding spectral peaks. This deviance can be derived analytically for the single-formant case [see Note 8, Appendix A]; Fig. 6 shows the resulting behavior for various formant bandwidths. The large positive error at low formant frequencies is due to the inﬂuence of the conjugate root. A single formant requires two roots, a1 and a 1 . As the formant frequency decreases, two things happen. First, the location of the resonant frequency of a1 , i.e. of the peak in its magnitude response, decreases. Second, the response of a 1 becomes more prominent (see Fig. 1) and has a stronger inﬂuence on the overall spec-

trum. As a result, the peak of the overall magnitude response H ðf Þ leans towards 0 Hz and thus the root location systematically overestimates the location of the formant peak. This deviance between root location and formant peak location is exacerbated by higher bandwidth – if the magnitude response of a1 has a broad peak, then it is more susceptible to the tilt introduced by the magnitude response of a 1 . The deviances in the two-formant case are similar to those in the single-formant case. To study them, we used direct root modiﬁcation with two pairs of roots, so in the magnitude spectrum there were two peaks representing F 1 and F 2. The location and bandwidth of the ‘‘F 2’’ root-pair were ﬁxed at f2 ¼ 1500 Hz and b2 ¼ 110 Hz, respectively, and the location f1 and bandwidth b1 of the ‘‘F 1’’ root-pair were varied. For each value of f1 , the frequency response of the overall polynomial was calculated and its peaks found using ﬁne-

Fig. 6. (a) The relation between root and formant peak location for a one-formant signal. The formant location was calculated using ﬁne-resolution peak-picking. (b) The reason for the deviation: As the root (o) moves closer to 0 Hz, it comes closer to its complex conjugate (*).

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

resolution peak-picking. Fig. 7 shows the resulting deviances. For low F 1, the errors are due to the proximity of the conjugate root (they are the same as in Fig. 6). For high F 1, the errors are due to the proximity of the F 2 root. The location of the F 2 peak would be similarly inﬂuenced by the proximity of the F 1 root. In short, the spectral peaks of adjacent roots will lean towards each other. Thus if one uses root-solving to compute a measure such as F 2–F 1, the error is likely to be twice that shown in Fig. 7. The errors described above are intrinsic to the root-solving method, and there is no simple way to ﬁx the method to take adjacent roots into account. The method is sometimes used to ‘‘resolve’’ formants which are close to each other, such as F 1 and F 2 for low vowels, and F 2 and F 3 for highfront vowels. The results shown in Fig. 6 suggest that root-solving, though useful in detecting the existence of multiple formants, is error-prone in estimating their locations. For precise measure-

153

ments of formants that are close to each other, it is therefore necessary to examine the ﬁnely-resolved spectrum of the LP ﬁlter. The locations of the F 1 and F 2 roots can be used as lower and upper bounds, and their magnitudes oﬀer a clue about how far the spectral peaks may have shifted towards each other. 3.4. Errors due to parabolic interpolation Root-solving and ﬁne-resolution peak-picking are both computationally intensive so frequently coarse-resolution peak-picking (a 128-point DFT followed by peak-picking and three-point parabolic interpolation) is used instead. We tested the eﬀectiveness of the interpolation by direct modiﬁcation of a single pair of roots. The frequency response was analyzed using coarse-resolution peak-picking and the location of the parabolic maximum (the ‘‘interpolated location’’) was compared to the location of the peak found with

Fig. 7. The relation between root and formant location in a two-formant signal. The F 2 root was ﬁxed at 1500 Hz (bandwidth ¼ 110 Hz). Note the symmetry of the error as the F 1 root approaches 0 Hz and 1500 Hz (the F 1 conjugate root and the F 2 root, respectively).

154

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

ﬁne-resolution peak-picking (the ‘‘true location’’). Fig. 8 plots the resulting errors. Notice that estimate of the parabolic maximum is always biased towards the nearest harmonic, and that the error is higher with low-bandwidth formants. Fig. 9 illustrates the reason for the errors. It was produced by setting the formant location to 2480 Hz (one of the locations of maximum error in Fig. 8), and then varying the bandwidth of the formant. The crux of the problem is that the interpolation assumes that the shape of the spectral peak can be approximated by a parabola. This assumption is reasonable for low-bandwidth formants. But as the bandwidth decreases, the spectral peak becomes less parabolic and the interpolation produces worse results. If magnitude and log-magnitude spectra are used instead of power spectra, the resulting errors are very similar to Fig. 8, except that the error range is approximately )15 to 15 Hz.

The easiest way to reduce these errors is to increase N, the length of the DFT. As the frequency spacing Fs =N decreases, the three points used in the interpolation will have similar power spectral values (see Eqs. (9)–(11)), so the spectral peak around those three points can be more accurately described by a parabola. Fig. 10 shows how the maximum error of the parabolic interpolation decreases with N (the error curves ﬂatten at 1 Hz because the granularity of the ﬁne-resolution peakpicking, used to estimate the ‘‘true’’ location, is 1 Hz). The plot suggests that 512 is a good choice for DFT length for Fs ¼ 10 kHz. In fact, the largest error shown for N ¼ 512 ð 5 HzÞ is actually the worst-case scenario because it is the maximum error for a very low bandwidth signal. It is also important to keep in mind that F 0 quantization limits the overall accuracy. The expected F 0 quantization error is approximately 10% of F 0 (see

Fig. 8. The errors with coarse-resolution peak-picking (128-point DFT + three-point parabolic interpolation) for diﬀerent formant locations and bandwidths. The ‘‘true location’’ was estimated using the ﬁne-resolution method. The vertical grid lines are multiples of Fs =128 ¼ 78:1 Hz. The formant location was restricted to the region near 2500 Hz in order to minimize the inﬂuence of the conjugate root.

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

155

Fig. 9. The power spectral peaks (dashed lines) and corresponding parabolic ﬁts (solid lines) for diﬀerent formant bandwidths. The symbol shape indicates the formant bandwidth. Open symbols mark the maxima of the power spectral peaks; the corresponding ﬁlled symbols mark the maxima of the parabolic ﬁts. The vertical lines are multiples of Fs =128 ¼ 78:1 Hz.

Note 9, Appendix A) – if F 0 is 80 Hz, then 8 Hz is the accuracy with which the locations of the formant peaks can be resolved, on average. Given this a priori error, the extra precision achieved by resolving the LP spectrum below 1 Hz (by having N ¼ 16384, for example) is largely cosmetic. A resolution of 2 Hz, achieved by a 512-point DFT with interpolation or a 2048-point DFT without interpolation, seems to provide a reasonable compromise between quantization and precision. Fig. 10 can also be used to estimate the error at sampling rates other than 10 kHz. To get the error at sampling rate Fs , multiply the vertical axis of the ﬁgure by ðFs ¼ 10 kHzÞ and, based upon the rescaled axis, determine which N has a maximum error less than 5 Hz. For example, if Fs ¼ 20 kHz, the scale factor would be 20/10 kHz ¼ 2. A maximum error of 5 Hz or less would then require a 1024-point DFT with interpo-

lation, or a 4096-point DFT without interpolation.

4. Quantization errors The errors described in the above section were analyzed in isolation from each other. To examine their combined eﬀects, we synthesized 1014 twoformant sounds and analyzed them using the different methods. All sounds were synthesized with F 0 ¼ 120 Hz, F 1 bandwidth ¼ 60 Hz, and F 2 bandwidth ¼ 90 Hz. The F 1 location was varied from 250 to 820 Hz in steps of 15 Hz, and the F 2 location was varied from 920 to 1720 Hz in steps of 30 Hz. Each synthesized sound was analyzed using an 8th-order LPC ﬁlter and the peaks in the LPC spectrum were estimated using coarse and ﬁne resolution peak-picking (see Note 10, Appendix A). Fig. 11 shows the original synthesized grid

156

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

Fig. 10. The maximum error (interpolated location – true location) of three-point parabolic interpolation, for a single-formant signal. The dashed line is the maximum error of the uninterpolated location; the curve is the same at all formant bandwidths. All errors were calculated with Fs ¼ 10 kHz.

and how it is quantized by the diﬀerent analysis methods. The F 0 quantization shown in Fig. 11(b) is unavoidable. However, F 0 will vary widely in any corpus of natural utterances, so it is likely that the quantization will not distort the overall distribution of formants. The quantization in Fig. 11(c) is much more troublesome. Whereas the F 0 quantization is relative to each utterance, the Fs =128, or DFT, quantization is absolute – it applies to all utterances and potentially distorts the overall distribution. Parabolic interpolation can mitigate this problem, but only if the formants have very high bandwidths or if the DFT length is 512 or higher. In addition, speakers can have characteristically diﬀerent formant bandwidths, so the eﬀectiveness of the interpolation varies for each speaker. Thus, errors due to DFT quantization or ineﬀective interpolation may be masked if an utterance corpus has many speakers with only a few utterances per

speaker (e.g., the vowel survey of Hillenbrand et al. (1995)).

5. Conclusions The above results can be brieﬂy summarized in the following recommendations: 1. The order of the LP ﬁlter should be matched to the utterance being analyzed whenever possible. If this is not feasible, then the order of the ﬁlter should at least be matched to each speaker (Vallabha and Tuller, Submitted). In Fig. 5, for example, p ¼ 12 and p ¼ 10 are good matches for S1 and S2, respectively. 2. Root-solving should be used with caution for low formants or when formants are close to each other. In the latter situation, root-solving is best used to detect the existence of multiple

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

157

Fig. 11. (a) The synthesized grid. (b) Fine-resolution analysis: 8th -order LPC followed by 8192-point DFT, peak-picking and parabolic interpolation. The gridlines are multiples of F 0 ¼ 120 Hz. (c) Coarse-resolution analysis: 8th -order LPC followed by 128-point DFT and peak-picking. The gridlines are multiples of Fs =128 ¼ 10; 000=128 78 Hz. (d) An illustrative error ﬁeld. The bases of the arrows are the synthesized formants and the tips are the estimated formants.

formants. The locations of the roots bracket the locations of the spectral peaks and can thus guide the peak-picking algorithm. 3. When estimating the locations of the spectral peaks, the length of the DFT should be at least 512 (with parabolic interpolation) or 2048 (without interpolation). If Fs is not 10 kHz, then Fig. 10 should be used to determine the DFT length. The reduction in error from following these recommendations can be psychophysically signiﬁcant. For trained listeners, the diﬀerence limen for isolated steady-state vowels is about 14 Hz for

formant frequencies less than 800 Hz and 1.5% for higher frequencies (Kewley-Port and Watson, 1994). If the results of the LPC analysis are used for resynthesis, then formant frequency changes of 15 to 60 Hz (the approximate magnitude of the errors discussed in this paper) may alter the perceptual quality of the resynthesized vowel. The errors described in this paper have been discussed in the context of steady-state (i.e. stationary) signals. An analysis of these errors in nonstationary signals is beyond the scope of this paper, but it is possible to make some general observations. First, a steady F 0 causes certain frequencies (the F 0 harmonics) to be preferred during formant

158

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

estimation. A ﬂuctuating F 0 spreads the bias so that there is no single set of preferred harmonics. In such situations, averaging the formant estimates over adjacent analysis frames can be eﬀective in reducing the F 0 quantization. Second, it is still possible to have errors due to incorrect ﬁlter order in signals with slowly changing formants. In such cases, however, speech psychologists are more interested in the onset, direction, and rate of change of the formants than in their precise values. Based on our experience with diphthong analyses, we have found that it is suﬃcient to use a single ﬁlter order for the entire utterance, as long as that order is matched to the speaker (see Recommendation 1 above). Finally, an LPC model is, by deﬁnition, a stationary approximation of a nonstationary signal. The evaluation of a model (i.e. a set of AR coeﬃcients) therefore does not depend on the stationarity of the original signal. Hence, the two sources of error in model evaluation – root-solving and parabolic interpolation – are also an issue for nonstationary signals, and Recommendations 2 and 3 are still relevant. Although our discussion has been restricted to sources of errors in the estimation of formant location, the same sources (incorrect ﬁlter order, exclusive reliance upon root solving, etc.) can also aﬀect the estimation of formant bandwidth. In fact, the larger conclusion from our study is that a uniform analysis method will not result in a uniform level of error across all productions. The customization of the analysis method to the speaker and the phonetic class of the production (such as the heuristic for choosing the optimal ﬁlter order) is much more likely to achieve a uniform level of error. This can in turn lead to a more veridical distribution of sounds in the formant space, especially if the corpus includes a wide variety of speakers and utterances.

Acknowledgements We thank Nurgun Erdol for reviewing the paper, and James Hillenbrand for his advice during the initial phase of this study. This research was supported by NIMH Predoctoral Training Grant MH19116 and NIMH Grant MH-42900.

Appendix A. Notes 1. The root zk was computed from the resonant frequency fk and bandwidth bk . The other root z k is simply the conjugate of zk . zk ¼ expði2pfk =Fs Þ expðbk p=Fs Þ:

ðA:1Þ

2. Estimation error estimated formant frequency synthesized frequency. All estimates were done using ﬁne-resolution peak-picking. 3. Let f0 and f be fundamental and formant frequencies such that f ¼ af0 (a is a positive integer), so that the estimation error is 0. Now let f0 be perturbed by df, so the new fundamental is f00 ¼ f0 þ df ) f0 ¼ f00 df. Then f ¼ aðf00 dfÞ ¼ af00 adf:

ðA:2Þ

As f00 is varied, the estimation error will again be 0 when f ¼ ða 1Þf00 ) f ¼ af00 f00 . Into this relation, we substitute Eq. (A.2) and get af00 adf ¼ af00 f00 ) adf ¼ f00 ¼ f0 þ df ) df df c ¼ f0 =ða 1Þ ¼ f02 =ðf f0 Þ: ðA:3Þ

This equation for df c holds even when f is not lined up initially with a harmonic of f0 . Note that df c , the perturbation needed for a complete error cycle, is proportional to f0 and inversely proportional to f. 4. em ðf Þ can be derived purely from Am1 ðzÞ and km , as follows (Markel and Gray, 1976, p. 137): wm ðxÞ ¼ mx þ 2AnglefAm1 ðeix Þg;

ðA:4Þ

km2 Þ

em ðf Þ ¼ lnð1 2 ln j1 þ km expðiwm ð2pf =Fs ÞÞj: ðA:5Þ 5. A signal was synthesized with a single formant at f ¼ 2500 Hz (the frequency was chosen to minimize the eﬀects of the conjugate root) and its reﬂection coeﬃcients k1 ; . . . ; kp were computed. Two coeﬃcients were suﬃcient to model the signal, hence popt ¼ 2. Only k1 and k2 were retained and all other coeﬃcients were zeroed out. Then, k3 was varied between 0.2 and

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

0.05. For each value of k3 ; w3 ðxÞ was perturbed [w3 ðxÞ þ dw; p < dw < þp; this simulates the eﬀect of extra roots and miscellaneous noise in the signal]. For each perturbation, the corresponding e3 ðf Þ and H3 ðf Þð¼ H2 ðf Þ þ e3 ðf ÞÞ were computed, and the error between the formant locations of H2 ðf Þ and H3 ðf Þ was measured. Finally, the errors across all the perturbations were collapsed into a single rootmean-square value. This entire procedure was repeated with k4 ; k5 , up to k17 . 6. All km for m > 2 were set to zero (see Note 5, Appendix A). When the errors with m ¼ 5 (for example) were being examined, only then was k5 allowed to be nonzero (but k3 and k4 would still be zero). If k3 and k4 are 0, then e3 ðf Þ and e4 ðf Þ are also zero, so we can study the errors with m ¼ 5 independently of the contributions of m ¼ 3 and m ¼ 4. Since the error at each ﬁlter order m was characterized independently of the others, the errors can be accumulated across diﬀerent m. 7. To ﬁnd an appropriate value for the cutoﬀ, we analyzed a large corpus of vowel productions using ﬁlter orders ranging from p ¼ 9 to p ¼ 17. We plotted the resulting data (see Table 1 for an example) in F 1=F 2 space and manually estimated the popt ’s. A reﬂection coeﬃcient cutoﬀ (0.15) was then chosen so that the popt ’s identiﬁed by the ‘‘k < cutoff’’ heuristic would match as many of the manually estimated popt ’s as possible. In practice, the sharp cutoﬀ tends to overestimate popt in about 10% of the sounds (the heuristic rejects p if jkpþ1 j is even slightly greater than 0.15). In such cases, the experimenter needs to override the heuristic and specify a more appropriate ﬁlter order. Also, note that the ﬁrst step of the heuristic is to compute k1 ; . . . ; k20 . This range is appropriate for Fs ¼ 10 kHz, but since popt is generally proportional to Fs , higher sampling rates may require a larger range (for example, Fs ¼ 16 kHz may require the range k1 ; . . . ; k25 ). 8. The deviance between root and spectral peak location for a single pair of roots a and a : We know a priori that there is only one extremum, so there is no need to compute the 2nd derivative.

2

2

159 2

jH ðeix Þj ¼ jeix aj jeix a j ;

ðA:6Þ

2

djH ðeix Þj =dx ¼ 0 ) x ¼ cos1 ½ReðaÞð1 þ r2 Þ=2r2 :

ðA:7Þ

9. The ‘‘expected F 0 quantization error’’ was estimated from the data in Fig. 2(a). For each value of F 0, the quantization error relative to F 0 was calculated. The root-mean-square relative error across all F 0 values was 0.12 for F 1, 0.10 for F 2, and 0.17 for F 3. The ‘‘10% of F0’’ value was adopted as a convenient rule of thumb. 10. We did not use the heuristic to ﬁnd the right popt for each of the 1014 sounds. This was partly due to practicality, but also note that the expected error would be around 10–15 Hz (due to the low F 1 bandwidth; see Fig. 4(a)), which would be unnoticeable at the scale of the plots in Fig. 11. The sounds were also analyzed using rootsolving but again because of the low bandwidths, the results were grossly similar to Fig. 11(b).

References Atal, B.S., Schroder, M.R., 1974. Recent advances in predictive coding – applications to speech synthesis. In: Fant, C.G.M. (Ed.), Proc. 1974 Stockholm Speech Communications. Wiley, New York. Vol. 1, pp. 27–31. Chandra, S., Lin, W.C., 1974. Experimental comparison between stationary and non-stationary formulations of linear prediction applied to speech. IEEE Trans. Acoust. Speech Signal Process. 22, 403–415. Deller, J.R., Proakis, J.G., Hansen, J.H.L., 1993. Discrete-time Processing of Speech Signals. Macmillan, New York. Hillenbrand, J., Getty, L.A., Clark, M.J., Wheeler, K., 1995. Acoustic characteristics of American English vowels. J. Acoust. Soc. Am. 97, 3099–3111. Johnson, K., Flemming, E., Wright, R., 1993a. The hyperspace eﬀect – Phonetic targets are hyperarticulated. Language 69, 505–528. Johnson, K., Ladefoged, P., Lindau, M., 1993b. Individual diﬀerences in vowel production. J. Acoust. Soc. Am. 94, 701–714. Kewley-Port, D., Watson, C.S., 1994. Formant-frequency discrimination for isolated English vowels. J. Acoust. Soc. Am. 95, 485–496. Klatt, D.H., 1980. Software for a cascade/parallel formant synthesizer. J. Acoust. Soc. Am. 67, 971–995. Markel, J.D., Gray Jr., A.H., 1976. Linear Prediction of Speech. Springer, Berlin.

160

G.K. Vallabha, B. Tuller / Speech Communication 38 (2002) 141–160

Maurer, D., Landis, T., 1995. F 0-dependence, number alteration, and non-systematic behavior of the formants in German vowels. Internat. J. Neuroscience 83, 25–44. Rabiner, L.R., Schafer, R.W., 1978. Digital Processing of Speech Signals. Prentice-Hall, New Jersey. Repp, B.H., Williams, D.R., 1987. Categorical tendencies in imitating self-produced isolated vowels. Speech Communication 6, 1–14.

Snell, R.C., Milinazzo, F., 1993. Formant location from LPC analysis data. IEEE Trans. Speech Audio Process. 1, 129– 134. Vallabha, G.K., Tuller, B., submitted. Choice of ﬁlter order in LPC analysis of speech. Welling, L., Ney, H., 1998. Formant estimation for speech recognition. IEEE Trans. Speech Audio Process. 6, 36– 48.

Systematic errors in the formant analysis of steady-state vowels

Systematic errors in the formant analysis of steady-state vowels

Recommend Documents