Cornpurer Speech and Language (1992) 6, 15-36
Time-frequency
spectral estimation of speech
D. Rainton* and S. J. Young? *ATR, Sanpeidani, Inuidani, Seika-cho, University Department qf Engineering,
Soraku-gun, KJjoto 61942. Japan and fCambriclge Trumpington Street, Cambridge CB2 IPZ. U.K.
Abstract In recent years there has been a growing interest amongst the speech research community into the use of spectral estimators which circumvent the traditional quasi-stationary assumption and provide greater time-frequency (t-f) resolution than conventional spectral estimators, such as the short time Fourier power spectrum (STFPS). One distribution in particular, the Wigner distribution (WD), has attracted considerable interest. However, experimental studies have indicated that, despite its improved t-f resolution, employing the WD as the front end of speech recognition system actually reduces recognition performance; only by explicitly re-introducing t-f smoothing into the WD are recognition rates improved. In this paper we provide an explanation for these findings. By treating the spectral estimation problem as one of optimization of a bias variance trade off, we show why additional t-f smoothing improves recognition rates, despite reducing the t-f resolution of the spectral estimator. A practical adaptive smoothing algorithm is presented, which attempts to match the degree of smoothing introduced into the WD with the time varying quasi-stationary regions within the speech waveform. The recognition performance of the resulting adaptively smoothed estimator is found to be comparable to that of conventional filterbank estimators, yet the average temporal sampling rate of the resulting spectral vectors is reduced by around a factor of 10.
1. Spectral estimation for speech recognition I. 1. Introduction
Most researchers would agree that the speech signal is best represented for recognition purposes as a joint function of both time and frequency, or a parameterization thereof. Such a joint representation is both intuitively and biologically plausible. Biological studies. for instance, have revealed that the ear acts essentially as a type of spectral anaiyser (Pickles, 1982). Thus, by implication, the spectral shape of the speech signal must contain important cues as to its information content. However, a spectral representation alone results in poor recognition performance, despite the fact that it may be a complete description of the signal (complete in the sense that the original acoustic 08X5 ?3o8~92iolool5+22
803.00/0
:i‘: 1992 Academic Press Limited
16
D. Rainton
and S. J. Young
waveform is uniquely recoverable). It is clear that the temporal ordering of speech events is also important for understanding the orthographic content of the acoustic waveform. Thus, any useful representation must have both time and frequency dimensions. However, despite its obvious intuitive appeal, the mathematical description of temporal frequency variation has proven surprisingly difficult. This difficulty has arisen from the fact that the properties we would ideally require of such distributions (Loynes, 1968) are mutually inconsistent. As a consequence of this fact. numerous time-frequency (t-f) descriptions (Lowe, 1986) have been proposed over the years; each satisfying only a nonconflicting subset of the ideal property set. One of the first requirements of any t-f spectral estimation task is the selection of an appropriate estimator. In speech recognition this selection is largely based on previous empirical experience. One area of spectral analysis which has achieved good results is that of perceptual modelling, i.e. modifying a given estimator to imitate the way in which we believe that the human ear works. Yet, while such auditory modelling schemes have undoubtedly led to improved recognition performance, the question as to why this should be the case is left unanswered. Simply saying that one spectral estimator is better than another, because it models the behaviour of the human auditory system is, in our view, not enough. After all, a hidden Markov model, for example, makes no attempt to model the human brain, so why should it perform well with a human auditory style front end? Clearly there is something fundamentally right about these perceptually-based spectral estimators, since they provide improved performance when coupled with almost any type of pattern recognition algorithm. If we could determine in an objective fashion exactly what they are optimizing, then perhaps we could hope to improve upon them. ln the first half of this paper we consider the spectral estimation problem from the perspective of one particular family of t-f spectral estimators, the t-f smoothed Wigner distributions (Classen & Mecklenbrauker, 1980a). This family of distributions was of particular interest as several authors have reported poor speech recognition performance when using the unsmoothed Wigner distribution (Velez & Absher, 1989; Wilbur & Taylor. 1988). To improve recognition performance, t-f smoothing had to be explicitly re-introduced into the distribution, the reasons for this being unclear. An explanation of these findings was one of the motivations for this paper.
I .2. The Wigner distribution-
the bias and variance oJ’ the resulting spectral estimates
The mathematical properties of the WD have already been well-documented elsewhere (Classen & Mecklenbrauker, 1980a,b), consequently, this section contains only a minimal introduction. The WD of the harmonizable stochastic process S(t) is defined by the equation (Martin & Flandrin, 1985)‘:
where R,(t,z)
is the time-varying R,(t,z)
’ Throughout _‘X.
this paper.
integrals
covariance
kernel
= E[S(t + t/2)S*(t
- t/2)]
for which no limits are given should be interpreted
(2) over the range
+ ,x to
Time-frequency
spectral estimation
17
of speech
and Et.1 denotes expectation over an ensemble of realizations. In practice ensemble averages are rarely available. For non-stationary random processes, such as speech, this necessitates replacing ensemble averages with local time averages. However. since speech is not a quasi correlation-ergodic process this assumption introduces a bias into the resulting spectral estimates. These biased spectral estimates are defined by the equation:
where &(t,z;y)
is in turn defined by the equation: (4)
and ty(t,,s) is a window function, usually of finite spread and monotonic decreasing away from the origin. In the window function 1,47,,7) the first temporal variable r, replaces ensemble averaging with time averaging. as discussed above. The second temporal variable r imposes the frequency resolution limits on the spectral estimator and is a consequence of having to use a finite length window in computing the signal correlation about any point in time. A finite window being required both because of the non-stationary nature of the speech signal and for obvious computational reasons. The window ty(s,,t) is related to a corresponding t-f window q(t.o) by a I-D Fourier transform, i.e.:
P(t,o) = Jv/(t,t)e
(5)
““‘dr.
Although in theory p(t.o) can be any arbitrary function, for the purposes of this paper we assume that it is a non-orientated 2-D Gaussian window function. i.e.:
V(t,~)=&xP ($)exp($) We will furthermore require that our t-f smoothed implies that CJ,and v,, must obey the inequality:
(6) spectral
estimates
be positive.
This
I
~,O,;,
a-.
2
Note that when the equality in Equation (7) is satisfied (i.e. CT,C/~~,= l/2) then the t 1‘ smoothed WD is identical to the short time Fourier power spectrum (STFPS) (Riley. 1987). The normalizing factor I/gr~,U was chosen such that:
in order to ensure that the equality:
(9) is satisfied.
Via Moyal’s
formula
(Classen
& Mecklenbrauker.
1980a) Equation
(9)
18
D. Rainton and S. J. Young
implies that the expected energy of the smoothed spectral estimates W,(t,w;p) is equal to that of Ws(t,~), which is clearly desirable. Two important properties of the window function y(t,m) are its time spread (A7)’ and its frequency spread (A@*, defined here respectively by:
(A@= (11) where Ep is defined
as: E,=~~~cp(~,w)~~dtdw=~.
flP,* (12) Note that these definitions assume that the window function is symmetric about the origin. As stated earlier, the spectra1 estimate fis(t,co;y) is a biased estimate of the spectrum Ws(t,w). Bias is defined by: bias[~s(t,o;rp)]=E[~s(t,o;yl)]-
Ws(t,o).
(13)
and may be interpreted in this case as a measure of the t-f ‘blurring’ introduced into the spectrum by the window function. Appendix A shows that, for the case of a suitably compact t-f smoothing window q(t,co), the spectral bias can be approximated by the equation:
(14) where V,” represents the partial derivative its time and frequency constituents, i.e.:
L3”/c?v”.Separating
rp(t,o) [equation
(6)] into
where:
(16) ? B(f)=iexp
allows Equation
(14) to be re-expressed bias[ fis( t, co;p)] &$V,*
(-
$
I >
in the form: W,(r,w)jtfp(t,)dt,
(17)
Time$equency
19
spectral estimation of speech
(18) Since the integral terms in equation following simplification:
(18) are equal to ,/2x
(T,’ and ,/2n
grn’ we get the
Thus, it can be seen that the bias introduced into the spectral estimates depends both on the shape of the t-f window y(t,w) and the local curvature of Ws(t,w). It should be clear that the bias is a monotonic increasing function of (T, and o,,,. Normally, when selecting an appropriate t-f smoothing window a great deal of importance is attached to the bias of the resulting estimates; too much bias, or ttf blurring, and important transitional information is lost, too little bias and unwanted voicing or pitch information appears in the spectrum. However, there is also another important, though often ignored, performance measure that can be attributed to the t -f smoothed estimates l@‘s(t,co;v), namely their variance, defined as:
It can be shown (see Appendix B) that if S(t) is a white zero mean complex analytic Gaussian process then the variance of the spectral estimates can be approximated by the equation:
stationary process where (S,(o) (’ is the power spectral density of the tangential approximating S(t) at t(see Appendix B). Equation (21) is a straightforward generalization of the discrete time case treated by Martin and Flandrin (1985). It is important to note that Equation (21) implies var [b@‘s(t,co;(a)]is a monotonic decreasing function of the smoothing window parameters g, and gw, since evaluating the integral on the right hand side of equation 21 gives:
=~k’+--!&~exp
(3)
exp ($)
C’J
dt’do’
In summary, two important performance measures that can be associated with the t f spectral estimate fis(t,c~;~) are its bias and variance. Bias has been shown to be a monotonic increasing function of the t-f smoothing window parameters O, and o,,, while
D. Rainton and S. J. Young
20
variance has been shown to be a monotonic decreasing function of these same two parameters. In the next section we investigate the consequences of these two measures, i.e. bias and variance, for speech recognition performance. 1.3. The efSect of spectral bias and variance in speech recognition In this section we only consider the problem of isolated word recognition, although the ideas presented are easily extendible to include any fundamental speech unit (e.g. phonemes, diphones. etc.). We also assume here a simplified model of the word generation process, where each orthographic word is modelled by a single unique stochastic process, which produces all acoustic realizations of that word. Given two different stochastic processes, U(t) and V(t), corresponding to two different words in the recognizer vocabulary, we wish to choose q(t,o) so as to minimize the probability of recognizing estimates fiJt,w; y) as having been generated by the process U(t), and visa versa. In other words, we wish to choose p(t,co) so as to maximize the word recognition rate. Viewing the problem from a geometric point of view, each t-f spectral estimate will occupy a single point in an infinite dimensional Euclidean space. We would reasonably expect the spectral estimates associated with different words to produce different, identifiable, perhaps overlapping clusters in this space. An obvious optimization criterion would be to choose q(t,co) so as to maximize the spacing between the cluster centroids, while simultaneously minimizing their spread. Assuming a Euclidean distance measure, then the spacing between cluster centroids is by definition:
(23) Equation
(23) can be re-expressed &Y&l+
in the form (Appendix
C):
E[~vl)=SS(~[~,r)(‘IA~(&,~)-Av(~,r)(’dedr
(24)
where Au(e.r) is the ambiguity function of U(t) [Equation (73)] and (D(E,z) is obtained from y(t,o) via a 2-D Fourier transform [Equation (76)] (Rihaczek, 1969). Importantly. Appendix C shows that if p(t,o) is an appropriately normalized [i.e. satisfies Equation (S)] real separable 2-D Gaussian, then 1@(E,T))~ must satisfy the inequality: I(D(&J)I%
1
VE, r.
(25)
Furthermore, since I Q(E, t) I2IS a monotonic everywhere (apart from E= r = 0) decreasing function of the t-f spread of q$t,w), D(E[@u]. C[ tiV]) must also be a monotonic decreasing function of the t-f spread of the smoothing window (o(r, w). Hence, to increase the distance between cluster centroids we must reduce the spread of the smoothing window. From Equation (19) this can be seen to be equivalent to reducing the bias of the spectral estimates. However, this conflicts with the requirement of minimizing the variance of the different individual clusters, since Equation (22) shows that var[ @Jt,w; q)] is a decreasing function of the smoothing window spread. Hence, the best that we can hope for is to select compromise values for gI and gO,which minimizes some combination of the bias and variance of the spectral estimates. A mathematically convenient optimization criterion is to minimize the expected sum of the variance and
Time-frequency
spectral estimation
of speech
21
squared bias, which is equivalent to minimizing the mean squared error of the estimates (Rainton, 1990). Another possible approach, instead of minimizing a residual error, would be to maximize some discriminant measure, thus making use of important class membership information. However, we do not concern ourselves with this approach here, instead leaving it open as a direction for possible future research. Up to now, for the sake of mathematical tractability, we have assumed that the smoothing window q(t,w) is t-f shift invariant (i.e. 0, and or0 are a priori fixed constant values). Thus, the t-f spectrum Ws(t,w) and the spectral estimates
[email protected],o;v) are related by a straightforward convolution relationship, simplifying much of the resulting mathematics. However, the concept of a bias variance trade-off has important implications for the t-f shift invariance assumption. Examining Equation (19) we can see that the bias is dependent both on the smoothing window parameters 0, and g,?,and the local shape of the t-f surface WJt,w). That is, the bias depends on V,’ W,(t,w), which is a measure of the time curvature of the t-f spectrum Ws(t,w), in the sense that max,/ V,’ WJt,o)) will be large if Ws(t,u) has a sharp time peak, but will be small if Ws(t.o) is fairly flat. Similarly, VCD’Ws(t,Q) in Equation (19) is a measure of the frequency curvature of WJt,w). Hence, the more sharply curved the t-f surfaces of the spectra, the greater the bias, for a given smoothing window. This suggests that a better solution to the bias variance trade-off could be achieved if p(t,co) were allowed to adapt its t-f spread to match the local curvature of the t-f spectrum. In other words, the spread of the smoothing window should increase in the flatter regions and decrease in the more sharply curved regions of the t--f spectrum. Thus, one important conclusion to be drawn from the ideas presented so far is that the spectral estimator used in the front end of a speech recognizer should employ an adaptive t-f smoothing strategy. In the next half of this paper we present an adaptive smoothing strategy which exploits the above ideas. It should be noted that although this strategy is based on bias variance minimization, it also has elements in common with earlier work on optimal segmentation based on maximum likelihood estimation (Sedgwick & Bridle, 1977).
2. A practical adaptive spectral smoothing algorithm
2.1. Introduction In the first half of this paper we argued that a spectral smoothing strategy should attempt to optimize a bias variance trade-off in the resulting spectral estimates. Furthermore, it was shown that such an approach leads naturally to the concept of an adaptive smoothing algorithm. In the rest of this paper we describe one possible practical approach to implementing such an adaptive smoothing strategy. In order to simplify the algorithm the spectral smoothing window is restricted to be of the form: v(t,u)=Kexp
(j$)
exP
($)
where K is a normalizing real scalar. The task is to choose a(t) and o,, where rr,,, is a constant value chosen on an a priori basis, but a(t) may be a function of both the signal and time. Thus. q(t,o) is not required to be t-f shift invariant, as is the case with
22
D. Rainton and S. J. Young
conventional ASR spectral estimators, and neither is q(t,o) constrained to be the same across different word realizations, as is also usual with conventional estimators, where the window duration is usually fixed and selected in an a priori fashion, based on heuristic criteria and/or past empirical results. To constrain the optimization problem additionally, it is assumed that the glottal excitation signal contains no useful information about the orthographic content of an utterance, instead it simply adds to the variance of the spectral estimates. Accepting this assumption as valid fixes the value of CJ~and introduces a lower bound on the value of o(t); i.e. p(t,co) must always have a sufficient t-f duration so as to remove unwanted excitation information from the resulting spectral estimates. Riley (1987) has shown that for the purpose of removing excitation effects from the spectrum, the minimum spread window is one whose time spread is equal to one pitch period and whose frequency spread is equal to the spacing between the spectral harmonics (Fig. 1 shows the WD of the phrase “away a year”, smoothed using a t-f Gaussian window with sufficient t-f spread to remove excitation effects from the resulting spectrum. For comparative purposes Fig. 2 shows the equivalent conventional narrow band STFPS, with its characteristic voicing harmonics clearly evident). 2.2. The adaptive smoothing strategy The adaptive spectral estimation algorithm proposed here is a two stage process. First, the spectral estimates @s( t, co;p) are computed using a fixed spread window v( t, o) with Upand ow determined by the criterion described in the last section. Note that @‘Jt,w; yl) can actually be computed using a 1-D convolution of the STFPS, rather than by the more computationally expensive 2-D convolution of the WD (Rainton, 1990). The second stage of the algorithm involves adaptively smoothing @Jt, co;y) in time to produce @Jt,o;$), using a time variant Gaussian smoothing window &t, t’), i.e.:
Figure 1. The WD of the phrase “away a year ” smoothed with a 2-D Gaussian window. The t-f spread of the Gaussian being chosen so as to remove excitation effects from the resulting spectrum.
Time-frequency
‘3
spectral estimation of speech
Figure 2. A narrow band STFPS of the phrase “away a year” computed a 1-D Gaussian window and showing characteristic voicing harmonics.
using
(27)
where: @(t,f)=
I--exr,( dt)fi
-i(s)‘)
Note that &t, t’) is not shift invariant since the variance of this smoothing window is a function of time. The idea is to select a(t) so as to match the changing quasi-stationary periods within the speech signal. thus introducing maximum variance reduction, while introducing minimal additional bias. Thus, for ample, mid-way into a long vowel sound, a(t) will be large, but at a consonent-vowel boundary it will be small. Conventional spectral estimators on the other hand make the blanket assumption that the speech signal comprises of fixed length quasi-stationary intervals. The assumed duration of these intervals being based on prior information about the expected signal statistics. In reality. it might be more reasonable to expect a(t) to be some continuously changing function of time. However, to simplify the optimization problem c(t) is constrained here to be of the t-t_ ,) Ci(t)=aip, + --_L(o,-o~_ t,_,
t, tl-ti_ , rJ,=(t;(T(t) = Bi
tip ,)/7-c
36i
(29)
r,ct,
where t, marks the start of the utterance and t, marks its end (see Fig. 3 for an example). In other words, a(t) is approximated by a set of N straight line segments. It is assumed that N is selected on an a priori basis. The c, values are defined as they are in Equation (29) so that the resulting spectrum can be reasonably sampled at the set of time points fJ,
D. Rainton
and S. J. Young
0
2
3
6
8
9
II
'0
+,
+2
f3
'4
+5
'6
Figure 3. One possible (291, with N= 6.
form of o)t). satisfying
the constraints
t
given in Equation
while satisfying what can be roughly thought of as a generalized Nyquist criterion (Rainton, 1990). Thus, instead of having to optimize an arbitrary function a(l). the problem is now reduced to one of finding an optimal set of N- 1 breakpoints, i.e. the t, values 1
q:,(t,o;&dtdw.
(30)
The next section describes a computationally efficient means of minimizing this MSE criterion based upon the philosophy of dynamic programming (DP) (Bellman & Roth. 1969). 2.3. Practical
implementation
of the adaptive smoothing
strategy
practical purposes we now switch to the discrete t-f domain. The vector s, is defined as the m-dimensional column vector obtained by sampling W,(tA,, co;G) at the frequency points 0. A,, . ., (m - l)A,, i.e.:
For
Sf = [S,J.
. . .. s,,,, . ., S(,,l
where
s, .,= WtA,,
(i-
lU,;6)
(31)
The sampling intervals, na,,, respectively, so as to satisfy the Nyquist criterion (Rainton, 1990). TA, seconds of speech produces the vector sequence so, s,, . ., s,. . .) ST Let FJt) be the mmlmum distortion mtroduced by smoothing the vector sequence s,, S ,, . ., s, using a discrete time equivalent of Equation (29). For n> 1, the principle of optimality (Bellman & Roth, 1969) yields the recursive relationship:
Time-frequency spectral estimation min [F,_,(t’)+D(t’,t)] n- I
F,(t)=
The MSE distortion
measure
25
ef speech
(32)
D(t’, t) is defined as:
t-t’
0,=-
71
if
t’=O
then
o,,= err else 01.N
The starting
point for the recursion
=
cr,,=~,, ,,_ , (33)
0,.
is the computation
of F,(t), which is defined as:
F,(t)=D(O,t).
(34)
This algorithm returns the optimal set of N- 1 breakpoints, t,, . . ..t._,. The N optimally smoothed spectra1 vectors, s, , 1
s,.,= 1 i@,,%
1
(35)
k=O
where &t,,k) is defined in Equation (33). N may be selected on an a priori basis, or may be chosen to ensure that the average MSE between corresponding vector sequences, obtained before and after adaptive smoothing, is below some predefined threshold. The rest of this paper describes the application of the above algorithm to a real speech recognition problem.
3. Experimental This section
describes
a multiple
speaker
results
isolated
digit recognition
experiment,
con-
D. Rainton and S. J. Young
ducted in order to compare against that of conventional
the performance of the adaptively filterbank vectors. 3.1.
Speech
smoothed
spectral vectors
data
The speech used for the following experiments comprised of the isolated digits zero to nine. One thousand four hundred isolated digits were obtained from various sources, and included nine female and 26 male speakers. In all cases sampling was performed at 10 kHz, with 5 kHz anti-alias filtering, using 12-bit resolution and a noise cancelling head mounted microphone. Data collection took place in a working lab, employing no special soundproofing. For each word realization, the initial spectral analysis stage produced a sequence of STFPS vectors using a 25.6 ms Gaussian analysis window, shifted 10 ms at a time. This sequence of 256 dimensional vectors was then additionally smoothed in frequency using another Gaussian window to produce a sequence of 64 dimensional vectors, one vector every 1Oms. Finally, each vector was replaced with its log equivalent, thus producing what we will refer to as the original vector sequences. Figure 4 shows one such vector sequence, generated from a realization of the word “nine”, plotted in the form of a spectrogram. Half of the resulting vector sequences from each speaker were used for training the recognizer and the other half used for testing the recognizer. Thus, there were 700 training tokens and 700 unknown testing tokens. Tokens from each speaker appeared in both the testing and training data. 3.2, The speech recognizer The pattern recognition algorithm used in these experiments was a standard, fixed endpoint, isolated word dynamic time warping (DTW) algorithm, very similar to that described in Sakoe and Chiba (1978). The main reason for choosing this particular
Figure 4. An example word “nine”.
of the original
vector sequence
for one realization
of the
Time-frequency
spectral estimation
27
of speech
algorithm was the availability of pre-existing software; in principle there is no reason why the adaptive spectral smoothing cannot be used with either hidden Markov model (HMM), or neural network (NN)-based algorithms. Indeed. it is envisaged that future work by the authors will be based upon an HMM algorithm. However, it was felt that for the purposes of this paper, the simpler DTW approach would be sufficient. For the specific DTW algorithm used in this paper, the slope constraint was between 0 and 2 and the spectral distance measure was Euclidean. After initially generating a variable length sequence of 64 dimensional smoothed log spectra for each word, each vector sequence was then passed into the adaptive smoothing algorithm to produce an equivalent adaptively smoothed vector sequence, with a fixed length N. Before considering recognition performance, it is useful to determine the actual MSE incurred for a given value of N so that a reasonable range of experimental values can be deduced. To this end we computed the average mean squared error (MSE), over all digit realizations, between the original vector sequences and the corresponding adaptively smoothed sequences. Figure 5 shows a plot of this average MSE plotted against the adaptively smoothed vector sequence length N. From these results it can be seen that the MSE decreases smoothly as N is increased and that values of N around 10 result in quite low values of MSE. Thus, choosing values of N around 10 may be expected to give minimal increases in t-f bias whilst at the same time giving appreciably decreased variance as compared to the original vectors. Next, we investigated the word recognition performances of the adaptively smoothed vector sequences as N was varied either side of 10. The control experiment employed
1
N--number
of
smoothing
breakpoints
Figure 5. The average mean squared error between smoothed vector sequences as a function of N.
per
the original
word
and adaptively
28
D. Rainton
and S. J. Young
exactly the same DTW recognition algorithm. but simply replaced the adaptively smoothed vectors with the original vector sequence. The results for these experiments are shown in Fig. 6. From Fig. 6 we can see that reducing the number of adaptively smoothed spectral vectors to as low as 7 per word produces no significant reduction in the word recognition accuracy. McNemar’s test (Gillick & Cox, 1989) predicts that the observed differences between the N= 7 adaptively smoothed result and the control would occur by chance with a probability of 0.3, thus confirming that the observed differences are not statistically significant. Given that the average digit length was 580 ms, thus producing an original vector sequence length of about 60, it implies that the adaptive smoothing algorithm allows nearly an order of magnitude reduction in the temporal spectral sampling rate, without significant loss of recognition performance. Finally, one property of the adaptively smoothed spectral vectors which has not yet been exploited, is the implicit time normalization achieved by the estimation algorithm. That is, a fixed number of vectors can be produced per word, irrespective of the word’s temporal duration. This allows the use of simple linear matching algorithms. Figure 7 shows the recognition results obtained by repeating the previous experiment, but constraining the DTW recogizer to follow a linear warping path (i.e. a slope of constraint of 1). The results show, as expected, that preventing non-linear alignment in the DTW recognizer had no significant effect on the word recognition performance.
86.00
.4
84.OOL 4
I 6
I 8 N--number
1 IO of
smoothed
I 12
spectral
I 14
vectors
I
I 16
per
18 word
Figure 6. Word recognition accuracy plotted against the number of adaptively smoothed spectral vectors. Also marked on the graph is the recognition accuracy obtained using the original spectral vector sequence. obtained prior to adaptive smoothing. ~ a --. adaptively smoothed estimator; ‘. ‘. conventional estimator (ignore horizontal axis).
20
Time-jiequency
IV-
number
spectral estimation
of
smoothed
spectral
of speech
vectors
per
79
word
7. Word recognition accuracy plotted against the number of adaptively smoothed spectral vectors using both linear and non-linear time alignment schemes. ~~ A - , adaptively smoothed estimator (non-linear time warping). OF_! . adaptively smoothed estimator (no time alignment). Figure
4. Conclusions The idea of using variable framerate spectral analysis techniques for speech recognition is not new. Over the years various different schemes have been employed with varying degrees of success. However, while the attraction of such schemes has been intuitively self-evident, they have generally lacked a theoretical fundation. By formulating the spectral estimation problem in terms of a bias-variance trade off, we have shown why the degree of t-f smoothing introduced into the estimator should adapt to match the changing signal statistics. Furthermore, we have demonstrated that such adaptive smoothing schemes can actually tolerate increased average temporal smoothing. and a correspondingly reduced temporal sampling rate. The advantages of adaptive smoothing schemes are several, including implicit time normalization of the signal, the effective increased weighting given to transition portions of the spectrum, and a significantly reduced temporal sampling rate. We would also expect this reduced average temporal sampling rate to improve the signal to noise ratio in the resulting spectral vectors; however, this idea has yet to be tested. The major disadvantage of such schemes is their associated increased computational cost. However, this increase is at least partially offset by the reduced computational load incurred by the pattern recognition stage of the recognizer. given the reduced rate at which it need receive the adaptively smoothed vector sequence.
30
D. Rainton
and S. J. Young
What this paper has failed to demonstrate is any positive increase in recognition performance obtained using the adaptive smoothing strategy. We have shown only that such a strategy enables data rate reduction without incurring a loss of recognition accuracy. However, taking an optimistic view, we would argue that the reason for this may well be the relative ease of the chosen digit recognition task. Consequently, the aim of future work will be to apply the same adaptive smoothing strategy to a speaker independent, large vocabulary recognition task, such as the TIMIT database.
References straight lines. Journal of the American Statistical Association, 64, 1079-1084. Classen, T. A. S. M. & Mecklenbrguker, W. F. G. (1980~). The Wigner distribution-a tool for time-frequency analysis part 1: continuous-time signals. Philips Journal of Research, 217-250. Classen, T. A. S. M. & Mecklenbrguker (19806). The Wigner distribution-a tool for time-frequency analysis part 2: discrete-time signals. Philips Journal of Research, 276300. Gillick, L. & Cox, S. J. (1989). Some statistical issues in the comparison of speech recognition algorithms. In Proceedings of ICASSP. Glasgow, pp. 532-535. May. Isserlis, L. (1918). On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables. Biomefrica, 12, 134-139. Lowe. D. (1986). Joint Representations in Quantum Mechanics and Signal Processing Theory: Why a Probability Function of Time and Frequency is Disallowed. Technical Report 4017. Royal Signals and Radar Establishment, 1986. Loynes, R. M. (1968). On the concept of the spectrum for non-stationary processes. Journal qf rhe Royal Saristics Society, B, 30, l-20. Martin, W. & Flandrin, P. (1988). Analysis of Non-Stationary Processes: Short Time Periodograms Versus o Pseudo Wigner Estimator, pp. 455458. Martin, W. & Flandrin. P. (1985). Wigner-Ville spectral analysis of nonstationary processes. ZEEE Transactions on Acoustics, Speech and Signal Processing, pp. 1461-1470. Papoulis, A. (1962). The Fourier Integral and Its Applications. McGraw-Hill. Pickles, J. 0. (1982). An Introduction to the Physiology qf Hearing. Academic Press. Rainton, D. (1990). Time-Frequency Spectral Estimation of Speech. Technical Report CUED/F-INFENG/TR.39, Cambridge University Engineering Department. Rihaczek, A. W. (1969). Principles of High Resolution Radar. McGraw-Hill. Riley, M. D. (1987). Beyond quasi-stationarity: designing time-frequency representations for speech signals. In Proceedings ICASSP. Dallas, pp. 657-660. April. Riley, M. D. (1987). Time-Frequency Representations for Speech Signals. Technical Report AI-TR-974. MIT Artificial Intelligence Laboratory. Sakoe, H. & Chiba, S. (1978). Dynamic programming algorithm optimisation for spoken word recognition. IEEE Transactions on Acoustic Speech, Signal Processing, pp. 4349. Sedgwick, N. C. & Bridle. J. S. (1977). A method for segmenting acoustic speech patterns, with applications to automatic speech recognition. In Proceedings of ICASSP. Hartford, Connecticut. May. Velez. E. F. & Absher, R. G. (1989). Transient analysis of speech signals using the Wigner time-frequency representation. In Proceedings of ICASSP. Glasgow, pp. 2242-2245. May. Wilbur, J. & Taylor, F. J. (1988). Consistent speaker identification via Wigner smoothing techniques. In Proceedings of ICASSP. New York, pp. 591-594. April.
Bellman, R. & Roth, R. (1969). Curve fitting by segmented
Time-frequency spectral estimation Appendix A. Bias of the time-frequency
31
qf speech
smoothed estimator
This appendix derives an approximate expression for the bias of the t-f smoothed Beginning with the definition of bias given in Equation (13). i.e.:
WD.
(36) for E[tis(t,w;p)]
substituting
using the relationship
[2]:
(37)
gives: bias[ci;(r.w;ro)J=~~~Ws(t-I,,
w-o,)&t,,w,)dw,dt,-
Multiplying Ws(‘,o) in Equation (38) by l/2nJl&t,, is normalized according to Equation (8), i.e.:
o,)dr,do,
(38)
Ws(‘,w). and assuming
that p(t,~)
JJp( t, w)dtdw = 271 then Equation
(38) can be re-written
in the form:
tl, o--o,)-
bias:~s(t.ru:~)]=~~S[U’,(t-
(39)
~s(t,w)]ul(r,.w,)dw,dt,.
(40)
Assuming that p( t,co) is sufficiently compact and that within the non-zero region of the window function Ws(t,w) is locall_v qua&uric, i.e. is twice differentiable with a bounded second derivative, then for small t,,w,, by a Taylor series expansion:
(,_(tlV,+~IVm)+(~lVr+(~,V*,)Z
Ws(‘- t,,u-co,)%
w
_____
l!
2!
>
(t
s ’
o)
2’
where V,‘=-. in frequency,
i?.u’
Substituting
for
- t,,o--O,)
(40) using the
(41), and assuming that q(t,.cu,) is an even function of both time and hence St,yl(t,,o,)dt, =O and iW,U)(f,,W,)dti>, ==0 gives the required result:
bias[ ms( t. o; q)]
f+‘s(t.co)f~tf~(t,.o~,
(42)
D. Rainton and S. J. Young
32
Appendix B. The variance of the t-f smoothed Wignet distribution-a This appendix derives the following smoothed WD @Jt,w;q), i.e.:
approximate
of the t-f
a white zero mean complex analytic Gaussian process. of the above equation is a straightforward extension of the discrete time by Martin and Flandrin (1988). The derivation of Equation (43) is as notational simplicity the shorthand notation @s(f,,w,;~~~) = W, and = W, is used. Beginning with the expression for cov[ W, , Wz] cov[W,,
Substituting
for the variance
I S,(o) (‘11 I cp(t ‘, 0’ ) I *dw’ dt ‘,
var[I@s(t,w;q)]z& when S(t) is The proof proof given follows; for I@s(tZ,02;‘p)
relationship
derivation
w*l=wW,
for W, and W, in Equation
(44)
-~[W,l)(W*-~[W21)*1. 44 using the definition
W~d~fj~~(~, , r)S(t, + z, + r/2)S*(t,
given in Equation
3, i.e.:
+ z, - z/2)e -@‘drdr,
(45) gives: cov[W, 1W21= SKfw
~&*(h
r2k -i(“,rl - W&~~[~t*, uv*]ds,dr2dr3dr,
(46)
where: s=S(t,
+T3+T,/2)
Z4=S(t2+T4+T2/2)
Using the relationship
[originally
f=S(t,+T, - T,/2) V=S(t2+T4-T2/2)
formulated
by Isserlis ( 19 1S)]:
cov[sr*, uv*] = E[st* u* v] - E[st*]qu* then ifs,
t, u, v are real joint
Gaussian
(47)
variates
v]
(48)
with zero averages
COV[St*, uv*] = E[su*]E[t*v] + I[sv]qt*u*].
(49)
This result is easily derived from the characteristic function (Papoulis, 1962). If, however, s, t, u, v are complex analytic zero mean Gaussian variates it is straightforward to show that Equation (49) reduces to: cov[3r*,
uv*]= E[su*]E[t*v].
(50)
This is because E[wx] = 0 where wx are zero mean analytic Gaussian variables (Papoulis, 1962). Thus, for complex analytic zero mean Gaussian variables, cov[ W, , W,] is simply:
cov[W,, W,] where:
= A
(51)
Time-rfrequenr~~~spectral estimation of‘speech
33
.--i(‘r’ltl- “~,“)E[SU*]E[t*V]d7, drzdtj A =SSSSV/(S~,Z,)V/*(S~,~~)~
To evaluate
.4 expand
(52)
C[su*]E[t*v]:
E[su*]E[t*v] = E[S(t, +
71+
r,p)S*(t,
E[S*(r,
and temporarily
dr,
substitute
for
7,.
r2,
rj,
r4
+ 74 + r$)]
+ 7i -- 7,/2)S(t,+s,-
7,/2)]
using: 7”=aph
s’=uih t’ = (c +
(53)
d)/2
(54)
t” = (C’- d)j2
where:
Equation
a= t, - 1, + s3 - r-1
h=(s,
c’=
d=
t, + t2 + 73 + tq
(53) can now be written E[.W*]E[t*r] = C[S( t’ +
Making
the quasi-stationary
- 7’)/2
(55)
(7, +5:,)/T.
in the form:
7’/2)s*(
t’ - s’p)]E*[S(t”
+ r”p)S*(t”
~
Y/2,].
(56)
assumption:
E[S(t,, + s/2)S*( t,, - r/2)] z E[S( I’ + t/2)S*( t’ - t/2)] Y E[S( t” + s/2)S*( t” - t/2)]
(57)
where t(, = (t, + r?)j3 gives: @r*]c[t*v]
z E[S(t,, + s’/2)S*(t,, - s’/2)]c[S(t,
+ s”/2)S*(t,, - t”,‘2)].
Expressing the quasi-stationary covariance E[(S( frJ+ r/2)S*( t,, ~ t/2)] in terms Fourier transform of the stationary process S,,,(t), tangential to S(t) in t,,. i.e.:
E[( S( t,, + r/2)S*(r,, - t/2)] z &,,(fi),>
(58)
of the
‘“tifi
(59)
“““‘dQ,dQ,.
(60)
gives: E[su*]E[t*r] =4>jJs,,,(n,
s,,,(Q&?”
Substituting for E[su*]E[t*r] in Equation (52) using Equation and expanding T'. S" using Equations (55) and (55):
(60) rearranging
factors
I
(61)
34
D. Rainton and S. J. Young
then substituting for the braced integrals in Equation (61) using Equation (62) and performing the co-ordinate transformation: R’ = ($ - sZ,)/2. Q” = (a, + C&)/2. The modulus of the Jacobian is 2 and hence d!A, dQ,+2dD’dQ” giving: A z zj,-jj{cD(2QJ, W,-~“)e-/?““l}{~*(2Q’,Wz-R”)e+‘?R”’j S,o(CY’- CY)S,:;(Q” + Q’)dQ’dR”. Defining
@(C&w) as a I-D Fourier
transform
(63)
of p(t,o),
i.e.:
def
-jntdt
@(Q~)=jq~(t.w)e
(64) then Equation
(63) can be re-written AZ 2!-&jjjjs,,.cn” q(t’-
as:
- Q’)S;ro(W + n’)e~/-?n”“-r’) t,,w, -Q”)p*(t”
Since S(t) is a white noise process,
A c!-&{,S,~(W),
- r,,oz-R”)dQ’dQ”dr’dt”.
S,,,(w) is flat and Equation
jle
-i’n”“-
(65)
(65) is equivalent
to:
““dfi’]
q(t’ - t,,w, - Q”)p*(t” - ~,.co~- R”)dQ”dt’dr”. Noting
that j-e -j?G’(f’- (“1dQ ‘ is equal to n&t’ - t”) then Equation A+S,<(R”)\
‘q(t’-
t,,o ,-Q”)qo*(t’-
Using exactly the same set of arguments
+$lS,cfif’)l
$(t’ - t, ,q-
(66)
(66) reduces
(67)
t2,a2-Q”)dQ”dt’.
it can be shown that:
(68)
Q”)v*(t’ - t2,wz+ Sl”)dR”dt’.
Hence, when the random process S(t) is a white zero mean complex analytic process, the covariance of its smoothed WD is approximately equal to:
&S,,,(n”,,
$(t’-
to:
t, ,q--
Q”)cp*(t’ - t2.0-
Letting t, and t2 equal 1. and CO,and CO?equal w, we obtain relationship for the variance of the WD, i.e.:
SZ”)dR”dt’.
the following
Gaussian
(69)
approximate
Time: frequency spectral estimation of speech
3.5
var[kPs(t,w;q7)]zz
&,(fir’)l $(t’Finally,
t.tu-Q”)rp*(f’--
using the fact that the S,(O) is flat in frequency,
(70)
r,tu-R”)dQ”dt’.
then we obtain
the desired result: (71)
Appendix C. Changes in D(E[ fiL], E[ tiv]) as a function of the t-f spread of q( f,(~) This appendix shows that if cp(t,w) is a real separable Gaussian window, which satisfies the normalization criterion of Equation (8), then D(E[@,], E[ &‘v)] must be a monotonic decreasing function of the t-f spread of y(t,co), irrespective of the choice U(t) and V(t). Beginning with the definition of D(E[ fiL,], I[ @‘,,I) given in Equation (23), i.e.: (72)
D(E[~~l,~[~vl)=~~lE[~~(t.o;~)]-E[~v(t,(r~;~)]l~dtdo we re-formulate this equation in the ambiguity plane. The ambiguity widely used in radar signal processing [14], is defined by the equation:
function
(AFL
where: (74) Similarly: &,s;@)=F(
lP(t.o;(L?))
(75)
where: WE, t) = F((o(t,co)). Making
use of Parseval’s
relationship, A&,
Equation
(72) can be re-expressed
If v(r,w) is a real separable
(76)
and the equality: s)@(e,7) =
A^S(&, t;@)
(77)
in the form:
2-D Gaussian
smoothing
window,
defined by the equation:
36
D. Rainton and S. J. Young
which can be shown to satisfy the required straightforward to derive that:
(+)
qE,t)=exp
Thus, @(E-Z) will also be a real separable @(O,O)= 1
and
normalization
eXp
of Equation
(39). then it is
(*)
2-D Gaussian @,(~,r)< 1
which satisfies the inequalities: V’C,Z.
(Xl)
Furthermore, since I@(e,r)i’ must also be less or equal to 1 for all I:. r. then Equation implies the following inequality, i.e.:
(7X)
D(E[@d.~:[&])< D( W,,
(82)
W,,,.
Or in other words, the distance between the expected values of the smoothed spectral estimates must be less than the distance between the unsmoothed estimates. Furthermore, I@(EJ)~* is a monotonic decreasing function of (T, and u,~. for all E. r. Hence. D(C[@‘u], E[ @“I) must be a monotonic decreasing function of window spread, and our original claim is proven.