Speech Communication 8 (!989) 363-364 North-Holland
363
SHORT COMMUNICATION
IMPROVING PERFORMANCE OF CODE EXCITED LPC-CODERS BY JOINT OPTIMIZATION J6rg-Martin MI3LLER A.N.T. Telecomnmnication~, Advanced Development Department, E314, Gert,erstrafle, 7150 B~wknang, Fed. Rep. Germany Received 20 December 1988 Revised 15 June 1989
Abstract. This article describes a CELP speech coding algorithm where the coder p~'ameters are jointly optimized. For that, the relation between pitch period, pitch predictor coefficient, codebook entry and scaling factor is derived. Different approaches to solve this optimization problem with low computational load are discussed. It is shown that the length of the excitation codebook can be reduced significantly compared to the sequential optimization of the coder parameters. The influence of the parameter quantization is examined and an estimation of the computational load is given.
Zusammenfassung. Dieser Beitrag beschreibt ein CELP-Sprachcodierverfahren bei dem die Coderparameter gemeinsam optimielt werden. Dazu wird die Beziehung zwischen Pitchperiode, Pitchpradiktionskoeffizieat, Codebuchadresse und Skalierungsfaktor abgeleitet. Verschiedene Ansatze, das Optimierungsproblem aufwandsgtinstig zu 16sen, werden diskutiert. Es wird gezeigt, daB. im Vergleich zur sequentiellen Optimierung, die Dinge des Anregungscodebuches betrfichtlich reduziert werden kann. AuBerdem wird der EinfluB der Parameterquantisierung untersucht und der Realisierungsaufwand abgesch~itzt. R6sumd. Cet article d6crit un codeur pr6dictif lin6aire excit6 par codes (CELP) ofa les param6tres sont optimis6s ensemble. On d6duit pour cela la relation entre la p6riode fondamentale (pitch), le coefficient du pr6dicteur pitch, radresse du dictionnaire et le coefficient d'adaptation d'6nergie. Apr6s avoir examin6 diff6rentes m6thodes permettant de r6soudre ce probl6me d'optimisation avec des algorithmes rapides, il est d6montr6 que la longueur du dictionnaire peut 6tre consid6rablement r6duite par rapport ~ une optimisation separ6e. On examine i'influence de la quantification des param6tres du codeur et on donne une estimation du nombre d'op6rations n6cessaire/~ sa r6alisation.
Keywords. Speech coding, joint optimization, fast algorithm.
1. Introduction Code-excited linear prediction (CELP) (Schroeder and Atal, 1985) coders belong to the class of RELP (Residual Excited Linear Prediction) coders, where an innovation sequence is filtered by all pole filters, which represent the speech production me~hanism. The determination of the innovation sequence in CELP coders is performed by means of a codebook, where the best codebook vector is chosen in an "analysis by synthesis" procedure. The codebook is populated by Gaussian distributed random numbers.
The structure depicted in Fig. 1 is derived from the basic block diagram in Schroeder and Atal (1985). In the first stage, the contribution of the memory of the linear prediction (LP) filter, represented by H0s(Z), is subtracted from the input speech samples s(n) and the resulting signal is weighted by the filter W(z). In the second stage, the contribution of the weighted memory of the pitch predictor filter (represented by H0t,(z) and Hw(z)) is subtracted. Finally, the weighted error signal e,,(n) is obtained by computing the difference between the filtered codebook vector and the signal s',,(n).
0167-6393/90/$3.50 © 1990, Elsevier Science Publishers B.V. (North-Holland)
364
J.-M. Miilh'r / C E L P ('oth'r with joint olmmization
s(n) ;
Fq,'l kin), k =/"0 39"K L ('~ bk
6(n
The linear prediction synthesis filter
-
i
e,,
,,I
\
- -¥---
F
H_o CZl
-~
= (,- z
i=i
,)
i
Sw(n)
describes the formant structure of the speech spectrum. The weighting filter
Sw(n)
W(z) = Hs (zly)Hs(z)- t
ew(n)
performs a spectral shaping of the noise, due to the "incomplete" excitation. Hw(z) is the concatenation of LP- and weighting filter,
i
--
2. Coder description
4 I
Codebook
L
.....................
i
0 <~ ), <<, 1,
Hw(z) = Hs(z)W(z). The one tap (PL = 1) pitch prediction synthesis filter is described by the transfer function Fig. 1. Structure of the CELP coder.
The energy E of the error signal is a function of all the coder parameters, i.e.,
E = f(a, M, bi, j, Q), where a,, i = I, 2 ..... Ps indicates the coefficients of the LP filter, M the pitch period, b~, i = 1, 2, .... PL the pitch predictor coefficients, j = 1, 2, .... Ks the codebook entries, and Q the corresponding scaling factor. The best possible speech quality is obtained if all these parameters are jointly optimized. In the following, however, the computation of the LP-parameters ,, is not included in the joint optimization process, because this would mean a non-realizable computational complexity. Therefore, a somewhat suboptimal approach to the joint optimization procedure is investigated here by minimizing the function
E = f(M, b,, j, Q). Solutions for an efficient implementation are given and the joint optimization approach is compared to the sequential procedure with regard to speech quality and computational complexity.
Speech ('ommunication
Hl.(Z) = (1 - bz-M) -=. The memory locations of the filters H ~ z ) , HL(Z), and W(z) in Fig. 1 are zero. The parameters of the pitch predictor are updated every Ns samples (subframe) and those of the LP filter every N samples. If M >- Ns is assumed, the pitch prediction fiR'er can be removed from the excitation branch in Fig. 1, because it does not affect the input of the filter H~(z) for n <~ Ns. To describe the effect of the pitch predictor memory, a more detailed representation is included in Fig. 1. The values in the memory locations are described with l(k). Each pitch period candidate M = k generates a different signal dk(n) at the delay line output. This filter can be interpreted as a signal generator which produces Kt. different signals dk(n). K1, depends on the allowed range of the pitch period M. A good choice for M is between 40 and 103. To cover this range, KL must be 64. This directly leads to the block diagram in Fig. 2. The KL different signals dk(n) can be thought of being combined in a codebook. In this representation, there is no difference in the structure of the branch with the excitation codebook and the branch with the codebook created by the filter memory of the pitch predictor. Only the characteristics of the codebooks are different: the excitation codebook is fixed whereas the other is timevariant (adaptive), because the filter memory is
365
J.-M. Miiller / CELP coder with joint optimization
s(n)
with
I ,,,~NS._~,
pk(n) = dk(n) * h..(n),
-6(n)=-=~I=Ios{;")~
qj(n) = rj(n) • h.(n), and
Ns
( a(n), b(n) ) = ~ a(n)b(n). n = 1
[rj[n) --.'~ I
,N.
With (2) and (1). the minimum error energy is
,
Emi,, = (s,(n)., s,,(n)) - TO", k, cj, bk).
NS-~
(3)
Because the energy of s,,(n) is constant during a subframe, the expression
E(k,j,bk,c J)
T(j, k, c), bk)
Fig. 2. Modified structure of the CELP coder.
= bk (pk(n), s..(n) ) + cj (qj(n), s..(n)) modified after each subframe. For optimizing the parameters, a huge number (KLKs) of different combinations has to be checked in order to find the minimum error energy E. All these combinations correspond to one codebook of length KLKS whereas the sequential optimization corresponds to a two-stage vector quantization with two codebooks of length KL and Ks, respectively.
has to be maximized. The computation of T consists of two steps: - solution of (2), and - computation of (4). These steps have to be performed KLKs times.
4. C o m p l e x i t y 3. J o i n t o p t i m i z a t i o n p r o c e d u r e
With the block diagram in Fig. 2, the error energy E as a function of the codebook entries j and k and the scaling factors c/and bk is given by E(j, k. bk, ci)
N~ =
[s,,(n) -
I(b,ddn)
+
ciri(n)) * h,o(n)]] 2,
,,--i
((pk(n),pk(n)) (pk(n), qi(n) )
(pk(n),qj(n))](bk] ( qj(n), qj(n) ) }\cj /
(pk(n). s.,(n) ) ) =
(qj(n),
reduction
To reduce the computational load compared to the direct computation of (2) and (4), advantage can be taken of the special structure of the codebook, generated by the pitch predictor memory. The filtered codebook vectors pk(n) can be defined recursively.
~i(k + 1)h,.(1) forn = 1, pk+l(n) = [pk( n -- 1 ) + l(k + 1)h,,,(n) forn = 2.....
(1)
where h,,,(n) denotes the impulse response of the weighted LP-filter and • the convolution. For a minimum of E with regard to the scaling factors, the following system of linear equations has to be fulfilled:
s.,(n))
(2)
(4)
Ns.
Additionally, a sparse codebook will be used to produce thesequences rj(n). As shown in Davidson and Gersho (1986), speech quality is not affected if up to 90% of the vector elements are zero. It is assumed that for each rj(n) only A locations are nonzero. Parts of equation (2) can now be computed very efficiently: (a)
{qj(n),
s,,,(n)) =
{riO0, R,.,,.h,,(n)).
(5)
Vol.8. No.4. Dcccmbcr'198q
J.-M. Miiiler / C E L P coder with johu optimization
366
Rah(n) indicates the correlation between the signals a(n) and b(n): N~
a(m)b(m - n + 1),
R,,b(n) = m= I
n = l , 2 ..... Ns. The computation of (5) requires A operations. The correlation function can be calculated by time-inversed filtering (Mtiller and Scheuermann, 1988) (NsPs operations) and has to be performed only once per subframe.
(b) (pk(n), qi(n) ) = (rj(n), Rpd,,(n).
(6)
Rp~h,(n) can be evaluated recursively: " l(k + 1)Rh.,h..(1) Ns-i
+ ~ PkO')hw(j+ 1) /=1
Rpk. It,,,(n) =
for n = 1, Rpkhw(n -- 1) + l(k + 1)Rh,,h,(n)
-pk(Ns)h~(Ns + 2 - n) forn = 2 ..... Ns. •
( qj(n), qj(n) ) cj = ( qj(n), s~,,(n) ) and
(7)
Again, the evaluation of (6) requires A operations. Eq. (7) can be computed with 2Ns operations, provided that the weighted impulse response h,.(n) can be neglected for n > 25. (c) For the computation of the energy of qj(n), the approximation (Trancoso and At,.,, 1987) (qi(n), qj(.) > u~ = Rr,,,(l)Rt,,,.i,,,(1) + 2 ~ R~,rj(i)Rh,,h,,(i) i=2
is used. Because rj(n) has only A nonzero elements, the function Rr,r,(n) has ½A(A - 1) + 1 nonzero elements at maximum. Rr,,j(n) can be precomputed and stored together with the excitation codebook. R~,,a,,.(n) is computed once per speech frame by time-inversed filtering. With the above results, the computational complexity is reduced significantly compared to the direct evaluation of equations (2) and (4). The procedure which examines all KLKs combinations of the codebook vectors is referred to as the "full search" procedure P0. Speech Communication
A further reduction in operations can be achieved if only a subset of all possible combinations is examined. For speech signals, especially for voiced regions, there are only a few values of k which are likely to be chosen as the final pitch period M. Hence, in a first step, eqs. (2) and (4) are solved for rj(n) = 0, n = 1, 2 . . . . . Ns, which is equivalent to a "closed loop" determination of the pitch predictor parameters (Singhal and Atal, 1984). Only the K~. signals dk(n) are written into the codebook which produce the lowest energy of s,',,(n). In the second step, the above described joint optimization procedure is performed, but this time with a codebook length K~. which is much smaller than the maximum length KL. This approach will be called procedure PI(K~.). The difference between the joint optimization procedure PI(1) and the sequential approach is that, in the former case eqs. (2) and (4) are calculated for the Ks excitation vectors whereas, in the latter case,
Ts(j, cj) = cj ( qj(n), s',.(n)) are determined for each codebook vector. These equations result from (2) and (4) in setting pk(n) to zero and sw(n) = s'.(n).
5. Quantization of the scaling factors The quantization of the factors bk and cj can be performed during the search procedure or only once, at the end. In the first case, the quantization error of the scaling factors is included in the computation of the total error energy Emia in (3) but requires a relatively high computational effort. In the latter case, the quantization has to be performed only once, but the quantization errors cannot be partly compensated by a better fitting pair of codebook vectors. There is also the choice between a two-dimensional vector quantization and a scalar quantization. Here, only scalar quantization is examined, because this quantization procedure requires no additional operations (only table lookups) and is less sensitive against errors on the transmission
J.-M. Miiller / CELP coder with joint optimization
channel. Finally, only the quantization during the search procedure is regarded. Taking into account the quantization effects of the scaling factors, eq. (3) cannot be used any more, because this equation has been derived under the condition that (2) is exactly fulfilled. Now the minimum error energy must be computed directly by (1). Denoting by Q(a) the quantized value of a, the minimum energy is computed as follows: Em0in = (Sw(n), s,.(n)>
TO = Q(T) (pk(n). s.,(n)) + Q(cj) ] - Q(bk) 2 (pk(n), pk(n)) - Q(ci) 2 (qi(n), qi(n) >.
6.
Simulation
results
The above procedures have been simulated and compared to the sequential procedure concerning speech quality (segmental signal-to-noise (S/N) ratio) and complexity (MAC operations per second). The following coder configuration was used: ® Ps = 10, adaptation every 20 ms (sampling frequency 8000 Hz), vector quantization with 18 bits per frame (Guth et al., 1986); • weighting filter: 7 = 0.8; ® longterm predictor: PL = 1, adaptation every 5 ms, 40 <~ k < 104; • stochastic codebook: A = 4 elements per vector are nonzero; • the duration of the simulated speech sequence was 60 s and consisted of six male and female speakers.
with =
-
quantization of the scaling factors by means of scalar quantization, and computation of equation (8).
-
T O,
-
367
2[Q(bk)
(8)
The computation of To now consists of three steps: solution of (2),
P0
~,,~
pI (Io) PI(5) PI(I)
f
, PI(5) quar,tzzed
/
./
sequentzal
/
j./,
./
sequentzal ./'quantized
./'J
/-
/"
t,.,., ,.
z
/
U3
/
/
/
/ /
g',< E
,i"
0'3
/
/ ,
.
_
,,, i
'
5
r.......
i
6
i
|
7
. i
'
B
i
'
9
i
. . . . . .
I
:I0
]dCKs) Fig. 3. Segmental S/N ratio for different joint optimization procedures. Vol. 8. No, 4. December 1989
368
J.-M. Miiller / CELP coder with joint optimization
In Fig. 3, the segmental S/N-ratio for the different procedures as a function of the excitation codebook length Ks is shown. There is an improvement of more than 0.5 dB already for Ki, = 1. For K~, I> 5, there is no significant additional improvement in S/N. With K~, = 5, the codebook for the sequential procedure has to be 3 to 8 times longer than for the joint optimization procedure for the same S/N-ratio. This behaviour was confirmed in informal listening tests. The dashed curves include the quantization of the scaling factors. in both cases, each of the scaling factors is quantized with a 4-bit nonuniform quantizei. Obviously, the difference between the procedures does not change if the scaling factors are quantized. The computational complexity for determining the parameters of the pitch prediction filter and the stochastic codebook of the introduced procedures is depicted in Fig. 4, including the scalar quantization during the joint optimization procedure.
't
7. Discussion
For maintaining the same speech quality, the excitation codebook can be reduced by a factor of 3 to 8 with the joint optimization of pitch predictor and excitation parameters. Up to 600 bits/s in excitation bitrate can be saved. A constraint is the high computational complexity especially for long codebooks. However, for K s <<- 256 and K~, = 5 the joint optimization procedure is realizable on modern digital signal processors.
References
G. Davidson and A. Gersho (1986), "Complexity reduction methods for vector excitation coding", Proc. ICASSP, Tokyo, pp. 56.8.1-56.8.4. P. Guth, H. Reininger and D. Wolf (1986), Pcrsonal communications, Institut for angewandte Physik, Universit~it Frankfurt. J.-M. MOiler and H. Scheuermann (1988), "RELP-
P0
Pl (1
Pl (5)
1
~t
r
.',7 7
L
/ a
5
i
a
6
i
i
7
i
r
fl
i
,
9
,
,
10
IO(KS) Fig. 4. Computational complexity of the joint optimization procedures. Speech Communication
J.-M. Miiller / CELP coder with joint optimization Sprachcodierung mittels "Analyse durch Synthese'", Nachrichtentechnische Berichte, A N T Telecomnnmications, pp. 93-105. M.R. Schroeder and B.S. Atal (1985), "'Code Excited Linear Prediction (CELP): High-quality speech at very low bit rates", ProcJ ICASSP, Tampa, pp. 937-940.
369
S. Singhal and B.S. Atal (1984), "Improving performance of Multi-Pulse LPC coders at low bit rates", Proc./CASSP, San Diego, pp. 1.3.1-!.3.4. I.M. Trancoso and B.S. Atal (1987), "Efficient procedures for finding the optimum innovation in stochastic coders", Proc. ICASSP, Dallas, pp. 44.5.1--44.5.4.
Vol 8. No. 4. December 1989