Signal Processing 24 (1991) 253 269 Elsevier
253
Hardware implementation of partitioned-parallel algorithms in linear prediction G. Carayannis, E. Koukoutsis and C.C. Halkias National Technical University of Athens, Division of Computer Science, Zografou, GR-157 73 Athens. Greece Received 8 February 1989 Revised 21 February 1990 and 8 April 1991
Abstract. In this paper, a partitioned-parallel strategy for the solution of the Toeplitz equations appearing in the linear prediction case is introduced. From the one extreme, i.e., the use of the Schur recursions for an order recursive implementation with O( l ) processors which perform O(p 2) MADs (multiplications and divisions), to the other extreme, i.e., the use of the same recursions for a fully parallel implementation with O(p) processors which perform O(p) MADs, there exists a compromise: The hardware designer can 'cut' the computational scheme into suitable partitions, which are executed one after the other, with all the computations of each partition organized in a parallel manner. This way he can achieve increased flexibility, especially in relation to the model order, which can become totally independent of the available number of processors. Moreover, in this paper an abatement methodology is introduced which significantly reduces the number of multiplications of the above computational schemes, as well as the overall algorithm complexity in the case of the parallel design.
Zusammenfassung. In dieser Arbeit wird eine aufgeteilte parallele Strategie zur L6sung der Toeplitz-Gleichungen eingefiihrt, wie sie f/ir den Fall der Linearen Pr/idiktion auftreten. Zwischen dem einen Extrem, n/imlich der Anwendung der SchurRekursionen fiir eine ordnungsrekursive Implementierung mit O(1) Prozessoren, die O(p 2) MADs (Multiplikationen und Divisionen) ausf/ihren, und dem anderen Extrem, nS,mlich der Anwendung derselben Rekursionen fiir eine v611ig parallele Implementierung mit O(p) Prozessoren, die O(p) MADs ausf/ihren, existiert ein Kompromil3 : Der Hardware-Entwerfer kann das Berechnungsschema in geeignete Teile 'zerschneiden', die einer nach dem anderen ausgef/ihrt werden, wobei alle Berechnungen eines jeden Teils parallel organisiert sind. Auf diese Weise kann er eine erh6hte Flexibilit~it erreichen, insbesondere in Bezug auf die Modellordnung, die gfinzlich unabh/ingig yon der verffigbaren Anzahl Prozessoren werden kann. Ferner wird in dieser Arbeit eine Reduktionsmethodologie vorgestellt, die sowohl die Anzahl Multiplikationen der obigen Berechnungsschemata deutlich verringert, als auch die Gesamtkomplexitfit des Algorithmus f/ir den Fall des parallelen Entwurfs. R6sum6. Une strategie de partitionnement parallel est introduite dans cet article pour la solution des equations de Toeplitz darts le cas de la prediction lin6aire. D'une extremite qui consisted l'utilisation de la recurrence de Schur pour l'implementation monoprocesseur en effectuant O(p :) operations jusqu'd l'autre extremit6 qui consiste fi I'utilisation parall6le de cette m6me recurrence employant O(p) processeurs qui effectuent O(p) operations, un compromis existe. La conception du hardware peut s'effectuer en coupant l'ensemble d'operations en partitions qui peuvent &re executees l'une apr6s l'autre. Toutes les operations I'interieur d'une partition peuvent ~tre organisbes en parallel. Le resultat est une flexibilit6 accrue dans la conception du hardware, specialement en relation avec l'ordre du model qui peut devenir independant du hombre de processeurs disponible. D'autre part une methode d'abatement est introduite dans cet article qui a comme consequence une reduction significative du nombre total des multiplications ainsi que la complexit6 algorithmique dans le cas de la conception parall61e. Keywords. Linear prediction, PARCOR coefficients, parallel algorithms, partitioned-parallel algorithms.
I. Introduction
speech synthesis and recognition, neurophysics, geophysics and communications, to name but a
Linear prediction techniques have become more and more attractive during the last two decades in many areas such as speech and image analysis,
few. Fast order recursive algorithms of the Levinson or the Schur type have become famous not only for
0165-1684/91/$03.50 (C~ 1991 Elsevier Science Publishers B.V. All rights reserved
G. Carayannis et al. / Parallel hardware for linear prediction
254
their reduced complexity, but also for the physical interpretations of the solution of the LPC problems that they permit, for the features of the new intermediate variables they are using, etc. The Schur algorithm was first introduced in DSP by Le Roux and Gueguen in 1977 [8]. A deep theoretical study on related issues has been presented by Dewilde et al. [5]. Later, in 1983, Kung and Hu [7] used the same algorithm to develop a parallel scheme and a wavefront processor. Their technique requires that the number of processors should depend on the order of the system to be solved. Therefore one cannot use this technique to process a signal produced by a system having an order independent of the number of the available parallel processors. A new computational organization of the operations performed for the solution of the Toeplitz equations is proposed in [2, 6]. This computational scheme, called the superlattice structure, provides a complete view of all the operations needed to obtain the PARCOR coefficients. Furthermore it gives to the designer the possibility to organize his computations taking into account the given technological or cost-performance restrictions (ex. the number of available processors or the number of processors that one can accommodate in the area of a single chip, etc.).
The Durbin algorithm provides an order recursive technique for the solution of the system (1). At each step m of the algorithm the solution of the system (2)
Rmam = - r,,
is obtained, where am = [al,m a2 . . . . .
am,m] T,
(3)
r,, = [rl r 2 . . . rm] T,
(4)
R,, = ST(r0 r, . . . rm-1).
(5)
ST( ) being the symmetrical Toeplitz matrix constructed from the vector inside the brackets. R,, is also known as the autocorrelation matrix, rm is the autocorrelation vector and am is the m-th order predictor. The Durbin algorithm solves the system (1) as follows: (0) Initialization: a0 = r0,/31 = r l , kl = r j / r o , all=kt, m= 1 (1) New information: rm+ 1 (2.1) a m = a m _ l +/3mkm (2.2) if (am~<0) STOP, "Rm is not a positive definite matrix." (2.3) f l m + j = a T Jrm+rm+l
2. Toepfitz solving and the superlattice
(2.4) k m + l = - / 3 m + l / a m In this section, the superlattice structure will be discussed. Special emphasis will be given to the computation of the lattice filter coefficients, which can be done in a number of different ways, including a highly parallel structure, a partially parallel structure and an order recursive one. It is the symmetrical case that will be considered in this work (i.e., the case of a symmetrical Toeplitz system matrix), which corresponds to AR modeling. The Durbin algorithm is generally used for the solution of the autocorrelation equations: Rpa. = -rp,
(4) r e = m + 1 (5) if (re
0
(1) J~
where p is the maximum permitted (or parsimonious) order of the model. Signal P r o c e s s i n g
0
...
1
[ °-.. °il .
0
.
...
0
•
255
G. Carayannis et aL / Parallel hardwarefor linear prediction
Next, the following quantities, introduced in [8], will be used: m
T
5i =r~+amri.m,
O<~m<~p, - p +
k7
k6
k5
k~
k3
l ~i<<,p,
(6) where ri.m = [ri-i ri 2" • " (°=ri,
-p<~i~p,
r i - m ] T.
If one defines (7)
then one can see that there are three characteristic regions for 5 m, depending on the value of i: (i) i= 0, where it is easy to prove that m
50
,
-
am,
(8)
(ii) 1 < i < m , where 57 = 0 (9) (conventional Yule-Walker or normal equations) ; (iii) i= m + 1, where it can be proved that 5mm+, = ]~,,n+ 1 -
(lO)
The lattice coefficients km+l (or PARCORs) can be computed using the relationship ~:- ;g km+ l =
- - 5nn~+ 1 / 5 ~ ) n .
(11)
The most important property of the quantities (7 is that they can be computed recursively through the use of the following Schur-type formula [2, 6, 8]: ( m = f m , +km(mZ].
Fig. 1. The superlattice for p = 8. small, but generally independent of the model order p, is available.
(12)
It has been shown in [2, 6] that these recursive computations can be organized to form a 'canonical' structure, called superlattice, which is illustrated in Fig. 1. This structure is important for the development of algorithms for the solution of the Toeplitz systems appearing in the linear prediction problem, which is discussed here, as well as in the optimal FIR filtering and the optimum lag problems, which are studied in more detail in [6]. It is shown in [6] that, for each one of these problems, there exist three kinds of solutions: an order recursive solution, a fully parallel one, and a partitionedparallel one. The purpose of this work is the study of the partitioned parallel solution when a given number of processing elements (PEs), possibly
3. The partitioned-parallel hardware The whole superlattice structure appearing in Fig. 1 can be produced by the repetition of a simple basic cell (BC), shown in Fig. 2. This cell will be
~im 1
km
m-1
Fig. 2. The butterflybasic cell (BBC). Vol, 24, No. 3, September 1991
256
G. Carayannis et al. / Parallel hardware for linear prediction
called hereon a butterfly basic cell, or BBC, because of its butterfly-like shape. Later in this work another BC will be introduced, the lattice basic cell, or LBC. It is easy to see that the upper half of the BBC is a direct implementation of the Schiir recursion; on the other hand, the lower part of the BBC can be derived from the same relationship, if we substitute m - i + 1 for i. Figure 3 clarifies the method to produce the superlattice by a suitable repetition of BBCs. The first BBC is used to compute the pair ((~, (~) needed for the calculation of k2. For each subsequent order, a number of interconnected BBCs are added to the already existing scheme. For the third order, for example, two BBCs are needed for the production of the pairs ( ~ , ~ L ~) and (~'23, ~ ) , and the calculation of k3. In general, m new BBCs are necessary in order to proceed from order m to order m + 1. These BBCs form a slanted band in the upper left area of the superlattice and produce the pairs ( ~ ' m0+ l , ~'O-- m), . . . . ((m+ t, (g'). The last pair gives the PARCOR coefficient k,, + i. In the following, a very convenient scheme for the implementation of the superlattice in a multiprocessor environment will be presented. This scheme is a compromise between the hardware constraints and the need for increased processing
0 Direction
_
of growth
,,-.~-~'~-----.,
3 rd band of fine ~ - ~
-~,o
-
/.././.~o
(~:r~
~.
- - 4 ~
~2" 2
~
~_
~ "" ~ ~ 2ndband of ".~o--.c~..____.~l BCs ( 2 BCs) ~,-"-~ ~ fine . . . . . 1st ba BCs (1BCs) fine - -
~o
oo ~°~:r! °-r ~o" o
Fig. 3. Production of the superlattice by the repetition of BCs. Signal Processing
speed. It can take full advantage of the presence of more than one PE which can work in parallel, even if the system order is much greater than the number of PEs. This can be achieved by cutting the superlattice structure into broad parallel bands, called partitions, which are computed one after the other, while in each partition the computations are performed in parallel. Figure 4 shows an example of the partitioning of the BBC-based superlattice, for a system of order p = 8, when the number of available processors is n = 3. Since the computations within each partition are performed in parallel, the partition width depends on the number of PEs: the larger the number of available PEs, the broader the partitions can be. On the other hand, the sequential computation of the partitions, i.e., the fact that the partitions are computed one after the other, introduces a bordering discontinuity, so that the hardware has to be organized accordingly: a bordering buffer (BB) must be introduced as a temporary storage area for the intermediate quantities of each partition which are necessary for the computation of the next partition. Figure 4 shows it clearly: the dotted lines that define the limits of the partitions cross some superlattice lines which carry intermediate quantities from one partition to the next one, hence the need for the bordering buffer. Figure 5 shows a signal flow diagram for the various BBCs of the superlattice of Fig. 4. Each PE is set to execute the operations of a single BBC. So, it can be realized either as a single element which needs two cycles to complete the BBC operations, or as a double element which can compute the two halves of the BBC in a single cycle. In the following, the terms BBC, PE and processor will be frequently used without distinction. In Fig. 5, the partitions in which the superlattice is cut are clearly shown. The dark lines separating the partitions are crossed by the e a n d f l i n e s : each crossing corresponds to a temporary storage cell of the BB. The BBCs (or PEs) on the same vertical belong to the same level. So, BBCs 1, 2, 3, 7, 8, 9 and 22 are in level 1, BBCs 4, 5, 10, 11, 12 and 23 are in level 2, and so on. The PEs of each level compute quantities of the superlattice with the same upper index: (lm in level 1, ( z in level 2, etc.
257
G. Carayannis et al. / Parallel hardware for linear prediction k
k~.
k3
k2
k,~. " ~
.
.
.
.
.
"~\
PARTITI
\. /
f
/
//
! ,,
PARTITION 1
-~_
_.-
INITIALIZATION/
Fig. 4. The partitioned-paralleltechnique for the superlatticewith p = 8, n = 3 (np= 3), BBC case. Although the partitioned-parallel scheme can be realized with any number of PEs, it requires only three slightly different kinds of processors, as far as the communication with the bordering buffer is concerned: - the top processor, which feeds the BB (PEs 3, 5, 6, 9, 12, etc. in Fig. 5), - t h e optional middle processor(s), which does (do) not communicate with the BB (PEs 2, 4, 8, 11, 14, etc.), and - the bottom processor, which receives data from the BB (PEs 7, 10, 13, 16, 22, etc.). If one wishes to use more than three processors, one must insert more middle-type elements between the bottom and top processors. Three PEs are used in Fig. 5 : one of the top-type (namely TP), one of the middle-type (MP) and one
of the bottom-type (BT). Suppose that we use the notation TP(j) (or M P ( j ) or BP(j)) to denote that the top (or middle or bottom) processor executes the operations of the j-th BBC. Then the order of the computations corresponding to the signal flow diagram of Fig. 5 is described by the following simplified phases of computation. Initialization: -
Phase
O:
Computation of kl (divider active). Partition 1 computations: -
Phase
1.1:
Level 1 computations: BP(1), MP(2), TP(3). (All PEs active.) -
Phase
1.1a:
Computation of k2 (divider active). Vol. 24, No. 3, September 1991
258
G. Carayannis et al.
Parallel hardware for linear prediction
lev,LeveL2 l ILeve'l PARTITION 3 ii
,
l,
~ii
J!l,,
r8 r7
~Ti__~ri._dtri ~TITI 21 ~
,-,,
Bordering - - r 7 Buffer --r 6
I
120~ i18~
1151
---
i@; PARTITION 2
-°--HII'5"~_~l ! i
k5~"
]
[ 16
13
-i~0~.~., 7
r6 rs
--r 6 -rs --r 5 --
r~
--r5 --r~ r~ r3 Borderin9
r/, --r 3 --r 3 --r 2
PARTITION 1
r3 --r2 r2 rl
k3 ~---
~'x)
Buffer
r2 rl o
dr=d+ki.c
f o ~ " <:1ki oc kI •
D : Divider BC : BBC _basic cett
Fig. 5. Signalflowdiagramfor the superlattice. -
Phase
1.2:
-
-
Phase
1.2a:
-
Phase
1.3:
Level 3 computations: TP(6). (BP and MP idle, TP active.) -
Phase
1.3a:
Phase
2.1:
Level 1 computations: BP(7), MP(8), TP(9). (All PEs active.) -
Phase
Phase
-
Phase
Signal Processing
2.4:
2.4a:
Computation of ks (divider active). 2.5:
Level 5 computations: MP(19), TP(20). (BP idle, MP and TP active.) -
Phase
2.5a:
Computation of k6 (divider active). -
Phase
2.6."
Level 6 computations: TP(21). (BP and MP idle, TP active.)
2.2:
Level 2 computations: BP(10), MP(ll), TP(12). (All PEs active.)
Phase
-
Computation of k4 (divider active). Partition 2 computations: -
2.3:
Level 4 computations: BP(16), MP(17), TP(18). (All PEs active.)
Computation of k3 (divider active). -
Phase
Level 3 computations: BP(13), MP(14), TP(15). (All PEs active.)
Level 2 computations: MP(4), TP(5). (BP idle, MP and TP active.)
-
Phase
2.6a:
G. Carayannis et al. / Parallel hardware Jbr linear prediction
259
I bordering buffer I
Computation of k7 (divider active). Partition 3 computations: -
P h a s e 3.1:
Level 1 computations: BP(22). (BP active, MP and TP idle.) -
i
~
P h a s e 3.6."
L
Level 6 computations: BP(27). (BP active, MP and TP idle.) -
"
A
Level 7 computations: BP(28). (BP active, MP and TP idle.) -
[ ~op-
....
P h a s e 3.7."
.
P h a s e 3.7a:
Computation of k8 (divider active). idd
From the above description it is clear that, in each partition, the PEs that work in parallel compute BBCs of the same level. If n is the number of the available PEs, then, at the last n levels of each partition, more and more of the processors remain idle. Moreover, in the computation of the last partition only some of the PEs are active (ex. only BP in Fig. 5). The number a of the active processors of the final partition, as well as the necessary number np of partitions, can be computed from the system order p and the number n of the available processors as follows: a ~ ( p - 1) mod(n) np ~
[ ( p - 1)/n]
if a # 0
then
np~np+
l
else a ~-- n,
where [ ] indicates the integer part of the argument in the brackets. It is evident that, if the number of PEs n is greater than or equal to the system order p, then only one partition is necessary and the computations are completed in a single parallel pass. In that case, of course, there are many idle processor states, but this cannot be avoided. It is the price
.....
y
.....
........
r. . . . . . . . . . .
"
'L__;_
_
r-
bus
(
processor k bus Fig. 6. Functional diagram for the implementation of the superlattice with three processors.
one has to pay for increased speed. As p increases compared to n, np becomes greater than 1 and more partitions are needed. The algorithm, however, is still much faster than the conventional sequential schemes: one gets the best out of the available number of processors. It is worthwhile mentioning that, ifp is much greater than n, the number of idle processor states is small compared to the corresponding active states. At its two extremes, the partitioned-parallel algorithm is equivalent to - the sequential, order recursive algorithm described in [6], when n = 1, i.e. when there is only one PE, and - the fully parallel algorithm, described in [6, 7], when n ~>p, i.e., when the number of available PEs is greater than or equal to the system order. VoL 24, No, 3, September 1991
260
G. Carayannis et aL / Parallel hardware for linear prediction
k6 LBC:~
_
gu
k5
k~ k3
k2
kl
e:a.~.b g: b.kia
<,,j
~oa
Fig. 7. The lattice basic cell (LBC).
Figure 6 shows a possible functional diagram for the realization of the structure of Fig. 5 with the three PEs BP, MP and TP. The implementation philosophy discussed so far has been based on the BBC. However, the same philosophy stands if the 'lattice basic call', or LBC, is selected as the elementary building block of the superlattice structure. The LBC is shown in Fig. 7, while Fig. 8 illustrates the relationship between the BBC and the LBC: within each pair of adjacent LBCs, a single BBC resides. The realization of the LBC-based superlattice can be easily achieved if suitable extensions are made (Fig. 9). The upper extension computes quantities needed for the linear predictor of the next order, while the lower extension computes quantities which are equal to zero (as it can easily be seen from the Yule-Walker or normal equations). These extensions may seem redundant, but they permit a uniform realization of the LBC-based superlattice. In a complete analogy to the BBC case, the LBC can be used for the implementation of the superlattice
LBC2 I
kl: ~oo Fig. 9. Realization of the superlattice through the use of LBCs.
in a sequential (even order-recursive) manner, - a fully parallel manner, and a partitioned-parallel manner. The discussion here will be restricted to the most interesting case, the partitioned-parallel implementation. Due to the LBC interconnections, the length of the necessary bordering buffer is only half of what is needed in the BBC case, as Fig. 10 clearly shows. The complete LBC partitioning scheme is depicted in Fig. 11 for a system of order p = 8 when the number of available processors is n = 3. The number of necessary partitions np and the number a of active processors of the final partition can be computed as follows: -
-
a ~ (p) mod(n)
f BBC
np ~ [p/n]
if a # 0
LBC1I
then
np~np+ l
else Fig. 8. The relationship between the BBC and the LBC. Signal Processing
a~.--n.
G. Carayannis et al. / Parallel hardware for linear prediction
PARTITION(n*l)
./ /
A signal flow diagram for the various LBCs of the superlattice of Fig. 11 can be seen in Fig. 12. Again, the partitions in which the superlattice is cut are clearly shown. This time the dark lines are crossed only by e lines, which feed the half-length BB. Phases of computation, analogous to the ones described for the BBC case, can be easily conceived. Figure 13 shows a possible functional diagram for the realization of the LBC-based partitioned-parallel algorithm with the minimum number of three LBCs. The partitioned-parallel organization of the superlattice structure, either the BBC- or the LBCbased, can be easily and efficiently implemented with cheap, commercially available signal processors, or even with general purpose processors. So, configurations with a small number of processors, i.e. three to six, can be used for many applications of practical interest, with very good results and
/~
//PARTITION n BBC
~
PARTITIONING
0
,///" .,//THE
~
<~
SAME QUANTITY
d "-'"C'O / i'
PARTITION (no1/) ~ ~ /
/ /~ PARTITION n
/ ~TWO DIFFERENT LBC ~ Q - ~ N T I T ~ PARTITIONING
/ Fig. 10. R e l a t i v e sizes o f the BB in the BBC a n d L B C cases.
k7
k6
,
.
.
k5
l
.
.
k4
k3
k2
/ / ,/
.
.
.
261
kl
[
o-7
i..._~t:~_.~i__..~ef._~~.....~ ~
¢0
\\
o
y s : rs
,
; L-<}--~ ;,0:~,
//PARTITION 3
"'X
PARTITION 2 / ' ' - .
~
~
.... / ' \ ~ /
~
E~:rl ~1o~
\'-.
i
~ : ro
I
kl :-
F'-'U
/ L
\
Fig. 11. L B C p a r t i t i o n i n g s c h e m e for p = 8 a n d n = 3.
Vol. 24, No. 3, September 1991
262
G. Carayannis et al. / Parallel hardware for linear prediction
[ PARTI_TrON3 ] kR
k7
k6
k5
k4
k3
BORDERING BUFFER
BORDERING BUFFER
k~ Fig. 12. A signal flow diagram for the LBC partitioning.
substantial increase in processing speed over that of prior, single-processor implementations. However, an important decision which has to be made, both for the LBC and the BBC case, concerns the divider circuitry. If a VLSI circuit is being designed, one divider is enough for the whole superlattice. Otherwise, when commercial processors are used, the division can be performed by any one of these processors. So the divider can either exist physically as a discrete hardware element, or it can be a simple routine for each processor.
4. Superlattice reduction and the sparse structures The superlattice can be simplified. This issue will be discussed in this section. Signal Processing
It is possible to construct a new minimal structure, the sparse structure, with no lattice-type cells. This structure is derived from the Split-Levinson recursion [3, 4] in the same way the superlattice is derived from the Levinson recursion. So, this minimal structure for the solution of Toeplitz systems involves the three consecutive orders m - 1, m and m + 1, i.e., one needs the values of the filters (and/or the predictors) of orders m - 1 and m to compute the filter of order m + 1 ( a n d / o r the corresponding predictor). The superlattice, on the other hand, involves only the two consecutive orders m and m - 1. It can be shown that the dynamic range of the intermediate quantities of the sparse structure is only twice that of the corresponding quantities of the conventional superlattice. So the sparse
263
G. Carayannis et al. / Parallel hardware for linear prediction
TOP
It is useful at this point to consider a simplified version of the superlattice, which emphasizes the communication among the BCs. Consider the structure of Fig. 9. Suppose that a black box with two inputs and two outputs is substituted for each lattice basic cell (LBC), as it is shown in Fig. 14. The resulting scheme shows the communication links among the various LBCs (boxes) only, without any consideration to what is happening inside each box. This communication structure, pertinent to Toeplitz solving, will be called hereon the superlattice communication network. In the following, an analytical formula will be derived which will be the basis of the sparse structure and its basic cell. Suppose that one defines
'~ .
.
.
.
.
.
.
I <
L
LBC AND LOGIC '
]
I
. . . . . . . . . . . . . . . .
LBC AND LOGIC
L .......
]
--~-- ....
Vt32 = (2_2 + g"2 .
--
(13)
Then, with the use of the lattice recursions (14) LBC AND LOGIC
,,
I
(15) k-BUS
L ...............
SOR
Fig. 13. A possible functional diagram for the LBC partitioning.
~___
lt[[llllt[
30 F8 r7
structure can be used for finite word length implementations, exactly like the superlattice, but with words which are one bit longer. There are two alternative realizations of the sparse structure: the additive and the subtractive one. In the additive structure the various intermediate quantities are the sums of certain quantities of the superlattice, while in the subtractive structure they are the differences of the same quantities. Strictly speaking, the subtractive sparse structure is the minimal structure for Toeplitz solving. However, both realizations allow important physical interpretations, since it can be shown that the additive structure is associated with the linear phase problem and the subtractive one to the constant group delay problem.
r7
r6 r6 % r5 r~ F~ % Or 3 r2
rl FI r0
Fig. 14. The superlattice communication network. Vol. 24, No. 3, September 1991
G. Carayann&et al. / Parallel hardwarefor linear prediction
264
it follows that
or
V~, +1 = vm + vm_l + (k,, - 1) ((,'~ Z ~+ (7'-q 1).
q?-2 = (~ + k2(Ls + (~k2 + (Lz
(29)
= ((~ + (Ls) + (k2- 1)(L3 + ((L2 + (~) + ( k 2 - 1)(~.
(16)
But (mz~
If one defines
V2-2 = ( L 2 + (1,
(17)
V2-3 = (~-s + (g,
(18)
=
m-2 (m-i + km- l(m-I 2
(30)
and
(m--ll = (ml2"~-km-- l(m-- 2.
(31)
So, by substituting (30) and (31) into (29), one can get
(16) can be written as
¢-2 = ¢-2+ qZ3 + (k2- 1)5;' +(k~- 1)(4'. (19)
~ , ' - ] = eT +1+ ~T-~ + ( k m - 1)(kin_, + 1)((~-~ + (mq2)
Since (1_ 3 = r4 + klr4,
(20)
(14= r4 + klr3,
(21)
(32)
and, since it stands that (m_-2 + (T_q2 = V 7-~',
(33)
it follows that
(19) gives
V3-2 = V2-2 + V2-3 + (k2- 1)(kl + 1)VL3,
(22)
where
I//~n+l= I]/~n+ I//T-, + (km- 1)(kin-, + 1)VT-~'. (34) If one defines
,
0
0
(23)
gt-3=fa+(-3=r3+r4.
This result can be generalized for any order. It will be shown that if one defines
~+m=-(km- 1)(k.._, + 1),
then (34) can be written as Ipr m + ' = ~l-lim + ~bct'rn--1 --
VT+I = (m + (~+, -,,
(35)
A m+
~]mfll .
(24)
it stands that VT'+l=vm+qtm-,+(km-1)(km-l+.
1)~i-lm-l. (25)
Using the basic lattice recursion (or the Schur recursive formula), the two quantities of the righthand side of (24) can be written as
Figure 15 illustrates the sparse BC (or SBC) which results from (36). The SBC uses a single adder and a single multiplier only. Based on the SBC, one can construct a whole new computational structure, shown in Fig. 20, which has already been called sparse organization or sparse structure. In fact, this
im
(,n = ( m - , + k m ( ~ - ~ ,
(26)
(~', +1 -i = (m +, -i + k,,(7'-q '.
(27)
O ~ i ~ "1
Consequently,
vT +' = (57'-' + (~z)) + (kin- 1)(~Z] + Signal Processing
((~+',_,+ (;%') +
(kin- 1)(T-~ 1 (28)
(36)
m J
Fig. 15. The sparse basic cell (SBC).
265
G. Carayannis et al. / Parallel hardware fi~r linear prediction
~-×~1 c-x;I c-x'0 i-x;/ (-x'2) (-x~)
is the additive sparse superlattice. The dividing circuitry appears at the lower part of the structure. However, there remains to be shown that the new structure can still produce the reflection coefficients. For that, we define
, i I
I
i I
I I
1 2
3
2
3
(3/~'o+ (o/~'o (1 -k.)
(1 -k4)
~-3 ~'2-~0 ( l - k 3 ) =(1 -k4)(1 + k 3 ) = 2 2 ,
(37)
where ~.3 (1 -k3) = (1 +k3) -~,
(38)
i.e., the well-known formula for the computation of the total squared error in the linear prediction case has been used. After writing (37) for an arbitrary order m, one can solve for km: km= 1 - 2 £ / ( 1 +kin-l).
(39)
Equation (39) is a recursive formula for the recovery of the reflection coefficients from the quantities 2m + , i.e., the sums of certain intermediate superlattice quantities. Formula (37), written for an arbitrary order m, provides another definition for 2,,+" 2£-
~ u/~,-I "
(40)
So, if the quantities ~'g' and ~ - 1 are available, 2£ can easily be obtained with a single division. The predictor coefficients, on the other hand, can be computed either from the reflection coefficients, or directly from the quantities 2m +. A comparison of the structure of Fig. 16 with the superlattice communication network of Fig. 14 shows that the sparse structure is a kind of simplified superlattice: instead of a lattice box, an adder is used, while the multiplication, normally done inside the lattice cell, is done outside, after creating
x~ Fig. 16. The additive sparse structure which solves the linear prediction problem• a new horizontal communication path between any two successive sections. The structure of Fig. 16 is the first of two possible sparse realizations. It has already been mentioned that this structure is called additive, since the quantities ~tm are the sums of the intermediate quantities of the original superlattice: [/j m+
1 --
~-m
-~i
m
+ ~,.+l-i.
(41)
Another organization, the subtractive sparse superlattice, involves differences instead of sums: 4)7'+~ =~7--~m+1_,.
(42)
Figure 17 shows this organization for a model of Vol. 24, N o . 3, S e p t e m b e r
1991
266
G. Carayannis et aL / Parallel hardware for linear prediction -x~
-x~ -x3
-x-6 -x-s -x2
so that a A,~ must be defined:
-x-3 -x-~
~? = 2(k, - 1).
(45)
Table 1 resumes the basic relationships governing both the superlattice and its sparse forms. New parameters, the A.+ and Z ? , have been introduced for the sparse structures. These parameters are in one-to-one correspondence to the ki and at, so that both the reflection and the predictor coefficients can be computed from them. Moreover, since the reflection coefficients are absolutely bounded by 1, the quantities Z+ and Z? are also bounded: 0 ~
(46)
0~
(47)
Figure 18 shows the way to partition the sparse structure. The technique used for the conventional superlattice applies here as well: the partitions are executed one after the other, but in each partition Table 1 Basic relationships Basic lattice recursion
r~.m=Iri 2ri 2 . . . r i
,,1
5 m = ri + a~ri,,,
x~o
x-9
x-8
x-7
x2
x-s x2
X-3
x-z
Fig. 17. The subtractive sparse superlattice of a model order p=10.
~7+~=(7+(m+~ ,
~,m+,= era+ ~,~, __X+~,m_,
order 10. The subtractive scheme is somewhat simpler than the additive one. This is due to the fact that no Z~- must be defined, since ~b22= (~-2- ( ] = (0-2 + k , ( ° - ( 0 _ k,(O3 = ¢0-2 -- G0 = Y4 -- Y2 = (r4 -- r3) -- (r2 -- r3).
(43)
No multiplier is needed for the computations of (43). On the contrary, the corresponding quantity for the additive structure is
= r2 + kjr3 + r4 + klr3,
Basic additive sparse recursion
~bm+'= om+ ~m , - ;t£ o7_q,
Basic subtractive recursion
A++l =(1 -kin+ 1)(1+kin)
Definition o f the parameters of the additive sparse
sparse
= ~ 4"+ '/V4"
X,,,+,=(l +km+,)(1-k,,,)
= 0o'-+'/~0m
¢~+,_,= (~,?+ 1- ~m+')/2 (44)
Additive sparse form, definition of the intermediate quantities
Subtractive sparse form, definition of the intermediate quantities
(?=(~,7+ ' + ck7+')/2
~'2-2 = ( L 2 + (4~= (°-2 + k,(° + (° + k,~'°_ 3 Signal Processing
Definition of the superlattice intermediate quantities
Definition of the parameters of the subtractive sparse
Other relationships between the superlattice and its two sparse forms
G. Carayannis et al. / Parallel hardware for linear prediction -X~
-X~
-X3
-X-6
-X-s
-X~ -X~
267
-X-2
/ / /
/ J
J
3rd PARTIT I I I i
/
J
./,'
\\ \
2 d PAR]
/
X~2
/INITIALIZATION
Fig, 18. The partitioned-parallel technique applied to the subtractive sparse superlattice.
more than one adders and multipliers work concurrently. In Fig. 18, the partitions have been chosen in such a way as to use the three adders and three multipliers. The signal flow diagram of Fig. 19 can be used both for the definition of the various phases of the sparse superlattice computations and for the conception of a dedicated hardware organization for the computation of the ~,7 coefficients. This
organization is shown in Fig. 20, where three adders, three multipliers and one divider are used. However, additional circuitry (or some more operational phases of the hardware of Fig. 20) must be used for the recovery of the PARCORs from the ;tT. Similar hardware can be obtained for the additive superlattice, but the details of such an implementation will not be presented here. VoI. 24, No. 3, September 1991
£
~8 flbb6
D i v i d e r : i~--- 4~/g
f * - - e . ),
d ~----a. b.c
Mu[tiptier:
Adder:
PARTITION 2
b~s
PARTITIONI
bb~,
X~tb%
~¢bb 2
Fig. 19. A signal flow diagram for the sparse superlattice structure.
*~IbbL
¢ ,bb 1
INITIALIZATION
bb 0
Fr~= dr~
r r6=dq
O(3
G. Carayannis et al. / Parallel hardware for linear prediction
269
methodology for the derivation of the sparse structure from the superlattice has been presented. Further research is focused on the application of superlattice-type computational organizations to time and space recursive schemes, as well as to order recursive methods involving near to Toeplitz matrices.
References
~-bus
5r-bus
bb-buffer
Fig. 20. A possible functional diagram for the subtractive sparse structure.
It can easily be shown that sparse-type structures can be used to solve the following problems: - the computation of the predictor coefficients from the PARCORs, and - the computation of an optimal FIR filter in a way similar to the one developed in [6].
5. Conclusion In this paper it is shown that there exists a variety of partitioned parallel computational organizations for the fast derivation of the PARCOR coefficients with the efficient use of more than one processing elements. Three organizations of that kind have been studied in detail: the BBC-based and the LBC-based partitioned superlattices and the subtractive sparse structure (which is the minimal known scheme). For the latter case, a simple
[1] D. Alpay, P. Dewilde and H. Dym, "On the existence and construction of solutions to the partial lossless inverse scattering problem with applications to estimation theory", IEEE Trans. Inforrnat. Theory, Vol. IT-35, No. 6, November 1989. [2] G. Carayannis, E. Koukoutsis, D. Manolakis and C.C. Halkias, "A new look on the parallel implementation of the Schur algorithm for the solution of Toeplitz equations", Proc. IEEE Internat. Conf. Acoust. Speech Signal Process., Florida USA, March 1985, pp. 1858-1861. [3] P. Delsarte and Y. Genin, "The split Levinson algorithms", IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-34, June 1986, pp. 470~78. [4] P. Delsarte and Y. Genin, "On the splitting of classical algorithms in linear prediction theory", IEEE Trans. Acoust. Speech Signal Process., Vol. ASSP-35, May 1987, pp. 645~553. [5] P. Dewilde, A.C. Vieira and T. Kailath, "On a generalized Sz~go-Levinson realization algorithm for optimal linear predictors based on a network synthesis approach", IEEE Trans. Circuits and Systems, Vol. CAS-25, No. 9, September 1978, pp. 663-675. [6] E. Koukoutsis, G. Carayannis and C.C. Halkias, "Superlattice/superladder computational organization for linear prediction and optimal FIR filtering", IEEE Trans. Acoust. Speech Signal Process., October 1991. [7] S.Y. Kung and Y.H. Hu, "A highly concurrent algorithm and pipelined architecture for solving Toeplitz systems", 1EEE Trans. Acoust. Speech Signal Process., Vol. 31, No. 1, February 1983, pp. 6(~76. [8] J. Le Roux and C.J. Gueguen, "A fixed point computation of partial correlation coefficients", IEEE Trans. Acoust. Speech Signal Process., Vol. 25, No. 3, June 1977, pp. 257259.
Vol.24, No. 3, September1991