Parallel Computing 17 (1991) 493-503 North-Holland
493
Systolic algorithm for polynomial interpolation and related problems H. S c h r o d e r
a, V.K. M u r t h y b a n d E.V. K r i s h n a m u r t h y c
Department of Electrical Engineering and Computer Science, University of Newcastle, Newcastle, N-S. W. 2308, Australia b Fujitsu Australia Ltd., Sydney, Australia Computer Science Laboratory, Research School of Physical Sciences, Australian National University, Canberra, A C T 2601. Australia
Received August 1989 Revised June 1990
Abstract
Schroder, H., V_K. Murthy and E_V. Krishnamurthy, Systolic algorithm for polynomial interpolation and related problems, Parallel Computing 17 (1991) 493-503. This paper describes a systolic algorithm for interpolation and evaluation of polynomials over any field using a linear array of processors. The periods of these algorithms are O(n) for interpolatin and O(1) for evaluation. This algorithm is readily adapted for Chinese remaindering, easily generahzed for the multivariable interpolation and can be extended for rational interpolation to produce Pade approximants. The instruction systolic array implementation of the algorithm is presented here. Keywords. Chinese remaindering; evaluation; interpolation; polynomials; instruction systolic arrays; Lagrange
interpolant; Newton interpolation; period of algorithm; quorum security locks; systolic algorithms; systolic archi tectures.
1. Introduction Systolic a l g o r i t h m s are highly s u i t a b l e for i m p l e m e n t a t i o n in a r e g u l a r n e t w o r k o f processors, w h e r e each p r o c e s s o r has a s i m p l e i n s t r u c t i o n set a n d has a p r o v i s i o n for local c o m m u n i c a t i o n a m o n g the n e i g h b o u r i n g processors. I n this p a p e r , we d e s c r i b e how the classical p o l y n o m i a l i n t e r p o l a t i o n / e v a l u a t i o n over a n y field a n d the r e l a t e d C h i n e s e r e m a i n d e r i n g p r o b l e m can be solved using the systolic a p p r o a c h . T h e c o n v e n t i o n a l systolic a r r a y a r c h i t e c t u r e (SA) lacks flexibility to execute a class of a b s t r a c t a l g o r i t h m s in which the i n p u t s a n d o u t p u t s b e l o n g to d i f f e r e n t d a t a d o m a i n s . T o i m p r o v e this s i t u a t i o n the c o n c e p t of I n s t r u c t i o n Systolic A r r a y s ( I S A ) d u e to L a n g [3] is very useful; the I S A retains all the a d v a n t a g e s of systolic a r c h i t e c t u r e s yet p e r m i t s the e x e c u t i o n of d i f f e r e n t closely related a l g o r i t h m s d e f i n e d u n d e r d i f f e r e n t d a t a d o m a i n s using the s a m e p r o c e s s o r array. In the ISA, a s e q u e n c e of i n s t r u c t i o n s called the T o p P r o g r a m (TP), a n d an o r t h o g o n a l s e q u e n c e of B o o l e a n selectors called the Left P r o g r a m (LP) are p u m p e d t h r o u g h the m e s h c o n n e c t e d a r r a y o f p r o c e s s o r s (unlike SA, where o n l y the d a t a is p u m p e d t h r o u g h the processors). If an i n s t r u c t i o n meets the selector bit '1' at the processor, then the i n s t r u c t i o n is e x e c u t e d in that p r o c e s s o r ; otherwise, (i.e. it meets a ' 0 ' bit) the i n s t r u c t i o n is n o t e x e c u t e d in 0167-8191/91/$03.50 © 1991 - Elsevier Science Publishers B_V. (North-Holland)
H. Schroder et al.
494
that processor leaving the contents of its registers unchanged. Thus an ISA program is a 2-tuple (TP, LP); data is supplied to ISA from external data queues, which can be read during the execution. Hence one can execute different programs on the same architecture simply by pumping in different set of instructions; also the flexibility of ISA is enhanced by the ability to inhibit instructions at certain processors using selectors. These two features make ISA a programmable systolic architecture. Also in contrast to the mesh-connected processor array which consists of independent processors each with its own memory and program storage, the ISA processors have no local memory for program storage; hence they can be realised in a smaller chip area.
2. Polynomial interpolation/Chinese remaindering The polynomial interpolation over a field (real or finite) consists in finding the (n + 1) coefficients of an nth degree polynomial in the specified field, given the functional values at (n + 1) distinct points; obviously the values of the polynomial evaluated at these points should coincide. A closely related problem is the computation of the solution to a system of linear congruences over the integers based on the Chinese remainder theorem, see Krishnamurthy [1]. Both these algorithms have tremendous applications in coding theory, design of q u o r u m security locks, see Shamir [7], McEliece and Sarvate [4] and exact computation [1]. From an algebraic point of view the Chinese remaindering and polynomial interpolation problems are equivalent. Given a set of remainders (residues) ( r 0, r 1. . . . . r } with respect to a set of moduli { P0, P~ . . . . . p,, } which are pairwise relatively prime, both problems reconstruct an element r such that ri = r rood Pi for i = 0, 1 . . . . . n in respective "domains; r is uniquely determined if n
size r < size F I p, = M, i=0 where size is suitably defined as follows: for integers ' t h e size' is the magnitude and for polynomials 'the size' is the degree. Element r is defined as r/
r = ~_, ( M / p i ) r . T i mod M, i=O
where T, is the solution ( M / p i ) T ~ I mod Pi. The equivalence between the Chinese remaindering algorithm and the Lagrange polynomial interpolation is readily seen from the following correspondence [1]: =
r ( x ) = n th degree polynomial in a field F r(x,) = r
( i = 0 , 1 . . . . . n)
p,(x) = ( x - x , )
r ( x ) mod p , ( x ) = r , n
M = 1-I ~=0
The Lagrange interpolant is obtained from
L.(x)=
°
~=0
i5
k=0
Systolic algorithm for polynomial interpolation
495
where Ti ~
/1
1
= [M/pi(x)]
-1 m o d ( x - x , )
FI (x,- xk)
k=0 (Mand
p i ( x ) are d e f i n e d a b o v e ) ,
k4:i.
Since in d o i n g p o l y n o m i a l i n t e r p o l a t i o n , we restrict ourselves to linear p o l y n o m i a l s ( x - x , ) (i = 0 . . . . . n), the r e m a i n d e r or r e s i d u e c o m p u t a t i o n for a p o l y n o m i a l r ( x ) is e q u i v a l e n t to r(x)
m o d ( x - x i ) = ri b y the r e m a i n d e r t h e o r e m .
Thus, the r e c o n s t r u c t i o n of an integer r in the r a n g e 0 ~< r ~< M - 1 from a set of r e s i d u e s {to, rl . . . . . 6,} with respect to a set of p r i m e s ( p 0 , P~ . . . . . p , } a n d the r e c o n s t r u c t i o n of a p o l y n o m i a l r ( x ) o f degree a t m o s t n f r o m a set of residues ( r o, r 1. . . . . r,, } with respect to a set of d i s t i n c t linear p o l y n o m i a l s ( p o ( X ) , p ~ ( x ) . . . . . p , ( x ) ) can be a c h i e v e d b y i d e n t i c a l alg o r i t h m s for s u i t a b l y d e f i n e d d a t a d o m a i n s . W e now f o r m a l l y state the s e q u e n t i a l a l g o r i t h m . 2.1. S e q u e n t i a l algorithm
T h e sequential a l g o r i t h m given b e l o w takes as i n p u t s residues { r o, r I . . . . . r,} a n d m o d u l i {Po, Pl . . . . . p,,}. F o r integers, r is r e c o n s t r u c t e d such that r rood p , = r i where r is over m o d u l o M , M = Fl~'=op ,. F o r p o l y n o m i a l s , r ( x ) is r e c o n s t r u c t e d such that r ( x ) m o d pi = r i where p, = x - x,; here, r ( x ) is a n t h degree p o l y n o m i a l over field F. Input: residues r i and moduli Pi (i=O, Output: r e c o n s t r u c t e d element r
1,...,
n).
M:=I; R:=r0; for k=l begin
to n do
M:=Mpk_I; u : = M -I mod Pk; d:=(rk-R)u R:=R+dM
rood Pk;
end; r:=R; This a l g o r i t h m expresses r in the form r = d o + d~ p o + - - - + d , P o . .. P , - 1 , where d o = r0. F o r integers, r is in m i x e d - r a d i x f o r m where d , ' s are the m i x e d - r a d i x digits a n d 0 ~< d, ~
3. Parallel/distributed algoritlm for Chinese remaindering/interpolation I n the d e s c r i p t i o n of this p a r a l l e l a l g o r i t h m , we d o n o t c o n s t r a i n ourselves to a p a r t i c u l a r p a r a l l e l c o m p u t a t i o n a l model. S u i t a b l e m o d i f i c a t i o n s , however, can b e easily i n c o r p o r a t e d to suit a p a r t i c u l a r c o m p u t a t i o n a l model. Its I S A i m p l e m e n t a t i o n will be d e s r i b e d later.
496
H. Schroder et al.
This algorithm takes as inputs residues {ro, r] . . . . . r, } and moduli { Po, P~ . . . . . integers, r is reconstructed such that r m o d Pi = r where r is over m o d u l o M,
p,}. F o r
t/
M=
I--[ Pi-
i=0
F o r polynomials, r(x) is reconstructed such that r(x) m o d p, = r i where pi(x)= x - x~; here, r(x) is an nth-degree polynomial over field F. We assume that there are n + 1 processors P R O ( i ) (i = 0, 1 . . . . . n), each with five registers R~, D~, S i, P~ and Mi. The operations subtraction, multiplication and inversion are carried out in the appropriaie field for each processor. Register R~ is initialized with r and register P~ with p~ (for polynomials, register Xi is initialized with x,.). At the end of the algorithm, register D r in each processor contains the coefficient d~, where d o = r 0. We can then reconstruct r using D,. I n p u t : r e s i d u e s r i and e o d u l i Output: reconstructed element For i = O t o n do begin
Pi(i=O, r
1,...,
n).
Ri:=ri; Pi:=Pi; end; For j=O begin
to n do
Dj :=Rj; For s = j + l begin
to n in parallel
do
Ss:=Rj; R s : = Rs - Ss; M s : = P ; 1 rood Ps; R s:=RsM s mod Ps;
end; end;
n
j- 1
Then r: = D O+ ~ Dj H P~" j=l
s=O
In the case of polynomials, we have Pj = ( x statements in the above algorithm then are:
Xj), Ps = ( x -
Xs). The last three assignment
Ms:=(X-Xj)-' ; R,:=R~M,; j--I
r=Do+
~ j=l
DjI-Ix-X,. s=O
Remarks 1. In the last iteration of j, when j is n, ' f o r s = j + 1 to n in parallel do begin' is not executed since j + 1 equals n + 1, and exceeds the limit n. 2. C o m p u t i n g r using the above formula can be deferred until one wants to evaluate the polynomial at some point.
3.1. Proof of algorithm We now provide a p r o o f of correctness of the algorithm.
Systolic algorithm for polynomial interpolation
497
3.1.1. Chinese remaindering In order to reconstruct the integer r we assume that it is expressed uniquely in the mixed-radix form r=do+dlpo+
"
+ d , P o . . . P , _ 1,
whereO<~dg<~p~-l(i=O,
1 ..... n)
and d~'s are the mixed-radix digits. This algorithm essentially determines the d,'s using r, and p,. F o r convenience, let R k = d o + d~p o + . . . + d k p o . . . Pk-a. We then have r mod p0=r0=do r rood P l = r l = d 0
+dip0
r rood P k = r k = R k r mod p =r,,=r. Thus, d 1 = (r I - d o )
p0-1 m o d Pl- Similarly,
d 2 = ( r 2 - ( d o + d l p o ) ) ( p o p a ) -1 m o d Pz or
d 2 = (r2- Rl)(popl)
-1 m o d P2,
and in general d~. = ( rk -
Rk_,)( p o... Pk-1)-'
m o d Pk-
N o t e that r < M = l-IT=0pi; thus, d~ = 0 for i >/n + 1. 3.1.2. Polynomial interpolation F o r polynomials, we replace r by r ( x ) . W e assume that r ( x ) is an n t h degree polynomial expressed uniquely by the N e w t o n ' s Interpolation formula, r(x)=do+d,(x-xo)+
... +d,(x-Xo)...(x-x,_,).
T h e above p r o o f can be directly extended i>~n+l.
to polynomial interpolation. Thus,
d~ = 0 for
3.2. Polynomial evaluation
T h e N e w t o n ' s interpolation formula is very convenient for polynomial evaluation. Since r ( x ) = d o + d l ( x - Xo) + . . . + d , ( x Xo)...(xx , _ l ) we can evaluate r ( x ) f r o m the Newton's coefficients di stored in the registers by using H o m e r ' s nested multiplication rule.
4. Examples Chinese remaindering
Let Pi = {5, 7, 11, 13}, r i = {1, 5, 9, 11}. Table I illustrates the process of reconstructing an in teger.
H. Schroder et al.
498 Table 1 Chinese remaindering
r,
Po
PI
/'2
/'3
rood 5
rood 7
m o d 11
r o o d 13
1
P
di
5
9
11
S,:= Rj
1
1
1
Subtract
4
8
10
M=5
3
9
8
5
6
2
Ss := g j Subtract
5 1
5 10
M=7
8
2
d,=8
8
7
-
-I
M/R s
-1
M~R~
s~:= Rj
8
Subtract
12
M =ll -I M.R~
d o =1
d 1=5 1+5.5+8-5.7 +7-5-7.11 = 3001
6 7
d3=7
4.2. Single variable interpolation - f i n i t e f i e M
L e t x 0 = 1 , r0 = 8 ; x~=2, rt=3; x z = 3 , r2 = 0 ; x 3 = 4 , r3 = 1 0 . r e c o n s t r u c t i o n o f the p o l y n o m i a l o v e r the p r i m e field m o d u l o 11.
Table 2 i l l u s t r a t e s t h e
4.3. Single variable interpolation - real f i e l d
L e t x 0 = 1, r o = 0.5; x t -- 2, r t = 2.5; x 2 = 3, r z = 6.5; x 3 = 0.5, r 3 = 0.25. Table 3 i l l u s t r a t e s the r e c o n s t r u c t i o n o f the p o l y n o m i a l o v e r the real field.
5. ISA implementation of polynomial interpolation/evaluation W e n o w c o n s i d e r the i m p l e m e n t a t i o n o f the a b o v e i n t e r p o l a t i o n a l g o r i t h m o n t h e I S A , u s i n g a s y s t e m a t i c t o p - d o w n d e s i g n t e c h n i q u e . T h e t i m e c o m p l e x i t i e s for the I S A p o l y n o m i a l i n t e r p o l a t i o n a n d e v a l u a t i o n p r o g r a m s are i n d i c a t e d .
Table 2 Single v a r i a b l e i n t e r p o l a t i o n
r, S~ := R j Subtract
( X - X o ) -1
P0 xo = 1
Pi x 1= 2
P1 x2
P3 x3= 4
8
3 8
0 8
10 8
6
3
2
1
6
4
6
7
8
S,:= Rj
6
6
Subtract
1 1 1
2 6 1
Multiply
(x - xl) -~ Multiply Ss:= Rj Subtract
di
do = 8
d 1= 6
8 + 6 ( x - 1) + l ( x - l ) ( x
=x2+3x+4 ( A r i t h m e t i c m o d u l o 11) d 2 =1
1 0
P
d3= 0
- 2)
Systolic algorithm for polynomial interpolation
499
Table 3 Single variable interpolation P0
Pt
P2
P3
x 0 =1
x] = 2
x2= 3
x 3 = 0.5
0.5
S~ : = R j
2.5 0.5
6.5 0.5
Subtract
2
6
- 0_25
(% - )to) -1 Multiply
1
0.5
-2
2
3
0_5
ss:= Rj
2
2
Subtract
1
- 1.5
(x.¢ - x I ) - 1 Multiply
1 1
- 0.666 1
r,
di
0.25 0.5
P
d o = 0.5
d1= 2
S~:=R
1
Subtract
0
0.5+2(x-1)+l(x-1)(x-2) =x2-x+0.5 d, = 1 d3= 0
The systematic top-down design process for an ISA is a generalized version of the design approach for systolic arrays, see Kung [2]. We start with a locally recursive version of the above algorithm and generate the dependence graph. The final design is based on this approach, but we shall only describe the the actual implementation and not the approach (see [5]). We have a 1 × (n + 1) array of n + 1 processing elements (PE) in the ISA, P0 . . . . . pn. Each processing element in the ISA has 6 data registers: X stores a point x, - XS shifts x, R stores a residue r RS shifts r, - M stores M s D stores D~ coefficients. We have five instructions in the instruction set, and each PE can execute all the instructions (subscripts L and T refer to the left and top neighbours respectively). The five instructions in the ISA program are: A. X : = X r, R : = R r B, R : = R - R S L , RS:=RSL C. M : = 1 / ( X - XSL), XS := XS c D. R : = R . M E. D : = R , R S : = R , X S : = X . The ISA interpolation program for four PE P0, P1, /'2 and /)3 n = 3) is shown in Fig_ 1. Note that the selectors (not shown) in a one-dimensional ISA are always '1'. The input data is the set of residues r i at points x i (i = 0 . . . . . n). At the end of the program, register D in each PE Pi contains value D , We observe that the pattern of instructions executed in each processing element Pv is A(BCD)iE, i.e. the sequence of instructions BCD is repeated i times in P~. We can generate all possible combinations of instructions that are executed at the same time (on different PE), viz. instructions on the same horizontal line, using the following expression: L = [(e, E, CB, B ) ( D C B ) * ( D , DC, A, e)] U C where ' e ' is the empty string, ' U ' is the union operation of the set of strings and ' * ' represents repetition of the string zero or more times. Remark. It is not necessary to have register D for the interpolation program since the Da coefficients are also stored in R. However, register D is necessary for the evaluation program below.
H. Schroder et al.
500
E
D C B
Program
E
D
D
C
C
B
B
D
E
D
C
D
C
B
C
B
A
B
A
E
A
x 0 , r0
x I , r1
x 2 , r2
I
x 3 , r3
P1
P2
J
P3
Data
1
P0
Fig. 1. ISA polynomial interpolation program.
5.1. Time complexity H e r e , we h a v e an ( m x k ) I S A , w h e r e m = 1 a n d k = 4. T h e p e r i o d o f t h e I S A p r o g r a m -- r = n u m b e r o f i n s t r u c t i o n s d i a g o n a l s = 3 ( k - 1) + 2 = 11. T h e e x e c u t i o n t i m e o f t h e I S A p r o g r a m = r + k + m - 2 = 14.
5.2. I S A program f o r polynomial evaluation W e n o w d e s c r i b e an I S A p o l y n o m i a l e v a l u a t i o n p r o g r a m t h a t r u n s o n t h e s a m e 1 x ( n + 1) p r o c e s s i n g a r r a y u s e d for p o l y n o m i a l i n t e r p o l a t i o n . T h e p o l y n o m i a l is e v a l u a t e d at s o m e p o i n t y as f o l l o w s ( X k d e n o t e s r e g i s t e r X in P E Pk): n r:=Do+
Y'~ j~l
j-1
DjHy-x~. k=0
Systolic algorithm for polynomial interpolation
EV I EV
501
Program
Data
I
r'ol ", XS L
I
ISA
Fig. 2. ]SA polynomial evaluation program_
Remark. This program can also be used to construct the n th degree polynomial interpolant r (in x) using the Dj coefficients: n
j--I
r : = D0 + E D j I - I x - X k . j=l
k=0
The evaluation program can be concatenated immediately after the interpolation program, as it uses the Dj coefficients (stored in the D registers) produced by the interpolation program. Also, register X in each PE Pk contains the point x kThe program consists of only one instruction EV. The input data is value y, the point where the polynomial is to be evaluated. This value is shifted through the processors using register XS. The program is initialized with the following values: R / = 0 , M L = I , XSI = y (point of evaluation); these values are used when EV is executed in PE P0-
EV.
XS:=XSL, M : = M L ( X S L - - X ) , R : = R L + D M L.
The ISA program for n = 3(4 PE) is shown in Fig. 2; the selectors are not shown. The execution of instruction EV does not affect registers D and X. Thus, the period of this algorithm is 1, and a new point can be input at each time unit. As mentioned above, register D is necessary for the evaluation program since D i coefficients stored in registers R are overwritten. It is possible to reduce the number of multiplications by about one half if Horner's nested multiplication rule is used during the evaluation. However, this would change the direction of the instruction diagonal to a perpendicular direction, viz. from north-west to south-east (in contrast, the instruction diagonal for interpolation is from south-west to north-east). This will cause delay if the evaluation program is concatenated immediately after the interpolation program.
5.3. Time complexity The period of the ISA program = r = 1 (number of instruction diagonals). The execution time for an (m × k) processing array = r + k + m - 2. Here, m = 1 and k = 4.
H. Schroder et aL
502
Table 4 Trace of ISA program
Time
for interpolation
Po
Pl
P2
Ps
A
X:=xO=l R : = r 0 = 0.5 A
E
D:= R = O . 5 = d O
X:=xl=2
R S : = 0_5,XS : = 1
R : = r l = 2_5 B
A
R ,= 2.5-.5=
2
R S : = 0.5
X:=x2=3 R : = r 2 = 6.5
C
B
M:=I
R := 6.5-.5
X S ,= 1
R S : = .5
D
R:=R
A
X:=x3 = 5
= 6
R : = R 3 = .25
C M=2
M:=
B
1/(3-
1) = . 5
R : = - .25
XS:=I
R S : = .5
E
D
C
D:=2=dl
R:=R.M=3
M:=
-2
RS ,= 2,XS := 2 6
B
D
R:=3-2=1
R:=R.M=0_5
RS := 2 C 7
M := 1/(3-
8
B 2) = 1
R :=.5-
2 = -1_5
XS:=2
RS:=2
D
C
R := R.M=I
M := 1/(.5-
2)
= - 2/3
9
E
D
D:=l=d2
R:=R.M=I
RS := I,XS := 3 B
R~=I-I
10
=0
RS := 1 C 11
M := 1/(.5=
-
3)
2/5
D 12
R,= R_M=O
13
D,=0=d3
E R S : = 0 , X S : = .5
Hence, execution time = 4. Remarks. For Chinese Remaindering (CRT) the subprogram C is modified thus:
M
:= 1/XSL,
XS
:= XS L
Remarks. For evaluating the result in Chinese Remaindering, the EV program is modified thus:
Initiate XSL := 1; R L := 0; M L:= 1; E V : X S := X ( P r i m e ) , M : = M L. X S L, R : = R L + D . M
L-
Tables 4 and 5 give the trace of ISA program for interpolation and evaluation of the polynomial given in Table 3. The evaluation is carried out for x = - 1 .
Systolic algorithm for polynomial interpolation
503
Table 5 Trace of ISA polynomial evaluation Time
Po
P1
/'2
PJ
D := dO = .5 XS:= - 1 X:=xO=l M:=I- -2= -2 R := 0 + _5 = -5
D,=dl= 2 XS:= -1 X:=xl=2 M:= -2--3=6 R :=_5-4 = -3.5 D := d 2 = 1 XS:= -1 X:= x2 = 3 M:= 6.-4 = -24 R:=-3.5+6=2.5
d:=d3~O XS:= - 1 X := x3 = .5 M: = - 2 4 - - 1 . 5 36 R := 2.5 =
Acknowledgements The authors thank Richard Brent for his interest in this work and the Australian National University, Canberra, for providing support during the time of this work. Also the authors thank the referees for providing suggestions to improve the presentation of this paper.
References E.V. Krishnamurthy, Error-free Polynomial Matrix Computations (Springer, New York, 1985). S.Y. Kung, V L S I Array Processors (Prentice Hall, Englewood Cliffs, N J, 1988). H.W. Lang, The instruction systolic array, a parallel architecture for VLSI, V L S I J. (1986) 65-74. R.J. McEliece and D.V. Sarvate, On sharing secrets and Reed-Solomon codes, Comm. A C M 24 (1981) 583-590. D.I. Moldavan, On the design of algorithms for VLSI systolic arrays, Proc. I E E E 71 (1983) 113-120. H. Schroder, The instruction systolic array - a trade-off between flexibility and speed, Technical Report CS-86-08, Department of Computer Science, Australian National University, Canberra_ [7] A_ Shamir, How to share a secret, Comm. A C M 22 (1979) 612-614. [1] [2] [3] [4] [5] [6]