Systolic algorithm for polynomial interpolation and related problems

Systolic algorithm for polynomial interpolation and related problems

Parallel Computing 17 (1991) 493-503 North-Holland 493 Systolic algorithm for polynomial interpolation and related problems H. S c h r o d e r a, V...

362KB Sizes 0 Downloads 58 Views

Parallel Computing 17 (1991) 493-503 North-Holland

493

Systolic algorithm for polynomial interpolation and related problems H. S c h r o d e r

a, V.K. M u r t h y b a n d E.V. K r i s h n a m u r t h y c

Department of Electrical Engineering and Computer Science, University of Newcastle, Newcastle, N-S. W. 2308, Australia b Fujitsu Australia Ltd., Sydney, Australia Computer Science Laboratory, Research School of Physical Sciences, Australian National University, Canberra, A C T 2601. Australia

Received August 1989 Revised June 1990

Abstract

Schroder, H., V_K. Murthy and E_V. Krishnamurthy, Systolic algorithm for polynomial interpolation and related problems, Parallel Computing 17 (1991) 493-503. This paper describes a systolic algorithm for interpolation and evaluation of polynomials over any field using a linear array of processors. The periods of these algorithms are O(n) for interpolatin and O(1) for evaluation. This algorithm is readily adapted for Chinese remaindering, easily generahzed for the multivariable interpolation and can be extended for rational interpolation to produce Pade approximants. The instruction systolic array implementation of the algorithm is presented here. Keywords. Chinese remaindering; evaluation; interpolation; polynomials; instruction systolic arrays; Lagrange

interpolant; Newton interpolation; period of algorithm; quorum security locks; systolic algorithms; systolic archi tectures.

1. Introduction Systolic a l g o r i t h m s are highly s u i t a b l e for i m p l e m e n t a t i o n in a r e g u l a r n e t w o r k o f processors, w h e r e each p r o c e s s o r has a s i m p l e i n s t r u c t i o n set a n d has a p r o v i s i o n for local c o m m u n i c a t i o n a m o n g the n e i g h b o u r i n g processors. I n this p a p e r , we d e s c r i b e how the classical p o l y n o m i a l i n t e r p o l a t i o n / e v a l u a t i o n over a n y field a n d the r e l a t e d C h i n e s e r e m a i n d e r i n g p r o b l e m can be solved using the systolic a p p r o a c h . T h e c o n v e n t i o n a l systolic a r r a y a r c h i t e c t u r e (SA) lacks flexibility to execute a class of a b s t r a c t a l g o r i t h m s in which the i n p u t s a n d o u t p u t s b e l o n g to d i f f e r e n t d a t a d o m a i n s . T o i m p r o v e this s i t u a t i o n the c o n c e p t of I n s t r u c t i o n Systolic A r r a y s ( I S A ) d u e to L a n g [3] is very useful; the I S A retains all the a d v a n t a g e s of systolic a r c h i t e c t u r e s yet p e r m i t s the e x e c u t i o n of d i f f e r e n t closely related a l g o r i t h m s d e f i n e d u n d e r d i f f e r e n t d a t a d o m a i n s using the s a m e p r o c e s s o r array. In the ISA, a s e q u e n c e of i n s t r u c t i o n s called the T o p P r o g r a m (TP), a n d an o r t h o g o n a l s e q u e n c e of B o o l e a n selectors called the Left P r o g r a m (LP) are p u m p e d t h r o u g h the m e s h c o n n e c t e d a r r a y o f p r o c e s s o r s (unlike SA, where o n l y the d a t a is p u m p e d t h r o u g h the processors). If an i n s t r u c t i o n meets the selector bit '1' at the processor, then the i n s t r u c t i o n is e x e c u t e d in that p r o c e s s o r ; otherwise, (i.e. it meets a ' 0 ' bit) the i n s t r u c t i o n is n o t e x e c u t e d in 0167-8191/91/$03.50 © 1991 - Elsevier Science Publishers B_V. (North-Holland)

H. Schroder et al.

494

that processor leaving the contents of its registers unchanged. Thus an ISA program is a 2-tuple (TP, LP); data is supplied to ISA from external data queues, which can be read during the execution. Hence one can execute different programs on the same architecture simply by pumping in different set of instructions; also the flexibility of ISA is enhanced by the ability to inhibit instructions at certain processors using selectors. These two features make ISA a programmable systolic architecture. Also in contrast to the mesh-connected processor array which consists of independent processors each with its own memory and program storage, the ISA processors have no local memory for program storage; hence they can be realised in a smaller chip area.

2. Polynomial interpolation/Chinese remaindering The polynomial interpolation over a field (real or finite) consists in finding the (n + 1) coefficients of an nth degree polynomial in the specified field, given the functional values at (n + 1) distinct points; obviously the values of the polynomial evaluated at these points should coincide. A closely related problem is the computation of the solution to a system of linear congruences over the integers based on the Chinese remainder theorem, see Krishnamurthy [1]. Both these algorithms have tremendous applications in coding theory, design of q u o r u m security locks, see Shamir [7], McEliece and Sarvate [4] and exact computation [1]. From an algebraic point of view the Chinese remaindering and polynomial interpolation problems are equivalent. Given a set of remainders (residues) ( r 0, r 1. . . . . r } with respect to a set of moduli { P0, P~ . . . . . p,, } which are pairwise relatively prime, both problems reconstruct an element r such that ri = r rood Pi for i = 0, 1 . . . . . n in respective "domains; r is uniquely determined if n

size r < size F I p, = M, i=0 where size is suitably defined as follows: for integers ' t h e size' is the magnitude and for polynomials 'the size' is the degree. Element r is defined as r/

r = ~_, ( M / p i ) r . T i mod M, i=O

where T, is the solution ( M / p i ) T ~ I mod Pi. The equivalence between the Chinese remaindering algorithm and the Lagrange polynomial interpolation is readily seen from the following correspondence [1]: =

r ( x ) = n th degree polynomial in a field F r(x,) = r

( i = 0 , 1 . . . . . n)

p,(x) = ( x - x , )

r ( x ) mod p , ( x ) = r , n

M = 1-I ~=0

The Lagrange interpolant is obtained from

L.(x)=

°

~=0

i5

k=0

Systolic algorithm for polynomial interpolation

495

where Ti ~

/1

1

= [M/pi(x)]

-1 m o d ( x - x , )

FI (x,- xk)

k=0 (Mand

p i ( x ) are d e f i n e d a b o v e ) ,

k4:i.

Since in d o i n g p o l y n o m i a l i n t e r p o l a t i o n , we restrict ourselves to linear p o l y n o m i a l s ( x - x , ) (i = 0 . . . . . n), the r e m a i n d e r or r e s i d u e c o m p u t a t i o n for a p o l y n o m i a l r ( x ) is e q u i v a l e n t to r(x)

m o d ( x - x i ) = ri b y the r e m a i n d e r t h e o r e m .

Thus, the r e c o n s t r u c t i o n of an integer r in the r a n g e 0 ~< r ~< M - 1 from a set of r e s i d u e s {to, rl . . . . . 6,} with respect to a set of p r i m e s ( p 0 , P~ . . . . . p , } a n d the r e c o n s t r u c t i o n of a p o l y n o m i a l r ( x ) o f degree a t m o s t n f r o m a set of residues ( r o, r 1. . . . . r,, } with respect to a set of d i s t i n c t linear p o l y n o m i a l s ( p o ( X ) , p ~ ( x ) . . . . . p , ( x ) ) can be a c h i e v e d b y i d e n t i c a l alg o r i t h m s for s u i t a b l y d e f i n e d d a t a d o m a i n s . W e now f o r m a l l y state the s e q u e n t i a l a l g o r i t h m . 2.1. S e q u e n t i a l algorithm

T h e sequential a l g o r i t h m given b e l o w takes as i n p u t s residues { r o, r I . . . . . r,} a n d m o d u l i {Po, Pl . . . . . p,,}. F o r integers, r is r e c o n s t r u c t e d such that r rood p , = r i where r is over m o d u l o M , M = Fl~'=op ,. F o r p o l y n o m i a l s , r ( x ) is r e c o n s t r u c t e d such that r ( x ) m o d pi = r i where p, = x - x,; here, r ( x ) is a n t h degree p o l y n o m i a l over field F. Input: residues r i and moduli Pi (i=O, Output: r e c o n s t r u c t e d element r

1,...,

n).

M:=I; R:=r0; for k=l begin

to n do

M:=Mpk_I; u : = M -I mod Pk; d:=(rk-R)u R:=R+dM

rood Pk;

end; r:=R; This a l g o r i t h m expresses r in the form r = d o + d~ p o + - - - + d , P o . .. P , - 1 , where d o = r0. F o r integers, r is in m i x e d - r a d i x f o r m where d , ' s are the m i x e d - r a d i x digits a n d 0 ~< d, ~
3. Parallel/distributed algoritlm for Chinese remaindering/interpolation I n the d e s c r i p t i o n of this p a r a l l e l a l g o r i t h m , we d o n o t c o n s t r a i n ourselves to a p a r t i c u l a r p a r a l l e l c o m p u t a t i o n a l model. S u i t a b l e m o d i f i c a t i o n s , however, can b e easily i n c o r p o r a t e d to suit a p a r t i c u l a r c o m p u t a t i o n a l model. Its I S A i m p l e m e n t a t i o n will be d e s r i b e d later.

496

H. Schroder et al.

This algorithm takes as inputs residues {ro, r] . . . . . r, } and moduli { Po, P~ . . . . . integers, r is reconstructed such that r m o d Pi = r where r is over m o d u l o M,

p,}. F o r

t/

M=

I--[ Pi-

i=0

F o r polynomials, r(x) is reconstructed such that r(x) m o d p, = r i where pi(x)= x - x~; here, r(x) is an nth-degree polynomial over field F. We assume that there are n + 1 processors P R O ( i ) (i = 0, 1 . . . . . n), each with five registers R~, D~, S i, P~ and Mi. The operations subtraction, multiplication and inversion are carried out in the appropriaie field for each processor. Register R~ is initialized with r and register P~ with p~ (for polynomials, register Xi is initialized with x,.). At the end of the algorithm, register D r in each processor contains the coefficient d~, where d o = r 0. We can then reconstruct r using D,. I n p u t : r e s i d u e s r i and e o d u l i Output: reconstructed element For i = O t o n do begin

Pi(i=O, r

1,...,

n).

Ri:=ri; Pi:=Pi; end; For j=O begin

to n do

Dj :=Rj; For s = j + l begin

to n in parallel

do

Ss:=Rj; R s : = Rs - Ss; M s : = P ; 1 rood Ps; R s:=RsM s mod Ps;

end; end;

n

j- 1

Then r: = D O+ ~ Dj H P~" j=l

s=O

In the case of polynomials, we have Pj = ( x statements in the above algorithm then are:

Xj), Ps = ( x -

Xs). The last three assignment

Ms:=(X-Xj)-' ; R,:=R~M,; j--I

r=Do+

~ j=l

DjI-Ix-X,. s=O

Remarks 1. In the last iteration of j, when j is n, ' f o r s = j + 1 to n in parallel do begin' is not executed since j + 1 equals n + 1, and exceeds the limit n. 2. C o m p u t i n g r using the above formula can be deferred until one wants to evaluate the polynomial at some point.

3.1. Proof of algorithm We now provide a p r o o f of correctness of the algorithm.

Systolic algorithm for polynomial interpolation

497

3.1.1. Chinese remaindering In order to reconstruct the integer r we assume that it is expressed uniquely in the mixed-radix form r=do+dlpo+

"

+ d , P o . . . P , _ 1,

whereO<~dg<~p~-l(i=O,

1 ..... n)

and d~'s are the mixed-radix digits. This algorithm essentially determines the d,'s using r, and p,. F o r convenience, let R k = d o + d~p o + . . . + d k p o . . . Pk-a. We then have r mod p0=r0=do r rood P l = r l = d 0

+dip0

r rood P k = r k = R k r mod p =r,,=r. Thus, d 1 = (r I - d o )

p0-1 m o d Pl- Similarly,

d 2 = ( r 2 - ( d o + d l p o ) ) ( p o p a ) -1 m o d Pz or

d 2 = (r2- Rl)(popl)

-1 m o d P2,

and in general d~. = ( rk -

Rk_,)( p o... Pk-1)-'

m o d Pk-

N o t e that r < M = l-IT=0pi; thus, d~ = 0 for i >/n + 1. 3.1.2. Polynomial interpolation F o r polynomials, we replace r by r ( x ) . W e assume that r ( x ) is an n t h degree polynomial expressed uniquely by the N e w t o n ' s Interpolation formula, r(x)=do+d,(x-xo)+

... +d,(x-Xo)...(x-x,_,).

T h e above p r o o f can be directly extended i>~n+l.

to polynomial interpolation. Thus,

d~ = 0 for

3.2. Polynomial evaluation

T h e N e w t o n ' s interpolation formula is very convenient for polynomial evaluation. Since r ( x ) = d o + d l ( x - Xo) + . . . + d , ( x Xo)...(xx , _ l ) we can evaluate r ( x ) f r o m the Newton's coefficients di stored in the registers by using H o m e r ' s nested multiplication rule.

4. Examples Chinese remaindering

Let Pi = {5, 7, 11, 13}, r i = {1, 5, 9, 11}. Table I illustrates the process of reconstructing an in teger.

H. Schroder et al.

498 Table 1 Chinese remaindering

r,

Po

PI

/'2

/'3

rood 5

rood 7

m o d 11

r o o d 13

1

P

di

5

9

11

S,:= Rj

1

1

1

Subtract

4

8

10

M=5

3

9

8

5

6

2

Ss := g j Subtract

5 1

5 10

M=7

8

2

d,=8

8

7

-

-I

M/R s

-1

M~R~

s~:= Rj

8

Subtract

12

M =ll -I M.R~

d o =1

d 1=5 1+5.5+8-5.7 +7-5-7.11 = 3001

6 7

d3=7

4.2. Single variable interpolation - f i n i t e f i e M

L e t x 0 = 1 , r0 = 8 ; x~=2, rt=3; x z = 3 , r2 = 0 ; x 3 = 4 , r3 = 1 0 . r e c o n s t r u c t i o n o f the p o l y n o m i a l o v e r the p r i m e field m o d u l o 11.

Table 2 i l l u s t r a t e s t h e

4.3. Single variable interpolation - real f i e l d

L e t x 0 = 1, r o = 0.5; x t -- 2, r t = 2.5; x 2 = 3, r z = 6.5; x 3 = 0.5, r 3 = 0.25. Table 3 i l l u s t r a t e s the r e c o n s t r u c t i o n o f the p o l y n o m i a l o v e r the real field.

5. ISA implementation of polynomial interpolation/evaluation W e n o w c o n s i d e r the i m p l e m e n t a t i o n o f the a b o v e i n t e r p o l a t i o n a l g o r i t h m o n t h e I S A , u s i n g a s y s t e m a t i c t o p - d o w n d e s i g n t e c h n i q u e . T h e t i m e c o m p l e x i t i e s for the I S A p o l y n o m i a l i n t e r p o l a t i o n a n d e v a l u a t i o n p r o g r a m s are i n d i c a t e d .

Table 2 Single v a r i a b l e i n t e r p o l a t i o n

r, S~ := R j Subtract

( X - X o ) -1

P0 xo = 1

Pi x 1= 2

P1 x2

P3 x3= 4

8

3 8

0 8

10 8

6

3

2

1

6

4

6

7

8

S,:= Rj

6

6

Subtract

1 1 1

2 6 1

Multiply

(x - xl) -~ Multiply Ss:= Rj Subtract

di

do = 8

d 1= 6

8 + 6 ( x - 1) + l ( x - l ) ( x

=x2+3x+4 ( A r i t h m e t i c m o d u l o 11) d 2 =1

1 0

P

d3= 0

- 2)

Systolic algorithm for polynomial interpolation

499

Table 3 Single variable interpolation P0

Pt

P2

P3

x 0 =1

x] = 2

x2= 3

x 3 = 0.5

0.5

S~ : = R j

2.5 0.5

6.5 0.5

Subtract

2

6

- 0_25

(% - )to) -1 Multiply

1

0.5

-2

2

3

0_5

ss:= Rj

2

2

Subtract

1

- 1.5

(x.¢ - x I ) - 1 Multiply

1 1

- 0.666 1

r,

di

0.25 0.5

P

d o = 0.5

d1= 2

S~:=R

1

Subtract

0

0.5+2(x-1)+l(x-1)(x-2) =x2-x+0.5 d, = 1 d3= 0

The systematic top-down design process for an ISA is a generalized version of the design approach for systolic arrays, see Kung [2]. We start with a locally recursive version of the above algorithm and generate the dependence graph. The final design is based on this approach, but we shall only describe the the actual implementation and not the approach (see [5]). We have a 1 × (n + 1) array of n + 1 processing elements (PE) in the ISA, P0 . . . . . pn. Each processing element in the ISA has 6 data registers: X stores a point x, - XS shifts x, R stores a residue r RS shifts r, - M stores M s D stores D~ coefficients. We have five instructions in the instruction set, and each PE can execute all the instructions (subscripts L and T refer to the left and top neighbours respectively). The five instructions in the ISA program are: A. X : = X r, R : = R r B, R : = R - R S L , RS:=RSL C. M : = 1 / ( X - XSL), XS := XS c D. R : = R . M E. D : = R , R S : = R , X S : = X . The ISA interpolation program for four PE P0, P1, /'2 and /)3 n = 3) is shown in Fig_ 1. Note that the selectors (not shown) in a one-dimensional ISA are always '1'. The input data is the set of residues r i at points x i (i = 0 . . . . . n). At the end of the program, register D in each PE Pi contains value D , We observe that the pattern of instructions executed in each processing element Pv is A(BCD)iE, i.e. the sequence of instructions BCD is repeated i times in P~. We can generate all possible combinations of instructions that are executed at the same time (on different PE), viz. instructions on the same horizontal line, using the following expression: L = [(e, E, CB, B ) ( D C B ) * ( D , DC, A, e)] U C where ' e ' is the empty string, ' U ' is the union operation of the set of strings and ' * ' represents repetition of the string zero or more times. Remark. It is not necessary to have register D for the interpolation program since the Da coefficients are also stored in R. However, register D is necessary for the evaluation program below.

H. Schroder et al.

500

E

D C B

Program

E

D

D

C

C

B

B

D

E

D

C

D

C

B

C

B

A

B

A

E

A

x 0 , r0

x I , r1

x 2 , r2

I

x 3 , r3

P1

P2

J

P3

Data

1

P0

Fig. 1. ISA polynomial interpolation program.

5.1. Time complexity H e r e , we h a v e an ( m x k ) I S A , w h e r e m = 1 a n d k = 4. T h e p e r i o d o f t h e I S A p r o g r a m -- r = n u m b e r o f i n s t r u c t i o n s d i a g o n a l s = 3 ( k - 1) + 2 = 11. T h e e x e c u t i o n t i m e o f t h e I S A p r o g r a m = r + k + m - 2 = 14.

5.2. I S A program f o r polynomial evaluation W e n o w d e s c r i b e an I S A p o l y n o m i a l e v a l u a t i o n p r o g r a m t h a t r u n s o n t h e s a m e 1 x ( n + 1) p r o c e s s i n g a r r a y u s e d for p o l y n o m i a l i n t e r p o l a t i o n . T h e p o l y n o m i a l is e v a l u a t e d at s o m e p o i n t y as f o l l o w s ( X k d e n o t e s r e g i s t e r X in P E Pk): n r:=Do+

Y'~ j~l

j-1

DjHy-x~. k=0

Systolic algorithm for polynomial interpolation

EV I EV

501

Program

Data

I

r'ol ", XS L

I

ISA

Fig. 2. ]SA polynomial evaluation program_

Remark. This program can also be used to construct the n th degree polynomial interpolant r (in x) using the Dj coefficients: n

j--I

r : = D0 + E D j I - I x - X k . j=l

k=0

The evaluation program can be concatenated immediately after the interpolation program, as it uses the Dj coefficients (stored in the D registers) produced by the interpolation program. Also, register X in each PE Pk contains the point x kThe program consists of only one instruction EV. The input data is value y, the point where the polynomial is to be evaluated. This value is shifted through the processors using register XS. The program is initialized with the following values: R / = 0 , M L = I , XSI = y (point of evaluation); these values are used when EV is executed in PE P0-

EV.

XS:=XSL, M : = M L ( X S L - - X ) , R : = R L + D M L.

The ISA program for n = 3(4 PE) is shown in Fig. 2; the selectors are not shown. The execution of instruction EV does not affect registers D and X. Thus, the period of this algorithm is 1, and a new point can be input at each time unit. As mentioned above, register D is necessary for the evaluation program since D i coefficients stored in registers R are overwritten. It is possible to reduce the number of multiplications by about one half if Horner's nested multiplication rule is used during the evaluation. However, this would change the direction of the instruction diagonal to a perpendicular direction, viz. from north-west to south-east (in contrast, the instruction diagonal for interpolation is from south-west to north-east). This will cause delay if the evaluation program is concatenated immediately after the interpolation program.

5.3. Time complexity The period of the ISA program = r = 1 (number of instruction diagonals). The execution time for an (m × k) processing array = r + k + m - 2. Here, m = 1 and k = 4.

H. Schroder et aL

502

Table 4 Trace of ISA program

Time

for interpolation

Po

Pl

P2

Ps

A

X:=xO=l R : = r 0 = 0.5 A

E

D:= R = O . 5 = d O

X:=xl=2

R S : = 0_5,XS : = 1

R : = r l = 2_5 B

A

R ,= 2.5-.5=

2

R S : = 0.5

X:=x2=3 R : = r 2 = 6.5

C

B

M:=I

R := 6.5-.5

X S ,= 1

R S : = .5

D

R:=R

A

X:=x3 = 5

= 6

R : = R 3 = .25

C M=2

M:=

B

1/(3-

1) = . 5

R : = - .25

XS:=I

R S : = .5

E

D

C

D:=2=dl

R:=R.M=3

M:=

-2

RS ,= 2,XS := 2 6

B

D

R:=3-2=1

R:=R.M=0_5

RS := 2 C 7

M := 1/(3-

8

B 2) = 1

R :=.5-

2 = -1_5

XS:=2

RS:=2

D

C

R := R.M=I

M := 1/(.5-

2)

= - 2/3

9

E

D

D:=l=d2

R:=R.M=I

RS := I,XS := 3 B

R~=I-I

10

=0

RS := 1 C 11

M := 1/(.5=

-

3)

2/5

D 12

R,= R_M=O

13

D,=0=d3

E R S : = 0 , X S : = .5

Hence, execution time = 4. Remarks. For Chinese Remaindering (CRT) the subprogram C is modified thus:

M

:= 1/XSL,

XS

:= XS L

Remarks. For evaluating the result in Chinese Remaindering, the EV program is modified thus:

Initiate XSL := 1; R L := 0; M L:= 1; E V : X S := X ( P r i m e ) , M : = M L. X S L, R : = R L + D . M

L-

Tables 4 and 5 give the trace of ISA program for interpolation and evaluation of the polynomial given in Table 3. The evaluation is carried out for x = - 1 .

Systolic algorithm for polynomial interpolation

503

Table 5 Trace of ISA polynomial evaluation Time

Po

P1

/'2

PJ

D := dO = .5 XS:= - 1 X:=xO=l M:=I- -2= -2 R := 0 + _5 = -5

D,=dl= 2 XS:= -1 X:=xl=2 M:= -2--3=6 R :=_5-4 = -3.5 D := d 2 = 1 XS:= -1 X:= x2 = 3 M:= 6.-4 = -24 R:=-3.5+6=2.5

d:=d3~O XS:= - 1 X := x3 = .5 M: = - 2 4 - - 1 . 5 36 R := 2.5 =

Acknowledgements The authors thank Richard Brent for his interest in this work and the Australian National University, Canberra, for providing support during the time of this work. Also the authors thank the referees for providing suggestions to improve the presentation of this paper.

References E.V. Krishnamurthy, Error-free Polynomial Matrix Computations (Springer, New York, 1985). S.Y. Kung, V L S I Array Processors (Prentice Hall, Englewood Cliffs, N J, 1988). H.W. Lang, The instruction systolic array, a parallel architecture for VLSI, V L S I J. (1986) 65-74. R.J. McEliece and D.V. Sarvate, On sharing secrets and Reed-Solomon codes, Comm. A C M 24 (1981) 583-590. D.I. Moldavan, On the design of algorithms for VLSI systolic arrays, Proc. I E E E 71 (1983) 113-120. H. Schroder, The instruction systolic array - a trade-off between flexibility and speed, Technical Report CS-86-08, Department of Computer Science, Australian National University, Canberra_ [7] A_ Shamir, How to share a secret, Comm. A C M 22 (1979) 612-614. [1] [2] [3] [4] [5] [6]