Parallel Computing 32 (2006) 301–312 www.elsevier.com/locate/parco
Polynomial interpolation and polynomial root finding on OTIS-mesh Prasanta K. Jana
*
Department of Computer Science and Engineering, Indian School of Mines, Dhanbad 826004, India Received 27 March 2005; received in revised form 25 September 2005; accepted 12 January 2006 Available online 20 March 2006
Abstract OTIS-mesh is an efficient model of optoelctronic parallel computers. In this paper, we develop several parallel algorithms for two important problems of numerical analysis, i.e., polynomial interpolation and polynomial root finding on this network. The algorithms are based on two schemes of data mapping namely, row–column mapping and group mapping in which the input data elements are arranged to store into row/column and group wise respectively by suitably exploiting the links of the network. We use Lagrange method for polynomial interpolation and Durand–Kerner method [T.L. Freeman, Calculating polynomial zeros on local memory parallel computer, Parallel Comput. 12 (1989) 351–358.] for polynomial root finding. The optimality/near optimality of the algorithms is shown to achieve with the assumption that the data elements are initially stored following the above mentioned mapping. The scalability of the algorithms is also studied. 2006 Elsevier B.V. All rights reserved. Keywords: Optoelectronic parallel computer; OTIS-mesh; Lagrange interpolation; Durand–Kerner method
1. Introduction Optical transpose interconnection system (OTIS) proposed by Marsden et al. [20] is a model of optoelectronic parallel computer in which the processors are divided into groups. The processors within each group are connected using electronic links and the connections between different groups are realized by optical links. Optical interconnects has power, speed and cross talk advantages over the electronic interconnects when the connect distance is more than a few millimeters [8,19]. It is shown in [20] that whenever the number of groups is equal to the number of the processors within each group, the bandwidth of the OTIS-system can be maximized and the power consumption is minimized. OTIS-mesh [24,26] is an efficient model of OTIS family of optoelctronic pffiffiffiffi parallel pffiffiffiffi computers. In an N · N OTIS-mesh, N2 processors are organized into N groups in the form of N N lattice as shown in Fig. 1 p p for N = 4. Each group is basically an N · N 2D-mesh. Let (G, P) denote the Pth processor of the Gth group. The processors within a group are connected via electronic links following 2D-mesh topology. These *
Tel.: +91 326 2210025; fax: +91 326 2210028. E-mail address:
[email protected]
0167-8191/$ - see front matter 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2006.01.001
302
P.K. Jana / Parallel Computing 32 (2006) 301–312
G(1,1)
G(1,2)
1,1
1,2
1,1
1,2
2,1
2,2
2,1
2,2
G(2,2)
G(2,1) 1,1
1,2
1,1
1,2
2,1
2,2
2,1
2,2
Fig. 1. OTIS-mesh of 16 processors.
links are shown by solid arrows in the figure. The processors of two different groups are connected via optical links following OTIS rule, i.e., (G, P) is connected with (P, G) as shown by dashed arrows. We assume that all the links are bi-directional. Let (gx, gy) denote the coordinates of the group G. Then the processor placed in the pffiffiffiffi px row and py column within the group G is denoted by P(gx, gy, px, py) for 1 6 gx, gy, px, py 6 N . The coordinates of a processor pffiffiffiffi within a group are shown inside the small box in the figure. The diameter of an N · N OTIS-mesh is 4 N 3 [26]. The simulation of a 4D-mesh by an OTIS-mesh has been shown in [31]. In the recent years, a number of parallel algorithms for various computations have been developed on the OTIS-mesh such as image processing [27], matrix multiplication [28], data sum, consecutive sum, concentrate [29], BPC permutation [30], k–k sorting [21], randomized routing, selection and sorting [22] etc. Polynomial interpolation and polynomial root finding are two important problems of numerical analysis with variety of real-time, and real-life applications. For example, in automatic control, digital signal processing, very often we require fast extractions of all the roots of a high degree polynomial. Specific applications include the identification of the poles of a digital filter in which polynomial with degree as high as 100 are sometimes encountered [23]. Polynomial interpolation can be very useful in geological mapping, petroleum explorations, cardiology (heart potential) etc. Several parallel algorithms for these problems have been studied on different models of parallel computers. Parallel algorithms for polynomial interpolation based on Lagrange formula have been studied in [3,4,11,15–17,25]. Various parallel algorithms for polynomial root finding based Durand–Kerner method [6,18] have been reported in the literatures [5,9,10,13,14]. In this paper, we present various parallel algorithms for Lagrange’s interpolation and Durand–Kerner method on an N2 processors OTIS-mesh. The algorithms result from two suitable ways of mapping the initial data elements over the whole network, namely row–column mapping and group mapping. In the row–column mapping, we arrange to distribute the data elements x1, x2, . . . xN row wise and column wise following suitable communication links. In grouppmapping the data elements xpNffiffiffiði1Þþj are arranged to store in all the processors ffiffiffiffi of the group G(i, j), 1 6 i, j 6 N , and x1, x2, . . . xN in row major fashion in every group. pffiffiffiffi In Section 2, we first develop a parallel algorithm for N-point Lagrange interpolation in 8 N 6 electronic moves pffiffiffiffi and 3 OTIS moves using the row–column mapping. Then we develop another algorithm that requires 6 N 4 electronic moves and 2 OTIS moves pffiffiffiffifollowing the group mapping. In Section 3, we present a parallel algorithm for Durand–Kerner method in 6 N 6 electronic moves p and ffiffiffiffi 2 OTIS moves that use row–column mapping followed by another algorithm using group mapping in 4 N 4 electronic moves and one OTIS move. We show that the algorithms can be made optimal/near optimal if we assume that data elements are initially stored in the appropriate mapping as mentioned above. The scalability of the algorithms is also discussed.
P.K. Jana / Parallel Computing 32 (2006) 301–312
303
2. Parallel algorithms for Lagrange interpolation Given a set of tabulated values y1, y2, . . . , yN of the function y = f(x) at some discrete points x1, x2, . . . , xN, the N-point Lagrange interpolation is defined as follows [12]: f ðxÞ ¼ pðxÞ
N X i¼1
yi ðx xi Þp0 ðxi Þ
where pðxÞ ¼
N Y ðx xi Þ; i¼1
p0 ðxi Þ ¼
N Y
ðxi xj Þ;
i ¼ 1; 2; ; N
j¼1;j6¼i
For the parallel algorithms presented in this paper we use a special symbol ‘*’ for the coordinates to denote the set of processors/groups with all possible values. We also assume that each processor has some local registers as per the requirement of the algorithms. Further we assume the SIMD mode operation of the OTIS-mesh in which all the active processors move the data along the same dimension. We consider the row–column mapping of the input data in our first Algorithm L1 and group mapping in the next Algorithm L2 which we describe step wise as follows: pffiffiffiffi Algorithm pL1. ffiffiffiffi /* We assume here that the boundary processors P(i, 1, k, 1), 1 6 i, k 6 N , and P(1, j, 1, l), 1 6 j, l 6 N have only I/O capability. Then the steps 1 through 7 perform the row–column mapping of the initial input data elements */ pffiffiffiffi Step 1. For all i, k, 1 6 i, k 6 N do in parallel Input xpNffiffiffiðk1Þþi to A(i, 1, k, 1) Input x to C(i, 1, k, 1) Step 2. Broadcast locally the contents of A and C registers stored in step 1 to all processors along the same row of the group. pffiffiffiffi Step 3. For all j, l, 1 6 j, l 6 N do in parallel Input xpffiffiNffiðl1Þþj to B(1, j, 1, l) Step 4. Broadcast locally the contents of B register stored in step 3 to all processors along the same column of the group. Step 5. Perform an OTIS move on A, B and C register contents stored in steps 2 and 4. Step 6. Broadcast locally the contents of A and C registers stored in step 5 to all the processors along the same row of the group. Step 7. Broadcast locally the contents of B registers to all processors stored in step 5 along the same column of the group.
Illustration 1. The contents of A and B registers after this step are, respectively, shown in Figs. 2 and 3 for N = 9 in which input of the initial data is also shown by arrows. pffiffiffiffi Step 8. For all i, j, k, l, 1 6 i, j, k, l 6 N do in parallel If A(i, j, k, l) 5 B(i, j, k, l) then Aði; j; k; lÞ Else A(i, j, k, l)
Cði; j; k; lÞ Bði; j; k; lÞ Aði; j; k; lÞ Bði; j; k; lÞ 1
304
P.K. Jana / Parallel Computing 32 (2006) 301–312
x1
x1
x1
x1
x1
x1
x1
x1
x1
x1
x4
x2
x2
x2
x2
x2
x2
x2
x2
x2
x7
x3
x3
x3
x3
x3
x3
x3
x2
x4
x4
x4
x4
x4
x4
x4
x4
x4
x5
x5
x5
x5
x5
x5
x5
x5
x5
x5
x8
x6
x6
x6
x6
x6
x6
x6
x6
x6
x3
x7
x7
x7
x7
x7
x7
x7
x7
x7
x6
x8
x8
x8
x8
x8
x8
x8
x8
x8
x9
x9
x9
x9
x9
x9
x9
x9
x9
x9
x3
x3
Fig. 2. Contents of A-registers and initial input.
x1
x4
x7
x2
x5
x8
x3
x6
x9
x1
x2
x3
x4
x5
x6
x7
x8
x9
x1
x2
x3
x4
x5
x6
x8
x9
x1
x2
x3
x4
x5
x6
x8
x9
x1
x2
x3
x4
x5
x6
x7
x8
x9
x1
x2
x3
x4
x5
x6
x7
x8
x9
x1
x2
x3
x4
x5
x6
x7
x8
x9
x1
x2
x3
x4
x5
x6
x7
x8
x9
x1
x2
x3
x4
x5
x6
x7
x8
x9
x1
x2
x3
x4
x5
x6
x7
x8
x9
x7 x7
Fig. 3. Contents of B-registers and initial input.
ð ffiffi Þ for A(i, j, k, l) 5 B(i, j, k, l) or 1 otherwise which ffiffi ffiffi ð Þ for i 5 j or 1 for i = j, respectively.
Illustration 2. After this step A(i, j, k, l) holds is shown in Fig. 4 where xij indicates
xxj xi xj
xxpN ðj1Þþl p x N ði1Þþk xpN ðj1Þþl
Step 9. Form the local product with the contents of the A registers along the same row of each group and pffiffiffiffi store it in A(i, j, k, 1), 1 6 i, j, k 6 N .
P.K. Jana / Parallel Computing 32 (2006) 301–312
1
x2-1
305
x1-2 x1-3
x1-4
x1-5 x1-6
x1-7
x1-8 x1-9
x2-3
x2-4
x2-5 x2-6
x2-7
x2-8 x2-9
1
x3-4
x3-5 x3-6
x3-7
x3-8 x3-9
x4-5 x4-6
x4-7
x4-8 x4-9
x5-6
x5-7
x5-8
1
x6-7
x6-8 x6-9
1
x7-8 x7-9
1
x3-1
x3-2
x4-1
x4-2
x4-3
x5-1
x5-2
x5-3
x5-4
x6-1
x6-2
x6-3
x6-4
x6-5
x7-1
x7-2
x7-3
x7-4
x7-5 x7-6
x8-1
x8-2
x8-3
x8-4
x8-5 x8-6
x8-7
1
x8-9
x9-1
x9-2 x9-3
x9-4
x9-5 x9-6
x9-7
x9-8
1
1
1
x5-9
Fig. 4. After step 8.
Step 10. Perform an OTIS move on the contents of the A registers stored in step 9. Step 11. Form the local product with the contents of A registers pffiffiffiffistored in step 10 along the same row of each group G(*, 1) and store it in A(i, 1, k, 1) , 1 6 i, k 6 N .
Illustration 3. After step 11, A(i, 1, k, 1) holds ðx x1 Þðx x2 Þ . . . x xpNffiffiffiðk1Þþi1 x xpNffiffiffiðk1Þþiþ1 . . . ðx xN Þ xpNffiffiffiðk1Þþi x1 . . . xpffiffiNffiðk1Þþi xpffiffiNffiðk1Þþi1 xpffiffiNffiðk1Þþi xpffiffiNffiðk1Þþiþ1 . . . xpffiffiNffiðk1Þþi xN
pffiffiffiffi Step 12. For all i, k, 1 6 i, k 6 N do in parallel Bði; 1; k; 1Þ y pNffiffiffiðk1Þþi pffiffiffiffi Step 13. For all i, k, 1 6 i, k 6 N do in parallel A(i, 1, k, l) A(i, 1, k, 1) * B(i,1, k, 1) pffiffiffiffi Step 14. Form the local sum with the contents of the A(i, 1, k, 1), 1 6 i, k 6 N along the same column of pffiffiffiffi each group and store them in A(i, 1, 1, 1), 1 6 i 6 N . pffiffiffiffi Step 15. Perform an OTIS move on the contents of A(i, 1, 1, 1), 1 6 i 6 N . pffiffiffiffi Step 16. Form the final result by adding the contents of A(1, 1, k, 1) column wise for 1 6 k 6 N and store it in A(1, 1, 1, 1). Time complexity: Since one piece of data can be moved at a time along an electronic link, the contents of the registers A and C in step 2 can be broadcast in pipelining, i.e., the content of C register trails by that of pA ffiffiffiffi register as analyzed by Wang and Sahni for their matrix multiplication on OTIS-mesh [28]. This requires N pffiffiffiffi electronic moves. Step 6 also requires N electronic moves by similar pipelining. Each of the steps 4, 7, 9, 11, pffiffiffiffi 14 and 16 requires N 1 p electronic moves. Steps 5, 10, 15 each requires a single OTIS move. Therefore the ffiffiffiffi above algorithm requires 8 N 6 electronic moves and 3 OTIS moves.
306
P.K. Jana / Parallel Computing 32 (2006) 301–312
The correctness of the above algorithm can be readily seen from the illustrations and the p figures as follows. ffiffiffiffi xx After the step 8, the contents of the A registers are xi xjj for i 5 j or 1 for i = j, 1 6 i , j 6 N as clear from QpNffiffi pffiffiffiffi ðxxj Þ Illustration 2 and Fig. 4. After step11, A registers hold QpNffiffij¼1 , 1 6 i 6 N and after step 16, the final ðxi xj Þ QN j¼1;i6¼j ðxxj Þ PN j¼1 result i¼1 QN y i is obtained from A register of the processors P(1, 1, 1, 1). j¼1;i6¼j
ðxi xj Þ
Remark pffiffiffiffi 1. In the above algorithm, pffiffiffiffithe data is fed only through the boundary processors P(i, 1, k, 1), 1 6 i, k 6 N and P(1, j, 1, l), 1 6 j, l 6 N which is then broadcast row wise and column wise. This requires a total of 2N I/O pins and similar as applied to the enumeration sort on mesh of trees [1,2]. If we, however, assume that the data elements x1, x2, . . . xN are initially stored in the twoplocal ffiffiffiffi registers namely A and B as shown in Figs. 2 and p 3 then the above algorithm can be achieved in only 4 N 4 electronic moves and 2 OTIS moves ffiffiffiffi (a total of 4 N 2 moves) following the steps 8 through 16 and omitting the step 1 through 7. This is almost pffiffiffiffi optimal since the diameter of OMULT is 4 N 3. This may be noted that similar assumption has been made for the initial data allocation using group row mapping (GRM) and group submatrix mapping (GSM) for matrix multiplication on OTIS-mesh [28]. Therefore we have the following theorem. Theorem 1. N-Point Lagrange interpolation can be performed by the Algorithm L1 in near optimal time, i.e., in pffiffiffiffi 4 N 4 electronic moves and 2 OTIS moves. We now outline how the above algorithm can be modified when p2 (for p < n) processors will be available as follows. For the sake of simplicity, we assume that N = kp, where k is an integer. It can be noted that the Algorithm L1 first stores the input data points x1, x2, . . . , xN in all the rows and all the columns (as indicated in Figs. 3 and 2, respectively) followed by sum of product computation. Now, suppose N input data points are grouped into k = N/p sets: {x1, x2, . . . , xp}, {xp+1, xp+2, . . . , x2p}, . . . , {x(k1)+1, x(k1)+2, . . . , xkp}. For a given input set to the p columns, the rows are successively fed with possible input sets and each time the required product terms are formed. Then the input sets to the columns are successively changed to update the partially computed results and the same procedure is repeated until the final interpolated value is generated in processor pffiffiffiffi P(1, 1, 1, 1). The time complexity of the modified algorithm will be Oðk N Þ. Algorithm L2. /* Here, pwe ffiffiffiffi assume that the data can be fed only through the boundary processors P(i, j,1,1), 1 6 i, j 6 N . Then steps 1 through 7 perform the group mapping of the initial input data elements */ pffiffiffiffi Step 1. For all i, j, 1 6 i, j 6 N do in parallel Input xpNffiffiffiði1Þþj to A(i, j, 1, 1) Input y pffiffiNffiði1Þþj to D(i, j, 1, 1) Input x to B(i, j, 1, 1) Step 2. On each group, broadcast locally the contents of A and B registers stored in step 1 to all the processors of the corresponding group. After this step the contents of A registers is shown in Fig. 5. pffiffiffiffi Step 3. For all i, j, k, l, 1 6 i, j, k, l 6 N do in parallel Cði; j; k; lÞ
Aði; j; k; lÞ
Step 4. Perform an OTIS move on the contents of A registers of all groups. The contents of A registers after this step, is shown in Fig. 6. pffiffiffiffi Step 5. For all i, j, k, l, 1 6 i, j, k, l 6 N do in parallel If C(i, j, k, l) 5 A(i, j, k, l) then Aði; j; k; lÞ Else A(i, j, k, l)
Bði; j; k; lÞ Aði; j; k; lÞ Cði; j; k; lÞ Aði; j; k; lÞ 1
P.K. Jana / Parallel Computing 32 (2006) 301–312
x1
x2
307
x3
x1
x1
x1
x2
x2
x2
x3
x3
x3
x1
x1
x1
x2
x2
x2
x3
x3
x3
x1
x1
x1
x2
x2
x2
x3
x3
x3
x4
x5
x6
x4
x4
x4
x5
x5
x5
x6
x6
x6
x4
x4
x4
x5
x5
x5
x6
x6
x6
x4
x4
x4
x5
x5
x5
x6
x6
x6
x7
x8
x9
x7
x7
x7
x8
x8
x8
x9
x9
x9
x7
x7
x7
x8
x8
x8
x9
x9
x9
x7
x7
x7
x8
x8
x8
x9
x9
x9
Fig. 5. Contents of A-registers and initial inputs.
x1
x2
x3
x1
x2
x3
x1
x2
x3
x4
x5
x6
x4
x5
x6
x4
x5
x6
x7
x8
x9
x7
x8
x9
x7
x8
x9
x1
x2
x3
x1
x2
x3
x1
x2
x3
x4
x5
x6
x4
x5
x6
x4
x5
x6
x7
x8
x9
x7
x8
x9
x7
x8
x9
x1
x2
x3
x1
x2
x3
x1
x2
x3
x4
x5
x6
x4
x5
x6
x4
x5
x6
x7
x8
x9
x7
x8
x9
x7
x8
x9
Fig. 6. After step 4.
Illustration 4. After this step the contents of A(i, j, k, l) is shown in Fig. 7 where xij bears the same notation, xx i.e., xi xjj for i 5 j or 1 for i = j, respectively. Step 6. Form the local pffiffiffiffi product with the contents of the A registers on each group and store it in A(i, j, 1, 1), 1 6 i, j 6 N .
308
P.K. Jana / Parallel Computing 32 (2006) 301–312
x1-2
x1-3
x2-1
x2-3
x3-1
x3-2
x1-4 x1-5
x1-6
x2-4
x2-5 x2-6
x3-4
x3-5 x3-6
x1-7
x1-8
x1-9
x2-7
x2-8 x2-9
x3-7 x3-8 x3-9
x4-1
x4-2
x4-3
x5-1 x5-2 x5-3
x6-1 x6-2 x6-3
x4-5 x4-6
x5-4
x6-4
x6-5
x4-7
x4-8
x4-9
x5-7
x6-7
x6-8 x6-9
x7-1
x7-2
x7-3
x8-1 x8-2 x8-3
x9-1 x9-2
x7-4 x7-5
x7-6
x8-4
x8-5 x8-6
x9-4
x9-5 x9-6
x8-9
x9-7
x9-8
1
1
1
x7-8 x7-9
x8-7
1
1
x5-6
x5-8 x5-9
1
1
1
x9-3
1
Fig. 7. After step 5.
Step 7. For all i, j = 1 to Aði; j; 1; 1Þ
pffiffiffiffi N do in parallel Aði; j; 1; 1Þ Dði; j; 1; 1Þ
Step 8. Perform an OTIS move on the contents of the A registers stored in step 7. Step 9. Add the contents of the A registers of the group G(1, 1) to form the final result. Time complexity: Step 1 requires constant time. Step 2 can be performed by first sending the contents of the A and B registers initially of a group down column one of that group in pipeline fashion (i.e., the content of pffiffiffiffi the B register trails the content of the A register by one column processor). This requires N electronic moves. pffiffiffiffi Next the contents of A and B registers are broadcast along rows in another N electronic moves using a simpffiffiffiffi ilar pipelining. Thus step 2 requires 2 N electronic moves. Further, each of the steps 6 and 9 requires pffiffiffiffi (2 N 1) electronics moves and steps 4 and 8 each requires a single OTIS move. Therefore, the above algopffiffiffiffi rithm requires 6 N 4 electronics moves plus 2 OTIS moves which lesser than that of the Algorithm L1. The correctness of the above algorithm can be proved as readily seen from the illustrations and the corresponding figures. Remark 2. In Algorithm L2, we have assumed pffiffiffiffithat input/output operations can be performed only through the boundary processors P(i, j, 1, 1), 1 6 i, j 6 N so that there is a need of N I/O pins which is lesser than that of the Algorithm L1. If we assume that the data elements x1, x2, . . . xN arepinitially stored in the two local ffiffiffiffi registers as shown in Figs.p5ffiffiffiffiand 6 then the above algorithm requires only 4 N 4 electronic moves and one OTIS moves (a total of 4 N 3 moves) which is optimal. This leads the following theorem. pffiffiffiffi Theorem 2. The Algorithm L2 can perform N-Point Lagrange interpolation in optimal time, i.e., in 4 N 4 electronic moves and one OTIS moves. pffiffiffiffi We can also modify the Algorithm L2 similarly as discussed for Algorithm L1 to run it in O(k N Þ time for the case when p2 (p < n) processors. 3. Parallel algorithms for finding polynomial roots Consider the following Nth degree polynomial equation, P N ðxÞ ¼ a0 xN þ a1 xN 1 þ a2 xN 2 þ þ aN 1 x þ aN ¼ 0
P.K. Jana / Parallel Computing 32 (2006) 301–312
309
Let us assume that the coefficients ai’s, i = 0, 1, 2, . . . , N, are real. The Durand–Kerner iterative scheme for finding the roots of the above polynomial equation is the following [10]: ðkÞ P N xi ðkþ1Þ ðkÞ ; i ¼ 1; 2; . . . ; N ¼ xi Q xi ðkÞ ðkÞ N x x i j j¼1;j6¼i The Durand–Kerner method is locally convergent with quadratic rate of convergence, respectively [10]. It can be noted that the principal computation of the above method is very similar to that of the Lagrange interpolation. Therefore we develop parallel algorithms for Durand–Kerner method similarly as the Algorithms L1 and L2 as described for Lagrange interpolation in the previous section. Algorithm D1. Steps 1 through 7. Same as Algorithm L1 omitting C registers pffiffiffiffi Step 8. For all i, j, k, l = 1 to N do in parallel If A(i, j, k, l) 5 B(i, j, k, l) then Aði; j; k; lÞ Bði; j; k; lÞ
Aði; j; k; lÞ
Else A(i, j, k, l) 1 Step 9. Form the local product with the contents of the A registers along the same row of each group and pffiffiffiffi store it in A(i, j, k, 1), 1 6 i, j, k 6 N . Step 10. Perform an OTIS move on the contents of the A registers stored in step 9. Step 11. Form the local product with the contents pffiffiffiffiof the A registers along the same row of each group G(*, 1) and store it in A(i,p1,ffiffiffiffik, 1) , 1 6 i, k 6 N . Step 12. For all i, k = 1 to N do in parallel Input y pNffiffiffiðk1Þþi to D(i, 1, k, 1) Input xpffiffiNffiðk1Þþi p toffiffiffiffiC(i, 1, k, 1) Step 13. For all i, k = 1 to N do in parallel Aði; 1; k; lÞ
Cði; 1; k; 1Þ
Dði; 1; k; 1Þ Aði; 1; k; 1Þ
pffiffiffiffi Time complexity: Each of the steps 2, 4, 6, 7, 9 and 11 requires N 1 electronic moves and pffiffiffiffithe steps 5 and 10 each requires a single OTIS move. Therefore, this version of the algorithm requires 6 N 6 electronic moves and 2 OTIS moves. Like the Algorithm L1 if we assume thatpthe ffiffiffiffi data elements x1, x2, . . . xN initially exist in the local registers as shown inp Figs. 2 and 3, it requires only 2 N 2 electronic moves and one OTIS ffiffiffiffi move (steps 9 and 11 each requiring N 1 electronic moves and the step 10 requiring one OTIS move). The correctness and the scalability of the algorithm can be similarly developed as Algorithm L1. Algorithm D2. Steps 1 through 4. Same as Algorithm L2 omitting B and D registers pffiffiffiffi Step 5. For all i, j, k, l = 1 to N do in parallel If C(i, j, k, l) 5 A(i, j, k, l) then Aði; j; k; lÞ
Cði; j; k; lÞ Aði; j; k; lÞ
Else A(i, j, k, l) 1 Step 6. Form the local pffiffiffiffi product with the contents of the A registers on each group and store it in A(i, j, 1, 1), 1 6 i, j 6 N . pffiffiffiffi Step 7. For all i, j = 1 to N do in parallel Input y pffiffiNffiði1Þþj to D(i, j, 1, 1) Input xpffiffiNffiði1Þþj to C(i, j, 1, 1) pffiffiffiffi Step 8. For all i, j = 1 to N do in parallel Aði; j; 1; 1Þ
Cði; j; 1; 1Þ
Dði; j; 1; 1Þ Aði; j; 1; 1Þ
310
P.K. Jana / Parallel Computing 32 (2006) 301–312
Table 1 Comparison between algorithms Algorithm
Mapping
1st version of Lagrange (L1) 2nd version of Lagrange (L2) 1st version of Durand–Kerner (D1) 2nd version of Durand–Kerner (D2)
Row–column Group Row–column Group
Electronic moves pffiffiffiffi 8pN ffiffiffiffi 6 6pN ffiffiffiffi 4 6pN ffiffiffiffi 6 4 N 4
OTIS moves 3 2 2 1
pffiffiffiffi Time complexity: Each of the steps 2 and 6 requires p (2ffiffiffiffi N 1) electronic moves and step 4 requires a single OTIS move. Therefore, the above algorithm requires 4 N 4 electronic moves pffiffiffiffi plus one OTIS move. Assuming the data elements are initially stored, this algorithm can be achieved in 2 N 2 electronic moves and one OTIS move. The scalability of the Algorithm D2 can be similarly achieved as Algorithm L2. 4. Conclusion In this paper, we have presented several versions of parallel algorithms for Lagrange interpolation and Durand–Kerner method on an N2 processors OTIS-mesh. The algorithms are based on two ways of mapping the initial data elements namely row–column mapping and group mapping. The results are summarized in Table 1. It can be seen that group mapping is superior to the row column with respect to time complexity and I/O pin requirements. We have also shown that if we assume that data elements are initially pffiffiffiffi stored in the corresponding mapping, then the Algorithm L1 is nearpoptimal with the time complexity of 4 N 4 elecffiffiffiffi tronic moves and 2 OTIS moves and L2 is optimal with 4 N 4 electronic moves and one p OTIS move. With ffiffiffiffi the similar assumption, both the Algorithms D1 and D2 can be shown to run in only 2 N 2 electronic moves and one OTIS move. The scalability of the algorithms has been also studied. It is important to note that parallel computation of Hermite polynomial (as given in the Appendix), which provides more accuracy, is similar to that of the Lagrange polynomial. This is because parallelization of both these methods mainly depend on p(x) and p 0 (xi). Therefore, the parallel algorithms for Lagrange interpolation presented in this paper can be easily modified to implement the Hermite interpolation with some constant communication moves. Similar remark can also be made another polynomial root finding method namely, Ehrlich method (Please see Appendix), as it is computationally similar to the Durand–Kerner method. It is important to note that Ehrlich method has higher (cubic) rate of convergency than the Durand–Kerner method (quadratic rate). Acknowledgements The first version of this paper appeared in the proceedings of the 6th International Workshop on Distributed Computing (IWDC 2004), December 27–30, 2004, Indian Statistical Institute, Kolkata, India. The author thanks the anonymous reviewers of both IWDC 2004 and Parallel Computing Journal for their valuable comments and suggestions, which greatly helped in improving the results. Appendix Hermite formula [12]: N N X X f ðxÞ ¼ hi ðxÞf ðxi Þ þ hi ðxÞf 0 ðxi Þ i¼1
i¼1
where 2
hi ðxÞ ¼ ½1 2L0i ðxi Þðx xi Þ½Li ðxi Þ ; pðxÞ hi ðxÞ ¼ ðx xi Þ½Li ðxi Þ2 ; Li ðxÞ ¼ 0 ðxi Þ i ¼ 1; 2; . . . ; N p 0 0 and f (xi) and Li ðxi Þ denote the derivative values of f(x) and L(x), respectively at the point x = xi.
P.K. Jana / Parallel Computing 32 (2006) 301–312
311
Ehrlich iterative scheme [7]: ðkÞ
ðkþ1Þ
xi
ai
ðkÞ
¼ xi
ðkÞ
1 ai bðkÞ i
;
i ¼ 1; 2; . . . ; N
where ðkÞ
ai
ðkÞ P N xi ¼ ; ðkÞ P 0N xi
bðkÞ i ¼
N X
j¼1;j6¼i
1 ðkÞ xi
ðkÞ
xj
and dP ðxÞ N ðkÞ P 0N xi ¼ dx
ðkÞ
at x ¼ xi
References [1] S.G. Akl, Parallel Sorting Algorithm, Academic Press, Orlando, FL, 1985. [2] S.G. Akl, The Design Analysis of Parallel Algorithms, Prentice-Hall, Englewood Cliffs, NJ, 1989. [3] H.S. Azad, et al., An efficient parallel algorithm for Lagrange interpolation and its performance, in: Proc. of High performance Computing in the Asia–Pacific Region, Beijing, China, pp. 593–598, 2000. [4] H.S. Azad et al., Parallel Lagrange interpolation on the star graph, J. Parallel Distr. Comput. 62 (4) (2002) 605–621. [5] M. Cosnard, P. Fraigniaud, Finding the roots of a polynomial on an MIMD multicomputer, Parallel Comput. 15 (1990) 75–85. [6] E. Durand, Solutions Numeriques des Equations Algebriques, Tom I. Paris; Manson, 1960. [7] L.W. Ehrlich, A modified Newton method for polynomials, Comm. ACM 10 (1967) 107–108. [8] M. Feldman, S. Esener, C. Guest, S. Lee, Comparison between electrical and free space optical interconnects based on power and speed considerations, Appl. Opt. 27 (9) (1988). [9] T.L. Freeman, M.K. Bane, Asynchronous polynomial zero-finding algorithms, Parallel Comput. 17 (1991) 673–681. [10] T.L. Freeman, Calculating polynomial zeros on local memory parallel computer, Parallel Comput. 12 (1989) 351–358. [11] Ben Goertzel, Lagrange interpolation on a processor tree with ring connections, J. Parallel Distr. Comput. 22 (2) (1994) 321–323. [12] F.B. Hildebrand, Introduction to Numerical Analysis, McGraw-Hill, New York, 1956. [13] P.K. Jana, B.P. Sinha, R. Datta Gupta, Efficient parallel algorithms for finding polynomial zeroes, in: Proc. of the 6th Int. Conf. on Advance Computing, CDAC, Pune University Campus, India, December 14–16, pp. 189–196, 1999. [14] P.K. Jana, Finding polynomial zeroes on a Multi-mesh of trees (MMT), in: Proc. of the 2nd Int. Conf. on Information Technology, Bhubaneswar, December 20–22, pp. 202–206, 1999. [15] P.K. Jana, B.P. Sinha, Fast parallel algorithm for polynomial interpolation, Comput. Math. Appl. 29 (4) (1997) 85–92. [16] P.K. Jana, B.P. Sinha, Efficient parallel algorithms for Lagrange and Hermite interpolation, Int. J. Appl. Sci. Comput. 4 (2) (1997) 118–136. [17] J.J. Joshep, An Introduction to Parallel Algorithms, Addison Wesley, Reading, Massachusetts, 1992. [18] I.O. Kerner, Ein Gesamtschrittverfahren Zur Berechnung der Nullstetten on polynomen, Numer. Math. 8 (1966) 290–294. [19] A. Kiamilev, P. Marchand, A. Krishnamoorthy, S. Esener, S. Lee, Performance comparison between optoelectronic and VLSI multistage interconnection networks, J. Light Wave Technol. 9 (12) (1991). [20] G. Marsden, P. Marchand, P. Harvey, S. Esner, Optical transpose interconnection system architectures, Opt. Lett. 18 (13) (1993) 1083–1085. [21] A. Osterloh, Sorting on the OTIS–mesh, in: Proc. 14th Int. Parallel and Distributed Processing Symp., IPDPS 2000 (2000) 269–274. [22] S. Rajasekaran, S. Sahni, Randomized routing, selection, and sorting on the OTIS-mesh, IEEETPDS: IEEE Trans. Parallel Distributed Syst. 9 (9) (1998) 833–840. [23] T.A. Rice, L.H. Jamieson, A highly parallel algorithm for root extraction, IEEE Trans. Comput. 38 (3) (1989) 443–449. [24] S. Sahni, Models and algorithms for optical and optoelectronic parallel computers, Int. J. Foundations Comput. Sci. 12 (3) (2001) 249–264. [25] S. Sengupta, D. Das, B.P. Sinha, M. Ghosh, A Fast parallel algorithm for polynomial interpolation using Lagrange’s formula, in: Proc. Int. Conf. on High Performance Computing (HiPC), New Delhi, pp. 701–706, 1995. [26] Chih-Fang Wang, S. Sahni, OTIS optoelectronic computers, in: K. Li, Y. Pan, S.Q. Zhag (Eds.), Parallel Computation Using Optical Interconnection, Kluwer Academic, 1998. [27] Chih-Fang Wang, S. Sahni, Image processing on the OTIS-Mesh optoelectronic computer, IEEE Trans. Parallel Distri. Syst. 11 (2) (2000) 97–109. [28] Chih-Fang Wang, Sartaj Sahni, Matrix multiplication on the OTIS-Mesh optoelectronic computer, IEEE Trans. Comput. 50 (7) (2001).
312
P.K. Jana / Parallel Computing 32 (2006) 301–312
[29] Chih-Fang Wang, Sartaj Sahni, Basic operation on the OTIS-mesh optoelectronic computer, IEEE Trans. Parallel Distri. Syst. 9 (12) (1998). [30] Chih-Fang Wang, Sartaj Sahni, BPC permutations on the OTIS-Mesh optoelectronic computer, in: Proc. on the Fourth Int. Conf. on Massively Parallel Processing Using Optical Transpose Interconnection (MPPOI ’97), pp. 130–135, 1997. [31] F. Zane, P. Marchand, R. Paturi, S. Esener, Scalable network architectures using the optical transpose interconnection system (OTIS), J. Parallel Distri. Comput. 60 (5) (2000) 521–538.