Parallel Computing 12 (1989) 351-358 North-Holland
351
Calculating polynomial zeros on a local memory parallel computer T.L. F R E E M A N Department of Mathematics, University of Manchester, Manchester, UK M13 9PL
Received January 1989
Abstrael. We consider the calculation, on a local memory parallel computer, of all the zeros of an n th degree polynomial Pn(x) which has real coefficients. We describe a generic parallel algorithm which approximates all the zeros simultaneously and we give three specificexamples of this algorithm which have orders of convergence two, three and four. We report extensive numerical tests of the algorithms; the fourth order algorithm is not robust, with many failures to converge, whereas the other two algorithms are reliable and display very respectable parallel speedups for higher degree polynomials.
Keywords. Parallel algorithms, local memory parallel computers, transputers, polynomial zeros, simultaneous polynomial root-finders.
1. Introduction In this paper we consider the problem of calculating all the zeros of the n th degree polynomial, P n ( x ) = x n + ax x n - 1 + a2 x n - 2 + . . ,
+ a n _ a X + a n,
(1.1)
where the coefficients a i, i = 1, 2 . . . . , n, are assumed to be real. Note that, without loss of generality, P n ( X ) is assumed to be normalised so that the coefficient of x n is unity. This form of the problem (calculating all, rather than a few of, the zeros) is that usually considered by software library routines (see, for example, the N.A.G. library [16], the I.M.S.L. [11] or the Harwell library [15]). A characteristic shared by these library routines is that they proceed by calculating the zeros one or two at a time and eliminate these calculated zeros using deflation. To maintain the stability of this deflation process the zeros are found in ascending order of magnitude. It is clear that parallelising these existing library routines will involve fine grain parallelism since there is only the possibility of exploiting parallelism within the iteration for a pair of zeros and the number of floating-point operations in this calculation is O(n); the iteration and the deflation step impose an essentially sequential character to the algorithms. There exists an alternative class of algorithms which calculate approximations to all the polynomial zeros simultaneously. These algorithms thus have the potential for a coarser grain parallelism. In the next section we consider this class of algorithms and describe three which are suitable for implementation on a local memory M I M D parallel computer. In Section 3 we summarise the performance of these algorithms on a set of test problems and on different numbers of processors. 016%8191/89/$3.50 © 1989, Elsevier Science Publishers B.V. (North-Holland)
352
T . L Freeman / Calculating zeros on a local memory computer
2. Simultaneous polynomial root-finders There exist several different algorithms for simultaneously approximating the zeros of Pn(x). Often the algorithms differ in their order of convergence although they share the same basic structure
Pn(X}k))
x}k+I)=x} k } -
. . . . .
i=1,2 .... ,n,
(2.1)
'
where x} k~ is the kth iterative approximation to the ith zero of P.(x) and q~ is a function of all the current approximate zeros which becomes increasingly complicated as the order of the algorithm increases. A local memory parallel version of this generic algorithm can be implemented by performing the updating of the approximate zeros, that is the calculation of P.(x~ k)) and ~ i ( x ~ k ) , x(2k),.. ., x ,(k)],, on separate processors. At the end of each iteration the processors need to communicate their current approximate zeros to each other, but there is no other communication required by the algorithm. We assume that there are p ~ n processors and that the lth processor handles jt approximate zeros. Thus the first processor handles the first Jl approximate zeros and the second processor handles the next J2 approximate zeros and so on. Notice 1-1 . that n = El=l Jr. If we write it = Z,,=lJm, l = 1, 2 . . . . . p, then the algorithm can be summarized as follows:
Step 1. (i)
k = 1, (ii) define initial approximations x~1), i = 1, 2 . . . . . n. Step 2. In parallel, for l = 1, 2 . . . . . p, for i = i t + 1, i t + 2 . . . . . il +Jl, (i) calculate p~k) = p,(x~k~), (ii) calculate q,(k)= ePi(x~k), x(2k), . . .x~k), . . ..(k+l~ i/+1 . . . . . x(k+l~ i--1 ' X~ k) . . . . . x ( k ) ) ' (iii) set x} k+l) = x~ k)-p[k)/q(ik). Step 3. For i = 1, 2 . . . . . n, communicate x} *+ 1) to all the other processors. Step 4. (i) check for convergence, (ii) set k = k + l , (iii) return to Step 2.
Note that, on the lth processor, the iteration in Step 2 is implemented in a Gauss-Seidel fashion with respect to those approximate zeros which are also calculated on the lth processor (the most recent values ,.(k+l) ~i,+1 , v(k+l) -~i,+2 . . . . . x(k+ i-1 1), il + 1 ~< i < i t +Jr, are used to calculate q~ in Step 2(ii)). However, the iteration is implemented in a Jacobi fashion with respect to the remaining approximate zeros. To implement the algorithm in a Gauss-Seidel fashion with respect to all the approximate zeros would require communication within each iteration and this would destroy the parallelism of Step 2 of the algorithm. The best-known simultaneous polynomial root-finding algorithm is the D u r a n d - K e r n e r iteration (see [1,6,14]), which has order of convergence two. This iteration is given by (2.1) with n
qJi(xl, x 2 ..... x , ) = I-[ ( x i - x j ) .
(2.2)
j=l
j4=i
Thus a parallel version of the D u r a n d - K e r n e r algorithm, subsequently referred to as Algorithm 1, results when we use this form for q'i in Step 2(ii) above. The efficiency could be improved if this iteration were implemented in a Gauss-Seidel fashion with respect to all the approximate
T.L. Freeman / Calculating zeros on a local memory computer
353
zeros (see [3]). However we have already noted that such an implementation would destroy the parallefism in Step 2 of the algorithm. A third-order simultaneous polynomial root-finding algorithm which is again suitable for parallel implementation is given in [7] (see also [4,5]), x~k+') = x~k)
a}k) 1 - alk)fl(k) '
i = 1, 2 . . . . . n,
(2.3)
where
(k) -- Pn(x}k))
(2.4)
O/i
and
/~?'= E j:,
1
(2.5)
4k')
jq-i
Clearly this iteration can be written in the form (2.1), and a parallel version of the algorithm, Algorithm 2, results from incorporating the iteration (2.3) in Step 2 of the generic algorithm given earlier. The calculation of the corrections in (2.3) requires about twice as many arithmetic operations as the calculation of the corrections in the Durand-Kerner iteration. Thus Algorithm 2 has a coarser grain parallelism (performs more computation between communications) than Algorithm 1 and we expect Algorithm 2 to display improved parallel speedups. Of course the actual time taken by either algorithm also depends on the number of iterations required to satisfy the convergence conditions. We could further improve the granularity of the algorithm by using an even higher order iteration formula. For example the fourth-order formula of Farmer and Loizou [8] is x}k+l) = X}k)--
alk'( 1 -- 8s(k)a~k)) 1 - 2 8 ( k ) 0 t ~ k) -t- 1 ( 8 ( k ) 2
-- yi(k))ol} k ) : '
(2.6)
where 2 p , (x}k)) ,
(2.7)
1
v(k) =
j:,
j~i
(2.8)
4 k ' ) 2'
and a} k) is given by (2.4). A parallel fourth-order algorithm, Algorithm 3, is obtained when we incorporate (2.6) in Step 2 of the generic algorithm. There exist several other simultaneous polynomial root finding algorithms (see [18] for example). Some of these algorithms, such as that originally proposed by Noureim [17], require the transfer, during each iteration, of intermediate calculations and are thus not suitable for implementation on a local memory parallel computer. Others are Gauss-Seidel implementations which we have already noted are not suitable. There are two parts of Algorithms 1, 2 and 3 which we have not yet considered in detail. Firstly Step l(ii) requires the initial approximations to be defined. A suitable definition is given in Section 4 of [1]. Secondly Step 4(i) requires a test for convergence. Adams [2] suggested a test based on rounding error bounds. The Adams test can be performed simultaneously with the evaluation of P,,(x}k)) in Step 2(i) of the algorithms. Thus the convergence tests on the
354
T.L. Freeman / Calculating zeros on a local memory computer
individual approximate zeros are performed in parallel and Step 3 of the algorithm communicates c o n y i (a boolean variable to indicate the convergence or otherwise of x} k)) in addition to x} k+l) to all the other processors. We consider the performance of the algorithms on a collection of test problems in the next section.
3. Computationalperformance The numerical experiments have been performed on an ITEM (Inmos Transputer Evaluation Module) consisting of eight T800 slave transputers [12] connected to a B004 board with a single T800 master transputer, with inter-processor links running at 10 Mbits/s. The algorithms described in the previous section have been programmed in occam2 [13] using REAL64 variables (about 15 significant decimal digits). They have been implemented with the calculation of the initial approximations performed on the master transputer and the iterations performed on the slave transputers. The slave transputers are configured as a linear chain and the communications in each direction in Step 3 of the algorithms are implemented in parallel. In addition the transputers can perform floating-point calculations in parallel with communications across their links. Hence the algorithm can perform Step 2(i) of the next iteration (the ~(k+l) , j ~ i) in parallel with the calculation of Pn(X}k+l)), which does not depend on ..j communications of Step 3 of the current iteration. The number of slave transputers used (as opposed to the number available) is chosen as follows. If the number of available slave transputers is m, then for any value of the degree n, at least one of the slave transputers must handle n z approximate polynomial zeros, where nz = ((n1 ) / m ) + 1, and the division is an integer division. Thus the minimum parallel computing time of each iteration is proportional to n z . The number of slaves used is s, where s is the smallest integer such that s * n z >1 n . If s * n z = n + k , then k of the slaves each handle n z - 1 approximate zeros and s - k of the slaves each handle n z approximate zeros. The slaves handling the larger number of approximate zeros are chosen to be those in the middle of the linear chain since this is the most favourable place in terms of communication delays. This strategy for deciding how many of the available slaves to use is designed to evenly spread the computation across the slaves whilst keeping communication costs to a minimum. To use more slaves would increase the communication costs since the same number of data items would need to be transferred across a greater number of links. It is possible to develop a strategy to use all the available slaves, although this will increase the communication costs and will not necessarily reduce the computation time; we return to this point later. In Tables 1 and 2 we list the observed speedups [(total time with one slave transputer)/(total time with 2, 4 or 8 slave transputers)] of Algorithms 1, 2 and 3 on the two tables of problems given in [10] (see also [19]). The number given in brackets after the speedup is the number of slaves used, if it differs from the number available. * * * * indicates that the algorithm failed with the given number of slaves and * * indicates that the algorithm failed with one slave so that a speedup could not be calculated. These failures occur because the iteration count reaches its maximum value, which is set to 10 * d e g r e e for these results. The first observation from these tables is that, although Algorithm 3 displays good speedups for the higher degree polynomials, it is so unreliable that we have excluded it from any further consideration. The remaining algorithms (Algorithms 1 and 2) are much more robust and failed on only one occasion; Algorithm 1 failed to converge on problem 4 of Table 1 in the upper limit of 30 iterations, however the algorithm did converge in 33 iterations when 2 slave transputers were used and i n 34 iterations when 3 slave transputers were used. In contrast the Algorithm 3 failures remain when the maximum iteration count is increased to 2 0 . d e g r e e . T h e parallel performances of Algorithms 1 and 2 are as expected. For low degree polynomials there
T . L Freeman / Calculating zeros on a local memory computer
355
Table 1 Speedups for the polynomials of Table 1 of [10] Problem
Degree
N u m b e r of processors Algorithm I 2
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 5 6 6 6 6 6 6 7 7 10 12 13 15 18 19
Algorithm 2
4
0.92 0.99 (3) 1.0 1.01 (3) 0.84 0.89(3) . . . . . . . . (3) 0.91 0.99 (3) 0.86 0.91 (3) 1.31 1.35 1.21 1.17 1.15 1.12 1.07 1.03 1.08 1.09 1.15 1.17 1.08 1.02 1.04 0.99 1.07 0.94 1.38 1.28 1.14 1.07 1.15 1.22 (3) 1.24 1.35 (3) 1.37 1.52 (3) 1.34 1.47 (3) 1.24 1.48 (3) 1.29 1.39 (3) 1.54 1.77 (3) 1.66 2.27 1.45 1.88 1.46 1.88 1.57 2.46 1.53 2.17 1.55 2.41 1.41 2.32 1.24 2.60
Algorithm 3
8
2
4
8
2
0.99 (3) 1.01 (3) 0.89 (3) • • • • (3) 0.99 (3) 0.91 (3) 1.35 (4) 1.17 (4) 1.12 (4)' 1.03 (4) 1.09 (4) 1.17 (4) 1.02 (4) 0.99 (4) 0.94 (4) 1.28 (4) 1.07 (4) 1.24(5) 1.15 (6) 1.47 (6) 1.40 (6) 1.37 (6) 1.36 (6) 1.62 (6) 2.29 (7) 1.89 (7) 1.97 (5) 2.80 (6) 3.01 (7) 3.03 3.68 (6) 3.84 (7)
1.13 1.06 0.99 1.03 1.15 1.0 1.25 1.26 1.22 1.22 1.22 1.3 1.19 1.15 1.22 1.3 1.29 1.17 1.4 1.42 1.46 1.47 1.38 1.42 1.41 1.39 1.46 1.64 1.60 1.68 1.71 1.58
1.26 (3) 1.17 (3) 1.10(3) 1.23 (3) 1.28 (3) 1.08 (3) 1.37 1.26 1.28 1.23 1.35 1.44 1.25 1.20 1.30 1.24 1.35 1.33 (3) 1.60 (3) 1.56 (3) 1.62 (3) 1.63 (3) 1.55 (3) 1.57 (3) 1.75 1.77 1.90 2.48 2.35 2.68 2.54 2.75
1.26 (3) 1.17 (3) 1.10(3) 1.23 (3) 1.28 (3) 1.08 (3) 1.37 (4) 1.26 (4) 1.28 (4) 1.23 (4) 1.35 (4) 1.44 (4) 1.25 (4) 1.20 (4) 1.30 (4) 1.24 (4) 1.35 (4) 1.39 (5) 1.52 (6) 1.63 (6) 1.66 (6) 1.64 (6) 1.57 (6) 1.67 (6) 1.89 (7) 1.90 (7) 2.06 (5) 2.89 (6) 3.11 (7) 3.49 3.57 (6) 3.64 (7)
1.15 1.14 1.15 1.16 1.02 1.13 • **• 1.54 . . . . . 1.47 1.59 . . . . . • • • • 1.39 1.56 • • • • • • • • 1.30 1.59 1.75 2.27 1.84 1.59 * ** * 1.55 • • 1.69 **** **** **** 1.89 1.76
4 1.23 (3) 1.44 (3) 1.37 (3) 1.61 (3) 1.24 (3) 1.39 (3) • • • * 1.85 . . . 1.73 2.10 . . . • • • • 1.84 1.97 • • • • • • • • 1.52 (3) 1.86 (3) 2.14 (3) 3.02 (3) 2.01 (3) 1.96 (3) * * (3) 2.04 • • • • 2.51 2.58 *• ** 2.81 3.03 3.09
8 1.23 1.44 1.37 1.61 1.24 1.39
(3) (3) (3) (3) (3) (3) * * * * (4) 1.85 (4) *
*
* * (4)
1.73 (4) 2.10 (4) * * * * (4) * *
* * (4)
1.84 (4) 1.97 (4) * *
* * (4)
*
* * (4)
*
1.89 1.96 2.51 3.39 2.22 2.15
(5) (6) (6) (6) (6) (6) * * (6) 2.62 (7) * *
(7)
2.09 (5) * *
* * (6)
* *
(7)
4.23 (6) 4.53 (7)
Table 2 Speedups for the polynomials of Table 2 of [10] Problem
1 2 3 4 5 6 7 8 9 10 11 12 13
Degree
6 6 7 7 8 8 8 9 9 9 15 18 36
N u m b e r of processors Algorithm I 2
4
1.31 1.25 0.99 1.31 1.31 0.93 1.28 1.45 1.38 1.32 1.54 1.70 1.55
1.52 1.38 1.41 1.56 1.77 1.65 1.84 1.89 1.73 1.92 2.54 2.53 3.04
Algorithm 2
8 (3) (3)
(3) (3) (3)
1.51 1.27 1.62 1.71 1.65 1.68 1.91 2.10 1.99 2.06 3.49 3.63 5.50
(6) (6) (7) (7)
(5) (5) (5) (6)
2
4
1.50 1.31 1.27 1.42 1.47 1.53 1.45 1.52 1.49 1.38 1.72 1.78 1.75
1.69 1.45 1.55 1.76 1.82 1.86 2.03 2.02 1.98 1.84 2.67 2.83 3.51
Algorithm 3 8 (3) (3)
(3) (3) (3)
1.76 1.51 1.48 1.80 1.82 1.87 2.25 2.32 2.22 2.11 3.59 3.54 5.42
(6) (6) (7) (7)
(5) (5) (5) (6)
2
4
8
1.45 1.55 1.45 1.56 *** 1.64 1.55 *** *** 1.54 *** *** 1.95
1.84 (3) 1.79 (3) 1.86 2.21 **** 2.28 **** * * * * (3) * * * * (3) 2.20 (3) **** **** 2.51
2.36 (6)
*
* * * *
2.05 (6)
* * * * (7) 2.55 (7) 2.95 2.51 3.15 2.89 (5) ****
(5)
2.67 (5) * * * *
(6)
T.L. Freeman / Calculating zeros on a local memory computer
356
Table 3 Number of iterations required by Durand-Kerner algorithm Table
Problem
Degree
Number of processors 1
2
3
2 2 2 2 2 2 2 2 2 2 2 2 2
1 2 3 4 5 6 7 8 9 10 11 12 13
6 6 7 7 8 8 8 9 9 9 15 18 36
23 22 15 26 15 15 48 59 31 23 62 54 34
27 27 24 29 19 28 61 66 36 27 68 59 42
27 27 17 29 18 28 58 66 38 25 71 59 38
4
5
6
7
8
28 31 19 31 18 19 59
69 60 39
16 30 19 18 59 68 37 26 69 57 39
68 57 35
33
is insufficient computation to justify the communication costs which result from using multiple slave processors and the algorithms display no real parallel speedup. As the degree of the polynomials increases the algorithms are more capable of exploiting the multiple processors and some satisfactory speedups are achieved. Measuring speedups based on total execution times quantifies the observed parallelism of the algorithm but is not the most illuminating way of analysing the parallelism of the iteration parts of the algorithms. Firstly the calculation of the initial approximations is performed on the master processor and there is no attempt to parallelise this calculation. Secondly, the algorithms require differing numbers of iterations depending on the number of slave processors used. Table 3 lists the number of iterations required by the Durand-Kerner iteration for the polynomials of Table 2 of [10]. We see that the algorithm usually requires the least number of iterations when one slave processor is used, since the algorithm is then implemented in a Gauss-Seidel fashion (see [3]). As the number of slave processors is increased the number of iterations required by the algorithm tends to increase but does not increase monotonically. This behaviour is displayed by all the algorithms. This is to be expected since, as the number of slave processors increases, the effect of the Gauss-Seidel implementation on the individual processors tends to be weakened and the effect of the Jacobi implementation across the processors tends to dominate. Those approximate zeros which are calculated on the same slave processor have a more direct influence on each other than those approximate zeros which are calculated on the other slave processors. Our computations suggest that, in general, the number of iterations depends more on which sets of approximate zeros are calculated on a given slave processor than on the number of slave processors. A more meaningful measure of the speedup achieved by the iterations of Algorithms 1 and 2 is given by [(the time per iteration on one slave transputer)/(the time per iteration on 2, 4 or 8 slave transputers)], where, in each case, the time to calculate the initial approximations is excluded. Table 4 lists these revised speedups for those test polynomials with degrees greater than or equal to eight. In general Algorithm 2 exhibits marginally better parallelism than Algorithm 1; this is to be expected because of the coarser grain parallelism of the algorithm. The parallel performances of both algorithms are best for the high degree polynomials, since this is the case when there is the most computation per iteration.
357
T.L Freeman / Calculating zeros on a local memory computer
Table 4 Revised speedups for the higher degree polynomials Table
Problem
Degree
Number of processors Algorithm 1
1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
27 28 29 30 31 32 5 6 7 8 9 10 11 12 13
10 12 13 15 18 19 8 8 8 9 9 9 15 18 36
Algorithm 2
2
4
8
1.73 1.70 1.59 1.74 1.87 1.80 1.70 1.73 1.64 1.64 1.63 1.58 1.70 1.87 1.93
2.28 2.71 2.40 2.88 3.02 3.12 2.26 2.22 2.32 2.16 (3) 2.19 (3) 2.19 (3) 2.90 2.87 3.59
2.71 3.20 3.31 3.80 4.10 4.24 2.22 2.15 2.40 2.47 2.46 2.45 3.96 3.98 5.69
(5) (6) (7) (6) (7)
(5) (5) (5)
2
4
8
1.81 1.80 1.73 1.72 1.89 1.87 1.72 1.62 1.69 1.69 1.61 1.65 1.80 1.88 2.05
2.44 2.86 2.76 2.87 3.12 3.23 2.42 2.25 2.51 2.27 (3) 2.30 (3) 2.28 (3) 3.02 3.18 3.31
2.97 3.44 3.60 3.87 4.31 4.51 2.43 2.25 2.79 2.69 2.62 2.66 4.38 4.18 5.31
(5) (6) (7) (6) (7)
(5) (5) (5) (6)
Finally we return to the question of how to distribute the calculation of the approximate zeros across the slave processors. The results quoted thus far have used the maximum number of available processors consistent with maintaining a balanced workload across the processors. Below we refer to this as strategy (i). An alternative, strategy (ii), would be to use all the available processors and distribute the workload as evenly as possible. This is achieved as follows: with m and n z as defined earlier, let m * n z - - - n + k , where we note that k ~< m - 1. Then, as with strategy (i), k of the slave transputers each handle n z - 1 approximate zeros and the remaining s - k slave transputers each handle n z approximate zeros, with the latter slaves chosen to be those in the middle of the linear chain (the only difference from strategy (i) is the choice of k). In Table 5 we give the total times required to solve six of the Table 2 test problems with Algorithm 1 using each of the strategies described above. For each problem the first line gives
Table 5 Comparison of workload strategies Problem
Degree
Number of processors 1
4
5
6
7
8
0.093 (69)
0.092 (67)
0.098 (71)
0.051 (36)
0.055 (38)
0.054 (37)
0.045 (31)
0.037 (25)
0.039 (26) 0.142 (68)
0.163 (67) 0.163 (57)
0.159 (74)
0.190 (57) 0.421 (39)
0.312 (35)
8
9
0.188 (59)
0.090 (68)
9
9
0.102 (31)
10
9
0.078 (23)
11
15
0.496 (62)
0.043 (26) 0.194 (69)
0.164 (69)
12
18
0.594 (54)
0.236 (60)
13
36
1.431 (34)
0.470 (39)
0.105 (67)
0.051 (37) 0.060 (37)
0.037 (26)
0.168 (59) 0.296 (34)
0.180 (64) 0.260 (33)
358
T.L. Freeman / Calculating zeros on a local memory computer
the times for strategy (i) a n d the s e c o n d line gives the times for s t r a t e g y (ii) in those cases where the two strategies differ. T h e n u m b e r s in b r a c k e t s i n d i c a t e the n u m b e r of i t e r a t i o n s r e q u i r e d a n d the time is given in seconds. T h e r e is little to c h o o s e b e t w e e n the two strategies. W i t h the lower degree p o l y n o m i a l s the l o a d b a l a n c e strategy (i) is m a r g i n a l l y m o r e efficient, p r e s u m a b l y b e c a u s e the c o m p u t a t i o n to c o m m u n i c a t i o n ratio does n o t f a v o u r the s t r a t e g y (ii) of using all the processors. W h e r e a s for the high degree p o l y n o m i a l s there is s o m e slight a d v a n t a g e in using the extra processing p o w e r available b y using all the p r o c e s s o r s (strategy (ii)). A l g o r i t h m s I a n d 2, using strategy (i) to d e c i d e h o w m a n y of the a v a i l a b l e p r o c e s s o r s to use, f o r m the r o u t i n e P o l y . s o i v e . R E A L which is the p o l y n o m i a l zero f i n d i n g r o u t i n e o f the o c c a m N u m e r i c a l L i b r a r y (see [9]).
Acknowledgment This w o r k was p e r f o r m e d d u r i n g a s t u d y leave visit to the C e n t r e for M a t h e m a t i c a l Software R e s e a r c h at the U n i v e r s i t y of Liverpool. T h e a u t h o r wishes to t h a n k Professor L.M. Delves for m a k i n g this visit p o s s i b l e a n d for his interest in this work. T h e a u t h o r also wishes to t h a n k Professor I. G l a d w e l l for his helpful c o m m e n t s o n an earlier d r a f t of this m a n u s c r i p t .
References [1] O. Aberth, Iteration methods for finding all zeros of a polynomial simultaneously, Math. Comp. 27 (1973) 339-344. [2] D.A. Adams, A stopping criterion for polynomial root finding, Comm. A C M 10 (1967) 655-658. [3] G. Alefeld and J. Herzberger, On the convergence speed of some algorithms for the simultaneous approximation of polynomial roots, S I A M J. Numer. Anal. 2 (1974) 237-243. [4] W. BSrsch-Supan, A posteriori error bounds for the zeros of polynomials, Numer. Math. 6 (1963) 380-398. [5] K. Dochev and P. Byrnev, Certain modifications of Newton's method for the approximate solution of algebraic equations, Z. Vy~isl. Mat. i Fiz. 4 (1964) 915-920. [6] E. Durand, Solutions Num~riques des Equations Alg~briques. Tome I (Masson, Paris, 1960). [7] L.W. Ehrlich, A modified Newton method for polynomials, Comm. A C M 10 (1967) 107-108. [8] M.R. Farmer and G. Loizou, A class of iteration functions for improving, simultaneously, approximations to the zeros of a polynomial, B I T 15 (1975) 250-258. [9] T.L. Freeman, Routine Poly.solve.REAL, Occam Library Manual, Mark 1, N.A.S.Ltd., Liverpool, 1989. [10] P. Hertrici and B.O. Watkins, Finding zeros of a polynomial by the Q-D algorithm, Comm. A C M 8 (1965) 570-574. [11] I.M.S.L., I.M.S.L. User's Manual, I.M.S.L., Houston, 1985. [12] Inmos, Transputer Reference Manual (Prentice-Hall, New York, 1988). [13] Inmos, occam2 Reference Manual (Prentice-Hall, New York, 1988). [14] I.O. Kerner, Ein Gesamtschrittverfahren zur Berechnung der Nullstellen von Polynomen, Numer. Math. g (1966) 290-294. [15] K. Madsen and J.K. Reid, Fortran subroutines for finding polynomial zeros, Harwell Report AERE-R7986, Harwell, 1975. [16] N.A.G., N.A.G. Fortran Library Manual, Mark 12, N.A.G. Ltd., Oxford, 1987. [17] A.W. Noureim, An iteration formula for simultaneous determination of the zeros of a polynomial, J. Comput. Appl. Math. 4 (1975) 251-254. [18] M.S. Petkovi6 and G.V. Milovanovir, A note on some improvements of the simultaneous methods for determination of polynomial zeros, J. Comput. Appl. Math. 9 (1983) 65-69. [19] R.F. Thomas, Corrections to numerical data on Q-D algorithm, Comm. A C M 9 (1966) 322-323.