Dense polynomial multiplication with reduced array manipulation overhead

Dense polynomial multiplication with reduced array manipulation overhead

~NF~~T~UN SCIENCES 68,113-122 (1993) 113 Dense Polynomial Multiplication with Reduced Array Manipulation Overhead ABSTRACT An algorithm has been d...

618KB Sizes 5 Downloads 55 Views

~NF~~T~UN

SCIENCES 68,113-122 (1993)

113

Dense Polynomial Multiplication with Reduced Array Manipulation Overhead

ABSTRACT An algorithm has been developed for ordered generation of the coefficients of the product of two dense and univariate polynomials, free from any constraint on their degrees and domains in which the coefficients of the polynomials lie. The algorithm reduces the overhead of array manipulation by using skillful programming techniques and obtains speed improvement over a straightforward intuitive algorithm. The space requirement is also greatly reduced when a sequential transmission of the coefficient of the product polynomial to some other medium is the only requirement. In the process of analysis of algorithms, a rule is found for certain class of nested loop manipulation problems like rectangular matrix addition.

1.

INTRODUCTION

Polynomial multiplication appears in the context of important areas of applied mathematics. Typical applications arise in algebraic manipulation systems, implementation of finite field multiplication in GF(p”), encoding/decoding of cyclic codes [I], implementation of software packages for multiple precision arithmetic, and in some problems of control theory [2]. In order to realize fast algorithms, several authors have studied the problem of univariate polynomial multiplication by attributing various constraints on the form and degree of the polynomials, nature of the coefficients of the polynomials, and sparsity of the coefficients of the polynomials. An efficient multiplication algorithm was realized by applying Fast Fourier Transforms (FFT) in a case [3] when the degrees of the two polynomials were the same and of the form 2” - 1. Lower bound has been obtained by Brown and Dobkin ]4] on polynomial multiplication under the assumption that the coefficients of the polynomials lie in the domain of integers. When the polynomials to be multiplied are sparse, Horowitz [51 and Klip [6] have shown that the actual overhead of ordered generation of 0 Elsevier Science Publishing Co., Inc. 1993 655 Avenue of the Americas, New York, NY 10010

114

SHOVONLAL

KUNDU

the coefficients of the product polynomial depends on the complexity of the associated ho-dimensional sorting probIem formed by exponents of the various terms of the product pol~omial. In all these works, no mention has been made as to how best two dense polynomials can be multiplied in the absence of above restrictive assumptions about degrees and coefficients of polynomials. Our purpose is to describe an efficient algorithm for ordered generation of the coefficients of the product of two ~lynomials that are dense and free from assumptions regarding their degrees and the domain in which their coefficients lie. But in a general case like this, as direct reduction of the number of arithmetic operations between the coefficients of input polynomials can hardly be realized by virtue of a mathematical theory, we have proposed the algorithm in which the total time requirement can be reduced by skillfully minimizing the software overhead involved in the manipulation of the polynomial coefficients from the point of view of a PASCAL-like high-level language. Our aim is to demonstrate how advantages can be derived by judicious software implementation even for a rather straightforward and well-understood problem. In some cases, reduction of software overhead may be achieved with the help of a code optimizer module in a compiler, but code optimization is costly in terms of resources, whereas the programmer is the best person to detect various redundancies arising out of a particular implementation of his specific problem. Hence, it is always better if the high-level language programmer can directly reduce the software overhead independent of a compiler for implementation of his or her problem f7]. In order to evaluate the comparative efficiency of the proposed algorithm, we have selected, as the basis for comparison, a naive algorithm for polynomial multiplication that is a variant of the integer multiplication algorithm indicated in [81. Both these algorithms use linear array structure and the same type of repetitive construct. A quantitative comparison of relative perfo~ance of our algorithm with that of the intuitive one has been carried out with respect to their implementation in PASCAL based on certain assumption about the implementation of the repetitive construct for statement.

Our object is to multiply the polynomials a(x) = a, + a,x + ‘*. + a,/ and b(x) = 6,) + 6,x + ... + b,xm to form the product polynomial c(x) = C(,+ c*x + ... + c,+,x m+n, where degree (a(x)) = n > m = degree (b(x)). The first algorithm we present is the naive one and is a variant of the

DENSE POLYNOMIAL

MULTIPLICATION

115

integer multiplication algorithm in [8], whereas the second algorithm is the one developed by us. The input to both these algorithms are arrays a=(a,,a I)..., a,) and b=(bo,b,,...,b,). The output is the array c= (c,,c,,..., c,> where p = m + n. The algorithms are being described below in PASCAL-like languages. 2.1.

ALGORITHM I

fork:=0 fori:=O

2.2.

topdoc[kl:=O; tomdo begin x := b[i]; forj:=O tondo begin l:=i+j; c[ll= c[ll+ a[jl*x; end; end;

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

ALGORITHM 2

fork:=0 tomdo begin s:=o; fori:= 0 to kdos := s i- a[k - i]*b[i]; c[k]:= s; end; fork :=m+l tondo begin s:=o; for i := 0 to m do s := s + a[ k - i]*b[i]; c[k]:= s; end; fork:=n+l topdo begin s:=o; for i:= k - rzto m do s := s + a[k - il*b[il; c[k]:= s; end;

(1) (2) (3) (4) (5) (6) (71 (8) (9) (10) (11) (121 (131 (14) (15) (16) (17) (18)

To understand Algorithm 2, it is helpful to refer to Figure I. Consider the set of (m + 1Xn + 11 lattice points enclosed in the closed rectangle

SHOVON~

116

KUNDU

[O,rt]X[O,m] in the first quadrant. Let us also imagine that within this region, the point (i, j) contains the pair of coefficients (ai,bj). Thus, each coefficient of the product polynomial is the sum of the product of all such pairs within this rectangle that are found on some straight line of slope = - 1. We have divided this rectangle into three adjacent regions, two congruent isoceless right angled triangles and one paraIlelogram. Region I corresponds to lines 1-6 of ~gorithm 2, region II corresponds to lines 7-12 of Algorithm 2, and region III corresponds to lines 13-18 of Algorithm 2. k indexes the coefficients of the product polynomial. To compute a coefficient of the product polynomial the variation of indices for the coefficients of polynomials a(x) and b(x) are symmetric in region I, in region II the index for the coefficient of b(x) always varied from 0 to 111,and in region III the index for the coefficient of b(x) starts from the intersection of the lines x + y = k and x = tz and varies up to m. 3.

COMPLEXITY

OF THE ALGORITHMS

In order to compare the performances of the two algorithms in a high-level programming language environment, it is necessary to compare their respective cu~~~i~i~s. There are two different approaches for calculating the complexi~ of an algorithm. One approach is to make a macroanalysis where only the asymptotically dominant operations of the algorithm are accounted for. Complexity thus calculated is also known as

line x+y:k, Sam of the product each pair of coefficients found on the lattice points of this line

m

of y-k

Fig. 1.

Partitioning

the lattice points in [O, m] X 10, n].

DENSE POL~OMI~

MULTIPLICATION

117

asymptotic complexity and expressed in 0 notation. The other approach is to make a microanalysis where euev operation necessary for the imple-

mentation of the algorithm need to be accounted for [91. It has been amply demonstrated in 1101 that the performance of an algorithm in practice cannot be estimated properly from its a~mptotic cornpi~~~, i.e., complexity derived from a macroa~u~ysis. In [lo], the complexity of an implementation of a matrix multiplication algorithm has been calculated by accounting for the time needed for assignments, array subscripting, comparisons and overhead of repetitive constructs used in the programming language of impIementation. It should be noted here that both the pol~omial multiplication algorithms discussed above have the same asymptotic complexity O(m + 1Xn + 1) as each term of one polynomial is multiplied in turn by each term of the other polynomial in both the algorithms. The difference in their speed of execution arising out of their different access patterns of the coefficients of the polynomials to be multiplied and their product cannot be accounted for though an asymptotic analysis. So, the complexities of the two algorithms will be calculated here based on a microanalysis in a vein similar to [lo]. In particular, exact calculation of the complexities of the number of additions, comparisons and assignments involving loop control variables, the number of additions necessary for array addressing, the number of assignments involving coe~i~ients of the polynomials and auxiliary variables will be made. In this calculation, the complexity of the a~~hrne~~c operations involving the coefficients of the two polynomials to be multiplied will be neglected as this complexity is the same for both the algorithms and our aim is to calculate the difference of the complexities of the two algorithms. In Table 1, the complexities corresponding to the different statements of the two algorithms are shown where we assume that i. The total number of operations necessary to access an element of an array x given by the expression x[f(i)] is 1 + the number of operations

necessary to evaluate f(i) .

ii. Each operation is going to take the same amount of time. iii. During the execution of a PASCAL-like for loop r times, the index variable of the loop is compared r + 1 times against its bound, its value is incremented r times, and its value is assigned r + 1 times according to the flowchart shown in Figure 2. This may be the case is a machine with single address instruction set and a single accumulator where all arithmetic and logical operations are to be performed.

SHOVONLAL

118

KUNDU

TABLE 1 Complexities Algorithm

of the individual steps of the algorithms

1

[a] Loop control variablek [line l] Number of comparisons = Number of assignments =m+n+2. Number of additions =m+n+l.

[b] Executing c[k]:= 0 [line l] Number of assignments =m+n+l. Number of additions for array referencing = m + n + 1 [cl Loop control variables i and j [lines 2,5] Number of comparisons = Number of assignments =(m + 1Xn +2)+(m +2). Number of additions =(m+lXn+l)+(m+l).

[d] Array referencing [lines 4,7,8] Line 4 requires (m + 1) additions and (m + 1) assignments. Lines 7 requires (m + 1Xn + 1) additions and (m + 1Xn + 1) assignments. Line 8 requires 3(m + 1Xn + 1) additions. [Particularly in the case of machines with single address instructions sets.]

Algorithm 2 [a] Loop control Liariable k [lines 1,7,13] Number of comparisons = Number of assignments =m+n+4. Number of additions =m+n+1+2=m+n+3asin lines 7 and 13 we require two extra additions for initialization of k. [b] Executings := 0 and c[k]:= s [lines 3,5,9,11,15,17] Number of assignments = 2(m + n + 1). Number of additions for array referencing = m + n + 1 [cl Loop control variable i [lines 4,10,16] Number of comparisons = Number of assignments =(m+lXn+l)+(m+n+l). Number of additions =(m+lXn+l)+m.

Note that for a given value of k, the number of times the i-loop will be executed equals the number of lattice points of the rectangle 10,n] X [O,m] lying on the line x + y = k. In line 16, we require m extra additions for initializations of i. [d] Array referencing [lines 4,10,16] Each execution of the statement s := s + a[ k - i]*b[i] requires three additions for address calculations, two for addressing the array a, and one for addressing the array b. Hence, the number of additions = 3(m + 1Xn + 1).

DENSE POLYNOMIAL

119

MULTIPLICATION

Assign starting value to loop index

Bound exceeded Bound

not exceeded

Execute Statements I

the values

of

Fig. 2. Realization of the for loop.

3.1. DIFFERENCE

IN COMPLEXITY

The total complexity of Algorithm 1, which is the sum of complexities of the different steps of algorithm given in Table 1, is

Similarly, the total complexi~ of the ~gorithm

2 is

=2(m+n+4)+(m+n+3)+3(m+n-t1)~+2(m+l)(n+l) +2(~+~+1)~(~+1)(~~1)+~+3(~+1)(~+1)

SHOVONLAL

120

KUNDU

The difference of the complexities of the Algorithms 1 and 2 is

= 2mn f 5m + 2 - n, which is always positive for m, n positive.

Hence, ~gorithm 2 is always expected to run faster than Algorithm 1. Both the algorithms were programmed in PASCAL and run on Burroughs B6738 computer. The results for running the algorithms for various values of m and n are shown in Table 2.

4.

DISCUSSION

The basic reason why the proposed algo~thm runs faster is that the references to the array c for storing the cgefficients of the product polynomial have been reduced to the minimum possible without increasing appreciably the overhead for manipulation of the loop index variables. This has been made possible by determining a suitable way of dividing the set of lattice points in [O,n] x [O,m] into three specific segments as shown in Figure 1. In Algorithm 1, most of the elements of array c have been referenced repeatedly, as evidenced by lines 7 and 8, in addition to being referenced by line 1, whereas in Algorithm 2, each element of the array c has been referenced exactly once as shown in lines 5, 11, and 17. It may be noted that improvements in the Algorithm 2 have been obtained without

TABLE2

Execution time of the algorithms Degrees of the

n

Total execution time t, of algorithm 1 in seconds

Total execution time t, of algorithm 2 in seconds

Time difference (t, -t,) in seconds

10 20 30 50 75 100 150 400 1000

0.047 0.065 0.098 0.199 0.161 0.675 0.505 12.367 87.972

0.044 0.061 0.091 0.177 0.141 0.595 0.439 8.162 71.543

0.003 0.004 0.007 0.022 0.021 0.080 0.183 4.205 16.429

polynomials to be multiplied m 10 20 30 50 25 100 50 400 lOGO

DENSE POLYNOMIAL

MULTIPLICATION

121

unduly complicating the structure of the program or increasing the space requirement. Another reason for the fast performance of Algorithm 2 is that its three loops may be viewed as obtained from partial unrolling [7] of a single loop. If its three loops were combined into a single loop, additional tests would have to be made in each iteration to determine the initial and final values of the loop control variable i. Here, the three loops correspond exactly to the three regions shown in Figure 1, so that in each region the initial and final values of the loop control variable i follows the same pattern. Significant space saving can be obtained with this algorithm when the main objective is an ordered transmission of the coefficients of the product polynomial sequentially to some other location. An example of such a case is when we are interested in only printing the coefficients of the product polynomial. In such cases, storage space for the array c may be completely eliminated and lines 5, 11, and 17 or our algorithm may be replaced by the step “Transmit the kth coeficient of the product polynomial from the locutions s.” The detailed analysis of the loop control variable manipulation overhead for Algorithm 1 also brings out another interesting fact for problems where nested loop manipulations with unequal number of iterations of the involved loop control variables are necessary. Algorithm 1 remains correct even if we interchange m with n and a with 6. But this increases the overhead when m < n as may be observed by interchanging m with n in item [cl for Algorithm 1 in Table 1. Hence, if the need for implementing nested loop manipulation with unequal number of iterations of the involved loop control variables arises, it is always better to select the one as the outermost loop control variable which goes through lesser number of iterations whenever possible. One such problem occurs in the case of addition of two rectangular matrices whose number of rows is less than the number of columns. The above discussion shows that in such cases implementation of this problem as addition by rows will always be faster than implementation as addition by columns. It may be pointed out that the algorithms referred to in [3-61 are not likely to be applicable as efficiently in our case. In the case of sparse polynomial multiplication algorithms [5, 61, some kind of sorting and special data structures for space saving are necessary, which, in turn, need related manipulation algorithms. But neither of them are required in a dense case. On the other hand, the applications of FFT based algorithms [3] may need padding up of the two dense polynomials to be multiplied with large number of zero coefficients in order that their degrees are raised to appropriate values. Besides this, a further requirement is that the field of coefficients must contain at least as many elements as one plus the

SHOVONLAL

122

KUNDU

degree of the product polynomial. Such padding and constraints on the size of the field at coefficients are not necessary for the algorithm proposed here. 5.

CONCLUSION

A simple algorithm for faster multiplication of any two dense polynomials is given here that improves upon a naive algorithm by reducing array manipulation overhead and, in certain situations, storage space requirement. It is demonstrated how, by a skillful technique of software implementation, the efficiency of an algorithm can be increased even for a straightforward and well-understood problem. REFERENCES 1. F. J. MacWilliams and N. J. A. Sloane, The Theory of Error Correcting Codes, Part I, North Holland, New York, 1977. 2. A. P. Paplinski, Matrix multiplication and division of polynomials, IEE Proc. 132:95-99 (1985). 3. A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algotithms, Addison Wesley, Reading, Massachusetts, 1974. 4. M. R. Brown and D. P. Dobkin, An improved lower bound on polynomial multiphcation, IEEE Trans. Comput. C-29:337-340 (1980). 5. E. Horowitz, A sorting algorithm for polynomial multiplication, J. ACM 22:450-462 (1975). SIAM J. Comput. 6. D. A. Klip, New algorithms for polynomial multiplication, 8:326-343 (1979). 7. J. L. Bentley, Writing Eficient Programs, Prentice Hall, New Jersey, 1982. 8. D. E. Knuth, The Art of Computer Programming, Vol. 2, Addison Wesley, Reading, Massachusetts, 1981. 9. J. Cohen, Computer assisted microanalysis of programs, Commun. ACM 25:724-733 (1982). 10. J. Cohen and M. Roth, On the implementation of Strassen’s fast multiplication algorithm, Acta Inform. 6:341-355 (1976). Received 10 March 1990; revised 6 May 1990, 12 March 1991, II December 1991