Faster polynomial multiplication via multipoint Kronecker substitution

Faster polynomial multiplication via multipoint Kronecker substitution

Journal of Symbolic Computation 44 (2009) 1502–1510 Contents lists available at ScienceDirect Journal of Symbolic Computation journal homepage: www...

636KB Sizes 29 Downloads 46 Views

Journal of Symbolic Computation 44 (2009) 1502–1510

Contents lists available at ScienceDirect

Journal of Symbolic Computation journal homepage: www.elsevier.com/locate/jsc

Faster polynomial multiplication via multipoint Kronecker substitution David Harvey 1 Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, NY 10012-1185, United States

article

info

Article history: Received 29 September 2008 Accepted 22 May 2009 Available online 6 June 2009 Keywords: Polynomial multiplication Kronecker substitution Efficient algorithm

abstract We present several new algorithms for dense polynomial multiplication in Z[x] based on the Kronecker substitution method. Instead of reducing to a single integer multiplication, we reduce to several smaller multiplications. We describe an implementation of multiplication in (Z/nZ)[x] for a word-sized modulus n based on these methods, and compare its performance to that of NTL and Magma. © 2009 Elsevier Ltd. All rights reserved.

1. Introduction The Kronecker substitution method is an algorithm for computing the product of two polynomials. The underlying substitution was conceived by Kronecker (1882) as a technique for reducing problems concerning multivariate polynomials to the case of univariate polynomials; later Schönhage (1982) suggested a similar idea to reduce multiplication in Z[x] to multiplication in Z. It is widely used in practice: for example, the Magma computer algebra system (Bosma et al., 1997) uses Kronecker substitution to multiply polynomials with high degree and small coefficients in Z[x] and (Z/nZ)[x] (Steel, 2008), and Victor Shoup’s NTL library (Shoup, 2007) uses Kronecker substitution to reduce multiplication in GF (pn )[x] to multiplication in GF (p)[x]. The reduction from Z[x] to Z is best explained by an example. Let f (x) = 621x3 + 887x2 + 610x + 274,

g (x) = 790x3 + 424x2 + 298x + 553,

and h = fg. We compute the integer product h(107 ) = f (107 ) · g (107 ); that is, 621|0000887|0000610|0000274 · 790|0000424|0000298|0000553

= 490590|0964034|1043046|1082839|0788467|0418982|0151522.

E-mail address: [email protected]. 1 Tel.: +1 212 998 3210; fax: +1 212 995 4121. 0747-7171/$ – see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.jsc.2009.05.004

D. Harvey / Journal of Symbolic Computation 44 (2009) 1502–1510

1503

The vertical bars indicate base-107 digit boundaries. Our choice of evaluation point ensures that the coefficients do not overlap, so we obtain h(x) = 490590x6 + 964034x5 + 1043046x4 + 1082839x3 + 788467x2 + 418982x + 151522. The main advantage of this approach to multiplication in Z[x] is that it places the burden of computation on existing highly optimised software libraries for multiprecision integer arithmetic, such as the GMP library (Granlund, 2008). This point of view is discussed further in Fateman (2005). In this paper we present several new variants of the above algorithm. The strategy is to evaluate not just at a single point, but at r carefully selected points, where r = 2 or r = 4. Instead of reducing to a single multiplication, we reduce to r multiplications, each about 1/r-th the size of the original multiplication. The proposed sets of evaluation points may be summarised as follows. Suppose that we are given a multiplication problem for which the standard Kronecker scheme would evaluate at 0 x = 2b . In the first algorithm (Section 3), we evaluate at x = ±2b where b0 ≈ b/2. We illustrate in base ten, continuing the example above: f (104 ) = 621|0887|0610|0274, f (−104 ) = −620|9113|0609|9726,

g (104 ) = 790|0424|0298|0553, g (−104 ) = −789|9576|0297|9447.

At x = −104 , the coefficients are ‘packed’ with alternating signs. We obtain h(104 ) = f (104 ) · g (104 ) = 490686413831542917850889971522, h(−104 ) = f (−104 ) · g (−104 ) = 490493607029377239842510331522. The coefficients of h(x) overlap in both h(104 ) and h(−104 ). Nevertheless, as the signs alternate in h(−104 ), we may recover the desired information in linear time:

(h(104 ) + h(−104 ))/2 = 490590|01043046|00788467|00151522, (h(104 ) − h(−104 ))/2 = 964034|01082839|00418982|0000. 0 The second algorithm (Section 4) evaluates at x = 2±b . In our running example,

f (104 ) = 621|0887|0610|0274, 10 f (10−4 ) = 274|0610|0887|0621, 12

g (104 ) = 790|0424|0298|0553, 10 g (10−4 ) = 553|0298|0424|0790. 12

At x = 10−4 , the coefficients are packed in reversed order. We have h(104 ) = f (104 ) · g (104 ) = 49|0686|4138|3154|2917|8508|8997|1522, 1024 h(10−4 ) = 1024 f (10−4 ) · g (10−4 ) = 15|1563|9060|8575|2943|3142|4083|0590. Once again, the coefficients of h(x) overlap in both products. Let hi denote the coefficient of xi in h(x). The bottom half of h0 is equal to the lowest base-104 digit of h(104 ), namely 1522. The top half is recovered as the highest digit of 1024 h(10−4 ), namely 15. In this way we completely recover h0 = 151522. Subtracting h0 from the appropriate position in both products reveals the two halves of h1 = 418982: h(104 ) − 151522 = 49|0686|4138|3154|2917|8508|8982|0000, 10 h(10 ) − 1024 151522 = 00|0041|9060|8575|2943|3142|4083|0590. 24

−4

Continuing in this way, we recover h(x) completely. There are certain complications arising from carries, not shown in this example, that we will address in Section 4. The key ideas of these two algorithms are orthogonal, and we may combine them to obtain an algorithm (Section 5) that 00 evaluates at the four points x = ±2±b where b00 ≈ b/4. The benefit of this multipoint evaluation strategy accrues as follows. Let M (n) denote the cost of multiplying n-bit integers. Suppose that the standard Kronecker scheme reduces the computation of h = fg to a multiplication of two k-bit integers. The cost is M (k), plus O(k) overhead for the packing and unpacking steps. For the new algorithms, the cost is instead rM (k/r ) + O(k). If k is very large, then M (k) behaves roughly like k log k (using FFT-based algorithms), leaving little difference between the two strategies. However, if k is not too large, then M (k) will typically behave like kα for some

1504

D. Harvey / Journal of Symbolic Computation 44 (2009) 1502–1510

1 < α ≤ 2 (for example, Karatsuba’s algorithm has α ≈ 1.58), and then we expect the multipoint algorithm to win by a factor of r α−1 . In reality the situation is more complicated. The implied constant in the O(k) term is relatively large for the new algorithms, due to the complexity of the packing and unpacking phases. When k is small enough to observe values of α substantially larger than 1, the O(k) overhead is not insignificant. In Section 6 we discuss an implementation of multiplication in (Z/nZ)[x], where n is a wordsized modulus. We use the new Kronecker variants for low degree polynomials, and the Schönhage– Nussbaumer convolution algorithm (Schönhage, 1976–1977; Nussbaumer, 1980) to reduce large degree problems to smaller ones. We give timings indicating that it performs favourably compared to the well-known Magma and NTL software packages, and that the new algorithms indirectly benefit polynomial multiplications of arbitrarily high degree. 2. Notation and the standard Kronecker substitution We assume that all integers are represented in base two using a sign-magnitude representation, and that k-bit integers may be added and subtracted in time O(k). The cost of multiplying integers with k1 and k2 bits will be denoted by M (k1 , k2 ). We write lg x for log2 x. We regard a polynomial p ∈ Z[x] as a vector of known length n = len p, and write p = We define Reverse(p, n) := x

n−1

p(1/x) =

P n −1 i=0

Pn−1 i=0

pi xi .

pn−1−i x . If 0 ≤ pi < 2 for all i, we define i

b

ib nb Pack(p, n, b) := p(2b ) = is an integer, then Unpack(X , n, b) returns i=0 pi 2 . If 0 ≤ X < 2 b the unique p of length n satisfying 0 ≤ pi < 2 such that p(2b ) = X . We also define signed versions: if −2b−1 ≤ qi < 2b−1 for all i, then Signed-Pack(q, n, b) := q(2b ). This may be implemented in terms of Pack by noting that q(2b ) = p(2b ) − (2b−1 + 22b−1 + · · · + 2nb−1 ), where pi = qi + 2b−1 , since 0 ≤ pi < 2b . This also shows that the definition Signed-Unpack(q(2b ), n, b) := q is unambiguous if the coefficients qi are known to satisfy the above inequalities. We assume that Pack, Unpack, Signed-Pack and Signed-Unpack have running time O(nb).

Pn−1

Algorithm 1: Standard Kronecker substitution Input: nonconstant polynomials f , g in Z[x] with 0 ≤ fi , gi < 2c for all i Output: the product h = fg m ← len(f ), n ← len(g ), b ← 2c + dlg(min(m, n))e X ← Pack(f , m, b), Y ← Pack(g , n, b) Z ←X ·Y return Unpack(Z , m + n − 1, b)

Proposition 1. Let f , g ∈ Z[x] be polynomials of lengths m and n respectively, with 0 ≤ fi , gi < 2c for all i. Algorithm 1 computes h = fg in time M (mb, nb) + O((m + n)b), where b = 2c + dlg(min(m, n))e. Proof. Correctness follows since Z = h(2b ) = f (2b )g (2b ) = XY , and since 0 ≤ hi < min(m, n)(2c )2 ≤ 2b for all i.  Remark. To adapt Algorithm 1 to the signed coefficient case, suppose that |fi |, |gi | < 2c for all i. Replace Pack and Unpack by Signed-Pack and Signed-Unpack, and redefine b := 2c + dlg(min(m, n))e + 1. This ensures that |hi | < 2b−1 for all i. 3. Negated evaluation points Proposition 2. Let f , g ∈ Z[x] be polynomials of lengths m and n respectively, with 0 ≤ fi , gi < 2c for all  i. Algorithm 2 computes h = fg in time 2M (mb, nb) + O((m + n)b), where b = c + 12 lg(min(m, n)) .

D. Harvey / Journal of Symbolic Computation 44 (2009) 1502–1510

1505

Algorithm 2: Two-point ‘negated’ Kronecker substitution Input: nonconstant polynomials f , g in Z[x] with 0 ≤ fi , gi < 2c for all i Output: the product h = fg m ← len(f ), n ← len(g ), s ← m + n − 1, b ← c + 21 lg(min(m, n)) f 0 ← f0 − f1 x + · · · + (−1)m−1 fm−1 xm−1 g 0 ← g0 − g1 x + · · · + (−1)n−1 gn−1 xn−1 X + ← Pack(f , m, b), X − ← Signed-Pack(f 0 , m, b) Y + ← Pack(g , n, b), Y − ← Signed-Pack(g 0 , n, b) Z + ← X + · Y +, Z − ← X − · Y − Z 0 ← (Z + + Z − )/2, Z 1 ← (Z + − Z − )/2b+1 h0 ← Unpack(Z 0 , ds/2e , 2b), h1 ← Unpack(Z 1 , bs/2c , 2b) return h0 (x2 ) + x h1 (x2 )





Proof. X + and X − are assigned the values f (2b ) and f (−2b ) (=f 0 (2b )) respectively; similarly for Y and g. The calls to Pack and Signed-Pack are legal since c ≤ b − 1. We obtain Z ± = f (±2b ) · g (±2b ) = h(±2b ), and then Z 0 = h0 + h2 22b + · · · + h2s0 −2 2(2s0 −2)b and Z 1 = h1 + h3 22b + · · · + h2s1 −1 2(2s1 −2)b , where s0 = ds/2e and s1 = bs/2c. Since 0 ≤ hi < min(m, n)(2c )2 ≤ 22b for all i, the calls to Unpack are valid and yield h0 = h0 + h2 x + · · · + h2s0 −2 xs0 −1 and h1 = h1 + h3 x + · · · + h2s1 −1 xs1 −1 . The last line interleaves these to obtain h.  Remark. We obtain a signed version for |fi |, |gi | < 2c by replacing the Pack and Unpack calls by Signed-Pack  and Signed-Unpack  (the existing Signed-Pack calls remain unchanged), and redefining b := c + 12 (1 + lg(min(m, n))) . Then |hi | < min(m, n)(2c )2 ≤ 22b−1 and the calls to Signed-Unpack are legal. 4. Reciprocal evaluation points The following subroutine reconstructs a polynomial f , whose coefficients are about 2b bits long, from knowledge of f (2b ) and f (2−b ). Algorithm 3: Recover(X , Y , n, b) Input: Integers X = f (2b ) and Y = 2(n−1)b f (2−b ), where f ∈ Z[x] is a length n polynomial such that 0 ≤ fi < 2b (2b − 1) for all i Output: The polynomial f p ← Unpack(X , n + 1, b), q ← Unpack(Y , n + 1, b) α−1 , β−1 , c−1 ← 0 for i ← 0 to n − 1 do if pi < βi−1 + ci−1 then ci ← 1 else ci ← 0 αi ← pi − βi−1 − ci−1 mod 2b if qn−i−1 < αi then di ← 1 else di ← 0 βi ← qn−i − αi−1 − di mod 2b fi ← αi + βi 2b end return f0 + f1 x + · · · + fn−1 xn−1

Proposition 3. Let f ∈ Z[x], with 0 ≤ fi < 2b (2b − 1) for all i. Let X = f (2b ) and Y = 2(n−1)b f (2−b ), where n = len(f ). Then Recover(X , Y , n, b) returns f , in time O(nb). Proof. The running time bound is clear, as the Unpack calls cost O(nb), and the main loop has n iterations each performing linear-time operations on b-bit integers.

1506

D. Harvey / Journal of Symbolic Computation 44 (2009) 1502–1510

ib < i=0 2b (2b − 1)2ib = 2b (2nb − 1) < We now prove correctness. Note that X = i=0 fi 2 (n+1)b 2 , so the call to Unpack(X , n + 1, b) is legal; similarly for Unpack(Y , n + 1, b). We thus have

Pn−1

X =

n X

pi 2ib ,

Y =

i=0

n X

qi 2ib ,

P n −1

0 ≤ pi , qi < 2b .

(1)

i=0

Split each fi into low and high parts as fi = αi + βi 2b , where 0 ≤ αi < 2b and 0 ≤ βi < 2b − 1 for 0 ≤ i < n. For convenience put also α−1 = β−1 = αn = βn = 0. Let c−1 = 0, and define sequences u0 , . . . , un and c0 , . . . , cn by the relations ui + ci 2b = αi + βi−1 + ci−1 ,

0 ≤ i ≤ n,

(2)

where 0 ≤ ui < 2 and ci ∈ {0, 1}. Then X = = = i=0 (αi + βi 2 )2 i=0 (αi + βi−1 )2 Pn cn 2(n+1)b + i=0 ui 2ib . Comparison with (1) shows that pi = ui for all i and that cn = 0. Similarly, let dn = 0 and define v0 , . . . , vn and dn−1 , . . . , d−1 by b

vi + dn−1−i 2b = αn−1−i + βn−i + dn−i ,

Pn−1

b

ib

Pn

ib

0 ≤ i ≤ n,

(3)

b ib where 0 ≤ vi < 2b and di ∈ {0, 1}. Then Y = = i=0 (αn−1−i + βn−1−i 2 )2 i=0 (αn−1−i + Pn ib (n+1)b ib βn−i )2 = d−1 2 + i=0 vi 2 , so again from (1) we have qi = vi for all i and d−1 = 0. The ith iteration of the main loop computes ci , αi , di and βi in terms of previously computed values. Rewriting (2) as αi − ci 2b = pi −βi−1 − ci−1 , we see that αi and ci are computed correctly. Replacing i by n−1−i in (3) we obtain (βi+1 +di+1 )−di 2b = qn−1−i −αi . As 0 ≤ βi+1 +di+1 ≤ (2b −2)+1 = 2b −1, we see that di = 1 if and only qn−1−i < αi , so di is computed correctly. Finally, βi is computed by replacing i by n − i in (3) and reading the result modulo 2b . 

Pn−1

Pn

Remark. We may also define a signed version. If |fi | < 2b−1 (2b − 1), let g = f + w where w = 2b−1 (2b − 1)(1 + x + · · · + xn−1 ). Then g (2b ) = X + W and 2(n−1)b g (2−b ) = Y + W where W = 2b−1 (2b − 1)(1 + 2b + · · · + 2(n−1)b ) = 2b−1 (2nb − 1). Since 0 ≤ gi < 2b (2b − 1), we may define Signed-Recover(X , Y , n, b) := Recover(X + W , Y + W , n, b) − w , which correctly returns f . The bound 2b (2b − 1) in Algorithm 3 is stronger than the more natural 22b that one might expect to work; this is necessary to avoid ambiguities arising from carries. For example, if n = 4 and b = 2, one cannot distinguish between f (x) = 4x3 + 4 and f (x) = 13x2 + 13x, since for both polynomials we have f (4) = 26 f (1/4) = 260. The following lemma will be used to establish this bound in Algorithms 4 and 5. Lemma 4. Let f , g ∈ Z[x]; put h = fg and L = min(len(f ), len(g )). Let r ≥ 1 and c ≥ 1 be integers. (a) If 0 ≤ fi , gi < 2c , then 0 ≤ hi < 2rb/2 (2rb/2 − 1), where b = 1r (2c + lg L) .   (b) If |fi |, |gi | < 2c , then |hi | < 2rb/2−1 (2rb/2 − 1), where b = 1r (2c + 1 + lg L) .





rb Proof. We first prove (a). We have hi ≤ L(2c − 1)2 , and 2√ ≥ 22c L by the definition √of b. We rb/2 c 2 rb/2 consider two√ cases. If L < 2 then L(2√− 1) ≤ L(2 / L − 1)2 = 2rb − 2rb/2+1 L + L < 2rb − 2rb/2+1 L + 2rb/2 = 2rb − 2rb/2 (2 L − 1) ≤ 2rb − 2rb/2 . If L ≥ 2rb/2 then L(2c − 1)2 = 22c L − 2c +1 L + L ≤ 2rb − L(2c +1 − 1) < 2rb − L ≤ 2rb − 2rb/2 . In both cases the required inequality holds. For (b), we have |hi | ≤ L(2c − 1)2 and 2rb ≥ 22c +1 L. The proof proceeds similarly, splitting into cases L < 2rb/2−1 and L ≥ 2rb/2−1 . We omit the details. 

Proposition 5. Let f , g ∈ Z[x] be polynomials of lengths m and n respectively, with 0 ≤ fi , gi < 2c for all  i. Algorithm 4 computes h = fg in time 2M (mb, nb) + O((m + n)b), where b = c + 12 lg(min(m, n)) . Proof. Xfwd and Xrev are assigned the values f (2b ) and 2(m−1)b f (2−b ); similarly for Y , g and n. Therefore Zfwd = h(2b ) and Zrev = 2(m+n−2)b h(2−b ). From Lemma 4(a) with r = 2 we have hi < 2b (2b − 1), so the Recover call is legal.  Remark. A signed version for |fi |, |gi | < 2c is obtained  by replacing Pack and  Recover by Signed-Pack and Signed-Recover, and by redefining b := c + 12 (1 + lg(min(m, n))) . Lemma 4(b) provides the needed bound |hi | < 2b−1 (2b − 1).

D. Harvey / Journal of Symbolic Computation 44 (2009) 1502–1510

1507

Algorithm 4: Two-point ‘reciprocal’ Kronecker substitution Input: nonconstant polynomials f , g in Z[x] with 0 ≤ fi , gi < 2c for all i Output: the product h = fg m ← len(f ), n ← len(g ), b ← c + 12 lg(min(m, n)) Xfwd ← Pack(f , m, b), Xrev ← Pack(Reverse(f , m), m, b) Yfwd ← Pack(g , n, b), Yrev ← Pack(Reverse(g , n), n, b) Zfwd ← Xfwd · Yfwd , Zrev ← Xrev · Yrev return Recover(Zfwd , Zrev , m + n − 1, b)





5. Four evaluation points Algorithm 5 evaluates at the four points ±2±b . The main new feature is that the coefficients overlap even in the evaluation phase, so some arithmetic is required in addition to the usual Pack calls. Algorithm 5: Four-point Kronecker substitution Input: nonconstant polynomials f , g in Z[x] with 0 ≤ fi , gi < 2c for all i Output: the product h = fg m ← len(f ), n ← len(g ), s ← m + n − 1, b ← m0 ← dm/2e, n0 ← dn/2e, s0 ← ds/2e m1 ← bm/2c, n1 ← bn/2c, s1 ← bs/2c

1 4

 (2c + lg(min(m, n)))

f 0 ← f0 + f2 x + · · · + f2m0 −2 xm0 −1 , f 1 ← f1 + f3 x + · · · + f2m1 −1 xm1 −1 0 0 Xfwd ← Pack(f 0 , m0 , 2b), Xrev ← Pack(Reverse(f 0 , m0 ), m0 , 2b) 1 1 1 Xfwd ← Pack(f , m1 , 2b), Xrev ← Pack(Reverse(f 1 , m1 ), m1 , 2b) 0 1 if m ≡ 0 mod 2 then Swap(Xrev , Xrev ) + 0 b 1 + 0 1 Xfwd ← Xfwd + 2 Xfwd , Xrev ← Xrev + 2b Xrev − 0 b 1 − 0 b 1 Xfwd ← Xfwd − 2 Xfwd , Xrev ← Xrev − 2 Xrev Repeat last six lines with f , m, X replaced by g, n, Y + + + + + + Zfwd ← Xfwd · Yfwd , Zrev ← Xrev · Yrev − − − − − − Zfwd ← Xfwd · Yfwd , Zrev ← Xrev · Yrev + − 0 0 + − Zfwd ← (Zfwd + Zfwd )/2, Zrev ← (Zrev + Zrev )/2 + − 1 1 + − Zfwd ← (Zfwd − Zfwd )/2b+1 , Zrev ← (Zrev − Zrev )/2b+1 0 1 if s ≡ 0 mod 2 then Swap(Zrev , Zrev ) 0 1 1 0 , Zrev , s1 , 2b) , s0 , 2b), h1 ← Recover(Zfwd h0 ← Recover(Zfwd , Zrev 0 2 1 2 return h (x ) + xh (x )

Proposition 6. Let f , g ∈ Z[x] be polynomials of lengths m and n respectively, with 0 ≤ fi , gi < 2c for all  i. Algorithm 5 computes h = fg in time 4M (mb + b + 1, nb + b + 1) + O((m + n)b), where b = 14 (2c + lg(min(m, n))) . 0 1 Proof. We first decompose f as f (x) = f 0 (x2 ) + x f 1 (x2 ). Then Xfwd = f 0 (22b ) and Xfwd = f 1 (22b ) ± 0 2b b 1 2b b (the Pack calls are legal since c ≤ 2b), so we obtain Xfwd = f (2 ) ± 2 f (2 ) = f (±2 ). For the 0 1 0 reversed versions, note that if m is even we must swap Xrev and Xrev so that Xrev encodes the even1 ± index terms and Xrev the odd-index terms of Reverse(f , m). We thus obtain Xrev = 2(m−1)b f (±2−b ). ± The evaluation block for g is similar. The pointwise multiplications compute Zfwd = h(±2b ) and

± Zrev = 2(s−1)b h(±2−b ). Since c ≤ 2b we have f (2b ) ≤

Pm−1 i=0

mb

(2c − 1)2ib = (2c − 1) 22b −−11 ≤

(2b + 1)(2mb − 1) ≤ 2mb+b+1 . This estimate, and analogous estimates for the other evaluation points, account for the M (mb + b + 1, nb + b + 1) terms in the running time bound.

1508

D. Harvey / Journal of Symbolic Computation 44 (2009) 1502–1510

0 1 We next obtain Zfwd = h0 + h2 22b +· · ·+ h2s0 −2 2(2s0 −2)b and Zfwd = h1 + h3 22b +· · ·+ h2s1 −1 2(2s1 −2)b . 0 Taking into account the Swap call if s is even, we find that Zrev = h2s0 −2 + h2s0 −4 22b + · · · + h0 2(2s0 −2)b 1 and Zrev = h2s1 −1 + h2s1 −3 22b +· · ·+ h1 2(2s1 −2)b . Lemma 4(a) with r = 4 implies that hi < 22b (22b − 1), so the final Recover calls are valid.  c Remark. For the case of signed coefficients |fi |, |gi | <  2 , we replace the Pack and  Recover calls by Signed-Pack and Signed-Recover, and redefine b := 14 (2c + 1 + lg(min(m, n))) . The Signed-Pack calls are valid since c ≤ 2b − 1, and the Signed-Recover call is valid since |hi | < 22b−1 (22b − 1) by Lemma 4(b).

6. Empirical performance and applications The author implemented the above algorithms in zn_poly (Harvey, 2008), a C library for arithmetic in (Z/nZ)[x] where n fits into a single machine word. The library has been used in several number-theoretic applications, including computations of zeta functions of hyperelliptic curves of genus three over prime fields (Harvey, 2007), computations of L-functions of hyperelliptic curves over Q (Kedlaya and Sutherland, 2008), and an ongoing project with Joe Buhler to extend the verification of Vandiver’s conjecture and computation of irregular primes and cyclotomic invariants carried out in Buhler et al. (2001). We compared the performance of several algorithms in zn_poly with the Magma computer algebra system (version 2.14-15) (Bosma et al., 1997) and the zz_pX class from Victor Shoup’s NTL library (version 5.4.2) (Shoup, 2007), for polynomials in (Z/nZ)[x] where n = 194932887582161 (a random odd 48-bit modulus). The tests were performed using a single core of an otherwise idle dual-core 64-bit 2.8 GHz AMD Opteron (revision 254) machine with 20GB RAM, running Linux (Fedora Core 8). Both our code and NTL were linked against GMP (the GNU Multiple Precision Arithmetic Library, version 4.2.3) (Granlund, 2008), and we applied an assembly patch written by Pierrick Gaudry that improves the performance of GMP on the Opteron. Magma links statically against GMP and also uses Gaudry’s patch. Our code and NTL were compiled with gcc 4.3.1. For timing our code and NTL we repeatedly called the multiplication routine directly from C/C++, using NTL’s GetTime function. For Magma we used the Cputime function from the Magma interpreter; the interpreter loop overhead was measured and found to be insignificant (<1% of total time). In all cases, the multiplication routine is called several times before the timing begins, to ameliorate priming effects in the code and data cache, and we do not include time for initialisation, clean-up or generation of random polynomials. The test suite was run three times to check stability of the timings; we found no discrepancy beyond a few percent. Figs. 1 and 2 show the performance of the various implementations. The horizontal axis is lg d where d is the polynomial length; the largest polynomials considered are of length 227 = 134,217,728, which occupy about 1 GB of storage. The vertical axis is a normalised measure of the time required for a single multiplication, given by lg(t /(d log d)), where t is measured in seconds. Magma uses ordinary Kronecker substitution for this class of polynomial multiplication problems, except when the degree is very small (Steel, 2008); therefore its speed depends mainly on its large integer multiplication performance. NTL uses several algorithms. For small degree problems, it uses Karatsuba with classical multiplication as a base case. This approach outperforms the other implementations for degree less than about 16 (in which case only the classical algorithm is being used). For degree larger than 1500, NTL lifts the problem to Z[x] and applies an FFT over Z/pZ (a number-theoretic transform) for several word-sized primes p, reconstructing the result via the Chinese Remainder Theorem. The change in algorithms at d = 1500 is clearly visible in Fig. 2. The sawtooth shape occurs because the inputs are zero-padded up to the next power-of-two transform length. After d = 216 the performance degrades by a factor of around two; this is due to the transform no longer fitting into the L2 cache. Fig. 1 shows the performance of the Kronecker substitution variants as implemented in zn_poly, which are applied after lifting the problem to Z[x]. The labels KS1, KS2, KS3, KS4 refer respectively to Algorithms 1, 2, 4 and 5. The underlying integer arithmetic is performed by GMP. Several features are

D. Harvey / Journal of Symbolic Computation 44 (2009) 1502–1510

1509

–20

Normalised running time

–21

–22

–23 KS1 KS2 KS3 KS4

–24

–25 0

5

10 15 20 Polynomial length (log base 2)

25

30

Fig. 1. Performance of four Kronecker substitution variants.

–20

Normalised running time

–21

–22

–23 KS4 FFT1 FFT2 NTL Magma

–24

–25 0

5

10 15 20 Polynomial length (log base 2)

25

30

Fig. 2. Performance of Magma, NTL, and zn_poly.

apparent from the graph. For very small degrees, KS1 is the fastest and KS4 the slowest; this is due to larger overhead in the more complicated algorithms. There is a small region, for 13 < d < 32, where KS2 is the fastest. For d > 32, KS4 is uniformly the fastest. KS2 is about 1.25–1.35 times faster than KS1 over the range 30 < d < 3500, and KS4 is about 1.6–1.9 times faster than KS1 over the range 100 < d < 4000. The running times begin to converge for larger d; as discussed in Section 1, this occurs because the running time of the large integer multiplication is asymptotically close to linear. For large degree, zn_poly implements a variant of the Schönhage–Nussbaumer convolution algorithm (Schönhage, 1976–1977; Nussbaumer, 1980). This algorithm is only available when the modulus is odd. The strategy is to split the input polynomials into pieces of length M /2, and map the problem to a convolution in R[z ]/(z K − 1) for R = (Z/nZ)[y]/(yM + 1), where K | 2M are powers of two and where K is large enough to accommodate the product. The ring R contains a principal K th root of unity (namely y2M /K ), so the convolution may be effected via FFTs over R. The

1510

D. Harvey / Journal of Symbolic Computation 44 (2009) 1502–1510

performance is smoothed by taking advantage of van der Hoeven’s truncated FFTs (van der Hoeven, 2004, unpublished); the sawtooth shape is consequently not as pronounced as that observed for NTL. The pointwise multiplications (products in R) are handled by KS1, KS2 or KS4 for sufficiently small M, or by Nussbaumer’s algorithm (Nussbaumer, 1980) for larger M, with optimal thresholds selected automatically. The graph labelled FFT1 in Fig. 2 shows the performance of this algorithm. It is already the fastest at degree 213 , and is at least twice as fast as any of the KS variants or the other systems after degree 220 . The graph labelled FFT2 is for the same algorithm, but without allowing KS2 or KS4 to participate in the pointwise multiplications; only KS1 and Nussbaumer’s algorithm are permitted. We observe that FFT1 is about 1.25–1.3 times faster than FFT2 for 219 < d < 223 , widening to about 1.4 times faster for 223 < d < 227 . Thus, while the Kronecker substitution variants presented in this paper do not directly apply to the large degree case, they can provide a significant indirect benefit. Acknowledgements Many thanks go to Paul Zimmermann, Andrew Sutherland, William Hart, and the referees for their comments on this paper. Thanks are also due to Joe Buhler for arranging access to the hardware on which the timings were performed. References Bosma, Wieb, Cannon, John, Playoust, Catherine, 1997. The Magma algebra system. I. The user language. J. Symbolic Comput. 24 (3–4), 235–265. Buhler, Joe, Crandall, Richard, Ernvall, Reijo, Metsänkylä, Tauno, Amin Shokrollahi, M., 2001. Irregular primes and cyclotomic invariants to 12 million. J. Symbolic Comput. 31 (1–2), 89–96. Computational algebra and number theory, Milwaukee, WI, 1996. Fateman, Richard J., 2005. Can you save time in multiplying polynomials by encoding them as integers? http://www.cs.berkeley. edu/∼fateman/papers/polysbyGMP.pdf. Granlund, Torbjörn, 2008. The GNU Multiple Precision Arithmetic library. http://gmplib.org/. Harvey, David, 2007. Kedlaya’s algorithm in larger characteristic. Int. Math. Res. Not. 2007 (rnm095, rnm095–29). Harvey, David, 2008. The zn_poly library. http://www.cims.nyu.edu/∼harvey/zn_poly/. Kedlaya, Kiran S., Sutherland, Andrew, 2008. Computing L-series of hyperelliptic curves. In: ANTS VIII. In: Lecture Notes in Computer Science, vol. 5011. Springer, pp. 312–326. Kronecker, Leopold, 1882. Grundzüge einer arithmetischen theorie der algebraischen grössen. Journal Für die reine und angewandte Mathematik (92), 1–122. Nussbaumer, Henri J., 1980. Fast polynomial transform algorithms for digital convolution. IEEE Trans. Acoust. Speech Signal Process. 28 (2), 205–215. Schönhage, Arnold, 1976–1977. Schnelle Multiplikation von Polynomen über Körpern der Charakteristik 2. Acta Inform. 7 (4), 395–398. Schönhage, Arnold, 1982. Asymptotically fast algorithms for the numerical multiplication and division of polynomials with complex coefficients. In: Computer Algebra, EUROCAM’82. In: LNCS, vol. 144. pp. 3–15. Shoup, Victor, 2007. NTL: A library for doing number theory. http://www.shoup.net/ntl/. Steel, Allan, 2008. Personal communication. van der Hoeven, Joris, 2004. The truncated Fourier transform and applications. In: ISSAC 2004. ACM, New York, pp. 290–296. van der Hoeven, Joris, 2005. Notes on the truncated Fourier transform (unpublished). Retrieved September 2008 from http:// www.math.u-psud.fr/∼vdhoeven/.