Applied Mathematics and Computation 189 (2007) 410–424 www.elsevier.com/locate/amc
Accurate floating-point summation: a new approach A. Eisinberg, G. Fedele
*
Dip. Elettronica Informatica e Sistemistica, Universita` degli Studi della Calabria, 87036 Rende (Cs), Italy
Abstract The aim of this paper is to find an accurate and efficient algorithm for evaluating the summation of large sets of floating-point numbers. We present a new representation of the floating-point number system in which a number is represented as a linear combination of integers and the coefficients are powers of the base of the floating-point system. The approach allows to build up an accurate floating-point summation algorithm based on the fact that no rounding error occurs whenever two integer numbers are summed or a floating-point number is multiplied by powers of the base of the floating-point system. The proposed algorithm seems to be competitive in terms of computational effort and, under some assumptions, the computed sum is greatly accurate. With such assumptions, less-conservative in the practical applications, we prove that the relative error of the computed sum is bounded by the unit roundoff. 2006 Elsevier Inc. All rights reserved. Keywords: Floating-point system; Summation algorithms
1. Introduction Many problems, such as numerical integration, numerical solution of differential equations, inner products, means, variances, norms, and all kinds of nonlinear functions, involve computing sums with many terms. Because of this ubiquity and since summation is a basic task in numerical analysis, there are numerous algorithms for that. Higham [1] devotes an entire chapter to summation. Excellent review of various applications in many different areas of numerical analysis can be found in [1–4]. Two recent applications can be found in [5,6]. In [5] Priest’s distillation algorithm [4] is used in the problem of surface interrogation and intersection that depends crucially on good root-finding algorithms, which in turn depends on accurate polynomial evaluation and indirectly on accurate summations. In [6] Demmel and Hida present several simple algorithms for accurately computing the sum of n floating-point numbers using a wider accumulator and apply their algorithms in computing a robust geometric predicate which determines whether a point D is to the left or right of the oriented plane defined by the points A, B and C. Throughout the paper we assume a floating-point arithmetic adhering to IEEE 754 floating-point standard [7]. We will use only the working precision for floating-point computations. If this working precision is IEEE 754 double precision, this corresponds to 53 bits *
Corresponding author. E-mail address:
[email protected] (G. Fedele).
0096-3003/$ - see front matter 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.amc.2006.11.095
A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424
411
precision including an implicit bit. The set of working precision floating-point numbers is denoted by F, that is a subset of the real numbers defined as follows: ( ) t X e k F ¼ y 2 Rjy ¼ b dkb : ð1Þ k¼1
The system F is characterized by four integer parameters: • the base b P 2, • the precision t P 1, • the exponent range emin 6 e 6 emax . Each digit dk satisfies 0 6 d k 6 b 1, and d 1 6¼ 0 for normalized numbers. An excellent tutorial on many aspects of floating-point arithmetic is given in [8]. We denote by flðÞ the result of a floating-point computation, where all operations inside the parentheses are executed in working precision. Floating-point operations according to IEEE 754 satisfy the following standard model [1]: flðx op yÞ ¼ ðx op yÞð1 þ dÞ; jdj 6 u op ¼ þ; ; ; =; x; y 2 F;
ð2Þ
where u ¼ 12 b1t is called unit roundoff. 1.1. Previous summation algorithms We collect some notes on previous work to put our results into suitable perspective; our remarks are by no means complete. It is well known that, unless precautions are taken, the summation of large sets of numbers can be very inaccurate due to the accumulation of rounding errors. There are many summation algorithms which are aimed at reaching a compromise between accuracy of the computed sum and computational effort. Here we briefly describe some of them with the pseudo-codes and error bounds. Let s be the sum of n floating-point numbers to be evaluated: n X s¼ xi ; ð3Þ i¼1
and let ^s be the computed sum. 1.1.1. Recursive summation The most used algorithm is: ^s ¼ 0 for i ¼ 1 : n ^s ¼ ^s þ xi end It is well known that the ordering of addends plays a crucial role in the performance of this algorithm [9]. 1.1.2. Pairwise summation The algorithm uses a queue as data structure. At every step, the partial sum obtained as sum of the first two elements extracted is inserted into the queue. These elements are initially xi’s and later are the computed partial sums. The related pseudo-code is: make queue of x while queue is not empty do addend 1 move front queue of x
412
A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424
addend 2 move front queue of x ^s addend 1 þ addend 2 insert tail queue ðx; ^sÞ end while The sum is obtained in log2 n structurally independent stages and so can be executed in parallel. 1.1.3. Insertion summation The key idea of this approach is to encourage the addition between numbers of similar magnitude. It is based on an ordered tree summation: build heap (x) while heap has more than 1 element do addend 1 move minimum from heap x addend 2 move minimum from heap x ^s addend 1 þ addend 2 insertHeap ðx; ^sÞ rebuildHeap (x) 1.1.4. Error analysis The three algorithms described in the previous subsections can be considered as a particular case of the following pseudo-code: Let S ¼ fx1 ; x2 ; . . . ; xn g repeat while S contains more than 1 element remove two numbers x and y from S add their sum x + y to S end Assign the remaining element of S to ^s. Note that there are n numbers to be added and consequently n 1 additions to be performed. If T i ¼ xi1 þ y i1 is the ith execution of the repeat loop, then the computed sum is xi þ y i1 ; jdi j 6 u; i ¼ 1; 2; . . . ; n 1: ð4Þ Tb i ¼ 1 1 þ di The partial error is di Tb i . The overall error is the sum of the partial errors: Sn ¼ En ¼ S n b
n1 X
di Tb i :
ð5Þ
i¼1
The error bound is therefore: jEn j 6 u
n1 X
j Tb i j:
ð6Þ
i¼1
Since j Tb i j 6
n X
jxj j þ OðuÞ;
8i;
ð7Þ
j¼1
then (6) becomes jEn j 6 u
n1 n X X i¼1
j¼1
! jxj j þ OðuÞ ;
ð8Þ
A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424
413
that is, jEn j 6 ðn 1Þu
n X
jxi j þ Oðu2 Þ:
ð9Þ
i¼1
For pairwise summation, taking into account the particular structure and under the assumption that n = 2r, the bound for the relative error is improved: Pn jxi j nu jEn j 6 Pi¼1 n 1 nu : i¼1 xi
ð10Þ
1.1.5. Compensated summation Floating-point summation is frequently improved by compensated summation. The compensated summation is a recursive summation with a correction term designed to reduce the rounding errors [10– 15]. The compensated summation due to Kahan [12] uses a correction term e at every step of the recursive scheme: ^s ¼ 0 e¼0 for i ¼ 1 : n temp ¼ ^s y ¼ xi þ e ^s ¼ temp þ y e ¼ ðtemp ^sÞ þ y end The error of the computed sum is as follows bounded: jEn j 6 ð2u þ Oðnu2 ÞÞ
n X
jxi j:
ð11Þ
i¼1
Priest [4] derives an ingenious algorithm from Kahan by introducing two correction terms: Sort the xi so that jx1 j P jx2 j P . . . P jxn j s1 ¼ x 1 c1 ¼ 0 for k ¼ 2 : n y k ¼ ck1 þ xk uk ¼ xk ðy k ck1 Þ tk ¼ y k þ sk1 vk ¼ y k ðtk sk1 Þ z k ¼ uk þ v k sk ¼ t k þ zk ck ¼ zk ðsk tk Þ end ^s ¼ sn Priest shows that, under the assumption n 6 bt3 , the computed sum ^s satisfies js ^sj 6 2ujsj:
ð12Þ
414
A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424
1.2. Our goal In this paper we present a fast and accurate algorithm for summation using only the working precision to compute the result. This means that all the operations are made without using wider accumulators and without extended precision. Moreover no ordering on the floating-point numbers to be summed is needed. Our method consists of two phases: • Each number is mapped in an equivalent and new representation of the floating-point number system, namely b F. In b F a number is represented as a linear combination of integer numbers belonged to F, with coefficients that are powers of the base of the floating-point system. • The numbers are summed taking into account that no error occurs whenever two integer numbers are summed or a number is multiplied by powers of the base of the floating-point system. Moreover we give simple criteria to check the accuracy of the result and show that under assumption on the magnitude of the numbers involved, the computed sum is extremely accurate and its relative error is bounded by u. In our opinion this fact is extremely important because it improves the theoretical bound of Priest’s algorithm that has a bound of twice the unit roundoff. If the hypotheses are violated we show how to modify our approach. 1.3. Outline of the paper The paper is organized as follows. In Section 2 we present the new representation of the floating-point system b F and show the equivalence between F and b F. In next two sections we describe our summation algorithms and discuss their numerical properties, both by error analysis and by reporting numerical results, as well as timing, accuracy, comparison with other algorithms and some remarks on practical applications. 2. A new representation of the floating-point number system The algorithms introduced in the previous section reach good results for the minimization of the relative error through arrangements and compensations that, however, do not modify the numbers to be summed. Every element is added, as extracted, to an accumulator or to a correction term. An alternative approach could be to modify in advance the elements in such a way to make the operations among the modified values as accurate as possible. The approach proposed in this paper, computes the sum of n floating-point numbers as a linear combination among sums of integer values multiplied by coefficients that are powers of the base of the floating-point system. Every floating-point number y 2 F can be represented as follows: y ¼ sgnðyÞ
q X
y j beAðjÞ
ð13Þ
j¼1
where: AðjÞ X
yj ¼
d k bAðjÞk ;
j ¼ 1; 2; . . . ; q;
ð14Þ
k¼Aðj1Þþ1
AðjÞ ¼ Aðj 1Þ þ aj ; j ¼ 1; 2; . . . ; q Að0Þ ¼ 0 8 < aj integer aj P 1 j ¼ 1; 2; . . . ; q q P : aj ¼ t
ð15Þ ð16Þ
j¼1
8 < þ1; y > 0 sgnðyÞ ¼ 0; y ¼ 0 : 1; y < 0 then b F is the set of floating numbers defined by (13).
ð17Þ
A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424
415
Formula (13) can be easily proved by inspection. It should be noted that yj is integer, as it can be seen by (14). Next proposition shows how to compute the integer quantities yj. Proposition 1. Let y ¼ be
t X
d k bk
ð18Þ
k¼1
be a floating-point number, then yj can be expressed as: $ % j1 X AðjÞ AðjÞAðiÞ yj ¼ b yn b yi
ð19Þ
i¼1
where: yn ¼
t X y d k bk e ¼ b k¼1
ð20Þ
and b x c is the greatest integer less than or equal to x. Proof. It is sufficient to show that (19) and (14) are equivalent. By substituting (20) and (14) in (19) we have: $ % AðiÞ j1 t X X X AðjÞk AðjÞAðiÞ AðiÞk yj ¼ dkb b dkb : ð21Þ k¼1
i¼1
k¼Aði1Þþ1
By simple algebraic manipulations [17] we obtain: $ % $ Aðj1Þ t t X X X AðjÞk AðjÞk yj ¼ dkb dkb ¼ k¼1
k¼1
k¼Aðj1Þþ1
% AðjÞk
dkb
¼
AðjÞ X
d k bAðjÞk :
ð22Þ
k¼Aðj1Þþ1
Note that the value of e in (20) can be computed by using the result in the next proposition. Proposition 2. Let y ¼ be
Pt
k¼1 d k b
k
; d 1 6¼ 0 be a floating-point number, then
e ¼ blogb jyjc þ 1:
ð23Þ
X t k logb jyj ¼ e þ logb d k b : k¼1
ð24Þ
Proof
Since
X t 1 k logb 6 logb d k b 6 logb ð1 bt Þ k¼1 b
ð25Þ
and 1 bt < 1, then X t 1 6 logb d k bk < 0: k¼1 By (24) one has logb jyj ¼ e þ c with 1 6 c < 0, therefore Eq. (23) holds.
ð26Þ h
Example 1. Assuming the IEEE double standard ðt ¼ 53; b ¼ 2Þ we transform the floating-point number y ¼ flðsqrtð2ÞÞ ¼ 1:4142135623730951 in the new representation with q = 2. Without loss of generality we choose a1 ¼ 26 and a2 ¼ 27 and accordingly Að0Þ ¼ 0, Að1Þ ¼ 26 and Að2Þ ¼ 53.
416
A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424
By Eq. (23) it follows that e ¼ blog2 jyjc þ 1 ¼ 1; therefore y ¼ 0:7071067811865476: 2e Finally we compute integer quantities by Eq. (19): yn ¼
y 1 ¼ b2Að1Þ y n c ¼ 47453132; y 2 ¼ b2Að2Þ y n 2Að2ÞAð1Þ y 1 c ¼ 109001677: Therefore, by Eq. (13): y ¼ 47453132 225 þ 109001677 252 : From Eq. (19), it is easy to derive a recurrent property of yj, j ¼ 1; 2; . . . ; q. Let sj be the quantity defined as: sj ¼ bAðjÞ y n
j1 X
bAðjÞAðiÞ y i ;
j ¼ 1; 2; . . . ; q;
ð27Þ
i¼1
then y j ¼ bsj c; j ¼ 1; 2; . . . ; q and by simple manipulations sjþ1 ¼ bajþ1 ðsj y j Þ;
j ¼ 1; 2; . . . ; q 1:
ð28Þ
The following pseudo-code maps the generic floating-point number into the new representation (13), by using the IEEE double standard ðt ¼ 53; b ¼ 2Þ: function fp2nfp Input: x ¼ floating-point number q P 2 ¼ number of splitting Output: y ¼ integer vector containing the yj in (14) esp = vector containing the quantities espðiÞ ¼ e AðiÞ j integer k
aðiÞ ¼
t q
i ¼ 1; 2; . . . ; q 1 Pq1 aðqÞ ¼ t k¼1 aðiÞ e ¼ blog2 ðjxjÞc þ 1 f ¼ jxj 2e aux ¼ f for i¼1:q aux ¼ aux 2aðiÞ yðiÞ ¼ bauxc aux ¼ aux yðiÞ end P espðiÞ ¼ e ik¼1 aðkÞ ,
3. The proposed summation algorithm The representation proposed in the previous section can be used to calculate the quantity: n X s¼ xi i¼1
ð29Þ
A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424
417
where each element is expressed as q X
xi ¼ sgnðxi ÞbeðiÞ
y j ðiÞbAðjÞ
ð30Þ
j¼1
with AðjÞ X
y j ðiÞ ¼
ðiÞ
d k bAðjÞk :
ð31Þ
k¼Aðj1Þþ1
Let espmin ¼ minfeðiÞji ¼ 1; 2; . . . ; ng; espmax ¼ maxfeðiÞji ¼ 1; 2; . . . ; ng;
ð32Þ
be the minimum and the maximum, respectively, of the addend exponents, then each exponent can be expressed as: eðiÞ ¼ espmin þ cðiÞ;
i ¼ 1; 2; . . . ; n
ð33Þ
where cðiÞ P 0;
i ¼ 1; 2; . . . ; n:
ð34Þ
So xi ¼ sgnðxi Þbespmin þcðiÞ
q X
y j ðiÞbAðjÞ
ð35Þ
j¼1
and s¼
q X
ð36Þ
fj
j¼1
where fj ¼ bespmin AðjÞ
n X
sgnðxi Þy j ðiÞbcðiÞ :
ð37Þ
i¼1
The Matlab-code in Table 1 implements the summation algorithm in IEEE double standard (t ¼ 53; b ¼ 2): 3.1. Error analysis Note that the rounding error involved in (36) is bounded by Pq jfi j Pi¼1 ðq 1Þu; q j i¼1 fi j
ð38Þ
because the sum of n addends is reduced to the sum of q floating-point numbers. Obviously this theoretical bound is guaranteed only if in (36) the sum of the integer quantities n X sgnðxi Þy j ðiÞbcðiÞ ð39Þ i¼1
does not exceed the maximum integer number which can be expressed in a t-digit register: intmax ¼ bt 1:
ð40Þ
This assumption limits the number of elements to be summed up. In fact yj is bounded by jy j jmax ¼ baj 1;
ð41Þ
418
A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424
Table 1 Matlab algorithm sumEF1 for accurate summation function s ¼ sumEF 1ðv; qÞ n ¼ lengthðvÞ; t ¼ 53; alfa ¼ floorðt=qÞ onesð1; qÞ; alfaðqÞ ¼ t ðq 1Þ alfað1Þ; ½f ; e ¼ log2ðvÞ; aux ¼ zerosðq; nÞ; y ¼ zerosðq; nÞ; auxð1; :Þ ¼ pow2ðv; alfað1Þ eÞ; yð1; :Þ ¼ floorðauxð1; :ÞÞ; for i ¼ 1 : q 1 auxði þ 1; :Þ ¼ auxði; :Þ yði; :Þ; auxði þ 1; :Þ ¼ pow2ðauxði þ 1; :Þ; alfaði þ 1ÞÞ; yði þ 1; :Þ ¼ floorðauxði þ 1; :ÞÞ; end espMin ¼ minðeÞ; for i ¼ 1 : q yði; :Þ ¼ pow2ðyði; :Þ; e espMinÞ; end s ¼ 0; esp ¼ 0; for i ¼ 1 : q esp ¼ esp þ alfaðiÞ; s ¼ s þ pow2ðsumðyði; :ÞÞ; espMin espÞ; end
while cðiÞ, in the worst case, takes the value: cmax ¼ espmax espmin :
ð42Þ
The following inequality must hold: nbespmax espmin jy j jmax 6 intmax ;
j ¼ 1; 2; . . . ; q;
ð43Þ
bt 1 bt 1 bt 1 ; ; . . . ; : ba1 1 ba2 1 ba n 1
ð44Þ
therefore: n6
1 bespmax espmin
min
3.2. Optimal choice q = 2 The choice of q ¼ 2 in (36) allow us to build a summation algorithm with an optimal relative error bound. In this case the (36) becomes: n n X X sgnðxi Þy 1 ðiÞbcðiÞ þ bespmin Að2Þ sgnðxi Þy 2 ðiÞbcðiÞ : ð45Þ s ¼ bespmin Að1Þ i¼1
i¼1
Such an expression contains the sum of two floating-point numbers. If these numbers are correctly formed, then by Eq. (2) their sum is bounded by u. Formula (45) is error-free only if bt 1 1 1 ; n 6 espmax espmin min a1 : ð46Þ b b 1 ba 2 1 In IEEE double standard (t ¼ 53; b ¼ 2), by choosing for example a1 ¼ 26 and a2 ¼ 27, the (46) is satisfied if n6
bt 1 ; bespmax espmin ðba2 1Þ
ð47Þ
A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424
419
therefore, in order to have n P 1 it must be espmax espmin 6 26: If the (48) holds, but the (47) is not satisfied, it follows that the integer quantities involved in (45): f1 ðiÞ ¼ sgnðxi Þy 1 ðiÞbcðiÞ ;
i ¼ 1; 2; . . . ; n
ð48Þ ð49Þ
and f2 ðiÞ ¼ sgnðxi Þy 2 ðiÞbcðiÞ ; i ¼ 1; 2; . . . ; n ð50Þ are computable exactly, but their sum may still exceed the maximum representable integer number (40). As f1 and f2 are integer quantities, their sum can be expressed in terms of modulo 253 and remainder: n X f1 ðiÞ ¼ D1 253 þ R1 ; ð51Þ i¼1 n X
f2 ðiÞ ¼ D2 253 þ R2 :
ð52Þ
i¼1
Therefore the (45) becomes: s ¼ D1 2espmin Að1Þþ53 þ R1 2espmin Að1Þ þ D2 2espmin Að2Þþ53 þ R2 2espmin Að2Þ :
ð53Þ
The four elements in (53) are generated without errors because the (48) holds. Their sum can be done by using Priest’s algorithm which guarantees a theoretical bound of 2u and, in this case, there is a drastic reduction of time used due to the fact that it must sum only four floating-point numbers. The algorithm is summarized by the following pseudo-code: function cutSum Input: v = vector of dimension n P nmax with nmax equals to the r.h.s. of (47) Output: computed sum Calculate f1 and f2 as in (49) and (50), respectively. Build k1 sub-vectors of f1, fw1;1 ; w1;2 ; . . . ; w1;k1 g such as dimðw1;j Þ 6 nmax ; j ¼ 1; 2; . . . ; k 1 . Build k2 sub-vectors of f2, fw2;1 ; w2;2 ; . . . ; w2;k1 g such as dimðw2;j Þ 6 nmax ; j ¼ 1; 2; . . . ; k 2P . dimðw Þ Build a vector p1 2 Nk1 such as p1 ðiÞ ¼ j¼1 1;i w1;i ðjÞ; i ¼ 1; 2; . . . ; k 1 . Pdimðw Þ Build a vector p2 2 Nk2 such as p2 ðiÞ ¼ j¼1 2;i w2;i ðjÞ; i ¼ 1; 2; . . . ; k 2 . Calculate D1, D2, R1 and PkR2 2 such as 53 P k1 53 p1ðiÞ ¼ D 2 þ R 1 1 i¼1 i¼1 p2ðiÞ ¼ D2 2 þ R2 Calculate the (53) by using Priest’s algorithm. Moreover if the (49) is not satisfied, then it is possible to split the original vector in a set of vectors in which the (49) is true. In the worst case, not common in practice, the number of vectors is equal to the number of elements to be summed. Then, for each vector the partial sum can be computed by the algorithm in Table 1, and the whole sum is computed by summing up the partial sums by a compensated algorithm. The Matlabcode in Table 2 implements such idea. Note that the algorithm can be improved by sorting the elements such that jxi j 6 jxiþ1 j; i ¼ 1; 2; . . . ; n 1. 3.3. Numerical experiments and computational aspects In this section we present timing and accuracy results. We compare our summation algorithms (EF1, EF2) to recursive summation (RI), Kahan (KA) and Priest’s algorithm (PR). All the above algorithms were per-
420
A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424
Table 2 Matlab algorithm sumEF2 for accurate summation function s ¼ sumEF 2ðvÞ n ¼ lengthðvÞ; ½f ; e ¼ log2ðvÞ; eMax ¼ maxðeÞ; eMin ¼ minðeÞ; if ðeMax eMinÞ < 26 s ¼ sumEF 1ðv; 2Þ; else aux ¼ v; k ¼ 1; while lengthðauxÞ > 0 pos ¼ findðabsðe eð1ÞÞ < 26Þ; spðkÞ ¼ sumEF 1ðauxðposÞ; 2Þ; auxðposÞ ¼ ½ ; eðposÞ ¼ ½ ; k ¼ k þ 1; end s ¼ priestðspÞ; end
formed on a Pentium-based computer in which the unit roundoff is u ¼ 253 ’ 1:11 1016 and they have been implemented in Matlab [18] and Mathematica [19] which allows arbitrary precision numbers, in order to compare the computed sums with the ‘‘exact’’ ones by using 1024 digit precision numbers and to investigate the relative error. We considered the algorithms RI, KA, PR and EF1 (with q ¼ 2) from a computational point of view by computing the mean time necessary to perform a fixed number of iterations. The mean time was computed by using two Matlab built-in functions: tic-toc. The first reset the clock, the latter returns the elapsed time since the latest reset. The obtained results are presented in Table 3. If the assumptions (47) and (48) are satisfied then the relative error bound of our algorithm is u. In such case we perform an experiment in order to estimate numerically the probability density function of the relative error (PDFRE) for the algorithm EF1 (with q ¼ 2): we generate 1 500 000 vectors of 10 numbers uniformly distributed in the interval ½0:5; 1. Fig. 1 shows the PDFRE normalized to the unit roundoff. We note that this distribution is a bimodal one. In the other cases we compare summation methods through statistical estimates of error, which may be more representative of the average case [15]. We consider the following sets of data for the algorithm EF1: • Set 1: the xi are random numbers uniformly distributed in the interval ½1; 1, for n ¼ 4096. • Set 2: the xi are random numbers uniformly distributed in the interval ½0; 1, for n ¼ 4096.
Table 3 Mean times in computing 50 000 sums n
RI
KA
PR
EF1
10 20 30 40 50 60 70 80 90 100
2.70–04 3.50–04 4.09–04 4.76–04 5.81–04 6.27–04 7.17–04 7.82–04 8.44–04 9.50–04
4.27–04 6.68–04 8.64–04 1.06–03 1.28–03 1.48–03 1.70–03 1.89–03 2.11–03 2.37–02
1.44–03 2.56–03 3.64–03 4.72–03 5.80–03 7.11–03 7.98–03 9.08–03 1.02–02 1.13–02
4.29–04 5.06–04 5.34–04 5.69–04 5.94–04 7.23–04 7.52–04 7.79–04 8.20–04 8.54–04
A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424
421
0.25
0.2
pdf
0.15
0.1
0.05
0
0
0.1
0.2
0.3
0.4
u
0.5
0.6
0.7
0.8
0.9
Fig. 1. PDFRE for algorithm in Table 1. n ¼ 10 random numbers in ½0:5; 1, number of iterations = 1 500 000.
and • Set 3: The xi ¼ 10pi where the pi are uniformly distributed in the interval ½0; 35. • Set 4: x1 ¼ x2 ¼ ¼ x2047 ¼ 1:0; x2048 ¼ x2049 ¼ 1:0e 18; x2050 ¼ x2051 ¼ ¼ x4096 ¼ 1:0. • Set 5: xi ¼ 1=i2 , for n ¼ 4096. for the algorithm EF2. The summations were performed 10 000 times each (or 1 time for data sets 4–5). The results for data sets 1– 3 are shown in Tables 4–6, respectively. For each method we report the maximum and mean value of the relative errors in term of unit roundoff. We also report the fraction of trials in which each algorithm gives equal or more accurate result than the others. The matrix in the bottom of Tables 4–6 must be read as follows: in each line is reported the probability of success of the corresponding algorithm in comparison to the other ones. In Table 7 we report the relative errors in terms of unit roundoff for data sets 4–5. Table 4 Set 1 RI
KA
PR
EF1
Max Mean
140 023 72.0336
937.42 1.65733
0.9909 0.360019
1.0923 0.360156
RI KA PR EF1
0 0.9912 1 1
0.0350 0 1 0.9998
0.0257 0.6644 0 0.9998
0.0257 0.6645 1 0
RI
KA
PR
EF1
Max Mean
56.0788 11.1005
1.00823 0.372495
0.998056 0.372467
1.99072 0.381584
RI KA PR EF1
0 0.9999 1 0.9997
0.0445 0 0.9999 0.9874
0.0444 0.9968 0 0.9874
0.0449 0.9969 1 0
Table 5 Set 2
422
A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424
Table 6 Set 3 RI
KA
PR
EF2
Max Mean
140795 60.5964
25467.5 6.06749
0.986649 0.362291
1.9575 0.482191
RI KA PR EF2
0 0.9761 1 0.9883
0.0664 0 0.9999 0.8664
0.0426 0.5536 0 0.747
0.0525 0.6206 0.9999 0
Table 7 Sets 4–5
Set 4 Set 5
RI
KA
PR
EF2
9.01 + 15 51.77
9.01 + 15 0.52
0 0.52
0 0.52
4. Accurate interpolation on equidistant nodes in [0, 1] We will not say much about the many practical applications of our algorithm because the applications of accurate summation are widely known and there are excellent treatments in the recent literature. We want only to present an application of the new representation of the floating-point system related to the problem of the polynomial interpolation on equidistant nodes in ½0; 1 [16]. It can be shown that the inverse of the Vandermonde matrix on equidistant nodes in ½0; 1 can be factorized as a product of four matrices with integer entries that can be stored without rounding errors: Wn ¼
1 D1 UD2 L; ðn 1Þ!
ð54Þ
where: i1
D1 ¼ diagfðn 1Þ gi¼1;2;...;n ; iþ1 j 1 U ði; jÞ ¼ ð1Þ ; i ¼ 1; 2; . . . ; n; j ¼ i; i þ 1; . . . ; n; i1 ðn 1Þ! ; D2 ¼ diag ði 1Þ! i¼1;2;...;n jþ1 i 1 Lði; jÞ ¼ ð1Þ ; i ¼ 1; 2; . . . ; n; j ¼ 1; 2; . . . ; i: j1 k1 Therefore the polynomial which interpolates the function f ðxÞ on the set of nodes X n ¼ fn1 ; k ¼ 1; 2; . . . ; ng is: n X ck xk1 ; ð55Þ pðxÞ ¼ k¼1
where the coefficient vector c can be calculated as: 1 D1 UD2 L f ð56Þ ðn 1Þ! k1
; k ¼ 1; 2; . . . ; n. and fk ¼ f n1 By using the integer properties of the matrices involved in (54) and the fact that each entry of the vector f can be represented as a linear combination of integers where the coefficients are powers of the base, we build an accurate algorithm for the interpolation problem on equidistant nodes in ½0; 1. We consider the quantity c¼
A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424
423
Table 8 Dual problem n
2
3
4
5
6
7
8
9
10
Max Mean
1.00 0.07
1.00 0.32
2.16 1.05
2.00 1.14
2.88 1.04
2.91 1.25
2.76 1.25
2.60 1.16
4.02 1.84
Maximum and mean value of c for (57) over 100 000 runs f 2 ½1; 1.
c¼
1 D1 UD2 Lf ðn 1Þ!
and, by using (13) we write the vector as 2 3 f1 6 7 f 6 27 6 7 6 7 f ¼ 6 f3 7 ¼ Y e; 6 . 7 6 . 7 4 . 5 fn where
2
y 1 ð1Þ
y 2 ð1Þ
y 1 ðnÞ
y 2 ð2Þ y 2 ð3Þ .. . y 2 ðnÞ
6 y ð2Þ 6 1 6 6 Y ¼ 6 y 1 ð3Þ 6 . 6 . 4 . and
2
2espð1Þ
y 3 ð1Þ
3
y 3 ð2Þ 7 7 7 y 3 ð3Þ 7 7 .. 7 7 . 5 y 3 ðnÞ
3
6 7 e ¼ 4 2espð2Þ 5: 2espð3Þ
Note that y q ðkÞ; q ¼ 1; 2; 3; k ¼ 1; 2; . . . ; n are integer quantities multiplied by power of the base of the floating-point number system. Then c¼
1 D1 UD2 LYe: ðn 1Þ!
ð57Þ
The numerical experiments are aimed in showing the high accuracy of formula (57). We have used package Mathematica to compute the approximate solutions ^c, the exact ones (using extended precision of 1024 significant digits) and the error c ¼ max
16i6n
j^ci ci j ; jci j
ð58Þ
100 000 experiments have been run for n ¼ 2; 3; . . . ; 10. Table 8 reports the maximum and mean value of c in terms of unit roundoff u ¼ 253 . 5. Conclusions We have compared our method with Priest’s algorithm. In our opinion integer based approach is conceptually more simple because it avoids the error compensation logic and uses only some elementary properties of the floating-point system [8]. This implies a drastic reduction in the number of floating-point operations.
424
A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424
Moreover we have shown that, under less-conservative hypothesis, the theoretical relative error is bounded by the unit-roundoff. References [1] N.J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, 1996. [2] N.J. Higham, The accuracy of floating point summation, SIAM. J. Sci. Comput. 14 (1993) 783–799. [3] X. Li, J. Demmel, D. Bailey, G. Henry, Y. Hida, J. Iskandar, W. Kahan, S. Kang, A. Kapur, M. Martin, B. Thompson, T. Tung, D. Yoo, Design, implementation and testing of extended and mixed precision BLAS, ACM Trans. Math. Softw. 28 (2002) 152–205. [4] D.M. Priest, On properties of Floating Point Arithmetics: Numerical Stability and the Cost of Accurate Computations, 1992. [5] C.M. Hoffmann, G. Park, J.R. Simard, N.F. Stewart, Residual iteration and accurate polynomial evaluation for shape-interrogation applications, in: Proc. of ACM Symp. on Solid Modeling and Appl., Genova, Italy, June 9–11, 2004. [6] J. Demmel, Y. Hida, Fast and accurate floating point summation with application to computational geometry, Numer. Algorithms 37 (2004) 101–112. [7] ANSI/IEEE, IEEE Standard for Binary Floating Point Arithmetic, in: Proc. of IEEE, New York, Std 754-1985 ed., 1985. [8] D. Goldberg, What Every Computer Scientist Should Know About Floating Point Arithmetic, ACM Comput. Survey (March) (1991). [9] E. Mizukami, The accuracy of floating point summations for cg-like methods, Tech. Report 486, Department of Computer Science, Indiana University – Bloomington, July 1997. [10] J.M. Wolfe, Reducing truncation errors by programming, Commun. ACM (1964) 335–356. [11] M. Malcom, An algorithm for floating-point accumulation of sums with small relative error, Technical Report, Stanford University, STAN-CS-70463, June 1970. [12] W. Kahan, Further remarks on reducing truncation errors, Commun. ACM 8 (1965) 40. [13] S. Linnainmaa, Analysis of some known methods of improving the accuracy of floating-point sums, Bit 14 (1974) 167–202. [14] T.O. Espelid, On floating-point summation, SIAM Rev. 37-4 (1995) 603–607. [15] J.M. McNamee, A comparison of methods for accurate summation, ACM SIGSAM Bull. 38-1 (2004) 1–7. [16] A. Eisinberg, G. Fedele, C. Imbrogno, Vandermonde systems on equidistant nodes in ½0; 1: accurate computation, Appl. Math. Comp. 172 (2) (2006) 971–984. [17] D. Knuth, Second ed., The Art of Computer Programming, 2, Addison-Wesley, Reading Mass., 1981. [18] The MathWorks Inc., MATLAB Reference Guide, 1994. [19] S. Wolfram, Mathematica: a System for Doing Mathematics by Computers, second. ed., Addison-Wesley, 1991.