Accurate floating-point summation: a new approach

Applied Mathematics and Computation 189 (2007) 410–424 www.elsevier.com/locate/amc Accurate ﬂoating-point summation: a new approach A. Eisinberg, G. ...

Download PDF

206KB Sizes 5 Downloads 147 Views

Report

PDF Reader
Full Text

Applied Mathematics and Computation 189 (2007) 410–424 www.elsevier.com/locate/amc

Accurate ﬂoating-point summation: a new approach A. Eisinberg, G. Fedele

*

Dip. Elettronica Informatica e Sistemistica, Universita` degli Studi della Calabria, 87036 Rende (Cs), Italy

Abstract The aim of this paper is to ﬁnd an accurate and eﬃcient algorithm for evaluating the summation of large sets of ﬂoating-point numbers. We present a new representation of the ﬂoating-point number system in which a number is represented as a linear combination of integers and the coeﬃcients are powers of the base of the ﬂoating-point system. The approach allows to build up an accurate ﬂoating-point summation algorithm based on the fact that no rounding error occurs whenever two integer numbers are summed or a ﬂoating-point number is multiplied by powers of the base of the ﬂoating-point system. The proposed algorithm seems to be competitive in terms of computational eﬀort and, under some assumptions, the computed sum is greatly accurate. With such assumptions, less-conservative in the practical applications, we prove that the relative error of the computed sum is bounded by the unit roundoﬀ. 2006 Elsevier Inc. All rights reserved. Keywords: Floating-point system; Summation algorithms

1. Introduction Many problems, such as numerical integration, numerical solution of diﬀerential equations, inner products, means, variances, norms, and all kinds of nonlinear functions, involve computing sums with many terms. Because of this ubiquity and since summation is a basic task in numerical analysis, there are numerous algorithms for that. Higham [1] devotes an entire chapter to summation. Excellent review of various applications in many diﬀerent areas of numerical analysis can be found in [1–4]. Two recent applications can be found in [5,6]. In [5] Priest’s distillation algorithm [4] is used in the problem of surface interrogation and intersection that depends crucially on good root-ﬁnding algorithms, which in turn depends on accurate polynomial evaluation and indirectly on accurate summations. In [6] Demmel and Hida present several simple algorithms for accurately computing the sum of n ﬂoating-point numbers using a wider accumulator and apply their algorithms in computing a robust geometric predicate which determines whether a point D is to the left or right of the oriented plane deﬁned by the points A, B and C. Throughout the paper we assume a ﬂoating-point arithmetic adhering to IEEE 754 ﬂoating-point standard [7]. We will use only the working precision for ﬂoating-point computations. If this working precision is IEEE 754 double precision, this corresponds to 53 bits *

Corresponding author. E-mail address: [email protected] (G. Fedele).

0096-3003/$ - see front matter 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.amc.2006.11.095

A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424

411

precision including an implicit bit. The set of working precision ﬂoating-point numbers is denoted by F, that is a subset of the real numbers deﬁned as follows: ( ) t X e k F ¼ y 2 Rjy ¼ b dkb : ð1Þ k¼1

The system F is characterized by four integer parameters: • the base b P 2, • the precision t P 1, • the exponent range emin 6 e 6 emax . Each digit dk satisﬁes 0 6 d k 6 b 1, and d 1 6¼ 0 for normalized numbers. An excellent tutorial on many aspects of ﬂoating-point arithmetic is given in [8]. We denote by flðÞ the result of a ﬂoating-point computation, where all operations inside the parentheses are executed in working precision. Floating-point operations according to IEEE 754 satisfy the following standard model [1]: flðx op yÞ ¼ ðx op yÞð1 þ dÞ; jdj 6 u op ¼ þ; ; ; =; x; y 2 F;

ð2Þ

where u ¼ 12 b1t is called unit roundoﬀ. 1.1. Previous summation algorithms We collect some notes on previous work to put our results into suitable perspective; our remarks are by no means complete. It is well known that, unless precautions are taken, the summation of large sets of numbers can be very inaccurate due to the accumulation of rounding errors. There are many summation algorithms which are aimed at reaching a compromise between accuracy of the computed sum and computational eﬀort. Here we brieﬂy describe some of them with the pseudo-codes and error bounds. Let s be the sum of n ﬂoating-point numbers to be evaluated: n X s¼ xi ; ð3Þ i¼1

and let ^s be the computed sum. 1.1.1. Recursive summation The most used algorithm is: ^s ¼ 0 for i ¼ 1 : n ^s ¼ ^s þ xi end It is well known that the ordering of addends plays a crucial role in the performance of this algorithm [9]. 1.1.2. Pairwise summation The algorithm uses a queue as data structure. At every step, the partial sum obtained as sum of the ﬁrst two elements extracted is inserted into the queue. These elements are initially xi’s and later are the computed partial sums. The related pseudo-code is: make queue of x while queue is not empty do addend 1 move front queue of x

412

A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424

addend 2 move front queue of x ^s addend 1 þ addend 2 insert tail queue ðx; ^sÞ end while The sum is obtained in log2 n structurally independent stages and so can be executed in parallel. 1.1.3. Insertion summation The key idea of this approach is to encourage the addition between numbers of similar magnitude. It is based on an ordered tree summation: build heap (x) while heap has more than 1 element do addend 1 move minimum from heap x addend 2 move minimum from heap x ^s addend 1 þ addend 2 insertHeap ðx; ^sÞ rebuildHeap (x) 1.1.4. Error analysis The three algorithms described in the previous subsections can be considered as a particular case of the following pseudo-code: Let S ¼ fx1 ; x2 ; . . . ; xn g repeat while S contains more than 1 element remove two numbers x and y from S add their sum x + y to S end Assign the remaining element of S to ^s. Note that there are n numbers to be added and consequently n 1 additions to be performed. If T i ¼ xi1 þ y i1 is the ith execution of the repeat loop, then the computed sum is xi þ y i1 ; jdi j 6 u; i ¼ 1; 2; . . . ; n 1: ð4Þ Tb i ¼ 1 1 þ di The partial error is di Tb i . The overall error is the sum of the partial errors: Sn ¼ En ¼ S n b

n1 X

di Tb i :

ð5Þ

i¼1

The error bound is therefore: jEn j 6 u

n1 X

j Tb i j:

ð6Þ

i¼1

Since j Tb i j 6

n X

jxj j þ OðuÞ;

8i;

ð7Þ

j¼1

then (6) becomes jEn j 6 u

n1 n X X i¼1

j¼1

! jxj j þ OðuÞ ;

ð8Þ

A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424

413

that is, jEn j 6 ðn 1Þu

n X

jxi j þ Oðu2 Þ:

ð9Þ

i¼1

For pairwise summation, taking into account the particular structure and under the assumption that n = 2r, the bound for the relative error is improved: Pn jxi j nu jEn j 6 Pi¼1 n 1 nu : i¼1 xi

ð10Þ

1.1.5. Compensated summation Floating-point summation is frequently improved by compensated summation. The compensated summation is a recursive summation with a correction term designed to reduce the rounding errors [10– 15]. The compensated summation due to Kahan [12] uses a correction term e at every step of the recursive scheme: ^s ¼ 0 e¼0 for i ¼ 1 : n temp ¼ ^s y ¼ xi þ e ^s ¼ temp þ y e ¼ ðtemp ^sÞ þ y end The error of the computed sum is as follows bounded: jEn j 6 ð2u þ Oðnu2 ÞÞ

n X

jxi j:

ð11Þ

i¼1

Priest [4] derives an ingenious algorithm from Kahan by introducing two correction terms: Sort the xi so that jx1 j P jx2 j P . . . P jxn j s1 ¼ x 1 c1 ¼ 0 for k ¼ 2 : n y k ¼ ck1 þ xk uk ¼ xk ðy k ck1 Þ tk ¼ y k þ sk1 vk ¼ y k ðtk sk1 Þ z k ¼ uk þ v k sk ¼ t k þ zk ck ¼ zk ðsk tk Þ end ^s ¼ sn Priest shows that, under the assumption n 6 bt3 , the computed sum ^s satisﬁes js ^sj 6 2ujsj:

ð12Þ

414

A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424

1.2. Our goal In this paper we present a fast and accurate algorithm for summation using only the working precision to compute the result. This means that all the operations are made without using wider accumulators and without extended precision. Moreover no ordering on the ﬂoating-point numbers to be summed is needed. Our method consists of two phases: • Each number is mapped in an equivalent and new representation of the ﬂoating-point number system, namely b F. In b F a number is represented as a linear combination of integer numbers belonged to F, with coeﬃcients that are powers of the base of the ﬂoating-point system. • The numbers are summed taking into account that no error occurs whenever two integer numbers are summed or a number is multiplied by powers of the base of the ﬂoating-point system. Moreover we give simple criteria to check the accuracy of the result and show that under assumption on the magnitude of the numbers involved, the computed sum is extremely accurate and its relative error is bounded by u. In our opinion this fact is extremely important because it improves the theoretical bound of Priest’s algorithm that has a bound of twice the unit roundoﬀ. If the hypotheses are violated we show how to modify our approach. 1.3. Outline of the paper The paper is organized as follows. In Section 2 we present the new representation of the ﬂoating-point system b F and show the equivalence between F and b F. In next two sections we describe our summation algorithms and discuss their numerical properties, both by error analysis and by reporting numerical results, as well as timing, accuracy, comparison with other algorithms and some remarks on practical applications. 2. A new representation of the ﬂoating-point number system The algorithms introduced in the previous section reach good results for the minimization of the relative error through arrangements and compensations that, however, do not modify the numbers to be summed. Every element is added, as extracted, to an accumulator or to a correction term. An alternative approach could be to modify in advance the elements in such a way to make the operations among the modiﬁed values as accurate as possible. The approach proposed in this paper, computes the sum of n ﬂoating-point numbers as a linear combination among sums of integer values multiplied by coeﬃcients that are powers of the base of the ﬂoating-point system. Every ﬂoating-point number y 2 F can be represented as follows: y ¼ sgnðyÞ

q X

y j beAðjÞ

ð13Þ

j¼1

where: AðjÞ X

yj ¼

d k bAðjÞk ;

j ¼ 1; 2; . . . ; q;

ð14Þ

k¼Aðj1Þþ1

AðjÞ ¼ Aðj 1Þ þ aj ; j ¼ 1; 2; . . . ; q Að0Þ ¼ 0 8 < aj integer aj P 1 j ¼ 1; 2; . . . ; q q P : aj ¼ t

ð15Þ ð16Þ

j¼1

8 < þ1; y > 0 sgnðyÞ ¼ 0; y ¼ 0 : 1; y < 0 then b F is the set of ﬂoating numbers deﬁned by (13).

ð17Þ

A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424

415

Formula (13) can be easily proved by inspection. It should be noted that yj is integer, as it can be seen by (14). Next proposition shows how to compute the integer quantities yj. Proposition 1. Let y ¼ be

t X

d k bk

ð18Þ

k¼1

be a floating-point number, then yj can be expressed as: $ % j1 X AðjÞ AðjÞAðiÞ yj ¼ b yn b yi

ð19Þ

i¼1

where: yn ¼

t X y d k bk e ¼ b k¼1

ð20Þ

and b x c is the greatest integer less than or equal to x. Proof. It is suﬃcient to show that (19) and (14) are equivalent. By substituting (20) and (14) in (19) we have: $ % AðiÞ j1 t X X X AðjÞk AðjÞAðiÞ AðiÞk yj ¼ dkb b dkb : ð21Þ k¼1

i¼1

k¼Aði1Þþ1

By simple algebraic manipulations [17] we obtain: $ % $ Aðj1Þ t t X X X AðjÞk AðjÞk yj ¼ dkb dkb ¼ k¼1

k¼1

k¼Aðj1Þþ1

% AðjÞk

dkb

¼

AðjÞ X

d k bAðjÞk :

ð22Þ

k¼Aðj1Þþ1

Note that the value of e in (20) can be computed by using the result in the next proposition. Proposition 2. Let y ¼ be

Pt

k¼1 d k b

k

; d 1 6¼ 0 be a floating-point number, then

e ¼ blogb jyjc þ 1:

ð23Þ

X t k logb jyj ¼ e þ logb d k b : k¼1

ð24Þ

Proof

Since

X t 1 k logb 6 logb d k b 6 logb ð1 bt Þ k¼1 b

ð25Þ

and 1 bt < 1, then X t 1 6 logb d k bk < 0: k¼1 By (24) one has logb jyj ¼ e þ c with 1 6 c < 0, therefore Eq. (23) holds.

ð26Þ h

Example 1. Assuming the IEEE double standard ðt ¼ 53; b ¼ 2Þ we transform the ﬂoating-point number y ¼ flðsqrtð2ÞÞ ¼ 1:4142135623730951 in the new representation with q = 2. Without loss of generality we choose a1 ¼ 26 and a2 ¼ 27 and accordingly Að0Þ ¼ 0, Að1Þ ¼ 26 and Að2Þ ¼ 53.

416

A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424

By Eq. (23) it follows that e ¼ blog2 jyjc þ 1 ¼ 1; therefore y ¼ 0:7071067811865476: 2e Finally we compute integer quantities by Eq. (19): yn ¼

y 1 ¼ b2Að1Þ y n c ¼ 47453132; y 2 ¼ b2Að2Þ y n 2Að2ÞAð1Þ y 1 c ¼ 109001677: Therefore, by Eq. (13): y ¼ 47453132 225 þ 109001677 252 : From Eq. (19), it is easy to derive a recurrent property of yj, j ¼ 1; 2; . . . ; q. Let sj be the quantity deﬁned as: sj ¼ bAðjÞ y n

j1 X

bAðjÞAðiÞ y i ;

j ¼ 1; 2; . . . ; q;

ð27Þ

i¼1

then y j ¼ bsj c; j ¼ 1; 2; . . . ; q and by simple manipulations sjþ1 ¼ bajþ1 ðsj y j Þ;

j ¼ 1; 2; . . . ; q 1:

ð28Þ

The following pseudo-code maps the generic ﬂoating-point number into the new representation (13), by using the IEEE double standard ðt ¼ 53; b ¼ 2Þ: function fp2nfp Input: x ¼ ﬂoating-point number q P 2 ¼ number of splitting Output: y ¼ integer vector containing the yj in (14) esp = vector containing the quantities espðiÞ ¼ e AðiÞ j integer k

aðiÞ ¼

t q

i ¼ 1; 2; . . . ; q 1 Pq1 aðqÞ ¼ t k¼1 aðiÞ e ¼ blog2 ðjxjÞc þ 1 f ¼ jxj 2e aux ¼ f for i¼1:q aux ¼ aux 2aðiÞ yðiÞ ¼ bauxc aux ¼ aux yðiÞ end P espðiÞ ¼ e ik¼1 aðkÞ ,

3. The proposed summation algorithm The representation proposed in the previous section can be used to calculate the quantity: n X s¼ xi i¼1

ð29Þ

A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424

417

where each element is expressed as q X

xi ¼ sgnðxi ÞbeðiÞ

y j ðiÞbAðjÞ

ð30Þ

j¼1

with AðjÞ X

y j ðiÞ ¼

ðiÞ

d k bAðjÞk :

ð31Þ

k¼Aðj1Þþ1

Let espmin ¼ minfeðiÞji ¼ 1; 2; . . . ; ng; espmax ¼ maxfeðiÞji ¼ 1; 2; . . . ; ng;

ð32Þ

be the minimum and the maximum, respectively, of the addend exponents, then each exponent can be expressed as: eðiÞ ¼ espmin þ cðiÞ;

i ¼ 1; 2; . . . ; n

ð33Þ

where cðiÞ P 0;

i ¼ 1; 2; . . . ; n:

ð34Þ

So xi ¼ sgnðxi Þbespmin þcðiÞ

q X

y j ðiÞbAðjÞ

ð35Þ

j¼1

and s¼

q X

ð36Þ

fj

j¼1

where fj ¼ bespmin AðjÞ

n X

sgnðxi Þy j ðiÞbcðiÞ :

ð37Þ

i¼1

The Matlab-code in Table 1 implements the summation algorithm in IEEE double standard (t ¼ 53; b ¼ 2): 3.1. Error analysis Note that the rounding error involved in (36) is bounded by Pq jfi j Pi¼1 ðq 1Þu; q j i¼1 fi j

ð38Þ

because the sum of n addends is reduced to the sum of q ﬂoating-point numbers. Obviously this theoretical bound is guaranteed only if in (36) the sum of the integer quantities n X sgnðxi Þy j ðiÞbcðiÞ ð39Þ i¼1

does not exceed the maximum integer number which can be expressed in a t-digit register: intmax ¼ bt 1:

ð40Þ

This assumption limits the number of elements to be summed up. In fact yj is bounded by jy j jmax ¼ baj 1;

ð41Þ

418

A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424

Table 1 Matlab algorithm sumEF1 for accurate summation function s ¼ sumEF 1ðv; qÞ n ¼ lengthðvÞ; t ¼ 53; alfa ¼ floorðt=qÞ onesð1; qÞ; alfaðqÞ ¼ t ðq 1Þ alfað1Þ; ½f ; e ¼ log2ðvÞ; aux ¼ zerosðq; nÞ; y ¼ zerosðq; nÞ; auxð1; :Þ ¼ pow2ðv; alfað1Þ eÞ; yð1; :Þ ¼ floorðauxð1; :ÞÞ; for i ¼ 1 : q 1 auxði þ 1; :Þ ¼ auxði; :Þ yði; :Þ; auxði þ 1; :Þ ¼ pow2ðauxði þ 1; :Þ; alfaði þ 1ÞÞ; yði þ 1; :Þ ¼ floorðauxði þ 1; :ÞÞ; end espMin ¼ minðeÞ; for i ¼ 1 : q yði; :Þ ¼ pow2ðyði; :Þ; e espMinÞ; end s ¼ 0; esp ¼ 0; for i ¼ 1 : q esp ¼ esp þ alfaðiÞ; s ¼ s þ pow2ðsumðyði; :ÞÞ; espMin espÞ; end

while cðiÞ, in the worst case, takes the value: cmax ¼ espmax espmin :

ð42Þ

The following inequality must hold: nbespmax espmin jy j jmax 6 intmax ;

j ¼ 1; 2; . . . ; q;

ð43Þ

bt 1 bt 1 bt 1 ; ; . . . ; : ba1 1 ba2 1 ba n 1

ð44Þ

therefore: n6

1 bespmax espmin

min

3.2. Optimal choice q = 2 The choice of q ¼ 2 in (36) allow us to build a summation algorithm with an optimal relative error bound. In this case the (36) becomes: n n X X sgnðxi Þy 1 ðiÞbcðiÞ þ bespmin Að2Þ sgnðxi Þy 2 ðiÞbcðiÞ : ð45Þ s ¼ bespmin Að1Þ i¼1

i¼1

Such an expression contains the sum of two ﬂoating-point numbers. If these numbers are correctly formed, then by Eq. (2) their sum is bounded by u. Formula (45) is error-free only if bt 1 1 1 ; n 6 espmax espmin min a1 : ð46Þ b b 1 ba 2 1 In IEEE double standard (t ¼ 53; b ¼ 2), by choosing for example a1 ¼ 26 and a2 ¼ 27, the (46) is satisﬁed if n6

bt 1 ; bespmax espmin ðba2 1Þ

ð47Þ

A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424

419

therefore, in order to have n P 1 it must be espmax espmin 6 26: If the (48) holds, but the (47) is not satisﬁed, it follows that the integer quantities involved in (45): f1 ðiÞ ¼ sgnðxi Þy 1 ðiÞbcðiÞ ;

i ¼ 1; 2; . . . ; n

ð48Þ ð49Þ

and f2 ðiÞ ¼ sgnðxi Þy 2 ðiÞbcðiÞ ; i ¼ 1; 2; . . . ; n ð50Þ are computable exactly, but their sum may still exceed the maximum representable integer number (40). As f1 and f2 are integer quantities, their sum can be expressed in terms of modulo 253 and remainder: n X f1 ðiÞ ¼ D1 253 þ R1 ; ð51Þ i¼1 n X

f2 ðiÞ ¼ D2 253 þ R2 :

ð52Þ

i¼1

Therefore the (45) becomes: s ¼ D1 2espmin Að1Þþ53 þ R1 2espmin Að1Þ þ D2 2espmin Að2Þþ53 þ R2 2espmin Að2Þ :

ð53Þ

The four elements in (53) are generated without errors because the (48) holds. Their sum can be done by using Priest’s algorithm which guarantees a theoretical bound of 2u and, in this case, there is a drastic reduction of time used due to the fact that it must sum only four ﬂoating-point numbers. The algorithm is summarized by the following pseudo-code: function cutSum Input: v = vector of dimension n P nmax with nmax equals to the r.h.s. of (47) Output: computed sum Calculate f1 and f2 as in (49) and (50), respectively. Build k1 sub-vectors of f1, fw1;1 ; w1;2 ; . . . ; w1;k1 g such as dimðw1;j Þ 6 nmax ; j ¼ 1; 2; . . . ; k 1 . Build k2 sub-vectors of f2, fw2;1 ; w2;2 ; . . . ; w2;k1 g such as dimðw2;j Þ 6 nmax ; j ¼ 1; 2; . . . ; k 2P . dimðw Þ Build a vector p1 2 Nk1 such as p1 ðiÞ ¼ j¼1 1;i w1;i ðjÞ; i ¼ 1; 2; . . . ; k 1 . Pdimðw Þ Build a vector p2 2 Nk2 such as p2 ðiÞ ¼ j¼1 2;i w2;i ðjÞ; i ¼ 1; 2; . . . ; k 2 . Calculate D1, D2, R1 and PkR2 2 such as 53 P k1 53 p1ðiÞ ¼ D 2 þ R 1 1 i¼1 i¼1 p2ðiÞ ¼ D2 2 þ R2 Calculate the (53) by using Priest’s algorithm. Moreover if the (49) is not satisﬁed, then it is possible to split the original vector in a set of vectors in which the (49) is true. In the worst case, not common in practice, the number of vectors is equal to the number of elements to be summed. Then, for each vector the partial sum can be computed by the algorithm in Table 1, and the whole sum is computed by summing up the partial sums by a compensated algorithm. The Matlabcode in Table 2 implements such idea. Note that the algorithm can be improved by sorting the elements such that jxi j 6 jxiþ1 j; i ¼ 1; 2; . . . ; n 1. 3.3. Numerical experiments and computational aspects In this section we present timing and accuracy results. We compare our summation algorithms (EF1, EF2) to recursive summation (RI), Kahan (KA) and Priest’s algorithm (PR). All the above algorithms were per-

420

A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424

Table 2 Matlab algorithm sumEF2 for accurate summation function s ¼ sumEF 2ðvÞ n ¼ lengthðvÞ; ½f ; e ¼ log2ðvÞ; eMax ¼ maxðeÞ; eMin ¼ minðeÞ; if ðeMax eMinÞ < 26 s ¼ sumEF 1ðv; 2Þ; else aux ¼ v; k ¼ 1; while lengthðauxÞ > 0 pos ¼ findðabsðe eð1ÞÞ < 26Þ; spðkÞ ¼ sumEF 1ðauxðposÞ; 2Þ; auxðposÞ ¼ ½ ; eðposÞ ¼ ½ ; k ¼ k þ 1; end s ¼ priestðspÞ; end

formed on a Pentium-based computer in which the unit roundoﬀ is u ¼ 253 ’ 1:11 1016 and they have been implemented in Matlab [18] and Mathematica [19] which allows arbitrary precision numbers, in order to compare the computed sums with the ‘‘exact’’ ones by using 1024 digit precision numbers and to investigate the relative error. We considered the algorithms RI, KA, PR and EF1 (with q ¼ 2) from a computational point of view by computing the mean time necessary to perform a ﬁxed number of iterations. The mean time was computed by using two Matlab built-in functions: tic-toc. The ﬁrst reset the clock, the latter returns the elapsed time since the latest reset. The obtained results are presented in Table 3. If the assumptions (47) and (48) are satisﬁed then the relative error bound of our algorithm is u. In such case we perform an experiment in order to estimate numerically the probability density function of the relative error (PDFRE) for the algorithm EF1 (with q ¼ 2): we generate 1 500 000 vectors of 10 numbers uniformly distributed in the interval ½0:5; 1. Fig. 1 shows the PDFRE normalized to the unit roundoﬀ. We note that this distribution is a bimodal one. In the other cases we compare summation methods through statistical estimates of error, which may be more representative of the average case [15]. We consider the following sets of data for the algorithm EF1: • Set 1: the xi are random numbers uniformly distributed in the interval ½1; 1, for n ¼ 4096. • Set 2: the xi are random numbers uniformly distributed in the interval ½0; 1, for n ¼ 4096.

Table 3 Mean times in computing 50 000 sums n

RI

KA

PR

EF1

10 20 30 40 50 60 70 80 90 100

2.70–04 3.50–04 4.09–04 4.76–04 5.81–04 6.27–04 7.17–04 7.82–04 8.44–04 9.50–04

4.27–04 6.68–04 8.64–04 1.06–03 1.28–03 1.48–03 1.70–03 1.89–03 2.11–03 2.37–02

1.44–03 2.56–03 3.64–03 4.72–03 5.80–03 7.11–03 7.98–03 9.08–03 1.02–02 1.13–02

4.29–04 5.06–04 5.34–04 5.69–04 5.94–04 7.23–04 7.52–04 7.79–04 8.20–04 8.54–04

A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424

421

0.25

0.2

pdf

0.15

0.1

0.05

0

0

0.1

0.2

0.3

0.4

u

0.5

0.6

0.7

0.8

0.9

Fig. 1. PDFRE for algorithm in Table 1. n ¼ 10 random numbers in ½0:5; 1, number of iterations = 1 500 000.

and • Set 3: The xi ¼ 10pi where the pi are uniformly distributed in the interval ½0; 35. • Set 4: x1 ¼ x2 ¼ ¼ x2047 ¼ 1:0; x2048 ¼ x2049 ¼ 1:0e 18; x2050 ¼ x2051 ¼ ¼ x4096 ¼ 1:0. • Set 5: xi ¼ 1=i2 , for n ¼ 4096. for the algorithm EF2. The summations were performed 10 000 times each (or 1 time for data sets 4–5). The results for data sets 1– 3 are shown in Tables 4–6, respectively. For each method we report the maximum and mean value of the relative errors in term of unit roundoﬀ. We also report the fraction of trials in which each algorithm gives equal or more accurate result than the others. The matrix in the bottom of Tables 4–6 must be read as follows: in each line is reported the probability of success of the corresponding algorithm in comparison to the other ones. In Table 7 we report the relative errors in terms of unit roundoﬀ for data sets 4–5. Table 4 Set 1 RI

KA

PR

EF1

Max Mean

140 023 72.0336

937.42 1.65733

0.9909 0.360019

1.0923 0.360156

RI KA PR EF1

0 0.9912 1 1

0.0350 0 1 0.9998

0.0257 0.6644 0 0.9998

0.0257 0.6645 1 0

RI

KA

PR

EF1

Max Mean

56.0788 11.1005

1.00823 0.372495

0.998056 0.372467

1.99072 0.381584

RI KA PR EF1

0 0.9999 1 0.9997

0.0445 0 0.9999 0.9874

0.0444 0.9968 0 0.9874

0.0449 0.9969 1 0

Table 5 Set 2

422

A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424

Table 6 Set 3 RI

KA

PR

EF2

Max Mean

140795 60.5964

25467.5 6.06749

0.986649 0.362291

1.9575 0.482191

RI KA PR EF2

0 0.9761 1 0.9883

0.0664 0 0.9999 0.8664

0.0426 0.5536 0 0.747

0.0525 0.6206 0.9999 0

Table 7 Sets 4–5

Set 4 Set 5

RI

KA

PR

EF2

9.01 + 15 51.77

9.01 + 15 0.52

0 0.52

0 0.52

4. Accurate interpolation on equidistant nodes in [0, 1] We will not say much about the many practical applications of our algorithm because the applications of accurate summation are widely known and there are excellent treatments in the recent literature. We want only to present an application of the new representation of the ﬂoating-point system related to the problem of the polynomial interpolation on equidistant nodes in ½0; 1 [16]. It can be shown that the inverse of the Vandermonde matrix on equidistant nodes in ½0; 1 can be factorized as a product of four matrices with integer entries that can be stored without rounding errors: Wn ¼

1 D1 UD2 L; ðn 1Þ!

ð54Þ

where: i1

D1 ¼ diagfðn 1Þ gi¼1;2;...;n ; iþ1 j 1 U ði; jÞ ¼ ð1Þ ; i ¼ 1; 2; . . . ; n; j ¼ i; i þ 1; . . . ; n; i1 ðn 1Þ! ; D2 ¼ diag ði 1Þ! i¼1;2;...;n jþ1 i 1 Lði; jÞ ¼ ð1Þ ; i ¼ 1; 2; . . . ; n; j ¼ 1; 2; . . . ; i: j1 k1 Therefore the polynomial which interpolates the function f ðxÞ on the set of nodes X n ¼ fn1 ; k ¼ 1; 2; . . . ; ng is: n X ck xk1 ; ð55Þ pðxÞ ¼ k¼1

where the coeﬃcient vector c can be calculated as: 1 D1 UD2 L f ð56Þ ðn 1Þ! k1

; k ¼ 1; 2; . . . ; n. and fk ¼ f n1 By using the integer properties of the matrices involved in (54) and the fact that each entry of the vector f can be represented as a linear combination of integers where the coeﬃcients are powers of the base, we build an accurate algorithm for the interpolation problem on equidistant nodes in ½0; 1. We consider the quantity c¼

A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424

423

Table 8 Dual problem n

2

3

4

5

6

7

8

9

10

Max Mean

1.00 0.07

1.00 0.32

2.16 1.05

2.00 1.14

2.88 1.04

2.91 1.25

2.76 1.25

2.60 1.16

4.02 1.84

Maximum and mean value of c for (57) over 100 000 runs f 2 ½1; 1.

c¼

1 D1 UD2 Lf ðn 1Þ!

and, by using (13) we write the vector as 2 3 f1 6 7 f 6 27 6 7 6 7 f ¼ 6 f3 7 ¼ Y e; 6 . 7 6 . 7 4 . 5 fn where

2

y 1 ð1Þ

y 2 ð1Þ

y 1 ðnÞ

y 2 ð2Þ y 2 ð3Þ .. . y 2 ðnÞ

6 y ð2Þ 6 1 6 6 Y ¼ 6 y 1 ð3Þ 6 . 6 . 4 . and

2

2espð1Þ

y 3 ð1Þ

3

y 3 ð2Þ 7 7 7 y 3 ð3Þ 7 7 .. 7 7 . 5 y 3 ðnÞ

3

6 7 e ¼ 4 2espð2Þ 5: 2espð3Þ

Note that y q ðkÞ; q ¼ 1; 2; 3; k ¼ 1; 2; . . . ; n are integer quantities multiplied by power of the base of the ﬂoating-point number system. Then c¼

1 D1 UD2 LYe: ðn 1Þ!

ð57Þ

The numerical experiments are aimed in showing the high accuracy of formula (57). We have used package Mathematica to compute the approximate solutions ^c, the exact ones (using extended precision of 1024 signiﬁcant digits) and the error c ¼ max

16i6n

j^ci ci j ; jci j

ð58Þ

100 000 experiments have been run for n ¼ 2; 3; . . . ; 10. Table 8 reports the maximum and mean value of c in terms of unit roundoﬀ u ¼ 253 . 5. Conclusions We have compared our method with Priest’s algorithm. In our opinion integer based approach is conceptually more simple because it avoids the error compensation logic and uses only some elementary properties of the ﬂoating-point system [8]. This implies a drastic reduction in the number of ﬂoating-point operations.

424

A. Eisinberg, G. Fedele / Applied Mathematics and Computation 189 (2007) 410–424

Moreover we have shown that, under less-conservative hypothesis, the theoretical relative error is bounded by the unit-roundoﬀ. References [1] N.J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, 1996. [2] N.J. Higham, The accuracy of ﬂoating point summation, SIAM. J. Sci. Comput. 14 (1993) 783–799. [3] X. Li, J. Demmel, D. Bailey, G. Henry, Y. Hida, J. Iskandar, W. Kahan, S. Kang, A. Kapur, M. Martin, B. Thompson, T. Tung, D. Yoo, Design, implementation and testing of extended and mixed precision BLAS, ACM Trans. Math. Softw. 28 (2002) 152–205. [4] D.M. Priest, On properties of Floating Point Arithmetics: Numerical Stability and the Cost of Accurate Computations, 1992. [5] C.M. Hoﬀmann, G. Park, J.R. Simard, N.F. Stewart, Residual iteration and accurate polynomial evaluation for shape-interrogation applications, in: Proc. of ACM Symp. on Solid Modeling and Appl., Genova, Italy, June 9–11, 2004. [6] J. Demmel, Y. Hida, Fast and accurate ﬂoating point summation with application to computational geometry, Numer. Algorithms 37 (2004) 101–112. [7] ANSI/IEEE, IEEE Standard for Binary Floating Point Arithmetic, in: Proc. of IEEE, New York, Std 754-1985 ed., 1985. [8] D. Goldberg, What Every Computer Scientist Should Know About Floating Point Arithmetic, ACM Comput. Survey (March) (1991). [9] E. Mizukami, The accuracy of ﬂoating point summations for cg-like methods, Tech. Report 486, Department of Computer Science, Indiana University – Bloomington, July 1997. [10] J.M. Wolfe, Reducing truncation errors by programming, Commun. ACM (1964) 335–356. [11] M. Malcom, An algorithm for ﬂoating-point accumulation of sums with small relative error, Technical Report, Stanford University, STAN-CS-70463, June 1970. [12] W. Kahan, Further remarks on reducing truncation errors, Commun. ACM 8 (1965) 40. [13] S. Linnainmaa, Analysis of some known methods of improving the accuracy of ﬂoating-point sums, Bit 14 (1974) 167–202. [14] T.O. Espelid, On ﬂoating-point summation, SIAM Rev. 37-4 (1995) 603–607. [15] J.M. McNamee, A comparison of methods for accurate summation, ACM SIGSAM Bull. 38-1 (2004) 1–7. [16] A. Eisinberg, G. Fedele, C. Imbrogno, Vandermonde systems on equidistant nodes in ½0; 1: accurate computation, Appl. Math. Comp. 172 (2) (2006) 971–984. [17] D. Knuth, Second ed., The Art of Computer Programming, 2, Addison-Wesley, Reading Mass., 1981. [18] The MathWorks Inc., MATLAB Reference Guide, 1994. [19] S. Wolfram, Mathematica: a System for Doing Mathematics by Computers, second. ed., Addison-Wesley, 1991.

Accurate floating-point summation: a new approach

Accurate floating-point summation: a new approach

Recommend Documents