Mathematics and Computers North-Holland
in Simulation
30 (1988) 541-552
541
ALL POSSIBLE COMPUTED RESULTS IN CORRECT FLOATING-POINT SUMMATION M. PICHAT C.N.A.M.
-
C.N.I.A.M.,
860, Avenue
de Saint
Priest,
34100
Montpellier,
France
On a computer, any entry or elementary operation has two legitimate results, one by default and one by excess. Thus, a given algebraic algorithm with a single result is able, when processed on a computer, to generate a large set of floating-point results, all representative of the exact algebraic result. The aim of this paper is to characterize such sets in the particular case of summation.
1. Introduction On a computer, any real number is represented by a finite number of digits. A consequence is that a computing error is generated when performing any computer operation; entry errors and abbreviation errors are so introduced in algorithms. Limited precision arithmetic also implies the nonassociativity of computing operators. Thus, as pointed out by Vignes and La Porte [5,6], while in algebra a given algorithm provides a single result, the same algorithm processed on a computer will have a set 9? of possible floating-point results, all equally representative of the exact algebraic result. As shown by these authors, the elements of population 9 have two possible origins: (1) Perturbation origin: each entry or elementary operation admits two results, one by default and the other by excess. If k is the total number of operations, computation leads to 2k possible results. corresponding to all (2) Permutation origin: if Cop is the total number of combinations different possible permutations of the operators, computation taking account of nonassociativity of approached operators leads to Cop possible results. Here we study, for summation algorithms, the subset of 99 generated by perturbation; permutation will not be considered here. 2. Notation and preliminaries 2.1. Roundings
and correct operators
Let S be a finite, nonempty subset of R, representing on a computer the set R of real numbers. We denote by R, the real interval bounded by the minimum and maximum elements of s. The mapping v : R!s + S defined by vx=max{Z:ZESAZ
VxER.
rounding of R!s into S.
0 1988, Elsevier Science Publishers
B.V. (North-Holland)
542
M. Pichat / Correct floating-point
summation
The mapping A :R s + S defined by Ax=min{Z:ZESAZ>x}
VxERs
is called the R-directed rounding of R, into S. Any elementary real operator * E { +, -, on S. Operator 0 : S x S + S is called correct if X@Y=(v(X*Y) Operator v : S
x
WY=v(X*
orh(X*
Y))
X,
:}
V(X,
is ‘approached’ by an operator 0
defined
Y)ES’.
S + S defined by Y)
qx,
Y)ES2Ax*
YERs
is called correct and L-directed. Operator A : S x S + S defined by xnY=A(X*Y)
v(X,Y)ES2AX*YE&
is called correct and R-directed. With a slightly different terminology, left-directed respectively right-directed roundings and operators have been completely and axiomatically studied by Kulisch and Miranker [2]. 2.2. Accuracy intervals, nondecreasing-intervals
subsets of II3 and floating-point
numbers
We shall have to measure the local accuracy of the approached representation. For X E S we denote by PRED( X) and SUCC( X) respectively the predecessor and successor of X in S: PRED(X)=max{Y:YESAY
Succ(X)=min{Y:YESA
We assume that 0 is an element of S; we denote by S+ (respectively S-) (respectively negative) elements of S. Given a real x, the strictly positive number E(x) defined by
Y>X}. the set of positive
for x 4 S, forxES+, forxES_, is called the accuracy interval of x in S. For a zero element, the accuracy interval should eventually be considered right or left. S is called a nondecreasing-intervals subset of Iw if IX]<
]YJ
=3 &(X)<&(Y)
v(x,
Y)Cs2AXY>0.
On computers, S is a set F of normalized floating-point numbers. The base will be given by b and the number of mantissa digits in base b by t.
M. Pichat
Notation.
/ Correct floating-point
summation
543
For u E Iw- (0)) the integer p defined by P-i<
(UI
is called the exponent of u and is denoted by e(u). 2.3. Algebraic procedure and computing procedures Let V= { * i, i = 1, 2,. _. , N - l} be a sequence of operations in Iw. Applied to a given sequence of N operands x = (xi, x2, . . . , xN ) E RN, sequence %?defines a real r = rN by ri = x 1,
5+1
=r,
i= 1,2 ,...,
*;x;+~,
N-
1.
r is the result generated by an algebraic procedure. We now consider subset S. Associated with 9 there are 2N-’ sequences of correct computer operations: 0 = { 0, , i = 1, 2,. . . , N - 1)
, where 0, is either v, or A,.
Associated with x there are 2N sequences of operands: X= (x ,‘__., x,) Any
sequence
E P,
where X, = (vx, or AX;).
0, applied to a vector X, defines in S a number R = RN by
R,=X,,
R,+,=R,O,X,+l,
i=l,2
,...,
N-l.
0)
R is the result generated by a computing procedure. According to whether or not entry errors are taken into account, 2N-’ (respectively 22N-*) computing procedures are associated to one algebraic procedure. We are concerned with the elements defined in S by these computing procedures and we consider the two following sets. In the case where no entry errors are committed: given X E SN, the set E of results R defined by (1) for X and any sequence 0 of correct operators is given as E={R:Rdefinedby(l),
@,=(v,orA,)}.
And, in the general case: given x E IwN, the set E’ of results R defined by (1) for any sequence 0 of correct operators and any rounding of x is given as E’ = {R : R defined by (l),
3. Arithmetic
0, = (vi
or &),
X, = (vx; or AX,)}
summation
3.1. A first property For arithmetic summation, a first property can be established with weak assumptions. Entry errors are not considered in this section.
M. Pichat / Correct floating-point
544
3.1.
Theorem. Given X = (Xl,. . . , X,)
E={R:R=X,q&@$
E ( S+)N,
summation
let E be the following set of elements in S: @I =(VorA)}.
.o.@~~,X~,
If S is u nondecreasing-intervals subset of R, then: (i) Card(E)
j=l,2
,.,.,
N.
Proof. The proof follows by induction with respect to N. We first prove (i): Basis step. Property (i) is true for N = 2. Induction step. Now assume it is true at order k: the set E, defined
Ek={Rk:Rk=X1G3,
...@k_lXk,
consists of h consecutive elements denote by G the following set:
by
@I =(vorn)}
of S, Sf, . . . , S:,
G={Z:ZESAZE[S;+X~+,,
.**
and X < k. When adding
S,” to Xk+,,
let us
S;+X,+,]}.
We have Card(G)
= freq( S;)
= 1 = C,O= Ci.
Induction step. Now assume Card( Ek) = k and freq( S’) = C/1: (j = 1, 2,. . . , k). (It will first be noted that Card(E) = N implies Card( Ek) = k Vk, as a consequence of Card( E,,,) < Card( Ek + 1 obtained above.) Necessarily, Card( Ek+ ,) = k + 1. It th en implies that there exists one and only one element of S Sk+i in any interval: Ij_i = [S,k_, + Xk+i, S,‘+ X,+,[ for j = 2, 3,..., k, and that S,? + X,,, G S’for j = 1, 2,. . _, k. Indeed, let us consider I, for any j = 1, 2,. . _, k - 1. I, contains at most one element of S: More than one would be in contradiction with the property of S to be a nondecreasing-intervals subset of R. I, contains at least one element of S: As G = 1, U I2 U . . . U lk_ 1 U { Sk” + X,, i }, assuming that there exists no element of S in I, leads with the above assertion to Card( E,,,) < E S for any j > 1 leads, with the preceding results, to k-2+2
M. Pichat / Correct floating-point
545
summation
We then deduce
freq(~~+*j=freq(~~_,j+freq(S:)=~,/_f+~;/_:=C,/-‘, freq( S:+‘)
= freq( St)
= 1,
freq( S,“:,‘)
= freq( Sf)
= 1.
0
3.2. Remark. To compare with the estimation arithmetic, we define the absolute errors:
R I+1 =R,+X,+l+q,, For any correct operation, of R, + X+1 or, a fortiori, We have
i=l,2
yielded
by classical
error analysis
in floating-point
,.._, N-l.
the quantities 77, satisfy the exponent of R,, ,.
1q, ( < h&-l,
where p, denotes
the exponent
N-l
N-l
v=R-r=
j=2,3,.-.,k,
c qz i=l
and
1771 <
c
bPc-‘,
p,=e(R,+,).
r=l
Making use of the bound p,
elements
result and
E consists
of F.
Use can also be made of the following
relation
R=X,(1+&)+X2(1+&)+
...
established
by Wilkinson
[7]:
+X&t-&),
where (I _ ,,-,),Y+l--r under the assumption 3.2. Injluence
r=l,2
of correct
floating-point
,...,
N,
additions.
of the entry errors
In this section we restrict ourselves to the case where S is a set F of normalized numbers. A preliminary lemma can be given as follows. 3.3. Preliminary Lemma. Given two floating-point floating-point numbers:
H=
(
W: W=
U@ V, U= (Xor
numbers X, YE F, let H be the following
Succ(X)),
W can be computed by 23 computing procedures. X, YE F can be written in the following form: LbP-‘,
Y = M&-l,
set of
V= (Y or Succ(Y)),Cl3 = (v or nj).
Then, the elements of H are at most 3, and are consecutive elements of F.
X=
floating-point
b’-’ < L, M-C b’,
546
M. Pichat / Correct floating-point
summation
where p and q denote the exponents e( X) and e( Y). Then, Succ(Y)
succ( x) = (L + l)V’,
= (M+
l)bq-’
(even if L or M = b’ - 1). We give an outline of the proof for the case b = 2. Notation. For x E R +, let us denote by lx ] and [xl the integers defined by 1x1 =min{n:
lx] =max{n:nER4AAnx},
nEN
Anax}.
The following cases can arise. Case l:p=q. The fact that b=2 thenimplies e(W)=p+l (or e(W)=p+2if - 1). In any case, W= Nbp+‘-’ and integer N takes its value in
The set H has two elements if L + M is elements being consecutive in F. Case2:p>q.Then, e(JV)=(porp+l). (2a): e( X + Y) =p. Then we have W= L + 1 + [(M + l)bq-P]. [Mbq-p],..., If we set M = bJ’-qA4’ + r A r -C bP_4, then set H has three consecutive elements. (2b): e( X+ Y) =p + 1. Then we have W= X+
Y= (Lb-’
Succ(X)
L=M=b’
even, and three elements if L + A4 is odd, these
and integer N takes its value in L +
NbP-’
in any case (r = 0 or r # 0 or L + M’ = b’ - 1) the Nbpf’-‘.
We can write
+ Mbq-P-‘)bP+‘-‘,
+ Succ(Y)
= ((L+
l)b-‘+
(M+
l)bq-p-l)bp+l-l.
If we set Lb-’ + Mbq-J’-’ = Q + p with Q E N and 0 < p < 1, then integer N takes its value in Q,..., Q + [p + b-’ + bq-p-‘]. As p + b-’ + bq-p-’ -C 1 + b-’ + bP2 < 2, the set H has at most three consecutive elements. Case 3: p -C q. This case is dealt with analogously to Case 2. 3.4. Remark. To prove Lemma 3.3 in the case b # 2, one has to modify the statements for p = q:
b # 2 implies e(W) = ( p or p + 1) and two cases have to be distinguished. Now, taking into account entry errors, we can state the following theorem. 3.5. Theorem. Given x = (x1, . . . , xN) E (R ‘) N and computations arithmetic, let E’ be the following set of elements in F: E’= Then,
1
E’ consists
R: R=X,C+
X2@+ -0.
@N_l X,,
of at most N + 1 consecutive
Xi =
being performed
in floating-point
(vxi or Ax;), G$ = cv or A)}.
elements of F.
Proof. The proof is done by induction with respect to N. Basis step. The property is true for N = 2 as a consequence of Preliminary Lemma 3.3. Induction step. We assume that it is true at order k: if EL is the set of all computed values x1@+ **. @,,_ 1 X,, then EL consists of X consecutive elements of F, Sf, . . . , S:, and h < k + 1.
M. Pichat / Correct floating-point
summation
547
If xk+i E F, the desired result follows from Theorem 3.1, otherwise for X > 2 we denote u= vxk+i and I/= A\x~+*. We are concerned with the sums SF @ U and Sf @ V, i = 1, 2,. . . , A. Let us consider the two following subsets:
We have Card(G) < X - 1. By Preliminary Lemma 3.3, Card(H) A-1+2=X+l
< 2 (Sf_, v U tZ H).
3.6. Remark. The weak effect of entry errors in arithmetic comparing Theorems 3.1 and 3.5, should be noticed. Table
It then follows that Card( EL+ 1) <
summation,
as it appears when
1a
Computation
of r,, r2, and r, Computation
n
of r, Card(E)
RI_
RR
50
4.4992053382
4.4992053382
100
5.1873775174
5.1873775180
81
150
5.5911805882
5.5911805892
130
32
200
5.8780309475
5.8780309488
180
250
6.1006752487
6.1006752504
230
Computation
n
of rz Card( E)
R,.
RR
50
1.6251327336
1.6251327331
100
1.6349839001
1.634983900_3
150
1.6382895729
1.6382895732
143
200
1.6399465458
1.6399465462
193
250
1.6409420560
1.6409420564 -
243
Computation n
45 94
of r, Card(E)
RI_
RR
50
2.9476758385
2.9476758381
100
3.2893173135
3.2893173139
98
150
3.4903958449
3.4903958454
148
200
3.6334079824
3.6334079831
198
250
3.7444818129
3.7444818138 -
248
a Decimal binary
outputs digits,
In the computation for ri, l/i2 n -
in Turbo-Pascal
representing
are performed
12 decimal
to 11 digits.
Besides,
48
floating-point
numbers
are stored
with 40
figures.
of rl or r2, some elementary
for r2 when i = bk, k = 1, 2,.
additions
. . ,7. If j additions
are evidently
performed
are so performed
without
without
error:
error, Card(E)
the adding
of l/i
will be at most
j.
As to r2, it should hold.
also be noticed
that we have e(r2) = 1 for any n, implying
that the assumptions
of Theorem
5.1
548
M. Pichat / Correct floating-point
3.3. Numerical
summation
results
Computations were performed on an IBM-PC microcomputer. Programs were written in Turbo-Pascal code. We used a floating-point arithmetic with the following characteristics: b = 2, t = 40, truncation mode ( @ = v). We defined a function providing, for any given X E F, Succ( X) if X > 0 and PRED( X) if x< 0. From v we generated the R-directed correct arithmetic addition A: XLJY=
1
Succ( m my
Y)
if wY#X+
Y,
otherwise.
As simple illustrations, we carried out, for different values of n, the following summations: ” r1 =
rz =
i l/i, i=l
C l/i2, t=l
r3 =
i 1/(2i i=o
+ 1).
Computation of each sum r, making use of operators v and A respectively, provides two floating-point values: R, and R,. Performing binary outputs we derive Card( E ) = RR - R L + 1. Results of these tests are given in Table 1.
4. Binary summation In this section we consider binary summation, computations being performed in binary floating-point arithmetic. Assuming N = 2k for the sake of simplicity, binary summation of N reals xi, _. _, xN is defined by the following formulas: rjo = x,,
i=l,2
,...,
N,
rji=r. 2,_,,j_l+r2i,j_,,
i=l,2
,...,
N/2’,
j=l,2,...,k,
and generates the same algebraic result, r,, = r = E~zlxj, as does recursive summation. Computing procedures, associated with this scheme and obtained by the way of perturbation, are yet 22N-’ or 2N-‘, according to whether or not entry errors are taken into account, with N = 2k. We can characterize the set E’ of floating-point values generated by these computing procedures in the case where reals xi have the same exponent for any i. 4.1. Theorem. Assume that computations are performed in binary floating-point arithmetic and that N = 2k positive reals xi of the same exponent are given (i = 1, 2, _. _, N). Let E’ be the set of values R = R,, defined in F by Rjo=Xj R,j= Then, E’ consists
withXi=(~xiorAxj), RZ;_l,j-l
~;j Rll,j-1
i=l,2 with
qj
= (B
,.._,
N,
or A)>
of at most k + 2 consecutive elements
of F.
i=l,2
,...,
N/2’,
j=l,2
,...,
k
M. Pichat / Correct floating-point
summation
549
Proof. By assumption, b = 2. We denote by p the common exponent of xi, br-’
Card( E,‘,) < 3
Vi,
Vi,
as derived from Preliminary Lemma 3.3. Let us assume that the above property is true at level j - 1: Card( E,lj_ 1)
Setting =
lbP+j-1-t
R2i-1,,--1
= )
&p+j-1-t
b’-’ < 1, m < b’,
3
R2r,j-l
with I= L,..., L+A (X
i-l,2
Ri+l=Ri~jXj+l,
,...,
N-l,
the terms of the sum not necessarily being of the same sign. In the case where partial floating-point sums R, have the same exponent for any i, or more generally when S # F in the assumption that sums Ri have the same accuracy interval in S Vi, the following can be stated. 5.1. Theorem. Given X= (XI,. . . , X,) E={R:R=x,~x++
E SN, let E be the following set of elements in S:
-613_~x~, gl =(VorA)}.
If S is a nondecreasing-intervals subset of II3 and if partial sums Ri L = XIV . . . VX, and Ri, = X& . . . AX, satisfy E( R, L) = 6?(RiR) = constant, independent of i Vi, then E consists of consecutive elements of S and, denoting P = Card(E), freq(Sj)=Card{O:R=S,}=2N-pC{Z~,
thej’-th element S, in set E satisfies j=1,2,...,P.
M. Pichat / Correct floating-point summation
550
Proof. The proof follows by induction with respect to N. Basis step. The property is true for N = 2. Induction step. Let us assume that it is true at order k: on using notation set for the proof of Theorem 3.1 we assume that Ek consists of X consecutive elements of S and that we have freq(S/) = 2k-hC/1i,
j = 1, 2 ,...,
A.
By assumption, elements S,! + X,,, and elements S,! have the same accuracy interval in S. Only two possibilities can then arise: (a) Sj + xk+ 1 E S Vj = 1, 2, __. , A; then, Card( Ek+l) = A and freq( S’“) = 2freq( S’). (b) There exists one and only one element of S belonging to each interval ] S’_ 1 + X, + i, S’ + X,+,[; then, Card( Ek+l) = X + 1 and the property immediately follows from freq(SF+‘)=freq(SF_,)+freq(S;),
j=2,3,...,X.
0
5.2. Example. The sum r4 = 4 + C:= 11/(2i + 1) satisfies e( rq) = 3 for any n, and leads to the following values of Card(E) : 50 51
n
Card(E)
100 101
150 151
200 201
250 251
In that case, taking into account entry errors, we have exactly Card( E’) = Card(E) Vn. Such a result follows from Card(E) = n + 1 Vn and from the fact that the real 4 is entered without error. 5.1. Sums respectively computed by a sequence of correct additions and the conjugate sequence Given a sequence of correct operators O=(Oj,
i=l,2
,...,
N-l;
0, =(Vi
or Ai)),
0’, the conjugate sequence of 0, is defined by o’={@,!,i=l,2
,...,
if 0, = vi,
A;
N-l}with@l=
v, i
I
ifO
I
=A I
Computations being performed in floating-point arithmetic, let us consider the sums respectively computed by a sequence of correct additions and the conjugate sequence in the assumption that partial sums Rj have nonincreasing exponents. 5.3. Theorem. Given X= (Xl,. . . , X,) E FN and a sequence of N - 1 correct additions if partial sums RiL=Xlv---vXj and Ri,=X,~***~X, O={q, i=l,2 ,..., N-l}, satisfy e(RiL)=e(Ri,)
in F,
M. Pichat / Correct floating-point
summation
551
then the following property holds for R = X, @+ . . * eN_ 1 X,, and R’ = XI @$ . . . gzl_, X,: R + R’ = constant,
independent of 0.
We assume that the property is true at order k. Let 0, be any sequence of k - 1 correct additions and Proof.
Rk=X,
@+ ...
et,_,&,
R;,=X,%$
...
q-,X,.
We then have R, + RL = C, independent of 0,. The fulfilled conditions: - 8( Rk) = (Y independent of Ok, and - &(Rk+Xk+J=j3
=A~=XE”P.
All the numbers Ri + X,, 1 can be obtained from elements of F by a translation amplitude e satisfies 0 G e G p. If e # 0, we have
P-e Rk % xk+i = Rk + x,+1 +
i -e
whose
if GJ = A, if ek = v,
and it follows that R k+l+R~+l=C+2Xk+l+P-2e_
0
5.4. Remark. The assumptions of Theorem 5.3 do not ensure that E consists of consecutive elements of F. 5.5. Example. If we consider the algebraic procedure defining r = x + y - x for the operands x = 1000, y = 10P4, and r = 10e4, and in decimal floating-point arithmetic with seven digits, we have E = (0, 10P3}. 6. Conclusion
In this paper we have considered, for summation algorithms, the set of all computed results generated by perturbation in correct floating-point arithmetic. The main features of the results are the following. (1) In arithmetic summation, entry errors have a weak effect, almost negligible. As established by Theorems 3.1 and 3.5, or in Section 4 for binary summation, the maximum values of Card(E) differ from 1. (2) Binary summation is almost optimal for accuracy in the case where terms are of the same sign and of nearly equal magnitude. Indeed, by Theorem 4.1 the set E’ of floating-point values generated by the 22N-1 computing procedures associated with the algebraic binary scheme satisfies Card( E’) < log,( N + 2). It follows from this result that binary summation should be used much more frequently, particularly when computing mean values or standard deviations.
552
M. Pichat
/ Correct floating-point
summation
As to the ordinary arithmetic summation algorithm, the fact that the maximum value of Card( E’) is of order N often yields poor accuracy. The proof of Theorem 3.1 leads to Wilkinson’s [7] conclusion that, when adding terms of decreasing magnitude, the smallest terms are to be added first. Also, in a recent paper, Stummel [4, p. 4571, making use of forward analysis, underlined that the arithmetic summation algorithm is not well-conditioned in many cases. (3) Theorem 5.1, although making quite heavy assumptions, has its interest in the fact that the normal distribution of computed results (see [l], for example) clearly appears in this particular case. (4) Theorem 5.3 shows the necessity of rundomIy perturbing the result of each arithmetic operation, in order to evaluate the computing error. Given a sequence of operands, the availability of the sums R and R’, respectively computed by a sequence 0 of correct additions and the determined conjugate sequence 0’, does not lead to a significant result: i( R + R’) has a constant value when 0 varies, but no guaranteed connection with the exact algebraic sum r can be derived in that case. We can stress that random perturbation is one of the bases of the Permutation-Perturbation method. Finally, a point that should be underlined is the following. Of course, given an algorithm, the property of Card(E) to be small does not imply good accuracy, unless E consists of consecutive values of F and includes the exact result. This algebraic study does not provide any evaluation relative to the precision of the obtained results, except when the assumptions of Theorem 4.1 hold and binary summation is used. For all other summations, only the Permutation-Perturbation method is able to evaluate the exact number of significant decimal digits appearing in computed results and must be performed. As a consequence of the weak effect of entry errors in arithmetic summation, it will not be necessary to perturb the entries in that case, as far as no measure errors occur.
References [l] J.-P. Faye and J. Vignes, Stochastic approach of the permutation-perturbation method for round-off error analysis, Apphed
Numerical
Mathematics
I (4) (1985)
[2] U. Kulisch and W. Miranker, Computer
349-362.
in Theory and Practice (Academic Press, New York, 1981). [3] P. Linz, Accurate floating-point summation, Comm. ACM 13 (6) (1970) 361-362. [4] F. Stummel, Perturbation theory for evaluation algorithms of arithmetic expressions, Mathematics of Computation Arithmetic
37 (156) (1981). [5] J. Vignes, New methods for evaluating the validity of the results of mathematical computations, in Simulation
20 (4) (1978)
[6] J. Vignes and M. La Porte, Error analysis in computing, Proc.
[7] J.H. Wilkinson, Rounding
Math.
& Comput.
227-249. Errors
in Algebraic
Processes
ZFZP Congress, Stockholm (1974) 609-614. (Prentice-Hall, Englewood Cliffs, NJ, 1963).