All possible computed results in correct floating-point summation

All possible computed results in correct floating-point summation

Mathematics and Computers North-Holland in Simulation 30 (1988) 541-552 541 ALL POSSIBLE COMPUTED RESULTS IN CORRECT FLOATING-POINT SUMMATION M. P...

770KB Sizes 0 Downloads 42 Views

Mathematics and Computers North-Holland

in Simulation

30 (1988) 541-552

541

ALL POSSIBLE COMPUTED RESULTS IN CORRECT FLOATING-POINT SUMMATION M. PICHAT C.N.A.M.

-

C.N.I.A.M.,

860, Avenue

de Saint

Priest,

34100

Montpellier,

France

On a computer, any entry or elementary operation has two legitimate results, one by default and one by excess. Thus, a given algebraic algorithm with a single result is able, when processed on a computer, to generate a large set of floating-point results, all representative of the exact algebraic result. The aim of this paper is to characterize such sets in the particular case of summation.

1. Introduction On a computer, any real number is represented by a finite number of digits. A consequence is that a computing error is generated when performing any computer operation; entry errors and abbreviation errors are so introduced in algorithms. Limited precision arithmetic also implies the nonassociativity of computing operators. Thus, as pointed out by Vignes and La Porte [5,6], while in algebra a given algorithm provides a single result, the same algorithm processed on a computer will have a set 9? of possible floating-point results, all equally representative of the exact algebraic result. As shown by these authors, the elements of population 9 have two possible origins: (1) Perturbation origin: each entry or elementary operation admits two results, one by default and the other by excess. If k is the total number of operations, computation leads to 2k possible results. corresponding to all (2) Permutation origin: if Cop is the total number of combinations different possible permutations of the operators, computation taking account of nonassociativity of approached operators leads to Cop possible results. Here we study, for summation algorithms, the subset of 99 generated by perturbation; permutation will not be considered here. 2. Notation and preliminaries 2.1. Roundings

and correct operators

Let S be a finite, nonempty subset of R, representing on a computer the set R of real numbers. We denote by R, the real interval bounded by the minimum and maximum elements of s. The mapping v : R!s + S defined by vx=max{Z:ZESAZ
VxER.

rounding of R!s into S.

0 1988, Elsevier Science Publishers

B.V. (North-Holland)

542

M. Pichat / Correct floating-point

summation

The mapping A :R s + S defined by Ax=min{Z:ZESAZ>x}

VxERs

is called the R-directed rounding of R, into S. Any elementary real operator * E { +, -, on S. Operator 0 : S x S + S is called correct if X@Y=(v(X*Y) Operator v : S

x

WY=v(X*

orh(X*

Y))

X,

:}

V(X,

is ‘approached’ by an operator 0

defined

Y)ES’.

S + S defined by Y)

qx,

Y)ES2Ax*

YERs

is called correct and L-directed. Operator A : S x S + S defined by xnY=A(X*Y)

v(X,Y)ES2AX*YE&

is called correct and R-directed. With a slightly different terminology, left-directed respectively right-directed roundings and operators have been completely and axiomatically studied by Kulisch and Miranker [2]. 2.2. Accuracy intervals, nondecreasing-intervals

subsets of II3 and floating-point

numbers

We shall have to measure the local accuracy of the approached representation. For X E S we denote by PRED( X) and SUCC( X) respectively the predecessor and successor of X in S: PRED(X)=max{Y:YESAY
Succ(X)=min{Y:YESA

We assume that 0 is an element of S; we denote by S+ (respectively S-) (respectively negative) elements of S. Given a real x, the strictly positive number E(x) defined by

Y>X}. the set of positive

for x 4 S, forxES+, forxES_, is called the accuracy interval of x in S. For a zero element, the accuracy interval should eventually be considered right or left. S is called a nondecreasing-intervals subset of Iw if IX]<

]YJ

=3 &(X)<&(Y)

v(x,

Y)Cs2AXY>0.

On computers, S is a set F of normalized floating-point numbers. The base will be given by b and the number of mantissa digits in base b by t.

M. Pichat

Notation.

/ Correct floating-point

summation

543

For u E Iw- (0)) the integer p defined by P-i<

(UI
is called the exponent of u and is denoted by e(u). 2.3. Algebraic procedure and computing procedures Let V= { * i, i = 1, 2,. _. , N - l} be a sequence of operations in Iw. Applied to a given sequence of N operands x = (xi, x2, . . . , xN ) E RN, sequence %?defines a real r = rN by ri = x 1,

5+1

=r,

i= 1,2 ,...,

*;x;+~,

N-

1.

r is the result generated by an algebraic procedure. We now consider subset S. Associated with 9 there are 2N-’ sequences of correct computer operations: 0 = { 0, , i = 1, 2,. . . , N - 1)

, where 0, is either v, or A,.

Associated with x there are 2N sequences of operands: X= (x ,‘__., x,) Any

sequence

E P,

where X, = (vx, or AX;).

0, applied to a vector X, defines in S a number R = RN by

R,=X,,

R,+,=R,O,X,+l,

i=l,2

,...,

N-l.

0)

R is the result generated by a computing procedure. According to whether or not entry errors are taken into account, 2N-’ (respectively 22N-*) computing procedures are associated to one algebraic procedure. We are concerned with the elements defined in S by these computing procedures and we consider the two following sets. In the case where no entry errors are committed: given X E SN, the set E of results R defined by (1) for X and any sequence 0 of correct operators is given as E={R:Rdefinedby(l),

@,=(v,orA,)}.

And, in the general case: given x E IwN, the set E’ of results R defined by (1) for any sequence 0 of correct operators and any rounding of x is given as E’ = {R : R defined by (l),

3. Arithmetic

0, = (vi

or &),

X, = (vx; or AX,)}

summation

3.1. A first property For arithmetic summation, a first property can be established with weak assumptions. Entry errors are not considered in this section.

M. Pichat / Correct floating-point

544

3.1.

Theorem. Given X = (Xl,. . . , X,)

E={R:R=X,q&@$

E ( S+)N,

summation

let E be the following set of elements in S: @I =(VorA)}.

.o.@~~,X~,

If S is u nondecreasing-intervals subset of R, then: (i) Card(E)
j=l,2

,.,.,

N.

Proof. The proof follows by induction with respect to N. We first prove (i): Basis step. Property (i) is true for N = 2. Induction step. Now assume it is true at order k: the set E, defined

Ek={Rk:Rk=X1G3,

...@k_lXk,

consists of h consecutive elements denote by G the following set:

by

@I =(vorn)}

of S, Sf, . . . , S:,

G={Z:ZESAZE[S;+X~+,,

.**
and X < k. When adding

S,” to Xk+,,

let us

S;+X,+,]}.

We have Card(G)
= freq( S;)

= 1 = C,O= Ci.

Induction step. Now assume Card( Ek) = k and freq( S’) = C/1: (j = 1, 2,. . . , k). (It will first be noted that Card(E) = N implies Card( Ek) = k Vk, as a consequence of Card( E,,,) < Card( Ek + 1 obtained above.) Necessarily, Card( Ek+ ,) = k + 1. It th en implies that there exists one and only one element of S Sk+i in any interval: Ij_i = [S,k_, + Xk+i, S,‘+ X,+,[ for j = 2, 3,..., k, and that S,? + X,,, G S’for j = 1, 2,. . _, k. Indeed, let us consider I, for any j = 1, 2,. . _, k - 1. I, contains at most one element of S: More than one would be in contradiction with the property of S to be a nondecreasing-intervals subset of R. I, contains at least one element of S: As G = 1, U I2 U . . . U lk_ 1 U { Sk” + X,, i }, assuming that there exists no element of S in I, leads with the above assertion to Card( E,,,) < E S for any j > 1 leads, with the preceding results, to k-2+2
M. Pichat / Correct floating-point

545

summation

We then deduce

freq(~~+*j=freq(~~_,j+freq(S:)=~,/_f+~;/_:=C,/-‘, freq( S:+‘)

= freq( St)

= 1,

freq( S,“:,‘)

= freq( Sf)

= 1.

0

3.2. Remark. To compare with the estimation arithmetic, we define the absolute errors:

R I+1 =R,+X,+l+q,, For any correct operation, of R, + X+1 or, a fortiori, We have

i=l,2

yielded

by classical

error analysis

in floating-point

,.._, N-l.

the quantities 77, satisfy the exponent of R,, ,.

1q, ( < h&-l,

where p, denotes

the exponent

N-l

N-l

v=R-r=

j=2,3,.-.,k,

c qz i=l

and

1771 <

c

bPc-‘,

p,=e(R,+,).

r=l

Making use of the bound p,
elements

result and

E consists

of F.

Use can also be made of the following

relation

R=X,(1+&)+X2(1+&)+

...

established

by Wilkinson

[7]:

+X&t-&),

where (I _ ,,-,),Y+l--r under the assumption 3.2. Injluence

r=l,2


of correct

floating-point

,...,

N,

additions.

of the entry errors

In this section we restrict ourselves to the case where S is a set F of normalized numbers. A preliminary lemma can be given as follows. 3.3. Preliminary Lemma. Given two floating-point floating-point numbers:

H=

(

W: W=

U@ V, U= (Xor

numbers X, YE F, let H be the following

Succ(X)),

W can be computed by 23 computing procedures. X, YE F can be written in the following form: LbP-‘,

Y = M&-l,

set of

V= (Y or Succ(Y)),Cl3 = (v or nj).

Then, the elements of H are at most 3, and are consecutive elements of F.

X=

floating-point

b’-’ < L, M-C b’,

546

M. Pichat / Correct floating-point

summation

where p and q denote the exponents e( X) and e( Y). Then, Succ(Y)

succ( x) = (L + l)V’,

= (M+

l)bq-’

(even if L or M = b’ - 1). We give an outline of the proof for the case b = 2. Notation. For x E R +, let us denote by lx ] and [xl the integers defined by 1x1 =min{n:

lx] =max{n:nER4AAnx},

nEN

Anax}.

The following cases can arise. Case l:p=q. The fact that b=2 thenimplies e(W)=p+l (or e(W)=p+2if - 1). In any case, W= Nbp+‘-’ and integer N takes its value in

The set H has two elements if L + M is elements being consecutive in F. Case2:p>q.Then, e(JV)=(porp+l). (2a): e( X + Y) =p. Then we have W= L + 1 + [(M + l)bq-P]. [Mbq-p],..., If we set M = bJ’-qA4’ + r A r -C bP_4, then set H has three consecutive elements. (2b): e( X+ Y) =p + 1. Then we have W= X+

Y= (Lb-’

Succ(X)

L=M=b’

even, and three elements if L + A4 is odd, these

and integer N takes its value in L +

NbP-’

in any case (r = 0 or r # 0 or L + M’ = b’ - 1) the Nbpf’-‘.

We can write

+ Mbq-P-‘)bP+‘-‘,

+ Succ(Y)

= ((L+

l)b-‘+

(M+

l)bq-p-l)bp+l-l.

If we set Lb-’ + Mbq-J’-’ = Q + p with Q E N and 0 < p < 1, then integer N takes its value in Q,..., Q + [p + b-’ + bq-p-‘]. As p + b-’ + bq-p-’ -C 1 + b-’ + bP2 < 2, the set H has at most three consecutive elements. Case 3: p -C q. This case is dealt with analogously to Case 2. 3.4. Remark. To prove Lemma 3.3 in the case b # 2, one has to modify the statements for p = q:

b # 2 implies e(W) = ( p or p + 1) and two cases have to be distinguished. Now, taking into account entry errors, we can state the following theorem. 3.5. Theorem. Given x = (x1, . . . , xN) E (R ‘) N and computations arithmetic, let E’ be the following set of elements in F: E’= Then,

1

E’ consists

R: R=X,C+

X2@+ -0.

@N_l X,,

of at most N + 1 consecutive

Xi =

being performed

in floating-point

(vxi or Ax;), G$ = cv or A)}.

elements of F.

Proof. The proof is done by induction with respect to N. Basis step. The property is true for N = 2 as a consequence of Preliminary Lemma 3.3. Induction step. We assume that it is true at order k: if EL is the set of all computed values x1@+ **. @,,_ 1 X,, then EL consists of X consecutive elements of F, Sf, . . . , S:, and h < k + 1.

M. Pichat / Correct floating-point

summation

547

If xk+i E F, the desired result follows from Theorem 3.1, otherwise for X > 2 we denote u= vxk+i and I/= A\x~+*. We are concerned with the sums SF @ U and Sf @ V, i = 1, 2,. . . , A. Let us consider the two following subsets:

We have Card(G) < X - 1. By Preliminary Lemma 3.3, Card(H) A-1+2=X+l
< 2 (Sf_, v U tZ H).

3.6. Remark. The weak effect of entry errors in arithmetic comparing Theorems 3.1 and 3.5, should be noticed. Table

It then follows that Card( EL+ 1) <

summation,

as it appears when

1a

Computation

of r,, r2, and r, Computation

n

of r, Card(E)

RI_

RR

50

4.4992053382

4.4992053382

100

5.1873775174

5.1873775180

81

150

5.5911805882

5.5911805892

130

32

200

5.8780309475

5.8780309488

180

250

6.1006752487

6.1006752504

230

Computation

n

of rz Card( E)

R,.

RR

50

1.6251327336

1.6251327331

100

1.6349839001

1.634983900_3

150

1.6382895729

1.6382895732

143

200

1.6399465458

1.6399465462

193

250

1.6409420560

1.6409420564 -

243

Computation n

45 94

of r, Card(E)

RI_

RR

50

2.9476758385

2.9476758381

100

3.2893173135

3.2893173139

98

150

3.4903958449

3.4903958454

148

200

3.6334079824

3.6334079831

198

250

3.7444818129

3.7444818138 -

248

a Decimal binary

outputs digits,

In the computation for ri, l/i2 n -

in Turbo-Pascal

representing

are performed

12 decimal

to 11 digits.

Besides,

48

floating-point

numbers

are stored

with 40

figures.

of rl or r2, some elementary

for r2 when i = bk, k = 1, 2,.

additions

. . ,7. If j additions

are evidently

performed

are so performed

without

without

error:

error, Card(E)

the adding

of l/i

will be at most

j.

As to r2, it should hold.

also be noticed

that we have e(r2) = 1 for any n, implying

that the assumptions

of Theorem

5.1

548

M. Pichat / Correct floating-point

3.3. Numerical

summation

results

Computations were performed on an IBM-PC microcomputer. Programs were written in Turbo-Pascal code. We used a floating-point arithmetic with the following characteristics: b = 2, t = 40, truncation mode ( @ = v). We defined a function providing, for any given X E F, Succ( X) if X > 0 and PRED( X) if x< 0. From v we generated the R-directed correct arithmetic addition A: XLJY=

1

Succ( m my

Y)

if wY#X+

Y,

otherwise.

As simple illustrations, we carried out, for different values of n, the following summations: ” r1 =

rz =

i l/i, i=l

C l/i2, t=l

r3 =

i 1/(2i i=o

+ 1).

Computation of each sum r, making use of operators v and A respectively, provides two floating-point values: R, and R,. Performing binary outputs we derive Card( E ) = RR - R L + 1. Results of these tests are given in Table 1.

4. Binary summation In this section we consider binary summation, computations being performed in binary floating-point arithmetic. Assuming N = 2k for the sake of simplicity, binary summation of N reals xi, _. _, xN is defined by the following formulas: rjo = x,,

i=l,2

,...,

N,

rji=r. 2,_,,j_l+r2i,j_,,

i=l,2

,...,

N/2’,

j=l,2,...,k,

and generates the same algebraic result, r,, = r = E~zlxj, as does recursive summation. Computing procedures, associated with this scheme and obtained by the way of perturbation, are yet 22N-’ or 2N-‘, according to whether or not entry errors are taken into account, with N = 2k. We can characterize the set E’ of floating-point values generated by these computing procedures in the case where reals xi have the same exponent for any i. 4.1. Theorem. Assume that computations are performed in binary floating-point arithmetic and that N = 2k positive reals xi of the same exponent are given (i = 1, 2, _. _, N). Let E’ be the set of values R = R,, defined in F by Rjo=Xj R,j= Then, E’ consists

withXi=(~xiorAxj), RZ;_l,j-l

~;j Rll,j-1

i=l,2 with

qj

= (B

,.._,

N,

or A)>

of at most k + 2 consecutive elements

of F.

i=l,2

,...,

N/2’,

j=l,2

,...,

k

M. Pichat / Correct floating-point

summation

549

Proof. By assumption, b = 2. We denote by p the common exponent of xi, br-’
Card( E,‘,) < 3

Vi,

Vi,

as derived from Preliminary Lemma 3.3. Let us assume that the above property is true at level j - 1: Card( E,lj_ 1)
Setting =

lbP+j-1-t

R2i-1,,--1

= )

&p+j-1-t

b’-’ < 1, m < b’,

3

R2r,j-l

with I= L,..., L+A (X
i-l,2

Ri+l=Ri~jXj+l,

,...,

N-l,

the terms of the sum not necessarily being of the same sign. In the case where partial floating-point sums R, have the same exponent for any i, or more generally when S # F in the assumption that sums Ri have the same accuracy interval in S Vi, the following can be stated. 5.1. Theorem. Given X= (XI,. . . , X,) E={R:R=x,~x++

E SN, let E be the following set of elements in S:

-613_~x~, gl =(VorA)}.

If S is a nondecreasing-intervals subset of II3 and if partial sums Ri L = XIV . . . VX, and Ri, = X& . . . AX, satisfy E( R, L) = 6?(RiR) = constant, independent of i Vi, then E consists of consecutive elements of S and, denoting P = Card(E), freq(Sj)=Card{O:R=S,}=2N-pC{Z~,

thej’-th element S, in set E satisfies j=1,2,...,P.

M. Pichat / Correct floating-point summation

550

Proof. The proof follows by induction with respect to N. Basis step. The property is true for N = 2. Induction step. Let us assume that it is true at order k: on using notation set for the proof of Theorem 3.1 we assume that Ek consists of X consecutive elements of S and that we have freq(S/) = 2k-hC/1i,

j = 1, 2 ,...,

A.

By assumption, elements S,! + X,,, and elements S,! have the same accuracy interval in S. Only two possibilities can then arise: (a) Sj + xk+ 1 E S Vj = 1, 2, __. , A; then, Card( Ek+l) = A and freq( S’“) = 2freq( S’). (b) There exists one and only one element of S belonging to each interval ] S’_ 1 + X, + i, S’ + X,+,[; then, Card( Ek+l) = X + 1 and the property immediately follows from freq(SF+‘)=freq(SF_,)+freq(S;),

j=2,3,...,X.

0

5.2. Example. The sum r4 = 4 + C:= 11/(2i + 1) satisfies e( rq) = 3 for any n, and leads to the following values of Card(E) : 50 51

n

Card(E)

100 101

150 151

200 201

250 251

In that case, taking into account entry errors, we have exactly Card( E’) = Card(E) Vn. Such a result follows from Card(E) = n + 1 Vn and from the fact that the real 4 is entered without error. 5.1. Sums respectively computed by a sequence of correct additions and the conjugate sequence Given a sequence of correct operators O=(Oj,

i=l,2

,...,

N-l;

0, =(Vi

or Ai)),

0’, the conjugate sequence of 0, is defined by o’={@,!,i=l,2

,...,

if 0, = vi,

A;

N-l}with@l=

v, i

I

ifO

I

=A I

Computations being performed in floating-point arithmetic, let us consider the sums respectively computed by a sequence of correct additions and the conjugate sequence in the assumption that partial sums Rj have nonincreasing exponents. 5.3. Theorem. Given X= (Xl,. . . , X,) E FN and a sequence of N - 1 correct additions if partial sums RiL=Xlv---vXj and Ri,=X,~***~X, O={q, i=l,2 ,..., N-l}, satisfy e(RiL)=e(Ri,)


in F,

M. Pichat / Correct floating-point

summation

551

then the following property holds for R = X, @+ . . * eN_ 1 X,, and R’ = XI @$ . . . gzl_, X,: R + R’ = constant,

independent of 0.

We assume that the property is true at order k. Let 0, be any sequence of k - 1 correct additions and Proof.

Rk=X,

@+ ...

et,_,&,

R;,=X,%$

...

q-,X,.

We then have R, + RL = C, independent of 0,. The fulfilled conditions: - 8( Rk) = (Y independent of Ok, and - &(Rk+Xk+J=j3
=A~=XE”P.

All the numbers Ri + X,, 1 can be obtained from elements of F by a translation amplitude e satisfies 0 G e G p. If e # 0, we have

P-e Rk % xk+i = Rk + x,+1 +

i -e

whose

if GJ = A, if ek = v,

and it follows that R k+l+R~+l=C+2Xk+l+P-2e_

0

5.4. Remark. The assumptions of Theorem 5.3 do not ensure that E consists of consecutive elements of F. 5.5. Example. If we consider the algebraic procedure defining r = x + y - x for the operands x = 1000, y = 10P4, and r = 10e4, and in decimal floating-point arithmetic with seven digits, we have E = (0, 10P3}. 6. Conclusion

In this paper we have considered, for summation algorithms, the set of all computed results generated by perturbation in correct floating-point arithmetic. The main features of the results are the following. (1) In arithmetic summation, entry errors have a weak effect, almost negligible. As established by Theorems 3.1 and 3.5, or in Section 4 for binary summation, the maximum values of Card(E) differ from 1. (2) Binary summation is almost optimal for accuracy in the case where terms are of the same sign and of nearly equal magnitude. Indeed, by Theorem 4.1 the set E’ of floating-point values generated by the 22N-1 computing procedures associated with the algebraic binary scheme satisfies Card( E’) < log,( N + 2). It follows from this result that binary summation should be used much more frequently, particularly when computing mean values or standard deviations.

552

M. Pichat

/ Correct floating-point

summation

As to the ordinary arithmetic summation algorithm, the fact that the maximum value of Card( E’) is of order N often yields poor accuracy. The proof of Theorem 3.1 leads to Wilkinson’s [7] conclusion that, when adding terms of decreasing magnitude, the smallest terms are to be added first. Also, in a recent paper, Stummel [4, p. 4571, making use of forward analysis, underlined that the arithmetic summation algorithm is not well-conditioned in many cases. (3) Theorem 5.1, although making quite heavy assumptions, has its interest in the fact that the normal distribution of computed results (see [l], for example) clearly appears in this particular case. (4) Theorem 5.3 shows the necessity of rundomIy perturbing the result of each arithmetic operation, in order to evaluate the computing error. Given a sequence of operands, the availability of the sums R and R’, respectively computed by a sequence 0 of correct additions and the determined conjugate sequence 0’, does not lead to a significant result: i( R + R’) has a constant value when 0 varies, but no guaranteed connection with the exact algebraic sum r can be derived in that case. We can stress that random perturbation is one of the bases of the Permutation-Perturbation method. Finally, a point that should be underlined is the following. Of course, given an algorithm, the property of Card(E) to be small does not imply good accuracy, unless E consists of consecutive values of F and includes the exact result. This algebraic study does not provide any evaluation relative to the precision of the obtained results, except when the assumptions of Theorem 4.1 hold and binary summation is used. For all other summations, only the Permutation-Perturbation method is able to evaluate the exact number of significant decimal digits appearing in computed results and must be performed. As a consequence of the weak effect of entry errors in arithmetic summation, it will not be necessary to perturb the entries in that case, as far as no measure errors occur.

References [l] J.-P. Faye and J. Vignes, Stochastic approach of the permutation-perturbation method for round-off error analysis, Apphed

Numerical

Mathematics

I (4) (1985)

[2] U. Kulisch and W. Miranker, Computer

349-362.

in Theory and Practice (Academic Press, New York, 1981). [3] P. Linz, Accurate floating-point summation, Comm. ACM 13 (6) (1970) 361-362. [4] F. Stummel, Perturbation theory for evaluation algorithms of arithmetic expressions, Mathematics of Computation Arithmetic

37 (156) (1981). [5] J. Vignes, New methods for evaluating the validity of the results of mathematical computations, in Simulation

20 (4) (1978)

[6] J. Vignes and M. La Porte, Error analysis in computing, Proc.

[7] J.H. Wilkinson, Rounding

Math.

& Comput.

227-249. Errors

in Algebraic

Processes

ZFZP Congress, Stockholm (1974) 609-614. (Prentice-Hall, Englewood Cliffs, NJ, 1963).