New methods for evaluating the validity of the results of mathematical computations

New methods for evaluating the validity of the results of mathematical computations

Mathematics and Computers o North-Holland Publishing in Simulation Company XX (1978) NEW METHODS FOR EVALUATING MATHEMATICAL COMPUTATIONS 227-249 ...

2MB Sizes 0 Downloads 8 Views

Mathematics and Computers o North-Holland Publishing

in Simulation Company

XX (1978)

NEW METHODS FOR EVALUATING MATHEMATICAL COMPUTATIONS

227-249

THE VALIDITY OF THE RESULTS OF

J. VIGNES Institut de Programmation lJniversit4 Pierre et Marie Curie, Paris, France

From a mathematical standpoint, numerical methods can be divided in two classes: (i) Direct methods, which give an exact result after some finite number of computations, and (ii) approximate methods, which only give an approximate result after any finite number of computations. When these numerical methods are carried out on a computer, they are projected into discrete space and performed with a limited precision arithmetic. Consequently they yield approximate results. A result given by a direct method contains only a computing error while a result given by an approximate method contains both a computing error and a method error. It is absolutely necessary to evaluate these errors. In this paper, we present new methods for estimating these errors and for evaluating the exact number of significant decimal digits appearing in computed results. Also presented here is a general methodology for improving the precision of computed results.

both a method error and a computing error. These errors often make computed results meaningless. So it is necessary to check for their validity. We shall describe here new methods for evaluating errors caused by the computer arithmetic. First, let us review some recent results published in refs. [ 1, 6, 7, 14-16,23,24].

1. Introduction Numerical methods intended to solve scientific problems are generally supposed to be used in a continuous space with an infinitely precise arithmetic. This is the case of most methods found in Numerical Analysis, Simulation, Operations Research, etc. When performed with an infinitely precise arithmetic, and stopped after some finite number of computations, these methods can be divided in two categories: (i) Methods which give an exact result are called direct (numerical) methods. For example, the methods found in Linear Algebra are direct methods. (ii) Methods which only give an approximate result are called approximate (numerical) methods. Results given by these methods contain a method error. Most iterative methods are approximate methods. When these methods are run on a computer, they are projected into a discrete space and performed with a limited-precision arithmetic. This causes a computing error to be generated and propagated at the level of each elementary operation, affecting the final computed result. Hence, when carried out on a computer, (i) a direct method gives a result containing only a computing error, (ii) an approximate method gives a result containing

2. Errors generated by computer errors)

hardware (computing

Computing errors arise because numerical values are represented in a computer by a finite number of significant digits. As a result, any computer operation generates an error. 2. I. Errors generated by the assignment operator Any element x E R is computer represented by an element X E F, with F containing all the values that can be written in normalized floating point computer representation. Thus X is just an approximation of x. In fact the value x E R can be written in a normalized floating-point arithmetic as follows: x=m*be,

(1)

where m is an unlimited mantissa such that l/b
J. Vignes /New methods for evaluating mathematical computations

228

2.2. Error generated by arithmetic computer operators

e is an integer exponent: e E Z, and b is the base. The computer representation of x is expressed by X as follows: X=M.bE,

The following algebraic operation:

(2)

z=xoy;

where M is a mantissa limited top digits in base b, and E is the integer exponent, E E E with E containing all the integer numbers that can be represented in the exponent-part of X. The underflow and overflow num bers are not considered here. The relative computing error x generated by the assignment operator is defined by:

x-x

Y

ff=X=--M2

z=xs2

Y,

as

X,Y,ZEF,~~E@,

0,*,/I, (5)

where 0, 0, 0, / are the standard arithmetic puter operators. It can be shown that the absolute

com-

(6)

is given by the equation f, = (x + E,) wo, + EJ - xwy + cr(xaY)

)

a E P(a)

(7)

where e, and E,, are the absolute errors in the operands X and Y.

1 [lo] :

the probability

is performed on the computer

(4)

e,=Z-z

with r = m - M. Statistical studies of a, made in refs. [ 1,6,10,14], lead to consider QIas belonging to a statistical population P(o) whose mean value Crand standard deviation (I are given in table 1 under the following hypotheses: - hypothesis

x,y,zER,oE[t,--,X,:1

density P(r, M) is equal to p.

2.3. Errors generated by library functions

2 [6] [ 141:

- hypothesis the probability

The expression

density P(r, M) is given by

u = funct(x) , P(r, M) =pMlogb

,

(8)

where u, x E R, and funct is a mathematical is performed on the computer as

where p is a constant. A similar study has been done in [7] for the local round-off errors occurring in floating-point arithmetic with guard digits.

function,

U = FUNCT(X) ,

(9)

where U, X E F, and FUNCT is a function of the computer library. It can be shown that the absolute

Table 1 In this table, the subscript t means truncated arithmetic (chopping), while subscript r means rounded off arithmetic (rounding), and n is the number of bits of the mantissa (base 2), n = 4q, where q is the number of hexadecimal digits of the mantissa base 16), _~. __~ ~~~ Hypothesis 2 Hypothesis 1 ____ _ base 2 fft at F or

-0.693 x 2-n 0.431 x 2-n 0 0.408 x 2-”

base 16 -1.48 x 2-n 1.77 x 2-n 0 1.33 x 2-”

base 2 -0.718 x 2-n 0.454 x 2-n 0 0.425 x 2-”

base 16 -2.70 x 2-” 2.81 x 2-n 0 1.95 x 2-n

J. Vignes /New methods for evaluating mathematical computations

computing

error E, defined by

e,=U-u

(10)

is given by: e, = f, d’FUNC$(X)]

+ (YFUNCT(X) ,

results. For the sake of simplicity, we assume that the result r is unique. When performed on a computer, this procedure is represented in a syntactic form equivalent to its algebraic form, and can be written as:

o!E P(cY) PROCEDURE@,

(11) where E, is the absolute error in X. Thus it theoretically possible to estimate the error generated by the computer hardware for any arithmetic operation. As a result, computing errors can be evaluated by following numerical processes step by step and by estimating the absolute error committed at each arithmetic operation. Since Q naturally intervenes in the estimation formulas, efforts have been made to evaluate either and upper bound of (Y,or a mean value of (Y. Unfortunately, in the former case, the accumulation of upper bounds of elementary errors causes the estimated error to be much greater than the actual error; in the latter case, a mean estimation can only be achieved when errors are independent. This does not happen with algorithms, and so the evaluation is impossible. Using a condition number to evaluate the propagation of the computing error has been suggested in [24, 251. This idea has been applied in [ 151 to design some software for round-off analysis. But in practice this method cannot be used for real-life problems which need a large number of arithmetic operations. Here we present a practical method for roundoff analysis in computing which is easily applicable to real-life numerical problems.

3. Method for evaluating the precision of a computed result 3.1. Precision evaluation in direct numerical methods 3.1.1. Permutation method. Theoretical aspect Let us consider some exact numerical method (in the sense of section 1) defined by: procedure(d,r,

+, -,X,

:, fonct) ,

229

(12)

where d C R is the set of data, and r C R is the set of

R, 0,

0, *, /, FUNCT) ,

(13)

whereDCF,RCF. Now this data-processing procedure (13) cannot at all be considered as the computer image of the algebraic procedure (12). It is one of the possible images but not the unique image. Indeed, in data processing, sacred rules of algebra such as the associativity of algebraic addition, are not satisfied, and so each dataprocessing procedure corresponding to a possible permutation of permutable operators is equally representative of the given algebraic procedure. Therefore, if Cop is the total number of combinations corresponding to all different permutations of the operators, there are in fact C,, data-processing procedures that are images of the algebraic procedure. 3.1.2. Perturbation method. Theoretical aspect Let us take any data-processing procedure (13) that is an image of the algebraic procedure (12) and have it performed by the computer. Since each operator produces an assignment-error, the result obtained with this operator contains always an error (error by default when the computer is working with a truncated arithmetic). But, the result by excess is just as legitimate. This is why we must admit that any dataprocessing operation gives two results, one by default and the other by excess. Suppose that the data-processing procedure contains k elementary operations, (such as assignments, arithmetic operations, function ...). Then this procedure yields for R not a single result but actually 2” results that are all equally representative of the algebraic result r. 3.1.3. Permutation-perturbation methods. Theoretical aspect By applying the perturbation method to each data processing algorithm provided by the permutation method, we obtain the set 6? of data-processing results that are images of the unique result r of a given algebraic algorithm. The number of elements in the popula-

J. Vignes /New methods for evaluating mathematical computations

230

tion bi thus obtained is such that Card di = 2kC,, .

(14)

Therefore, Card 61 data-processing images correspond to the unique result r of the algebraic procedure. 3.1.4. Statistical evaluation of the exact number of significant decimal digits in the result Assuming that the computer representation of r is any element of 6? and that Re is the element of I$? corresponding to the algebraic algorithm translated without using permutation or perturbation), and calling R and 6 respectively the mean value and the standard deviation of the elements of @ then a classical statistical computation gives the exact number C of significant decimal digits appearing in the result. This number C is given by the equation: -= ,;

,

lo-

c

permutations and perturbations. Any program includes assignment statements (among them reading) and calculating sequences giving the results as well as control statements governing the logic of the algorithm. Any variable, whether it comes from an assignment or a calculating statement, will have its value randomly fixed by default or by excess. Furthermore, in the light of the preceding comment, all expressions included in the calculating sequences and having a linear form must both be executed in a random order and have their results fixed randomly by default or by excess. These operations can be performed by a single subroutine called PEPER (PErmutation-PERturbation), having a variable number n of arguments and whose role is as follows: for n = 1 PEPER performs the random perturbation of the argument, i.e. makes the random addition of a 0 or a 1 in the last bit of the mantissa.

,

For n = 2 PEPER performs the addition or subtraction of two arguments and perturbs the result.

0

with S=J(Ro-R)2+@

(15)

3.1.5. Practical aspects of the permutation-perturbation method In practice, it is impossible to generate all the elements of 61. Therefore, we must find a sub-population of 6? made of a restricted number of significant elements of 61. A practical method should consist of generating successive elements of the population until we obtain the stationnary state for the exact number C of significant decimal digits. Before going any further, let us make a fundamental comment. If all data-processing operators create errors, one particular operator causes the greatest errors. This operator is subtraction, in the arithmetic sense of the term, between two operands having very close values and both containing errors. The result then obtained is not at all significant because it cumulates the errors of the operands. Therefore, in practice, it suffices to apply the permutation method to the addition and subtraction operators appearing in each expression of the algorithm. The successive elements of the population 61 are obtained by running the program of the algorithm several times. This program must be organized in such a way that it can automatically perform the necessary

For n = 3 PEPER performs the random permutation of the n arguments and makes the additicns or subtractions of them as well as the perturbation of the result. This subroutine PEPER is very easy to write. It may be divided into two parts, the one which generates the permutations of additions and subtractions and the other which perturbs the results. This subroutine needs approximatively 30 Fortran statements. (The Fortran listing of the subroutine PEPER useable on a 7600 CDC is presented in the Appendix.) Each performance of the image program of the algorithm using the subroutine PEPER supplies an R result, that is an element of I%.Let RI and R2 be the results of two performances of the program. Given these two results we compute, by means of eq. (15) the exact number Cc2, of significant decimal digits (the subscript (2) means that Chas been evaluated with a sub-population of 2 elements). Of course we cannot affirm that this sub-population is representative of the whole-population. Consequently we are obliged to perform the program again. Let R3be the new result. With this new sub-population of 3 elements we compute (C’s). If C(s) and C’(a)are stationary, meaning that the difference between C’(2)and C(s) is less than 0.5 for example, we consider that the integer

J. Vignes /New

methods for evaluating mathematical computations

part of C(s, gives the exact number of significant decimal digits in the result. If the stationarity is not obtained we perform again the program until the sta. tionarity is obtained. This method has been used in a great number of real-life problems (such as linear algebra, Fast Fourier Transform etc...). It has never failed, and a stationary value of C has always been obtained with generally 3 elements of di. 3.1.6. Applications of the permutation-perturbation method 3. I. 6.1. Some elementary examples of functionals. The following examples are taken from the paper written by W. Miller [IS] : (1)

(2)

x=dXd

y =d+x

v=dXd

w =d+v

y=w+x

z3=y-v

21=y-x

(16)

z2=y-d

(17)

x =dXv (18)

The aim of the study is to evaluate the exact number of significant decimal digits in the computed results Zl a z2, z3.

As shown in [IS], the process for computing zr is numerically unstable when d is large, the process for computing z2 is numerically unstable around d = 0 and the process for computing z3 is numerically stable for any value of d. The Fortran code using the subroutine PEPER corresponding to these examples is given in the appendix. It is very easy to determine the exact numbers C2, Cr, Ca of significant decimal digits in zr, z2, z3. The results are presented in Table 2. Obviously the results presented in Table 2 are in perfect agreement with Miller’s results, and our method allows us to obtain very easily the exact number of significant decimal digits in the computed functionals. 3.1.6.2. Linear Algebra problems. Since linear algebra problems always occur in the form of matrix and vector processing, use of the permutation-perturbation method is highly simplified. It suffices in practice to apply perturbation only to the elements of the matrices and vectors, while permutations are applied to the columns or the rows of the matrices. Let us consider direct numerical methods for solving a linear-equations system. They give an exact solu-

231

Table 2 d

Cl

c2

c3

-1of20

0 9 14 14 14 14 14 5 0

14 14 14 7 2 0 8 14 14

14 14 14 14 14 14 14 14 14

-lo+6 -1 -10-8 ;;$62 10-b 1010 1030

tion for the system and, during calculation, enable the possible singularity of the matrix to be detected, but they do this if, and only if, they are performed with an infinite-precision arithmetic. When run on a computer, these methods produce errors which make results non-significant and prohibit the detection of the matrix singularity. To solve a linear system on a computer, we must: (i) check for the non-singularity of the matrix [ 131, (ii) check for the validity of the’solution given by the computer, [ 111 and (iii) evaluate the precision of this solution [ 121. Problems that had hitherto not been solved have been solved by the permutation-perturbation method. a. Detection of the matrix singulan’ty. When an analytically singular matrix is processed on a computer, it may happen to be considered as being numerically regular. Therefore, we are led to define singularity, in the numerical sense, in the following manner: Definition: A matrix is numerically singular when the computed value of its determinant is either null or non-significant. This means that the number C of significant digits in the determinant calculated by the permutation-perturbation method is null. Of course, this numerical singularity depends on the computer being used. Let us give here a few results concerning numerically regular and singular matrices. - Regular Hilbert matrix H, = aii ,

a.. =‘I

1

i+j-1’

i,iE [I, nl

Since the exact value of the determinant

(19) A* of H, is

J. Vignes /New methods for evaluating mathematical computations

232

given by ** =

the residuals reflects only the cumulative effect of errors appearing when these residuals are calculated. As a result, we have shown in ref. [ 1 l] that the solution is satisfactory from a data-processing standpoint if for each i E [ 1, n] the following equations are satisfied

i=n

G4(n - 1)

!m=t~

$(2n - 1) ’

(20)

i! 3

it

is easy to calculate the exact number C of significant decimal digits of the computed determinant, using the equation

c*=log

2%

I

I

-1,

I.

(21)

. 1I

I

2nlPi

pi” =

Table 3 shows an excellent agreement between the values of C, calculated by the permutation-perturbation method, and C* and proves that the matrix numerical singularity depends on the precision of the computer arithmetic. - Singular mathematical matrix. We have processed several thousand matrices of all types and orders deliberately made singular by replacing the last row with a combination of the others. The permutationperturbation method always determines the singularity, generally with only two determinants being calculated. b. Validity of the solution. Every solution given by the computer is substitued into the given equations to check if they are all satisfied [ 111. Naturally, the residuals are never found to be null. Therefore, from a data-processing standpoint, we consider the solution obtained to be satisfactory if and only if the value of

2

pi=Bi-

AiiXi.

(22)

j=1

with n being the number of bits in the mantissa and N the order of the system; m = 2 if the computer is working with truncated arithmetic while m = 1 if the computer is working with rounded-off arithmetic. When a solution is given by the computer, three cases may happen: 1.p;-1 for each i E [ 1, n] Then is satisfactory from a 2. << << 2” be improved in simple by the following + &k,

Table 3 n

A*

_~~~ 8.3 4.6 1.6 3.7 5.3 4.8 2.7 9.7 2.1 3.0 2.6 1.4 a)

is the

3600 CDC Computer (p = 36 bits)

CDC Computer @ = 48 bits) a)

7600

Number of calculated

C

4 3 3 3 3 4 4 3 3 3 2 2

13 12 11 9 8 7 5 4 3 1 0 0

c*

At larity S

C

C*

ofA

larity s

~_._~.__ ~x x x x x x x x x x x x

8.3 4.6 1.6 3.7 5.3 4.8 2.7 9.7 2.1 2.8 -4.5 -7.8 of bits

x x x x x x X x x x x x

the mantissa.

-_____

8.3 x 4.6 1O-4 X 1O-7 3.7 x 5.3 10-t* x 1O-25 2.5 x 1.3 10-42 x 1O-52 -9.6 x

12 10 8 7 4 2 0

s s

-__

3 3 3 3 3 3 3 2 2 2

10 8 7 6 4 3 1 OS OS OS

233

J. Vignes 1 New methods for evaluating mathematical computations

any discrete mathematical transform, such as the discrete Fourier Transform (FFT) belongs to the class of direct numerical methods. Several papers have been published on round-off error propagation in FFT, presenting lower or upper bounds [8] as well as the mean square [2] of the global round-off error propagation. Using the permutation-perturbation method, it is very easy to evaluate the local precision on each point of the FFT. In order to show the efficiency of this method, we present here a few results obtained on a theoretical problem. More results concerning real problems are presented in [3]. Let x = (xk) for k = 0, 1, . . .. N - 1 be an array in C?. The Fourier Transform of x is y = ($k) in fl such that:

where AX(k) are solutions of the linear system A .aX(k)=B-AX(k); 3.p; - 2n for some i. Then the system cannot be solved by the numerical method on this computer. c. Precision of the solution. Although the criterion of normalized residuals pi* leads to satisfactory solutions from a data-processing standpoint, we have shown in [12] that it does not allow us to state that the computed solution is a good computer image of the algebraic solution. To reach this conclusion, the exact number of significant digits in the solution must be determined by means of the permutation-perturbation method. This method has been applied to a great number of linear systems. It has always worked and has never failed. Moreover, it is the only method allowing for the detection of numerical instability of the system, proposed by J.H. Wilkinson, even if the system matrix W, is well-conditioned. When applied to this system, the permutation-perturbation method leads immediately to the conclusion that the first 8 unknowns are not significant, as shown in Table 4.

N-l

yp=

;

where w = exp(-2in/N). is defined by:

,

N=SO.

l--(y A*



A=2N-‘-1+~.

(23)

The exact number Ci of significant decimal digits provided by the permutation-perturbation method is presented in Table 4. 3.1.6.3. Fast Fourier transform. The computation of

xk = (0, 0)

for

k= (0, 1, .. . . 111))

xk=(l,o)

for

k=(ll2,

113, .. . . 143),

xk=(O.O)

for

k=(l44,

145, . . .. 255).

___.--

ci

1,

In our example the array x

Table 4 Variables

l,..., N-

To evaluate the precision on each computed element yp we have applied the permutation-perturbation method to the well known Cooley-Tuckey’s Fast Fourier Transform algorithm. A few results obtained on a 7600 CDC computer are presented in Table 5. C, and Ci are respectively the exact number of significant decimal digits appearing in the real and imaginary parts ofyp, as given by the permutation-perturbation method. As shown in Table 5, the permutation-perturbation method gives

The exact solution of this system is: xk = _2k-’

p=O,

N= 256,

1 cu=o.9,

for

(24)

1 B=

XkWkp c k=O

x1

x9

to

to

x12 to

X15 to

x19 to

x24 to

x26 to

x29 to

x34 to

X37 to

x40 to

x44 to

x48 to

X8

x11

x14

x18

x23

x25 x28 ~_____

x33

x36 _

x39

x43

x47

x49

0

1

2

3

4

5

I

9

10

11

12

6

8

x50

13

234

J. Vigmes /New methods for evaluating mathematical computations

the exact number of significant decimal digits appearing in the computed values of the FFT. Note too that this method gives correctly the null values of the FFT. 3.2. Precision evaluation in approximate numerical methods Within the class of iterative methods we shall consider exact iterative methods which yield, from an algebraic standpoint, an exact result after an infinite number of calculations, and approximate iterative methods, which, by nature, supply only an approximation of the exact mathematical solution. In fact, when performed on a computer both methods produce approximate results. 3.2.1. Exact iterative methods These methods, which give an exact mathematical result only after an infinite number of calculations, lead obviously to an approximated result if the iterative process is stopped by some terminal criterion. This is the case when they are performed on a computer. Thus computer results always contain a double error caused both by the termination of the iterative process and the propagation of calculating errors. We may divide exact iterative methods into two classes depending on which problems they are applied to. In the first class, we consider non-directly checkabZemethods, e.g. methods leading to results that cannot be checked. This class includes the calculation of numerical series or the optimization of non-differentiable multidimensional functions. The second class contains directly checkable methods, e.g., methods yielding results that can be checked, such as all iterative methods used for solving linear and non-linear equations or systems and for optimizing differentiable multidimensional functions. In both cases, we have to deal with the two following questions. How to choose the termination criterion and how to evaluate the precision of the computed result. 3.2.1.1. Non-directly checkable exact iterative methods 3.2.1.1.1. Termination criterion. Let us consider a convergent iterative method, such as: x,+x,

when n -+ 00

(x, exact mathematical solution). When performed on a computer, this method generates a sequence X, that must necessarily be stopped by means of a termina-

235

J. Vignes /New methods for evaluatingmathematical computations

tion criterion.

IG E

IX, - X,-r

Table 6

The criterion generally chosen is: or

IX, - X,-r

I G clX,l

(25)

with E chosen arbitrarily. This choice is always difficult, so E is often taken to be equal to the greatest precision of the computer, which leads us to write: x, =xn-r

.

(26)

3.2.1.1.2. Precision of the result. When the previous criterion is used to stop the iterative process, it is often thought that the precision of the computed result is the same as the precision of the computer. This would be true if the computer arithmetic was infinitely precise but is totally false with an approximated arithmetic. Thus, after obtaining the result, it is necessary to estimate the computing error. To do so, it is enough to apply the permutation-perturbation method in order to determine the exact number of significant digits in the result. 3.2.1.1.3. Application of the permutation-perturbation method to a serial computation. Let us consider the sequence Ui defined by ui

2

i (27)

i!

with O! = 1 and its sum s, defined by i=n Sn =

C

(28)

Ui

i=O

which is always convergent such that: s, + ex

when

n+m.

for any value of x, and (29)

The criterion to stop the iterative process is chosen so that: IS, - S,_, I < lo-r4

(30)

i) Results obtained when the termination criterion is defined by S, = S,_, . The results obtained with a 7600 CDC computer for XE [-5, -10, -1.5, -201 are given in Table 6. This table shows that the value of the exact number of significant digits in the computed exponential is very far from the precision of the computer (14 decimal significant digits). ii) Precision on the result. We have applied the permutation-perturbation method using Eq. (15). The exact number of significant digits of the computed result, as

x

exact ex _~_ __

-5 -10 -15 -20

6.131946 4,539 992 3.059 023 2.061 153 _

calculated ex 999 916 205 622

X x x x

1O-3 1O-5 1O-7 1O-8

6.131946 4.539 994 3.058 255 6.194 312

999 566 626 856

C x x x x

1O-3 10-s 1O-7 lo-*

10 6 3 0

given by the method, is shown in column C of table 6. One can see that the results are excellent. As is shown in [ 191, the permutation-perturbation method is extremely effective for determining the numerical conditioning and stability of serieals. Furthermore, it is shown in [ 181, that, for some numerical series, a well chosen calculation-procedure strategy can provide with a result whose precision corresponds to the maximal computer precision. Comment: The exponential by the library function EXP(X) is not done by calculating the series and, no matter what value of the argument may be, enables a value having the numerical precision of the computer. 3.2.1.2. Directly checkable exact iterative methods. The results given by these iterative methods are checked with the help of some mathematical relation, and are considered exact when this relation is satisfied. For example, when solving a system of linear or nonlinear equations by iterative methods, the computer results are checked by examining whether each equation is satisfied or not. Thus the satisfiability of a mathematical relation proves the validity of the solution provided by an iterative method. However, mathematical relations such as equalities or inequalities can never be ckecked exactly because of the limited-precision arithmetic of computers. 3.2.1.2.1. Optimal termination criterion. As proposed by J.H. Wilkinson [24], an optimal termination criterion is one which stops the iterative process when the mathematical relation is satisfied. From data processing standpoint, checking for this relation to be satisfied can be done by applying the permutationperturbation method. Thus, the value C of the residual of the mathematical relation is computed for each iteration using eq. (15). A value C < 1 means that the residual is non-significant and that it is only due to the cumulative computing error. We then say that C represents the mathematical value 0 and so the mathemat-

J. Vignes /New methods for evaluating mathematical computations

236

ical relation is satisfied. In this case, the permutationperturbation method makes it possible to stop the iterative process as soon as a satisfactory solution is found, independently of the problem being considered. 3.2.1.2.2. Precision of the result. It is now necessary to determine the precision of the solution obtained according to the previous method. Once again, the permutation-perturbation method can be used to determine, with the help of eq. (1 S), the exact number of significant decimal digits of the result. We have solved a great number of real-life problems using this method. In practice, only three computations are necessary to obtain the precision of the computed result. To illustrate this statement, we now consider a concrete problem, that is the resolution of an equation and of a non-linear system. 3.2.1.2.3. Application of the permutation-perturbation method. (i) Resolution of an equation by Newton’s method. Consider the algebraic equation

f(x>=O>

xER

-f

f(&)

(32)

yx,>

ccnverges to x,, no matter what the starting point xOED. We have x,+=x,

when

gX,-e”=o, O! = I i!

j=o

for k E [-5, -10, -15, -201 and q E [34, 50, 59, 651, whose exact solutions are respectively -5, -10, -15, -20 fork = -5, -10, -15, -20. Remark. Since the residual of algebraic equations has a specific structure, as in the case of linear systems, it is possibel to establish a normalized residual pi shown in [l] to be such that

2plP,I

PI: =

-1,

where p is the number of bits in the mantissa, and

Pn = ,kAjX;,

(31)

Assume that the root x, of eq. (3 1) is located in a finite interval D and that the sequence x,+~, recursively defined by Newton’s method as

xn+1 =xn

We have solved the following algebraic equations:

n-+=.

When solving eq. (3 1) on a computer, we look in fact for the root X, of F(X) = 0, where F is a data-processing image off, and X E F, and we must stop the sequence X,, as defined in 3.2.1.2.1 by means of an optimal termination criterion. Listed in Table 7 are the results obtained when using first the classical termination criterion defined by

i=O

is the residual associated with the nth iteration. It has been shown in [ 1 l] that pi < 1 is strictly equivalent to c,, < 1. The results of Tables 7 and 8 were achieved on a 7600 CDC computer. As shown in these tables, use of the classical termination criterion leads to a great number of useless iterations which do not improve the precision of the solution, while use of the optimal termination criterion makes it possible to stop the iterative process as soon as satisfactory solution is found. Also, the exact number of significant decimal digits, found by means of the permutation-perturbation method after only three computations, appears to be very precisely determined (in Table 8). ii) Resolution of a non-linear system. Consider the following non-linear system of order 3: fr=x:t2x2-eX2+xs+$-10

-12 = 0

,

f2 =x1x2 + eWx2+x1x3 - 1 - $ X IO-l2 = 0, where e is arbitrarily chosen, and then the optimal termination criterion such that the exact number of significant decimal digits in the residual pn is less than l,or cp, < I 7

in = F(X,) .

fa=x:tcosx2-xa-+lo-r2-1=o.

(33)

The exact solution of this system is: xi = 3, x; = 0, x; = lo-r2. The Raphson-Newton method has been used to solve this non-linear system on a 7600 CDC computer. As previously, we have used both the classical termination criterion and the optimal criterion

J. Vignes /New

methods for evaluating mathematical computations

231

Table 7 Classical criterion defined by IX, - Xn_l I i E. k

n

E

x0

Number of iterations

-5

34 50 59 65

10-12 10-V 10-7 10-S

-2 -7 -10 -10

20 56 41 20

-10 -13 -15

XPI -5.000 -9.999

000 000 0110 999 970 939 0 949 202 453

-12.999 -15.004 517 646 609

Table 8 Optimal criterion defined by CP, < 1. k

n

P

x0

Number of iterations

Xtl

-5 -10 -13 -15

34 50 59 65

5 x10-l 5 x10-l 10-l 5 x 10-2

-2 -7 -10 -10

8 8 7 8

-4.999 -10.000 -13.000 -15.002

to stop the iterative process. And we have evaluated the precision of the computed solution by means of the permutation-perturbation method. The results obtained are presented in Tables 9 and 10. Table 10 shows that the iterative process has been stopped just when the system was satisfied, and that the exact number of significant decimal digits evaluated with the permutation-perturbation method for each unknown X,(Vi E [ 1,3]), is excellent. Furthermore this example shows that the permutation-perturbation method allows for the separation of closed unknowns. 3.2.2. Approximate iterative methods This category includes methods which, by nature, provide only an approximation of the solution to the problem considered. Numerical methods used for calculating derivatives, definite integrals, and for solving differential-equation or partial-derivative-equations systems fall in this category. When some initial problem is transformed into an approximated problem so that it can be solved, an approximation error, called a method error, is introduced in the solution. Of course, this error has to be minimized. When the approximation is achieved through finite differences, a mathematical standpoint leads to the choice of a discretizing step as small as

Number of exact digits in X, 999 000 087 544

999 284 551 423

938 223 474 401

11 8 6 4

possible. But the smallest discretizing step available on a computer corresponds to its greatest relative precision, which causes the computing error also to be greatest. Thus, we have to deal with two types of errors acting in opposite ways: when the discretizing step decreases, the method error decreases, while the computing error increases. In fact, the discretizing step must be chosen in order to minimize the overall error. Since this one is specific of each approximate iterative method, it is impossible to establish a general methodology to determine the optimal step. To illustrate this statement, we consider now the calculation of derivatives. We shall determine the optimal step by applying some results from [4] and [S] . To address this problem, we shall apply the classical so-called centered approximation formula which, to the contrary of other formulas, allows for an analysis of the error committed and is not excessively computer-time consuming. Letf(x) be a real function of x E A CR, assumed continuous with a continuous first derivative. From a mathematical standpoint, the derivativef’(x) off(x) at point x0 E A is defined by f’(xo) = lim h-0

f(xo + h) - f(xo - h) 2h

(34)

238

.I. Vignes /New methods

for evaluating mathematical computations From a data-processing standpoint, it is only possible to obtain an approximation f&o) of the derivative at point xo such that: f;(xo)

=f(xo+ h) -f&o - h) 2h

> hfO,

(35)

since we cannot have a step h = 0 and consequently we cannot reach the previous limit. Therefore, we write: f’(x0)

=.C&O)

+ e,

(36)

,

where e, represents the method error. If the function

f accepts up to 3rd order definite and continuous derivatives on the finite interval [x0 - h, x0 + h] it can be shown that:

em= _&-“’

(l)

,

EE [x0-h,xo+h]

(37)

It is easily seen that e, decreases when h decreases. Thus, from a mathematical standpoint, the smallest possible step H yields the best approximation of F’(Xo). However, as stated previously, the smaller the step H the greater the computing error, since this error is caused by the subtraction of two close quantities F(X + H) and F(X - H) both containing errors. Therefore, the computed value of the derivative tends to be non-significant as the step H decreases. Using eq. (39, we have computed the value of the derivative FP off(x) = ex at point X0 = 0.5 for different steps. Fig. 1 shows the variation of the relative loo

ERFP

IFP-f’(m)1 f’(XOl

10

Fig. 1.

J. Vignes /New methods for evaluating mathematical computations

239

Table 10 Optimal criterion defined by Cf < 1, f E f 3 $4

0.3

3

10-15

10-10

number of iterations

x(5’

xi”’

Aj5)

Number of exact digits in the result

5

0.333 333 333 333 357

0

0.100 108 456 305 459 X 10-l’

CXI = 13 Cx2=15 cx3 = 3

error ERFP defined by:

ERFP = I FP - f’(xo) l/f’(xo)

(38)

as a function of the step H. In this calculation, the relative precision of the computer was 3 X IO-‘. We see that large and small steps make respectively the method error em and the computing error e, predominant. In both cases, the value of the derivative is nonsignificant. Like wise we see that the overall relative error es defined by: es = em + e,

(39)

shows a minimum for a step H = 8 X lob3 which yields the best approximation off’(xo). In such a method, the optimum step can be found by minimizing the overall error es. The expression of es follows from the expression of em, given by eq. (37) and the expression of e,, which we are now going to determine. First, we note that the computation of FP(X,) by eq. (35) is in fact achieved according to the following expression (in Fortran notation). FP(Xo) = (F(x,

0 H) o I’+-~ 0 H))/(2 * H) .

(40)

In this expression, F is a computer image off, X0 and H are computer representations of x0 and h, while 0, 0, *,/, are data-processing operators corresponding to the arithmetic operators +, -, X, :. It is shown in [4] that the absolute computing error e, appearing in the computation of f;(x) is given by cc =f,‘(X,) - FP(Xo).

(41) This error e, contains three errors, el, e2, ea, such that: (i) el is created by the data-processing operators 0, 0, *, /, which only approximate their corresponding arithmetic operators. In fact, er can be neglected.

(ii) e2 arises when x0 and h are respectively replaced by X0 and H. In a first approximation, e2 can also be neglected. (iii) e3, generated by F, is given by e3 = -

f(x, 0 H) 0 f Wo 0 HI 2*H F(X,

0 H) 0 F(Xo

0 H)

2*H

(42)

Thus e, = e3. Let e(X) be the relative error associated with the computation of F(X). We have E(X)

=

F(X) - f(x) f(x)

.

Assume that, VX E A, there is a strictly positive number P, small with respect to 1, such that (e(x) (
=z

ec

(Q 0 E2)

(44)

with e1 = e(XO o H), e2 = e(XO 0 H). As a result, the overall error es can be expressed in practice by es = e, + e, (45) Using a result given in [4], it is found that

H5* F”‘*(X,,) 3*H

36PI ~(X,,) I

H81F”‘(Xo) I3 o 648P2 * F2 * (X0) ’

(46)

Hence, cancelling out the derivative of le, I with respect

J. Vignes /New methods for evaluating mathematical computations

240

to H yields the optimal step 1.67 * P * IF(&)

H

F”‘(XO)

thermore, it has always detected functions whose derivatives cannot be computed, by assigning the value 1 to MRE.

1 1’3

(47)

I

here P is the relative precision obtained on the computation of F(X). P can be estimated by means of the permutation-perturbation method. Somewhat surprisingly, the cube root of the third derivative of F appears in the expression of Hp. However, a rather approximated computation of this cube root, easily achieved by the so-called deliberate error method [4], yields a good estimation of Hp. When the optimal step HP is used to compute the value of the derivative at Xo, the mean MAE of the absolute error committed is MAE

f’IF(Xo)I

=

o

4. Precision improvement

H5F”‘2(&)

3*H

36PIF(&) t

H81F”‘3(XO)I ’

(48)

648 * P2F2(XO)

while the mean MRE of the relative error is MRE = MAE/lF(Xel

.

(49)

The method just described has been applied successively to standard functions whose derivative is easily computed, to non standard functions, and to functions for which it is possible to compute the derivative. The results shown in Table 11 represent averages over 100 values of X0 belonging to the range [O.l , 12.51. In this table NF is the number of functions being computed. These results have been obtained on a 1130 IBM computer whose relative precision is approximately 1Oe6. This method has always given satisfaction for computing the derivatives of non standard functions. Fur-

Table 11

Function

Exact relative error e,

Estimated relative error MRE

ex log x J;; arctg x sinx

1.642 2.189 2.375 5.494 2.478

1.662 2.278 2.388 5.361 2.351

x X x x X

lo@ 1O-5 1O-5 1O-5 lo-’

x x x x x

1O-5 1O-5 1O-5 1O-5 1O-5

MRE - er _

0.012 0.041 0.005 -0.024 -0.051

NF

1.5 17 15 20 15

of computed

results

In his publication on concrete complexity [26], Winograd addresses the question of improving the precision of numerical methods. Noting that, when solving a problem, error propagation increases with the number of arithmetic operations to perform, he states that shorter numerical formulations of the solving-process yield more precise results. As an example, he points to the Fast Fourier Transform, as opposed to conventional Fourier Transform. However, there are cases, in which a formulation requiring more arithmetic operations gives better results than a short formulation because its numerical behaviour is better, as will be seen later on, when studying stable formulations for solving 3rd degree equations. Thus, as far as concrete complexity is concerned, the number of operations of an algorithm must always be associated with the numerical behaviour of its formulation. In the following, we shall attempt to work out a general methodology allowing for the improvement of the precision of computed results, no matter which numerical method is used. 4.1. Numerical behaviour of a formulation The concept ofequivalent formulation does not exist in data-processing. Usually two algebraically equivalent forms coding the same algebraic expression yield two different results, and one is better than the other. Therefore we can say that each computer-performed expression has a numerical behaviour which depends, of course, on the values of the variables it contains. Using the ideas of La Porte [9], we are now going to establish a numerically stable formulation for solving third degree equations. As we have already said, subtraction is the main factor of errors created by computer arithmetic. Basically, there are two categories of subtractions; (i) subtractions between a single variable and some expression independent of this variable, called irreducible subtractions,

.I. Vignes /New methods for evaluating mathematical computations

(ii) subtractions which can be transformed into irreducible subtractions called reducible subtractions. Subtractions whose results are always significant are well-conditioned. Otherwise they are ill-conditioned. A subtraction whose result is small with respect to the operands is considered to be ill-conditioned, and leads to a bad numerical behaviour. A numerical method is said to be numerically stable if it contains no reducible subtractions and IZUmerically unstable otherwise. We assume that only unstable formulations containing ill-conditioned reducible subtractions have a bad numerival behaviour. Therefore computing errors can be minimized by eliminating reducible subtractions by means of algebraic methods such of a common denominator, use of recognizable identities, factorization, . .. . However, this elimination process is sometimes difficult or tedious, and leads quite often to stable formulations being more complicated to run than the corresponding unstable formulas. Consequently, an unstable expression with well-conditioned reducible subtractions gives sometimes better results than the stable expression. This is not the case when the reducible subtractions are ill-conditioned. Then, unstable expressions always give very bad or even non-significant results, whereas stable formulations always yield acceptable results. Sor far, little research has been done in this way.. Thus, in the problem of determining the real roots of algebraic equations, only recently [9] has a stable formulation for solving 3rd degree equations been proposed, even if stable formulations have been known for some time in the case of 2nd degree equations.

241

Let P=b/a,

y=c/a,

6 =d/a.

We obtain the canonical form y3+py+q=o with x=y-$3, ,=2p3_pY+~ 27 3

p=-@+y, The discriminant



(51)

A is given by

A = ($p)” + (;q)* .

(52)

We know that: if A > 0 there is only one real root, if A < 0 there are three real roots. When A > 0, the real root x is given by Cardan’s classical formula y =vw+v-m

(53)

i x=y-$0

(54)

When A < 0 the three real roots x1, x2, x3 are given by the following classical trigonometrical relations xr = -$p + 2&$

cos e )

x2 = -;p •t 2a

cos(8 + $7) )

x3=

-$t

2&$Tcos(e

+$),

(55)

with 0 = arc cos

-’ i &qiP

I

4.2. A stable formuhtion for solving 3rd degree equations

These formulas are numerically unstable and give bad results when they contain an ill-conditioned subtraction. To obtain stable formulas we are going to eliminate the subtractions algebraically.

We present now a stable formulation giving the real roots of third degree equations, established by M. La Porte * (the detailed study appears in [9]). Consider the third degree equation

4.2.1. Stable formulation for canonical equations Consider again the canonical equation

a3+bx2+cx+d=0,

a#O.

(50)

y3+py+q=o where p and q are independent,

and

A = ($J)~ + (;q)* .

* M. La Porte died three years ago in a road accident.

1st case: When A > 0 eq. (53) contains always a reducible subtraction regardless of the value of q. To

J. Vignes /New methods for evaluating mathematical computations

242

eliminate this subtraction,

A=$(6

U3 = ;1q1 t@,

‘( ) P

ifp
Sl,

(57)

ifp>O.

2’

3

s2=

!_ ( 3u 1

2nd case. When A < 0 the formula (55) gives bad results for q around zero, because 0 and cos(8 + $r) are then ill determined. So it is wiser to use the following formulas which are numerically stable: Yr = 2&$

ya = -2~cos(~n

- fo)

~

ifp
$w,

(63) P 4 - 3 - u2 t ;p + (;p/u),

(58) u3=$/qI+fi, ,

0 = arc tg(iq/&iC)

-$r

< cd < $r .

(59)

together with the reduced equation =o.

s1=

tl

ifq>O

i -1 ifq
When /3q < 0 (meaning that fl and q are of opposite sign). Formulas (63) contain reducible subtractions and so are unstable. After eliminating these subtractions we obtain the following stable formulas

4.2.2. Stable formulation for general equations Let us consider the general equation ax3+bx2+cx+d=0,

ifp>O,

with

with

x3tpxZtyx+6

I+lifu20p

4.2.2.2. Roots computation 1st case: A > 0. In this case there is only one real root xr given by

cos<+ + $0) )

y3 = 2Gsin

3

-1 ifo
-4 &!+

- ,$P’>, cp= 2J-?53

u = $(r

We then obtain a stable formula given by

Y =’

(62)

-?r2)

(56)

( -1 ifq < 0 ’

U-G

-S&6

with

+I ifq20 sr =

-

detailed in [9]. Finally we obtain

operation

we set

P 4

xr =

I

(60) with

y3+py+q=o.

h=

p and q are not independent and consequently when p is negative, A contains a reducible subtraction which may be ill-conditioned. So we must eliminate it. 4.2.2.1. Computation of A 1st case p > 0. A may be computed using the clasical formula (52), since it does not contain a subtraction. 2nd case: p < 0. Formula (52) contains a reducible subtraction which can be eliminated by the non-trivial

3

-6

$12+h

We suppose that the coefficients 0, y, and 6 are almost independent. Now, when we reduce eq. (60) into the canonical form (61)

I _

ifflq>O

-3-h’

(64)

if/3q
- $04 ’

1U2 + ip + ($p/u)’ ,

if p 2 0

I

ifp
ulql u2 - $p ’

2nd case: A < 0. In this case there are three different real roots given by

I

xr = -&

I

+ 2&$

x2 = -1p - 24$ x3=-@+2&sin$w,

cos($

+ $0) ,

cos($r - J+%)), (65)

243

J. Vignes /New methods for evaluating mathematical computations

(iii) If /3q = 0, then there is a triple root given by

with ,

w = arc tg &

-$r

< w < $77.

xi =x2

Depending on the sign of fl and 4 these formulas may contain reducible subtractions. After eliminating these subtractions by the non trivial operation detailed in [9], we obtain the following stable formulas in each of the different cases: (i) If & < 0 then 1x1 = -;p

I I

- 2ss&$

X2 = -&I

t

cos<;7r - ss$w) ) ljw ,

2*sin

6

x3=--

x1x2

(66)



=x3

= -40.

(70)

These formulas are more complicated than the wellknown classical ones, but they are stable and they always give good results. To illustrate this statement, we have successively applied both sets of formulas to the resolution of the following 3rd degree equation 1O-3x3+4X1O4x2-6X1O3x+4X1O-2=O

(71)

using a 7600 CDC computer on which 14 significant decimal digits are available. The exact roots xl, x2, x3 of this equation are, to the extent of 10 significant decimal digits xi = -4.000

000 015 x lo7 )

with +1

if

p>O

i -1

if

fl
x2 = 1.499 933 324 X 10-l

s3 =

x3 =

cos($

-

4p2 (

x3=----,

9 6

3

x1 = -4.000

+m

m_q

a4

-;

x3 = -7.499 (67)

*1x2

I s4=[I:

x3 = 6.666 962 989 X 10-6. if

6 - P-y>0

if

S-fly<0

3rd case: A = 0. In this case there is either one single root together with a double root or a triple root. Depending on the sign of p and LJwe have: (i) If & < 0 then w = is171 there is a single and a double root given by x1

=x2=-$3ts,&gJ,

x3=--.

,

(74)

This example allows for an instructive comparison between the unstability of the classical formulas and the stability of the new ones. Thus, when studying the concrete complexity of an algorithm, one must take into account both the number of operations and the numerical behaviour of the algorithm’s formulation.

5. Conclusion

(ii) If /34 > 0 then w = $srn there is a single and a double root given by -Y

,

6 x1x2

x1 =x2 =p+,-

(73)

000 015 x 107 )

x2 = 1.499 933 324 X 10-l

- &s3w) ,

OS&

992 847 X 10-2.

While our stable formulas give the excellent results: x1 = -4.000

with y1 = -2~36 and

000 015 x 107,

x2 = 7.500 004 768 X 1O-2 ,

+s4

Yl 1

(72)

$s3w),

6 -or x2=2

6.666 962 989 X 10-6.

Classical formulas yield the following non-significant results:

(ii) If & > 0 then x1 = -&I - 2s3&&

,

x3=-_S~-22sl~. (69)

We have given a classification of different types of numerical methods used in practice for solving scientific problems arising in numerical analysis, simulation or operations research. Any numerical method projected into a discrete field and performed by an approximated computer arithmetic that does not satis-

244

J. Vignes /New methods for evaluating mathematical computations

fy the mathematical laws of associativity and distributivity always gives a result containing errors. The validity of this result must be checked. We have shown that the permutation-perturbation method can be used to check the validity of the result given by a direct method. The permutation-perturbation method has been applied to many real-life problems. It has never failed, and it has always given excellent results. It may appear expensive with regard to computer time since it requires the problem to be solved several times (in general 3 times). Indeed, it is no more expensive than the use of double precision, whereas double precision makes it impossible to ckeck the validity of the computed results. We have also shown that the permutation-perturbation method allows one to stop an iterative process just when a satisfactory solution has been found and to evaluate the precision of the computed result. When applied to iterative methods, the permutation-perturbation method is not computing-time consuming, since, once the result is obtained, it is not necessary to solve again the problem starting from the beginning. We have seen that the overall error generated by approximated methods must be minimized in order to obtain a satisfactory result. Even if in this case we cannot give a general methodology to check the validity of the computed results, the permutation-perturbation method is effective to evaluate part of the errors. As data-processing grows and computers become more and more powerful, it is becoming more and more possible to solve problems that require a large number of arithmetic operations to be performed. Simultaneously, the propagation of computing errors is obviously greater as the number of elementary operations is larger, causing greater errors in the computed results. Consequently, it is becoming more and more imperative to be able to estimate such error so as to be able to determine the validity of computed results. We have also presented here a general methodology to improve the precision of a computed result. In conclusion, we think that the permutation-perturbation method presented here is a very good tool for evaluating the validity of the results yield by mathematical computations and that it is very easy to apply to real-life problems.

References [l]

R. Alt, Etude statistique de l’erreur numerique d’affection sur un ordinateur, I.P. Report no. 76, 5 February 1976. [2] R. Alt, Error Propagation in Fourier Transform, Math. Comp. Sim. XX (1978) pp. 37-43. [3] P. Bois and .I. Vignes, Software for local round-off analysis in discrete Fast Fourier Transform (in press). [4] .I. Dumontet, Algorithme de derivation numerique. Etude thtorique et mise en oeuvre sur ordinateur. Th&e de 3eme cycle, Paris, 1973. and J. Vignes, Determination du pas opti15 1 J. Dumontet mal dans le calcul des d&i&es sur ordinateur. R.A.I.RO., Analyse NumCrique vol. 11 no. 1, 1977, pp. 13-25. of numbers, The ]6 R.W. Hamming, On the distribution Bull. System. Tech. Journal, pp. 1609-1625, October 1970. ]7 1 T. Kaneto and B. Liu, On local round-off errors in floating-point arithmetic, J. Ass. Comp. Mach., vol. 20, no. 3, July 1973, pp. 391-398. of round-off error I8 T. Kaneto and B. Liu, Accumulation in Fast Fourier Transform, J. of Ass. Comp. Mach. 17 (1970) pp. 637-654. numeriquement stable ]9 M. La Porte, Une formulation donnant les racines reelles de l’equation du 3eme degre. IFP Report no. 21516, October 1973. des [lo 1 M. La Porte and J. Vignes, Evaluation statistique erreurs numeriques dans les calculs sur ordinateur. Proceedings of Canadian Computer Conference, pp. 414201-4141213, June 1972. des erreurs 111 M. La Porte and J. Vignes, Etude statistique dans l’arithmetique des ordinateurs. Application au controle des resultats d’algorithmes numeriques. Numerische Mathematik, 23, pp. 63-72, 1974. sur 112 M. La Porte and J. Vignes, Evaluation de l’incertitude la solution d’un systeme lin&ire, Numerische Mathematik, 24 pp. 39-47, 1975. [13] ‘, La Porte and J. Vignes, Methode numerique de ddtection de la singularite d’une matrice. Numerische Mathematik, 23, pp. 73-81, 1974. [14] M. MaillC, Distribution des mantisses des nombres en virgule flottante normalisee. Report of Institut de Programmation, Paris, 1976. [ 151 W. Miller, Software for round-off analysis, ACM Transactions on Mathematical Software, vol. 1, no. 2, June 1975, pp. 108-128. [ 16 ] W. Miller, Computer search for numerical instability, I. A.C.M., vol. 22, no. 4, October 1975, pp. 512-521. 1171 R.E. Moore, Interval Analysis, Prentice Hall, 1966. [ 181 M. Pichat, Contribution a l’etude des erreurs d’arrondi en arithmktique a virgule flottante, These d’Etat, Grenoble, 1976. [19] J. Vergnes, Contribution a l’etude de generateurs de courants harmoniques en magnetohydrodynamique, These d’Etat, Paris 1976. [20] J. Vignes and M. La Porte, Error analysis in computing,

J. Vignes /New methods for evaluating mathematical computations Proceedings of IFIP Congress, Stockholm, pp. 610-614, August 1974. [ 211 J. Vignes and M. La Porte, Analyse des erreurs dans les calculs sur ordinateur, Revue de I’IFP pp. 3-16, January 1975. [22] J. Vignes, M. Pichat, R. Alt, Algorithmes, numeriques, analyse et mise en oeuvre, Editions TECHNIP, Paris (in press).

[23

] J.H. Wilkinson,

Error analysis of floating-point cahxlation, J. Ass. Comp. Mach. 8, pp. 281-330, 1961. [ 24 ] J.H. Wilkinson, Rounding errors in algebraic processes, Prentice Hall, 1963. [ 251 J.H. Wilkinson, The algebraic eigenvalue problem, Clerandon Press, Oxford, 1965. [ 26 ] S.Winograd, La complexite des calculs numeriques, La Recherche no. 83, November 1977, pp. 956-963.

Appendix PROORAW TEST (INPUT,OlJtPUT) OIWCNSION 21 (3) v22(3) ,23O)vPl3l COWMOW PRINT

X 204

204 FORYAtllnl,sX~RCSULTs~*~/~ 1 RCA0 lOOtO PRXNT 201,OD N=3 00 2 I=l*N

P(l)=00 09PtP~RfP111 P(ll=O*o

X=PEPLRfP*lb PtlJ=O p121=x YmPEPCR(P*Zl Ptl)=Y P(2)=-X Zl(Il~PCPERlP,2~ Ptl)=Y P(Z)=-0 Zt~x~~PEPER~P*21 Ptll=O*o V=PEPERtP*ll P(l)=0 PfL)W bPEPERfP*Z) P(l)*Ow X=PEPcRlP~l) PtI)=X PfZl=u Y=PEPLR(P*Zt

Pl1I.Y P(Z)=-V 2 23fI)~PLPtR~P,21 K=l R=oo PRINT 200,K,R,K~~KtI~Zl~I~,Illrl) CALL NCSEtZl,NrCl

K-2

R.D0*.2 PRINT 200~K,R,K~~KrI,Z2~IJ~I=l~3~ CALL NCSClZZqNeC) K=3 RrOo*O0.~3 PRIhT 2OO~K~R~K~iK,I,Z3~I~~I=l~31 CALL NCSftZ3gNsC) GO TO 1 loo FORMATlElO.6~ 200 TORMATl6Xe*TW! EXACT VALUE Of l,* CO~UTEO~r3f/~l6~~*Z*tIl,~~~rIlr~l=~rt22.lS~~ 201 FORMAT~l~O;~O-~rE22.lS~ END FUNCTION PCPERlP,Wl OIMLMSION PINI ICtN.LT.3tGO TO 4

Z*rIlr*

IS

245

•1E22.lS;/*llX1~Z~rIl

J. Vignes /New methods for evaluating mathematical computations PLRMUTATION OLS P(1) 00 2 I=lrN 12~HASARO~fLOAT~!~~fLOAT~N~*O,9999999999999999~ STOCKmP ( I) P(Il~P(I2) PlI2~-STOCR CALCUL DE LA SOYUC OES PII) PERTURGLS S-P(li 00 3 t=Z,N

CT

PERMUTES

S=S*PtI1

It=~ASARO~O.rl.99999999999999~ IflI2;CO.O, GO 10 3 S=XV~SrSteNtCLOAT~I2J~S~~ CONTINUC PCPLR-S RCTURN IF(N.CO,ZBGO TO 5 CAS OU N=l IPILC=~AsARo~O.*l.99999999999999 IflIPILt,EG,O~~GO TO 19

‘t

Id

itttt

~~I~=XV~P~I~,SIGN~~LOXT~~PILEl~P PCPCR=PtNb RETURN CAS OU N=2 S=Pfll*P(Z) 12=HASARO~O,,l,99999999999999~ If~I2,LO.O~ GO TO 6 S=XVIS~SIGN~fLOAT~I2),S)) PEPCR=S RETURN EN0

Xi=iIiNtPRE(0.t .Sl RETURN EN0 SUGROUTINL NCSLlReN,C) DIMENSION R(N) COMMON J RMWO. 00 2 I=ltN RM=RW*R II b RM=RW/fLOAT(N) V4R=O, 00 1 I=lsN VAR=VAR*(RII)-RIdI. VAR=VAR/fLOAT[N-1) ER=SORT l’/Alj, I~ILR,EO.O.~ GO TO 7 G=AGS(RW/LR) IffG~EO.O.~ANO~~R~NL~O~~ ~~~G.LQ.O,.ANO.~R.LO.O.~ C~ALOBlOtGt GO TO G Cal5. IflC.LT.1.I GO TO 3 IC=C*0.5 PRINT ~oO,J~IC*~,IC~R*~~O-IC,C RETURN PRINT Iol,J.4BSICl RETURN fORMIT(6X,*THE CO~PuTEO fORYATl6X,*THE COWPuTEO _ ___._

GO TO 6 00 70 7

VALUE

VALUE

Of Of

t**Itc.

IS

t*,Il,*

IS

l1E=,=,=X*.C***f4.0/) NOT SIGNIfICANT

J. Vignes /New methods for evaluating mathematical computations

RESULTS

D=

-.100000000000000E421 THE EXACT VALUE Of 21 COVPUTED

21

IS

-,10000000000000~E*21

Zl~ll= 0. THE TWE

THE THE

THE De

-,366P5626227668AE*26 21(z)= Z1(3)= -,3Ea~5h262276681C*26 COMPUTED VALUE Of 21 IS NOT

SIGNIFICANT

c=

0.

EXACT VALUt Of 22 JS .999999999999999E*IO 22 COtiPUTED .100000000000000E*41 22(l)= ZZLz)= .999999999999999E+LQ Z2l3)= .100000000000001E*41 COMPUTED VALUE Of Z2 IS .100000000000005*41

C*

14.

EXACT VALUE Of Z3 IS -.999999999999999E+60 Z3 COhPUTED 23(l)-.100000000000000E*61 Z3(2)= -.999999999999999E*60 z313j= -.100000000000001E*61 COMPUTED VALUE OF 23 IS -.10000000000000E*61

C*

14.

C=

9.

C=

14.

C=

14.

c.

14.

C=

14.

C-

14.

C=

14.

Cm

T.

C=

14.

-.10000000000000OE~07 THE

EXACT

VALUE OF Zl IS -.100006000000001E~07 Zl COYPUTED z1c11= -.100000000000F00E*07 Z1(2)8 -.100O00000390625E*07 z1131= -.100000000000000E*07

THE

COMPUTED

THE

THE

EXACT VALUE OF I2 IS .10000000000000CE.13 Z2 COMPUTED ZZ(l)* .1000000000n000lE~13 z2(21= .100000000000001E*13 Z?(J)= .100000000000000E*13 COMPUTED VALUE OF 22 15 .10000000000000E*13

THE

EXACT

23

VALUE

VALUE COMPUTED

Of

OF

Zl

Z3 IS

IS

-.100000000E*0?

-.10000000000010oE*19

23(l)= -.100000000000101E*19 Z)(2)= -.100000000000100E*19 z3t31= -,100000000000100E*19 THE COMPUTED VALUE OF 23 IS '.1CI000000000010E~19 Da -.10000000000000OE~O1 THE

EXACT VALUE Zl COMPUTED

Of

Z1

IS

-.100000000000000E~01

Zl(ll= -.100000000000001E*01 Zll219 -.99999999999999hEbOO Z1I3)m -.100000000000001E*01 THE COMPUTED VALUE Of Zl IS -.10000000000000E*01 THE

EXACT VALUE Of 12 IS .100000000000000E*01 Z2 COMPUTED z2c1>= .1000C00~0000003E*01 22(21= .10000000000C001E*01

THE

COYPUTED

THE

EXACT VALUE Of 23 IS -.200000000000000E*01 23 COMPUTED 23(l)= -.20O000000000003E*O1 Z3(2)= -.200000000000001E+01 z3(3)= -.2oooooooooooooiE.oi COMPUTED VALUE OF 23 IS '.20000000000000E*01

Z2(31=

THE

VALUE

,100000000000002E~01 OF Z2 IS .10000000000000E*01

Dm -,100000000000000E-07 THE

EXACT VALUE Zl COHPUTED

Of

Zl

1s

-,10000000000000OE-07

z1c11= -,100000000000000E-07 THE THE

zlIP)+ -.100000000000001E-O? Z1(3)8 -.10~000000000000E-07 COMPUTED VALUE OF Zl IS -.10000000000000E-OT EXACT ‘VALUE I2 C9*PUTED

ZZ(ll=

Of

THE

z2(2)= Z2(3l= COMPUTED VALUE

THE

EXACT

THE

z2

Is

,lOOOOOOOOOOOOOOE-15

.9999996859901426-16 .lOfl00002153G573E-15 .100000021538573E-15 Of ZP IS .1000000E-15

VALUE OF 23 IS -.100000000000000E-O? 23 COMPUTED z3111= -.100000000000001E-07 23(P)* -.100000000000001F-07 z3(3)= -.100000000000002E-O? COMPUTED VALUE Of 23 IS -.100000b0000000E-O?

247

248

J. Vignes /New methods for evaluating mathematical computations

D= -.100000000000000E-11 THE EXACT VALUE OF-21 15 -.1000oOoOooOoooDE-l1 21 CO*PuTEO z1111= -.100000000000001c-11 21171= -.100000000000000E-11 L1(31= -.100000000000001E-11 THE COMPUTED VALUE OF 21 IS '.10000000000000F-11 THE

THE THE

D=

22

IS

THE THE

c=

14,

c=

2.

C=

14.

C=

14.

c=

0,

C=

14.

.100000000000009E-23

.100166402301344E-23 . 100166402301343E-23 .995?016744?7F,67E-24 OF 22 IS .lOE-23

*

EXACT VALUE OF 23 I5 -.100000000000000E-11 23 C”“IPI,TFC, 23t1;= -.100000000000000E-11 13(2)= -.100F0000000~002E-11 23(311 -.100000600000002F-11 THE COMPUTFO VALUF OF 23 IS '*1000000000~00~C-11 .1000~0000000000E-15 THE FXACT VALUE OF Zl IS .100000000000000E-15 21 COMPUTED z1111= .10Qn00000000000E-15 zl(zl= .100000000000000E-15 z1(3)= .100000000000000E-15 THE COMPUTED VALUE OF Zl IS .10000000000000E-15 THE

09

EXACT VALUE OF 22 CO'IPUTEO Z.?ll)= z212j= 22131; COMPUTE0 VALUE

_~

EX4CT VALUE OF Z? IS .99999999999999?E-32 22 COI*PUTEO Z211);: .394430452610506E-30 zZ(Zl= 0. 2213)~ 0; COMPUTED VALUF OF Z2 IS NOT SIGNIFICANT

EXACT VALUE OF 23 CO~PUTEO z3111= Z3(21= 23(31X TIIE COHPIJTEO VALUE ,999999999999997E-06 THE EXACT VALUE OF Zl COMPUTED zl(ll= Zll2)= z113)= THE CO~PUTEO VALUE

23

IS

. 10000000000000nE-15

.100000000000001P-15 ,100000000000000E-15 . 100000000000001E-15 OF 23 IS .10000000000000E-15 21

IS

.999999999999997E-06

.100000000000000~-05 .999999999999997t-06 .999999999999997t-01 OF 21 IS .10000000000000L-05

c=

14.

.99999999999999OE-12 EXACT VALUE OF 22 IS 22 COf~PUTED ZZllJ= .100000000278046E-11 .99999999b004204E-12 Z2(2I= z2t31= .999999996004197E-12 THE COWWTEO VALUE OF 22 IS .10000000E-11

C=

0,

THE EXACT VALUE OF 23 IS .100000000000099E-05 23 COQPVIED .100000000000101E-05 23t11= 2312)~ .100000000000101E-05 2313). .100000000000101E-05 THE COMPUTED VALUE OF 23 IS .10000000000010F~05

C=

14.

.100000000000000E~11 .100000000000000E411 TYE EXlCl VALUE OF Zl IS ZI COMPUTED ,999974S02400000E410 Zlfll= z1121= .100002693120001E*11 r100002693120001E411 z1131= THE COMPUTED VALUE OF Zl IS .10000E*11

C=

5,

.100000000000000E*21 THE EXACT VALUE OF 22 IS I2 COwPUtEO ZP(llr .100000000000002E*21 ;iooooooooooooo2~42i zztii. .100000000000001E*21 z2131= .10000000000000E*21 THE COMPUTED VALVE OF 12 IS

c=

14.

.999999999999999LdO EXACT VALUE OF 23 IS ZS CO"PUTCD .100000000000001E*31 23111= 231218 .100000000000001E*31 .999999999999999El30 z3131= TtlE COWUTCD VALUE OF 23 IS .10000000000000E*31

c=

14.

THE

D=

THE

J. Vignes /New methods for evaluating mathematical computations

z1131= 0. THE COMPUTED VALUE OF THE

THE THE

THE

Zl

IS

NOT SIQNIFICAW

c.

0.

EXACT VILUE OF Zt IS .999999999P99994E*bo ZZ COMPUTED Zzlll= .100000000000000E*61 z2121= .100000000000002E*61 ZPlJl= .999999999999991E460 COMPUTED VALUE OF ZZ IS .10000000000000E+61

C=

14.

EX4CT VALUE OF 23 IS .999999999999992Ee90 23 COVPUTEO z3t11= .99999999999999bEb90 23(Z)= .10000@000000o02E+91 ,999999999999999E~9O z3131= COMPUTED VALUE OF Z3 IS ,10000000000000E~PI

C-

14.

249