Rigorous Error Analysis of Numerical Algorithms via Symbolic Computations

Rigorous Error Analysis of Numerical Algorithms via Symbolic Computations

J. Symbolic Computation (1996) 22, 435–458 Rigorous Error Analysis of Numerical Algorithms via Symbolic Computations MARIAN MROZEK§†‡ § Instytut Inf...

646KB Sizes 0 Downloads 24 Views

J. Symbolic Computation (1996) 22, 435–458

Rigorous Error Analysis of Numerical Algorithms via Symbolic Computations MARIAN MROZEK§†‡ §

Instytut Informatyki, Uniwersytet Jagiello´ nski, Krak´ ow, Poland (Received 25 August 1995)

We present a symbolic technique of error analysis of numerical algorithms, which acts as a preprocessor to the actual numerical computation. The technique provides rigorous, a priori bounds which are universal for inputs in a prescribed domain. The method is an alternative to interval arithmetic in applications where speed and rigor but not necessarily the tight bounds are the main concerns. The technique was invented to perform the computer-assisted proof of chaos in the Lorenz equations. c 1996 Academic Press Limited °

1. Introduction The aim of this paper is to describe a method of rigorous error analysis which meets the needs of a new technique of computer-assisted proofs based on topological tools (Mischaikow and Mrozek, 1995a, b; Mrozek, 1995; Zgliczy´ nski, 1995). Computer-assisted proofs in analysis were originated by a paper on the Feigenbaum conjectures by Lanford (1982). In the recent years they have become more and more common. Let us mention the work on the universality of area-preserving maps (Eckmann et al., 1984), the proof of the existence of a homoclinic trajectory in the Lorenz system (Hastings and Troy, 1992; Hassard et al., 1994), the proof of chaos in a molecular model (Rage et al., 1994) or a recent paper by Koch et al. (1995). In all these papers the role of a computer is to find a sufficiently good bound of a certain number, most often the Lipschitz constant in the Banach contraction principle. To make this work as a proof, a method to guarantee rigorous error estimates is needed. The usual tool is interval arithmetic (Moore, 1966; Alefeld and Herzberger, 1983; Rump, 1988; Keiper, 1993). It provides exact and tight bounds but has some disadvantages. Implementations of interval arithmetic are rather slow. This is partly due to the nature of the method (numbers replaced by intervals) but mainly because neither hardware nor the optimization routines of the available compilers are prepared to work efficiently with intervals. On some hardware platforms it is even necessary to emulate directed rounding needed to implement interval arithmetic, which additionally slows down interval computations (ANSI/IEEE, 1985; IEEE, 1987; Cody, 1988). Another disadvantage is the fact that interval arithmetic provides only a posteriori † ‡

Research supported by KBN, Grant 0449/P3/94/06. E-mail: [email protected]

0747–7171/96/100435 + 24 $25.00/0

c 1996 Academic Press Limited °

436

M. Mrozek

error bounds and as a consequence there is no way of predicting the expected outcome. This is expensive if the computations are heavy. In the mentioned topological technique of computer-assisted proofs the disadvantages of interval arithmetic became an obstacle. The technique rests on constructing an enclosure (or multivalued representation, Mrozek, 1995) of a single-valued Lipschitz continuous map f . Roughly speaking, this is achieved in two steps. First a grid {σ} covering the domain of f and a collection of arguments xσ ∈ σ are selected. Then an enclosure of f (xσ ) is constructed and extended to σ by applying the Lipschitz constant. The enclosure must be rigorous but need not be very tight and its requested size is usually known in advance. On the other hand, the construction of the enclosure requires applying the same numerical procedure to a large number of elements in the selected grid, which makes the efficiency of the procedure very important. The method of rigorous error analysis proposed in this paper acts as a symbolic preprocessor to the numerical algorithm. It provides error bounds which are not necessarily the best but are universal for inputs in a given domain. This makes possible the utilization of the existing numerical, nonsymbolic, non-interval libraries in a rigorous way even if the software is not designed to provide rigorous results. Moreover, since running the preprocessor is cheap when compared to the algorithm itself, one can experiment with error bounds of various settings of the algorithm and the algorithm is run only after a setting with satisfactory error bounds is found. The proposed method was successfully applied in a computer-assisted proof of chaos in the Lorenz equations (Mischaikow and Mrozek, 1995a, b). We present the main results of the paper in Section 2 and an example of an application to the Lorenz system in Section 3. In order to make these sections easier to read we do not go into all the details of definitions. The interested reader will find the precise statements of all definitions as well as all proofs in the following sections. In the sequel we use the word error interchangeably with the word deviation. Although the former is commonly used, we feel that the latter is more appropriate in the considered context. Also any use of the word estimate is in the sense of a rigorous bound. Throughout the paper R, R+ and Z denote respectively the set of real numbers, the set of non-negative real numbers and the set of integers. R and Z supplemented by positive ¯ and Z. ¯ If X, Y are sets then f : X → Y will and negative infinity are denoted by R denote a map f defined on X and with values in Y . If the map f is not defined on the whole X then we will write f : X−→ ◦ Y . In such a case dom f will stand for the domain of f . ¯ as a binary operWe will treat the minimum and maximum of two numbers x, y ∈ R ¯ denoted by x ∧ y := min(x, y) and x ∨ y := max(x, y) respectively. For x ∈ R ation on R the integer part of x given by max {n ∈ Z | n ≤ x} ∈ Z. will be denoted by floor(x). The following straightforward proposition is included for reference purposes. Proposition 1.1. Assume a, b, a0 , b0 ∈ R. Then ab ≥ 0 ⇒ |a − b| ≤ |a| ∨ |b| |(a ∧ b) − (a0 ∧ b0 )| ≤ |a − a0 | ∨ |b − b0 | |(a ∨ b) − (a0 ∨ b0 )| ≤ |a − a0 | ∨ |b − b0 |.

(1.1) (1.2) (1.3)

Rigorous Error Analysis

437

2. Main Results Let f : Rn −→ ◦ R be a function and let a1 , a2 , . . . , an be explicitly given real numbers such that (a1 , a2 , . . . , an ) ∈ dom f . In general, the construction of the exact numeric value of f (a1 , a2 , . . . , an ) is difficult. Thus one usually tries to find an easy to evaluate ◦ R which is close to f and computes Q(a1 , a2 , . . . , an ) instead of function Q : Rn −→ f (a1 , a2 , . . . , an ). Since the elementary operations available on most computers are just the elementary arithmetic and logic operations, Q is most often taken to be a rational function. Note that even if some elementary functions are implemented in hardware, the actual evaluation of such functions is via series expansions, hence via elementary arithmetic operations. A machine implementation of a rational function can work only with a finite set of real ˆ ⊂R ¯ called representable numbers (usually a certain set of binary fractions). numbers R As a consequence, any real number x on input, output or as an intermediate result, is replaced by a close representable number denoted by hxi and called the representation of x (see Sections 4 and 6 for detailed definitions). In particular, the arguments a1 , a2 , . . . , an are replaced by their representations ha1 i, ha2 i, . . . , han i and evaluating Q(a1 , a2 , . . . , an ) we compute hQi(ha1 i, ha2 i, . . . , han i), where hQi stands for the machine implementation of Q. To make any rigorous use of such a result we need an exact estimate of the total error etot := |f (a1 , a2 , . . . , an ) − hQi(ha1 i, ha2 i, . . . , han i)|. Obviously, etot ≤ etnc + ernd , where etnc := |f (a1 , a2 , . . . , an ) − Q(a1 , a2 , . . . , an )| is called the truncation error and ernd := |Q(a1 , a2 , . . . , an ) − hQi(ha1 i, ha2 i, . . . , han i)| is called the rounding error. Thus in order to estimate the total error we want to estimate the truncation error and the rounding error. Let us begin with the estimate of the rounding error. We make a simplifying assumption that Q is not a general rational function but a polynomial. This is sufficient for many purposes, in particular when f is analytic. It seems that there is an analog of the analysis presented in the sequel which extends to rational functions, but it is definitely much more complicated, because to get an estimate from above of a fraction one requires an estimate from below of the denominator. Thus it is reasonable to postpone the detailed analysis of this case until it becomes really necessary. Since the machine implementations of the arithmetic operations are not exact, the associativity rule fails in their case. As a consequence the order in which the operations are evaluated does matter. This makes the following concept necessary. By a polynomial expression or briefly an expression we mean a polynomial in the form which uniquely determines the way of its evaluation (see Section 8 for the precise definition). For instance (1 + y(2 + x)) + x(z 2 ) is a polynomial expression but 1 + z(2 + x) + xz 2 is not. By an evaluation of an expression we will mean its standard evaluation whereas the machine evaluation of an expression will stand for its evaluation in machine arithmetic. In Section 9 we will associate with every polynomial expression w of n variables two

438

M. Mrozek

¯n → R ¯ which will serve as the measure of the order recursively defined functions: Hw : R ¯n → R ¯ which will estimate the rounding of magnitude of w(a1 , a2 , . . . , an ) and ²w : R error |w(a1 , . . . , an )−hwi(a1 , . . . , an )| relative to Hw (a1 , . . . , an ). More precisely, we will prove in Section 9 the following theorem. ¯ are such that Theorem 2.1. Let w be a polynomial expression. If a1 , . . . , an ∈ R |ai | ≤ Ni |ai − hai i| ≤ ηi × Ni ,

for i = 1, . . . , n for i = 1, . . . , n

(2.1) (2.2)

for some N1 , . . . , Nn , η1 , . . . , ηn ∈ R+ , then |w(a1 , . . . , an )| ≤ Hw (N1 , . . . , Nn ), |w(a1 , . . . , an ) − hwi(a1 , . . . , an )| ≤ ²w (η1 , . . . , ηn ) × Hw (N1 , . . . , Nn ).

(2.3) (2.4)

Let us explain the practical meaning of Theorem 2.1 in more detail. Assume that in order to prove something we need to verify a conjecture. The conjecture says that the polynomial expression w evaluated at representable arguments in a very large set A := {(ai,1 , ai,2 , . . . , ai,n ) | i = 1, 2, . . . , k} and satisfying some concrete a priori bounds |ai,j | ≤ Nj

for i = 1, 2, . . . , k, j = 1, 2, . . . , n.

always gives a result less than some representable number p. Assume that we experimented with a small randomly chosen subset of A and the experiment led us to believe that all results are actually less than a number p1 < p. Thus if we are able to show that all machine evaluations of w for arguments in A are less than p1 and the rounding error is less than p − p1 , we are done. Notice that if p − p1 is relatively large, we do not need the rounding error bound to be very tight. Theorem 2.1 produces a bound on the rounding error valid for all arguments in A and given by ²w (0, 0, . . . , 0)Hw (N1 , N2 , . . . , Nn ) (we can take all ηj = 0, because we assumed that all our arguments are representable—see Remark 2.3 below). We will show in the Appendix that explicit formulas for ²w and Hw may be obtained by means of symbolic computations. Substituting the a priori bounds N1 , N2 , . . . , Nn we obtain a concrete rounding error bound. Since we get it before the lengthy evaluation of w at all arguments in the set A is performed, we have a chance to give up the evaluation if we decide that the bound is unsafely close to or larger than p − p1 . Instead, we can try to reformulate our conjecture or check what error bound could be obtained on a computer with a significantly larger set of representable numbers. On the other hand, if the error bound is significantly less than p − p1 , we can undertake the evaluation of w using the fastest hardware and standard, non-interval software available and still make rigorous claims about the results. In particular, if all computed results of the machine evaluation of w turn out to be less than p1 , the conjecture is proved. Remark 2.2. Formula (2.1) is itself an expression which must be evaluated rigorously to get rigorous error bounds. To avoid an infinite loop, another method must be used to guarantee rigorous evaluation. Since this evaluation is performed only once, we can apply the classical tools. For instance, we can use exact arithmetic on rationals. Or we can use classical interval arithmetic and take the upper end of the resulting interval as the error bound. Remark 2.3. In the considered example we can take all ηj = 0, because we assumed that

Rigorous Error Analysis

439

all inputs are representable so that ai,j = hai,j i. If we are interested in the evaluation of w for arguments which are not necessarily representable (non-binary fractions, roots or transcendental numbers like π) we can set ηj in such a way that (2.2) is satisfied. Another potential use of (2.2) is error estimation which takes into account some initial errors other than the rounding errors but this requires some changes in the statement and the proof of Theorem 2.1. We leave this to the interested reader. We now turn our attention to the estimation of the truncation error of a numerical algorithm. The estimation of the truncation error is method specific. What is usually common is the fact that even if explicit formulas for that error are available, they are often difficult to evaluate in concrete problems. A good example is the Runge–Kutta method used to construct solutions of initial value problems in ODE’s (see Hairer et al., 1987). It is well known that the error bound of the pth order Runge–Kutta method is a constant times the pth power of the step of the method. What is difficult is the estimation of the constant. In the sequel we will present truncation error bounds of one step of the Runge–Kutta method for an ordinary differential equation with polynomial right-hand side. However, we believe that the applicability of the method is much broader than this example. In particular, considering polynomial vector fields only is not a real restriction. If the vector field is not a polynomial, then in any actual computation it is replaced by a polynomial approximating it. Thus, in order to get the total error estimate one additionally needs a bound on the difference between the exact solution and the corresponding solution of the approximating polynomial vector field. Such a bound is provided by the theory of differential inequalities (see Hartman, 1964). To be more concrete let us consider the initial value problem of the form ∂ ϕ(t, x0 ) = V (ϕ(t, x0 )) ∂t ϕ(0, x0 ) = x0 ,

(2.5) (2.6)

where x0 ∈ R , V : R → R is a polynomial vector field and ϕ : R × R → R is the unknown solution. Given a sequence of points 0 = t0 , t1 , t2 , . . . , the goal of the algorithm is to construct a sequence of points {x1 , x2 , . . . } such that xn approximates ϕ(tn , x0 ) at tn := nh. In Runge–Kutta methods the goal is achieved by means of a recursive formula d

d

d

d

d

xn+1 = Φ(tn+1 − tn , xn ). In the most commonly used fourth order Runge–Kutta method hX κi ki (h, x), 6 i=0 3

Φ(h, x) := x +

(2.7)

where κ0 = κ3 = 1, κ1 = κ2 = 2 and k0 (h, x) := V (x) h k1 (h, x)) 2 h k2 (h, x) := V (x + k2 (h, x)) 2 k3 (h, x) := V (x + hk3 (h, x)). k1 (h, x) := V (x +

From Bieberbach (1923, 1951), (see also Hairer et al., 1987; Theorem 3.1) we have for

440

M. Mrozek

any x ∈ Rd and h > 0 kϕ(h, x) − Φ(h, x)ksup ≤ C(h, x)h5 , where k · ksup stands for the supremum norm in Rd , 1 1 X ¯ C(h, x) := κi max Ki (th, x) max G(th, x) + 120 t∈[0,1] 144 i=0 t∈[0,1] 3

and

(2.8)

° ° ∂5ϕ ° ¯ x) := ° G(h, ° 5 (h, x)° , ∂t sup ° ∂4k ° ° ° i Ki (h, x) := ° 4 (h, x)° . ∂t sup

Define recursively functions Vn : Rd → Rd by V1 (x) := V (x) Vn+1 := DVn (x)V (x) and put G(x) := kV5 (x)ksup . An easy induction argument shows that ¯ x) = G(ϕ(h, x)). G(h, Let Z ⊂ Rd be a compact subset. For s > 0 denote Zs := {x ∈ Z | ϕ([0, s], x) ⊂ Z}

(2.9)

Z and let xZ := (xZ i )i=1,... ,d , where xi := max{|xi | | x ∈ Z}. Define

etnc (Z, h) :=

3 h 1 i 1 X ∗ ∗ HG (xZ ) + κi H K (h, xZ ) h5 i 120 144 i=0

(2.10)

where H ∗ is a variant of the order of magnitude estimate H defined precisely in Section 8. Furthermore, define ernd (Z, h) := ²Φ (0)HΦ (h, xZ ), etot (Z, h) := etnc (Z, h) + ernd (Z, h).

(2.11) (2.12)

In Section 10 we will prove the following theorem. Theorem 2.4. Assume the flow ϕ is generated by (2.5), where V is a polynomial vector field. Then for every h > 0 and every vector x ∈ Zh with representable coordinates kϕ(h, x) − hΦi(h, x)ksup ≤ etot (Z, h).

(2.13)

Thus, etot (Z, h) gives an error bound for the difference between the h-translation operator along the flow of a vector x and the approximation obtained by performing one step of length h of the standard fourth order Runge–Kutta method. As we will see in Section 8, etot (Z, h) is a recursively defined polynomial, thus finding it is an elementary task but typically extremely arduous or even practically unrealizable by hand even for

Rigorous Error Analysis

441

very simple vector fields like the vector field in the Lorenz system described in the next section. In the Appendix, we present a program in MATHEMATICA which finds the rounding error bound for a polynomial expression and the truncation and total error bounds for the h-translation operator of a polynomial vector field approximated by the standard fourth order Runge–Kutta method. Let us emphasize that formula (2.13) provides the error estimate for one step only. This is all that is needed in applications to computer-assisted proofs based on topological methods, because the theory of differential inequalities (Hartman, 1964) may be used to get rigorous and satisfactory bounds for the propagation of errors after several steps (see Mischaikow and Mrozek, 1995b). Nevertheless, if necessary, analogous error bounds for several steps, and for methods with adaptive steps, may be obtained.

3. An Example: the Lorenz Equations As an example of application we consider the Lorenz equations   x˙ = s(y − x) y˙ = rx − y − xz   z˙ = xy − qz.

(3.1)

Here is the implementation of the Lorenz vector field in MATHEMATICA. Lorenz[{x_,y_,z_}]:={s(y-x),r x - y - x z, x y - q z};

First we read the package for the Runge–Kutta error (see Appendix) << RungeKuttaError‘

To compute the formulae providing bounds for the rounding, truncation and total error of one step of the Runge–Kutta method for the Lorenz equations we enter LorenzRoundingErrorBound=RungeKutta4RoundingErrorBound[Lorenz,h,{x,y,z}]; LorenzTruncationErrorBound=RungeKutta4TruncationErrorBound[Lorenz,h,{x,y,z}]; LorenzTotalErrorBound=LorenzRoundingErrorBound+LorenzTruncationErrorBound;

The resulting formulae take a lot of space: The standard MATHEMATICA object ByteCount returns the length of an expression in bytes. In the case of LorenzTotalErrorBound it returns 449680. The auxiliary function from the package PolynomialDegree[LorenzTotalErrorBound]

(see Appendix) shows that the total error bound is a polynomial of degree 23. Selecting concrete coefficients and concrete upper bounds for x, y, z, the deviation estimate becomes a polynomial of h only. The polynomial is still lengthy, because of many maximum functions. Providing upper and lower bounds for the step: MinEst[h]:=1/10000; MaxEst[h]:=1/1000;

442

M. Mrozek

the expanded form of the deviation becomes short enough to be printed (note that the only reason we use the expanded form is to simplify the formula—in actual applications there is no need to perform this step). Entering LorenzExampleRoundingErrorBound=MxExpand[LorenzRoundingErrorBound /. {x->50,y->50,z->50,r->54,s->45,q->10}] LorenzExampleTruncationErrorBound=MxExpand[LorenzTruncationErrorBound /. {x->50,y->50,z->50,r->54,s->45,q->10}]

we obtain the following formula for ernd (Z, h) 54874375 h ---------------9007199254740992

and the following formula for etnc (Z, h) 2 5 31132488375 3125302734375 h 145764404296875 h h Mx[----------- + --------------- + ------------------, 2 4 4 3 372703910775 21516416015625 h 2 744773630712890625 h Mx[------------ + ---------------- + 1332268857421875 h + --------------------- + 4 2 8 4 5 6 4051790720947265625 h + 79749667803955078125 h + 491954864501953125000 h , 2 3 409190660625 2619730128515625 h 347243452294921875 h ------------ + 10967653125000 h + ------------------- + --------------------- + 4 2 4 4 8538374234619140625 h 5 6 ---------------------- + 114496818310546875000 h + 860921012878417968750 h ]] 2

Using the intervals implemented in MATHEMATICA (Keiper, 1993): minh=Interval[0.0001]; Table[TopBound[LorenzExampleRoundingErrorBound /. h->minh i], {i,1,5,1}] Table[TopBound[LorenzExampleTruncationErrorBound /. h->minh i], {i,1,5,1}] Table[TopBound[LorenzExampleRoundingErrorBound+LorenzExampleTruncationErrorBound /. h->minh i], {i,1,5,1}]

we obtain the following lists of the rounding, truncation and total deviation bounds for steps h = 0.0001i, where i = 1, 2, 3, 4, 5. -13 -12 -12 -12 -12 {6.09228 10 , 1.21846 10 , 1.82768 10 , 2.43691 10 , 3.04614 10 } -9 -8 -7 -6 -6 {1.03408 10 , 3.34542 10 , 2.56871 10 , 1.09466 10 , 3.37875 10 }

Rigorous Error Analysis

443

-9 -8 -7 -6 -6 {1.03469 10 , 3.34554 10 , 2.56873 10 , 1.09466 10 , 3.37876 10 }

Analogous computations for the classical choice of parameters (R = 28, s = 10, q = 8/3) give even better error bounds: -13 -12 -12 -12 -12 {6.09228 10 , 1.21846 10 , 1.82768 10 , 2.43691 10 , 3.04614 10 } -11 -9 -8 -8 -7 {5.45261 10 , 1.75403 10 , 1.33903 10 , 5.67279 10 , 1.7405 10 } -11 -9 -8 -8 -7 {5.51353 10 , 1.75525 10 , 1.33921 10 , 5.67303 10 , 1.74053 10 }

The evaluation of the above examples in MATHEMATICA takes less than 1 min. In the proof of chaos in the Lorenz system (Mischaikow and Mrozek, 1995a, b) an enclosure of a Poincar´e map was constructed. Since the grid size used was of the order of 10−4 , the total error bound of the Runge–Kutta method did not contribute significantly to the size of the constructed enclosure. The numerical computations needed in the proof were implemented in C. The total computation time was 33 h on a 90MHz PENTIUM processor. The above error bounds were used hundreds of millions of times but the evaluation of the error bounds was performed only once. The a priori knowledge of error bounds was very helpful in finding the successful setting of the algorithm. It also helped to avoid unfruitful computations. Also, if interval arithmetic were used, the time of both fruitful and fruitless numerical computations would increase from tens to hundreds of hours.

4. Representable Numbers As we mentioned in Section 2, most implementations of numerical algorithms work on a certain finite subset of real numbers and all other real numbers are approximated (represented) by some numbers in this set. Since we want the developed machinery to work on a possibly large number of hardware and software platforms, we formalize these ideas in the possibly weakest setting. ˆ h·i) is a representation of R ¯ if R ˆ ⊂R ¯ is a finite subset and h·i : Definition 4.1. (R, ¯ →R ˆ is a function such that for any x, y ∈ R ¯ the following four conditions are satisfied R ˆ 0, −∞, +∞ ∈ R ˆ ⇒ −x ∈ R ˆ x∈R

(4.1)

ˆ ⇒ x = hxi x∈R

(4.3)

x ≤ y ⇒ hxi ≤ hyi.

(4.4)

(4.2)

The number hxi is called the representation of x. We define the lower and upper representations of x by ˆ | y ≤ x} ↓(x) := max {y ∈ R ˆ | y ≥ x}. ↑(x) := min {y ∈ R

444

M. Mrozek

The following two properties follow immediately from the above definitions and Property (4.4). ↓(x) ≤ x ≤ ↑(x) hxi ∈ {↓(x), ↑(x)}.

(4.5) (4.6)

¯ in hardware is by means of A common way of implementing a representation of R binary fractions. Define N : R \ {0} 3 x → floor(log2 |x|) ∈ Z, m : R \ {0} 3 x → |x|/2N (x) ∈ [1, 2). If x ∈ R \ {0}, then the numbers N (x) and m(x) are called respectively the binary exponent and the mantissa of x. For all x ∈ R \ {0} we have 2N (x) ≤ |x| < 2N (x)+1 N (−x) = N (x), m(−x) = m(x) x = sgn(x)m(x)2N (x) . For D ∈ N, an element of the set of binary fractions µD := {1 + k/2D | k = 0, 1, 2, . . . , 2D − 1} ⊂ [1, 2) may be represented by means of D binary digits. The machine mantissa of x ∈ R \ {0} is defined by m ¯ : R \ {0} 3 x → 2−D floor(2D m(x) + 1/2) ∈ µD . Now we define the sample representation by  x    0 hxi :=  sgn(x)∞    N (x) sgn(x)m(x)2 ¯

x ∈ {−∞, 0, +∞} N (x) < N− N (x) > N+ otherwise

(4.7)

and ˆ := {x ∈ R | hxi = x}, R

(4.8)

where N− , N+ , N− < N+ are fixed numbers which limit the minimum and maximum ˆ h·i) defined this way satisfies exponents. It is an easy exercise to verify that the pair (R, all properties required in Definition 4.1. We will call this representation the standard (D, N− , N+ )-floating point representation. ˆ h·i) is taken to be a fixed representation of R ¯ and the elements of R ˆ In the sequel (R, are referred to as representable numbers. ˆ Then Proposition 4.2. Assume a ∈ R and r1 , r2 ∈ R. r1 ≤ |a| ≤ r2 ⇒ r1 ≤ |hai| ≤ r2 . Proof. If a ≥ 0 then r1 ≤ a ≤ r2 , hai ≥ 0, and by (4.4) and (4.3) r1 ≤ hai ≤ r2 , i.e. r1 ≤ |hai| ≤ r2 . If a ≤ 0 then the proof is similar. 2

Rigorous Error Analysis

445

5. Binary Orders of Magnitude ¯ The set of A binary order of magnitude is a number of the form 2k , where k ∈ Z. binary orders of magnitude will be denoted by Rbom . The (upper) binary order of magnitude of x is given by dxe := min {y ∈ Rbom | y ≥ |x|}. The following proposition summarizes the properties of the upper binary order of magnitude. ¯ Then Proposition 5.1. Assume a, b ∈ R. |a| ≤ dae 0 ≤ |a| ≤ |b| ⇒ dae ≤ dbe ˆ ⇒ dhaie ≤ dae dae ∈ R

(5.1) (5.2)

da ± be ≤ 2(dae ∨ dbe)

(5.3) (5.4)

da × be ≤ dae × dbe da ∨ be ≤ dae ∨ dbe

(5.5) (5.6)

da ∧ be ≤ dae ∨ dbe.

(5.7)

Proof. Properties (5.1) and (5.2) are obvious. To see (5.3) consider first the case a ≥ 0. Then 0 ≤ a = |a| ≤ dae and by (4.4) and (4.3) 0 ≤ hai ≤ hdaei = dae, i.e. dhaie ≤ dae. If a ≤ 0, then a ≥ −dae, 0 ≥ hai ≥ h−daei = −dae and |hai| = −hai ≤ dae. Thus dhaie ≤ dae. This proves (5.3). To prove (5.4) observe that |a ± b| ≤ |a| + |b| ≤ dae + dbe ≤ 2(dae ∨ dbe) ∈ Rbom . This means that also da ± be ≤ 2(dae ∨ dbe). The proof of the remaining properties is similar. 2 6. Representable Arithmetic Let A be a set. A function ω : A × A−→ ◦ A is called an operation in A. Let x, y ∈ A. If (x, y) ∈ dom ω, then we say that xωy is defined and we write xωy := ω(x, y). The four basic arithmetic operations +, −, ×, / and the minimum and maximum operations ∧, ∨ ¯ Since the five relations =, <, >, ≤, ≥ may be considered are examples of operations in R. ¯ ×R ¯ → {0, 1} ⊂ R, we also treat them as operations in R. ¯ as functions R ¯ By a proper implementation of ¦ we mean the operation Assume ¦ is an operation in R. ˆ defined by h¦i in R ˆ × R) ˆ ∩ dom ¦ dom h¦i := (R

(6.1)

xh¦iy := hx ¦ yi,

(6.2)

and for (x, y) ∈ dom h¦i. ˆ × R) ˆ ⊂R ˆ then by (4.3) h¦i coincides with ¦ In general xh¦iy 6= x ¦ y. However, if ¦(R restricted to representable numbers. In particular, the proper implementations of the minimum, maximum and relation operators are just exact. Operations +, −, ×, /, ∧, ∨. ≤, <, =, >, ≥ will be called the standard arithmetic operations. Although in the following sections we will not use the division operation, we

446

M. Mrozek

¯ do include it here, because of possible future generalizations. A representation of R equipped with proper implementations of standard arithmetic operations will be called a representable arithmetic. ˆ h·i) over the If a, b ∈ R are such that a ≤ b and 0 6∈ [a, b], then the accuracy of (R, interval [a, b] is defined by ˆ h·i, [a, b]) := sup {|x − hxi|/dxe | x ∈ [a, b]}. acc(R, Observe that the (D, N− , N+ )-floating point representation given by (4.7) leads to the ˆ and [a, b] ∩ accuracy not greater than 2−(D+2) over any interval [a, b] such that a, b ∈ R {0, −∞, ∞} = ∅. ˆ 6= ∅ and the numbers ˆ h·i) is strong if Rbom ∩ R We will say that the representation (R, ˆ u− := min Rbom ∩ R, ˆ u+ := max Rbom ∩ R,

(6.3) (6.4)

˙ [u− , u+ ]), acc(R, ˙ [−u+ , −u− ])}e ˆ hi, ˆ hi, δ := dmax {acc(R,

(6.5)

satisfy the following conditions ˆ u ∈ [u− , u+ ] ∩ Rbom ⇒ u ∈ R u− ≤ δ ≤ 1 ≤ u+ .

(6.6) (6.7)

In particular, since by (6.7) u− ≤ u− /δ ≤ 1 and both δ and u− are binary orders of magnitude, we get from (6.6) ˆ u− /δ ∈ Rbom ∩ R.

(6.8)

A strong representation equipped with a representable arithmetic will be called a strong representable arithmetic. It is easy to check that the standard (D, N− , N+ )-floating point representation is a strong representable arithmetic with δ = 2(−D+2) , u− = 2N− , u += 2N+ . In particular, a hardware arithmetic complying with the 64-bit IEEE standard becomes a strong representable arithmetic with δ := 2−54 , u− := 2−1022 , u+ := 21023 . 7. Function Gamma Typically, a final error bound is given as a relative error with respect to the absolute value of the final result. However, since the machine accuracy drops rapidly close to zero and infinity, it is convenient to replace the absolute value by the following function. Given ˆ h·i) define a strong representation (R,   if u− /δ ≤ |x| ≤ u+ dxe Γ(x) := u− /δ (7.1) if |x| < u− /δ   ∞ if |x| > u+ , where u− , u+ , δ are given by (6.3), (6.4) and (6.5) respectively. Observe that by (6.6) and (6.8) we get ˆ ¯ Γ(x) ∈ Rbom ∩ R ∀x ∈ R and consequently

( ∀x ∈ Rbom Γ(x) =

∞ x ∨ (u− /δ)

(7.2) if |x| > u+ otherwise.

(7.3)

Rigorous Error Analysis

447

The following proposition summarizes the basic properties of the function Γ. ¯ Then Proposition 7.1. Assume a, b ∈ R. |a| ≤ |b| ⇒ Γ(a) ≤ Γ(b) Γ(hai) ≤ Γ(a)

(7.4) (7.5)

|a| ≤ dae ≤ Γ(a) |a − hai| ≤ δΓ(a)

(7.6) (7.7)

Γ(a ± b) ≤ Γ(2(Γ(a) ∨ Γ(b))) Γ(a × b) ≤ Γ(Γ(a) × Γ(b))

(7.8) (7.9)

Γ(a ∧ b) ≤ Γ(Γ(a) ∨ Γ(b)) Γ(a ∨ b) ≤ Γ(Γ(a) ∨ Γ(b)).

(7.10) (7.11)

Proof. To prove (7.4) first observe that the property is obvious when |b| > u+ or |a| < u− /δ. When u− /δ ≤ |a| ≤ |b| ≤ u+ then the property follows from (5.2). Property (7.5) is evident when |a| > u+ . If |a| < u− /δ then we get from (4.4) that also |hai| < u− /δ, hence Γ(hai) = u− /δ = Γ(a). If u− /δ ≤ |a| ≤ u+ then by Proposition 4.2 u− /δ ≤ |hai| ≤ u+ , i.e. Γ(hai) = dhaie ≤ dae = Γ(a). Property (7.6) is straightforward. To prove (7.7) first observe that the property is obvious when |a| > u+ . If u− ≤ |a| ≤ u+ then it follows from the definition of the strong representation that |a − hai| ≤ δdae ≤ δΓ(a). It remains to consider the case |a| < u− . Then |hai| ≤ u− and Γ(a) = u− /δ. Since ahai ≥ 0, it follows from property (1.1) that |a − hai| ≤ |a| ∨ |hai| ≤ u− = δ uδ− = δΓ(u). This proves property (7.7). To prove (7.8) first observe that the property is obvious if |a±b| < u− /δ. If |a±b| > u+ then either |a| > u+ /2 or |b| > u+ /2. In every case we have 2(Γ(a) ∨ Γ(b)) ≥ 2(|a| ∨ |b|) > u+ , which implies that Γ(2(Γ(a) ∨ Γ(b))) = ∞. Thus it remains to consider the case u− /δ ≤ |a + b| ≤ u+ . Then Γ(a + b) = da + be ≤ 2(dae ∨ dbe) ≤ 2(Γ(a) ∨ Γ(b)) ≤ Γ(2(Γ(a) ∨ Γ(b))). The proof of the remaining properties is similar. 2 8. Polynomial Expressions We want to study the evaluation of polynomials via considering a certain language. For this purpose we fix a countably infinite set of symbols V = {X1 , X2 , . . . }, called ˆ∪ variables, and define a word to be an arbitrary finite sequence of symbols in V ∪ R { ( , ) , +, −, ×, ∧, ∨}. If w is a word then |w| will stand for the length of w, i.e. the number of elements in w. The set of (polynomial) expressions W is defined (for the purpose of this paper) as the smallest set of words satisfying the following two properties ˆ ⇒x∈W x∈V ∪R w1 , w2 ∈ W ⇒ (w1 + w2 ), (w1 − w2 ), (w1 ∧ w2 ), (w1 ∨ w2 ), (w1 × w2 ) ∈ W. The following proposition is an easy exercise. Proposition 8.1. If w ∈ W then exactly one of the following possibilities holds

448

M. Mrozek

(i) w = Xi for some Xi ∈ V ˆ (ii) w = s for some s ∈ R (iii) there exist w1 , w2 ∈ W such that w = (w1 + w2 ) or w = (w1 − w2 ) or w = (w1 × w2 ) or w = (w1 ∧ w2 ) or w = (w1 ∨ w2 ). Assume w is an expression. Let n ∈ N be large enough so that all variables appearing in w are among X1 , . . . , Xn . Let x1 , . . . , xn be real numbers. We define recursively w(x1 , . . . , xn ), the evaluation of w at x1 , . . . , xn by  ˆ  if w = s, where s ∈ R s w(x1 , . . . , xn ) := xi if w = Xi , where Xi ∈ V   (8.1) if w = (w1 ¦ w2 ) w1 (x1 , . . . , xn ) ¦ w2 (x1 , . . . , xn ) where ¦ ∈ {+, −, ×, ∨, ∧}. Similarly we define recursively hwi(x1 , . . . , xn ), the machine evaluation of w at x1 , . . . , xn by  ˆ  if w = s, where s ∈ R s hwi(x1 , . . . , xn ) := hxi i if w = Xi , where Xi ∈ V   if w = (w1 ¦ w2 ). hw1 i(x1 , . . . , xn )h¦ihw2 i(x1 , . . . , xn )

9. Rounding Error Bounds n

¯ →R ¯ representing For a given expression w we define recursively functions Hw∗ , Hw : R bounds on absolute value by Hw∗ (x1 , . . . , xn ) :=  |s|      |xi | Hw∗ 1 (x1 , . . . , xn ) + Hw∗ 2 (x1 , . . . , xn )   H ∗ (x , . . . , xn ) ∨ Hw∗ 2 (x1 , . . . , xn )    w∗ 1 1 Hw1 (x1 , . . . , xn ) × Hw∗ 2 (x1 , . . . , xn )

if if if if if

w w w w w

ˆ = s, where s ∈ R = Xi , where Xi ∈ V = (w1 ± w2 ) = (w1 ∧ w2 ) or w = (w1 ∨ w2 ) = (w1 × w2 )

Hw (x1 , . . . , xn ) :=  Γ(s)      Γ(xi ) Γ(2(Hw1 (x1 , . . . , xn ) ∨ Hw2 (x1 , . . . , xn )))   H   w1 (x1 , . . . , xn ) ∨ Hw2 (x1 , . . . , xn )  Γ(Hw1 (x1 , . . . , xn ) × Hw2 (x1 , . . . , xn ))

if if if if if

w w w w w

ˆ = s, where s ∈ R = Xi , where Xi ∈ V = (w1 ± w2 ) = (w1 ∧ w2 ) or w = (w1 ∨ w2 ) = (w1 × w2 )

¯ n. for (x1 , . . . , xn ) ∈ R ¯n → R ¯ representing the relative evaluation error Similarly we define a function ²w : R by ²w (η1 , . . . , ηn ) :=

Rigorous Error Analysis

 0      ηi ²

(η ,... ,η )

²

(η ,... ,η )

w1 1 n + w2 12 n + δ 2      ²w1 (η1 , . . . , ηn ) ∨ ²w2 (η1 , . . . , ηn ) ²w1 (η1 , . . . , ηn ) + ²w2 (η1 , . . . , ηn ) + δ

if if if if if

w w w w w

449

ˆ = s, where s ∈ R = Xi , where Xi ∈ V = (w1 ± w2 ) = (w1 ∧ w2 ) or w = (w1 ∨ w2 ) = (w1 × w2 )

¯ n. for (η1 , . . . , ηn ) ∈ R ˆ for Remark 9.1. It follows immediately from (7.2) that Hw (x1 , . . . , xn ) ∈ Rbom ∩ R n ¯ (x1 , . . . , xn ) ∈ R . Thus the simplified formula (7.3) may be used when evaluating Hw if the length of w is greater than 1. We will say that a function ψ(x1 , . . . , xn ) is monotone if |xi | ≤ yi for i = 1, 2, . . . , n implies ψ(x1 , . . . , xn ) ≤ ψ(y1 , . . . , yn ). The following theorem is an extended version of Theorem 2.1 in Section 2. ¯ are such Theorem 9.2. The functions Hw∗ and Hw are monotone and if x1 , . . . , xn ∈ R that |xi | ≤ Ni

for i = 1, . . . , n for i = 1, . . . , n |xi − hxi i| ≤ ηi × Ni ,

(9.1) (9.2)

for some N1 , . . . , Nn , η1 , . . . , ηn ∈ R+ , then |w(x1 , . . . , xn )| ≤ Hw∗ (N1 , . . . , Nn ), |w(x1 , . . . , xn )| ≤ Γ(w(x1 , . . . , xn )) ≤ Hw (N1 , . . . , Nn ), |hwi(x1 , . . . , xn )| ≤ Γ(hwi(x1 , . . . , xn )) ≤ Hw (N1 , . . . , Nn ), |w(x1 , . . . , xn ) − hwi(x1 , . . . , xn )| ≤ ²w (η1 , . . . , ηn ) × Hw (N1 , . . . , Nn ),

(9.3) (9.4) (9.5) (9.6)

Proof. The monotonicity of Hw and Hw∗ is straightforward by an induction argument on the length of w. The first inequality both in (9.4) and in (9.5) follows immediately from (7.6). We will prove the second inequality in (9.4) and in (9.5) by induction on the length of w. Assume first that |w| = 1. Then either w = Xi for some Xi ∈ V or w = s ˆ In the first case we have by (7.5) for some s ∈ R. Γ(hwi(x1 , . . . , xn )) = Γ(hxi i) ≤ Γ(xi ) = Hw (x1 , . . . , xn ) ≤ Hw (N1 , . . . , Nn ) and Γ(w(x1 , . . . , xn )) = Γ(xi ) = Hw (x1 , . . . , xn ) ≤ Hw (N1 , . . . , Nn ). In the second case we have Γ(hwi(x1 , . . . , xn )) = Γ(w(x1 , . . . , xn )) = Γ(s) = Hw (N1 , . . . , Nn ). This completes the proof of (9.4) and (9.5) in case |w| = 1. Thus assume |w| > 1. By Proposition 8.1 w has one of the forms given in (iii) of that proposition. Consider these cases in turn. Assume first that w = (w1 ± w2 ) for some expressions w1 , w2 . Since |w1 |, |w2 | < |w|,

450

M. Mrozek

we have by (6.2), (7.5), (7.8), (7.4) and the induction assumption Γ(hwi(x1 , . . . , xn )) = Γ(hw1 i(x1 , . . . , xn )h±ihw2 i(x1 , . . . , xn )) ≤ Γ(hw1 i(x1 , . . . , xn ) ± hw2 i(x1 , . . . , xn )) ≤ Γ(2(Γ(hw1 i(x1 , . . . , xn )) ∨ Γ(hw2 i(x1 , . . . , xn )))) ≤ Γ(2(Hw1 (N1 , . . . , Nn ) ∨ Hw2 (N1 , . . . , Nn ))) = Hw (N1 , . . . , Nn ) and Γ(w(x1 , . . . , xn )) = Γ(w1 (x1 , . . . , xn ) ± w2 (x1 , . . . , xn )) ≤ Γ(2(Γ(w1 (x1 , . . . , xn )) ∨ Γ(w2 (x1 , . . . , xn )))) ≤ Γ(2(Hw1 (N1 , . . . , Nn ) ∨ Hw2 (N1 , . . . , Nn ))) = Hw (N1 , . . . , Nn ), which shows that (9.4) and (9.5) are satisfied in case w = (w1 ± w2 ). Assume in turn that w = (w1 ×w2 ) for some expressions w1 , w2 . Again we have by (6.2), (7.5), (7.9), (7.4) and the induction assumption Γ(hwi(x1 , . . . , xn )) = Γ(hw1 i(x1 , . . . , xn )h×ihw2 i(x1 , . . . , xn )) ≤ Γ(hw1 i(x1 , . . . , xn ) × hw2 i(x1 , . . . , xn )) ≤ Γ(Γ(hw1 i(x1 , . . . , xn )) × Γ(hw2 i(x1 , . . . , xn ))) ≤ Γ(Hw1 (N1 , . . . , Nn ) × Hw2 (N1 , . . . , Nn )) = Hw (N1 , . . . , Nn ) and Γ(w(x1 , . . . , xn )) = Γ(w1 (x1 , . . . , xn ) × w2 (x1 , . . . , xn )) ≤ Γ(Γ(w1 (x1 , . . . , xn )) × Γ(w2 (x1 , . . . , xn ))) ≤ Γ(Hw1 (N1 , . . . , Nn ) × Hw2 (N1 , . . . , Nn )) = Hw (N1 , . . . , Nn ), which shows that (9.4) and (9.5) are satisfied in case w = (w1 × w2 ). It remains to consider the case w = (w1 ¦ w2 ) with ¦ ∈ {∧, ∨}. We have then by (7.5), (7.10), (7.11), (7.4) and the induction assumption that Γ(hwi(x1 , . . . , xn )) = Γ(hw1 i(x1 , . . . , xn ) ¦ hw2 i(x1 , . . . , xn )) ≤ Γ(hw1 i(x1 , . . . , xn )) ∨ Γ(hw2 i(x1 , . . . , xn )) ≤ Hw1 (N1 , . . . , Nn ) ∨ Hw2 (N1 , . . . , Nn ) = Hw (N1 , . . . , Nn ) and Γ(w(x1 , . . . , xn )) = Γ(w1 (x1 , . . . , xn ) ¦ w2 (x1 , . . . , xn )) ≤ Γ(w1 (x1 , . . . , xn )) ∨ Γ(w2 (x1 , . . . , xn )) ≤ Hw1 (N1 , . . . , Nn ) ∨ Hw2 (N1 , . . . , Nn ) = Hw (N1 , . . . , Nn ). This completes the proof of (9.4) and (9.5). The proof of (9.3) goes along the same lines as the proof of (9.4), hence we leave it as an exercise. The proof of (9.6) is again by induction on the length of w. Assume that |w| = 1. If w = Xi for some Xi ∈ V then |w(x1 , . . . , xn ) − hwi(x1 , . . . , xn )| = |xi − hxi i| ≤ ηi Ni = ²w (η1 , . . . , ηn )Hw (N1 , . . . , Nn ),

Rigorous Error Analysis

451

ˆ then whereas if w = s for some s ∈ R |w(x1 , . . . , xn ) − hwi(x1 , . . . , xn )| = |s − hsi| = 0 = ²w (η1 , . . . , ηn )Hw (N1 , . . . , Nn ), which proves (9.6) in case |w| = 1. Assume |w| > 1. We again consider various forms of w. Consider first the case w = (w1 ± w2 ). It follows from (7.6), (7.7), (7.8), (7.4) and the induction assumption that |w(x1 , . . . , xn ) − hwi(x1 , . . . , xn )| ≤ |(w1 (x1 , . . . , xn ) ± w2 (x1 , . . . , xn )) − (hw1 i(x1 , . . . , xn ) ± hw2 i(x1 , . . . , xn )) +(hw1 i(x1 , . . . , xn ) ± hw2 i(x1 , . . . , xn )) − hw1 i(x1 , . . . , xn )h±ihw2 i(x1 , . . . , xn )| ≤ |w1 (x1 , . . . , xn ) − hw1 i(x1 , . . . , xn )| + |w2 (x1 , . . . , xn ) − hw2 i(x1 , . . . , xn )| +|(hw1 i(x1 , . . . , xn ) ± hw2 i(x1 , . . . , xn )) − hhw1 i(x1 , . . . , xn ) ± hw2 i(x1 , . . . , xn )i| ≤ ²w1 (η1 , . . . , ηn )Hw1 (N1 , . . . , Nn ) + ²w2 (η1 , . . . , ηn )Hw2 (N1 , . . . , Nn ) +δΓ(hw1 i(x1 , . . . , xn ) ± hw2 i(x1 , . . . , xn )) ≤ (²w1 (η1 , . . . , ηn ) + ²w2 (η1 , . . . , ηn )) × (Hw1 (N1 , . . . , Nn ) ∨ Hw2 (N1 , . . . , Nn )) +δΓ(2(Γ(hw1 i(x1 , . . . , xn )) ∨ Γ(hw2 i(x1 , . . . , xn )))) ²w (η1 , . . . , ηn ) ²w2 (η1 , . . . , ηn ) ≤( 1 + + δ)Γ(2(Hw1 (N1 , . . . , Nn ) ∨ Hw2 (N1 , . . . , Nn ))) 2 2 = ²w (η1 , . . . , ηn )Hw (N1 , . . . , Nn ). This proves (9.6) in case w = (w1 ± w2 ). Assume in turn that w = (w1 × w2 ). Then by (7.6), (7.7), (7.9), (7.4), Properties (9.4), (9.5) and the induction assumption we get |w(x1 , . . . , xn ) − hwi(x1 , . . . , xn )| ≤ |w1 (x1 , . . . , xn ) × w2 (x1 , . . . , xn ) − hw1 i(x1 , . . . , xn ) × w2 (x1 , . . . , xn )| +|hw1 i(x1 , . . . , xn ) × w2 (x1 , . . . , xn ) − hw1 i(x1 , . . . , xn ) × hw2 i(x1 , . . . , xn )| +|hw1 i(x1 , . . . , xn ) × w2 |(x1 , . . . , xn ) + hhw1 i(x1 , . . . , xn )h×ihw2 i(x1 , . . . , xn )i ≤ |w1 (x1 , . . . , xn ) − hw1 i(x1 , . . . , xn )| × |w2 (x1 , . . . , xn )| +|hw1 i(x1 , . . . , xn )| × |w2 (x1 , . . . , xn ) − hw2 i(x1 , . . . , xn )| +δ × Γ(hw1 i(x1 , . . . , xn ) × hw2 i(x1 , . . . , xn )) ≤ ²w1 (η1 , . . . , ηn ) × Hw1 (N1 , . . . , Nn ) × Hw2 (N1 , . . . , Nn ) +Hw1 (N1 , . . . , Nn ) × ²w2 (η1 , . . . , ηn ) × Hw2 (N1 , . . . , Nn ) +δ × Γ(Γ(hw1 i(x1 , . . . , xn )) × Γ(hw2 i(x1 , . . . , xn ))) ≤ (²w1 (η1 , . . . , ηn ) + ²w2 (η1 , . . . , ηn )) × Hw1 (N1 , . . . , Nn ) × Hw2 (N1 , . . . , Nn ) +δ × Γ(Γ(Hw1 (N1 , . . . , Nn ) × Hw2 (N1 , . . . , Nn )) ≤ (²w1 (η1 , . . . , ηn ) + ²w2 (η1 , . . . , ηn ) + δ) × Γ(Hw1 (N1 , . . . , Nn ) × Hw2 (N1 , . . . , Nn )) = ²w (η1 , . . . , ηn ) × Hw (N1 , . . . , Nn ). It remains to consider the case w = (w1 ¦ w2 ), where ¦ ∈ {∧, ∨}. We have by (1.2),

452

M. Mrozek

(1.3) and the induction assumption |w(x1 , . . . , xn ) − hwi(x1 , . . . , xn )| = |w1 (x1 , . . . , xn ) ¦ w2 (x1 , . . . , xn ) − hw1 i(x1 , . . . , xn ) ¦ hw2 i(x1 , . . . , xn )| ≤ |w1 (x1 , . . . , xn ) − hw1 i(x1 , . . . , xn )| ∨ |w2 (x1 , . . . , xn ) − hw2 i(x1 , . . . , xn )| ≤ ²w1 (η1 , . . . , ηn ) × Hw1 (N1 , . . . , Nn ) ∨ ²w2 (η1 , . . . , ηn ) × Hw2 (N1 , . . . , Nn ) ≤ (²w1 ∨ ²w2 ) × (Hw2 (N1 , . . . , Nn ) ∨ Hw2 (N1 , . . . , Nn )) = ²w (η1 , . . . , ηn ) × Hw (N1 , . . . , Nn ). This completes the proof of (9.6). 2 Proof of Theorem 2.1.

Theorem 2.1 is an immediate consequence of Theorem 9.2. 10. Truncation Error Bounds

Proof of Theorem 2.4. First observe that by Definition (2.9), Property (9.4) and the monotonicity of H we have for x ∈ Zh ∗ ∗ ¯ max G(th, (x) ≤ HG (xZ ). x) = max G(ϕ(th, x)) ≤ max G(x) ≤ max HG

t∈[0,1]

x∈Z

t∈[0,1]

x∈Z

Similarly ∗ ∗ max Ki (th, x) ≤ max HK (th, x) ≤ HK (h, xZ ). i i

t∈[0,1]

t∈[0,1]

Hence 1 1 X ∗ ∗ HG (xZ ) + κi HK (h, xZ ). i 120 144 i−0 3

C(h, x) ≤

(10.1)

Now from property (9.6) we get kϕ(h, x) − hΦi(h, x)ksup ≤ kϕ(h, x) − Φ(h, x)ksup + kΦ(h, x) − hΦi(h, x)ksup ≤ C(h, x)h5 + ²Φ (0)HΦ (h, x) 3 h 1 i 1 X ∗ ∗ Z HG ≤ (h, xZ ) + κi H K (h, x ) h5 + ²Φ (0)HΦ (h, xZ ) i 120 144 i−0 = EZ (h). 2 Acknowledgements The author would like to express his gratitude to the two anonymous referees for providing several very helpful comments which substantially improved the presentation of the paper. Special thanks are to one of the referees, who suggested many improvements to the code presented in the Appendix. References Alefeld, G., Herzberger, J. (1983). Introduction to Interval Computation, Academic Press. ANSI/IEEE (1985). The IEEE Standard for Binary Floating-Point Arithmetic, ANSI/IEEE Std 754.

Rigorous Error Analysis

453

Cody, W.J. (1988). Floating Point Standards—Theory and practice, In: Moore, R.E. (ed), Reliability in Computing, Academic Press, 99–107. Bieberbach, L. (1923). Theorie der Differentialgleichungen, Grundlehren Bd.VI, Springer-Verlag. Bieberbach, L. (1951). On the remainder of the Runge–Kutta formula in the theory of ordinary differential equations, Zeitschr. angew. Math. Phys. (ZAMP), 2, 233–248. Eckmann, J.P, Koch, H., Wittwer, P. (1984). A computer-assisted proof of universality for area-preserving maps, Memoirs of the American Mathematical Society 47, 1–121. Hairer, E., Nørsett, S.P., Wanner, G. (1987). Solving Ordinary Differential Equations I, Nonstiff Problems, Berlin: Springer-Verlag. Hartman, Ph. (1964). Ordinary Differential Equations, New York: John Wiley & Sons Inc. Hassard, B., Hastings, S., Troy, W., Zhang, J. (1994). A computer proof that the Lorenz equations have “chaotic” solutions, Appl. Math. Letter, 7, 79–83. Hastings, S.P., Troy, W.C. (1992). A shooting approach to the Lorenz equations, Bulletin (New Series) of the American Mathematical Society 27, 298–303. IEEE (1987). A Radix-Independent Standard of Floating-Point Arithmetic, IEEE Std 854-1987, New York: IEEE. Keiper, J. (1993). Interval arithmetic in Mathematica, Interval Computations, 3, 76–87. Koch, H., Schenkel, A., Wittwer P. (1995). Computer assisted proofs in analysis and programming in logic: a case study, Universite de Geneve, preprint. Lanford, O.E. (1982). Computer assisted proof of the Feigenbaum Conjectures, Bull. AMS(New Series), 6, 427–434. Mischaikow, K., Mrozek, M. (1995a). Chaos in Lorenz equations: a computer assisted proof, Bull. AMS, 32, 66-72. Mischaikow K., Mrozek M. (1995b). Chaos in Lorenz equations: a computer assisted proof, Part II: Details, preprint. Mischaikow K., Mrozek M., Szymczak A. Chaos in Lorenz equations: a computer assisted proof. Part III: The classical case, in preparation. Moore, R.E. (1966). Interval Analysis, Englewood Cliffs, NJ: Prentice-Hall. Mrozek, M. (1996). Topological invariants, multivalued maps and computer assisted proofs, Computers & Mathematics, 32 82–104. Rage, T., Neumaier, A., Schlier, C. (1994). Rigorous verification of chaos in a molecular model, Physical Rev. E, 50 2682–2688. Rump, S.M. (1988). Algorithms for verified inclusions—Theory and practice, In: Moore, R.E. (ed), Reliability in Computing, Academic Press, 109–126. Zgliczy´ nski, P., (1995). A computer assisted proof of chaos in the R¨ ossler equations and the H´enon map, preprint.

Appendix: a Sample Implementation The contents of this appendix set with the typewriter font in the display style constitutes a program in MATHEMATICA that finds the formula for the total error bound etot (Z, h) given by the formula (2.13). The program can be easily implemented in many other languages for symbolic computations, for example MAPLE. The program consists of three packages. The first one is auxiliary: since the function Γ (see (7.1)) involves maximum and minimum operations, we would like to have them evaluated whenever possible. BeginPackage["MnMxEstimates‘"];

Note that if a lower bound for an expression a is known to be greater than or equal to an upper bound of an expression b then a ∨ b = a. For instance, if a variable x is assumed to satisfy the inequalities 1 ≤ x ≤ 2 then 2x − 1 ≤ 3 ≤ x + 2 and consequently (2x − 1) ∨ (x + 2) = x + 2. The following two functions, implemented recursively in the body of the package, serve such calculations. They are used externally only to introduce initial bounds for variables. MinEst::usage = "MinEst[a]=mina means: a is never smaller than mina"; MaxEst::usage = "MaxEst[a]=maxa means: a is never greater than maxa";

454

M. Mrozek

Here are our minimum and maximum functions Mn::usage = "Mn[a,b] returns minimum of a,b"; Mx::usage = "Mx[a,b] returns maximum of a,b";

The following two functions are auxiliary. They are used to extract the left (bottom) and right (top) endpoint of an interval. TopBound::usage = "TopBound[a_Interval] returns the upper bound of an interval a"; BotBound::usage = "BotBound[a_Interval] returns the lower bound of an interval a";

Here is the main body of the package. Begin["Private‘"]; BotBound[Interval[{a_,_}]]:=a; TopBound[Interval[{_,b_}]]:=b; (* Minimal and Maximal Estimates used by Mx,Mn (max,min) functions *) MinEst[a_ ? NumberQ]:=a; MinEst[a_Times]:=Map[ MinEst, a ]; MinEst[a_^k_]:=MinEst[a]^k; MinEst[Mx[a_,b_]]:=Max[MinEst[a],MinEst[b]]; MaxEst[a_ ? NumberQ]:=a; MaxEst[a_Times]:=Map[ MaxEst, a ]; MaxEst[a_^k_]:=MaxEst[a]^k; MaxEst[Mx[a_,b_]]:=Max[MaxEst[a],MaxEst[b]]; (* Mx (max) and Mn (min) functions *) SetAttributes[Mx,Orderless] Mx[a_ ? NumberQ,b_ ? NumberQ]:=If[a>b,a,b]; Mx[a_,b_]:=a /; MinEst[a]>=MaxEst[b]; Mx[x_ /; Head[x]=!=List]:=x; Mx[{a_}]:=a Mx[l_List] := Fold[ Mx, First[#], Rest[#] ]& @ Reverse[l]; Mx[a_^k_. x_.,a_^l_. y_.]:= With[{m=Min[k,l]}, a^m Mx[a^(k-m) x, a^(l-m) y] ]; Mx[a_,a_]:=a; Mx[a_,Mx[a_,b_]]:=Mx[a,b]; Mx[Mx[a_,b_],Mx[a_,c_]]:=Mx[a,Mx[b,c]]; Mx[a_Interval,b_Interval]:= Interval[{Mx[BotBound[a],BotBound[b]],Mx[TopBound[a],TopBound[b]]}]; SetAttributes[Mn,Orderless] Mn[a_ ? NumberQ,b_ ? NumberQ]:=If[a=MaxEst[a] Mn[x_ /; Head[x]=!=List]:=x; Mn[{a_}]:=a Mn[l_List] := Fold[ Mn, First[#], Rest[#] ]& @ Reverse[l]; Mn[a_,a_]:=a;

Rigorous Error Analysis

455

Mn[a_^k_. x_.,a_^l_. y_.]:= With[{m=Min[k,l]}, a^m Mn[a^(k-m) x, a^(l-m) y] ]; Mn[a_,Mn[a_,b_]]:=Mn[a,b]; Mn[Mn[a_,b_],Mn[a_,c_]]:=Mn[a,Mn[b,c]]; Mn[a_Interval,b_Interval]:= Interval[{Mn[BotBound[a],BotBound[b]],Mn[TopBound[a],TopBound[b]]}]; End[]; EndPackage[];

The next package evaluates the estimate of the rounding deviation given by formula (2.4). BeginPackage["RoundingError‘",{"MnMxEstimates‘"}]; RoundingErrorBound::usage = "RoundingErrorBound[a] returns an upper estimate of |a-|, where a is a polynomial expression and is its machine evaluation";

The following five constants are set to satisfy the 64-bit IEEE standards. They correspond to the constants D, u− , u+ , δ, u− /δ (see Sections 4 and 6). MantissaLength=52; MinReprBom=2^-1022; MaxReprBom=2^1023; Begin["Private‘"]; delta=2^-(MantissaLength+2); MinGamma=MinReprBom/delta;

Next come the other definitions introduced in Sections 4 and 5. (* binary integer logarithm *) BinaryLogarithm[0]=-Infinity; BinaryLogarithm[x_ ? NumberQ]:=Module[{n,y},n=0;y=Abs[x]; If[x>=1,While[y>=2,y=y/2;n=n+1;],While[y<1,y=2y;n=n-1]]; n]; (*MantissaLength-digit mantissa of x *) MachineMantissa[0]=0; MachineMantissa[x_ ? NumberQ]:= 2^(-MantissaLength) Floor[2^MantissaLength Abs[x]/(2^BinaryLogarithm[x])+1/2]; (* machine representation of x*) Representation[x_ ? NumberQ] := Sign[x] MachineMantissa[x] 2^BinaryLogarithm[x]; ExactNumberQ[x_ ? NumberQ] := x==Representation[x]; (* upper binary order of magnitude *) UpperBom[x_ ? NumberQ]:= 2^(1+BinaryLogarithm[x])

Now we define the function Γ. We assume the simplified Definition (7.3) (see Remark 9.1). This means that a suitable modification must be made in the definition of Hw for expressions w of the length 1.

456

M. Mrozek

(* Function Gamma *) Gam[x_]:=Infinity /. x>MaxReprBom; Gam[x_]:=Mx[MinGamma,x] /. x<=MaxReprBom;

Now come the definitions of Hw∗ , Hw , ²w . (* Hstar[x] produces an upper estimate of the absolute value of an expression x *) Hstar[n_ ? NumberQ]:=Abs[n]; Hstar[x_Symbol]:=x; Hstar[x:(_Plus|_Times|_Mx)]:=Map[Hstar,x]; Hstar[a_^n_]:=Hstar[a]^n; Hstar[x_Mn]:= Apply[Mx,Map[Hstar,x]]; (* recursive definitions of the functions H and Epsilon, which give an estimate of the difference between the exact value of an expression x and the result of its evaluation in floating point arithmetic x^ by |x-x^|<=Epsilon(x)H(x) *) H[n_ ? NumberQ]:=Gam[UpperBom[n]]; H[x_Symbol]:=x; H[x_Plus]:= With[{args = Map[ H, Reverse[Apply[List,x]] ]}, Fold[ Gam[2*Mx[#1,#2]]&, First[args], Rest[args] ] ]; H[x:(_Times|_Mx)]:=Gam[ Map[ H, x ] ]; H[x_^n_]:= Fold[ Gam[#1 #2]& , H[x], Table[H[x],{n-1}]]; H[x_Mn]:= Apply[Mx,Map[H,x]]; Epsilon[n_ ? NumberQ]:=If[ExactNumberQ[n],0,delta]; Epsilon[_Symbol]:=0; Epsilon[x_Plus]:= With[{args = Map[ Epsilon, Reverse[Apply[List,x]] ]}, Fold[ (#1+#2)/2+delta&,First[args], Rest[args] ] ]; Epsilon[x_Times]:= Apply[Plus,Map[Epsilon,Apply[List,x]]]+(Length[x]-1) delta; Epsilon[x_^n_]:=n Epsilon[x] + (n-1) delta; Epsilon[x_Mx]:=Map[Epsilon,x]; Epsilon[x_Mn]:=Apply[Mx,Map[Epsilon,x]]; RoundingErrorBound[a_]:=(Epsilon[#]H[#])& @ Mx[a]; End[]; EndPackage[];

The last package evaluates the estimate of one step of the standard fourth order Runge– Kutta method given by formula (2.13) BeginPackage["RungeKuttaError‘",{"MnMxEstimates‘","RoundingError‘"}]; RungeKutta4TotalErrorBound::usage = "RungeKutta4TotalErrorBound[f,h,x] returns an upper estimate of |phi[f,h,x]-[f,h,x]|, where where f is a polynomial vector field, h is a scalar, x is the argument of f, phi[f,h,x] denotes the time t translation of x along the flow generated by the vector field f and [f,x,h] stands for the machine evaluation of one step of

Rigorous Error Analysis

457

the corresponding standard fourth order Runge--Kutta method"; RungeKutta4RoundingErrorBound::usage = "RungeKutta4RoundingErrorBound[f,h,x] returns an upper estimate of |rk[f,h,x]-[f,h,x]|, where where f is a polynomial vector field, h is a scalar, x is the argument of f, rk[f,h,x] denotes the the corresponding standard fourth order Runge--Kutta method and [f,x,h] stands for machine evaluation of one step of rk[f,h,x]"; RungeKutta4TruncationErrorBound::usage = "RungeKutta4TruncationErrorBound[f,h,x] returns an upper estimate of |phi[f,h,x]-rk[f,h,x]|, where where f is a polynomial, h is a scalar, x is the argument of f, phi[f,h,x] denotes the time t translation of x along the flow generated by the vector field f and rk[f,x,h] stands for one step of the corresponding standard fourth order Runge--Kutta method";

The following two functions are auxiliary. PolynomialDegree::usage = "PolynomialDegree[f] returns the degree of a polynomial f"; MxExpand::usage = "MxExpand is analogous to the Expand function but works also with the Mx function"; Begin["Private‘"];

Here is the implementation of the fourth order Runge–Kutta method. (* the fourth order Runge--Kutta method *) RungeKutta4[f_,h_,x_]:= Module[{k1,k2,k3}, k1=f[x + 1/2 h f[x]]; k2=f[x + 1/2 h k1]; k3=f[x + h k2]; x + (1/6)h(f[x] + 2 k1 + 2 k2 + k3) ];

We also need separate formulae for the functions k0 , k1 , k2 , k3 . (* The k0,k1,k2,k3 components of the standard 4th order Runge--Kutta method; needed in the estimate of the error coefficient *) RungeKutta40[f_,h_,x_]:=f[x]; RungeKutta41[f_,h_,x_]:=f[x + 1/2 h f[x]]; RungeKutta42[f_,h_,x_]:=f[x + 1/2 h RungeKutta41[f,h,x]]; RungeKutta43[f_,h_,x_]:=f[x + h RungeKutta42[f,h,x]];

Next we implement the functions Vn (see Section 2). (* Find flow derivative in terms of the vector field *) FlowDerivative[1,f_,args_]:=f[args]; FlowDerivative[n_Integer, f_, x:(_Symbol|_List)]:= If[Head[x]===Symbol,D[f[x],x]f[x], Outer[D, FlowDerivative[n-1,f,x],x].f[x]];

458

M. Mrozek

Now we are ready to implement formulae (2.8), (2.10), (2.11) and (2.12). (* The error estimate of the fourth order Runge--Kutta method *) RungeKutta4ErrorCoefficient[f_,h_,x_]:= 1/24(1/3 D[RungeKutta42[f,h,x],{h,4}] + 1/6D[RungeKutta43[f,h,x],{h,4}])+ 1/120 FlowDerivative[5,f,x]; RungeKutta4ErrorCoefficientBound[f_,h_,x_]:= Hstar[Mx[RungeKutta4ErrorCoefficient[f,h,x]]]; RungeKutta4TruncationErrorBound[f_,h_,x_] := RungeKutta4ErrorCoefficientBound[f,h,x] h^5; (* The error estimate including rounding errors *) RungeKutta4RoundingErrorBound[f_,h_,x_] := RoundingErrorBound[RungeKutta4[f,h,x]]; RungeKutta4TotalErrorBound[f_,h_,x_] := RungeKutta4RoundingErrorBound[f,h,x]+ RungeKutta4TruncationErrorBound[f,h,x];

There remain two auxiliary functions. The first is used to find the degree of a polynomial. The other is used to simplify the form of results. (* An auxiliary function computing the degree of a polynomial *) SetAttributes[PolynomialDegree,Listable]; PolynomialDegree[_ ? NumberQ]:=0; PolynomialDegree[x_Symbol]:=1; PolynomialDegree[x:(_Plus|_Mx|_Mn)]:=Mx[PolynomialDegree[Apply[List,x]]]; PolynomialDegree[x_Times]:=Apply[ Plus, PolynomialDegree[ Apply[ List, x] ] ]; PolynomialDegree[x_^n_]:=n PolynomialDegree[x]; (* Another auxiliary function it is similar to Expand but works also with the Mx function *) GoDown[x_ ? NumberQ]:=x; GoDown[x_Symbol]:=x; GoDown[x_Mx]:=Map[MxExpand,x]; GoDown[x:(_Plus|_Times)]:=Map[GoDown,x]; GoDown[x_^n_]:=GoDown[x]^n; MxExpand[x_ /; Head[x]=!=Mx]:=GoDown[Expand[x]]; MxExpand[x_Mx]:=Map[MxExpand,x]; End[]; EndPackage[];