TRE ASYMPTOTIC DISTRIBUTION OF ROUNDING OFF ERRORS IN LINEAR TRANSFORMATIONS* V. V.
VOEVODIN ?~oscow 9 December
(Received
1966)
1. LET G be a simple-connected convex closed region in real n-dimensional sperce f?. The vectors z of G are assumed to be random quantities, the distribution density of which is a continuous function D(Z), where ?(z) > c > 0. Let a sequence of transformations on the vectors Z, the transformation
:]I, I/,, . . ., ‘!h, matrices being
k < ;I be
performed
r= 1,2,..., k.
U, = E - atblr,
Here a, and b, are n x 1 matrices (vector-columns). Most direct and iterative methods of solving problems of linear algebra are based on such transformations. Let zr denote the vector format ions: then
obtained
from z c
C after
the first
zr--i - (b,, z+i)a,.
Zr =
r trans-
(2)
We cannot in fact perform these transformations, since all the computations have to be realized to a finite number of places. Suppose the computations are performed to t binary places after the point, while intermediate results having more than t places are rounded off according to themselves will be dethe usual rule (see below). The transformations fined by the matrices (t)=
u
E
_
where the vectors *
Zh.
vychis
1.
a,
ct ),
b, ct
Mat.
mat.
Fiz.
b(‘)
d”
r
r
T
) are obtained 7,
5,
3
965
7
from a, and b,
- 976,
1967.
by rounding
4
V.V.
Voevodin
off their coordinates to t places. In many problems of linear algebra, e.g. when solving sets of linear algebraic equations, it is unimportant whether the transformations (2) be performed with the vectors a,, b, or c_x~(~), b,.(t), the only thing that matters is that the actual transformation be performed fairly accurately. It is therefore assumed in future that the vectors z,. are computed from
(3) But even in this case, only the vector zr (t) is actually where
computed,
and E,.(~) is the error introduced at the r-th step by the inaccurate realization of (3). We have
zr
= ...=
[J(l) u’ft T 7
1..
.
...
UP(2 + 2 + u,‘“’-’ +
. * *
j-p-
El
(0
LI;(‘r’. *, $r*
4
+
up SF> )
A
82
(0
+.
. .
,
Here su(t) is the error obtained from rounding off the components of z to t places after the point. From the last formula, zp ft ) can be regarded as the result of accurate tr~sfo~ation of the vector in the brackets. The latter differs from the given vector z by
which, following ElII will be called the equivalent perturbation. Major& inequalities for the equivalent perturbation are examined in [ll . In the present paper nr tt ) is investigated as a function of the random argument t. In addition, we differ from [tl in examining only the effect of the error on the transformation equivalence, so that errors in computing the vectors +(tf, b,(t) can be ignored, and as a result, rather better inequalities obtained. It is shown below that, as t + co, the principal term of nr (t) is the sum of independent uniformly distributed quantities. Bounds are obtained for the principal term and the terms of higher order.
Rounding
off
errors
in Einear
2. Suppose that, when realizing (3), to double the number of places: then
5
transformations
the scalar
product is computed
where 5, ct ) is the error from rounding off the scalar product (!I?!;, z:!.~), and (srft ) the error from rounding off the product of the rounded scalar product and the vector a, (t 1. Obviously,
‘Ibe second inequality means that none of the moduli of the components of cryct ) is greater than 52-t. From (5) and (6),
where
The error distribution for the main numerical. methods is shown below to be determined by the function vr tt ). We therefore examine v,(t) as a function of the random argument z, and obtain an upper bound for K, (t 1. With a, ft)
fixed.
prod&$ (b’“” cit r , z,._&
the error u, tt) depends only on the rounded scalar Since the vectors subjected to tr~sfo~ation
are
much more numerous than the possible rounded off values of (@,
z$!r),
a certain number of vectors z!?, will necessarily ur (t ) at the r -th transformat ion.
have the same error
Let G(t) denote the set of vectors into which vectors of G transform after rounding off their components to t places, Obviously, G(t) C G, except for points close to the boundary. Divide Jft) into non-intersecting subsets
G&...Rr,
so that each only contains vectors of Z(t)
which have the same rounded off scalar products, namely, k,, kz, . . . , k, during all transfo~ations from the 1st to the r-th. The subsets
G&.-A,
will be termed bundles of the r-th order. From
6
V.V.
Voevodin
what has been said, at least some of the bundles consist of more than all the vectors belonging to a given one vector. From the definition, r-th order bundle will yield the same error v,(*). In addition, clearly
Our immediate tasks are to examine the errors which a bundle carries, to investigate a bundle structure, and to evaluate the probability of coming within a bundle. We define
the deviation
of a number x from the nearest
wt
LrT= where Ix)
denotes
the fractional
number x rounded off {
1,
L
1,
(
(4 -
component of which is obtained of the initial vector. It is easily
(5) <
if
{x> b %,
If
a vector
l/2,
in a sign
is contained
of the operation by applying
by
By we denote the
part of x [21.
to t places.
> , the result
1,
if
integer
vector,
is another
the operation
each
to the component
shown that
(a> = a -
2-fL2’a7,
so that (0
csr = The deviation
-
from the nearest
of the fractional
part,
while
2-q__ 2t ( integer
1 be the rounded off
error
of.‘)
will
clearly
i.e.
all
The layer angles
the vectors thickness
to b,(t).
scalar
be contained
*/,2-t
l-
<
belonging is
2-‘~#‘~’
is a piecewise
linear
the number 2f((b(,‘ls$z1 )>
Investigation of the error carried ing the distribution of fractional runs over a sequence of integers. Let
(b!‘),&)> a?’1. function
is an integer.
by a bundle thus amounts to examinparts of the vectors ka,(t ), where k
product in all
(by’ , &)
for the
< I+
the vector &,
z,_x. The same ^(*)
for which
1/22-t,
to a layer with one boundary excluded. while
its
plane
faces
are at right
Rounding
off
errors
in
linear
transformations
7
whence the transformation of vectors belonging to a given bundle amounts merely to the operation of translation, since the vector -
&)
((b!“,
) af” + $)
This means that region lar
all
will
be constant
the vectors
for
bl’,
nesses
(0 ,. . . , by) bz
vectors
non-intersecting,
and
U
respectively
of two distinct
G,ka...kf-
r+i
ha
=
all
the vectors
of the region
transform
to t places
after
in the bundle
belonging
to the coordinateexes,
boundaries,
into
the point.
Gk$!*...k
perpendicu-
None of the regions
bundles,
all
u Gk, =
..k r)
to a
these
regions
are
G.
kl
Denote by E, the set of vectors parallel
belong
and having the thick-
.
kr+i
and sides
of the bundle.
of S with r layers
2-tII bi(” IIs-’ & 2-t)I bf’ IjE’, . . . ,2-‘11 br(‘)[ii
Gk,k2..:kr contains
vectors
G&..:k,
of the bundle
Gk,k,..:kTI which is the intersection
to the vectors
all
f
G n E,,
the vector
to the cube with centre equal to 2-t.
If zc
with the exception z after
The probability
rounding
z
GW,
of certain
their
components
p(z c G$z...kr)
of z being
is
P(z C
Gk:!~.:.kr) = 1 P(W, G(Jd A,Al...Ar
where (E)
U
Gk,k,.:.kr= z
c
G(‘) A,Ao...Ar
(GnEz).
3. Some propositions from the theory of numbers will what follows. Given [21 a finite number of vectors z (9), components Zj ‘9’, distributed in the unit cube, i.e.
Let a and P be n-dimensional
vectors
be needed for 1~ q ~2, with
with components oj and Sj,
8
V.V.
Voevodin
G < aj < Pj < 1. Denote by F(a, in the parall elepiped
p) the number of vectors
Zj”’
CZj <
of volume ll(Pj - oj).
is called
<
z(q),
lying
pj
Then,
the deviation
Given an infinite first (2 vectors. If
of the vectors
vector
sequence,
z(q). let DQ denote
the deviation
of
its
limDq=O, Q--
the sequence unit cube.
will
of z(g)
be said
to be uniformly
distributed
in the
From bl, a sequence of vectors {k(3), k = 1, 2, . . . , is uniformly distributed provided the components of the vector 8 are rationally independent together with the number 1. Take a sequence of vectors G,, p = 1, 2, . . . . convergent
to 8, and construct
vectors && + Y), { (kP + WP 0, the deviation of this system. ‘%eorem
for
each p a set of
+ ~1,. . . , { UG + Wb Then,
+ ~1.
Denote by
1
If the vector
8 has components rationally
independent
together
with
t e number 1, and lim I, =
lim 0 p = 0,
00,
P-+-J
P--J
then IimD, P+-
uniformly
with respect
to k,
regions
lemma establishes
‘%,k*.:.k, @es
0
and y.
We shall not dwell on the proof no particular immediate interest. The following
=
since
it
the ratio
is very laborious
and is of
between the volumes of the
&,k*...k ,,) and &!$*...k, (IUS ($2 ,...,+.
~ound~~~
off
errors
in
linear
9
transformations
For almost all vectors bl, bz, . . ., hr. mes Gk,k,.:.k_
the conversance
r =
being uniform with respect to almost all k,,
ks,
. . . , k,.
The underwing idea of the proof can be seen by taking the ease IZ= 2, 1, to which we shall confine ourselves for simplicity.
It can be assumed without loss of generality that the elements of t?(t) are the base-points of a mesh with unity interval. To prove the lemma, it only needs to be shown that the number of elements of -i(t) occurring in a layer is asymptotically equal to the volume of the layer. All the points of Zct) on a given straight line will be called a ray. The number of points in a layer is equal to the number of its projections (allowing for multiplicity~ on the vector blct 1, which enter into the projection of the layer itself on to this vector. The projection of a layer is a semi-interval
of length l]bj”’//il.
Project the points of a ray of Gctf on to the vector bl(t). These projections form on bl(t) a uniform mesh with interval cos cyct), where q(t) is the angle between the ray and bl(t ). Projection of the points of the adjacent rey is equivalent to a shift through sin act) of the first mesh leftwards or rightwards. If the deviation of sin q(t )/cos g(t ) is convergent to an irrational number (which is defined solely by the vector blct 11, then, by Theorem 1, the number of points belonging to any F adjacent rays and the projections of which on to b,(t) come within a fixed segment, is asymptotically pro~rtional to p, independently of the position of the segment. Independently of the position of the layer, its volume is asymptotitally proportional to the numberof reys intersecting it. Hence the lemmafollows from the fact that
kg
kt
The convergence (‘7) will be uniform for sll leyers except those cant inu ous to the points of intersection of the vectors b, tt ) with the boundary of ii.
10
V. V. Voevodin
Given the conditions
the convergence
being uniform
The function division
of Lemma 1, we have
o(z)
of C into
with respect
is uniformly
to almost
continuous
non-intersecting
regions
all
kl,
in G, so that, gi(e),
gs(s),
kz,. . .,k,. given E, a
. . . ,gN(e),
can
be found such that the deviation of P(z) fran the mean value on each K~(E) is not greater than E. Obviously, the gi(e) can be regarded as regions G of Lemma 1. Consequently,
mesgf(4 fl GM....k,.
lim
t-+m mes gi (e) Let the mean value mean value theorem
=
C-U
1
n G6fds...k P
*
> 0. Using the
of P(z) in the region gi(e) be pi&c for an integral and (9), we have
P(z c gi (8) II G~tkr...k~) = *- P (z c gi (e) n G,$.. .Q
lim
=
lim(~‘+O(e))mesgf(e)nGh,~...~,
= 1+
*-((~~+O(e))mesg,(e)nG~~~...h~ Hence, given
E,
a
to(~)
(idO(e))P(ZCgi(e)
can
be found such that,
for
all
fV%....g,
O(e) 2 0.
Further,
p(Z c Gitlk2.../z,.) =
t
G&...
k,.)
[ ~~~zcg,(e)nGk~k~...k~)] i=l N
whence, for all
t 2 to(~),
tato(~),
,I G W = gt(8) nG~,,k,.~.k ,) G
nGcm ~...k
G (1 +O(eHW=g4e)
p(Z
O(e).
X
Rounding
off
*-O(e)< Since
e is arbitrary,
Now fix
in
linear
(2 C Gk,i+../a,,>
p
P(z c G&,...kr ) fl
+0(e).
small number u > 0 and consider
from 4 by cutting
out the boundaries
Gk,kz...kr, for which the uniform convergence shall
11
transformations
the theorem follows.
an arbitrarily
Gg) , obtained
errors
(8)
the region
of the regions
is not satisfied.
We
assume here that
P(zcGtf')> 1-p. It can be assumed without loss of generality that. the error ur ct ) is distributed in the unit cube, having the same number of dimensions as the non-zero coefficients of the vector a,. Let S, be a parallelepiped inside this cube. Then, Lemma 2 We have p
lim t-co the convergence
(or(‘)c S,, 2 c
Gk:‘+k,_J
mes S,P (2 G Gk$, ,.. A,_~)
being
uniform
for
all
bundles
=
1
t
Gf).
of
We have [31
P(2)
c
sr,
zc
=
Gdfk, ...R
Notice
that,
all
bundles
s,/zc
Gk’fh,..
k,_i
) = mes S,
of GY).
as t - COeach bundle of order
generates an infinity . . ., b, are linearly
),
if we can show that
lim p (u!t) C
t-ww for
)=
P(~cG~~~~...k,_~)P(urcSr/z=G~fk,...k,_,
so that the lemma follows
uniformly
r_-i
r
-
1 in the region
Gr)
of r-th order bundles provided the vectors b 1, b2, If all the r-th order bundles were independent.
V.V.
12
equiprobable, this relation Theorem 2. Given
E
>
Voevodin
(IO) would follow trivially from Theorem 1. The proof of in the general case is very similar to the proof of
0, we find a division
of the region
Gf,f) into non-inter-
secting regions ci(e) bounded by planes perpendicular to the vectors (1: b (0 and the boundaries G$). It will be assumed that all b I , 2 ,**-, bQ)r the r-th order bundles in each qi(~) factor
(1+0(a))
for all
According to the division
are equiprobable apart from a
t 3 ti(s).
of
We get
L33
G$) into regions q i(E),
limP(a!“cS,/zcqi(e)nG~~~*...k,*)= t-t=
mesS,(l
We have t-Q(s))
uniformly for all bundles and qi (E). Since E iNarbitrary, follows. Theorem
(10) now
7
independent The errors a? , r = 1, 2, *a., n - 1, are ~ymptotical~ uniformly distributed entities for almost all vectors or and b,. Using Lemma2, lim P (PC t-w= =
&) = lim[P(lo!f)c: S,.,z c Gf’) +
P (as*)C S,, 2 E G$)l
t-too
lim z t-k03[ $1
P(o,“‘c
ST, 2 c Gf’n G&
kik-. . . . k,_*
+P(df’cS,,
z=G:l’)]=mesS,+O(p)
... k,_i ) +
=
13
a**~o$‘cE 8,) lim t+m
~P(~CG~~~l...k,_l)P(~jf)CSr/~CG~~~:...I,_ F 1) =
IfI p (2 c: qtJ, ._.x,_J
C mes S, lim P ( CT?, c Sr_*, . . . , ai
c
Si) =
mes S, me.9S,i . . . mes Si.
The summation in these latter sums is only over those bundles which give errors from S,, S,, . . ., S,_,. 4. Let v=ai+az+...+ UT be a random ~antity , equal to the sum of random independent quantities oi, each of which is uniformly distributed in a parallelepiped with centre zero and sides of length
oil,
GiZ,s**,
cfir.
The
mathematical expectation of llylt2 is given by Ml~llJ? =
‘/12
2
oij’.
iJ
This is easily verified directly. Now take some concrete transformations fitting into the scheme (1) and encountered in numerical methods of solving problems of linear algebra t41. Gauss’ s method (without division of a row by a principal element). It can be assumed without loss of generality that the matrix U, in this method is
14
V. V. Voevodin
Consequently , r-1
a lp’=
(n,
b,‘=(O
,...,
All the above assertions tion,
Sf)=O,
r=l,
0, 0,
%tl, r9 * . -
I, 0,
9
%r)7
. . ..O).
hold for the unit vectors b,,
and in addi-
1. Further, only the first
2,... *n-
are different
in the matrix Uf’-’ V.$f)“. . . Ul”-’
columns of the unit matrix, while the first
r
r
columns
from the corresponding components of the vector
u, ft ) are zero. Hence
Since n - 1 transformations are required to realize we have
Method
of reflect
Gauss s method,
In this method the matrices Ur are orthogonal,
ions.
and = 1. u p = E - 2W
rlx:~*II2 ,p+liv
lim
f-+m
‘Jethod
of
22f
lim JfltdLll2 4
ort~o~on~~~zut
*
22f
f-km
(without normali~tion
ion
this method 1
0
1
u, = %l,
0 Hence
n* Q.
-
1.
. .
%+I,
P
1
. 1
of the rows). In
Rounding
off
a,’
errors
in
=(O_,
b,’ = (a,+~,1,
. . . .
linear
15
transformations
I,0
O),
,...)
a,+~,r, 0, 0, . . . , 0).
It, follows that ur ct) = 0. Further, only the first
r columns of the matrix Uf’-‘Us’-‘.
differ from the columns of the unit matrix, while the first of the vector a,(t) are zero. Hence
uiw-1u;tr’
. . Uf.“‘-’
r components
. . . u,“’--i6,(t) a,(1) = 6:’ a:‘.
Finally,
An interesting than a sixfold
point is that,
if we consider
improvement in the inequality
M/I I&\$,
not more
is obtained.
5. Taking the example of Gaussian transformations, we now describe the method of estimating the most probable value of the principal term of the norm of the equivalent’ perturbation, We have
1
max Ilv!$Il~ < 2-t n3 ZCG
(12)
12’
Let oi’,“’ denote the i-th component of o,(? The error of o$) in each bundle of order r - 1 is none other than the error from multiplying two numbers, one of which is the i-th component of the vector o$) , and the other the rounded off value of the scalar product (brcf), z$). using IAt
lM(d?/zc M
Gk,kr.:.k,._,) I = O( (~,...k,_~2’)
&) (Ci$z C &$!..A
42)
Hence,
2-f,
= (1 + O( (%..:k _,2Y’~9)~.
in the direction where 2k,...k?_, is the nlength” of the bundle G/,!.k,l of the vector
6,
(t ).
Further, the function
‘&
is constant in each bundle of order r - 1,
16
V.V.
Voevodin
so that
i==r-+-i
i=r+l
G(t) kl...k,_l
i--,-i-1
Here, the constants of the terms O(2-tfi) depend only on the region I; and the distribution density Yz), and are independent of n. Without dwelling in detail on the truth of these assertions in these last relationships, we merely remark that it follows from the existence of the math~ati~al expectation of the terms ~((~2~)-“~~, where x is the “length” of the bundles. Consider ~11~~~~11~ in the light of these relationships:
G
Mll~~~*ll~ + G(1
+0(2-t/2))+
n~2-~~0(2-~/~)sg.,
.
2-2fn2 . . .
zg 7(1+
nq2-f/2)).
Rounding
off
This gives us asymptotic the
dispersion
of
The final equality
[31:
I’(ll&llE
Q Y $$I
’
solution
+
in
errors
formulae
for
\Iv~>,~/~;. In fact
of our
linear
is
+-0(2-@)))=
>
the
mathematical
17
expectation
and
Chebyshev’s
in-
[31,
problem
+ nOC-f’W
transformations
obtained
by using
I--P(IIY,-IlIE>
b 1 -P(
1IldflillE - MII&IIE~
(Y- 1)2-‘n(IfnO(2-t/Z)))>+
> (y&2.
-f2d
In particular,
P(llv:i& < These
results
are
in good agreement
n2-ya0.93.
with
both
(11)
and
Translated
(12). by
D.E. Brown
REFERENCES 1.
WILKINSON, J.H. The Press. 1965.
2.
CASSELS, J.W.S. Introduction Cambridge U.P., 1957.
3.
GNEDENKO, B.V. Course on probability theory nostei), Moscow, Gostekhizdat, 1954.
4.
VOEVODIN,
algebry) 5.
V.V.
algebraic eigenoalue problem, Oxford, Clarendon
Computational
, Moscow - Leningrad,
to Diophantine
methods
Nauka,
of
approximations.
(Kurs teorii veroyat-
algebra
(Chislennye
metody
1966.
KIM, G. and CHIBISOV, D.M. Rounding off error dlstrlbution when multiplying two numbers on a computer witi fixed memory. Mat. Zametki 1, 2, 225 - 234, 196'7.