Approximation of functions on a compact set by finite sums of a sigmoid function without scaling

Approximation of functions on a compact set by finite sums of a sigmoid function without scaling

Neural Networks, Vol. 4, pp. 817-826, 1991 /1893-60811/91 $3.0~1 + .00 Copyright ~ 1991 Pergamon Press pie Printed in the USA. All rights reserved. ...

885KB Sizes 0 Downloads 31 Views

Neural Networks, Vol. 4, pp. 817-826, 1991

/1893-60811/91 $3.0~1 + .00 Copyright ~ 1991 Pergamon Press pie

Printed in the USA. All rights reserved.

ORIGINAL CONTRIB UTION

Approximation of Functions on a Compact Set by Finite Sums of a Sigmoid Function Without Scaling YOSHIFUSA ITO Nagoya University College of Medical Technology

(Received 12 October 1990; revised and accepted 28 February 19911

Abstract--This paper is concerned with three layered feedforward neural networks which are capable of approximately representing continuous functions on a compact set. Of particular interest here is the use of a sigmoid function without scaling. First, we prove existentially that a linear combination of unscaled shifted rotations of any sigmoid function can approximate uniformly an arbitrary continuous function on a compact set in R d. Second, a proposition is proved constructively using the fact that a homogeneous polynomial P can be expressed as P(x) = Z," ~a,(to, • x) r. It states that the approximation of an arbitrary polynomial on a interval in R can be extended to that of an arbitrary continuous function on a compact set in R d. Then, four corollaries are derived. Though their statements are more or less restricted, the proofs provide algorithms for implementing the uniform approximation. In three of these corollaries, sigmoid functions are used without scaling.

Keywords--Heaviside function, Sigmoid function, Unscaled sigmoid function, Discriminatory, Strongly discriminatory, Linear combination, Uniform approximation, Compact set. 1. INTRODUCTION

formly approximated on R 't by a linear combination of unscaled shifted rotations of a sigmoid function of a certain class. In this paper, it is first proved that a linear combination of unscaled shifted rotations of any strongly discriminatory continuous function can approximate uniformly an arbitrary continuous function on a compact set (Theorem 2.1). The method of the proof is a standard application of the Hahn-Banach theorem (see Rudin (1973)), which was first brought into the field of the neural network theory by Cybenko (1989). Hence, the proof is existential. Applying Theorem 2.1, we prove the main theorem (Theorem 2.6), which states that an arbitrary continuous function on a compact set can be uniformly approximated by a linear combination of unscaled shifted rotations of any sigmoid function. Since the proof of this theorem is based on Theorem 2.1, it is also existential. For convenience in applications, a proposition and four corollaries are provided with the constructive proofs in Section 3. The polynomial approximation theorem (see, for example, Yoshida (1968)) is used as a basic tool throughout in this section. Proposition 3.2 is derived from the simple eqn (3.1) and guarantees that if any polynomial in one variable is approximated uniformly on a compact set by a linear combination of shifts of a sigmoid function h, then a formula of the form (2.9) can be a universal approximator on a compact set in R d. Corollaries

Irie and Miyake (1988) proved that an L2-function on R J can be expressed as an integration over scaled shifted rotations of an integrable function on R. Following this, several authors have proved that a linear combination of scaled shifted rotations of a sigmoid function can approximate a continuous function on a compact set in R J. Among them, we mention Carroll and Dickinson (1989), Cybenko (1989), Funahashi (1989), and Hornik, Stinchcombe and White (1989). These results imply that a three-layered feedforward neural network with sigmoid units on the hidden layer can approximate a continuous function on a compact set. Of particular interest in our previous paper (Ito, 1991) were (a) the uniform approximation of an arbitrary function on the whole space R a, and (b) the use of sigmoid function without scaling; as one of the results in the paper, we have proved that an arbitrary rapidly decreasing function can be uni-

Acknowledgements: The author wishes to express his deep gratitude to Professors I. Kubo, T. Matsuzawa, and J. Gani for their helpful discussions, careful reading of the manuscript, and kind assistance in revising the text. Requests for reprints should be sent to Yoshifusa Ito, Toyohashi University of Technology, Tempaku-cho, Toyohashi, 440. Japan.

817

818

Y lto

3.3, 3.5, 3.8, and 3.10 are derived from the proposition. Corollary 3.3 guarantees that a linear combination of shifted rotations of the Heaviside function can u n i f o r m l y a p p r o x i m a t e any c o n t i n u o u s function on a compact set. Corollary 3.5 is obtained as a kind of by-product. Its statement is the same as the well-known fact that a linear combination of scaled shifted rotations of any sigmoid function can uniformly a p p r o x i m a t e an arbitrary continuous function on a compact set. Different classes of sigmoid functions are used in Corollaries 3.8 and 3.10, respectively, without scaling. Examples are annexed to the respective corollaries. The proofs of the corollaries and examples are all constructive in Section 3.

but not strongly discriminatory, in fact. let ~. 6, + d~, where d, is the delta function at a. For any t, f sin(_+x - t) d r ( x ) = 0. But ~ ;,: 0. However, we shall observe later that the class of strongly discriminatory functions is still sufficiently large as to include all sigmoid functions, THEOREM 2.1. Let g be a strongly discriminatory continuous function and f be an arbitrary continuous function defined on a compact set K in R'~I Then, for an arbitrary c > (1, there are coefficients A,. rotators uJ,, and shifters t,. i ,: t. . ,~. such that a finite sum ((x)

= ~" A , g l ( ~ ,

" ~

, i

i2.2)

K.

(2.3~

2. MAIN THEOREM First, we define strongly discriminatory functions that play the important role in our theory. D e n o t e by S e ~ the totality of unit vectors in the d-dimensional Euclidean space R e. We denote by x • y an inner product of two vectors x, y ~ R e. Let g be a function defined on R. For c > 0. to C S '~-~ and t ~ R. we call g ( c - l ( e ) • x - t)) a scaled shifted rotation of g, where e) is a rotator, t a shifter, and c is called a scalar as it scales g. Of course, g(t~J • x - t) is called an unscaled shifted rotation (or just a shifted rotation) of g. DEFINITION. We call a function g on R strongly discriminatory if, for a measure r on R e with compact support, [ g(co • X - t) d v ( x ) = 0

for all

(oES ~

and

t@R

(2.1)

zmplies that v = 0. The term "'strongly discriminatory" is used here following the definition of discriminatory functions given in C y b e n k o (1989). Our strongly discriminatory function is a discriminatory function under a stronger restriction. Although C y b e n k o ' s discriminatory functions are supposed to be scalable, strongly discriminatory functions can be used without scaling. In C y b e n k o ' s definition, test measures are restricted to those whose supports are confined to a fixed compact set [0, 1]d. E v e n if the confinement is removed. the class of discriminatory functions remains as it is. as long as the test measures are all with compact support. Hence:

satisfies f(x) - f(x~l < ~ on

P r o q [ . Let C ( K ) be the space of continuous functions defined on K and G be a set of linear combinations of functions of the form Ag(~o - x

t j.

Denote by G the closure of G in the uniform topology on K. To prove the t h e o r e m , it is sufficient tO obtain = C ( K ) . Suppose that there ~s a function ~0. C ( K t , which does not belong to G, Then, by the H a h n - B a n a c h theorem (see Rudin (1973)), there exists a linear functional A such that A(o0) = 1 and A(~0) = 0 for every ~0 E G. Then. by the Riesz representation theorem (see Rudin (1966)), there exists a measure v on K such that A(qT) = f ~o(x) d r ( x ) for all ~0 E C ( K ) . Since g(o) - x - t), itself, belongs to G. f g(o) • x - t) d r ( x ) ~ 0 for all to and t. Hence, by definition v = 0, which contradicts f ~ , ( x ) d r ( x ) = 1. This concludes the proof. II Except that g is not scaled here, this proof is similar to C y b e n k o ' s (1989). We denote by ~ag the Fourier transform J g ( x ) e v-~-x~' d x in R a. We call a function g on R" slowly increasing if there is an integer k such that g(x)(l -~ txt 2) ~ is bounded. Note that the Fourier transform :idg is a t e m p e r e d distribution (distribution temp6r6e, see Schwartz (1966)). For a function g defined on R. we write g,o(x) = g ( t o - x ) , x E R a. If g is slowly increasing then g,,, is also slowly increasing. For ~ = (1, 0 . . . - . 0), the Fourier transform of g,~,, is a tensor product: ,,~g,,,~y) = (2n) ~ ':~Lg(y~)6~,(y:) "'" 6,(y~,).

It is obvious that a strongly discriminatory function is discriminatory in the sense of Cybenko. But the converse is not true. There is a discriminatory function that is not strongly discriminatory. Let t h e space be one-dimensional. In this case e) = ÷ 1. Set g ( x ) = sin x. T h e n , g ( x ) is obviously discriminatory Remark.

For any co. Feg,,~ is obtained by rotating this tensor product. Set ~,,,(z) = g , , , ( - z ) . LEMMA 2.2. Let g be a slowly increasing function on R and v be a measure on R a with compact support.

A p p r o x i m a t i o n on C o m p a c t Set

819

proximated arbitrarily well by a finite sum of the form

If

f g(c,) • x - t) d v ( x )

,i

= 0

for all

a,h(t - t,), (o ~ S d 1 a n d

t ~ R,

t

(2.4)

1

where a~ and t~ are constants.

then

:~.g,/~i~v(y) = 0

for all

~

and

y ~ R d,

(2,5)

P r o o f . Denoting by x,,, an arbitrary point of R d such that x,,,. ~,J = 0, we have that

P r o o f . We may suppose that a < b and h is monotonic increasing• Set D = sup,(h(t - a) - h ( t - b)), a~ = (b - a ) / n and ti = (b - a ) i / n + a. Then, 0 <- ~ a,h(t - t, ,) - h * ll,.,,l(t )

f

-

=

i=l

- x,,) - t)d,,lx)

(2.6)

g,,(x - x,, - t¢,J) d r ( x ) = ~,.~ * v(x,, + WJ),

where * stands for a convolution. Since an arbitrary point of R d can be expressed as x,,, + t(o, we have ~,,) * v --- 0 on R J. The Fourier transform ~0' is analytic, its derivatives are all bounded and the Fourier transform :~,@, is a tempered distribution. Hence, the product :~dg,OdVhas a meaning as a distribution. Furthermore, the convolution g,,, * v is a slowly increasing function. We can prove that ~,~(g,., * v) = :~,~,,~,,v

for all ¢o ~ S'~ '

(2.7)

in the sense of distribution. Combining eqns (2.4), (2•6) and (2.7), concludes the proof. •

= "L.~c|,, ( h ( t - t , I Jtt

L)-

" b-a <- ~ - - ( h ( t I

LEMMA 2.3. Every slowly increasing function g on R is strongly discriminatory, if supp (:i~g) is dense in a nonempty open set. P r o o f . Note that

O

,,~
supp(:bg,.,)= i

U

ej~sd

1

supp0a~,.,).

If the condition is satisfied, these sets are dense in a nonempty open set in R a. Suppose that v is a measure with compact support in R d and eqn (2.1) holds. By Lemma 2.2, eqn (2.1) lmphes that ~idg,~,~dV = 0 for a l l ~ o E S '~ ~. Hence, ~dv is equal to 0 on U supp o)~Sa

I

Odg,,) as ~dV is continuous. Furthermore, f~dV is analytic. Hence, by analytic continuation we obtain that ft, ll' =--- 0 on R d, from which we obtain v = 0. • Let lib.hi be the indicator function of the interval [a, b]. LEMMA 2.4. Suppose that a function h defined on R is monotonic and Ih(t - a) - h ( t - b)[ is bounded for fixed a, b. Then, hi = h * l[a,b I is uniformly ap-

s) ds

b-a t, ,) - h(t - t,)) < - - - D .

n

n

Hence, as n -~ ~, b -a =I

h(t-

,,)

n

, f" h(,-

,,t

(2s)

j,a



uniformly.

DEFINITION. We say that a function h defined on R is sigmoid if h is monotonic increasing, h ( t ) ~ 0 as t--~ -zc and h ( t ) ~ 1 as t--~ 2. A function H defined on R by H(t)=

We denote by supp (g) the support of g, irrespective of g being a function or a distribution.

h(t-

I

[1

f o r t -> 0, for t < 0

is the Heaviside function. It is an example of a sigmoid function. LEMMA 2.5. Any sigmoid function is strongly discriminatory• P r o o f . Let h be a sigmoid function and a be the Lebesgue-Stieltjes measure defined by a ( ( a , b]) = h ( b ) - h ( a ) . Since a is a probability measure, the Fourier transform ~ila(S) is continuous and the support of :~ja contains a nonempty open interval• Since h is slowly increasing, ~lh is well defined. The support of :~h also contains a nonempty open interval, because :~la(s) = X / Z - f s ~ h ( s ) . Hence, by Lemma 2.3, h is strongly discriminatory. We are now ready to prove the main theorem.

THEOREM 2.6 (Main theorem). Let h be a sigmoid function and f be a continuous function defined on a compact set K in R d. Then, for an arbitrary e > 0, there are coefficients A , , rotators ~i and shifters t,, i = 1, ... , n, such that a finite sum f(x)

= ~ A,h(eJ~" x - t,) i,I

(2.9)

820

} ~ it~9

satisfies If(x) - .f(x)l < e. on

g.

(2.10)

P r o o f . By Lemma 2.5., h is strongly discriminatory.

First, suppose that h is continuous. Then, by Theorem 2.1, the present theorem holds. Next, suppose that h is not necessarily continuous. Set hj = (b - a) l h * lla,b I for arbitrarily chosen a and b (a < b). Then, ht is a continuous sigmoid function. Hence, there exists a finite sum of the form

(2.11)

f , ( x ) = ~ A,ht(oo~ " x - t,),

which satisfies If(x) - f,(x)t < ~ on

K.

(2.12)

By L e m m a 2.4, h~ can be uniformly approximated by a finite sum in a way m

Ih,(o) " x ) - ~[ a,h(m . x -

t,) I < ~

on

K,

express any homogeneous polynomial of degree rirt x as a linear combination of the powers (~o,. • x)L II R e m a r k . Let us note that the polynomial approximation theorem can be proved constructively (see Yoshida (1968)). This fact is inherited by the proofs in this section. In applications, a polynomial approximation of a continuous function can be often obtained more easily than the proof of the theorem. The proposition below converts the uniform approximation of powers ¢ on finite intervals in R i n t o that of arbitrary continuous functions on a compact set in R d. All the corollaries in this section are derived from this proposition. For a compact set K in R a, set m~ = max~eK[x!.

PROPOSITION 3.2. Let h be a sigmoid function, K C R a a compact set and c a scaler. Suppose that, for any non-negative integer r and any ~:~ > 0, there are coefficients a~ and shifters s~, i = I. :.. , m . for which a finite sum

(2.13) v,(s) =

where A = max~_~_~,,la~t.Combining eqns (2.1 t) and (2.13), concludes the proof. This theorem concludes this section. III

a,h i,

c ~' (s

s,

i

satisfies ~s - v,(s)l < ~, on t- m~,, m~].

3. CONSTRUCTIVE PROOFS Theorem 2.6 guarantees the universal capability of the three-layered feedforward neural network whose hidden layer units have the same sigmoid activation function fixed beforehand, even if the connection vectors e)~are restricted to the unit vectors. However, it is unlikely that its proof leads to an algorithm for designing an actual neural network because it is existential. The constructive proofs of the corollaries in this section may compensate for this inconvenience. As has been shown by many authors, there are several constructive methods of the proof of the approximation theorem in the case where the sigmoid function is scalable. Though there could be some other methods even in the case of unscalable sigmoid functions, we derive all the results from the eqn (3.1) in this paper, applying the polynomial approximation theorem. LEMMA 3.1. For any homogeneous polynomial P of degree r in x, there exist coefficients a~ and unit vectors o9~, i = 1, .." . n -< r * d - Fr. such that P(x) = ~

a,(~o, . x)'

[3.1)

i-

T h e number of linearly independent homogeneous polynomials of degree r in x is r + d 1% and that of linearly independent powers (~o~ • x y of the inner products is the same. Hence. we can Proof.

Then. for an arbitrarily continuous function f defined on K and any e > O. there are coefficients A , , rotators co, and shifters t,, i = 1 . . . . . n. for which

f(s) = 2A,h (c

~(~o,

~ - t,))

(3.2t

K.

(3.3)

satisfies f(x)

f ( x ) [ < ~: on

P r o o f . By the polynomial approximation theorem. there is a polynomial P for which £

! f ( x ) - P(x)t < ~ oT~ K. By Lemma 3.1, each term of P can be expressed in the form of the right hand side of eqn (3.1). By assumption we have that. for any ~i > 0. n,

/=1

if - m K <- xl <- m r . This concludes the proof.

II

We use this proposition with c ,~ I only in Example 3.4 and Corollary 3.5. The first two eorrollaries are obtained straightforwardly by rephrasing this proposition. Though the scaling of theHeaviside function is itself, it is somewhat meaningful that the scaling constant is unity in the approximate representation eqn (3.4) in the first corollary, as will be seen in Example 3.4.

821

Approximation on Compact Set COROLLARY 3.3. Let H be the Heaviside function and f be an arbitrary continuous function defined on a compact set K in R d. Then, for any e > 0, there are coefficients A~, rotators coi and shifters t~, i = 1, ... , n, such that a finite sum

where to = 0 and {ti}7=~ is a partition of the interval [ - V 2 , V~]. In the case that the connection vectors are not restricted to the unit vectors, we can directly obtain a simpler formula 1 "

f(xl, x:) = ~ ~] (t,2 - t~_l) f ( x ) = ~ A,H(ogi " x - t,)

i=1

(3.4)

× {H(xt + x2 - t~) + H(x, - x2 - t,)}. (3.8)

t=l

satisfies If(x) - f(x)l < e

on

K.

(3.5)

The shifters ti can be confined in [ - m ~ , m~]. Proof. For any e~ > 0, there is a polynomial P such that If(x) - P(x)l < e, for x ~ K by the polynomial approximation theorem. Supposing that the degree of P is p, take a sufficiently fine partition {ti}7=0 of the interval [ - m ~ , m~] such that sup I t ti

trl
for 1--
i~t~t I

Second, using Corollary 3.3, we can derive the approximate representation formula with each familiar sigmoid function described below, if it is scalable. The result here is included in Corollary 3.5, but it offers a different algorithm. Let ¢p be a spherically symmetric non-negative function such that f ~o(x) dx = 1 and denote by ~0, a scaling of ~0. Any continuous function defined on a compact set K can be extended continuously to a greater compact set K' whose interior includes K. Hence, we can obtain eqn (3.5), where K is replaced by K' and ~ by e/2. Hence, there is a scalar c for which both

and set for i = 0, ( t; - t~ , f o r l <-i<-n.

a,

and If(x) - f * ~0c(x)l < ~

Then, for 1 ~ r -< p, t' - ~:,,~a.H(t - t,) < gt for t ~ [-m~, m~]. (3.6)

This elementary and constructive proof may be regarded as a convenient algorithm for designing a neural networks as seen in the following examples. EXAMPLE 3.4. We illustrate here how to use Corollary 3.3. First, we select a multiplier putting P(x~, x2) = &x2 for (Xl, x2) E [ - 1, 1]2. We use an equality = ~{(x, + x~) ~ + (x~ - x~)~}.

With unit vectors ~0~ = (wn, w12) and ~02 = (w2~, w22), where W~l = w12 = w21 = -w22 = 2 -~2, this equation is written XIX

2

=

1 W llXl -~ W12X2) 2 -~- (W21X 1 -~- W22X2)2}. ~{(

For (~, ~ ) ~ [ - 1 ,

1] ~, I~1 = (Ix# + Ix#) ''2 -<

V~. Hence, mk = V~. For any e > 0, we can follow the procedure of the proof of Corollary 3.3, to obtain an approximate representation of the product x~x2: -f(x,, x~) = ~

(t ~, - t~_~) i=1

n(w~,x, + W,zX: - t,),

]=1

(3.7) which satisfies Ix,x2 - ] ( x ~ , x 2 ) l < e

on

[0,1] 2,

K

holds. In other words, we obtain that, for ~ * ~oc(x) = ~; A,Ho

c '(~o,.x

- 0

,

(3.9)

i=l

If(x) - f * ~0c(x)l < e on

Hence, Proposition 3.2 concludes the proof.

xm

on

K,

(3.10)

where H~ is a sigmoid function defined by H~(t) = f'

dst f "" f ~o(s,, ..',s,,) ds: "" ds,.

Since the scaler is unity in Corollary 3.3, the scalars on the right hand side of eqn (3.9) can be the same among the respective terms. If f is an infinitely continuously differentiable function, we call it a C~-function. R e m a r k . There are such functions ~0 that H~ are familiar sigmoid functions. For ~0 = (2n)-n/2 exp( -[x12/ 2), H~ is the Gaussian distribution function: H~,(t) -- (Dr) ,/2

f

t

exp(-s2/2) ds.

For ~0(x) = (2n)'(d+~)/2F(d + 1/2)(1 + Ix[2)-(d+~//2, H~ is the arctangent sigmoid function: 1

1

H~(t) = - tan -~ t + rc 2' Further, there is a function ~o satisfying the condition for which H~ is the logistic function: H,(t) = (1 + e ') I

822

} fie

As in the case of the first two examples above, this ~0 is also a rapidly decreasing C~-function. If the sigmoid function h is (n ~ 1)-times differentiable and the derivatives decrease with a certain rapidity, then the spherically symmetric function q~ such that Ho = h can be obtained as the inverse Radon transform of the derivative h ' ( t ) which is regarded as a function on R x S d- ' (see Ito (1991)). The sigmoid function used in the corollary below must be scalable. In this sense, it is different from other corollaries in this paper.

l, Let h be a simple polygonal sigmoid defined bv [I foi

~

h(t) - h(t. r} -

-- for r l fol

~

f,(x) - ZA.h{c

l,,z.x

- rl)

, <- r

~ t3)

Let M be an arbitrary positwe number. For tl < c <~ 1. there is a positive integer n for which h (t/cl

= 1- Z

{h(l

-

iv)

hit

<'l}

~n COROLLARY 3.5. Let h be any sigmoid function, and f be any arbitrary continuous function defined on a compact set K in R e. Then. for any r, > 0, there are coefficients A , , rotators ~o~, shifters t,, i = 1. -- - . . n. and a scaler c such that a finite sum

~L

[

:

:~. M].

(3.14)

For c > 1, combining an equahty h ( t / N ) = 1 I N Ef=~~ h(t - i t ) , N > c. with eqn (3.t4), we can obtain a linear sum that coincides with h ( . / c l on ( -~. M]. Thus. a linear combination which plays the role of h ( . / c ) on a compact set can be obtained. 2 A general polygonal sigmoid function is written:

(3.11t

i

h(tl -

satisfies f(x)

- f,(x) ~< r. on

K.

(3. t21

The shifters t~ can be confined in a neighbourhood of the interval [ - i n K , mK]. P r o o f . Similarly to the proof of Corollary 3.3. we can show that any power t r can be uniformly approximated on the interval [ - ink, mk] by a finite sum of scaled shifts of h with the shifters ti confined m any neighbourhood of [-inK, rnK], which can be fixed beforehand. Hence. Proposition 3.2 concludes the proof. •

Although it must be sufficiently small, the scaler c are c o m m o n in the terms of the right hand side of eqn (3.11). Since the sigmoid function h must be scalable, this corollary does not fit the purpose of this paper. Nevertheless, it is described here because it can be directly obtained from eqn (3.1). If we set Wg = oJ/c and T~ = t / c . the formula eqn (3.11) can be written 7(dx) - ~ A , h ( W .

x

T,}.

This is a familiar expression of the well known approximate representation formula. Hence, the proof of this corollary may be meaningful as a different proof of the well-known fact. Since this proof is simple and concrete, it may be regarded as a useful algorithm. Further. ~t can be used to obtain the examples below, which are meaningful in this paper. EXAMPLE 3.6. Using Corollary 3.5, we can prove constructively that any polygonal sigmoid function can be used without scaling for implementing the approximate representation.

;

~ c,h(t

-.,..

(3.15)

where ct . . . . . . c,, = 1. c -,- tJ and r --: : . . . . . r,. Without loss of generality, we may assumc that t,, = 0. Set T, = ir~. In the interval [T~, T:), there are a finite number, say j, of slopes that start, respectively, at t~ < --- <- t,. Set hO(t) =- h(t) - ¢~lr h(l

;.J

(L'I _. ~

t. . j, "c,.

2

-

~

h(t

r ~

C~IT

Then. hm(t) = h ( t ) f o r T , - t < f ~ . a n d h ~ ( t ) = c, for T~ -< t < T~. There are a finite number of slopes of h m in the interval ( T2, T3). These can be flattened similarly by subtracting a linear combination of shifts of h from h"). D e n o t e by h~2)(t) the result. We can repeat this procedure. As k increases, the number of slopes of h m in [Tk+ t, Tk+2) could increase. However. it remains finite each time. Hence. we can finally flatten the slopes in the interval lh- M]. Multiplying the result by a constant, we obtain a linear combination which plays the role of h ( t , r~) on a compact set. In Example 3.6.2, it is obvious that. using the first slope, we can flatten (a part of) the second slope of the sigmoid function and go on. However, since the number of the slopes to be flattened can increase quickly as we repeat the procedure, it is essential for us to show that the procedure can be finished within a finite number of repetitions. Example 3.6 illustrates simple examples of stgmold functions that can implement the uniform approximation without scaling. There are many such sigmoid functions. If a probability measure a satisfies the condition of the lemma below, a sigmoid function

Approximation on Compact Set

823

defined by h(t) = f'_~ d a ( s ) can be used for implementing an a p p r o x i m a t e representation. This will be p r o v e d in Corollary 3.8 constructively. T h e Lebesgue-Stieltjes m e a s u r e s defined by the polygonal sigmoid functions illustrated in E x a m p l e 3.6 satisfy the condition of L e m m a 3.7. We denote by ~,~(R) the Schwartz space on R which consists of rapidly decreasing C~-functions.

LEMMA 3.7. Let a be a probability m e a s u r e on R. If the inverse 1/:i~a of the Fourier transform of a is a t e m p e r e d distribution, then, for any (p ¢ ~,~(R), there exists a slowly increasing C~-function v on R such that q)(t) = v * a(t)

on

R.

(3.16)

Proof. Since the quotient ~fl(fl/~flO" is a rapidly decreasing distribution, v = :i~ ~(~i~{0/~i~a) is well defined. Obviously eqn (3.16) holds for the v. This function is also written v = ~0 * 3~-1(1/blO'). H e n c e , v is a slowly increasing C~-function. •

If 1/:ft~ has only discrete poles as its singular points, then it is a well-defined distribution. W h e n it is further t e m p e r e d , it satisfies the condition of L e m m a 3.7. This is a m a t t e r of " P r o b l 6 m e de la division" (Schwartz (1966)). We call a m e a s u r e a rapidly decreasing, if f (1 + t2)~dla[(t ) < ~ for any integer k.

COROLLARY 3.8. Let h be a sigmoid function and a be a m e a s u r e defined by a ( ( s , t]) = h(t) - h(s), s < t, and suppose that the m e a s u r e a is rapidly decreasing and satisfies the condition in L e m m a 3.7. T h e n , for any arbitrary continuous function f defined on a c o m p a c t set K in R ~ and for any ~ > 0, there are coefficients A~, rotators ~o~ and shifters t,, i = 1, ... , N, such that a finite sum f ( x ) = ~ A,h(oJ, . x - t~)

(3.17)

If(x) - f(x)] < e on

(3.18)

satisfies K.

If s u p p ( a ) C [ - q , q], then t~ can be confined in the interval [ - m h , - q, m~ + q].

Proof. For each n o n - n e g a t i v e integer r, there is a C~-function ~0~ with c o m p a c t support that coincides with t ~ on [ - m ~ , m~]. By L e m m a 3.7, there is a slowly increasing C~-function v~ such that Or = U, * ~.

(3.19)

Let e~ be an arbitrary positive n u m b e r and set

l(t,

q) = f in/k - q Iv,(s)i d ~ ( t

+ (~

- s)

( ] u , ( s ) - o~(m~. + q)] + r , ) d o ( t -

s).

q

Since Vr is slowly increasing and a is rapidly decreasing, l(t, q) < e~ for a sufficiently large q. T h e r e are sets of coefficients a~, and constants t,, C [ - m ~ q, m~ + q], i = 1, ... , nr, such that u,(t) - ~ a,~H(t - t~,) < gl i I for

t E [ - m A - q. m~ + q].

(3.20)

H e n c e , we have that u~ * a(t) - ~ a,,H * a(t - t,)

" m k "q

~= I

,

+ l(t, q) < 2e~ (3.21) on [ - m ~ - , m~]. Proposition 3.2, eqns (3.19) and (3.21) conclude the first half of the Corollary. If supp(a) C [ - q , , q0], l(t, q) = 0 on [ - m K , mK] for q > q0. H e n c e , t~ can be confined in [-m~,. q, mK + q]. •

EXAMPLE 3.9. We can illustrate several m e a s u r e s that satisfy the condition of Corollary 3.8. T h e polygonal sigmoid functions used in E x a m p l e 3.6 are simultaneously the e x a m p l e s for Corollary 3.8 as described below. T h e densities similar to qb~ and qb2 below have already been described in Ito (1991). 1. Set ~ ( t ) = a/2e-"t'l, a > 0. T h e n , its Fourier transform is ~hqb(s) = (1 + IsI2/a 2) ~. H e n c e , qb dt satisfies the condition. 2. A finite sum of the delta functions v = ET_~ a,6,,, where 2; a, = 1 and a~ > 0, satisfies the condition. 3. Let v be a m e a s u r e defined a b o v e and a be a rapidly decreasing probability m e a s u r e . T h e n c~v + []a(c~ + [/ = 1, ~ > [/-> 0) satisfies the condition. 4. Let v be a m e a s u r e defined a b o v e and a be a probability m e a s u r e with c o m p a c t support. T h e n av + [ l a ( a + [] = 1, a > 0, fl -> 0) satisfies the condition. 5. Let q~l(t) = 1 - It] for Itl -< 1 and qt,~(t) = 0 otherwise. Similarly, let ~2(t) = ~(1 - )])2 for ItI -< 1 and ~P2(t) = 0 otherwise. Two absolutely continuous probability m e a s u r e s with densities qb~ and qb2 respectively satisfy the condition because their Fourier t r a n s f o r m s are ~ ( s ) = 2s-2(1 coslsl) and ~flP2(s) = 6s 2(1 - l/Is I sinlsl).

824

~/2 It(,

6. Let h be a sigmoid function which consists of a finite number of polynomial pieces and is not necessarily continuous. Define a probability measure ~r by a((s, t]) = h(t) - h(s). Then, a is a sum of delta functions and an absolutely continuous measure whose density is piecewise polynomial and whose support is confined in a finite interval. Hence, ~s~a(s) is a linear sum of terms of the forms

and (-1)* k={}

n! (X/-L-fs)_ ~ (n - k)r x (e~-%t', ' ~ - e ~--% ~t7 ~),

where t~ and t~_~ are both ends of a polynomial piece, This sum is analytic, bounded and does not converge to 0 as t - - , ---~. Further, its zeros are discrete. Hence, :f~a satisfies the condition of Corollary 3.8. The polygonal sigmoids illustrated in Example 3.6 and the sigmoid functions defined by ~ and ~_, above are piecewise polynomial sigmoid functions. The polynomials used in the last example can be replaced by some other functions. So we can obtain a number of measures that satisfy the condition of Corollary 3.8. R e m a r k . If a compact set K, a positive constant e. a function f to be approximated and the sigmoid function h satisfying the condition of Corollary 3.8 are all given, each step of its p r o o f can be approximately traced by numerical calculation. Moreover, if a table of constants a,~ and t,~ is prepared for the h in such a way that (3.20) holds for a sufficiently small positive e~ for each r, we can sum them up to approximate a continuous function as soon as a polynomial approximation of the function is obtained. We denote by Sh the set of all linear combinations of shifts of a sigmoid function h:

The final proposition is specific in that the range of shifters t~ in the a p p r o x i m a t e representation can be restricted to any small neighbourhood of the origin. COROLLARY 3.10. Let K G R ~ be a compact set and h be a sigmoid function. Suppose that Pi~ ,.,~..m~lC DJ. . . . ,~,

{3.24j

Then. for an arbitrary continuous function f defined on a compact set K in R d and for any e > 0. there are coefficients A,, rotators ~J, and shifters t,, i -- 1. . - . . n. for which a linear combination f(x) = ~ A,h(ttz

~

tl

13.25~

f(x)-

,m K.

~3.26)

satisfies f(x)<~:

Further. the shifters t~ can be confined in any neighbourhood of the origin. Proof. Since h E C~(R), 1/At{hI"~(t + At) - ht"~(t)} converges to h ~"+t)(t) uniformly on any compact set. as At--9 0. Hence. it is obvious that D---~I ,,,~,oKI C &--~ll",K",~P

(3.27}

By eqns (3.24) and (3.27) we obtain that P, ,~,,,~i C Sh .......~.

(3.28~

This is the assumption of Proposition 3.2 with c = l. Further. sel S,,, = l ~ a ~ h ( . -

tJln = 1 . 2 . . (3.29)

For any e > 0. Sh can be replaced by Sh, in !,3.27) and (3.28). Hence. Proposition 3.2 concludes the proof. • R e m a r k . The inclusion (3.24) is equivalent to

Sh= { L a~h(. - t,)ln = 1,

".. , a, ~ R.

t, E R }. (3.22)

Further, we denote by P the set of all polynomials on in t. For a compact set K in R e , set m~ = max~klx[. Recall that the notation - - stand for the closure of a set of functions in the uniform topology and let [r be a notation for denoting the restriction of functions to a compact set K. Let h E C~(R) be a sigmoid function. We denote by Dh the set of all linear combinations of derivatives of h: Dh = {~a,h~Oln 1,2,...,~ =

,a, E R , t ~ E R } .

(3.23)

C([ - mr, mx]) C D~.......... ~

(3.3{))

If there is a continuous positive function y on [ rnk, mr] for which 7Dhlt. . . . . ~j satisfies the conditions of the Stone-Weierstrass theorem on [ - m r , inK]. then we have that C([-mK, mKl) C ~"-Dh-~l-,K,,,I"

{3.31t

Since 7 is continuous positive, we obtain eqn (3.30) from eqn (3.31). Further. (3.31) follows from Pl1-,,~.,,xJ c YDhh-,,,~.,,KI.

(3.32)

Using the inclusions (3.31) or {3.32) is often convenient for obtaining (3.24). The inclusion (3.31) may often be proved by the Stone-Weierstrass theorem

Approximation on Compact Set and eqn (3.32) by the polynomial approximation theorem. EXAMPLE 3.11. We illustrate here three examples, applying the method described above. The first two may be commonly encountered.

1. Set q~(t) = (27~) '/: exp(-t2/2), h(t) = f'_~ dp(s)ds and ),(t) -- (2~) 1/2 exp(t2/2). Then, 7Dh]l ........~l satisfies (3.32) because 7Dh is the system of the Hermite polynomials. 2. Seth(t) = (1 + e ') l a n d T ( 0 = 1 + e '.Then ;'Dr, is a set of linear combinations of e "/(1 + e ') ", n = 0, 1, .... Hence, ~'Dh satisfies (3.28). 3. Set h(t) = e x p ( - e ') and 7(t) = exp(e % Then, ;~'Dt, contains all polynomials in e 'including the constant 1. Hence, the inclusions (3.31) holds.

Let h and ;, be any pair illustrated in Example 3.11. Even if both functions are shifted in parallel, the first half of the statement of Corollary 3.10 holds. Hence, the range of the shifters ti can be restricted to any nondegenerated interval. The first two examples in Example 3.1 motivated Corollary 3.10, and the last one was obtained using the corollary. 4. D I S C U S S I O N

We discuss here mostly the relation of this paper to others. This paper inherits the idea of using a sigmoid function without scaling from Ito (1991). Another important ingredient of this paper is the idea of extending the polynomial approximation of R to R '~ by using the powers ((~ - x) of inner products. It came from another paper by the present author submitted at the same time as Ito (1991) but not published. It was written under the latter idea and included the formula similar to eqn (3.8) in Example 3.4 and others, but sigmoid functions were scaled in the paper. Immediately after the submission of both papers, an afterthought hit the author: Useful is the fact that the closure of shifts of the Gaussian distribution function includes the Hermite polynomials. This led to Example 3.11.1. Given this example, it was difficult to leave the logistic function unchallenged. Thus, Example 3.11.2 was obtained. Then, Example 3.6.1 also became a natural target of the trial. These three examples convinced the author that a theorem of the form of Theorem 2.6 holds. The constructive proofs of Corollaries 3.8 and 3.10 were also thought out after these examples were obtained. Hinted by Cybenko (1989), the author used the Hahn-Banach theorem, but applied it in a somewhat different way. We took the Fourier transform of the both hand sides of eqn (2.1) and directly expressed

825 the condition of the function g being strongly discriminatory in terms of the Fourier transform. Let us view the simple eqn (3.1) from the point of HechtNielsen's approach (1987). He reinterpreted the equation which appeared in the improved version by Sprecher (1965) of Kolmogorov's theorem (1957) as a mathematical expression of a neural network. However, nonlinear functions are nested in nonlinear functions in the equation. Consequently, when Hecht-Nielsen's idea is implemented by linear and sigmoid units, the network may inevitably be four layered (Funahashi (1989)). The simple eqn (3.1) can take the place of the famous equation in dealing with the present approximation problem. Moreover, since only the power is nonlinear in eqn (3.1), a network based on this equation can be three layered. The simplicity of the eqn (3.1) may be an advantage of using the polynomial approximation theorem. Quite recently, one of the anonymous referees for our previous two papers pointed the author to Stinchcombe and White (1990). Hinted by their paper, we extended Example 3.6.2 to Example 3.9.6. Only polygonal sigmoid functions were treated there originally, though Corollary 3.8 was already obtained. They showed that if the hidden layer sigmoid function belongs to either a class of sufficiently kinky piecewise polynomial functions or that of superanalytic functions, the three-layered feedforward network can be a universal approximator even under a restriction on the range of connection weights and shifting. They stressed the importance of such a restriction in applications, stating that the connection weights and shifting cannot be too great in actual units. From this point of view, this paper may also be meaningful because it concludes that any sigmoid function can be used without scaling. It is interesting that two classes of sigmoid functions mentioned by them are closely related to Corollaries 3.8 and 3.10 in this paper, respectively. More remarkable is that some of the concrete examples they illustrated coincide with those we have obtained independently; the polygonal sigmoid function, the Gaussian distribution function and the logistic function. Since the connection weights are exactly restricted to the unit vectors in these examples, they may be well-fitted examples to this paper. Of course, they can also be good examples in their paper where bounding the range of the connection weights and shifts is the main topic. It may be of theoretical interest that any sigmoid function can be used without scaling in the case that the domain of approximation is a compact set, but it is not always practically convenient. One usually needs more units in exchange for the restriction. However, the restriction could be sometimes convenient in theoretical discussion because S~-t is a compact set. The restriction of the range of the shift-

826

~. It¢~

ers ti stated in the respective corollaries could also be convenient. In the case of approximation on the whole space R d, there are many sigmoid functions which cannot be used unless they are scaled. This paper is to be followed by another by the present author, in which we treat the uniform approximation on R d with unsealed sigmoid functions and describe a necessary and sufficient condition imposed on a sigmoid function ensuring that it can be used without scaling for implementing the uniform approximation on R ~.

REFERENCES Carroll, B. W., & Dickinson, B. D. (1989). Construction of neural nets using the Radon transform. '891JCNN Proceedings. I:607~ 611. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematical Control Signal Systems, 2,303-. 314. Funahashi, K. (1989). On the approximate realization of continuous mapping by neural networks. Neural Networks, 2, 183-192. Hecht-Nielsen, R. (19871. Kolmogorov's mapping neural network

existence theorem. IEEE First International t~onlerence ,,,,~ Neural Networks. 3. 11-13. Hornik, K., Stinchcombe. M.. & White. kt. I1989L Muttilaver feedforward networks are universal approx~mators. Neural Networks, 2. 359-366. Irie, B,, & Miyakc, S. (1988L Capabilit~ o~ three-layered perceptrons. IEEE International Conference on Neural Networks. 1,641-64& Ito, Y (1991). Representation o1 tuncti~ms by ,~uperpostttons :, a step or sigmoid function and their applications to neural network theory, Neural Networks, 4, ~,85-3'44 Kolmogorov, A. N. (19571. On the representation of continuou~ functions of many variables by superposition of continuou,~ functions ol one variable and addition Dokladv Akaderr, it Nauk, 144, 67(I-681. Rudin, W. (1%61, Real and complex a~,[~.~ New York: Mc Graw-Hill Book Company. Rudin, W. 119731 Functional analwt,s Nov, York: McGrav~-tlill Book Company. Schwartz. [o. (19661. Thdorie des distributton.s, Paris: Hermann. Spreeher, D. A. 19651. On the structure of continuous functions ~m several variables. Transactions o f the Mathematical Society. !15, 340-355 Stincheombe. M.. & White. 1t ( 19901. Approxlmanng and learning unknown mappings using multilayer feedforward networks with bounded weights. 90HCNN Proceedings, I11:7--16, Yoshida, K. (19681. Fanctional analvsi~ New York: SprmgerVcrlag.