On the design of a class of neural networks

On the design of a class of neural networks

Journal of Network and Computer Applications (1996) 19, 111–118 On the design of a class of neural networks V. David Sa´nchez A. German Aerospace Res...

196KB Sizes 0 Downloads 69 Views

Journal of Network and Computer Applications (1996) 19, 111–118

On the design of a class of neural networks V. David Sa´nchez A. German Aerospace Research Establishment DLR Oberpfaffenhofen, Institute for Robotics and System Dynamics, D-82230 Wessling, Germany The design of the class of RBF networks is described. These three-layer networks possess the universal and best approximation capability in the framework of supervised learning of real functions of real vectors from examples and offer a large amount of applications. A learning method is described for finite network and data size, which is quite efficient for noise-free and noisy data. Its robust and efficient implementation as  1996 Academic Press Limited well as its evaluation is described.

1. Introduction One of the required capabilities which need to be mastered to build intelligent systems is the design of learning machines. One of the most promising approaches is through neural network architectures. These networks continue offering a large amount of innovative practical applications [1–4], while their theoretical foundations have enjoyed significant advance in the last few years [5]. Other workers have studied RBF networks [6–10], their generalization improvement [11] and centre placement [12–14]. In order to give some concrete guide on how to design these networks, this paper covers the principles of this architecture class in Section 2, a learning method, which allows processing of noise-free and noisy data under the same framework, as well as its robust and efficient implementation in Section 3, and its experimental evaluation in Section 4. Conclusions are drawn in Section 5.

2. A class of neural networks Figure 1 shows the architecture of the RBF network class. The networks belonging to this class possess three layers: one input, one hidden, and one output layer. The hidden layer contains neurons realizing basis functions. The output layer contains one neuron for the approximation of functions f:Rn→R. The approximation function gm realized by a network architecture with m hidden units has the form in eqn (1). The weights wi, i=1, . . ., m are parameters weighting the connections between each of the hidden neurons and the output neuron. The basis functions φi :Rn→R realized in the individual hidden units are parameterized scalar functions, which are built using a given nonlinearity φ:R→R. To parameterize this function the centres zi and the widths ri for i=1, . . ., m are used, see eqn (2). Examples of nonlinearities for RBF networks are summarized in Table 1. 111 1084–8045/96/020111+08 $18.00/0

 1996 Academic Press Limited

112 V. D. Sa´nchez A.

Ki zi

φ



wi

x

Node

K1 w1

Σ

x

y

wm Km

Figure 1. The RBF network.

Table 1. Examples of RBF nonlinearities. Function expression φ(r, c)=(c=constant)

Function name Linear Cubic Thin plate spline Gaussian Multiquadric

r r3 r2 ·log r exp(−[r2/c2]) (r2+c2)±1/2

m

gm(x)=

] w ·φ (z , r , x) i

i

i

i

(1)

i=1

φi(zi, ri, x)=φ(%x−zi %, ri), 1≤i≤m

(2)

3. Learning method For the architecture class covered, a fixed value of m, and a given training data set Tr={(xi, yi), i=1, . . ., N}, the learning method needs to solve the optimization problem

On the design of a class of neural networks 113 in eqn (3) determining the weight vector w=(w1, . . ., wm)T. This leads to the solution of a linear equation system or more specifically a linear least-squares problem. Noniterative methods to solve the obtained linear equation systems include the QR factorization and the singular value decomposition (SVD); iterative methods include the conjugate gradient method. For more details see Golub and van Loan [15], for least-squares methods see Bjo¨rck [16], for some implementation details see Press et al. [17]. N

A

m

1 yi− wj ·φj(xi) min eTr, eTr= · N i=1 j=1

]

w

]

B

2

(3)

For that purpose, the components of a matrix CvRN×m are defined in eqn (4) and the vector yvRN is y=(y1, . . ., yN)T. At the minimum the gradient of the minimization function disappears, which is shown in the normal equations (eqn (5)). Cij=φj(xi), i=1, . . ., N, j=1, . . ., m N

]

0=

i=1

C

m

yi−

] w ·φ (x ) j

j

i

j=1

D

(4)

·φk(xi), k=1, . . ., m

(5)

Defining the components of a matrix A=CT ·C, and a vector b=CT · y as in eqn (6), the normal equations can also be represented in matrix form as in eqn (7). )

N

] φ (x )·φ (x ), b =y ·φ (x ),

Akj=

j

i

k

i

k

i

k

i

j, k=1, . . ., m

(6)

i=1

A·w=b

(7)

Using the singular value decomposition the problem in eqn (7) can be posed as minw |C·w−y| 2 and the matrix C can be represented as in eqn (8). The solution is given in eqn (9). ki are the singular values of the matrix C, R=Diag. (k1, . . ., km) and U(i) and V(i) are vectors of length N and m, which represent the columns i=1, . . ., m of the matrices U and V. C=U·R·VT

(8)

m

w=

U(i) ·y ·V(i) ki i=1

]

(9)

The implementation of the learning method is summarized in Fig. 2.

4. Experimental evaluation For the experimental evaluation a training and a test data set with N and M=N−1 elements, respectively, are generated using a given one-dimensional function y=f(x).

114 V. D. Sa´nchez A. Inputs Χ Training data set Tr={(xi, yi), i=1, . . . , N}, xi M Rn, yi M R. Χ Test data set Te={(xk, yk), k=1, . . . , M}, xk M Rn, yk M R. Χ Number m and set of basis functions {uj(x:Rn→R), j=1, . . . , m}. Algorithm Χ Defining the matrix C and the vector y): Cij=uj(xi), i=1, . . . , N, j=1, . . . , m y=(y1, . . . , yN)T Χ Singular value decomposition: C=U · R · VT Χ Weight computation: m

U(i) · y) · V(i) ki i=1

) =; w

Χ Evaluation of memorization and generalization capability: eTr=

A A

m 1 N · ; yi− ; wj · uj(x) i) N i=1 j=1 M

m

B

2

B

1 · ; yk− ; wj · uj(x) k) M k=1 j=1

eTe=

2

Figure 2. Learning method for RBF networks.

The optimal network architecture belonging to the class defined in eqn (10) and its corresponding approximation function g∗m are determined. Optimality is judged using the generalization capability based on the test data set (Fig. 2), whereas training is done using the learning method in Section 3 and the training data set. The remaining error after training represents the memorization capability of the network architecture (Fig. 2). m

UN={gm |gm(x)=

] w ·φ (z , r , x), 1≤m≤N} i

i

i

i

(10)

i=1

4.1 Noise-free data The function f:[0, 1]→[−1, 1], f(x)=cos(2px) and equidistant partitions of the interval [0, 1] are used to generate N=21 training data points (xi, yi), yi=f(xi), where xi=i·d, i=0, . . ., N−1=20, d=(1/N−1)=0·05, and M=20 test data points (xj, yj), yj=f(xj), where xj=(j+0·5)·d, j=0, . . ., M−1=19. In addition to training and test data, the number, the centres and the widths of the basis functions are needed as inputs for the learning method in Section 3 (Fig. 2). m varies between 2 and N=21. Centres are chosen equidistantly in the interval [0, 1]: zk= k·dZ, k=0, . . ., m−1, dZ=(1/m−1). A Gaussian nonlinearity is used (Table 1). The widths are chosen as a constant rk=r=dZ=(1/m−1), k=0, . . ., m−1. For an upper

On the design of a class of neural networks 115 0.010

0.008

e

0.006

0.004

0.002

0.000

3

6

9

12 m

15

18

21

Figure 3. Approximation error for noise-free data. Χ, Training data set; Φ, test data set.

bound of the approximation error O(1/m2) when choosing r∝1/m see Liu and Si [18]. Figure 3 shows the approximation error e for the training and test data sets versus m, the number of hidden units. The curve for the approximation error e for the training data set shows the results of 20 training runs with the learning method of Section 3 for m=2, . . ., 21. Relative minima for the approximation error for the training and test data sets are at m=5 and m=6, respectively. m=6 is a global minimum for the test data set, whereas the global minimum for the training data set is at m=N=21. The goal of learning is to maximize the generalization capability. This is the case for m=6 and the optimal architecture in this sense is g∗6. Overtraining is present at m=21, confirming that maximizing the memorization capability of an architecture, or minimizing the approximation error for the training data set, is not consistent with the goal of learning. In Fig. 4(a) the approximation with the optimal architecture is shown as a solid line. The given training and test data are also shown. Figure 4B shows the remaining error for the training and the test data set. 4.2 Noisy data A noisy training data set is generated using the expression yi=f(xi)+ei, i=0, . . ., N−1 for the function f given in Subsection 4.1. Gaussian noise e with mean l=0 and standard deviation r2n is used. Therefore, ei=rn ·ni, where the random variable ni>n(0, 1) (n is the normal distribution) and rn=0·1 is chosen. In a similar way a test data set with M=N−1 elements is generated. For the definition of the x-partitions for the training and the test data set, see Subsection 4.1. The only difference is the number of elements. For noisy data N=51 and M=50.

116 V. D. Sa´nchez A. 1.5

0.02 B

A 1.0 0.01

0.0

0.00

e

y

0.5

–0.5 –0.01 –1.0 –1.5 –0.2 0.0

0.2

0.4

0.6

0.8

1.0

1.2

–0.02 –0.2 0.0

0.2

x

0.4

0.6

0.8

1.0

1.2

x

Figure 4. Approximation with the optimal network for noise-free data. A, Approximation; B, error. Φ, Test data; Χ, training data.

0.06

e

0.04

0.02

0.00

5

10

15

20

25 m

30

35

40

45

50

Figure 5. Approximation error for noisy data. Χ, Training data set; Φ, test data set.

Figure 5 shows the approximation error for the training and the test data set versus the number of hidden units m. m=3 is the global minimum for the approximation error for the test data. Therefore, the optimal network architecture has three hidden neurons. The approximation function g∗3 and the remaining error f−g∗3 are shown in Fig. 6A and B, respectively. The same functions are shown in Fig. 7 for the case of overtraining with m=N=51. In this case noise is also learned by the network architecture. This corresponds to the global minimum of the approximation error for the training data set (Fig. 5), eTr=0 for m=51.

5. Conclusions Key aspects of the design of the class of RBF networks have been presented. The basic architecture and its approximation function were described in detail. An efficient

On the design of a class of neural networks 117 1.5

0.3 B

1.0

0.2

0.5

0.1

0.0

0.0

e

y

A

–0.5

–0.1

–1.0

–0.2

–1.5 –0.2 0.0

0.2

0.4

0.6

0.8

1.0

–0.3 –0.2 0.0

1.2

0.2

0.4

x

0.6 x

0.8

1.0

1.2

Figure 6. Approximation with the optimal network for noisy data. A, Approximation; B, error. Φ, Test data; Χ, training data.

1.5

0.3 B

A 0.2

1.0

0.1

0.5 e

y

0.0 0.0

–0.1

–0.5 –0.2 –1.0 –1.5 –0.2 0.0

–0.3 0.2

0.4

0.6 x

0.8

1.0

1.2

–0.4 –0.2 0.0

0.2

0.4

0.6 x

0.8

1.0

1.2

Figure 7. Overtraining. A, Approximation; B, error. Φ, Test data; Χ, training data.

learning method to determine the weights between the hidden and the output layer for noise-free and noisy data was derived based on the optimization of the memorization capability of the network architecture. In the experiments the optimal architectures were determined based on the highest possible generalization capability of the solution architecture. For noise-free and noisy data, the appropriate choice of the number of hidden units and their basis functions was shown.

References 1. V. D. Sa´nchez A. and G. Hirzinger 1992. The state of the art of robot learning control using artificial neural networks—An overview. In The Robotics Review 2 (O. Khatib, J. J. Craig and T. Lozano-Pe´rez, eds), pp. 261–283. Cambridge, MA: The MIT Press. 2. S. Y. Foo, H. Szu and Y. Takefuji 1995. Special issue on optimization and combinatorics. Neurocomputing, 8.

118 V. D. Sa´nchez A. 3. T. Fukuda, K. S. Narendra and Y.-H. Pao 1995. Special issue on control and robotics. Neurocomputing, 9. 4. P. Treleaven, A. Ben-David and Y.-H. Pao 1996. Special issue on financial applications. Neurocomputing, 10. 5. V. D. Sa´nchez A 1991. Neurocomputing—State-of-the art. In Mechanics Computing in 1990’s and Beyond, Vol. 1 (H. Adeli and R. L. Sierakowski, eds) pp. 23–42. New York: ASCE. 6. D. S. Broomhead and D. Lowe 1988. Multivariable functional interpolation and adaptive networks. Complex Systems, 2, 321–355. 7. J. Moody and C. J. Darken 1989. Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 281–294. 8. T. Poggio and F. Girosi 1990. Networks for approximation and learning. Proceedings of the IEEE, 78, 1481–1497. 9. F. Girosi 1992. Some extensions of radial basis functions and their applications in artificial intelligence. Computers and Mathematical Applications, 24, 61–80. 10. L. Xu, A. Krzyzak and A. Yuille 1994. On radial basis function nets and kernel regression: statistical consistency, convergence rates, and receptive field size. Neural Networks, 7, 609–628. 11. C. Bishop 1991. Improving the generalization properties of radial basis function neural networks. Neural Computation, 3, 579–588. 12. V. D. Sa´nchez A 1995. On the number and the distribution of RBF centers. Neurocomputing, 7, 197–202. 13. V. D. Sa´nchez A 1995. Second derivative dependent placement of RBF centers. Neurocomputing, 7, 311–317. 14. P. A. Jokinen 1991. A nonlinear network model for continuous learning. Neurocomputing, 3, 157–176. 15. G. H. Golub and C. F. van Loan 1989. Matrix Computations. Baltimore: The Johns Hopkins University Press. 16. A. Bjo¨rck 1990. Least squares methods. In Handbook of Numerical Analysis, Vol. I (P. G. Ciarlet and J. L. Lions, eds) pp. 465–647. Amsterdam: Elsevier Science Publishers. 17. W. H. Press, B. P. Flannery, S. A. Teukolsky and W. T. Vetterling 1988. Numerical Recipes in C. Cambridge: Cambridge University Press. 18. B. Liu and J. Si 1994. The best approximation to C2 functions and its error bounds using regular-center Gaussian networks. IEEE Transactions on Neural Networks, 5, 845–847.