Dynamically capacity allocating neural networks for continuous learning using sequential processing of data

Dynamically capacity allocating neural networks for continuous learning using sequential processing of data

w Original Research Paper 121 Chemometrics and Intelligent Luboratov Systems, 12 (1991) 121-145 Rlsevier science Publishers B.V., Amsterdam Dynami...

2MB Sizes 0 Downloads 20 Views

w

Original Research Paper

121

Chemometrics and Intelligent Luboratov Systems, 12 (1991) 121-145 Rlsevier science Publishers B.V., Amsterdam

Dynamically capacity allocating neural networks for continuous learning using sequential processing of data Petri A. Jokinen NESTE Technology, P.O. Box 310, SF-06101 Potvoo (Finland) (Received12 February1991; accepted17 June 1991)

Jokinen, P.A., 1991. Dynamically

capacity allocating neural networks for continuous

learning using sequential processing

of data.

Chemometrics and Intelligent L&oratory Systems, 12: 121-145. A nonlinear network model with continuous learning capability is described. The dynamically capacity ahocating (DCA) network model is able to learn incrementally as more information becomes available and- to avoid the spatially unselective forgetting of commonly used learning agony for artificial neural networks. These nonhnear network models are compared to other methods on some classification problems and muhivariate calibration of spectroscopic data. In the example cases studied the DCA networks are able to a&eve performances that are better than or at least equal to the linear and nonlinear methods tested. In addition to the good prediction performance on the test problems, the DCA networks are able to construct the model using only sequential processing of data. This means that the training data can be collected simuItaneou@y while the network model is already in operation.

INTRODUCTION

The mathematical nonlinear network models, or neural networks, are based on a crude analogy to the biological neural networks. This analogy is vague for many network types presented in the literature and it is therefore not pursued any further in this work. Instead of that the nonlinear networks are presented as a general method for solving estimation problems encountered in chemometrics and especially in the process industry. Such problems are, for example, accurate control of nonlinear plants, fault detection and diagnosis of complex processes, and quality control of prodU&S.

0169-7439/91/$03.50

The most important feature of nonlinear networks for applications in the process industry is the ability to learn. Continuous learning, in particular, is needed to overcome the common knowledge-acquisition bottleneck. This bottleneck must be understood in a broad sense, covering acquisition of sensory ~fo~ation and notation about process conditions, faults, operational methods, etc. The acquisition of a proper data set for the training of a neural network regression analysis or any other modelling method might be too expensive, time consuming or in some cases dangerous. All these facts together lead to a conclusion that the discovery of nonlinear modelling methods with a capability to learn ~n~uo~ly and autono-

8 1991 - Elsevier science Publishers B.V. All rights reserved

Chemometxics

122

mously is a valuable breakthrough in terms of application development in the process industry and many other domains. Nonlinear network models motivated by a crude analogy to the biological neural networks were first presented by McCulloch and Pitts [l] in 1943. They formulated the models in a form of Boolean algebra and this was one motivation for the development of logic circuits, now common in all computers. In the beginning of the 1960s Rosenblatt invented the Perceptron, which was the first network model with adaptive learning behavior, although it was a linear model. The neurally inspired models were forgotten for a long time after the end of the 1960s due to the criticism presented by Minsky and Papert [2]. In the 1970s only a few people continued the research work, among them being Grossberg [3], Kohonen [4] and Fukushima [5]. Most of the work related to neural networks was then done in pattern recognition. The attention of the research community was drawn back to neural networks by the presentations of Hopfield [6] at the beginning of the 1980s. A final breakthrough was perhaps the back-propagation algorithm presented by Rumrnelhart and Hinton in 1986 [7]. The back-propagation algorithm had been developed independently by Werbos [8] at the beginning of the 197Os, but it was not wellknown at that time. Nonlinear networks of the back-propagation type have been applied to problems in chemometrics by Long et al. [9]. They created multivariate calibration models using spectroscopic data. This work describes an approach to the continuous learning problem based on nonlinear network models. In particular a nonlinear network model for continuous learning is presented along with two different learning algorithms. The Dynamically Capacity Allocating (DCA) networks [lo] are proposed as continuous learning estimators of unknown functions. The DCA networks are an extension of kernel estimation methods, in which an unknown function is interpolated using a functional

f(x)

= ; i=l

b&(x

- Wi)

(I)

and Intelligent Laboratory

Systems

n

where K( .) is the kernel function and bi are scalar constants. x and rq are n X 1 column vectors and k is the number of kernel functions. In the case of DCA networks the multivariate kernel function is a Gaussian kernel that is shown to have many desirable properties in terms of practical applications and interpretations that can be given to the model. The DCA networks are different from Eqn. 1 due to the learning algorithms that are used for identification of the models. In this case the number of Gaussian kernel functions is variable during learning and the function form used for interpolation is changed dynamically to achieve a small prediction error. This paper is divided as follows. The following section presents the DCA networks in a general setting and describes the structure of the models. The learning algorithms and methods for selection of the network parameters are then explained, followed by a presentation of some examples of DCA networks for applications in classification problems and the interpretation of spectral information. An appendix presents a subset of the developed models and algorithms implemented as a collection of MATLAB [ll] function files.

DCA NETWORK

STRUCTURE

In this section a nonlinear type of network is presented for continuous learning. This type of network is related to radial basis function networks as well as to several function approximation methods. The novelty of this model lies in its ability to estimate simultaneously the functional form as well as the model parameters. These models therefore implement an open learning system. In particular, these networks can be considered as an extension of kernel estimation methods, where the scalar valued function f(x) is approximated using a linear combination of kernel functions K(e), see Eqn. 1. The DCA networks have a three-layer, feed-forward connection structure. Fig. 1 shows a network with seven inputs, three hidden nodes and two outputs, that is, the approximated function is now a vector-valued function. In this network, all outputs from the input layer are connected to all hidden nodes. If the current input

n

123

Original Research Paper

X

Fig. 1. Connection structure of the proposed DCA network.

pattern vector is x, the weights of node i are wi and A is a weighting matrix, then the output activation hi of hidden node i is

(2) where (I is a positive scalar constant, x and wi are n x 1 column vectors and A is a positive definite n x n square matrix. The outputs of all hidden nodes hi are used for calculating the jth output of the network Oj k

Oi=

c

biihi

i=l

In all this three layer connection structure of DCA networks corresponds to the kernel estimate of Eqn. 1 with multivariate Gaussian kernels given by .Eqn. 2. Each hidden node hi of the network model computes a quadratic function (x - w)rA( x - w) of the input pattern x instead of the commonly assumed linear function wrx for each node in the neural network models. Eqn. 2 also means that each hidden node computes the square of the weighted distance between stored pattern vector wi and the current input pattern vector x. This squared distance is used to calculate the activation of each hidden node, Eqn. 2. On the basis of the

work of Stinchcombe and White [12], Girosi and Poggio [13] have shown that the model structure of DCA networks is able to approximate any continuous function with arbitrary accuracy. A similar result has also been derived independently by Hartman et al. [14]. If the weighting matrix A is an identity matrix I then the distance metric is simply the Euclidean distance and the output activation function of each hidden node is symmetric with respect to all input pattern vector components. If the weighting matrix A is equal to the inverse of input pattern covariance matrix then it is a special case of the Mahalanobis distance that maximizes the separation between input pattern vectors x [15]. This selection of the distance function between stored and input pattern vectors is not the only possibility. Some other commonly used measures such as Hamming, Minkowski or Canberra metrics [15] could be used instead. The particular form of Eqn. 2 was selected because it provides the interpretation of each hidden node covering a part of a continuous valued input pattern vector space. As Franke [16] has pointed out, the Gaussian basis functions, Eqn. 2, cannot interpolate polynomials of any order with zero error, but they provide the possibility of changing the interpolating function locally, affecting only a small neighborhood of the interpolating function f(e). This makes it possible to incorporate new information selectively in the network without corrupting other previously learned parts of the function. This property makes spatially selective updating of the models possible and the use of multivariate Gaussian kernels is justified. For the dynamic allocation of hidden nodes during learnin g, each hidden node is defined to respond to a different region in the input pattern vector space. If the Gaussian-shaped activation function is cut at some height then the borderline of this cut defines an N-dimensional ellipsoid in the input pattern vector space. The equation of this N-dimensional ellipsoid is (x-wi)=A(x--)=c

(4)

If the weight vector wi, weighting matrix A and the positive scalar constant c are defined then the

Chemometrics

124

constant a in Eqn. 3 defines the steepness of the activation function response. Fig. 2 shows some possible ellipsoid borders in a two-dimensional input pattern vector space.

and IntelIigent Laboratory

n

Systems

80-

60x2

SUPERVISED

LEARNING

Non-iterative

algorithm

USING

DCA NETWORKS

40-

zo-

Iterative 1earning algorithms for neural networks, such as the back-propagation algorithm, are too slow for many applications of continuously 1earning networks. One input pattern vector may be important for the network operation but it may occur so rarely that iterative learning methods cannot memorize it within a reasonable amount of time. The learning algorithm should also implement selective forgetting of information learned earlier, without distortion of previously stored patterns. To overcome these problems the networks should have some kind of built-in attention system to notice what changes in the input pattern vectors are important and what can be ignored. A simple supervised learning algorithm with an internal attentional system for DCA networks is presented. This learning algorithm is a fast noniterative method with predefined control over the maximum number of hidden nodes of the network. The learning Algorithm I [lo] can be described by the following five steps: Select a suitable weighting matrix A and constant c to fix the shape and size of the N-dimensional ellipsoid for all hidden nodes. Select constant a, Eqn. 2, to define the steepness of the responses of the hidden nodes. Fix the number of inputs and outputs of the network and set the number of hidden nodes k initially to zero. If k = 0 then go to step 3b else calculate the network outputs Oj using the currently defined network and parameters for a new pattern vector x:

0 0

I 50

I 100

I l>O

:

0

Xl Fig. 2. Ellipsoid bounds of five hidden nodes in the two-dimensional input pattern vector space.

Compare the calculated output oj with the true desired output rj. If the error is smaller than the predefined tolerance c: l"j-Yjl
then go back to step 2. If the error is bigger then check if the input pattern vector is within the N-dimensional ellipsoid of any hidden node:

3a. If x is inside more than one currently allocated N-dimensional ellipsoid then go back to step 2. 3b. If x is outside all currently allocated N-dimensional ellipsoid regions, allocate a new hidden node with wk+i = x and set k = k + 1. 3c. If x is inside one currently allocated N-dimensional ellipsoid p, then change the center vector of this node to wP= x. Calculate the matrix b by solving the system of linear equations: Hb=Y, where k X k matrix H is obtained as

The matrix Y contains the desired output responses yi of the network corresponding to the same stored pattern vectors. Go back to step 2.

n

125

Original Research Paper

Notice, that only one of the steps 3a, 3b and 3c is executed for each new pattern vector x. The leaming algorithm as a whole is an infinite loop that processes incoming pattern vectors sequentially. The matrix H is symmetric and all elements on the diagonal are ones. This information can be used to reduce the number of floating point operations required to calculate the matrix H. The supervised learning algorithm just presented guarantees that the network response is always accurate on k stored input pattern vectors. Due to the noise in the pattern vectors, these stored vectors are not fixed; they are changed dynamically during the normal, continuous operation of the network. It means that the center vectors of the N-dimensional ellipsoids are allowed to move in the input pattern vector space, but the center of each ellipsoid is kept outside all other ellipsoids. It is also possible to obtain an approximating solution between x and y vectors. If the data are noisy this may compensate the error that is introduced by fitting the model exactly to the stored pattern vectors. Using the regularization approach, the equation Hb = Y is modified. The matrix H is replaced by H + h1, where I is an identity matrix and h is a ‘small’ positive parameter whose magnitude is proportional to the amount of noise in the input data vectors x. The matrix b is then given by 6= (H+hI)-‘Y

(5)

The original exact solution is recovered by letting A go to zero. Both 1earning methods guarantee that if the desired response to a particular input pattern vector changes, this new behavior is learned immediately. This step 3c of the learning algorithm corresponds to learning in case of a time variant function F(x, t) between x and y. The basic learning algorithm can be modified to include dynamic allocation of network outputs during 1earning. The following step is added between steps 3 and 4: 3.5 If a new separate network output is needed for the input vector x add a new column to the matrix Y that contains the desired output

responses yi of the network. The last row of this new column is made equal to the new desired response of this output and all other rows are filled with zeros. A method like this can be used mainly for classification tasks that have simple binary desired responses. This kind of operation is useful in the construction of DCA network based expert systems that can learn more responses (classes) by allocating additional network outputs for them. It should be noticed that this part of the learning algorithm is also supervised, because the column of new output responses must be provided for the network. If the training of the DCA network can be done with some amount of precollected data then the matrix b can be calculated as a solution of a least squares estimation problem. Suppose that the I x n matrix of data objects is X with n variables and 1 observations. Let xi represent the transpose of the ith object in X. Now the coefficient matrix H is computed as H,,=exp[-&(xi-w,)TA(xi-n;)] H is an I X k matrix and the b matrix is obtained

as a standard least squares solution b = (H=H)-~H=Y

(7)

The centers of the Gaussian functions wi are first found using the learning algorithm with sequential processing, then afterwards a new matrix b is computed using Eqns. 6 and 7. Notice that there is a difference in the dimensions of the matrix H. Eqn. 6 computes matrix H for solving a least squares estimation problem and the dimensions are 1 X k. In the sequential learning algorithm, the dimensions of H are k X k, corresponding to exact interpolation. Selecting the network parameters The weighting matrix A and the scalar constants c and a together with the desired tolerance of the network response are the design parameters of the DCA network. These parameters should be chosen in such a way that noise in the input pattern vector components is not able to allocate

Chemometrics and Intelligent Laboratory Systems n

126

redundant hidden nodes. The control over the N-dimensional ellipsoid shape and size allows construction of networks that are highly tolerant to noise and this is different from other network types. The constant a defines the steepness of the activation function responses of the hidden nodes. If a is made larger, then overlapping is increased and vice versa. The interpolation or ‘generalization’ properties of these networks depend mainly on the ratio of the parameters c and Q. The optimal value of this ratio, however, depends on the form of the learned function and is difficult to determine analytically. It should be noted that the network parameter a can be changed during learning without any problems. The coefficient matrix b must be recalculated (as in step 4), but no other changes are necessary and all stored information is retained. Actually, the matrix b will be calculated automatically during learning at step 4 and no additional computations result directly from the change in the value of a. This property makes it possible to experiment with different values of a after the operation of this network has started and the amount of necessary a priori knowledge of the problem is reduced. For practical implementations of the DCA network it is necessary to know how many hidden nodes could be needed for the solution of a given problem. After the shape and size of the N-dimensional ellipsoid and the size of the input pattern vector space are defined, an estimate of the maximum number of possibly allocated hidden nodes n est in the network can be obtained from Eqn. 8: N

nilmax

xi-min

xii

where max xi and min xi are a priori estimates of the maximum and minimum values attainable by each input pattern vector component during the operation of the network. The constant c defines the border of the N-dimensional ellipsoid of each hidden node (Eqn. 4). Xi are the eigenvalues of the inverse of the weighting matrix A-’ and N is

the dimension of the N x N matrix A. Eqn. 8 has been obtained by dividing the volume of the N-dimensional ellipsoid defined by the maximum and minimum components of x by the volume of the N-dimensional ellipsoid reserved for each node in the network. This expression is not accurate because it assumes that N-dimensional ellipsoids of each node overlap by 50% and that the input pattern vectors are uniformly distributed over the possible N-dimensional input pattern vector ellipsoid. The weighting matrix A defines the shape of the N-dimensional ellipsoid and it can be any positive definite matrix. If all components of the input pattern vector x have equal variance, then A could be chosen simply as an identity matrix, which means that the shape is a hypersphere. More common is that variances of the vector components are not equal. In this case A could be chosen as a diagonal matrix with the inverses of the component variances I$ as diagonal terms. If the correlation between input pattern vector components is zero then this method selects the inverse of the input pattern vector correlation matrix as the weighting matrix A. Eqn. 9 is an example of one such weighting matrix in a two-dimensional case with component variances ~1”and Cl;:

A=

rai i

1 7

0

0

1 02

1

1

(9

In practice the true variances are usually unknown, but reasonable estimates can be made and used in Eqn. 9. If a reliable estimate of the input covariance matrix can be made beforehand, then also the inverse of this matrix can be selected directly as A. The shape of the N-dimensional ellipsoid is now fixed, but the initial size is fixed with the parameter c. This selection is the most difficult part of the design process and some experimental work is usually needed. There are at least two different ways to tackle this problem. Using Eqn. 8, select the value of c to limit the size of the network to a practical value in terms of computing resources. The second possibility is to

l

Original Research Paper

127

choose the smallest difference between input and stored vectors that must be stored separately in the network. This is done by setting all but one component of this difference vector to zero and the non-zero value is chosen to reflect our view of an ‘important change of input’. (x-Wi)=

% [I0

(10)

Eqn. 10 shows such a three-dimensionaldifference vector with d representing the smallest change in the input vector component 1 that should be stored separately from the network. Using this difference vector the constant c is calculated using Fqn. 4. The design process has now been completed, but it should be noted that this design is based on heuristic assumptions about the parameters and, for a particular application, these parameters may have to be chosen in a different way. Figs. 3 and 4 demonstrate the effect of the smoothing parameter a. Fig. 3 shows five two-dimensional activation functions of hidden nodes with a = 1. The response is clearly not smooth between individual nodes. Fig. 4 shows the same activation functions with a = 10, which means that the overall response is much smoother. The generalization behavior of the network is much better in Fig. 4, but the nodes in Fig. 3 are able to give a zero response to unknown input patterns. The network is able to provide conservative answers if the parameter a is small compared to c and the

Fig. 3. The response of five hidden nodes with a =l.

Fig. 4. The response of five bidden nodes with D = 10.

answers then contain essentially the stored pattern vectors only. If the value of parameter a is increased, then the network is able to do more generalization. This smooth change of the operating mode makes it possible to obtain definitive answers or ‘intelligent guesses’ as a result of the system, depending on the needs of the user. In other words the system may normally be operated with little ‘generalization’, but at unexpected or difficult situations, the user may ask for a more generalizedanswer from it. The g~er~tion is a term used commonly in the field of neural networks and here it can be defined as meaning the ability to interpolate values of the learned function between stored pattern vectors. Even extrapolation is more meaningful, because the response tends to give conservative estimates outside the normal operating region. For example, polynomial methods have a tendency to explode in similar circumstances. Algorithm 1 for DCA networks learns time-dependent functions F(r, t) between pattern vet tors. Due to the time-dependence, the centers of Gaussian functions must be moved if the prediction error is larger than predefined tolerance. This is accomplished in step 3c of the learning algorithm. At the same time step 3c selectively updates the model spatially with the most recent information. The learning algorithm assumes that the most recent ~fo~ation is also the most accurate for the current function F that is estimated. Therefore the presentation order of the pattern vectors has an influence on the form of the model.

128

This is desirable for time series estimation, but if the network model is trained with pattern vectors that are not drawn from a time series then the DCA networks may not be able to form satisfactory models with one scan through the learning data set. Conceptually the DCA network models fall between regression models and random access memory units. The random access memory implements strictly spatial update of the model and it cannot interpolate between stored pattern vectors. The model of function F is encoded in a form of a lookup table. Regression models, on the other hand, use the correlation between vector components for representation of dependences and all updates of such models affect the whole input pattern vector space. The DCA network models have, with each stored pattern vector, a certain neighborhood that is affected by the update, but the model as a whole is’not changed. The size of the neighborhood is defined by the steepness of the hidden node activation functions. Small values of parameter a reduce the size of the neighborhood and vice versa. The DCA networks implement a global interpolation method, because all hidden node activations affect all computed model outputs. The approximate spatial locality of the models is due to the Gaussian hidden node activation functions that reduce effectively the coupling between stored pattern vectors. Algorithm 1 for DCA networks works adequately in many situations, but this method has many tuning parameters which have to be chosen by the designer. Unfortunately, the methods to select the parameters are based on heuristics which do not have a solid heretics fo~dation and quite a lot of experience is needed to select the parameters properly. Therefore an attempt is made in the fo~o~g section to reduce the dependence of the learning algorithm on the initial values of the network parameters without affecting the desired continuous learning and selective forgetting aspects of the model. Focusing attention The learning algorithm for DCA networks separates the construction of the hidden layer and the

Chemometrics and Intelligent Laboratory Systems

n

output layer into two simultaneous and loosely coupled processes. In fact, the hidden nodes can be thought of as performing a semi-self-organized partitioning of the input pattern vector space, while the output layer converts the activations of the nodes in the hidden layer to some desired coding form. The attention of these networks is controlled by the parameter c that defines the size of the ellipsoids. The presented learning algorithm assumes that all hidden nodes have equal c and a. This approach is adequate for most of its applications, but sometimes a part of an input-output mapping may be so irregular that more hidden nodes would be needed than are predicted by the initial choice of c. A bigger problem of Algorithm 1 for continuous learning is that if the input pattern vector n changes at a suitable rate (typical in time series prediction, for example) then only one hidden node is allocated and the center of this node is changed at each time step. As a result, the network has only one hidden node, and by changing the position of this node the network gradually forgets earlier learned information, although this should be avoided and additional nodes should be allocated for different parts of the input vector space. The automatic focusing of attention means that the learning dlgorithm is able to decrease and increase parameter c according to the local predio tion performance of the network and this avoids the problem of the hidden node allocation. The modified 1earning ~go~t~ with automatic focusing of attention is presented below. Notice that the addition of network outputs is made as in Algorithm 1 and therefore it is not repeated here. Steps related to the focusing of attention are shown. in boldface. This is learning Algorithm 2. 1. Select a suitable weighting matrix A and constant c to fix the shape and initial size of the N-dimensional ellipsoid for all hidden nodes. Select constant a, Eqn. 2, to define the steepness of the responses of the hidden nodes. Fix the number of inputs and outputs of the network and set the number of hidden nodes k initially to zero. 2. If k = 0 then go to step 3b else calculate the

n

Original Research Paper

129

network output Oj using the currently defined network and parameters: y)=A(x

- wi) I

Parameters ci are defined individually for each hidden node hi. 3. Compare the calculated output oj with the true desired output yj. If the error is smaller than the predefined tolerance E: l”j-Yjl

Cc

then go back to step 2. If the error is larger, then check whether the input pattern vector is within the N-dimensional ellipsoid of any hidden node: (x-y)*A(x-q)sq 3a. If x is inside more than one currently allocated N-dimensional ellipsoid then go back to step 2. 3b. If x is outside all currently allocated N-dimensional ellipsoid regions, increase ck+ 1 and allocate a new hidden node with wk+1 =xandset k=k+l. 3c. If x is inside one currently allocated N-dimensional ellipsoid p, then decrease cp and change the center vector of this node to WP= X. 4. Calculate the matrix b by solving the system of linear equations: Hb=Y where k X k matrix H is obtained as Hij=exp

[

The matrix

-&(wi-w/)‘A(wi-9)

1

Y contains the desired output responses yi of the network corresponding to the same stored pattern vectors. 5. Go back to step 2. The addition of automatic focusing of attention affects only steps 3b and 3c of the basic algorithm. If a new hidden node is allocated then increase parameter c and if the weights y of node j are changed then first decrease c and after that change the weights. As a result, each hidden node j has

TABLE

1

Some possible methods for focusing of attention by changing parameter c Increase

Decrease

Notice

c=c+6

C-+C_8

a>0

c=c8

c=c/6

6>1

c=ce*

c=ce-'

r=r+1

now parameters cj that may be different for each node. Notice that the parameter c is changed only when the prediction result of the network is worse than a predefmed tolerance. One question still remains: how to change c to focus the attention quickly, but to avoid allocation of unnecessary hidden nodes. Unfortunately it seems to be difficult to obtain an optimal method for the adjustment of the parameters, but some methods are suggested in Table 1. In this table, 6 is a constant and t is the number of successive allocations of new hidden nodes or changes of weights wi. Initially t is always set to 0. The automatic focusing of attention can also mean that each hidden node is defined to have adjustable parameters aj. The parameters aj can be changed using similar methods to parameters cj. In practice it has been noticed that if parameter a is also changed the resulting network often has redundant hidden nodes. Therefore the leaming algorithm should mostly be used with constant a and variable c. If c is decreased, the network is forced to pay attention to smaller changes in the input and the attention level during learning is increased. This makes it possible to learn accurately features of some important part of the input pattern vector space, while the learning normally is done using coarse allocation of hidden nodes (larger c). Despite of the attention level of the network during learning, all stored information is available in recall computation of the network. The network provides all information on a given subject that it has, but the learning might have been more refined on some parts of the input pattern vector space than others. An obvious extension of Algorithm 2 would be an iterative solution of parameters aj. Such a

Chemometricsand Intelligent Laboratory Systems W

130

method would adjust all parameters of the DCA network au~mati~y, but ~ort~tely it is impossible. If the parameters ai are iterated while no other adjustments can be done (step 3a of the algorithms) then the result is that the aj tend to gotoinfinity, ai+fo3,whilet-,+c0.Thisin turn makes the matrix H singular and the algorithm fails. Theoretically, step 3a of the learning algorithms is not ~tisf~to~, because there is a known error in the model, but nothing can be done. This does not affect the performance of the learning algorithm much. It is, however, also possible to change this part of the learning algorithm to facilitate accurate interpolation behavior. Then step 3a of the Algorithm 2 is changed to: 3a. If x is inside more than one currently allocated N-dimensional ellipsoid then reduce the sizes of the overlapping ellipsoicis by reducing the para~ters cj until the input vector x is within at most one ellipsoid region. Continue

with step 3b or 3c of the algorithm. This means that each input vector that causes too large a prediction error is transferred from the ambiguous region in the input pattern vector space to one of the two categories that can be handled. The center of one hidden node is then changed or a new hidden node is allocated accordingly.

EXAMPLES

Classification This example is chosen to demonstrate the effect of parameter a to the prediction properties of the network. The problem is classification of Iris species based on the sepal and petal lengths and widths [IS]. This cl~~i~tion problem is a classical example and due to the properties of the data set almost any linear or nonlinear classification method gives acceptable results. The DCA network for this problem has 4 inputs and 1 output that has been assigned values 1, 2 and 3 to represent the different Iris species. The training data consisted of 150 training samples, 50 samples of each category. The weighting matrix A was chosen as a 4 x 4 identity matrix and constant c = 1. The

TABLE 2 Number of allocated bidden nodes as a function of parameter a in case of one scan and multiplescans through the data set a 0.01 0.1 1.0 10.0 100.0 looo.o loOoo_O

No. of hiddennodes l&all

Asymptotic

11 11 11 7 9 9 9

14 14 13 14 16 14 13

training was done using Algorithm 1 and only one scan through all traiuing samples was made. The error tolerance e was set to 10% during learning. The network with only one output has been selected to visualize the differences between the pr~ction ability of the model using different values for the parameters. The effect of pattern class labeling is also explained in this section. Table 2 shows the number of allocated hidden nodes for different values of parameter a. There are two different values for the number of hidden nodes, the first number of hidden nodes is obtained by scanning the data set once and the second number is obtained by scanning the data set until no additional bidden nodes are allocated. It can be seen that the ~~ptotic number of hidden nodes is nearly independent of a, as it should be according to F.qn. 8. Using Eqn. 8 the estimated maximum number of hidden nodes for this problem is 15. These results show that the model is dependent on the presentation order of the pattern vectors. The learning Algorithm 1 tends to disregard stored pattern vectors that are presented at the beginning of the learning phase and replace them with more recent vectors. But the spatially selective forgetting and learning of new information are always contradictory goals. The fixed size of the N-dimensional ellipsoids in Algorithm 1 is the main reason for this problem, but as can be seen, despite the problems, the DCA networks are able to obtain good results even with Algorithm 1. The data set is such that class 1 (Iris Setosa) should be easily separated, but classes 2 (Iris

3.5 32

2.5 -

Fig. 5. Classification of Iris data, using a = 0.1. The solid line is true classification and the dashed line is predicted by the DCA network model.

Versicolour) and 3 (Iris Virginica) are slightly overlapping. This fact is reflected in the results shown in Fig. 6. The continuous lines represent the true classification result and the dashed lines are predicted by the trained DCA network. Notice that these results are obtained with only one scan through the data set. If the overlapping of the activation function responses is too small, as in Fig. 5, then the network can only classify correctly the memorized pattern vectors and no generalization can be observed. When the overlapping is increased, as in Fig. 6, then a smaller number of

3.51

hidden nodes is actually able to perform the classification task correctly. On the other hand, if the overlapping is too strong, a > 100, then the prediction accuracy decreases. The actual value of a is not criticaI in this problem, because values between 1 and 20 give nearly similar performance. 98% of this data set is classified correctly using an optimal Bayesian classifier. For example SIMCA achieved 93-96% depending on the number of principal components used. The DCA network has essentially the same ~~0~~ (93% with Algorithm l), but the crucial difference is that only sequential processing of data is needed. This means that the DCA network can learn continuously and improve the classification accuracy when more data are available. On the other hand, construction of Bayesian classifiers requires that all needed data is collected before the classifier can be designed. If it is assumed that all data are available for identification of the model then the DCA network can be identified differently. First the DCA network model is identified using Algorithm 1 and the resulting network is modified by recalculating matrix b using Eqns. 6 and 7. The performance of this modified DCA network model is shown in Fig. 7. This model achieved a prediction accuracy of 98% that is better than SIMCA and equal to the optimal Bayesian classifier. The prediction accuracy of the DCA network models was confirmed

i

3-

0

1 0

20

40

60

80

100

120

140

Sample Numbx

Fig. 6. Classification of Iris data, using II = 10. The &id line is true classification and the dashed line is predicted by the DCA network model.

60

80

Km

120

140

Sample Nvmbx

Fig. 7. Classification of Iris data, using CI= 10 and a modified 15.The solid line is true classification and the dashed line is predicted by the DCA network model.

Ch~ornet~~ and Insight

132

by using a leave-oneout procedure and the obtained results were 92% for sequential processing and 97.4% for the method with recalculated matrix b. It must be mentioned that the coding that is used also affects the prediction results of the network. If the labels of classes 1 and 2 are reversed, then the overlapping of original classes 2 and 3 would cause some patterns to be classified incorrectly to the original class 1 that is even not overlapping with the others. This is easily avoidable, because the classification can be done with a DCA network that has three binary valued outputs corresponding to each of the Iris species. Using this method similar prediction accuracies are achieved without problems in labeling of the classes. Comparison of learning algorithms The two learning algorithms for DCA networks are compared here using a nonlinear pattern classification problem. Input patterns are &bit binary strings and the number of continuous clumps of ‘1’ determines the class of each pattern string. Thus, for 8-bit pattern strings there are 256 distinct patterns that belong to one of five classes. Table 3 gives the number of patterns in each class. This problem has been discussed also by Denker et al, [17] and Webb and Lowe [18]. The linear Bayesian classifier and SIMCA are able to classify only 25% of the 256 patterns correctly, but for example a nonlinear back-propagation neural network with eight inputs, five hidden nodes and one output classifies 100% of the patterns correctly, if the output is quantified properly. Table 4 shows classification results obtained TABLE 3 Number of I-bit patterns in each class for groups of digit Is Class

Number of groups

Numberof patternsin class

1 2 3 4 5

0 1 2 3 4

1 36 126 84 9

Laboratory Systems n

TABLE 4 Classificationresults with a DCA networkusing Algorithm 1 c 1 1 1 1 1 2 3 4 5 8 10 100 1000

11 1 10 100 loo0 1OOOO 1000 1000 1000 1000 1000 1000 1000 1000

Number of hiddennodes

accuracy ( W)

15 29 16 16 16 32 15 6 3 2 1 1 1

99.6 99.6 100.0 100.0 100.0 87.9 59.0 41.0 39.0 23.4 14.1 14.1 14.1

&diGtiOII

with several combinations of DCA network parameters c and a using learning Algorithm 1. The DCA networks were trained with only one scan through 256 pattern vectors. For example, the back-propagation network needed several hundred scans through the pattern vectors to converge to a solution. The prediction accuracy is good in the case of learning Algorithm 1, if the size of ~-~ension~ ellipsoids (parameter c) is small, but as can be seen the result is highly dependent on this parame ter and decreasing c increases the number of allocated hidden nodes quite rapidly. The smoothing parameter a has only minor effect on the prediction performance in this nonlinear classification problem, although in the case of an almost linear classification problem (Iris data) this parameter has larger effect. A general result seems to be that if a is too high the numerical problems of matrix inversion reduce the performance and in case of small values of a the interpolation between stored vectors is not good which is compensated by allocating more hidden nodes to the network. The same problem was solved using Algorithm 2 for DCA networks. Table 5 shows the results using the same initial parameter values c and a as for Algorithm 1 in Table 4. Algorithm 2 automatically changes values of ci for each hidden node and therefore the table can

n

Original Re.search Paper

only give initial values for the parameters. The parameters ci were changed by multiplying or dividing the parameter cj by 2 at each appropriate step of the algorithm. The effect of using different methods for changing the values of ci was not studied in this example, Algorithm 2 is more robust to the changes in the initial value of parameter c, as can be seen from Table 5. One drawback is that more hidden nodes are usually needed for the solution than in the case of Algorithm 1. The prediction accuracy is also lower for some parameter values c and a than in the case of Algorithm 1, but if Algorithm 2 is allowed to learn the training data set twice, then the prediction accuracy is similar. For example, the case with c = 1 and a = 1000 results in 100% correctly classified patterns using Algorithm 1 and one scan through the training patterns. Algorithm 2 achieves 100% accuracy with two scans through the data and 78.5% with one scan. This means that the robustness with respect to the initial values of the parameters c and a have been partly obtained at the expense of slightly worse usage of information during learning phase. But the usage of information is still orders of magnitude more efficient than in the case of iterative learning ~go~t~s. Interpretation of spectral information The DCA networks can, be used for interpretation of information from complex measuring instruments. Such instruments are typically spectrometers for visible (VIS), near-infrared (PUR) and infrared (IR) wavelength regions. These instruments provide a complex spectrum that contains information about the structure and composition of the measured sample. For quality control purposes these spectral measurements should be related to some known quality parameter of the sample. Typically such models have been constructed using multiple linear regression (MLR), principal component regression (PCR) and partial least squares (PLS) [19-221. Fig. 8 shows some measured spectra of a chemical product. This iuformation is used in the following to obtain a model for simultaneous estimation of the density of the product and concentration of one particular chemical substance.

133

TABLE 5 Classification results with a DCA network using Algorithm 2 c

(i

1 1 1 1 1 2 3 4 5 8 10 100 1000

1 10 100 1000 10000 1000 1000 1000 1000 1000 1000 1000 1000

Number of bidden nodes

-aq

Pre&ction

42 25 25 27 27 30 33 32 32 31 34 33 29

91.0 84.4 81.3 78.5 78.5 94.1 100.0 99.6 100.0 95.7 100.0 98.4 77.0

@I

The whole spectrum of Fig. 8 was used as an input to the models and the density and concentration were both estimated sinudtaneously. This means that the models had 130 inputs and 2 outputs. The initial models were estimated with 13 calibration spectra and tested with 38 different spectra of the same product. This means that the model is used for explaining the changes in the quality parameters of produced material that is sold as one product. The cahbration samples were selected in such a way that the variance of the whole sample set was covered by the calibration samples. This means that the models were not tested by predicting such samples that were totally

2.5

.-

,

la3

120

i 0

Oo I

20

40

60

so

inputnumtu Fig. 8. Spectral infomation

of a chemical product.

140 1

Chemometrics and Inte~~t

134

TABLE 6 Mean prediction error of different models for the interpretation of spectral lotion Model

MLR model pcR(4) PC=@) PWlO) PLS model BP network DCA network

Prediction error (W) Density

Concentration

0.96 1.25 1.10 0.92 0.81 0.64 0.44

2.25 3.74 2.61 2.12 1.71 0.95 0.70

different from the calibration samples. The number of calibration samples was kept very small, because this makes the test more realistic. It is typical that collecting enough calibration samples is very time-consuming and may take over one year in a production ~~o~~t. This makes it critical to test how different methods work with very small amounts of data. Due to the small number of calibration samples the MLR and PCR models were calculated using the SVD d~m~sition and pseudo-inverse. The DCA network model was estimated using learning Algorithm 1 with parameters a = 1 and c = 0.01. The weighting matrix A was an identity matrix I and the tolerance was 0.1%. The estimated model had nine hidden nodes. The back-propa~tion algorithm (BP) was also tested with several different feed forward networks. The smallest prediction error was obtained in this case with a network which bad 130 inputs, eight hidden nodes and two outputs. Table 6 shows prediction results of several different linear models and the nonlinear network models. The numbers in parentheses in the PCR models are the number of principal components used for the model. The principal components were selected based on the amount of variance in the input that was explained. For example, the PCR model with eight principal components uses those principal components which correspond to the six largest singular values. It may be argued that the number of principal components is high

Laboratory Systems

n

in Table 6, while the models were identified with only 13 spectra, but remember that practical constraints may limit us to this small number of calibration samples. So, the options are: take the most out of the available data by testing several types of models, or do not make any models at all. In this example, the PLS model with eight factors was the best linear model, but the DCA network and feed forward back-propagation type networks clearly had lower prediction error levels for both output variables. This may be due to the ability of the nonlinear network models to compensate for the nonlinearity of the rne~~g instrument and explain nonlinear dependencies between variables. This result is in agreement with previous results of Long et al. [9], who obtained better predictions using nonlinear network models than with any linear models when there was some nonlinear effect present during measuring process. This result is not general for other types of modelling problems, but the usefuhess of the nonlinear network models (BP and DCA) is still evident. The calibration models of spectroscopic process analyzers can be updated continuously if calibration information of known samples is available. In practice continuous calibration can be achieved using laboratory ~fo~tion m~agement systems (LIMS) [23], together with process analyzer data. If a quality control sample is taken from the process and analyzed in the laboratory, that information can be transmitted automatically back to the process analyzer. The analyzer compares its calculated result to the laboratory result and, based on this comparison, the model is updated or not. The DCA network models are particularly suitable for this kind of update procedure, because the model can be updated spatially locally without affecting the performance in other parts of the function. Common linear and nonlinear regression models tend to forget previously stored information if the model is updated with only one pattern vector pair. In DCA network models this kind of forgetting is minimal. Actually the DCA networks implemented an efficient interpolation method with continuous construction of a knowledge base for the model and therefore the DCA network models can be regarded as simple embedded expert systems for modelling.

n

Original Research Paper

CONCLUSIONS

The purpose of this work has been the development of methods for continuous estimation of unknown, possibly t~~v~~t, factions F( xr, t) between input pattern vectors x, and output pattern vectors Ye. As a result a nonlinear network model, called a dynamically capacity allocating (DCA) network, has been developed, together with learning ~go~~s. Continuously learning systems must make a tradeoff between the ability to adapt to novel features and the ability to memorize old features of the learned function. This is actually the difference between the completely ~s~but~ representation of information and local, random access memory type representation. The DCA networks implement spatially local representation of information but, in addition to this, DCA networks interpolate the estimated function between stored pattern vectors in a well-defined manner. Several variations of the learning algorithms for DCA networks have been presented in this work. All beaming algorithms are fast, non-iterative methods that require only sequential processing of the data set. This feature makes DCA networks highly attractive for continuous learning systems, such as the continuous calibration of process analyzer models.

APPENDIX

This appendix contains M-file listings of a subset of presented algorithms and methods of DCA

135

networks for MATLAB. MATLAB is an interactive software system for tasks involving matrix operations, graphics and general numerical computation. These MATLAB function files are known to work on several machines ~g version 3.5 of PC-MATLAB or PRO-MATLAB. As published here the library contains 19 functions. No attempt has been made to optimize the use of vector operations in this library.

SBVC”ettll

save.s the data ltNcNcl dnetmat -fik

1oadntl.m

Loads the data StNcNres fnnndnct.mat -file

~wmpvtel.m,wmputc2.m mcalcl.m, rccakbl.m,

tc

These fiks am used internally by the library

ncalc2.m

Changestheparamctcraand lwxlcuktes might matrix k

ncatcb2.m

Recalculeru weight matrix 4 using the whale training data

% % Defnet % % global global global global global global global

Defines

DCA

network

structures

w_dnet b_dnet Y_dnet A_dnet a_dnet c_dnet tol_dnet

% % Clnet % % clear clear clear clear clear clear clear

Clears

DCA

network

structures

DCA

network

weights

w_dnet b_dnet Y dnet A-dnet adnet c-dnet tol_dnet

% % Clwts Clears

% % clear clear clear

from

workspace

w_dnet b dnet YIdnet

% % Savenet Saves % % save

dnet

w_dnet

DCA

network

b_dnet

structure

Y_dnet

A_dnet

to dnet.mat-file

a_dnet

c_dnet

tol_dnet

% % Loadnet Loads % % load dnet defnet

DCA

network

structure

from

dnet.mat-file

n

Ori&mlResearch Paper

function est - computel(in) % % Compute1 Computes output of a DCA network % % est = computelfin) % % % est = estimated output of the network % in - input column pattern vector % uses global DCA network structure as defined % in defnet.m -file % [n,m]=size(w_dnet); if m -- 0 est = 0; else for i=l:m diff = in - w_dnet(:,i); h(i) = exp(-O.S*diff'*A_dnet*diff./a_dnet(l)); end est - b- dnet'*h'; end

function est -91compute2(in) % % Compute2 % Computes output of a DCA network % % est = compute2(in) % % est = estimated output of the network % in = input column pattern vector % % uses gLoba DCA network structure as defined % in defnet.m -file [n,m]-size(w_dnet): If m == 0 est = 0; else for i-1:m diff = in - w_dnet(:,i); = exp(-O.S*diff'*A_dnet*diff./ a_dnet(i)); h(i) end est = b dnet'*h'*I end

137

function % % Recalcl % % % % % % % 8

recalcl(x)

Recalculates

the network

for new parameter

value

a

recalcl(a) a - new value uses global in defnet.m

for parameter

DCA network -file

a_dnet

structure

as defined

[n,m]=size(w_dnetf; a_dnet for

= x;

i-1:m for j=l:i - w_dnet(:,j); diff = w_dnet(:,i) h(i,j) = exp(-O.S*diff'*A_dnet*diff./ h(j,i) = h(i,j); end

end b_dnet

a_dnet(l) 1;

= h\Y_dnet;

______________________________ function % % Recalc2 % % % % % % % %

recalc2(x)

Recalculates

the network

for new parameter

recalc2(a) a = new

value

uses global in defnet.m

for parameter

DCA network -file

a_dnet

structure

as defined

[n,m]-size(w_dnet); a_dnet

value

= x;

for i=l:m for j=l:i diff = w_dnet(:,i) - w dnet(:,j); h(i,j) = exp(-0,5*diff"A_dnet*diff./a_dnet(if); h(j,iJ = h(i,j); end end b_dnet - h\Y_dnet;

a

a

OxiginalResearchPaper

function recalcbl(x,y) % % Recalcbl Recalculates the matrix b % % % recalcbl(in,out) % in = matrix of input data % % out = matrix of output data % % uses global DCA network structure % in defnet.m -file

139

as defined

[n,m]=size(w_dnet); [nn,mm]=size(x); for

i=l:nn for j-1:m diff = w_dnet(:,j) - x(i,:)'; h(i,j) = exp(-O.[i*diff'*A_dnet*diff./a_dnet(l)); end

end b_dnet

= hty;

function recalcb2(x,y) % % Recalcb2 Recalculates the matrix b % % % recalcb2(in,out) % % in = matrix of input data % out = matrix of output data % uses global DCA network structure % in defnet.m -file %

as defined

[n,m]-size(w_dnet); [nn,mm]-size(x); for

i=l:nn for j=l:m - x(i,:)'; diff - w_dnet(:,j) hti,j) = exp(-O.!j*diff'*A_dnet*diff./a_dnet(j) end

end b_dnet

- h\y;

f;

Chemometrics and Incest

140

function % % Addoutl % % % % % % % % %

addoutl(x,y)

Adds

one output

to type

1 DCA network

addoutl(in,out) in = input row pattern vector out = output row pattern vector uses global in defnet.m

DCA network -file

structure

as defined

[n,m]=size(w_dnet); [nl,ml]=sixe(Y_dnet); Y_dnet(:,ml+lf

- zeros(nl.lf:

for i-1:m for j-1:i - w dnet(:,j); diff - w_dnet(:,i) h(i,j) = exp(-0.5*diffT*A_dnet*diff./ h(j,i) = h(i,j); end end

a_dnet(l) );

b_dnet - h\Y dnet; dnetl(x',yr);

function % % Addout % % % % % % % % %

addoutZ(x,yl

Adds

one output

to type

2 DCA network

addout2(in,out) in - input row pattern vector out - output row pattern vector uses global in defnet.m

DCA network -file

structure

as defined

[n,mJ=sizefw_dnet); [nl,ml]=size(Y_dnet); Y_dnet(:,ml+l)

- xeros(nl,l);

for i-1:m for j=l:i diff - w_dnet(:,i) - w_dnet(:,j); h{i,j) = exp(-O.tj*diff'*A_dnet*diff./a_dnet(i)); h(j,i) = h(Ljt; end end b_dnet - h\Y dnet; dnet:!(x',y')F

Laboratory Systems

n

n

function % % Dnetl % % % % % %

% % %

141

OriginalResearchPaper

dnetl(x,y)

Implementation

of Algorithm

1 for DCA networks

dnetl(in,out) in = input column pattern vector out = output column pattern vector uses global in defnet.m

DCA network -file

structure

as defined

[n,m]-size(w_dnet); if m == 0 w_dnet(:,l) = x; b dnet - y'; Ydnet - y'; elseest - computel(x); if max(abs((y-est)./y))*lOO > tol_dnet k - 0; kk - 0; for i-l:m diff - x - w_dnet(:,i); if diff**A_dnet*diff C c_dnet k = i; kk = kk t 1; end end % Allocate new hidden node if kk == 0 w_dnet(:,m+l) - x; Y_dnet(m+l,:) - y'; m-mtl; % Change weights of node k elseif kk == 1 w_dnet(:,k) = x; Y_dnet(k,:) = y'; % Disregard this data vector else return; end: for i=l:m for j-1:i diff = w_dnet(:,i) - w_dnet(:,j); h(i,j) = exp(-O.S*diff'*A_dnet*diff./a_dnet(l)); h(j,i) - h(i,j); end end b_dnet = h\Y_dnet; end end

Chemometricsand IntelligentLaboratory Systems n

142

function % % Dnet2

dnet2 (x, y)

% % % % % % % % %

Implementation

of Algorithm

2 for DCA networks

dn&2(in,out) in = input column pattern vector out = output column pattern vector uses global DCA network in defnet.m -file

structure as defined

[n,m]=size(w_dnet); if m == 0 w_dnet(:,l) - x; b dnet = y'; Ydnet = y'; elseest = compute2(x); if max(abs((y-est)./y))*lOO > tol_dnet k = 0; kk = 0; for i=l:m diff = x - w_dnet(:,i); if diff**A_dnet*diff < c_dnet(i) k = i; kk = kk + 1; end end if kk == 0 % Allocate new hidden node w_dnet(:,m+l) = x; Y_dnet(m+l,:) = y'; a_dnet(m+l) = a_dnet(m); c_dnet(m+l) = c_dnet(m)*2; m=m+l; % Change weights of node k elseif kk == 1 w_dnet(:,k) = x; Y_dnet(k,:) = y'; c_dnet(k) = c_dnet(k)/2; % Disregard this data vector else return: end: for i=l:m for j=l:i diff = w_dnet(:,i) - w dnet(:,j); h(i, j) = exp(-0.5*diffT*A_dnet*diff./a_dnet(i)); h(j,i) = h(i,j); end end b_dnet = h\Y_dnet; end end

n

OriginalResearchPaper

function % % Learn1 % % % % % % % % %

143

learnl(in,out)

Learns

the

input-output

data

with

a DCA network

learnl(in,out) in out

- input data matrix - output data matrix

uses global in defnet.m

DCA network -file

structure

as defined

data

a DCA network

[n,m]=size(in); for

i=l:n dnetl(in(i,:)',out(i,:)');

end

function % % Learn2 % % % % % %

% % %

learn2(in,out)

Learns

the

input-output

learn2(in,out) in out

= input data matrix = output data matrix

uses global in defnet.m

DCA network -file

[n,m]-size(in); for end

with

i=l:n dnet2(in(i,:)',out(i,:)');

structure

as defined

Chemometrics and InteIIigent Laboratory Systems

144

function % % Recall1 % % % % % % % % % %

est

l

= recalll(in)

Computes output for matrix in

of a DCA network

est - recalllfin) est = estimated output of the network in .- input data matrix uses global in defnet.m

DCA network -file

structure

as defined

[n,m]-size(in); for

i-l:n est(i,:)

-

computel(in(i,:)')*;

end

function % % Recall2 % % % % % % % % % %

est

= recall2(in)

Computes output for matrix in

of a DCA network

est

= recall2(in)

est in

- estimated output of the network = input data matrix

uses global in defnet.m

DCA network -file

structure

as defined

[n,m]-size(in); for

i-1:n est(i,:)

- compute2(in(i,:)')';

end

REFERENCES 1 W.S. McCulloch and W.H. Pitts, A logical caIcuIus of the ideas immanent in neutrai nets, 3~fe~~n of Mathematical ~io~~sic~, 5 (1943) 115-133. 2 M. Minsky and S. Papert, Percepttons; An ~nt~tion to Computational Geometry, MIT Press, Cambridge, MA, 1969. 3 S. Grossberg, Adaptive pattern classification and universal

recording. I. ParaIIel development and coding of neural feature detectors, Biological Cybernetics, 23 (1976) 121-134. 4 T. Kohonen, Self-Organization and Associative Memory, Springer-VerIag, Be&n, 1987. 5 K. Fan Neocognitron: A ~er~c~~ neural network capable of visual pattern recognition, Neural Networks, 1 (1988) 119-130. 6 J.J. Hopfield, NeuraI networks and physical systems with

n

7

8

9

10

11 12

13

14

GriginaI Research Paper

emergent coIIe&ive computational abilities, Proceedings of the National Acaakmy of Sciences, USA, 79 (1982) 25542558. DE. Rummelhart and Hinton, in DE Rummelhart and J.L. McClelland (Editors), ParaNel Distributed Processing. Explorations in the Microstructure of Cognition. Vol. I, Foundolions, MIT Press, Cambridge, MA, 1986, pp..318362. P. Werbos, Beyond regression: new tools for prediction and analysis in behavioral sciences, Ph.D. 2%esis, Harvard University, Cambridge, MA, August 1974. J.R. Long V.G. Gregoriou and P.J. Gemperhne, Spectra scopic calibration and quantitation using artificial neural networks, Analytical Chemistry, 62 (1990) 1791-1’797. P.A. Jokiuen, Neural networks with dynamic capacity allocation and quadratic function neurons, in Proceedings of EURO-Nimes 90, EC2, Nimes, 1990, pp. 351-362. C.B. Moler, J.N. Little and S. Banger& PC-MATLAB User’s Guide, The MathWorks Inc., Sherbom, MA, 1987. M. S~ch~mbe and H. White, Universal appro~tion using feedforward networks with non-sigmoid hidden layer activation functions, in Proceedings of the Znternational Joint Conference on Neural Networks, Washington DC, 1989, pp. 607-611. F. Girosi and T. Poggio, Networks and the beat approximation property, A.Z. Memo No. 1164, MIT Artificial InteIIigence Laboratory and Center for BiokgicaI Information Processing, October 1989, p. 21. E. Hartman, J.D. Keeler and J.M. Kowabki, Layered neural networks with Gaussian hidden units as universal approximations, Neural Computattibn, (1990) 210-215.

145

15 K.V. Mardia, J.T. Kent and J.M. Bibby, Mu&variate Analysis, Academic Press, London, 1979. 16 R. Franke, Scattered data interpolation: teats of some methods, Mathematics of Computation, 38 (1982) 181-199. 17 J. Denker, D. Schwartz, B. Wittner, S. SoBa, R. Howard, L. Jackel and J. Hopfield, Large automatic leaming, rule extraction, and 8enerahzation, Complex Systems, 1 (1987) 877-922. 18 A.R. Webb and D. Lowe, The optimized internal representation of multilayer classifier networks performs nonlinear ~~~~t analysis, Neurui Networks, 3 (1990) 367-375. 19 M. Stone and RJ. Brooks, Continuum regression: crossvahdated sequentially constructed prediction embracing ordinary least squares, partiaI least squares and principal component regression, Journal of the Royal Statistical Society, Series B, 52 (1990) 237-269. 20 P. Geladi and B.R. KowaIski, Partial least squares regression: a tutorial, Ana&tica Cltimicu Acta, 185 (1986) l-17. 21 A. Lorber, L.E. Wangen and B.R KowaIski, A theoretical fo~dation for the PLS algorithm, Journal of Chemometrics, 1 (1987) 19-31. 22 S. Wold, P. Geladi, K. Eabensen and J. &man, M&i-way principal components and PLS analysis, Journal of Chemometrics, 1 (1987) 41-56. 23 R.R. Stein, Improving efficiency and quality by coupling quality assurance/quality control testing and process control systems with a Iaboratory information management system, Process Control and Quality, 1 (1990) 3-14.