Training Dynamics and Neural Network Performance

Training Dynamics and Neural Network Performance

NeuralNehvorks, Pergamon @ PII: S0893-6080(%)00119-0 CONTRIBUTEDARTICLE Training Dynamicsand Neural Network Performance CHARLESL. WILSON’, JAMESL. ...

2MB Sizes 0 Downloads 101 Views

NeuralNehvorks,

Pergamon @ PII: S0893-6080(%)00119-0

CONTRIBUTEDARTICLE

Training Dynamicsand Neural Network Performance CHARLESL. WILSON’, JAMESL. BLUE’ AND OMID M. OMIDVAR2

( Abstract—We u a a o as m t d o m t t t m u f M t i i n n ( w s t p c p t r b f m e t o b t P b a n o s T c a s J 2 u o s r B p t c t d o e c s t s s d t e s A f o t c i a a i t s t t c p t m i n s b 4 0 E S

7S

1

o r

n u u

( n b

R d cf s

p z

e

al M w as t 1 u o n t c t w s s ao r a m i d o t l e p L

n ap t v 4 u b t i o

t g

q i i t n e T ri l sy c l f t I p wr t p e e i p o c a j ee e e n m W w sa tr p i L M b m f c i t h c w r t ct p e o t t o ta m e i w us 3 u o og o p oc p t on a i r r i tc c b i i w l a i l o ac g o u r Ot h d a f n pi y b f o b 2 aa 4 ar rr e d e0 l

K

e

1. INTRODUCTION

the data sets we will use, Ni and N~ need to be approximately 100, leading to approximately 10,000 weights in the network. Determining the weights, or “training” the network, is essentially done by minimizing some function, E A common choice for E is a sum (over all the output nodes and all the examples in the training data set) of squares of the differences of actual and desired output node values. Other terms, such as a sum of squares of the weights themselves, are often added. The choice of the E that is used is made partly because it is a desirable objective function and partly because minimizing E is tractable. (In fact, it is impossible to find an objective function that would lead to the network we would really like, since we don’t know what data sets will be classified using the network.) Even supposing that we have au optimal E the training rarely succeeds in getting to the “best” minimum, for some of the reasons detailed in Section 1.1.

It is now generally accepted that neural networks can be used successfully for pattern recognition and classification on data sets of realistic size. However, we lack a full understanding of the methods being used, even for the long-studied feedforward multilayer perceptions (MLPs). In this paper we present some remaining difficulties with MLPs and use an analogy with dynamic systems to help gain understanding and to develop improved MLP methods for pattern classification. We consider three-layer fully connected MLPs with Nt input nodes, Nh hidden nodes, and NOoutput nodes. For

4

1.1. Some problems There are several difficulties that hamper the training, or limit the usefulness, of MLPs. We summarize some of 9

e

9

them briefly. 1. There are many local minima of E and each of them occurs iV~!times because of symmetry (all the permutations of the hidden layer nodes). Finding the global minimum is unlikely. (Though, since E i nonoptimal, the global minimum unlikely to be the best one.) 2. If the nodes of the network are sigmoidal, the Jacobian of E with respect to weights is ill-conditioned (Saarinen et al., 1993), causing problems with the convergence of the optimization process. Some of the weights can change by large amounts without changing E much. 3. Near a local minimum of E t local dimension is likely to be much smaller than Ni. (See Section 1.2.) A fully connected MLP is too large, but only locally. This also causes problems with the convergence of the optimization process. Since different regions of the network are active for different input samples, we can’t just reduce Ni or Nh. 4. The data are often inherently low precision. In our OCR example, the training images are binary; averages over an N-example training set can only take on N distinct values, and thus are known only to log2N bits (about 13 for our OCR example). 5. The training data are sparse in the input space, which has dimension Ni. In our OCR example, we have approximately 213 training samples, so that 2N < 2NL.Our input space is woefully undersampled. 6. The training data, and therefore the boundaries of the classes in the Ni-dimensional input space, are more fractal-like than ball-like. (We loosely refer to the data set as being fractal, though strictly speaking it is not fractal.)

C L W

e a

fractal objects in the compact feature space used for recognition. In previous work on character and fingerprint classification (Blue et al., 1994), PNN networks (Specht, 1990) were found to be superior to MLP networks in classification accuracy. This work, and the examples in the current paper, use global Karhunen-Lo?we (K–L) transforms as the feature set. In later work (Wilson et al., 1996), combinations of PNN and MLP networks were found to be equal to PNN in accuracy and to have superior errorreject performance. These results were achieved by using 45 PNN networks to make binary decisions between digit pairs and combining the 45 outputs with a single MLP. This procedure is much more expensive than conventional MLP training of a single network, and uses much more memory space. Analysis of the results of the binary decision network (Wilson et al., 1996) for digit recognition showed that the feature space used in the recognition process had a topological structure whose local intrinsic dimension (Fukunaga, 1990) was 10.5, although the global K–L transform dimension was approximately 100. For binary decision machines, typical feature set sizes required for good accuracy were 20–28, significantly less than the number of features required by the global problem for MLPs, 48–52. The number of features, 20–28, needed to make binary decision machines discriminate between digits was larger than the intrinsic dimensionality, 10.5. Similar tests showed a comparable structure in the fingerprint feature data. The low local dimensionality explains some of the difficulty of these problems; the MLP is being used to approximate a complex fractal

1.2. Discussion One of the basic difficulties in recognizing images using pattern recognition methods is that the set of patterns of interest is a small subset of all the possible patterns that can be represented in the image space. As an example, if characters in an OCR system are represented by 32 X 32pixel binary images, only a small fraction of the 21024 possible images are characters. If the features used to represent these characters form a complete representation of the image down to some specified resolution, such as a discrete cosine transform, the feature set is a dense compact vector space representation of the image. The image training set is a fractal (in the loose sense) object in this feature space. This statement of the recognition problem poses it as the inverse of the fractal image compression problem (Barnsley & Hurd, 1993) and shows that the neural network recognition problem for images retains the ‘‘geometry of nature” (Mandelbrot, 1982) seen in these images. This paper explores methods which allow neural networks to accurately classify image based objects which are sampled from

F 1 T b aP c r c t t f

d

a u

o ( v

c t

w

f

an e

t

g K l

c e

t

i a c

T

D

F 2 T b aM c r c t t f

a

d

N

e u

o ( v

N

d t

w

P

f

r

c t

an e

g K l

c e

t

i a c

object, the set of decision surfaces, which has a typical local dimension of 10.5, embedded in a space of much larger dimension. Since the domain of each prototype in the PNN network is local, PNN can more easily approximate surfaces with this topology. Figure 1 shows a typical PNN decision surface and Figure 2 shows a typical MLP decision surface. See Figure 8 of Blue et al. 1994 for additional examples of this type of local structure in PNN-based recognition. The local nature of PNN decision surfaces also explains why MLPs have better error-reject performance. It was shown in Hansen et al. (1995) that the error-reject curve decreases most rapidly when binary choices are made between classes. The decision surfaces in Figure 1 are such that, as the radius of a test region expands, multiple class regions are intersected, flattening the error-reject curve. Simpler class decision surfaces result in better error-reject performance. Thus the shape of the error-reject curve can be used to assess the complexity of decision surfaces. Neural networks have been proven to be a general nonlinear function approximator (Hartman et al., 1990). In theory, they should be capable of approximating complex decision surfaces. The difficulty is in constructing the appropriate neural network, in finding an appropriate set of weights. 1.3. A dynamic systems analogy Although MLPs have no feedback, the training process does have feedback; the output node values are used to change the weights, and through the weights all the nodal

y

e

9

values. The network undergoing training can be viewed as a nonlinear, recurrent, dynamic network, although the trained network is feedforward and static. Since the system is trying to learn a complex surface via a recurrent dynamic process, we conjectured that the complexity of the training process could provide an insight into the learning problem. To obtain possible insights we study the dynamics of a simple weakly nonlinear recurrent network model. The model contains both linear and second-order rate-dependent terms which couple all of the nodal values. This model, analogous to, but far simpler than, the actual network training system, can be solved in closed form. Mathematical results on such systems indicate several interesting ways in which the dynamics of the feedback signals influence network behaviour. Although the MLP networks used in practical recognition problems are too complex to be subject to direct analysis, we find that the qualitative insights provided by the simpler model can be used to develop better training methods. l i aI e In this paper to l we will show that four modifications o the conjugate gradient method discussede in Blue &e e MLP e to approxGrother (1992) will allow a three-layer imate the required decision surface with zero-reject error similar to PNN and k-nearest-neighbour (KNN) methods. The error-reject performance is then better than the best binary decision method discussed in Wilson et al. (1996). This indicates that the resulting decision surfaces are both the most accurate and the simplest approximations to the character and fingerprint classification problem yet found. We have no reason to believe that these two classification problems are atypical of reasonably large realistic classification problems, or that our conclusions will not be generally valid. In Section 2 a simple dynamical neural network model is presented, which is complex enough to show many interesting dynamic effects, but simple enough to be solved in closed form. In Section 3 the Boltzmann pruning method is discussed. In Section 4 the relationship between structural stability and optimization is used to develop qualitative insights into network training. In Section 5 we will discuss methods which the network dynamics suggest for changes in training. In Section 6 we will discuss the changes in training method used to improve network accuracy while simplifying network structure. In Section 7 we will discuss the results of these changes on classification of handprinted digits and fingerprints. 2. A SIMPLE DYNAMICAL NETWORK MODEL To understand the dynamics behind training, it is helpful to analyse a system having feedback between the nodes, as the MLP has during training, but which is simple enough to be solved in closed form. This system is not intended to directly model the neural network; in

e

9

C L W

particular the time variable, t, is not directly analogous to an iteration count or epoch number during network training. Consider a number of linear neuron nodes with nodal voltages, U, which can be written either in matrix form:

o

(’) H Ul(t)

=

o

or as a vector:

““”

0

(1)

Un(t)

U

u=

i

(2)

U

as is required by the linear or quadratic order of interaction. We assume that the neurons are linear and the network is fully recurrent, the interaction of the neurons can be described by a nonstationary set of first order couplings:

and a set of higher order couplings:

F

(‘ ) fn,

(4)

fn,n(t)

The system dynamics is described by: ;

(

+ G)U.

(5)

(This is not the most general quadratic system; the most general quadratic system cannot be solved analytically.) The solution of eqn (5) was postulated as eqn (6), and checked using a symbolic mathematics program to validate the solution terms. Terms generated in this checking process are given in detail in Appendix A. See p. 99 of Hirsch and Smale (1974), and p. 60 of Perko (1991), for similar linear problems which were used as clues to the proposed solution. The solution derived and checked in this way is: U=(I–

H

Fexp( G

J G

(

Why can’t we just integrate eqn (5), and use the answer for our model? First, for n nodes there are 2 functions which must be evaluated and integrated to fully expand the solution. Second, eqn (6) is a generalization of the dynamic system which generates the Lorentz attractor (Guckenheimer & Holmes, 1983), which is the source of the “butterfly wing effect” in weather forecasting. Depending on F and G, this system may have no attainable stable state and may have very high sensitivity

e a

to initial conditions. The theory of the solution to systems of this kind is treated in Hirsch et al. (1977), Guckenheimer and Holmes (1983), and Perko (1991). A simplified explanation is that small changes in the values of the initial conditions or in the time integration of the equation can result in large changes in the approximate solution of eqn (6) when the matrix exponential and inverse are expanded. Suppose that F and G are locally stationmy (independent of time). Then, according to p. 127 of Guckenheimer and Holmes (1983), the flow of the solution of the nonlinear system can be divided into three parts, based on the eigenvalue spectrum of the Jacobian of the right hand side of the equation. These three parts are the stable manifold associated with eigenvalues with negative real part, the unstable manifold associated with eigenvalues with positive real part, and the centre manifold associated with eigenvalues with zero real part. The stable and unstable manifolds are unique but the centre manifold need not be. The centre manifold is tangent to the stable and unstable manifolds. If any of the eigenvalues in eqn (5) has zero real part, the centre manifold theorem must be used to transform eqn (5) into stable, centre, and unstable manifold parts, and the solution expansion of eqn (6) must be revised accordingly (Carr, 1981). In numerical calculations, rounding error and lack of precision in the input data can cause some solution components of the stable and unstable solutions to approach the centre manifold within the rounding error of the calculation. The complexity of even small models, n = 3 with nine terms in the matrices, each of which has 729 terms similar in form to eqn AlO (see Appendix A), exceeds the level of complexity which can be locally analysed by direct linearization about equilibrium points to study weight stability during training. We could return to eqn (5) and build a numerical model, but this would cause all of the noise present in the training data to be present in a system of equations; this would couple the noise directly to the sensitive initial conditions, and result in various numerical significance problems. The resulting terms are multiplied to form n terms which are summed to form the solution. The noise seen in the resulting solution is not a numerical artifact but an inherent part of the system. The situation we face in analysing even this oversimplified model is one in which the mechanics of the process are dclear and are soluble in the closed form d given by eqns A6–A9, but no direct comparison with real training data is feasible because the expanded nonlinear solution is too complex to allow the components to be individually analysed. 3. BOLTZMANN PRUNING For completeness, before continuing with our dynamical systems analogy, we review the subject of Boltzmann pruning (Omidvar & Wilson, 1993).

t

T

D

a

N

N

P

r

Boltzmann methods have been used as a statistical method for combinatorial optimization and for the design of learning algorithms (Ackley et al., 1985; Alt, 1962). This method can be used in conjunction with a supervised learning method to dynamically reduce network size. The strategy used in this research is to remove the weights using Boltzmann criteria during the training process. Information content is used as a measure of network complexity for evaluation of the resulting network, Four competing mechanisms are involved when the Boltzmann method is used in conjunction with scaled conjugate gradient (SCG) optimization (Blue & Grother, 1992). First, the Boltzmann method is self-organizing while the SCG method is a supervised learning method. Second, the Boltzmann method seeks to minimize the number of weights while maintaining the information content of the network. The SCG method seeks to minimize an error function over the training set. Third, the important controlling parameter for the Boltzmann method is the information in the network as the annealing process approaches equilibrium. The controlling informational parameter for the SCG method is the information provided in the initial weights. Fourth, the algorithmic control in the Boltzmann method is the temperature during the iteration. The equivalent controlling parameter for the SCG method is the stopping criterion. In the earlier work (Omidvar & Wilson, 1993), the SCG method is used as a starting network for the Boltzmann weight pruning algorithm. The initial network is fully connected. The pruning is carried out by selecting a normalized temperature, T and removing weights based on a probability of removal: P

exp( – w

(7)

The values of P a compared to a set of uniformly distributed random numbers on the interval [0,1]. If the probability P i greater than its random number, w~is set to zero. The process is carried out for each iteration of the SCG optimization process, and is dynamic. If a weight is removed it may subsequently be restored by the SCG algorithm; the restored weight may survive if it has sufficient magnitude in subsequent iterations. The dynamic effect of this is shown in Figure 3 for five temperatures. The pruning, unlike the work in the current paper, started from a small fully converged and fully connected network (32-32-10). The number of weights, including bias weights, in the initial network was 1386. At all temperatures the initial iterations are effective in reducing the weights. The decrease in the rate of pruning is the result of a critical phenomena characterized by a critical temperature, T at which the new information added by the SCG training balances the information removed by pruning. At this critical point, networks trained on small training sets will achieve identical testing and training accuracy even when tested on large test sets.

-+/ -i f .

‘~ 0

e

e

9

,. . .

2 :~

y

100

.-

200

9W

ao

200

Wo

700

a

eao

1(

w

ITauanoN

F t f

T

3 W f 0 t

r T= 0 0 u c

8 a f 0 I f T

T

o i l

a c

I

0

4. DYNAMICS OF OPTIMIZATION AND STRUCTURAL STABILITY There are two time scales associated with the dynamics of neural network training. The shorter time scale is associated with the calculation of feedback signals within an iteration (the “inner iterations’ and the longer time scale with the sequence of optimization iterations (the ‘‘outer” iterations). The dynamics can be used to analyse the sequence of dynamic systems generated as the optimization iteration proceeds. On the short timescde, in any single iteration the structure of the network is fixed. The dynamics involves the application of the feature vectors as driving inputs. Since the feature vectors provide a forcing function, this is not an equilibrium system, but a problem where the training data drives the network to different stationary states, with error-comecting feedback to alter the weights to bring the network to a minimum in the recognition error. On the longer time scale is the optimization iteration process. At each time step of the outer iteration the connectivity of the network is changed by Boltzmann pruning. On this time scale structural stability is the dominant dynamic process. The goal of the outer iteration process is to approach a steady state where the effect of network changes minimizes the feedback error over the training set and thus an overall steady state is achieved. The link between the study of time-dependent network performance and nonlinear network optimization is provided by the analysis of structural stability. The mathematical correspondence between the structural stability analysis and the linearization used in optimization is that both methods are based on a Taylor series expansion of the nonlinear system about the point of interest. This

e

9

C L W

expansion yields a local linear approximation to the nonlinear system. The usual analysis of structural stability (Perko, 1991) for: :=f(u,

w)

(8)

performed by a local analysis of the eigenspectrum of DflUO, WO)at the point UO,WO.The optimization by gradient-based methods of the nonlinear network is (essentially) performed by linearization of the nonlinear system:





=f(u, w) -f(ut, w)

(9)

where the U~are a set of training examples. The expansion of terms in the right hand side of eqn (9) is used to compute the dynamic change in the error over the training set. The optimization is performed over the sum of all training examples using the second-order term in the expansion, the Hessian matrix, because the Jacobian, the first-order term, goes to zero at the desired minimum point. Similar additional terms are also required for nonlinear stability analysis when the Jacobian is too small. Both the analysis of local dynamic stability and the analysis of structural stability are based on the expansion of the system of equations in a Taylor series. When dynamic stability is involved, the series is used to provide linear approximations to the right hand side of equations such as eqn (5). When structural stability is involved, the solution is expanded in terms of the critical parameters. For eqn (6) this requires expansion in terms of the F and G matrix elements. Details of this type of analysis are given in chapter 5 of Guckenheimer and Holmes (1983). The expansion yields insights which can be applied to the structural stability of the learning process. Three changes we have made in training neural networks are directly related to insights we obtained from dynamics. The first is regularization, in which a term, R times the sum of the squares of all the weights, is added to the error function being minimized. Regularization has been shown to act as a smoother when the neural network training process is treated as an approximation problem (Girosi et al., 1995). We can qualitatively evaluate the effect of the quadratic regularization term on the dynamics; such a term acts to constrain the optimization to a region near the origin. In our highly fractal feature spaces (Wilson et al., 1996), the smaller the dynamic volume, the more effective the features will be in spanning the training space, as is demonstrated from the approximation point of view in (Girosi et al., 1995). For our OCR problem, as stated earlier, the local feature dimension of features is about 10.5, and the large networks used here have 96 K–L features. That is, the training examples sample well a feature subspace of dimension 10.5. Full sampling, at least one point in every hyper-quadrant, of the feature space would require 296samples. The undersampling increases roughly as the

e a

power of the difference in the global and the local dimension. For the OCR problem this is 96–10.5 so that the undersarnpling increases at least as fast as the volume calculated from the feature space radius. The large difference in local and global dimensionality indicates that adding a practical number of training examples (that might expand the feature set volume somewhat but would not alter the local or global dimensionality of the feature set) is unlikely to produce a significant improvement in network performance. The second insight we can obtain from dynamics is that, as we expand the Jacobian of eqn (9), any very small nonlinear terms will be unimportant compared to the noise in real training data. The calculation of the K–L transform involves, for the OCR problem, calculation of a mean image value for every location in the image. Since the values used to calculate the mean are binary, the mean for an N-example training set can only take on N values and is only known to log2N bits. Even with perfect training data the K–L features have this finite precision. Ambiguous characters, or characters that are well formed but which have noisy edge pixels, add additional uncertainty, giving uncertainty throughout the dynamic calculation. Small changes in outputs in network dynamics caused by training data noise are indistinguishable from sm~l changes caused by real but small dynamic terms. For MLPs with sigmoidal activation functions, these small terms occur for both large positive and large negative values of the sum of the weighted inputs, since all derivatives of the activation function approach zero exponentially. For the data used here, the K–L transform normalizes the input signals to a uniform dynamic range; the neural network is not needed to perform this task. Very large or very small signals to the nodes are the result of very large or very small weight values. Using an alternative form of the activation function, a sine function asineqn(11), solves the problem of insensitivity of derivatives by providing hidden nodal signals which have significant derivatives for all input patterns. The third insight provided by dynamic stability considerations involves the dynamics of small weights, weights near the origin in weight space. In weight space, the stability analysis of these variables involves small real eigenvalues which will be dominated by weight oscillations. The dynamics of these weights are associated with dynamic processes that have small real parts and are therefore near the centre manifold. For the OCR problem, for example, the dynamics near the centre manifold result partly from an attempt by the optimization process to make fine character distinctions on large signals well above the noise level, and partly from small weights interacting with the noise. Since these two effects are indistinguishable and create very complex dynamics, it is better to set the near-zero weights to zero, forcing these dynamic processes out of the training.

t

T

D

a

N

N

P

The three ideas combine to yield a training method which is smoothed on the exterior, large weight, region by regularization. At the same time the weights are pruned in the interior, small weight, region by Boltzmann pruning. The combination of these two methods greatly restricts the active region of weight space. Weight removal simplifies the problem by reducing the number of degrees of freedom. Restricting the range of weight values to a spherical shell near to but excluding the origin matches the significance of the network to the significance of the training data. In the region between constraints, dynamic stability is enhanced by choosing activation functions with nonzero derivatives throughout the space. After this extended discussion, one may wonder if this type of recognition problem is unique to OCR and fingerprint problems. The authors would argue that it is not. The characters used in this work are normalized to 32 X 32-pixel binary images with 1024pixels each. Characters form a very small subset of all 210Xpossible 32 X 32pixel images. The point of regularization is to confine the training dynamics to a region of weight space where the images are near the subset of images which are meaningful characters. At the other extreme, we argue that the subset of character images is not dense in feature space, and that many noisy images (that are near character images but are not easily recognizable characters) are found in large training sets. The complex dynamics associated with the noisy images are pruned away by the Boltzmann process. These properties are not unique to character recognition. They are even more pronounced for fingerprints because the set of meaningful images is a much smaller fraction of all 512 X 480 8-bit gray images. The difficulty of matching images to incomplete feature sets should not be surprising, since patterns generated or identified by an image algebra (Grenander, 1993) are a small subset of the patterns which could be generated with arbitrary generators and links in any image algebra. 5. NEURODYNAMICS OF LEARNING The design and implementation of most neural network architectures are based on an analysis of the size and content of the network training data. The direct form of analysis is suitable when the size of the training data is small, the class distribution is uniform, and the local and global dimension of the feature set are approximately equal. In fingerprint and character classification applications, the training sets are large and the local and global ranks of the feature data are very different. The complex structure of the training data requires using large networks, in the order of 104 weights and 102 nodes. If the training process is treated as a dynamical system with the weights as the independent variables, the Jacobian would have at least 108terms. Previous pruning studies on our OCR problem have shown that these

.———.... ....——.. -.—.

—.—

r

y e

9

networks typically contain at least 50% redundant weights (Omidvar & Wilson, 1993), and have confirmed that no more than 12 bits of these weights are significant. Direct analysis of the Jacobian is numerically intractable. To avoid the difficulties in analysing this type of complex low accuracy system, we looked directly at the qualitative properties expected in systems of this kind (Carr, 1981; Hirsch & Smale, 1974; Guckenheimer & Holmes, 1983) and altered the training procedure to take the expected dynamic behaviour into account. This analysis used the dynamical systems approach to provide us with qualitative information about the phase portrait of the system during training, rather than a statistical representation of the weight space of the MLP network. For this approach we considered the training process as an n-dimensional dynamical system (Cohen & Grossberg, 1983), where for a given neuron: d

3

&=



+$(Uj)

+ Ii

(lo)

Ti

where 7i decay time for the unit ~d~j is the input– output transfer function, a sinusoidal function, =

~

~

WijUj),

(11)

j

driven through the Wtiinterconnection weights, and Ii is the initial input. We effectively reduce the dimension of the problem using the centre manifold approach (Sijbrand, 1985). This approach is similar to the Lyapunov–Schmidt technique (Chow & Hale, 1982), which reduces the dimension of the system from n to the dimension of the centre manifold, which in numerical calculations is equal to the number of calculable non-zero eigenvalues of the Jacobian. Since the number of weights in the typical network is approximately 104 and the number of bits in the feature data is approximately 12, direct numerical methods for calculation of the eigenvalues from the linearized dynamics are very poorly conditioned. The centre manifold method has the advantage over the Lyapunov method in that the reduced problem still is a dynamical system with the same dynamic properties as the original system. The reduction in dimension is implemented using the Boltzmann machine for a scaled conjugate gradient learning algorithm. The reduced problem after application of the centre manifold method is still an SCG system. The SCG requires that at any given point, the performance of the dynamical system be assessable through a certain error function, E Then the system parameters are iteratively adjusted to reduce the error. The reduction on the size of error can be approximated as follows: aE

d ‘d



(12)

where q is the learning rate, Wtiare theweights,ad E is the error. This approach can reduce the error independent of the

e

9

C L W

content of the particular sample distribution and the size of training data. The result is a saving in training time and improvement in performance without analysis of those network components which make minimal contributions to the learning process. 6. OPTIMIZATION CONSTRAINTS The improvement in network performance achieved here requires four modifications in the optimization, each of which must be incorporated in the weight and error calculations of the SCG iteration (Blue & Grother, 1992). Each of these constraints alters the dynamics of the training process in a way that simplifies the form of the decision surfaces, which as stated earlier have a global dimension of about 100 with a local dimension of about 10.5. Understanding the topology of this space is useful for developing improved training methods based on dynamics. The four modifications all alter the error surface being optimized by changing the shape or dimension of the error function. All of the modifications take place in the inner loop of the optimization. 6.1. Regularization Regularization decreases the volume of weight space used in the optimization process by adding an error term which is proportional to the sum of the squares of the weights. The effect is to create a parabolic term in the error function that is centred on the origin, reducing the average magnitude of the weights. A scheduled sequence of regularization values is used which starts with high regularization and decreases until no further change in the form of the error-reject curve is detected. At present this sequence is manually terminated, but automatic termination using a validation set is possible. Constraining the network weights reduces the number of bits in the weights and therefore the amount of information contained in the network, allowing a simplification in the network structure. 6.2. Sine activation The usual form for the activation function for neural networks is a sigmoidal or logistic function. This function has small derivatives of all orders for very large positive or negative values of the input signal. Therefore the Jacobian of the inner iteration is singular within computational error (Saarinen et al., 1993). The Jacobian has large numbers of near-zero eigenvalues; the optimization is dominated by centre manifold dynamics (Carr, 1981; Sijbrand, 1985). Changing the activation function to a sinusoidal function creates a significant change in the dynamics of the training since the activation function has significant derivatives for all possible input signals. Using sinusoidal activation functions improves network

e a

training dynamics and results in better error-reject performance and simpler networks. 6.3. Boltzmann pruning Boltzmann pruning has two effects on the training process. First, it takes small dynamic components that have small real eigenvalues which are near the centre manifold, and places them on the centre manifold. This simplifies training dynamics by reducing weight space dimension. Second, Bokzmann pruning keeps the information content of the weights bounded at values which are equal to or less than the information content of the training set. When K–L features are derived from binary images, the significance of the feature is no greater than the number of significant bits in the mean image value. Boltzmann pruning forces this constraint on the weights (Omidvar & Wilson, 1993). When Boltzmann pruning was used, detailed annealing schedules were used to insure convergence of the training process. When regularization is combined with pruning, an annealing schedule is not needed, and pruning can proceed concurrently with the regultization process, starting at the beginning of the training process. The cost of pruning is only a small extra computation associated with the weight removal. 6.4. Class-based error weights In problems with widely variant prior class probabilities, such as fingerprint classification, it may be necessary to provide large samples of rare classes so that class statistics are accurately represented, but it is important to train the classifier with the correct prior class probabilities. This is discussed in chapter 7 of Kohonen (1988). In the SCG method used here, both the network errors and the error signals used in the control of the iteration must be calculated using class weights, thus: ~

~

W ~

+R

X W t

(

This formulation ensures that the optimization is performed in a way that produces the best solution to the global problem but allows reasonable sampling of less common classes. In the digit classification used here, uniformly distributed classes were available. In OCR on full text or in fingerprint classification, the classes are not uniformly distributed and class-based error weights should be used. 7. RESULTS The training method incorporating all four modifications was used on samples of handprinted digits

e

T

D

a

N

N

P

r

y e

9

e

7.1. Distribution of weights

and fingerprints. The digit sample contained 7480 training and 23,140 testing examples, equally distributed for classes “O” to “9”. The fingerprint data contained 2000 training and 2000 testing samples from NIST database SD4 (Watson & Wilson, 1992). These training and test samples are identical to those used in Blue et al. (1994). The MLPs used here were three-layer networks with 96-96-10 structure and 10,282 weights for the OCR problem, and 112-112-5 structure and 13,221 weights for fingerprints. Training was carried out by selecting a Boltzmann temperature and successively training the network at decreasing values of the regularization factor, R defined as the constant used in the error function for the weight related term. Typically the first training sequence used R = 2.0. After each training pass R was reduced in a 2, 1, 0.5, 0.2,..., sequence until the regularization had no significant effect on the errorreject performance of the test sample. As the regularization decreases the number of weights in the network increases and so does the size of the average weight. The smallest net will occur with high temperature and regularization and the largest with low temperature and regularization. The goal of the study is to find the smallest network that gives the steepest fall in error for a given reject level.

We have developed various methods for evaluation of network dynamics based on the statistical properties of the weights. This is necessary both because the number of computationally identical fully connected nets is very large, 96! and 112!, and because there are more than 10,000 interconnecting weights. To simplify the discussion we will show the weight analysis for the OCR problem only. Identical arguments apply to the fingerprint case. t Figure 4 and Figure 5 show the 7.1.1. N distribution of interconnecting weights for two OCR networks trained at a temperature of T = 0.001, one at high regularization, R = 2.0, and one at low regularization, R = 0.1. A fully-connected network has 10,282 weights (100%), the R = 2.0 network has 2365 weights (23%), and the R = 0.1 network has 3139 weights (31%). The vertical axis in Figures 4 and 5 is the input node number, and the horizontal axis is the hidden node number. Since K-L features are used, the input nodes with the smallest numbers have the greatest statistical significance. These nodes are densely connected; the degree of connectively deereases with increasing input node number and decreasing statistical significance. In these experiments,

9

8

7

6

5

n

20

10

F

4 N

w

30 i

l

40 c

50

60 p

70 f

R

81J 0

a

90 T

1

o

o

9

C L W

.

F

0

5 N

IO

20 w

30 i

I

40

50

c

hidden layer nodes are always fully connected to the output nodes. Their connectivity is not forced; the pruning process does not select these weights for removal. The significant change seen in the interconnection pattern in going from high to low regularization is the increase in connectivity of high number K–L features. t

p

60 f

70 R

80 2

a

90

T

F d

6 D a T 1

o w

1

o

The 96th input node (top line of Figures 4 and 5) is not connected at R = 2.0, and has two connections at R = 0.1. There are 12 connections to input node 30 at R = 2.0 and 37 at R = 0.1. The total number of nonzero weights increases by 35Y0and the number of connections (nonzero weights) to input node 30 increases by 61%. Nodes

R = R =

0

e a

+ +

I

o m

f

s

M

a a fi

o t

r

f

aI

f

i

o si i

e

o

T

D

a

N

N

P

r

e y

e

9

3 R R

3000-

= + = 2.00+

.

25002000-

1

0 0

F d

7 D a T 1

o

m

s

M

s

i

o t

r

a

f

R f

f

i

o sIs i

higher than number 30 get a disproportionately large part of the increased connectivity.

the reduced numbers of larger weights produced at a higher temperature.

p In addition to altering the topol7.1.2. W ogy of the network, the training process changes the distribution of the weights. The distribution of weights, with intervals of width 0.05, is shown in Figure 6 and Figure 7 for temperatures of 10-3 and 10-5. The change in the magnitude of the weights is more sensitive to regularization than to temperature. While the number of weights is reduced by a factor of 2.76, from 8653 to 3134, the peak height of the distribution is reduced by a factor of 6.0 from 3369 to 565. Reducing the regularization doubles the mean weight size at both temperatures. At the lower temperature, regularization created large numbers of small weights on low-order K–L features. These weights are no more effective in the classification process than

r p Figure 8 and Figure 9 7.1.3. E show the error-reject performance of ten different networks, all combinations of two temperatures and five values of R (A data point at p rejection is found by of the testing samples with the lowest eliminating t maximum output activation, and calculating the classification errors of the remaining testing samples.) Our study involved eight temperatures and six regularization values, or 48 networks, for OCR; we present these ten networks as sufficient to illustrate the process. The significant factor here is that the best error-reject performances for the two temperatures are very similar. Both R = 0.1 networks reach 0.1% error after rejecting about 17’70of the characters despite having different

%

a g

r e

J

A

I

1

1

x

A

LL

1

5

0

% F

8 E

10-3.

g

f

s

M

a af

o t

r

f

R f

c

i

o d

a s T rI

u

9

C L W ~

I

f

I

I

I

e a

J

R = R = 0.02 + R =0 ❑ Rf = 1 X R =2 A

%

1

AA A I

0.1 0

x.x “$

I

A 1

A

5 %

F

9 E

g

f

a

M

a af

o t

r

f

R f

c

i

o d

a a T

rI

for the sigmoidal MLP, 2.54% for the PNN, and 2.45% for the sinusoidal MLP. The sinusoidal result is the best yet achieved on this data and is comparable to human performance (Geist et al., 1994). The slope of the curves is initially proportional to their accuracy, but at higher reject rates the sinusoidal MLP has substantially better performance, indicating simpler decision surfaces.

topology, mean weight sizes different by a factor of 2.17, and numbers of weight different by a factor of 2.76. Since the total information content of the network is proportional to the log base 2 of the product of the weight size and the number of weights (Omidvar & Wilson, 1992), the information content decreases by only 23%, which is fully compensated for by improved network topology.

7.3. Fingerprint classification 7.2. Digit recognition

Figure 11 compares the results of fingerprint classification using MLP networks with sinusoidal and sigmoidal activation functions and a PNN network. Both MLPs were trained using successive regularization and Boltzmann pruning, as described above. The zero reject error rates are 9.2% for the sigmoidal MLP, 7.2~ofor the PNN,

Figure 10 compares the results of digit recognition using MLP networks with sinusoidal and sigmoidal activation functions and a PNN network. Both MLPs were trained using successive regularization and Boltzmann pruning, as described above. The zero-reject error rates are 3.34Y0



❑n I

1

o



I

0

.

5 %

F

1

E

g

f

s

M

s

M

a

rP

f

c

o d

i

i

I rl

T

N

D

N

P

1 [

y e

r I

f

1

1

1

1

I

e

9

) O +

% -v

+

+-+

❑ ❑ 1

1 o

I

‘u

❑ 1

I

1

00

-++++

I

0<>

5 %

F

1

E

g

f

s

M

s

and 7.8% for the sinusoidal MLP. The sinusoidal result is not as low as PNN at zero reject, indicating that the decision surfaces required for this problem are more complex than the less difficult digit recognition problem. The slope of the reject-error curves is not proportional to the accuracy, not even at zero reject. However, at the higher reject rates the sinusoidal MLP has substantially better performance, indicating simpler decision surfaces are providing better confidence estimates used to generate the error-reject curve. At 10% reject the error rates have changed to 5.45% for the sigmoidal MLP, 4.96% for the PNN, and 3.43% for the sinusoidal MLP. The better error-reject curve again demonstrates that the decision surfaces generated by the dynamically optimized MLP are simpler than those of the other networks.

M

a

rP

f

c

o f

i

I ri

REFERENCES &

A

C & & &

P

R

2 (

4

&

A

o C M

&

M

T

N

o B

8. CONCLUSIONS In this paper we have shown that some relatively low cost modifications to the MLP training process based on training dynamics can result in lower error and better error-reject performance on difficult classification problems. The changes in training strategy are motivated by an analysis of a simplified recurrent model which illustrates the complexity of a network with feedback signals. These improvements yield less complex decision surfaces. The digit recognition problem was solved with consistently better performance at all reject rates. The more difficult fingerprint classification problem was solved in a way which still showed some advantage for complex PNN decision surfaces at zero reject, but which yielded better performance than PNN after a small percentage of the low confidence classifications were rejected. In all cases the sigmoidal MLPs had more error at all levels of rejection than MLPs with sinusoidal hidden nodes.

i

5 &

&

& 7

&

&

T

p

9

C L W

e a

& a

a = F

&

G

a

& G ~~=lskieik’skj % =

&

I G G

&

H –

=



&

n= 3 &

a

a

a

& a a A,z suslke ~spqeAJspq.

APPENDIX A a

U,

n

a

:=

F

M

J

a G S

a =I A

A=

A S

=

A

=

(’) o

G

*

a12

ax3

azz

az~

a31

a32

a33

(

A

Al(t) *

( all

azl

a

)

,

(

S3,S32eh3’ +S21S22c?A2’+SllS12d”

0 A



)

I +

+

+ G

(

=

a

G a

n

n+ q

qi t



+

(I

)

)

T

D

a

N

N

P

r

y e

9

)

+

(





) (



(

s

s

(

)

+

) 3

)

)

1 –

h

)

+

s31s33e

A3r



+ s21s23e

Azf



All

+ slls13e



)

(



+

)

+

)

+fl,

+

&’1eA3’

(

– = —

Az

h3t

–)

~ s22s23e



Azt

&

+ s12s13e



A

A,:

A

)

) 2 A,: 2 h3t 2 Azt s32e ~ s12e ● s22e

+

+

+

2 Azf ~ she z At s21e

(

~

(

s31s32e

+

)

A

(

)

+



s32s33e ~

h

(





)

~

A3

(



+ slls12e

(

+



● s2~s22e

s31s32e

(

– = —

ASt

~ s21s22e

A3

(

–h A#

~ s11s12e

h,!

x,

AZ

)

a31







A

(

s32s33e

+

)

(

+

~

)

(





2 A2t 2 Axt s3~e +s23e

+

h

~

+ s22s23e



(

) s31s33e

+

ASI

Alt

+ s12s13e



A

~ s21s23e

A

(

A

2

~ s13e A –

Azt

A,t

A

)

A,r

~ slls13e

A

A

)

a32



— —

+

+

)

s31s33e ~

h .

A3t

+ s21s23e



(



(

)

s31s32e

+

~ (

h

A

s2~s22e

A

k

A2r

A,t

+ slls13e



+ slls12e

— A

A

)

Ar )

e

9

C L W

+f32

( ( ( ( ( ( ( + ( ( ( ( ( ~ ( A31

s32s33e

~ s22s23e

A3

2

A3t

s31s32e ~

+

+

–)

+ s21s22e

A>t

s32s33e — A



A

A

+

A3t .r31s32e — ~

AZ

A3

s32s33e

A3t

~ s22s23e

+

)

Akr

A,t

Azr

a

0

0

0

0

+ s~2s13e A,

2

~ s~2e

AZ

0 o

0

0

0

0

g3

0

U1

o

0

0 0

0

0

U2

o

0 0 0

0

0

U3

U3

o

0

0

0

0

0

U2

o

0

0

0

0

0

U1 o

0

0

0

0

–)

o -

)

0

lg21

+ o

Aft

o

A

0

o lg21u2

~

1 0

0’

o 0

2 AIt s13e –)

A,

Igllul

o

) o

2 A3t 2 A2t s33e ● s23e L2

o g2

A,

A

~

0

o 0

Alt

A3[ A2t A,I s31s32e s21s22e + s11s12e A

Uin

–)h,

AZ

2 A3t 2 A2t s32e ~ s22e

~

o

3 )

o

A3

+

SI1h2e — A,

2

()

A,t

A

~s

12

)

~ vls13e

+

Sl,slseh” — A

O Gin

‘o o 0 gl

)

k

A

2 A2t ~ s21e

+

All

G=

A3r h2r s31s33e + s21s23e —

h

G= n= 3

~ s12s13e

+

) )

X n

A,r

A

Azt

+

Alt

n an

)

Al

A

~ — s21s23e

~

2

~ s13e –)

k,t

~ s22s23e

S3,S33J3’+ — — A h

s12s~3e — x,

A,r

A

A3t

AZ

1~1=(1–hll)((l–hzz)(l–hsy)

~ s1ks12e

Az

AZ

A,

A

A3

)

2 A,t ~s12e

AZ

A3t

+

h, r

k,

2 A3t 2 A2r s33e ~ s23e

=

+ s12s13e

A2

2 A2r ~ s22e

s32e +fzz A3

+

)@

( (

A3r A2t s32s33e ~ s22s23e + —



e a

0

000

)

T

D

a

N

0

0

0

0

N

P

r

0 1 0

o

0

g lg21u2

0

o

s —

o —

0

tu3

o

0000

0

e’uz

0000

0

0

e’”l 000

000

(

00 00

0010 0001 =

g’(e:l-l) 0

0



0

0

I 0

000

0 —— a

e

0

o

000

0

0

=

9

y e

S

0

g2(e’u2 – o

0 o .

U5

end

o

0

et.z

o

0

0

&3

0



e