Neural Networks for Pattern Recognition S . C . KOTHARI AND HEEKUCK OH Department of Computer Science Iowa State University Arnes. Iowa 1. Introduction . . . . . . . . . . . 2. Neural Network Architectures . . . . . 2.1 Single-Layer Networks . . . . . 2.2 Multilayer Feedforward Networks . 2.3 Feedback Networks . . . . . . 3. Gradient Descent Learning Algorithms . . 3.1 Gradient Descent Method . . . . 3.2 Delta Rule . . . . . . . . . 3.3 Generalized Delta Rule . . . . . 3.4 Cascade-Correlation Algorithm . . 4. Relaxation Learning Algorithms . . . . 4.1 Pseudo-Relaxation Method . . . . 4.2 Learning in Hopfield Networks . . 4.3 Learning in BAM . . . . . . . 4.4 Delta Rule versus Relaxation Method 5. Hebbian Learning . . . . . . . . . 6. Performance and Implementation . . . 6.1 Performance of Associative Memories 6.2 Applications . . . . . . . . . 7. Conclusions . . . . . . . . . . . References . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
119 122 123 126 128 134 134 136 137 142 145 146 148 149 150
152 153 155
157 159 162
1 . Introduction Fundamental brainwork is what makes the difference in all arts.-Dante
A long-standing dream of mankind came true when the Wright brothers built the first airplane . It was a great triumph-a machine was invented that made it possible to fly. Another grand dream is being pursued today by scientists and engineers all over the world . This time the object of pursuit is a machine that can imitate the brain. the epitome of biological creations .
.
ADVANCES IN COMPUTERS VOL . 37
119
Copyright 0 1993 by Academic Press. Inc. All rights of reproduction in any form reserved. ISBN 0-12-01 21 37-9
120
S. C. KOTHARI AND HEEKUCK OH
Neural networks’ are the present day prototypes for imitating the intricate functionings of the brain. Neural networks offer a new computing paradigm based on the brain metaphor. A computer performs by executing a program. A program lays down a step-by-stepprocedure for solving a problem. A computer can repeat a sequence of steps many times until the task is completed, but the steps must be given through a program. A neural network does not need a program, instead it is presented with examples that specify desired outputs for sample cases. The network adapts its computational behavior according to the given examples. This is referred to as learning from examples. The network consists of several simple processing elements, called neurons (or nodes), and connections between them. Neurons go through a succession of activations. An activation changes the state of a neuron. The states are represented by suitable numerical values. The activation pattern of a neuron is affected by signals received from other neurons. The effect that a neuron A has on another neuron B depends on the activation state of neuron A and the strength of the connection between A and 3. The connection strengths collectively contain the blueprint of the computational behavior of the network. Learning from examples is accomplished by modifying the connection strengths. Interestingly, humans solve many pattern recognition problems almost effortlessly without having a step-by-step procedure for solving the specific problem. Without such a procedure we cannot effectively program computers to solve these problems. Conventional template matching techniques may be used, but they fall short in many ways. Humans seem to learn by studying examples of patterns to be recognized. The same approach is adopted by the “learning from examples” paradigm of neural networks. There is another aspect of neural networks that concurs with a key feature of human pattern recognition ability. The brain can deal effectively with noisy or incomplete images of patterns. Similar properties have been demonstrated for neural networks. Hopfield (1982, 1984) introduced a neural network with many interesting properties. The Hopfield model can store a number of patterns and retrieve the correct pattern starting from a noisy or an incomplete image. Hopfield proposed the neural network as a model of biological memory. His network model relies on a distributed representation of memory in which multiple patterns are superimposed and stored together. A conventional computer memory uses different cells to store different bits of information. To retrieve information the memory cell address is specified. In such an organization, contents of a memory cell are lost if a cell becomes defective. Alternatively,
’ The term artificial neural network is sometimes used.
NEURAL NETWORKS FOR PATTERN RECOGNITION
121
the Hopfield network can reconstruct missing bits of information from other available bits of information. In a sense, bits of a pattern are used as clues for retrieving the pattern. As long as sufficiently many clues exist, the pattern can be retrieved correctly. Hopfield contributed another important idea by establishing a new connection between computation and physics. He introduced the concept of energy in his model and proved that computations can be viewed as transitions to lower energy states. This observation opens up possibilities for a new technology to build computers. Physical systems have a natural tendency to move to a lower energy state. For example, a stretched rubber band if left alone returns to its original shape, a lower energy state. In fact, Durbin and Willshaw (1987) have proposed a novel approach using an elastic net method to solve the traveling salesperson problem, considered to be an extremely difficult problem for digital computers. Hopfield suggested an analog implementation for his network. Presently, neural network chips are being developed using both analog and digital technology. The modern history of neural networks goes back to the work of McCulloch and Pitts (1943). They proposed a model of neural networks to show how such networks could process information. Later, a rigorous concept of learning was introduced by Rosenblatt (1962). He provided a powerful learning algorithm for a class of networks that he called perceptron. A perceptron can learn anything that it can compute, but unfortunately what it can compute is quite limited. The limitation of the perceptron turned out to be a major setback for further research in neural networks. The interest in neural networks was rejuvenated in the 1980s as two new neural network architectures came on the scene. The first was the Hopfield model and the second was the backpropagation network. It was realized earlier that if a perceptron is enhanced by incorporating multiple layers of processing elements then its limitation goes away. However, no one knew how to use such a structure because of the lack of a learning algorithm. The perceptron learning algorithmn works only for a single-layer network. The backpropagation algorithm introduced a learning technique for multilayer networks. The backpropagation technique was independently discovered by three different groups of researchers. However, it is mainly due to the efforts of the PDP research group (Rumelhart and McClelland, 1986) that the technique has gained wide acceptance. The group has carried out extensive research in neural networks in the context of a broad approach to cognition. Neural networks are being applied to a wide variety of problems. A survey of several applications is provided in the report of the study conducted by DARPA (DARPA, 1988). The first application was pioneered by Bernard Widrow in the 1950s. He developed a device called ADALINE (Widrow et al., 1960; Widrow and Hoff, 1960) for adaptive noise filtering. Even today,
122
S.C. KOTHARI AND HEEKUCK O H
the device is widely used for eliminating echoes in long distance telephone calls. More recently, Sejnowski and Rosenberg (1986) developed NETtalk, a program that teaches itself to read aloud. The program uses backpropagation for learning. After training, the system can achieve a high performance level, reading with few mistakes. NETtalk has become a landmark application of the neural network technology. Along with success stories, one also needs to keep in mind that neural networks also have limitations. Not all pattern recognition problems can be handled effectively by neural networks. We hope that a better understanding of the scope and limitations of neural networks will develop in the course of time as more studies are carried out. This chapter provides an account of different neural network architectures for pattern recognition, and while no prior knowledge of neural networks is assumed, the account goes beyond the introductory material. The chapter brings together and presents a cohesive view of some of the material presently available only in scattered form, requiring a collection of technical articles. We discuss the gradient descent and the relaxation method as the two underlying mathematical themes for deriving learning algorithms. Neural network architectures are described in Section 2, followed by a discussion of learning algorithms in Sections 3 , 4 and 5. Performance and implementation issues are discussed in Section 6.
2.
Neural Network Architectures
A neural network, in general, consists of several simple processing elements called neurons. Each neuron is connected to some of the other neurons and possibly to the input nodes. A real valued weight is associated with each connection to represent its strength. Each neuron has an activation function whose output determines the state of the neuron. The state information is communicated to the neighboring neurons. A neuron receives signals from other neurons and from input nodes that are directly connected to it. An activation function f produces an output based on the weighted sum of inputs and the threshold 8. As shown in Fig. 1, the threshold can be modeled using an extra input node with constant value -1. Mathematically, given a signal vector (XI,X 2 , . . . ,X N ) , a weight vector ( W ,, W z ,. . . , W N )and a threshold 8 the output Y is computed asf(net4nput) wherefis the activation WiXi- 8.It is common to assume function and the net-input is equal to that all neurons use the same activation function. Four commonly used activation functions are shown in Fig. 2. All of these functions are monotonically increasing. In this chapter we will not consider
xEl
NEURAL NETWORKS FOR PATTERN RECOGNITION
123
+
flnet-input)
FIG.1. A single-layer network for two-way classification
(b)step
(a) linear
(c)
(d)sigmoid
FIG.2. Four commonly used activation functions.
nonmonotonic activation functions such as radial basis functions. The activation functions are required to be nonlinear, otherwise the computing power of the network is limited. Inputs and the output can be real or binary (0 and 1) or bipolar (-1 and 1). Typically, binary or bipolar values are used in associative memory applications.
2.1 Single- Layer Networks A single-layer network with only one output is shown in Fig. 1. The network consists of several input nodes with connections to the output neuron. Although the network consists of two layers of nodes, it has only one layer of processing elements, hence the terminology single-layer network.
2.1.I
Perceptron
A single-layer network using the step activation function is known as the perceptron. After McCulloch and Pitts (1943) introduced a neural network
124
S. C. KOTHARI AND HEEKUCK OH
model of computing, the major problem was to understand how such networks could learn. Hebb (1949) is credited with proposing a biologically plausible explanation of learning, as the strengthening of synaptic connections when both nodes are simultaneously active. The essence of his ideas persists today as a well-known paradigm called the Hebbian rule. Later, Rosenblatt (1962) introduced perceptrons and stated the “perceptron learning theorem,” which marked the real beginning of a rigorous study of learning in neural networks. A perceptron with a single output node can be used to solve a two-way classification problem where the set of inputs Z c 9INis partitioned into two classes S’ and S - . !)IiN denotes the N-dimensional space of reals. The two classes are commonly characterized by output values 1 and - 1. Cutoff points a < 0 and /3 > 0 are used in the case of a sigmoid activation function. Output values bigger than /3 characterize the class Sf and values less than a characterize the classes S-. A partition (Sf, S - ) of the set Z is called linearly separuble if and only if the sets S + and S - can be separated by an Ndimensional hyperplane. A perceptron, irrespective of the activation function, can classify inputs correctly if and only if the two classes of the inputs are linearly separable. This observation made by Minsky and Pappert (1969) points to a limitation of the perceptron as a two-way classifier. Certain classification problems involving multiple classes can be solved by the perceptron architecture with multiple output nodes as shown in Fig. 3. Multiple classes are represented by a suitable encoding of outputs. The limitations arising from the linear separability considerations carry over to multiway classification using a single-layer network. A general multiway classification problem where different classes of inputs are not separated by linear hyperplanes cannot be solved by a single-layer network. Next, we
XI
x2
‘N-1
‘N
FIG.3. A single-layer network for multiway classification.
NEURAL NETWORKS FOR PATTERN RECOGNITION
125
describe an approach to overcome this limitation while maintaining a singlelayer architecture.
2.1.2 Functional- Link Net A functional-link net (Pao, 1989) is a single-layer architecture like perceptron but incorporating additional nodes along with the input nodes. These nodes are introduced to provide additional inputs which are nonlinear functions of the original inputs. Since the single-layer structure is maintained, the perceptron learning algorithm can still be used for training. The functional-link net concept is simple but it can work effectively for some problems. For example, the XOR problem cannot be solved by a singlelayer network with two input nodes because the problem is not linearly separable. The problem can be solved by a functional-link net with an additional input representing the product of original input variables. The functional-link net approach is based on the mathematical principle that linear separation is possible by embedding the given sets in a higher dimensional space. The original input sets { (1, 0 ) , (0, I)} and { (0, 0 ) , (1, I)}, which are not linearly separable in two-dimensional space become separable in three-dimensional space by the embedding: ( X I,X,) M ( X I,X 2 ,X IX,). Two neural network architectures for the XOR, a functional-link net and a multilayer feedforward network, are shown in Fig. 4. Powerful single-layer training procedures like the perceptron algorithm can be used for functionallink nets after the embedding is selected. However, how to select the embedding suitably is an important issue that remains to be answered. Also,
f
XI
x2
XIX2 XI
(a)Functional-linkNet
x2
(b) Multilayer Feedfonvard Net
FIG.4. Two architectures for solving the XOR problem.
126
S. C. KOTHARI AND HEEKUCK OH
the dimensionality of the embedding can be very high, requiring a large number of additional nodes.
2.2
Multilayer Feedforward Networks
Backpropagation (BP) is currently the most widely applied neural network paradigm. The backpropagation network was independently discovered by different groups (Parker, 1985; Le Cun, 1985; Rumelhart and McClelland, 1986) over the period 1969 to 1986. The architecture gained wide recognition due to the efforts of the PDP group (Rumelhart and McClelland, 1986). We will describe a commonly used version of the backpropagation architecture. The BP architecture consists of fully interconnected layers of processing nodes as shown in Fig. 5. Each processing node receives the weighted sum of output signals from nodes of the previous layer. The first layer of the network consists of input nodes, it distributes an input vector to all of the nodes of the second layer. The final layer produces the output of the network. The in-between layers are called hidden layers because they are not directly connected to the outside environment. Besides the feedforward connections, error feedback connections exist to connect each node to hidden nodes in the next layer. These connections are used to propagate errors from one layer to the previous layer. The backpropagation network functions in two stages:
Forward Sweep: Output signals of one layer are transmitted to the next layer. Starting from the input layer, the forward sweep continues until the final output layer produces the network's estimate of the desired output. Backward sweep: Each node in the output layer computes the error by comparing the network's output with the target output supplied from the
t
t
t
t Output Layer
Hidden Layer
Input Layer FIG.5 . Backpropagation architecture.
NEURAL NETWORKS FOR PATTERN RECOGNITION
127
environment. The error signal from each output node is transmitted to all nodes in the hidden layer following the output layer. Successively, error signals are generated and transmitted to the previous layer by all nodes in each hidden layer until the first hidden layer is reached. The error terms generated by a layer are used to update the weights of forward connections from the layer below it. After all of the input/output training pairs are presented and the network completes the forward and the backward sweep for each training pair, one learning epoch is said to be completed. Several epochs may be needed until the networks becomes “operational” and can emit outputs with the desired accuracy. Once the network is operational it can be applied to novel inputs using only the forward sweep. The BP architecture implements an algorithm for fitting data to a function. The addition of even a single hidden layer makes the backpropagation architecture very powerful as a tool for function approximation and classification problems. The BP architecture provides a generic way to approximate a multidimensional function as a linear combination of functions of a single variable. The mathematical basis for such representations of a function was investigated earlier. In 1957, the following classic result was established:
Theorem 1 (Kolmogrov). Given any n-dimensional real-valued continuous function f defned on the n-dimensional cube, there exist n(2n + 1) continuous, monotonically increasingfunctions hp,qwith the property that there is a continuous real-valued,function g such that
As pointed out by Hecht-Nielsen (1987, 1989), Kolmogrov’s theorem can be readily adapted as a statement that any continuous mapping f :[0, 11” H %” is exactly implemented by a neural network with a single hidden layer with (2n + 1) hidden nodes, with n nodes in the input layer and m nodes in the output layer. Kolmogrov’s representation is exact but involves different functions Hp,q. BP architectures typically use the same nonlinear activation function at each processing node. Cybenko (1988,1989) has proved that a BP architecture with the sigmoidal activation function and only one hidden layer can uniformly approximate any continuous function defined on [0, I]“.
128
S. C. KOTHARI AND HEEKUCK OH
Cybenko’s result does not include a bound on the number of nodes in the hidden layer. The question of whether a reasonable limit exists on the number of hidden nodes is an important feasibility issue that remains to be addressed. Valiant (1984) has investigated related theoretical issues. In backpropagation, weights are adjusted in a network of fixed topology. A priori determination of an appropriate topology is not often easy. Generative feedforward architectures (Ash, 1989; Fahlman and Lebiere, 1990) have been proposed to determine the network topology adaptively, relying on incremental addition of hidden nodes. As an example of a generative feedforward architecture we will describe Fahlman’s cascade-correlation architecture (Fahlman and Lebiere, 1990). The architecture begins with no hidden nodes and eventually constructs a multilayer network with a cascade of hidden nodes. Hidden nodes are added one by one. Each new hidden node receives a connection from each of network’s original inputs and also from pre-existing hidden nodes. The new node’s input weights are frozen at the time the node is installed. Each new node adds a new one-node layer to the network. Once a new node is installed, weights on connections to output nodes are determined using a single-layer algorithm.
2.3 Feedback Networks Feedback networks are capable of settling to a solution by gradually solving a complex set of constraints. As such, feedback mechanisms can help to filter noise as a pattern goes through successive updates of the state of the network. The problem is usually given to the network either by initial conditions or as a fixed external input representing a pattern to be recognized. The answer is given by the state of the network after it stabilizes.
2.3.1 Hopfield Networks We describe a discrete version of the Hopfield network consisting of N two-state neurons. The state of the network is represented by the vector X, where X E { - 1, + 1 N. Let K, be the connection strength between the ith and thejth neuron, and Bi be the threshold for the ith neuron. The weight matrix is assumed to be symmetric and with diagonal terms equal to 0. Variations of the Hopfield model have been investigated using different conditions on the weight matrix (Bruck, 1990). Let X = ( X , , X z , . . . , X N ) be a state vector. Information stored in the network is retrieved by repeated
NEURAL NETWORKS FOR PATTERN RECOGNITION
129
application of the following updating rule until the state of the network stabilizes : N
[-I
i f 1 wjxj- ei < 0. j= I
If none of the states X i change during an update, the network is said to be in a stable state. Applications of Hopfield networks seek to make the stable states correspond to solutions to the given problem. An associative memory application seeks to make the stable states correspond to the patterns being stored. Each successive state of the network is computed from the current state by applying the update rule to a set S of the neurons of the network. Different modes of operation are possible depending on the choices for set S selected for each update. If only one neuron is selected at a time then the network is said to operate in a serial mode. If all the neurons are updated at the same time, i.e., IS1 = N , then the network is said to operate in the fully parallel mode. All the other modes of update with 1 < I S I < N will be called parallel modes. The set S can be chosen at random or according to some deterministic rule. Convergence of the Hopfield model under various modes of operation has been investigated by several researchers. As stated in following theorems, the symmetry of the weight matrix leads to important convergence properties of the Hopfield network.
Theorem 2 (Hopfield). A Hopfield network with a serial mode of operation always converges to a stable state. Theorem 3 (Goles et a / . ) . A Hopfield network with the fullyparallel mode of operation always converges to a stable state or to a cycle of length two in the state space. Examples. The two-node network with the fully parallel mode of operation, as shown in Fig. 6 , results in a cycle of length two. We get two oscillating states (1, -1) and (-1, 1) from the initial state (1, -1). Starting with the same initial state but using a serial mode of operation leads to a
FIG. 6. An illustration of oscillating states with fully parallel update.
130
S. C. KOTHARI AND HEEKUCK OH
FIG.7. A sequence of states resulting from serial update.
stable state as shown in Fig. 7. In a serial mode of operation, the final stable state not only depends on the initial state but also depends on the updating sequence of neurons. An energy function was introduced by Hopfield to facilitate the study of convergence and other properties of the network. Hopfield (1982) and Goles et al. (1985) have used the following energy function:
."
N
The energy function is a quadratic mapping from the state space to %, the set of real numbers. Note that the energy function is uniquely defined by the set of weights and the threshold values. Conversely, an energy function uniquely defines weights and the threshold values. A stable state of a Hopfield network corresponds to a local minimum of the energy function.
Lemma 1 . The energy function either decreases or remains constant with each update in a serial mode of operation. Proof. Suppose a given state (XI,X 2 , . . . ,X , , . . . , X , J changes to ( X I ,X z , . . . , Y,, . . . ,X,) as a result of updating the ith neuron. Let AE = E ( X I , X 2 , . . . , Y,, . . . ,X,)
- E(X1,X 2 ,
. . . ,X , , . . . ,X u ) .
We want to show that AE < 0. Using the expression of the energy function,
LIc w,,x, N
AE = - ( Y ,
- XJ
-
1
- 0, .
In view of the updating rule, ( Y ,- X , ) > 0 (C,"=, W,,X, - 0 , ) > 0 , and W,X, - 0,) < 0. Thus we get AE < 0. ( Y ,- X , ) < 0 The Hopfield model can be viewed as a neural network that performs a local search for a minimum of a quadratic optimization function defined over the state space. Thus, the Hopfield model can be applied to solve a class of optimization problems that can be represented by a quadratic function. Hopfield and Tank (1985) have illustrated the use of the network for solving
NEURAL NETWORKS FOR PATTERN RECOGNITION
131
the traveling salesperson problem. Following Hopfield’s work, several studies were done to investigate solutions of combinatorial optimization problems using neural networks. Interested readers can refer to (Hopfield and Tank, 1985; Hopfield, 1989; Jeffrey and Rosner, 1986; Tagliarini and Page, 1987; Wilson and Pawley, 1988; Hedge et al., 1988) for further details.
2.3.2 Bidirectional Associative Memory Kosko (1987, 1988a) proposed a neural network model of bidirectional associative memory (BAM). Kosko’s model consists of two layers of neurons with feedback and symmetric synaptic connections between neurons of two layers. The pattern pairs are stored as bidirectionally stable states of the BAM. As an example, associations between faces and names could be stored and recalled using a BAM. The BAM allows the retrieval of stored data associations from incomplete or noisy patterns. Consider an N-M BAM with N neurons in the first layer and M neurons in the second layer. Let W = [ W j ] be the weight matrix, where W j is the connection strength between the ith neuron in the first layer and thejth neuron in the second layer. Let Ox, be the threshold for the ith neuron in the first layer and 8, be the threshold for thejth neuron in the second layer. The BAM behaves as a heteroassociative content addressable memory, storI k = 1, . . . , P},where ing and recalling a set of vector pairs T = { ( X k , Yk) X k e {-1, + l } N a n d Y k e {-1, +1}”. The recalling procedure in the BAM is nonlinear and employs interlayer feedback. Given an initial vector X (or Y), the recalling process in the BAM reverberates between its two layers until a stable state is reached in finitely many steps as proved in Kosko (1988a). Unlike the Hopfield model, the BAM does not have oscillating states. During the recall process, neurons update their states using a nonlinear activation function. A neuron examines its net-input of weighted signals from neurons in the other layer, and the state of the neuron is changed to +1 if the net-input is bigger than zero and - 1 if the net-input is less than zero. Figure 8 illustrates the recalling mechanism of a BAM that stores associations between small letters and capital letters. The illustration shows how a noisy input of “A” in one layer the BAM recalls “a” in the second layer after a few iterations.
2.3.3 Boltzmann Machine The Boltzmann machine architecture, proposed by Ackley, Hinton and Sejnowski (Ackley et al., 1985; Sejnowski et al., 1986; Hinton and Sejnowski, 1986), is similar to the Hopfield model but uses a stochastic rule called
132
S. C. KOTHARI AND HEEKUCK OH
Input
Internal Iterations
output
-r
I
.
FIG. 8. Recognition of “(A, a)”.
simulated annealing both during the updating and the training of the network. The stochastic rule is used to alleviate the problem of getting trapped in a local minimum of the network’s energy function. Another new feature of the architecture is that it includes hidden nodes. However, unlike the multilayer feedforward networks, the hidden nodes are not sandwiched between the input and the output layer. Two variations of the Boltzmann machine architecture are shown in Fig. 9 and Fig. 10. In one architecture, called the Boltzmann completion network all nodes are connected to each other with symmetric weights as in the Hopfield model. In the Boltzmann input-output network, the visible nodes are separated into input and output nodes. There are no connections among the input nodes and the connections from input nodes to other nodes are unidirectional. All other connections are bidirectional. During the recall process, the input nodes are clamped permanently at given values, only the output nodes and the hidden nodes are updated using the simulated annealing procedure. Consider a network consisting of N nodes. The connection weight between neurons i and j is denoted Wi,. The state X , of each neuron is either 1 or 0
Hidden Layer
Visible Layer
FIG.9. Boltzmann machine completion architecture: the network is fully connected.
NEURAL NETWORKS FOR PATTERN RECOGNITION
0
.
.
133
Hidden Layer
--
Visible Layer
Input Nodes
Output Nodes
FIG,10. Boltzmann machine input-output architecture: the visible nodes are separated into input and output nodes. Input nodes only have unidirectional connections going to other nodes.
for a binary network. During a recall, all known visible nodes are initially assigned the states specified by the input vector. All hidden nodes and visible nodes not specified by the initial input vector are assigned random binary states. The recalling procedure continues by repeating several processing cycles. A processing cycle uses serial updating using randomly chosen nodes. A number of nodes are updated until each node has had a preassigned probability p of being selected for update. Each selected node k , regardless of its current state, is assigned state x k = 1 according to the probability distribution P k = 1/( + exp-"etk'T),called the Boltzmann distribution. The WkjXj. The recalling variable T is called the temperature and netk = procedure proceeds according to a specified annealing schedule, which consists of a list of monotonically decreasing temperature values along with a specified number of processing cycles at each temperature. The motivation for the Boltzmann architecture comes from statistical mechanics, where the purpose of an annealing process is to find the global energy minimum as an object is being cooled down to the ambient temperature. Ackley et al. (1985) employ the following intuitive argument to explain how simulated annealing can be used to escape from local minima of the network energy function. Consider the energy landscape shown in Fig. 11. Suppose that a ball starts at a randomly chosen point on the landscape. Initially the ball may end up in the local minimum A . Imagine the landscape to be enclosed in a box. If we shake the box, the ball may get up the hill and into the global minimum B. The harder we shake, the more likely it is that the ball will be given enough energy to get over the hill. However, violent shaking can also cause the ball to cross the barrier in the wrong
134
S. C.KOTHARI AND HEEKUCK OH
Energy
States
FIG. 1 I . An energy landscape with two minima.
direction (from B to A ) . A good compromise is to start by shaking vigorously and gradually shake more and more gently. High temperature implies large thermal energy and corresponds to vigorous shaking; low temperature corresponds to gentle shaking. In the annealing procedure, an object is heated and then it is gradually cooled down to the ambient temperature, allowing the object to reach the thermal equilibrium at each stop along the way. A simulated annealing schedule tries to specify the number of processing cycles required to reach equilibrium at a given temperature. An accurate specification is hard and often one has to guess the number of processing cycles required to reach the equilibrium. Examples of annealing schedules used to solve a few small problems are given by Ackley et al. (1985).
3. Gradient Descent Learning Algorithms Neural networks store information in the form of weights associated with connections between pairs of processing nodes. Instead of being driven by a program, the computing function implemented by a neural network is determined by its weights. Neural networks are said to learn from examples because weights are determined from a given set of input-output pairs { ( X i , Y i } j = , , . . .A , p .training set specifies the expected result on a set of examples, and the objective of a learning procedure is to determine weights to construct an appropriate computing function based on the information provided through the training set. Specific learning algorithms can be derived as natural offsprings of the underlying mathematical principles.
3.1 Gradient Descent Method Several neural network learning algorithms are based on the gradient descent method. The classical gradient descent method can only be applied
NEURAL NETWORKS FOR PATTERN RECOGNITION
135
to a differentiable function. In their original formulation McCulloch and Pitts (1943) proposed a step function for activation of neurons. A step function is not differentiable, but it can be approximated by a sigmoidal function that is differentiable. The use of a differentiable activation function makes it possible to apply the gradient descent method. Given a point Po in an N-dimensional domain and a differentiablefunction f of N variables, the gradient descent method can be used to reach a minimum of the function f near the point Po. The method works as follows. Calculate the gradient off at Po and move from Po to P I using a small step in the opposite direction of the gradient vector. The process is repeated by Assuming that f calculating the gradient at Piand moving from Pito Pi+l. has a minimum, the gradient descent principle states that it is possible to reach within any given distance E of a minimum offby a finite sequence of points P o , P I , .. . ,P, as defined previously. Because of its central role in several neural network algorithms, we will review the mathematical basis of the gradient descent method. To facilitate understanding, the case of an univariable function is discussed first. Univariable Case: Let f be a function of a single variable. Let x be an arbitrary value in the domain o f f . The following diflerence equation directly follows from the definition of the derivative
d f = f ( x + dx) - f ( x ) N
dxf’(x).
For a univariable function the gradient is the derivative, and moving in the direction opposite to the gradient is the decrement (increment) x if the derivative is positive (negative). Setting dx = - q f ’ ( x ) accomplishes moving in the direction opposite to the gradient if q > 0. The difference equation implies that f(x + dx) < f ( x ) and thus a succession of changes to x will gradually lead to a minimum of the functionf. Multivariable Case: Let f be a function of N variables. Let x = (x,,x2,. . . ,x N )be an arbitrary point in the domain off. For a multivariable function the difference equation is given by the inner product of A x and Vf: Af
= AX.Vf(x),
where AX = ( d x , , d x 2 , . . . ,dXN) and Vf = (af/ax,, af/ax2, . . . , a f / a x N ) is the gradient off at x. Gradient descent is accomplished by choosing Ax = -qVf(x) with q > 0. The vector -Vf(x) gives the direction opposite to the gradient. According to the difference equation, Af = -qllVf ) I 2 ,
136
S. C. KOTHARI AND HEEKUCK OH
which implies that f decreases from the point x to x + Ax. For a vector V, llVl12 denotes the square of the norm, computed by the inner product of V with itself. The difference equation is a first-order approximation and it holds if Ax is sufficiently small. Thus, the parameter t) must be chosen appropriately small. We will not go into the proof of convergence of the gradient descent method which involves mathematical technicalities and certain conditions on the function f and its domain. is the direction along Given x = (x,,x2, . . . , xN), the vector -V’(x) which a change in x leads to the steepest descent from the point [ x , f ( x ) ] on the multidimensional surface defined by f: If Ax, the step size, is not sufficiently small, then the gradient descent method may not converge. On the other hand, if the step size is too small, an excessively large number of iterations may be needed to reach a minimum. This presents a difficulty either way unless the step size is appropriately chosen. We will refer to this problem as the “step-size problem.” There is another potential problem. The gradient descent method typically reaches a minimum that is near the starting point x. If the function has several local minima, one can get stuck at a local minimum and not find the global minimum using the gradient descent method. A gradient descent learning algorithm begins with an error function E, which measures the discrepancy between desired outputs and observed outputs from a neural network. A learning algorithm incorporates the gradient descent technique to minimize the error. The delta rule and the generalized delta rule that are described later minimize the error, which is defined as the sum of the squares of the errors over all the training vectors. Weights and thresholds are treated as the variables of the error function. A learning procedure starts with some initial values for the weights and thresholds. During an epoch, all training patterns are presented one at a time to the network, and the error is calculated by observing the output for each pattern. A gradient descent technique is used to modify the weight and the threshold values leading to a revised configuration of the neural network. The process is repeated until the error reaches an acceptable level. In order to complete the training and get an operational neural network configuration, several epochs may be needed.
3.2
Delta Rule
Delta rule is an iterative learning algorithm for a single-layer network using a linear activation function at each neuron. Let W = ( W,, , . . . , Wji, . . . , W M Nand ) 0 = (0, , . . . , O j , . . . , O M ) , where M is the
NEURAL NETWORKS FOR PATTERN RECOGNITION
137
number of output nodes and N is the number of input nodes. Let Y k = ( Y : , Y t ,. . . , Y b ) be the desired output corresponding to kth training pattern X k . Let Ok= ( O f , 05, . . . , Ob) be the observed output. Consider the Ek(W,O), where P i s the number of training error function E(W, 0 )= (IY ; - 0,”)’is the error term for the kth patterns and Ek(W, 0)= 3cjM_ training pattern. The algorithm employs a heuristic procedure that does not Weights and calculate exactly the gradient of the error function E(W, 0). thresholds are updated for each training pattern X k one at a time by calculating the gradient of Ek(W,0 ) instead of E(W, 0). If the error for pattern X k is not acceptably low, weights and thresholds are changed according to the update rule:
T;=l
(&w,A k a ) = -qvEk
with 17 > 0,
(3)
where AkW = (AkWjij and AkO = (AkOj)are vectors with components ranging over i = 1,2, . . . , N and j = 1,2, . . . , M . The changes to be applied to individual weight Wji and a threshold Oj are denoted by Ak Wji and AkOj, respectively. The name “delta rule” or “Widrow-Hoff rule” is commonly used for the scalar form of the previous equation. Thus, the delta rule specifies the change for each individual weight W,; following the presentation of the kth training pattern. Note that
= f ( C ; , W , ; X f - O,), it follows that Since 0.;
Since the activation function f is assumed to be linear and monotonically increasing the derivative offis a positive constant and it need not be calculated explicitly. Denoting (Yf - Of) by S,”,we get the delta rule: = AkOj
q6,”Xf
= - q6,”.
3.3 Generalized Delta Rule The delta rule implements gradient descent in sum-squared error for a single-layer network. Mathematically, a multilayer feedforward network can be viewed as a composite function. For example, in the case of a single hidden layer, an output can be expressed as a composite function
138
S. C. KOTHARI AND HEEKUCK OH
,f2,. . . ,fh), where f is the function implemented by the output node andJ is implemented by an individual hidden node. A composite function can be differentiated using the chain rule of derivatives. This fact is used in generalizing the delta rule. A calculation of derivatives using the chain rule requires the activation function to be differentiable. Also, hidden nodes must use nonlinear activation functions to provide any advantage over a singlelayer network.
f( f i
Notation
N = number of nodes in the input layer A4 = number of nodes in the output layer L = number of nodes in the hidden layer P = number of training vectors S = {(X”, Yp) J p= 1,2,. . . , P} is a training set, where Xp = (Xf, X;,. . . ,XP,) is the pth input vector and Y p= YS, Y % ,. . . , YP,) is the corresponding output vector Wi.= weight associated with the connection from the ith input node to the jth hidden node Wij = weight associated with the connection from the jth hidden node to the kth output node 6; = threshold at thejth hidden node 0; = threshold at the kth output node net;(p) = Cc, U’iXy - 6; gives the net-input at thejth hidden node forpth input vector X p neti(p) = I$,W&I,(p)- 0; gives the net-input at the kth output node for pth input vector X p I , ( p ) = the output of the jth hidden node for the pth input vector X p f = the activation function OP = ( O f , @, . . . , PM) is the observed output Ep= C,”=,(Y: - OQ)’ is the sum-squared error for the pth training vector Forward Pass
Step 1: An input vector X p is applied to the input layer of the network shown in Fig. 5. Subsequently, each ith input mode distributes the values Xj’ to all the hidden-layer nodes. Step 2: The net-input to thejth hidden node is net;(p), which is the weighted sum of values received from all the input nodes. Step 3: F o r j = 1,2, . . . ,L, thejth hidden node produces the output h (p ) = f[net;(p)] by applying the activation functionfto its net-input. The value J ( p ) is distributed to all the nodes in the output layer.
NEURAL NETWORKS FOR PATTERN RECOGNITION
139
Step 4: The net-input to the kth output node is neti(p), which is the weighted sum of values received from all the hidden-layer nodes. Step 5: Each kth output node produces the output opk =f [neti(p)] by applying the activation function f to its net-input.
The forward pass can be easily extended to include multiple hidden layers by repeating Steps 2 and 3 for h = 1,2, . . . , H, where H is the number of hidden layers. At the end of Step 3, the values are distributed as inputs to nodes in the next hidden layer. Backward Pass
Step 1: Calculate the error Ep using the target output Y p and the observed output OP resulting from the forward pass. Step 2: (Updates of output-layer weights.) The weight-threshold vector is updated by - qVEp. The parameter q is a positive constant. Specifically, A W i j ( p )is the change in Wij based on pth training vector.
Define G ( p ) = ( Y i - a)f"netl(P)l.
(6)
We can then write the weight-update rule as AWij(p> = v
~~(P)~(PI.
(7)
Similarly, the threshold-update rule is A W p ) = -qG(p).
(8)
Step 3 (Updates of hidden-layer weights.) The update rule is similar to the rule used in the previous step. Let A W i . ( p )be the change in W$ based on pth training vector
140
S. C. KOTHARI AND HEEKUCK OH
Since
and / N
$ ( p ) =f
\
(1w;&J - 0;) i= 1
using the chain rule, we get
The concept of "backpropagation" comes from the summation term in the last formula. The error E ( p ) is propagated back from each kth output node to each node in the hidden layer. Each jth hidden node receives the net error XkM,, S Z ( p )WZj from the output layer and calculates its error term as: M
~ ; ( p =f"net;(p)l )
1~
( pwZj. )
(9)
k=l
Thus, the weight update rule is: AW$(p) = qSjh(p)X? AO;(p)
=
-vS;(p).
(10)
(11)
The backward pass can also be extended to include multiple hidden layers. Suppose that the hidden layers are numbered in ascending order in going
NEURAL NETWORKS FOR PATTERN RECOGNITION
141
from input to output layer. The error values 6 ; ( p ) are propagated back to nodes in the ( h - 1)th layer. Each Ith node in the ( h - 1)th layer computer:
&-'(PI =f'[net:-'(p)~
1
wi:,
(12)
j
where the summation is over all the nodes in the hth hidden layer. The weight update rule used by the (h - 1)th hidden layer is AWtr:-'(p) = i6:p'(p)Zq
(13)
A @ - I ( p ) = - 17 Sf-'(p ) ,
(14)
where ZS is the output of i-th node of the layer below the ( h - I)-th hidden layer. The process of backpropagation of errors is continued from one hidden layer to the previous hidden layer until the first hidden layer is reached. Backpropagation Algorithm (BP)
Initialize all the weights and thresholds ; start: begin epoch for p = 1 to P for each training vector X p Execute Forward Pass ; Execute Backward Pass; end epoch Ep; Calculate the error E = if E is acceptably low then stop else goto start:
z:==,
The sigmoidal activation function is commonly used at hidden nodes and also at the output nodes if bistate output is desired. For real-valued output, a linear activation function may be used at the output nodes. The use of a nonlinear activation function at hidden nodes is essential to recognize patterns that are not linearly separable. Note that if there is no hidden layer and if a linear activation function is used then the weight adaptation rule of BP is the same as the delta rule. Thus, the weight updating rule of BP is indeed a generalization of the delta rule. By appropriately defining the error terms 6 z ( p ) and S,h(p), BP achieves the same formalism for propagatiang signals in the forward pass and backpropagating errors in the backward pass. The connections between layers are used in one direction in the forward pass and in the opposite direction in the backward pass. In either case, each processing node calculates the
142
S. C. KOTHARI AND HEEKUCK OH
weighted sum of values received through the interlayer connections. This nice formalism results from the use of linear neurons instead of higher order neurons. For example, for quadratic neurons, in which inputs are squared before calculating the weighted sum, the formalism for backpropagating the errors would be different from the formalism for propagating the signals. A momentum technique is sometimes used to improve the speed of convergence. The technique involves adding a fraction of the previous change when calculating the new weight change. The additional term tends to keep the weight change going in the same direction, hence the name momentum technique. Using the momentum technique, the weight modification rule is AW:(p, t
+ 1) = (1 - a)q6;(p)Xf + AWih,(p,t ) .
(15)
The variable t is the iteration counter. A similar modification rule is used for the weights on the output connections. The technique introduces the momentum parameter a , which is usually set to a positive value less than one. The following formula is sometimes used for the error term. The formula as shown below results from an algebraic simplification of the derivative of the sigmoid activation function.
W P ) = ( Y i - OR)OR(l - a). 3.4
(16)
Cascade-Correlation Algorithm
The cascade-correlation algorithm includes a dynamic node generation strategy. The algorithm is described as : Step 1: Begin with a single-layer network with as many input and output nodes as dictated by the problem. Figure 12a illustrates this step where we begin with a network with three input and two output nodes. Initialize the weights and thresholds. Step 2: Train the network for a certain number of epochs using a singlelayer learning algorithm. The number of epochs is fixed by a control parameter set by the user. Step 3 Run the network over the entire training set once. Calculate the error. If the error is acceptably low then stop the learning process. Otherwise note the change in error as a result of applying Step 2. If the error has significantly reduced, go back to Step 2. The rationale being that the single-layer algorithm is doing its job and needs to be given more opportunity to improve the error. If the error has not been significantly reduced, then the conclusion is that the given number of nodes is not enough and at least one new hidden node must be added.
NEURAL NETWORKS FOR PATTERN RECOGNITION
143
(a) inital network
~~
(b)addition of one hidden node
n n
(c) addition of two hidden nodrs FIG. 12. Illustration of cascade-correlation architectures
Step 4: (Node Creation Algorithm) 1. Begin with a candidate node that receives trainable input connections from all the input nodes and from all pre-existing hidden nodes. The output of this candidate node is not yet connected to any other node. Initialize the threshold and all the weights on input connections for the candidate node. 2. Adjust the threshold and weights on input connections for the newly added candidate node. The goal of this adjustment is to maximize the
144
S. C. KOTHARI AND HEEKUCK OH
magnitude of the correlations between the candidate node’s output and the residual output errors observed at output nodes. For the pth pattern, let V, be the candidate node’s output and & ( p ) be the residual output error at the kth output node. The correlation is defined as:
k=I ! , = I
where the quantities pand are the values for V, and & ( p ) averaged over all patterns. By using gradient ascent to maximize S , we get the adaptation rule :
M
P
k=l p=l
where (Tk is the sign of the correlation between the candidate’s output V, and the kth output, f; is the derivative of the candidate node’s activation function evaluated for thepth pattern, and 4 ( p ) is the input the candidate node receives from node i for the pth pattern. The candidate node’s threshold is adjusted in a similar way by calculating as/ do. The adaptation rule is applied repeatedly until S stops improving. The new candidate node is installed in the network by freezing its threshold and input weights and connecting it to the outputs. Figure 12b shows the network after the first hidden node is installed. The , W , , are frozen, where Wijdenotes the weight weights W , ,, W I 2 and on the connection from input n o d e j to the ith hidden node. Step 5 At this point, the network is viewed as a single-layer network with the original input nodes and all hidden nodes on one side and the output nodes on the other side. Figure 12c shows the network after the second hidden node is installed. The weights W2jand WiI are frozen at this stage. Reinitialize the thresholds for output nodes and weights on connections to output nodes and go to Step 2 for training the new single-layer network. One can consider several variations of the algorithm described here. Different strategies can be designed to decide when to invoke the node generation algorithm. Also, instead of a single candidate node, it is possible to use a pool of candidate nodes, each with a different set of random initial weights. These candidate nodes can be trained in parallel and the one whose correlation score is best can be installed.
NEURAL NETWORKS FOR PATTERN RECOGNITION
4.
145
Relaxation Learning Algorithms
The relaxation method is a mathematical technique for solving systems of linear inequalities. The.method was introduced by Agmon (1954) and Motzkin and Schoenberg (1954). After discussing the mathematical basis of the relaxation method we will describe its applications to neural networks. The notation (. ,. ) is used to denote the inner product of two vectors and 1 1 . I/ to denote the Euclidean norm. Consider a consistent system of m linear inequalities < a i , x ) + b i 3 0 f o r i = I , . . . , m,
(1 7)
where a' E W,b' E '$3 and x E !Rfl is a variable vector. Each inequality defines a halfspace H i in 93": H ' = { X E WI(a', x)
+b'30).
The feasible solution set for (1 7) is a convex polyhedron given by: m
C=(-)H' 1-1
To solve the system of inequalities (17), the relaxation method performs an iterative procedure as follows. Start with an arbitrary point in %'. Let x q be the point at qth iteration. Suppose x q $ H ' for some i. Let xj be the orthogonal projection of xq on the hyperplane of H ' as illustrated in Fig. 13. Choose the next point xqt' as follows : xq+'
= xq
+ /? (x,"
- xq),
FIG. 13. Geometric illustration of relaxation procedure.
(18)
146
S.C. KOTHARI AND HEEKUCK OH
FIG. 14. An example of an over-relaxation sequence.
where the relaxation factor d is a constant between 0 and 2. Note that (xfI - x q ) can be replaced by
I (ai, x q ) + hi[ llail12
and we get:
We will call the sequence of points ( x q } a relaxation sequence. Figure 14 shows an example of a relaxation sequence. The method is called underreluxation if 0 < il< 1, over-relaxation if 1 < d < 2, or the projection method if A = 1. Let d(x, H i ) denote the Euclidean distance between x and H'. Define dmAx(x) = max{(x, H i ) 1 i = 1, . . . ,m}.
If the relaxation sequence is such that d(x4,H i ) = dmaX(x4), then the procedure (19) is called the maximal distance relaxation method. Agmon (1954), and Motzkin and Schoenberg (1954) have proven the convergence of the maximal distance relaxation method.
4.1
Pseudo- Relaxation Method
The pseudo-relaxation method proposed in Oh and Kothari (1991b, 1992a, 1992b) is an adaptation of the relaxation method for neural networks. It changes weights based on local information provided by each training vector.
NEURAL NETWORKS FOR PATTERN RECOGNITION
147
The method cycles through the sequence of halfspaces { H i } and performs the relaxation procedure (19) if d(x4,H i ) > 6' for some predetermined 6' > 0. The pseudo-relaxation method does not necessarily give a solution for (17). Instead, when it terminates, x4 is in a 6'-neighborhood of H i , i.e., Vi d(x4,Hi) 6 6'. The convergence of the pseudo-relaxation method is established by the following theorem proved in Oh and Kothari (1991b).
Theorem 4 (Oh and Kothari). Let { x 4 }be a sequence generated by the pseudo-relaxation method using the relaxation procedure (19). If 0 < A < 2, then the sequence { x4} always terminates. The pseudo-relaxation method can be used to find a solution to a system of linear inequalities if the solution set is full-dimensional, i.e., not contained in any hyperplane. Assuming the polyhedron C to be full-dimensional, the pseudo-relaxation method works as follows : Step 1: Define H i to be the halfspace in 'illn such that
where 5; > 0. Let d(H', H i ) be the perpendicular distance between two hyperplanes defined by the halfspaces H' and H i . Defining S, = maxid(Hi,H i ) . The solution set to the new system { H i } is:
nH; m
C, =
i= I
and
Note that C, # 8 as long as C is full-dimensional and the 6,, is chosen to be sufficiently small. In particular, if C is a convex polyhedral cone, C , is not empty regardless of the choice of ti. Step 2: Apply the pseudo-relaxation method for solving the system { H i } using 6' = d(H', H i ) . Step 3: As proved in Theorem 4, pseudo-relaxation terminates at x q such that Vi d(xq,H i ) d 6'. Since 6' = d(H', H i ) and C, c C, x 4 E C. An application of the pseudo-relaxation method for a two-dimensional case is illustrated and compared with that of the maximal distance relaxation
148
S. C. KOTHARI AND HEEKUCK OH
(h) pseudo-relaxation
(a) maximal distance relaxation
FIG.15. Fast convergence of the pseudo-relaxation method.
method in Fig. 15. In the illustration, both methods use under-relaxation with the same starting point xo.
4.2
Learning in Hopfield Networks
Let T = { X ' ) k : l , . . . , p be a set of training vectors that are to be stored as stable states. Let W,,be the connection strength between the ith and thejth neurons. Let 8 , be the threshold for the ith neuron. Note that every vector belonging to T is a stable state if the weight and threshold values satisfy the following system of linear inequalities :
(z N
1
wjxj - ei x~> 0
(20)
for k = 1,. . . , P and for i = 1 , . . . , N . The solution set C for the linear inequalities (20) is a convex polyhedral cone with vertex at the origin. Thus, by applying the pseudo-relaxation method the following learning algorithm for Hopfield networks can be derived (Oh and Kothari, 1992b). Learning Algorithm PRLAH
Given a training set T = { X ' } , where X k = ( X : , . . . ,X",, the following adaptation rule is applied for each X ': /i
A W,, = -- [Sf-
a
AOX,= f - [ S f -
if S f X f < 0
I?=, Wl,Xj"- B i , W,, = y,iand Wii= 0.
(22)
149
NEURAL NETWORKS FOR PATTERN RECOGNITION
The relaxation factor A, the initial weights, and the constant parameters that need to be set for an application of PRLAH.
5
are the
4.3 Learning in BAM Consider an N-M BAM with N neurons in the first layer and M neurons in the second layer. Let W, be the connection strength between the ith neuron in the first layer and the j t h neuron in the second layer. Let Ox, be the threshold for the ith neuron in the first layer and 0, be the threshold for thejth neuron in the second layer. The BAM behaves as a heteroassociative content addressable memory, storing and recalling a set of vector pairs T = {(Xk, Y k ) } & =,,p, , where X k € {-1, + l } " and Y k €{ - 1 , +l}M. The given pairs in T are stored as stable states if the following system of linear inequalities are satisfied for all k = 1, . . . , P
;(
N
1
forj=l, ..., M
(23)
w , J y ~ - o x , ) x ~ > foo r i = 1 , . . . , N.
(24)
U/;,X!'-O,
(z M
Yf>O
,
In this case, W,J,Ox,,and 0 are unknowns and the set of feasible solutions is a convex polyhedral cone C with vertex at the origin. Starting with arbitrary initial values for weights and thresholds, the algorithm determines w = ( Wl,,Ox,, O ? ) , a ( M N + M + N)-dimensional weight-threshold vector, which will satisfy inequalities (23) and (24). Learning Algorithm PRLAB
For each pair ( X k , Y k ) ,the vector (K,, Ox,,0,) is modified using the following adaptation rules. For the neurons in the first layer,
A
AWJ=-[sf;,- { ~ Y," f ] if S ; , X ~< o 1+M
A
At?,y, = t1+M
[s;?- < X ~ J
if sf;,~f 6 o
and for the neurons in the second layer,
A
AQ,=+-[S$--~~] l+N
ifS$fi
(25)
(26)
150
S. C. KOTHARI AND HEEKUCK OH
where M
s:, = 1wl,.~; - ex, j= 1
,s:
N
=
1wijx;
-
8,.
i= I
4.4 Delta Rule versus Relaxation Method Consider a training set {(Xi, Y ' ) l i = 1, 2 , . . . . , P } , where X i= (Xi, X i , . . . ,Xi,> represents an input vector and Y' E f-1, I } is the corresponding output. Let W = ( Wl , W 2 ,. . . , W,) be a set of weights. The
weight W, is used as the threshold by setting X i , = -1. Given an input Xi, the corresponding output Y' will be produced by a single-layer network if the following inequality holds : Y'(Xl WI
+ x:w2 +
*
. . + XNWn)> 0.
Let a: = Y'X; and a' = ( a ; , a:, . . . ,ah). Then, the learning problem is to solve the following set of inequalities treating weights as the unknowns : N
J=l
Both the delta rule and the relaxation method have a common geometric interpretation described as follows. Each training pair (Xi, Yi)corresponds to the hyperplane H ' : N
j= I
Treat W as a point in an N-dimensional space. Starting from an arbitrary point Wo, both the algorithms use iterative procedures that find a point W on the positive side of all hyperplanes Hi. The hyperplanes are examined one at a time. Suppose W kis the point representing the current set of weights. If W k is not on the positive side of H', then W k is projected to a new point Wk+'along the direction of the positive normal to the hyperplane H'. The Let 0 be the observed objective is to move toward the positive side of Hi. output corresponding to Xi.The critical difference between the delta rule and the relaxation method lies in the actual update rules.
Delta rule:
Wk+l
- wk+ q( Y' - 0 ' ) ~ if ' Y' # 0'.
(29)
NEURAL NETWORKS FOR PATTERN RECOGNITION
151
Since (Y' - 0')= 2 Y', the delta rule is
That is
The delta rule is also called the perceptron learning rule when 17 = 1. The delta rule moves the point in the direction of the normal to the hyperplane H i , but the distance d by which the point moves is fixed at d = 217/la'll. Relaxation method:
Wk+'= W k- A.
(ai, wk>
Unlike the delta rule (or the perceptron rule), the updates made by the relaxation method are not static. The distance A by which a point W k is moved varies dynamically depending on the position of the point relative to the hyperplane H '. Specifically,
d = -A
(a', wk> Ila'll
*
Note that 1 (a', W k )~ / \ ~ is a 'the ~ ~distance of W k from the hyperplane H'. The difference between the two updating rules is illustrated in Fig. 16 for a two-dimensional example.
FIG.16. The update W Dusing the delta rule and the update W Rusing the relaxation method are shown for two initial positions of the point W. The distance between W Dand W is fixed, whereas the distance between W Rand W varies depending on the position of W.
152
S . C. KOTHARI AND HEEKUCK OH
5.
Hebbian Learning
Hebbian learning strategy, attributed to D. 0. Hebb (1949), suggests the adjustment of the connection strength between two nodes according to the correlation of the values of the two nodes. Hebb (1949) originally described a concept of learning as follows: When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency as one of the cells firing B is increased.
The most widely accepted discrete approximation of the Hebbian strategy, called the correlation encoding scheme or the Hebbian rule, is given by :
where A Wijis the change in the connection strength corresponding to a given . . . ,X N ) . The changes are accumulated over all training pattern (XI,Xz, the training patterns to determine the final connection strength. There are several variations of this rule. Sejnowski’s covariance correlation learning rule (Sejnowski, 1977) is:
A where
w,= q(Xi-
Xi)(Xj
- Xi),
(32)
x indicates mean value. Sutton and Barto’s learning rule (1981) is: A w, = qXi(Xj - X i ) , (33)
and Klopf’s discrete time correlation equation (Klopf, 1986) is :
A W , = qAXIAXJ.
(34)
A continuous approximation of the Hebbian rule was proposed by Grossberg (1968, 1969) with an addition of a passive decay term WIJ= - w, + X , X J ,
(35)
where the overdot denotes the time derivative dWl,/dt. A variation of Grossberg’s learning rule is the passive decay associative law (also known as the signal Hebb law) Wij = - wu
+ sI(XI)sJ(XJ),
(36)
NEURAL NETWORKS FOR PATTERN RECOGNITION
153
where S( . ) denotes a sigmoid activation function. A more general learning rule of this type is the dtferential Hebb law proposed by Kosko (1986) and Klopf (1986) :
Rj = - w,+ S,(X,)S,(X,) + Si(Xi)S,(Xj). 6.
(37)
Performance and Implementation
The learning speed is a metric to measure the time a learning algorithm takes to converge to a desired neural network configuration for solving a given problem. The learning speed is often measured in number of epochs. Recall that during an epoch each training pattern is presented once. Fast learning is desirable, however its importance as a performance criteria depends on the nature of the application. In certain applications, after training the network it is operated routinely for a long time without requiring further training. Fast learning may not be a critical issue for such applications and one may be willing to trade off learning speed with another performance factor such as robustness of solution. This is often the case if the patterns to be recognized remain static and adequate examples are available to train the network. On the other hand, an application may require repeated learning with the network frequently changing from operational to learning mode. Moreover, the network may be expected to learn new patterns fairly quickly. In such applications, learning speed becomes a critical performance issue. Traditionally, algorithm performance is quantified by the order of the algorithm. The order is expressed by a function that measures the number of repeated operations required by the algorithm given the input size n of the problem. Different algorithms for the same task can be compared for their performance on the basis of the order of each algorithm. For example, the order of a sorting algorithm measures the number of swap operations given an input sequence of size n. Such a mathematical characterization of performance is difficult in the case of a learning algorithm. The number of weight adjustments required by a learning algorithm not only depends on the number of training patterns but also on appropriately chosen learning parameters. Depending on learning parameters, significant variations in performance of the same learning algorithm are possible. Figure 17 illustrates swings in the performance of the delta rule depending on a learning parameter. Empirical methods are often used to compare learning algorithms. In order to justify the results, such comparisons must be based on repeated experiments with varying learning parameter values and other important factors affecting the performance. In making such comparisons, often the practical difficulty is that the learning time is prohibitively high and it may
154
S. C. KOTHARI AND
HEEKUCK OH
600 500
2
B
400
4
.$
300
3
200
E
1
1
100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Learning Raie
FIG. 17. Effect of varying learning rate q on learning speed of the delta rule.
not be possible to repeat the experiments sufficiently many times to get meaningful performance statistics. Another problem is caused by the error surface if it has several local minima that do not provide an acceptable solution for a given pattern recognition task. A learning algorithm using a gradient descent approach can get stuck at such a local minimum and fail to finish the learning phase satisfactorily. This problem may be fixed by repeating the learning phase with another set of initial weights that changes the starting point on the error surface so that gradient descent can lead to a different and possibly an acceptable minimum. Since the success of this heuristic approach can vary significantly, it is always advisable to undertake a preliminary feasibility study. The existence of local minima is well-known in the case of multilayer networks. Interestingly, local minima can also occur in the case of singlelayer networks (Sontag and Sussmann, 1989; Brady and Raghavan, 1989). In addition to initial weights, there are other learning parameters that can affect the performance of a learning algorithm. One such parameter is the learning rate q. If q is too small then a gradient descent algorithm requires considerable learning time because of very small changes in weights at each step. On the other hand, the algorithm may not even converge if q is not small enough. In general, how to choose appropriate parameter values for fast learning is an important implementation issue. The choice of learning parameter values often requires trial and error and it can be a troublesome task. The algorithm based on the pseudo-relaxation method are attractive
NEURAL NETWORKS FOR PATTERN RECOGNITION
155
in this context because they are fairly insensitive to learning parameters. The papers (Oh and Kothari, 1991a, 1991b) includes a study of sensitivity to learning parameters for both the pseudo-relaxation algorithm and the perceptron algorithm. Selection of an appropriate network topology is another potential source of difficulty in feedforward networks. One needs to determine the number of hidden nodes and how many layers to use. It is thought that the hidden nodes learn to recognize different features of the training patterns. After training, a hidden node is expected to respond with an active output if the feature learned by the node is presented in an output pattern. The process of feature extraction by hidden nodes is not well understood. The existing learning algorithms cannot always effectively control the process of feature extraction by hidden nodes. In the cases where features for successful pattern recognition are not known a priori, determining an appropriate topology requires several trials. In practice one finds that certain patterns are more difficult to learn than others. However, at present there is no way to quantify the inherent difficulty of training patterns. Empirical studies aimed at such quantification can be misleading, because the difficulty in learning can also be due to a bad choice of learning parameters or an inappropriate network topology.
6.1
Performance of Associative Memories
An associative memory is used to store patterns for subsequent recalls. Many applications require recall of a pattern starting from an incomplete or a noisy image of the pattern. Thus, it is desirable that an associative memory model not only be able to store a large number of patterns but also provide a recalling mechanism with a good noise filtering capability. The metric memory capacity is used to measure the number of distinct patterns that can be stored in an associative memory. The memory capacity of the Hopfield model has been extensively studied using randomly generated patterns and the Hebbian rule. Analytic techniques for estimating memory capacity are discussed in (Amari, 1972; Amari et al., 1977; Abu-Mostafa and St. Jacques, 1985; McEliece et al., 1987; Lee et al., 1992). These techniques rely on a probabilistic notion of memory capacity. The Hebbian rule does not guarantee storage of all patterns unless they are represented by orthogonal vectors. Based on a simulation study with a 100 neuron network, Hopfield observed that out of 15 random patterns about half of the patterns evolve to meaningful stable states with less than 5% errors. Moreover, the capacity deteriorates significantly if one tries to store more patterns or if the patterns are correlated. This deterioration occurs because the noise term in the Hebbian rule becomes increasingly dominant and causes more failures
156
S.C. KOTHARI AND HEEKUCK OH
in storing of patterns. The memory capacity of the BAM shows similar behavior when the Hebbian rule is used. An associative memory may develop spurious states. These are stable states that do not correspond to any given pattern. The recalling process can result in a spurious state that is an invalid pattern. One can expect better performance if the number of spurious states is reduced. Another important concept is that of the basin of attraction. Given stable state T, its basin of attraction is defined as the set B ( T ) = { X ( X t+ T ) . The notation X H T means that X evolves to the stable state T through a sequence of updates. Note that a correct pattern Twill be recalled from a noisy image X , provided X belongs to the basin of attraction of T. Thus, a superior noise filtering capacity will follow if basins of attraction are large for given patterns stored as stable states. The totality of states is partitioned between basins of attraction of stored patterns and spurious states. A reduction in the number of spurious states can help to increase the size of basins of attraction for stored patterns. It is desirable to have learning algorithms that can guarantee storage of given patterns as stable states with as large basins of attraction as possible. Multiple training (Wang et al., 1990) is one technique that can improve the performance of the Hebbian rule by improving the memory capacity. The technique involves using a given pattern more than once in the Hebbian rule. Another technique is based on the concept of “unlearning.” This technique is inspired by Crick and Mitchison’s hypothesis on the function of dream sleep (Crick and Mitchison, 1983). They postulated a reverse learning mechanism that acts during the dream sleep to remove certain undesirable modes of interaction in networks of cells in the cerebral cortex. Based on this hypothesis, researchers (Hopfield et al., 1983; Kleinfeld and Pendergraft, 1987) have done experiments using an unlearning technique to improvise on the Hebbian rule. The unlearning technique modifies the Hebbian prescription for Wij by -XiXj in order to suppress a spurious state ( X , ,X 2 , . . . ,XN). One practical approach is to use noisy images to check if they lead to spurious patterns. If so, use the resulting spurious states for unlearning. Experimental studies have shown that it is possible to improve the performance of the Hebbian rule significantly by using these techniques. However, reported results on the achievable memory capacity through these techniques fall short of what is possible by using a technique like the pseudo relaxation algorithm. The memory capacity of a Hopfield network or a BAM can be greatly increased by using a perceptron algorithm or a pseudo-relaxation algorithm. The paper (Oh and Kothari, 1991b) presents a comparative performance study of the Hebbian rule, perceptron algorithm and the pseudo-relaxation algorithm. For correlated patterns drawn from the IBM PC CGA fonts, the
NEURAL NETWORKS FOR PATTERN RECOGNITION
157
paper (Oh and Kothari, 1991a) reports successful storing of 97 patterns in a 49 neuron Hopfield network. These results represent a significant improvement over the results obtained by the Hebbian rule. Both the perceptron and the pseudo-relaxation algorithms offer guaranteed recall as long as there exists a set of weights to store all the given patterns. Unlike the Hebbian rule, these algorithms exploit the entire memory capacity of a Hopfield network or a BAM. The only limitations are the ones imposed by the network architecture itself. These limitations can be expressed in terms of linear separability constraints. Recall that a partition S = P+ u P- of a set S in 9IN is called linearly separable if there exists a hyperplane in 9ZN which separates P+ and P-. We define partitions {P?,P;} of the training set T for each i = 1, . . . , N , as follows: P,'= {XkIXkETandX!= l }
P; = {XklXkE T and Xf= -1}. All the vectors in Tcan be stored as stable states if and only if every partition {PT,P;) of T is linearly separable for i = 1, . . . , N .
6.2 Applications
6.2.I NETtalk NETtalk is a program that learns patterns of pronunciation. Pronunciations are full of exceptions that are difficult to handle through a rigid rulebased system. Instead of using rules, NETtalk teaches itself to read aloud. NETtalk takes English text as input and produces codes for phonemes and stresses. These codes are converted to sounds by using DECtalk, a voice synthesizer. A transcription of a person's reading the text provides the target for outputs. NETtalk uses the backpropagation algorithm. A specific implementation consisted of a feedforward network with 309 nodes and 18,629 connections. The input layer had 7 groups each with 29 nodes, one for each letter of alphabet, space, comma, and period. The hidden layer had 80 nodes and the output layer had 26 nodes. NETtalk examines a window of seven characters at a time. The middle character is considered to be in the context of the surrounding triples. NETtalk was developed in 1986 by Terrence Sejnowski and Charles Rosenberg (1986). The first implementation was a serial simulation on a VAX machine. Demonstrations of NETtalk have shown remarkable success.
S. C. KOTHARI AND HEEKUCK OH
6.2.2 T D N N Speech processing is one of the difficult and challenging tasks even with the most powerful computers. Speech is a dynamic signal that varies in both the time and frequency domains, and a speech-recognition system must be able to capture the time-varying properties of speech rather than simply taking static snapshots of the signal. The complexity of the speech recognition task lies in the fact that a given utterance can be represented by an infinite number of time-frequency patterns. The time-delay neural networks (TDNN) were proposed by Waibel et al. (1988, 1989) and Lang and Hinton (1988) to perform high accuracy shift-invariant phoneme recognition. The TDNN consists of several layers including two or more hidden layers. The input layer of the network corresponds to many time slices of speech depending on the complexity of the recognition task. Each one of these time slices provides a frequency spectrum representing the speech waveform sampled at a fixed time interval. Each spectrum consists of 16 coefficients representing frequencies ranging from 20 Hz to over 5 kHz. These input spectra are fully connected in groups of identical size to the corresponding node in the first hidden layer. The weights between the group and its first hidden layer counterpart are identical among all groups. The grouping scheme in the first hidden layer and its connection to the next hidden layer are similar to the input layer’s groupings and connections to the first hidden layer. Each one of the groups serves as a temporal time window of speech. By overlapping series of such windows, the network can capture the local acoustic-phonetic events that act as identifying features of a particular phoneme. Waibel et al. (1989) used the TDNN to recognize the consonants b, d, and g. The input layer of the network was fed by 15 frequency spectra, representing the speech waveform sampled at 10 msec intervals. Using a 30 msec window of speech, the network could recognize the voiced consonants b, d, and g at a rate of more than 98% for all phonemes for a single speaker. Lang and Hinton (1988) did the similar experiments on the four syllabic words “bee,” “dee,” “ee,” and “vee,” and achieved a more than 90% recognition rate.
6.2.3 Neocognitron The neocognitron proposed by Fukushima (1988 ; Fukushima and Wake, 1991) is a hierarchical neural network to perform deformation-invariant visual pattern recognition tasks. The neocognitron has several layers. Each layer does an abstraction of features from the previous layer, eventually leading to the classification of the object being recognized.
NEURAL NETWORKS FOR PATTERN RECOGNITION
159
The neocognitron consists of two types of layers: a layer of C cells and a layer of S cells. In the network, the two layers are arranged alternatively to perform shift-invariant feature extraction tasks. S cells are feature-extracting cells. The C cells are used to correct positional errors in the features. Each layer of S cells or C cells is divided into subgroups, and each subgroup is associated with a single feature to which they respond. Each S cell in the same subgroup extracts the same feature, but in a different position. In other words, an S cell is fired only when a particular feature is presented at a certain position in the previous layer. Each C cell receives signals from a group of S cells in the previous layer. The C cell is fired if at least one of these S cells is fired. Thus, a shift-invariant feature extraction can be achieved by cascading a group of S cells with a C cell. The ability of the neocognitron to correctly recognize deformed characters depends highly on the choice of local features to be extracted in intermediate stages of the network. The total number of stages of the network, the size of each subgroup as well as the number of subgroups in a stage are determined by the complexity of the patterns to be recognized. A skillful choice of training patterns is also an important factor to the performance of the system. Fukushima has successfully demonstrated his model for recognizing handwritten numerals (Fukushima, 1988) and alphanumeric characters (Fukushima and Wake, 1991).
7 . Conclusions Neural networks offer an attractive approach to the pattern recognition problem where fuzzy data and multiple representations of the same pattern make the recognition process difficult. In many instances the recognition process cannot be defined by a rigid set of rules required for programming conventional computers. Neural networks address these problems by employing a distributed and less rigid representation of knowledge. Moreover, the knowledge is not programmed into the network, rather it is inferred by the network from training examples. Neural networks thus provide a simple computing paradigm to perform complex recognition tasks in real time. Pattern recognition applications can greatly benefit from the neural network approach. In practice, however, some issues remain to be settled. We will list here a set of issues for further study that can go a long way in enhancing the neural network approach. Selection of a neural network architecture: Many time-consuming trials can be avoided if there is a proven technique to select an appropriate neural
160
S.C. KOTHARI AND HEEKUCK OH
network architecture given any specific application. The problem is complex because even within an architecture type many variations are possible. For example, within feedforward multilayer networks there are many choices for the network topology, the error function, and the activation function. These choices determine the error surface. A learning algorithm essentially searches the error surface for a suitable solution. The run time of the search can vary significantly depending on the local and global minima on the error surface. It is also possible that a suitable solution may not exist on the error surface for some choices of the network topology. This leaves open the possibility of many unsuccessful trials, with one or more neural networks topologies. This issue is particularly relevant for networks with hidden layers because one usually does not know how many hidden nodes are necessary to solve a given problem. A small number of hidden nodes may not be enough to solve a problem, but on the other hand a large number of hidden nodes may make the problem unnecessarily complex thus requiring a large number of trials to arrive at a solution. Since feedforward networks with hidden nodes are widely used in practice, a specific research problem of great interest is how to arrive at a suitable network topology to provide an efficient solution for a given pattern recognition problem. Learning speed: Typically, a learning algorithm performs a heuristic search on the error surface. It often works very well but in some cases the search can require an inordinate amount of time. While experimenting, for example to find a suitable network topology, it is impossible to do many trials unless the learning is fast. Learning algorithms, in general, require a selection of parameters such as the learning rate, initial weights, and possibly many other parameters specific to a given algorithm. For many algorithms, the performance depends critically on these parameters. The search for good parameter values can itself become a hard problem. There are several sources of difficulties. A problem can be that the performance trend is not clear as a parameter is varied. Moreover, different parameters may interact with each other making the problem even harder. Also, the behavior of a learning algorithm and an appropriate set of choices vary across applications. The combination of the unpredictable behavior of a learning algorithm and the lack of knowledge of whether a given neural network is capable of solving the given problem creates a confusing situation. It becomes unclear whether to search for a different set of learning parameter values or to search for a different network. The parameter sensitivity is an inherent problem with the gradient descent method, which is the basis for many learning algorithms. A lot of research activity is centered around learning algorithms because of their fundamental importance in neural networks. In this chapter we
NEURAL NETWORKS FOR PATTERN RECOGNITION
161
discussed two important directions of research to improve learning algorithms. One direction is the dynamic node generation to improve learning. This is the approach used by the cascade correlation algorithm. The other direction is to design learning algorithms where the choice of parameters is not an issue. The relaxation algorithms are representative of this direction of research. These algorithms can be used with functional link nets and radial basis function networks. Data representation and preprocessing: A part of the pattern recognition task can be accomplished through preprocessing to reduce the complexity of the problem that the neural network has to address. This is a valid approach if a part of the problem can be easily handled through preprocessing and the remaining problem can be solved by a neural network. The preprocessing not surprisingly, is application dependent. However, it is desirable to develop a better understanding of neural networks in order to decide how to modify the problem through preprocessing and appropriate data representation so that the new problem is more amenable to neural networks. Accountability, capability and reliability: In pattern recognition applications it may not always be clear if the neural network has solved the problem only superficially. A network could be differentiating the patterns according to superficial characteristics of the way the data is presented. It is dangerous to use a neural network as a magical black box that somehow produces answers. Ideally, a neural network should be accountable so that its answers can be traced back. Although some possibilities have been suggested, the traceability in neural networks remains a difficult problem requiring new insight. Interestingly, even if accountability is not established, a neural network can still be claimed to be reliable while answering questions on new data. The reliability is established in practice by an empirical study using experimental test data. However, there is a lack of a theoretical foundation for performance levels of neural networks. Such a foundation, if established, can be used to support results of the experimental studies. It can also provide effective and efficient ways of establishing the reliability of the neural network. Neural networks provide interesting possibilities in terms of parallelization and hardware implementations. A neural network implementation of a technique often suggests possibilities for parallelizing the technique. The technology of massively parallel machines is rapidly advancing. Machines with tens of thousands of processors exist today and the possibility of millions of processors may not be too far away in the future. Neural networks with their natural parallelism may be an ideal medium for making use of massively parallel machines. It remains to be seen if neural networks provide unique capabilities that go beyond other methods such as statistics and traditional pattern
162
S. C. KOTHARI AND HEEKUCK OH
recognition techniques. Neural networks have evolved from cross disciplinary research. Along with the resemblance to biological neural structures, it is important to pursue the mathematical ideas which provide a rigorous foundation for advancing the neural network approach. Neural networks offer alternatives for solving problems addressed by the approximation and extrapolation methods in mathematics and regression techniques in statistics. A better understanding of the relationship between neural networks and other mathematical and statistical techniques will be mutually beneficial. The hope is that neural networks will provide some unique capabilities for pattern recognition.
REFERENCES Abu-Mostafa, Y. S., and St. Jacques, J. M. (1985). Information Capacity of the Hopfield Model. IEEE Trans. In$. Theory 31 (4), 461-464. Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cogniliue Sci.9, 147 169. Agmon, S. (1954). The relaxation method for linear inequalities. Canadian J. Math. 6 (3), 382392. Amari, S. (1972). Learning patterns and patterns sequences by self-organizing nets of threshold elements. fEEE Trans Comp. C-21, 1197-1206. Amari, S., Yoshida, K., and Kanatani, K. (1977). A mathematical foundation for statistical neurodynamics. SIAM J . Appl. Math. 33 ( l ) , 95-126. Ash, T. (1989). “Dynamic Node Creation in Backpropagation Networks.” Institute for Cognitive Science, University of California, San Diego, Technical Report ICS-8901. Brady, M. L., and Raghavan, R. (1989). Back propagation fails to separate where perceptrons succeed. IEEE Trans. Circuits Sys. 36 (3,665-674. Bruck, J. (1990). “On the convergence properties of the Hopfield model.” Proc. IEEE 78 (lo), 1579-1585. Carpenter, G. A. (1989). Neural network models for pattern recognition and associative memory. Neural Net. 2, 243-251. Cernuschi, B. (1989). Partial simultaneous updating in Hopfield memories. IEEE Trans. Sys., Man, Cybernet. 19 (4), 887-888. Cottrell, M. (1988). Stability and attractivity in associative memory networks.” Biol. Cybernet. 58, 129-139. Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition.” IEEE Trans. Elec. Comp. EC-14, 326-334. Crick, F., and Mitchison, G. (1983). The function of dream sleep. Nature 304, 11 1-1 14. Cybenko, G. (1988). “Continuous Valued Neural Networks: Approximation Theoretical Results,” Interface ’88 Proc., Reston, Virginia. Cybenko, G. (1989). “Approximations by Superpositions of a Sigmoidal Function.” Center for Supercomputing Research and Development, University of Illinois, Urbana. CSRD Report No. 856, pp. 1-15. DARPA (1988). “DARPA Neural Network Study.” AFCEA International Press.
NEURAL NETWORKS FOR PATTERN RECOGNITION
163
Durbin, R., and Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature 326 (16), 689-691. Fahlman, S . E. (1988). “Faster-Learning Variations on Back-Propagation: An Empirical Study.” Proceedings of the 1988 Connectionist Models Summer School, pp. 38-51. Fahlman, S. E., and Lebiere, C. (1990). “The Cascade-Correlation Learning Architecture.” School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, Technical Report CMU-CS-90-100. Forrest, B. M. (1988). Content-addressability and learning in neural networks. J. Phys. A: Math. Gen. 21, 245-255. Forrest, B. M., Roweth, D., et al. (1987). Implementing neural network models on parallel computers.” Comp. J . 30 (S), 413-419. Fukushima, K. (1988). Neocognitron: a hierarchical neural network capable of visual pattern recognition. Neural Net. 1 (2), 119-130. Fukushima, K., and Wake, N. (1991). Handwritten alphanumeric character recognition by the neocognitron. IEEE Transac. Neural Net. 2 (3), 335-365. Gardner, E., and Derrida, B. (1988). Optimal storage properties of neural network models. J . Phys. A: Math. Gen. 21, 271-284. Golden, R. M. (1986). The ‘brain-state-in-a-box’ neural model is a gradient descent algorithm. J . Math. Psychol. 30, 73-80. Goles, E. (1985). Dynamics of positive automata networks. Theoret. Comp. Sci. 41, 19-32. Goles, E., Fogelman, F., and Pellegrin, D. (1985). Decreasing energy functions as a tool for studying threshold networks. Discrete Appl. Math. 12, 261-277. Grossberg, S. (1968). Some nonlinear networks capable of learning a spatial pattern of arbitrary complexity. Proc. Nut. Acad. Sci.59, 368-372. Grossberg, S. (1969). On learning and energy-entropy dependence in recurrent and nonrecurrent signed networks. J . Stat. Phys. 1 , 319-350. Gulati, S., Barhen, J., and Iyengar, S. (1991). Neurocomputing formalisms for computational learning and machine intelligence. In “Advances in Computers,” Vol. 33, Marshall C . Yovits, ed.. Academic Press, Boston, pp. 173-245. Haines, K., and Hecht-Nielsen, R. (1988). A BAM with Increased Information Storage Capacity. Proceedings of the 2nd International Conference on Neural Networks, pp. 1-181-1-190. Hassoun, M. H. (1989). Dynamic heterassociative neural memories. Neural Net. 2, 275-287. Hebb, D. 0. (1949). “The Organization of Behavior.” John Wiley & Sons, New York. Hecht-Nielsen, R. (1987). “Kolmogorov’s Mapping Neural Network Existence Theorem.” IEEE First International Conference on Neural Networks, Vol. 111, pp. 1 1-1 4. Hecht-Nielsen, R. (1989). “Theory of the Backpropagation Neural Network.” Proceedings of the IEEE/INNS International Joint Conference on Neural Networks, Vol. I, pp. 1-593-1605. Hedge, S., Sweet, J., and Levy, W. (1988). “Determination of Parameters in a Hopfield/Tank Computational Network.” Proceedings of the IEEE International Conference on Neural Networks, Vol. 11, pp. 291-298. Hinton, G. E., and Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1 : Foundations. MIT Press, pp. 282 317. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl Acad. Sci. 79, 2554-2558. Hopfield, J. J. (1 984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Nail Acad. Sci. 81, 3088-3092. Hopfield, J. J. (1989). Collective computation, content-addressable memory, and optimization problems. In “Complexity in Information Theory.”
164
S. C. KOTHARI AND HEEKUCK O H
Hopfield, J. J., and Tank, D. W. (1985). “Neural” computation of decisions in optimization problems. Bzol. Cybernet. 52, 141-152. Hopfield, J. J., Feinstein, D. I., and Palmer, R. G. (1983). Unlearning has a stabilizing effect in collective memories. Nature 304, 158-1 59. Jeffrey, W., and Rosner, R. (1986). “Neural Network Processing as a Tool for Function Optimization.” AIP Conference Proceeding I5 I : Neural Networks for Computing, Snowbird, Utah, pp. 241-246. Kanter, I., and Sompolinsky, H. (1987). Associative recall of memory without errors. Phys. Rev. A 35 ( I ) 380-392. Kleinfeld, D., and Pendergraft, D. B. (1987). Unlearning increases the storage capacity of content addressable memories. Biophys. J . 51, 47-53. Klopf, A. (1986). “Drive-reinforcement Model of Single Neuron Function : An Alternative to the Hebbian Neuronal Model.” AIP Conference Proceeding 151: Neural Networks for Computing, Snowbird, Utah, pp. 265-270. Knight, K. (1989). “A Gentle Introduction to Subsymbolic Computation : Connectionism for the A.1. Researcher.” School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania. Technical Report CMU-CS-89-150, pp. 1-28. Kohonen, T. (1977). “Associative Memory: A System-Theoretical Approach.” Springer-Verlag, Berlin. Kohonen, T., Oja, E., and Lohtio, P. (1981). Storage and processing of information in distributed associative memory system. Parallel Models of Associative Memory, G. Hinton and J. A. Anderson, eds. Erlbaum, Hillsdale, New Jersey. Kosko, B. (1986). “Differential Hebbian Learning.” AIP Conference Proceeding 151 : Neural Networks for Computing, Snowbird, Utah, pp. 277-282. Kosko, B. (1987). Optical bidirectional associative memories. Proc. SPIE 758, 11. Kosko, B. (1988a). Bidirectional associative memories. IEEE Trans. Sys., Man, Cybernet. 18 (1 ), 49-60. Kosko, B. (1988b). “Feedback Stability and Unsupervised Learning.” Proceedings of the 2nd International Conference on Neural Networks, pp. 1-141-1-151. Lang, K., and Hinton, G. E. (1988). “A Time-Delay Neural Network Architecture for Speech Recognition.” Carnegie Mellon University, Technical Report CMU-VS-88-152. Le Cun, Y. (1985). “Une procedure d’apprentissage pour reseau a seuil assymetrique.” Proceedings of Cognitiva 85, Paris, pp. 599 604. Lee, K., Kothari, S. C., and Shin, D. (1992). Probabilistic information capacity of Hopfield networks. Complex Sys. 6 (1). 31-46. Lippmann, R. P. (1987). An introduction to computing with neural nets. IEEE ASSP Mag. 3 (4), 4-22. McCulloch, W. S., and Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 461-482. McEliece, R. J., Posner, E. C., e f al. (1987). The capacity of the Hopfield associative memory. IEEE Trans. In$ Theory 33 (4), 461-482. Minsky, M., and Papert, S. (1969). “Perceptrons.” MIT Press, Cambridge, Massachusetts. Motzkin, T. S., and Schoenberg, I. J. (1954). The relaxation method for linear inequalities. Canad. J . Math. 6 (3), 393-404. Oh, H., and Kothari, S. C. (1991a). “A New Learning Approach to Enhance the Storage Capacity of the Hopfield Model.” Proceedings of the IEEE/INNS International Joint Conference on Neural Networks, Singapore, pp. 2056-2062. Oh, H., and Kothari, S. C. (1991b). “Adaptation of the Relaxation Method for Learning in Bidirectional Associative Memory.” Department of Computer Science, Iowa State University, Technical Report TR-91-25. (To appear in IEEE Transactions on Neural Networks.)
NEURAL NETWORKS FOR PATTERN RECOGNITION
165
Oh, H., and Kothari, S. C. (1992a). “A Pseudo-Relaxation Learning Algorithm for Bidirectional Associative Memory.” To appear in the Proceedings of the IEEE/INNS International Joint Conference on Neural Networks, Baltimore, Maryland. Oh, H., and Kothari, S. C. (1992b). “A New Approach in Learning and Its Application to the Hopfield Model.” Department of Computer Science, Iowa State University, Technical Report TR-92-18. Pao, Y. (1989). “Adaptive Pattern Recognition and Neural Networks.” Addison-Wesley, Reading, Massachusetts. Parker, D. B. (1985). “Learning Logic.” Technical Report TR-47, Center for Computational Research in Economics and Management Science, MIT, Cambridge, Massachusetts. Personnaz, L., Guyon, I., and Dreyfus, G. (1985). Information storage and retrieval in spinglass like neural networks. J . Physique Lett. 46, L-359-L-365. Rajavelu, A,, Musavi, M. T., and Shirvaikar, M. V. (1989). A neural network approach to character recognition. Neural Nei. 2, 387-393. Rosenblatt, F. (1962). “Principles of Neurodynamics.” New York, Spartan Books. Rumelhart, D. E., and McClelland, J. L. (1986). “Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1 : Foundations. MIT Press, Cambridge, Massachusetts. Sejnowski, T. J. (1977). Storing covariance with nonlinear interacting neurons. J. Math. Biol. 4, 303-321. Sejnowski, T. J . , and Rosenberg, C. (1986). “NETtalk: A Parallel Network That Learns to read aloud.” Johns Hopkins University, Department of Electrical Engineering and Computer Science, Technical Report JHU/EECS-86/01. Sejnowski, T. J., Kienker, P. K., and Hinton, G. E. (1986). Learning symmetry groups with hidden units : beyond the perceptron. Physica 22-D, 260-275. Shavlik, J. W., Mooney, R. J., and Towell, G. G. (1990). “Symbolic and Neural Learning Algorithms: An Experimental Comparison (Revised).” Computer Sciences Department, University of Wisconsin-Madison, CS Technical Report #955, pp. 1-39. Sontag, E. D., and Sussmann, H. J. (1989). Backpropagation can give rise to spurious local minima even for networks without hidden layers. Complex Sys. 3, 91-106. Sutton, R. S., and Barto, A. G . (1981). Toward a modern theory of adaptive networks: expectation and prediction. Psychol. Rev. 88, 135-171. Tagliarini, G., and Page, E. (1987). “A Neural-Network Solution to the Concentrator Assignment Problem.” Proceedings of the 1987 NIPS, pp. 775-782. Tank, D. W., and Hopfield, J. J. (1986). Simple “neural” optimization networks: an A/D converter, signal decision circuit, and a linear programming circuit. IEEE Trans. Circ. Sys. 33 ( 5 ) , 533-541. Valiant, L. G. (1984). A theory of the learnable. Comm. ACM 27 ( 1 I), 1134-1 142. Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K., and Lang, K. (1988). “Phoneme Recognition : Neural Networks versus Hidden Markov Models.” Proceedings of the 1988 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 107-1 10. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. (1989). Phoneme recognition using time-delay neural networks. IEEE Trans. Acoustics, Speech, Signal Proc. ASSP-37. Wang, Y., Cruz, J. B., and Mulligan, J. H. (1990). Two coding strategies for bidirectional associative memory. IEEE Trans. Neural Net. 1 ( I ) , 81-91. Wang, Y., Cruz, J. B., and Mulligan, J. H. (1991). Guaranteed recall of all training pairs for bidirectional associative memory. IEEE Trans. Neural Net. 2 (6), 559-567. Weisbuch, G., and Fogelman, F. (1985). Scaling laws for the attractors of Hopfield networks. J. Physique Lett. 46, L623-L630. Widrow, B., and Hoff, M. (1960). “Adaptive Switching Circuits.” 1960 IRE WESCON Convention Record: Part 4, New York, pp. 96-104.
166
S. C. KOTHARI AND HEEKUCK OH
Widrow, B., Williams, R. J., and Ziper, D. (1960). “An Adaptive aduline Neuron Using Chemical memistors.” Stanford Electronics Laboratory Technical Report, 1553-2, Stanford, California. Wilson, G. V., and Pawley, G. S . (1988). On the stability of the traveling salesman problem algorithm of Hopfield and tank. B i d . Cybernet. 58, 63-70. Witbrock, M., and Zagha, M. (1989). “An Implementation of Back-Propagation Learning on GFI 1, a Large SIMD Parallel Computer.” School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, Technical Report CMU-CS-89-208, pp. 1-27. Wong, A. (1988). Recognition of general patterns using neural networks. Biol. Cybernet. 58, 361-372.