Two-Stage Habituation Based Neural Networks for D y n a m i c Signal Classification Bryan W. Stiles and Joydeep Ghosh
Abstract
This article describes a novel neural network structure designed for the dynamic classification of spatio-temporal signals. The network is a two-stage structure consisting of a biologically motivated temporal encoding stage followed by a static neural classifier stage. The temporal encoding stage is based upon a simple biological learning mechanism known as habituation. This habituation based neural structure is capable of approximating arbitrarily well any continuous, causal, time-invariant, mapping from one discrete time sequence to another. Such a structure is applicable to SONAR and speech signal classification problems, among others. Experiments on classification of high dimensional feature vectors obtained from Banzhaf sonograms, demonstrate that the proposed network performs better than time delay neural networks while using a substantially simpler structure.
1
Introduction
Many tasks performed by h u m a n s and animals involve decision-making and behavioral responses to spatio-temporally p a t t e r n e d stimuli. Thus the recognition and processing of time-varying signals is f u n d a m e n t a l to a wide range of cognitive processes.
Classification of such signals is also
*This work was supported in part by an NSF grant ECS 9307632 and ONR contract N00014-92C-0232. Bryan Stiles was also supported by the Du Pont Graduate Fellowship in Electrical Engineering. We thank Prof. I. Sandberg for several fruitful discussions. CONTROL AND DYNAMIC SYSTEMS, VOL. 75 Copyright 9 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
301
302
BRYAN W. STILES AND JOYDEEP GHOSH
basic to many engineering applications such as speech recognition, seismic event detection, sonar classification and real-time control [Lip891,[Mar901. A central issue in the processing of time-varying signals is how past inputs or "history" is represented or stored, and how this history affects the response to the current inputs.
Past
information can be used explicitly by creating a spatial, or static, representation of a temporal pattern. This is achieved by storing inputs from the recent past and presenting them for processing along with the current input. Alternatively, the past events can be indirectly represented by a suitable memory device such as a series of possibly dispersive time-delays, feedback or recurrent connections, or changes in the internal states of the processing "cells" or "neurons" [GW93], [Mar90]. The past few years have witnessed an explosion of research on neural networks for temporal processing, and surveys can be found in [GD95], [Mar90], and [Moz93] among others. Most of this research has centered on artificial neural network models such as time delayed neural networks (TDNNs), the recurrent structures of Elman and Jordan, and Principe's gamma network, that can utilize some variant of the backpropagation algorithm [dVP92]. Another large class of networks, inspired by physics, are based on transitions between attractors in asymmetric variants of the Hopfield type networks [Be188],[HKP91]. Some researchers have also studied spatio-temporal sequence recognition mechanisms based on neurophysiological evidence, especially from the olfactory, auditory and visual cortex. Representative of this body of work is the use of non-Hebbian learning proposed by Granger and Lynch, which, when used in networks with inhibitory as well as excitatory connections, can be used to learn very large temporal sequences [GAIL91]. Similar networks have been used to act as adaptive filters for speech recognition [Kur87], or provide competitive-cooperative mechanisms for sequence selection [SDN87]. At the neuronal level, irregularly firing cells have been proposed as a basic computational mechanism for transmitting temporal information through variable pulse rates [Day90]. All these efforts are concentrated on neurophysiological plausibility rather than being geared toward algorithmic classification of man-made signals. Issues such as computational efficiency and ease of implementation on computers, are secondary. This article discusses class of a two stage neural classifers which is biologically motivated, but is adapted for efficiency in practical temporal processing applications. The first stage serves
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS
303
as the memory that is affected by past inputs, while the second stage is a sufficiently powerful feedforward network. Before describing the details of the structure, we shall first elaborate on the biological background which inspired its design.
2
H a b i t u a t i o n and R e l a t e d Biological M e c h a n i s m s
What are the biological mechanisms of learning and association? Studies have revealed a variety methods by which biological neural systems perform basic associative and nonassociative learning [She95]. In this article, we highlight experiments performed with the mollusk,
Aplysia. Because of
the small number and large size of the neurons in this organism, it is relatively easy to measure the strengths of neural connections and how these strengths vary depending on the stimuli presented to the sensory neurons. Additionally, experiments performed with
Aplysia in the last decade have
demonstrated the ability of the organism to perform basic types of associative and nonassociative learning. Because the biological neural network found in
Aplysia has the ability to perform these
basic forms of learning, and its physical characteristics make it particularly easy to reverse engineer, it is ideal for use in developing a mathematical model of these simple learning mechanisms. Such a model for a single neuron has been produced by Byrne and Gingrich[BG89]. This model explains the experimental data from
Aplysia for three basic types of learning:
classical conditioning, habituation, and sensitization. In order to understand these three types of learning, it is necessary to define a few basic terms. An unconditioned stimulus is any stimulus which provokes a reflex action or unconditioned response. This response is automatic and does not require any previous learning. Any other stimulus is referred to as a conditioned stimulus, and does not automatically elicit any particular response. The three learning mechanisms discussed determine how to respond to a conditioned stimulus in the presence or absence of a unconditioned stimulus. Figure 1 presents the neural basis for conditioned and unconditioned stimulus as discussed in [BK85]. This figure illustrates the relationship between the sensory neurons the facilitator neuron, and the motor neuron, and how this relationship results in the three types of learning behavior. The important thing to notice is that the synaptic strength associated with the unconditioned stimulus
304
BRYAN W. STILES AND JOYDEEP GHOSH
is constant, because the response of the motor neuron to the unconditioned stimulus is automatic. Figure 1 also includes basic definitions of the three forms of learning as they have been observed in
Aplysia. Unconditioned Stimulus
Conditioned Stimulus
_~~n,,
ory y
"-"
Habituation:
W2lt) degrades when Sensory Neuron 2 is repeatedly activated. This degradation reverses over time when Sensory Neuron 2 is inactive
Sensitization: When Sensory Neuron I is repeatedly activated the Facilitator Neuron causes an increase in W2(t) which degrades over time when Sensory Neuron 1 is not actwated. Classical Conditioning: Sensitization is enhanced when activations of Sensory Neuron 1 are temporallypaired with activations of Sensory Neuron 2..
Figure 1: Neural Basis for Unconditioned and Conditioned Stimuli
2.1
Habituation
Habituation is perhaps the simplest form of learning. In the absence of an unconditioned stimulus, the response to any conditioned stimulus degrades with each repetition of that conditioned stimulus. In biological neural systems, it has been observed that neurons respond more strongly to stimuli which occur infrequently. If a stimulus occurs often and is not classically conditioned, the neuron loses its ability to respond to that stimulus. Conversely if the stimulus is not observed for a long period of time, the neurons ability to respond may return. Experimenting with
Aplysia has
clarified
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS
305
the neural basis for habituation[BK85]. Bailey and Kandel showed that habituation was localized at a single synapse. Repeated firing of the presynaptic neuron depressed the strength of the synaptic connection. The postsynaptic neuron, therefore, responded less strongly to any stimulus which activated the presynaptic neuron. This behavior showed both short and long term effects. In the short term, synaptic strength could be caused to decrease quickly and then rebound. Conversely, if the synaptic strength was reduced for a long period of time it required a long period of time to reestablish itself. Short term activation of the presynaptic neuron was found to reduce the influx of C a 2+ which is necessary for neurotransmitter release. Long term habituation led to long periods of C a 2+ deprivation which in turn made the electrical connection between neurons immeasurable and
caused changes in the physical structure of the synapse. The Byrne and Gingrich model is based on the flow of neurotransmitter among the external environment and two pools internal to the neuron [BG89]. One of the pools, the releasable pool, contains all the neurotransmitter ready to be released when the neuron is activated. The other pool, the storage pool, contains a store of neurotransmitter for long term use. Short term habituation is explained as the depletion of the releasable pool due to frequent activation of the neuron. The increased level of C a 2+ which results from the occurrence of the conditioned stimulus increases both the flow from the storage pool to the releasable pool and the release of neurotransmitter. It is the increase in neurotransmitter release which leads to activation of the neuron. In this manner, both pools are depleted by the frequent occurrence of conditioned stimuli. The neurotransmitter in the releasable pool can be replenished by diffusion from the storage pool or by other neurotransmitter flows which are regulated through sensitization and classical conditioning. Long term habituation can be explained by depletion of the storage pool itself. The Byrne and Gingrich model assumes a single flow into the storage pool from the external environment. This flow is proportional to the difference between the current neurotransmitter concentration in the pool and the steady state value. Another model for both long term and short term habituation was presented by Wang and Arbib and is reproduced as follows with straight forward modifications[WA92]. This model was produced independently from the Byrne model, and is based on Wang and Arbib's own experimental observations. The modifications introduced here merely change the equations from continuous to
306
BRYAN W. STILES AND JOYDEEP GHOSH
discrete time, which is necessary in order to use them later in an artificial neural network. Here I(t) is the current input vector from the neurons whose outputs are habituated, and W ( t ) is the vector of synaptic strengths. The dependence of W ( t ) on sensitization or classical conditioning effects is ignored. Since this habituation only version of W ( t ) is dependent only on the activation of the presynaptic neuron, only the single subscript, i, is needed. Synapses attached to the same presynaptic neuron habituate in the same manner.
Henceforth W ( t ) will be referred to as the
habituation value to avoid confusion with either a more complete model of synaptic strength or artificial neural network parameters.
W~(t + 1) = Wi(t) + ri(az,(t)(W~(O) - W~(t)) - W~(t)I~(t))
(1)
zi(t + 1) = zi(t) + 7z,(t)(zi(t) - 1)Ii(t)
(2)
In this model, vi is a constant used to vary the habituation rate and c~ is a constant used to vary the ratio between the rate of habituation and the rate of recovery from habituation. The function
zi(t) monotonically decreases with each activation of the presynaptic neuron. This function is used to model long term habituation. Due to the effect of zi(t) after a large number of activations of the presynaptic neuron the synapse recovers from habituation more slowly. Some assumptions about the range of values for the constants are made in order to assure that Wi(t) and z~(t) remain within the range [0,1]. Specifically, ri, 7, the product of r~ and c~ and I~(t) must all be in the same range [0,1]. For simplicity, W~(0) will always be assumed to be unity unless otherwise stated. It is apparent in Wang and Arbib's model that in the long term, if the presynaptic neuron is not completely inactive the synaptic strength will eventually decay to zero, because zi(t) is monotonically decreasing. This was valid for Wang and Arbib's research because they were examining the response of animals to artificial stimuli which are of no importance to the animals in question. Sensitization and classical conditioning are ignored in this model. If these other two learning mechanisms were included in the model, then the synaptic strength would only decay to zero in the absence of any unconditioned stimuli.
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS 2.2
307
Sensitization
Sensitization is another nonassociative form of learning. It is a complement to habituation. Whenever a US occurs the reaction to any conditioned stimulus that may occur is heightened. Sensitization was also observed in Aplysia by Bailey and Kandel [BK85]. According to them, when the postsynaptic neuron receives an unconditioned stimulus, another neuron referred to as a fascilitator is activated. This neuron then elicits a response in the presynaptic neuron which causes sensitization. According to the Byrne and Gingrich model, activation of the fascilitator neuron causes an increased production in the presynaptic neuron of a substance called cAMP. Increased cAMP levels in turn stimulate both an increased influx of Ca 2+ into the neuron and an increased flow of neurotransmitter from the storage pool to the releasable pool. The increase in the Ca 2+ level causes an additional increase in the flow from the storage to releasable pool as well as increasing the rate of neurotransmitter release.
2.3
Classical
Conditioning
Classical conditioning is the simplest form of associative learning. It occurs when a conditioned stimulus is observed within a certain time period before the observation of an unconditioned stimulus. After such a temporal pairing has occurred repeatedly, the conditioned stimulus itself produces a response.
This learned response is known as the conditioned response (CR) and it can occur
even in the absence of the unconditioned stimulus. The original experiment that demonstrated this type of behavior was performed by Pavlov [Pav27]. Such behavior has long been observed in higher animals, but Bailey and Kandel were among the first to discover classical conditioning in simple organisms such as Aplysia[BK85]. Because of the relatively simple neural structure of Aplysia, Bailey and Kandel were able to qualitatively discuss the biological mechanisms which produced the behavior. Byrne and Gingrich then took Bailey and Kandel's qualitative discussion and used it to build a mathematical model[BG89]. The behavior of the model was then compared to Bailey and Kandel's experimental results. According to this model, cAMP levels are enhanced even more strongly by an unconditioned stimulus, when Ca 2+ levels are high as the result of a recently occurring conditioned stimulus. In this manner, the same mechanism responsible for sensitization results in even greater
308
BRYAN W. STILES AND JOYDEEP GHOSH
enhancement of the neural response when the unconditioned stimulus is temporally paired with the conditioned stimulus.
2.4
Temporal Information Encoding
In all three types of learning previously mentioned, the response of the neuron to stimuli varies over time. One wonders if this time varying response can be used to encode information about temporal patterns in the stimulus. In fact, an experiment has been performed which suggests that the short term form of habituation may be useful for temporal pattern recognition [RAH90]. This study focused on the perception of a repeated unary sound pattern such as X X ' X ' X X ' . Here (X) represents an interval of a specific sound followed by an interval of silence of equal duration, and (') represents an interval of silence equal to the total duration of (X). Previous psychological studies had shown that humans perceived specific beginning and ending points of such patterns.
The pattern was perceived to
begin on the longest string of (X)s and end on the longest string of (')s.
Patterns for which
this heuristic can produce one unambiguous result were found to be perceived in only that one way. In accordance with this theory the repetitions of the patterns XXX'X"' and XX'X'XX' were unambiguous and ambiguous respectively with beginnings perceived at the underlined sites. The experiment in [RAHg0] connected electrodes to neurons in the brains of four cats. The neurons associated with the "input" (auditory tectum) responded more strongly to the sounds which were perceived by humans to begin the sequences. A similar response was noticed for neurons further removed from the auditory nerve. The authors, Robin et al, concluded that the reason sequences were perceived in this way had to do with the habituation of the input neurons to stimuli. Due to habituation, neurons respond most strongly to new stimuli, while frequently received stimuli are filtered out. Long runs of silence allowed the neurons to forget the (X) stimulus and hence respond more strongly to it, whereas long runs of (X)s allowed the neurons to habituate to it and filter it out; therefore, the (X) after the longest run of (')s and before the longest run of (X)s would respond most strongly. The experimental evidence supported this conclusion. This experiment highlights that habituation can encode information about a temporal se-
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS quence.
309
Without any specific spatial representation of the sequence, (i.e. varying delay times
between neurons or specific neurons to save previous state information) the neurons of the auditory rectum in the subject cats encoded contextual information by magnitude of response. It further seems apparent that varying habituation rates could be used to encode contextual information for time windows of various sizes. Another experiment illustrates this hierarchy of habituation times [WA92]. In this study frogs were presented with dummy prey and their neural and behavioral responses to the dummies was recorded over time. Thalamic pretectal neurons which are relatively close to the inputs(retina) of the neural network were found to show no adaptation effects which lasted longer than 120 s, but the frog's behavior showed longer periods of habituation effects lasting 24 hours or more. The thalamic pretectal neurons could ignore dummy prey for only two minutes without retraining, whereas the frog could ignore these dummies for more than a day. The thalamic pretectal neurons served to encode recent context information. Other neurons encoded this same information for longer periods of time. It is important to note that this information is contextual in nature. The learned response of ignoring prey is only useful for the frog within the specific context of recent exposure to dummy prey.
3
A H a b i t u a t i o n B a s e d N e u r a l Classifier
We shall now describe an artificial neural structure which uses habituation to encode temporal information.
This structure is mathematically powerful.
It consists of two stages, a temporal
encoding stage based on habituation, and a static neural classifier stage.
Such a structure can
be shown to be capable of approximating arbitrarily well any continuous, causal, time-invariant mapping from one discrete time sequence to another. Thus, the habituation based structure can realize any function realizable by a TDNN.
310 3.1
BRYAN W. STILES AND JOYDEEP GHOSH General
Design
Structure
The first stage of the spatio-temporal network is a "memory" stage comprised of short term habituation units based upon the Wang and Arbib model of habituation. This stage produces a set of habituated weights from the input
z(t). If the input is multi-dimensional, one set is extracted for
each component. These weights are affected by the past values of the input, and implicitly encode temporal information. Spatio-temporal classification can thus be achieved by using such habituated weights as inputs to a static classifier. For example, if a multilayered perceptron (MLP) (alt. radial basis function network) is used, the overall network is a habituated MLP (alt. habituated RBF) that can be applied for spatio-temporal classification. The model equation is shown as follows.
Wk(t + 1) = Wk(t) + rk(ak(1 -- Wk(t)) - Wk(t)z(t)) This equation is derived from Equation 1 by setting
z(t) = 1 to eliminate long term habituation
effects, replacing the presynaptic neuron activation, I(t), with the input rebound to 1 instead of
(3)
z(t), and letting Wk(t)
Wk(O). Long term habituation is eliminated so that the ability of Wk(t)
to recover from habituation does not vary over time. Otherwise the
Wk(t) values would eventually
decrease to zero for all but the most infrequent of inputs. The k index is used to indicate that multiple values
Wk(t + 1) are determined for an input signal z(t). It was found mathematically,
that multiple habituation values are better able to encode temporal information. This fact may also have biological context, because it is known that a given pair of neurons often have multiple synapses between them. Dynamic classification is achieved by training a suitable nonlinear feedforward network, whose inputs are a set of m habituated values,
Wk(t + 1), 1 < k < m, that are extracted from the
raw input x(t). Figure 2 shows the generic structure of such a classifier. In [WA92]
Wk(t) represents a
synaptic strength, but because our designs use habituated values as inputs to a static classifier rather than weights, the variables are redefined accordingly. We do not mean to imply that this network construction is either the most biologically feasible or the only method in which habituation might be used. A more biologically inspired approach would be to reflect
Wk(t) as modulating weights
of the inputs. We found by experiment, however, that this approach, although more biologically feasible, does not encode temporal information as well for the classification problems which we
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS
311
studied. Moreover, the structure of Figure 2 can be shown mathematically to be very powerful.
f(I) Outputs
Nonlinear Feed-Forward Network
__~W2 .... Habituation Units (Memory)
l{t}
Figure 2: Structure of Habituated Neural Networks The parameters, rk and ak affect the rate at which habituation occurs, thereby determining the temporal resolution and range of the information obtained. The issues and tradeoffs involved are akin to memory depth versus resolution in dispersive delay line based models [dVP92], [Moz93]. We set Wk(0) to zero for all k, employ positive values of ak and rk such that akrk + rk < 1 and normalize the input so that x(t) E [0,1]. With these specifications, we can guarantee that the habituation process is stable. In fact we can guarantee that Wk(t) E [0,1] for all values of k and t.
3.2
Theoretical
Properties
In this section, a theorem is presented concerning the ability of a general category of neural networks, including the habituation based structure, to approximate arbitrarily well any continuous,
312
BRYAN W. STILES AND JOYDEEP GHOSH
causal, time-invariant mapping f from one discrete sequence to another. Since all functions realized by TDNNs with arbitrarily large but finite input window size are continuous, causal, and time-invariant, the proofs of the theorems also imply that habituation based networks can realize any function which can be realized by a TDNN [San91]. The key to the proof is to show that the memory structure realized by the habituated weights is a complete memory. Then so long as the feedforward stage is capable of uniformly approximating continuous functions, the overall network will be capable of mapping one sequence to another. The proof is related to the previous work by Sandberg. In [San92a] he demonstrates a method for determining necessary and sufficient conditions for universal approximation of dynamic input-output mappings. Also in [San91] he provides a universal approximation proof for structures similar to that of Figure 2, with the exception that the temporal encoding is performed with linear mappings. Let X be the set of discrete time sequences for which x E X implies x(t) E [0,1] and x(t) = 0 for all t _< 0. Let R be the set of all discrete time sequences. We are attempting to approximate a continuous, time-invariant, causal function, f, from X to R. It is known that any TDNN can be represented by such a function. First we define the delay operator, T~. 0
(T~ o x)(t) =
ift 3
z(t - ~) otherwise
When dealing with operators such as T~ which operate on sequences, we shall adopt the notation
(Ta o x)(t) which should be read as Ta operating on x at time t. Next we define the concept of a complete memory. Let B be a set of mappings from X to R. B is a complete memory if it has the following four properties. First, there exist real numbers a and c such that (b o x)(t) ~_ (a, c) for all t E Z +, x E X, and b E B. Second, for any t E Z + and any to such that 0 < to < t, the following is true. If x and y are elements of X and x(to) # y(to), then there exists someb E B such that (box)(t) :# (boy)(t). Third, ifb E B then (boTaox)(t) = ( b o x ) ( t - ~ ) for all t E Z +, all x E X and any ~ such that 0 < ~ < t. Fourth, every b E B is causal. It can be shown that set of habituation functions with all acceptable values of habituation parameters, a and r is a complete memory. Let a temporal input be transformed by a set of
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS m habituation units.
313
If m is sufficiently large, then after habituating, f can be approximated
arbitrarily well by an MLP or RBF which takes the habituated values, Wk(t + 1), as inputs. This is a consequence of Theorem 1 which states that a two layer neural network with an exponential activation function and a complete memory structure for processing the inputs can universally approximate f.
T h e o r e m 1 Let f be a continuous, causal, time-invariant function from X to R. If B is a complete
memory then the following is true. Given any ~ > 0 and any arbitrarily large positive integer, to, there exist real numbers, aj and cjk, elements of B, bjk, and natural numbers, p and m, such that the following inequality holds for all x E X and all t such that 0 <_ t < to. ( f o x)(t) -
aj exp
Cjk(bjk o x)(t)
< e
(4)
j=l
For conciseness, the theorem is stated here without proof. For more details, see [SG95a] and [sg95b]. The results of Theorem 1 can be readily extended to include habituated MLP and RBF networks and to include multiple (d > 1) spatial input dimensions, xh(t), 1 <_ h <_ d. In order to show that habituated MLPs and RBFs can perform the same approximations it is sufficient to show that the exponential function can be approximated arbitrarily well by a summation of sigmoids or gaussian functions. This is a special case of theorems which have already been proven for sigmoids by Cybenko [Cyb89] among others and for gaussian functions by Park and Sandberg [PS91]. The expansion of the result to multiple spatial dimensions follows directly from the proof of Theorem 1. It is important to notice that the input processing functions, bjk used in Theorem 1 depend on j and thus the habituation parameters used also depend on j. This means that different hidden units in the feedforward network may have different input values. This dependency is not present in the structure illustrated in Figure 2. However, one can show that for any approximation 9 of the form discussed in Theorem 1, there is an equivalent network without this dependency.
314
BRYAN W. STILES AND JOYDEEP GHOSH
Corollary
1 Let g be an approximation function of the form Ej=I P ajexp (Y~k=l m cjk(bjk o x)). Given
any such g one can find an h of the following form such that g(x) = h(x) for all x e X for real numbers, wji, (weights to the hidden units} and elements of B, si. h(x) =
ajexp
wji(si o x)
(5)
j=l
Proof Simply choose M to be the number of distinguishable functions bjk used in g and let the sequence {si} be the list of these distinguishable functions. For a particular si and a particular hidden node j, set wji to zero if the original bjk corresponding to si was not present at hidden node j, otherwise set wji to the appropriate cjk. An approximation of the form given by h has the same structure given in Figure 2, so the structure illustrated in Figure 2 is adequate. From Theorem 1 and Corollary 1 we conclude that a two layer static neural network with an exponential activation function and inputs transformed by elements of a complete memory, B, can perform the same function as any TDNN. As stated earlier, a habituation based network is a special case of this type of generalized structure. Since habituated networks have the same approximation power as TDNNs, the question that remains is which are more efficient. The answer depends on the nature of the function that is being realized. The complexity of the TDNNs depends on n, the input window size. The number of weighted inputs to each hidden unit in a TDNN is nd. For functions which only depend on recent values of the inputs, TDNNs can be quite efficient; but for functions which depend on long term temporal information or variable amounts of temporal information, TDNNs are not efficient solutions. For habituated networks, the required memory depth and resolution affects the choice of a and r in Equation 3, and the number of habituated weights. Differents weights can have different values of a and r, and the number of weights used can vary in different dimensions. Thus the memory structure can be optimized for a given mapping. The number of inputs to each hidden unit is ~ i a l mi, where mi is the number of habituation values used to encode the ith component of x(t). A parallel can be drawn between finding a suitable m and finding a suitable number of hidden units in an MLP or RBF. In both cases there is no guarantee that the number required will not be inordinately large. However, in the case of the MLP, a large set of problems have been found
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS
315
for which a small number of hidden units is suitable. The same is true for finding m. There may be many simple problems which are unsuitable for TDNNs because they require long term temporal information, but which can be solved with habituated networks with small values of m. Since the output of the short-term memory stage is different for TDNNs and habituated networks, the complexity (number of hidden units) of the feedforward network needed at the output stage may also differ. For certain problems, habituated networks require a smaller feedforward output stage as compared to TDNNs for a given level of approximation.
We have previously
performed experiments using habituated MLPs to classify real SONAR data and have found that small habituated networks outperformed larger TDNNs. In fact we found that even m = 1 networks dramatically outperformed TDNNs with time window length of 5 or more [Sti94]. Unfortunately due to the proprietary nature of the real SONAR data sets, they cannot be made public. Therefore, in the Section 4, we discuss experimental results on artificial Banzhaf sonograms, which can be easily generated and verified by other researchers.
3.3
Other
Related
Structures
The habituation based design is a neural network structure, which consists of a temporal encoding stage followed by a nonlinear feedforward neural network. To put this new network in perspective let us examine previous work involving other two stage networks with different temporal encoding mechanisms. A large of amount of mathematical theory in this area, has been produced by I. W. Sandberg. In this section, Sandberg's work is summarized. Additionally, the gamma network, a particular two-stage architecture developed by J. Principe, is also discussed. The gamma network has been shown to perform well for a number of applications and represents the current state of the art. One major contribution made by Sandberg is a universal approximation proof for TDNNs. TDNNs are a simple two-stage architecture which uses a tapped delay line to encode temporal information and an MLP as a feedforward stage. Sandberg demonstrated that TDNNs can approximate arbitrarily well any continuous, causal, time-invariant, approximately finite memory mapping from one discrete time sequence to another [San92b]. In [San91] he produced a more general form of
316
BRYAN W. STILES AND JOYDEEP GHOSH
this result. First he introduced the concept of a fundamental set. A fundamental set is a family of mappings {F~ : A E A} associated with a given dynamic mapping G which satisfies certain properties with respect to G. In [San91] Sandberg demonstrated that one can use such a fundamental set as a temporal encoding mechanism in order to approximate G. In the same paper, he exhibited a structure which can approximate any G which is a continuous causal time-invariant, approximately finite memory mapping from one discrete time sequence to another. It was shown that such a G can be approximated arbitrarily well by a function F of the form
Here hm and x are function of time, kt, rltm, and pl are real constants, and a is a sigmoid function. The overall approximation structure is an MLP feedforward stage, with affine operators (sum of linear operator and a constant) used as a temporal encoding mechanism. Sandberg proved that such an F can approximate G arbitrarily well by showing that the set of affine operators is a fundamental set for G. He has also demonstrated several different forms for the temporal encoding stage which are sufficiently general in order to have the same approximation power but more specific than the set of all affine operators [SX95]. One example of such a temporal encoding stage is the gamma memory structure developed by Principe [dVP92]. The gamma memory structure is a set of operators defined as follows.
xk(t) = (1 - .)xk(t - 1) + . x k _ l ( t - 1)
(7)
Here x0 is the original input signal, xj for 1 < j < k are the output signals from the temporal encoding stage, and /~ is a constant parameter. For ~t = 1 the gamma memory degenerates to a K length tapped delay line. The gamma memory structure is typically used with an MLP as a feedforward stage. The entire two stage network is referred to as a focused gamma network. When ~t = 1 it is identical to a TDNN. In [dVP92] it is demonstrated that the memory depth of a gamma network is D = K/p; whereas, the memory resolution is R = ~t. Unlike a TDNN, a gamma network can adaptively optimize the tradeoff between memory depth and resolution by training the p parameter.
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS
317
The primary difference between the habituation based design and any of the previous twostage network designs is that the temporal encoding stage (habituation) is nonlinear and not comprised of affine operators. Although, as demonstrated by Sandberg, affine operators are sufficient to approximate a wide variety of mappings, it is possible that an impractically large number of operators may be required to approximate a particular problem to a given error tolerance. It is our conjecture that a nonlinear temporal encoding strategy may provide a more efficient solution for some problems.
4
Experimental Results
In this section, we shall discuss the results obtained when a habituation based network is applied to a spatio-temporal signal classification problem. Before describing the data sets used and providing comparative experimental results, some fundamental issues and terminology pertaining to classification of spatio-temporal signals are summarized in Section 4.1.
4.1
Basic Issues in Spatio-Temporal Classification
The signals of interest in speech or sonar processing exhibit time-varying spectral characteristics, and can also be of varying lengths or time durations.
Before classification can be attempted, a
signal has to be separated from background noise and clutter, and represented in a suitable form. In order to use a static neural network such as the MLP or RBF, the entire signal must first be described by a single feature vector, i.e., a vector of numerical attributes, such as signal energy in different frequency ranges or bands [GDB92]. The selection of an appropriate set of attributes is indeed critical to the overall system performance, since it fundamentally limits the performance of any classification technique that uses those attributes [GDB90]. The use of static classifiers is not ideal because representing each signal by a single feature vector results in a blurring of the spatio-temporal pattern characteristics and may lead to a loss of information that would be useful for discrimination.
Another, more detailed way to view a
time-varying nonstationary signal is to treat it as a concatenation of quasi-stationary segments
318
BRYAN W. STILES AND JOYDEEP GHOSH
[HN89]. Each segment can be represented by a feature vector obtained from the corresponding time interval. Thus the entire signal is represented by a sequence of feature vectors that constitute a spatio-temporal pattern.
A popular way of representing such signals is by a two-dimensional
feature-time plot called a spectrogram which shows the sequence of feature vectors obtained by extracting attributes from consecutive segments. The number of segments representing a signal is not necessarily pre-determined. Segmentation of nonstationary signals is a difficult problem and has been extensively studied [DKBB92]. Any neural network applied to the signal detection task must also address the following issues. (i)
Cueing: Generally, any network that performs a recognition task must be informed of the
start and end of the pattern to be recognized within the input stream. If this is so, the classifier is called a cued classifier. An uncued classifier, on the other hand, is fed a continuous stream of data from which it must itself figure out when a pattern of interest begins and when it ends. (ii)
Spatio-temporal Warping: Often the original pattern gets distorted in temporal and
spatial domains as it travels through the environment.
A robust classifier must be insensitive
to these distortions. If the classifier is not sophisticated in this aspect, then the distorted signal received needs to be preprocessed to annul the effects of the environment. This is called equalization and is widely used in communication systems. The characteristics of the communication or signal transmission medium are "equalized" by the preprocessor, so that the output of this processor is similar to the original signal.
Alternatively, mechanisms such as the dynamic time warping
algorithm [Kun93] can be built into the classifier to make it less susceptible to the distortions. Since a general purpose cued classifier must detect the presence of a signal as well as classify it, measures of quality include not only classification accuracy, but false alarm rate, the number of times the system indicates presence of a signal when there is none, and missed detection rate, the number of real signals that are present but not detected. Also important are the computational power required and confidence in classification decisions when they are made [GDB92]. Another problem inherent in spatio-temporal classification, that is not present in static classifiers, is when to decide to classify a signal. With a static classifier, after each pattern is
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS
319
presented some decision must be made, but in spatio-temporal classification the network is presented sequentially with some number of subpatterns. The network must determine at what point it has enough information to classify a signal. One method of making this decision is to classify a signal only when a particular class membership is indicated over several consecutive subpatterns.
signal or pattern to refer to a sequence be referred to as a sample or subpattern.
For clarity we shall only use the terms vectors. Each component feature vector shall
of feature
Banzhaf signals, consisting of 30-dimensional feature vectors, with sequence length typically between 30 and 45 samples, and including the effects of spatio-temporal warping, are used to evaluate the habituation networks. These signals are described in the next section. Also note that the proposed classifier is uncued and has to tackle both detection and classification. Therefore, it involves selection of thresholds to yield a range of missed detection rate vs. confidence in classification tradeoffs, as detailed in Section 3.3. False alarm rate is not reported, because the simplicity of our background noise model tends to produce artificially few false alarms.
4.2
Banzhaf Sonograms
This section describes Banzhaf sonograms [BK91], which were selected to obtain comparative experimental results on high dimensional spatio temporal classification. Banzhaf sonograms were used for several reasons. First, they are easily reproduced by other researchers. Secondly, they can be used to vary the difficulty of the data sets, the dependence of performance on temporal information, and the amount of warping in both time and space. Finally, it is known that superpositions of gaussian functions can be used to model any continuous spectrogram arbitrarily well and several interesting spectrograms (sonar, radar) are well modeled using small numbers of gaussian components. A Banzhaf mother signal is generated by superposing two dimensional gaussian functions. The pth component gaussian, t0[p], 6[p], h,[p],
ht[p], ,k,[p],
Gp(z,t),
is completely described by the constant parameters, x0[p],
and At[p]. The pair (x0[p], t0[p]) is the coordinates in time and space
of the center of the gaussian, 9[p] is the angle of axial tilt, h,[p] and
ht~]
determine the height of
the peak, and ,k,[p] and ,kt[p] determine the width of the gaussian in space and time respectively. Values for each component gaussian,
Gp(x,t),
are determined for integer values of x and t such
320
BRYAN W. STILES AND JOYDEEP GHOSH
that 0 < x < 29 and 0 < t < 40. For each value of x and t, the computation is performed in the following manner. First the following permutation is applied to rotate the gaussian by an angle of 0[p] about its center point, (z0[p], t0[p]).
Next the value,
x* =
(x - x0[p])cos0[p] + (t - t0[p])sin0[p] + x0[p]
t* =
(x0[p]- x)sin0[p] + ( t - to [p]) cos 0 [p] + t0[p]
Gp(x,t) is calculated
as follows.
Gp(x,t) = hx[p]e-(~)2ht[p]e - ( ~ ) 2 The Banzhaf mother signal,
(8)
G(z, t), is then
(9)
computed by summing the component gaussians.
In our experiment we generated six different mother signals using six different sets of parameter values. Each mother signal serves as the prototypical member of a class. Figure 3 shows the set of mother signals for each class. Table 1 lists the parameter values for the mother signals. The values for 0[p] are given in radians. Classes B and E have two component gaussians. All other classes have three. Training and test examples of each class were generated by adding uniformly distributed noise in the range [-0.1,0.1] to
G(z,t) and
also perturbing the parameters in order to
rotate, scale, and warp the mother signals to some extent. When the signals were time warped, the number of time samples calculated was varied accordingly, in order to create signals with a variable length in time. The number of spatial samples, however, was held constant at 30. A seventh "noise only" class was also constructed. The training and test sets were made up of 10 examples of each signal class and 9 examples from the "noise only" class. Once the data sets were compiled, they were normalized so that all data values were in the range [0,1]. Figure 4 gives typical examples of each signal class after rotating, scaling, and warping. The data set illustrated in Figures 3 and 4 is designated as data set 1 (DS1). Clearly classification of DS1 is a problem which requires relatively long term temporal information. It is impossible to uniquely classify any signal based on only a short temporal window of inputs. For example, consider the mother signals of classes A, B, and C as illustrated in Figure 3. The signals in classes A and B are identical for the first twenty time samples, while classes A and C are identical for the last twenty time samples. Additionally, there is no time window of less than
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS
Table 1: Parameters of Mother Signals of Data Set 1
..
Xo[1]
Class A
Class B
Class C
Class D
Class E
Class F
7.5
7.5
15.0
7.5
15.0
15.0
7.5
7.5
7.5
7.5
7.5
,1
to[l]
7.5
!
|,
011]
-0.785
-0.785
0.0
-0.785
0.0
0.0
h~[1]
1.0
1.0
1.0
1.0
1.0
1.0
hi[I]
1.0
1.0
1.0
1.0
1.0
1.0
A=[1]
2.0
2.0
2.0
2.0
2.0
2.0
At[l]
7.6
7.6
6.6
7.6
6.6
6.6
Xo[2]
15.0
15.0
15.0
15.0
15.0
15.0
|
,! i
to[2] i 27.5
27.5
27.5
27.5
27.5
27.5
O[2] hx[2] h,[2] ~x[2]
! 0.0
0.0
0.0
0.0
0.0
0.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
2.0
2.0
2.0
2.0
2.0
2.0
~,[21 .~~o[3] to[3] o[3] h~[3] h,[3]
9.2
9.2
9.2
9.2
9.2
9.2
i
J
7.5
7.5
27.5
27.5
]22.5
22.5
|
27.5
27.5
0.0
0.0
1.0
1.0
i
i
0.0
0.0 m
i
1.0
i
il.O |
i
1.0
|
1.0 |
5.94 |
2.0
i
i
1.0 5.94
2.0
2.0
i
2.0 ,
1.0 5.94
|
5.94
321
322
BRYAN W. STILES AND JOYDEEP GHOSH
Figure 3: Banzhaf Mother Signals for classes A-F (clockwise from the top left) of Data Set 1. The vertical and horizontal axes depict time(increasing top to bottom) and frequency(increasing left to right) respectively. ten samples in any of the three signals that is not identical to a time window in one of the other two signals. This classification problem is obviously difficult for short window TDNNs. In order to demonstrate the effectiveness of habituation for problems with a range of difficulties we have constructed two other data sets which do not depend as severely on long term temporal information. Data set 2 (DS2) and data set 3 (DS3) were generated using the same parameters as DS1 except that the centers of the component gaussians were shifted to reduce the overlap among the classes. In DS2, the component gaussians of samples of a particular class were all shifted uniformly, whereas in DS3, individual component gaussians were shifted so that a particular gaussian component might act as a tag for identifying the class membership of the signal. For this reason DS3, is the most local temporal information rich of the three data sets, and one would expect TDNNs to perform relatively better on DS3 than on either of the other data sets.
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS
323
Figure 4: Sample Signals from Data Set 1. 4.3
Classification
of Uncorrelated
Sequences
For our first experiment, we trained habituated MLPs and TDNNs on DS1 and DS2. The patterns in both data sets were randomly shuffled so that the classification of each pattern was uncorrelated with the classification of nearby patterns in the sequence. When the habituated MLP was tested, the habituation values, Whk(t + 1), were calculated at each instance in time and then fed into the feedforward portion of the network. At each instance in time an output vector was computed and then used in classification. During training, the habituation values were calculated in one pass through the training set, and then randomly shuffled before being used to train the MLP stage. This method is advantageous as long as the habituation parameters are not adapted, because it eliminates oscillatory behavior in training. As an additional optimization, only the habituation values computed for the last ten samples in each signal were used to train the habituated MLP. This reduced training set method (RTS) was used because habituation gradually builds up information about a signal as the signal is presented. During the first few samples of a signal, a habituated MLP does not have enough information to classify a signal. By the end of the signal, however, the network should have accumulated enough information to perform the classification. The reduced training set method was not used for TDNNs, however, because they exhibit no similar dichotomy
324
BRYAN W. STILES AND JOYDEEP GHOSH
in the way they store information over time. For the first set of experiments we used habituated MLPs with random values of a h k and rhk in the range [0, 0.5]. The Othk and rhk parameters were not modified during training. The number of habituation units per input, m, was set to one. We found that for DS1 the habituated MLP, (HMLP), greatly outperformed a 5 sample time window TDNN and an MLP. All three networks utilized 10 hidden units. Increasing the number of hidden units was not found to greatly effect the performance. Classification and detection of signals is accomplished using two thresholds, H and L. Detection occurs whenever a single output node has an output value, O . . . . larger than all other output nodes, O,,a~ > H, and all other output values are less than 1 - H, for L consecutive input presentations. Classification is considered to be correct for a given signal if the only class detected within the length of the signal is the desired class. The best values of H and L may vary from network to network. For a fair comparison, for each network one should select the L for which the network achieved its highest classification rate for some H. Figure 5 illustrates performance in terms of the classification rate, i.e. the percentage of signals detected as well as correctly classified. Labels such as "L10" are included in the figures to denote the particular value of L used. As mentioned earlier, the best value of L was chosen for each classifier in order to make a fair comparison. Results for DS2 similar to those found for DS1 are illustrated in Figure 6. For conciseness, this time the results are given in terms of the classification rate only. One notices in Figure 6, that although the HMLP has achieved a greater maximum result than the TDNN, it does not have a better result at every value of H. This is an artifact of the particular L values used. The greater the value of L the steeper the decrease in performance for increasing H. The HMLP was able to outperform the TDNN, because unlike the TDNN, the HMLP is capable of encoding long term temporal information. On DS1 this is particularly important, because classification is impossible without information from a large portion of the entire signal length.
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS
325
55.00 50.00 45.00 40.00 35.00 n-
30.00 25.00 20.00 15.00
_oO~ .,oooO/~176176
10.00 5.00
\
0.00 500.00
600.00
700.00
800.00
900.00
H x 10 3
Figure 5: Overall Classification Rate (CR) on Data Set 1 4.4
Comparison
of Different T r a i n i n g
Methods
As previously mentioned, one method for improving HMLP performance is to train only on the last few habituated feature vectors in each signal.
In the results discussed so far the last ten
samples of each signal were used to train the HMLP. We shall henceforth refer to this method as the reduced training set (RTS) approach. Figure 7 demonstrates how the number of training samples per signal affects the performance on data set 1 as measured by the mean square error at each subpattern. Because, signals have variable length some care must be taken to make the numbering of the subpatterns uniform. Subpattern 45 of any signal refers to the last subpattern. Subpattern k for k < 45 refers to the subpattern 45 - k subpatterns from the end of the signal. It is apparent from the figure that the reduction of the number of training samples is only useful up to a point. For too few training samples, generalization may suffer. Also, for few training samples, the performance peaks sharply at the end of the signals, and the performance on the rest of the signal is unpredictable. On the other hand, training on the entire signal is suboptimal, because too much effort is wasted trying to classify the first few samples of each signal, at which time there is not enough information to make a correct classification. It may be that by training on the first
326
BRYAN W. STILES AND JOYDEEP GHOSH
80.00 75.00
--,,,,,..,>~
70.0065.00 " " ......"",, _
60.( 55.0t n0 50.00 45.00 40.00 35.00 30.00 25.00 20.00 15.00 500.00
\ "4~...., ",
600.00700.00800.00900.00 H x l O -3
Figure 6: Overall Classification Rate on Data Set 2 few samples of the data, the network is forced to inappropriately use pattern ordering information as the only available means of reducing MSE on those samples. As previously discussed, this may inhibit generalization to the test set. An alternative training method to RTS is to apply ramping to the desired outputs as depicted in Figure 8. With ramping, the network is trained to gradually change from one signal classification to another 9 At the beginning of each pattern all the desired output values ramp toward a quiescent value, fl, in R time steps. After S time steps, the desired output, O . . . . associated with the new pattern begins to increase toward one, while all other desired outputs decrease toward zero. After R additional time steps On~ reaches one. The fl value is the reciprocal of the number of classes, including the noise class. Because of this choice of fl and since all the rampings are linear, the sum of the desired outputs is always one. This is required so that the outputs of a well-trained network can estimate the aposteriori probabilities of each class with respect to the input9 In practice we set fl = 0 and found that the performance of the networks was unaffected9 The ramping method has an advantage over RTS in that the behavior of the network on the first few samples of a new pattern is more predictable. This is true because with ramping the network is trained to produce low output
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS
327
17o.oo-/,,.\ 160.00 ,~,\ ,% 150.00140.00"% 130.00~' 120.00 o 11o.oo_ \ ~ ' ~ ~'x x lOO.OOtu i / co 9 0 . 0 0 - - W h o l e Signal - - ~-- ~,~-~ TSPS 80.00 70.00 60.00 '~
so.oo-
40.00 30.00 10.00
N,.~ - /
15 TS PIS"~"-,.,,,.., , y I 20.00 30.00 40.00 subpattern
Figure 7: Effect of the Number of Training Samples per Signal (TSPS) on Performance of HMLP values for the first few training samples, while if RTS is used the network is not trained on these samples at all. On the other hand, ramping requires two parameters R and S whereas RTS requires only the number of training samples per signal. Additionally, training on the first few samples may require the inappropriate use of pattern order information even if the desired output is zero. It may be undesirable for the network to try to determine whether or not it is at the beginning of a pattern. Figure 9 illustrates the effects of the ramping parameters R and S on the HMLP's performance. Ramping worked best for intermediate values of S and R. The results of comparisons between RTS and ramping were mixed. RTS seemed to do slightly better on DS2, while ramping performed slightly better on DS3. Apparently, habituation does not do as well on DS3 as on the other data sets. This result is reasonable because of the three data sets DS3 is the one which is least dependent on long term temporal information. DS3 can be fairly well classified with only local time information. For such a data set, TDNNs perform adequately well. The results, however, are still encouraging for habituation, because the HMLP is much simpler in structure than a TDNN of depth 5. The HMLP has much fewer trainable parameters. Even if the cr and r parameters are
328
BRYAN W. STILES AND JOYDEEP GHOSH
considered, the HMLP has less than one fourth as many parameters. Additionally, the habituation parameters were randomly assigned and never trained in all the experiments illustrated so far. Despite the simplicity of the HMLP, it outperformed much more complex TDNNs on DS1 and DS2, and performed similarly to the TDNN on DS3.
4.5
Using Multiple Habituation Units for Each Input
So far all the experiments discussed have focused on HMLPs with m = l . The mathematical proof which we demonstrated in Section 3.2, required that m be variable. Values of m > 1 make the network more powerful and may allow improvements in performance.
In order to examine this
possibility, we now consider the effect of increasing m to construct multiply habituated MLPs, (MHMLPs). This experiment was performed using data set 1. It was found that increasing m did lead to improved performance. HMLPs with m = 2, m = 3, and m = 5 achieved classification rates of 59.4, 56.5, and 58.0 respectively as compared to 55.1 for the m = l case. On DS3 an MHMLP with m = 2 performed as well as the 5 time sample TDNN. Both networks achieved a classification rate of 76.8.
4.6
Hybrid
Habituated
TDNN
Network
Another type of network which bears investigation, is the hybrid habituated TDNN, (HTDNN), in which the habituated values, Wk(t + 1), are taken as inputs to a TDNN. This network combines the local temporal information of a TDNN with the long term temporal information available from an HMLP. We found that a 5 time sample HTDNN with m = 1 did outperform an HMLP with m = 1 on artificial data set 2. The classification rate was 84.1 for the HTDNN as compared to 81.2 for the HMLP. We also tried HTDNNs with 2 and 3 sample time windows, but the results were not as good as the results obtained with the HMLP.
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS 4.7
Applying
Principal
Component
329
Analysis
Although habituated networks tend to be less complex than TDNNs, there is still further room for complexity reduction, especially when the input has a large number of spatial components. One method for improving HMLP performance and reducing the complexity of the static classifier stage is to perform principal component analysis on the habituated values. The sequence of habituated values generated for a single pass through the training set is stored and the covariance matrix, M, is determined. Next, the eigenvalues and eigenvectors of M are computed. Finally the set of eigenvectors, hi, corresponding to the largest few eigenvalues are selected. Each vector of habituated weights,
W(t),
is then replaced by the sequence of dot products, W(t)r~i. These dot products are
presented to the static classifier instead of the habituated values themselves. By applying principal component analysis one can decrease the correlation among inputs to the static classifier, as well as, decreasing the number of inputs. For an HMLP with m = l the number of inputs to the static classifier was reduced by a factor of 3, while simultaneously improving the classification rate on DS1 from 55 to 68 percent. The fact that the HMLP performed better when less information was presented to the static classifier is counterintuitive. Reducing the number of inputs to the static classifier, leads to simplified training.
Additionally, since the correlation among the inputs are also decreased,
the parameter values in the static classifier have less interdependencies during training. Since the number of parameters in the static classifier are also reduced, generalization is improved. The simplified training and improved generalization results in a better classification rate.
4.8
Effect of Varying H a b i t u a t i o n P a r a m e t e r s
To determine the dependence of HMLP performance on the ok and rk parameters, HMLPs with m = 1 were trained and tested on DS2 with all ak and rk parameters set to constant values A and T respectively. Figure 10 illustrates the effect of varying T with A = 0.1. Figure 11 demonstrates what happens when A is varied for T = 0.3. It is important to note that the networks illustrated in this experiment are all suboptimal. For one thing, they are all trained on the entire signals. Figure 7 illustrates that this is not the best method, but it better demonstrates the temporal effects of
330
BRYAN W. STILES AND JOYDEEP GHOSH
varying habituation parameters. From Equation 3 we observe that T affects the rate at which habituation values change with changes in the input. For large values of T, the
Wk(t) values
change quickly with time. The A
parameter, on the other hand, controls the rate at which information is forgotten. The larger the value of A the faster the
Wk(t) values
rebound to a constant unity value. From Figures 10 and
11, one notices that large values of T or A result in performance peaks closer to the beginning of the signal, whereas small values of either parameter leads to peaks toward the end of the signals. For large values of A information is forgotten quickly and the latter parts of a signal are therefore classified more poorly. For small values of T, information is accumulated slowly and the earlier parts of the signal are therefore classified poorly. As both figures illustrate, intermediate values of A and T are required to optimize performance. For A too large, information about the beginning of a signal is forgotten before the end of the signal; while for A too small, unnecessary information prior to the current signal is maintained at the expense of more useful recent information. A similar tradeoff exists for values of T. Large T values lead to information which is too localized. Small values of T lead to information which is too long term.
5
Conclusions
Two stage networks are a well known structure for approximating mappings from one discrete time sequence to another. Several such structures have been shown to be universal approximators. In order to use a two stage network to approximate a particular mapping f, two conditions must be met. First, the temporal encoding stage must be sufficiently powerful to encode all necessary information about the past history of the inputs. Secondly, the feedforward stage must be powerful enough to perform the required static mapping. In this article, we have discussed a network structure which is capable of approximating all continuous, causal, time invariant mappings from one discrete time sequence to another. This structure is biologically motivated and is significantly different from previously developed two stage networks. Unlike TDNNs and the gamma network, the habituation based network has a nonlinear temporal encoding stage. While this nonlinearity does not affect the approximation power, which is the same as that of TDNNs and the gamma network, it may allow
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS
331
more efficient approximations for some mappings. Some experimental results have been obtained regarding the ability of habituation based networks to classify spatio-temporal signals. From the experiments obtained so far, habituation based networks are consistently more efficient than TDNNs. Additionally, it has been found that for complex data sets with a large number of inputs, it is often beneficial to compress the amount of information passed between the temporal encoding stage and the feedforward stage. Such a compression can result in a streamlined feedforward stage which is more quickly trained and better able to generalize. Experimental results indicate that if Principal Component Analysis is used to perform such a compression, the performance of a habituation based network can be improved dramatically.
References [Bel88]
T. Bell. Sequential processing using attractor transitions. In Proceedings of the 1988
Connectionist Models Summer School, pages 93-102, June 1988. [BG89]
J. H. Byrne and K. J. Gingrich. Mathematical model of cellular and molecular processes contributing to associative and nonassociative learning in aplysia. In J.H. Byrne and W.O. Berry, editors, Neural Models of Plasticity, pages 58-70. Academic Press, San Diego, 1989.
[BK85]
C. Bailey and E. Kandel. Molecular approaches to the study of short-term and long-term memory. In Functions of the Brain, pages 98-129, Oxford, 1985. Clarendon Press.
[BK91]
W. Banzhaf and K. Kyuma. The time-into-intensity-mapping network. Biological Cy-
bernetics, 66:115-121, 1991. [Cyb89]
G. Cybenko. Approximations by superposition of a sigmoidal function. Mathematics of
Control, Signals, and Systems, 2:303-314, 1989.
332 [Day90]
BRYAN W. STILES AND JOYDEEP GHOSH J. Dayhoff. Regularity properties in pulse transmission networks. In Proceedings of
the Third International Joint Conference on Neural Networks, pages 111-621:626, June 1990. [DKBB92] P. M. Djuric, S. M. Kay, and G. F. Boudreaux-Bartels. Segmentation of nonstationary signals. In Proc. ICASSP, pages V:161-164, 1992. [dVP92]
B. de Vries and J. C. Principe. The gamma model- a new neural net model for temporal processing. Neural Networks, 5:565-576, 1992.
[GAIL91]
R. Granger, J. Ambros-Ingerson, and G. Lynch. Derivation of encoding characteristics of layer II cerebral cortex. Jl. of Cognitive Neuroscience, pages 61-78, 1991.
[GD95]
J. Ghosh and L. Deuser. Classification of spatio-temporal patterns with applications to recognition of sonar sequences. In T. McMullen E. Covey, H. Hawkins and R. Port, editors, Neural Representation of Temporal Patterns. 1995.
[GDB90]
J. Ghosh, L. Deuser, and S. Beck. Impact of feature vector selection on static classification of acoustic transient signals. In Government .Neural Network Applications
Workshop, Aug 1990. [GDB92]
J. Ghosh, L. Deuser, and S. Beck. A neural network based hybrid system for detection, characterization and classification of short-duration oceanic signals. IEEE Jl. of Ocean
Engineering, 17(4):351-363, october 1992. [GW93]
J. Ghosh and S. Wang. A temporal memory network with state-dependent thresholds. In
Proceedings of the IEEE International Conference on Neural Networks, San Franci.~co, pages I:359-364, March 1993. [HKP91]
J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computa-
tion. Addison-Wesley, 1991. [HN89]
J.-P. Hermand and P. Nicolas. Adaptive classification of underwater transients. In Proc.
ICASSP, pages 2712-2715, 1989.
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS
333
[Kun93]
S.Y. Kung. Digital Neural Networks. Prentice Hall, Englewood Cliffs, NJ,, 1993.
[Kur87]
S. Kurogi. A model of neural network for spatiotemporal pattern recognition. Biological
Cybernetics, 57:103-114, 1987. [LipS9]
R. P. Lippmann. Review of neural networks for speech recognition. Neural Computation, 1(1):1-38, 1989.
[Mar90]
A. Maren. Neural networks for spatio-temporal recognition. In A. Maren, C. Harston, and R. Pap, editors, Handbook of Neural Computing Applications, pages 295-308. Academic Press, 1990.
[Moz93]
M. C. Mozer. Neural network architectures for temporal sequence processing. In A. S. Weigend and N.A. Gershenfeld, editors, Time Series Prediction, pages 243-264. Addison Wesley, 1993.
[Pav27]
I. Pavlov. Conditioned Reflezes. Oxford University Press, London, 1927.
[PS91]
J. Park and I.W. Sandberg. Universal approximation using radial basis function networks. Neural Computation, 3(2):246-257, Summer 1991.
[RAH90]
D. Robin, P. Abbas, and L. Hug. Neural response to auditory patterns. Journal of the
Acoustical Society of America, 87(4):1673-1682, 1990. [San91]
I.W. Sandberg. Structure theorems for nonlinear systems. Multidimensional Systems
and Signal Processing, 2:267-286, 1991. [San92a]
I. W. Sandberg. Approximately-finite memory and the theory of representations. Inter-
national Journal of Electronics and Communications, 46(4):191-199, 1992. [San92b]
I.W. Sandberg. Multidimensional nonlinear systems and structure theorems. Journal
of Circuits, Systems, and Computers, 2(4):383-388, 1992. [SDN87]
J.-P. Changeux S. Dehaene and J.-P Nadal. Neural networks that learn temporal sequences by selection. Proc. National Academy of Sciences, USA, 84:2727-2731, May 1987.
334 [SG95a]
BRYAN W. STILES AND JOYDEEP GHOSH B. Stiles and J. Ghosh. A habituation based mechanism for encoding temporal information in artificial neural networks. In SPIE Conf. on Applications and Science of
Artificial Neural Networks, volume 2492, pages 404-415, Orlando, FL, April 1995. [SG95b]
B. Stiles and 3. Ghosh. A habituation based neural network for spatio-temporal classification. In Proceedings of NNSP-95, Cambridge, MA, September 1995.
[She95]
G. M. Shepherd. Neurobiology. Oxford University Press, New York, 1995.
[Sti941
B. W. Stiles. Dynamic neural networks for classification of oceanographic data. Master's thesis, University of Texas, Austin,Texas, 1994.
[sx95]
I. W. Sandberg and L. Xu. Network approximation of input-output maps and functionals. In Proceedings of the 34th IEEE Conferenec on Decision and Control, December 1995.
[WA92]
D. Wang and M.A. Arbib.
Modeling the dishabituation hierarchy: the role of the
primordial hippocampus. Biological Cybernetics, 67:535-544, 1992.
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS
o._
oold
1.0
1.0
f
1/7 0.0
:
i
i '
' i
L
335
0.0
E~-=
:
r
s.;
t
'
:
-i
:!
Start of New Pattern
~ S
Start of New Pattern (End of Old Pattern) Other Desired Oul m~
o.o
i
i +
t
+
Start of New Pettern
Figure 8: Illustration of Ramping Method
--i
+
336
BRYAN W. STILES AND JOYDEEP GHOSH
rx ?4.0O 7~m ?2.00
//
wm tam
\
~
i
~.eo
(a.oo 64.00
/
13.oo
/
/
/
J
i
a.oo 61 .oo m.ee
s"
59.oo
'"
58.oo
r"
~.oo 56.oo
"%t"
~m
u e.eo
IO.O0
20.00
30.00
40.00
Figure 9: Effect of Ramping on Performance of HMLP on Data Set 3
l~Sla 9 10-3
Ira.00 't'~J" ! 5 0.W 140.00 130.00 120.00
! to.ee
\
teem 9em tem ~e.ee to.to so.to
J
40.00 3o.eo
t~oe
Figure 10: Effect of Varying T on HMLP Performance
DYNAMIC SIGNAL CLASSIFICATION NEURAL NETWORKS
illm n~m 1~o.oo
"~
ii5,oo 11o.oo ioo.oo
.....
um
/
\
u.w
m.m
/
...
~m ~m ,wm
-~
-r
. ,,**.,
~m
!o.oo
lo.m
3o.oo
40.00
Figure 11: Effect of Varying A on HMLP Performance
337