ADVANCES IN I M A G I N G A N D ELECTRON PHYSICS, VOL. 127
Preffitering for Pattern Recognition Using Wavelet Transform and Neural Networks FAN YANG AND MICHEL PAINDAVOINE Laboratoire Electronique, Informatique et Image--CNRS FRE 2309, Aile des Sciences de l'IngOnieur, UniversitO de Bourgogne, BP 47870-21078 Dijon Cedex, France
I. I n t r o d u c t i o n
. . . . . . . . . . . . . . . . . . . . . . . . . .
II. I n t r o d u c t i o n to N e u r a l N e t w o r k s . . . . . . . . . . . . . . . . . . A. Perceptron and Adaline N e t w o r k s . . . . . . . . . . . . . . . . . 1. The Perceptron . . . . . . . . . . . . . . . . . . . 2. The Adaline N e t w o r k . . . . . . . . . . . . . . . . . . 3. A n Example of Pattern Recognition . . . . . . . . . . . . . B. Linear Associative Memories . . . . . . . . . . . . . . . . . . . 1. Principle and Architecture of Linear Autoassociative Memories . . . . 2. Linear Autoassociative Memories Training with W i d r o w - H o f f Learning Rule . . . . . . . . . . . . . . . . . . 3. Linear Heteroassociative Memories . . . . . . . . . . . . C. Multilayer N e u r a l N e t w o r k s and B a c k - P r o p a g a t i o n Learning Rule . . . . 1. Architecture and N o t a t i o n of Multilayer F e e d - F o r w a r d N e t w o r k s . . 2. The Generalized W i d r o w - H o f f Rule . . . . . . . . . . . . . D. R a d i a l Basis F u n c t i o n N e t w o r k s . . . . . . . . . . . . . . . . . 1. Exact I n t e r p o l a t i o n . . . . . . . . . . . . . . . . . . . . 2. R a d i a l Basis F u n c t i o n N e t w o r k s . . . . . . . . . . . . . . . . 3. R B F N e u r a l N e t w o r k Training . . . . . . . . . . . . . . . III. I n t r o d u c t i o n to Wavelet T r a n s f o r m . . . . . . . . . . . . . . . . . A. C o n t i n u o u s and Discrete Wavelet T r a n s f o r m . . . . . . . . . . . . . 1. Definition of C o n t i n u o u s Wavelet T r a n s f o r m . . . . . . . . . . . . 2. Discrete Wavelet T r a n s f o r m . . . . . . . . . . . . . . . . . . 3. Wavelet Packet T r a n s f o r m . . . . . . . . . . . . . . . . . . B. Wavelet F u n c t i o n s . . . . . . . . . . . . . . . . . . . . . . . 1. H a a r Wavelet F u n c t i o n . . . . . . . . . . . . . . . . . . 2. Daubechies Wavelet F u n c t i o n and Other Wavelet F u n c t i o n s . . . . . . 3. Wavelet Analysis: Parallelism and Applications . . . . . . . . . . . IV. Pattern R e c o g n i t i o n Using Wavelet and N e u r a l N e t w o r k for Signal and Image Processing Applications . . . . . . . . . . . . . . . . . . . A. Different Techniques of Preprocessing for P a t t e r n Recognition . . . . . . . 1. Principal C o m p o n e n t Analysis and F o u r i e r Descriptors . . . . . . . . 2. Wavelet T r a n s f o r m and F e a t u r e Extraction . . . . . . . . . 3. Wavelet Packet T r a n s f o r m (WPT) . . . . . . . . . . . . . . . . 4. F e a t u r e Extraction Using Invariance Properties . . . . . . . . . . B. Audio, Underwater, and E d d y Current Signal Processing Applications . . . . 1. A u d i o Signals Identification . . . . . . . . . . . . . . . 2. U n d e r w a t e r Signals Classification . . . . . . . . . . . . . . . 3. N e u r a l N e t w o r k Classifier for Eddy C u r r e n t Signals . . . . . . . . .
126 127 128 128 130 133 136 136
.
138 138 139 139 141 143 143 144 145 147 147 147 149 152 152 153 154 156 158 159 159
161 161 162 164 164 166 168
125 Copyright 2003, Elsevier (USA). All rights reserved. ISSN 1076-5670/03
126
YANG AND PAINDAVOINE
C. Classification of Electroencephalogram and Myoelectric Signals Using Wavelet Transform and Neural Networks . . . . . . . . . . . . . . 1. Sleep Stage Classification . . . . . . . . . . . . . . . . . . . . 2. Brain Disorder Classification . . . . . . . . . . . . . . . . . . 3. Classification of the Myoelectric Signal Using the Wavelet Packet Transform . . . . . . . . . . . . . . . . . . . . . . D. Shape Recognition Applications and Image Classification by Content . . . . 1. Recognition and Classification of Blemishes . . . . . . . . . . . . . 2. A New Multiscale-Based Shape Recognition Method Applied to Images of Industrial Tools . . . . . . . . . . . . . . . . . . . . . . 3. Wavelet Index of Texture for Artificial Neural Network Classification of Landsat Images . . . . . . . . . . . . . . . . . . . . V. Wavelet Prefiltering Technique Applied to Handwriting Movement Analysis and Face Recognition . . . . . . . . . . . . . . . . . . . . . . A. Filtering Techniques, Time-Frequency Representations, and Generalized Canny Deriche Filter . . . . . . . . . . . . . . . . . . . . 1. Filtering Techniques and Time-Frequency Representations . . . . . . 2. Wavelet Transform and Generalized Canny Deriche Filter . . . . . B. Wavelet Filtering Technique and Movement Analysis in Handwriting and Drawing . . . . . . . . . . . . . . . . . . . . . . . 1. Handwriting and Drawing Signals . . . . . . . . . . . . . . . . 2. Filtering and Signal Segmentation Using Multiresolution Analysis . 3. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . C. An Image Filtering Technique Combining a Wavelet Transform with an Autoassociative Memory: Application to Face Recognition . . . . . . 1. Linear Autoassociators and Eigenvalue Decomposition . . . . . . . 2. Preprocessing Using Multiscale Edges . . . . . . . . . . . . . . . 3. Pattern Completion of Noisy Patterns . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . .
170 170 171 173 174 174
174 175
177 177 177 178
.
180 181 182 187
188 189
192 197 199 201
I. INTRODUCTION
P a t t e r n r e c o g n i t i o n , r e g a r d i n g e n g i n e e r i n g , w o u l d i m p l y the s t u d y o f a u t o m a t i c or semiautomatic systems capable of recognizing patterns. T h i s c o n c e r n s several disciplines, such as statistics, o p e r a t i o n a l r e s e a r c h , c o m p u t e r science, b i o l o g y , a n d electronics. T h i s c h a p t e r limits itself to signal a n d i m a g e p r o c e s s i n g w h e r e n e u r a l n e t w o r k s are the m a i n t o o l s in r e a l i z a t i o n o f such a system. N e u r a l n e t w o r k s are b u i l t f r o m simple u n i t s i n t e r l i n k e d b y a set o f w e i g h t e d c o n n e c t i o n s . G e n e r a l l y , these u n i t s are o r g a n i z e d in layers. E a c h u n i t o f the first l a y e r ( i n p u t layer) c o r r e s p o n d s t o a f e a t u r e o f a p a t t e r n t h a t we w a n t to analyze. T h e units o f the last l a y e r ( o u t p u t layer) p r o d u c e a decision after the p r o p a g a t i o n o f information.
PREFILTERING FOR PATTERN RECOGNITION
127
In general, before feeding the computational data to neural networks, the signal must undergo a preprocessing in order to (1) define the initial transformation to represent the measured signal, (2) retain important features for class discrimination and discard that which is irrelevant, and (3) reduce the volume of data to be processed, e.g., data compression. This stage of preprocessing can be realized using many techniques: principal component analysis (PCA), Fourier transform, and other algorithms that allow the selection of the best parameters. These techniques either use or do not use knowledge from the analyzed signal. Often, the best representation remains the wavelet transform for nonstationary signals such as seismic tremors, human speech, engine vibrations, medical images, financial data, music, and many other types of signals. Pattern recognition using wavelet transform and neural networks is the theme discussed in this chapter. Sections II and III present, respectively, neural networks and wavelet transform. These correspond to the theory part of the manuscript. We describe basic concepts and give small examples to illustrate some basic models. These two sections provide only an introduction to neural networks and wavelet transform. During our arguments, we take the point of view of a computer scientist. In signal processing, as well as in other disciplines, experimentation and apprenticeship in conjunction with reality are the only efficient mean for comprehension of the concept. Sections IV and V focus on applications of pattern recognition using wavelet transform and neural networks. We will first discuss audio, underwater, and eddy current signals identification and classification in Section IV. Then we will analyze medical signal processing applications concerning the classification of EEG and myoelectric signals. One of these applications consists of classification of sleep stages. After that, we illustrate applications such as shape recognition and image classification by content. In Section V, wavelet transform is used as a prefiltering technique for movement analysis in handwriting and face recognition. Through this presentation, detailed with theory and complemented by many illustrated applications, it is hoped that the scientific population finds more interest in pattern recognition using wavelet transform and neural networks. II. INTRODUCTION TO NEURAL NETWORKS
This section illustrates some basic neural network models. First we introduce single layer neural networks, including some of the classical approaches to the neural computing and learning problem. Two classical models are described: the perceptron, proposed by Rosenblatt in the late 1950s, and the adaline, presented in the early 1960s by Widrow and Hoff. These models are
128
YANG AND PAINDAVOINE
very easy to analyze and allow a straightforward introduction of important concepts, such as the Widrow-Hoff learning rule, generalization, and linear separability. Then, we give an example of pattern recognition using the single layer neural network. Section II.B focuses on linear autoassociative memory and linear heteroassociative memory, which is a generalization of the perceptron and corresponds to multiple linear regression. A single layer linear network has severe restrictions because it can only solve linear separable problems. Minsky and Papert (1969) showed that twolayer feed-forward networks can overcome many restrictions, but did not present a solution to the problem of how to adjust the weights from input to hidden units. This problem was probably a major cause of the lack of interest in neural networks at the end of the sixties. In 1985, Parker presented an answer to this question, as did Rumelhart and colleagues in 1986. The main idea behind the solution is that the error for the units of the hidden layer is determined using back-propagation of the error of the output layer units. These multilayer neural networks and the back-propagation learning rule are presented in Section II.C. Another nonlinear neural network model, radial basis function (RBF), is also described. Note that the area of neural networks is very vast. The presentation in this section is relatively general. We limit our description to only basic models in order to favor the comprehension of pattern recognition applications using the wavelet transform (WT) and neural networks given in Sections IV and V. Readers are invited to refer to Fausett (1994), Gurney (1997), Haykin (1999), Ripley (1996), Rojas (1996), Abdi et al. (1999), Bishop (2000), and Wermter and Sun (2000) for more complete information.
A. Perceptron and Adaline Networks 1. The Perceptron
Constructed by Rosenblatt (1959) in the 1950s, the perceptron can be considered as the first neural network capable of learning. As its name indicates, the perceptron was intended as a model of perceptual activity. The aim of the perceptron was to associate input patterns with responses. The main idea behind the design of the perceptron was that an object is first registered by the cells of the retina (input neurons) and is then recognized by cells located in the brain (output neurons). Figure 1 shows a biological neuron and an artificial neuron of the perceptron. a. Building Block Networks with Threshold Activation Functions. A single layer feed-forward network consists of one or more output neurons j, each of which is connected with a weighting factor w o. to all of the input i. In the
129
PREFILTERING FOR PATTERN RECOGNITION
FIGURE 1. Schematic of a biological neuron (left) (cf. http://vv.carleton.ca/neil/neural/ neuron.html) and an artificial neuron (right).
Bias neuron x 0 = 1
Wo=-O Response o(a)
1
_ v
Thresholding x2
Output response 0orl
Computation of the activation
FIGURE 2. A basic threshold unit computes its activation a as the weighted sum of its inputs. The output o(a) is equal to 0 when a < 0 and to 1 when a > 0.
simplest case, the sketched in Figure inputs. The output neuron, which is a
network has only two inputs and a single output, as 2. The input of the neuron is the weighted sum of the of the network is formed by the activation of the output function of the inputs. 2
a -- Z
WiXi
(1)
i=1
The activation function can be linear so that we have a linear or nonlinear network. Here, we consider the threshold (or sign) function. response-
o(a) -
1 0
a>O a < 0
(2)
The threshold can be seen as a weight. In this case, thresholding is implemented by keeping one input neuron always active (the 0th input
130
YANG AND PAINDAVOINE
neuron x 0 - 1) such that weight w0 is equal to -0. This is equivalent to rewriting Eq. 1 (2) as 2
1
a-
o(a) -
~_~ WiXi > 0 i=0 2
0
a-
Z
(3)
wixi ~ O.
i=0
Such a simple device can now decide whether an input pattern belongs to one of two classes. The separation between the two classes in this case is a straight line given by the equation WlXl -+-W2X2 -at- W0
= 0.
(4)
The single layer network represents a linear discriminant function. Now we show how perceptrons can use a systematic procedure to find (learn) the value of the weights appropriate for implementing specific association between inputs and output. b. Perceptron Learning Rule. Neural network learning consists of an adjustment of the weights of the connections between units according to some modification rule. Virtually all learning rules can be considered as a variant of the Hebbian learning rule (Hebb, 1949). The basic idea is that if two units, i and j, are active simultaneously, their interconnection must be strengthened. If j receives input from i, the simplest version of Hebbian learning prescribes a modification of the weight, Wo., with /kw O"= naiaj,
(5)
where r/is a positive constant of proportionality representing the learning rate. The perceptron learning rule is very simple and can be stated as follows. 1. Start with random weights for the connections. 2. Select an input vector x from the set of training samples. 3. If o < > t(x) (the perceptron gives an incorrect response), modify all connections W i according to m W i -'- t(x)xi. 4. Go back to step 2. 2. The Adaline Network a. Adaptive Linear Element. A model called Adaline was developed independently in a signal processing framework by Widrow and Hoff (1960) around the same time as perceptron. In a simple physical implementation (see Figure 3), this device comprises a set of controllable resistors connected to a circuit that can sum up currents caused by the input voltage signals. Although
PREFILTERING FOR PATTERN RECOGNITION
131
+1 ~ ~level
Wlj
Woj
w2j x2 wij
j
FIGURE 3. Adaline neural network architecture.
here the adaptive process is exemplified in a case when there is only one output, it may be clear that a system with many parallel outputs is directly implementable by multiple units of the same kind. In the case of Adaline, the activation function is linear and the output signal is continuous I
oj --- Z Wo'Xi' i=0
(6)
where oj is the output signal of jth unit, j = 1, ..., J; X i is the input signal, i - 0, 1, ..., I; and wij is the input conductances. The purpose of this device is to yield a given value, oj, at its output when the set of values, xi, is applied at the input. The problem is to determine the coefficients wij in such a way that the input-output is correct for a large number of arbitrarily chosen signal sets.
b. Widrow-HoffLearning Rule. For the Adaline neural network, Widrow and Hoff (1960) introduced the delta rule in order to adjust the weights mWij -- TI(tj -- Oj)Xi,
(7)
where r/is a constant of proportionality and tj is the target (desired) response of the jth unit. Suppose we want to train the network such that a set of training samples consisting of the input value x is associated with a target output value t. For every given input sample, the output of the network differs from the target value t by t - o, where o is the actual output for this pattern. The delta rule uses an error function based on these differences in order to adjust the weights.
132
YANG AND PAINDAVOINE
The error function is the summed square error (the Widrow-Hoff rule is also called the least mean square learning procedure). The total e r r o r E t o t a l is defined as K Etotal = ~ Ek k - 1,2, . . . , K, k=l
1 j~l (tj Ek -- -~
_ Oj)2
j -- 1,2, ... ,J,
(8)
.__
where Ek represents the error on pattern k. The least mean square procedure finds the values of all the weights that minimize the error function by a method called gradient descent. The idea is to make a change in the weight proportional to the negative of the derivative of the error as measured on the current pattern with respect to each weight OEk Awij. -- -rl Ow u 9
(9)
The derivative is
o&
OEk Ooj
Owij
Ooj Owij
(10)
Because of the linear units [Eq. (6)] Ooj ~=xi Ow O.
(11)
and
0Ek 0oj
(12)
= -(tj-oj)
such that Awij = r l ( t j - oj)xi. Learning is achieved by adding, usually iteratively, to the present weights a small quantity, Awo.. The Widrow-Hoff learning rule specifies the correction to apply at time, t, to the weight connecting the ith input unit to the flh output unit as w(t+ ij 1) _ w!t) ~j + rl(tj
_ oj)x i
(13)
where t is the iteration number (i.e., this is iteration t), wiff ) is the weight of the connection at iteration t (the initial value wu(~ is generally chosen randomly), and r/ is a positive constant, generally between 0 and 1. It corresponds to the small positive constant to be added or subtracted to the weights of connection. Eq. (13) is very general and can be extended easily to more complex neural network architectures.
PREFILTERING FOR PATTERN RECOGNITION
133
3. An Example of Pattern Recognition We want to recognize the follow patterns 0 to 9 (Fig. 4). The purpose is to associate each pattern with its class corresponding to
We initiate the matrix of weights with values chosen randomly
-
W
~..
3
-84 65 87 -17 92 -12 75 55 35 -69 13 -21 58 49
-65
-38
-22 -70 74 -7 -95 83 -15 -51 47 -44 43 -14 -85 -24
-45 25 74 96 -36 15 -28 -32 88 -73 -61 -75 -5 -4
7
90
-27 -37 35 -75 52 -77 -24 -54 -54 73 98 -8 -70 5
97 -31 52 -58 -52 14 -92 -41 68 50 -50 -53 -51 -81
-66
7 84 16 92 18 -50 -68 -39 94 -59 -14 98 89 19
41 53 4 -22 48 -92 -1 5 78 56 -72 51 31 23 -31
-55 29 -20 -29 -18 92 -53 40 -93 -14 -41 73 21 98 -72
-1 54 21 -60 56 -36 -5 -81 30 35 61 79 -52 -5 56
-75 56 57 66 52 -89 -19 -20 62 -56 96 -9 60 56 42
The number patterns 0, 1 , . . . , 9 are presented to input units, and the responses of output units are calculated with
oj = f (aj) = f ( ~i=0 wijxi)
(14)
For the pattern of number O, we calculate Oo, 01, and o5: ao = 3 - 84 + 65 + 87 + 92 - 12 + 55 + 35 + 13 - 21 + 58 + 49 = 340 Oo = 1 because ao > 0 al---65-22-70+74-95+83-51+47+4314-85-24= - 179
134 Y A N G AND PAINDAVOINE
i r~
=
0.) ,.Q
Z
135
PREFILTERING FOR PATTERN RECOGNITION
Ol = 0 because al ~ 0 as - - 66 + 7 + 8 4 + 16 + 1 8 - 5 0 - 39 + 9 4 - 14 + 98 + 89 + 19 - 256 o5 = 1 because as > 0 In the same manner, we obtain the matrix of responses as follows for patterns 0-9.
1 0 0 0 0 1 1 0 1 1 0
O
Pattern
1 0 1 0 0 1 1 1 1 1 1
1 0 0 0 0 1 1 0 1 1 2
1 0 0 0 0 1 1 1 1 1 3
1 0 0 1 0 0 1 0 0 0 4
1 0 0 0 0 1 1 0 1 1 5
1 0 0 0 0 1 1 0 1 1 6
1 0 0 0 1 1 1 0 1 1 7
1 0 0 0 0 1 1 1 1 1 8
1 0 0 0 0 1 1 1 1 1 9
The total error, E, of incorrect classification is e r r o r = 54%. The W i d r o w Hoff learning rule is applied to modify the weights Wij(t+ 1) _ w!f) tj + V(tj - o j ) x i .
with ~7 = 0.9, we obtain
a oo ( 1 ) _ w~O) + 0.9(1 - 1) x 1 - 3 (')w01
w~~
0 . 9 ( 0 - O)x 1 - - 6 5
w~) = w05"(~ 0.9(0 - 1) l x = -66.9. After 51 iterations, a new matrix of weights is obtained (see Figure 5), and the matrix of responses, O, is shown as follows. Figure 6 gives some results of pattern recognition for number 8.
m
-1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 0
00 0 0 0 0 0 0 0 1
136
YANG AND PAINDAVOINE
~Error 4%
I~ ~ . . . _ _ _ _ ~ 0 ~ 10 20 30
Iterations 40
50
FIGURE 5. Convergence of the network.
FIGURE 6. Decision of the network with some samples of number pattern recognition: a perfect pattern of number 8 is categorized correctly (top), a nearly perfect pattern of number 8 is categorized correctly (middle), and an imperfect pattern of number 8 is not categorized correctly (bottom).
B. Linear Associative Memories The class of models presented in this section are known as linear associators (Abdi et al., 1999; Valentin et al., 1994). They come in two forms: heteroand autoassociators. The heteroassociator can be used to learn arbitrary associations between input and output patterns. The autoassociator is a special case of the heteroassociator in which the association between an input pattern and itself is learned.
1. Principle and Architecture .of Linear Autoassociative Memories A u t o a s s o c i a t i v e m e m o r i e s are single-layer n e t w o r k s m a d e of fully interconnected linear units (the output is a linear function of the activation).
PREFILTERING FOR PATTERN RECOGNITION
137
The goal of an autoassociative memory is to store a set of stimuli and to retrieve the stored stimuli when cued with partial or degraded versions of these stimuli. In contrast to a classical computer, which uses the exact memory address to retrieve stored information, an autoassociative memory is c o n t e n t a d d r e s s a b l e and is able to retrieve a whole pattern of information given one part of this information. Because of this property, autoassociative memories are widely used in many pattern recognition applications and for modeling human perceptual learning and memory. Linear autoassociative memories are built from fully interconnected linear units. A linear unit computes its activation as the sum of weights of its inputs. A linear transfer function transforms its activation into its output. If a unit has I inputs noted X l , . . . , x i , . . . , x i , each of them connected with a weight denoted wi, the activation of the unit will be I a -
(15)
~ WiX i. i=1
The output of the unit will be 7a with 7 being a constant. Usually, for convenience, 7 is set to 1. Using vector notation, if x is the input vector and w the weight vector, the output of the unit is response -
wTx
(16)
Formally, an autoassociative memory is a network of I linear units fully interconnected by modifiable connections (see Figure 7). The set of connections is represented by a square symmetrical matrix W,
W--
WI,1
W1,2
999
WI,I
w2,1 9
w2,2
...
W2, I
o
o
WI~ 1
WI, 2
999
o
.
.
~
WI, I
FIGURE 7. A n autoassociative memory composed of five fully interconnected units.
(17)
138
YANG AND PAINDAVOINE
The wi, e represents the strength of the connection between units i and s Each K pattern stored in the memory is represented by an I - dimensional vector denoted x~ in which a given element, xi,~, corresponds to a feature describing the pattern k. Each element in these vectors is used as input to a unit of the autoassociative memory. The set of K input patterns is represented by an I • K matrix X whose kth column is xk. 2. L&ear Autoassociative Memories Training with Widrow-Hoff Learning Rule Training results from modification of the value of the connections following the presentation of a set of patterns. The Widrow-Hoff learning rule seen previously iteratively adjusts the weight of the connection matrix in order to minimize the difference of the output of the target response. Eq. 13 describing Widrow-Hoff learning can be rewritten using matrix notation to specify the correction to apply to the complete set of weights after one iteration (after presentation of a given pattern k) (Valentin et al., 1994) T W(t+l ) -- W~t ) -+- ~7(t~ - ok)x~',
(18)
where t represents the iteration number, xk is the I x 1 kth input vector, tk is the J x 1 kth target vector, ok is the J x 1 kth response vector, and W is the I x J of weight of connection. In the case of autoassociative memories, we have that t k - - x k (autoassociative memories) and W is a square symmetrical matrix, W T = W.
(19)
W(t+l) -- W(t) -'1- TI(Xk -- Ok)X~" -_ W(t) + r/(x~ _ W(t)xk)x r kr ,
(20)
Therefore,
where k is a random integer (1 < k ___K). This algorithm can be expressed in a more practical way by using I • K stimulus matrix X. In this case, the difference xk - Ok, k - 1, 2 , . . . , K is computed for the complete set of patterns before implementing the correction W(t+l) - W(t) + r / ( X - W(t)X)X T.
(21)
3. Linear Heteroassociative Memories We have seen that the goal of autoassociative memories is to associate an input pattern to itself such that when part of the input is presented as a
139
PREFILTERING FOR PATTERN RECOGNITION
memory key, the memory gives back the complete pattern. Here, we introduce a more general family of models called heteroassociative memories. They differ from autoassociative memories in that an input pattern is associated to an output pattern instead of being associated to itself. The goal of a heteroassocaitive memory is to learn mapping between inputoutput pairs such that the memory produces the appropriate output in response to a given pattern. Heteroassociative memories are generally used to solve pattern identification and categorization problems. We rewrite the Widrow-Hoff learning rule for heteroassociative memories using matrix notation [see Eq. (21)]
W~t_t_l)- W~t ) -Jc-?~(r - W~t)X)X T.
(22)
where T is the J x K matrix of target responses.
C. Multilayer Neural Networks and Back-Propagation Learning Rule The basic models of single-layer linear networks presented previously have severe restrictions because they can only solve linear separable problems. Minsky and Papert (1969) showed that two-layer feed-forward networks can overcome many restrictions but did not present a solution to the problem of how to adjust the weights from input to hidden units. An answer to this question was presented by Rumelhart and colleagues in 1986. The main idea behind this solution is that the error for the hidden layer units is determined by back-propagation of the error o f the output layer units. Backpropagation can be considered as a generalization of the Widrow-Hoff rule for nonlinear activation functions and multilayer networks. It adjusts the weights in any feed-forward network trained to associate pairs of patterns.
1. Architecture and Notation of Multilayer Feed-Forward Networks Feed-forward networks are made of several layers of nonlinear units. Each unit in a layer is connected to all the units of the preceding and following layers: these units receive their input from units from a layer directly below and send their output to units in a layer directly above the unit. There are no connections within a layer. The inputs are fed into the first hidden layer. The output of the hidden units is distributed over the next hidden layer, up until the last hidden layer, where the outputs are fed into a layer of output units (see Figure 8). Each nonlinear unit computes its activation by adding all the weighted inputs it receives. It transforms this activation into a response using a
140
YANG AND PAINDAVOINE Hidden layers ...................
...................
~,.
...................
!~-
...................
~,.
Output layer
Input layer
Nonlinear function FIGURE 8. A multilayer network with several layers of nonlinear units: each layer may have a different number of units and a different transfer function.
1
0.25
0.8
0.2
,
/ A
,
0.7
0.6
._~ o.~5
;= 0.4
~
-J
o.~
.--I
0.3 0.2
0.05
0.1 0
0 " 0
-
-
x
1
~
10
-
x
FIGURE 9. Sigmoid (logistic) transfer function and her derivative--this function maps the set of real numbers into the [0, 1] domain.
n o n l i n e a r t r a n s f e r function. Several t r a n s f e r functions can be used. T h e m o s t p o p u l a r one is the sigmoid (logistic) f u n c t i o n (see F i g u r e 9)
f(a)
1 = ---------~" 1+ e
(23)
This f u n c t i o n has the interesting p r o p e r t y o f having an easy to c o m p u t e derivative.
PREFILTERING FOR PATTERN RECOGNITION
f ' ( a ) =f(a)[1 - f ( a ) ] .
141 (24)
Although back-propagation can be applied to networks with any number of layers, it has been shown that one layer of hidden units suffices to approximate any function. Therefore, in most applications, a feed-forward network with a single layer of hidden units is used with a sigmoid activation function for the units. 2. The Generalized W i d r o w - H o f f Rule
We generalize the Widrow-Hoff (delta) learning rule to the set of nonlinear activation functions. The response of the jth output unit is a differentiable function of the total input xi, (i - O, 1, 2 , . . . , I), given by oj = f ( a j ) j = 1,2, . . . , J I aj - ~ WijX i.
(25)
i=0
To get the correct generalization of the Widrow-Hoff rule as presented in the previous sections, we must set OE Awij -- -rl Owij"
(26)
The error measure E is defined as the total quadratic error for the current pattern at the output units,
E=2l j ~ l ( t j - o s
12,
(271
._~
where tj is the target response of the jth unit. We can write OE Ow o.
OE Oaj Oaj Owij"
(28)
By Eq. (25) we see that the second factor is Oaj Owij = xi.
(29)
When we define OE
eJ= Oaj
(30)
142
YANG AND PAINDAVOINE
we will get an update rule, which is equivalent to the Widrow-Hoff rule, resulting in a gradient descent on the error surface if we make the weight changes according to (31)
m wij = ?]~jXi.
To compute 6j we apply the chain rule to write this partial derivative as the product of two factors,
OE 6J = - O--ajj=
OE Ooj Ooj Oaj "
( 32)
Let us compute the second factor. By Eq. (25) we see that
Ooj = f'(aj), Oaj
(33)
which is simply the derivative of the function f for the jth unit. In order to compute the first factor of Eq. (32), we consider two cases. First, assume that unit j is an output unit of the network. In this case, it follows from the definition of E that OE Ooj = - ( tj - oj).
(34)
Substituting this and Eq. (33) in Eq. (32), we get
6j = (tj - oj)f'(aj) Vj.
(35)
Second, if unit j is not an output unit (it is a hidden unit), we do not readily know the contribution of the unit to the output error of the network. However, the error measure can be written as a function of the inputs from hidden to output layer, and we use the chain rule to write
OE _ ~ OE Oaj Z~ Ohl j= 10aj Ohl JOEO L = Z Oaj Oh, Z z ' J h ' j=l
l=1
J OE j=l J
j=l
,
(36)
PREFILTERING FOR PATTERN RECOGNITION
143
where hz is the response of the/th hidden unit of network, l = 1, 2 , . . . , L, aj is the activation of the jth output unit of network, j = 1, 2 , . . . , J, and z 0 is the weight between the/th hidden unit and the jth output unit of network. Substituting this in Eq. (32) yields J
6,-f'(az) Z
6JZO"
(37)
j=l
Eq. (35) and (37) give a recursive procedure for computing the 6 for all units in the network, which are then used to compute the weight changes according to Eq. (31). This procedure constitutes the generalized WidrowHoff rule for a feed-forward network of nonlinear units.
D. Radial Basis Function Networks The network models discussed in the previous sections are based on units that compute a function of the scalar product of the input vector and a weight vector. This section considers the other major class of neural network models, in which the activation of a hidden unit is determined by the distance between the input vector and a prototype vector.
1. Exact Interpolation Radial basis function methods have their origins in techniques for performing the exact interpolation of a set of points in a multidimensional space (Powell, 1987). The exact interpolation problem requires that every input vector is mapped exactly onto the corresponding target vector, Consider a mapping from a d-dimensional input space x to a onedimensional target output space t. The data set consists of K input vectors, Xk, together with corresponding targets tk. The goal is to find a function f(x) such that f(Xk) -- tk
k -- 1, 2 , . . . , K.
(38)
The radial basis function approach introduces a set of K basis functions, one for each data point, which take the form 4~(Ix - Xk]), where 4~ is some nonlinear function. Thus the kth such function depends on the distance ]x Xkl, usually taken to be Euclidean, between x and Xk. The output of the mapping is then taken to be a linear combination of the basis functions, K
f(x) - Z
~b(Ix- Xkl)Wk,
k=l
where Xk is the center of the basis function.
(39)
144
YANG AND PAINDAVOINE
The interpolation conditions [Eq. (38)] can then be written in matrix form as 9 w = t.
(40)
Provided the inverse matrix ~ - 1 exists, we can solve Eq. (40) to given W= r
(41)
When the weights in Eq. (39) are set to the values given by Eq. (41), the function f(x) represents a continuous differentiable surface, which passes through each data point. Several forms of basis function have been considered, the most common being the Gaussian qS(x)- exp ( - 2@2).
(42)
where a is a parameter whose value controls the smoothness properties of the interpolating function. The generalization to several output variables is straightforward. Each input vector xk must be mapped exactly onto an output vector tk. 2. Radial Basis Function Networks
The exact interpolating function for noisy data is typically a highly oscillatory function. In this case, the interpolating function that gives the best generalization is one that is typically much smoother and gives averages over the noise on data. An additional limitation of the exact interpolation procedure is that the number of basis functions is equal to the number of patterns in the data set, and so for large data sets the mapping function can become very costly to evaluate. The architecture of a typical radial basis function network (Broomhead and Lowe, 1988; Moody and Darken, 1989) is shown in Figure 10. Modifications required to the exact interpolation procedure are as follows. 1. The number M of basis functions need not equal the number K of data points and is typically much less than K. 2. The centers of the basis functions are not constrained to be given by input data vectors. Instead, the determination of centers becomes part of the training process. In brief, the main idea of radial basis function networks is to replace the input vectors by a nonlinear transformation such that each input is represented by a vector coding its similarity to several centers and then to use a standard linear heteroassociative memory. RBF neural networks can be used in pattern recognition applications as kernel classifiers. They use overlapping simple functions in order to cover
PREFILTERING FOR PATTERN RECOGNITION
r
= exp{-l[x-cilV o"2}
145
Heteroassociator
FIGURE 10. Architecture of a typical radial basis function network. The hidden layer computes the distance from the input to each of the centers (each center corresponds to a unit of the hidden layer). Units of the hidden layer transform their activation (i.e., the distance from the input) into a response using a nonlinear transformation (typically a Gaussian function). Units of the output layer behave like a standard linear heteroassociative memory. complex decision regions that separate different categories of patterns (Musavi et al., 1992). The main advantages of these R B F classifiers are computational simplicity and robust generalization. 3. R B F Neural N e t w o r k Training
Radial basis function network training is processed following two phases. In the first phase, only the input data set Xk is used to determine the parameters of the basis functions; number of centers, center impositions, and parameters, o, are determined by unsupervised training techniques (i.e., methods that use only input data and not target data). In fact, the choice of centers and o determines in part the quality of the estimation given by the network. The set of centers is sometimes learned using learning techniques such as K - means (Moody and Darken, 1989; Musavi et al., 1992). The o of the ~b function can be approximated similarly from the samples. Here we present the algorithm inspired by a clustering method proposed by Musavi et al. (1992). Initially, we have K training (input) points in a d-dimensional space and each training point corresponds to a cluster (e~ = xk). 1. Take any point, Ok, and its associated parameter, Ok (initially, Ok = 0). 2. Find the nearest point, cz, of the same category by using the Euclidean distance. 3. Compute the mean of these two points. We obtain a new point with its associated parameter o - II~tll + O k .
146
YANG AND PAINDAVOINE
4. Compute the distance, D, from the new mean to the nearest point of all other categories. 5. If D > 2e, then accept the merge of et, and ct and start again from step 2. If the condition is not satisfied, reject the merge and recover the two original points and their parameter, e, and then restart from step 1. 6. Repeat steps 1 to 5 until all points of each category are used. A is the "clustering parameter" with 1 ~ 2 ~ 3 (Musavi et al., 1992). Finally, we obtain the centers, em, and their parameters, ffm (m -- 1 , . . . , M _< K) of the hidden neurons of the RBF neural network. We can see (Figure 11) the result after using this clustering algorithm for two categories in a two-dimensional space. Note that we obtain two nonlinear decision regions. In fact, RBF can map spaces of any shape (e.g., nonlinear, convex, disjoint spaces). The basis functions are kept fixed while the final layer weights are found in the second phase of training, which requires the solution of a linear problem and is very fast. The procedures for training radial basis function networks can be substantially faster than the back-propagation algorithm. Because the parameters governing the basis functions are determined using relatively quick, unsupervised methods, the final-layer weights subsequently are found by fast linear supervise methods (see Section V.C.1).
@ CATEGORYA 9 CATEGORYB FIGURE 11. Illustration of RBF neural network training: determination of hidden neurons.
PREFILTERING FOR PATTERN RECOGNITION
147
III. INTRODUCTION TO WAVELET TRANSFORM
Traditionally, Fourier transforms are utilized for performing signal representation and analysis. Although it is straightforward to reconstruct a signal from its Fourier transform, no local description of the signal is included in its Fourier representation. Therefore, Fourier transforms are powerful tools for analyzing the components of a stationary signal (a stationary signal is a signal that repeats itself). It is, however, less useful in analyzing nonstationary signals. The short-time Fourier transform (STFT) and, as a special case, the Gabor (1946) transform have been introduced for signal local description; because the signal is analyzed after filtering by a fixed window function, these transforms have the localization property that traditional Fourier transforms lack. However, because the envelope of the signal is the same for all frequencies, they do not provide enough time detail for high frequencies. The most frequently adapted representation to the nonstationary signal is the wavelet transform, which gives a better localization than the traditional Fourier transform and a better division of the time-frequency (or space-frequency) plane than the STFT. The first known example of a wavelet function is the Haar wavelet, and the wavelet transform was introduced by Grosmann and Morlet (1984) in order to study seismic reflection signals in geophysics applications. Many wavelet functions were constructed in the 1980s, e.g., Meyer (1993), Lemarie and Meyer (1986), and Daubechies (1989). Mallat and Meyer developed the multiresolution analysis in 1989, which is widely used for signal and image processing. This section first presents briefly the wavelet theory: wavelet transform, scaling function, discrete wavelet transform (DWT), pyramidal algorithm, filter bank, and multiresolution analysis. Then, we expose in detail the Haar wavelet transform in order to illustrate basic concepts of wavelet analysis. We also describe some other frequently used Wavelet functions: Daubechies, Mexican hat, Morlet, Meyer, and Coiflet. This introduction to wavelet transforms is finished by wavelet analysis applications.
A. Continuous and Discrete Wavelet Transform 1. Definition of Continuous Wavelet Transform The wavelet family can be obtained by dilating and shifting a basic function, qS, the "mother Wavelet"
~)a,b(X) -- a-89 x -
'
(43)
148
YANG AND PAINDAVOINE
wherea, b E R , a > 0 . If a = 1, then Ca, b represents versions of r shifted by b. If b = 0, then Ca, b represents dilated versions of r with different frequencies controlled by the parameter a. The scalar a-89 also determines the amplitude of the wavelet functions. Larger values are assigned to higher frequency functions (i.e., with a < 1), and smaller values are assigned to lower frequency functions (i.e., with a > 1). The continuous wavelet transform, Wa,bf, of a function, f, is defined by the scalar product: Wa,
bf -- (f , C a ,
b) - -
If_co (x-b)dx.
-'~
Cof (x)r
a
(44)
The admissibility condition denoted Cr ensures that the inverse Wavelet transform is applicable (Daubechies, 1989). This condition is satisfied as '+co r
= 0.
(45)
co
Figure 12 shows that the wavelet transform corresponds to an enlargement of the signal f(x) on the part centered in b with a dilation factor a
f (x)
I
= X
bl
~(X-bl)a v
gz(x-b2a )
I
FIGURE 12. Wavelet transform illustration.
PREFILTERING FOR PATTERN RECOGNITION
149
9 If f(x) is constant around time b, both the integral of the product between f and ~ [Eq. (44)] and the wavelet coefficient Wa,b (case b = bl) are very small. 9 If f (x) presents some variations at time b that are in phase with ~b(x), both the integral of the product between f and ~ and the wavelet coefficient (case b = b2) become important. Although the parameters a and b can be chosen with any real values, for ease of computation one usually sets a - 2 -J and b = ka, j, k c Z. The simplest type of wavelet basis is obtained as .
2-J
j = 2 e(2 x - k).
(46)
This choice of parameters a,b naturally connects multiresolution analysis in signal processing with the world of wavelets, and sampling in the a-b plane introduces the discrete wavelet transform, which can be performed using a fast, pyramidal algorithm related to multiresolution analysis (Mallat, 1989; Strang and Nguyen, 1997).
2. Discrete Wavelet Transform a. Orthogonal Wavelets. The orthogonal Wavelet function (mother wavelet) is orthogonal to all functions obtained by shifting right or left by an integer amount. Furthermore, the mother wavelet is orthogonal to all functions obtained by dilating the mother by a factor of 2J and shifting by a multiple of U units (Daubechies, 1989). The orthogonality property means that the inner product of the mother wavelet with itself is unity, and the inner product between the mother wavelet and the shifts and dilates of the mother is zero. The collection of these shifted and dilated wavelet functions is called a wavelet basis, which refers only to an orthogonal set of functions, whereas the term wavelet function is used generically to refer to either orthogonal or nonorthogonal wavelets. A biorthogonal wavelet system consists of two sets of wavelets generated by a mother wavelet ~b and a dual wavelet ~b (see examples of biorthogonal wavelets implemented in the Matlab toolbox). The use of an orthogonal basis implies the use of the DWT, which has a very important mathematical and engineering consequence" any continuous function may be uniquely projected onto the wavelet basis functions and expressed as a linear combination of the basis functions. Decomposition of a function in terms of orthogonal basis functions is, in fact, old news. What is new and exciting about the wavelet decomposition methodology is that wavelet basis functions have what is called compact support. This means that the basis functions are nonzero only on a finite interval, thus
150
YANG AND PAINDAVOINE
allowing the DWT to efficiently represent signals that have localized features.
b. Scaling Function, Filter Bank, and Multiresolution Analysis. The DWT employs two sets of functions: scaling functions, ~, and the wavelet functions, ~b. Often, prior to the construction of ~b, one constructs 4~ such that the functions ~b(x - n), n c Z constitute an orthonormal system [see Daubechies (1989) for properties of orthogonormal wavelet functions]. Two-scale relations, ~b(2x - n), are the fundamental components with which to construct the scaling function, 4~, and the wavelet function ~b. In fact, the different properties cause the functions ~band ~b to be linked via relations as follows (Daubechies, 1989) n
--- oo
qS(x) -- x/~ Z n~
~3(X) -- ~
hndp(2x- n), ~(x)
(47)
n~_~cxzg n ~ ( 2 x - n), n -~ --oo
where the numbers, hn, are called the filter coefficients of the function qS, and gn are called the filter coefficients of the function, ~. In digital signal processing terms, the scaling function is a low-pass filter that suppresses the high-frequency components of a signal and passes the low-frequency components (the scaling function is also often called the smoothing function). The wavelet function is a high-pass filter that passes the high-frequency components while the low-frequency components are suppressed. Thus, the DWT analysis can be performed using a fast dyadic pyramidal algorithm related to filter banks. Mallat (1989) showed that computation of wavelet representation can be accomplished by successive convolutions of the time-domain signal with quadrature mirror filters. The original signal x[n] is first passed through a half-band low-pass filter h[n] and a high-pass filter g[n]. The analysis simplifies if the length (dimension) N of the signal is a power of two. After filtering, the signal is decomposed into a coarse approximation (Ytow) and detailed information (Yhigh) N
Ytow - Z x[n]h[2k - n] n N Yhigh -- Z x[n]g[2k - n],
(48)
n
where Ylow and Yhigh of length N are the output of the low-pass and high-pass filters, respectively, after subsampling by 2. This decomposition halves the
151
PREFILTERING FOR PATTERN RECOGNITION
time resolution because the entire signal is now characterized by half the number of samples. However, it doubles the frequency resolution, as the frequency band of the signal now spans only half the previous frequency band. The result, Yhigh, constitutes wavelet coefficients at the scale one. The Ytow is then further decomposed using the same wavelet decomposition step. This procedure is repeated for further decomposition. At every scale, filtering and subsampling result in half the time resolution and double the frequency resolution. Each scale of decomposition analyzes the signal at different frequency ranges and at different resolutions, hence the name multiresolution analysis.
c. Image Processing by Wavelet Transform. The discrete wavelet transform can be easily developed for images, which are two-dimensional signals and thus can be treated as a two-dimensional array. Because every row (column) of the array corresponds to a one-dimensional signal, one can use the previously described method to process it and, hence, an image can be processed in two directions sequentially. Figure 13 shows an interpretation of the process given by Wu et al. (2000). Figure 14 illustrates the wavelet decomposition of an image.
L-col L-row
~_l~ ]
_1 -I
Scale 0
Scale 1
Scale 2
Decomposition process
Original image H-row
_1
-I
H-col
I 9
L-col Scale 2
Scale 1
Scale 0
Reconstruction process FIGURE 13. Block diagram of the multiresolution wavelet decomposition of an image (right) and Flow diagram of the filter bank to decompose an image (left): L denotes a low-pass filtering operation and H denotes a high-pass filtering operation. The notation L-row (L-col) means that the L operation is applied to the row (column). The output from the L-low and L-col operations is the A sub-image. The H, V, and D sub-images, which represent the detail information of the decomposition, are similarly constructed. A, H, V, and D denote average, horizontal details, vertical details, and diagonal details, respectively.
152
YANG AND PAINDAVOINE
FIGURE 14. Wavelet decomposition of an image: original image (left) and Wavelet decomposition at scale 1 using Daubechies-4 function (right). From the upper left corner of the image we have: low-passdecomposition,horizontal details, vertical details and diagonal details.
3. Wavelet Packet Transform Wavelet packet transform (WPT), first introduced by Coifman et al. (1992), is a generalization of the dyadic wavelets transform. Wavelet packet functions offer more flexibility than wavelets in representing different types of signals. Dyadic wavelet decomposition is achieved by iterative filter bank operation over the low-frequency band of each scale. Wavelet packet decomposition, however, is achieved when the filter bank operation is iterated over all frequency bands at each scale. The WPT offers a rich decomposition structure set. The WPT is often associated with a best basis selection algorithm, which decides on a decomposition structure among the library of possible bases by measuring a data-dependent cost function. The best basis algorithm automatically adapts the transform to best match the characteristics of the analyzed signal. More details concerning the WPT will be given in Section IV (see Section IV.A.3 for the wavelet packet tree and Section IV.B.1 for an example of WPT application).
B. Wavelet Functions One criticism of wavelet analysis is the arbitrary choice of the wavelet function, one of the important decisions on the user's part. It is similar to the choice of instruments of observation. In choosing the wavelet function, there are several factors (Chui, 1992; Farge, 1992): orthogonality, complex
153
PREFILTERING FOR PATTERN RECOGNITION
or real function, width, and shape of the wavelet function. We describe the Haar wavelet function in order to illustrate some fundamental concepts and multiresolution theory introduced previously. 1. H a a r Wavelet Function
Figure 15 displays H a a r a. Definition. functions. The scaling function is defined by 1
,(x)-
scaling and H a a r wavelet
x c (0, 1) x ~ (0,1)
0
(49)
and H a a r wavelet function is defined by
~(x)
-
1
1
xc
- 1
x c (~, 1)
0
x 6 (0, 1).
1
(50)
Functions 4~ and ~ of the Haar wavelet family can be constructed, thus qS(x) = qS(Zx) + qS(Zx- 1) ~(x) - qS(Zx) - ~b(2x - 1).
(51)
F r o m Eq. (50), the filter coefficients of the scaling and wavelet functions can 1 be found as h0 - hi - @2 and go - - g l v~" Each step in a Haar transform calculates a set of wavelet coefficients and a set of averages. If a signal, s = So, s l , . . . , Su - 1, contains N elements, there will be u average and U coefficient values. The averages are stored in the lower half of the N elements array, and the coefficients are stored in the upper half. The averages become the input for the next iteration of the wavelet calculation, where for each iteration i + 1, Ni +1 = Ni/2. The r
~,(x)
0.'5
-1
1
,, x
0
D.X
0.5
-1
FIGURE 15. Haar scaling function (left) and Haar wavelet function (right).
154
YANG AND PAINDAVOINE
recursive iterations continue until a single average and a single coefficient are calculated. This replaces the original signal of N elements with an average, followed by a set of wavelet coefficients whose size is an increasing power of two (e.g., 2 ~ 21, 22,..., -~). The Haar wavelet transforms that calculate an average ai and a wavelet coefficient, ci, are shown as Si -~- Si+ l ai-r
(52)
x/~ s i -- si+ 1
--
V~~
9
b. Wavelet Vector and Wavelet Matrix. In the linear algebra view of the Haar transform, the first average is calculated by the inner product of the signal [So, s 1 , . . . , Su-1] and the vector of the same size [~2' ~22' 0,0, . . . , 0]. This is the scaling vector. The first coefficient is calculated by the inner product of the signal and the vector [v~ 1 , _ 1v~' 0, 0, " ' " 0] " This is the wavelet vector. The next average and coe~cient are calculated by shifting the scaling and wavelet vectors by two and calculating the inner product. The first step of the Haar wavelet for an eight-element signal can be described as the matrix-vector multiplication W s of the wavelet matrix and the signal, where
1
1 0 0 0 1 0 0 0
1 0 0 0 -1 0 0 0
0 1 0 0 0 1 0 0
0 1 0 0 0 -1 0 0
0 0 1 0 0 0 1 0
0 0 1 0 0 0 -1 0
0 0 0 1 0 0 0 1
0 0 0 1 0 0 0 -1
The Haar wavelet transform has a number of advantages, e.g., it is fast and conceptually simple. It is exactly reversible without the edge effects that are a problem for other wavelet transforms. Of course, the Haar transform also has limitations, which can be a problem for some applications. The Haar window is only two elements wide, with shifts over by two values of wavelet transform algorithm at each scale, and thus some change in the signal is not reflected in the wavelet coefficients. 2. Daubechies Wavelet Function and Other Wavelet Functions
The Daubechies-4 (D4) is, probably, the most used function of the Daubechies family (Daubechies, 1989). As indicated by its name, this
PREFILTERING FOR PATTERN RECOGNITION
155
wavelet set has four coefficients. Figure 16 displays the scaling and wavelet functions of D4. The scaling function, 4~, and the wavelet function, ~, can be expressed as qS(x) -- x/2[hoqS(2x) +
hldp(2x- 1) +
hzqS(2x- 2) + h3qS(2x- 3)]
(53)
~(x) - x/~[g0qS(Zx) + gl ~b(Zx - 1) + g2qS(Zx - - 2) + g3qS(Zx - - 3)] The scaling function coefficients values are h0
l+x/~ =
h2 =
3+v~ hi
3-x/3 ~ 4x/~
=
h3 =
4x/~ l-x/3
(54)
The wavelet function coefficient values are go - h 3 , g l - -h2, g2 - - h i , and g3 = -h0. As with the Haar transform discussed earlier, the Daubechies scaling and wavelet function coefficients shift from left to right by two places in each iteration of a wavelet transform step. The result of the wavelet transform produces a "down sampled" smoothed version of the signal and a "down sampled" version of the signal that reflects changes between signal elements. If we compare the Haar wavelet and Daubechies-4 wavelet, we see that the Haar high-pass filter produces a result that reflects the difference between an even element and an odd element. The difference between an even element and its odd element successor will not be reflected in the coefficient calculated by a single step of the Haar high-pass filter. In contrast, there is overlap between successive Daubechies-4 high-pass filters, so change between any two elements will be reflected in the result. The wavelet literature covers a wide variety of wavelet functions. Figures 17-20
1 1
0.8 0.6
0.5
0.4 0.2
' 0
0
-0.5
-0.2 0
2
4
6
0
2
4
6
FIGURE 16. Scaling function of Daubechies-4 (left) and wavelet function of Daubechies-4 (right).
156
YANG AND PAINDAVOINE
1
0.8 0.5
0.6 0.4
0
0.2 -0.5 -0.2 0
, 5
~, 10
|
15
20
0
5
iO
15
20
FICtrRE 17. Scaling function of Coiflet-4 (left) and wavelet function of Coiflet-4 (right).
0.8 0.6 0.4 0.2
-0.2 -0.4
.
-8
-6
. -4
.
. -2
. 0
.
. 2
4
6
8
Fi6tnu~ 18. Wavelet function of Mexican hat.
display, respectively, Coiflet, Mexican hat, Meyer, and Morlet wavelet functions (Daubechies, 1994). These figures are obtained using the Matlab 6.0 Toolbox. Readers are invited to refer to the Matlab Toolbox for properties of these wavelet functions.
3. Wavelet Analysis: Parallelism and Applications As with many algorithms in computer vision and image processing, the twodimensional D W T is computationally intensive and operates on a large data set. These factors, coupled with the demand for real-time operations in many image processing tasks, necessitate the use of parallel processing in order to
157
P R E F I L T E R I N G FOR PATTERN RECOGNITION ,
|
|
1
1 0.8
0.8
0.6 0.6
0.4 0.2
0.4 0.0 0.2
-0.2 -0.4
0
-0.6 -0.8
-0.2
I
!
I
-5
0
-5
5
!
!
0
5
FIGURE 19. Scaling function of Meyer (left) and wavelet function of Meyer (right).
0.8 0.6 0.4 0.2
-0.2 -0.4 -0.6 -0.8 -8
i
i
-6
-4
,
i
-2
0
i
i
i
2
4
6
8
FIGURE 20. Wavelet function of Morlet.
provide high performance at a reasonable cost. A number of parallel solutions of the DWT have been proposed (Lu, 1993; Koc et al., 1994; Misra and Nichols, 1994). The intrinsic parallelism in the wavelet transform makes it attractive for hardware implementation (Chen et al., 2001).
158
YANG AND PAINDAVOINE
The wavelet has been applied to a wide ranges of fields. For example, it can be used in a variety of statistical applications, including signal variance estimation, frequency analysis, and kernel regression. In digital signal processing, it provides powerful tools for analyzing, encoding, compressing, reconstructing, and modeling signals and images. Some important applications follow. 9 Enhance edge detection in image processing. 9 Achieve high rates of signal or image compression (e.g., the JPEG 2000 standard is based on the wavelet lifting scheme). 9 Restore noisy signal and degraded images. 9 Study the fractal properties of signals and images. 9 Extract information-rich features for use in classification and patternrecognition applications. Section IV gives many applications using the wavelet transform technique for feature extraction, and in Section V the wavelet transform is used as a prefiltering technique for movement analysis in handwriting and face recognition.
IV. PATTERN RECOGNITION USING WAVELET AND NEURAL NETWORK FOR SIGNAL AND IMAGE PROCESSING APPLICATIONS
A system of pattern recognition for signal or image processing applications is generally composed of two modules: a preprocessing module and a classifier. The before recognition preprocessing task usually consists of two roles: defining the initial transformation in order to represent the measured signal and retaining information that is important for class discrimination and discarding that which is irrelevant. This preprocessing stage is sometimes called "features extraction." The goal is to extract the relevant features that enable the recognition of different patterns. It is also desirable to keep the dimension of the feature vector as low as possible because this results in a simpler classifier. However, sometimes it happens that the dimension of the feature vector is greater than the initial dimension of the data set because complementary information is integrated as textures or contours in order to better represent the measured signal. The structure of this section is as follows. We first briefly describe the different techniques of preprocessing commonly used for pattern recognition. Then we give some examples of audio, underwater, and eddy current signal processing for identification and classification purposes using the wavelet transform. We present the systems by describing the preprocessing
PREFILTERING FOR PATTERN RECOGNITION
159
technique used, the pattern recognition model, and the obtained results. We also introduce some medical signal processing applications concerning the classification of electroencephalogram (EEG) and myoelectric signals (MES). One of these applications consists of classification of sleep stages using the neural network and wavelet transform. After that, we illustrate applications of shape recognition and image classification by content. Aside from the examples described in this section, we can find many other recognition pattern applications using the wavelet transform and neural network for signal processing as ground fault discrimination, flaw echo location, short-term electrical load forecasting, particle shape classification, and so on. Readers are invited to refer to Tungkanawanich et al., 2000; Liu et al., 1997; Yao et al., 2000; Drolon et al., 2000; and Szu et al., 1996 for more details.
A. Different Techniques of Preprocessing for Pattern Recognition 1. Principal Component Analysis and Fourier Descriptors a. Principal Component Analysis. The PCA approach consists of constructing an orthogonal basis from a data set that yields the best compression. This results in the smallest least square error if we use only M of the N coefficients for representation of a signal, x, where M _< N. Thus, the basic idea behind the PCA approach for pattern recognition is that response data can be expressed as a superposition of a set of basis functions determined (learned) from examples. The learned basis functions are eigenvectors of the covariance matrix of the example data vector regarded as a stochastic column vector. Collecting the basis functions (eigenvectors of the covariance matrix in descending order) as columns of a matrix, the signal vector denoted by x can also be expressed using the new basis as y - Urx,
(55)
where y denotes the data vector expressed in the new basis and U is the eigenvector matrix. Through the PCA approach, an "intelligent" system has been obtained, one that extracts a suitable orthogonal set of basis functions from data examples (Bishop, 2000). The PCA produces an uncorrelated feature set by projecting data onto eigenvectors of the covariance matrix. The PCA provides a means of unsupervised dimensionality reduction because no class membership qualifies data when specifying eigenvectors of maximum variance.
160
YANG AND PAINDAVOINE
Using a geometrical interpretation, the coordinate system is rotated such that the axes in the new system point in the directions of the largest variance of the analyzed data. The axes are referred to as principal axes. The idea behind compression is to represent data using only a limited number of axes for which the variance is sufficiently large (see Figure 21). Another method of extracting basic functions using knowledge about analyzed data consists of independent component analysis (ICA). Readers will find a wellillustrated comparison between PCA and ICA by Bartlett (1998). PCA encodes dependencies in data by rotating the axes in order to correspond to directions of maximum covariance. ICA does not constrain the axes to be orthogonal and attempts to place them in the directions of statistical dependencies in the data.
b. Fourier Descriptors (FDs). The Fourier transform decomposes a function into a basis set of sinusoidal components that are only localized in the frequency domain. The Fourier descriptor is a standard technique used in image processing for the recognition of different contours in digital images. The idea is to expand the contour of an object in a Fourier series and use a limited number of Fourier coefficients, called FDs, as feature for recognition of the object. Discrete Fourier transform (DFT) and fast Fourier transform (FFT) are widely used in real-time applications. The Fourier transform is a powerful tool for studying the frequency content of signals, but it does not provide any localization in time. In order to overcome this problem, the windowed Fourier transform, or short-time Fourier transform (STFT), was suggested f / k
STFT[f (w, s)]
/ g*(t - s)f(t)e-iwtdt,
(56)
where g is a given time window, which can be shifted in time. The STFT divides the time-frequency domain using a fixed tiling: once a window
PCA
.9 9
-/.
o/
9
9
ICA
9 o 9
:
9
t" 9
/ o 9 /.
, ~o./
9
... 9 .
o
9-
9 o" .... / 9
r/
9
/'o" /
o" ~
9 9
,,~,,:,~. o o
...
"
9 9149149
9
. "'.~/~.
to~
' o99
FIGURE 21. Example of two-dimensional data distribution and the corresponding principal component and independent component axes (cf. Bartlett, 1998).
PREFILTERING FOR PATTERN RECOGNITION
161
function is specified, each basis function has an identical aspect ration of time-frequency. If g is a Gaussian function, then the STFT is called a Gabor transform (Gabor, 1946).
2. Wavelet Transform and Feature Extraction A more elegant version of time-frequency analysis is wavelet analysis. The wavelet transform uses a sequence of small, compactly supported waves as a basis for the representation of the signal under study. This basis consists of local functions with different positions and scales (Mallat, 1989). Different from the PCA and the ICA, the basic functions used for WT are constructed without using any particular knowledge about the analyzed data. Unlike sines and cosines, wavelets can be constructed in many forms as long as certain requirements are satisfied. In pattern recognition, feature extraction denotes a procedure for mapping the original measurement into a relatively short vector representing features relevant for the classification. The multiscale aspect of the wavelet transform is particularly important for recognition and diagnostic purposes because if one can make some decision concerning the underlying timefrequency components of the signal, one may choose the appropriate scales in the wavelet transform while ignoring the contribution of the other scales. This allows one to achieve both relevant feature extraction and data compression (Mallet et al., 1997). The wavelet transform provides very general techniques that can be applied to many tasks in nonstationary signal processing. Like the FFT, the wavelet transform can be implemented with fast algorithms. DWT allows a better study of short-time high-frequency structures. Low frequencies are sampled with large time steps, whereas high frequencies are sampled with small-time steps. The tiling of the time-frequency domain is variable using WT: the aspect ratio of the wavelets varies such that the frequency resolution is proportional to the center frequency. This tiling has been shown to be appropriate for many physical signals, but the partition is nonetheless still fixed.
3. Wavelet Packet Transform ( W P T ) We use the multiresolution analysis proposed by Mallat in order to illustrate differences between wavelet transform and wavelet packet transform (Coifman and Wickerhauser, 1992; Coifman et al., 1992; Bachman et al., 2000; Benedetto and Frazier, 1994). In MaUat (1989), the wavelet transform is seen as generated by a pair of quadrature mirror filters (h, g), where h is a low-pass filter and g a high-pass filter, h is related to the scaling function ~b with the scale equal to 2j, j E Z + and g to the mother wavelet ~b by
162
YANG AND PAINDAVOINE
dp(x) = ~ n h ( n ) x / 2 ~ p ( 2 x - n)
(57)
~b(x) - ~ n g ( n ) x / 2 d p ( 2 x - n).
For a discrete signal, x, of finite length, we define the following operators H, G: (Hx)~ = Z n h ( n - 2 k ) x ( n ) (Gx)k -- ~ng(n - 2 k ) x ( n ) .
(58)
H, G acting on x are convolutions with both filters h, g and a downsampling by two. This transforms the signal, x, into two subbands of equal length, (Hx) contains the low-pass band and (Gx) the high-pass band. Recursive application of H and G on the low-pass band defines the wavelet transform; recursive application of the operators H, G on both bands defines the "wavelet packets transform" as shown in Figure 22. The WPT provides an adaptive tiling in the time-frequency domain: an overcomplete set of timefrequency tilings are provided and the best can be selected for the particular signal under study.
4. Feature Extraction Using Invariance Properties The sinusoidal basis functions used by Fourier transform are well localized in the frequency domain but not in the time domain. For image processing applications, this implies that it may be difficult to modify a particular region in the image by adding or deleting frequency components generated by the Fourier transform. However, the wavelet transform decomposes the image into a set of subimages called shapes with different resolutions
o-~
nf j O-Fn/2
/\
O-Fn
"-2 Fn/2-Fn
-.s
.
I
/\
O-f./2
A
f./Z-f.
/", /",
FIGURE 22. Wavelet tree (left) and Wavelet packet tree (right). The top subband contains the signal, x, with Nyquist frequency, Fn. The time resolution decreases by a factor of 2 in each iteration; the frequency resolution doubles as ones goes from the top to the bottom in the WP and WPT trees.
PREFILTERING FOR PATTERN RECOGNITION
163
corresponding to different frequency bands. In pattern recognition, the Fourier transform and the wavelet transform have been used for shape analysis, e.g., in the discrimination of closed curves so as to classify different present objects in an image. The difficulty of shape recognition is in classifying objects correctly, and this is independent of their position, orientation, and size. This property may be obtained in three basic ways (Chen and Hewit, 2000). The first is to train the classifier to be invariant. The training regime will explicitly tell the classifier that shapes that are to be classified are similar. Although this classification procedure is theoretically possible, in practice it may require an infeasibly large amount of training data and a correspondingly long training time. The second method is to build a degree of invariance into the classifier itself. For example, higher order neural networks (HONNs) (Reid et al., 1989) can have this type of inherent invariance, but the number of interconnections in a fully interconnected HONN may be too large to be practical for high-resolution image processing. This severely limits the application of HONNs in cases where the number of inputs is large and consequently increases the processing time. The third method is to extract invariant features from a given pattern during the preprocessing stage and then use a neural network as a secondary classifier. The extracted feature vectors must be invariant to any similarity transformation of a pattern, such as translation, rotation, and scale. Many algorithms have been developed for this (Sekita et al., 1992; Das et al., 1990) and are based on intrinsically describing the boundary of a pattern because the shape recognition is considered equivalent to the detection of the pattern boundary. One of these methods consists of representing the boundary using the polar coordinate system. By setting the center of the pattern as the origin, sampling of the boundary can be performed by angular sampling methods and repeated every 27r radians. Thus, the boundary shape can be represented by the distance between the sampling points and the pattern center, and so identification of a two-dimensional pattern can be transformed to that of the one-dimensional distance function, which is periodical. Another advantage of this transformation is that the distance sequence is already invariant to pattern translation in a two-dimensional domain. It should be noted that the pattern rotation has been transferred to the distance sequence shift. Thus, it only remains to derive algorithms to generate coefficients that are invariant to scaling and shifting of the distance sequence. Scaling invariance can be achieved by normalizing the distance sequence. The new distance sequence can be obtained by dividing the original distance sequence by its average value.
164
YANG AND PAINDAVOINE
Rotation invariance of the shape is equivalent to shift invariance of the distance function. Shift invariance is easy to achieve using the Fourier transform but more difficult using the wavelet transform. This problem was tackled by Mallat (1991), who studied the zero crossings of a wavelet transform. Simoncelli et al. (1992) studied the invariance property of a wavelet further and proposed shiftable multiscale transforms. More recently, Unser (1995) used overcomplete wavelet transforms in order to derive shift invariance of a wavelet. A similar method has been developed and used by Chen and Hewit (2000).
B. Audio, Underwater, and Eddy Current Signal Processing Applications Time-frequency representations have received considerable attention in such diverse fields as speech recognition (Tzanetakis et al., 2001), radar classification (Ayrulu and Barshan, 2001), underwater acoustics, and geoacoustical signals. In many real-time applications, linear time-frequency representations (STFT, DWT, and WPT) are preferable to quadratic timefrequency representations and the continuous wavelet transform because firsts are accompanied by efficient discrete transforms. This section describes three examples of audio, underwater, and eddy current signal processing applications for identification and classification purposes using the wavelet transform.
1. Audio Signals Identification In the last decade, there has been an increasing demand for systems able to classify different types of auditive signals, especially speech recognition systems. Sound classification systems can be designed to identify different types of audio signals from a set of training data with robust performance (Renals and Rohwer, 1989; Arslan and Hansen, 1999; Kermit et al., 1999). We describe a general audio classification system proposed by Kermit and Eide (2000) that uses the Haar wavelet transform (Chui, 1992) in the preprocessing stage, a neural network comparing small signal segments present in specific types of sound from candidate audio signals against a predefined template. The classification performances are tested with three different applications concerning steady-state vowels, speech signals, and instrument recognition.
a. Preprocessing Stage and Identification Procedure. A key to successful classification may be to isolate some kind of uniqueness that identifies the sound to be classified. Such uniqueness is sometimes present during a very short period of time in audio signals and is often repeated with some degree
PREFILTERING FOR PATTERN RECOGNITION
165
of perturbation for multiple succeeding periods throughout a signal. The system proposed by Kermit and Eide (2000) selects small audio segments that represent characteristic regions as features of identification. In order to compensate for some perturbations of these segments, the Haar wavelets (Chui, 1992), known to have a dampening effect on rich amplitudal variations often present in audio signals, are performed in the preprocessing stage. The application of Haar wavelets to a characteristic segment of a audio signal produces a set of Haar wavelet coefficients that give a template for identification of a specific audio signal. The identification process is performed by a neural network known as the O algorithm (Eide and Lindblad, 1992; Lindbald et al., 1997), which performs a similarity match between two sets of data. The O algorithm is similar to an RBF neural network. Here, the Haar wavelet coefficients are used as a predefined characteristic template representing some class of audio signals. A candidate template captured from an audio signal can be classified by comparing its similarity to the predefined templates.
b. System Evaluation Using Three Different Data Sets. Three data sets are used in the evaluation of the proposed system. The first data set is a database consisting of 90 data files representing 10 separate trials of nine vowels in the Norwegian tongue. Each file contains 1024 mono samples at a 16-bit sampling rate of 8 kHZ. In this application of classification of steadystate vowels, windows with widths of N = 8, 16, and 32 samples were applied as a unique representation of vowel periods. About 90,000 tests were performed, and the results are presented in a confusion matrix table (Kermit and Eide, 2000). The maximum number of obtainable identifications for each vowel is about 100 because there are about 10 periods in each file from the vowel database. The second data set consists of 25 files representing five separate trials of five different words (cat, pit, fed, apple, and tall). The vowel search from five different words is the application to speech signals. If a word contains six vowel periods where four of these are identified correctly and two periods are identified incorrectly, the vowel is still considered to be classified correctly. The classification performance is presented in Kermit and Eide (2000). The last data set differs slightly from the two sets of speech sound just described. A prerecorded musical composition on a compact disc was chosen such that some musical instruments could be isolated and recorded without interference from other instruments. This application consists of instrument recognition from a complex musical composition. The system test with three different instruments (bass guitar, keyboard, and snare drum) present during the 5-s musical composition gives, respectively, a correct classification of 75%, 62.5%, and 87.5%.
166
YANG AND PAINDAVOINE
2. Underwater Signals Classification Automatic classification of underwater passive sonar signals radiated by ships is a complex problem because of the large variability in both temporal and spectral characteristics in signal even when obtained from the same source. Each type of signal has its own characteristics and conventionally is identified by human experts either by listening to or by looking at the spectrograms of the processed sonar signals. Identification by human experts is usually not objective and is a very heavy workload. Chen et al. (1998) have proposed an approach using the wavelet transform and neural networks to classify the passive sonar signals of a ship in order to identify the class of ships. a. Underwater Signal and Pattern Classification System. A passive sonar underwater system is an acoustic receiving system used to receive underwater signals radiated by various ships and other marine life. Thus characteristics of radiated signals from ships are of particular importance in identifying the ships (Chen et al., 1998). Research on the models of the underwater signals sources radiated by ships shows that the signal sources can be divided into three headings: machinery signals, propeller signals, and hydrodynamic signals (Urick, 1983). Each signal produces its own component spectrum, and the tonal content of a the radiated signals of a ship is very characteristic of that ship. Some of the frequency components and harmonics vary with the speed of the engine, but others stay fixed. The pattern classification system proposed by Chen et al. (1998) takes into account these particular properties of underwater signals (see Figure 23). b. Preprocessing and Feature Extraction. Chen et al. (1998) obtained the spectral envelope by averaging over each spectrum in a spectrogram. Tonals, composed of high frequency, are important features for the spectrum of a ship and are usually buried in the spectrum. The wavelet transform has been used to extract tonal features from the spectrum. In fact, tonals have the same property as edges or corners, which are often characterized by the local variation. Measurement of the local variation of a signal must be defined with respect to a reference scale. This scale characterizes the neighborhood size, where the variations are computed. For signals including meaningful structures of different sizes, one needs to modify this scale parameter. By coupling scale and spatial variables of a spectrum signal, the local maxima of the wavelet transform allows one to find tonal features easily. The first-order derivation of the Gaussian function O(x) is chosen as a mother wavelet ~(x). The wavelet transform of f ( x ) at scale U (j is an integer) and position x is given by
PREFILTERING FOR PATTERN RECOGNITION
167
Sonar data
Time-frequency distribution } (spectrogram)
Average power spectral density I
Feature extraction )
Neural network classifiers ) Output decisison FmURE 23. Architecture of the pattern classification system (cf. Chen et al., 1998).
dO2J) d W2Jf(x) = f 9 ~2J(x) = f 9 2J--~-xj (x) - 2J--~x(f 9 02j)(x),
(59)
1 where 02j(x)= 89 and ~2J(x) __ ~7~(~). W2jf(x)is proportional to the first derivative of f(x) smoothed by 02J (x). The maxima of ]W2J f(x)l corresponds to the maxima of the derivative of f . 02J (x), which are the sharp variation points at scale U (Lee et al., 1993). Thus, tonal features can be extracted using the wavelet transform by choosing a small scale. The tonal features extracted are then used as inputs to the classifier.
c. Neural Network Classifiers and Experimental Results. Two neural networks classifiers have been employed: multilayer perceptron (MLP) and radial basis function (RBF). The MLP has been trained by the backpropagation algorithm and its variants. Two sigmoid functions, unipolar and bipolar, have been tested. The data set is collected from three fishing boats. After signal preprocessing, a total of 200 average power spectrum density (APSD) patterns are included in the data set: 60 patterns belonging to the first class, 60 patterns belonging to the second class, and the rest belonging to the third class. The input vector has 256 components. Experimental results obtained from MLP and RBF with different configurations and learned algorithms show that networks trained with tonal features extracted by wavelet transform have 96% or 94% correction rates, but training with original APSDs has only an 80% correction rate.
168
YANG AND PAINDAVOINE
3. Neural Network Classifier for Eddy Current Signals A modern civil aircraft has a large number of lap joints and a number of rivets on the order of 104. The manual inspection of these rivets is not only tedious but time-consuming and therefore expensive. Moreover, considerable human error risk exists. Lingvall and Stepinski (2000) have proposed an approach for automatically detecting and classifying defects or riveted lap joints during the analysis of eddy current (EC) signals.
a. Eddy Current Signals Inspection. This method employs a probe (coil with current) placed close to the inspected material. The primary flux in Figure 24 induces eddy currents in the material, which gives rise to a secondary flux. The secondary flux is then coupled back to the coil, which affects its impedance. Therefore, the magnitude and phase of the eddy currents will affect the loading on the coil and thus its impedance. If a discontinuity (defect) is present in the material, the eddy current density will be changed, which can be observed as a change in coil impedance. The EC inspection of riveted lap joints was performed using a simple mechanical scanner guiding the probe along the rivet line. The resulting signals are complex valued vectors, which correspond to the probe impedance as a function of its location. Three preprocessing steps of this signal must be performed before feature extraction and classification: three-point median filtering, rotation of EC signals, and amplitude normalization.
Primary flux
Coil
. . . . . . . . . . . . . . . . . . .
Conductor
Eddy curre
Secondary flux FIGURE24. Eddy current inspection (cf. Lingvalland Stepinski, 1999).
PREFILTERING FOR PATTERN RECOGNITION
169
b. Feature Extraction and Classification. The goal here is to extract the relevant features that enable detecting different classes of defects. Examples of classes are large defects, defects on the left (or right) side of the rivet, and so on. Large defects are detected easily by thresholding, but the features that preserve information about location must be used in order to distinguish between left or right defects. The feature extraction was applied in a window of 128 samples centered precisely around each rivet. Four types of feature extractors, applied to a windowed signal, were investigated: block mean values, FDs, DWT, alad PCA.
Block Mean Values. These consist of dividing the window positioned around each rivet into a number of blocks, and a mean value is calculated in each block. Fourier Descriptors. In the case of the complex valued EC Lissajous patterns, the Fourier coefficients were performed in the analyzing window using discrete Fourier transform. Discrete Wavelet Transform. Time-frequency distribution is useful for analyzing nonstationary signals such as EC signals. One of the most important features of DWT is the fact that the basis consists of local functions with different positions and scales. This feature permits the determination of particular scales where the EC signal has significant energy. For instance, the energy at small scales is mostly due to noise; therefore by removing small-scale coefficients, both noise reduction and data compression have been achieved. The mother wavelet chosen by Lingall and Stepinski is the Coiflet-2 wavelet, which is a rather smooth wavelet suitable for modeling EC signals. Principal Component Analysis. The basic functions used for DWT (here the Coiflet-2 family) are constructed without using any particular knowledge about the analyzed data. Through the PCA approach, an "intelligent" system can be obtained, which extracts a suitable orthogonal set of basis functions from examples of EC data. The classifier used is an MLP neural network with two outputs: one for defects on the right side of the rivets and one for defects on the left side. Only three neurons in the hidden layer have been implemented. EC signals were analyzed in windows of 128 samples after three preprocessing steps. Among the four types of feature extractors, block mean and FDs needed 12 coefficients in order to achieve a satisfactory classification performance; the DWT used the two largest scales, which resulted in 15 basis functions; and the best compression performance was obtained with the PCA; only 6 coefficients were required. All the extractors resulted in two to three missed detections (out of 68) and five to eight false detections (out of 640).
170
YANG AND PAINDAVOINE
C. Classification of Electroencephalogram and Myoelectric Signals Using Wavelet Transform and Neural Networks EEG analysis is a very useful technique for investigating the activity of the central nervous system. It provides information related to the brain activity based on electrical measurements taken from the scalp of a subject. Inference and studies about a subject's health and effective treatment of some diseases can be carried out by analyzing the information obtained from an EEG. Readers are invited to refer to Nunez (1981) for an exposition on the biophysics of the EEG and the enormous complexities involved. Processing of the recorded information of EEG signals consists mainly of spectral analysis. When the FFT is applied to successive segments of an EEG signal, the frequency spectrum is observed to vary over time as the Fourier coefficients vary (Zhu et al., 1994), indicating that the EEG signal is a nonstationary signal. If the feature extraction method could include modeling of possible nonstationary effects in the signal, better results may be obtained for the classification of EEG signals. This section presents two applications related to the classification of EEG signals: sleep stage classification and brain disorder classification (Polikar et al., 1997).
1. Sleep Stage Classification Sleep scoring by a human expert is a very time-consuming task. In this last decade, several works introduced the use of neural networks as a tool for automated sleep scoring. Most of the systems, used spectral information of the signal using Fourier transformation (Robert et al., 1998). Jobert et al. (1994) illustrated some advantages of the wavelet analysis over the Fourier analysis in sleep research. Oropesa et al. (1999) designed a system based on a specific wavelet packet transform and a neural network structure for a classification task using time-frequency patterns.
a. EEG Sleep, Sleep Stage Structure, and Feature Generator. In humans, five sleep stages and an awake stage are defined. Each sleep stage is characterized by a specific pattern of frequency content. For example, stage 1 is defined by no presence of alpha activity (8-13 Hz), low betal activity (1322 Hz), and theta activity (4-8 Hz) (Oropesa et al., 1999). The classification of EEG sleep is usually made by a visual scorer, which takes 30-s epochs and gives a classification according to the rules of Rechtschaffen and Kales (1968). Not every epoch has 100% of the properties of a specific stage. The decision is made according to which stage properties are present the most. Motivated by the adaptive time-frequency localization property of the wavelet transform and the fact that some structures in sleep recording have
171
PREFILTERING FOR PATTERN RECOGNITION
a well-defined time-frequency pattern, Oropesa et al. (1999) used multiresolution analysis with wavelets proposed by Mallat (1989). The EEG was sampled at 128 Hz, and the db20 wavelet from the Daubechies family of orthogonal wavelets was used. A WPT of eight levels was designed. Out of the family of generated subbands, Oropesa et al. (1999) selected those containing frequency information of the seven bands that correspond to K complexes + delta, delta, theta, alpha, spindle, beta l, and beta2 (see Figure 25). For every 30-s epoch taken from the central EEG electrode C3, the mean quadratic value (called energy) of the wavelet packet coefficients for each of seven bands was calculated. These seven numbers were used as features for the epoch. Additionally, six features based on total energy and the ratio of different energy values have been defined.
b. Classification and Results. An MLP network trained by the Levenberg-Marquardt function (Bishop, 2000) is used for the classification task with the following structure: 13 neurons in the input layer, each one getting 1 of the 13 features, and 10 neurons in the hidden layer fully connected to 6 neurons in the output layer. The data set consisted of 1630 30-s epoch, half of which were used as the training set and the rest to test the performance of the network. The classification results compared to those of a human expert reached a 97.5% of agreement with the learning set and a 77.6% of agreement for the test set. 2. Brain Disorder Classification We present the application realized by Hazarika et al. (1997) for the classification of three classes of EEG signals: normal, schizophrenia (SCH), and obsessive compulsive disorder (OCD), which use the wavelet transform as a feature extraction technique. Hazarika et al. (1997) decomposed a typical segment of EEG signals using an orthogonal wavelet basis: the
I
I
I :,
"," . , 1:, i
~, ~,'
1 2
"-k," 3
s S
i !
I
i
i ',
A._
I
i
4-
_.1._
'
l, '7
6
Ill
4
FIGURE25. WPT and selected subbands (cf. Oropesa et al., 1999).
172
YANG AND PAINDAVOINE
Lemarie wavelet (Lemarie and Meyer, 1986). The Fourier transform of the Lemarie mother wavelet ~b(x) has the shape of a band-pass filter (see Figure 26). Thus, the wavelet function can be interpreted as the impulse response of a band-pass filter. The "upper" levels of the wavelet transform are considered "noise" components and can be ignored for the decomposition. Each EEG signal was divided into segments, with each segment c o m p r i s i n g 27 - 128 samples, i.e., each segment was 1-s in duration. One hundred and twenty such segments were taken from each subject. A total of 41, 60, and 35 EEG files were obtained respectively from normal, SCH, and OCD subjects. The 26, 39, and 24 EEG files, respectively, were used for training, and the rest of the files were used for testing purposes. To reduce the number of coefficients used as features describing each segment, only four decomposition levels of the Lemarie wavelet transform have be executed. At each level of decomposition, Hazarika et al. (1997) measured the absolute value of the detail signals and retained the two coefficients with the highest magnitude. Thus, the 4 x 2 - 8 coefficients for each segment of the EEG signals have to be extracted as features. The classification of three classes is realized by an MLP neural network using eight signal feature parameters as the input, with a single hidden layer containing 50 hidden units and 3 output units. After training, the network with wavelet coefficients was able to correctly classify, respectively, over 66% of the normal class and 71% of the schizophrenia class of EEG signals.
(a)
(b)
1.0
1.0
0.6
gr(x) 0.1
_~
g(x) 0.5
-0.4
-0.8
-2.5
-1'.2
0'.0 X
1'.2
2'.5
0.0 -10.0-5.0
0.0
5.0
10.0
X
FIGURE 26. (a) The Lemarie wavelet ~b(x) and (b) modulus of the Fourier transform of ~b(x).
PREFILTERING FOR PATTERN RECOGNITION
173
The wavelet transform thus provides a potentially powerful technique for preprocessing EEG signals prior to classification.
3. Classification of the Myoelectric Signal Using the Wavelet Packet Transform The MES, collected at the skin surface, has become an important tool in rehabilitation due to the ease with which it may be acquired. The MES provides information about the neuromuscular activity from which it originates, which has been fundamental as a source of control for assistive devices and schemes of functional electrical stimulation (Englehart et al., 1999). One of the MES applications consists of the control of powered prosthetic limbs, realized by pattern recognition. The control information extracted from the MES is based either on an estimate of the amplitude or the rate of change or on the time-domain feature (zero crossing, mean absolute value...) (Hudgins et al., 1993). This information is used to specify the function to be performed--the state of the device. Once the state is selected, it may be driven at a constant speed or its speed may be controlled in a manner proportional to the myoelectric activity. Englehart et al. (1999) proposed a means to improve the accuracy of classification in the form of a time-frequency-based signal representation. This classification system task was decomposed in a multistage process illustrated in Figure 27. In the feature extraction stage, the Hamming (width = 64 ms) for STFT was preferred to other windows. The mother wavelet functions used are Coiflet-4 for the wavelet transform and Symmelet-5 for the wavelet packet transform (WPT). Feature selection methods attempt to determine the best subset of the original feature set. Feature selection has been performed using a Euclidean distance class separability (CS) criterion (Fukunage, 1990). Feature projection methods attempt to determine the best combination of the original features. Feature projection was performed using PCA. The performance of each form of signal representation has been evaluated in the context of a linear discriminant analysis (LDA) classifier and
Measured signal
_[ Feature ] -[ extraction J -Time domain -STFT -Wavelet -Wavelet packet
_~Dimensionality] -1 reduction [ -Feature selection -Feature projection
_f Classification~ Class labels -I, -LDA -MLP
FIGURE27. A breakdown of the classification and the subject methods investigated by Englehart et al. (1999).
174
YANG AND PAINDAVOINE
an MLP classifier. The LDA is implemented easily and is considered a statistical classifier (Fukunage, 1990). A roster of 16 healthy subjects participated in this study. Four classes of myoelectric signal patterns were collected from the biceps and triceps, corresponding to flexion and extension of the elbow and pronation and supination of the forearm. Each pattern consists of two channels of 256 points, sampled at 1000 Hz. The best performance is exhibited when using a WPT-PCA-LDA combination, yielding an average classification error of 6.25%. This represents a significant improvement in comparison with the method of Hudgin et al. (1993) (9.25% of average error).
D. Shape Recognition Applications and Image Classification by Content 1. Recognition and Classification of Blem&hes A blemish is a stain or a damage mark on the surface of a product that renders the product unacceptable. Recognition and classification of blemishes permit analysis of the cause of the blemishes and the development of ways to prevent them in the future. The system proposed by Chen and Hewit (2000) is based on a new approach to the derivation of shift invariance by using an overcomplete wavelet transform. The patterns were digitized as binary images using a linear CCD scanner. The shape recognition of the blemish mark is equivalent to the detection of the boundary of the mark. By setting a polar coordinate system and angular sampling of the boundary, the two-dimensional patterns were converted into one-dimensional distance functions and then normalized as scaling invariant sequences. The sequences were then processed by the FT, and then the WT, to generate vectors with shift invariance. A new method of generating shift invariance using an overcomplete wavelet transform is described in Chen and Hewit (2000). The complex orthogonal estimation (COE) (Tsang and Billings, 1992) algorithm was used along with the FT and WT for comparison. Two types of wavelet, Haar and Daubechies, are used. Recognition of C-shape and S-shape was implemented using an RBF neural network. The performances of the FT and WT are superior to the COE in terms of the recognition correctness and execution time.
2. A New Multiscale-Based Shape Recognition Method Applied to Images of Industrial Tools Lin et al. (1998) proposed a new shape recognition system based on three stages.
PREFILTERING FOR PATTERN RECOGNITION
175
a. Feature Vector Extraction. Multiscale wavelet-transformed extremal evolution contains information of the contour primitives in a multiscale manner. Lin et al. (1998) have shown that wavelet-transformed extremal values at small scales contain information about the individual contour primitives (a corner or an arc: individual dominant points), whereas wavelet-transformed extremal positions and values at large scales contain information about the neighboring primitives (neighboring dominant points). For this reason, the multiscale wavelet-transformed extremal evolution of contour orientation was used to form the feature vector for discriminating shape-dominant points. The first derivative of a normalized Gaussian function is chosen as the mother wavelet, namely ~ ( x ) = - x exp ( - @ ) .
(60)
b. Wavelet Function Normalization. This permits creation of the method scale invariant and reduction of the distortion resulting from normalization of the object contours. c. Shape Recognition. Observing the transformed wavelet orientation in the test images, most of the significant structures can persist in three consecutive scales, sl, s2, and s3. Thus, an extremum at Sl is regarded as a feature point if it can also appear at s2 and s3. The feature position is decided at the first scale because the location resolution is better at a small scale. Hopfield neural networks are applied on images of industrial tools in order to test the performance. Experimental results have shown that this method can achieve satisfactory recognition under noisy, occluded, and affine conditions using only three scales. 3. Wavelet Index of Texture for Artificial Neural Network Classification of Landsat Images In remote sensing imagery, very large amounts of data are available for research programs concerning land, atmospheric, oceanic, and climate studies with the goal of developing new models and monitoring local and global changes, as well as predicting natural and unnatural phenomena, such as volcanic eruptions, earthquakes, and the effect of human activity on our planet (Szu et al., 1997). In order to analyze all of the available data and to optimize the amount of information, remote sensing imagery needs to be organized in ways that facilitate its mining, retrieval, and future analysis. Extracting image content appears to be one of the most powerful tools that can be used for such purposes, and the most promising ways to extract
176
YANG AND PAINDAVOINE
image content start with image classification. It is in this context that Szu et al. (1997) have proposed a system of Landsat image classification integrating texture information obtained by the wavelet transform.
a. Integrating Texture Information. In fact, most of the classification errors occurring at the boundary between classes is observed when using a straightforward classification method. Szu and colleagues thought that these errors are due to the fact that in pixel-based classification methods, such as neural networks, no local spatial information is integrated in the classification decision. Thus, their study consists of an initial step in assessing the impact of integrating local information, particularly texture information, in the decision. b. Co-occurrence Matrices and Wavelet Transform. In order to study aspects of texture concerned with spatial distribution and spatial dependence among local gray tones, one often calculates several cooccurrence matrices with various distances, d, and angles, Q, and the gray tone co-occurrence matrix can be defined as a matrix of relative frequencies of two gray tones separated by a given distance, d, and at a given angles, Q (Szu et al., 1997). By using the multiresolution aspect of a wavelet transform, computations of the co-occurrence matrix can be performed at different resolutions, d, while the information from different angles, Q, are integrated within the same filter. More generally, computing texture with wavelets allows one to raise their localization properties and provides a spatial density function of the co-occurrence texture. c. Wavelet Transform Used and Results. Wavelet information has to be computed by the composite edge/texture wavelet transform: the Mexican hat filter re(x) is chosen as an edge-oriented wavelet and the Morlet wavelet M(x) is used as a texture-oriented wavelet. The two-dimensional composite wavelet filter is defined by _x2)(1 _y2) W(x,y) = exp ( - X2 q-Y2./(1 2
+ exp (_ X__~)exp (_ Y~) cos(kox)cos(koy)
(61)
= m(x)m(y) + M(x)M(y). If f(x, y) is the image to be classified, its multiresolution wavelet decomposition is defined by
':j
MW(f)(a, b) = x~l
(
)
f (t, u) W .t -abl , u -a b2 dt du
with multiple values of a and b - (bl, b2).
(62)
PREFILTERING FOR PATTERN RECOGNITION
177
Wavelet transform (texture) values associated with the original values of the multiband image will be introduced on a probabilistic neural network (PNN) in order to perform the final classification of the image. Preliminary results show that it is an encouraging exploitation track for better discriminates of Landsat imagery. V. WAVELET PREFILTERING TECHNIQUE APPLIED TO HANDWRITING MOVEMENT ANALYSIS AND FACE RECOGNITION
Wavelet transform theory has found many interesting applications in digital signal processing. It provides very general techniques that can be applied to many tasks in nonstationary signal processing. This section presents two applications using the wavelet transform as a prefiltering technique: handwriting movement analysis and face recognition. We applied the multiresolution analysis using the generalized Canny Deriche (GCD) filter as wavelet function, first to filter the original signal and second to detect the sharp variations in the filtered signal. Comparisons with other methods, which explored the kernel estimates technique, show that the GCD filter provided better filtering performance. We briefly discuss the main filtering methods currently used in the literature and raise the interest of a time-frequency representation of a signal. We describe the GCD filter. Then we detail how to apply this filter with a multiresolution manner in movement analysis of handwriting and drawing. An application of this wavelet prefiltering technique for face recognition concludes this section.
A. Filtering Techniques, Time-Frequency Representations, and Generalized Canny Deriche Filter 1. Filtering Techniques and Time-Frequency Representations The existing standard signal processing methods for filtering are based primarily on two approaches: linear filtering (finite impulse response, FIR filters) and statistical methods. The noise to be eliminated is present in highfrequency components, and the use of FIR filters such as second-order Butterworth filters leads to a noticeable reduction of this noise. Determination of parameters such as cutoff frequency or transition band frequencies requires a frequency analysis of the noisy signal. Statistical methods include nonparametric regression techniques such as estimates established from spline functions or kernel estimates (Marquardt and Mai, 1994). However, the signal frequency characteristics are
178
Y A N G AND PAINDAVOINE
completely ignored in this method, which may turn out to be prejudiced in the case of complex nonstationary signals. The statistical approach neglects the frequency representation of the signal in favor of a temporal representation of the signal. Direct extraction of significant information (duration of phases, changes in rhythm, discontinuities, etc.) from a temporal representation of the signal is quite acceptable when a pseudodeterminist model of the signal can be applied, i.e., when the variations across time are not too random. If such a model cannot be a priori assumed, smoothing and differentiating the signal while considering its frequency components also become necessary. For nonperiodic signals, the Fourier frequency analysis (more precisely, the integral of Fourier) represents the signal as a superposition of sinusoidal waves of all possible frequencies of which the contribution is coded through their respective amplitudes. This method is powerful for stationary signals but is limited for nonstationary signals such as speech, music signals, and handwriting signals. In this case, the wavelet transform allows decomposition of the signal into functions of both time and frequency. 2. Wavelet Transform and Generalized Canny Deriche Filter
The wavelet transform can be applied in order to extract the characteristics of a signal presenting sharp variations. For this scope, a multiscale analysis first introduced by Mallat and Zhong (1992) is usually used. Most multiscale sharp variation detectors smooth the signal at various scales and detect sharp variation points from their first or second derivative. The extrema of the first derivative corresponds to the inflection points of the smoothed signal. We call a smoothing function any function qS(t) whose integral is equal to 1 and converges to 0 at infinity. We suppose that qS(t) is differentiable and define the wavelet function ~(t) as (63)
~(t) - dd~(t) dt "
1 t We denote ~s(t) -~b(~) as the dilation of the function ~b(t) by a scaling factor s. Thus the wavelet transform of the signal x (t) at scale s and time b can be defined as
Wsx - X ( b , s ) -
x(t)~2s(t - b)dt - x(t) 9 ~2s(t),
(64)
O0
from which it can be derived that Wsx=X*
(
s dt,](t)=s--
where 9 represents the convolution operator.
dt
(t),
(65)
PREFILTERING FOR PATTERN RECOGNITION
179
Eq. (45) shows that the wavelet transform W s x is the first derivative of the signal smoothed at scale s. For must purposes, wavelet function is not required to keep a continuous scale parameter s. In order to allow fast numerical implementations, we impose the condition that the scale varies along the dyadic sequence (2j) with j E Z. We denote ga2J(t) the dilation of ~b(t) by a factor 2 / 1
t
~32j (t) -- ~ 7 ~ ( ~ ) .
(66)
Using Eq. (64), the discrete wavelet transform of the signal x ( t ) at scale 2j is defined by the following convolution product, Wz, x = x 9 ~b2, (t).
(67)
In order to introduce the multiresolution analysis applied to signal processing, Mallat and Zhong (1992) defined ~b(t) with a quadratic spline function, which is the derivative of the cubic spline function OS(t). In our application of wavelet transforms to handwriting or drawing signals and face recognition, we used multiresolution analysis using the "GCD filter" as wavelet function. The GCD filter, introduced by Bourennane and Paindavoine in 1993, possesses a good signal-to-noise ratio (SNR). The wavelet ~b and smoothing 4~functions using the GCD filter are defined as follows: ~bs(X) --
G ( k o s x e m*x + e m*x _ e sx)
x-'
Cs(kosxe_mSx
x >- 0
(68)
m s ) e -mslxl - ms2e-slxl],
(69)
~,(x) - A,[(koms2lxl -
_ e_msx _ e_SX)
kos +
where k0 and m are adjustable parameters for good detection and good localization of the sharp variations of the signal. These filters give an optimum SNR and localization for k0 = -0.564 and m = 0.215 (Bourennane et al., 1993). The wavelet ~band smoothing 05functions are represented in Figure 28. Eqs. (68) and (69) represent the continuous wavelet function and the smoothing function, respectively. The corresponding wavelet discrete transforms (dyadic wavelet function ~b2J and discrete smoothing function ~b23 are obtained using the Z transform applied to Eqs. (68) and (69). These Z transforms derive two third-order recursive filters moving in opposite directions. Details for the computing of these numerical filters are given in the Appendix. In order to illustrate multiresolution analysis, a noisy signal presenting slow and fast variations can be used. Figure 29 illustrates the advantages of the multiscale decomposition: (1) sharp variations on the smoothed signal can be eliminated if only the scale j = 0 or j = 1 is considered and (2) a sharp variation can be precisely detected if the scale j = 3 is considered.
180
YANG AND PAINDAVOINE 1
0.4 0.3 0.2 0.1 0 -0.1 -0.2 -0.3 -0.4 - 1 0 - 8 -6 - 4 - 2
0
2
4
6
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 "---~" 8 10 - 1 0 - 8 -6 -4 -2
0
2
4
6
8
10
FIGURE 28. Wavelet ~b (left) and smoothing 4~(right) filters. Original noisy signal
Smoothed signal using scale j = 3
Wavelet component at scale j = 0
Wavelet component at scale j = 1
Wavelet component at scale j
= 2
Wa
FIGURE 29. Illustration of the multiresolution analysis.
B. Wavelet Filtering Technique and Movement Analysis in Handwriting and Drawing T h i s section s h o w s h o w the the w a v e l e t t r a n s f o r m c a n be fruitfully a d a p t e d t o m o v e m e n t analysis in d r a w i n g a n d h a n d w r i t i n g . I n o r d e r to facilitate the c o m p a r i s o n o f results in this d o m a i n , we a d o p t e d the s a m e p r o c e d u r e as
P R E F I L T E R I N G FOR PATTERN R E C O G N I T I O N
181
used by Marquardt and Mai (1994), who explored the kernel estimates technique for smoothing handwriting signals. We applied multiresolution analysis using the GCD filter as a wavelet function (1) to filter the original signal (use of the q52Jfilters) and (2) to precisely detect the sharp variations in the filtered signal (use of the ~2J filters). An optimal calibration of the filter parameters was made for a synthetic signal, and the filtering performance was assessed by estimating the residual errors in the position, velocity, and acceleration signals. The GCD filter provided better performance than other currently used filtering techniques. We also present how this technique can be applied to the problem of signal segmentation in the case of drawing movements.
1. Handwriting and Drawing Signals As for language and perception, the study of handwriting and drawing performances has progressively become a stimulating object of research, attracting scientists from very different horizons. For instance, neurophysiologists, experimental psychologists, and neurologists are concerned with handwriting and drawing in the context of an understanding of human movement planning, programming, and execution, including their respective disturbances. The interests of electronic engineers and computer scientists in handwriting and drawing are related to image processing in general, such as automatic pattern recognition or signature authentification, as well as to signal processing, through, for instance, the improvement of the technical performances of digitizers or the refinement of valid techniques for the analysis of handwriting signals. Handwriting and drawing can be studied both within a task-oriented approach, which focuses on the analysis of the products, i.e., the static traces or images forming handwritten characters or drawings, and within a processoriented approach, which deals with the analysis of the actual movements being performed during the handwriting or drawing task, usually recorded by means of a digitizer. The analysis of pen-tip displacements over a digitizer provides very valuable information for the study of human motor control [e.g., discovery of motor invariant and laws (Viviani and Terzuolo, 1980)] and movement disturbances [e.g., significant progress in the understanding of Parkinson's disease (Teulings and Stelmach, 1991; Vinter et al., 1996)]. Globally, this approach requires computing different parameters from the first (velocity), second (acceleration), and sometimes third (jerk) derivatives of the pen-tip displacements as a function of time. Clearly, the validity of results obtained in handwriting movement analysis is partly dependent on the digitizer's technical performances in the spatial and temporal domains and on the precision of the filtering techniques used
182
YANG AND PAINDAVOINE
to reduce the noise contained in the data. Different types of errors can be introduced by the very process of digitizing the pen-tip displacements. Ward and Phillips (1987) have distinguished several of them, such as missing coordinates and nonlinearity as examples of spatial errors or temporal asynchronisms in the sampling of x and y coordinates as an example of a temporal error. As pointed out by Marquardt and Mai (1994), when a kinematics movement analysis is desired, the greatest difficulty emerges from the random errors introducing stochastic noise in the signal because this noise will necessarily be largely amplified as successive derivatives of the signal are computed, with differentiation acting as a high-pass filter. In addition to the performances of the digitizer, noise can emerge from different sources when human movement is recorded, such as noise caused by stretch reflexes, a physiological tremor, or mechanical oscillations due to the spring-like characteristics of limbs (Van Galen et al., 1990). Having at one's disposal highly efficient filtering methods appears crucial, and several solutions are used in the literature. Most of them deal with standard linear filter methods for digital low-pass filtering. Teulings and Maarse (1984) used a filter cutoff frequency of 10 Hz, but other parameters have been adopted in the literature [a transition band from 8 to 24 Hz in Teulings et al. (1986) and from 10 to 30 Hz in Schomaker and Thomassen (1986), for instance]. Clearly no agreement for the determination of these filter parameters is established in the literature. The main problem here is related to the representativity of the handwriting samples used for the determination of cutoff frequencies. Some solutions consist of nonparametric regression methods, such as kernel estimates, performant methods for an automatic definition of the filter parameters having been proposed (Amico and Ferrigno, 1992). The kernel estimate approach has been tested directly for handwriting signals and provides better results than FIR filters or second-order Butterworth filters for the first and second derivatives (Marquardt and Mai, 1994). The present section suggests that the wavelet filtering method provides an optimal solution for the movement analysis of handwriting and drawing signals. We will show that the solution is optimal both for the filtering procedure and for the precise location of significant positions occurring in the course of movement trajectory, these allowing a segmentation of the movement into meaningful units.
2. Filter&g and Signal Segmentation Using Multiresolution Analysis a. Choice of Optimal Parameters for Filtering. In order to select the optimal scales for the smoothing of handwriting signals, synthetic data sets have to be studied. As mentioned previously, Marquardt and Mai (1994)
PREFILTERING FOR PATTERN RECOGNITION
183
made a similar study for the adaptation of the kernel estimate approach to handwriting signals, providing comparison tests with other smoothing methods. We therefore decided to consider the same synthetic data sets as those studied by Marquardt and Mai in order to facilitate a further comparative assessment of our method. In order to simulate hand-written loops as optimally as possible, these authors used sine waves at 3, 5, and 7 Hz, computed with an amplitude of 10 mm sampled at 165 Hz, adding a Gaussian noise of cr(x) -- 0.058 mm to the sine waves in order to simulate positional errors (see justifications of the approach in Marquardt and Mai, 1994). The determination of optimal filtering parameters is done with position, velocity, and acceleration signals. In order to study position, the original noisy signal is filtered with the smoothing GCD filter and then the positional error is measured in the filtered signal. In order to study velocity and acceleration, the first and second derivatives of the filtered signal are computed and, subsequently, the velocity and acceleration errors are measured. These computations are done with a scale ranging from j - - 5 to j -- 5 (the scale factor is s = U). The optimal smoothing scale can thus be obtained by minimization of the different mean square errors (MSE). over the data points N: 1N
MSE - -~Z
2
ei
(70)
i=1
with ei, which is the difference between estimated and measured values. Figure 30 displays the results, showing the relation among MSE, scale, and signal frequency. Figure 30 reveals the existence of an optimal smoothing scale with the presence of minima on the different curves. For each class of signal, position, velocity, and acceleration, the scale that gives the best filtering result is a function of signal frequency. In order to improve the quality of filtering, the choice of the smoothing scale necessarily lies in a compromise among the three available values (at 3, 5, and 7 Hz), which can be achieved by determining the best minimization of the three errors. For this scope, the following equation was computed: Error - Min[errorZHz + error~Hz + errorZHzJ.
(71)
The final choice of the scales corresponding to sampled signals with a 167Hz frequency is as follows: 1.1 for position, 0.3 for velocity, and -0.4 for acceleration. b. Comparison Tests. We compared the results obtained with the GCD filter with those reported by Marquardt and Mai (1994). The evaluation
184
YANG AND PAINDAVOINE 0.1
.i
\
'-. o
I
i
i
\.,, 7 Hz ",, "" "... ""
".. 5 H z
0 9r~ -
i
0.05
X~Hz
~2
I
I
I
I
-1
0
1
2
3
4
Scale 2
.
,
.
\
.
.
.
.
.
~','"
"\
,,.~.
. .,-.."
.,..~ 0
1 0
-2.5
...
'. . . . .
I
I
-2
-1.5
.
I
I
-1
-0.5
I
I
I
I
I
0
0.5
1
1.5
2
2.5
Scale
100
,.
, \
.
.
.
.
.
,,."
.1
"\ O .,..~
.
50 0
-2.5
"N
"'""......
" ~
" ""
' "~
" "7
'
"~"
'
.
......
I
I
I
I
I
I
I
-2
-1.5
-1
-0.5
0
0.5
1
1.5
Scale FIGURE 30. Relation among MSE, scale, and frequency.
method used by these authors has been applied successively to three filters: a low-pass 13-Hz second-order Butterworth filter, a kernel estimation filtering described by Marquard and Mai, and the GCD filter. We therefore computed the standard deviations of the residual errors in the filtered signals and their respective derivative for a 1-Hz sine wave sampled at 167 Hz. Table 1 reports the obtained results. Results obtained with the GCD filter show a significant improvement of the smoothing performances in comparison to the other filters. The GCD filter offers a particularly good performance for the acceleration signal. However, these performances are available only after a calibration of the filter parameters. Optimal scale values depend on the sampled frequency of the noisy signal. For the sake of illustration, a 100-Hz sampled signal should be filtered using 2, 1.1, and 0.7 as scale values for position, velocity, and acceleration, respectively. A few technical details may be useful. The software discussed in this study is written in Matlab 5.0 and is PC compatible. The time for filtering 1000 equidistant data points at a sampled rate of 100 Hz is 4.5 ms using a
185
PREFILTERING FOR PATTERN RECOGNITION TABLE 1 STANDARD DEVIATIONSOF RESIDUAL ERRORS FOR POSITION (POS.), VELOCITY (VELOC.), AND ACCELERATION(ACC.) AS A FUNCTION OF THE FILTER
Pos. (mm) Veloc. (mm/s) Acc. (mm/s2)
Noisy signal
Butter worth 13 Hz
Kernel estimation
G C D opt.
0.058 13.78 3850
0.025 1.19 84
0.021 0.89 21.70
0.015 0.27 5.04
Pentium 600 MHz, and it is the same for position, velocity, and acceleration. This time is independent from the scale because we use the same third-order filters for these three signals, with different coefficients adapted to the chosen scale as described in the Appendix. This fast filtering time allows filtering of the original signal in almost real time. c. Segmentation. Because the essence of multiscale resolution analysis lies in the detection of variations in the signal, it can also be fruitfully used for resolving one of the main problems in handwriting and drawing research, that of automatic segmentation. Several algorithms and methods have been developed for recovering meaningful pieces of information from the recorded signal, such as letters (handwriting) or geometrical drawing segments (see Wesolkowski, 1996; Bontempi and Marcelli, 1996). We now briefly report how we adapted the multiresolution analysis to this question. The method was applied to the segmentation of angular figures, made of three segments (see Bontempi and Marcelli, 1996; Deshief et al., 1996). The scope was to resituate the three segments and the pauses that possibly occur at angles (Figure 31), which display the tangential velocity profile (Figure 31B) when drawing an obtuse figure (Figure 31A) delimited by the circles on the trace (Figure 31E). The primary interest of our method was in combining both spatial detection of the relevant local curvature maxima (the angles) and detection of the relevant local minima of velocity. Detecting only the velocity minima appeared too restrictive because as soon as the movement departs from ballistic-like movements, such as in children or in patients, several velocity minima are found between two relevant spatial segmentation points. The algorithm of spatial detection was aimed at providing candidate positions for a segmentation realized further in the velocity domain, eliminating minor accidents appearing along the trace. For our application, only positions associated with the angles had to be detected. These angular positions were characterized by important changes occurring in an angular direction, which was defined as
186
YANG AND PAINDAVOINE
A 1400
B 120
1200
100
1000
.
.
.
.
I,
.
.
.
.
.
1
80
800 60 600 40
400 200
20
0 8600
0
C1. 5
0
20 40 60 80 100 120 140 160 180 200
D 50 50
1
r 30 0.5
20 10
OZ
20
E 1400
-
.
40 .
60 .
.
.
80 .
.
.
100
.
120
.
0
140
20 40 60 80 100 120 140 160 180 200
.
1200 lOOO
f ~
800
(
60O 400 2OO 0
i
|
!
|
!
|
_
8600 8800 9000 9200 9400 9600 9800 10,000 10,200 FIGURE 31. Segmentation of angular figures.
PREFILTERING FOR PATTERN RECOGNITION
dX) . dO(t) = arctan _d__y
187 (72)
More precisely, the critical angular positions corresponded to the local maxima in dO(t), the variation of angular direction. In reference to Lee et al. (1995), we used the GCD smoothing filter at two different scales, a and b with a < b, in such a way that two objectives were aimed at (1) a precise localization of the critical spatial positions (with dOa) and (2) a strong reduction of the minor accidents in the trace (with dOb). Then, for each maxima xi detected with dOa, its value Yi with dOb w a s determined and the ratio y' was computed. Only the xg of whose ratios y' were above a given Xi Xi threshold were considered as candidates for segmentation points. Figures 31C and 31D show the dOa, dOb smoothing and the resulting spatial segmentation points for the drawing of an obtuse figure. Each maximal curvature position obtained from the previous spatial detection defined only the area where a segmentation key position was present, not its precise location, because the optimal values of the smoothing scales obtained for the spatial detection (parameters a and b, estimated at 0.8 and -0.2 in our application) did not correspond to the optimal values for the smoothing of velocity signal (0.3). Furthermore, when a pause was present at angles, the spatial procedure did not provide two segmentation points, respectively, for the beginning and the end of the pause. Refinement of results issued from spatial detection using the velocity signal was necessary. This analysis was done with the tangential speed smoothed by the GCD filter at an optimal scale (0.3). Each area given by the spatial procedure was projected onto the smoothed velocity signal, and a search was conducted in the x < 0 and in the x > 0 directions until the half-height of the first peak located in the direction of the search was attained. A gradient descent search was then carried out until the first local velocity minimum. In the case of a pause, two successive minima were obtained with this procedure.
3. Discussion Most of the criteria used for the assessment of handwriting and drawing movements are derived from velocity and acceleration signals. The degree of fluency of movements, for instance, is often considered as an indicator of the level of coordination in movement production (Van Galen, 1991) and requires the analysis of the velocity profile, if not the acceleration profile (Hogan and Flash, 1987) of the recorded signal. Smoothing the original recorded signal with a filter resistant to the differentiation process is crucial to the validity of further analyses. In this perspective, considerable improvements in available filtering techniques have been made (Woltring, 1985; Marquardt and Mai, 1994).
188
YANG AND PAINDAVOINE
Here we explored a new alternative, the wavelet transform, and the reported study concludes there are very good performances for this filtering technique. As indicated clearly by residual errors measurement, when compared to other currently used filters, the best performances are offered by the wavelet transform, both for velocity and acceleration signals. However, one must consider that these optimal performances require a calibration of the filter parameters on the basis of the sampled frequency of the recorded signal, but this limit appears as a minor drawback because the variety of sampling frequencies used by researchers in the domain is not large (100 to 200 Hz essentially). The very good results obtained with wavelet analysis in different applications such as speech, music, and image analysis are confirmed in the application for handwriting and drawing signals (see also Lee et al., 1996; Sasi et al., 1997). A second part of our study was devoted to the question of segmentation of the recorded signals into meaningful segments for movement analysis. Again, the wavelet transform was used for this scope and provided interesting results, as also obtained by Lee et al. (1995) for a similar problem. Of course, further analysis is still needed here for an assessment of the power of the method if an automatic algorithm would be desired. To what extent such an approach can bring some solutions for the recognition of drawn or written characters is also still an open question. C. An Image Filtering Technique Combining a Wavelet Transform with an Autoassociative Memory: Application to Face Recognition
Linear autoassociative memories are one of the most simple and wellstudied neural network models (Kohonen, 1977; Anderson et al., 1977) (see Section II.B). They are widely used as models for cognitive tasks, as well as pattern recognition, or digital signal processing, in part because they are formally equivalent to well-known techniques such as the Karhunen-Lo~ve transform or principal component analysis (Valentin et al., 1994). Even though linear autoassociators are .known to be quite robust when noise is added to the patterns to be recognized, their performance is rather poor when a lot of noise is added to the stimulus. One of the ways to improve performance could be to use some pre- and postprocessing on the patterns to be recognized. This section evaluates the performance of a preprocessing technique using the wavelet transform applied to face images. In order to improve the performance of a linear autoassociator, we examined the use of several preprocessing techniques. The gist of our approach is to represent each pattern by one or several preprocessed (i.e., filtered) versions of the original pattern (plus the original pattern). First, we
P R E F I L T E R I N G FOR P A T T E R N R E C O G N I T I O N
189
compared the performance of several preprocessing techniques (a plain vanilla version of the autoassociator as a control, a Sobel operator, a GCD operator, and a multiscale GCD operator) and a Wiener filter on a pattern completion task using noise-degraded versions of stored faces. We found that the multiscale Canny-Deriche operator gives the best performance of all models. Second, we compared the performance of the multiscale CannyDeriche operator with the control condition on a pattern completion task of noise-degraded versions (with several levels of noise) of learned faces and new faces of the same or another race than the learned faces. In all cases, the multiscale Canny-Deriche operator performs significantly better than the control. This section is organized as follows. First, we describe the linear autoassociator model applied to face images. Second, we compare the wavelet approach with other preprocessing or filtering techniques. Third, we look at the performance of the wavelet approach under various conditions of noise degradation. 1. Linear Autoassociators and Eigenvalue Decomposition a. Linear Autoassociator Description. The advantage of linear associators in comparison with nonlinear models is that they provide integration of a very large number of cells in the network. Their implementation is quite easy because they can be analyzed in terms of the singular value decomposition of a matrix (Valentin et al., 1994; Abdi, 1994). In addition, linear models constitute a first processing stage for numerous applications using more sophisticated approaches (for reviews, see Valentin et al., 1994). For example, Kohonen (1984) showed that an autoassociative memory could act as a content addressable memory for face images. Figure 32 gives an illustration. When linear autoassociative memories are applied to images, the first step is to transform each digitized image into a (image) vector by concatenating the columns of the matrix of the pixel values of the image. Images are "stored" into a connection-weight matrix, which models neural synaptic connections between neural cells associated with the image pixels (see Figure 33). In our description, we follow closely the formulation detailed in Abdi (1994). The patterns to be learned are represented by L x 1 vectors ak where k is the stimulus number. The components of ak specify the values of the pattern to be applied to the L cells of the input layer for the kth stimulus. The responses of the network are given by L x 1 vectors ok. The complete set of K stimuli is represented by an L x K matrix noted A (i.e., ak is the kth column of A. The set of K responses is represented by an L x K matrix noted O. The L x L synaptic weight connection matrix between the L input cells is
190
YANG AND PAINDAVOINE
FIGURE 32. Illustration of a content-addressable memory for faces. Images of faces were stored using an autoassociative memory. (Top) Two stimuli given as a key to probe the memory. (Bottom) Responses of the memory. The memory is able to reconstruct a face from an incomplete input (cf. Abdi, 1994).
Cellules [ = 1 0" ................ ~"....... i Matrice de connexion W
|
'1
4
I
6 .
.......
iiiiiii
...... IIILLLLL'.T!!....... i
FIGURE 33. The linear autoassociator is applied to images: transform each digitized image into a vector by concatenating the columns of the matrix of the pixel values of the image.
d e n o t e d W. L e a r n i n g occurs by modifying values o f the c o n n e c t i o n weight b e t w e e n cells as explained later. The response (or estimation) o f the m o d e l to a p a t t e r n x (which m a y or m a y not have been learned) is o b t a i n e d as = Wx.
(73)
Because a u t o a s s o c i a t o r s are generally interpreted as c o n t e n t addressable memories, their p e r f o r m a n c e is evaluated by c o m p a r i n g the o u t p u t o f the system with a test p a t t e r n that can be a copy or a d e g r a d e d version o f one o f the patterns learned previously by the system. This is achieved by c o m p u t i n g
PREFILTERING FOR PATTERN RECOGNITION
191
similarity measures (usually a cosine) between input and output. The
coefficient of correlation between ~ and x is an index of the quality of the estimation; hence the larger the similarity between input and output, the better the performance. In order to achieve a high level of performance, several iterative learning rules have been proposed. The most popular one is clearly the Widrow-Hoff learning rule. This is an iterative procedure that corrects the connection matrix W using the difference between the target response and the actual response of the network. In matrix notation, the Widrow-Hoff rule is written as W(t+l) = W(t) + / ] ( a k -- W(t)ak)a~',
(74)
with Wit] being the weight matrix at step t, ~ being a small positive constant, and the index k being chosen randomly.
b. Eigen and Sigular Value Decomposition: The PCA Approach. Abdi and colleagues developed a simple method of implementing the WidrowHoff algorithm by using the eigen decomposition of W or singular decomposition of matrix A. These decompositions give rise to PCA. Eigenvectors of a matrix are vectors that have the property that, when multiplied by the matrix, their length is changed but their direction remains unchanged. More formally, if u is an eigenvector of W,
Wu/= ut21,
(75)
where 2t is a scalar called the eigenvalue associated with the eigenvector ut. Traditionally, the set of eigenvectors of a given matrix is represented by a matrix U in which the first column represents the vector with the largest eigenvalue, the second column the eigenvector with the second largest eigenvalue, and so on. The corresponding eigenvalues are represented by a diagonal matrix A. Using this notation, matrix W can be expressed as a function of its eigenvalues and eigenvectors, W = UAU -1.
(76)
In the particular case where W is symmetric and can be obtained as the cross-product of a matrix by its transpose (W is positive semidefinite), its eigenvalues are all positive or zeros and its eigenvectors are orthogonal. In this particular case, U -1 = U ~
(77)
W = UAU r.
(78)
and
192
YANG AND PAINDAVOINE
If the eigenvectors are normalized, then U U T - I. The notions of eigenvectors and eigenvalues of a positive semidefinite matrix can be used to define the singular value decomposition (SVD) of a rectangular matrix. Specifically, any rectangular matrix A can be expressed as
A = UaV T
(79)
where U is the matrix of eigenvectors of AA T, AA T - UAU ~, with UU T - I; V is the matrix of eigenvectors of A r A, A v A - VAV ~, with VV r - I; and A is the diagonal matrix of singular values of A, A - A89 with A matrix of eigenvalues of AA T and A T A. Abdi et al. (1998) pointed that the Widrow-Hoff equation [see also Eqs. (21) and (74)] W ( , + l ) -- W(t) -4- r / ( A - W(t)A)A T
(80)
can be expressed in terms of the eigen decomposition of W (or the SVD of A) as W(t+~)- UO(t+I)U r with O(t+~)- [ I - (I-r/A)t+~].
(81)
-1
When ~7 is smaller than 22max, l i m t ~ (I - ~TA)TM = 0.
(82)
Therefore, at convergence, W reduces to W~
= UU ~
(83)
which is the value used in this section. The matrix U is an L • N matrix with N being the number of nonzero eigenvalues. Typically, N is significantly smaller than L (i.e., N << L). As a consequence, using U directly instead of W will lead to an important gain in processing speed and storage. For example, when dealing with a face recognition application, matrix W was a 33,975 • 33,975 matrix, whereas eigenvector matrix U was only a 33,975 • 400 matrix. In terms of the eigenvector matrix U, the response of the model z~ to a pattern x [Eq. (73)] is obtained as :~ - UUTx.
(84)
2. Preprocessing Using Multiscale Edges The goal of learning is to find values for the connections between cells such that the response of the model approximates the input as well as possible. To assess the performance of the model, degraded versions of the previously
193
P R E F I L T E R I N G FOR PATTERN R E C O G N I T I O N
learned stimuli are presented to the model as a test. If learning has been successful, then the response pattern will be more similar to the original pattern than the degraded stimulus was (for an illustration, see Kohonen, 1977). In other words, autoassociators can act as pattern completion devices. Here, we explore different approaches for improving the performance of a linear autoassociator storing face images. The general strategy is to store, in addition to the original images, several filtered versions of the images (see Figure 34). We refer to this technique as preprocessing. Then the model is evaluated by its reconstruction performance when presented with probes that are versions of the original faces degraded by the addition of Gaussian random noise. Because we are interested in image patterns, we choose filtering techniques meaningful in this context. Because it is generally agreed that edges are essential for recognition (Jia and Nixon, 1995), we decided to increase their importance in the image. Quite a large number of algorithms have been proposed in the literature for performing edge extraction. We decided to implement three algorithms. 1. The Sobel operator (a differential operator) is considered a standard procedure well suited for noiseless images. The Sobel operator was implemented with a convolution and a 3 x 3 mask. 2. The GCD operator because it is known to be optimal for edge extraction in noisy images (Deriche, 1987; Bourennane et al., 1993). 3. The multiscale GCD edge detector (Bourennane et al., 1993), which is equivalent to finding local maxima of a wavelet transform as suggested in Mallat and Zhong (1992). The GCD filter is a separable filter when applied to two-dimensional images. Its impulse response for a one-dimensional signal (because of its separability, the filter can be seen as two one-dimensional filters) is given by
{ Cs(kosxe msx + emsx _ eSX~_~x) x-~O Cs(X)
--
Cs(kosxe
-msx
-
e -msx
--k
x
0
(85)
with k 0 - - 0 . 5 6 4 , m - 0.215, and where s - 2j is the scale factor (with, in our case, j E {0, 1, 2, 3}) and x being the pixel position. Figure 34 displays the impulse response of the generalized CannyDeriche filter for different scales. This method is implemented as a wavelet transform using a convolution between the image and the edge detection filter for different scales (s = 2J). As a result, this filter detects edges occurring at different scale resolutions in the image (Mallat and Zhong, 1992).
194
YANG AND PAINDAVOINE |
|
|
!
|
i
|
|
|
|
1 --.... ...
0.5 .....
~'.9 t
65
70
s=8 s=4 s=2 s=l
-
0
-0.5
-1
-1.5 50
55
60
75
80
85
90
95
100
FIGURE 34. Impulse response of the generalized Canny-Deriche filter for different scales.
In order to compare these different techniques, we implemented these three models along with two control conditions corresponding to a standard associator, and for denoising used a Wiener filter (because it is a standard algorithm applied one image at a time). All the models were tested using the same procedure. The patterns stored were a set of 80 face images of size 225 x 151 (with 16 gray levels per pixel). The performance of each model was assessed as follows. Each face was degraded by adding to each pixel a random number [chosen such that the noise values belong to the interval (0 to 45)]. The degraded face was then recalled by the model (or filtered in the case of the Wiener filter model). The correlation between the original face and the model-reconstructed face reflects the performance of the model for this face: the higher the correlation, the better the performance. Specifically, the models were a Wiener filter applied directly to the noise-degraded stimulus and four autoassociators (see Figure 35). 1. A standard autoassociator storing the original 80 face images. 2. An autoassociator storing the original 80 face images plus, for each face, a Sobel-filtered image of the face (hence a total of 160 face images). 3. An autoassociator storing the original 80 face images plus, for each face, a GCD-filtered image of the face (hence a total of 160 face images).
PREFILTERING FOR PATTERN RECOGNITION
195
4. An autoassociator storing the original 80 face images plus, for each face, four wavelet-transformed(by a multiscale GCD filter) face images (one face image per scale resolution, hence a total of 400 images). For the last three models, the complete set of patterns to be learned (matrix A) is composed of original and filtered images. The eigenvector matrix U and the synaptic connection matrix W have to be obtained using Eqs. (79) and (83). Figure 36 displays an example of the responses of the models to the test face. The top panels present a face learned previously by the system and a stimulus with additive random noise added used as a probe to evaluate the Original image
Filtered image
Standard 80"1 images
Sobel 80*2 images
Deriche 80*2 images
Wavelet 80*5 images
FIGURE 35. Patterns to be learned by four autoassociators: Filtered images have been obtained, respectively, with the Sorel operator, the optimized Canny-Deriche operator, and the wavelet transform.
196
YANG AND PAINDAVOINE
Standard Wiener
Sobel
Deriche Wavelet
FIGURE 36. Response of the models. (Top) A target face (learned previously by the autoassociator) and the stimulus used to test the model (the stimulus is a noisy version of the target). (Bottom) Responses of the different models. The wavelet model performed best.
0.938 O t~
0.9 -
0.852
0.871 ~
@
O r
0.8
0.7
--
I
Standard
I
I
Wiener Sobel Deriche Type of models used
I|
Wavelet
FIGURE 37. Mean correlation between the model responses and their targets (the higher the correlation, the better the performance). The wavelet model performed best. performance of the models. The b o t t o m panels show the estimation of the original face by a standard autoassociator, a Wiener filter, an autoassociator plus Sobel preprocessing, on autoassociator plus a G C D filter, and an autoassociator plus a wavelet transform. The quality of recognition (estimation) can be measured by computing the cosine between vector ~ (i.e., the response of model) and x~ (i.e., the original stimulus, which is also the desired response or target). Figure 37 gives the
PREFILTERING FOR PATTERN RECOGNITION
197
average correlation between response and target for the five models used. Clearly, the standard method is the worst of all. Preprocessing the images improves the performance of the autoassociator, and the wavelet transform gives the best result. In conclusion, the multiscale resolution (i.e., wavelet preprocessing) approach leads to the best performance for the autoassociator. Therefore, we decided, in what follows, to consider only this approach.
3. Pattern Completion of Noisy Patterns We have applied the multiscale edge preprocessing to store a set of 80 Caucasian faces (40 males and 40 females). In order to evaluate the effect due to preprocessing, we tested the models with different levels of Gaussian random noise added to the test stimulus. Learning was implemented as described previously. For simplicity, we decided to keep only two models: the standard autoassociator and the wavelet-enhanced autoassociator. Testing was implemented as described previously except that faces were tested under four different levels of noise. The noise intensity was chosen such that its range was, respectively, [0..15], [0..30], [0..45], and [0..60]. Figure 38 displays an example of the noisy test stimuli used along with the response of each model (standard and wavelet).
FIGURE 38. (Top) Four stimuli obtained by adding to a target stimulus a noise component of magnitude equal to one, two, three, and four times the magnitude of the signal of the original stimulus. (Middle) Responses produced by the standard autoassociator. (Bottom) The response of the autoassociator when learning is "enhanced" with wavelet-filtered stimuli.
198
YANG AND PAINDAVOINE
FIGURE39. Stimuli and responses. (Top) Performance for a new Caucasian face and (bottom) for a new Japanese face. (Left to fight) A noise-degraded stimulus (the magnitude of the noise is equal to twice the magnitude of the original signal); the response of the standard autoassociator; and the response of the wavelet-enhanced autoassociator. We also decided to explore the performance of the model with three different types of face stimuli: (1) previously learned faces, (2) new faces similar to the learned faces, and (3) new faces coming from another race than the learned faces. This was done in order to evaluate the robustness of the models in terms of response generalization to new stimuli. Figure 39 displays, as an example, the responses of both models for two new faces (from top to bottom): (1) a new face similar to the set of learned faces (Caucasian face) and (2) a new face different from the set of learned faces (Japanese face). The autoassociator trained with the standard learning is not capable of producing distinguishable responses. As can be seen in Figure 39, better results are obtained with wavelet preprocessing. Figure 40 displays the mean correlation between noiseless face images and the output for each model (1) for 80 previously learned Caucasian faces, (2) for 80 new Caucasian faces, and (3) for 80 new Japanese faces. In all cases, preprocessing the image improves the performance of the autoassociator with the improvement being more important when the noise added is larger. This section explored the effects of storing, in a linear autoassociator, filtered versions of face images in addition to the original images. Compared to the Sobel operator and the simple GCD operator, the multiscale GCD operator (i.e., a wavelet filter) gives the best performance for a pattern completion task involving degraded face images. The multiscale generalized Canny-Deriche operator produces better generalization performance than the control with or without noise added to the image. The larger the amount
PREFILTERING FOR PATTERN RECOGNITION
1 -
199
•
Learned Caucasian faces
D
New Caucasian faces
0
New Japanese faces
.
=
o ".= --g o
0.9
0.8 -
"O.
o 0.7 -
X "-.'~ "O [" -. "'- "'D.
9" 0.6 -
"|
"-. "• "'-. " ' . . "D "'(3
~-" _
--
With wavelet Without wavelet
0
15 30 45 Maximum noise magnitude
60
FIGURE 40. Mean correlation between model response as a function of the magnitude of the noise for Caucasian faces learned previously, new Caucasian faces, and new Japanese faces. Filled lines correspond to the wavelet-enhanced model, and dotted lines correspond to the standard autoassociator. The wavelet-enhanced autoassociator always performed best.
o f noise a d d e d , the l a r g e r the i m p r o v e m e n t in p e r f o r m a n c e . T h i s r e s e a r c h confirms better performances obtained using wavelet transform representat i o n for face t r a c k i n g a n d face r e c o g n i t i o n ( R a n g a n n a t h a n d A r u n , 1997; K r i i g e r et al., 1999; K r i i g e r a n d S o m m e r , 2000; H u n t s b e r g e r et al., 1998).
APPENDIX
Wavelet Filter T h e w a v e l e t filter ~b2J is d e f i n e d as follows:
r
~b2j (z) = Z
+ Z ~b~(n +
o
1 ) Z -(n+l)
o
with q-O0
E r (-n)Zn = 0
(1 -
alZ -1 -[--a2Z -2 e-m2JZ-1)2(1 - e-2JZ-1)
200
YANG AND PAINDAVOINE q-oo
E
~
( - n ) Z n -- -
(1 --
0
al Z -1 -Jr-a2 z - 2 e-m2JZ-1)2(1 - e - 2 J Z - 1 ) '
C2J [(1 - ko 2j) e - m 2 j - e -2j] and a 2 - C2J [(1 + ko21) e --m2J-2J e-m2J+l+ 1)] C2J is a normalization constant given by
where al -
-
C2J
--
(1 - e-m21)2(1 - e -2.) (1 - ko21)e -m2j + (1 + ko2J)e -m2j-2j - -
e -m2j+l
--
e -2j "
The impulse response y(n) corresponding to the filter ~ is given by y(n) -- y+ (n) + y - ( n ) y+ (n) - - a l x ( n - 1) - a2x(n - 2) + bly + (n - 1) - b2y + (n - 2) + b3y + (n - 3) y - ( n ) -- alx(n + 1) + a2x(n + 2) + b l y - ( n + 1) - b2y-(n + 2) -+- b3y-(n + 3)
with bl -- 2e -m2j -q- e -2j, b2 - e -m2j+~ -k- 2e -m2j-2j, and b3 - e -m2j+l-2j. Smoothing Filter
The smoothing filter ~bz is defined as follows: -kcx:~
q~2i(z) -- E
.-1-oo
( 9 ~ ( - n ) Z n -k- E o
r
--1- 1 ) Z -(n+l)
o
with nt-oo
E
~P~(-n)Zn --
--]-oo
E
~5~(n + 1 ) Z -(n+l) -o
CO Jr- c 1 Z -[" C2Z2 (1 - e-m2JZ)2(1 - e-2iZ)
C3Z - 1 Jr- r Z - 2 -~- C5Z - 3
(1 - e-m2JZ-1)2(1 - e-2JZ -1)
where co - A2J(m2 j - ko2 ] - m22 j) Cl -- A2, [(-kom22j + ko2 j - m2 j + m22 j+l)e -m2' + (ko2 j - m2J)e -2'] C2 - -
A2, [(kom22j - ko 2j + m2J)e -m2'-2j - m22Je -m2j+l]
c 3 - A2, [(-kom22j - k o 2 j + m2J)e -m2' - m22Je -2'] C4 -- A2, [(-kom22j + ko 2j - m2 j + mg2J+l)e-m2'-2' + (k02j - m2J)e -m2'+1] c5 - A z j ( - k o 2j + m2 j - mZ2J)e -mzj+l-zj
Az is a given normalization constant defined as
PREFILTERING FOR PATTERN RECOGNITION
201
(1 - e-m2J)2(1 - e -2j) A 2 J - - d o +dl-k-d2-+-d3-k-d4-k-d5 with
do dl d2 d3
= -
m2 j - ko 2 j - m 2 2 j ( - 2 k o m 2 2 j + m22J+1)e -m2j (k02 j - m 2 j - m22J)e -2j - ( 2 k o m 2 2 j + m22 j + l ) e -mzj-2j
d4 - (k02 j - m 2 j - m22J)e -m2j§ d5 - ( - k 0 2 j + m2 j - m22J)e -m2j§ T h e i m p u l s e r e s p o n s e y(n) c o r r e s p o n d i n g to the filter ~b is given by
y ( n ) = y+ (n) + y - ( n ) y+ (n) -- c3x(n - 1) + c4x(n - 2) q- c5x(n - 3) + b l y + (n - 1) - b 2 y + (n - 2) q- bay + (n - 3) y+(n) = co + ClX(n + 1) + c2x(n + 2) + b l y - ( n + 1) - b 2 y - ( n + 2) + b 3 y - (n -q- 3).
REFERENCES Abdi, H. (1994). Les R6seaux de neurones. Presses Universitaires de Grenoble, Grenoble. Abdi, H., Valentin, D., and Edelman, B. (1998). "Neural Networks," Sage, Thousand Oaks, CA. Abdi, H., Valentin, D., and Edelman, B. (1999). "Neural Networks." Sage, Thousand Oaks, CA. Amico, M. D., and Ferrigno, G. (1992). Comparison between the more recent techniques for smoothing and derivative assessment in biomechanics. Med. Biol. Engin. Comput. 30, 193-204. Anderson, J. A., Silverstein, J. W., et al. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychol. Rev. 84, 413-451. Arslan, L. M., and Hansen, J. L. (1999). Selective training for hidden markov models with application to speech classification. IEEE Trans. Speech Audio Process 7(1), 46-54. Ayrulu, B., and Barshan, B. (2001). Neural networks for improved target differentiation and localization with sonar. Neural Networks 14, 355-373. Bachman, G., Narici, L., and Beckentein, E. (2000). "Fourier and Wavelet Analysis." SpringerVerlag, New York. Bartlett, M. S. (1998). "Face Image Analysis by Unsupervised Learning and Redundancy Reduction." Ph.D. thesis of Cognitive science and psychology, University of California. Benedetto, J. J., and Frazier, M. (1994). "Wavelets: Mathematics and Applications." CRC Press, Boca Raton, FL. Bishop, C. M. (2000). "Neural Networks for Pattern Recognition," Oxford Univ. Press, London.
202
YANG AND PAINDAVOINE
Bontempi, B., and Marcelli, A. (1996). On-line segmentation of cursive script using an arclength representation. In "Handwriting and Drawing Research: Basic and Applied Issues" (M. L. Simmer, C. G. Leedham, and A. J. W. Thomassen, eds.), pp. 315-327. IOS Press, Amsterdam. Bourennane, E., Paindavoine, M., and Truchetet, F. (1993). Amrlioration du filtre de CannyDeriche pour la drtection des contours sous forme de rampe. Traitement du signal, Recherche, Vol. 10, pp. 297-310. Broomhead, D. S., and Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Syst. 2, 321-355. Chen, H., and Hewit, J. (2000). Application of wavelet transforms and neural networks to the recognition and classification of blemishes. Mechatronics 10, 699-711. Chen, C. Y., Yang, Z. L., et al. (2001). A programmable parallel VLSI architecture for 2-D discrete wavelet transform. J. VLSI Signal Process. 28, 151-163. Chen, C.H., Lee, J.D., and Lin, M.C. (1998). Classification of underwater signals using wavelet transform and neural networks, Math. Comput. Model 27(2), 47-60. Chui, C. (1992). "An Introduction to Wavelets." Academic Press, San Diego. Coifman, R., Meyer, Y., and Wickerhauser, V. (1992). Wavelet analysis and signal processing. In "Wavelet and Their Applications." Jones and Barlett, Boston. Coifman, R., and Wickerhauser, V. (1992). Entropy-based algorithms for best basis selection. IEEE Trans. Inform. Theory 38(2), 713-718. Das, M., Paulik, M. J., and Loh, N. K. (1990). A bivariate autoregressive modelling technique for analysis and classification of planar shapes. 1EEE Trans. Pattern Anal Mach. Intell. 12(1), 97-103. Daubechies, I. (1989). Orthonormal bases of compactly supported wavelets. Comm. Pure Appl. Math. 41(7), 909-996. Daubechies, I. (1994). Ten lectures on wavelets. CBMS, SIAM Vol. 61. Deriche, R. (1987). Using canny's criteria to derive a recursively implemented optimal edge detector, lnt. J. Comput. Vis. 1, 167-187. Desbiez, D., Vinter, A., and Meulenbroek, R. G. J. (1996). Biomechanical and perceptual determinants of drawing angles. Acta Psychol. 94, 253-271. Drolon, H., Druaux, F., and Faure, A. (2000). Particles shape analysis and classification using the wavelet transform. Pattern Recogn. Lett. 21, 473-482. Eide, A., and Lindblad, T. (1992). Artificial neural networks as measuring devices. Nuclear Instruments Methods physics Res. A317, 607-608. Englehart, K., Hudgins, B., et al. (1999). Classification of the myoelectric signal using timefrequency based representation. Med. Engineer. Phys. 21, 431-438. Farge, M. (1992). Wavelet transform and their applications to turbulance. Annu. Rev. Fluid Mech. 24, 395-457. Fausett, L. (1994). "Fundamentals of Neural Networks." Prentice-Hall, New York. Fukunage, K. (1990). "Introduction to Statistical Pattern Recognition," ed 2., Academic Press, San Diego. Gabor, D. (1946). Theory of communication. J. IEE 93, 429-457. Grosman, D., and Morlet, J. (1984) Decomposition of hardy functions into square integrable wavelets of constant shape. S l A M J. Math. 15, 723-736. Gurney, K. (1997). "An Introduction to Neural Networks." UCL Press. Haykin, S. (1999). "Neural Networks," ed 2., Prentice-Hall, New York. Hazarika, N., Zhu, J., et al. (1997). Classification of EEG signals using the wavelet transform. Signal Process. 59, 61-72. Hebb, D. O. (1949). "The Organization of Behaviour." Wiley, New York.
PREFILTERING FOR PATTERN RECOGNITION
203
Hogan, N., and Flash, T. (1987). Moving gracefully: Quantitative theories of motor coordination. TINS 4, 170-174. Hudgins, B., Parker, P. A., and Scott, R. N. (1993). A new strategy for multifunctions myoelectric control. IEEE Trans. Biomed. Engineer. 40(1), 82-94. Huntsberger, T. L., Rose, J., and Ramaka, A. (1998). Fuzzy-Face: A hybrid wavelet/fuzzy selforganizing feature map system for face processing. J. Biol. Syst. 6(3), 281-298. Jia, X., and Nixon, S. (1995). Extending the feature vector for automatic face recognition. IEEE-Trans. Patterns Anal Mach. Intelligence 17, No. 12, 1167-1176. Jobert, M., Timer, C., et al. (1994). Wavelet: A new tool in sleep biosignal analysis. J. Sleep Res. 3, 223-232. Kermit, M., Bodal, K. A., et al. (1999). Steady state vowel recognition of speech sound using O-algorithm. In "Artificial Intelligence and Soft Computing" (M. H. Hamza, ed.), pp. 362-367. Kermit, M., and Eide, A. J. (2000). Audio signal identification via pattern capture and template matching. Pattern Recogn. Lett. 21, 269-275. Koc, C. K., Chen, G., and Chui, C. K. (1994). Complexity analysis of wavelet signal decomposition and reconstruction. IEEE Trans. Aerospace Electr. syst. 30(3), 910-918. Kohonen, T. (1977). "Associative Memory: A System Theoretic Approach." Springer-Verlag, Berlin. Kohonen, T. (1984). "Self Organization and Associative Memory." Springer-Verlag, Berlin. Kriiger, V., Herpers, R., et al. (1999). Teleconferencing using an attentive camera system. In "Proc. Int. Conf. on Audio- and Video-Based Biometric Person Authentification," pp. 142-147. Barcelona, Spain. Kriiger, V., and Sommer, G. (2000). Affine real-time face tracking using gabor wavelet networks. In "Proc. Int. Conf. on Pattern Recognition." Barcelona, Spain. Lee, S. W., Kim, C. H, et al. (1996). Multiresolution recognition of unconstrained handwritten numerals with wavelet transform and multilayer cluster neural network. Pattern Recognition 29(12), 1953-1961. Lee, J. S., Sun, Y. N., and Chen, C. H. (1995). Multiscale comer detection by using wavelet transform. 1EEE Transacti. lmage Process 4, 100-104. Lee, J. S., Sun, Y. N., et al. (1993). Wavelet based comer detection. Pattern Recogn. 26(6), 853-865. Lemarie, P. G., and Meyer, Y. (1986). Ondelettes et bases Hibertiennes. Revista Mathematica Ibero Americana, Vol. 2, 1-18. Lin, W. H., Lee, J. S., et al. (1998). A new multiscale-based shape recognition method. Signal Process 65, 103-113. Lindbald, T., Lindsey, C. S., and Eide, A. (1997). Radial basis function neural networks. In "The Industrial Electronics Handbook" (J. D. Irvin, ed.), pp. 1014-1018. CRC Press, Boca Raton, FL. Lingvall, F., and Stepinski, T. (1999). Neural network classifier for eddy current signals. http://www.signal.uu.se/Research/CANDIA. Lingvall, F., and Stepinski, T. (2000). Automatic detecting and classifying defects during eddy current inspection of riveted lap-joints. N D T & E Int. 33, 47-55. Liu, Z. Q., Lu, M. D., and Wei, M. (1997). Structure noise reduction of ultrasonic signals using artificial neural network adaptive filtering. Ultrasonics 35, 325-328. Lu, J. (1993). Parallelizing Mallat algorithm for 2-D wavelet transforms. Inform. Process. Lett. 45(5), 255-259. Mallat, S. (1989). A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transact. Pattern Anal. Machine Intell. 11(7), 674-693.
204
YANG AND PAINDAVOINE
Mallat, S. (1991). Zero-crossing of a wavelet transform. I E E E Transact. Inform. Theory 37, 1019-1033. Mallet, Y., Coomans, D., et al. (1997). Classification using adaptive wavelets for feature extraction. I E E E Trans. Pattern Analy. Mach. Intell. 19(10), 1058-1066. Mallat, S., and Zhong, Z. (1992). Characterization of signal from multiscale edges. I E E E - P A M I 14, 710-732. Marquardt, C., and Mai, N. (1994). A computational procedure for movement analysis in hand-writing. J. Neurosci. Methods 52, 39-45. Meyer, Y. (1993). Wavelet: Algorithms and applications. SIAM. Minsky, M., and Papert, S. (1969). Perceptron: An Introduction to Computational Geometry. MIT Press, Cambridge, MA. Misra, M., and Nichols, T. (1994). Computation of the 2-D wavelet transforms on the connection machine-2. In "Proc. the IFIP WG10.3 Working Conf. on Applications in Parallel and Distributed Computing," pp. 3-12. Caracas. Moody, J., and Darken, C. J. (1989). Fast learning in networks of locally-tuned processing units. Neural Comput. 1(2), 281-294. Musavi, M. T., Ahmed, W., et al. (1992). On the training of radial basis function classifiers. Neural Networks 5, 595-603. Nunez, P. L. (1981). "Electrical Fields of the Brain." Oxford Univ. Press, New York. Oropesa, E., Cycon, H. L., and Jobert, M. (1999). "Sleep Stage Classification Using Wavelet Transform and Neural Network." International Computer Science Institute, TR-99-008. Parker, D. B. (1985). "Leaming-Logic." Tech. Rep. TR-47, Massachusetts Institute of Technology, Center for Computational Research in Economics and Management Science, Cambridge, MA. Polikar, R., Greer, M. H., et al. (1997). Multiresolution wavelet analysis of ERPs for the detection of Alzheimer's disease. In "Proc. 19th Int. Conf. IEEE/EMBS," Chicago, IL. Powell, M. J. D. (1987). Radial basis functions for multivariable interpolation: A review. In "Algorithms for Approximation." (J. C. Mason and M. G. Gox, eds.), pp. 143-167. Clarendon Press, Oxford. Rangannath, S., and Arun, K. (1997). Face recognition using transform features and neural networks. Pattern Recogn. 30(10), 1615-1622. Rechtschaffen, A., and Kales, A. (1968). "A Manual of Standardized Terminology, Technique and Scoring System for Sleep Stages of Human Subjects." Public Health service, U.S. Government Printing Office. Reid, M. B., Spirlovsla, L., and Ochoa, E. (1989). Simultaneous position, scale and rotation invariant pattern classification using third-order neural networks. Int. J. Neural Networks 154-159. Renals, S., and Rohwer, R. (1989). Phoneme classification experiments using radial basis functions. Int. Joint Conf. Neural Network 1, 461-467. Ripley, B. D. (1996). "Pattern Recognition and Neural Networks." Cambridge Univ. Press, Cambridge, UK. Robert, C., Guilpin, C., and Limoge, A. (1998). Review of neural network application in sleep research. J. Neurosci. Methods 79, 187-193. Rojas, P. (1996). "Neural Networks: A Systematic Introduction." Springer-Verlag, New York. Rosenblatt, F. (1959). "Principles of Neurodynamics." Spartan Books, New York. Rumlhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representation by backpropagation errors. Nature 323, 533-536. Sasi, S., Schwiebert, L., and Bedi, J. S. (1997) Wavelet packet transform and fuzzy logic approach to handwritten character recognition. In "IEEE Workshop on Nonlinear Signal and Image Processing, NSIP-97." Mackinac Island.
PREFILTERING FOR PATTERN RECOGNITION
205
Schomaker, L. R. B., and Thomassen, A. J. W. (1986). On the use and limitations of averaging handwriting signals. In "Graphonomics: Contemporary Research in Handwriting," (H. S. R. Kao, G. P. Van Galen, and R. Hoosain, eds.), pp. 225-238. Elsevier, Amsterdam. Sekita, I., Kurita, T., and Otsu, N. (1992). Complex autoregressive model for shape recognition. I E E E Trans. Pattern Anal. Mach. Intell. 14(4), 489-496. Simoncelli, E. P., Freeman, W. T., et al. (1992). Shiftable multiscale transforms. I E E E Transact. Inform. Theory 38, 587-607. Strang, G., and Nguyen, T. (1997). "Wavelet and Filter banks." Wellesley-Cambridge Press. Szu, H., Le Moigne, J., et al. (1997). Wavelet index of texture for artificial neural network classification of Landsat images. In "Proc. SPIE 25th AIPR Workshop: Emerging Applications of Computer Vision," Vol. 2962, 36-44. Szu, H., Telfer, B., and Garcia, J. (1996). Wavelet transforms and neural networks for compression and recognition. Neural Networks 9(4), 695-708. Teulings, H.-L. H. M., and Maarse, F. J. (1984). Digital recording and processing of handwriting movements. Hum. Movement Sci. 3, 193-217. Teulings, H.-L. H. M., and Stelmach, G. E. (1991). Control of stroke size, peak acceleration, and stroke duration in parkinsonian handwriting. Hum. Movement Sci. 10, 315-333. Teulings, H.-L. H. M., Thomassen, A. J. W., and Van Galen, G. P. (1986). Invariants in handwriting: The information contained in a motor program. In "Graphonomics: Contemporary Research in Handwriting," (H. S. R. Kao, G. P. Van Galen, and R. Hoosain, eds.), pp. 305-315. Elsevier, Amsterdam. Tsang, K. M., and Billings, S. A. (1992). Orthogonal estimation algorithm for complex number system. Int. J. Syst. Sci. 23(6), 1011-1018. Tungkanawanich, A., Kawasaki, Z. I., et al. (2000). Ground fault discrimination based on wavelet transform using artificial neural networks. T.1EE Japon 120-B(10), 1263-1270. Tzanetakis, G., Essl, G., and Cook, P. (2001). Audio analysis using the discrete wavelet transform. In "Proc. WSES Int. Conf. Acoustics and Music: Theory and Applications," Skiathos, Greece. Unser, M. (1995). Texture classification and segmentation using Wavelet frames. I E E E Trans. Image Process 4(11), 269-281. Urick, R.J. (1983) "Principles of Underwater Sound." McGraw-Hill, New York. Valentin, D., Abdi, H., O'Toole, A. J., and Cottrell, G. W. (1994). Connectionist models of face processing: A survey. Pattern Recogn. 27, 1209-1230. Van Galen, G. P. (1991). Handwriting: Issues for a psychomotor theory. Hum. Movement Sci. 10, 165-191. Van Galen, G. P., Van Doom, R. R., and Schomaker, L. R. (1990). Effects of motor programming on the power spectral density function of finger and wirst movements. Hum. Percept. Perform 16, 755-765. Vinter, A., Desbiez, D., et al. (1996). The drawing of angular figures in Parkinson's disease patients: Preliminary report. In "Handwriting and Drawing Research: Basic and Applied Issues" (M. L. Simmer, C. G. Leedham, and A. J. W. Thomassen, eds.), pp. 251-264. IOS Press, Amsterdam. Viviani, P., and Terzuolo, C. A. (1980). Space-time invariance in learned motor skills. In "Tutorials in Motor Behavior," (G. E. Stelmach and J. Requin, eds.). North-Holland, Amsterdam. pp. 523-533. Ward, J. R., and Phillips, M. J. (1987). Digitizer technology: Performance, characteristics and the effects on the user surface. I E E E Comput. Graph. Appl. 4, 31-44. Wermter, S., and Sun, R. (2000). "Hybrid Neural Systems." Springer, Heidelberg.
206
YANG AND PAINDAVOINE
Wesolkowski, S. (1996). Cursive script recognition: A survey. In "Handwriting and Drawing Research: Basic and Applied Issues," (M. L. Simmer, C. G. Leedham, and A. J. W. Thomassen, eds.), pp. 267-284. IOS Press, Amsterdam. Widrow, B., and Hoff, M. E. (1960). "Adaptive Switching Circuits." IRE WESCON Convention Record, pp. 96-104. Woltring, H. J. (1985). On optimal smoothing and derivative estimation from noisy displacement data in biomechanics. Hum. Movement Sci. 4, 229-245. Wu, J. S., Amaratunga, K., and Lui, T. M. (2000). Design of an online GIS viewer by wavelet technology. In "Proc. the 8th Int. Conf. On Computing in Civil and Building Engineering," Stanford, CA. Yao, S. J., Song, Y. H., et al. (2000). Wavelet transform and neural networks for short-term electrical load forecasting. Energy Conver. Manage. 41, 1975-1988. Zhu, J., Hazarika, N., et al. (1994). Classification of EEG signals using Wavelet and an ANN. In "Workshop on Brain Electric and Magnetic Topography," Sydney, Australia.