Pergamon os93-6oso(75)ooo6o-7
Neural Networks, Vol. 8, No. 7/8, pp. 1131-1141, 1995 Copyright © 1995 Elsevier Science Ltd Printed in Great Britain. All fights reserved 0893~080/95 $9.50 + .00
1995 SPECIAL ISSUE
A Nonlinear Extension of the MACE Filter JOHN W . FISHER I I I AND JOSE C. PRINCIPE University of Florida (Received 15 November 1994; revised and accepted 3 May 1995)
Abstract--The minimum average correlation energy ( M A C E ) filter, which is linear and shift invariant, has been used extensively in the area of automatic target detection and recognition (A TD/R). We present a nonlinear extension o f the M A C E filter, based on a statistical formulation of the optimization criterion, of which the linear M A C E fiiter is a special case. A method by which nonlinear topologies can be incorporated into the filter design is presented and adaptation issues are discussed. In particular, we outline a method by which training exhaustively over the image plane is avoided which leads to much shorter adaptation. Experimental results, using target chips from 35 GHz T A B I L S 24 inverse synthetic aperture (ISAR) data are presented and performance comparisons are made between the M A C E filter and this nonlinear extension.
Keywords--Correlation filters, MACE filter, ISAR, Automatic target recognition. minant function (SDF) (Hester & Casasent, 1980). Other generalizations o f the S D F include the minimum variance synthetic discriminant function (MVSDF) (Kumar et al., 1988), the M A C E filter, and more recently the gaussian minimum average correlation energy (G-MACE) (Casasent et al., 1991) and the minimum noise and correlation energy (MINACE) (Ravichandran & Casasent, 1992) filters. All of these filters are linear and shift-invariant 1 and can be formulated as a quadratic optimization subject to a set o f linear constraints in either the sample or spectral domain. The solution to these problems is obtained using the method o f Lagrange multipliers. K u m a r (1992) gives an excellent review of these filters. The bulk of the research using these types of filters has concentrated on optical and infrared (IR) imagery and overcoming recognition problems in the presence o f distortions associated with 3-D to 2-D mappings, i.e., scale and rotation. Usually, several exemplars from the recognition class are used to represent the range o f distortions over which the filter is to be used. Although the distortions in SAR imagery do not occur in the same way, that is a change in target aspect does not manifest exactly as a rotation in the SAR image, exemplars may still be
1. I N T R O D U C T I O N In the area o f automatic target detection and recognition (ATD/R), it is not only desirable to recognize various targets, but to locate them with some degree of resolution. The minimum average correlation energy filter (MACE) (Mahalanobis et al., 1987) is o f interest to the A T D / R problem due to its localization and discrimination properties. Correlation filters, o f which the MACE is an example, have been widely used in optical pattern recognition in recent years. Our current interest is in the application of these types of filters to high-resolution synthetic aperture radar (SAR) imagery. Some recent articles have appeared showing experimental results using correlation filters on SAR data. Mahalanobis (Mahalanobis et al., 1994a) used M A C E filters in combination with distance classifier methods to discriminate five classes o f vehicles and reject natural clutter as well as a confusion vehicle class with fairly good results. Novak et al. (1994) present results comparing the performance of several classifiers including the M A C E filter on SAR data. The M A C E filter is a member of a family of correlation filters derived from the synthetic discri-
Acknowledgement: This work was partially supported by ARPA grant N60921-93-C-A335. Requests for reprints should be sent to John W. Fisher, 405 CSE, BLDG 42, University of Florida, Gainesville, FL 32611, USA; E-mail
[email protected]
t We refer here to the signal processing definition of shift invariance, that is, an operator is said to be shift invariant if a shift in the input results in a correspondingshift in the output. 1131
1132
J. W. Fisher Ill and J. C. Principe
sufficient to model a single target class over a range of target aspects and relative depression angles. Our focus is on the MACE falter and its variants because they are designed to produce a narrow constrained-amplitude peak response when the filter mask is centered on a target in the recognition class while minimizing the energy in the rest of the output plane. The filter can be modified to produce a low variance output for a designated rejection class as well. Another property of the MACE filter is that the constrained peak output is guaranteed over the training exemplars to be the maximum in the output image plane (Mahalanobis et al., 1987). Since the MACE filter is linear, it can only be used to realize linear discriminant functions. Along with its desirable properties it has been shown to be limited in its ability to generalize to between-aspect exemplars that are in the recognition class (but not in the training set), while simultaneously rejecting out-ofclass inputs (Casasent et al., 1991; Casasent & Ravichandran, 1992; Ravichandran & Casasent, 1992). The number of design exemplars can be increased in order to overcome generalization problems, however the computation of the filter coefficients becomes computationally prohibitive and numerically unstable as the number of design exemplars is increased (Kumar, 1992). The MINACE and G-MACE variations have improved generalization properties with a slight degradation in the average output plane variance and sharpness of the central peak, respectively. In the sample domain, the SDF family of correlation filters is equivalent to a cascade of a linear pre-processor followed by a linear correlator (Mahalanobis et al., 1987; Kumar, 1992). Fisher and Principe (1994) showed that this is equivalent to a preprocessor followed by a linear associative memory (LAM), illustrated in Figure 1 with vector operations. The pre-processor, in the case of the MACE filter, is a pre-whitening filter computed on the basis of the average power spectrum of the recognition class training exemplars. Mahalanobis et al. (1987) use a synthetic discriminant function (SDF) to refer to the LAM portion of the filter decomposition.
We use the associative memory viewpoint for investigating extensions to the MACE filter. It is well known that nonlinear associative memory structures can outperform their linear counterparts on the basis of generalization and dynamic range (Hinton & Anderson, 1981; Kohonen, 1988). In general, they are more difficult to design as their parameters cannot be computed in closed form. The parameters for a large class of nonlinear associative memories can, however, be determined by gradient search techniques. In this paper we discuss a nonlinear extension of the MACE filter that shows promise in overcoming some of the problems described. In our development we show that the performance of the linear MACE filter can be improved upon in terms of generalization while maintaining its desirable properties, i.e., sharp, constrained peak at the center of the output plane. In this paper we present experimental results using a simple nonlinear modification of the MACE filter. We replace the LAM portion of the filter with a nonlinear associative memory structure, specifically a feed-forward multi-layer perceptron (MLP) which retains the shift invariance properties but yields improved performance via a nonlinear discriminant function. In Section 2 we review the MACE filter formulation and its relationship to associative memories. In Section 3 we develop a generalized statistical filter structure of which the linear MACE filter is a special case. Section 4 details experimental results using TABILS 24 inverse synthetic aperture (ISAR) imagery. We compare performance of the linear MACE filter to a nonlinear extension. We draw our conclusions and observations in Section 5. 2. MACE FILTER AS AN ASSOCIATIVE MEMORY
In the original development SDF type filters were formulated using correlation operations, although a convolutional approach can be easily adopted. The output, g(nl, n2), of a correlation filter is determined by
r ..........................................................................................................
y=Ax -
-
h = y (yt y) d W
input image, x
pre-processor
LAM/SDF
scalar output
SDF Filter Decomposition FIGURE 1. Decomposition of SDF-typo filter in space domain, assuming image and filter coefficients have been re-ordered into vectors. The input image vector x, is pre-processed by the linear transformation, y = A x . T h e resulting vector is processed by a linear associative memory (i_AM), Yo.t = Yth.
Nonlinear Extension of the MACE Filter
1133
Nl-I N2-1
=
x ' ( , , , + m,,
2 +
mt-----Or~---O = x*(nl,n2) ® ®h(nl,n2), where x* (nl, n2) is the complex conjugate of the input image with N l x 5/2 region of support and h(nl,n2) represents the filter coefficients. The MACE filter is formulated as follows (Mahalanobis et al., 1987). Given a set of image exemplars, {xi E 9~NtxN2;i = I,...,Nt}, we wish to find filter coefficients, h E 91N~×N2, such that average correlation energy at the output of the filter
(
1 ~i~=l
E = ~, ~
1
E
N--,-~ x~,__o~__o
))
Ig(nl'n2)12
hrx = d T
(5)
(1)
and, in the under-determined case, the product
is minimized subject to the constraints g/(0,0) = E
typically in an input/output pair-wise fashion. From a signal processing perspective we view associative memories as projections (Kung, 1992), linear and nonlinear. The input patterns exist in a vector space and the associative memory projects them onto a new space. Kohonen's linear associative memory (Kohonen, 1988) is formulated exactly in this way. A simple form of the linear associative memory (the hetero-associative memory) maps vectors to scalars, that is, given a set of input/output vector/ scalar pairs {X i E ~Nx I di E 91, i = 1,..., Nt }, find the linear projection, h, such that
hTh
E x~(m''m2)h(mt'm2) = di; i= l,...,Nt.
m~----0ml=0
(6)
is minimized, while for the over-determined case h is found such that
(2) (hrx - d r)(hxx - dr) T Mahalanobis et al. (1987) reformulates this as a vector optimization in spectral domain using Parseval's theorem. Let X E C~'~'2,'N' be a matrix whose columns contain the 2-D DFT coefficients of exemplars {Xl,...,Xlv,} reordered into column vectors. Let the matrix Di E g~NIN2xNIN2 be a diagonal matrix whose diagonal elements contain the magnitude squared of the 2-D DFT coefficients of the/th exemplar. The diagonal elements of the matrix
is minimized. The columns of the matrix x = [xl,... ,xN,] contain the input vectors and the elements of the vector, d = [dl... dN,]T contain the associated desired output scalars. The optimal solution for the under-determined, using the pseudo-inverse of x is (Kohonen, 1988) h=~(x~x)-'d.
n =
1 N,
E
~ D,
(3)
are then the average power spectrum of the training exemplars. The solution to this optimization problem can be found using the method of Lagrange multipliers. In the spectral domain, the filter that satisfies the constraints of eqn (2) and minimizes the criterion of eqn (1) (Mahalanobis et al., 1987; Kumar, 1992) is
H = (NtN2)D-IX(XtD-tX)-Id,
(4)
where H E C v'N2×l contains the 2D-DFT coefficients of the filter, assuming the nonunitary 2-D DFT as defined in Oppenheim and Shafer (1989), re-ordered into a column vector and d E 91N'×! contains the desired outputs, di, for each exemplar. This formulation can be easily cast as an associative memory. In general, associative memories are mechanisms by which patterns can be related to one another,
(7)
(8)
As was shown in Fisher and Principe (1994), if we modify this linear associative memory model slightly by adding a pre-processing linear transformation matrix, A, and find h such that the under-determined system of equations
hT(Ax) = d r
(9)
is satisfied while hTh is minimized, we get the result
h = Ax(xTATAx)-Id.
(10)
If the pre-processing transformation, A, is the spacedomain equivalent of the MACE filter's spectral prewhitening filter then eqn (10) combined with the preprocessing transformation yields exactly the space domain coefficients of the MACE filter when the input vectors, x, are the re-ordered elements of the original images.
1134
J. W. Fisher III and J. C. Principe
3. N O N L I N E A R E X T E N S I O N OF T H E MACE FILTER The M A C E filter is the best linear system that minimizes the energy in the output correlation plane subject to a peak constraint at the origin. One of the advantages o f linear systems is that we have the mathematical tools to use them in optimal operating conditions. Such optimality conditions, however, should not be confused with the best possible performance. In the case of the M A C E filter one drawback is poor generalization. A possible approach to design a nonlinear extension to the MACE filter and improve on the generalization properties is to simply substitute the linear processing elements of the L A M with nonlinear elements. Since such a system can be trained with error backpropagation, the issue would be simply to report on performance comparisons with the MACE. Such methodology does not, however, lead to understanding of the role of the nonlinearity, and does not elucidate the trade-offs in the design and in training. Here we approach the problem from a different perspective. We seek to extend the optimality condition of the MACE to a nonlinear system, i.e., the energy in the output space is minimized while maintaining the peak constraint at the origin. Hence we will impose these constraints directly in the formulation, even knowing a priori that an analytical solution is very difficult or impossible to obtain. We formulate the M A C E filter from a statistical viewpoint and generalize it to arbitrary mapping functions, linear and nonlinear. We begin with a random vector, x E ~N, N2×I, which is representative of the rejection class and a set of Nt observations of the random vector, placed in the matrix xo C 9Vv'N2×N', which represent the target sub-class. We wish to find the parameters, a, of a mapping, g(a, x) : ~RN~N2×l ---, ~R such that we may discriminate target vectors from vectors in the general rejection class. In this sense the mapping function, g, constrains the discriminator topology. Towards this goal, we wish to minimize the objective function J : E(g(oL, x) 2) over the mapping parameters, a, subject to the system of constraints
g(a, x0) = d L
(11)
d E 0t jr' × 1 is a column vector of desired outputs. It is assumed that the mapping function is applied to each column o f x0, and E( ) is the expected value function.
Using the method of Lagrange multipliers, we can augment the objective function as
s = e(g(~,x) 2) + (g(~, x0) - aT)h,
(12)
where the mapping is assumed to be applied to each column of x0. Computing the gradient with respect to the mapping parameters yields OJ ct
- - =
2E(g(a,x)(~))
. Og(a, Xo) A. Ot~
(13)
Equation (13) along with the constraints of eqn (11) can be used to solve for the optimal parameters, a °, assuming our constraints form a consistent set of equations. This is, of course dependent on the network topology. For arbitrary nonlinear mappings it will, in general, be very difficult to solve for globally optimal parameters analytically. Our initial goal, instead, is to develop topologies and adaptive training algorithms which are practical and yield improved generalization over the linear mappings. It is interesting to verify that this formulation yields the M A C E filter as a special case. If, for example, we choose the mapping to be a linear projection of the input image, that is g(a,x) = aXx;
a = [hi,...
,hNtNz]TE~NtN2×I,
then eqn (12) becomes, after simplification, J = aTE(xxT)a + (aXX0 -- dX)A.
(14)
In order to solve for the mapping parameters, a, we are still left with the task of computing the term E ( x x T) which, in general, we can only estimate from observations of the random vector, x. Assuming that we have a suitable estimator, the well known solution to the minimum of eqn (14) over the mapping parameters subject to the constraints of eqn (11) is a = Rxlxo(XokxlXO}-td
/~x = estimate {E(xxX)}. (15)
Depending on the characterization of x, eqn (15) describes various SDF-type filters (i.e., MACE, MVSDF, etc.). In the case of the M A C E filter, the random vector, x, is characterized by all 2D circular shifts of target class images away from the origin. Solving for the M A C E filter coefficients is therefore equivalent to using the average circular autocorrelation sequence (or equivalently the average power spectrum in the frequency domain) over images in the target class as estimators of the elements of the matrix E(xxX). Sudharsanan et al. (Sudharsanan et al., 1991) suggest a very similar methodology for
Nonlinear Extension of the MACE Filter
i
/ t f
,/ /1 /1 / / /
..~m .,-'q
1135
A Image Pre-processor
Output Scalar
/
f
Input Image
MLP NL-MACE FIGURE 2. Experimental nonlinear MACE structure.
improving the performance of the MACE filter. In that case the average linear amocorrelation sequence is estimated over the target class and this estimator of E(xx T) is used to solve for linear projection coefficients in the space domain. The resulting filter is referred to as the SMACE (space-domain MACE) filter. As stated, our goal is to find mappings, defined by a topology and a parameter set, which improve upon the performance of the MACE filter in terms of generalization while maintaining a sharp constrained peak in the center of the output plane for images in the recognition class. One approach, which leads to an adaptive algorithm, is to approximate the original objective function of eqn (12) with the modified objective function J -- (1 - a ) g ( g ( ~ , x) 2) + ~ [g(~, x0) - d'] [g(~, x0) - d T] T.
(16) The principal advantage gained by using eqn (16) over eqn (12) is that we can solve adaptively for the parameters of the mapping function (assuming it is differentiable). The constraint equations, however, are no longer satisfied with equality over the training set. Varying ~ in the range [0, 1] controls the degree to which the average response to the rejection class is emphasized versus the variance about the desired output over the recognition class. In R6fr6gier and Figue (1991) an optimal criterion trade-off method is presented. The authors show that the convex combination over the set of criteria describe a performance bound for the linear mapping. Mahalanobis (Mahalanobis et al., 1994a) extends this idea to unconstrained linear correlation filters. Further investigation will be required in order to explore the relationship and performance of these linear filters relative to the nonlinear mappings we are currently studying. As in the linear case, we can only estimate the expected variance of the output due to the random vector input and its associated gradient. If, as in the MACE (or SMACE) filter formulation, x is
characterized by all 2D circular (or linear) shifts of the recognition class away from the origin then this term can be estimated with a sampled average over the exemplars, x0, for all such shifts. From an adaptive standpoint this leads to a gradient search method which trains exhaustively over the entire output plane. This becomes a computationaUy intensive problem for most nonlinear mappings. It is desirable, then, to find other equivalent characterizations of the rejection class which may alleviate the computational load without significantly impacting performance. This issue is addressed in later sections. 3.1. Architecture
A block diagram of the proposed nonlinear extension is shown in Figure 2. In the pre-processor/LAM decomposition of the MACE filter the LAM structure was replaced with a feed-forward multilayer perceptron (MLP). The pre-processor remains as a linear, shift-invariant pre-whitening transformation, 91~,N2×1_+ 91~¢,N2×1, yielding a pre-whitened space domain image. The MLP has two hidden layers with N1N2 nodes, on the input layer, corresponding to an input mask with N I x N2 support in the image domain. This layer can be implemented with two correlators followed by nonlinear elements. The outputs of these elements feed into four nodes on the second hidden layer which nonlinearly combine the two features followed by a single output node. The nonlinearity is the logistic function. Since the mapping is 91NIN2×I --~ 91, we must, of course, apply the filter input mask to each location in the original input image in order to obtain an output image. The specific architecture was chosen for several reasons. The linear MACE filter extracts the optimal feature over the design exemplars for a linear discriminant function. Any linear combinations of additional linear features will yield an equivalent linear feature. This means that the MACE filter is the best linear feature extractor for the target exemplars. Only a nonlinear system can improve on this design. The MLP structure has the advantage of providing an efficient means of nonlinearly combining an
1136
J. IV, Fisher III and J. C. Principe
f(fl + Ol) A Image Pre-processor
f(f2 + OI)
Feature Extraction Discriminant Function
FIGURE 3. Division of pre-processorlMLP into feature extraction and dlscrimlnant function.
optimal linear feature with others. It is well known that a single hidden layer M L P can realize any smooth discriminant function of its inputs. If we view the output of each node in the fist hidden layer as an extracted feature of the input image, then the second layer gives the capability of realizing any smooth discriminant function of the first hidden layer output. This is illustrated in Figure 3, where the linear outputs plus bias terms, fl + 01 and f2 + Ol, of the first hidden layer are the features of interest, and f ( ) is the nonlinear logistic function. The division of Figure 3 will be useful in later analysis. If the performance of the linear M A C E filter can be improved, the addition of a single feature should be sufficient to illustrate this improvement. It is for this reason that we set the number of nodes to two on the first hidden layer, although more hidden nodes may lead to even better performance. Finally, the MLP with backpropagation provides a simple means for adapting the N L - M A C E although a globally optimal solution is not guaranteed. The mapping function of the N L - M A C E can be written
g(o~,x) = f( W3f( W2f( Wix + 01) + 02)) Ot = { W I C ~j~2xN,N2 W2 E 'i~4x2w 3 e ¢J:~lx4,01,02}
(17) Implicit in eqn (17) is that the terms 01 and 02 are constant bias matrices with the appropriate dimensionality. It is also assumed that if the argument to the nonlinear function f ( ) is a matrix then the nonlinearity is applied to each element of the matrix. We can rewrite the linear input term, Wlx, which is the only term with dependency on the input image (reordered into a column vector), as
x] ]'I/I X
=
:
Lh~,,xJ
=
•
LfN,,(x)l
7
(18)
where Nhl is the number of hidden nodes in the first layer of the M L P (two in our specific case) and {h~, ...,h~,,} ~ 911×N'N2 are the rows of the matrix Wl. The elements of the result, {fl(x),... ,fivh,(x)} C 9t l×~v'N2, are recognized as the outputs, in vector form (Mahalanobis et al., 1987; Kumar et al., 1988), of Nhl purely real linear correlation filters operating in parallel, therefore the elements of this term are shift-invariant. Rewriting eqn (17) as a function of its shift invariant terms
g(x) = f ( W x f ( W 2 f ( ~ (x).. "fNh,(x)ITJi-01) + 02))
(19)
we can see that the output is a static function of shift invariant input terms. Any shift in the input image will be reflected as a corresponding shift in the output image. The mapping is, therefore, shift invariant.
3.2. Avoiding Exhaustive Training Training becomes an issue once the associative memory structure takes a nonlinear form. The output variance of the linear M A C E filter is minimized for the entire output plane over the training exemplars. Even when the coefficients of the M A C E filter are computed iteratively we need only consider the output point at the designated peak location (constraint) for each pre-whitened training exemplar (Fisher & Principe, 1994). This is due to the fact that for the under-determined case, the linear projection which satisfies the system of constraints with equality and has minimum norm is also the linear projection which minimizes the response to images with a flat power spectrum. This solution is arrived at naturally via a gradient search only at the constraint location. This is no longer the case when the mapping is nonlinear. Adapting the parameters via gradient search on pre-whitened exemplars only at the constraint location will not, in general, minimize the variance in the output image. In order to minimize the variance over the entire output plane we must
Nonlinear Extension of the MACE Filter consider the response of the filter to each location in the input image, not just the constraint location. The brute force approach would be to adapt the parameters over the entire output plane which would require NIN2Nt image presentations per training epoch. If such exhaustive training is done, then the pre-whitening stage seems unnecessary. The prewhitening stage and the input layer weights could be combined into a single equivalent linear transformation, however, pre-whitening separately enables us to greatly reduce the number of image presentations during training. This can be explained as follows: due to the statistical formulation, we are only reducing the response to the NL-MACE filter to images with the second order statistics of the rejection class. If the exemplars have been pre-whitened then the rejection class can be represented with random white images. Minimizing the response to these images, on the average, minimizes the response to shifts of the exemplar images since they have the same secondorder statistics. In this way we do not have to train over the entire output plane exhaustively, thereby reducing training times proportionally by the input image size, NIN2. Experimentally the difference in convergence time was approximately 2300 epochs of N1N2Nt image presentations for exhaustive training versus 1800 epochs of (Nt + 4 ) image presentations (training exemplars plus four white noise images) for noise training, with nearly the same performance in both cases. This is obviously a considerable speedup in training for even moderate images sizes. In both cases, the resulting filters exhibit improved performance over the linear MACE filter in terms of generalization and output variance. 3.3. Linear versus Nonlinear Discriminant Functions
Several observations were made during our experiments. It became apparent that linear solutions were a strong attractor. Examination of the input layer showed that the columns of W1 were highly correlated. When this condition is true, although a nonlinear system is being used, the mapping of the image space to the feature space is confined to a narrow strip. The net result is that a mapping similar to the linear MACE filter could be achieved with a single node on the first hidden layer and we have achieved a linear discriminant function with a complicated topology. Even if the resulting linear discriminant function yields better performance there are much better and well documented methods for finding linear discriminant functions (R~frtgier & Figue, 1991; Mahalanobis et al., 1994b). In order to find, with the MLP, a nonlinear discriminant function of the image space, modifica-
1137 tions were made to the adaptation procedure. The presumption here is that better performance (in terms of discrimination, localization, and generalization) can be achieved using a nonlinear discriminant function. It is certainly possible that in some input spaces the best discrimination can be achieved with a linear projection, but in a space as rich as the one in which we are working we believe that this will rarely be the case. The modification to the adaptation was to enforce orthogonality on the columns of the input layer weight matrix,
[h'~hl h'~h2] W~W~= Lhlh I hlh21 = via
Gram-Schmidt
[llh~ll 0 ] 0
IIh2[I
orthogonalization,
where
{Wi,hl,h2} are as in eqns (17) and (18). This has two consequences. First, it guarantees that the mapping to the feature space is not rank-deficient although it does not ensure that the discriminant function derived through gradient search will utilize the additional feature. The second consequence is that, assuming we have pre-whitened input images over the rejection class, the extracted features will also be orthogonal, in the statistical sense, over the rejection class. Mathematically, this can be shown as follows E(WlX:Wl
=
Wle(X:)V:,
= [h'~E(xxX)h' h'~E(xxT)h21 [ h'~E(xxX)hl h'~E(xxT)h2J"
(20)
As a consequence of the pre-whitening, the term E(xx "r) is of the form a3IN,A,~, where o~ is a scalar and 1~,~t2 is the NIN2 x NIN2 identity matrix. Substituting into eqn (20) gives
[hT(¢I~,,~)h,hT(dI~,,~)h~ ] E( WtxxT W'~) = LhI (o~l,,N~)hthI (o~lN,N~)h2J
I :hlh,
:hlh J
[:,:l,, o].
(21)
dllh2ll It is fairly straightforward to show that any affine transformation of these features will also be uncorrelated. Since the MLP is nonlinearly combining orthogonal features it will yield, in general, a nonlinear discriminant function.
1138
J. IV. Fisher 111 and J. C. Principe
FIGURE 4. Examples of ISAR imagery. Down range is increasing from left to rlghL Target vehicle is shown at aspecta of 5 (left), 45 (middle), and 85 (right) degrees.
4. E X P E R I M E N T A L R E S U L T S For these experiments we used vehicle data from the TABILS 24 ISAR data set. The radar used for the data collection is a fully polarimetric, Ka band radar. The ISAR imagery was processed with a polarimetric whitening filter (PWF) (Novak et al., 1993) and then logarithmically scaled to units of dBsm (dB square meters) prior to being used for our experiments. The data used were collected at a depression angle of 20 °, that is the radar antenna was directed 20 ° down from the horizon. ISAR images were extracted in the range 5-85 ° azimuth in increments of 0.8 ° (Figure 4). This resulted in 100 ISAR images (50 training, 50 testing). Images within both the training and testing sets were separated by 1.6° .
curve in the feature space. Further adaptation continued to increase the correlation of the features.
4.2. Experiment 2 In light o f the results of the first experiment (and several other experiments not described here for brevity), a modification was made to the training algorithm that yielded a nonlinear discriminant function. During training, orthogonality between the columns of the matrix Wl was enforced via a G r a m - S c h m i d t procedure at each training iteration. The approximate convergence time was nearly the NL-MACE
discriminant '
l
'
function '
'
l
'
'
4.1. Experiment 1 In the first experiment, straight backpropagation training was conducted with no other modifications other than to weight the quadratic penalty term associated with the constraints in (16) by /~ = 0.93 and the output variance term with ( 1 - ~ ) = 0.07. The coefficients converged to a solution after approximately 1200 runs through the entire training set. Examination of the input layer (feature extracting layer) revealed that the coefficients associated with the first feature (lst column of matrix Wl) were highly correlated with the coefficients of the second feature. In effect the M L P converged to a linear discriminant function. At best, the M L P was equivalent to choosing a threshold for a linear filter. The resulting discriminant function is illustrated in Figure 5. In the figure a contour plot of the discriminant function with respect to the linear outputs of the first hidden layer, f l and f2 of Figure 3, is plotted. Although the discriminant function implemented is nonlinear, the features are so highly correlated that all inputs are projected onto a single
I
o
-2
m 3
i --,
i --2
t
I
0 f,
2
FIGURE 5. NL-MACE discriminant function with respect to extracted feature mapping. The cluster in the lower left is the mapping of noise (asterisk) with the same second order statistics as the rejection class. The cluster in the upper right is the mapping of testing (plus) and training (diamonds) exemplars. Since the features are highly correlated, inputs are mapped to a single curve in the feature space, and the overall filter is effectively a linear discriminant function of the input image.
Nonlinear Extension of the MACE Filter
NL-MACE 3
. . . .
~]~
1139
discrirninant
. . . .
I
. . . .
I
discriminant function is plotted in Figure 6. The features are no longer correlated so the target exemplars and noise (rejection class) no longer lie on a single curve in the feature space. The resulting filter is utilizing the second feature and the discriminant is not equivalent to a linear discriminant function.
function
. . . .
[
. . . .
I
r
'
'
,`
4.3. Performance Comparison
-a I - - 4 ~
-2.0
. . . .
I
. . . .
-1.5
/
I
. . . .
-1.0
I
,
,
-0.5 fl
,
r
I
0.0
,
,
,
~
I
0.5
L
L
At this point, we are satisfied that the nonlinear associative memory structure is doing more than applying a threshold to the linear discriminant function. We now compare the performance of the linear M A C E filter to our nonlinear extension with orthogonal features. Sample responses o f both filters, linear and nonlinear, are shown in Fig. 7. One training set exemplar and one testing set exemplar are shown for both the linear M A C E filter and the nonlinear filter. It is evident from the figure that the nonlinear filter appears to reduce the variance in the output plane (correlation energy for the linear case) as compared to the linear filter while still maintaining a sharp peak near the center point. Recall that at no time during training were shifted exemplars presented to the network, although as in the M A C E filter, the projection must be computed at all positions in the input image in order to compute the output image.
,
l.O
FIGURE 6. Comparison of discriminsnl function with respect to extracted feature mapping when orlhogonel features were enforced. The cluster In the upper left is the mapping of noise (asterisk) with the same second-order statistics as the rejection class. The cluster in the lower right is the mapping of testing (plus) and training (diamonds) exemplars. The mapping is no longer confined to s single curve in the feature space and the dlscriminant function Is a nonlinear function of both features.
same as in the first case, but the resulting discriminant function was no longer linear, indicating that the second feature was utilized by the filter. The new W
~
O
~
* 20
O0
d
O
K
r~
~B
aa.~ooO
i!I
d e • roe
B
t • r
e
o
4g
e ooOO
d
o
r-~-
•
oo
I~L
~
u
tp
,
t~raae
o
4to'O000
d
r "z~ •
t
o.e
o.e
i
o.,
4
FIGURE 7. Sample responses of the linear MACE filter (left) as compared to the output of the nonlinear filter (right) given the same input. The samples shown include one training exemplar (top) and the adjacent testing exemplar (bottom).
1140
J. W. Fisher III and J. C. Principe
MACE r e s p o n s e (training ' ' ' i . . . . . .
1.30
set)
NL-MACE response
I
1.30
1.20
1.20
1.10
1.10
o 1.00 L
o 1.00
o9ol
0.90
0.80
0.80
0.70 0 1.30[
JL+~,IL,LI,+II, 20 40 60 a s p e c t (degrees) MACE r e s p o n s e I ' ' [ i'
I
0.70 80
(test
100
set)
I
' ' l
I
(test
[
I
set)
1.20 1.10
~'1°I 1.00
NL-MACE response
1.30
1.20
' I ' '
, , , I J J , I ~ , , I , , , [ , i L 20 40 60 80 100 a s p e c t (degrees)
0
set) I
(training
I
o 1.00
F
0.90
0'90 I
0.80
0"80 I 0.70 I-
I
20
L
40 60 a s p e c t (degrees)
J
,
80
0.70
100
,
0
,
,
i
20
,
,
,
I
,
,
,
I
,
40 60 a s p e c t (degrees)
,
,
J
,
80
,
,
100
FIGURE 8. Peak (center) response of the linear MACE filter (left) compared to the output of the nonlinear filter (right) over the entire training set (top) and testing eel (bottom) plotted as a function of vehicle aspect angle.
This response was typical for all exemplars. Localized peak and low variance properties were retained. In Figure 8 we show the peak response for both the linear and nonlinear filter for both the training and testing set. In the case of the training set for the linear filter the designed value is, o f course, met exactly at the center point. The peak response over the training set always occurred at the center point for the nonlinear filter. In order to determine the peak response for the testing set, for both the linear and nonlinear filter, we simply chose the peak response in the output plane. In all cases this point occurred within a 5 x 5 pixel area centered in the output plane, but was not necessarily the center point for the test
set. It can be seen in the plot that the nonlinear filter appears to have better generalization properties over the training set than the linear filter. In Figure 9 we show the probability distribution of the output plane response estimate (via Parzen window method) from the testing exemplars. The linear M A C E filter clearly exhibits a more significant tail in the distribution than does the nonlinear filter. 5. R E M A R K S A N D C O N C L U S I O N S
We have presented a method by which the MACE filter can be extended to nonlinear processing. A necessary part o f any extension to the MACE filter
Nonlinear Extension of the MACE Filter
background 60
....
' ....
distribution ' ....
' ....
1141
(test
set
' ....
50 40
1
30
REFERENCES
20 10
0.00
in the process of testing with multiple targets in target-plus-clutter imagery and will be reporting our results in the future. Future investigations will also explore the performance and relationships to the class of unconstrained correlation filters of (Mahalanobis et al., 1994b).
0.05
0.10 background
0.15 output
0.20
0.25
FIGURE 9. Filter output plane ixlfs (excluding 5 x 5 pixel center region), esUmated over training exemplars, for the linear MACE (solid line) and the NL-MACE (dotted line).
must consider the entire output image plane. In the case of the nonlinear extentions to the MACE filter the output image plane can no longer be characterized by the average power spectrum over the recognition class and any iterative method for computing its parameters might have to train exhaustively over the entire output plane. Using a statistical treatment, however, we were able to develop a training method that did not require exhaustive output plane training and which drastically reduced the convergence time of our training algorithm and gave improved performance. Our training algorithm requires the generation of a small number of random sequences with the same second-order statistics as our recognition class. Pre-whitening of the input exemplars played an important role in the training algorithm because the random sequences could then be any white noise sequence, which, as a practical matter, are less difficult to generate during training. Our results also show that it is not enough to simply train a multi-layer perceptron using backpropagation; the black-box approach. Careful analysis of the final solution is necessary to confirm reasonable results. In particular, the linear solution is a strong attractor and must be avoided, otherwise the solution would be equivalent (at best) to the linear MACE filter followed by a threshold. We used Gram-Schmidt orthogonalization on the input layer which did result in a nonlinear discriminant function and improved performance. We are currently exploring other methods by which independent features will adapt naturally. In our experiments better generalization and reduced variance in the output plane were demonstrated. Our current interest is in the application of this filter structure to SAR imagery. We are
Amit, D. J. (1989). Modelling brain fanction: the world of attractor neural networks. Cambridge: Cambridge University Press. Casasent, D., & Ravichandran, G. (1992). Advanced distortioninvariant minimum average correlation energy (MACE) filters. Applied Optics, 31,(8), 1109-1116. Casasent, D., Ravichandran, G., & Bollapragada, S. (1991). Gaussian minimum average correlation energy filters, Applied Optics, 30(35), 5176-5181. Fisher, J., & Principe, J. C. (1994). Formulation of the MACE filter as a linear associative memory. Proceedings of the 1EEE International Conference on Neural Networks, Vol. 5, p. 2934. Herts, J. et al. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hester, C. F., & Casasent, D. (1980). Multivariant technique for multiclass pattern recognition. Applied Optics, 19, 1758-1761. Hinton, G. E., & Anderson, J. A., (Ed.) (1981). Parallel models of associative memory. Lawrence Edbaum Associates. Kohonen, T. (1988). Self-organization and associative memory, Vol. 8, Springer Series in Information Sciences. Berlin: SpringerVerlag. Kumar, B. V. K. Vijaya (1986). Minimum variance synthetic discriminant functions. Journal of the Optical Society of America A, 3(10), 1579-1584. Kumar, B. V. K. Vijaya (1992). Tutorial survey of composite filter designs for optical correlators. Applied Optics, 31(23), 47734801. Kumar, B. V. K. Vijaya, Bahri, Z., & Malahanobis, A. (1988). Constraint phase optimization in minimum variance synthetic diseriminant functions. Applied Optics, 27(2), 409-413. Kung, S. Y. (1992). Digital neural networks. Englewood Cliffs, NJ: Prentice-Hall. Mahalanobis, A., Kumar, B. V. K. Vijaya, & Casasent, D. (1987). Minimum average correlation energy filters. Applied Optics, 26(17), 3633-3640. Mahalanobis, A., Forman, A. V., Day, N., Bower, M., & Cherry, R. (1994a). Multi-class SAR ATR using shift-invariant correlation filters. Pattern Recognition, 27(4), 619-626. Mahalanobis, A., Kumar, B. V. K. Vijaya, Song, Sewong, Sims, S. R. F., & Epperson, J. F. (1994b). Unconstrained correlation filters. Applied Optics, 33(33), 3751-3759. Novak, L. M., Burl, M. C., & Irving, W. W. (1993). Optimal polarimetric processing for enhanced target detection. IEEE Transactions on Aerospace and Electronic Systems, 29, 234. Novak, L. M., Owirka, G., & Netishen, C. (1994). Radar target identification using spatial matched filters. Pattern Recognition, 27(4), 607-617. Oppenheim, A. V., & Shafer, R. W. (1989). Discrete-time signal processing. Englewood Cliffs, NJ: Prentice Hall. Ravichandran, G., & Casasent, D. (1992). Minimum noise and correlation energy filters. Applied Optics, 31(11), 1823-1833. R6frbgier, Ph., & Figue, J. (1991). Optimal trade-off filters for pattern recognition and their comparison with Weiner approach. Opt. Comp. Proc, 1, 3-10. Sudharsanan, S. I., Mahalanobis, A., & Sundareshan, M. K. (1991). A unified framework for the synthesis of synthetic diseriminant functions with reduced noise variance and sharp correlation structure. Applied Optics, 30(35), 5176-5181.