NeuralNetworks,Vol.8. No. 2, pp. 179-201,1995 Copyright© 1995ElsevierScienceLtd Printed in the USA.All rightsreserved 0893-6080/95$9.50+ .00
Pergamon 0893-6080(94)00064-6
CONTRIBUTED ARTICLE
An Algorithm to Generate Radial Basis Function (RBF)-Like Nets for Classification Problems ASIM R O Y , SANDEEP GOVIL, AND RAYMOND M I R A N D A Arizona State University (Received 3 August 1993; revisedand accepted 6 June 1994)
Abstract--Th/s paper presents a new algorithm for generating radial basis function (RBF)-like nets for classification problems. The method uses linear programming ( LP) models to train the RBF-like net. Polynomial time complexity of the method is proven and computational results are provided for many well-known problems. The method can also be implemented as an on-line adaptive algorithm.
Keywords--Radial basis function-like nets, Classification problems, Linear programming models. 1. I N T R O D U C T I O N - - A R O B U S T AND EFFICIENT LEARNING THEORY
amples (on-line) for the same level of error. "Told him a million times, and he still doesn't understand" is a typical remark often heard of a slow learner. 4. Efficiency in learning: The method must be computationally efficient in its learning when provided with a finite number of training examples. It must be able to both design and train an appropriate net in polynomial time. That is, given P examples, the learning time should be a polynomial function of P. 5. Generalization in learning: The method must be able to generalize reasonably well so that only a small amount of network resources is used. That is, it must try to design the smallest possible net. This characteristic must be an explicit part of the algorithm. This theory defines algorithmic characteristics that are obviously much more brain-like than those of classical connectionist theory, which is characterized by predefined nets, local learning laws, and memoryless learning. Judging by these characteristics, classical connectionist learning is not very powerful or robust. First of all, it does not even address the issue of network design, a task that should be central to any neural network learning theory. It is also plagued by efficiency (lack of polynomial time complexity, need for excessive number of teaching examples) and robustness problems (local minima, oscillation, catastrophic forgetting), problems that are partly acquired from its attempt to learn without using memory. Classical connectionist learning, therefore, is not very brain-like at all. Several algorithms have recently been developed that follow these learning principles (Roy & Mukhopadhyay, 1991; Roy, Kim, & Mukhopadhyay, 1993; Mukhopadhyay et al., 1993; Govil & Roy, 1993). The algorithm
The science of artificial neural networks needs a robust theory for generating neural networks and for adaptation. Lack of a robust learning theory has been a significant impediment to the successful application of neural networks. A good, rigorous theory for artificial neural networks should include learning methods that adhere to the following stringent performance criteria and tasks. 1. Perform network design task. A neural network learning method must be able to design an appropriate network for a given problem, because it is a task performed by the brain. A predesigned net should not be provided to the method as part of its external input, because it never is an external input to the brain. 2. Robustness in learning: The method must be robust so as not to have the local minima problem, the problems of oscillation, and catastrophic forgetting or similar learning difficulties. The brain does not exhibit such problems. 3. Quickness in learning: The method must be quick in its learning and learn rapidly from only a few examples, much as humans do. For example, a method that learns from only 10 examples (on-line) learns faster than one that needs 100 or 1000 exAcknowledgement:This research was supported, in part, by the National ScienceFoundationgrant IRI-9113370and Collegeof Business Summer Grants. Requests for reprints should be sent to Asim Roy,Department of Decisionand InformationSystems,ArizonaState University,Tempe, AZ 85287. 179
180
A. Roy, S. Govil, and R. Miranda
presented here also satisfies the brain-like properties described above. Successful and reliable on-line selflearning machines can be developed only if learning algorithms adhere to these learning principles. 2. RADIAL BASIS FUNCTION NETS--BACKGROUND Radial basis function (RBF) nets belong to the group of kernel function nets that utilize simple kernel functions, distributed in different neighborhoods of the input space, for which responses are essentially local in nature. The architecture consists of one hidden and one output layer. This shallow architecture has great advantage in terms of computing speed compared to multiple hidden layer nets. Each hidden node in a RBF net represents one of the kernel functions. An output node generally computes the weighted sum of the hidden node outputs. A kernel function is a local function and the range of its effect is determined by its center and width. Its output is high when the input is close to the center and it decreases rapidly to zero as the input's distance from the center increases. The Gaussian function is a popular kernel function and will be used in this algorithm. The design and training of a RBF net consists of 1 ) determining how many kernel functions to use, 2) finding their centers and widths, and 3) finding the weights that connect them to an output node. Several radial basis function algorithms have been proposed recently for both classification and real-valued function approximation. Significant contributions include those by Powell (1987), Moody and Darken (1988, 1989), Broomhead and Lowe (1988), Poggio and Girosi (1990), Baldi (1990), Musavi et al. (1992), Platt( 1991 ), and others. In Moody and Darken's algorithm, an adaptive K-means clustering method and a "P-nearest neighbor" heuristic are used to create the Gaussian units (i.e., determine their centers and widths) and the LMS gradient descent rule (Widrow & Hoff, 1960) is used to determine their weights. Musavi et al. (1992) present an algorithm that forms clusters (Gaussian units) that contain patterns of the same class only. They attempt to minimize the number of Gaussian units used and thereby increase the generalization ability. Platt( 1991 ) has developed an RBF algorithm for function approximation where a new RBF unit is allocated whenever an unusual pattern is presented. His network, using the standard LMS method, learns much faster than do those using back propagation. Vrckovnik, Carter, & Haykin (1990) have recently applied RBF nets to classify impulse radar waveforms and Renals & Rohwer (1989) to phoneme classification. 3. BASIC IDEAS AND THE ALGORITHM
. . . . XN); Xn denotes the nth element of vector x. The pattern space, which is the set of all possible values that x may assume, is represented by ~2x.K denotes the total number of classes and k is a class. The method is for supervised learning where the training set xl, x2 . . . . . Xp is a set of P sample patterns with known classification; xp denotes the pth vector. In this method, a particular subset of the hidden nodes, associated with class k, is connected to the kth output node and those class k hidden nodes are not connected to the other output nodes. Therefore, mathematically, the input F k ( x ) to the kth output node (i.e., the class k output node) is given by Ok
Fk(x) = ~ hkGk(x),
(1)
q-1
Gkq(X) = R ( I I x -
Ckqll/wko).
(2)
Here, Qk is the number of hidden nodes associated with class k, q refers to the qth class k hidden node, G ~ ( x ) is the response function of the qth hidden node for class k, R is a radially symmetric kernel function, C~ = (Cqkl . . . CkqN) and Wq k are the center and width of the qth kernel function for class k, and h qk is the weight connecting the qth hidden node for class k to the kth output node. Generally, a Gaussian with unit normalization is chosen as the kernel function: Gkq(X) = exp -- ~ ~. (Ckq. -- X.)Z/(w~) 2 .
(3)
n=l
The basic idea of the proposed method is to cover a class region with a set of Gaussians of varying widths and centers. The output function F k ( x ) , for class k, a linear combination of Gaussians, is said to cover the region of class k if it is slightly positive (Fk(X) >---e) for patterns in that class and zero or negative for patterns outside that class. Suppose Qk Gaussians are required to cover the region of class k in this fashion. The covering (masking) function Fk(X) for class k is given by eqn ( 1 ). An input pattern x, therefore, may be determined to be in class k if Fk(X) > e and not in class k i f F k ( X ) <-- O. This condition, however, is not sufficient, and stronger conditions are stated later. When the effect of a Gaussian unit is small, it can be safely ignored. This idea of ignoring small Gaussian outputs leads to the definition of a truncated Gaussian unit as if G~(x) > 4~,
( ~ ( x ) = G~(x)
=0
otherwise,
(4)
where O f ( x ) is the truncated Gaussian function and a small constant. In computational experiments, 4~ was set to 10 -3 . Thus, the function F k ( x ) is redefined in terms of Gqk(x) as Qk
The following notation is used. An input pattern is represented by the N-dimensional vector x, x = (X~, X2,
Fk(X) = E hkqGkq(x) q=l
(5)
RBF-Like Netsfor Classification Problems
181
where F~(x) now corresponds to the output o f a RBF net with "truncated" RBF units. So, in general, Qk is the number of Gaussians required to cover class k, k -- 1. . . . . K, Fk(x) is the covering function (mask) for class k, and Gkl(X) . . . . . G~k(x) are the corresponding Gaussians. Then an input pattern x' will belong to class k iff its mask Fk(X) is at least slightly positive, and the masks for all other classes are zero or negative. This is the necessary and sufficient condition for x' to belong to class k. Here, each mask Fk(x), k = 1. . . . . K, will have its own threshold value ek as determined during its construction. Expressed in mathematical notation, an input pattern x' is in class k iffFk(x') >-- ek and F j ( x ' ) _< 0 for all j ~ k, j = 1. . . . . K. If all masks have values equal to or below zero, the input cannot be classified. If masks from two or more classes have values above their e-thresholds, then also the input cannot be classified, unless the maximum of the mask values is used to determine class ("ambiguity rejection"). Let TRk be the set of pattern vectors of any class k for which masking is desired and TRk be the corresponding set of nonclass k vectors, where TR = TRkUTRk is the total training set. As before, suppose Qk Gaussians of varying widths and centers are available to cover class k. The following linear program is solved to determine the Qk weights h k = (h~ . . . . . h ~ ) for the Qk Gaussians that minimize the classification error: minimize a ~
d(xi)+l~
xieTRk
~
d(xi)
(6)
xieTR k
subject to
Fk(xi) + d(xi) >- '~k, xieZRk,
(7)
Fk(xi) -- d(xi) < O, xieTRk,
(8)
d(xi) >- O, xieTR,
(9)
ek >- a small positive constant,
(10)
h k in Fk(x) unrestricted in sign,
( 11 )
where d(xi)s are external deviation variables and a and/3 are the weights for the in-class and out-of-class deviations, respectively. 3.1. Generation of Gaussian Units The network constructed by this algorithm deviates from a typical RBF net. For example, there is truncation at the hidden nodes and the output nodes use a hard limiting nonlinearity [ for the kth output node, the output is 1 if Fk(x) >-- ek, and 0 otherwise]. In addition, the Gaussians here are not viewed as purely local units because it generally results in a very large net. An explicit attempt is made by the algorithm to obtain good generalization. For that purpose, a variety of overlapping Gaussians (different centers and widths) are created to act both as global and local feature detectors and to help map out the territory of each class
with the least number of Gaussians. Though both "fat" (i.e., ones with large widths) and "narrow" Gaussians can be created, the "fat" ones, which detect global features, are created and explored first to see how well the broad territorial features work. The Gaussians, therefore, are generated incrementally and they become narrow local feature detectors in later stages. As new Gaussians are generated for a class at each stage, the LP model [eqns ( 6 ) - ( I 1 )] is solved using all of the Gaussians generated till that stage and the resulting mask evaluated. Whenever the incremental change in the error rate (training and testing) becomes small or overfitting occurs on the training set, masking of the class is determined to be complete and the appropriate solution for the weights retrieved. The Gaussians for a class k are generated incrementally (in stages) and several Gaussians can be generated in a stage. Let h ( = 1, 2, 3 . . . ) denote a stage of this process. A stage is characterized by its majority criterion, a parameter that controls the nature of the Gaussians generated (fat or narrow). A majority criterion of 60% for a stage implies that a randomly generated pattern cluster at that stage, which is to be used to define a Gaussian for class k, must have at least 60% of the patterns belong to class k. Let Oh denote the majority criterion for stage h. In the algorithm, Oh starts at 50% (stage 1 majority criterion) and can increase up to 100% in, say, increments of 10%. Thus, the method will have a maximum of six stages (0h = 50%, 60% . . . . . 100% ) when the increment is 10%. A 50% majority criterion allows for the creation o f " f a t t e r " Gaussians compared to, say, a 90% majority criterion and thus can detect global features in the pattern set that might not otherwise be detected by narrow Gaussians of a higher majority criterion. The Gaussians for a given class k at any stage h are randomly selected in the following way. Randomly pick a pattern vector xi of class k from the training set and search for all pattern vectors in an expanding 6-neighborhood of x~. The 6-neighborhood of x~ is expanded as long as class k patterns in the expanded neighborhood retain the minimum majority of Oh for stage h. The neighborhood expansion is stopped when class k losses its required majority or when a certain maximum neighborhood radius of 6max is reached. When the expansion stops, the class k patterns in the last 6-neighborhood are used to define a Gaussian and are then removed from the training set. To define the next Gaussian, another class k pattern xi is randomly selected from the remaining training set and its 8-neighborhood is similarly grown to its limits, as explained above. This process of randomly picking a pattern vector x~ of class k from the remaining training set and searching for pattern vectors in an expanding neighborhood of x~ to define the next Gaussian is then repeated until the remaining trainingset is empty of class k vectors. The process of generating a Gaussian starts with an
182
A. Roy, S. Govil, and R. Miranda
initial neighborhood of radius 60 and then enlarges the neighborhood in fixed increments of A6 (6r = 6~_1 + A6). Here 6r is the neighborhood radius at the rth growth step. Let V~ be the set of pattern vectors within the 6~-neighborhood of starting vector x~ for the j t h Gaussian being generated at stage h. A neighborhood size can be increased only if the current pattern set V~ from the 6rneighborhood satisfies the majority criterion and if 6r < 6max. Otherwise, further expansion is stopped. At any growth step r, if the current pattern set V~ fails the majority criterion, the previous set V~ -~ (if there is one) is used to create the Gaussian. When a Gaussian is created from either Vjr or Vjr - I , the centroid of class k pattern vectors in the set becomes the center C k and the standard deviation of their distances from Cqk becomes w~, assuming the Gaussian being defined is the qth Gaussian for class k where q is the cumulative total n u m b e r of Gaussians generated over all of the past and current stages. When the number of patterns in a set V~ or V~ -1 is less than a certain m i n i m u m , no Gaussian is created; however, the class k patterns in the set are removed from the remaining training set.
3.2. The Algorithm The algorithm is stated below. The following notation is used. I and R denote the initial and remaining training sets, respectively. 6ma x is the m a x i m u m neighborhood radius, 6~ is the neighborhood radius at the rth growth step, and A6 is the 6r increment at each growth step. V~ is the set of pattern vectors within the 6rneighborhood of starting vector xi for the j t h Gaussian of any stage h. PC~(k) denotes the percentage of class k members in V~. N~ denotes the number of vectors in V~. h is the stage counter, Ohis the m i n i m u m percentage of class k members in stage h, and A0 is the increment for Oh at each stage. Sk corresponds to the cumulative set of Gaussians created for class k. Cqk and Wq k are the center and width, respectively, of the qth Gaussian for class k. TREh and TSEh are the training and testing set errors, respectively, at the hth stage for the class being masked./3 is the m i n i m u m number of patterns required in V~ to form a Gaussian and p is the maxim u m of the class standard deviations that are the standard deviations of the distances from the centroid of the patterns of each class, t~max is set to some multiple of p. The fixed increment A6 is set to some fraction of 6max - - 6 0 , A 6 = ( 6 m a x - - 6 0 ) / S , where s i s the desired number of growth steps, s was set to 25 and 6max was set to 10p for computational purposes.
The Gaussian Masking (GM) Algorithm
(0)
Initialize constants: 6ma x = 10p, A0 = some constant (e.g., 10%), 6o = some constant (e.g., 0 or 0.1 p),
A6 = (6ma x --
60)/S.
( 1 ) Initialize class counter: k = 0. (2) Increment class counter: k = k + 1. I f k > K, stop. Else, initialize cumulative Gaussian counters: Sk = 0 (empty set), q = 0. (3) Initialize stage counter: h = 0. (4) Increment stage counter: h -- h + 1. Increase majority criterion: ifh > 1, Oh = Oh-i + A0; otherwise Oh = 50%. If Oh > 100%, go to (2) to mask next class. (5) Select Gaussian units for the hth stage: j = 0, R = I. (a) S e t j = j + l,r= 1,6r=60. (b) Select an input pattern vector xi of class k at random from R, the remaining training set. (c) Search for all pattern vectors in R within a 6r radius of xi. Let this set of vectors be V~. (i) if PC~(k) < Oh and r > l, set r = r - l, go to (e); (ii) if PC~(k) > Oh and r > l, go to (d) to expand neighborhood; (iii) if PC~(k) < Ohand r = 1, go to (h); (iv) if PC~(k) > Oh and r = l, go to (d) to expand neighborhood. (d) Set r = r + 1, 6r = 6r_~ + A6. I f 6 , > 6max, set r = r -- l, go to (e). Else, go to (c). (e) Remove class k patterns of the set V~ from R. If N~ 3, go to (g). ( f ) Set q = q + 1. Compute the center C 2 and width w k of the qth Gaussian for class k. Add qth Gaussian to the set Sk. C k = centroid of class k patterns in the set V~, and wqk = standard deviation of the distances from the centroid C0k of the class k patterns in
V~. (g) I f R is not empty of class k patterns, go to (a), else go to (6). (h) Remove class k patterns of the set V~ from R. If R is not empty of class k patterns, go to (a), else go to (6). (6) From the set Sk, eliminate similar Gaussians (i.e., those with very close centers and widths). Let Qk be the number of Gaussians after this elimination. (7) Solve LP (6) - ( 11 ) for class k mask using Qk number of Gaussians. (8) Compute TSEh and TREh for class k. If h = 1, go to (4). Else: (a) If TSEh < TSEh-t, go to (4). (b) If TSEh > TSEh_t and TREh > TREh-1, go
to (4). (c) Otherwise, overfitting has occurred. Use the mask generated in the previous stage as class k mask. Go to (2) to mask next class. Other stopping criteria, like m a x i m u m number of Gaussians used or incremental change in TSE, can also be used. The G M algorithm needs a representative set of examples, other than the training set, to design and train
RBF-Like Nets for Classification Problems an appropriate net. This set can be called the validation set or control test set. This control test set can sometimes be created by setting aside examples from the training set itself, when enough training examples are available. That has been done for the four overlapping Gaussian distribution test problems in Section 4, where independent control and test sets were created to test the error rate independently during and after training. For the other four test problems, because the number of training examples is limited, the test sets themselves were used as control test sets. If the training and control test sets are representative of the population, no overfitting to a particular (control) test set should occur. Furthermore, the algorithm does not try to minimize the error on the control test set. Explicit error minimization is done only on the training set. The experimental results show that the G M algorithm does not get the best test result always, compared to other algorithms. For multiclass problems, all classes are masked according to the algorithm. For a K class problem, one needs to test a pattern for only ( K - 1 ) classes. If the pattern is not from one of the tested classes, it is assigned to the remaining class. Thus, for a K class problem, one needs to construct a net with the optimal set of Gaussians from ( K - 1 ) classes that produce the best error rate. The best ( K - 1 ) class combination is selected by testing out all different combinations. For a two-class problem, both classes are masked and the best class selected to construct the net.
183 Further assume that to obtain each two-point Gaussian, the neighborhood radius dr needs to grow, at most, s times. In this worst case, therefore, to generate the P~ 2K two-point Gaussians at each stage, Z = s ( P + (P - 2) + ( P - 4) + . . . + { P - ( P / K - 2)} - P / 2 K ) = Ps 4 K 2 [ ( P - 2 ) K 2 - 2 K - ( K - 1 ) [ P ( K - 1) - 2K]]
Polynomial time convergence of the G M algorithm is proved next.
distances are computed and compared with d r to find the points within the dr-neighborhood. For M stages, M Z distances are computed and compared, which is a polynomial function of P. Because only Gaussian units are used for masking, the number of variables in the LP ( 6 ) - ( 11 ) is equal to ( h P / 2 K + P) at the hth stage. The h P / 2 K L P variables represent the weights of the Gaussians and the other P variables correspond to the deviation variables. Let Th = h P / 2 K + P = the (where th = h / 2 K + 1 ) be the total number of variables and Lh be the binary encoding length of the input data for the LP in stage h. For the LP in eqns ( 6 ) - ( 11 ), the number of constraints is always equal to P. The binary encoding length of the input data for each constraint is proportional to Th in stage h if Gaussian truncation is ignored. Hence, LhaPThathP 2 or Lh = athP 2, where a is a proportionality constant. Khachian's method (1979) solves a linear program in O (L 2T 4) arithmetic operations, where L is the binary encoding length of the input data and T is the number of variables in the LP. Karmarkar's method (1984) solves a linear program in O ( L T 3"5) arithmetic operations. More recent algorithms (Todd & Ye, 1990; Monteiro & Adler, 1989 ) have a complexity of O ( L T 3). Using more recent results, the worst case total LP solution time for all M stages is proportional to
PROPOSITION. The Gaussian Masking ( GM) algorithm terminates in polynomial time.
Z OILhT 3] = Z OI(trthP2)(thP) 3]
3.3. Polynomial Time Convergence of the Algorithm
M h=l
PROOF. Let M be the number of completed stages of the G M algorithm at termination for a class. The largest number of linear programs is solved when Oh reaches its terminal value of 100%. In this worst case, with a fixed increment of A0 for Oh, the number of completed stages (M) would be (100% - 50%)/A0 = 50%/A0. At each stage, new Gaussian units are generated. Let = 2, so that only a minimum of two points is required per Gaussian. Suppose, in a worst case scenario, only two-point Gaussians, that satisfy the majority criteria, are generated at each stage and, without any loss of generality, assume P / K , the average number of patterns per class, is even. Thus, in this worst case, a total of P / 2K 100% majority Gaussians are produced from P / K examples of a class if Oh > 50%. (For Oh = 50%, P / K Gaussians can be produced, but that case is ignored to simplify this derivation.) At the hth stage, a total o f h P / 2K Gaussians are accumulated. Assume that these hP/ 2KGaussians, which are randomly generated, have different centers and widths, and, therefore, are unique.
M h=l M
= E O[t4p5], h=l
which again is a polynomial function of P. Thus, both Gaussian generation and LP solutions can be done in polynomial time. • In practice, LP solution times have been found to grow at a much slower rate, perhaps about O(P3), if not less. 4. C O M P U T A T I O N A L R E S U L T S This section presents computational results on a variety of problems that have appeared in the literature. All problems were solved on a SUN Sparc2 workstation. Linear programs were solved using Roy Marsten's OB 1 interior point code from Georgia Institute of Technology. OB 1 has a number of interior point methods implemented and the dual log barrier penalty method was
184
A. Roy, S. Govil, and R. Miranda
used. The weights in the LP ( 6 ) - ( 11 ), a and fl, were set to 1 in all cases. The problems were also solved with the RCE method of Reilly, Cooper, and Elbaum (1982), the standard RBF method of Moody and Darken ( 1988, 1989), and the conjugate gradient method for multilayer perceptrons (MLPs) (Rumelhart, Hinton, & Williams, 1986 ) and the computational results are reported. One of the commercial versions of back propagation was tried on several of the test problems and the results were miserable. It was then decided to formulate the MLP weight training problem as a purely nonlinear unconstrained optimization problem, and the Polak-Ribiere conjugate gradient method was used to solve it (Luenberger, 1984 ). A two-layer, fully connected net (a single hidden layer net) and a standard number of hidden nodes (2, 5, 10, 15, and 20) were used on all problems. The starting weights were generated randomly in all cases except for the breast cancer problem, which, for some reason, worked only with zero starting weights; otherwise it got stuck close to the initial solution. The sonar and vowel recognition problems could not be solved even after trying various starting weights and different number of hidden units. Lee (1989) reports similar difficulties on the vowel recognition problem with the conjugate gradient method. For standard RBF, K-means clustering was used to obtain Gaussian centers and the LMS rule used for weight training. The width or standard deviation of a Gaussian was set to some multiple ( 1, 2, and 3 were the multipliers used) of the distance of its center from the center of its nearest neighbor. All problems were solved with a standard set of RBF nodes (5, 10, . . . . 100). The RCE algorithm was run with different levels of pruning; the pruning criterion was specified by the minimum number of points per hypersphere, which was set to some percentage of the inclass points. For all three algorithms, the runs were thus standardized. That is, the choices for the number of hidden nodes (RBF, MLP), the RBF Gaussian widths, and the pruning level for RCE were all standardized. The main reason for this is that, without standardized runs, one gets into a trial-and-error manual optimization process for these net parameters [number of hidden nodes
(RBF, MLP), RBF node width, pruning level for RCE] to obtain the best error rate on a test set. That is not a desirable way to evaluate and compare algorithms. This also implies that the best possible results were not obtained for these algorithms. The paper, therefore, wherever possible, quotes better results obtained with these algorithms by other researchers that are known to the authors. All algorithms were run on a SUN Sparc 2 workstation. 4.1. Overlapping Gaussian Distributions
The algorithm was tested on a class of problems, where the patterns of a class are normally distributed in the multiple input dimensions. They are two-class problems with overlapping regions. All problems were tried with randomly generated training and test sets. Because they are two-class problems, only one of the classes needs to be masked. Problem 1 : The I-I Problem. A simple two-class problem where the classes are described by Gaussian distributions with different means and identity covariance matrices. A four-dimensional problem with mean vectors [0000] and [1111] was tried. The Bayes error is about 15.86% in this case. Tables 1A and 9 show that an error rate of 17.90% was obtained by the G M algorithm using 18 Gaussians. Tables 1B-D and 9 show the results for the other algorithms. MLP had the best error rate of 16.51%. Problem 2: The I-4I Problem. Another four-dimensional two-class problem where the classes are described by Gaussian distributions with zero mean vectors and covariance matrices equal to I and 41. The optimal classifier here is quadratic, and optimal Bayes error is 17.64%. Tables 2A and 9 show that an error rate of 17.97% was obtained by the G M algorithm using five Gaussians. Tables 2 B - D and 9 show the results for the other algorithms. Problem 3. This problem is similar to problem 2, except that it is eight-dimensional instead of four. Optimal
TABLE 1 Overlapping Gaussian Distributions: Problem 1
1A: GM Algorithm Results, Problem 1 (Mask Class 0) Control Test Set Error
Independent Test Set Error (10,000 Pts)
(%)
No. of Gaussians
Cumulative No. of Gaussians
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
50 60 70 80 90
1 5 5 7 12
1 6 11 18 30
179 78 73 71 67
15 51 49 48 43
164 27 24 23 24
500 178 178 176 188
500 118 104 117 120
0 60 74 59 68
3471 1891 1813 1790 1830
348 1249 1200 1169 1126
3123 642 613 621 704
9 9 8 11 19
383 807 1050 1176 1275
Majority Criterion
Training Set Error
Time (s)
TABLE 1 Continued
1B: RCE Results, Problem 1
No. 2% 4% 6%
pruning (min. pruning (min. pruning (min. pruning (min.
Independent Test Set Error (10,000 Pts)
Training Set Error
No. of Hyperspheres
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Training Time (s)
109 53 37 23
62 79 85 94
23 62 74 86
39 17 11 8
2204 1947 2013 2174
1054 1327 1555 1790
1150 620 458 384
3.3 5.4 5.4 5.5
1 point) 5 points) 10 points) 15 points)
1C: Standard RBF Results, Problem 1 Test Set Error (10,000 Examples)
Training Set Error
Time (s)
No. of Gaussians
Width Multiplier
Learning Rate
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
5 10 20 30 40 50 60 70 80 90 100 5 10 20 30 40 50 60 70 80 90 100 5 10 20 30 40 50 60 70 80 90 100
1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
180 170 151 136 148 148 129 144 109 132 250 249 202 98 98 92 99 89 94 90 86 95 251 254 111 96 88 74 81 221 230 195 161
51 49 44 47 44 37 41 39 37 40 0 9 94 56 55 50 52 40 49 49 47 49 10 12 65 54 49 42 44 191 226 194 7
129 121 107 89 104 111 88 105 72 92 250 240 108 42 43 42 47 49 45 41 39 46 241 242 46 42 39 32 37 30 4 1 154
3782 3544 3218 2750 3031 3175 2788 3039 2422 2896 5000 4969 4212 2053 2154 1987 2150 1868 2047 1898 1890 2081 4993 5035 2174 2057 1917 1743 1727 4241 4561 3838 2876
1008 964 886 941 847 803 849 791 833 854 0 120 1798 1108 1115 1046 1057 1013 1033 1023 1004 1079 127 121 1217 1117 1015 945 895 3663 4450 3784 95
2774 2580 2332 1809 2184 2372 1909 2248 1589 2042 5000 4849 2414 945 1039 941 1093 855 1014 875 886 1002 4866 4914 957 940 902 798 832 578 111 54 2781
1 2 4 7 12 9 11 17 16 14 14 1 2 4 7 12 9 11 17 16 14 22 2 2 4 7 12 9 11 17 16 14 22
7 9 19 30 29 38 60 57 108 85 22 3 6 22 18 23 22 30 28 34 36 28 3 3 24 21 22 33 29 12 13 19 46
1D: Multilayer Perceptron Results, Problem 1 Test Set Error (10,000 Examples)
Training Set Error
No. of Hidden Nodes
Total
Incl.
O(Jtcl.
Total
Incl.
Outcl.
Training Time (s)
2 5 10 15 20
76 66 66 76 288
38 39 38 38 38
38 27 28 38 250
1665 1653 1651 1664 5898
890 900 903 895 848
775 753 748 769 5000
51 188 637 2190 2682
Description: two classes {0, 1 }, four dimensions, different centers. Training examples = 250 + 250 = 500; control test set = 500 + 500 = 1000; independent test set = 5000 + 5000 = 10,000. Minimum number of points per Gaussian = 4% of in-class points = 10. Nomenclature: Incl., number of in-class points classified as out-of-class; Outcl., number of out-of-class points classified as in-class; Total, total number of errors; Ph. I, generation of Gaussians (centers and widths) (GM and standard RBF algorithms); Ph. II, determination of weights (by LP or LMS method, as appropriate).
186
A. Roy, S. Govil, and R. Miranda TABLE 2 Overlapping Gaussian Distributions: Problem 2
2A: GM Algorithm Results, Problem 2 (Mask Class 0)
Majority Criterion
Independent Test Set Error (10,000 Pts)
Control Test Set Error
(%)
No. of Gaussians
Cumulative No. of Gaussians
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
50 60 70 80 90
1 1 1 2 5
1 2 3 5 10
90 91 74 76 74
10 49 27 29 28
80 42 47 47 46
218 195 193 190 204
15 95 63 61 75
203 100 130 129 129
2151 1855 1799 1797 1819
159 889 573 584 677
1992 966 1226 1213 1142
9 4 4 5 14
385 578 524 676 771
Training Set Error
Time (s)
2B: RCE Results, Problem 2 Independent Test Set Error (10,000 Pts)
Training Set Error
Pruning
No. of Hyperspheres
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Training Time (s)
None, min. 1 point 2%, min. 5 points 4%, min. 10 points 6%, min. 15 points
209 121 54 34
48 65 81 112
1 36 67 102
47 29 14 10
2329 2270 2446 2773
623 892 1542 2147
1706 1378 904 626
6 10 10 10.6
2C: Standard RBF Results, Problem 2 Training Set Error
Test Set Error (10,000 Examples)
Time (s)
No. of Gaussians
Width Multiplier
Learning Rate
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
5 10 20 30 40 50 60 70 80 90 100 5 10 20 30 40 50 60 70 80 90 100 5 10 20 30 40 50 60 70 80 90 100
1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
250 249 287 248 208 176 193 194 187 220 199 250 250 106 116 92 102 90 100 86 101 103 250 250 250 270 250 250 272 282 259 260 250
0 0 6 0 22 25 20 18 21 11 17 0 0 17 18 24 25 29 36 31 33 31 0 0 0 22 0 0 27 250 250 250 0
250 249 231 248 186 151 173 176 166 209 182 250 250 89 98 68 77 61 64 55 68 72 250 250 250 248 250 250 245 32 9 10 250
5000 5015 4903 4997 4529 4109 4413 4505 4399 4793 4613 5000 5000 2621 2743 2429 2453 2130 2382 2148 2280 2315 5000 5000 5000 5427 5000 5000 5525 5649 5187 5186 5000
0 35 202 7 428 587 550 571 674 330 542 0 0 292 279 443 428 562 659 603 610 555 0 0 0 476 0 0 612 4999 5000 5000 0
5000 4980 4701 4990 4101 3522 3863 3934 3725 4463 4071 5000 5000 2329 2464 1986 2025 1568 1723 1545 1670 1760 5000 5000 5000 4951 5000 5000 4913 650 187 186 5000
1 3 4 4 8 15 10 14 17 16 15 2 2 4 4 8 15 10 14 18 16 16 2 3 4 4 8 14 10 14 18 16 16
22 48 62 53 97 133 138 175 198 192 253 4 3 54 58 44 50 42 44 55 49 51 3 3 3 3 5 5 10 9 6 7 5
RBF-Like Nets for Classification Problems
187 TABLE 2 Continued
2D: Multilayer Perceptron Results, Problem 2 Test Set Error (10,000 Examples)
Training Set Error No. of Hidden Nodes
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Training Time (s)
2 5 10 15 20
137 90 70 71 61
40 33 25 25 25
97 57 45 46 36
3500 2454 2077 2031 2017
1036 775 734 731 766
2464 1679 1343 1300 1251
129 739 873 1314 2842
Description: two classes {0, 1}, four dimensions, same center. Training examples = 250 + 250 = 500; control test set = 500 + 500 = 1000; independent test set = 5000 + 5000 = 10,000. Minimum number of points per Gaussian = 4% of in-class points = 10.
Bayes error is 9% for this problem. Tables 3A and 9 show that an error rate of 9.19% was obtained by the G M algorithm using four Gaussians. Tables 3B-D and 9 show the results for the other algorithms. Musavi et al. (1992) solved this problem with their RBF method and achieved an error rate of 12% with 128 Gaussian units. The error rate was 13% with their M N N method (Musavi et al., 1993). They also report solving this problem with Specht's (1990) PNN method for an error rate of 25.69% and that the back-propagation algorithm did not converge for over 40 training points. For converged BP nets, the error rate was higher than their M N N method. Hush and Salas (1990) report that the back-propagation algorithm obtained an error rate of 14.5% with 800 training examples and 30 hidden nodes (also had the same error rate for 50 hidden nodes), and an error rate of 11% with 6400 training examples and 55-60 hidden nodes. Problem 4. This is a two-class, two-dimensional problem where the first class has a zero mean vector with identity covariance matrix and the second class has a mean vector [1, 2 ] and a diagonal covariance matrix with entries of 0.01 and 4.0. The estimated optimal error rate for this problem is 6% (Musavi et al., 1992 ). Tables 4A and 9 show that an error rate of 7.70% was obtained by the G M algorithm using 16 Gaussians. Tables 4 B - D and 9 show the results for the other algorithms. Musavi et al. (1992) solved this problem with their RBF method and achieved an error rate of 9.26% with 86 Gaussian units. Musavi et al. (1993) also solved this problem with their M N N method and Specht's (1990) PNN method, both of which achieved an error rate of about 8% when trained with 300 points. They again report that the back propagation algorithm failed to converge, regardless of the number of layers and nodes, when provided with over 40 training samples.
4.2. Medical Diagnosis Breast Cancer Detection. The breast cancer diagnosis problem is described in Mangasarian, Setiono, and
Wolberg (1990). The data is from the University of Wisconsin Hospitals and contains 608 cases. Each case has nine measurements made on a fine needle aspirate (fna) taken from a patient's breast. Each measurement is assigned an integer value between 1 and 10, with larger numbers indicating a greater likelihood of malignancy. Of the 608 cases, 379 were benign, and the rest malignant. Four hundred fifty of the cases were used for training and the rest were used for testing. Tables 5A and 9 show that an error rate of 3.94% was obtained by the G M algorithm using 11 Gaussians. Tables 5B-D and 9 show the results for the other algorithms. Bennett and Mangasarian (1992) report average error rates of 2.56% and 6.10% with their MSM 1 and MSM methods, respectively. Heart Disease Diagnosis. This problem is described in Detrano et al. (1989). This data base contains 297 cases. Each case has 13 real-valued measurements and is classified either as positive or negative. One hundred ninety-eight of the cases were used for training and the rest were used for testing. Tables 6A and 9 show that an error rate of 18.18% was obtained by the G M algorithm using 24 Gaussians. Tables 6 B - D and 9 show the results for the other algorithms. Bennett and Mangasarian (1992) report a test error of 16.53% with their MSMI method, 25.92% with their MSM method, and about 25% error with back propagation.
4.3. Speech Classification The vowel classification problem is described in Lippmann (1988). The data were generated from the spectrographic analysis of vowels in words formed by "h," followed by a vowel, followed by a "d," and consists of two-dimensional patterns. The words were spoken by 67 persons, including men, women, and children. The data on 10 vowels were split into two sets for training and testing. Lippmann (1988) tested four classifiers--KNN, Gaussian, two-layer perceptron, and feature map-on this dataset. All classifiers had similar error
188
A. Roy, S. Govil, and R. Miranda TABLE 3 Overlapping Gaussian Distributions: Problem 3
3A: GM Algorithm Results, Problem 3 (Mask Class 0)
Majority Criterion
Independent Test Set Error (10,000 Pts)
Control Test Set Error
(%)
No. of Gaussians
Cumulative No. of Gaussians
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
50 60 70 80 90
1 1 1 1 2
1 2 3 4 6
41 48 43 42 42
9 27 22 19 19
32 21 21 23 23
88 89 89 88 84
19 51 44 36 33
69 38 45 52 51
937 979 937 919 927
228 582 468 375 384
709 397 469 544 543
16 9 9 7 19
398 1199 745 999 1084
Training Set Error
Time (s)
3B: RCE Results, Problem 3 Independent Test Set Error (10,000 Pts)
Training Set Error
Pruning
No. of Hyperspheres
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Training Time (s)
None, min. 1 point 2%, min. 5 points 4%, min. 10 points 6%, min. 15 points
176 137 96 69
37 39 42 40
2 8 17 20
35 31 25 20
1322 1306 1273 1286
219778 255 333 448
1103 1051 940 888
9.2 16.8 15.3 15.3
3C: Standard RBF Results, Problem 3 Training Set Error
Test Set Error (10,000 Examples)
Time (s)
No. of Gaussians
Width Multiplier
Learning Rate
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
5 10 20 30 40 50 60 70 80 90 100 5 10 20 30 40 50 60 70 80 90 100 5 10 20 30 40 50 60 70 80 90 100
1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
250 250 248 167 209 139 178 156 105 146 130 252 252 254 88 62 71 77 250 250 308 290 251 251 253 275 250 250 250 250 250 250 250
0 0 0 15 8 14 15 13 27 19 20 2 2 4 3 2 2 4 0 0 250 250 1 1 3 26 0 0 250 0 0 0 0
250 250 248 152 201 125 163 143 78 127 110 250 250 250 85 60 69 73 250 250 58 40 250 250 250 249 250 250 0 250 250 250 250
5000 5000 5002 3508 4431 3150 3653 3207 2283 3016 2963 5060 5089 5150 1872 1439 1411 1575 5000 5000 6090 5646 5026 5046 5108 5523 5000 5000 5000 5000 5000 5000 5000
0 0 10 474 181 501 431 395 621 482 572 60 90 151 90 111 90 71 0 0 5000 5000 26 46 109 528 0 0 5000 0 0 0 0
5000 5000 4992 3034 4250 2599 3222 2812 1662 2534 2391 5000 4999 4999 1782 1328 1321 1504 12 5000 1090 646 5000 5000 4999 4995 5000 5000 0 5000 5000 5000 5000
3 5 9 13 10 13 17 19 26 34 24 3 5 9 13 10 13 17 19 26 34 24 3 5 9 13 10 13 17 19 26 34 24
309 153 132 247 276 373 303 289 362 351 378 8 7 9 132 204 181 102 12 20 19 13 6 6 5 6 8 17 8 7 8 9 9
RBF-Like Nets for Classification Problems
189 TABLE 3 Conbnued
3D: Multilayer Perceptron Results, Problem 3 Test Set Error (10,000 Examples)
Training Set Error
No. of Hidden Nodes
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Training Time (s)
2 5 10 15 20
119 59 26 16 14
48 24 9 2 2
71 35 17 14 12
3404 2264 1683 1787 1555
1389 838 570 612 580
2015 1426 1113 1175 975
73 6238 8325 41713 39440
Description: two classes {0, 1}, eight dimensions, same center. Training examples = 250 + 250 = 500; control test set = 500 + 500 = 1000; independent test set = 5000 + 5000 = 10,000. Minimum number of points per Gaussian = 4% of in-class points = 10. TABLE 4 Overlapping Gaussian Distributions: Problem 4
4A: GM Algorithm Results, Problem 4 (Mask Class 0) Control Test Set Error
Independent Test Set Error (10,000 Pts)
Majority Criterion (%)
No. of Gaussians
Cumulative No. of Gaussians
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
50 60 70 80 90
2 2 5 7 7
2 4 9 16 20
47 35 16 14 7
8 29 13 14 6
39 6 3 0 1
615 280 245 165 172
36 236 240 135 144
579 44 5 30 28
2151 1359 891 770 900
397 1190 793 594 519
1754 169 98 176 381
1 1 1 1 1
49 83 96 103 121
Training Set Error
Time (s)
4B: RCE Results, Problem 4
Pruning None, min. 2%, min. 2 4%, min. 4 6%, min. 6
Independent Test Set Error (10,000 Pts)
Training Set Error
No. of Hyperspheres
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Training Time (s)
15 15 11 11
16 16 20 20
10 10 14 14
6 6 6 6
1269 1269 1235 1235
679 679 748 748
590 590 487 487
0.2 0.3 0.3 0.4
1 point points points points
4C: Standard RBF Results, Problem 4 Training Set Error
Test Set Error (10,000 Examples)
Time (s)
No. of Gaussians
Width Multiplier
Learning Rate
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
5 10 20 30 40 50 60 70 80 90 100 5 10 20 30 40
1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
41 45 44 31 61 29 54 51 51 71 81 38 57 74 44 85
7 5 5 6 1 5 1 4 4 0 0 4 7 8 12 4
34 40 39 25 60 24 53 47 47 71 81 34 50 66 32 81
2417 2408 2352 1970 3114 1878 2810 2519 2519 3674 3958 2335 2889 3646 2402 4388
578 362 346 444 49 341 46 141 141 5 7 306 560 407 825 261
1839 2046 2006 1526 3065 1537 2764 2378 2378 3669 3951 2029 2329 3239 1577 4127
1 1 1 2 1 1 3 2 2 8 3 1 1 1 2 1
4 7 15 30 24 55 37 51 59 55 68 5 6 6 13 10 Con~nued
190
A. Roy, S. Govil, and R. Miranda TABLE 4 Continued
4C: Standard RBF Results, Problem 4 Test Set Error (10,000 Examples)
Training Set Error
Time (s)
No. of Gaussians
Width Multiplier
Learning Rate
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
50 60 70 80 90 100 5 10 20 30 40 50 60 70 80 90 100
2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
71 53 64 70 45 46 43 96 51 70 76 72 50 68 73 35 59
9 8 10 8 7 9 2 0 8 6 11 10 14 12 11 16 14
62 45 54 62 38 37 41 96 43 64 65 62 36 56 62 19 45
3507 2665 2968 3442 2393 2505 2510 4877 2691 3389 3635 3550 2510 3198 3465 2072 2777
481 498 615 498 441 575 115 70 499 417 605 575 834 724 609 947 823
3026 2167 2353 2944 1952 1930 2395 4807 2192 2972 3030 2975 1676 2474 2856 1125 1954
2 3 2 2 8 3 1 1 1 2 1 2 3 2 2 8 3
16 19 28 28 30 49 8 5 9 7 10 14 20 22 23 32 38
4D: Multilayer Perceptron Results, Problem 4 Test Set Error (10,000 Examples)
Training Set Error
No. of Hidden Nodes
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Training Time (s)
2 5 10 15 20
33 31 31 33 36
18 18 20 18 20
15 13 11 15 16
1737 1670 1627 1736 1875
987 1016 1033 988 1045
750 654 594 748 830
46 98 571 922 1236
Description: two classes {0, 1 }, two dimensions, different centers. Training examples = 100 + 100 = 200; control test set = 1000 + 1000 = 2000; independent test set = 5000 + 5000 = 10,000. Minimum number of points per Gaussian = 4% of in-class points = 4.
TABLE 5 Breast Cancer Problem
5A: GM Algorithm Results, Breast Cancer Problem (Mask Class 0) Majority Criterion (%)
No. of Gaussians
Cumulative No. of Gaussians
Training Set Error
Test Set Error
Time (s)
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
50 60 70 80 90 100
1 1 1 1 1 7
1 1 2 3 4 11
15 15 14 14 14 10
12 12 10 10 10 9
3 3 4 4 4 1
10 10 10 10 9 8
7 7 7 7 6 7
3 3 3 3 3 1
16 15 15 12 11 18
118 118 151 168 222 236
5B: RCE Results, Breast Cancer Problem Training Set Error
Test Set Error
Pruning
No. of Hyperspheres
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Training Time (s)
None, min. 1 point 2%, min. 4 points 4%, min. 8 points 6%, min. 12 points
27 19 17 17
13 14 16 16
3 10 14 14
10 4 2 2
10 10 9 9
4 5 7 7
6 5 2 2
1.3 2.2 2.2 2.2
RBF-Like Nets for Classification Problems
191 TABLE 5 Continued
5C: Standard RBF Results, Breast Cancer Problem Training Set Error
Test Set Error
Time (s)
No. of Gaussians
Width Multiplier
Learning Rate
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
5 10 20 30 40 50 60 70 80 90 100 5 10 20 3O 40 50 60 70 80 90 100 5 10 2O 30 40 5O 60 70 8O 90 100
1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
26 78 3 32 35 78 44 78 33 31 40 30 21 21 25 13 14 15 14 22 13 23 44 17 14 13 13 14 13 13 14 13 14
9 7 10 10 10 8 9 9 10 9 9 3 14 11 9 11 11 12 12 11 11 9 2 10 10 10 10 10 11 11 12 11 11
17 71 23 22 25 70 35 69 23 22 31 27 7 10 16 2 3 3 2 11 2 14 42 7 4 3 3 4 2 2 2 2 3
11 31 12 18 19 35 21 31 18 12 17 15 15 11 13 11 10 10 10 14 6 11 19 12 8 8 8 11 9 9 12 9 10
3 3 3 3 3 3 3 3 3 3 3 0 7 5 3 6 5 5 5 5 4 3 0 5 5 5 5 6 7 7 7 6 6
8 28 9 15 16 32 18 28 15 9 14 15 8 6 10 5 5 5 5 9 2 8 19 7 3 3 3 5 2 2 5 3 4
1 1 4 7 7 13 12 10 13 37 20 1 1 4 7 7 13 12 10 13 37 20 1 1 4 7 7 13 12 10 13 37 20
24 40 86 99 175 143 222 194 304 208 348 16 26 22 21 41 51 59 85 68 58 60 27 17 18 18 25 31 37 53 48 37 46
5D: Multilayer Perceptron Results, Breast Cancer Problem Training Set Error
Test Set Error
No. of Hidden Nodes
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Training Time (s)
2 5 10 15 20
13 13 15 12 405
10 10 12 10 251
3 3 3 2 154
6 6 6 6 203
4 4 5 4 128
2 2 1 2 75
563 2977 2324 4524 5125
Description: two classes {0, 1}, nine dimensions. Training examples = 405; testing examples = 203. Minimum number of points per Gaussian = 4% of in-class points = 8 points.
rates ranging from 18-22.8%. Tables 7A and 9 show that an error rate of 24.3% was obtained by the G M algorithm using 92 Gaussians. Tables 7B and C and 9 show the results for the other algorithms. The conjugate gradient method could not solve this MLP training problem even with various starting points and hidden nodes. Lee (1989) also reports serious local minima problems with conjugate gradient training on MLPs
for the vowel problem. Lee (1989) reports obtaining an error rate of 21% with standard back-propagation training and 21.9% on an adaptive step size variation of BP on a single-layer net with 50 hidden nodes. Lee (1989) also reports that a hypersphere algorithm, where hyperspheres are allowed to both expand and contract, had an error rate of 23.1% with 55 hyperspheres after proper pruning. Lippmann (1988) reports solving the
192
A. Roy, S. Govil, and R. Miranda TABLE 6 Heart Disease Problem
6A: GM Algorithm Results, Heart Disease Problem (Mask Class 0) Majority Criterion
(%)
No. of Gaussians
Cumulative No. of Gaussians
Training Set Error
Test Set Error
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
50 60 70 80 90 100
2 5 3 4 10 10
2 7 10 14 24 30
73 38 32 27 20 19
65 26 22 18 12 11
8 12 10 9 8 8
42 29 28 22 18 18
41 22 22 14 13 13
1 7 6 8 5 5
4 5 8 10 13 15
41 71 85 90 131 210
Time (s)
6B: RCE Results, Heart Disease Problem
Pruning None, min. 2%, min, 2 4%, min. 4 6%, min, 6
1 point points points points
Training Set Error
Test Set Error
No. of Hyperspheres
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Training Time (s)
53 53 32 25
19 19 25 31
4 4 19 27
15 15 6 4
29 29 29 33
18 18 21 25
11 11 8 8
2 3.5 3.0 2.9
6C: Standard RBF Results, Heart Disease Problem Training Set Error
Test Set Error
Time (s)
No. of Gaussians
Width Multiplier
Learning Rate
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
5 10 20 30 40 50 60 70 80 90 100 5 10 20 30 40 50 60 70 80 90 100 5 10 20 30 40 50 60 70 80 90 100
1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
77 79 79 77 62 57 75 75 76 74 74 70 73 61 73 46 66 78 68 65 49 71 72 61 51 47 51 50 44 75 52 50 47
0 0 0 0 1 1 0 0 0 0 0 1 0 2 0 3 0 0 1 2 5 1 0 0 1 3 1 1 4 0 2 5 7
77 79 79 77 61 56 75 75 76 74 74 69 73 59 73 43 66 78 67 63 44 70 72 61 50 44 50 49 40 75 50 45 40
52 54 52 52 49 47 52 52 51 51 51 50 56 46 50 41 49 52 98 48 43 50 50 45 41 39 41 42 36 51 44 42 37
0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 2 1 0 0 1 3 0 0 0 1 3 0 1 2 0 2 3 4
52 54 52 52 48 46 52 52 51 51 51 50 56 45 50 39 48 52 98 47 40 50 50 45 40 36 41 41 34 51 42 39 33
7 2 4 3 6 7 9 10 12 13 17 1 2 4 3 6 7 9 10 12 13 17 1 2 4 3 6 7 9 10 12 13 17
18 24 44 54 74 118 166 187 169 201 223 36 73 42 49 71 78 52 60 62 89 66 17 258 70 81 76 78 43 65 79 91 103
RBF-Like Nets for Classification Problems
193 TABLE 6 Continued
6D: Multilayer Perceptron Results, Heart Disease Problem Training Set Error
Test Set Error
No. of Hidden Nodes
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Training Time (s)
2 5 10 15 20
29 81 29 28 198
12 0 11 12 117
17 81 18 16 81
18 56 19 20 99
6 0 7 7 43
12 56 12 13 56
53 116 627 1224 1976
Description: two classes {0, 1}, 13 dimensions. Training examples = 198; testing examples = 99. Minimum number of points per Gaussian = 4% of in-class points = 4. TABLE 7 Vowel Recognition Problem
7A: GM Algorithm Results, Vowel Recognition Problem Training Set Error
Test Set Error
Majority Criterion (%)
Cumulative No. of Gaussians
Total
Percentage
Total
Percentage
50 60 70 80
37 62 92 111
227 132 78 73
67.2 39.0 23.1 21.6
244 117 81 104
73.2 35.1 24.32 31.2
7B: RCE Results, Vowel Recognition Problem Test Set Error
Training Set Error
Pruning
No. of Hyperspheres
Total
Percentage
Total
Percentage
None, min. 1 point 2%, min. 1 point 4%, min. 2 points 6%, min. 3 points 8%, min. 3 points 10%, rain. 4 points
250 250 250 175 175 126
116 116 116 79 79 56
34.32 34.32 34.32 23.37 23.37 16.57
191 191 191 143 143 110
57.36 57.36 57.36 42.94 42.94 33.03
7C: Standard RBF Results, Vowel Recognition Problem Test Set Error
Training Set Error
No. of Gaussians
Width Multiplier
Learning Rate
Total
Percentage
Total
Percentage
5 10 20 30 40 50 60 70 80 90 100 5 10 20 30 40 50 60 70
1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
283 204 167 150 155 153 153 154 116 106 150 320 266 204 179 170 163 163 164
83.73 60.36 49.4 44.38 45.86 45.27 45.27 45.56 34.32 31.36 44.38 94.68 78.7 60.36 52.96 50.3 48.23 48.23 48.52
282 204 142 145 151 141 147 148 118 112 157 310 269 191 162 157 140 140 143
84.7 61.3 42.6 43.54 45.35 42.34 44.14 44.44 35.44 33.63 47.15 93.1 80.8 57.36 48.65 47.15 42 42 42.9 Con~nued
194
A. Roy, S. Govil, and R. Miranda TABLE 7 Continued
7C: Standard RBF Results, Vowel Recognition Problem Training Set Error
Test Set Error
No. of Gaussians
Width Multiplier
Learning Rate
Total
Percentage
Total
Percentage
80 90 100 5 10 20 30 40 50 60 70 80 90 100
2 2 2 3 3 3 3 3 3 3 3 3 3 3
0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
137 150 173 336 306 239 206 191 181 177 175 155 147 126
40.53 44.38 51.18 99.41 90.53 70.71 60.95 56.51 53.55 52.36 51.78 45.86 43.49 37.28
120 140 154 332 298 236 192 186 174 168 164 152 138 121
36 42 46.25 99.7 89.5 70.9 57.7 55.86 52.3 50.5 49.25 45.65 41.4 36.34
Description: 10 classes {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}; two dimensions. Training examples = 338; testing examples = 333. Minimum number of points per Gaussian = 4% of in-class = 2.
problem with BP with 50 hidden nodes and achieving an error rate of 19.8%. Moody and Darken ( 1989 ) report obtaining an error rate of 18% with 100 hidden nodes with their RBF method.
4.4. Target Recognition The sonar target classification problem is described in Gorman and Sejnowski (1988). The task is to discriminate between sonar signals bounced offa metal cylinder from those bounced offa roughly cylindrical rock. The patterns were obtained by bouncing sonar signals off the two cylinder types at various angles and under var-
ious conditions. Each pattern is a set of 60 numbers between 0.0 and 1.0. The training and test sets each have 104 members. Gorman and Sejnowski ( 1988 ) experimented with a no hidden layer perceptron and single hidden layer perceptrons with 2, 3, 6, 12, and 24 hidden units. Each network was trained l0 times by the back-propagation algorithm over 300 epochs. The error rate decreased from 26.9% for zero hidden units (with standard deviation of error = 4.8 ) to 9.6% for 12 hidden units (with standard deviation of error = 1.8 ). They also report that a K N N classifier had an error rate of 17.3%. Roy et al. ( 1993 ) had reported that the training set is actually linearly separable. Tables 8A and 9 show that an error rate of 2 I. 15% was obtained
TABLE 8 Sonar Problem
8A: GM Algorithm Results, Sonar Problem (Mask Class 0)
(%)
No. of Gaussians
Cumulative No. of Gaussians
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
50 60 70 80 90 100
7 6 8 10 11 11
7 13 21 28 32 32
37 21 13 11 7 7
6 11 11 8 5 5
31 10 2 3 2 2
30 29 24 24 22 22
9 18 20 23 14 14
21 11 4 1 8 8
58 87 111 135 156 160
13 19 26 30 34 34
Majority Criterion
Training Set Error
Test Set Error
Time (s)
8B: RCE Results, Sonar Problem Training Set Error
Test Set Error
Pruning
No. of Hyperspheres
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Training Time (s)
None, min. 1 point 2%, min. 1 point 4%, min. 2 points 6%, min. 3 points
20 20 20 12
21 21 21 22
10 10 10 16
11 11 11 6
28 28 28 34
17 17 17 25
11 11 11 9
3 5.7 4.3 5.6
RBF-Like Nets for 67assification Problems
195 TABLE 8 Con~nued
8C: Standard RBF Results, Sonar Problem Training Set Error
Test Set Error
Time (s)
No. of Gaussians
Width Multiplier
Learning Rate
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
5 10 20 30 40 50 60 70 80 90 100 5 10 20 30 40 50 60 70 80 90 100 5 10 20 30 40 50 60 70 80 90 100
1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001
52 55 55 55 55 55 55 55 55 55 55 54 54 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55
52 55 55 55 55 55 55 55 55 55 55 54 54 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55 55
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
44 45 42 41 41 41 42 42 42 42 42 39 41 41 42 42 42 42 42 42 42 42 41 41 42 42 42 42 42 42 42 42 42
39 41 42 41 41 41 42 42 42 42 42 38 38 41 42 42 42 42 42 42 42 42 41 41 42 42 42 42 42 42 42 42 42
5 4 0 0 0 0 0 0 0 0 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 5 7 10 15 21 22 26 30 29 39 4 5 7 10 15 21 22 26 30 29 39 4 5 7 10 15 21 22 26 30 29 39
41 163 155 204 197 195 423 407 328 535 517 7 11 12 15 17 18 21 20 23 27 29 5 7 8 10 10 13 15 17 16 17 18
Description: two classes {0, 1}, 60 dimensions. Training examples = 104; testing examples = 104. Minimum number of points per Gaussian = 4% of in-class points = 2.
by the GM algorithm using 32 Gaussians. Tables 8B and C and 9 show the results for the other algorithms. The conjugate gradient method ran into local minima problems and could not solve this problem, even with various starting points and hidden nodes.
Table 9 summarizes the results for all of the algorithms and test problems. Figures 1-4 show how the classification boundary develops as the algorithm progresses for the two-dimensional overlapping Gaussian distribution problem (problem 4).
TABLE 9 Summary of Results: All Four Algorithms Network size
Test Error Rate (%)
(No. of Hidden Nodes or Hyperspheres)
Problem Type
GM
RBF
RCE
MLP
GM
RBF
RCE
MLP
Overlapping Gaussians Problem 1 Problem 2 Problem 3 Problem 4 Breast cancer Heart disease Vowel recognition Sonar
17.90 17.97 9.19 7.70 3.94 18.18 24.32 21.15
17.27 21.30 14.11 18.78 2.96 36.36 33.63 37.51
19.47 22.7 12.73 12.35 4.43 29.29 33.03 26.92
16.51 20.17 15.55 16.27 2.96 18.18 N/A N/A
18 5 4 16 11 24 92 32
60 60 50 50 90 60 90 5
53 121 96 11 17 32 126 20
10 20 20 10 2 2 N/A N/A
196
A. Roy, S. Govil, and R. Miranda
f
i
2
: t~'*~¢*~ • i~ " ' ,,,:~" ................................... ?...-:.....~
. . . . . . . . . . . . .
............... , .................................................
i'I .
/i
"i "
•
.:,.
.
•
" "
/''';''u,
.
.
.
.
.
.
.
• ~ ' ~ : ...............i ....................... i ........
i
•
............. '°i"
• - ' . "," i:
°"
.....
J
,. !I
".......... '" .....................,,
, ¢,¢~
t
,,i:....
r
-4 f..................................................... ~:i................................................................................................................................... -~1 -4
-2
0
FIGURE 1. Decision boundary for Gaussian problem 4 with 50% mask.
-4
'
T-
'
I
......
!
""-~ ,:
.
i:ii '
:,iii.......... ii "!
°
"
~.
.
.
.
I
...............................i
'
2
4
5. ADAPTIVE L E A R N I N G W I T H MEMORY--AN ALGORITHM In the current neural network learning paradigm, online adaptive algorithms like back propagation, RCE, and RBF, that attempt to generate general characteristics of a problem from the training examples, are not supposed to store or remember any particular information or example. These algorithms can observe and learn whatever they can from an example, but must forget that example completely thereafter. This process of learning is obviously very memory efficient. However, the process does slow down learning because it does not allow for learning by comparison. Humans learn
/
....
,(: '
........................ i :
"
.
0
FIGURE 3. Decision boundary for Gaussian problem 4 with 70% mask.
4.5. GM Algorithm With LMS Training Another algorithmic possibility is to generate the Gaussians with the G M algorithm and then use them to train a regular RBF net with the LMS algorithm or by matrix inversion. A regular RBF net in this case implies one without thresholding at the output nodes, such as is used in the G M algorithm. Tables 10-16 show the results of this modified algorithm for some of the test problems. Table 17 compares this modified algorithm with the G M algorithm. As can be observed, this modified algorithm is indeed quite fast, but the GM algorithm provides better quality solutions most of the time.
-2
' ~;~"~...............~................................... ~............
."
~
'
i
",.,.~
o
.....................
...°"
.i.o°..o
°
::
'"
o-~',:;e io
, .
/,
t .jy."
[
•
I -4
I ~ , -2
.
.
.
....:
r F
:
:
i 0
2
4
FIGURE 2. Decision boundary for Gaussian problem 4 with 60% mask.
-4
-2
0
2
4
FIGURE 4. Decision boundary for Gaussian problem 4 with 80% mask.
197
RBF-Like Nets for Classification Problems TABLE 10 GM Algorithm With LMS Weight Training, Problem 1, Overlapping Gaussians
Independent Test Set Error (10,000 Examples)
Majority Criterion (%)
No. of Gaussians
Cumulative No. of Gaussians
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
50 60 70 80 90 100
1 5 5 7 12 12
1 6 11 18 30 42
171 111 87 75 80 78
146 73 40 37 43 40
25 38 47 38 37 38
3250 2135 1678 1724 1699 1787
2671 1326 847 886 957 912
569 809 831 838 742 875
9 9 8 11 19 31
1 34 71 167 79 51
Training Set Error
Time (s)
TABLE 11 GM Algorithm With LMS Weight Training, Problem 2, Overlapping Gaussians
Independent Test Set Error (10,000 Examples)
Majority Criterion (%)
No. of Gaussians
Cumulative No. of Gaussians
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
50 60 70 80 90 100
1 1 1 2 5 9
1 2 3 5 10 19
141 78 71 145 114 92
139 61 55 60 62 27
2 17 16 85 52 65
3060 2012 1901 3058 2626 2300
3038 1695 1543 1294 1479 909
22 317 358 1764 1147 1391
9 4 4 5 14 26
1 33 32 87 81 202
Training Set Error
Time (s)
TABLE 12 GM Algorithm With LMS Weight Training, Problem 3, Overlapping Gaussians
Independent Test Set Error (10,000 Examples)
Majority Criterion (%)
No. of Gaussians
Cumulative No. of Gaussians
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
50 60 70 80 90 100
1 1 1 1 2 25
1 2 3 4 6 31
38 67 80 90 87 89
28 13 71 79 54 65
10 54 9 11 33 24
931 1708 1553 1618 2020 2137
669 274 1321 1408 1045 1356
262 1434 232 210 975 781
16 9 9 7 19 91
2 2 2 2 3 9
Training Set Error
Time (s)
TABLE 13 GM Algorithm With LMS Weight Training, Problem 4, Overlapping Gaussians
Independent Test Set Error (10,000 Examples)
Majority Criterion (%)
No. of Gaussians
Cumulative No. of Gaussians
Total
Incl.
Outcl.
Total
Incl.
Outcl.
50 60 70 80 90 100
2 2 5 7 7 7
2 4 9 16 20 24
67 71 53 40 42 39
66 9 33 28 26 15
1 62 20 12 16 24
3304 3253 2134 2065 1948 1401
3242 271 1453 1379 1197 477
62 2982 681 686 751 924
Training Set Error
Time (s) Ph. I
Ph. II 1 13 8 10 4 19
198
A. Roy, S. Govil, and R. Miranda TABLE 14 GM Algorithm With LMS Weight Training, Breast Cancer Problem
Majority Criterion (%)
No. of Gaussians
Cumulative No. of Gaussians
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
50 60 70 80 90 100
1 1 1 1 1 7
1 1 2 3 4 11
16 16 15 32 19 19
3 3 5 29 4 9
13 13 10 3 15 10
10 10 10 17 10 9
3 3 3 16 2 4
7 7 7 1 8 5
16 15 15 12 11 18
1 1 1 1 1 120
Training Set Error
Test Set Error
Time (s)
TABLE 15 GM Algorithm With LMS Weight Training, Heart Disease Problem
Majority Criterion (%)
No. of Gaussians
Cumulative No. of Gaussians
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
50 60 70 80 90 100
2 5 3 4 10 10
2 7 10 14 24 30
132 70 80 57 69 48
69 55 50 40 57 37
63 15 30 17 12 11
68 30 29 23 37 22
46 24 22 17 34 19
22 6 7 6 3 3
4 5 8 10 13 15
3 34 29 165 37 211
Training Set Error
Test Set Error
Time (s)
TABLE 16 GM Algorithm With LMS Weight Training, Sonar Problem
Majority Criterion (%)
No. of Gaussians
Cumulative No. of Gaussians
Total
Incl.
Outcl.
Total
Incl.
Outcl.
Ph. I
Ph. II
50 60 70 80 90 100
7 6 8 10 11 11
7 13 21 28 32 32
44 42 36 37 38 38
17 18 14 15 16 16
27 24 22 22 22 22
58 47 50 48 47 47
12 13 12 12 12 12
46 34 38 36 35 35
58 87 111 135 156 160
40 67 153 227 268 268
Training Set Error
TABLE 17 Summary of Results: GM Algorithm and GM Algorithm With LMS Weight Training
Test Error Rate (%)
Problem Type
GM
GM With LMS Training
Gaussians Problem 1 Problem 2 Problem 3 Problem 4 Breast cancer Heart disease Sonar
17.90 17.97 9.19 7.70 3.94 18.18 21.15
16.78 19.01 9.31 14.01 4.43 22.22 45.19
Network Size (No. of Hidden Nodes)
GM
GM with LMS Training
18 5 4 16 11 24 32
11 3 1 24 11 30 13
Test Set Error
Time (s)
rapidly when allowed to compare the objects/concepts to be learned, as it provides very useful extra information. If one, for example, is expected to learn to pronounce a thousand Chinese characters, it would definitely help to see very similar ones together, side-byside, to properly discriminate between them. If one is denied this opportunity, and shown only one character at a time, with the others hidden away, this same task becomes much more difficult and the learning could take longer. In the case of learning new medical diagnosis and treatment, if medical researchers were not allowed to remember, recall, compare, and discuss different cases and their diagnostic procedures, treatments, and results, the learning of new medical treatments would be slow and in serious jeopardy. Remembering relevant facts and examples is very much a part of the
RBF-Like Nets for Classification Problems
human learning process because it facilitates comparison of facts and information that forms the basis for rapid learning. To simulate on-line adaptive learning with no memory, generally a fixed set of training examples is repeatedly cycled through an algorithm. The supposition, however, is that a new example is being observed online each time. If an algorithm has to observe n examples p times for such simulated learning, it implies requiring n p different training examples on-line. So, if a net is trained with 100 examples over 100 epochs, it implies that it required to observe 10,000 examples online. On-line adaptive learning with no memory is an inefficient form of learning because it requires observing many more examples for the same level of error-free learning than those that use memory. For example, Govil and Roy (1993) report that their RBF algorithm for function approximation learned to predict the logistic map function very accurately (0.129% error) with just 100 training examples. In comparison, Moody and Darken (1989) report training a back-propagation net on the same problem with 1000 training examples that took 200 iterations (line minimizations) of the conjugate gradient method and obtaining a prediction accuracy of 0.59%. This means that the back-propagation algorithm, in a real on-line adaptive mode, would have required at least 1000 × 200 = 200,000 on-line examples to learn this map, which is 199,900 examples more than that required by the RBF algorithm of Govil and Roy (1993), an algorithm that uses memory for quick and efficient learning. If the examples were being generated in a very slow and costly process, which is often the case, this would have meant a long and costly wait before the back-propagation algorithm learned. On the other hand, in such a situation, the net generated by the memory-based RBF algorithm could have been operational after only 1O0 observations. This essentially implies that in many critical applications, where training examples are in short supply, and costly and hard to generate, an on-line no-memory adaptive algorithm could be a potential disaster, because it cannot learn quickly from only a few examples. For example, it might be too risky to employ such a system on-line to detect credit card fraud. New fraudulent practices may not be properly detected by such a no-memory system for quite some time and can result in significant losses to a company. The "thieves" will be allowed to enjoy their new inventions for quiet some time. An on-line adaptive learning algorithm based on the G M method is proposed here. The basic idea is as follows. Suppose that some m e m o r y is available to the algorithm to store examples. It uses part of the memory to store some testing examples and the remaining part to store training examples. Assume it first collects and stores some test examples. Training examples are collected and stored incrementally and the G M algorithm
199
used on the available training set at each stage to generate (regenerate) a RBF-like net. Once the training and testing set errors converge and stabilize, on-line training is complete. During the operational phase, the system continues to monitor its error rate by collecting and testing batches of incoming examples. If the test error is found to have increased beyond a certain level, it proceeds to retrain itself in the manner described above. The following notation is used to describe the proposed algorithm. MA denotes the maximum number of examples that can be stored by the algorithm. N x s corresponds to the number of testing examples stored and NXR the number of training examples stored, where NTR + NTS --< MA. ~/is the incremental addition to the training set NTR. tsej and trej correspond to the testing and training set errors, respectively, after t h e j t h incremental addition to the training set. tSeold denotes the test set error after completion of training and tse, ew the test error on a new batch of on-line examples. # is the tolerance for the difference between tSen,w and tSeold and O the tolerance for the error rate difference during incremental learning or adaptation. The adaptive algorithm is summarized below.
On-Line Adaptive Learning with Fixed Memory ( 1 ) Collect Nxs examples on-line for testing. (2) Initialize counters and constants: j = 0, NXR = 0, #. (3) Increment collection counter: j -- j + 1. (4) Collect n number of (additional) examples for training; add to the training set; NTR = NXR + 77. If NXR + Nvs > MA, go tO (7). (5) Regenerate the RBF net with the G M algorithm using NTR training examples and Nxs testing examples. (6) Compute tsej and trej. I f j = 1, go to (3). If Itsej - tsej_ll <- O and Itrej - trej_ll < O, go to (7); else go to (3). (7) Current adaptation is complete. Set tSeola = tsej. Test system continuously for any significant change in error rate. (8) Collect N r s new examples on-line for testing; test and compute tSenew. (9) If I tSenew -- /SColdI --< #, go to ( 8 ). Otherwise, it is time to retrain, go to (2). Table 18 shows how the overlapping Gaussian distribution problem 4 is adapted on-line. RBF nets were generated with the G M algorithm at increments of 100 examples, p was set to 1.5%. With this p, adaptation is complete within 300 examples. If p is reduced further, adaptation would take longer (i.e., would need more examples). If the G M algorithm used a different stopping rule, such as [ trej - tsejl within some bound, then
200
A. Roy, S. Govil, and R. Miranda TABLE 18 On-Line Adaptation of Overlapping Gaussian Distributions, Problem 4
NTR, Collection Counter 1
2 3 4 5
Cumulative No. of Training Points
Cumulative No. of Gaussians
LP Solution Time (s)
trej (%)
tsej (%)
I tsej - tsej_l I (%)
I trej - trejl (%)
I trej - tsej_~ I (%)
100 200 300 400 500
11 13 8 17 17
9 69 213 555 1129
5 6.5 7.67 6 7.4
9.65 7.15 7.9 6.15 6.75
2.5 0.75 1.75 0.5
1.5 1.17 1.67 1.4
4.65 0.65 0.23 0.15 0.65
it c o u l d have s t o p p e d at 4 0 0 e x a m p l e s a n d o b t a i n e d close to the o p t i m a l e r r o r rate o f 6.15% o n the test set.
6. C O N C L U S I O N T h e p a p e r has d e f i n e d a set o f r o b u s t a n d c o m p u t a t i o n a l l y efficient l e a r n i n g p r i n c i p l e s for n e u r a l n e t w o r k a l g o r i t h m s . T h e a l g o r i t h m p r e s e n t e d here, a l o n g w i t h s o m e o f the p r e v i o u s o n e s ( G o v i l a n d Roy, 1993; M u k h o p a d h y a y et al., 1993; R o y & M u k h o p a d h y a y , 1991; R o y et al., 1993) have b e e n b a s e d o n these l e a r n i n g principles. These l e a r n i n g p r i n c i p l e s differ s u b s t a n t i a l l y f r o m classical c o n n e c t i o n i s t l e a r n i n g . Extensive c o m p u t a t i o n a l e x p e r i m e n t s show t h a t the a l g o r i t h m p r e s e n t e d here w o r k s q u i t e well. W o r k is u n d e r w a y to i m p r o v e these m e t h o d s a n d e x t e n d t h e m to o t h e r types o f n e u r a l n e t w o r k s . T h e y will b e r e p o r t e d i n t h e future.
REFERENCES Baldi, P. (1990). Computing with arrays of bell-shaped and sigmoid functions. Proceedings of IEEE Neural Information Processing Systems, 3, 728-734. Bennett, K. P., & Mangasarian, O. L. (1992). Robust Linear Programming Discrimination of Two Linearly Inseparable Sets. Optimization Methods and Software, 1, 23-34. Blum, A. L., & Rivest, R. L. (1992). Traininga 3-node neural network is NP-complete. Neural Network, 5( 1), 117-127. Broomhead, D., & Lowe, D. (1988). Multivariable function interpolation and adaptive networks, Complex Systems, 2, 321-355. Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64, 304-310. Gormann, R. P., & Sejnowski, T. J. (1988). Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks, 1, 75-89. Govil, S., & Roy, A. (1993). Generating a radial basis function net in polynomial time for function approximation. Working paper. Hush, D. R., & Salas, J. M. (1990). Classification with neural networks:A comparison (UNM Tech. Rep. Number EECE 90-004 ). University of New Mexico. Judd, J. S. (1990). Neural network design and the complexity of learning. Cambridge, MA: MIT Press.
Karmarkar, N. (1984). A new polynomial time algorithm for linear programming. Combinatorica, 4, 373-395. Khachian, L. G. (1979). A polynomial algorithm in linear programming. Doklady Akademii Nauks SSR, 244 ( 5 ), 1093-1096. Lee, Y. ( 1989 ). Classifiers:Adaptive modules in pattern recognition systems. Masters thesis, Dept. of Electrical Engineeringand Computer Science, MIT, Cambridge, MA. Lippmann, R. P. (1988). Neural network classifiers for speech recognition. The Lincoln Laboratory Journal, 1 ( 1), 107-128. Luenberger, D. (1984). Linear and nonlinear programming. Reading, MA: Addison-Wesley. Mangasarian, O. L., Setiono, R., & Wolberg, W. H. (1990). Pattern recognition via linear programming: Theory and application to medical diagnosis. In T. E Coleman and Y. Li (Eds.), Proceedings of the Workshop on Large-Scale Numerical Optimization, Cornell University, Ithaca, NY, Oct. 19-20, 1989, pp. 22-31, Philadelphia, PA, 1990, SIAM. Montiero, R. C., & Adler, I. (1989). Interior path following primaldual algorithms. Part I: Linear programming. Mathematical Programming, 44, 27-41. Moody, J., & Darken, C. (1988). Learning with localized receptive fields: In: D. Touretzky, G. Hinton, T. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 133143). San Mateo: Morgan Kaufmann. Moody, J., & Darken, C. (1989). Fast learning in networks of locallytuned processing units. Neural Computation, 1 (2), 281-294. Mukhopadhyay, S., Roy, A., Kim, L. S., & Govil, S. (1993). A polynomial time algorithm for generating neural networks for pattern classification--its stability properties and some test results. Neural Computation. 5(2), 225-238. Musavi, M. T., Ahmed, W., Chan, K. H., Faris, K. B., & Hummels, D. M. ( 1992 ). On the training of radial basis function classifiers. Neural Networks, 5(4), 595-603. Musavi, M. T., Kalantri, K., Ahmed, W., & Chan, K. H. (1993). A minimum error neural network (MNN). NeuralNetworks. 6, 397407. Platt, J. (1991). A resource-allocating network for function interpolation. Neural Computation, 3 (2), 213-225. Poggio, T., & Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978-982. Powell, M. J. D. (1987). Radial basis functions for multivariable interpolation: A review. In: J. C. Mason, M. G. Cox (Eds.) Algorithms for approximation. Oxford: Clarendon Press. Reilly, D., Cooper, L., & Elbaum, C. (1982). A neural model for category learning. Biological Cybernetics, 45, 35-41. Renals, S., & Rohwer, R. ( 1989 ). Phoneme classification experiments using radial basis function. Proceedings of International Joint Conference on Neural Networks, I, 461-467. Roy, A., & Mukhopadhyay, S. (1991). Pattern classification using
RBF-Like Nets for Classification Problems linear programming. ORSA Journal on Computing, Winter, 3( 1), 66-80. Roy, A., Kim, L. S., & Mukhopadhyay, S. (1993). A polynomial time algorithm for the construction and training of a class of multilayer perceptrons. Neural Networks, 6(4), 535-545. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation (Chap. 8). In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in microstructure of cognition, vol. 1: Foundations (pp. 318-362) Cambridge, MA: MIT Press.
201 Specht, D. E (1990). Probabilistic neural networks. Neural Networks, 3(1), 109-118. Todd, M., & Ye, Yinyu (1990). A centered projective algorithm for linear programming. Mathematics of Operations Research, 15(3), 508-529. Vrckovnik, G., Carter, C. R., & Haykin, S. (1990). Radial basis function classification of impulse radar waveforms. Proceedings of the International Joint Conference on Neural Networks, I, 45-50. Widrow, B., & Hoff, M. (1960). Adaptive switching circuits. In: 1960 IRE WESCON Convention Record, 96-104, IRE, New York.