i. Recall that the image space (denoted by :J) is the space spanned by the hidden functions cP l s
k = 0
02
<1>0
=
{1}
1002 03 04
os 06 07
N. J. Redding, A. Kowalczyk. and T Downs
do
k;= k
+
1
determine if> sk find images :J + and :J- of x+ and X - under until:J + and :J- are LS
A HON that realizes the problem on been established.
x will now
cp",;k
have •
In the case of two-valued inputs (i.e., X <;:;; IBn, where IBn = {O, I} nor IEB n = {-1, 1}n), the algorithm must terminate because the degree of a monomial on X clearly need not be higher than n. This algorithm is demonstrated by the following simple example. EXAMPLE 1. Determine the order of the two-variable XOR problem.
Given that x = (Xl, X2) E X, and X = 1B 2 , the set of monomials of up to degree one is given by cp '"I = {I, Xl, X2}. The images ofil>"'1 on x' = {(O, I), (1, On and X- = {CO, 0), (l, I)} are the sets J" ;;::: {(1, 0, 1), (1, 1, o)} and :J- = {(l, 0, 0), (1, I, I)}. The next step, determining the linear separability of:J - and :J+, will show that the images of the monomials ep"'1 on X are not LS. Therefore, we must test the next highest-order monomials, those of second order. Now, 4>2 = {xlxd so ep"'2 = {I, Xl, X2, xlxd and so the new image sets are J = {(1,0, 1,0), (1, 1,0,0)} and J: = {(1, 0, 0, 0), (1, 1, 1, I)}. Applying a testforlinear separability then shows that :J+ and :J- are LS ( confirming that the two variable XOR is second order). Therefore, a HON that realizes the two-variable XOR problem will require monomials from the HFF 4>,,;;2 = {I, Xl> X2, XIX2} to form its hidden units. The algorithm just described has one significant difficulty that renders it unusable for large-dimensional problems without some additional controlling mechanisms. This is, of course, the combinatorial explosion of monomials that will result as the dimension of the input space n and the order k increase. In fact, the number of monomials of order k in an n-dimensional two-valued input space is given by the binomial coefficient (Z). SO, for large n an unacceptably large number ofdimensions in the image space will be needed during the execution of the algorithm. In addition, high-dimensional tests for linear separability will need to be performed. . This combinatorial explosion, daunting in its magnitude, is often seen as insurmountable and a good reason to avoid HONs (Minsky & Papert, 1988; Reid, Spirkovska, & Ochoa, 1989). There are, however, a number of techniques available for keeping this problem within manageable limits. Some techniques for dealing with the combinatorial explosion of monomial
hidden functions were suggested by Giles and Maxwell (1987). These techniques, however, are not suitable for our purposes because of our requirement of an exact mapping. An explanation of the approach we developed to deal with the combinatorial explosion follows. This approach results in a constructive technique for determining an HON solution to the classification of a pattern set X C IBn. 4.1. Universal Set of Monomials The combinatorial explosion can be dealt with by recognizing that in the majority of cases many of the monomials are redundant because of their mutual linear dependence (as functions upon the set x). This linear dependence will occur in any problem that is incompletely specified, and all generalization problems are of this type. By restricting the monomial hidden units to only nonredundant, linearly independent monomials, the number of hidden units is kept to an acceptable level without reducing the representational ability of the network, as we will see in the following development. This representational ability is captured by what we term a universal set of monomials. The universal set is so named because it is constructed only from the set of input patterns X and therefore remains the same across all the possible classifications of the patterns in x, The following definition is fundamental to the development of the universal set. Consider the space of real-valued functions that can be defined on a finite set of patterns X. A particular function on X can be identified by its real-valued outputs for each pattern in the set x. If IX I = p, say, then the function will be identified by a point in the p-dimensional real space IW, which will be called the x-space. Each coordinate in this space will correspond to the value of the function for a particular pattern in x, Whenever we talk about monomials being linearly independent upon X, we mean that the x-space vectors of the monomials are linearly independent. Let us now introduce some notation. First, let cp '[ , where T = {iI, i 2 , ••• , is}, represent a particular monomial XiI Xi2 ••• Xis' Then, let the point in X-space that is given by the outputs of the monomial ip T for the patterns in X be denoted by [cp '[]x. Further, if .At denotes a set of monomials let .At k f;;; .M. denote the subset of all monomials in .M. of degree k. Finally, [.M.lx is used to denote the set of x-space vectors from the monomials in .M. on x. The idea of the universal set is simple in nature: A universal set includes only monomials whose outputs on X cannot be written as a combination of the outputs of the other monomials in the set. If a universal set for a set of patterns Xis denoted by.M.,then the monomials in .M. will form a basis for the x-space, A universal set
1003
Constructive HONs
must have one further property-it must be order preserving (see Definition 5) with respect to any problem on the set ojpatterns X. The concept of a universal set is defined more formally as follows.
6. Let .M be a set of monomials on X. Then . .M. is termed a universal set ojmonomials on X if and only if the following two conditions are met: • All the monomials in .M are linearly independent on X such that .M, provides a basis for all real-valued Junctions on X, that is, the x-space is generated by the output ojthe monomials in the set .M. • Any polynomial x ofdegree k ::=; n restricted to X can be written as a linear combination oj monomials in .M sk, that is, .M is an order-preserving set ojhidden Junctions Jar any problem defined on X. DEFINITION
The following theorem highlights two important properties of a universal set. THEOREM 4. Consider a universal set of monomials
m.
on X. • The cardinality oJ.M is equal to the cardinality ojx, that is, I.M.I = IX I. • A universal set for X is not necessarily unique. Proof. The first part of this theorem follows from the fact that there can only be IX I linearly independent vectors in an IX [-dimensional linear space. The second part follows from the fact that more than one set of vectors may span a linear space. _
As a consequence of this theorem, one need never consider more than p monomial hidden units, where p = IX I. (Incidentally, this places an upper bound of p on the number of hidden units required in an FFN.) An obvious first algorithm to compute a universal set of monomials .M for the set X can then be expressed as follows.
that .At is a universal set. Further, because X is a set of two-valued patterns the algorithm will terminate with a set.At, I.M.I = IX I, such that the degree ofany monomial in .M. is no more than n , _ This algorithm is faced with one difficulty: It is unclear at this stage how to generate each cp of degree k in line 05. We will solve this problem in the following development and produce a polynomial-time algorithm for determining a universal set. If the two sets .M. i and .M. ) are the i th and jth degree subsets of a universal set on the patterns X, then the set of ordinary products of the monomials in .Ini and .M.), denoted by .M. i • .M), will usually contain a number of monomials of degree i + i. Most importantly, however, the number of monomials in .M. i.,MJ of degree i + j will usually be much less than the total number of monomials that exist with degree i + j. The following theorems indicate how a universal set can be constructed by considering only those monomials in the ordinary product .M. i • .M. J • We will use the notation Span ( ) to denote a linear combination.
6. Given a universal set .M. on X with subsets and .M.) ojdegree i and j , respectively, then [cp]x E Span ([.M. i • .M. )1..: U [At
THEOREM
.M.
j
Proof. Any monomial cp of degree i +) can be written as a multiple of two monomials 'Pi and !Pj of degree i and), respectively. The monomials cpj and 'PJ on X can be written as a linear combination of the monomials in .Ats i and .MSo) on X, respectively, because At is a universal set. Therefore,
[cph = [<'oi<'oj]X
(4)
where the only terms of degree i +) in eq. (4) will be those from the set .Mi • .M. J• A polynomial of degree i + j may also contain monomials with degree < i + j but these terms on X must be linear combinations of the monomials in .M,
ALGORITHM 2. 01 02 03
04
as 06 07 08
k= 0 .M0 = {I} do k := k + I for each 'P of degree k if sp is linearly independent of .M sk: on X .M.k:=.MkUcp until I.M.I = IX I
The following corollary follows directly from Theorem 6.
_
THEOREM 5. Algorithm 2 constructs a universal set
tor x. Proof. By testing all monomials of lower degree first, and eventually testing all possible monomials, both properties in Definition 6 are satisfied, which ensures
COROLLARY 7. Given a universal set .M. and the ith and jth degree subsets oj At, .M.', and AU respectively, then the subset .M,j+j can be replaced by a set of monomials from the ordinary product cfmonomials m. j • •M) to form a not necessarily identical universal set . THEOREM 8. A universal set can always be constructed such that the monomials ojdegre e i + 1 that it conta ins are elements oj the ordinary product JItt i • .M1 for all values oji E N.
Proof The proof of this theorem follows from re-
N. J. Redding, A. Kowalczyk, and T. Downs
1004
peated application of Corollary 7 for j i = 2, etc.
=
1 and i = 1, •
Algorithm 3 below, for computing a universal set for
x, makes use of Theorem 8 by considering as elements of JZtLi+! only those monomials that belong to the ordinary product .At i • .At I of previously determined sets .M. i and .M. I . This new algorithm has the desired polynomial-time complexity, as is indicated by Theorem 9. ALGORITHM 3 01
k
=
°
02
JZtL0"" {1}
OJ
do
04
05
06 07
08
k:= k
+1
{ XI, X2, ... , x,}. k = 1 for each cp E ( .At 1 • .At k-l , k ;;;:: 2 if cp is linearly independent of .M,s.k on X .M. k : = J11, k U cp
until /.M. I
= IX I
•
The following example demonstrates the main features of the above algorithm and indicates how using Theorem 8 radically reduces the number of monomials that have to be considered in constructing a universal set. EXAMPLE 2. Let us assume that the pattern set X con-
with a third-order monomial indicates that any problem on this pattern set will have order no greater than 3.)
The polynomial-time complexity of Algorithm 3 is demonstrated in the following theorem. THEOREM 9. Algorithm 3 constructs a universal set for 4 X in CO (n I X 1 ) time.
Proof That the algorithm constructs a universal set follows simply from Algorithm 2, Theorem 5, and Theorem 8. The computational complexity is proven as follows. Because we have :s; n first-order monomials and a total of IX I monomials in JItt, we need to perform not more than n IX I tests for linear independence, as the following indicates. Let the integer j k: denote the number of monomials of degree k, k = 0,1,2, ... , n, that are added to the universal set, sojo = 1 and j. :s; n are special cases. Initially, n tests for linear independence are performed to determine the j I monomials of degree one, followed by jlj, tests to determine the j2 monomials of degree two, until finally jlj,,_, tests are performed to determine thej, monomials of degree n, Therefore, the number of tests required is given by n-I
n-r j, 'L,);= n+)I(lxl-)n-l) i""l
= n
+)Iixi -
JI)n -)1
.::;; n Ix I, for
IX I >
n,
tains p patterns, where each pattern x E X is a vector with five binary or bipolar entries, and that we wish to find a universal set Jar X using Algorithm 3. Initially, the unitary monomial is added to the universal set to form the subset JItt 0 = {I}. Next, the monomials x., X2, ... , X5 need to be tested for linear independence over the set X. Suppose that out a/these the monomials Xl, X3, Xs are found to be linearly independent s~ that At I = {x., X3, Xs }. Next, to find the subset In only the monomials in the set ojordinary products I At 1 • .M, I = {IXIXd, XIX3, XIX5, IX3x 31, X3X5, IXsXs I} are tested for linear independence. This is a clear saving over the exhaustive method of Algorithm 2 because no monomials involving the inputs X2 and X4 need be tested. Let us now assume that the monomials X\X3 and X2XS are found to be linearly independent, so that In 2 = 3 {XIX3' XtX5}' Then, the subset.M. is found by testing for linear independence on the set ofordinary products 2 At 1 • JZtL 2 = {x, x3xd, and the procedure would now terminate because there are no higher-order products to consider. (Note the fact that the procedure terminates
4.2. Constructing a HON Using a Universal Set of Monomials
I Terms of the form Xi Xi (those that appear in boxes) are ignored because XiX, = x, when Xi is binary valued and XiX; = J when Xi is 2 2 2 2 . bipolar valued.. 2 Note that we have Ignored the terms XIX3, XIX" XIX], X1X~ ill this set of ordinary products.
The following theorem indicates how the concept of a universal set of monomials may be incorporated into Algorithm 1to reduce the computational overhead imposed by the need of the algorithm to consider all possible monomials. This theorem clearly indicates that a universal subset .M, ak can be used at any point in Al-
confirming that at most n IX I tests for linear independence need to be performed. Each test for linear independence requires examination of a rectangular matrix that has each dimension s; IX I, and as a result (see the appendix) the complexity of each test will be CO ( I X /3). And because n IX I such tests have to be carried out, the computational complexity of Algorithm 3is(!)(n/xI 4 ) . • In the appendix, we develop an improved algorithm by using a form of Gaussian elimination to speed the tests for linear independence. The algorithm complexity is given by the following theorem (for proof, see the appendix). THEOREM 10. A universal set for the set cfpattems X can be computed in (f) ( n I X 13 ) time.
Constructive HONs
1005
gorithm 1 in place of the larger set <.I> s;k of monomials. The speed improvement is immediately apparent when one considers that IclJ I = 2n whereas IJItt I = IX I and X s:;;; lB.
11. The problem 'l' on X such that has order ~ k if and only if it can be implemented with hiddenfunctions from the subset .M,s;k of a universal setof monomials .M on X.
THEOREM
'l' : x
-
lEE
Proof This theorem follows from the definition of a universal set (Definition 6) and Theorem 1. • Once we have determined a universal set JItt on X that forms the set from which we will choose the hidden units of a HaN, the next step is to determine the minimum k such that the monomials in the subset .M, sk when used as hidden units will correctly classify X according to 'It. From Theorem 11, a problem 'l' on X is of order ~ k ifthe following inequalities can be satisfied for some set of weights Wi, i = I, ... , I.M, ski:
L 'l'IE At sk
>0 Wil"i(X) {
<0
if
x E x'
.
If xEX".
(5)
Let us introduce the notation t (x) = 1 if x E X+ and t(x) = -1 ifx E x-. We can then frame the problem of eqn (5) as a Chebyshev solution to a set of linear equations (Cheney, 1982) min E(w) = min w
w
maxi "ex
2:
Wi'Pi(X) - t(X)I.
(6)
_,EAt sk
It can easily be shown that any solution of eqn ( 6) for which E( w) < 1 is also a solution to eqn (5) and that if a solution to eqn (5) exists then any solution to eqn (6) must satisfy E(w) < I (Kaplan & Winder, 1965). The inequalities in eqn (6) can be rewritten as the following linear programming problem: minimize; subject to
r
We can now adapt our initial algorithm, Algorithm 1, to incorporate the concept of a universal set. ALGORITHM 4 01
k= 0
02
.M,l =
03
do
04 05 06
{I}
k:= k + I determine .M, k until eq n (5 ) is satisfied
A HaN that realises the problem on X will now have been established. •
Algorithm 4 will construct a minimal order HON to realizethe problem on X s:;;; IB" in (f) (np 5.5) time.
THEOREM 12.
Proof When X
£; IBn, the order of a problem on X cannot be larger than n, Therefore, the do-loop of Algorithm 4 will never be executed more than n times. The cost of solving the linear programming problem in eqn (5) is (Q(p 5.5) time, much larger than that required to determine .M,", Therefore, the total algorithm • will require at most (!)( np5.5) time.
In effect then, the algorithm described constructs a network with minimal maximum hidden unit fan-in (and this can be done in polynomial time using Karmarkar's algorithm); the network obtained is not necessarily, however, one that contains the minimum possible number of hidden units. An impediment to obtaining a minimal solution concerns the fact that the universal set is not unique. Finding a truly minimal solution from all possible universal sets for a given problem would be infeasible.
5. DISCUSSION
L
Wil"i(X) - t(x) - r ~ 0,
!/l;EAts,k
for each x E
L:
Wil"i(X) - t(x)
X
+ r ~ 0,
JPjEJlts k
for each x E
x.
After translation into Karmarkar's form (Bazaraa, Jarvis, & Sherali, 1990), this linear programming problem can be solved in f!)(p5.5) time (Karmarkar, 1984), where p = I x], assuming arithmetic operations require unitary time. Because a universal set .M, is constructed for any problem on X, a problem of order k would usually not require all the monomials in .M-sk as hidden units of a HaN realization. Using a linear programming algorithm in the manner described above generally eliminates at least some of these redundant hidden units.
The strength of the constructive HaN (CHON) algorithm we have presented lies in its ability to construct a minimal-order set of spanning monomials that are then used to determine the hidden units for an arbitrary problem on X. In problems ofthe training-by-example type, and in other problems where generalization is required, we normally have I X I ~ 2 n (X C IB "). In such cases, our universal set algorithm generally eliminates the vast majority of the possible 2 n monomials from consideration as hidden units. Of course, if every pattern were present in the training set (exceedingly rare in practical learning problems) the selection ofa universal set would be a pointless exercise because it would, of necessity, contain all 2 n monomials for the pattern set X = IBn. While the ability of the algorithm to construct a correctly classifyingHON for an arbitrary set of patterns has been demonstrated mathematically, it is instructive
1006
to examine the performance of the algorithm on a test problem. At the same time, we will investigate the generalization behavior of the resulting network. The particular problem that we will use for this purpose is the "two-or-more clumps" problem, following Denker et al. (1987), Mezard and Nadal (1989); and Frean ( 1990). An input pattern x is classified as belonging to x' if x contains two or more clumps of Is; otherwise, it belongs to X- . Note that cyclic boundary conditions apply to x-the element x, ofx is considered to be next to X n• The two-or-more clumps problem is a secondorder problem. A two-or-more clumps problem with a mean of 1.5 clumps and 25 inputs was tested to make possible a direct comparison with the results for the Upstart and Tiling constructive algorithms presented in Frean (1990). The training set in each of the trials contained up to 800 patterns- and the performance ofthe network was tested on a further 600 patterns. These generalization test results are presented in Figure 3. Figure 4 indicates the growth of the number of weights in the resulting HaN as the size of the training set was increased for the two-or-more clumps problem. We present the growth in the HON in terms of the number of weights rather than the more usual situation of quoting the number of hidden units (although they occur with the same frequency in a HaN) to emphasize that the hidden units of a HaN are more cost effective than in typical FFN structures (including the network constructed by the Upstart algorithm). A HaN has only one weight for each monomial hidden unit. In typical FF'Ns,however, each hidden unit has in addition a weight for each input element, and a threshold (giving n + 2 weights per hidden unit in total for an n-dimensional input space). Further, the maximum fan-in of the monomials is kept as small as possible by the CHON algorithm, so the computation required to determine the output of each hidden unit of a HON is smaller than for typical FFNs. Table 1 indicates the network order of the constructed HaNs on the two-or-more clumps problem. From this table, it is possible to compute the total connectivity. The apparent variation in order of the two-or-more clumps problem, as evidenced by the different network orders obtained in the trials of Table 1, can be simply explained in the following way. The concept of a learning problem (e.g., the predicate "are there two or more
3 Although the size of the pattern sets is a minuscule fraction of the 2 25 possible patterns, the mean of 1.5 clumps ensures that there is a considerable bias toward the patterns that have less than two clumps, of which there are only 602. Therefore, as the pattern set sizes increase the number of duplicate patterns will increase, making the generalization performance statistic less and lessinteresting (Frean, 1992). For the training sets used here, it was found that the 50, 100, 200, 400, 600, and 800 pattern training sets had a mean of 3, 8, 20, 57, 101, and 155 duplicate patterns, respectively.
N. J. Redding, A. Kowalczyk, and T. Downs
clumps?") can be only partially captured by an incomplete set of patterns. As the number of patterns in the set decreases, it becomes increasingly more probable that the unspecified patterns can be assigned so that the order of the problem collapses to a lower value. It is interesting to note the dramatic increases in performance on the test set as the training set increases from 400 to 600 patterns and from 600 to 800 patterns. These increases correspond with only a marginal increase in the number of monomials and weights in the constructed HaNs (Fig. 4). Further, at 600 and 800 training patterns the variations in the number ofweights over the 25 trials is too small to mark on the figure, having a standard deviation from the mean of only 0.9. This behavior is not observed in training sets of up to and including 600 patterns under the Upstart and Tiling algorithms (Frean, 1990). It seems that the CHON algorithm has learned considerably more of the structure of the two-or-more clumps problem from 600 and 800 patterns than the increase that occurred from 200 to 400 patterns would suggest was possible. We can compare our results with those for the Tiling and Upstart algorithms presented in Frean ( 1990) for training sets of 600 or less patterns-these results have been reproduced in Figure 3 from data generously supplied to us by M. Frean. At 600 patterns, the CHON algorithm outperforms the Upstart and Tiling algorithms. At less than 600 training patterns, the Upstart algorithm is a slightly better performer and the Tiling algorithm performs roughly the same as the CHON algorithm we have presented. 5.1. Conclusions By utilizing a constructive architecture, we developed an algorithm that solves a mapping problem in time polynomial in terms of the number of patterns. The scheme involves the selection of multiplicative nonlinearities as hidden units (the HaN architecture) based upon their relevance to (i.e., linear independence upon) the particular pattern set. Recently, Blum and Rivest (1992) suggested this as a possible approach to overcoming hardness results for the training problem. Further, using the concept of order, we have shown that the representational ability of an arbitrary FFN cannot be better than that of a HON of equivalent order. In addition, as long as the net structure is not set before training we have shown that it is a simple matter to use a HON to construct a sigmoidal net of the same maximal fan-in in polynomial time. Finally, the algorithm does not have any parameters that require tuning for good performance and seems to perform reasonably wellwhen compared with other constructive algorithms on a standard test problem. These properties make the CHON algorithm an attractive one. The universal set concept and algorithm could also be used with a least squares algorithm to obtain a least
r---~--....,..--~---.----r---r---~--r-----'
100
90
80 % teet eet correct
,6>
-Upetart
70 .'
. .,'
'
..
)(
\U..
60
5OL..-_---L_ _-L-_ _..l.-
o
100
200
300
1....-_--l._ _...L-_ _.L..._ _. L . - _ - J
400
500
600
700
800
900
Number of trainins paUerna FIGURE 3. Performance of the CHON, Upstart, and Tiling algorithms on the two-or-more clumps problem. Each point Is computed from the mean of 25 trials on different training sets, and the error bars indicate one standard deviation from the mean. Errorbars for the Upstart and Tiling algorithms are not displayed to reduce clutter, although the standard deviations range between 1.7 and 2.8 for the Upstart algorithm and between 2.1 and 2.9 for the Tiling algorithm. Data for the Upstart and Tiling algorithms was supplied by M. Frean. 350
r------r---r---....-----,r---~--_._--..._--..__-__.
300
250
200 Number of weights
150
100
60
OL..----'-_ _.L.-_--'-_ _. L . - _ - l -_ _..I-.._--I._ _.....L..._---l
o
100
200
300
400
600
600
700
800
900
Number of training patterna FIGURE 4. Number of weights (equal to the number of monomial hidden units) in the networks constructed by the CHON algorithm for the two-or-more clumps problem. Each point is computed from the mean of 25 trials on different training sets, and the error bars indicate one standard deviation from the mean. The standard deviation lor 600 and 800 training petterns is too small to plot at 0.9.
1008
N. J. Redding, A. Kowalczyk. and T Downs
TABLE 1 Network Order Occurrences for 25 Trials in the Solutions to the Two-or-More Clumps Problem as Constructed by the CHONAlgorithm
NetworK Order
2
Training Set
50 100
23
200 400
2 25 25 20
600 800
25
3
Mean Number of Connections
29 131
5
25
287 616 627 627
In each trial, for particular training set sizes the network order is recorded along with the mean number of connections between the inputs and hidden layer.
squares fit to the training patterns. However, a network obtained by such an approach would not have the property of being guaranteed to be a network of minimal order for the test patterns presented. An article dealing with the case of real-valued patterns (of limited precision) is forthcoming. An incremental version of the constructive HON algorithm could be developed to obtain an "online" version of the algorithm.
REFERENCES Ash, T. (1989). Dynamic nodecreation in backpropagationnetworks (ICS Rep. 8901). San Diego: Institute for Cognitive Science, UCSD. Baum, E. B. ( 1991). Review of Neuralnetwork design and the complexity oflearning, by S. Judd. IEEE Transactions onNeuralNetworks. 2(1),181-182. Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation. 1,151-160. Bazaraa, M. S., Jarvis, J. J., & Sherali, H. D. (1990). Lineal' programing and networkflows (2nd ed.). New York: John Wiley & Sons. Blum, A. L, & Rivest, R. L. (1992). Training a 3-node neural network is NP-complete. Neural Networks, 5, 117-127. Cheney, E. W. (1982). Introduction 10 approximation theory (2nd ed.). New York: Chelsea Publishing. Cover, T. M. ( 1965). Geometrical and statistical properties of systems of linear inequalities with applications to pattern recognition. IEEE Transactions on Electronic Computers, EC-14, 326-334. Denker, J., Schwartz, D., Wittner, B., Sol1a, S., Howard, R., & Jackel, L. (1987). Large automatic learning, rule extraction, and generalization. Complex Systems. 1, 877-922. Fahlman, S. E., & Lebiere, C. L. (1990). The cascade-correlation learning architecture. In D. S. Touretzkey (Ed.), Advances in neural informationprocessing systems 2 (pP. 524-532). San Mateo, CA: Morgan-Kaufmann, Frean, M. (1990). The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Computation. 2(2), 198-209. Frean, M. (1992). Personal communication. Giles, C. L., & Maxwell, T. (1987). Learning, invariance, and generalization in higher-order neural networks. Applied Optics, 26, 4972-4978.
Golea, M., & Marchand, M. ( 1990). A growth algorithm for neural network decision trees. Europhysics Letters. 12,205-210. Hanson, S. J. (1990). Meiosis networks. In D. S. Touretzkey (Ed), Advances in neuralinformation processing systems 2 (pp. 533541). San Mateo, CA: Morgan-Kaufmann. Hanson, S. J., & Pratt, L. Y. (1989). Comparing biases for minimal network construction with back-propagation. In D. S. Touretzsky (Ed.), Advances in neural information processing systems I (pp. 177-185). San Mateo, CA: Morgan-Kaufmann. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theoryof neuralcomputation. Reading, MA: Addison-Wesley. Ji, C., Snapp, R. R., & Psaltis, D. (1990). Generalizing smoothness constraints from discrete samples. Neural Computation, 2( 2), 188197. Judd, J. S. (1990). Neural network design and the complexity of learning. Cambridge, MA: MIT Press. Kaplan, K. R., & Winder, R. O. ( 1965). Chebyshev approximation and threshold functions. IEEE Transactions on Electronic Computers. EC-14, 250-252. Karmarkar, N. ( 1984). A new polynomial-time algorithm for linear programming. Combinatorica, 4, 373-395. Keeler, J. D. (1987). Information capacity of outer product neural networks. Physics Letters A. 124, 53-58. Kohring, G. A. (1990). Neural networks with many neuron interactions. Journal de Physique. 51 (2), 145-155. Kramer, A. H., & Sangiovanni-Vincentelli, A. ( 1989). Efficient parallel learning algorithms for neural networks. In D. S. Touretzkey (Ed.), Advances in neural information processing systems 1 (pp. 4048). San Mateo, CA: Morgan-Kaufmann, Krishnan, T. (1966). On the threshold order of a Boolean function. IEEE Transactions on Electronic Computers. EC-15, 369-372. Le Cun, Y., Denker, J. S., & Solla, S. A. ( 1990). Optimal brain damage. In D. S. Touretzsky (Ed.), Advances in neural information processing systems 2 (pp. 598-605). San Mateo, CA: Morgan-Kaufmann. Lee, Y. c., Doolen, G., Chen, H. H., Sun, G. Z., Maxwell, T., Lee, T. Y., & Giles, C. L. ( 1986). Machine learning using a higher order correlation network. Physica, 22D, 276-306. Lipsch utz, S. ( 1968). Theoryandproblems oflinearalgebra. Schaum's outline series. New York: McGraw-HilL Marchand, M., Golea, M., & Rujan, P. ( 1990). Convergence theorem for sequential learning in two layer perceptrons. Europhysics Letters. 11,481-492. Mezard, M., & Nadal, J.-P. (1989). Learning in feedforward layered networks: The tiling algorithm. Journal of Physics: A: Mathematical and General. 22( 12),2191-2203. Minsky, M. L., & Papert, S. A. ( 1988). Perceptrons (2nd ed). Cambridge, MA: MIT Press. Mozer, M. C; & Smolensky, P. (1989). Skeletonization: A technique for trimming the fat from a network via relevance assessment. In D. S. Touretzsky (Ed.), Advances in neuralinformation processing systems 1 (pp. 107-115). San Mateo, CA: Morgan-Kaufmann. Nadal, J.-P. (1989). Study of a growth algorithm for a feedforward network. International Journal ofNeural Systems. 1 ( I), 55-59. Nilsson, N. J. (1965). Learningmachines. New York: McGraw-Hili. Perantonis, S. J., & Lisboa, P. J. G. (1992). Translation, rotation, and scale invariant pattern recognition by high-order neural networks and moment classifiers. IEEE Transactions on NeuralNetworks, 3(2), 241-251. Peretto, P., & Niez, J. J. (1986). Long term memory storage capacity of rnulticonnected neural networks. Biological Cybernetics. 54, 53-63. Personnaz, L., Guyon, I., & Dreyfus, G. ( 1981). High-order neural networks: Information storage without errors. Europhysics Letters, 4,863-867. Poggio, T. ( 1975). On optimal nonlinear associative recall. Biological Cybemetics, 19,201-209. Psaltis, D., Park, C. H., & Hong, J. ( 1988). Higher order associative
Constructive HONs
1009
memories and their optical implementation. Neural Networks, I, 149-163. Redding, N.J. (1991). Some aspects ofrepresentation and learning in artificial neural networks. Ph.D. Thesis, University of Queensland. Redding, N. J.• Kowalczyk, A., Downs, T. ( 1991 ). Higher order separability and minimal hidden unit fan-in. In T, Kohonen, K. Miikisara,o. Simula, & J. Kangas(Eds.), Artificial neural networks (vol. I, pp 25-30). North-Holland: ElsevierScience. Refenes, A. N., & Vithlani, S. ( 1991). Constructive learning by specialisation, In T. Kohonen, K. Makisara, O. Simula, & J. Kangas (Eds.), Artificial neuralnetworks (vol. 2, pp. 923-929). North Holland: Elsevier Science. Reid, M. B., Spirkovska, L., & Ochoa, E. (1989). Rapid training of higher-order neural networksfor invarient pattern recognition. In
Proceedings oftheInternational Joint Conference on Neural Networks (vol. I, pp 689-692) Washington, DC: IEEE.
matrix is convertedto echelon form, but this does not completely fit our requirements for a linear independence test. For instance, consider what happens during the operation of Algorithm 3. In this algorithm, the universal set .At is constructed incrementally, so that during the course of the algorithm, a single new monomial is tested for linear independence from theexisting monomials in At. It would be desirable to make use of the work done in previous linear independence tests in each new test, and if this is done a complete implementation of the Gaussian elimination algorithm is not necessary: the following algorithm suffices. ALGORITHM 5 (incremental Gaussian elimination), The matrix A = (au) is an I' X c matrix in echelon form, with rows denoted by ai. (For our purposes, the number of columns c = IX I and the number of rows I' equals the number of monomials already in the universal set.) To this matrix, we wish to add a new row, a q , to the matrix A and place the new matrix in echelon form. 01
Rujan, P., & Marchand, M. (1989). Learning by mimimizing resources in neural networks. ComplexSystems, 3,229-241. Shin, Y., & Ghosh, J, (1991). The pi-sigma network: An efficient
02
higher-order neural network for pattern classification and function approximation. In Proceedings ofthe International Joint Conference on Neural Networks (vol. I, pp 13-18). Washington, D.C.: IEEE. Sietsma, J., & Dow, R. J. F. ( 1991). Creating artificial neural networks that generalize. Neural Networks, 4, 67-80. Sirat, J. A., & Jorand, D. (1990). Third-order hopfield networks: Extensive calculations and simulations. Phillips Journal of Research, 44, 501-519. Sirat, J. A., & Nadal, J.-P. (1990). Neural trees: A new tool for classification. Network: Cornputation in Neural Systems, 1 (4), 423428. Wang, c, & Williams, A. C. ( 1991). The threshold orderof a Boolean function. Discrete AppliedMathematics, 31, 51-69. Wynne-Jones, M. (1991). Constructivealgorithms and pruning: Improving the multilayer perceptron. In R. Vichnevetsky & J. J. H. Miller (Eds.), Proceedings of 13th IMACS World Congress on Computation andAppliedMathematics (pp. 747-750), Dublin: IMACS. Wynne-Jones, M. (1992). Node splitting: A constructive algorithm for feed-forward neural networks. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems 4. San Mateo, CA: Morgan-Kaufmann.
04
APPENDIX In this appendix, we givethe development of an algorithm for computing a universalset in (!J( np 3) time. In constructing the universal set, it is necessary to make use of the x-space concept in performing the linear independence tests in Algorithm 3. A set of monomials At = {'PI, 'P2, , . , ,'P,} arc linearly independent on a set of patterns X if the monomials x-space points are linearlyindependentvectors.By letting each of these x-space vectors form the rows of a matrix, we can perform linear independence tests on this matrix in the manner outlined in the following development. First, we have to introduce the echelonform ofa matrix. DEFINITION 7. A matrix A = (aij) is said to be a matrix in echelon form, oran echelon matrix, if thenumberofzeros preceding thefirst
nonzero entry of a row increases row by row until only zero rows remain.
03
._ .-
aq
_ -
aq -
aqj -
aij
ai
form a new (I' + I) X c matrix by placing aq and rows of A into echelon row order
•
oM
= {\C'/o'
••• ,
\C"k_l+h ••• , \C"k' ••• ,
\C'/.},
where the monomialsfollow the order in which they are added to oM by the algorithm. Further, the monomials 'P1k-1+h ••• , 'f'lk are all the monomials of degree k in At, that is, the subset Atk of .M.. ALGORITHM 6 % Initialization 'T, = (OJ 02 'PT I = 1 03 a,=(I, ... ,I) 04 10 = I 01
OS
THEOREM 13. The nonzero rows of a matrix in echelon form are
08 09
s= I % Findlinearly
06 07
A matrix can be placed in echelonform usingthe Gaussian elimination algorithm. In the standard form of this algorithm, an entire
aq
This simple procedure of first placing a matrix in echelon form and then testing for nonzero rows will form the basis of an efficient algorithm for determining a universal set. The following example demonstrates how the linear independence test is applied to the xspace vectors for each monomial under consideration. By incorporatingthe incremental Gaussian elimination procedure (Algorithm 5) into Algorithm 3, we obtain an efficientalgorithm for determining a universal set. Note that in this algorithm (detailed below) the x-space vectors after each incremental Gaussian elimination are effectively placed in echelon row order using a linked list. Therefore, placing the row vectors in echelon row order (step 04 of Algorithm 5) is simplya matter of performing a linear search through the list to find the appropriate point in which to insert a pointer to the new row. In this algorithm, the set of patterns X has cardinality IX I = P and X!;; IBn. The x-space outputs are used to form the rows of a matrix A upon whichthe incremental Gaussian elimination algorithm is performed. During the operation of the algorithm, it is necessary to keep track of some internal processes that take place using notation that we will now introduce. The x-space vector for the current monomial under consideration is denoted by z and its ith element by Zi' The variable s is used to denote the current number oflinearly independent monomials. The integer Ik is used as an index to keep track of the monomials that have been added to the universal set. The value of Ik is the final monomial of degree k found to be linearly independent and hence included in the universal set. If.At denotes the universal set constructed, then the elements of the set are given by
The echelon form of a matrix can be used to determine linear independence, as the following theorem indicates (Lipschutz, 1968, p.87).
linearly independent.
for i = I, ... , r let j indicate the first nonzero column of row a,
10
independent monomialsof degree 1.
for i = 1, ... , n z = [tp Ii } ]x for r = 1, ... , S let q = index of first nonzero element of row a,
z := z -
Zq a, a,q % If z is nonzero then 'Pi is a linearly independent monomial
N. J. Redding, A. Kowalczyk. and T. Downs
1010 II 12 13 14 15 16
where c is a constant and s is the current number of linearly independent monomials. We can determine the value of sin eq, (A .I) by noting that for each iterat ion of the outer loop on line 17 with loop variable k the current number of linearly independent monomials will be less than or equal to Ik , so in eq, (A.I) s ~ Ik • So, then, the total cost of the for-block in the algorithm is
if z has nonzero elements s:= s + 1
a. = z T, = {i f reorder 81 •••
. ,
as into echelon row order
II = S
% Find linearly monomialsof degree> I. 17 for k = 2, ,n 18 for i = I, , It 19 for j = h -2 + I , . . . , lk- t 20 z = [rp'l"'] x'[rp'l"Jh
totaLcost(for-block) ~
22
Z
23
:= z -
!i.
a,q
n
~ cp
lIr
/1
It - I
t - !....2.. 1
k-2
By noting that n
L
II = Z
•
(h-I -
Ik- 2 )Jk ~ I. L
k-2
T.=TiUTj reorder aj , . .. , as into echelon row order
Uk-I -
Finally, a universal set
oM.
=
we obtain
total.xostffor-block) :s: cpM.
•
{rp '1", Ii = I , . .. , I.}
A more stable algorithm is possible if the largest element of row a, is used rather than the first nonzero one in lines 9 and 22, but this wiI1 require some reordering of the columns of A (this is called
pivoting). The complexity ofAlgorithm 6 is stated in the following theorem. THEOREM 14. Algorithm 6 constructs a universalset for »:
1.-2 ) ,
k- 2
h=s %
30
cp,
L L L h ~ cp - L II ii: - lk- 2 )lk' k -2 f-I
independent monomial 26 21 28 29
2: L
totaLcost( for-block)
if z has nonzero elements s: = s + 1
25
It
Ik-I
which can be simplified in the following manner:
% if z is a nonzero then rp 'I"I • ({J'I"J is a linearly 24
It
k·2 / - 1 j -/1<-2'" , - 1
for r = I, . . . , S let q = index of first nonzero element of row IIr
21
"
2: L
•
L (h-. -lk - 2 ) k -2
~ cplt/n l . - . ~
cnp],
Finally,then , as determined by the cost ofthe for-block, the complexity of the algorithm is (!l (np 3).
in (!l(n Ixll )
time.
NOMENCLATURE
Proof. Clearly, the computationally expensive piece of pseudocode is the block that is most deeply nested. In Algorithm 6, this is the block ofcode in lines 21-28. This block of code is divided into two sections of interest: lines 21-23 containing the for loop, which we will refer to as the for-block; and the If-statement of lines 24-28, referred to as the if-block, The computational complex ity of the if-block is given by the cost of line 26, which will be a constant factor of p, because the reordering of the rows in line 28 is a simple list operation that wiI1 cost a constant factor of s in the worst case. The complexity of the for-block, however, is greater than p because the operation in line 23, executed s times, involves the p elements of vectors Z and a.. As a result, the computational complexity of Algorithm 6 is determined by the total cost of executing the for-block during the algorithm's operation. Assuming that the cost of arithmetic operations is constant, the cost that the for-block contributes to the overall algorithm is determined by summing over the nested loops within which this block lies. There are four loops to take into account in this calculation: the loops that occur on lines 17, 18, 19, and 21 of the algorithm. So, then, the contribution to the algorithm's total cost from the for-block is given by n
totaLcost(for-block) =
/1
L 2:
' k-I
L
The cost of the for-block by itself is given by
L r -I
cp,
rp
ep
ep"'l Wi
o
n
(.)
' ,1
s>, :/-
At At' .ht"; [rp]x [.hth
cost (for-block).
k - 2 f-I j - f. _2" .
cost(for-block ) '"
'J1 k
(A.I)
Span ( -)
binary {G, I} or bipolar {-I , +I} set vector input pattern ith input element of pattern vector x input pattern set containing all inputs x sets of positivelyand negativelyclassified patterns from X function mapping X ... 13 integer denoting order hidden function of hidden unit hidden function family (HFF) set of monomials of degree i union of sets epo,