Constructive higher-order network that is polynomial time

Constructive higher-order network that is polynomial time

Neural Networks. Vol. 6, pp. 997-1010,1993 Printed in the USA. All righlS reserved, 0893-6080/93 $6.00 + .00 COpyright © 1993 Pergamon Press LId. ...

1MB Sizes 0 Downloads 30 Views

Neural Networks. Vol. 6, pp. 997-1010,1993 Printed in the USA. All righlS reserved,

0893-6080/93 $6.00

+ .00

COpyright © 1993 Pergamon Press LId.

ORIGINAL CONTRIBUTION

Constructive Higher-Order Network Algorithm That Is Polynomial TIme NICHOLAS I

J. REDDING, I

ADAM KOWALCZYK, 2 AND TOM DOWNS 3

DSTO Information Technology Division, 2Telecom Australia, Research Laboratories, and 3University of Queensland

(Received 27 August 1992; accepted I December 1992) Abstract-Constructivelearning algorithms are important because they address two practical difficulties oflearning in artificial neural networks. First. it is not always possible to determine the minimal network consistent with a particular problem. Second. algorithms like backpropagation can require networks that are larger than the minimal architecture fo r satisfactory convergence. Further, constructive algorithms have the advantage that polynom ial-time learning is possible if net work size is chosen by the learning algorithm so that the learning of the problem under consideration is simplified. This article considers the representational ability offe edforward networks (FFNs) in terms ofthe fan-in required by the hidden units of a network. We define network order to be the maximum fan-in ofthe hidden units ofa net work. We prove. in terms ofthe problems they may represent. that a higher-order network (HON) is at least as powerful as any other FFN architecture when the order ofthe networks are the same. Next , we present a detailed theoretical development of a constructive. polynomial-time algorithm that will determ ine an exact HON realization with m inimal orderfor an arbitrary binary or bipolar mapping problem. This algorithm does not have any parameters that need tuning for good performance. We show how an FFN with sigmoidal hidden units can be determin ed from the HON realization in polynomial time . Last. simulation results ofthe constructive HON algorithm are presented for the two-or-more clumps problem. demonstrating that the algorithm performs well when compared with the Tiling and Upstart algorithms.

Keywords-Constructive networks, Higher-order networks, Feedforward networks, Network order, Fan-in, Representation, Polynomial time, Two-or-more clumps problem .

learning algorithm has no control over the network's size, which is predetermined. In practice, the network's sizeis chosen by the user, rather than by an "adversary," ensuring that the problem lies within the ability of the network and learning algorithm. If the networks available to the learning system are unrestricted, then polynomial-time complexity results are possible. There are, however, two practical difficulties with this approach: It is not always possible to determine the minimal network consistent with a particular problem and, second, algorithms like backpropagation can require networks that are larger than the minimal architecture for satisfactory convergence. These two points provide the motivation for the development of constructive learning algorithms, which do not decide a priori upon the required network size but "grow" the network as needed. Most constructive algorithms, however, provide no assurance that the resulting network willbe minimal in any sense. This is an important

1. INTRODUCTION

Judd (1990) has proven the NP-completeness of the loading problem, that is, "can a given neural network map a predetermined set of input patterns to a desired set of output patterns?" However, Baum in a review ( 1991 ) pointed out that by framing the learning issues more realistically an NP-completeness result need not arise. The key is that in Judd's loading problem the

Acknowledgements: Th e authors thank Gar ry Newsam, Peter Bartlett, and Andrew Back for their valuable suggestions and comment s; Raymond Lister and especially Marcus Frean for the ir help on the two-or-more clumps problem ; and Ewa Kowalczyk for help in the simulations. This work was partiall y supported by the Australian Telecommunications and Electron ics Research Board. Requests for reprints should be sent to Nicholas Redding, DSTO Information Technology Division, P.O. Box 1500, Salisbury SA 5108, Australia.

997

998 issue not only for cost reasons but also because there is convincing evidence that using a larger network than required can adversely affect the network's ability to generalize (Denker, Schwartz, Wittner et al., 1987; Baum & Haussler, 1989). Other approaches that have been used to try and achieve a minimal solution either "prune" the network after training has taken place (Mozer & Smolensky, 1989; Le Cun, Denker, & Solla, 1990; Sietsma & Dow, 1991) or use a bias term in the error function to inhibit network size (Denker et al., 1987; Hanson & Pratt, 1989; Ji, Snapp, & Psaltis, 1990). Some of the main constructive algorithms employ a number of architectures that create a hierarchical partitioning of the input space (Mezard & Nadal, 1989; Nadal, 1989; Fahlman & Lebiere, 1990; Frean, 1990; Golea & Marchand, 1990; Marchand, Golea, & Rujan, 1990; Sirat & Nadal, 1990) . [A summary of the salient features of these constructive algorithms can be found in Hertz, Krogh, and Palmer ( 1991) and in WynneJones (1991).] Other algorithms (Ash, 1989; Refenes & Vithlani, 1991; Wynne-Jones, 1992) create additional nodes during backpropagation training, and the algorithm ofHanson ( 1990) behaves similarly but uses a stochastic search in place of backpropagation. Convergence proofs for some of these constructive algorithms are available, but most do not contain any statements regarding the time complexity of training. One exception is the regular partitioning algorithm of Rujan and Marchand (1989), which has polynomial-time complexity (and is also minimal in some sense), although this is in terms of p = 2", the total number of patterns in an input space of n-dimensions. In practice, however, as the input dimensions n increases the actual number of patterns p may form a decreasing fraction pIp of the possible patterns, so that the algorithm becomes prohibitively expensive. Recently, Blum and Rivest (1992) presented material that extends Judd's results to show that training a simple feedforward network (FFN) with linear threshold functions is NP-complete. They also showed for a simple example that polynomial-time learning is possible if the network is enlarged to include extra inputs that are nonlinear combinations of the original inputs. Blum and Rivest, however, did not present a general algorithm that could be used. In this article, we present a general algorithm that will learn an arbitrary mapping problem in polynomial time. More precisely, this article extends the previous work of the authors (Redding, 1991; Redding, Kowalczyk, & Downs, 1991) on higher-order networks (HaNs; Giles & Maxwell, 1987) to develop a constructive HaN algorithm for an arbitrary problem that trains in time polynomial in the num ber of examples in the training set. The constructive HaN algorithm we present has a

N. J. Redding, A. Kowalczyk, and T. Downs

number of important properties. Apart from the polynomial-time complexity property that we have already mentioned, the constructed network has minimal maximum/an-in, where maximum fan-in is defined to be the largest number of network inputs that connect with any single hidden unit. This maximum fan-in of a network is defined to be the order of the network by Minsky and Papert ( 1988). Network order is largely neglected by the current literature but has a long history, dating back to the mid 1960s (Cover, 1965; Krishnan, 1966; Minsky & Papert, 1988) . We will demonstrate that the minimum network order required to solve a given problem is independent of the kind of architecture under consideration, and in this sense order is a structure-free property of the problem. In addition, we will show how the algorithm can be used to construct a sigmoidal net in polynomial time from the HaN. In FFNs, full connectivity from the inputs to the hidden units (i.e., max imal order) is only required for a small percentage ofleaming problems. For many realworld problems, maximal order is infeasible because such problems often have many thousands of inputs. Researchers have tried to address this problem a posteriori using techniques like pruning (Sietsma & Dow, 1991 ) and weight decay (Kramer & Sangiovanni-Vincentelli, 1989). Our constructive approach automatically ensures that the network does not have a larger fan-in than necessary. In Section 2, we present more precise definitions of the concepts that will be used in the remainder of this article. Section 3 deals with representational issues of order in FFNs. Next, Section 4 gives a development of the constructive HaN algorithm. Finally, simulation results and a discussion are presented in Section 5. 2. TERMINOLOGY 2.1. FFNs

The learning problems that are of interest to us here can be expressed in the following manner. We are given a finite set of two-valued patterns X C IBn, where IB is either the binary set {O, I} , or the bipolar set { -1 , + I}. (In what follows, most of our definitions will be in terms of bipolar values, although sometimes we will find it convenient to use binary values.) The pattern set X is formed from the union of two disjoint sets X + and x - . We wish the network to learn a function '1' that exactly maps the set X into a two-valued set that will be used to indicate whether the pattern x E X is a member of x" or X -. The two-valued set of outputs can be either binary or bipolar as convenient, so that '1' performs the mapping '1' : X -- lB. The function '1'(x) , where x is the input, is computed in two stages. First, a set of functions q,(x) = {!PI(x), !P2 (x), ... , !Pr( x) } is computed and then the results

999

Constructive HONs

are combined by means of a combining function, say n, of r arguments to obtain '1'. Because both '1' and the !{Ji are functions, it is often convenient to refer to the functions !{Ji as hidden functions to distinguish them from the overall function '1'. This framework, called the perceptron scheme, is not limited to any particular variety of hidden functions and will allow us to derive results for a variety of different FFN architectures. (See Figure 1 for a depiction of the perceptron scheme.) A network within this framework will here be called a perceptron and includes the backpropagation network and most often FFNs. We are particularly interested in the HON architecture and shall discuss later how this fits within the perceptron scheme. Note that, for simplicity, the argument x is often omitted from the functions cI> and '1'. Let = {1f'1, !{J2, ••• ,!{Jr} be a hiddenfunctionfamily (HFF) defined on some subset of IBn . Let us introduce the following definition, stated for bipolar-valued '1'. DEFINITION 1. A function '1' is a linear threshold function with respect to cI> if there exists a number 0 and a set ofnumbers {WI, W2, .•. , wr } such that '1' (x) =: I if and only if Wllpl (x)

and 'iJI(x) == -1

+ ... + Wrlpr(x) :?; 8

if and only if

Wllpl(X)

+ ... + wr\",(x) < 8.

Note that () is commonly termed the threshold. This is equivalent to saying that 'iJI is linearly separable (LS) with respect to cI>, the space spanned by the hidden functions !{JI, !{J2, ••• , !(Jr. Nilsson (1965) termed this space the image space. In the perception scheme, the function n is the familiar linear combination followed by thresholding. In this scheme, the function n is written explicitly as

where the decision function
+ l, (u) == { -1,

u~o

u < 0.

Thus, a perceptron defined upon the HFF q> computes the function 'iJI( x) as follows: '!'(x) == (

L

WjCPi -

8) .

(l)

'l"le...

2.2. Problem Order Order of a given problem is defined here in terms of the order of the network architecture required to implement the problem. So, it is reasonable to talk of both function (or problem) order and network order, two related but different concepts. Consider a hidden function l(J( x) and its implementation cp(i 1, ••. , h) depending upon inputs Xii"'"

Xik'

DEFINITION 2. We define the fan-in

ofa hidden function implementation cp(i 1, • . . , i k ) : X -- IB, denoted faILJn(!{J), to be k, the number ofinputs to e.

Much of the following development concerns HFFs of monomials (e.g., XiIXi2'" Xik)' Products involving higher powers of input elements need not be considered in the binary or bipolar input case because, for instance, x7 = Xi, n EN when Xi E {O, 1}. The real case is not considered in this article. We now wish to consider the concept of problem order. To assist us, we first introduce a definition of network order. 3. In a two-layer FFN, let the number oj inputs to hidden unit i be k., The order ofthe network is equal to max,(k i ) •

DEFINITION

The order of a given function is then defined as being equal to the order ofthe lowest-order network necessary to realize that function. A more formal definition, which avoids reference to any specific network architecture, is as follows.

input output combining function

hidden functions

FIGURE1. Important aspects of the perceptron scheme. These include the input elements x, of the input pattern x, the "hidden units" of the network that compute the partial functions tp,(x) from the set lI>(x), and the combining function 0, which gives the output i'( x ).

DEFINITION 4. The order of a function 'iJI, ord('iJI), is the smallest number k for which we can find a set of hidden/unctions ~ = {C(Ji} with faILin(!{Ji) s; k for all !{Ji E iJ>, such that "IJt is a linear threshold function with respect to ~.

Other equivalent definitions of order are possible in the case of two-valued inputs (Krishnan, 1966; Wang & Williams, 1991).

N. J. Redding, A. Kowalczyk, and T. Downs

1000

An interesting consequence of the definitions of problem order and fan-in is that it is possible to discuss how they relate to FFN structures with more than two layers. Figure 2 indicates how a multilayered FFN can be contracted down so that all the layers of hidden units can be considered as a single layer of composite hidden units. When this is done, a composite hidden unit's fan-in is simply determined by the number of inputs, and the value of the largest such fan-in is the network order. This allows one to discuss the relationships between network order and problem order with complete generality. For example, if the order of a three-layer FFN is k then it is only capable of realizing a problem of order k or less-the fact that it is a three-layer network has no special significance.

2.3. The HFF We already mentioned how the use of an HFF determines the particular type of FFN modeled. This requires further elaboration. In the case of HaNs, the HFF, denoted HON, is simply a subset of possible monomials on the input elements XI, X2, ... , x.; In the case of binary-valued or bipolar-valued inputs, the HFF is
{I,

Xl, X2, .•. ,X,,, XIX2, XtX3, .•. , X1X3· . •X n, XIX1· . • X n}.

When the inputs are bipolar valued, the monomials can be considered to compute XOR functions. In the case where the inputs are binary, the monomials compute masks (conjunctions). For simplicity, we will only concern outselves with the case of a network with a single output; the results that follow can be easily extended to networks with multiple outputs. The equation for a single output of a HaN, denoted by Y, is often written in the following manner: y =

(~ WjXi + Wn+IX1X2 + Wn+2X

j

X3

+ ... + WZ"-IX,X2' . .x; -

0).

(2)

In the above equation, the w" r = 1, 2, ... , 2 n - 1 are weightings on the outputs of the hidden functions, 0 is the threshold, and >denotes the decision function. It is often convenient to replace the threshold 0 by the negative of a weight Wo upon the augmenting input Xo = 1, that is, WoXo = -0. Network order in a HaN is easily identified: When the polynomial in eqn (2) includes products of up to k input terms, the network is termed a kth-order HaN

<.

(k s n).

A number of authors considered HONs, but the list is small in comparison with the popular Hopfield and sigmoidal FFN architectures (see, e.g., Poggio, 1975; Lee, Doolen, Chen et al., 1986; Peretto & Niez, 1986;

FIGURE 2. How multiple layers of hidden units in an FFN can be contracted down into a single composite layer for the purposes of computing network order and hidden-unit fan-ln. Then, the order of a network with more than one layer of hidden units can be seen to be given by the largest number of inputs that connect to anyone output link.

Giles & Maxwell, 1987; Keeler, 1987; Personnaz, Guyon, & Dreyfus, 1987; Psaltis, Park, & Hong, 1988; Kohring, 1990; Sirat & Jorand, 1990; Shin & Ghosh, 1991; Perantonis & Lisboa, 1992). 3. REPRESENTATIONAL ISSUES The task we wish the HaN to solve is to classify correctly the finite set of patterns X C IBn into the two subsets X+ and X- , corresponding to a unitary positive and negative network output, respectively. In addition, we require that the HaN achieve this with a network of minimal order. These task requirements mean that, first, the HaN architecture must be able to represent the classification of X. Second, it means that if the classification of X is a problem of order k then this representation must be possible in a network of order k. We address these issues in this section. The most important property of HaNs when compared with other FFNs is that ifa HaN of at least order k is required to classify a set of patterns X s; IBn then an FFN (with any architecture) with order < k that will classify X (with complete accuracy) does not exist. This means that, for example, choosing an HFF such as linear units with sigmoidal outputs over the monomial HFF HON offers no advantage in terms of order. This rather remarkable property can be explained by means of the following theorem. THEOREM 1. Any function on X s;; IB n, that can be implemented by an FFN oforder k, can be implemented by a HON oforder k.

Proof For a bipolar vector y E X s;; { -1, I} n , we can construct a polynomial Oy(x) = II 7=1 (I + YiXi )/2 vanishing everywhere on X but at x = y. A similar polynomial exists for the binary case. Clearly, Oy(x) is a linear combination of monomials of degree not more than n. Thus, any hidden function on Xcan be written as a linear combination of monomials {tpi } of degree

Constructive HONs

1001

(and so fan-in) not more than n. Similarly, if one of a network's hidden functions cp has fan.sin (cp) = k then cp can be written as a linear combination of monomials of degree not more than k. Therefore, it is always possible to transform the hidden units of an order k FFN to a linear combination of monomials such that the resulting HON has order no larger than k when X ~ IBn. • It is, however, possible to conceive of situations where a restriction on the type of hidden functions can lead to an increase in the required network order. Because of this, it is of some interest to consider whether the HFF is order preserving: DEFINITION 5. The HFF it> is order preserving ifany problem oforder k can be realized as a network oforder k using only hidden functions from
Clearly, the HFF