A general framework for supervised learning

A general framework for supervised learning

Journal of Economic Dynamics and Control 18 (1994) 97-118. North-Holland A general framework for supervised learning Probably almost Bayesian algorit...

987KB Sizes 5 Downloads 196 Views

Journal of Economic Dynamics and Control 18 (1994) 97-118. North-Holland

A general framework for supervised learning Probably almost Bayesian algorithms L. Bochereau, P. Bourgine, and G. Deffuant CEMAGREF, 92185Antony, France Received February 1992, final version received October 1992

The paper proposes the concept of Probably Almost Bayesian (PAB) algorithms generalizing a definition given by Valiant in his theory of the learnable. PAB algorithms are defined as algorithms that probably approximate the Bayesian optimum when the training set size tends to infinity, in polynomial time with respect to the training set size. We present this concept in the framework of the decision theory and we support this definition by giving examples of such algorithms, particularly in the field of artificial neural networks.

1. Introduction T h e m a i n a p p l i c a t i o n fields studied in this p a p e r are classification and function approximation from examples. T h e y are o b v i o u s l y closely related to some areas of e c o n o m i c s such as m o d e l d e r i v a t i o n from d a t a or decision theory. These research d o m a i n s have been extensively studied in classical statistics a n d d a t a analysis f r a m e w o r k s . M o r e recently, c o n n e c t i o n i s t m o d e l s have received a lot of a t t e n t i o n a n d have p r o v e d to p r o v i d e a d e q u a t e solutions. The objectives of this p a p e r are twofold. W e firstly i n t r o d u c e a general framework for supervised learning which is a p p r o p r i a t e for all classification or function a p p r o x i m a t i o n m e t h o d s . This f r a m e w o r k p r o p o s e s a theoretical p r o cedure for deriving a s o l u t i o n that consists of e x p l o r i n g a lattice of m e t h o d s ' spaces while a l l o w i n g c o m b i n a t i o n s of such methods. C o m p u t a t i o n a l constraints, however, are n o t t a k e n into a c c o u n t in this procedure. Therefore, a second p a r t of the p a p e r is d e v o t e d to characterize a class of algorithms, PAB algorithms, that lead to a n a d e q u a t e solution. The c o m p u t a t i o n a l r e q u i r e m e n t s i n c l u d e d in the c h a r a c t e r i z a t i o n of P A B a l g o r i t h m s state t h a t such a l g o r i t h m s

Correspondence to: L. Bochereau, CEMAGREF, BP 121, 92185 Antony, France. 0165-1889/94/$06.00 © 1994 Elsevier Science Publishers B.V. All rights reserved

98

L. Bochereau et al., Probably almost Bayesian algorithms

probably approximate the Bayesian optimum (when the training set size tends to infinity) in polynomial time (with respect to the training set size). These conditions lead to a simplification of the general theoretical procedure and a restriction of the research space. However, the space must remain rich enough to allow the approximation of interesting classes of Bayesian optima. This problem is rather difficult, but we explain the main steps to solve it in two different cases: function approximation by orthogonal polynomials and classification by perceptron membranes (algorithms based on a new connectionist architecture). We will first present the general framework for supervised learning (section 2). This framework allows then to define the concept of the Probably Almost Bayesian (PAB) procedure (section 3). The PAB property is then studied for two particular algorithms: decomposition with orthogonal polynomials (section 4) and perceptron membranes (section 5).

2. Supervised learning

2.1. Classical approach Supervised learning can be viewed as constructing a model relating two vectors of variables, noted x and y, given a limited number of experiences. Therefore, this approach is related to statistical inference that aims at estimating the parameters of a model whose analytical form is chosen beforehand. Such a model can be written as y = f ( x , O) + ~ ,

where f(x, O) represents a function from X x O to 11, ~ represents the space of unknown parameters, and e corresponds to the model error vector. We will not make any assumptions about the analytical form of f and we will replacef(x, 0) by f ( x ) to simplify the notations. Vectors x and y belong to finite-dimensional space X and Y and X x Y follows a distribution probability p(x, y). To each functionfcan be associated a criterion C(f) related to the model quality:

C ( f ) = f x f v R[f(x),y]l~(x,y)dxdy, where R is a function from Y x Y into R +. When R(y', y") = IIy' - y" II2, C(f) corresponds to the quadratic error often used in practice. In addition, it is often useful to introduce a risk or cost function related to the estimation quality. In this case, R(y', y") corresponds to the cost to produce a prediction y' when y" is the correct answer.

L. Bochereauet al., Probablyalmost Bayesian algorithms

99

According to the decision theory, C ( f ) can be called the Bayes risk associated with the function f C ( f ) is also defined as the expectation of the risk associated withfi

C ( f ) = E[R(f(x), y)] . If F is the research space for functions f, one can then state that supervised learning aims at finding a function f * minimizing the criterion C ( f ) on F, that is,

C ( f * ) = min C ( f ) . fEF

In practice, it is impossible to calculate all C(f). One is usually given a set of N observations pairs (x", y"),= 1..... N obtained according to the probability measure/~. In the following, these observations pairs will also be called examples. It will then be possible to estimate the value of C ( f ) with the available sampling, by distinguishing the examples used for constructing the model and the examples used for validating such a model. The N examples will then be divided into two groups: the learning base K that is used to estimate the value of the unknown parameters and the generalisation base K' for estimating the model prediction ability. We can then estimate C ( f ) by calculating C(f, K'): l

c(f, K') =

IK'l

g[f(x"),y"].

(1)

Under reasonable conditions, when the examples number, card(K'), increases, C(f, K') converges towards C(f). However, when the examples number is limited, several types of errors may occur and bias the choice of function f These errors will be discussed in the next section.

2.2. Encountered errors in supervised learning All measurement errors when observing x and y are taken into account by the term C(f). Such errors contribute to increase the variance of the observed variables but do not introduce any important biases in the choice of f u n c t i o n f Nevertheless, three other types of errors influence the solution choice.

2.2.1. Sampling error This error is induced by the approximation of the integration over X x Y by the finite summation calculated on a finite number of observations. If we assume

100

L. Bochereau et al., Probably almost Bayesian algorithms

that the terms R[f(x"),y"], for 1 < n < card(K'), are bounded and independent random variables, the Hoeffding (1963) inequality gives a bound for this sampling error: Ve > 0,

P r { I C ( f ) - C(f, K')I > e} ~ 2e -2*2cara
This allows us to conclude that in order to neglect the sampling error, the number of examples must grow with an order of at least e- 2.

2.2.2. Error related to the choice of F The limited number of examples leads to a restriction of the space F of admissible functionsf This can easily be understood when the vector y consists of a single variable y and when we only dispose of N examples; in this case, one can understand that it will not be possible to fit a polynomial of degree N. In fact, there is an infinite number of solutions. The space of admissible functions can be derived by restricting the initial functional space while keeping all functional subspaces that have a number of degrees of freedom less than or equal to the number of available examples in the learning base. One can easily understand that the restriction of the initial functional space introduces another error since the function to approximate may be outside the search space.

2.2.3. Error related to the optimization procedure For some classes of functions f(x, 0), there are procedures that allow to calculate the exact value of the parameters optimizing the criterion C(f, K'). This is the case, for instance, with the polynomial models since their coefficients can be easily calculated by solving a system of linear equations. For other functions f such an algorithm is not available. For example, connectionist multilayer networks are trained with learning algorithms, such as the gradient backpropagation algorithm, that consist of a gradient descent toward a solution that is often only a local minimum of C(f, K') in parameter space.

2.3. A general framework for supervised learning The preceding remarks leads us to reformulate the supervised learning problem. The formulation proposed in this section is valid whatever the analytical form chosen for the function f [connectionist networks [-Hinton (1989)], decision trees [Breiman (1984)], polynomial decomposition . . . . ]. In all these learning methods, we can define the number of degrees of freedom of a function f(x, O) as the number of free parameters 0 of this function. Thus, the space of

101

L. Bochereau et al., Probably almost Bayesian algorithms

.•.

aFCtScontinuous et~N~ nd differentiableJ

o

"

° •

o



~

""(

Polynomials of 1 var. of deg. 1

•° ~

A" ,p.

° ~

A "f

)(

*

~

• •

olynom~als of 1 ~ ~olynomiais of 2 ~ var. ofdeg. 2 . , J k , ~ var. ofdeg. 1 J "

-~

A"

• •~

(Neural k~

Decision trees with 2 leaves

Q

networ~N~ (1,1,1)_ _,~

~ u r a l networks~ ~ (2,0,1)

Neural networks

7"(

(1,0,1)

)

°""

"

Fig. 1. Lattice of functional spaces (F). Neural network (1, 1, 1) represents the set of neural nets with one input unit, one hidden unit, and one output unit. polynomials with one variable of degree less than or equal to n has a number of degrees of freedom equal to n + 1; similarly, the space of connectionist networks completely connected between layers with p input units, q hidden units, and r output units has a number of degrees of freedom equal to q(p + r + 1) + r. The space of binary decision trees with n leaves has a number of degrees of freedom equal to n. Thus, for each learning method, we can associate a denumerable lattice of functional space F i, with its characteristic number of degrees of freedom noted n(Fi). The number of known methods are finite and we can now group these different lattices: we define a denumerable lattice of functional spaces F = {Fi}, such that all the links between nodes F i correspond to inclusion relationships and a decreasing n(Fi). For example, one of the leaves of the lattice corresponds to the space of all constant functions from X into Y and the root node corresponds to all continuous and differentiable functions from X into Y. This lattice is illustrated by fig. 1. Let us assume that we dispose of N examples (x", Y")x _<, _
L. Bochereau et al., Probably almost Bayesian algorithms

102



C (fi *, K')

0

C (fi*, K)

0 0 0

0

0

0

0

..~ n (F l ) Fig. 2. Evolution of criteria C(fi*, K) and C(fi*, K') with n(Fi).

the learning base, and then to select, among such functions fl, the one that minimizes the criterion C ( f ~, K') calculated on the generalization base. Therefore, one has to perform a double minimization that is represented by the following expression:

mintC(f~*,K'):f~*=minC(f~,K) F i e ~ x I.

} .

fieF i

We can easily notice that a finite ordered sequence of embedded functional spaces (Fi)l < i _

i, the solution f~* in F i also belongs to F j. If we neglect the sampling and convergence errors, we can then conclude that C(fj*, K) will be less than or equal to C ( f f , K). However, we do not have any a priori information about the comparison between C(]]*, K') and C(fj*, K'). Fig. 2 shows that criterion C(f~*, K) must decrease when n(F ~) increases; the opposite case can only be explained by an error of the optimization procedure. However, in practice the quantity C(f~*, K') usually begins by decreasing, then reaches a minimum value, and then increases again. Thus, for a given branch of the lattice, the minimal value of the criterion C(f~*, K') allows the selection of the most appropriate model. It is important to notice that the type, continuous or discrete, of vector y leads to different types of problems. When y is a continuous vector, supervised learning becomes a problem of function approximation; when y is a discrete vector, supervised learning becomes a classification problem. The double minimization in the lattice of fig. 1 appears to be the ideal theoretical procedure; however, it does not take into account the computational constraints. In practice, this procedure must be restricted to approximation algorithms which are polynomials with the size of K. We focus on such algorithms in the next sections.

L. Bochereau et al., Probably almost Bayesian algorithms

103

3. Definition of probably almost Bayesian algorithms 3.1. Probably almost correct algorithms Valiant (1984) has developed a learning theory for boolean functions defined on a finite set of boolean variables. His work aims at determining classes of learnable functions in a polynomial number of steps relative to a query protocol, which uses the function as an oracle to which queries are made. Referring to this framework, Baum and Haussler (1989) have studied the sample complexity necessary for multilayer networks in order to guarantee good generalization; the study is founded on Vapnik and Chervonenkis' works (1981). Kanaya (1990) proposed to extend Valiant's framework to statistical pattern recognition. In this context, the mapping associating each input vector x with an output vector y is not deterministic. Therefore, the conditional probabilities P(Y~I x) are not always equal to 0 or 1. This framework can be further extended to the case of real outputs. This leads to the Probably Almost Bayesian framework which is described in the next paragraph.

3.2. Probably almost Bayesian algorithms We consider now the case of section 2. We defined earlier the Bayesian risk associated with a function f as

C(f) = E j R ( f (x), y)] = ~ R (f(x), y) I~(X,y) dx dy .

(I)

Let C be the Bayesian optimum, i.e., C = min C ( f ) .

(2)

Definition (PAB algorithms).

Let X be of dimension d and Y of dimension d'. Given/~ a probability density on X x Y, an algorithm A is said to be PABfor I~if, for all scalars e, ~/> 0, there exists re(e, ~/) > 0 such that, for all random sample K of size > m drawn from the probability distribution/~, it provides from K, in a polynomial time with respect to m, a function f such that for all sample K' chosen with probability density #: Pr[(C(f)-

C) < ~3 > 1 - t/.

Notice that this definition remains valid for discrete or continuous types of the variables to predict. It is important to have in the lattice a variety of PAB algorithms for large classes of probability distributions. In sections 4 and 5, the PAB property is

104

L. Bochereau et al., Probably almost Bayesian algorithms

studied on two different models: a p p r o x i m a t i o n by o r t h o g o n a l polynomials and perceptron membranes.

4. Approximation by orthogonai polynomials In the following, the cost function R will be equal to the square of the norm:

R ( y ' , y " ) = I l Y ' - Y"II 2 • The C ( f ) is the quadratic error often used in practice. In this case, the Bayesian o p t i m u m can be precisely defined, by the introduction of E [ y l x-I, the conditional m e a n of y for x,

C ( f ) = E [ IIf(x) - Y l l 2] = El- IIf ( x ) - E[-yl x ] I12] + EEII E f y l x ] - Y l l 2] .

(3)

The second term is independent of f a n d constant. 1 C ( f ) is thus m i n i m u m when the first term is nil, i . e . , f = E [ y I x]. This leads to the following theorem:

Theorem. (i)

I f the cost function is the quadratic error,

the Bayesian solution f * = E [ y I x],

(ii) the Bayesian error is C ( f * ) = E [ II E l y [ x] - y 112], (iii) C ( f ) - C ( f * ) = EI-Hf(x) - f*(x)II

2].

(4)

In this section, we consider connectionist networks based on orthogonalised bases of functions I-Qian (1990)1. We p r o p o s e to s u m m a r i z e the m a t h e m a t i c a l results on orthogonalised bases of functions. Then we define a P O B algorithm (Projection on O r t h o n o r m a l Basis), which is PAB for a very large class of probability densities and of functions representing the expectation o f y when x is known. It is i m p o r t a n t to notice that these functions are not necessary continuous. In order to simplify the notations, we consider real functions from R to R. But the following can be generalized for real functions from R m to R".

1Thus, the minimization of the quadratic error is the same operation as the minimization of the quadratic distance to the conditional mean, as recalled by Gish (1990).

L. Bochereau et al., Probably almost Bayesian algorithms

105

4.1. Problem formulation We consider a positive measure # on a Borel subset O of R and the Hermitian product for two real functions:

(f' g) = fo f ( x ) g(x) dp(x). We consider the space L2(#) of functions f: 12 ~ R such that ( f , f ) < + ~ (i.e., f has bounded norm). With this Hermitian product, L2(#) is an Hilbert space. If t2 is compact or if # has an exponential decreasing for _ ~ ,2 the polynomials {x"} consist of an Hilbertian basis for this space. By an orthonormalisation procedure, there is also a basis of orthonormal polynomials {P,(x)}, constructable in a polynomial time (n 2) by the following recurrence formula:

Pn+1(x) = (anx + b,) Pn(x) + c~P~- l (x) , where an, bn, cn are calculated by

(P.+~,P.)=O,

(P.+~,P._I)=O, (P.+,,P.+,)-- 1.

In this orthonormal basis, a function f of L2(/t) can be expressed as -I- ct)

aiPi

where

ai = (f, Pi) •

i=0

We consider now the problem of approximating a function from a large set of examples drawn from the probability density # when the number of examples tends to infinity.

4.2. Results We define the P O B (Projection on a Orthonormal Basis) algorithm before presenting our main result.

General POB Algorithm. Let K be a set of couples {(xk, YD}, where {Xk} are drawn from the distribution/~ in f2.

2,ll is dominated by a function x ~ e - ~lxlwith ct > 0 when x ---, _ oo

L. Bochereau et al., Probably almost Bayesian algorithms

106

(i) Construct the orthonormal basis {P~} from 0 to n < IKI from empirical moments, IK[

Mi = ~

xf,/IKI •

k=l

(ii) Calculate for i from 0 to N the coefficients ai of P~: IKI

bi = ~

YkPi(Xk)/IKI •

k=l

(iii) Define the approximation f by

f= ~ biPi(x). i=1

The computational cost when IKI--~ + oo is O(n* IKI) for (i) and O(n*lKI) for (ii). Theorem. A P O B algorithm is P A B if (i) /~ has a compact support or is fast decreasin9 to 4- oo and (ii) E [ y l x] belonos to Lg(#).

Let {Pi} be the orthonormal basis of the Hilbert space Lg(#) defined with the #-scalar product, which exists with the hypothesis (i). Let f * be the conditional mean E [ y ] x], which is a linear combination of the orthonormal basis with the hypothesis (ii). Let f * be the orthogonal projection o f f * on (Pi)o ~_,<_, : Proof

+oo

f*=

~ aiPi

where

ai = ( f * , P i ) ,

i=0

f* =

~'~ ai P i . i=o

Lemma. Let f . be the polynomial function of degree n minimizing C ( f ) on the base of examples K. The norm l]f . - - f * 11 converges to zero in probability when

IKI - > + 0o.

[See Bourgine et al. (1992) for the proof.]

Following (4), we can write C ( f . ) - C ( f * ) = IIf* - i n II2 = IIf* - L * II2 + lift - L II2 •

L. Bochereau et al., Probably almost Bayesian algorithms

107

The second equality becomes from the orthogonality between f * - f * and f * - f ~ ; the first is a linear combination of {Pi, i > n} and the second of { Pi, i < n}. Now: Ve > 0, 3n such that [If* - f * II2 < ~/2

(property of an Hilbertian space),

V e > O, V~l > O, 3m(e, rl) such that IKI > m =~ Prob( LIf* - f , II2 < e/2) > 1 - ~/ (cf. the Lemma). We can then deduce: Ve > 0, V~/> 0, 3m(e, rl) such that Ig l > m =~ Prob(C(f,) - C ( f * ) < e) > 1 - rl. This is the desired PAB property, if we remember that the computational cost is polynomial in IK I. The proof can be generalized to polynomials of p variables xl, • • . , xp and functions from R p to R n.

5. Perceptron membranes In the standard connectionist model, backpropagation in multilayer networks [Le Cun (1985) Rumelhart et al. (1986)] is obviously not a PAB algorithm for large classes of probability distributions. First of all, there is no convergence proof of this algorithm, and the local minima is a well-known problem for it. Furthermore, one has to find the simplest architecture allowing the convergence, in order to maximize the probability of good generalization [Baum and Haussler (1989), Baum (1990)]. In the backpropagation algorithm this search of the best architecture is performed empirically by the user. In order to achieve the PAB property for large classes of probability distributions, a connectionist model must therefore optimize the network architecture (i.e., find the best compromise between the network simplicity and its performance on the learning examples). The implementation of structural transformation procedures is therefore necessary in such a model. Perceptron membranes are a new connectionist model allowing the implementation of structural transformations during the learning process. However, perceptron membranes are deeply different from classical multilayer networks because they have to be considered as geometric objects: piecewise linear surfaces. In this model, perceptrons I-Rosenblatt (1962), Minsky and Papert (1969)] are used as geometric components, which define adaptive linear separators. These components are organized in subsets defining convex polyhedrons. The union of several such polyhedrons allows us to define any piecewise linear surface separating the input space into two parts.

L. Bochereau et al., Probably almost Bayesian algorithms

108

The piecewise linear surface (the membrane) is submitted to the influence of learning examples, which tends to stabilize it at the frontier between the two different classes. This is achieved thanks to a geometric credit assignment which is described further in detail. The important point is that interactions between learning examples and facets are established through a geometric procedure. The geometric interpretation of the model also allows us to design procedures of structural transformations of the membrane: new convex polyhedrons, new facets can be added, some facets can be removed, convex polyhedrons can be linked. This allows us to optimize the membrane structure. Firstly, the membrane structure and the perturbations from its environment are described. Then the membrane development algorithm is presented. Finally, the PAB property of the model is discussed.

5.1. Membrane description We focus on a two-class classification problem. We consider therefore a training set of N d-dimensional vectors (Xi)l _
5.1.1. Mathematical definition of the membrane The membrane M is defined by a union of n convex polyhedrons, each of them bein9 defined by the intersection of several half-spaces.Thanks to these polyhedrons, the membrane divides the input space into two parts, internal and external. If a point of the space is strictly inside one of these polyhedrons, then it is in the internal part of the space, called I(M). More precisely, the mathematical definition of I ( M ) involves three levels: (a) Half-space (or perceptron) level. the following type:

At this level, we take into account sets H of

x e H cc. w . x + b > O, w h e r e ' . ' is the scalar product, w a d-dimensional vector, and b a scalar, w will be the weights and b the basis of a perceptron unit. (b) Convex level. We consider now sets C which are intersections of several half-spaces (convex polyhedrons of Rd): k

C = (-] HI, i=l

where the sets Hi are half-spaces.

L. Bochereauet al., Probablyalmost Bayesian algorithms

109

]["1 ~. s

"1

"'

"

Fig. 3. The membrane level I(M) is definedwith three convex polyhedrons. (c) Membrane level. This is the last level, which allows to define completely I(M). I(M) is the union of the convex polyhedrons defined at the previous level. I(M) is defined by a list of perceptron lists; for instance, the membrane of fig. 3 is noted

I(M) = {(nx, n 2 , n 3 , H4, H5), (I1), (Jx, J2,

Ja)}



The membrane is defined as the boundary of I(M) (it is represented in bold lines on fig. 3). The part included in the membrane of each hyperplane is called the active part of the hyperplane (or of the perceptron). The membrane is therefore the union of all perceptron active parts. Such a model can be used for a two-class classification problem; for instance, I(M) must contain the 1-class examples and 0-class examples must remain in the extrenal part of the space. The membrane is then the separating surface between the two classes. Since the membrane is mathematically defined, the question of its adaptivity to the training examples can be addressed.

5.1.2. Membrane adaptivity The membrane includes several perceptrons, and in this case, the well-known 'credit assignment problem' arises: how to assign the responsibility of an error among the different perceptrons? The derivative composition rule in a multilayer network allows to perform a gradient descent through the parameter space, which is the principle of the backpropagation algorithm [Le Cun (1985), Rumelhart et al. (1986), Lapedes and Farber (1987)]. The perceptron membrane gives a new solution to this problem. In this approach, the network is considered as a geometric surface (a membrane) in the input space. The membrane facets are continuously attracted or repulsed by the training examples thanks to the following procedure (cf. fig. 3):

110

L. Bochereau et al., Probably almost Bayesian algorithms !

I9. rr,'-

o.,.

,.

"H2

:

ff~

Fig. 4. The geometricalcredit assignment. Example A perturbs perceptrons 12 and H1. Example B perturbs perceptronsH1, H2, Ha, and not I1 because the perturbation intersects the membrane.

(i) choose randomly a training example E, (ii) project this example orthogonally on each hyperplane defining the membrane, (iii) if the orthogonal projection P is located in the active part of hyperplane H and if segment EP does not intersect the membrane, then hyperplane H is attracted or repulsed by E so that the membrane tends to 'eat' 1-class examples or to 'reject' 0-class examples. The repulsion or attraction on the hyperplanes are derived from the delta rule [Rumelhart et al. (1986)] on one perceptron cell alone. In the membrane model, each perceptron can be therefore considered independently from the others; it learns only the examples which 'strike' it. This method is much more direct than backpropagation through layers in a network. However, this adaptivity is not sufficient for the PAB requirement, because the algorithm must perform a search in a lattice of functions with different numbers of degrees of freedom. Under the perturbations coming from its environment, the membrane performs this exploration, by periodic structural transformations. 5.2. Structural transformations 5.2.1. Initialization of the membrane." Creation of thefirst convex First of all, a class (0 or 1) is chosen to be the inside class of the membrane. This choice can be made randomly or by taking for instance the minority class in the training set. Let the inside class of the membrane be the class 1. Then, an example of class 1 which has more than half of its k nearest neighbors of class 1 is searched in the training set. If none is found, the membrane development is not possible and k must be decreased in order to allow the development.

L. Bochereauet al., Probably almost Bayesian algorithms

111

Suppose now that example x satisfies the previous condition. A polyhedral ser (subsequently called a convex) C which only contains 1-class examples is built around x, thanks to the following method: Initialize C with the median hyperplane between x and its nearest neighbor of class 0, such as E is on the positive side of the hyperplane. Then iterate the construction by adding the median hyperplane between x and its nearest neighbor of class 0 inside C, such as x is on the positive side of the hyperplane. The construction ends when C only contains examples of class 1. The convex built according to this method allows to initialize the membrane, which is then submitted to the perturbations of the environment.

5.2.2. Recruitment New convex polyhedrons can be added to the membrane definition thanks to the previous polyhedron construction method. Besides, convex holes can also be dug into the inside I(M) when the convex polyhedron is built around a 0-class example which is in the inside part of space.

5.2.3. Linkage of convex polyhedrons This procedure simplifies the membrane by eliminating its useless 'bumps'. This is made by 'linking' pairs of convexes. Two convexes are said to be linked when the same perceptron (hyperplane) is present in both their definitions. The procedure takes into account all pairs of convexes which intersect, and tests whether the share of a perceptron improves or does not deteriorate the membrane efficiency on the training set. If this test is positive, then the perceptron is shared by both convexes. Furthermore, the useless 'bumps' removal is very important for the generalization ability improvement.

5.2.4. Perceptron duplication Periodically, the perceptrons which receive more than a threshold number of perturbations corresponding to misclassified examples are duplicated. A copy of the duplicated perceptron is made and its parameters are slightly modified at random. This allows us to create a new concave or convex (the choice is randomly made) 'bump' in the membrane.

5.2.3. Elimination As explained below, each perceptron of the membrane can be considered as operating independently on a training set determined by the geometric credit assignment. This training set is approximately constant during one training pass because the modifications of the membrane are rather small during this period. Therefore, the relevance for generalization of a perceptron on this training set

112

L. Bochereau et al., Probably almost Bayesian algorithms

can be evaluated according to the result of Ehrenfeucht et al. (1988) [this theorem is quoted in Baum and Haussler (1989)]. The result gives a lower bound of the number of examples necessary for valid generalization, using the Vapnik and Chervonenkis (1981) dimension of a class of functions. Using the VapnikChervonenkis dimension of half-spaces, one can deduce that perceptron submitted to less than d/4 perturbations (d being the space dimension) must be removed from the structure.

5.2.4. The development algorithm The algorithm stops when the convex polyhedron recruitment has failed p times (p being an integer). The higher p is, the lower the probability is to find an example that has more than half of its k nearest neighbors of the same class and also misclassified, but the longer the learning is. Let Cs be the stopping criterion; the development algorithm is the following (Pr, P~, Pc, Pt being integers):

While (Cs false) do: every Pr training passes: recruitment, every Pd training passes: perceptron duplication, every Pe training passes: elimination, every P, training passes: linkage of convex polyhedrons. The exploration efficiency depends mainly on the periodicity parameters. Figs. 5, 6, and 7 illustrate membrane developments.

5.3. PAB property discussion We argue that the direction given by the perceptron membranes is of a high interest in order to implement PAB connectionist networks for large classes of probability distributions. First of all, such connectionist networks need to have very important structural adaptation capacities. This has been well understood by many authors who propose growing networks [M~zard and Nadal (1989), Marchand and Rujan (1989), Fahlman and Lebiere (1990), Frean (1990), Deffuant (1990)] or algorithms allowing to remove or decay units [Scalettar and Zee (1988), Sietsma and Dow (1988)]. However, it is well-known now that the network generalization ability is closely related to the quality of a compromise between the network simplicity and its results on the learning examples [Baum (1990), Baum and Haussler (1989)]. None of these algorithms allows to concurrently perform component recruitment and elimination in order to achieve this compromise like perceptron membranes. Moreover, the theoretical approximating power of perceptron membranes is very high.

L. Bochereau et al., Probably almost Bayesian algorithms

"" " ' " '""":~" "'~''~ "" " "

113

'.S','". "'" " "}~""':":

'"'"':~" "'~'.' °';~* I k

-



..



, . . . . , . •. - u

".o ~ q . , - " . ~

• ,.

:-..., .':u

,~

. " , ' "

"

~.

.

i-, - .•.

-. - L

u.

:.

'..'t'

""

-...:i

;. u~.q. "~ru"

. ,.

~-

"

,,

,.

Uo •

"

-

. .

~.

".

",'"

~ . " . .-"-'-'-'-'-'-~;N: -

,. :-. ~-/~::

- ' t n"

,...

• • ~.-~"~-"

. ~'/,~-.

~ , . . . . : ~

.

.. "'

"~-...

,:

.-. ".-,,.~ .: -,-.~:~ .:~, .'. ., ,. ~,x~.~-. : "~r !~i '~'~'~..:..~, "'

'



".-

"

;

.

-.

". .

" ". • .

".r. . .'.:

• :;-.

.';. "..q

"..-

.

. . . .

':'

,..

. .,~,,.

:.'.L;.,

.~',..-',,.~.

".

."..

.. ""

..'.,

" ". • .

,..

•. . ..





~

; ..



..

:.

r4" ...-..i

.-

.

:

."

;~

."

.

• "'"k

"

.

..~



.i.



," ,

"

~

..

I

,'..



' ~ . -'"

"

"

~

,~

,"

,

:..;.:. i.-

"" .

"-~'"

:,.

"'~ • -"

.

."

.:

.. ..~.

....

.q

.;

.

~

-

• "

~

..

. . . . . .

.

. ,.'.

-

,,%.

,

~ .

i

• •,,..

""

-" .'.

.

"



"q .•

.

.~ ;".':. ~- .':.

.

--....:,.. -.-.

r.4 "..-. -

..;

-

.

:

~ - ",' - ' ~

;

-.

.

: . . ;~- -,,~ , ' . ~ . ,,f. ,.'~

.

,'~'

'.

; . . ,



=~

u*

"~ua'

~'~l

~

'%

t~... ~...~,.,,--.-.

""

~~D

. •

.m..

' •' " k

...



"

.'~ 4

,,~

• :

~'~.',-~ -"

~.

.....

""

"J

•"

"I"

"

*.,

. ."......,...

.'..~4" " "."

,,

"..

: ....'-. _._ ,', :~

.

.

~

"

'-"

-

.

"

[ :

'.."

."..

.

. ,.'.

".q

~....-..

,',.,

.,;

"

...

. . . . . . .

..'.

: . " ~ - , - ~ "

-.-.

. ."......,..

.

-'.....

*':

"-.



.,.. " ~ .'

.

.,~.

.,,;

'i"i

"~. ..

".'u'.

,..

"-

t

.

:-..... .':".

."

'" .

"

.

': " .....

"'-

.

.

~'-...

""

""

'~

~i ..

-

:.".":. .-i, °



.

i."



,)



....

;..~

""..""[

. . . . .,

, , , , ,

•,

;

.,.:.

2.'.

-

.

.

" .

,-;

" ']

.--"

T-. .. ~ ~ :. ' - . ;

..

; . ,"

Fig. 5. Example of membrane development for the double square problem with noisy data (noise 10%) in dimension 2. The membrane is pictured every two training passes•

114

L. Bochereau et al., Probably almost Bayesian algorithms •



°

.



• o • m ° O o



°

.

o • • g °

I

""

fo•°°'°°•O•o•



"'.

•o.••O••O•o

i Oo O•o•oOl .

.

°• .

.

.

• °

". ....



°O•oo•O



. •

°

.

.........)•

.

:

• • ° O••o.•••

. .





.

~ D



.

.

.



.

.

-

.

Fig. 6. Example of membrane development for the double spiral problem• The membrane is pictured every training pass. The final membrane includes 13 convex polyhedrons and 23 perceptrons.

L. Bochereau et al., Probably almost Bayesian algorithms

\

115

\ i

"~

,a

\

\

'~\ \ \

\

Fig. 7. Example of membrane development for the double square problem with nonnoisy data in dimension 3. The membrane is pictured every four training passes.

L. Bochereau et al., Probably almost Bayesian algorithms

116

Theorem. For any 0 < e < 1 and any compact and continuously differentiable surface S separating a d-dimensional space into two parts, there exists a membrane M in an e-neighborhood of S for the standard topology on sets, involving a number of perceptrons growing polynomially with e-1 The detailed proof of this theorem is given in Deffuant (1992). It uses the compacity property in order to exhibit a minimum surface of the membrane's facets in an t-neighborhood of S. This allows to evaluate an upper bound of the number of perceptrons involved in the membrane. It is important to notice that this complexity property has not been demonstrated for classical multilayer networks. This result allows us to study the learning base complexity necessary for good generalization. The introduction of the Vapnik and Chervonenkis (1981) framework gives the following theorem: Theorem. Let D be probability distribution of two-class d-dimensional examples such that the Bayesian separating surface is compact, continuous, and differentiable, and let C* be the Bayesian corresponding risk. For any 0 < e < 1, 0 < 6 < 1, a perceptron membrane M with a corresponding risk C(M), such as P r o b [ ( C ( M ) - C*) < e] > 1 - 6 , can be obtained by the previously described algorithm from a learning sample growing polynomially with e-1 and 6-1. The detailed proof of this theorem is also given in Deffuant (1992). It uses mainly the fact that perceptron membranes' Vapnik Chervonenkis dimension is inferior to the product of the number of perceptrons involved in the membrane by the space dimension. Combined with the previous result, the Vapnik and Chervonenkis (1981) theorem allows us to achieve the demonstration. It has to be noticed that this theorem gives a theoretical possibility, but the actual convergence has not been demonstrated yet. It can be hoped that improvements of perceptron membranes' adaptivity properties lead to this convergence. In any case, perceptron membranes give new hope for the design of connectionist models which are PAB for large classes of probability distributions.

6. Conclusion

The framework for supervised learning introduced in this paper allows us to unify classical and connectionist approaches. This represents an important step towards a clarification of these methods when solving classification or function approximation problems. Moreover, this framework leads to a new characterization: the Probably Almost Bayesian property. In order to take into account

L. Bochereau et al., Probably almost Bayesian algorithms

117

computational costs, we introduce the definition of PAB algorithms that calculate the solution in a polynomial number of steps with respect to the number of examples. We have shown that the design of new PAB algorithms for large classes of probability distributions is a very important issue in the field of supervised learning. The study of the PAB property for two particular classes of algorithms, function approximation by orthogonal polynomials and classification by perceptron membranes, allowed us to illustrate our point in a more concrete manner.

References Baum, E., 1990, When are the k-nearest neighbor and back-propagation accurate for feasible sized sets of examples?, in: Lecture notes on computer science (Springer-Verlag, New York, NY). Baum, E. and D. Haussler, 1989, What size net gives valid generalization?, in: Neural computation 1 (MIT, Cambridge, MA) 151-160. Bochereau, L., P. Bourgine, and G. Deffuant, 1990, Equivalence between connectionist classifiers and logical classifiers, in: Lecture notes in physics 368 (Springer-Verlag, New York, NY) 351-364. Bourgine, P., E. Monneret, and P. Rivi6re, 1992, Probably almost Bayesian algorithms and orthonormal basis nets, in: IJCNN proceedings (Peking), forthcoming. Deffuant, G., 1990, Neural units recruitment algorithms, in: IJCNN proceedings (San Diego, CA). Deffuant, G., 1992, Self building connectionist networks, Ph.D. dissertation (Universit6 de Paris VI, EHESS, Paris). Ehrenfeucht, A., D. Haussler, M. Kearns, and L.G. Valiant, 1988, A general lower bound on the number of examples needed for learning, in: Proceedings of the annual workshop on computational learning theory 1988 (Morgan Kaufmann, San Mateo, CA). Fahlman, S.E. and C. Lebiere, 1990, The Cascade-correlation learning architecture, in: D.S. Touretzky, ed., Advances in neural information processing systems II (Denver, 1988) (Morgan Kaufmann, San Mateo, CA) 524-532. Frean, M., 1990, The upstart algorithm: A method for constructing and training feed forward neural networks, Neural Computation 2, 198-209. Gish, H., 1990, A probabilistic approach to the understanding and training of neural network classifiers, IEEE CH2847-2/90, 1361-1364. Hinton, G.E., 1986, Learning distributed representations of concepts, in: Proceedings of the eighth annual conference of the Cognitive Science Society (Amherst) (Erlbaum, Hillsdale) 1-12. Hinton, G., 1989, Connectionist learning procedures, Artificial Intelligence 40, 185-234. Hoeffding, W., 1963, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, 13-30. Kanaya, F. and S. Miyake, 1990, Bayes statistical and valid generalization of pattern classifying neural networks, in: Proceedings of Cognitiva 90 (AFCET) 13-19. Lapedes, A. and R. Farber, 1987, How neural nets work, in: Proceedings of the IEEE conference on neural nets (Denver). Le Cun, Y., 1985, A learning scheme for asymetric threshold networks, in: Cognitia (CESTA-AFCET). Marchand, M., M. Golea, and P. Rujan, 1990, A convergence theorem for sequential learning in two-layer perceptrons, Europhysics Letters 11, 487-492. M6zard, M. and J.P. Nadal, 1989, Learning in feedforward layered networks: The tiling algorithm, Journal of Physics 21, 1087-1092. Minsky, M. and S. Papert, 1969, Perceptrons (MIT Press, Cambridge, MA). Morrison, D., 1990, Multivariate statistical methods (MacGraw-Hill, New York, NY).

118

L. Bochereau et al., Probably almost Bayesian algorithms

Nadal, J.P., 1989, Study of a growth algorithm for a feedforward network, International Journal of Neural Systems 1, no. 1. Qian, S., Y.C. Lee, R.D. Jones, C.W. Barnes, and K. Lee, 1990, Function approximation with an orthogonal basis net, in: IJCNN proceedings, Vol. III, 605-619. Rosenblatt, F., 1962, Principles of neurodynamics (Spartan Books, New York, NY). Rumelhart, D.E., G. Hinton, and R. Williams, 1986, Learning internal representations by error propagation, in: J.L. McClelland, D.E. Rumelhart, and the PDP research group, eds:, Parallel distributed processing: Exploration in the microstructures of cognition. Scalettar, R. and A. Zee, 1988, Emergence of grandmother memory in feed forward networks: Learning with noise and forgetfulness, in: D. Waltz and J.A. Feldman, eds., Connectionist models and their implications: Readings from cognitive science (Albex, Norwood) 309-332. Sietsma, J. and R.J.F. Dow, 1988, Neural net pruning - Why and how, in: IEEE international conference on neural networks (San Diego), Vol. I (IEEE, New York, NY) 325-333. Valiant, L., 1984, A theory of the learnable, Communications of the ACM 27: 11, 1134-1142. Vapnik, V.N. and Y. Chervonenkis, 1981, On the uniform convergence of relative frequencies of events to their probabilities, Theory of Probability and Its Applications XXVI, 532-553.