i, the solution f~* in F i also belongs to F j. If we neglect the sampling and convergence errors, we can then conclude that C(fj*, K) will be less than or equal to C ( f f , K). However, we do not have any a priori information about the comparison between C(]]*, K') and C(fj*, K'). Fig. 2 shows that criterion C(f~*, K) must decrease when n(F ~) increases; the opposite case can only be explained by an error of the optimization procedure. However, in practice the quantity C(f~*, K') usually begins by decreasing, then reaches a minimum value, and then increases again. Thus, for a given branch of the lattice, the minimal value of the criterion C(f~*, K') allows the selection of the most appropriate model. It is important to notice that the type, continuous or discrete, of vector y leads to different types of problems. When y is a continuous vector, supervised learning becomes a problem of function approximation; when y is a discrete vector, supervised learning becomes a classification problem. The double minimization in the lattice of fig. 1 appears to be the ideal theoretical procedure; however, it does not take into account the computational constraints. In practice, this procedure must be restricted to approximation algorithms which are polynomials with the size of K. We focus on such algorithms in the next sections.
L. Bochereau et al., Probably almost Bayesian algorithms
103
3. Definition of probably almost Bayesian algorithms 3.1. Probably almost correct algorithms Valiant (1984) has developed a learning theory for boolean functions defined on a finite set of boolean variables. His work aims at determining classes of learnable functions in a polynomial number of steps relative to a query protocol, which uses the function as an oracle to which queries are made. Referring to this framework, Baum and Haussler (1989) have studied the sample complexity necessary for multilayer networks in order to guarantee good generalization; the study is founded on Vapnik and Chervonenkis' works (1981). Kanaya (1990) proposed to extend Valiant's framework to statistical pattern recognition. In this context, the mapping associating each input vector x with an output vector y is not deterministic. Therefore, the conditional probabilities P(Y~I x) are not always equal to 0 or 1. This framework can be further extended to the case of real outputs. This leads to the Probably Almost Bayesian framework which is described in the next paragraph.
3.2. Probably almost Bayesian algorithms We consider now the case of section 2. We defined earlier the Bayesian risk associated with a function f as
C(f) = E j R ( f (x), y)] = ~ R (f(x), y) I~(X,y) dx dy .
(I)
Let C be the Bayesian optimum, i.e., C = min C ( f ) .
(2)
Definition (PAB algorithms).
Let X be of dimension d and Y of dimension d'. Given/~ a probability density on X x Y, an algorithm A is said to be PABfor I~if, for all scalars e, ~/> 0, there exists re(e, ~/) > 0 such that, for all random sample K of size > m drawn from the probability distribution/~, it provides from K, in a polynomial time with respect to m, a function f such that for all sample K' chosen with probability density #: Pr[(C(f)-
C) < ~3 > 1 - t/.
Notice that this definition remains valid for discrete or continuous types of the variables to predict. It is important to have in the lattice a variety of PAB algorithms for large classes of probability distributions. In sections 4 and 5, the PAB property is
104
L. Bochereau et al., Probably almost Bayesian algorithms
studied on two different models: a p p r o x i m a t i o n by o r t h o g o n a l polynomials and perceptron membranes.
4. Approximation by orthogonai polynomials In the following, the cost function R will be equal to the square of the norm:
R ( y ' , y " ) = I l Y ' - Y"II 2 • The C ( f ) is the quadratic error often used in practice. In this case, the Bayesian o p t i m u m can be precisely defined, by the introduction of E [ y l x-I, the conditional m e a n of y for x,
C ( f ) = E [ IIf(x) - Y l l 2] = El- IIf ( x ) - E[-yl x ] I12] + EEII E f y l x ] - Y l l 2] .
(3)
The second term is independent of f a n d constant. 1 C ( f ) is thus m i n i m u m when the first term is nil, i . e . , f = E [ y I x]. This leads to the following theorem:
Theorem. (i)
I f the cost function is the quadratic error,
the Bayesian solution f * = E [ y I x],
(ii) the Bayesian error is C ( f * ) = E [ II E l y [ x] - y 112], (iii) C ( f ) - C ( f * ) = EI-Hf(x) - f*(x)II
2].
(4)
In this section, we consider connectionist networks based on orthogonalised bases of functions I-Qian (1990)1. We p r o p o s e to s u m m a r i z e the m a t h e m a t i c a l results on orthogonalised bases of functions. Then we define a P O B algorithm (Projection on O r t h o n o r m a l Basis), which is PAB for a very large class of probability densities and of functions representing the expectation o f y when x is known. It is i m p o r t a n t to notice that these functions are not necessary continuous. In order to simplify the notations, we consider real functions from R to R. But the following can be generalized for real functions from R m to R".
1Thus, the minimization of the quadratic error is the same operation as the minimization of the quadratic distance to the conditional mean, as recalled by Gish (1990).
L. Bochereau et al., Probably almost Bayesian algorithms
105
4.1. Problem formulation We consider a positive measure # on a Borel subset O of R and the Hermitian product for two real functions:
(f' g) = fo f ( x ) g(x) dp(x). We consider the space L2(#) of functions f: 12 ~ R such that ( f , f ) < + ~ (i.e., f has bounded norm). With this Hermitian product, L2(#) is an Hilbert space. If t2 is compact or if # has an exponential decreasing for _ ~ ,2 the polynomials {x"} consist of an Hilbertian basis for this space. By an orthonormalisation procedure, there is also a basis of orthonormal polynomials {P,(x)}, constructable in a polynomial time (n 2) by the following recurrence formula:
Pn+1(x) = (anx + b,) Pn(x) + c~P~- l (x) , where an, bn, cn are calculated by
(P.+~,P.)=O,
(P.+~,P._I)=O, (P.+,,P.+,)-- 1.
In this orthonormal basis, a function f of L2(/t) can be expressed as -I- ct)
aiPi
where
ai = (f, Pi) •
i=0
We consider now the problem of approximating a function from a large set of examples drawn from the probability density # when the number of examples tends to infinity.
4.2. Results We define the P O B (Projection on a Orthonormal Basis) algorithm before presenting our main result.
General POB Algorithm. Let K be a set of couples {(xk, YD}, where {Xk} are drawn from the distribution/~ in f2.
2,ll is dominated by a function x ~ e - ~lxlwith ct > 0 when x ---, _ oo
L. Bochereau et al., Probably almost Bayesian algorithms
106
(i) Construct the orthonormal basis {P~} from 0 to n < IKI from empirical moments, IK[
Mi = ~
xf,/IKI •
k=l
(ii) Calculate for i from 0 to N the coefficients ai of P~: IKI
bi = ~
YkPi(Xk)/IKI •
k=l
(iii) Define the approximation f by
f= ~ biPi(x). i=1
The computational cost when IKI--~ + oo is O(n* IKI) for (i) and O(n*lKI) for (ii). Theorem. A P O B algorithm is P A B if (i) /~ has a compact support or is fast decreasin9 to 4- oo and (ii) E [ y l x] belonos to Lg(#).
Let {Pi} be the orthonormal basis of the Hilbert space Lg(#) defined with the #-scalar product, which exists with the hypothesis (i). Let f * be the conditional mean E [ y ] x], which is a linear combination of the orthonormal basis with the hypothesis (ii). Let f * be the orthogonal projection o f f * on (Pi)o ~_,<_, : Proof
+oo
f*=
~ aiPi
where
ai = ( f * , P i ) ,
i=0
f* =
~'~ ai P i . i=o
Lemma. Let f . be the polynomial function of degree n minimizing C ( f ) on the base of examples K. The norm l]f . - - f * 11 converges to zero in probability when
IKI - > + 0o.
[See Bourgine et al. (1992) for the proof.]
Following (4), we can write C ( f . ) - C ( f * ) = IIf* - i n II2 = IIf* - L * II2 + lift - L II2 •
L. Bochereau et al., Probably almost Bayesian algorithms
107
The second equality becomes from the orthogonality between f * - f * and f * - f ~ ; the first is a linear combination of {Pi, i > n} and the second of { Pi, i < n}. Now: Ve > 0, 3n such that [If* - f * II2 < ~/2
(property of an Hilbertian space),
V e > O, V~l > O, 3m(e, rl) such that IKI > m =~ Prob( LIf* - f , II2 < e/2) > 1 - ~/ (cf. the Lemma). We can then deduce: Ve > 0, V~/> 0, 3m(e, rl) such that Ig l > m =~ Prob(C(f,) - C ( f * ) < e) > 1 - rl. This is the desired PAB property, if we remember that the computational cost is polynomial in IK I. The proof can be generalized to polynomials of p variables xl, • • . , xp and functions from R p to R n.
5. Perceptron membranes In the standard connectionist model, backpropagation in multilayer networks [Le Cun (1985) Rumelhart et al. (1986)] is obviously not a PAB algorithm for large classes of probability distributions. First of all, there is no convergence proof of this algorithm, and the local minima is a well-known problem for it. Furthermore, one has to find the simplest architecture allowing the convergence, in order to maximize the probability of good generalization [Baum and Haussler (1989), Baum (1990)]. In the backpropagation algorithm this search of the best architecture is performed empirically by the user. In order to achieve the PAB property for large classes of probability distributions, a connectionist model must therefore optimize the network architecture (i.e., find the best compromise between the network simplicity and its performance on the learning examples). The implementation of structural transformation procedures is therefore necessary in such a model. Perceptron membranes are a new connectionist model allowing the implementation of structural transformations during the learning process. However, perceptron membranes are deeply different from classical multilayer networks because they have to be considered as geometric objects: piecewise linear surfaces. In this model, perceptrons I-Rosenblatt (1962), Minsky and Papert (1969)] are used as geometric components, which define adaptive linear separators. These components are organized in subsets defining convex polyhedrons. The union of several such polyhedrons allows us to define any piecewise linear surface separating the input space into two parts.
L. Bochereau et al., Probably almost Bayesian algorithms
108
The piecewise linear surface (the membrane) is submitted to the influence of learning examples, which tends to stabilize it at the frontier between the two different classes. This is achieved thanks to a geometric credit assignment which is described further in detail. The important point is that interactions between learning examples and facets are established through a geometric procedure. The geometric interpretation of the model also allows us to design procedures of structural transformations of the membrane: new convex polyhedrons, new facets can be added, some facets can be removed, convex polyhedrons can be linked. This allows us to optimize the membrane structure. Firstly, the membrane structure and the perturbations from its environment are described. Then the membrane development algorithm is presented. Finally, the PAB property of the model is discussed.
5.1. Membrane description We focus on a two-class classification problem. We consider therefore a training set of N d-dimensional vectors (Xi)l _
5.1.1. Mathematical definition of the membrane The membrane M is defined by a union of n convex polyhedrons, each of them bein9 defined by the intersection of several half-spaces.Thanks to these polyhedrons, the membrane divides the input space into two parts, internal and external. If a point of the space is strictly inside one of these polyhedrons, then it is in the internal part of the space, called I(M). More precisely, the mathematical definition of I ( M ) involves three levels: (a) Half-space (or perceptron) level. the following type:
At this level, we take into account sets H of
x e H cc. w . x + b > O, w h e r e ' . ' is the scalar product, w a d-dimensional vector, and b a scalar, w will be the weights and b the basis of a perceptron unit. (b) Convex level. We consider now sets C which are intersections of several half-spaces (convex polyhedrons of Rd): k
C = (-] HI, i=l
where the sets Hi are half-spaces.
L. Bochereauet al., Probablyalmost Bayesian algorithms
109
]["1 ~. s
"1
"'
"
Fig. 3. The membrane level I(M) is definedwith three convex polyhedrons. (c) Membrane level. This is the last level, which allows to define completely I(M). I(M) is the union of the convex polyhedrons defined at the previous level. I(M) is defined by a list of perceptron lists; for instance, the membrane of fig. 3 is noted
I(M) = {(nx, n 2 , n 3 , H4, H5), (I1), (Jx, J2,
Ja)}
•
The membrane is defined as the boundary of I(M) (it is represented in bold lines on fig. 3). The part included in the membrane of each hyperplane is called the active part of the hyperplane (or of the perceptron). The membrane is therefore the union of all perceptron active parts. Such a model can be used for a two-class classification problem; for instance, I(M) must contain the 1-class examples and 0-class examples must remain in the extrenal part of the space. The membrane is then the separating surface between the two classes. Since the membrane is mathematically defined, the question of its adaptivity to the training examples can be addressed.
5.1.2. Membrane adaptivity The membrane includes several perceptrons, and in this case, the well-known 'credit assignment problem' arises: how to assign the responsibility of an error among the different perceptrons? The derivative composition rule in a multilayer network allows to perform a gradient descent through the parameter space, which is the principle of the backpropagation algorithm [Le Cun (1985), Rumelhart et al. (1986), Lapedes and Farber (1987)]. The perceptron membrane gives a new solution to this problem. In this approach, the network is considered as a geometric surface (a membrane) in the input space. The membrane facets are continuously attracted or repulsed by the training examples thanks to the following procedure (cf. fig. 3):
110
L. Bochereau et al., Probably almost Bayesian algorithms !
I9. rr,'-
o.,.
,.
"H2
:
ff~
Fig. 4. The geometricalcredit assignment. Example A perturbs perceptrons 12 and H1. Example B perturbs perceptronsH1, H2, Ha, and not I1 because the perturbation intersects the membrane.
(i) choose randomly a training example E, (ii) project this example orthogonally on each hyperplane defining the membrane, (iii) if the orthogonal projection P is located in the active part of hyperplane H and if segment EP does not intersect the membrane, then hyperplane H is attracted or repulsed by E so that the membrane tends to 'eat' 1-class examples or to 'reject' 0-class examples. The repulsion or attraction on the hyperplanes are derived from the delta rule [Rumelhart et al. (1986)] on one perceptron cell alone. In the membrane model, each perceptron can be therefore considered independently from the others; it learns only the examples which 'strike' it. This method is much more direct than backpropagation through layers in a network. However, this adaptivity is not sufficient for the PAB requirement, because the algorithm must perform a search in a lattice of functions with different numbers of degrees of freedom. Under the perturbations coming from its environment, the membrane performs this exploration, by periodic structural transformations. 5.2. Structural transformations 5.2.1. Initialization of the membrane." Creation of thefirst convex First of all, a class (0 or 1) is chosen to be the inside class of the membrane. This choice can be made randomly or by taking for instance the minority class in the training set. Let the inside class of the membrane be the class 1. Then, an example of class 1 which has more than half of its k nearest neighbors of class 1 is searched in the training set. If none is found, the membrane development is not possible and k must be decreased in order to allow the development.
L. Bochereauet al., Probably almost Bayesian algorithms
111
Suppose now that example x satisfies the previous condition. A polyhedral ser (subsequently called a convex) C which only contains 1-class examples is built around x, thanks to the following method: Initialize C with the median hyperplane between x and its nearest neighbor of class 0, such as E is on the positive side of the hyperplane. Then iterate the construction by adding the median hyperplane between x and its nearest neighbor of class 0 inside C, such as x is on the positive side of the hyperplane. The construction ends when C only contains examples of class 1. The convex built according to this method allows to initialize the membrane, which is then submitted to the perturbations of the environment.
5.2.2. Recruitment New convex polyhedrons can be added to the membrane definition thanks to the previous polyhedron construction method. Besides, convex holes can also be dug into the inside I(M) when the convex polyhedron is built around a 0-class example which is in the inside part of space.
5.2.3. Linkage of convex polyhedrons This procedure simplifies the membrane by eliminating its useless 'bumps'. This is made by 'linking' pairs of convexes. Two convexes are said to be linked when the same perceptron (hyperplane) is present in both their definitions. The procedure takes into account all pairs of convexes which intersect, and tests whether the share of a perceptron improves or does not deteriorate the membrane efficiency on the training set. If this test is positive, then the perceptron is shared by both convexes. Furthermore, the useless 'bumps' removal is very important for the generalization ability improvement.
5.2.4. Perceptron duplication Periodically, the perceptrons which receive more than a threshold number of perturbations corresponding to misclassified examples are duplicated. A copy of the duplicated perceptron is made and its parameters are slightly modified at random. This allows us to create a new concave or convex (the choice is randomly made) 'bump' in the membrane.
5.2.3. Elimination As explained below, each perceptron of the membrane can be considered as operating independently on a training set determined by the geometric credit assignment. This training set is approximately constant during one training pass because the modifications of the membrane are rather small during this period. Therefore, the relevance for generalization of a perceptron on this training set
112
L. Bochereau et al., Probably almost Bayesian algorithms
can be evaluated according to the result of Ehrenfeucht et al. (1988) [this theorem is quoted in Baum and Haussler (1989)]. The result gives a lower bound of the number of examples necessary for valid generalization, using the Vapnik and Chervonenkis (1981) dimension of a class of functions. Using the VapnikChervonenkis dimension of half-spaces, one can deduce that perceptron submitted to less than d/4 perturbations (d being the space dimension) must be removed from the structure.
5.2.4. The development algorithm The algorithm stops when the convex polyhedron recruitment has failed p times (p being an integer). The higher p is, the lower the probability is to find an example that has more than half of its k nearest neighbors of the same class and also misclassified, but the longer the learning is. Let Cs be the stopping criterion; the development algorithm is the following (Pr, P~, Pc, Pt being integers):
While (Cs false) do: every Pr training passes: recruitment, every Pd training passes: perceptron duplication, every Pe training passes: elimination, every P, training passes: linkage of convex polyhedrons. The exploration efficiency depends mainly on the periodicity parameters. Figs. 5, 6, and 7 illustrate membrane developments.
5.3. PAB property discussion We argue that the direction given by the perceptron membranes is of a high interest in order to implement PAB connectionist networks for large classes of probability distributions. First of all, such connectionist networks need to have very important structural adaptation capacities. This has been well understood by many authors who propose growing networks [M~zard and Nadal (1989), Marchand and Rujan (1989), Fahlman and Lebiere (1990), Frean (1990), Deffuant (1990)] or algorithms allowing to remove or decay units [Scalettar and Zee (1988), Sietsma and Dow (1988)]. However, it is well-known now that the network generalization ability is closely related to the quality of a compromise between the network simplicity and its results on the learning examples [Baum (1990), Baum and Haussler (1989)]. None of these algorithms allows to concurrently perform component recruitment and elimination in order to achieve this compromise like perceptron membranes. Moreover, the theoretical approximating power of perceptron membranes is very high.
L. Bochereau et al., Probably almost Bayesian algorithms
"" " ' " '""":~" "'~''~ "" " "
113
'.S','". "'" " "}~""':":
'"'"':~" "'~'.' °';~* I k
-
•
..
•
, . . . . , . •. - u
".o ~ q . , - " . ~
• ,.
:-..., .':u
,~
. " , ' "
"
~.
.
i-, - .•.
-. - L
u.
:.
'..'t'
""
-...:i
;. u~.q. "~ru"
. ,.
~-
"
,,
,.
Uo •
"
-
. .
~.
".
",'"
~ . " . .-"-'-'-'-'-'-~;N: -
,. :-. ~-/~::
- ' t n"
,...
• • ~.-~"~-"
. ~'/,~-.
~ , . . . . : ~
.
.. "'
"~-...
,:
.-. ".-,,.~ .: -,-.~:~ .:~, .'. ., ,. ~,x~.~-. : "~r !~i '~'~'~..:..~, "'
'
•
".-
"
;
.
-.
". .
" ". • .
".r. . .'.:
• :;-.
.';. "..q
"..-
.
. . . .
':'
,..
. .,~,,.
:.'.L;.,
.~',..-',,.~.
".
."..
.. ""
..'.,
" ". • .
,..
•. . ..
•
•
~
; ..
•
..
:.
r4" ...-..i
.-
.
:
."
;~
."
.
• "'"k
"
.
..~
•
.i.
•
," ,
"
~
..
I
,'..
•
' ~ . -'"
"
"
~
,~
,"
,
:..;.:. i.-
"" .
"-~'"
:,.
"'~ • -"
.
."
.:
.. ..~.
....
.q
.;
.
~
-
• "
~
..
. . . . . .
.
. ,.'.
-
,,%.
,
~ .
i
• •,,..
""
-" .'.
.
"
•
"q .•
.
.~ ;".':. ~- .':.
.
--....:,.. -.-.
r.4 "..-. -
..;
-
.
:
~ - ",' - ' ~
;
-.
.
: . . ;~- -,,~ , ' . ~ . ,,f. ,.'~
.
,'~'
'.
; . . ,
•
=~
u*
"~ua'
~'~l
~
'%
t~... ~...~,.,,--.-.
""
~~D
. •
.m..
' •' " k
...
•
"
.'~ 4
,,~
• :
~'~.',-~ -"
~.
.....
""
"J
•"
"I"
"
*.,
. ."......,...
.'..~4" " "."
,,
"..
: ....'-. _._ ,', :~
.
.
~
"
'-"
-
.
"
[ :
'.."
."..
.
. ,.'.
".q
~....-..
,',.,
.,;
"
...
. . . . . . .
..'.
: . " ~ - , - ~ "
-.-.
. ."......,..
.
-'.....
*':
"-.
•
.,.. " ~ .'
.
.,~.
.,,;
'i"i
"~. ..
".'u'.
,..
"-
t
.
:-..... .':".
."
'" .
"
.
': " .....
"'-
.
.
~'-...
""
""
'~
~i ..
-
:.".":. .-i, °
•
.
i."
•
,)
•
....
;..~
""..""[
. . . . .,
, , , , ,
•,
;
.,.:.
2.'.
-
.
.
" .
,-;
" ']
.--"
T-. .. ~ ~ :. ' - . ;
..
; . ,"
Fig. 5. Example of membrane development for the double square problem with noisy data (noise 10%) in dimension 2. The membrane is pictured every two training passes•
114
L. Bochereau et al., Probably almost Bayesian algorithms •
•
°
.
•
• o • m ° O o
•
°
.
o • • g °
I
""
fo•°°'°°•O•o•
•
"'.
•o.••O••O•o
i Oo O•o•oOl .
.
°• .
.
.
• °
". ....
.°
°O•oo•O
•
. •
°
.
.........)•
.
:
• • ° O••o.•••
. .
•
•
.
~ D
•
.
.
.
•
.
.
-
.
Fig. 6. Example of membrane development for the double spiral problem• The membrane is pictured every training pass. The final membrane includes 13 convex polyhedrons and 23 perceptrons.
L. Bochereau et al., Probably almost Bayesian algorithms
\
115
\ i
"~
,a
\
\
'~\ \ \
\
Fig. 7. Example of membrane development for the double square problem with nonnoisy data in dimension 3. The membrane is pictured every four training passes.
L. Bochereau et al., Probably almost Bayesian algorithms
116
Theorem. For any 0 < e < 1 and any compact and continuously differentiable surface S separating a d-dimensional space into two parts, there exists a membrane M in an e-neighborhood of S for the standard topology on sets, involving a number of perceptrons growing polynomially with e-1 The detailed proof of this theorem is given in Deffuant (1992). It uses the compacity property in order to exhibit a minimum surface of the membrane's facets in an t-neighborhood of S. This allows to evaluate an upper bound of the number of perceptrons involved in the membrane. It is important to notice that this complexity property has not been demonstrated for classical multilayer networks. This result allows us to study the learning base complexity necessary for good generalization. The introduction of the Vapnik and Chervonenkis (1981) framework gives the following theorem: Theorem. Let D be probability distribution of two-class d-dimensional examples such that the Bayesian separating surface is compact, continuous, and differentiable, and let C* be the Bayesian corresponding risk. For any 0 < e < 1, 0 < 6 < 1, a perceptron membrane M with a corresponding risk C(M), such as P r o b [ ( C ( M ) - C*) < e] > 1 - 6 , can be obtained by the previously described algorithm from a learning sample growing polynomially with e-1 and 6-1. The detailed proof of this theorem is also given in Deffuant (1992). It uses mainly the fact that perceptron membranes' Vapnik Chervonenkis dimension is inferior to the product of the number of perceptrons involved in the membrane by the space dimension. Combined with the previous result, the Vapnik and Chervonenkis (1981) theorem allows us to achieve the demonstration. It has to be noticed that this theorem gives a theoretical possibility, but the actual convergence has not been demonstrated yet. It can be hoped that improvements of perceptron membranes' adaptivity properties lead to this convergence. In any case, perceptron membranes give new hope for the design of connectionist models which are PAB for large classes of probability distributions.
6. Conclusion
The framework for supervised learning introduced in this paper allows us to unify classical and connectionist approaches. This represents an important step towards a clarification of these methods when solving classification or function approximation problems. Moreover, this framework leads to a new characterization: the Probably Almost Bayesian property. In order to take into account
L. Bochereau et al., Probably almost Bayesian algorithms
117
computational costs, we introduce the definition of PAB algorithms that calculate the solution in a polynomial number of steps with respect to the number of examples. We have shown that the design of new PAB algorithms for large classes of probability distributions is a very important issue in the field of supervised learning. The study of the PAB property for two particular classes of algorithms, function approximation by orthogonal polynomials and classification by perceptron membranes, allowed us to illustrate our point in a more concrete manner.
References Baum, E., 1990, When are the k-nearest neighbor and back-propagation accurate for feasible sized sets of examples?, in: Lecture notes on computer science (Springer-Verlag, New York, NY). Baum, E. and D. Haussler, 1989, What size net gives valid generalization?, in: Neural computation 1 (MIT, Cambridge, MA) 151-160. Bochereau, L., P. Bourgine, and G. Deffuant, 1990, Equivalence between connectionist classifiers and logical classifiers, in: Lecture notes in physics 368 (Springer-Verlag, New York, NY) 351-364. Bourgine, P., E. Monneret, and P. Rivi6re, 1992, Probably almost Bayesian algorithms and orthonormal basis nets, in: IJCNN proceedings (Peking), forthcoming. Deffuant, G., 1990, Neural units recruitment algorithms, in: IJCNN proceedings (San Diego, CA). Deffuant, G., 1992, Self building connectionist networks, Ph.D. dissertation (Universit6 de Paris VI, EHESS, Paris). Ehrenfeucht, A., D. Haussler, M. Kearns, and L.G. Valiant, 1988, A general lower bound on the number of examples needed for learning, in: Proceedings of the annual workshop on computational learning theory 1988 (Morgan Kaufmann, San Mateo, CA). Fahlman, S.E. and C. Lebiere, 1990, The Cascade-correlation learning architecture, in: D.S. Touretzky, ed., Advances in neural information processing systems II (Denver, 1988) (Morgan Kaufmann, San Mateo, CA) 524-532. Frean, M., 1990, The upstart algorithm: A method for constructing and training feed forward neural networks, Neural Computation 2, 198-209. Gish, H., 1990, A probabilistic approach to the understanding and training of neural network classifiers, IEEE CH2847-2/90, 1361-1364. Hinton, G.E., 1986, Learning distributed representations of concepts, in: Proceedings of the eighth annual conference of the Cognitive Science Society (Amherst) (Erlbaum, Hillsdale) 1-12. Hinton, G., 1989, Connectionist learning procedures, Artificial Intelligence 40, 185-234. Hoeffding, W., 1963, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, 13-30. Kanaya, F. and S. Miyake, 1990, Bayes statistical and valid generalization of pattern classifying neural networks, in: Proceedings of Cognitiva 90 (AFCET) 13-19. Lapedes, A. and R. Farber, 1987, How neural nets work, in: Proceedings of the IEEE conference on neural nets (Denver). Le Cun, Y., 1985, A learning scheme for asymetric threshold networks, in: Cognitia (CESTA-AFCET). Marchand, M., M. Golea, and P. Rujan, 1990, A convergence theorem for sequential learning in two-layer perceptrons, Europhysics Letters 11, 487-492. M6zard, M. and J.P. Nadal, 1989, Learning in feedforward layered networks: The tiling algorithm, Journal of Physics 21, 1087-1092. Minsky, M. and S. Papert, 1969, Perceptrons (MIT Press, Cambridge, MA). Morrison, D., 1990, Multivariate statistical methods (MacGraw-Hill, New York, NY).
118
L. Bochereau et al., Probably almost Bayesian algorithms
Nadal, J.P., 1989, Study of a growth algorithm for a feedforward network, International Journal of Neural Systems 1, no. 1. Qian, S., Y.C. Lee, R.D. Jones, C.W. Barnes, and K. Lee, 1990, Function approximation with an orthogonal basis net, in: IJCNN proceedings, Vol. III, 605-619. Rosenblatt, F., 1962, Principles of neurodynamics (Spartan Books, New York, NY). Rumelhart, D.E., G. Hinton, and R. Williams, 1986, Learning internal representations by error propagation, in: J.L. McClelland, D.E. Rumelhart, and the PDP research group, eds:, Parallel distributed processing: Exploration in the microstructures of cognition. Scalettar, R. and A. Zee, 1988, Emergence of grandmother memory in feed forward networks: Learning with noise and forgetfulness, in: D. Waltz and J.A. Feldman, eds., Connectionist models and their implications: Readings from cognitive science (Albex, Norwood) 309-332. Sietsma, J. and R.J.F. Dow, 1988, Neural net pruning - Why and how, in: IEEE international conference on neural networks (San Diego), Vol. I (IEEE, New York, NY) 325-333. Valiant, L., 1984, A theory of the learnable, Communications of the ACM 27: 11, 1134-1142. Vapnik, V.N. and Y. Chervonenkis, 1981, On the uniform convergence of relative frequencies of events to their probabilities, Theory of Probability and Its Applications XXVI, 532-553.