Convergence Properties of High-order Boltzmann Machines

Convergence Properties of High-order Boltzmann Machines

Pergamon @ PII: S08934080(96)000263 NeuralNetworks, Vol.9, No.9, pp. 1561–1S67, 1996 Copyright 01996 ElsevkScienceLtd.Allrightsreserved PrintedinGrea...

1016KB Sizes 0 Downloads 54 Views

Pergamon @ PII: S08934080(96)000263

NeuralNetworks, Vol.9, No.9, pp. 1561–1S67, 1996 Copyright 01996 ElsevkScienceLtd.Allrightsreserved PrintedinGreatBritain 089H080/96S15.00+.00

CONTRIBUTED ARTICLE

ConvergencePropertiesof High-orderBoltzmannMachines F. XABIER ALBIZURI, ALICIA D’ANJOU, MANUELGRAIQAANDJ. ANTONIOLOZANO Universityof the BasqueCountry (Accepted8 January1996)

Abstract-The high-orderBoltzmannmachine(HOBA4) approximatesprobabilitydistributionsdefinedon a set of binary variables,througha learningalgorithmthat usesMonte Carlomethods.Theapproximationdistributionis a normalizedexponentialof a consensus functionformed by high-degreetermsandthestructureof theHOBA4is given by the set of weightedconnections.We prove the convexity of the Kullback–L.eiblerdivergencebetween the distrtiutwn to learnand the approximationdistributionof the HOBM. Weprove the convergenceof the learning algorithmto thestrict globalminimumof the divergence,whichcorrespond to the maximumhkelihoodestimate of the connectionweights,establishingthe uniquenessof the solution. These theoreticalresults do not hold in the conventionalBoltzmannmachine,wherethe consensusfunctionhasjirst andsecond-degreeterms and hiddenunits are used. Copyright(ij 1996ElsevierScienceLtd. Keywords-Boltzmann machine, Stochastic neural networks, High-orderneural networks, Learning algorithms, Approximation of probability distributions, Statistical inference, Monte Carlo methods, Maximum likelihood estimates. 1. INTRODUCTION The conventionalBoltzmannMachine (Ackley et al., 1985; Hinton & Sejnowski, 1986; Aarts & Korst, 1989),as well as the high-orderBoltzrnannmachine (HOBM) (Sejnowski, 1986;Albizuriet al., 1995),is a technique whose purpose is, in its fimdarnental formulation, to describe and model probability distributions defined on a set of binary random variables. An essentialfeature is the use of Monte Carlo methods in the learningalgorithm and, once learningis finished,for probabilisticinference. The Boltzmann machine (BM) approximates a distribution with a model where the probability function is defined as the normalizedexponentialof a consensus function. In the conventional BM we have hidden units, the consensusfunction is formed by first- and second-degreeterms on the variables, i.e., connections up to order two betweenunits, and the approximation distribution is the marginal

Acknowlexlgemettts: This work was supportedin part by PGV9220and P194-78researchgrantsfrom the Departmentof Education,Universitiesand Investigationof the BasqueGovernment. The authors are gratefulto a reviewerfor his detailed comments. Requestsfor reprintsshouldbe sent to F. XabierAlbizuri, InformatikaFakultatea,P.O.Box649,20080Donostia,Spain;Tel: +34 43218000;Fax:+ 3443219306;e-mail:[email protected].

distribution on the visible units. The learning algorithm is a steepest descent of the Kullback– Leibler divergencebetweenthe distributionto learn and the approximationdistribution. When we consider the theoretical bases of the conventional BM, the absence of results on fundamental questions is observed. Specifically, given a distributionto learn, the uniquenessof the distribution obtained by the learning algorithm is not established.If the divergence has various relative minima, different solutions are obtained (Aarts & Korst, 1989),differentconnection weightsdepending on the starting point, i.e., the initial connection weights, and the parametersof the learning rule. Given a structure,definedby the number of hidden units and the connections used, there is no characterizationof the learneddistribution. The objectiveof thispaperis to show thatwiththe HOBM, a variation of the conventional BM where we considerhigher-orderconnections and do not use hiddenunits,thesequestionsare solved satisfactorily. We prove the convergenceof the learningalgorithm of the HOBM and the uniqueness of the learned distribution, which corresponds to the maximum likelihoodestimateof the weightsof the connections of the model. The HOBM (withouthidden units) and the usual two-order BM with hiddenunitsare two alternatives

1561

1562

F. X. Albizuri et al.

to improve the learning capacity of the two-order BM without hidden units. The two-order BM without hidden units provides a probabilistic model that corresponds to a limited class of probability distributions,it cannot tackle problems where highorder correlationsare significant.Consequentlyit is necessary to introduce hidden units or high-order connections, which is equivalent to some extent (Pinkas, 1990). In fact, any distribution can be written as the marginal distribution on the visible units of the distributiongiven by a two-order BM with hidden units(Fort & Pag&, 1993)and likewise any distribution can be written as the distribution given by a HOBM without hidden units,seee.g., the theory of log-linearmodels (Lauritzen, 1989; Whittaker, 1990).In this paper we show that the HOBM (without hidden units) has convergence properties that do not hold in the BM with hidden units, particularlythe usual two-order BM. The paper is organizedas follows. In Section2 the HOBM is defined,the minimum of the divergenceis characterizedas the maximumlikelihoodestimateof the weights,and the learningalgorithmis defined.In Section3 we prove the convexity of the divergencein the HOBM. In Section4 we prove the convergenceof the learning algorithm of the HOBM to the global minimum of the divergence.In Section 5 we prove convergence for modified learning algorithms. Our conclusions are presentedin Section 6.

We note that a(~ [x) = 1 if every end of A takes the value 1, the connection is activated,and a(Alx) = O otherwise. The dynamics of the HOBM corresponds to a stochastic process, a Markov chain where the transition law is defined as follows: given a configurationx of the networkwe choose at random a unitj and we changeits statexi to x;, = 1 – xi with probability 1 p = 1 + exp(–AC(x)) where the increment of the consensus function is given by AC(X) = (1 –2xj) ~

Lj = {A E L/j G A} being the set of connectionsthat include the end j. The stationary probability distributionis the Boltzmann43ibbsdistribution f’”(x) = ~ exp C(x)

x

The HOBM is a variation of the conventional Boltzmann machine where we define high-order connections and do not use hidden units. The HOBM is a stochastic recurrent network and its purpose is to approximatea probability distribution P(x) on {O, 1}~ with the stationary probability distributionP*(x) of the Markov chain defined by its dynamics. The configurationor stateof the networkis given by x ~ {O, 1}~ and the stateof a uniti by xi e {O, 1}, for i = 1,. .., N. A connection A of order m is given by its ends, A = { il,. . . , im} being a nonemptysubset of [1, N] = {1, ... , N}, A ● 9“([1, N]). We definethe consensusfunction (minus energy): (1)

where L c 9“([1, N]) is the set of the weighted connections of the HOBM, WAE R is the weight of the connection Aand a(Alx) is a function on {O, 1}~, (2)

(3)

whereZ = Xx exp C(x) is the normalizingfactor. The objectiveof thelearningalgorithmis to obtain the weights{w~/J ~ L} thatminimizethe divergence D = ~

2. THE HIGH-ORDER BOLTZMANN MACHINE

w~a(A – {j} [x),

,WLI

P(x) P(x) In—. P*(x)

(4)

We assume that P(x) is a (strictly) positive distribution;P(x) is usuallygiven by the frequency distribution of a set of samples (not necessarily different). If N is large P(x) possibly will not be positive, however, we can get a positive distribution adding noise. The expression (4) defines the Kullback–Leibler information divergence, from information theory, betweenthe distributionsP(x) and P*(x). This is an improper distance which verifies that D 20, and D = O if and only if P(x) = P“(x) for all x (Kullback, 1959). The informationdivergenceis relatedto thenotion estimate. Let likelihood maximum of ‘Y’ = {xl,.. ., x~} be a set of samplesand P*(x) a function of the vector of weightsw, the probabilistic model defined by the HOBM. Assuming that the samples come from independentobservations with probability P*(x), the probability that we get Y given w is P(,Y’IW) = fi P*(x’) i=l

High-orderBoItzmannMachines

1563

The maximumlikelihoodestimateof w is thevalueW. that maximizes P(fY’lw) given 9’. Intuitively it corresponds to the value of w that fits best the sampleset. Let P(x) be the frequencydistributionof the set of samplesY = {xl,..., x~}. We have this property, see,e.g., Whittaker(1990): giventhe model P*(x) and the samplesetY’, W.is a (global)minimum of the divergence(4) if and only if W. is a (global) maximum likelihood estimateof w. In this way the minimization of the divergence in the (high-order) Boltzmannmachineis justified. Therefore, given a distributionP(x), we want to minimize the divergence (4), where P*(x) is the probabilisticmodel givenby (3) and it correspondsto the stationarydistributionof the stochasticprocess defined by the HOBM. The method of steepest descentis used to minimizethe divergence. The first- and second-order derivatives of the divergenceD, a function of the weights{w~/J c L}, are given by (Albizuri et al., 1995)

(5)

P; and PA are the means of the random variable(2) under the distributionsP*(x) and P(x), p~ = ~

P*(x)a(Jlx)

Q(x,) +fY(f(x,) -f(x,)).

Geometrically, a function is convex if the line segmentjoining two points on its graph liesnowhere below the graph. We have thesepropositions.

P(x)a(Alx).

They are called activation probabilities of the connection J in the “free phase” and the “clamped phase”, respectively.We note that the (first- and) second-order derivatives of D are continuous, D e %2. In a hierarchicalmodel, where J c L and X c A impliesA’ E L, itcan be shown that thereis a one-toone map betweenthe activationprobabilitiesand the correlation coefficientsassociated with the connections of L (Albizuri et al., 1995). The iterativelearningalgorithm is definedby the rule:

that is for every connection A c L

In order to prove the convergence of the learning algorithm we begin studying the convexity of the divergence(4). Let~be a functiondefinedon Rn. We say that~is convex if for every xl, X2and every real a, O
x

Wk+’= Wk– aVD(wk)

3. CONVEXITY OF THE DIVERGENCE

(6)

x

pA = ~

As we have shown above t?D/8wA = p; –pA, thus the learningrule is local. The learning algorithm begins with a “clamped phase” where pA is computed for every A c L. Afterwards in each step of the learning algorithm we have a “free phase” wherea stochasticsimulation according to the dynamicsdescribedabove is carried out, computing p: for every J c L. The consensus function (1) at the step k corresponds with the weights wk. After the stochastic simulation the weightsare modified according to (7). We note the difference between the learning algorithmin the HOBM (without hidden units) and the learningalgorithmin the conventionaltwo-order BM with hidden units. In the HOBM the activation probabilitiespA are computed at the beginningof the algorithm directly from the set of samples. In the conventionalBM, thep~ arecomputedat each stepof the algorithm in the “clamped phase”: it is carried out in a stochasticprocesswherethe hiddenunitsare free and the visiblesunits are clamped according to the distributionto learn,computing thepA.

(7)

Letf E%2.Given a point W, suppose that: (i) the gradient off in% is null, Vf(xO) = O, (ii) the Hessian off in Xo, H(xo) is positive dejinite. Then X. is a strict relative minimum point of$

PROPOSITION 1.

PROPOSITION 2. Letf c %72.Then f is convex 1~and

only 1~the Hessian matrix is positive semia%finitefor all x. These two propositions are proved in Luenberger (1984). We will prove that the Kullback-Leibler divergence in the HOBM has a positive definite Hessianmatrix, establishingits convexity. THEOREM1. Given twopositive distributions P*(x) and

P(x) on {O, 1}~, if P*(x) has the form(3), then the divergence (4] has a positive &finite Hessian matrix for all w.

1564

F. X. Albizuri et al.

Proof From (5),

ing equation VA= Ofor the one-order connections of L and thereforeVP= O.Taking successivelyconnections of higherorderwe obtainfinallythatv = O. ❑

for all v, wherep; = E[a(Al x)], p; = E[a(p lx)] and themean P;uP = E[a(~ U PIx)I = ~[a(~lx)a(~lx)l, being for P*(x). Consideringthe covariance Cov(a(Jlx), a(p[x)) = E[(a(Jlx) –p~)(a(plx)

-pj)]

= E[a(A [x)a(p [x)] – E[a(A [x)] E[a (p [x)] = P;(JP – P;P;

we have

E

=

A, KL





‘[(a(AIx) -Pi)u(a(plx)

-p;)vM]

E[~~L(a(AIx) ‘POv~(a( Ax)–p;)vP

[(

=E

~

(a(AIx) -P])vA

,kL

)] 2

1

>0

for all v. Then the Hessian matrix is positive semidefinite. Let us prove that it is positivedefinite. The lastexpressionis null for the valuesv thatsatisfy the equationsystem

~ {

(4~lx)-pOh=o/xE

{0,

AEL

I}N

}

In the equation corresponding to x = (O, a(~ Ix) = Ofor all connections, therefore ~ -p:v,

o),

= o.

,kL

So we have the equation system ~ {

AEL

a(~[x)v~ =0/x#O }

Let A = {i} c L and x the configurationwith xi = 1 and xj = Ofor j # i. In the equationcorrespondingto this configuration a(A Ix) = 1 for A = {i} and a(Alx) = O for the rest. Hence VA= O for ~ = {i}. Let p = {i, j} E L and x the configuration with Xi = Xj = 1 and xk = () for k # i, j. In the correspond-

COROLLARY1. The divergence (4), a convex function of the weights w, has at most one extreme point. If this extreme point exists, it is a strict global minimum.

ProoJ According to Theorem 1 and Propositions 1 and 2 the divergenceis convex and every extreme point is a strict relativeminimumpoint. If such an extremepoint existsconvexityimpliesthatit is a strict global minimum,which is unique. ❑ We have thus proved the convexity of the divergence in the HOBM. If the divergencehas an extreme point, it will be the global minimum. This property does not hold in the conventionaltwo-order BM with hiddenunitsand in generalif hidden units were introduced in the HOBM. When hidden units are introduced the distribution P*(x) in the divergence (4) is the marginal distribution on the visible units of the Boltzmann-Gibbs distribution of all units and the second-order derivatives of the divergence are not given by (5). Consequently the proof of Theorem 1 is not valid and convexity is not guaranteed. In fact, hidden units introduce local minima in the divergence of Boltzmann machines, particularly in the two-order BM. Obviously, convexity is highly desirable for reliable learning. 4. CONVERGENCE OF THE LEARNING ALGORITHM We will prove the convergence of the learning algorithm of the HOBM to the global minimum of the divergence for any a < 2/1 L12. Our proof will rest on the following theorem, which comes from the global convergence theorem (and its Corollary) in Luenberger (1984), applied to the learning algorithm of the HOBM. THEOREM2. Let {wk} be a sequence of weights generated by the learning algorithm (7), and suppose (i) the set of weights {wk} is bounded, (ii) ifw is not an extreme point of the divergence (4) then D(w’) < D (w) for w’ given from w by one step of the learning algorithm. Then the limit of any convergent subsequence of {wk} is an extreme point. If there is a unique extreme point Wo, then the sequence {wk} converges to Wo.

Firstwe prove thatfor a small@at each stepof the learningalgorithmthe divergence(4) decreases(if an extreme point of the divergence is reached, the weightsand the divergencedo not change). PROPOSITION 3. If O < a < 2/1 1,12, where ILI is the

1565

High-orderBoItzmannMachines number of connectwns in L, and w is not an extreme point of the divergence, then D(w’) < D(w) for w’ = w – CYVD(W). Proof Since D c %2, from Taylor’s theorem

pil

P*(o) = o.

As in the divergence D = I&P(x) ln(P(x)/P*(x)) term has a lower every bound, P(x) ln(P(x)/P*(x)) > P(x) lnP(x), we get

. D(w’)= D(w)+ VD(W)OAW + ~AwT.HoAW h

j+m

whereH, the Hessianof the divergence,is definedon a point of the line segment [w, w’]. Besides Aw = w’ – w = –aVD(w) and Aw~Awp . ~AwT.H.Aw = j ~ ~ ~ ~EL~eLt3wAawp

D(wJ) = h

j-m

P(0)hI ~=co,

which contradicts the fact that according to Proposition 3, D(w~) is decreasing. (If an extreme point were reached{wk} would be bounded.) If WA~ –00, we have

The second-orderderivativesof D are bounded, and we get the sameresult.So {wk} is bounded. El Thereforewe can establishthe convergenceof the learningalgorithmin the HOBM for any CYsmaller than a certainvaluedeterminedby its structure. Therefore COROLLARY2. The divergence (4) has one extreme point, the strict global minimum. If 0< a < 2/~L[2, jAwT. H.Aw < ~lL[211Aw112 = ~lJZ12a211VD(W)112. any sequence of weights {d} generated by the learning algorithm (7) converges to the global minimumpoint of Since VD(W)OAw = –cYIIVD(W)112 we have finally the divergence. The sequence {VD(@)} convergesto O, that is p; converges to pA for every connection A E L. D(w) – D(w’) >allVD(w)112– ;lL12a2 IIVD(W)112 = a(l – crlL12/2)IIVD(W)II*.

(8)

As O< alL12/2 <1, we get D(w’) < D(w). Now we will prove that the sequenceof weightsis bounded in the learningalgorithm. PROPOSITION 4. Zf O < a < 2/l L12, the set of weights

{wk} generated by the learning algorithm (7) is bounded. Proof Suppose{wk} is not bounded. There will exist at least a connection A E L such that {w~} is not bounded. Let A c L be a connectionsuchthat {w~} is not bounded and if its order is greaterthan one then {w~,} is bounded for all X G L such that A’ c A. Then we can definea subsequent {W~}ieJ such that }w~l ~ 00. Therefore we can define a subsequence {W!~}jeY,# c Y, suchthat~~ ~ +COor ti~ ~ –CO. Let x~ be the configurationwith xl= 1 for all 1E J and xl = O if 1$ A. If ti~ j +00, since P*(0) = Z“’, we have P*(x~) — =j~~ j!% P*(o) As P*(xA) <1, it follows

exp C(x2) = 00.

Proof According to Theorem 2 and Propositions 3 and 4 the limit of any convergent subsequenceof {Wk} is an extremepoint. (Such convergent subsequence exists by the Weierstrass theorem.) By Corollary 1 this extreme point is unique, the strict global minimum Wo. By Theorem 2 the sequence {Wk} converges to Wo. As VD(W) is a continuous function of the weightsit converges to O. ❑ Thus we have proved the convergence of the iterativelearningalgorithm of the HOBM, for any ~ < 2/l L12, to the (strict) global minimum of the Kullback-Leibler divergence(4), which corresponds to the (strictglobal) maximumlikelihood estimateof the parametersof the model, i.e., the connection weightsof theHOBM. We note that themaximumof the divergence decrease guaranteed by (8) corresponds to @ = l/l L[2. Finallywe indicatethat if the model is saturated, L = 9“([1, N]), the sequence {wk} converges to a limit point W. such that P*(x) = P(x). Effectively any distributionP(x) on {O, l}N can be writtenas the normalizedexponentialof a consensusfunction like (l), see e.g., Lauritzen(1989), thereforeif we set L = 9“([1, ~) in the HOBM the global minimumof the divergence (4) corresponds to the weights for which P*(x) = P(x).

1566

F. X. Albizuri et al.

We end this section ccmsideringthe convergence properties of the BM when hidden units are introduced, particularly in the conventional twoorder BM. The learning rule is the same since (5) holds for the first-orderderivativesof the divergence. Besides it can be shown that the second-order derivatives of the divergence are bounded, l~D/i%v~ilwPl <2, therefore Proposition 3 holds for O< a < l/l L12. However the proof of Proposition 4 is not valid for a BM with hidden units,since the distributionP*(x) in the divergenceis now the marginal distribution on visible units of the Boltzmann--Gibbsdistribution.So we cannot apply Theorem 2. Neverthelesswe can get some convergenceresult. In the proof of Proposition 3 we have, insteadof (8), D(w) – D(w’) >a(l – al L12)\lVD(w) []2.

(9)

Let {w~} be the weights generated by the learning algorithm of a BM with hidden units, with a < l/l L[2. According to (9) the divergence is decreasing, but the divergence is nonnegative (see Section 2), therefore the difference D (w~) – D(w~+l) converges to O,then VD (wk) converges to O,that is the activation probability p; converges to p~ for every connection J ● L. We note that the convergence of the activation probabilitiesdoes not imply the convergence of the weights, i.e., the strict convergenceof the learningalgorithm. Consequently,when hidden units are introduced the activation probabilitiesof the weightedconnections converge to the desired values, but the convergence of the weights is not guaranteed. Moreover the divergenceis not convex so the weights can convergeto somelocalminimum.The distribution provided by the learningalgorithmis not determined, it depends on the initial weights, the values of the parametercr,etc., andlikewisetheminimizationof the divergencedependson thesefactors.The existenceof a global minimum of the divergence cannot be established.Finally, Theorem 3 (below) will not be valid for a BM with hidden units, since the proof is based on the convexity of the divergence.

5. LEARNING ALGORITHM IN PRACTICE In practice the value of a and even the learning algorithm are modified in severalways to accelerate convergence. We can start with a value of the parameter ~ greater than the values for which convergence is guaranteedby the preceding results, decreasing gradually the value of a through the learningalgorithm. On the other hand the learning rule (7) is usuallymodified (Aarts & Korst, 1989),so

w does not move in the directionof VD but according

to the rule

where

sig(x) =

1 –1 { o

if x >0 if X
Likewise the value of the parameter CYcan vary through the learning. This type of rule is used in order to avoid oscillations for great values of some components of VD, and at the same time to accelerate convergence for small values of other componentsof VD. All theseaspectsare determined empirically. We willprove thatwheneverwe havea sequenceof weightssuch that the sequence{VD (wk)} converges to O,that is p; converges to p~ for every connection ~ G L, the sequence of weights converges to the global minimum of the divergence, independently of the learning rule or the parameter values. This result is interesting because the probability activations pi are computed at each step of the learning algorithm and we can observe their convergence. THEOREM 3. Given a sequence of weights {wk} generated by some learning algorithm and which verl~es that {VD (wk)} converges to O, then {wk} converges to the strict global minimum of the divergence.

Proo$ We present here an abbreviated proof, see Albizuri (1995) for the detailed proof. It can be interpretedgeometricallyimaginingthat the points w are on the x – y plane and the z axis is for the divergence. From Corollary 2 thedivergencehasa strictglobal minimumWo.Let S, be a sphericalsurfaceof radius r >0 centeredon Wo.SinceD(w) is continuousit has a global minimumpoint on S,, wherethe divergence value is D, > D (wo). Let W1be a point outside S,, IIW1– W.II>r. The convexity of the divergenceimpliesthat D(w,)-D(wO)

> ‘Iw’‘Wol[ (D,r

D(wO))

and VD(W,) (WO– WI)
High-orderBoltznrannMachines Since >D(W1)

1567

D(w]) > D(WIJ), [IVD(W,) Il.[[WI) -w, [1 and from the first inequality

– D(wo)

IIVD(W,)II > ‘r -:(W”)

(lo)

for every WIsuch that Ilwo– W1II >r. Since {VD(w~)} convergesto O,given r >0 there will exista numberM, such that for k PM, we have llVD(wk) II< (D, - D(wO))/r, that is IIW. - w~ll < r from (10). Therefore {wk} convergesto wo.

6. CONCLUSIONS In thispaper we have establishedbasic resultson the mathematicalfoundations of the high-order Boltzmann machine, which do not hold in the conventionalBoltzmannmachine.Given the structureof the HOBM, the set of weighted connections, we have proved the convexity of the Kullback–Leibler divergence between the probability distribution to learn and the approximation distribution, and we have proved the convergenceof the iterativelearning algorithm to the strict global minimum of the divergence, which corresponds to the maximum likelihood estimateof the parametersof the model, that is the weightsof the high-order connections.So we have proved the uniqueness of the weights to which the learning algorithm converges, i.e., the uniquenessof the learneddistribution. We have established the convergence of the learning algorithm for any CY< 2/l L12 in the rule @+l = wk _ ~VD. (with ~ = l/l L12we obtain the maximumof the guaranteeddecreaseof divergence.) Moreover we haveproved thatwhenthe learningrule is modifiedtheconvergenceto the global minimumof the divergenceholds wheneverthe learnedactivation probabilitiesconvergeto thedesiredones. Combining both results, modified learning rules can be consideredto accelerateconvergence. Although the HOBM (without hidden units) and the usual two-order BM with hidden units are two ways to improve the limitedlearningcapacity of the

two-orderBM without hiddenunits,the convexityof the divergencein the HOBM and the convergenceof the learning algorithm to the maximum likelihood estimateof the weightsgive a mathematicalbasis to the HOBM that the conventional BM with hidden unitshas not.

REFERENCES Aarts,E. H. L., & Korst,J. H. M. (1989).Sinrulatedannealing and Boltzmann Machines: a stochastic approach to combinatorial optimization and neural computing. New York:JohnWiley.

Ackley,D. H., Hinton,G. E., & Sejnowski,T. J. (1985).A learning algorithmfor Boltzmannmachines.Cognitive Science,9, 147– 169. Albizuri,F. X. (1995).Mdquina de Boltzmorm a%alto orakm:una red neuronal con t~cnicas de Monte Carlo para mo&lado de distribuciones de probabilirrhd. Caracterizacicin y estructura.

PhD thesis, Department of Computer Science and AI, Universityof the BasqueCountry,Donostia,Spain. Albizuri,F. X., d’Anjou,A., Grana, M., Torrealdea,F. J., & Hernandez,M. C. (1995).Thehigh-orderBoltzmannmachine: learned distribution and topology. IEEE Transactions on Neural Networks, 6(3), 767-770.

Fort,J. C., &Pag6s,G. (1993).R&e.auxde neurones:desm6thodes connexionistesd’apprentissage.Technicalreport, Universit6 du SAMOSno. 24. Paris1, SAMOS,Pr&publication Hinton,G. E., & Sejnowski,T. J. (19S6).Learningandrelearning in Boltzmannmachines.In D. E. Rumelhart,J. L. McClelland, thePDP ResearchGroup(Eds),Parallel distributedprocessing: explorations in the microstructure of cognition (vol. 1, pp. 282– 317).Cambridge,MA: MIT Press. Kullback,S. (1959). Information theory and statistics. New York: JohnWiley. Lauritzen,S. L. (1989).Lectureson contingencytables.Technical Report R S9-24, The Universityof Aalborg, Institute for ElectronicSystems,Departmentof MathematicsandComputer Seienee,Aalborg,Denmark. Luenberger,D. G. (19S4).Ltiear and nonlinearprogramming, 2nd edn. Reading,MA: Addison-Wesley. Pinkas,G. (1990). Energyminimizationand the satisfiabilityof propositionallogic. In D. S. Touretzky,J. L. Elrnan,T. J. Sejnowski,andG. E. Hinton(Eds),1990Connectionist Summer School, San Mateo, CA. Sejnowski,T. J. (1986).Higher-orderBoltzmannmachines.In J. S. Denker (Ed.), AIP Conference Proceedings 151, Neural Networksfor Computing (pp. 398-403).Snowbird,UT. Whittaker,J. (1990). Graphical modet!r in applied rnrdtivariate statistics. New York: JohnWiley.