Parallel Computing 18 (1992) 57-73 North-Holland
57
A dedicated massively parallel architecture for the Boltzman machine A. D e G l o r i a , P. F a r a b o s c h i a n d S. R i d e l l a Department of Biophysical and Electronic Engineering, University of Genova, Via All'Opera Pia 11A, 16145 Genova, Italy Received May 1991
Abstract De Gloria, A., P. Faraboschi and S. Ridella, A dedicated massively parallel, architecture for the Boltzmann machine, Parallel Computing 18 0992) 57-73. A key task for neural network research is the development of neurocomputers able to speed-up the learning algorithms to allow their application and test in real cases. This paper shows a massive parallel architecture specifically designed to support the Boltzmann machine neural network. The heart of this architecture is its simplicity and reliability together with a low implementation cost. Despite the impressive speedup obtained by accelerating the standard BM algorithm the architecture does not use particular techniques to expose parallelism in the simulating annealing task, such as the change of state of multiple neurons. Features of the architecture include: (1) speed: the architecture allows a speedup of N (N is the number of neurons constituting the BM) with respect to standard implementation on sequential machines; (2) low cost: the architecture requires the same amount of memory of a sequential application, the only additional cost is due to the inclusion of an adder for each neuron; (3) WS! capabilities: the processor interconnection is limited to a single bus for any number of implemented processors, the architecture is scalable in terms of number of processors without any software or hardware modification, the simplicity of the processors enables to implement built-in self-test techniques: (4) High weight dynamics: the architecture performs computation by using 32-bit integer values, therefore offering a wide range of variability of weights.
Keywords. Neural networks; Boltzmann machine; massively parallel architectures; VLSI design.
1. Introduction
The advance of VLSI technology along with WSI techniques allows to design massively parallel architectures and consequently new computational models, suitable for massive parallel computation, become very attractive. Cost and performance figures lead to consider dedicated architectures, since the specification of hardware allows to reduce design and implementation cost leading also to better performance for the exploitation of the implemented computational model. The Boltzmann machine (BM) is an interesting neural network, since it can be applied to a wide range of problems such as combinatorial optimization, knowledge representation and learning. It is provided with rigorous mathematical foundations with convergence properties that can be analyzed using techniques derived from physics [8,9,4]. The BM model shows some interesting features from a hardware implementation point of view too, as it requires integer weights and, for the digital nature of its neurons, no multiplication is required to compute synapse values. 0167-8191/92/$05.00 © 1992 - Elsevier Science Publishers B.V. All rights reserved
A. De Gloria et al.
58
This paper, after a brief presentation of the BM model and of some hardware implementations, shows how to modify the BM algorithm to exploit massive parallelism. Then the architecture is shown with the structure of its main components.
2. The Boltzmann machine
2.1. The formal model For reader convenience we present a summary of the formal model of the Boltzmann machine. We follow [1] with minor modifications. A Boltzmann machine consists of a number of logical neurons, N, that can be represented by an undirected graph G -- (V, E), where {v0, .... v,_ ~} denotes the set of vertices corresponding to the logical neurons, and E c_ V × V the set of edges corresponding to the connections between the units. An edge (v i, v~) ~ E connects the vertices vi and vi. The set of edges includes all loops, i.e. {(vi, vi)l vi E V} ~ E. With each vertex vi a number is associated denoting the state of the corresponding ith logical neuron, i.e. 0 or 1 corresponding to 'off' or 'on', respectively. A configuration k of the Boltzmann machine is uniquely defined by the states of all individual vertices. The state of vertex vi in configuration k is denoted by s~k~. An edge (v~, vi) is defined to be activated in a given configuration k if s~k~s~k)= 1. With each edge (v~, vj) a connection strength, called weight w~, is associated determining the strength of the connection between the vertices v~ and vj. A consensus function (C k) of a configuration k measures the desirability of all the activated edges in the configuration and can be defined as
ck= E
('0
i ,j,i ;~j
From a given configuration k, a neighboring configuration k,. can be obtained by changing the state of the vertex i, so that:
s~k"
= [SJ k' [ 1 - S ~ k,
j*i j=i.
The corresponding difference in the
(2) consensus
A C k k i ~- Cki -- C k
is given by:
ACkk, = (1 - 2S~k~)hi
(3)
hiffi(~.dW=.S{.k)+Wii),j j
(4)
where
is the local field of neuron i and W/i is the bias of neuron i. The difference of consensus is completely determined by the states of the neurons j connected to i and by their corresponding weights, so that it can be computed locally, thus allowing a parallel execution. The probability to accept a transition (k, kj) with cost ACkk ' is given by: 1
Bkk, =
ACkk ' l+e-~
C
(5)
A parallel architecture for the Boltzmann machine
59
t = TMAX; do f l~or(i=0; i
hi = ~.~ W 0 and Sj + Wii r = rand~om number 1 if < r 1+ e h ,~'c Si ffi -1; else Sir0; }
_
} wCh~e(gC;percentage of acceptance is lower than a g i ~ n constant); Fig. 1. The C-like code of the algorithm of the Boltzmann machine.
where c is a cooling parameter, real and positive, c o is initialized on the basis of the Y'ij I W/j [, [3] and cj+ I =- ~cj, fl < 1 and/3 close to 1. The decrement rule for calculating the next value c~+l is applied each time the unit has completed K trials. The annealing process begins with c = c o and several consecutive trials of transitions (k, ki) are randomly tossed according to the probability Bkk ,. As c aproaches 0 the accepted transitions are less and less frequent and finally the Boltzmann machine stabilizes in a steady configuration. The whole process can be thought as a sequence of Markov chains and it can be demonstrated that for long enough chains the system converges to a unique equilibrium state [4]° The process stops when the number of transitions is less then a given quantity.
2.2. Performance o f parallel implementation A C-like code of the algorithm of the BM is shown in Fig. 1. The code refers to an implementation where the weights are represented with integer numbers; moreover, in order to speed-up the computation of the local field the product wi~ si has been replaced by an and operation, and the active state of a neuron is codified as the 2's complement value - 1. As we can notice fro,~'a the code, the computation time of the algorithm is due to the time for completing an iteration of the main loop (TL multiplied by the number of trials M and by the number of cooling schedules (C) resulting in the following quantity:
T = CMT L.
(6)
The time for the iten, tion of the main loop is composed of the time for the computation of the local-field (T h) and of the time for miscellaneous operations (including the sum of neuron bias) (Tmisc) that, for the scope of comparing different architectures, we can suppose to be much lower than Th. We can express M as a function of the number of neurons of the network (i.e. M = KN). Thus we have:
TfyN(T
h + Tm,sc), ~,---cr.
(7)
60
A. De Gloria et al.
h Loop: load s[iJ 2: load wb][j] 3: inci 4: incj 5: cmp i, N 6: bleLoop 7: and wii, si ,k 8: add k,h,h 9: store h,h[i]
LOOP
; II: load si 12: load wij ~ il: increment pointer to , i2: increment " to , c: is it the lastp°mtcr neurons ; b: no, it is not ; a h compute connection ; a2: update local field ; s: store local field
neurons ~veigths .
strength
Fig. 2. The loop for the computation of the local field in the basic machine. It is composed of 8 operations.
In order to have a notion of the time required by the algorithm to perform a run, and to have a metric to compare our approach, we define a basic machine with the following features: - Each instruction is executed with a latency of one machine cycle, except memory accesses that need three machine cycles and can be executed in pipeline. - The branch instruction is delayed with two latency cycles. - The machine does not allow parallelism in the execution of the instruction, issuing one instruction per cycle. In the following the times will be indicated as the products of the number of machine cycles by the time length of machine cycle. Every time will have a superscript that indicates the algorithm used fro the BM, and a subscript that refers to the type of machine. The same notation will be also used for U, that represents the number cycles needed for the miscellaneous operations and that does not depend on the number of neurons (N). The assembler version of the program for the computation of the local field is shown in Fig. 2. The total time needed by the machine to execute the program is:
Tg f yN(SN + 3 + U~)t.
(8)
This result comes from the assumption that each neuron is connected with all the others. To have a notion of how the connectivity of the network affects the total time, we introduce a connectivity parameter a which measures the average connectivity of a network. Since a neuron has at least one connection with another neuron and maximum number of connections equal to N - 1, we define a as the average number of connections a neuron has; therefore it is in the range [1/N, ( N - 1)/N]. Thus we have:
T ~ f TN(8aN + 3 + U~,)t.
(9)
The computation can be speeded-up by using a parallel machine (mp) but the performance is limited by the memory channel. In fact, if we assume that only one memory access can be performed in a cycle, the maximum performance achieved shows a computation time (Fig. 3):
T~p=yN(2aN + 11 +
V~p)t
(10)
with a machine that can perform 4 instructions in a cycle and no latency. The memory operation latency does not heavily affect the performance, in fact with a machine (mpi) with a latency of three cycles for a memory acces:~es we obtain (Fig. 4): N
r~Sp,= y ' ~ ( 4 a N + 29 + Umpt)t.
(11)
A multi-port memory architecture (mmpl) can help to obtain a considerable speedup. By assuming to have a synchronous linear architecture with a memory with N independent ports and N processors, where the processors are connected by a global bus that can be subdivided into N sub-buses (rig. 5), the 2N memory loads and N logical and operations can be
A parallel archiwcturefor the Boitzmann machine TIME
ITERI
0 1
MII
2 3
All
5 6 7 8 9
ITER2
ITER3
MI2 M22 A12 C2 B2 A22
MI3 M23
61
ITER4
M2I CI BI A21
A13
C3 B3 A23
MI4 M~s A14 C4 B4 A24
10 11 Fig. 3. The code for the computationsof the local field in the parallel machine.The code has been generated by using the software-pipelinetechnique. The operation mnemonicsare indicated in Fig. 2, the subscriptsindicate the ~umber of the iteration. The main loop is enclosed in a box and it is composedof two cycles. Mki refers to a load operation and the incremenl of the address. concurrently executed by the processors, the sum can be achieved by log2 ( N ) cycles, while the time needed to store the result is unchanged, resulting in a total time: T~mpl
=
)t.
7 N ( 4 + [ l o g e ( a N ) ] + Ummpl
(12)
From (10) and (12) we see that the order of convergence of the algorithm is O ( N 2) and O ( N log2 N ) respectively. A linear convergence may be obtained by changing algorithm in order to overcome the sequential nature of the local field computation.
3. Exposing parallelism in Boltzmann machine Two methods have been proposed to expose parallelism in BM [2]. The former, the synchronous method, assumes that the network has a low connectivity (i.e. a << 1), therefore all the neurons not connected with each other can simultaneously change their state without affecting the computation of the consensus difference. The network can be partitioned into subsets, each of them composed of neurons not connected together; at each iteration a subset is chosen and its neurons are considered for state change, allowing the parallel computation of all the local fields. This method can be implemented in a parallel synchronous architecture. By assuming that a is equal for each neuron, we need l / a processors, containing; a neuron for each subset. By taking into account the results shown in the previous section, we obtain three possible results accordingly with the three different architectures implementing a processor. With the basic processor we have:
T f = y N a ( 8 a N + 3 + VP)t
(13)
while with the parallel processor the computation time becomes:
TPp= T N a ( 2 a N + 11 + UPp)t.
(14)
Finally the massively parallel architecture reaches the following performances: TmPmpl= y N a ( 4 + [log2 a N ] + UPmpl )t.
(15)
62
A. De Gloria et al.
TIME 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 21 22 23 24 25 2627 28 29 30 31 32 33
ITER2 ITER3 Mill M211
AIII Cll
ITER4 ITER5 ITER5
MII2 M2i2
BII
A211
MI21 M221
All2 C12 BI2 A212
MI13 M213
All3 C13 AI21 C21
M122 M222
B13
A213
B21
A221 A122 A222
MII4 M214
M123 A123 C23
All4 C14 BI4 M124 M224
Ml15 M215 73 19 C!5 B15 A215
A124 C24
M125 M225
B23
A223
B24
A224 A125 C25 B25 A225
Fig. 4. The code for the computation of the local field for a machine with different operation latencies. The loop has been unrolled two times. The first subscript of the operation mnemonics refers to the unrolling number, the second one to the iteration number. The main loop is composed of four cycles.
In order to evaluate the architectural complexity, it is important to point out that this architecture has a two-level hierarchical organization, where each processor is in turn composed of a N processors plus a host, resulting in a total number of processors equivalent to N(a + 1). The latter method, the asynchronous one, is not based on any assumption about the connectivity of the network. However, it also assumes that some neurons can simultaneously change their state with no matter which other neurons they are connected to. This method can virtually reach any parallelism and consequently and speedup. In fact the estimation of
A parallel architecture for the Boltzmann machine
63
T
Fig. 5. The mmpl architecture. It is composed of N processors communicating on a global bus. The bus can be subdivided into N independent buses to allow the computation of the sum. The subdivision of the bus is performed by switches piloted by the host. Each processor is connected to a local memory where the connection weights are stored.
the computation time shows the same results of the previous method but with the important difference that a is not constrained by network connectivity but only by the user. Unfortunately no proof exists to support the validity of the asynchronous method: its good performance is confirmed only by simulation results [2]. Therefore we avoid this approach, preferring to retain the standard BM algorithm. Finally we mention the VLSI approach suggested in [5]. It is based on analog devices and performs the stochastic decision-making by using noise. At present, this approach does not allow a high-weight dynamic and therefore the range of its applications is limited. On the other hand, it reaches a considerable speedup. In fact a BM composed of 2000 neurons may run about 106 times faster than simulation on a V,~2~'~[1].
4. Massive parallelism in Boitzmann machine Even if a decrease in computation time can be obtained by looking at low connected networks, an architecture for the BM has to be able to manage any connectivity degree guaranteeing at the same time a considerable speedup. Furthermore, an analysis of representative problems solved with BM reveals that the connectivity parameter a is close to 1. In [1] and [2] this analysis is reported and we limit to summarize the results and show the a value for each problem (Table 1). As we have previously pointed out, the most time-consuming activity in BM is due to the computation of the local field. In the algorithm shown in Fig. 1, at each iteration, the local field for the selected neuron is computed by examining the state of all the other neurons it is connected to. We propose to cornput~ the local field in an incremental way by upd,~ting the local field of each neuron when a change of state occurs in the selected neuron. Figure 6 shows the proposed algorithm. A new data structure (h[i]) has been introduced, containing the value of the local fields of all the neurons. Before beginning the cooling procedure, this data structure is initialized with the value of the bias of the neurons, which in turn are all set to the OFF state. In this way, at the first iteration, when every change of state occurs, the consensus difference is only due to the bias of the neuron. When a selected neuron changes its state, the local fields of all the other neurons are updated consequently by adding or subtracting the value of the strength of the connection to the considered neuron if it is switched to the ON or the OFF state respectively. The proposed algorithm has no effect on a sequential architecture, but on the contrary causes a decrease in the performances. On the other hand it is very suitable for a massively parallel implementation (mmp2).
64
A. De Gloria et aL
Table 1 Number of neurons and number of connections needed to solve some optimization problems. The value of the parameter a is also reported. (Table partially derived from [3].) Problem
Number of neurons
Number of connections
a
Max cut
CKII/I)
O(IVl + IEI)
O(2(11/1+ IEI))
Independent set
O(IVI)
O(11/1+ lED
O(2/1/1+ IEI IVI 2 )
Vertex cover
O(11/I)
O(I 1/I + I E I)
Clique
O(IVI)
O(11/I 2- lED
Graph colouring
O((A + 1)IVI)
O(AzlVI+AIEI)
0(2 ,( I V I + I E I) ) IVl: IEI) 0(2 (IVIzIVI2 ) A2[VI + AIEi O(2 ((d+l)l V!)2 )
Clique partitioning
O((d + I)lVI)
O(AZlVI+A(IVI 2- IEI))
0(2
Linear TSP
O(I V I)
O(I V I 2)
1
Quadratic TSP
O( I V I)
O(1V 12)
l
IVI 2
~21vl+zl(IvI 2- IEI) ((A+I)IVI) 2
)
Suppose we subdivide the memory into N independently accessible sub-banks, each of them associated to a neuron and containing the connection weights of that neuron with all the other ones. With this memory organization it is possible to compute in parallel the effects a
in.;tialize the local field of all the neurons: hi = wii initialize to zero all the neurons C = cO; for(;;) [ for(i=0; i
if(l/(l+ exp(li..[r]/c) X new -- u; } els { x_new
=
r) {
l;
z~ft~(x~t~on == x[r]) continue; transition + 1; i f ( x n e w - x[r] < 0) { x[r] = 0; for(j=. 0; j ;j+ + ) { ztUhb~ r)= h[j]-w[r][j];
++){
}
}
~( transition
= hEj]+w[r][.j];
< delta) break;
C = I~C; Fig. 6. The modified algorithm of the Bo]tzmann machine.
A parallel architecture for the Boltzmann machine
65
c=cO; set all the neurons to the not active state; set the local field of every neuron to the values of their bias; do I ~or(i=O; i
1:
si=O;
if (the state of neuron xi is changed) [ broadcast the state of the neuron to all the others; each neurons xj update its local field as: if(new state - = O) eihJ~, - hj - wij; hj= /
hj ÷ wij;
_
is lower than a given constant); Fig. 7. The parallel version of the modified algorithm reported in Fig. 6.
change of state of a neuron produces on the local field of all the others, in such a way the computation of the local field can be done in parallel by all the neurons once the change of state of a neuron is broadcast to the other ones. The modified algorithm is reported in Fig. 7. With the parallel updating of all the local fields we have the following computation time:
r~,mp2 mp =
mp + yN(4+ Um~mp2
Z)t,
(16)
where Z denotes the additional cycles introduced by the new algorithm. We stress that the parameter a has no effects on the performance and therefore network connectivity does not affect the time required to run the algorithm. On the other hand, as we have pointed out above, the majority of the problems requires high connectivity networks resulting in a validity of the architure for accelerating the Boitzmann machine. Supporting the modified algorithm requires a low amount of additional hardware with respect to a sequential machine; in fact the size of the memory remains the same. The architecture is composed of N processors, each one dedicated to a single neuron. A processor is composed of a memory bank that stores the strength of the connection of the neuron to the other, an accumulator to store the local field and an adder to update the local field. Figure 8 shows the performances for the different architectural organizations as a function of the number of neurons and supposing a equal to 1. As it results from Fig. 8 the two architectures, mmpl and the romp2, reach about the same performance for a large number of neurons. But mmp2 shows some features that are more interesting by a hardware point of view. Even if the two architectures are composed of the same number of processors and need the same amount of memory, mmpl requires a more sophisticated control than romp2. This is due to the collection of the partial sums for the computation of the local field. In fact, while mmp2 can be organized in a SIMD basis because all the processors execute the same instruction except the processor that stores the selected neuron, in romp 1 the tree algorithm for the
66
A. De Gloria et al. 1E+13 1E+12
°°'° oo *° p
1E÷11 1E+10
,
J 1E+09 1E+08
°°°°
. • e m
°w
o,
oD
m~
°o°"
1E*07
°l °
1000000
I' ..'" ..-"
1O000O
---
b
•
mp
----
romp1
OQ I 0
10000
f
..."
mmp2
1000 100 10 1 10
100
1000
10000
100{X]O
1{}0{~00
Fig. 8(a). Graph showing the time performance of the four different architectures as a function of the number of neurons. 4
computation of the sum requires additional hardware for the synchronization of the processors and for the ~ransmission of the partial results. Therefore a S I M D organization cannot be successfully applied if we do not consider the sum operation as a whole, but this requires m
m
-
/ romp1
400~0 " - - ' E - - - - mmp2 30GOOQO
2O0O00O
1OOOOOO
o
I
I
Fig. 8(b). Graph ~howing the time performance of the two massively parallel architectures, mmpl and romp2, as a function of the number of neurons.
67
A parallel architecture for the Bohzmann machine
for(l=O; I < Number._of_Leaming..Cycles; I++) { I
Clamped Phase */ for(~O; e < Number of Examples; e + + ) select an_exumpleO;- force ~mputsO; force-outptltsO; equili-brat?.0;//SEE
{
FIG. 1
fo'r(s=0; s < Number_of_Statistics; s++) { randomly, select a neuron si; compute its local field (hi) determine the new state of neuron according to probability:.
l+ e h i / t
for(j= 0; j < N; j+ + ) { if(si s j = = l ) pC i) = pC ij + 1 elsepC(/= pC/j_ 1 }
).
I
Undamped Phase */ for(~O; e < Number.of Examples: e++) { electan_exmpleO; ice mputsO;; .ilrbrateO:// SEE FIG. 1 f6r(~='~ O; s Number of Statistics; s+ + ) { randomly select ~ n~'uron si; compute its local field (hi) if
1
i/c < r 1+ e h determine the new state of neuronaccording to probability:. for(j=0; j < N; j++) [ i f ( s i s j - ~ - 1 ) p ' i j = p~'i)+ 1 else pa ij = pU ij 1 - -
*/
Weight Updating Phase
for(i---O; i < N; i++~ [ for(j=O; j < N; j++) { if( pC¢_ pU# < 0) wij eise wij = wij + 8
). */
=
Wij
--
}
Verification Phase
for(nt=0; nt < Number of Tests; at++) { , for(e=O; e < Number_of_Exumples; e++; { select_exampleO; force_inputD; enuilibmteO; " " " u~utsX ff~ output..neurons ~--- oestrm .o Wo number_of_successes - numuer_m_successes + !; ] !f(number_of_suc~esses/Number_of_Examples break;
>=
desired_percentage)
Fig. 9. The C-like code of the learning algorithm of the Boltzmann machine.
68
A. De Gloria et al. for(l~0; i < Number of_Learning_Cycles; 1++) { /
Clamped Phase */ for(~0, e < Number of Examples; e++) { selecuan_,example0, for(i--0; ~< Input_Neurons: i++) send value of input neuron i to proc~.ssor i for0=O; j
bj + wij:
e)ach neuron update the statistics as: if(si sj - ~ l ) p o = p/j+ 1
J.
else p o = p lj - i
}
*/ UnelamNd Phase for(¢~.0; e Number of Examples; e++) { selecuan_exampl~'0:for(i=0; i< Input Neurons; i++) send value of i~put neuron i to processor i equilibrate(); //SEE FIG. 1 f6r(s=0; s < Number of_Statistics; s++) { randomly 1select a n~uron si; i f - < random number !+ e hi/c Siffi 1; else S/f0; if (the state of neuron si is changed) { broadcast me state ot the neuron to all the others; each neurons sj update its local field as: if(new state == 0) el~hJ~, ffi hj - Wij; h!= hj + w/j: each neuron update the statistics as: if(s/sj ~ 1 ) p / j = p/j1
}
else P ij = p ij + 1
*/ Weight Updating Phase for(iffi0.; i
}
else wij = Wij + 6
/* Fig. 10. The parallel version of the learning algorithm of Fig. 9.
A parallel architecture for the Boltzmann machine
69
Verification Phase */ for(nt=0; nt < Number of Tests; nt++) { for(eh-0; e < Numbet_6f_Examples; e++) [ select .example0; torce_mputO; e,tluilibrateO; if( outputneurons - - desired outputs) number_of_successes = number_of_successes + 1;
]
i}f(numbelof._successes/Number_of__Exumples = break;
desired_pe~en~ge)
Fig. 10. (continued.)
additional hardware. Moreover, the bus organization of mmpl is more complex with respect to romp2; in fact, in order to perform the sum, it needs a bus that can be subdivided into N sub-buses, resulting in N switches and N control bits. 4.1. Massive parallelism in learning with the Boltzmann machine A learning algorithm for the Boltzmann machine has been proposed by Hinton et al. [7]: a short summary of that procedure is now reported. The BM set of neurons is divided into three subsets: input, hidden and output. Given a set of examples consisting of couples of inputs and of the corresponding desired outputs, the learning procedure is composed of several learning cycles each of them organized into four phases: (1) clamping, (2) unclamping, (3) weights updating, (4) verification. The goal of the procedure is to accommodate the weights to obtain the desired state of the output neurons when the corresponding inputs are applied to the input neurons. The BM starts with a random configuration of weights. In the clamped phase the input is applied to input neurons and the output to the output ones; then the annealing procedure is applied to equilibrate the network. Once the network is at the 'thermodynamics equilibrium' a statistics pc is collected about the activation of the edges. Subsequently the network enters in the unclamped phase where only the inputs are applied and also at this time the network is equilibrated and a statistics P" about the activation of the edges is collected. In the weight updating phase the weight of each edge, wij is incremented or decremented according to the sign of corresponding difference [p~-p~)]. In the verification phase the network is tested by collecting the percentage of successes in obtaining the desired outputs from the corresponding inputs. Following [1]: "The objective of the learning algorithm can be formulated as follows: modify the connection weights of the Boltzmann machine such that pu is close to pc,,. Figure 9 shows a C-like code of the learning procedure. We can individuate three main activities in Fig. 9: - the annealing of the clamped, the undamped and verification phases, the collection of statistics about the activation of the connections both in the clamped and in the undamped phases, the updating of the weights. The resulting total time is the following: Tt_-L(E~/(Ta.. + Tstat)+ Tup + ,qTann) '
(17)
where L is the number of learning cycles, ,/the number of training examples, Ta,:n is the time for a single annealing, Tstat the time for collecting statistics and T,,p the time for updating the weights.
70
A. De Gloria et ai.
In the following we show the detailed expressions for Ta.., Tstat , and/'up, when the learning algorithm is implemented on the mmp2 architecture. The romp2 architecture is particularly convenient in learning, since we previously have shown that it is very efficient when the connectivity parameter a is close to 1: it has been shown [6] that a network with high connectivity shows better learning capabilities with respect to those with low connectivity. The computation of the times involved in (17) follows from Fig. 10: Tan n -- Tmm£p2
(18)
T,,at = S(4 + F)t,
(19)
where S is the number of trial we consider for statistic and we can suppose to be S --- ON, 0 not depending on N. Moreover F represents the number of cycles required by the host for randomly selecting a neuron, evaluating its change of state and broadcasting the resulting state value to the processors: Tup = ant.
(20)
Each neuron requires 3 cycles for updating one weight Value. The resulting total time is the following: rnp + Z ) + 2 ~ 0 N ( 4 + F ) + 3 N ) t . Ttmmp2=L(3*lyN(4+U~mp2
(21)
The memory of each processor is organized into two areas of the same dimensions. In the former the connection weights are stored, while in the latter the statistics about edge activation are recorded. Also for learning we achieve a convergence of O(N) with respect to O(N 2) and O(N log2 N) of the best algorithm implemented onthe other architectures.
5. The Boltzmann machine massively parallel architecture The modified algorithm of the BM presents proper features for an implementation with a dedicated massive parallel architecture. In fact the porocessors and memory can be easily partitioned and replicated and the interpocessor communication requires a constant amount of resources (buses) to e~ploit the algorithm parallelism with respect to the increment in the number of processors. By associating a processor to a neuron the architecture results to have a single instruction multiple data (SIMD) organization (Fig. 11). This organization leads to an incremental structure where the addition of an NPE does not r¢quif~, any modification in the software and hardware. Moreover, the architecture requires a single data-bus and a single instruction bus to be shared by all the NPEs. In other words, the parallelism exploitation is not constrained by the communication buses. The main components of the architecture are the Master Control Unit (MCU), the Neural Processing Element (NPE) and the intercommunication bus.
5.1. The master control unit Because all the NPEs execute the same instruction, their design can be simplified by eliminating the control unit, the decode logic and the control memory, resulting in a saving of silicon area which can be used for the data memory. The amount of data memory determines the dimension (in terms of neurons) of the larger neural network the architecture can support.
A parallel architecture for the Boltzmann machine
71
data bus command bus
(A)
HOST
IL (a) Fig. 11. The mmp2 architecture: (a) the NPE array, (b) the system configuration.
The MCU is a RISC-like processor, with dedicated instructions, that fetches and decodes instructions and issues control signals to the NPE array. The most important feature of the MCU instruction set is the RAN instruction, which allows to compute a random number. It is usgd to select a neuron, whose local field has to be evaluated, and to decide the state of the selected neuron. A memory has been included to store the state of all the neurons of the network.
5. 2. The neural processing element The neural processing element (Fig. 12) has a simple architecture, whose main components are: the memory, the ALU, the accumulator and the processor identification register. The memory has a size of 10 k words of 32-bits. Because the memory size determines the size of the neural network, special care has been taken to its design in order to reach the maximum density with a standard CMOS technology. The resulting memory has a two-transistors per word dynamic structure. Despite a DRAM requires refresh cycle, in this special application it can be seen that no refresh is needed when the architecture runs, because the memory is completely read or written within a refresh period. The ALU allows to perform standard arithmetic and logic instructions on 32-bits integer data; moreover, the instructions can be conditioned on the basis of a flag previously set by a compare instruction. Special logic has been added for the initialization phase. In fact at the startup each processor has to be tested for its correct working and has to be assigned a processor identification code that corresponds to the number of the neuron the processor models. To simplify as much as possible the structure of the NPE, its instruction is horizontal. Each unit of the NPE is represented by a field in the instruction. This operating mode may cause failures in the NPE if a wrong instruction is sent. In fact no control is performed on the correctness of the instruction, and the MCU is responsible to send the proper instructions to the NPE.
72
A. i.)e Gloria etal,
ft
Ry I
\
/
I ADDRESS
1
1 DATA PADS
I ' ' IINITIALIZATION[ LOGIC /
Fig. 12. The micro-archicture of the NPE.
The NPE instructions are executed in a machine cycle. The NPE works in pipeline mode by overlapping the fetch and the execution phases. The data and the instruction buses are 32-bits long.
6. Conclusions We have shown the implementation of the Boltzmann machine algorithm on different architectures, Moreover a modified version of the algorithm nas been introduced, which allows a SIMD implementation that shows performance of O(N) where N is the number of neurons. The architecture is simple and is based on a processing element that can be easily designed by using VLSI techniques. We have shown that by a combination of massively parallel architecture, design simplicity, cell replication, and CMOS VLSI, the implementation of neural network algorithms in hardware delivers very high performance with low cost.
References [1] E.H.L. Aarts and J. Korst, Boltzmann machines and their applications, Lecture Notes in Comput. Sci. 258 (Springer, Berlin, 1987) 34. [2] E.H.L. Aarts and J.H.M. Korst, Computations in massively parallel networks based on the Boltzmann machine: A review, Parallel Comput. 9 (1988/89) 129-145. [3] E.H.L. Aarts and J.H.M. Korst, Combinatorial optimization on a Boltzmann machine, in: Proc. Europ. Seminar on Neural Computing, London, UK (Feb. 1988). [4] D.H. Ackley, G.E. Hinton and T.J. Sejnowski, A learning algorithm for Boitzmann Machines, Cognitive Sci. 9 (1985) 147. [5] J. Alspector and R.B. Allen, A neuromorphic VLSI learning system, in: Advanced Research in I,Z S I (MIT, Cambridge, MA, 1987).
A parallel architecture for the Boltzmann machine
73
[6] A. De Gloria and S. Ridella, The Boltzmann Machine: Theory and analysis of the interconnection topology, in Neural Networks: Concepts, Application, and Implementation (Prentice Hail, Englewood Cliffs, WJ, 1990). [7] G.E. Hinton and T.J. Sejnowski, Learning and relearning in Boltzmann machines, in: D.e. Rumelhart, J.L. Mc Clelland and the PDP Research Group, eds., Parallel Distributed Processing: Explorations in the Microstructure of Cognition (Bradford Books, Cambridge, MA 1986). [8] J.J. Hopfield, Neural networks and physical system with emergent computational abilities, in: Proc. Nat. Acad. ScL USA 79 (1982) 2554-2558. [9] T.J. Sejnowski, K. Kienker and G.E. Hinton, Learning symmetry groups with hidden units: beyond the perceptron, Physica 22D (1986) 260-275. [10] PJ.M. Van Laarhoven and E.H.L. Aarts, Simulated Anneal#:g: Theory and Applications (Kluwer, Dordrecht, 1987).