PARALLEL COMPUTING ELSE’VIER
Parallel Computing 22 (1996) 1345-1357
Mapping of neural network models onto two-dimensional processor arrays Rabi N. Mahapatra a,*, Sudipta Mahapatra b a Dqmrtment ofComputer Science, Texus A & M University, College Station, TX 77843, USA b Depurtment oj’Computer Science und Engineering, Regionul Engineering College, Rourkelu. India 769 008
Received 22 January 1996; revised I8 July 1996
Abstract Proper implementation strategies should be followed to fully exploit the potential of artificial neural nets (ANNs). This in turn depends on the selection of an appropriate mapping scheme, especially while implementing neural nets in parallel processing environment. In this paper, we discuss the mapping of ANNs onto two-dimensional processor arrays. It has been shown that by following a diagonal assignment of the neurons and by suitably distributing the weight values across the array, computations involved in the operation of neural nets can be carried out with minimum communication overhead. The mapping has been illustrated on two popular neural net models: the Hopfield net and the multilayer perceptron (MLP) with back-propagation learning. One iteration of an n neuron Hopfield net takes 4(P - ~).[n/~l unit shifts in a P X P processor
array. For the same target architecture, a single iteration of back-propagation learning on an MLP of n neurons takes a maximum of 8(P - l).ln/P] shifts. Artificial neural networks; Mesh connected processors; Inter-processor communication; Mapping strategy; Performance analysis
Keywords:
1. Introduction Contrary to conventional digital computers, artificial neural nets (ANNs) do not use algorithm to solve any problem. Instead, they use training and learning. How fast a neural net can solve any particular problem depends on how fast it can be trained to perform the task. This in turn depends on the implementation strategy adopted. ANNs
* Correspondin:: author. Email:
[email protected]. Presently on leave from E&ECE Department, IIT, Kharagpur, 72 1302, India. 0167-8191/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved PII SOl67-8l9l(96)00048-8
1346
R.N. Mahuputra,
S. Mahupatra/ParulleI
Computing 22 11996) 1345-1357
can be implemented in three different ways. The first two are: software implementation on serial computers, which is rather slow, in spite of being rather fast; and special purpose hardware implementation, which is totally inflexible, although fast. Speed is a very important factor in dealing with real world problems and equally important is achieving sufficient flexibility to cope up with the new models and algorithms that are being developed quite regularly. So the above two implementation strategies are not acceptable. A compromise between the two is the third way of implementing ANNs on programmable parallel machines, which are flexible due to the programmability present and also very fast as they involve parallel processing. The power of an ANN stems from the parallel operation of a large number of simple processing elements, called neurons. The neurons are densely interconnected via weighted links. In each iteration, a neuron computes the weighted sum of the state values of all the neurons connected to its input and then evaluates its own output by applying an activation function to this sum. Then, the output is passed on to all or some of the other neurons. Although the operations of a single neuron is quite simple, the total computational load on a neural net may be quite large. While implementing neural nets on parallel computers, our primary goal is to distribute the computations of one complete network in two or more processors with a small interprocessor communication (IPC) overhead. Recently, a lot of research is directed towards devising an efficient strategy for implementing neural nets on parallel computers [4-151. In [4] Ghosh and Hwang have discussed the implementation of ANNs onto message passing multicomputers. They have assumed that the neural nets can be partitioned into groups of neurons such that the connectivity within a group is much higher than the overall network connectivity. This is not true for many neural net models like the multilayer perceptron (MLP) network. Yoon et al. [51 have considered the implementation of MLP network with back-propagation learning (henceforth referred to as the BP net) in parallel processing environment, wherein the same weight is stored in two different processors and an expensive all-to-all broadcasting is used as the principal mode of communication. In [8], Lin et al. have proposed a general mapping technique for implementing neural nets on parallel machines. They have discussed both partitioned and non-partitioned mappings of the BP net onto two-dimensional processor arrays. In the non-partitioned mapping the data permutations involved in one recall iteration take 24(N - 1) unit shifts and that in a single iteration of backpropagation learning take 60(N - 1) shifts for mapping onto an N X N array, where N2 2 n + e, n and e being the no. of neurons and no. of weights respectively. For partitioned implementation on a P X P array, ( P2 * n + e> the optimal time for realizing the data permutations involved in a single recall or learning iteration is of the order O(N’/P>. Considering that in all realistic parallel processing environment, size of the processor array available is much smaller than size of the neural net, we have proposed a partitioned mapping of n neuron ANNs onto P X P processor arrays. Here the data routing used is very simple, involving row or column transfers only. This mapping has been illustrated on two popular ANN models, the Hopfield net and BP net. It has been shown here that each iteration of n neuron nets can be realized with an IPC overhead of order O(n). This indicates that the mapping is independent of size of the processor array.
R.N. Mahapatru,
S. Mahapatra/
Purallel
Computing 22 (1996) 1345-1357
1347
Section 2 following the introduction discusses the mapping of Hopfield net in brief. Mapping of the BP net has been presented in Section 3, in greater detail. A performance model has been derived for this mapping, from which the speedup ratio has been estimated. Section 4 includes discussions on the efficacy of the proposed mapping.
2. Mapping of Hopfield net Hopfield nets are mainly used as associative memories. They can produce the corresponding pattern for any pattern presented as their input. If a pattern is presented at the input of the Hopfield net, the net begins to iterate and finally converges to the target pattern associated with the applied input pattern. The structure of the Hopfield net is as follows: it consists of a single layer of neurons with a feedback connection from each neurons to all others, and no feedback to itself [l-3]. All the connections are weighted with the weights satisfying the following criteria: 6) wij = wji, for all values of i and j. (ii) wii = 0, for all i. where wii is the weight from neuron j to neuron i. A 5 neuron Hopfield net and its weight matrix are depicted in Fig. l(a) and Fig. l(b) respectively. The weights in a Hopfield net are predetermined from a set of exemplar patterns. In the operation of the net, neurons are made to assume states according to an input pattern. Then the network is left to iterate. In each iteration, a neuron computes a weighted sum of the current state values of all other neurons. Then, it computes its new output by applying a nonlinear activation function, such as the threshold function, to this sum. This process continues till the net converges to a definite output pattern. It can be observed that the weight matrix of the Hopfield net is symmetric in nature, all the diagonal elements being zeros. This particular feature has been fully exploited in our mapping. Initially, we assume that the II neuron Hopfield net is mapped onto a P X P array, where P = n. In the initial data assignment (IDA) phase, the weight matrix is directly mapped onto the P X P array in an one-to-one manner. The diagonal processors, which do not have any weights assigned to them, store the n neurons. The 5 neuron Hopfield net of Fig. l(a) has been mapped onto a 5 X 5 array in Fig. l(c). The data routing involved in operation of the Hopfield net is as follows. In each of the iterations, a diagonal processor sends its neuron value to all the output weights of this neuron, which are stored along the same column. Each weight processor now finds the product of the state value with the corresponding weight. Later, all the product terms along a row are collected, summed up and sent to the diagonal processor. Afterwards, each diagonal processor applies an activation function to this sum to find the new state value for its neuron. The entire operation takes 4( P - 1) unit shifts if we assume a simple mesh without any wrap around connections. This is the communication overhead if P = n. If P Q n, we have to assign more than one neuron to every diagonal processor and multiple weights to every processor in the array. Then, the maximum number of neurons in a processor is [n/PI, and hence required number of shifts will be
1348
R.N. Mahapatra, S. Mahapatra/Purullel
Computing 22 (1996) 1345-1357
03
(d)
Fig. 1. (a) A 5 neurcm Hopfield net. (b) Weight matrix. (c) Data assignment on a 5 X 5 array. (d) Mapping onto a triangular array.
4.[n/P].(P - I). Hence, the mapping scheme leads to a communication complexity of order O(n) for realizing the data transfers involved in a single iteration of the Hopfield net. As the weight matrix of the Hopfield net is symmetric in nature, it is possible to map the same onto a triangular array as shown in Fig. l(d). It is evident from the figure that data transfer overhead is now double that in a full sized array. Thus, the mapping is found to be optimal.
3. Mapping of BP net Unlike the Hopfield net, the weight matrix of the BP net is sparse in nature, and hence it is not efficient here to have P = II. Because, in such a case many processors would remain idle. Therefore, only partitioned mapping is considered here with P -=cn.
R.N. Mahaputra,
S. Muhupatra/Purullel
Computing 22 (1996) 1345-1357
1349
3.1. The BP net This neural net consists of multiple number of neurons divided into an input layer, an output layer and one or more hidden layers. There is full connectivity in between two adjacent layers. All these connections, known as synapses, are weighted. The BP net, if properly trained, can act as a pattern classifier. During the training period, the net executes two distinct phases, i.e. the forward execution or recall phase and the backward execution or learning phase. But once the net has learnt a particular task, only the recall phase is executed. In the learning phase, the net is presented with a set of input pattern-target pattern pairs. Then the backpropagation learning algorithm is used to adjust the weights, so that the net produces the required output pattern for any pattern presented at the input. Below, we present a detail discussion of the backpropagation algorithm. 3.2. Backpropagation algorithm Invention of backpropagation as a learning algorithm for multilayer neural nets rekindled interest in neural net research. Beforehand, although it was possible to adjust the weights in a single layer neural net, it was not possible to adjust the inner layer weights in a multilayer neural net [l]. The Backpropagation algorithm allowed the adjustment of these weights by backward propagation of the error values calculated for the outer layer neurons. Before proceeding further, we will give the notations and conventions used throughout this paper. l The BP net is assumed to have L layers with n(l) neurons at layer 1, where l
NET(l)
=
c j=
I)
w/(l).aj(l-
l),
and
~~(1) =fi(NET,(l)).
(1)
I
Usually the activation function used is the sigmoid function given by f(x)= l/(1 + emX>. This function has a very simple derivative d(flx)) = f(x).(l -f(x)>. The learning phase comprises of again two different parts: calculation of the error values of lower layer neurons and modification of their input weights. l
&values of lower layer neurons ~('i-a!(l)).ui(~).(l
-uz(r)),
are computed
using the equation l=L,
(2)
1350 l
R.N. Mahqatra,
S. Mahapatra/Parallel
Computing 22 (1996) 1345-1357
Weights are modified using the equation w:(l) = wf( I) + qSi( !).ai( I - I),
(3)
where ~7is the learning rate, which usually lies between 0.25 and 0.75. 3.3. The mapping 3.3.1. Initial data assignment Unlike in the Hopfield net, the weight matrix of the Hopfield net can not be mapped onto the processor array in an one-to-one manner. Here, the neurons are mapped onto consecutive diagonal processors, irrespective of their layer number. This can be very well seen from Fig. 2(b), where the 3 layer BP net of Fig. 2(a) is mapped onto a 3 X 3 array. The idea behind the weight assignment is same as that in case of the Hopfield net, i.e., to reduce the communication overhead in both recall and learning phases and also to simplify the data routing. The output weights of a neuron are stored in the column
(4 a,(l) a,(l)
w
i(3)
w
f(3)
w
;w
y(3)
a,(l) w
:w
w
4,w
w
:w
w
;w
y(2) w:(2)
a,(l) w
:(a
a,(z) 4
(2)
Fig. 2. (a) Three layer BP net. (b) Initial data assignment for mapping onto a 3 X 3 array.
R.N. Mahuputru,
S. Muhuputru /Purullel
Computing 22 (1996) 1345-1357
1351
containing the neuron and the input weights are stored are stored along its row. In particular, w/(l) is stored in the column having aj(Z - 1) and in the row containing a;(E). For example in Fig. 2(b), w:(2) is assigned to the third column and second row, which store a,(l) and a,(2), respectively. 3.3.2. Data routing The IDA discussed above, simplifies the data routing required in each of the recall and learning phases. The recall phase can be realized by having data transfer along the various columns followed by a set of horizontal transfers along the rows. In the learning phase row transfers precede the column transfers. Routing in recall phase. At start of the recall phase, state values of lower layer neurons are sent to all their output weights along each column. This would take 2( P - 1) unit shifts if there was only one neuron per diagonal processor. Then, all the output weights are multiplied with the neuron values. Later, the product terms corresponding to input weights of each higher layer neuron are collected, summed up and sent to the processor containing that neuron. This is done by following a parallel shift and add operation along the rows of the array. At first, each processor forms a partial sum of all product terms corresponding to the input weights of a single higher layer neuron. Afterwards, this partial sum is sent towards the diagonal processor, collecting the corresponding partial sums in all the processors of the row. Finally the weighted sum for the higher layer neuron is formed in the diagonal processor. This operation takes 2( P - 1) shifts for a single neuron per processor. So, the completion of forward pass would take a total of 4( P - 1) unit shifts if only one neuron was assigned to a processor. If P < n, the maximum number of neurons in a diagonal processor is [n/P], and the number of shifts required to realize the forward pass is given by C,=4.[n/P].(P-
1).
(4)
Afterwards the new state values are found for the higher layer neurons by applying the sigmoid function to the weighted sums. We assume that the state values of lower layer neurons sent in this phase are stored with their output weights. This reduces the communication overhead in the learning phase. Routing in learning phase. In this phase, the &values of higher layer neurons are sent to and multiplied with their input weights along the rows of the array. Then, the product terms are collected in the diagonal processors by use of the same parallel shift and add operation, but this time along the columns. Both the steps would take a total of 4( P - 1) unit shifts for a single neuron per diagonal For P << n, the total communication overhead in computing the delta values is C,=4fn/P].(P-
1).
(5)
We assume that the 6 values sent in this phase are stored with the input weights. As state values of the lower layer neurons were stored with these weights in the forward pass, communication for updating these weights is totally eliminated. Hence Eq. (5) stands for the communication overhead in the learning phase.
1352
R.N. Muhaputra,
S. Mahapatra/Parailel
Computing
22 (1996) 1345-1357
3.4. Performance analysis
The performance model has been derived for a single layer of weights connecting layers 1 - 1 and 1. Eqs. (l)-(5) have been used to estimate the total no. of additions and multiplications in each phase of the BP algorithm. In the derivation of the performance model we make the following assumptions. l Maximum number of neurons of layer I - 1 per diagonal processor is n, = [n(Z 1)/P] and of layer 1 is n2 = [n(Z>/Pl. l Maximum number of output weights of a lower layer neuron per processor is w, = In, * n(Z)/Pl and maximum number of input weights of a higher layer neuron per processor is w2 = Cn, * n(E - 1)/P]. l Time for multiplying two floating point numbers is r,,, and the addition time is r,. We further assume that r, = a.ta. l Communication time to transfer a data between two adjacent processors is t, = r.t,.
Time for evaluating the sigmoid function is r, = p.r,. In the above (Y, r, and p are positive constants. With the above assumptions we proceed with derivation of the performance model. Suppose I,, ts, and t, are the uniprocessor times respectively for executing the recall phase, calculating the lower layer S-values and for updating the weights. Let, t,, tmS, and rmmu be the corresponding multiprocessor times. Then, the total uniprocessor time for executing the backpropagation algorithm is l
fUni= rr + ta + t,,
where
tr=t,.?z(Z).(n(Z-
I).(1 +a)
f,=f,.n(E-
l).(n(1).(1
t,=t,.Fz(I-
l).n(l).(l+2a).
+p>,
+a)(1
+2a)),
(6)
Total multiprocessor time is Imul
= t,,
+ fms
tti=t,.[4n,.(Pt m8
where,
+ t,,,“,
=t,.(4n2.(P-
1).7+(1
+a).w,
+n,./3],
1).7+(1
+(Y).w*+n,.(1
t Ill” = &.W*.(l + 2a).
+2cX)), (7)
The multiprocessor speedup ratio is obtained as ’ = 'uni/'mul .
(8)
The communication overhead of the proposed mapping is found to be 81n/P I.( P - 1). The communication overhead in case of [S] is given to be of the order O(N*/P). Assuming the constant term k to be 1, 2, and 4 respectively, the lower bound of the communication overhead in [8] has been estimated for a 3 layer BP net and has been compared with the exact overhead of the proposed mapping. This has been shown by plotting the total communication overhead, C, in terms of unit shifts required to realize the backpropagation algorithm, versus size of the processor array P. Fig. 3 and Fig. 4
R.N. Muhuputra,
S. Mahaputra/
Parallel
Computing
22 (1996) 1345-1357
1353
C (Thousands)
i
\
\
40
60 processor
-
Proposed
mapplng
-t
80 array
181 for k-l
Fig. 3. Plot of communication
size, -
100
120
P [8] for k.2
-
181 for k-4
overhead versus P for n = 100.
show the curves for n = 100 and n = 200, respectively. In each figure, the communication complexity of [8] has been plotted assuming three different values of k, i.e. k = 1, 2, and 4. It is observed that in the proposed mapping, the communication overhead
C(Thousands) 4OF.r ‘+
; :! “,, i /
40
60 processor
-
Proposed
mapping
+
80 array
I81 for k-1
Fig. 4. Plot of communication
size, +
100
120
P 181 for k-2
-
overhead versus P for n = 200.
I81 for k=4
1354
R.N. Mahupatra.
S. Mahapatra/PuraNel
Computing 22 (1996) 1345-1357
1000
800
600
400
200
0 0
200
400
600
800
1000
n -
k=l
+
k=2
+
k.4
Fig. 5. Plot of P^ versus n.
remains almost constant even when the processor array size is varied. This is because, the diagonal assignment of neurons in the two-dimensional processor array, makes the total number of communication steps independent of the array size. In the case of [8], the communication overhead being inversely proportional to the processor array size, P,
0’
0
I
I
400
800
I
1200
t
1600
I
I
2000
2400
1
2300
I
3200
n Fig. 6. Plot of
speedupversus
no. of neurons per layer.
I
3600
I
4000
R.N. Muhupatru,
S. Mahaputru
300
/ Purullel
Computing
400
500
1355
22 (1996) 1345-1357
000
700
800
900
no. of processors
-
n.64
-M-
n-250
--S
n-512
Fig. 7. Plot of speedup versus no. of processors.
falls rapidly as expected, with the increase in array size. In each of Fig. 3 and Fig. 4, there is a crossover between the communication overheads using both the approaches, which shows the performance bound between them in terms of processor array size. This crossover point shifts towards the right (increase in P), as the value of k is increased. The crossover point indicates the processor array size, which results in equal amount of communication overhead for both the mapping methods. Also, for a particular value of k, the crossover point shifts to the right as the number of neurons per layer increases. In order to estimate the largest processor array size_for which the proposed mapping is more efficient than the mapping in [8], the plot of P versus n has been depicted in Fig. 5. In order to see the speedup performance of the proposed mapping, we assume that a neural net of n neurons per layer is mapped onto a P X P array. The multiplication time t, is assumed to be 10 times addition time t,. The sigmoid function is assumed to be computed using a Range Reduction Technique [9], which makes p almost equal to 500. With the above assumptions, the speedup ratio has been calculated for an 8 X 8 array for different values of n and is plotted in Fig. 6 for r = 0.1. Furthermore, the plot of speedup versus the processor array size P, has been shown in Fig. 7 for n = 64, 256 and 512.
4. Discussion A simple and efficient data assignment has been proposed for partitioned implementation of neural nets on two-dimensional processor arrays. Using this, the communication
1356
R.N. Muhupatru,
S. Muhupatru / Pumllel Computing 22 (1996) 1345-1357
complexities of n-neuron ANNs on a two-dimensional array of size P X P is found to be of the order O(n). The Hopfield net and BP net are used to show the efficacy of our mapping. The Hopfield implementation is found to be optimal. Both the recall and learning iterations of the BP net have been realized with a reduced communication overhead, as compared to the order 0W2/P) complexity of the mapping proposed in [8]. The comparison of the communication complexities have been shown in Fig. 3 and Fig. 4, which indicate that for small sized processor arrays, the proposed mapping is better than the one given in [81. These also indicate that the communication overhead in the proposed mapping remains nearly constant, eve; though the processor array size is increased. The largest size of the processor array P, for which the proposed mapping has equal or less communication overhe_ad than the overhead of [8] has been given in Fig. 5. The linear characteristics of P versus n is an indication in selecting the processor array size for a given problem. The speedup performance for various problem sizes, given in Fig. 6, indicates the increase in speedup with increase in problem size, until it saturates to a value close to the number of processors. The effect of variation in number of processors on the speedup ratio, for a given problem size is indicated by a linearly dependent relationship (Fig. 7). It is clear from the data distribution that the total computational load on a processor does not depend on the number of neurons assigned to it, as all the synaptic operations of a neuron are not performed by a single processor. Rather, they are distributed along the row containing the particular neuron. As the neurons are distributed along the diagonal, the total computational load is distributed uniformly in all the processors of the array. Our algorithm has been shown to work well for a neural net having same no. of neurons in each layer (uniform size ANN). The algorithm would work satisfactorily, even if the neural net is not uniformly structured. In such a situation, we choose a processor array of size P X P, where P equals the number of neurons in the smaller layer. So, each processor would contain at least one neuron of every layer (hidden and output). Thus, the synaptic weights would be distributed in all the rows and there would be proper load balancing.
References [l] [2] [3] [4]
P.D. Wasserman, Neurcrl Computing, Theory und Practice (Van Nostrand Reinhold, New York, 1989). J.E. Dayhoff, Neural Network Architectures, An Introduction (Van Nostrand Reinhold, New York, 1990). R.J. Lippman, An introduction to computing with neural nets, IEEE ASP Mug. (April 1987) 4-22. J. Ghosh and K. Hwang, Mapping neural networks onto message passing multicomputers, J. Parullel Distributed Comput. 6 ( 1989) 29 l-330. [5] H. Yoon, J.H. Nang and S.R. Maeng, Parallel simulation of multilayer neural networks on distributed memory multiprocessors, Microprocessing und Microprogramming 29 (1990) 185- 195. [6] A. Singer, Implementation of artificial neural networks on the Connection machine, Purullel Comput. 14 (1990) 305-315. 171 X. Zhang, M. McKenna, J.P. Mestrov and D.P. Waltz, The backpropagation algorithm on grid and hypercube architectures, Pumllel Comput. 14 (1990) 317-327. [8] W.-M. Lin, V.K. Prasanna, and K. Wojtek Przytula, Algorithmic mapping of neural network models onto parallel SIMD machines, IEEE Truns. Comput. 40 ( 12) (I 991) 1390- 1401.
R.N. Muhupatro,
[9] S. Shams
S. Muhupcrtru/
Purullel
Computing
22 (1996) 1345-1357
1357
and K. Wojtek Przytula, Implementation of multilayer neural networks on parallel prodigital computers, in: M. Bayoumi, ed., Purulle/ Algorithms md Architectura jbr LISP Applicutions, (Kluwer Academic Publishers, 1991) 266-279. [lo] T. Watanabe et al., Neural network simulation on a massively parallel cellular array procssor: AAP-2, in. Proc. fJCNN’89, Washington, DC, Vol. 2, pp. 155- 161. [I I] F. Blaya et al., Extension of cellular automata to neural computation, in: Proc. IJCNN’89, Washington. DC, Vol. 2, pp. 197-204. [ 121 Y. Fujimoto, An enhanced parallel planar lattice architecture for large scale neural network simulation, in: Proc. IJCNN’90, San Diego, Vol. 2, pp. 581-586. [IS] CF. Chang et al., Digital VLSI multiprocessor design for neurocomputers, in: Proc. IJCNN’92, Baltimore, Vol. 2, pp. l-6. [I41 A.F. Arif, A. Iwata et al., A neural network accelerator using matrix memory with broadcast bus, in: Proc. IJCNN’93, Nagoya, pp. 3050-3053. [15] M.A. Viredaz and P. Lenne, MANTRA I: A systolic neuro-computer, in: Proc. IJCNN’93, Nagoya, pp. grammable
3054-3057.