Microp~ocessingand Micmprogramming35 (1992)747-754 North-Holland
747
Neural Network Implementations and Speed-up on Massively Parallel Machines M. E, A=ema-Barac & A. N. Refenes Department of Computer Science University College London Cower SU'eet London WC1E 6BT United Kingdom
Abstract This paper investigates large scale learning algorithms and their implementati~mon Massively Parallel machines. The system prototype described in this paper is part of an integrated environment for developing neural network applications, consisting of: i) a library of neural models and associated tools and ii) a mapping system responsible for providing generic and efficient implementations on a spectrum of parallel machines ranging from coarse grain MIMD to fine grain, Massively Parallel SIMD machines. We also describe the implementation of standard learning algorithms onto the Distributed Array of Processo~ (DAP) and show that a speedup of 50 is obtained for a typical penem recognition application. I. Introduction Neural network architectures have drawn considerable attention in recent years because of their interesting learning abilities. Neural Networks are generally believed to provide effective learning procedures when the mapping from input to output vectors contains both regularities m:d exceptions and they are, in principle, capable of solving any non-linear classification problem. There are however three main problems associated with the use of neural networks in non nivial applications. Generafisation: the most important feature of a learning machine is its ability to generalise over the task domain. It is usually accepted that good generalisatinn performance on real-world problems is difficult to achieve unless some a priori knowledge about the task domain is built into the system. Thus it usually requires several training runs each of which is compute bound. Convergence speed: one of the major obstacles i,x using neural network technologies in large scale applications h'~s been the slow speed at which they operate during training. Typically for small and medium scale problems learning times are measured
in CPU hours and often days. The availability of Massively Parallel computers is seen as the enabling technology for developing realistic large scale neural networks. This document investigates large scale learning algorithms and their implementation on Massively Parallel machines. It analyses the implementation of standard learning algorithms onto a representative of massively' parallel machines, namely the Distributed Array Processor (DAP). This paper comprises three main parts: the The first part describes the overall environment. It gives an overview of ti~ integrated enviromnent for developing neural applications; made of the Pygraalinn library and the mapping system. The second part concentrates on the mapping system, and describes the two phases L1volved; namely the Abstract and Mncl~ine Dependent r..ecompusitions. The third ~qd last pa~t analyses ff~e implementation of standard learning algoritlm~s onto a representative o f massively paralle! machines, namely the Distributed Array of Processors (DAP).
748
M.E. Azema-Barac, A.N. Refenes
2. O v e r v i e w One of the major obstacles in using neural network technologies in large scale applications has been the slow speed at which they operate during training. Typically for small and medium scale problems learning times are measured in CPU hours and often days. The availab~L::y ¢ f .~,~assively Parallel computers is seen as the enabling technology for developing realistic large scale neural networks. Several researchers have described implementations o f specific neural network models on the Connection Machine (CM) or Distributed Array Processor (DAP)_ However, there are no generic implementations of neural algo~ rithms that can be ported between different Massively Parallel machines. The work described in this paper attempts to tackle this problem and is part of an integrated environment for developing neural network applications (see Figur,~ I ).
efficient implementations on a spectrum of parallel machines ranging from coarse grain MIMD to fine grain, Massively Parallel SIMD machines, qais paper presents strategies fol implementing the mapping system on Massively parallel architectures. As such, it describes a set of network decompositions ,and mapping strategies, presents their implementation, and evaluate their performance. The implementation of the system on MIMD machines is described elsewhere [2i. The mapping system consists of two phases: the abstract decomposition and the m a c h i n e dependent decomposition phases.
•
abstract decomposition seeks the maximum level of parallelism inherent in the neural network model irrespective of machine restrictions.
o
the machine dependent decomposition phase decides upon a particular distribution of the neural model in order to exploit the maximum parallelism for the partieulax model onto the particular parallel machine.
The abstract decomposition and the machine dependent decomposition phases are now outlined. 3. T h e M a p p i n g Syslem
/
.....
]t
Figure I - The Integrated E n v i r o n m e n t -
The top level system (developed as part of the ESPRIT PYGMALION Project [I]) conrains a library of neural algorithms including supervised, unsupervised, and associative reinforcement procedures. The mapping system is responsible for providing generic and
The different degrees of parallelism implemented by neural networks form the basis of the mapping system. Such a system adapts to the neural network concerned and to the targeted massively parallel machine; it can be seen as a system receiving information about a neural network and a massively parallel machine which provides guidelines or distribution schemes for an implementation [4]. The system consists of two components: an abstract decomposition of neural networks and a machine specific decomposition. The abstract decomposition is responsible for analysing the parallelism that the neural network exhibits and for providing the ahemative distribution schemes according to the required exploitation of parallelism. The
Neural network implementations and speed-up on massively paral'el machfnes machine specific decomposition considers the relevant machine criteria, and integrates them with the abstract decomposition to form a decision system. 3.1. Abstract Decomposition Abstract decomposition seeks the maximum level of parallelism inherent in the neural network model irrespective of machine restrictions. This parallelism is divided i n t o t w o distinct classes: 1- network related parallelism that is inherent in all networks irrespective of size or architecture, and is found at three distinct levels: an i n l r a - n e u r o n level that concerns the processing performed within a neuron, an inlra-slab ! level that concerns the processing performed within a slab of neurons, and a inter-slab that concerns the overall processing. 2- t r a i n i n g d a l a related parallelism that is concerned with the overall control of the neural network training cycle; e.g. can the tr~ihing of one particular pattern be expressed independently of the other padtoms. n e t w o r k related parallelism (1) i n t r a - n e u r o n level. This type of parallelism is expressed within the individual role that each neuron (or comlectinn) is performing. Typically, wheat a neuron computes its new state of activation, it has to compute its n e t input 2. In this function ca :h individual multiplication can be performed in parallel with all the others and then be summ~:d. (2) intra-sinb level. This type of parallelism is embedded in the fact that each neeron (or connection) performs generally its role in parallel with all the neurons (or connections) of the same slab. For example, in the homogeneous layered models, all the weights that i slab = group of neurons that generally perform the. same role 2 neuron i net input = ;-j WijS j
749
connect into the hidden layer can be updated in parallel. (3) inter-slab level. This type of parallelism is found at the highest level of a neural network model, i.e. it addresses the overall behaviour of the model. ~ conffol specifies parallelism by stating that a particular slab A is to perforro a role X in parallel with a slab B pefforrrdng a role Y. Taking the Backpropagadon model as an example, its control specifies that the weights that connect into the hidden layer and the weights that connect into the output layer are updated in
parallel, t r a i n i n g d a t a related parallelism This parallelism is neural model dependent. The Backpropagation neural network encompasses such a parallelism. In this model, a number of training patterns {composed of an input putte:n associated with an output pattern) are presented to the network which then adapts so as to retain the input-output associations. This adap~*ation by the network is done by adjusting the cc.m3~tions weight of the connected neurons; this process o f updating the weights is additive fox the different training patterns, defining thus a distinct form of parallelism over the training pattems. Other neural networks exhibit this kind of parallelism; it is particularly true for the heterogeneous neoral networks where different neural models are cooperating. The basic distribution schemes, as a function of the type of parallelism are summarised in Table I. Parallelism network related ino'a-nearon
Disttibuti~t ~ t decomposition connection distribution
intra-slab
neuron dislrilmtion
inwr-slab training re~md
slab diswibulioo net replication
Tabl e 1: P a r a l l e l i s m a r m Distribution S c h e m e s The two principal dist~ibudon strategies are
M.E. Azema-Barac, A.N. Refenes
750
not mutually exclusive provided that the learning algorithm permits n e t w o r k replicatins. For example, network related parallelism is exploited by d e c o m p o s i n g the network at any o f three levels and distributing the parts across the machine. Suitable synchronisafion primitives axe provided. T h e decision as to w h i c h grain o f decomposition to stop is m a c h i n e dependent. 3.2. M a c h i n e D e p e n d e n t D e c o m p o s i t i o n A particular distribution has to be decided upon according to the characteristics o f the massively parallel machine. T h e underlying idea is to have a set o f criteria that guides the choice o f one o f the disudbution schemes available. T h e characteristics o f the neural m o d e l that are decisive in the choice o f a distribution are m a i n l y the size and the connectivity pattern. Indeed, the size o f the lxaining pattern set indicates the benefit o f h a v i n g a distribution o v e r the trainir,g set (instead o f a distribution over the netw o r k elemental. T h e size and width (maxh n u m n u m b e r o f neurons within a slab) o f the n e t w o r k itself guides the granularity of the decomposition o f the network; that is according to the n u m b e r o f neurons within a slab it might be m o r e beneficial not to allow different slab to process in parallel but instead to allow all neurons within a s a m e slab to process in paralle~ In terms o f distribution, this implies that each processor will retain m o r e than one neuron, each belonging to different slabs. Finally, the connectivity pattern influences the distribution and is related to the connectivity o f the machine. Training set size small
Network width large
large"
~mall
large
large
Strategy net decomposition training decomposition net & training decomposition
Table 2: B a s i c n e t w o r k characteristics & d e c o m p o s i t i o n strategies
T h e model for machine independent decomposition, uses a taxonomy of parallel machines in terms o f the key parameters that characterise the performance o f the machine. This includes parameters such as _machine-size the ratio o f local c o m m u n i h,em,ork --si=e "
global
cation overheads, the ratio o f co mmunicari, oJ~ proee$$tng
speed, etc. G i v e n a specific machine structure, each m a p p i n g strategy is characterised by a machine-specific cost. T h e mapping system is then responsible for the selection o f a mapping strategy with the m i n i m a l cost [41. 4, P r o t o t y p e on t h e D i s t r i b u t e d A r r a y o f Processors This section presents the implementation o f the most c o m m o n l y used learning procedure on the D A P machine (i.e. error backpropagation on fully connected multilayer perceptrons), it acts as a test-bed for the implementation o f other neural networks on izhe D A P . 4.1. D A P a r c h i t e c l n r e T h e D A P is a Single Instruction Multiple Data ( S I M D ) machine that implements fine grain data parallelism. Typically, data is spread across the m e m o r y o f a large number o f simple (1 or 8-bits) processors and the same operations is performed simultaneously on data objects in all or a selection o f the processors. It has been used in a variety o f applications, ranging from image and signal processing to text searching [2]. 4.1.1. D A P H a r d w a r e T h e Processing Elements (PE) within the D A P are organised in a square array, in our case 64x64. Each P E has connections to its four nearest neighbour, referred to as north, easL south and west neighbours, Furthermore, fast data broadcasting and fetching is provided by a bus system that connects all the PEs in each row and all the PEs in e a c h column. Each PE has it own local m e m o r y that ranges from 32 Kbits to 1 Mbits. A
Neural network fmplementatlens and speed-up on massively parallel machines Master Control Unit controls the processor array by broadcasting instructions to each PE, which then operates on its own local data and this simultaneously with all other PEs. The PEs perform their basic operation; involving fetching or storing a memory bit within a single DAP cycle (of the order 100 nanoseconds). 4.1.2. DAP Software The DAP is programmed using c onve ntional languages that have been extended so to support parallel data constructs; enabling thus to benefit from the massive parallelism. Currently, the DAP can be programmed in Fortran Plus 3 that corresponds to an extension of the Fortran language so as to allow algorithms to include vector and matrices operations, consequently taking advantage of their inherent parallelism. An assembly l~nguage (APL) is also available, it provides facilities for programming the DAP at the bit-level required by certain applications; e.g. when variable precision is a crucial factor.
host-based I/O and user interaction, and a DAP program that uses the parallel features of the DAP. The DAP program resides in the code memory of the Master Control Unit, shown in Figure 2, and is accessible through a host computer (VAX or Sun). According to the application the control is wholly given to the DAP or the host remains in conUol; the DAP will then only run the highly data parallel functions. Furthermore. the DAP edge size and the program matrices size do not have to fit. That is, the compiler is responsible for the actual decomposition of the matrix so as to fit the edge size. It performs a tiling algorithm that enables programs not to take into account of the actual edge size. 4.2. Supervised learning by e r r o r backpropagation The processing units (or neurons) in the Backprupagation model are arranged in layers - there are an i n p u t layer, an o u t p u t layer and one or more h i d d e n layers. The function of the backpropagation model is to retain a set of input-output mappings. It is trained b2, presenting a series of input-output mappings, the model adjusting i~elf (via its weighted connections) so as to retain the mapping. Each training cycle consists in three phases: a forward phase where each neuron computes its new state, a backward phase where each neuron computes its error and an update phase where each connection adapts its weight. Each phase is formally described in the complete paper, We present here the overall processing occurrring when training a neural network by error backpropagation. •
Figure 2 - DAF System Overview -
Figure 2 shows how the DAP system is used. Typically. a DAP application will be composed of a host program that handles the S Fortran Phls is a trademark of AMT.
751
Ove r a ll processing
The processing occnrnig during the training of a backpropagation model colresponds to a series of cycles, each cycle being formed by forward phase, a backward phase and finally a weight update phase. The number of cycles executed is controled by a tolerance test; i.e. the learning stops when the overall error
752
M.E. Azema-Barae. A.N. Refenes
reaches a certain tolerance tol. T h e complete sequential (T+) processing can then formalised by: T~ = X m ...... (~,ot ( T y o ~ + T ~
+ Twuw ))
(1)
where TI~
-the overall time spent in the forward pha~e. During this phase, and starting with the first hidden layer, each neuron computes a non linear transform to its net input to yield to the n e u r o n ' s output state.
Tnact~. -
the overall time spent in the b a c k w a r d phase. During this phase, and starting with the output layer, each neuron cornpules its error term. T h i s computation differs for an output neuron and for an hidden neuron. T , , e - t h e overall time spent in the weight
update phase. During this phase, each weights is updated according to the learning rule. T h e computations at the hidden neurons are different from the one at the output neurons. T h i s fact is used to further decompose (1) thus yielding to the f o l l o w i n g expression (2): Ts = ~'j,....... [ X, oz [ tpff~9" + t~A'~"' + r~"
+ t~",',7"
]]
(2)
T h e c o m p l e t e processing o f an application m e a n s that each pattern in turn gets fully learnt: the three computationnal phases are applied until the error term meets a particular v a l u e referred to as the tolerance parameter. 4.3. P a r a l l e l i s m in t h e b a c k p r o p a g a t i o n model U s i n g the classification described in previous sections, the parallelism exhibited by the backpropagation network is found at all levels. T h e r e exist: network related parallelism
(1) i n t r a - n e u r o n . In this dot product each individual multiplication is performed in parallel and the results are then summed. (2) i n t r a - l a y e r 4. Neurons or connections that belong to a same layer perform their rule in parallel. For example, all the weights that connect into the hidden layer can be updated in parallel. (3) i n t e r - l a y e r . T h e hidden and output layers o f weights implement their learning function simultaneou.:ly.
training related parallelism T h e backpropagation network can be trained by epoch. In this case, each training pattern is presented once to the network and the weights are adjusted after all the training patterns have been presented to the network. Parallelism over the training set is thus present in the model. 4.4. D e c o m p o s i l i o n a n d M a p p i n g o n t o Ihe DAP T h e optimal mapping o f the B a c k p r o pagation model is the one that exploits all the degrees of parallelism; i.e. the neuron, layer, network and pattem parallelisms. The D A P is a machine well suited to vector and matrix operations. Thus, the most straighffc~rward m a p p i n g makes use o f this characteristic and specifies the backpropagation algorithm as a set of m a l r i x - v e c t o r operations [5], where layers o f neurons correspond to vectors, and w h e r e their inter connections correspond to matrices. A less obvious m a p p i n g makes use o f the t r a i n i n g parallelism which exists in the Backpropagation network. In this case. the machine holds copies o f the neural netw o r k and each copy processes its own specific training pattern simultaneouMy with others. Such parallelism is characteristic o f the Backpropagation neural network w h i c h allows training by epoch [6]. Other distributions can be devised by adding to the basic 3 oeuron i net input = St !4zt,St
Neural network implementations and speed-up on massively paraltel machines distibutions another level o f parallelism. T a b l e 3 gives a summary o f the a h e m a t i r e distribution schemes and their main decision criteria.
753
position by exploiting the inter-layer t y p e o f parallelism. It does so by distributing the peurons and connections that belong to different layers onto different processors.
Distribution
Decision criteria
4.5. P e r f o r m a n c e Assessment
matrix-vector
network size vs machine size
training set
training set size vs machine size
matrix-matrix
network size vs training set size vs machine size
The application cboosen for evaluation is a standard one: the digit recognition application. T h e neural n e t w o r k is trained to recognise ten digits, each digit being represented by a matrix o f 8x12 pixels. W e remind the reader that our interest focuses on the training phase o f the application, phase which is the most expensive thus benefiting most from a parallel execution. Moreover, the whole application requires the neural netw o r k to be trained for e a c h p a t m m and this m e r e than once (Equation 1). If one considers the time for the w h o l e application then it depends on the learning rate, on the tolerance, and on the type of image the application uses. So, in the first instance only the processing o f o n e training cycle is considered. T h a t is, the performances are studied for only one forward, backward and weight update phases, T h e y are then generalised so as to apply to the w h o l e processing. H o w ever, for the sake o f brievety these intermediate phases are not described in this paper, instead speed up factors obtained for the whole application and for variations o f the network size are given.
distribule matrix-vector
non applicable to this model
Table 3 - Distributions & D e c i s i o n Criteria -
T h e m a t r i x - v e c t o r decomposition takes the activities values o f neurons es vectors and the set o f connection weights as matrices. T h e network is thus represented as a series o f matrix-vector multiplications and non-linear transforms. T h i s distribution exploits the first two types o f parallelism: the iatraneuron and intra-layer. The traiuing set decomposition exploits the training set parallelism by replicating the entire set o f neurons (neurons that belong to the input, hiddens and output layers) along with a single training pattern in each processor. The matrix-matrix decomposition integrates the matrix-vector distribution and training set distribution. The network o f neurons is J-eplicated within the machine processors but each copy is also distributed over a column (or a row) o f processors. Therefore this decomposition exploits both the intraneuron and inter-layar parallelism during its forward, backward, and weight update phases. T h e distributed m a t r i x - v e c t o r decomposition improves the matrix-vector decom-
T h e s p e e d u p factor is expressed as the ratio between the sequential execution time over the parallel execution; s = T J T p . T h e speed up obtained from executing the l 0 digits recognition application onto a Spare station against running the s a m e application onto the D A P 6 0 0 machine is equal to 5 0 (i.e. T=/Te=50).
Moreover. the execution time has been monitored accordingly to the network size. That is, the s a m e backpropagation has been used but varying the actual size o f the netw o r k so as to quantify the effect o f the netw o r k size onto the execution time. T h e fol-
754
M.E. Azema-Sarae, A.N. Refene$
lowing figure presents the results obtained.
ing up the learning phase of neural networks. With carefull network decomposition, speed-up factors o f up to 2 0 - 1 0 0 can be achkwed.
s ~ p F ..... REFERENCES: it] B. Angeniol and P. Trelcaven, "Pygmalion: The Neurocompt~tDlg Project" in Esprit Cotffer¢nce 1991. [2] Magali E. Azema-Barac "A Generic Strategy for Mapping Nel~tal Nem,orks on T~ansplrterbased Machines", Applications on Transputers 768-773. IOS Press, 1991.
N ~0o
ks
109.00
F i g u r e 3 - S p e e d t~p vs N e u r a l N e t w o r k Size -
T h e execution time is monitored accordingly to the network size. T h a t is, the same backpropagation ~.as been used but varying due actual size o f the network so as to quantify the effect o f the network size onto the execution time. T h i s has been c o m p u t e d for a single pass through each i,hase occuring during the training o f a neural network; i.e, forward, b a c k w a r d and weight update phases. T h e xaxis corresponds to a network size ( X x X x ~ ) , the y-axis plots the actual speed up obtained for a c o m p l e t e training cycle o f the backprot~.gafi0n, T h e speed-up has been calculated for foltol.0ing network sizes S= (32, 33, 40, 63, 64, 9 6 ; 128, 140). T h e best speed-up is ~btainetL ~,hen the network size X c6r/C~sponds to twice the e d g e size o f the DA/P ( S ' ~ ' 128 = 2 * 64). i ,y. 5. C o n c l u s i o n "T~tis paper, has examined the implementatio~ o f t~rarat networks on massively ~arallel. m a c h i n e s , and analysed alternative d e c o m p o s i t i o n strategies and their potential speed-up. ~ E'~:isfing " massively parallel m a c h i n e s p r o v i d e ~ good~v~hiele for speed-
[3] AMT, "Dap Technical Overview", DAP Series. [4] Magali E. Azema-Barae "A Conceptual Framework fiJr Implementing Neural Networks onto Massively Parallel Machines" 6th lmemational Symposium on Parallel Processing (IPPS) 1992 IEEE Press. [5] Charles R, Rosenberg and Guy Blelloch "An mtpleotentation o f Network Learning on tile Connection Machine" . in Cormectionist Models and their Implications, eds D. Waltz and J. Fcldman. 161 Xim Zhang. Micheal McKenna, Jill P. Msirov and [)avid L. Waltz "An Efficient Implementation of the backpropagatiorj algorithttJ on the Coni~ection Machine", in Neural Information Processing Systems 1989.