North-Holland Mlcroprocessingand Microprogramming30 (1990)249-256
249
A CELLULAR ARCHITECTURE DEDICATED TO NEURAL NET EMULATION
Bernard FAURE and Guy MAZARE IMAG-LGI Groupe Circuits, 46 avenue Fdlix Viallet 38031 Grenoble cedex, France
This paper describes a cellular architecture for the fast parallel processing of feed-forward neural networks. This architecture is made of a bidimensional array of specific asynchronous processing elements which communicate among themselves with non local message transfers. Each cell processes the algorithms of a neuron in a simplified but fast way. The simplifications of the backpropagation algorithms are discussed and the internal architecture of the cell implementing these features is presented. The mapping of the neural network is examined and the performances of the whole cellular array, evaluated on the basis of computer simulations, are shown and compared with other machines. Our implementation of the recall scheme is at least more than 4 times faster at evaluating the NETtalk text-to-speech network than the Warp, a 20 processor systolic array which ought to be the fastest simulator reported in the literature, while the speed of the learning scheme is about twice faster. The results presented in this paper indicate that two-dimensional arrays can be good candidates for fast parallel processing of feed-forward neural networks.
1. INTRODUCTION The application of neural networks to problems that involve some particular faculties of the brain which are not well understood requires a huge amount of data processing and thus considerable computing power. Neural networks typically learn to perform these tasks from the iterative presentation of an important set of examples. Since neural networks learn by minimizing the error detected from the output of each presented example, the total errorminimization time must be as short as possible. The usefulness of a neural network accelerator depends directly on its speed as well as its learning accuracy when supported. One possibility for increasing the computing power is to use a parallel machine but, in such machines, the bottleneck is the interprocessor communications. In large grain architectures, the task is distributed on a few powerful processors. The neural network must be partitioned for balancing the load, take the best advantage of the parallelism and minimize the interprocessor communications. This partitioning and mapping of the neural network is a difficult task. On the other hand, most of the time, the resulting graph has got nothing to do with the initial structure of the neural network and then all its fascinating features such as robustness and fault tolerance are lost at the profit of a speed increase. However, the one-neuron-per-processor association which first comes in mind and which tries to preserve the initial topology of the neural network with its
underlying features, might be used with fine grain architectures. Such an architecture can be made of a large number of processors, reduced to simple automata that compute the algorithms of a neuron. A global communication system distributed in each processor can avoid the handling of a huge number of physical connections. We propose a message transmission mechanism distributed in each processor allowing a particular processor to communicate with potentially any other without the need of global information. A message is passed from a processor to one of its neighbours until it reaches its destination ; the path followed by the message is set dynamically in each passedthrough processor. This communication system is hardware-implemented to favour the transmission speed and then meet the high communication rates required by emulation of neural networks. Because each processor can communicate with potentially any other, this communication system can better take advantage of a bidimensional grid than it could in the 3-dimensional space of a complex hyperoube. The proposed architecture is derived from this communication mechanism : a bidimensional array of asynchronous processors, each of which being physically linked to its four immediate neighbours through buffers. The description of back-propagation feed-forward neural networks as sets of n e u r o n s arranged in layers of different sizes, with only interlayer connections, leads to a disorganized parallelism that can be efficiently handled by this array of asynchronous processors.
250
B. Faure, G. Mazare / A cellular architecture dedicated to neural net emulation
2. THE CELLULAR ARCHITECTURE The proposed architecture consists of a N x M array of asynchronous processing units, called cells, all identical and dedicated to a particular application. Being specific, the basic cell can process efficiently a given task with a limited silicon surface. Each cell performs a simple local function and is physically connected to its four immediate neighbours through eight unidirectional buffers, one for each way of the four directions. When using four bidirectional buffers instead of eight unidirectional ones, the message transfers can cause the communication system to reach a deadlock. Figure 1 sketches a 5x5 cellular array and highlights the paths followed by the messages in a two-way communication.
Figure 2 shows the functional diagram of the cell with the eight external buffers surrounding it and the two internal buffers for the communication between the routing and processing parts. status flaas
buses Figure 2 cell a dy
cell b Figure 1 We have included a flip-flop based mechanism in each buffer to prevent the two cells linked through it from acceding it concurrently and then destroy the message currently stored. 2.1. The cell The cell performs two concurrent and independent tasks and thus is divided in two asynchronous parts : the routing part, dedicated to the transmission of the
messages, reads a message from an input buffer, selects an output buffer according to the routing policy, updates the routing part of the message and stores the new message in the selected output buffer if it is empty. Otherwise, it holds the message in the input buffer and selects another input buffer. the processing part executes the algorithms of the application mapped on the array : it tests the flag of its input buffer for the presence of a message stored by the routing part, reads the message if the flag is set, processes the data and stores the result as a message in its output buffer when completed.
The cell surface is large enough to allow the use of parallel buffers and then parallel transmission of the messages. The building block of the cellular array is made of the cell and its four external input buffers, the output buffers of a cell being the input buffer of its four neighbours. This basic block is cascadable, so the array is easily extendable. While the processing part is dedicated to a given application and must be respecified for each new application, the routing part is general and its specifications are common to all the application specific arrays. 2.2. The message transmission In this array, each cell can be logically connected to any other and to the outside world by sending and receiving messages through the communication system distributed among the cells. In order to keep the generality and cascadability of the basic cell, the piece of the communication system included in each cell can only pass a message from one of its input buffers to one of its output buffers without global information. Therefore, the message must hold all the information required by the routing part to pass it in the proper direction : that is information about the distance and direction of its destination cell towards the cell currently transmitting it. The message is divided in two fields. The data field holds the information needed by the processing part for computing the application mapped on the array (this information consists generally of a numeric value and some control bits telling the destination cell what to do with this value). The routing field holds the information needed by the routing parts of the cells for the proper transmission of the message : the relative displacement dx and dy from the current
B. Faure, G. Mazare / A cellular architecture dedicated to neural net emulation
to the destination cell. Figure 3 shows the structure of the message, its two fields, the type of information carried, and the value of the relative displacement when the message is output from the cell a, the origin being at the top left corner of the array. routing field
251
For a N x N army, the addressing range (precision of both dx and dy) must be + N/2 in order to reach the calls at the center of the array from the outside. The routing field must have Iog2(N/2+l ) bits (12 bits for a 65 x 65 array). For large s=ze arrays, the buffers c a n be folded to fit in tha perimeter of the cells.
data field 3. THE PROCESSING PART
dx = xa
- x b
and
dy = Ya " Yb
Figure 3 The routing part has been studied and evaluated for a wide range of message transmission policies. It reads a message from one of its 5 input buffers (4 external and 1 from the processing part), selects the appropriate output buffer according to the currant value of the relative displacement dx and dy carried by its routing field ; if this output buffer is empty, it updates dx if not null or dy and stores the message in the selected output buffer. The retained policy, that transfers the messages first along the X axis and then along the Y axis, allows a better use of the peripheral cells and drains the excess of messages off the cells at the center of the array. A message is passed from a cell to one of its neighbours westward or eastward in the way that decrements the absolute value of dx until it becomes null. When dx is null, it is passed northward or southward in the way that decrements the absolute value of dy until it becomes null. When both dx and dy are null, the message has reached its destination and its data field is passed to the processing part. The routing part has been designed in VLSI last year with a semi-custom CAD tool. It allows up to 4 message transfers in parallel with a 40 ns delay from buffer to buffer, at a clock speed up to 20 MHz. At 10 MHz, the latency of the transfer is twice : 80 ns. This. routing part is made of 4 buses, 4 input buffers augmented with 4 copies of a small circuitry that allocates the buses, updates the routing field of the message, clears the status flag of its input buffers and sets the one of its output buffers. The buffers between the two parts of the cell come with the processing part. The resulting circuit had less than 7000 transistors, was 9 mm2 ; the buffers could store messages with 8 bits of data, 4 control bits and a routing field of 8 bits for both dx and dy (that is an addressing range of 16 x 16 cells) [4].The flexibility of the chosen design allows us to parametrize easily this routing part and then generate rapidly a new layout for any size of any field of the buffer. The design of the cell is now limited to the design of the layout of the specific processing part with its i/o interface (two buffers), the whole cell being created by generating the layout of tha routing part around it.
Since the computing power of such a fine grain architecture increases with the number of cells in the array, the compromise surface versus time is critical. To meet this, and due to the high number of computations that have to be done, we must simplify the functions used, limit the data path width and decrease the memory size as much as possible. The processing part performs the computations in a different order from the standard series devoted to the modelizad neuron : sum the weighted incoming inputs, compute its activation and output a weighted activation on each downstream path during the recall phase while update the weights of the downstream links with the incoming gradient, weight and sum the incoming gradients, compute and output the local gradient on each upstream path during the learning phase. This is easily done by associating the weights of a cell to the downstream links instead of the upstream links in the standard neuron model. That is in fact, a neuron straddles over two cells : the upstream one manages the weights (the synapses) while the downstream one emulates the decision taking of the neuron [1]. This new order in the series of computations, done by the processing part, allows tha fast processing of any incoming message and induces an important delay, necessary for the multiplication of the weight by the output activation. The input buffer is rapidly cleared while the sending of a new message waits for the release of the previous one into the buffer of another cell. This prevents the messages from Overloading the communication system. 3.1. A first approach Our primary goal was to rapidly produce a cell that performs the forward pass at a high speed. This cell must also include all the circuitry necessary for the backward pass. Since the processing part emulates a neuron with many connections, the first reduction was to include only one input and one output buffers in the processing part. The second was to limit the fan-in and fan-out of all the neurons to 8. An array of such cells can emulate small neural networks with no more than 8 weights, 8 input and 8 output links for each neuron. The next step in decreasing the size of the processing part is to limit the data path width, the complexity of the non linearity and of its associated derivative function.
252
B. Faure, G. Mazare / A cellular architecture dedicated to neural net emulation
We have studied the behaviour of the cellular array for evaluating the suitability and feasibility of the VLSI implementation. The simplifications made to the functions (hyperbolic tangent and derivative) and the reduction of the width of the data path have been functionally simulated with Occam on a Transputerbased system for two simple neural networks : The exclusive-or : a 3-layered network with 2 units in the input layer, 2 in the hidden layer and 1 in the output layer. This is the most simple and hard to learn non linear function that can be processed by the back-propagation learning rule in a multilayered neural network. This network handles a total of 9 connections with associated Weights, including the unit thresholds, that have to be updated during the learning phase. The Little Red Riding Hood : a 3-layered network with 6 units in the input layer, 3 in the hidden layer and 7 in the output layer, that is 49 synaptic weights. This network outputs the actions that have to be taken by the Little Red Riding Hood when she encounters someone with some physical characteristics such as big ears, big teeth... The network classifies the characteristics presented to its input layer into three classes : woodcutter, grandma and wolf, each of which leads to a set of actions [3].
The simulations have shown that a transfer function and its derivative, both tabulated with 7 values in the interval [-1,+1] and ]0,+0.5] respectively, were sufficient to train the two networks, provided the final precision of the value emitted by the processing part (after all the computations) was at least 8 bits for the forward pass and 10 bits for the backward pass [1]. An extremely simplified version of the processing part has been designed during the spring 1989 with an automated CAD tool. All the data being in a twos complement integer form, it was reduced to a twos complement adder, (multiplications were performed by additions and shifts) had two memories : one for the 16 (dx, dy) pairs and another for the 8 synaptic weights, required by the 6 logical links it handled and the i/o interface to the routing part was reduced to two buffers. The circuit had about 9000 transistors for a surface of 16 mm 2, could compute a forward pass in 29 p.s and a backward pass in 55 p.s at a 10 MHz clock speed [6]. The complete cell, including 4 external buffers, the processing and routing parts, was 25 mm 2 and had about 16000 transistors. 3.2. Effects of the limited connectivity Since we want the cellular array emulate mediumsized neural networks, the neurons handling several hundreds of synapses, we must limit the connectivity (fan-in and fan-out) of the basic cell. Indeed, it is unrealistic to build an array with cells having at least 50000 transistors for every 1O0 synapses supported for only storing the addresses of the cells they are
linked to. We can bypass the limited connectivity by introducing partially connected sub-layers to the initial neural network topology - that is assign a neuron to a set of cells - and then map the increased network by using a simulated annealing method. The mapping of the neural network is done by sending one message "initialization begins" to every cell of the array, then sending a set of initialization messages to all the cells the neural units have to be assigned to, with information on their effective connectivity, their initial weights, the addresses of all the cells they are linked to... and at last sending one message "initialization ends" to every cell of the array that requests the cells to send back one message "initialization completed" to the outside. When all those messages have been received by the host computer, the array is ready to use. Figure 4 shows a possible map of the exclusive-or network for the cellular array, given a cell has a maximum allowed connectivity of 2. The white circles represent the bias neuron that handles the thresholds for all the network. The arrows show the paths followed by the messages during the recall phase (forward pass). During the phase of weight update (backward pass), the messages follow the dual paths obtained by changing dx in - dx and dy in - dy to access the upstream-connected cell. r~tr~"l
r-~r-'-~
r,--~t.--~
r~=t----=
L#L# Figure 4 The simulations have shown that there is no need for an efficient mapping. The performances between a good and a bad mapping differ only by a few percent : 3 % for small networks and less than 0.2 % for medium-sized ones. However, the mapping must create logical links that fit in the addressing range of each cell and assign the input and output layers to the peripheral cells of the array. The overhead time is due to the message retention along the most used paths that are less in a bad mapping, the neurons being more spread over the array. The cell connectivity must be at least 32 and the mapping very bad (let say less than 25 % of assigned cells, due to the mapping of the i/o cells only at the array periphery) to have paths with high connection rates.
B. Faure, G. Mazare / A cellular arch/tecture dedicated to neural net emulation
3.3. A new design As seen before, the designed cell cannot emulate neurons with more than 8 weights : its low precision creates high oscillation rates. That is why we have studied and evaluated another approach which allows the spreading of a neuron over a set of cells. The cell uses a linear transfer function f, saturated at both ends of its interval of variation. The value f'(x) of the derivative of f is produced by the operative part from the value f(x) : f'(x) = [1 - f(x)]. [1 + f(x)] / 2. When f saturates, the weight update step, which value should be null, is given the lowest possible value the precision can handle. This results in modifying the weights at every trial but also reduces the oscillation. The data path is set to 16 bits for bypassing the round-off error problem. The final mapped neural network topology, still close to the initial one, may demand a wider data path for taking into account the round-off errors generated by the limited precision computations. For the tested examples, a 16-bit data path is sufficient with a cell connectivity of 8 or more. For instance, the character pattern classification, character pattern recognition and NETtalk require 17, 27 and 30 bits with a cell connectivity of 2. The connection weights and neuron activations are scaled between 1 and -1 and the gradients between 2 and -2. The absolute values of all the data handled by a cell range from 21 to 2-(n-2), n being the resolution of a cell (n > 16 bits). Figure 5 sketches the data path of the processing part necessary for implementing theses features. It is made of : - one input buffer which contains a status flag and stores the data field of the message, - one output buffer which contains a status flag and stores the data and routing fields of the message, - a memory of 8 16-bit (or more if necessary) words, storing the synaptic weights, - a memory of 16 Iog2(N/2+l)-bit words for storing the values of dx an8 dy for the downstream and upstream connected cells in a N x N array, - a twos complement 16-bit integer adder with one accumulator extended to 32 bits. data bus
I
+ address bus Figure 5
253
The control part which handles the multiplication algorithm, the memory loading during the array initialization and all the synchronization needs, uses about one third of the processing part surface. For a 65 x 65 array' build from cells with a 16-bit data path, the number of transistors is around 25000 for a cell that includes the processing and routing parts, four external buffers and supports a connectivity of 8. We plan to produce a one-cell chip, with a 16-bit processing part, a connectivity of 8 and a routing part able to address 32 cells in each of the four directions, which we will assemble in a small array.
4. PERFORMANCES The performances of the cellular array were studied for a cell having 16-bit twos complement adder, running at a 20 MHz clock speed. When necessary, the data path and memory used are widened but computations are still made with the same adder. 4.1. Learning performances We have studied the implications of these choices with the exclusive-or, Little-Red-Riding-Hood and a set of applications chosen for their relevance in the field of neural computing : The numeral pattern classification : a 2-layered network with 45 units in the input layer and 10 in the output layer, that is 460 synaptic weights. This network is presented with a 5 by 9 pixel image that have to'be changed to its decimal value (1 to 10). The character pattern classification : a 2-layered network with 45 units in the input layer and 26 in the output layer, that is 1196 synaptic weights. This network is presented with a 5 by 9 pixel image that have to be changed to a character value (1 to 26). The character pattern recognition : a 3-layered network with 25 units in the input layer, 60 in the hidden layer and 25 in the output layer, that is 3085 synaptic weights [10]. This network is presented with a 5 by 5 pixel image and produces a 5 x5 pattern.
The initial neural networks, with randomly initialized weights, were preprocessed, that is augmented if necessary depending on the maximum allowed connectivity for a cell, and then mapped onto the array. The VLSI implementation restrictions were functionally simulated with Pascal on a Vax 6310 for the cellular array, given a cell has a maximum connectivity of 2, 4, 8, 16, 32, 64 and no limit. The simulations have pointed out that the augmented networks emulated by the cellular array can memorize all the presented patterns.
254
B. Faure, G. Mazare / A cellular architecture dedicated to neural net emulation
The simulated examples are : the exclusive-or (xor), Little-Red-Riding-Hood (Irrh), numeral (hum) and character (char) pattern classification and character pattern recognition (wang). Figure 6 shows for each example the minimum number of learning steps required to store all the presented patterns (the sign of each network output for each pattern is correct) in the case of an unchanged topology with unrestricted (unr.) and precision-limited computations (lim.) and, in the case of augmented networks mapped on precision-limited cells with a maximum connectivity limited to 8, 4 and 2 (c. 8, c. 4 and c. 2 respectively), this number of learning step being divided by the number of steps needed for the unchanged topology and precision-limited computations. 2.5 2.0 •
1.5 1.0 0.5
unr.
[]
lim.
[] [] []
c. 8 c. 4 c. 2
0.0 xor
Irrh
hum char wang Figure 6
Since the weights are updated in the backward pass of every learning trial, the back-propagation algorithm implemented on the cellular array cannot find a solution for about 50 % of the randomly generated initial sets of weights used to test the exclusive-or and wang networks. Oscillations, and then incorrect training of the neural network mapped, were found for about than 30 % of the cases for the Little Red Riding Hood, numeral and character pattern classification networks. These oscillation rates drop to about 25 % and 10 % respectively with the accumulation of the weight changes in each cell over a set of prototype patterns (presented in a random order). A cell accumulates both the activation and the weight change over the complete set of patterns. Each pattern is forwarded and backwarded and once this is done, the host computer sends to all the cells a special message asking for the weight update. Each cell sends back a completion message. This feature demands that the control of the learning process is to be transferred to the host computer which does not seems satisfactory because the communication between the array and the outside world is a bottleneck and despite the fact that the complexity of the processing part of the cell is not increased too much. The control part needs a few thousand transistors more, the operative part remaining the same.
4.2. Speed performances We have studied the performances of the cellular array for all the previous examples : xor, Irrh, num, char, wang and also for : NETta/k : a 3-layered network with 203 units in the input layer, 60 in the hidden layer and 26 units in the output layer, that is 13826 synaptic weights. This network learns to transform written English text to a phonetic representation. It "learns to read aloud" [9].
During the simulations, we have assumed that the messages are processed by the outside world at the array speed and that the routing part transmits a message in 40 ns from buffer to buffer. All the results presented herein were obtained from the simulation of 101 recalls of the same pattern (for measuring the recall performances) or 1 recall, then 100 learnings and then 1 last recall of the same pattern (for measuring the learning performances) for all the examples but NETtalk for which we have only simulated 2 recalls or 1 recall followed by 1 learning and 1 last recall of the same pattern. Table 1 shows the recall speed of the cellular array in millions of connections per second (MCPS) and its learning speed (one forward and one backward pass) in millions of connection updates per second (MCUPS) for the 6 tested examples and for a cell connectivity limited to 2, 4, 8, 16, 32 and 64. connect. xor trrh hum char wang NETtalk
2 4 8 16 32 3.2 3.0 2.9 2.8 2.4 0.2 0.4 0.4 0.4 0.4 9.8 8.2 4.9 5.0 4.41 0.5 0.8 0.9 1.0 0.8 102.2 92.0 43.0 38.3 27.9 4.5 8.3 7.1 5.6 5.0 173.3 170.9 108.7 68.3 38.1 3.9 15.9 16.2 10.1 5.2 161.8 177.9 132.7 88.2 58.7 3.4 5.5 16.0 9.3 7.5 151.6 335.6 273.8 164.8 NA 8.5 51.5 38.8 27.0
64 2.5 0.3 3.2 0.9 25.7 4.7 32.2 4.9 30.4 5.0 72.5 9.2
Table 1 To be run, NETtalk requires a 167 x 167, 97 x 97, 65 x 65 or 59 x 59 array with respectively 27800, 9317, 4175 and 2112 cells assigned to neurons, depending on the connectivity of the basic cell used to build the array : 2, 4, 8 and 16 respectively. For more details, see [2]. The asynchronism of the cells allows the pipe-lining of multiple recalls. This is achieved by internally resynchronizing the cellular array : for each message processed, the cell replies back to the sender with an acknowledgement message
B. Faure, G. Mazare / A cellular architecture dedicated to neural net emulation
"message received and processed, you can send the next message, I am ready". The cell sends the acknowledgement messages when its internal data has become obsolete, that is when it has completed its processing. With this internal resynchronization, one layer out of two processes the data of two successive recalls. The performances and the way they were obtained are presented in [2].
5. COMPARISON WITH OTHER MACHINES The recall performances obtained for NETtalk can be compared with those of other machines [7]. The 65 x 65 cellular array, given a cell has a connectivity of 8, 4175 of which are neural units, rated at 335.6 MCPS, is 4 times faster than the most powerful non dedicated machine : the 20 nodes Warp (80.0 MCPS), 12 times faster than Saic's Delta Board and HNC's Anza-Plus simulator (27.5 MCUPS), 52 times faster than the 16K Connection Machine 1 (6.5 MCPS with one synapse per node) and more than 2600 times faster than a Vax 780 (0.125 MCPS).The back-propagation algorithm, implemented on the 64K CM-1, runs at 32.5 MCPS for larger networks, which is still 10 times slower than the cellular array. This array, rated at 51.5 MCUPS in learning mode, is nearly twice faster than the Warp (32.0 MCUPS), 4 times faster than the 64K CM-1 (13.0 MCUPS), 20 times faster than the 16K CM-1 (2.6 MCUPS), 5 times faster than Delta and Anza-Plus (11 MCUPS) and more than 1000 times faster than a Vax 780. This learning speed must be handled with caution : the learning trials are not equivalent, a complete learning with the cellular array may require more trials than with the other machines.
in this paper with an average activity more than twice the best activity obtained with the application specific accelerator, but at a speed 10 times slower. This indicates that the specificity of the cell is an important factor for the fine grain parallelism to be efficient in time critical applications such as neural networks that have to deal with important amounts of data. The results presented herein indicate that this kind of bidimensional array of application specific asynchronous cells communicating by message transfers can be a good candidate for processing back-propagation neural networks.
REFERENCES [1]
[2]
[3] [4] [5]
[6] [7]
6. CONCLUSION The use of fine grain architectures can help in keeping the initial topology with its underlying massive parallelism and robustness features, which are often lost with other kinds of parallel machines. If a few cells of the array miscompute, only a few links will be erroneous and the behaviour of the mapped neural network will remain acceptable. Our research team, is concurrently developing a general purpose parallel accelerator based on the idea of this cellular architecture. The processing part of each cell is composed of one 8-bit processor and 256 bytes of memory [8]. Some simulations that have been done for this accelerator show it can simulate the neural network applications presented
255
[8]
[9] [10]
Faure, B., Mazar6, G., A VLSI Asynchronous Cellular Architecture Dedicated to Multi-layered Neural Networks', in: Personnaz, L. and Dreyfus, D., Neural Networks from Models to Applications (IDSET, Paris, 1988) pp. 710-719. Faure, B., Mazar6, G., A VLSI Asynchronous Cellular Architecture for Neural Computing : Functional Definition and Performance Evaluation, in : proceedings of lEA/AlE-90, Charleston, SC, July 1990. Jones, W. P., Hoskins, J., Back-propagation : A Generalized Delta Learning Rule, BYTE, vol. 12, n° 10, October 1987, pp. 155-162. Karabernou, M., DEA de micro-61ectronique, Universit6 de Grenoble, France, June 1989. Le Cun, Y., ModUles Connexionistes de rApprentissage, Th~se d'informatique, Univ. de Paris 6, France, June 1987. Mhiri, M., DEA de micro-61ectronique, Universit6 de Grenoble, France, June 1989. Pomerleau, D. A., Gusciora, G. L., Touretzsky, D. S., Kung, H. T., Neural Network Simulation at Warp Speed : How We Got 17 Million Connections per Second", IEEE ICNN, San Diego, Ca, July 1988. Rubini, P., Karabernou, M., Payan, E., Mazar6, G., A Network with Small General Processing Units for Fine Grain Parallelism, accepted at the International Workshop on Algorithms and Parallel VLSI Architectures, Pont-,~-Mousson, France, June 1990. Sejnowski, T. J., Rosenberg, C. R., Parallel Networks that Learn to Pronounce English Text, Complex Systems, n° 1, 1987, pp. 145-168. Wang S., R6seaux Multicouches de Neurones Artificiels : Algorithmes d'Apprentissage, Implantations sur Hypercube, Applications, Th~se d'informatique, INP Grenoble, France, September 1989.