MARBLE: an asynchronous on-chip macrocell bus

MARBLE: an asynchronous on-chip macrocell bus

Microprocessors and Microsystems 24 (2000) 213–222 www.elsevier.nl/locate/micpro MARBLE: an asynchronous on-chip macrocell bus W.J. Bainbridge*, S.B...

286KB Sizes 0 Downloads 53 Views

Microprocessors and Microsystems 24 (2000) 213–222 www.elsevier.nl/locate/micpro

MARBLE: an asynchronous on-chip macrocell bus W.J. Bainbridge*, S.B. Furber Department of Computer Science, The University of Manchester, Oxford Road, Manchester M13 9PL, UK Accepted 29 November 1999

Abstract This paper presents MARBLE, the Manchester Asynchronous Bus for Low Energy, a two channel asynchronous micropipeline-style VLSI macrocell bus. In addition to basic bus functions, MARBLE supports bus-bridging and test access, demonstrating that all the functions of a high speed macrocell bus can be implemented efficiently in a practical, fully asynchronous design style. MARBLE is used in the AMULET3i asynchronous Microprocessor system to connect the CPU core and DMA controller to RAM, ROM and peripherals. It exploits pipelining of the arbitration, address and data-cycles with a protocol based on split-transfers to meet the performance needs of such a system. 䉷 2000 Elsevier Science B.V. All rights reserved. Keywords: Chip-level; System-on-chip; Asynchronous; Macrocell-bus

1. Introduction As technology advances, the number of transistors on a chip increases dramatically and re-use of system components becomes essential in order to meet the short time-tomarket requirements. The system-on-a-chip approach, where one microchip contains a number of subsystems (possibly from different vendors and design houses), is thus gaining popularity, especially in the area of embedded systems. The system-on-a-chip approach requires a standard means of interconnecting the different macrocell components, and a bus is a common solution to this problem. Example solutions include AMBA [1,2] from ARM Limited who provide their processor macrocells with interfaces for direct connection to such a bus, and the OMI PI-Bus [3]. Standardisaton of system-on-chip issues is being addressed by the Virtual Socket Interface (VSI) Alliance [4]. Asynchronous VLSI design has had a major revival in recent years and offers promise in areas of key importance such as high performance [5], low power [6] improved robustness [7] and a better electromagnetic emissions profile [8]. The AMULET2e microprocessor [8,9] from Manchester University demonstrated that asynchronous design can provide similar power-efficiency to its equivalent synchronous ARM chip but asynchronous design offers benefits in the simplicity of its power management (no * Corresponding author. Tel.: ⫹44-161-275-6844; fax: ⫹44-161-2756236.

clock gating or shutdown is required) and in its emissions profile since the radiated energy is not concentrated in clock harmonics. To use asynchronous macrocells in system-on-a-chip designs it is necessary to have a strategy for interconnecting them. Clearly they could be connected within the existing synchronous design framework, but this means losing much of the potential benefit offered by the asynchronous design style. A much better solution is the development of an asynchronous bus, as presented here. This allows the asynchronous macrocells to be connected fully asynchronously, with synchronisations performed when and where necessary to connect synchronous macrocells to such a system. The scope of application of an asynchronous bus is not restricted to wholly asynchronous subsystems. In fact, a typical system of the future may well be constructed from both asynchronous and synchronous macrocells, interconnected seamlessly over an asynchronous backbone such as the bus presented in this paper, with each local macrocell (of which there may be many) running at its own independent frequency. This paper presents an overview of a typical asynchronous macrocell-based embedded system in Section 2, showing the need for an asynchronous macrocell bus. Section 3 introduces the low-level communication channel used in asynchronous VLSI design, and looks at how it was extended for use in MARBLE to operate as a bidirectional tristate-bus channel connecting multiple senders and receivers (with at most one of each active upon the channel at any one time). The issues addressed include arbitration for bus

0141-9331/00/$ - see front matter 䉷 2000 Elsevier Science B.V. All rights reserved. PII: S0141-933 1(00)00075-2

214

W.J. Bainbridge, S.B. Furber / Microprocessors and Microsystems 24 (2000) 213–222

Fig. 1. The AMULET3i system.

access, the signalling protocol, data-validity, tristate drive handover and pipelining to improve performance. Section 4 presents the higher level constraints used in MARBLE, a two-channel split-transfer system-bus, to ensure correct operation whilst permitting advanced bus-features such as deferred-transfers (necessary for bus-bridging), atomic transfer sequences and error scenarios. Low-level implementation details are omitted as these are addressed more extensively elsewhere [10]. Section 5 presents performance results for a 0.35 mm implementation of MARBLE, as used in the AMULET3i telecommunications controller presented in Section 2.

2. Asynchronous VLSI design: a MARBLE based system As an example system-on-chip design using the MARBLE asynchronous macrocell bus we look at the AMULET3i telecommunications controller system. This contains an AMULET3 processor core [11] (capable of executing the ARM v4T instruction set architecture [12]), 8 Kb of dual-port RAM, a 16 Kb ROM, a DMA controller and an external memory interface and test controller. These are all implemented asynchronously using a four-phase micropipeline-style design methodology [13] and are

connected together as shown in Fig. 1 through the MARBLE macrocell bus presented here. The system also has a bridge to a synchronous subsystem which contains more RAM and custom hardware for handling telecommunications operations. Fig. 1 shows the function of each of the devices attached to the bus. In total this system has four initiators (capable of starting, but not responding to a transfer on the bus) and seven targets (capable of responding to a transfer that has already begun on the bus, but not capable of starting a transfer): • Instruction Bridge—an initiator used by the harvardarchitecture processor core to fetch instructions; • Data Bridge—acts as an initiator when the processor is performing data-accesses over the bus; also behaves as a target when other MARBLE initiators need to access the processor-local RAM. Requires an internal arbiter to resolve conflicts when accessing the processor-local RAM; • DMA Controller—acts as an initiator when performing transfers between peripherals and memory, but must also function as a target when being programmed by the processor core; • Off-Chip Interface—normally functions as a target

W.J. Bainbridge, S.B. Furber / Microprocessors and Microsystems 24 (2000) 213–222

215

3.1. Single rail signalling and data validity

Fig. 2. The asynchronous VLSI channel.



• • •

allowing access to off-chip peripherals and memory, but can also operate as an initiator to provide external testaccess to the other macrocells; Control/Test Registers—these allow processor optimisations and features such as the branch-target cache and low-power halt-modes to be controlled in addition to providing test access to the processors branch-target cache; ADC—an analog to digital converter; ROM—an on-chip ROM for storing program-code and self-test routines; Synchronous Peripheral Bridge—allowing connection of the asynchronous island to a synchronous subsystem.

3. A multi-source, multi-destination bus channel In asynchronous VLSI, all data is passed on channels either using a dual-rail encoding of each bit, or using a single-rail encoding (as for synchronous systems) with a request signal. Both techniques use an acknowledge signal in the backwards direction to indicate when the receiver has accepted the transmitted data. Some aspects of the implementation of a dual-rail bus channel have been considered previously [14], but although the dual rail approach gives a simpler delay model, the single-rail design style is much more cost-effective. Consequently, the MARBLE implementation presented here is based upon a single rail approach.

In asynchronous VLSI design the channel is normally used to pass data in only one direction with the communication always performed between the same two units. Such a channel would be arranged as shown in Fig. 2. The data is always transferred on the channel between two defined signalling events—which two events determines the protocol and data-validity of the channel. Some of the more common possibilities include: • 2-phase signalling—shown in Fig. 3a—the request event signals that data is valid, the acknowledge event indicates that data has been received. • 4-phase early—shown in Fig. 3b—data is transmitted between the request rising and the acknowledge rising. • 4-phase broad—shown in Fig. 3c—data is transmitted between the request rising and the acknowledge falling. In all of the above, data is always transmitted in the same direction as the request event in a push action. Fig. 3d shows the 4-phase pull protocol in which data is transmitted in the same direction as the acknowledge event in a pull action. Each of these signalling schemes has its own merits [15] for everyday use, but for a bus-channel the issues are somewhat different because each transfer may be between different units, and in a different direction. This affects the choice of signalling protocol used in three major areas. 3.1.1. Signalling wire drive control Consider the channel shown in Fig. 4, connecting three logical devices, A, B, and C (a logical device is an interface onto the channel). Each logical device may be either an initiator (capable of starting, but not responding to a transfer on the channel) or a target (capable of responding to a transfer that has already begun on the channel, but not capable of starting a transfer). Further, each transfer will have a device acting as a data-sender supplying data onto

Fig. 3. Two- and four-phase signalling.

216

W.J. Bainbridge, S.B. Furber / Microprocessors and Microsystems 24 (2000) 213–222

Fig. 4. Bus-channel wiring.

the channel, and another device acting as a data-receiver, accepting the data from the channel. There are thus two sensible alternative schemes for controlling which device will drive which signalling wire and when: • sender drives request, receiver drives acknowledge; • initiator drives request, target drives acknowledge. The first option allows data always to be pushed by the sender, but requires that every device on the bus is capable of driving either request or acknowledge depending on which direction data is being transferred, giving a complex circuit for determining who will drive which signalling wire. The second option gives simpler circuits due to the isolation of the request and acknowledge drive-control functions, even though it requires that the channel allows the initiator to push data to the target, or pull data from the target depending on the necessary transfer direction. This second approach is the one used in MARBLE. 3.1.2. Signalling handover The 2-phase signalling and 4-phase signalling protocols described earlier are both suitable for use in a bus-channel, but an extra consideration is the complexity of the logic used to generate the channel request and acknowledge signals (the MERGE blocks in Fig. 4). This choice is affected by the protocol because: • 2-phase signalling causes the signal lines to flip state after each cycle.

Fig. 5. Four-phase bus-channel protocol.

• 4-phase signalling always leaves the signal lines in the same state as before the cycle. For 4-phase signalling the channel-request can be formed using an or-gate to merge the individual device-requests (and likewise for the acknowledge), whereas for the 2phase situation, an xor-gate or a more complex circuit with internal state is required. The wired-or approach used to combine handshaking signals in some off-chip buses [16,17] is not feasible for low-power CMOS implementations, but an alternative to this gated-or approach is to use tristate drivers to couple the local device-request to the distributed channel-request (and likewise for acknowledge). This gives a simpler channel-routing requirement (req and ack run with the data wires with no gates in the path), but suffers from the problem that the signal lines would be undriven between transfers and during handovers, and so would require charge retention to maintain their state. They would thus be very susceptible to the effects of noise. 3.1.3. Data validity scheme The general bus-channel protocol used in MARBLE is a combination of the early-push and the pull variants of the 4phase signalling protocols shown in Fig. 3b and d. Data is thus transferred as shown in Fig. 5 using the four phases for: • • • •

req-rising ! ack-rising push lines stable; ack-rising ! req-falling pull lines stable; req-falling ! ack-falling push lines not driven; ack-falling ! req-rising pull lines not driven.

The data-lines of the channel may be used to transfer data in different directions in different cycles, as shown for the MARBLE data-channel in Fig. 5, but in any one cycle a data line can only be used for a transfer in one direction. The direction may be indicated by signals which are always pushed, e.g. the opcode-bit (indicating whether to read or write) passed on each channel in MARBLE, or by information passed in a previous cycle. This protocol guarantees that there is always one signalling-phase of the bus when the bus-lines are driven by

W.J. Bainbridge, S.B. Furber / Microprocessors and Microsystems 24 (2000) 213–222

217

all three metal layers. By changing the layout geometry of the wires this was reduced to ^0.15 ns at a cost of three times the silicon area by only using the upper and lower layers with no overlap of wires on these layers. 3.3. Arbitration

Fig. 6. Arbitrated call block.

neither driver in a handover. Although this means that these lines are susceptible to the effects of noise as before, this does not matter here because the state of these signals is irrelevant during this part of the communication protocol and the lines will always be driven when they are being monitored. 3.2. The bundling constraint and crosstalk Data is driven onto a bus-channel using a tristate-driver which is enabled at the correct point in the protocol (as indicated in the previous section). For correct operation, the data on the channel must be stable when the receiver is using the data, requiring a one-sided timing constraint to be met: the delay in the signalling line that is indicating data validity (request for push, acknowledge for pull) must be no-less than the delay in the data-lines. The value of the delay is found by simulation, which must also take account of the effects of crosstalk on the bus-wires. This effect can cause a significant slow-down of some signals in the bundle when wires are hand routed in close proximity (as with the hand-routed wires in the AMULET3i subsystem) which is often the case since hand-routing is still much more area efficient than automated routing for long buses. SPICE simulations of MARBLE for the 0.35 mm technology used in AMULET3i showed that the crosstalk variation in the delay for a 10 mm bus with eight devices attached was ^1.5 ns for the most dense arrangement using

Fig. 7. Arbiter trees.

A bus with more than one initiator is a multi-master bus. In order to avoid the corruption of data or signalling failure due to multiple initiators accessing the bus at the same time, such a bus must provide an arbitration mechanism to ensure that only one initiator uses the bus at a given point in time. 3.3.1. Centralised or distributed arbitration Distributed arbitration of the form used in off-chip buses such as SCSI [17] or ethernet are not suitable for low-power CMOS on-chip systems. This is because these systems use arbitration techniques based on detecting a drive-clash or a higher priority device, waiting a period and then retrying; both causing unwanted power wastage. Consequently synchronous on-chip buses (e.g. AMBA [1]) and some off-chip buses such as PCI [18] use a centralised arbitration system with request and grant handshaking signals (typically operating using a 4-phase protocol) connecting the device to the central arbiter. This argument also gives good reason for using a centralised arbitration system in an asynchronous CMOS VLSI system. Such a system is built around instances of the MUTEX (mutual exclusion) structure as proposed by Seitz [19]. 3.3.2. Tree-arbiter Including the MUTEX in a wrapper as shown in Fig. 6 creates a tree-arbiter element (or arbitrated-call) block which provides one handshake on the output channel for each handshake on an input channel. (The asymmetric inverting Muller-C elements [19] used in this circuit have the behaviour that their outputs go low when the symmetric input and the ⫹ inputs are all high, and go high when the symmetric input is low. For any other input combination, the output retains its previous state). The implementation shown, a technology mapped version of a design by Josephs and Yantchev [20] guarantees that one complete handshake is seen on the output for each handshake on the input allowing these blocks to be combined into tree-structures as shown in Fig. 7. Changing the shape of the tree, using different combinations of two input arbitrated-call blocks allows the latency and bandwidth allocated to each branch to be adjusted. Alternative arbitration schemes and tree-arbiter implementations have been presented elsewhere [20] but they are all based upon the same principle of enclosing a MUTEX structure in some form of handshake-wrapper to create a fair or prioritised arbitration network. 3.3.3. Hidden arbitration In a busy system it is possible to hide the latency of the

218

W.J. Bainbridge, S.B. Furber / Microprocessors and Microsystems 24 (2000) 213–222

Fig. 8. MARBLE bus information flows.

arbiter by performing the arbitration for the next channel access in parallel with the current cycle. The grant signal is thus an early-grant indicating which device will take ownership of the channel when it next becomes idle. This means that the arbitration network can be optimised to give a fast grant to the first contender for an empty bus since all subsequent arbitrations whilst the bus is in use are hidden by the current transfer. 3.4. Packet routing Once an initiator has arbitrated for the channel, a cycle commences and the correct target has to respond. This selection of the target is performed by decoding part of the packet on the bus (e.g. the address). If performing a broadcast operation then multiple targets would be activated, but typically the decode operation would activate only one target. The action taken in the error situation

(where the decode reveals that the packet is not destined for any of the targets) is described later. The location of the decode-logic has an effect on the performance and complexity of the system: • performing the decode centrally is efficient in terms of hardware; • letting each target perform a decode operation to check if the packet is for itself gives a modular system and simplifies the wire-routing of the channel; • performing a decode at the initiator and then transmitting the result (indicating which target should respond) as a further field within the packet allows the decode to be performed in parallel with the initiator’s arbitration for access to the channel, potentially giving a better performance. The implementation of MARBLE in the AMULET3i

W.J. Bainbridge, S.B. Furber / Microprocessors and Microsystems 24 (2000) 213–222

subsystem uses the first approach to route address-packets, and the second approach to route data-packets. 3.5. Pipelining The benefits of pipelining in microprocessor design are well known [21], and similar benefits have been observed in asynchronous processors [22] and synchronous buses [1,3]. Asynchronous buses present the same opportunity to pipeline the arbitration for the next access to a channel with the activity of the current cycle on the channel by adding a pipeline latch at every port where a packet is taken from the channel. In MARBLE these pipeline latches are normally-closed to minimise system power consumption, and prevent slow edges and glitches in two ways: • changes in the levels of bus-wires are only propagated into one device (the one taking the packet off the bus) instead of into every device connected to the bus; • the slow edges on the bus signal-lines are not propagated to the devices; instead they see a fast, clean edge when the latch opens. Note that the performance impact of using normally closed latches instead of normally open latches is of the order of 2% of the total cycle time of the channel for the AMULET3i system. 4. The 2-channel split-transfer system bus A system bus could be built using just one channel, in a similar manner to many backplane system-buses. However, since on-chip macrocells typically have separate wires for address and data; and peripherals often require or provide these at different times, the bus-architecture presented here uses two of the channels described previously. Information flows around the bus-subsystem as shown in Fig. 8 and explained in the following sections of this paper. 4.1. Basic transfer support The address and control channel is used to send a 32-bit address from an initiator to a target. Other control information included in the packet specifies the size of the transfer (byte, half-word or word) the opcode (read or write) and a tag indicating which initiator the transfer originated from (used to route packets on the data-channel). Two bits of the address-control bundle are used to convey a privilege level for use by a simple memory protection unit. This code indicates if the operation is an instruction fetch or a data access, and if it is a user-mode or supervisor transfer. The transmission of this privilege information facilitates the use of a simple bus-protection unit. Three bits of the address bundle are used to transmit the relationship of the current address to the previous address or a prediction (and later confirmation) of the relationship

219

between the current address and the subsequent address. This gives an indication of whether the addresses are sequential, within the same (implementation specific) memory region, or unrelated. Such information is primarily of use to the DRAM controller in the system (which is part of the external memory interface in the AMULET3i chip) since it allows fast paged mode accesses to be used with minimal extra hardware cost. The data and error channel carries two pieces of information per transfer: the data to be read or written, and an errorstatus used to indicate the success or failure of the transfer (see below). The system overhead on this channel consists of a copy of the tag and the opcode-bit sent on the address channel. These are used to route the data-transfer back to the correct initiator. 4.2. Bus-errors The address is used to route the packets on the addresscontrol channel, using a centralised address-decoder. Since the address-map is not fully occupied, any unmapped addresses are detected in this decoder. Such error situations, known as bus-errors, must be trapped and signalled (using a bus-error operation) to an initiator so that appropriate system-dependent recovery actions can be taken. In MARBLE the bus-error operation consists of the central bus controller and address decoder acting as a target and acknowledging the address-cycle. It then also has to perform the corresponding data-cycle to signal the errorcondition using the error field of the data/error channel. This scheme allows the support of precise exceptions (as opposed to using an interrupt to signal the error) and supports simple systems that do not have a full MMU. Other errors, such as errors from the bus-protection unit (which could be a part of the central control) or errors detected by a target device (e.g. attempting to write to a read-only peripheral), are signalled using the same error signal on the data/error channel. Activity on the data channel will never generate a buserror since the data-cycle is always destined for an initiator which is known to exist (since it started the transfer). 4.3. Deferred-transfers The inclusion of a hardware-retry mechanism within the protocol of a multi-master bus, where the transfer is not completed but is retried automatically by the initiator some time later, allows bridging between buses. To illustrate the necessity of this feature, consider the case where device A on bus 0 tries to communicate with device C on bus 1 at the same time as device D on bus 1 tries to communicate with device B on bus 0. Both transfers require the use of the bridge and the two buses, so the bridge must reply with a defer command to one initiator to allow the other transfer to complete, otherwise the system would simply deadlock. More generally, the defer response is for use when a

220

W.J. Bainbridge, S.B. Furber / Microprocessors and Microsystems 24 (2000) 213–222

device that can act as both an initiator and a target is being accessed as a target on the bus, but is unable to perform in this role until it has first performed some action on the bus as an initiator. This use of defer does incur some degree of power wastage due to the retries, but provided that the arbiter is fair (as is the arbitrated-call tree presented earlier), the worst case number of retries can be predicted since the only reason for deferring is that another initiator wants to use the bus, so once that initiator has completed its activity the deferred initiator will eventually be serviced. Other buses commonly use defer for additional purposes: • defer is the basic primitive necessary for the support of split transfers by simpler buses, which also require sequence tagging information which can be conveyed externally to the bus; • defer allows the bus to be released after starting a transfer to a device that takes a long time to complete, so other initiators can use the bus whilst deferring the completion of this transfer. This provides reduced latency to these devices and improves overall system performance; • when a device is busy (possibly in a recovery cycle) and cannot accept another transaction, in a system without defer any attempt to access this device across the bus would result in the bus being stalled with the attempted transfer sat on it. Defer allows the initiator to be told to retry the command later, i.e. defer starting the transfer. All of these features are supported directly by MARBLE through the use of the pipeline-latches and the split-transfer protocol, in a more power efficient manner than when using defer to achieve the same effects. 4.4. Atomic transfer sequences The duration of the granted period of the request-grant handshake performed for every arbitration can be prolonged to extend over multiple bus-cycles, thus achieving atomicity of transfers with little extra complexity in the bus interface controllers. The lock signal is included in the address channel bundle to indicate to the target when a transfer is atomic with the following transfer. Targets capable of deferral should not defer any part of an atomic sequence except for the first transfer of that sequence, since this would break the atomicity. 4.5. Flow-control/packet-reordering Section 3.5 described the use of latches at all ports where packets are taken from the bus, as in Fig. 8, to allow arbitration to be pipelined with transfer cycles. Fig. 8 also shows the decoupling FIFO between the target address and data interfaces which provides full decoupling between corresponding address channel and data channel cycles such that they are only constrained by the causal relationship

that a data-cycle will commence after the start of an address cycle. These latches allow address packet n to be latched at the target, freeing the address channel to allow an initiator to send packet n ⫹ 1 (this may be the same initiator or a different one). At the same time data packet n can be transferred. Pipelining of the address and data-cycles of different transfers is thus allowed but not enforced. The disadvantage of this lack of synchronisation between these cycles is that it introduces problems for the control of bus handover between initiators and allows the following three problem scenarios: • if the initiator and/or the target has a FIFO connected to the pipeline latch then many address cycles could be performed requiring the system to keep track of the outstanding data-cycles; • the initiator could perform its first address transfer to a target, say A, and then go on to perform its next address transfer to a different target, say B. Now if target B is faster than target A, then the data-cycle for the transfer to target B may occur before the data-cycle to target A. If both transfers were reads, then the data would be returned in the wrong order; • after the first initiator’s address cycle, a different initiator can perform its address cycle (possibly to the same target). The data-cycles can happen in either of two orders. If both transfers are to the same target, then the data ordering is important. If the transfers are to different targets then it does not matter which order the transfers occur in. Simple in-order buses generally cannot cope with any of these complex behaviours and therefore enforce a rigid address-data interlocking scheme. Those in-order buses that allow split-transfers can usually only do so by adding an extension to the underlying bus and passing other signals around the side of the bus, as opposed to having inherent support. Such a technique can be used to implement split transfers on AMBA [1] by using the retract command. MARBLE takes an alternative approach, using a split-transfer as its only primitive transfer mode with one major restriction to simplify the control circuits: All data-cycles are started by the target. The target device thus always receives or delivers data in the same order as it received address packets, and so data-reordering is never necessary at the target. MARBLE restricts the number of outstanding cycles permitted for each initiator using a token-flow around the loop including the throttle unit, the interface units and the FIFO between the target address and data-interface units shown in Fig. 8 where each token is a sequence-code used for reordering purposes if necessary. In the AMULET3i subsystem the number of tokens allowed in this loop (regulated by the throttle) is a maximum of one (hence no wires are necessary to convey the sequence code). This means that an initiator cannot begin its next address cycle until its

W.J. Bainbridge, S.B. Furber / Microprocessors and Microsystems 24 (2000) 213–222 Table 1 Module sizes Module

No. of gates

Initiator Target 4-way arbiter 7-way arbiter Address decoder

570 600 40 100 90

current data-cycle has commenced and prevents the need for data-reordering at the initiator. These features permit the third scenario above, but prevent the first and second. All that is required to permit the first two scenarios above is a reorder capability at the initiator (which can be optimised for the case where data is added/removed in order) and some additional bus-lines for the passing of the sequence-code which flows around the token-loop with each address/data packet.

221

MARBLE shows the clear advantages of using a splittransfer scheme in an asynchronous environment, namely simple control units and no polling or wasted bus activity (as would occur in a synchronous or non-split-transfer solution). Further, support for advanced bus features such as precise exceptions and deferred transfers is shown to be feasible in such an environment. MARBLE will be used in the AMULET3i chip, tapeout of which is due in mid1999. In summary, MARBLE demonstrates that all the features of a high speed on-chip macrocell bus can be implemented efficiently in a fully asynchronous design style, thus adding the advantages of elastic pipelines and zero quiescent power to the modularity and support for testability already offered by existing clocked macrocell buses whilst avoiding the problems of clock distribution associated with long synchronous interconnections.

References 5. Implementation SPICE and Timemill simulations of the MARBLE implementation used in the AMULET3i chip show the control circuits cycling at around 8 ns per cycle with minimal bus-wiring for the fastest inputs of asymmetric arbitration trees. The actual bus-wiring of the chip, and its associated bundling delays introduce a 3 ns slow-down. Each arbitrated-call block passed through by the arbitration handshake introduces an additional slow-down of around 0.5 ns. The final system thus cycles at up to 80 MHz with a read-latency of around 14 ns (excluding the device access time). MARBLE thus offers a performance between the equivalent synchronous buses such as CoreConnect-OPB (at 50 MHz in the PowerPC405GP [23] on 0.25 mm CMOS) or AMBA-ASB and the higher performance, nontristate, separate read/write datapath members of the CoreConnect and AMBA families which typically operate at around 100 MHz [23,24]. Table 1 shows the sizes of the MARBLE interfaces and bus-control elements used in the AMULET3i system (assuming 1 gate ˆ 4 transistors). Approximately onethird of the initiator and the target size comes from the data-path pipeline latches and a further one-sixth of the size is from the data-path tristate buffers. 6. Conclusions A solution to the problems of asynchronous macrocell interconnection has been presented. This solution shows how multi-point asynchronous communication channels can be implemented to give a good performance in submicron technology, and how standard asynchronous arbitration circuits can be employed to control access to such channels.

[1] AMBA, Advanced Microcontroller Bus Architecture Specification, Rev D, Advanced RISC Machines Ltd. (ARM), April 1997. [2] AMBA, Advanced Microcontroller Bus Architecture Specification, Rev 2.0, Advanced RISC Machines Ltd. (ARM), May 1999. [3] DRAFT STANDARD OMI 324: PI-Bus rev 0.3d, Open Microprocessor Systems Initiative (OMI), Siemens AG, Germany, 1994. [4] The Virtual Socket Interface (VSI) Alliance, URL: http:// www.vsi.org/. [5] C.E. Molnar, I.W. Jones, W.S. Coates, J.K. Lexau, A FIFO ring performance experiment, Third International Symposium on Advanced Research in Asynchronous Circuits and Systems, ASYNC’97, Sun Microsystems Laboratories, April 1997. [6] J. Kessels, P. Marston, Designing asynchronous standby circuits for a low-power pager, Proceedings of the IEEE 87 (2) (1999) 257–267. [7] D.J. Kinniment, B. Gao, Towards asynchronous A–D conversion, Fourth International Symposium on Advanced Research in Asynchronous Circuits and Systems, ASYNC’98, University of Newcastle upon Tyne, UK, April 1998. [8] S.B. Furber, J.D. Garside, P. Riocreux, S. Temple, P. Day, J. Liu, N.C. Paver, AMULET2e: an asynchronous embedded controller, Proceedings of the IEEE 87 (2) (1999) 243–256. [9] S.B. Furber, J.D. Garside, S. Temple, J. Lui, AMULET2e: an asynchronous embedded controller, Third International Symposium on Advanced Research in Asynchronous Circuits and Systems, ASYNC’97, 1997. [10] W.J. Bainbridge, Asynchronous system-on-chip interconnect, PhD thesis, Department of Computer Science, Manchester, UK, 2000. [11] D.A. Gilbert, Dependency and exception handling in an asynchronous microprocessor, PhD thesis, Department of Computer Science, The University of Manchester, December 1997. [12] D. Jaggar, Advanced RISC Machines Architecture Reference Manual, Prentice-Hall, Englewood Cliffs, NJ, 1996. [13] I.E. Sutherland, Micropipelines, Communications of the ACM 32 (6) (1989) 720–738. [14] P.A. Molina, The design of a delay-insensitive bus architecture using handshake circuits, PhD thesis, Imperial College of Science, Technology and Medicine, University of London, UK, 1997. [15] P. Day, J.V. Woods, Investigations into micropipeline latch design styles, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 32 (2) (1995) 264–272. [16] FUTUREBUS: Specifications for Advanced Microcomputer Backplane Buses, IEEE Computer Society Press, November 1983.

222

W.J. Bainbridge, S.B. Furber / Microprocessors and Microsystems 24 (2000) 213–222

[17] Small Computer System Interface (SCSI), American National Standards Institution, 1986. [18] T. Shanley, D. Anderson, PCI System Architecture, 3rd ed., AddisonWesley, Reading, MA, 1995 (ISBN 0-201-40993-3). [19] C. Mead, L. Conway, Introduction to VLSI Systems, 2nd ed., Addison Wesley, Reading, MA, 1980. [20] M.B. Josephs, J.T. Yantchev, design of the tree arbiter element, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 4 (4) (1996) 472–476. [21] J.L. Hennessy, D.A. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, Los Altos, CA, 1990. [22] J.V. Woods, P. Day, S.B. Furber, J.D. Garside, N.C. Paver, S. Temple, AMULET1: an asynchronous ARM microprocessor, IEEE Transactions on Computers 46 (4) (1997) 385–398. [23] PowerPC 405GP Has CoreConnect Bus, Microprocessor Report, 13(9) (1999). [24] ARM10 Points to Set-Tops, Handhelds, Microprocessor Report, 12(15) (1998).

John Bainbridge received the M.Eng. degree in Electronic Systems Engineering from Aston University, UK in 1996 and has been with the AMULET group at the University of Manchester, UK since 1996. His research interests include asynchronous VLSI design, specialising in interconnect issues, and asynchronous neural computing. He has just submitted a PhD thesis on asynchronous system-on-chip interconnect.

Steve Furber is the ICL Professor of Computer Engineering in the Department of Computer Science at the University of Manchester. He received his BA degree in Mathematics in 1974 and his PhD in Aerodynamics in 1980 from the University of Cambridge, England. From 1980 to 1990 he worked in the hardware development group within the R&D department at Acorn Computers Ltd, and was a principal designer of the BBC Microcomputer and the ARM 32-bit RISC microprocessor, both of which earned Acorn Computers a Queen’s Award for Technology. Since moving to the University of Manchester in 1990 he has established the AMULET research group which has interests in asynchronous logic design and power-efficient computing. He is a Fellow of the Royal Academy of Engineering and the British Computer Society, a Chartered Engineer, and a Member of the IEEE.