Journal of Parallel and Distributed Computing 57, 236256 (1999) Article ID jpdc.1998.1528, available online at http:www.idealibrary.com on
Evaluating the Benefits of Communication Coprocessors 1 Chris J. Scheiman* and Klaus E. Schauser *Department of Computer Science, California Polytechnic State University, San Luis Obispo, California 93407; Department of Computer Science, University of California, Santa Barbara, Santa Barbara, California 93106 E-mail: cscheimacalpoly.edu, schausercs.ucsb.edu Received July 26, 1998; accepted December 9, 1998
Communication coprocessors (CCPs) have become commonplace in modern MPPs and networks of workstations. These coprocessors provide dedicated hardware support for fast communication. In this paper we study how to exploit the capabilities of CCPs for executing user-level message handlers. We show, in the context of Active Messages and Split-C, that we can move message-handling code to the coprocessor, thus freeing the main processor for computational work. We address the important issues that arise, such as synchronization, and the limited computational power and flexibility of CCPs. We have implemented coprocessor versions of both Active Messages and Split-C. These implementations, developed on the Meiko CS-2, provide us with an excellent experimental platform to evaluate the benefits of a communication coprocessor architecture. Our experimental results show that the coprocessor version of Split-C is more responsive than the version which executes all message handlers on the main processor. This is despite the fact that some communication primitives are slower since the coprocessor on the Meiko CS-2 computes much slower. Overall, the performance on large Split-C applications is better because the coprocessor reacts faster and offloads the main processor, which does not need to poll or process interrupts. 1999 Academic Press Key Words: parallel processing; cluster computing; communication coprocessors; Active Messages; Split-C.
1. INTRODUCTION Many modern parallel architectures, including network of workstations, contain dedicated communication coprocessors (CCPs) to support fast communication. Examples of architectures with CCPs include the Intel Paragon [9], Manna [4], 1
This work was supported by the National Science Foundation NSF CAREER Award CCR-9502661 and NSF Postdoctoral Award ASC-9504291. Computational resources were provided by the NSF Infrastructure Grant CDA-9216202. 0743-731599 30.00 Copyright 1999 by Academic Press All rights of reproduction in any form reserved.
236
EVALUATING COMMUNICATION COPROCESSORS
237
Meiko CS-2 [11], Flash [15], Typhoon [21], and cluster of workstations connected via ATM networks [27]. These coprocessors support communication on both the sending and receiving node. They provide the protection, reliability, and protocol handling needed for communication, thus freeing the main processor for computational work. Protection is provided by trusted OS code running on the coprocessor, or by the use of specially designed hardware [9, 11]. Reliability is often provided via a retransmission protocol. Other tasks often executed on the coprocessor include the segmentation and reassembly of message packets [27]. Likewise, shared memory machines and some distributed shared memory machines use the coprocessor to implement consistency protocols. Other advantages of communication coprocessors include support for high-bandwidth DMA transfers and asynchronous message handling. In many forms of communication, when a message arrives, there is often some action that must be performed to incorporate the data into the on-going computation. This is the case with both Active Messages [28] and tagged send and receive [25]. For example, a message may match a tag, insert an item on a queue, increment a variable, or store something in memory. Executing message handlers on the main processor incurs a high overhead, as the main processor must either use expensive interrupts or poll frequently for incoming messages. Since some of the communication coprocessors mentioned above can execute arbitrary code, it may be desirable for these coprocessors to perform the message handling. In this paper we study how to exploit the capabilities of CCPs for executing userlevel message handlers. Our goal is to examine the benefits and drawbacks of this scheme. Executing message handlers on the coprocessor provides a number of benefits [22, 13]. The coprocessor can handle the message while the main processor can continue to work uninterrupted. Since the coprocessor is fully dedicated to communication, it is likely to serve incoming messages faster than the main processor. Furthermore, if all messages are handled by the coprocessor, there is no need for the main processor to be interrupted or for polls to be inserted into the computation. While the computational overhead of the poll usually is not significant, the real cost of polling occurs when a request message is not processed promptly, causing the requester to wait. This might occur if the remote processor is in a computational loop without polls. Increasing the amount of polling to avoid such situations, however, is not always possible or desirable. If messages are received and processed by a coprocessor, this delay can be prevented. Thus, even if the communication coprocessor is slower in processing messages, it may still be beneficial to have the coprocessor service requests. Executing message handlers on a communication coprocessor, however, raises a number of issues that must be addressed. Frequently, CCPs are special purpose hardware which are limited in computational power and flexibility. This constrains the kinds of computation that can be performed by the coprocessor. For example, some coprocessors do not support floating point operations, or they might not run arbitrary user code. Another important issue is synchronization between the coprocessor and the main processor. The problem is that sometimes code on the main processor must execute atomically with respect to certain message handlers. When not taking advantage of a coprocessor, explicit polling on the main processor
238
SCHEIMAN AND SCHAUSER
provides an easy solution. When a coprocessor can execute message handlers independently of the main processor, however, synchronization primitives may be required. One frequently occurring example is the use of a counter that is updated by both the main computation and the message handlers. Either this counter must be protected by expensive locks, or a more creative solution must be used, such as splitting the counter [13]. In this paper, we evaluate the handling of messages by the coprocessor in the context of Active Messages and Split-C. Active Messages provide a good example of how communication and computation can be executed on the coprocessor [22]. Active Messages associate with each message a handler function which is executed on the receiving processor upon arrival of the message [28]. Executing the message handlers on the coprocessor achieves all of the benefits, as well as exposing all of the complications mentioned above. We implemented Active Messages on the Meiko CS-2, a modern MPP with a user programmable CCP. In order to evaluate our approach under large parallel applications, we also implemented Split-C, a simple parallel extension of the C programming language. Our Split-C implementation provides an excellent experimental platform for studying the benefits of having a dedicated communication coprocessor. Using Split-C gives us access to a large set of parallel programs. The programs serve as a useful benchmark set that allows us to evaluate the absolute performance of the implementation based on the CCP. We are also able to directly compare the performance of Split-C running on the proposed platform with other existing implementations of Split-C. For each of the Split-C communication primitives, we produced specialized message handlers which run solely on the coprocessor. Using this implementation of Split-C, we benchmarked an extensive set of parallel applications. We found these programs to be more responsive than those using a version of Split-C that handles all messages on the main processor. This occurs in spite of higher latencies due to the slow coprocessor on the Meiko CS-2. The coprocessor reacts faster to requests and offloads the main processor, which does not need to poll or process interrupts. The structure of the remainder of the paper is as follows. Section 2 gives an overview of Active Messages and Split-C. Section 3 presents the implementation strategy for running the Split-C library on a coprocessor architecture. Section 4 presents in more detail our hardware platform, the Meiko CS-2 architecture, focusing in particular on the distinguishing characteristics of the communications coprocessor. Section 5 evaluates the performance of our Split-C library implementation. Section 6 discusses related work. Finally, Section 7 summarizes our experiences and concludes.
2. ACTIVE MESSAGES AND SPLIT-C Active Messages provide a universal communication architecture which is frequently used by compiler and library writers [28]. Split-C is a C-based parallel language which makes use of Active Messages. Our goal is to execute much of the message handling on the coprocessor in order to free up the main processor.
EVALUATING COMMUNICATION COPROCESSORS
239
2.1. Active Messages Each Active Message has a handler function associated with it which is executed on the destination processor when the message arrives. Active Message handlers are intended to be short functions which execute quickly and are not allowed to suspend. Handlers are divided into two classes: requests and replies. A processor can send a request message to an arbitrary processor. When it arrives, the specified request handler is invoked. Request handlers may answer by sending a single reply message. However, to simplify deadlock considerations reply handlers are prohibited from additional communication. Under the Active Message model, messages travel from user space (the send instruction) directly to user space (the message handler), avoiding any form of buffer management and synchronization usually encountered in traditional send and receive. As a result, Active Messages achieve an order of magnitude performance improvement over more traditional communication mechanisms. Although they are quite primitive, Active Messages have become an important communication layer because of their efficiency. The small overhead and low latency facilitate building more complicated communication layers [25] and make it a desirable target for high-level language compilers [8, 6]. Over the past several years Active Messages have been implemented on many different hardware platforms, including the CM-5 and Ncube2 [28], Intel Paragon [16], Meiko CS-2 [22], T3D [1, 12], IBM SP-2 [5], as well as clusters of workstations connected by FDDI [18] and ATM [26, 27]. Several of these architectures have a communication coprocessor, which is used to support reliable and protected communication. So far only the Meiko CS-2 and the Intel Paragon have been used to run user-level message handlers on the coprocessor [13, 22].
2.2. Split-C Split-C is a simple parallel extension of the C programming language [6]. Split-C follows the SPMD model of computation; a single thread of computation is started on each processor. Both the parallelism and data layout is explicit and is specified by the programmer. Split-C provides a global address space in the form of distributed arrays and global pointers, and it supports several efficient split-phase operations (including bulk transfers) to access remote data. Split-phase operations separate the initiation of a memory operation (the request) from the response. Split-phase operations in Split-C are put and store, for transferring data to another location, and get, for retrieving data from another location. (Put and get are acknowledged transactions, while store is a one-way operation.) Split-C implementations have been built on top of Active Messages on many computational platforms, including the Thinking Machines CM-5, Intel Paragon, Meiko CS-2, IBM SP-2, and networks of workstations. The split-phase nature of Split-C requires that the basic memory operations (put, get, and store) keep track of outstanding memory transactions. Split-C keeps track of these outstanding data transfers with counters, using Active Messages to transfer the data and increment or decrement the counters. Every put or get request
240
SCHEIMAN AND SCHAUSER
FIG. 1.
Steps required for a Split-C get operation.
increments a counter to indicate that an operation has begun but not yet completed. When the data has been successfully transferred, the counter is decremented. The user can wait for all outstanding gets and puts with the Split-C function sync( ), which essentially waits for the counters to equal zero. Unlike the get and put operations, a store operation is one-way and uses two counters: one is incremented when data is sent and the other is incremented when data is received. Since get, put, and store all work in a similar fashion, we examine only the get in more detail. As shown in Fig. 1, a get operation consists of three steps: 1. The requesting processor (1a) sends the ``get'' request to the remote processor and (1b) increments the counter tracking outstanding requests. 2. The remote processor receives the request message, (2a) reads the requested data, and (2b) sends it back. 3. The requesting processor receives the reply, (3a) stores the data, and (3b) decrements the counter. Steps 2 and 3 correspond to Active Message handlers (servicing the request or reply message). As described in the next section, we implement Split-C by running the Active Message handlers directly on the communication coprocessor.
3. IMPLEMENTATION OF SPLIT-C ON CO-PROCESSOR ARCHITECTURES This section shows how we implement Split-C on a coprocessor architecture. Using the coprocessor to handle messages has the advantage that the main processor on the sending side only needs to initiate the communication process, while the main processor on the receiving side can proceed uninterrupted. The coprocessors keep track of the outstanding requests, send the requested data, and move the incoming data into the specified memory locations. For a coprocessor implementation of Split-C, the steps originally done by message handlers on the main processor are now executed by the coprocessor. For the get operation, as shown in Fig. 2, the coprocessor reads and stores the data, as well as decrements a counter. The steps are almost exactly the same as in Fig. 1, except that the coprocessors on both nodes are utilized. In Step 1, the message is sent to the receiving coprocessor, and the counter is incremented as before; in Step 2, the receiving coprocessor reads the data and sends the reply; and in Step 3, the requesting coprocessor stores this data in memory and decrements the counter. In most architectures, the coprocessor is also involved on the sending side to ensure
EVALUATING COMMUNICATION COPROCESSORS
FIG. 2.
241
Steps required for a Split-C get operation on a coprocessor architecture.
protection. But this is transparent to the application, which is just provided with a user-level communication interface. 3.1. Implementation Issues Using the coprocessor to handle messages raises new issues not normally found in a straightforward (single processor per node) implementation of a parallel language. These involve such things as synchronization, atomic operations, and the relative power of the coprocessor compared to the main processor. Synchronization. There are several synchronization issues that must be addressed. For example, a problem arises if the coprocessor decrements the counter just as the main processor issues another get request; the counter value could be corrupted. We have to ensure mutually exclusive access if we allow both the main processor and the coprocessor to update the same counter. Implementing this with locks can be quite expensive. A more efficient solution is to split the counter into two counters, using one counter for the main processor increments and the other for the coprocessor decrements [13]. This solves the synchronization problem. No race condition can occur since each processor can only write to one location and read the other. The sum of the two counters produces the desired counter value. The counter for the put operation is treated similarly. (The counter for store is already split into sending and receiving counters in the basic Split-C implementation.) Atomic operations. Another important synchronization issue, when executing code on the coprocessor, involves atomic operations. Split-C allows the user to specify message handler functions which are atomic with respect to other message handlers, as well as the main computation. (These atomic functions are subject to the same constraints as Active Messages: they cannot block and they can only send a single reply.) These functions are intended only for very simple operations; for example, enqueuing or dequeuing an object. If the coprocessor is allowed to execute these atomic functions, some mechanism has to be applied to ensure mutually exclusive execution of the function. The simplest solution is to allow only the main processor to execute such functions. The main processor will have to poll for atomic functions it must execute. Coprocessor restrictions. A final issue is that frequently a communication coprocessor only provides limited functionality and cannot run arbitrary message handlers. For instance, many coprocessors cannot execute floating point operations. Additionally, most communication coprocessors have less design effort invested, compared to the commercial main processors, thus they are not as fast as their
242
SCHEIMAN AND SCHAUSER
main processor counterparts. It can be more efficient to run resource-intensive tasks on the main processor instead. Not only will the code run faster, but it will not burden the communication coprocessor and prevent new messages from being processed. The main processor might also be utilized for functions that the user just does not want to port to the coprocessor. For instance, Split-C has several infrequently used string functions, which we did not convert to coprocessor implementations. In summary, communication in Split-C consists of a number of split-phase memory primitives, plus a collection of less frequently used operations such as atomic functions. To keep track of outstanding memory transfers, Split-C uses counters. The memory transfer and counter operations can be performed easily by the coprocessor, and if split counters are used, no synchronization is needed with the main processor. Atomic operations, on the other hand, have stringent synchronization requirements and may therefore execute more efficiently on the main processor. In the case of the Meiko CS-2, the coprocessor provides features which we can exploit to further optimize this basic implementation. This is described in the next two sections.
4. THE MEIKO CS-2 ARCHITECTURE The Meiko CS-2 consists of Sparc-based nodes connected via a fat tree communication network [11]. It runs a slightly enhanced version of the Solaris 2.3 operating system on every node, and thus closely resembles a cluster of workstations connected by a fast network. Our machine is a 64-node CS-2 without vector units. Each node contains a 40-MHz SuperSparc processor with 1 MB external cache and 32 MB of main memory. The SuperSparc is a three-way superscalar processor and achieves a peak rate of 120 MIPS. Each node in a CS-2 contains a special CCP, the Elan processor, which connects the node to the fat tree. The CCP enables direct user-level communication. For every message, the coprocessor ensures protection by a series of translations and checks at both the sending and receiving sides. This achieves protected user-level communication without the intervention of the operating system for every message. The Elan coprocessor consists of a number of functional units which include the DMA engine (for sending data), the thread engine (for running arbitrary code, including user code), and the input engines (for receiving messages). These units share the same microcoded engine, and execution is switched among these. The DMA engine performs DMA transfers of arbitrary size from a virtual address on the source node to a virtual address on the destination node. DMA transfers can even cross page boundaries since the Elan coprocessor contains its own TLB and does a virtual to physical address translation on every memory reference. Long transfers are automatically split into smaller units, and the DMA engine is timesliced to allow other functional units to execute. The thread engine implements a Sparc-like instruction set, enhanced with special instructions for opening and closing connections, sending network transactions, and transferring control to other functional units. The user can either write code for the thread engine directly in
EVALUATING COMMUNICATION COPROCESSORS
243
assembly language or generate it through a modified version of gcc, the ``thread compiler.'' Due to a micro-code implementation, the thread engine achieves a peak performance of only 3 MIPS. The current gate array implementation of the Elan has only a small eight-word instruction cache for the thread engine and no data cache, and all memory accesses go to the slow external DRAM. Meiko has a unique synchronization mechanism called an event. When an event is triggered, it can launch a DMA transfer, generate an interrupt, or more commonly, cause a coprocessor thread to resume execution. They allow for synchronization since the main processor, the Elan processor, and the remote processors can set up and trigger events. For example, a thread can suspend itself and wait on an event. When the main processor (or some other processor) sets the event, the thread will resume execution. A thread waiting on an event uses no resources except memory, so there is no penalty for having many suspended threads. 5. PERFORMANCE OF THE CO-PROCESSOR IMPLEMENTATION We now give the details and discuss the performance measurements of our coprocessor implementations for Active Messages and Split-C. We divide our discussion into four parts: First, we examine the raw communication performance of the Meiko CS-2. Next, we discuss our coprocessor implementation of Active Messages and compare it to the version that runs handlers on the main processor. In the third part, we discuss our optimized implementation of Split-C, which specializes the message handlers for execution on the coprocessor. Finally, we examine applications written in Split-C. 5.1. Basic Performance Measurements For comparison with the experimental results in this section, we first discuss the raw performance measurements of a basic DMA transfer on the Meiko CS-2. This establishes the best performance an ideal implementation could achieve. In our measurements, we compute the one-way latency of a DMA transaction by halving the round-trip latency involved in transferring a flag between two processors. This roundtrip operation is performed repeatedly so that overhead incurred by starting and stopping the timer can be ignored. In measuring the individual components, the timings are derived experimentally by executing the appropriate code fragment within a loop. For transferring four to 32 bytes of data, it takes 9.5 +s from the time a processor starts a DMA until the data appears on the remote processor. This time is divided into four components: the time the main processor spends in launching the DMA (2.5 +s), the time the sending coprocessor spends in processing the remote write (5.5 +s), the transmission time in the network (0.5 +s), and the time the remote coprocessor needs to store the value in memory (1 +s). (Homewood and McLaren [11] report a one-way latency of 9 +s, but the additional second-level cache in our machine adds another 0.5 +s of latency.) For bulk transfers, the Meiko can achieve 42 MBs using the DMA engine and up to 50 MBs using the thread engine.
244
SCHEIMAN AND SCHAUSER
5.2. Performance of Active Messages
In this section we compare the performance of two implementations of Active Messages. The first implementation uses the CCP only on the sending side. The second implementation goes a step further and also uses the coprocessor on the receiving side to execute the message handlers. The first implementation, described in more detail in [22], achieves a 12.5 +s one-way latency, as shown in Table 1. Sending an Active Message involves the following steps: (1) The sending processor assembles the message in the outgoing buffer and sets a flag to indicate that the message is ready (1.3 +s). (2) The sending CCP polls on this flag. Once it sees that the flag is set, it checks whether the receiving buffer on the remote processor is empty, and if so it deposits the message into the receiving buffer (9 +s). (3) The main processor on the receiving node checks this buffer on a poll; when it encounters a message, it dispatches it to the message handler and frees the buffer (2.2 +s). This scheme essentially implements a remote queue [3] of depth one. In the second implementation of Active Messages, the CCP is used both on the sending and the receiving side. In this implementation, the main processor is no longer involved in the reception of messages. When a message arrives, it triggers a thread function (using Meiko's event mechanism). This function calls the appropriate Active Message handler. 2 The steps involved sending an Active Message under this coprocessor implementation, along with the associated timings, are: (1) The sending processor sends a message and a ``thread trigger'' (Meiko event) using the DMA mechanism (13 +s). (2) The receiving coprocessor receives the message, stores the data in main memory, and triggers the thread (5 +s). (3) This thread runs on the receiving coprocessor, reads the message from memory, and calls the Active Message handler function (9 +s total for a message that stores a value and increments a counter). If the receiver needs to reply, the Active Message handler calls the reply function, taking 9 +s on the Elan coprocessor. The original sending coprocessor then receives the reply, which triggers a thread (5 +s), which invokes the reply handler (6 +s) and calls the message handler (3 +s). The transmission time adds roughly 1 +s. Thus, the time for a roundtrip message (request and reply) is 52 +s, while the time for a oneway Active Message is 28 +s. As shown in Table 1, the one way and roundtrip latencies are more than twice as large as the main processor implementation. This is because the coprocessor is an order of magnitude slower (runs at only 3 MIPS versus up to 120 MIPS on the main processor) and has only an eight-word instruction cache, no data cache, and no register windows. 2
Our implementation is optimized by using P&1 threads per processor for the sending and receiving of messages. Each thread is responsible for receiving messages from a single remote processor, and each is triggered by its own event. Note that multiple eventsthreads only take up space; threads do not run unless triggered by an event. This scheme assures that two sending processors do not set the same event at the same time. Furthermore, in our implementation, a processor waits for an acknowledgment before sending a second message to the same destination; therefore, no buffer management is needed.
EVALUATING COMMUNICATION COPROCESSORS
245
TABLE 1 Comparison of Active Message Implementations for Simple Ping-Pong Program Active message implementation Main processor Coprocessor
One-way time
Roundtrip time
12.5 +s 28 +s
23 +s 52 +s
Our implementation involves a dispatch thread on the receiving coprocessor, which takes incoming messages and calls the proper handler function. If a program or library uses only a few message handlers, one optimization is to avoid this dispatch thread and directly invoke the proper handler function. (On the Meiko, this can be done easily by creating a separate thread for each function and then triggering the appropriate thread.) This approach is ideal for the Split-C library, which contains only a few important message handlers. This forms the basis for the optimized Split-C implementation described next. 5.3. Performance of Split-C Primitives As we saw in the previous section, a direct implementation of the Split-C primitives using Active Messages on the coprocessor is about twice as slow as the main processor implementation. To improve the Split-C primitives, we exploit the fact that Split-C requires only a small number of Active Message handlers. While the generic Active Message implementation requires a dispatch thread to call the appropriate message handler, the optimized version can avoid this dispatch and invoke the appropriate handler directly. Furthermore, many Split-C operations only require the functionality that is already built into the Elan hardware. We can use special communication primitives such as DMAs and remote atomic adds to avoid running thread code on the receiver (the work is still performed by the receiving coprocessor, but it is done by the DMA engine or input engine, i.e., other parts of the Elan coprocessor). Before addressing the optimizations, we present the timings for basic Split-C primitives. Figure 3 gives an overview of the implementation results for the get, put, and store operations. It shows the latencies for the optimized coprocessor implementation and compares them to the versions of Split-C that run handler functions on the main processor and coprocessor. We see that get and store are still much slower in the coprocessor library, compared to the main processor version: 34 vs 24 +s for get and 22 vs 13 +s for store. However, they are much better than the time required under the Active Message coprocessor implementation (54 +s for get and 30 +s for store). Furthermore, put and bulk store are actually faster under the optimized coprocessor implementation: 22 vs 24 +s for put and 23 vs 27 +s for bulk store. This is because we were able to specialize these functions in the coprocessor implementation, while the main processor implementation relies on generic Active Messages. Figure 4 shows the bandwidth curves for the bulk store and bulk get operations. We see that the coprocessor implementation outperforms the main
246
SCHEIMAN AND SCHAUSER
FIG. 3. One-way latency for basic Split-C operations, for three versions of Split-C. Main AM is the version of Split-C which executes the Active Message handlers on the main processor; coproc AM is the version running Active Messages on the coprocessor; coproc optimized is the version optimized for the coprocessor. (The bulk store is for 32 bytes.)
processor implementation for long messages 3; for bulk stores it achieves 38 MBs, compared to 32 MBs. We now discuss the optimization of the three Split-C operationsput, get, and storein more detail: The put operation increments a counter, transfers data to a remote processor, then decrements the counter when the transfer is complete. The optimized put operation is faster because it executes no code on the receiver; thus we can use a highly optimized DMA to transfer the data and avoid any sort of Active Message altogether. On the Meiko, the completion of a DMA can trigger a thread on the sending coprocessor, and this thread can decrement the counter. The get operation increments a counter, requests data from a remote processor, then decrements a counter when the data has arrived. The get operation is also optimized by using DMAs. The requesting processor first sends a DMA descriptor to the server processor (using a second DMA); then it remotely triggers the DMA descriptor to run on the server. This server DMA transfers the required data and triggers thread code on the requesting coprocessor which decrements a counter. The store operation increments a local counter, transfers data to a remote processor, and increments a second counter on the remote processor. Unlike get and put, the store involves a computation on the remote processor. The Meiko supports remote atomic adds. Thus, the sending processor can send the data, then make the receiving coprocessor perform an atomic add on the counter. This avoids the high cost of triggering and running thread code. Note, with the remote atomic add we are essentially running the active message handler on the sender instead of the receiver. Even with these optimizations, the latencies for gets and stores are still higher than the original main processor version. Of course, an advantage of the coprocessor implementation is that messages can be serviced while the main processor is busy with other tasks. The main processor has a lower communication overhead 3
This occurs because the main processor version trades off bandwidth for latency. The coprocessor continuously schedules a thread which checks whether the main processor wants to send a message. While this reduces the latency, it constantly takes resources from the coprocessor and thus reduces the bandwidth for DMAs used in bulk transfers. The coprocessor implementation, on the other hand, uses the event mechanism so that none of its many threads require resources until needed. Consequently, these threads do not affect the bandwidth.
EVALUATING COMMUNICATION COPROCESSORS
247
FIG. 4. Bandwidth of Split-C operations for two versions of the Split-C library: One that uses the coprocessor and one that uses the main processor to handle messages. The left graph shows results for bulk get operations, while the right graph shows results for bulk store operations.
and does not need to poll, thus overlapping communication and computation to a larger degree. To determine whether the coprocessor implementation is of value in real applications, we studied the overall performance under large Split-C applications. 5.4. Performance of Split-C Applications In this section we compare the performance of full Split-C applications under the two implementations. As the measurements in the previous section show, gets and stores using the coprocessor are about 30400 slower than those using the main processor. Nevertheless, executing message handlers on the slower coprocessor can be beneficial, because the communication coprocessor provides more responsive service for requests. (Of course, in our experiments, we expect some improvement for code that uses bulk operations, since the coprocessor version has a higher bandwidth.) Furthermore, because the main processor spends less time servicing messages, it has more time for the computation and can overlap communication and computation to a larger degree. To evaluate this scheme, we compare the performance of the two Split-C implementations for a number of benchmark programs and explain in detail where performance differences arise. Table 2 lists our benchmark programs, along with the corresponding running times for both versions of Split-C. Figure 5 displays the relative execution times as a bar graph, sorted according to the relative runtime of the coprocessor version. All programs were run on a 16 node Meiko CS-2 partition. Bitonic is a bitonic sort program, run on eight million integer keys (half a million per processor); it uses the bulk store operation for communication. Cannon codes two versions of Cannon's matrix multiply algorithm (which vary on data placement); both use bulk store operations and have an input matrix of size 1024 by 1024. FFT performs a fast Fourier transform computation involving two computation phases separated by a cyclic-to-blocked remap phase (done with store operations); our input size is eight million points. FFTB is the same FFT program, except communication uses the bulk stores. MM codes two versions of a blocked matrix multiply algorithm (which differ with respect to data placement); both use bulk gets for data movement and
248
SCHEIMAN AND SCHAUSER
TABLE 2 Run Times for Various Split-C Programs
Program
Description
Bitonic Cannon
Bitonic sort (8 million keys) Cannon matrix multiply (1024_1024) global blocked algorithm local blocked algorithm FFT using small transfers (8 million pts) phase 1 (computation) remap (communication phase 2 (computation FFT using bulk transfers (8 million pts) phase 1 (computation) remap (communication) phase 2 (computation matrix multiply (128_128, block size=8) global blocked algorithm local blocked algorithm Ray-tracer (on 512_512 tea pot image) Radix sort (8 million keys) Radix sort, bulk transfers (16 million keys) Sample sort (16 million keys) Sample sort, bulk transfers (16 million keys) Shell sort (16 million keys) N-body simulation of fish in a current
Fft
Fftb
Mm
p-Ray Radix Radixb Sample Sampleb Shell Wator
Time (in seconds) Main proc Coprocessor 17.0
16.0
13.9 8.5
12.4 8.5
4.0 12.5 1.0
3.7 8.8 0.9
9.2 2.4 2.0
8.5 2.2 1.8
29.0 11.2 37.5 35.2 15.3 20.9 7.4 44.7 139
23.2 12.4 38.0 55.4 14.4 25.3 7.4 21.7 34
Note. Run times on the left correspond to the main processor version of Split-C, while times on the right correspond to the version which exploits the coprocessor. Measurements are for 16 processors on a Meiko CS-2.
have as input a 128 by 128 matrix, with a block size of 8. P-ray is a parallel raytracer program which uses bulk put operations; our input is a 512 by 512 image of a tea pot. Radix is a radix sort algorithm, run on eight million integer keys; it uses store operations. Radixb is the same radix sort algorithm, except it uses bulk store operations, and the input array is of size 16 million. Sample is a sample sort program, run on 16 million integer keys; it uses get, put, and bulk get operations. Sampleb is the sample sort algorithm, optimized with bulk store operations. Shell is the shell sort algorithm, run on 16 million integer keys, using bulk reads. Wator is a simulation program that models fish moving in a current; it uses a number of operations, including reads. Our input size is 400 fish. In examining Table 2 and Figure 5, we find that many of the benchmark programs generate similar timings for both the coprocessor and the main processor Split-C libraries. This occurs for two reasons: First, much of the communication is done with bulk operations, and there is only a 20 0 difference in bandwidth between the two Split-C implementations. Second, most of these programs do not overlap communication and computation to a great degree. Instead, they run in computation phases, separated by barriers. These programs do not provide the
EVALUATING COMMUNICATION COPROCESSORS
249
FIG. 5. Run times for our Split-C benchmark programs, normalized to the running time of the main processor implementation. Bars on the left correspond to run times using the main processor implementation of Split-C, while bars on the right correspond to run times using the version of Split-C which exploits the coprocessor. Measurements are for 16 processors on a Meiko CS-2.
coprocessor implementation with an opportunity to exploit its ability to compute and handle communication at the same time. There are some programs that show a significant variation in run times; these are mm, radix, sample, and wator. Radix and sample run slower with the co-processor library because of the raw performance of the communication operations (as shown in Fig. 3). Radix uses store operations which are much faster in the main processor library (13 vs 22 +s); while sample uses a number of operations, including gets, which are faster in the main processor library. The global blocked matrix multiply program runs faster in the coprocessor version for two reasons: First, the main processor implementation does not queue more than one get request at a time, forcing the main processor to wait for the previous one to complete (this accounts for roughly half of the difference in run times). Second, mm uses bulk get operations, which have a slightly higher bandwidth in the coprocessor library. The program with the largest improvement is wator. In this program each processor runs through a loop that reads a data point and then computes on that data. Because the the main processor version only polls when it performs communication, any requests it receives while it is computing experience a long delay. The coprocessor implementation, in contrast, can process requests even if the main processor is computing. This accounts for the large difference in run-times. Table 3 shows the range of times processors must wait while a read request is serviced. As the data shows, the average waiting time for a read request under the main processor implementation is substantially larger than under the coprocessor implementation. To avoid at least some of this delay, the programmer would have to add polls to the compute routine. Unfortunately, this program contains expensive X-library and system calls which are not readily accessible for inserting polls. To highlight the benefits of the coprocessor implementation we created a microbenchmark where each processor performs a simple computepoll loop, and the time for each computation varies as an input parameter; after each computation the process may poll for messages. All processors take turns busy computing while
250
SCHEIMAN AND SCHAUSER
TABLE 3 Time Spent Waiting for a Read Request to Complete in the Wator Program Wait times for read request Main Coproc. 4 processors 16 processors
144288 +s 600800 +s
2324 +s 2533 +s
Note. The range of times represent the wait time over all processors. Results are shown for 4 and 16 processors, under both the coprocessor and main processor implementations of Split-C.
the remaining processors each request a single data item from the busy processor. The requesting processors need this data item to make progress. If this request is not serviced immediately (i.e. if the compute granularity is large), all requesters are idling and only a single processor is busy computing at any given time. On the other hand, if the compute granularity is small, or if a CCP is used to service this request, the responses come back immediately, and all processors can work in parallel. Thus, for very coarse grained computation running the benchmark on the coprocessor implementation of Split-C can result in a near linear speedup over running it on the main processor implementation. This is what we observe experimentally. Figure 6 shows the impact of varying the granularity between polls on the overall run-time on the Meiko. The curves show the run time of this benchmark using the CCP and main processor implementations of Split-C. For the main processor implementation, we run two version of the program, one which polls every
FIG. 6.
Run times for varying granularity between polls on a 16-processor Meiko CS-2.
EVALUATING COMMUNICATION COPROCESSORS
251
x iterations (where x indicates the granularity between polls) and the other which does not poll at all during the compute phase. As expected, not polling or polling infrequently results in poor performance. The coprocessor implementation always performs well because the CCP immediately services requests as they arrive. There is a wide range of granularities where polling on the main processor performs just as well as the coprocessor implementation. Only when polling occurs very frequently does the polling overhead become noticeable, as reflected in the difference in runtime between the poll and coprocessor curves for small granularities. (The increase of run-time with small granularities for the no-poll curve is due to loop overheads.)
6. RELATED WORK The use of coprocessors for communication has been explored in many recent parallel architectures. Examples include the Meiko CS-2 [11], Mana [4], Intel Paragon [9], SP-2 [24], *T [19], Typhoon [21], and FLASH [15]. In all of these machines, the coprocessor is responsible for ensuring protected communication. However, their architectural details differ substantially. We can categorize them based on the coprocessor's hardware (i.e., is the coprocessor a smart controller, an embedded processor, or a general purpose processor), capabilities (e.g., able to run arbitrary user-level code, kernel code, or a fixed set of special operations), primary use (shared memory or message passing), and interface to the compute processor (via memory bus, IO bus, or direct link). We first briefly compare various coprocessor architectures before discussing how some of them have been used to implement Active Messages and Split-C. The coprocessor architecture has found its way into many commercial parallel machines. While the coprocessors on several of above machines are freely programmable, only the Meiko CS-2 allows it to execute user-level code while still ensuring protection. The Meiko CS-2 coprocessor can be classified as a powerful embedded controller which directly attaches to the Sparc memory bus. It supports DMA and remote memory transfers and is able to run user-level code. The Paragon contains two to five Intel i860 processors per node, one of which usually serves as the CCP. Since the coprocessor is exactly the same as the main processor(s), this machine would be an ideal target for running arbitrary message-handling code on the coprocessor. Unfortunately, to ensure protection only systems code is allowed to access the network interface. In order to avoid expensive operating system calls, the most efficient solution is to only run system-level code on the CCPs. The IBM SP-2 also uses an Intel i860 as a CCP in its adaptor card [24]. It connects to a Micro Channel bus, controls the DMA engines, and runs protocol code. CCPs are also found in add-on boards for networks of workstations. One example is the FORE 200 ATM card, which contains an Intel i960 processor and firmware RAM; the OS can download arbitrary code for the i960 to run. Another example is the Myrinet adaptor card coprocessor (LANai). Like the FORE card, it allows arbitrary code and is connected to the main processor via an IO bus. Still other architectures do not provide programmable coprocessors, but have some message-handling functionality built in. The T3D, for instance, has the ability to fetch and increment a
252
SCHEIMAN AND SCHAUSER
register (for building a queue). The AP1000+ has a controller that allows for two operations: get and put (similar to the Split-C operations)[10]. In addition to commercial machines, there are a number of research machines that make use of CCPs. Most of these machines use the coprocessor to support shared memory protocols in addition to message passing. Some, such as SHRIMP [2], consist of simple controllers connected to the memory bus. The SHRIMP controller snoops for memory writes and can send the data to another processor; as it receives data, it transfers it to memory via a DMA. MANNA is designed to measure the usefulness of CCPs. Like the Paragon, its main processor and CCP are both Intel i860s. The MIT Alewife contains a custom CCP (CMMU), which also has a built-in hardware queue for incoming messages. It stores incoming messages in this queue and quickly (e.g., 1 +s) interrupts the main processor. The *T machine, also from MIT, also allows user-level handlers. Typhoon uses a custom CPU for its coprocessor; however, it contains a generic Sparc processor and is able to run user-level code. Like the Meiko CS-2, the coprocessor contains a TLB and allows (constrained) virtual addressing. The Stanford FLASH machine also uses a custom CCP, called MAGIC, which can handle arbitrary protocol routines stored in main memory (and cached for efficiency). For protection purposes, however, such code must exist in kernel space and users cannot write arbitrary handlers. In exploiting the coprocessor for message handling, our study focused on Active Messages and Split-C. Active Messages have been ported to a wide variety of machines, including the Ncube2, CM-5 [28], Paragon [16], Meiko CS-2 [22], Alewife [3], IBM SP-2 [5], a cluster of HP workstations connected by FDDI [18], and a cluster of SparcStations connected by ATM [27]. These last few implementations make use of a coprocessor. In most of these implementations, the coprocessor is used to implement the Remote Queue abstraction [3]. The main processors can enqueue an outgoing message or dequeue an incoming message at user-level, while the coprocessor manages the queue and protocol handling. The FDDI implementation uses a Medusa interface card which enqueues and dequeues data but is not programmable. U-NET, a fast user-level communication layer on ATM networks, is implemented using FORE ATM cards, containing i960 processors that run specialized firmware. Similarly, the Myrinet LANai coprocessor is used for both Active Messages [17] and Illinois Fast Messages [20]. Special support for atomic operations are used on the T3D to implement Active Messages [1] and fast messages [12]. An interesting Active Message implementation from our perspective is a version on Paragon [16], wherein they examine the use of the coprocessor. Bypassing the coprocessor allows a lower latency (of course), assuming one could also bypass the kernel trap needed for protection. The main advantage of the coprocessor is that it can receive messages, offloading the main processor. Split-C shares its origin with Active Messages, and most implementations of the language are based on them. Split-C has been ported to a number of machines, including the CM-5, Paragon, SP-1, SP-2, and networks of workstations. The most interesting implementation from our perspective is a recent prototype version of Split-C implemented on the Paragon, using the coprocessor for message handling
EVALUATING COMMUNICATION COPROCESSORS
253
[13]. By using the coprocessor on the receiving side to handle requests, they provide a speed up of 25 0 for readwrite operations. An interesting comparison of various coprocessor implementations of Split-C on the Paragon [13] and the Meiko CS-2 [23] has been presented in [14]. 7. CONCLUSIONS In this paper we have studied how to exploit the capabilities of communication coprocessors present in modern parallel architectures. We have studied this in the context of Active Messages and Split-C and have implemented new versions of these libraries which execute the message handlers on the coprocessor of the Meiko CS-2. We identified and solved the synchronization issues which arise when running message handlers on the coprocessor in parallel with the code on the main processor. Ours is the first full implementation of Active Messages and Split-C running on a coprocessor. As our experimental results show, the latencies of Active Messages when executing handlers on the coprocessor increase about twofold. The reason is that the coprocessor is much slower than the main microprocessor. On the other hand, the coprocessor has important specialized functionality which we exploit to optimize the split-phase remote memory operations of Split-C. Under this optimized Split-C implementation, the latencies are reduced; put operations are as fast as the main processor implementation, although get and store operations are still slower. However, the real benefit of the new implementation is that the main processor does not need to poll or process interrupts to handle messages, since this is done by the coprocessor. This frees the main processor for computation and allows it to overlap communication and computation to a larger degree. Since the coprocessor is not doing any other computation, it can handle messages more responsively. This benefit exhibits itself in large applications, such as the benchmark program wator, which achieves a factor of 3 performance improvement. Our results show that not polling in inner loops can have a significant negative impact on performance. This was also observed in [7], where programs without sufficient polling in large computational regions showed a substantial load imbalance. Note that for the main processor versions, the user is not required to insert polls; this could be done by the compiler. For many applications, the compiler can insert polls such that it is at least as efficient as the coprocessor versions [21]. On the other hand, in tight loops, library functions, or system code it is difficult or impossible to poll efficiently. This work is applicable to other modern architectures which have a communication coprocessor, such as Paragon, T3D, and future machines. Many of these architectures will, like the Meiko, contain a custom-designed chip that may be slower than its main processor counterpart. However, as we have shown, this does not preclude its usefulness for handling messages. Even if the coprocessor can only execute kernel code, it is possible to support a fixed set of handlers required by a communication library. Coprocessors exist in MPPs and networks of workstations in order to provide protection. We have demonstrated how to exploit this hardware for additional
254
SCHEIMAN AND SCHAUSER
efficiency. The T3D exemplifies the minimum hardware needed for protection; it has simple circuitry which verifies that remote addresses are accessible (a global address MMU). In order to exploit the coprocessor, it needs to be programmable. But if architecture designers are going to provide such an interface, ideally we would like to see something more. Namely, it should provide a way to enqueue arriving messages quickly and efficiently. This would allow constructs such as Active Messagesor any other communication mechanism that allows many processors to send simultaneously to a single destinationto be efficiently implemented. While machines such as the Meiko and T3D have this capability, none can do so as efficiently as their remote readwrite implementations. We propose that the additional enqueue functionality is a worthwhile investment for coprocessor architectures. While the additional efficiency might not be worth the cost of adding a programmable coprocessor, many machines already have one and the additional cost to implement a queue would be minimal. Our coprocessor implementation of Split-C is, overall, superior to the main processor implementation; however, the main processor version has advantages as well. Since our two implementations have different performance characteristics, future research could try to combine both and develop dynamic scheme which, depending on the situation, adaptively applies the best implementation. For example, if we know that the destination processor is polling, we can use the main processor implementation; otherwise, it is better to use to coprocessor implementation. REFERENCES 1. R. Arpaci, D. E. Culler, A. Krishnamurthy, S. C. Steinberg, and K. Yelick, Empirical evaluation of the CRAY-T3D: a compiler perspective, in ``Proceedings the Annual International Symposium on Computer Architecture, May 1995.'' 2. M. A. Blumrich, K. Li, R. Alpert, C. Dubnicki, and E. W. Felten, Virtual memory mapped network interface for the SHRIMP multicomputer, in ``Proc. of the 21st International Symposium on Computer Architecture, April 1994.'' 3. E. A. Brewer, F. T. Chong, L. T. Liu, S. D. Sharma, and J. Kubiatowicz, Remote queues: exposing message queues for optimization and atomicity, in ``7th Annual Symposium on Parallel Algorithms and Architectures, July 1995.'' 4. U. Bruening, W. K. Giloi, and W. Schroeder-Preikschat, Latency hiding in message-passing architectures, in ``Eighth International Parallel Processing Symposium, April 1994.'' 5. C.-C. Chang, C. Grzegorz, and T. von Eicken, ``Performance of Active Messages on the SP-2,'' Cornell University, May 1995. 6. D. E. Culler, A. Dusseau, S. C. Golstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick, Parallel programming in split-C, in ``Proc. of Supercomputing, November 1993.'' 7. D. E. Culler, S. C. Goldstein, K. E. Schauser, and T. von Eicken, Empirical study of dataflow languages on the CM-5, in ``Proc. of the Dataflow Workshop, 19th International Symposium on Computer Architecture, Gold Coast, Australia, May 1992.'' 8. D. E. Culler, S. C. Goldstein, K. E. Schauser, and T. von Eicken, TAMA compiler controlled threaded abstract machine, J. Parallel Distrib. Comput. 18 (July 1993). 9. W. Groscup, The intel paragon XPS supercomputer, in ``Proc. of Fifth ECMWF Workshop on the Use of Parallel Processors in Meteorology, November 1992.'' 10. K. Hayashi, T. Doi, T. Horie, and Y. Koyanagi, AP100+: architectural support of PUTGET interface for parallelizing compiler, in ``ASPLOS-VI, October 1994.''
EVALUATING COMMUNICATION COPROCESSORS
255
11. M. Homewood and M. McLaren, Meiko CS-2 interconnect elan-elite design, in ``Proc. of Hot Interconnects, August 1993.'' 12. V. Karamcheti and A. Chien, A comparison of architectural support for messaging on the TMC CM-5 and the cray T3D, in ``Proceedings the Annual International Symposium on Computer Architecture, May 1995.'' 13. A. Krishnamurthy, J. Neefe, and R. Wang, ``Towards Designing and Evaluating Network Interface Support: A Case Study,'' U.C. Berkeley. [On-line: http:www.cs.berkeley.edu ~neefecs258cs258.html, May 1995] 14. A. Krishnamurthy, K. E. Schauser, C. J. Scheiman, R. Y. Wang, D. E. Culler, and K. Yelick, Evaluation of architectural support for global address-based communication in large-scale parallel machines, in ``7th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996.'' 15. J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy, The stanford FLASH multiprocessor, in ``Proc. of the 21st International Symposium on Computer Architecture, April 1994.'' 16. L. T. Liu and D. E. Culler, Evaluation of the intel paragon on active message communication, in ``Proceedings of Intel Supercomputer Users Group Conference, April 1995.'' 17. R. Martin, L. T. Liu, V. Makhija, and D. E. Culler, ``Lanai Active Messages.'' [Online: http:now.cs. berkeley.eduAMlamrelease.html, September 1995] 18. R. P. Martin, HPAM: an active message layer for a network of HP workstations, in ``Proc. of Hot Interconnects II, August 1994.'' 19. R. S. Nikhil, G. M. Papadopoulos, and Arvind, *T: a multithreaded massively parallel architecture, in ``Proc. 19th. Annual International Symp. on Computer Architecture, May 1992.'' 20. S. Pakin, M. Lauria, and A. Chien, High performance messaging on workstations: Illinois fast messages (FM) for Myrinet, in ``Supercomputing, December 1995.'' 21. S. K. Reinhardt, J. R. Larus, and D. A. Wood, Tempest and Typhoon: user-level shared memory, in ``Proceedings the 21st Annual International Symposium on Computer Architecture, April 1994.'' 22. K. E. Schauser and C. J. Scheiman, Experience with active messages on the Meiko CS-2, in ``9th International Parallel Processing Symposium, April 1995.'' 23. K. E. Schauser, C. J. Scheiman, J. M. Ferguson, and P. Z. Kolano, Exploiting the capabilities of communication coprocessors, in ``10th International Parallel Processing Symposium, April 1996.'' 24. C. B. Stunkel, D. G. Shea, B. Abali, M. Atkins, C. A. Bender, D. G. Grice, P. H. Hochschild, D. J. Joseph, B. J. Nathanson, R. A. Swetz, R. F. Stucke, M. Tsao, and P. R. Varker, The SP2 communication subsystem, technical report, IBM Thomas Watson Research Center, August 1994. [On-line: http:ibm.tc.cornell.eduibmppsdoccsscss.ps] 25. L. W. Tucker and A. Mainwaring, CMMD: active messages on the CM-5, Parallel Comput. 20, No. 4 (April 1994). 26. T. von Eicken, V. Avula, A. Basu, and V. Buch, Low-latency communication over ATM networks using active messages, in ``Proc. of Hot Interconnects II, August 1994.'' 27. T. von Eicken, A. Basu, V. Buch, and W. Vogels, U-net: a user-level network interface for parallel and distributed computing, in ``Proc. Symposium on Operating Systems Principles, 1995.'' 28. T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser, Active messages: a mechanism for integrated communication and computation, in ``Proc. of the 19th International Symposium on Computer Architecture, May 1992.''
KLAUS SCHAUSER is an associate professor in the Department of Computer Science at the University of California at Santa Barbara. He received his diploma in computer science with distinction from the University of Karlsruhe in 1989 and the M.S. and Ph.D. degrees, also in computer science, from the University of California, Berkeley, in 1991 and 1994, respectively. Professor Schauser's research interests are in efficient communication, cluster computing, parallel languages, systems for highly parallel
256
SCHEIMAN AND SCHAUSER
architectures, and Web-based global computing. He is the recipient of the NSF CAREER Award, the UC Regent's Junior Faculty Fellowship, the Distinguished Lectureship from the German American Adademic Council, and various teaching awards. Dr. Schauser has served on many program committees and was the program chair for the ACM 1998 Workshop on Java for High-Performance Network Computing. CHRIS SCHEIMAN is an assistant professor in the Department of Computer Science at California Polytechnic State University in San Luis Obispo, California. He received his B.S. in computer science from Rose-Hulman Institute of Technology in 1987 and the M.S. and Ph.D. degrees, also in computer science, from the University of California, Santa Barbara, in 1990 and 1995, respectively. Professor Scheiman's research interests are in efficient communication, cluster computing, and the performance analysis of systems.