Improving communication latency with the write-only architecture

Improving communication latency with the write-only architecture

J. Parallel Distrib. Comput. 72 (2012) 1617–1627 Contents lists available at SciVerse ScienceDirect J. Parallel Distrib. Comput. journal homepage: w...

804KB Sizes 2 Downloads 33 Views

J. Parallel Distrib. Comput. 72 (2012) 1617–1627

Contents lists available at SciVerse ScienceDirect

J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc

Improving communication latency with the write-only architecture Simon Spacey ∗ , Wayne Luk, Paul H.J. Kelly, Daniel Kuhn Department of Computing, Imperial College London, London, UK

article

info

Article history: Received 23 January 2012 Received in revised form 9 June 2012 Accepted 25 August 2012 Available online 11 September 2012 Keywords: Execution paradigm Communication latency reduction High-performance computing Heterogeneous computing

abstract This paper introduces a novel execution paradigm called the Write-Only Architecture (WOA) that reduces communication latency overheads by up to a factor of five over previous methods. The WOA writes data through distributed control flow logic rather than using a read–write paradigm or a centralised message hub which allows tasks to be partitioned at a fine-grained level without suffering from excessive communication overheads on distributed systems. In this paper we provide formal assignment results for software benchmarks partitioned using the WOA and previous execution paradigms for distributed heterogeneous architectures along with bounds and complexity information to demonstrate the robust performance improvements possible with the WOA. © 2012 Published by Elsevier Inc.

1. Introduction Execution paradigms embody contributions from Computer Science domains including: hardware design, networking, software engineering and formal optimisation. With such a breadth of domain input, it is perhaps not surprising that execution paradigms are considered black boxes by many researchers. However it is perhaps surprising to observe that this black box approach has resulted in some of the most cutting-edge distributed system research today [32,5,41,14,10,40,13,17,51] being based on an execution paradigm designed in the 80s when ‘‘distributed computing’’ often meant homogeneous architectures consisting of two active components communicating at a coarse-grained functional level [70,7]. As the meaning of distributed systems has evolved to include multi-node heterogeneous architectures [56,72] and with even the desktop trend moving towards heterogeneous systems providing acceleration through fine-grained computational specialism requiring ever more cross-component call volumes [4,15,55] the coarse-grained two component systems for which the traditional paradigm is most efficient are fast becoming extinct. This paper presents the Write-Only Architecture (WOA) execution paradigm and demonstrates the improvements in run time possible over previous approaches for software executing on modern multi-node heterogeneous architectures in both tightly and loosely coupled networks. In doing so, the paper contributes:



Corresponding author. E-mail addresses: [email protected] (S. Spacey), [email protected] (W. Luk), [email protected] (P.H.J. Kelly), [email protected] (D. Kuhn). 0743-7315/$ – see front matter © 2012 Published by Elsevier Inc. doi:10.1016/j.jpdc.2012.08.007

1. a novel execution paradigm with less communication overhead than previous approaches; 2. a formal timing model based on the new execution paradigm that can be used to find optimal assignments of program sections to heterogeneous components using standard solvers; 3. experimental results quantifying the speed-ups possible using the new paradigm for a set of benchmarks along with result generalisation and complexity information. The paper begins with an overview of previous work in Section 2. In Section 3 the WOA execution paradigm is introduced with timing and sequence diagrams and then detailed through a formal assignment model. Section 4 provides comparative performance results for the WOA and previous paradigms for a set of software benchmarks and hardware architectures along with result generalisations. The paper concludes with a summary and areas for future work in Section 5. 2. Background Table 1 lists several execution paradigms used by previous authors in systemised distributed execution work all of which are based to some extent on the Remote Procedure Call (RPC) paradigm [70]. The reason for RPC’s firm embedding in modern execution paradigms may lie in the history of computers where loosely coupled two component Client–Server architectures were at one point considered advanced [7,81]. In such architectures clients do not incur extra communication latencies when receiving server responses and server responses are going to be used by the client and not forwarded on by the client to a third computational component. The moment we move to tightly coupled busses or

1618

S. Spacey et al. / J. Parallel Distrib. Comput. 72 (2012) 1617–1627

(a) Communications for a Custom Instruction call.

(b) Communications for the WOA. Fig. 1. Communications for a Custom Instruction protocol versus the WOA on a two component tightly coupled architecture.

architectures with more than two computational components, the efficiency of RPC based paradigms breaks down. The issues inherent in RPC have been recognised recently by authors concerned with Networked Data Services running on homogeneous computing architectures. For example, in [61] a method called RPC Chains is introduced in which clients attach code to their RPC based requests for compilation and execution at service locations to determine where their service responses will be directed to and in [60] client specific continuation code is statically distributed to homogeneous coarse-grained service locations to filter and direct service responses along fixed data flow paths. In this work we focus our attention on the RPC based paradigms used in systemised program acceleration approaches aimed at Distributed and High-Performance Computing (DHPC) environments [58]. In systemised DHPC acceleration an automatable approach is applied to partition an existing application for execution on an architecture that could include heterogeneous components such as CPUs, GPUs and FPGAs specialised for certain types of computation [56,72]. While the interconnection approaches used in bespoke application development are infinitely flexible [11], the approaches used in systemised DHPC acceleration work today are far less so and tend to fall into one of the categories of Table 1. Unlike Networked Data Services, the execution paths of DHPC accelerated applications do not depend on who is running the program and so we do not need client specific chaining code [61,60] however DHPC execution paths do face the problem of input data dependent high-volume path variability caused by fine-grained internal loops and branches and the added problem of execution across heterogeneous computational environments [55,53,67,56,72]. We will demonstrate our solution to these problems in Section 3, but before we detail our novel approach, we

use the rest of this section to briefly review the systemised DHPC execution paradigms of previous work listed in Table 1. In Custom Instruction paradigms like those of [32,5,41], a CPU passes data to a custom instruction fabric using a write and then issues a read to block the processor and wait for the custom instruction’s response. This process requires up to three latencies per call and is illustrated in Fig. 1(a) which we discuss in detail in Section 3 when we contrast our alternate approach. Shared Memory paradigms like those of [14,10,40] are similar to Custom Instruction approaches except that program sections read the data they need from shared memory after activation instead of having the data pushed to them as part of the activation signal. Shared Memory paradigms are useful where the data that will be required by a custom instruction cannot be easily predicted by the caller, however the paradigm suffers from additional latencies as shown in Table 1. Distributed Client–Server paradigms like those of [13,17,51] pass request inputs to a service function using a write through an underlying protocol such as TCP or UDP [68] and wait for the service response to be sent back. While Client–Server paradigms do not suffer from the explicit client read required by the Custom Instruction and Shared Memory approaches to obtain their service responses, distributed Client–Server paradigms, like all RPC based paradigms, suffer from the need for a central message broker/hub, redirection engine or nested call stack which can increase latency costs beyond amortisable asynchronous limits [13] in multiple component architectures as illustrated in Fig. 2(a) which is discussed in detail in the next section. The execution paradigm introduced in this paper can be used to replace previous systemised DHPC paradigms in architectures composed of any number of heterogeneous components to reduce communication latencies and remove the need for a central

S. Spacey et al. / J. Parallel Distrib. Comput. 72 (2012) 1617–1627

1619

Table 1 Characteristics of previous systemised Distributed and High-Performance Computing (DHPC) execution paradigms. The Latencies column gives the minimum number of cross-component latencies required to activate and obtain a useable result from a partitioned (service or custom instruction) code section. The crosses in the Flexible Path column indicate that the services cannot redirect their responses based on, for example, a result of their calculations and the Result Target column shows where service results are sent. Execution Paradigm

Latencies

Flexible Path

Result Target

Custom Instructions [32,5,41] Shared Memory [14,10,40] Client–Server [13,17,51]

3 5 4/2 (TCP/UDP)

✗ ✗ ✗

Caller Caller Caller

(a) Client–Server protocol.

(b) Communications for the WOA.

Fig. 2. Communications for a Client–Server network protocol versus the WOA on a three component loosely coupled architecture.

message hub. Further, the paradigm is compatible with existing Models of Computing [24,19,39], software assignment approaches [67,46,31,76,80,43] and implementation technologies [25,79,45, 48,8,11,6,72,56,11,74,20,35,81,13] as will be appreciated after we detail the WOA methodology in Section 3 next and provide experimental results in Section 4. 3. Methodology 3.1. The Write-Only Architecture The Write-Only Architecture (WOA) is an execution paradigm based on control flows [63]. In the WOA, an executing code section passes the result of its calculation directly along the task’s control flow path to the next executing code section target rather than passing computation results back through an execution scheduler or message hub as in RPC based paradigms. Thus the distinguishing feature of the basic WOA paradigm is simple: in the WOA there are no reads of data or control signals between program code sections — there are only data writes along the control path. Fig. 1 compares the Custom Instruction paradigm introduced in Section 2 with the tightly coupled WOA alternative. In Fig. 1(a) the Custom Instruction method of [32,5,41] is employed wherein input parameters from say a CPU are written to a custom fabric and a read request is then sent from the CPU to the fabric to obtain

the computational result. It should be observed that the result read request can only be sent after the CPU/fabric communication bus is free which means the return of the computational response can be delayed in cases where the computation time is less than the heterogeneous communication time which can limit the code section granularity with which Custom Instruction implementations can provide performance improvements. Fig. 1(b) shows the WOA implementation in which the CPU issues a write of parameters but no read, instead waiting on a change of a memory location’s value or an interrupt to be issued when the service response has been received. When the custom instruction fabric calculates its response, it immediately writes it on the (free) bus without having to wait to receive a read communication saving:

 α α Time Saving = max λ + − µ, 0 − β β 

(1)

in total execution time in the example assuming similar latency λ, bandwidth β , packet header overheads α (i.e. memory addresses in the Custom Instruction approach’s initial write and read and in the WOA’s writes) and computation times µ for the two implementations. Fig. 2 compares the Client–Server paradigm [13,17,51] introduced in Section 2 with the loosely coupled (i.e. communicating

1620

S. Spacey et al. / J. Parallel Distrib. Comput. 72 (2012) 1617–1627

Fig. 4. Two WOA Controllers (labelled JMP and MUX) implementing an if then else distributed over a heterogeneous CPU (top) and data flow (bottom) Fig. 3. A typical WOA activation packet. The packet width is either the bus width or a notional bit grouping for a serial link and the data section could include parameters, contexts and other data as discussed in the text.

over an arbitrarily buffered network as appose to a tightly coupled circuit bus) WOA alternative. In Fig. 2(a) client A activates service B to obtain a service response using Client–Server UDP with constructed message ACKnowledgements sent off the main control path. When B completes its computation it returns the result back to A (its RPC caller) which then passes the result on to service component C for additional transformation. Fig. 2(b) shows the corresponding WOA implementation in which A calls B, B writes its response directly to C and C returns the final computation to A saving: Time Saving = λ +

η β

(2)

in total execution time in the example assuming similar node to node latencies λ, bandwidths β and data packet sizes η (including header overheads α ) maintained for the two implementations. Unlike homogeneous RPC Chains [61], filtering fixed path continuations [60], Promise compositions [36,13] and traditional Data Flow Architectures [16,23,47], the WOA data flows may be circularly or dynamically routed through heterogeneous components without returning to their caller at any point to implement finegrained program loops, result conditional branches and other standard internal program flow constructs across distributed components. To allow execution path flexibility the WOA computation steps use distributed heterogeneous logic to generate WOA activation packets sent along the control flow path. A typical WOA activation packet is illustrated in Fig. 3. While it is recognised there are a variety of possible WOA activation packet implementation choices [28,67,10,11,53,75], typical WOA packets differ from simple call or response packets because they include a target identifier (Target ID) specifying the fine-grained program code section to execute next, are different from Active Messages in that they represent control flows rather than producer–consumer data flows between parallel processes [75], are language [61,13,36] and transport mechanism independent to support heterogeneous generation [56,72] with receiver type casting and are lighter weight than general Network Service Packets [70,61,60] to support high-volume fine-grained program control-flow constructs [55,53,67] as illustrated.

architecture using WOA activation packets.

Fig. 4 shows an input data dependent if then else loop implemented on a heterogeneous architecture using WOA activation packets. In the example, program Section 3 represents a parallel computation and a logical if whose branch direction can only be determined at run-time which are implemented with a data flow configuration on an FPGA [79], TRIPS EDGE [12] or WaveScalar [42] computational component. Unlike RPC, the runtime determination of the next program section to execute (the then or else body) is performed on the data flow component by populating and sending a selected WOA activation packet template to the CPU based on direct run-time comparison of the data flow computation result rather than offloading the if branch comparison logic to a central message hub which could require additional latencies if the message broker were implemented on a third component. When the WOA packet arrives at the CPU, the CPU’s WOA Controller simply examines the Target ID of the WOA activation packet and chooses either software code Section 1 (the if body) or 2 (the else body) to execute next allowing the architecture to achieve computational specialism benefits for code Sections 1 and 2 which may have, for example, low parallelism and so be better suited for execution on a high-performance CPU rather than a data flow component [79,12,42]. WOA Controllers can be thought of as simplified, low-overhead RPC message brokers [81] that remove the central hub communication restrictions of RPC derived paradigms by being distributed across the components of an architecture. WOA Controllers are similar in concept to WaveScalar φ −1 instructions in that they direct data flows [71] as illustrated in Fig. 4, however WOA Controllers differ in that they can have unrestricted branch factors and target dependent data packet sizes which can be beneficial when implementing switch and other high-level program constructs. Additionally, WOA Controllers can be used to serialise finegrained access to sequential components such as CPU cores shared by parallel tasks on Heterogeneous Multicore Systems [66,36] and to collate parallel thread computation results before activating a code section where WOA activations are sent to multiple locations [67,10,11,53] using standard resource locking mechanisms [63]. The inclusion of target identifiers in WOA activation packets has the apparent drawback of increasing WOA packet transfer sizes (and so transfer times) over the alternatives. However it should be recognised that the RPC based alternatives generally include

S. Spacey et al. / J. Parallel Distrib. Comput. 72 (2012) 1617–1627

1621

Table 2 Characteristics of previous DHPC execution paradigms along with those of the WOA. The column headings are the same as Table 1. The ≤ in the WOA row is a consequence of the distribution of the RPC message hub responsibilities allowing communications to be passed directly between code sections at the same location rather than through a cross-partition centralised message broker call. Execution Paradigm

Latencies

Flexible Path

Result Target

Custom Instructions [32,5,41] Shared Memory [14,10,40] Client–Server [13,17,51] Write-Only Architecture

3 5 4/2 (TCP/UDP) ≤1

✗ ✗ ✗ ✓

Caller Caller Caller Control path

a similar identifier in one or more of their communications. For example, the Custom Instruction paradigm of Fig. 1 includes a memory location in both its initial write and read communications and the Client–Server paradigm of Fig. 2(a) would include a function ID in both its initial activation and response (because of the stateless nature of UDP [68] in the example). The data portion of the WOA activation packet illustrated in Fig. 3 may include operands, global context updates, loop counts and shared code return target IDs with local data statically distributed as in a traditional approach. Additionally, the WOA packets may be used to pass through dynamic data intended for later code sections in an execution path to remove the need for Shared Memory paradigm overheads where context information cannot be sent directly off the main control path, broadcast or accessed efficiently through a memory hierarchy in a cache coherent system [23]. Target data requirements can be readily identified in advance for inclusion in the WOA packet through data dependence analysis frameworks such as 3S [62,63] or the like [44,49,38,78] for most computationally intense programs and for programs where data requirements are not reliably predictable in advance (such as LZ77 [82]) the data section of the WOA activation packet can be used to pass a full change context [26] or predicted data set [46] to remove the need to resort to a Shared Memory pull paradigm. Including speculative data in WOA activations to support unpredictable control paths may increase the WOA packet size unnecessarily. However the trend towards higher bandwidth connections means that passing potentially unnecessary data can be far more efficient than incurring additional latencies from Shared Memory data lookups. For example, in the Axel architecture [72] discussed in Section 4 of this work, the tightly coupled intra-node round-trip latency is 178 ns and the loosely coupled inter-node round-trip latency is 96 µs. In these latency times we could send 356 additional context bytes through the tightly coupled Axel connection or 12,000 additional bytes through the loosely coupled Axel connection and, as bandwidths increase with latencies approaching physical limits, the bandwidth/latency equivalence is likely to increase further in future architectures. We will return to the subject of WOA packet data when we detail WOA communication costs in the next section, but before doing so let us conclude this section with Table 2 which compares the key characteristics of previous RPC based paradigms with those of the WOA approach. As explained earlier, the WOA avoids the need for a central hub so that calls to services on the same location can be sent directly resulting in an average of less than one crosscomponent latency for an equivalent RPC service activation as shown in the table. 3.2. Timing objective The WOA execution model associates an activation transfer along the control flow path with each code section computation. Thus the Expected Time to Compute (ETC) [2] a task assignment on a distributed architecture using the WOA will be: t =

 pl

xpl µpl +

 pqlm

xpl xqm cpqlm

(3)

Table 3 Hardware and software characteristic symbols used in the WOA timing equations presented in Section 3.2. For the results of this work, the hardware characteristics were obtained from published data sheets [79,25,73] and the software characteristics from 3S measurements [62,63]. Hardware Characteristic

Software Characteristic

τl ωl ϵpl λlm βlm

Execution time Parallel execution slots Program code unit iterations Control flows from p to q WOA data sent from p to q

Cycle time Parallel execution units Execution efficiency Communication latency Data bandwidth

µpl φpl ιp χpq ηpqlm

where xpl is 1 if code section p is instantiated at location l and 0 otherwise, µpl is the total computation time of code section p on location l and cpqlm is the total communication time for WOA activations sent from code section p on l to q on m. The computation µpl and communication times cpqlm used in Eq. (3) can be measured where suitable implementations exist for all hardware components in an architecture [18,54]. However for general design-space exploration where the potential benefits of a possible new hardware architecture, code-sign distribution or execution paradigm are being explored, the code section execution and computation times often need to be estimated by other means [65,33]. For the paradigm comparison results of this paper, the µpl values were measured for the CPU and estimated for other components using published component characteristics [79] and 3S measurements [62,63] for representative program input data [30] with the equation:

µpl =

ιp φpl τl . ϵpl

(4)

Referring to the symbols defined in Table 3, Eq. (4) is simply the total number of parallel execution slots a program section would take to execute at a location (iteration count ιp multiplied by the number of parallel slots per iteration φpl ) multiplied by the time required for each parallel slot to execute (the cycle time of the component τl divided by the effective issue rate ϵpl ). For the results of this work, the communication costs cpqlm were estimated using the equation: cpqlm = χpq λlm +

ηpqlm βlm

(5)

with the latency and bandwidths obtained from published figures [25,73] and control and data flows measured by 3S [62,63]. Referring back to Fig. 3 and again to Table 3, Eq. (5) is simply the number of control flows between code sections χpq multiplied by the communication latency λlm plus the total number of bytes in the activation packets ηpqlm divided by the bandwidth βlm . Eq. (5) is general enough to cope with implementations that wrap intracomponent control flows in WOA packets (l and m can be the same), self loops (p and q can be the same) and can be used to account for packet transmission overheads (λ can account for link layer and β for encoding overheads) and data packet overheads (ηpqlm can include packet headers α ) if required.

1622

S. Spacey et al. / J. Parallel Distrib. Comput. 72 (2012) 1617–1627 Table 4 Characteristics of the six software tasks partitioned in this work. The Category column corresponds to the MiBench category of the task and the other columns are 3S version 2.10 [62,63] measurements at the basic block level of granularity obtained by compiling to x86 CPU instructions with gcc-O3 and running with representative program inputs. The symbols correspond to the earlier definitions with |P| the number of basic blocks in the task, δpf the total size of the task on an FPGA component of Fig. 5 in x86 Instruction Equivalents,  max φpf the maximum number of parallel slots required to execute blocks of the task on an FPGA component, χpq the sum of the inter-block  control flows and ηpq the sum of the non-cached inter-block data flows (in bytes) for each task. Category [21]

|P|



dijkstra fft ispell jpeg sha susan-e

Network Telecom Office Consumer Security Industrial

117 130 1033 1776 68 249

295 438 3242 7657 816 2491

3.3. Optimising the objective In this section we use the execution timing equation of Section 3.2 to construct a Mixed Integer Quadratic Program (MIQP) that can be solved to identify the optimal assignment of program code sections for a multi-component architecture including communication costs [57] using the WOA paradigm. Our optimisation problem is to find the best (minimum) execution time from the set of feasible assignments X where a feasible assignment x ∈ X is one that: 1. assigns every code section exactly once somewhere (see [67,29] for the alternative). 2. does not violate hardware size constraints. 3. assigns code sections that communicate with external code to hardware locations that can access/be accessed by that external code. The first two constraints should be readily understood with the third a practical implementation constraint to ensure the program entry, exit and system library calls are located on, for example, a CPU running an appropriate OS to allow static control transfer at execution time. The three feasibility constraints above can be expressed in linear programming form and joined with the quadratic objective of Eq. (3) to create the formal WOA Assignment Problem of Definition 1. Definition 1 (WOA Assignment Problem). min



x∈X

xpl µpl +



xpl xqm cpqlm

(6)

pqlm

pl

where x ∈ X is an assignment of code sections to locations satisfying the feasibility constraints:



xpl = 1 ∀ p ∈ P

(7)

xpl δpl ≤ ∆l

(8)

l



∀l∈L

p



xpl = 1 ∀ p ∈ Pe

(9)

p l∈Le

with: p, q

code sections within the program P

l, m

locations from the set of possible computational locations L 1 if p is assigned to l and 0 otherwise

xpl

δpl ∆l µpl cpqlm Pe ⊆ P p Le ⊆ L

δpf

Task

the size of code section p on location l the maximum capacity of location l the computation time for p on location l the time for communications from p on l to q on m code sections communicating with external code locations with access to p ∈ Pe

max φpf



7 18 9 20 6 72

9864,826 3670,310 5856,315 4406,038 6609,432 4719,519

χpq



ηpq

26,750,704 11,470,084 35,981,124 12,414,004 82,246,556 28,121,940

Definition 1 can be readily recognised as a strongly NP-hard problem through its quadratic communication costs cpqlm and binary assignment variables xpl [63,22,52]. However, despite the theoretical complexity of the problem, thousand node practical instances were solved to optimality in just a few seconds in this work as we shall show in Section 4.3 after providing comparative optimal objective results for specific problem instances of Definition 1. 4. Results 4.1. Experimental configuration This section compares the execution times of a set of tasks partitioned for multi-component heterogeneous architectures using different execution paradigms. The tasks were selected to cover each of the six MiBench benchmark categories [21] with software measurements made using 3S version 2.10 [62,63] at the basic block [59] level as summarised in Table 4. The tasks are partitioned for execution on the two architectures illustrated in Fig. 5(a) and (b) which are based on the Imperial Axel [72] heterogeneous computational cluster. We will be assigning to only the CPU and FPGA components of the Axel nodes for architectural clarity which will be sufficient for us to demonstrate that WOA has benefits over alternative paradigms for specific design-points as in previous distributed execution performance comparison work [32,5,41,14,10,40,13,17,51] and we will discuss the general significance of our specific experimental results in Section 4.3. In any case it should be clear that the formulation presented in Section 3 is problem instance independent and could be applied to alternate tasks, assignment granularities, cluster nodes and computational components (such as GPUs) without modification to allow detailed design-space performance exploration of reader specific architectures if required [65]. The computation and communication characteristics of the hardware are provided in Tables 5 and 6. The computation characteristics are based on the results of [72] and the communication characteristics correspond to a local PCIe bus [25] and a networked Gigabit Ethernet [73] for the single and two node architectures respectively. We will be comparing the running times of the software benchmarks executing on the architectures using the WOA, Custom Instruction and Client–Server paradigms detailed in Section 3. To ensure our paradigm quality comparison is not biased by heuristic sub-optimality, we compare results from optimal CPLEX [27] assignments which we label: WOA, CI and CS for the three paradigms respectively. Our formal WOA model is a linearised [64] version of Definition 1 and our CI and CS models are based on the WOA model with asymmetric modifications to the communication timing objective of Eq. (5) to model the central hub and relative overheads of the different paradigms as shown in Table 7.

S. Spacey et al. / J. Parallel Distrib. Comput. 72 (2012) 1617–1627

(a) Single node heterogeneous architecture.

1623

(b) Two node heterogeneous architecture.

Fig. 5. The multi-component architectures considered in this work. See Tables 5 and 6 for the corresponding computation and communication characteristics. Table 5 Computational characteristics for the hardware components of Fig. 5. The components are based on the CPU and FPGAs in the Axel architecture [72] with AMD Phenom CPU cores [1] operating at 2.3 GHz and Xilinx LX330T FPGA reconfigurable fabrics [79] operating at an effective frequency of 266.67 MHz with an x86 integer Instruction Equivalents (IEs) capacity of 2560 obtained through calibration against [72]. Note that Eq. (4) was used to model the FPGA computation times with the characteristics below but actual 3S timing measurements were used for the CPU in this work to account for ϵpl variability [63]. Characteristic

CPU

FPGA Fabric

τl ωl ϵpl ∆l

0.435 ns 3 100%

3.75 ns 2560 100% 2560

Effective cycle time Parallel execution units Execution efficiency (for all p) x86 instruction capacity



χpq

Table 6 Symmetric cross-component characteristics for the CPU/FPGA communications of Fig. 5. The one node costs are based on an 8 channel PCIe 2.0 bus [79] with latencies from the Cut-Through response figures of [25] and the two node costs are based on a Gigabit Ethernet with latencies from [73]. Characteristic

λlm βlm

Bus latency Data bandwidth

One Node

Two Node

89 ns 2 GB/s

48 µs 125 MB/s

Table 7 Modifications made to the WOA communication timing equation (5) to simulate the Custom Instruction CI and Client–Server CS paradigms formally for the results of this section. λ, β , α , χ and c are the latency, bandwidth, packet header overhead, control flows and total WOA communication costs as described earlier in this paper. Here, µχ is a per activation computation time, χpq means sum over all p to q activations and the c ′ functions are directional communication costs for the RPC based execution paradigms. The p and q subscripts refer to general code sections and the u and f subscripts refer to the CPU and FPGA computational locations respectively. Modified Cost ′

CI

CS





λuf +

αCI βuf

χ

− µqf , 0



cpquf

cpquf +

′ cpqfu

cpqfu − χpq

cpqfu

′ cpquu ′ cpqf f

cpquu cpqfu + cpquf

cpquu cpqfu + cpquf

χpq max αWOA βfu

cpquf

Considering first the CS column of Table 7 we see that the Client–Server paradigm is equivalent to the WOA paradigm (i.e. the CS c ′ functions are the same as the WOA c functions) except for communications between code sections located on the reconfigurable hardware of Fig. 5(b) where: ′ cpqf f = cpqfu + cpquf

′ read added to cpquf (cpquf accounts for the initial CI parameter ′ write) and a reduction in the cpqfu transport time to account for the lack of target IDs in the Custom Instruction responses as discussed in Section 3. The computation and communication costs for the objective of each of the models are calculated off-line using 3S measurements and hardware characteristics from Tables 5 and 6 to construct expanded problem instances of Definition 1 which we linearise and solve using CPLEX to obtain the results reported in Section 4.2. We assume each basic block code section has the same computation ′ cost for all activations on the FPGA (i.e. cpquf = cpquf +max(χpq λuf +

(10)

indicating that inter-service communications are sent via the CPU as a central hub. The same central hub restriction can be seen in the ′ cpqf f cell of the Custom Instruction CI model along with two further modifications: a cost for the additional CPU to co-processor

αCI βuf

− µqf , 0) for CI where µqf is the total computation time

for q on f from Eq. (4)) [67,63], that target ID, RPC function ID and Custom Instruction memory addresses are 4 bytes each (i.e. α... = 4), that packet sizes are within transport thresholds, communications are reliable, WOA intra-component and broker overheads are negligible and that indirect data flows can be sent off the main control flow path when generating each problem instance so that we can examine the effect of the paradigm specific differences in isolation and we will discuss the significance of these implementation assumptions later in Section 4.3. 4.2. Paradigm performance results Table 8 shows the program accelerations for the software, architectures and execution paradigm models described in Section 4.1. The program accelerations were calculated with: Acceleration =

tCPU t∗

(11)

where tCPU is the CPU only execution time for the tasks and t ∗ is the execution timing objective of the optimal heterogeneous assignment for the corresponding formal model. The results of Table 8 show that the WOA is consistently better than the alternatives with the WOA model providing up to 3.2 times program acceleration where previous methods fail to provide any significant acceleration whatsoever. The Amdahl’s limit column gives the maximum theoretical acceleration achievable for the heterogeneous computational components ignoring all communication costs which represents a limit for any execution paradigm [3]. In four cases (dijkstra, jpeg, sha and susan-e) the low latency single node WOA assignments are within 10% of Amdahl’s limit. Further, the WOA model is still able to identify accelerations for all but one of the benchmarks even when working on the two node architecture with components communicating over a higher latency UDP network. The difference between the WOA accelerations and the Amdahl’s limits for the fft and ispell benchmarks is explained

1624

S. Spacey et al. / J. Parallel Distrib. Comput. 72 (2012) 1617–1627

Table 8 Execution accelerations for the experimental configuration of Section 4.1. The results are for basic block level assignments using formal models for the WOA (WOA), the Custom Instruction (CI ) and the Client–Server (CS ) execution paradigms for the same software, hardware and implementation assumptions. Task

dijkstra fft ispell jpeg sha susan-e

Amdahl’s Limit [3]

3.343x 1.450x 4.548x 3.264x 2.200x 2.708x

One Node

Two Node

WOA

CI

WOA

CS

3.129x 1.081x 2.793x 3.222x 2.198x 2.504x

1.000x 1.000x 1.000x 1.001x 1.000x 1.031x

1.000x 1.058x 1.011x 2.702x 1.774x 2.170x

1.000x 1.000x 1.000x 1.000x 1.000x 1.030x

Table 9 CPU to FPGA calls for the assignments of Table 8. Task

dijkstra fft ispell jpeg sha susan-e

Amdahl’s Limit [3]

40,365 835,680 160,216 7,125 235 143,308

One Node

Two Node

WOA

CI

WOA

CS

30,022 6 115,272 899 233 21

0 3 32 143 0 14

0 1 1 46 229 2

0 0 0 0 0 1

by Table 9 which shows the number of CPU to FPGA calls for the optimal assignments of the different execution models defined as: CPU to FPGA Calls =



xpu xqf χpq

(12)

pq

with x an indicator from the optimal assignment results, u and f the CPU and FPGA locations in the architectures of Fig. 5 respectively and χpq the number of control flows from code section p to q. It can be seen from the table that the Amdahl’s assignments have a larger number of cross-component calls than any other model. This is because the Amdahl’s limit ignores the cost of communications [3]. If we were to include the tightly coupled communication costs with the Amdahl assignments, the accelerations achievable for the fft and ispell benchmarks would drop to 0.688 and 2.659 times and the WOA assignments of Table 8 would represent a 57% and 6% improvement in acceleration respectively. Table 10 shows the size of the reconfigurable hardware (FPGA) partitions for the different execution models defined as: FPGA Partition Size =



xpf δpf

(13)

p

where xpf is an assignment indicator which is one if code section p is assigned to the FPGA and zero otherwise and δpf is the size of code section p on the FPGA location f (measured in x86 Instruction Equivalents in this paper to correspond with the calibrated reconfigurable component size provided in Table 5). It can be seen that the WOA is able to benefit from almost all the available hardware space for the ispell and jpeg partitions on the single node (lower latency) architecture. If we remove the space constraint, the WOA accelerations remain the same for ispell and increase by only 1% for jpeg with a size cost increase of almost two extra Axel FPGAs (4812 extra IEs). 4.3. Bounds and complexity The results of Section 4.2 demonstrate that the WOA is capable of providing superior program accelerations than previous paradigms for a single set of software benchmarks, architectures and implementation assumptions. In this section we extend the results to the general case and examine the complexity of the WOA assignment problem.

Table 10 FPGA partition sizes in x86 Instruction Equivalents (IEs) for the assignments of Table 8. Task

Amdahl’s Limit [3]

One Node

255 342 2560 2560 788 1793

dijkstra fft ispell jpeg sha susan-e

Two Node

WOA

CI

WOA

CS

176 137 2557 2560 753 1983

0 7 94 348 0 44

0 119 51 2476 390 1864

0 0 0 0 0 7

Starting first with the Client–Server execution paradigm, we see from Table 7 that the CS model is constructed from the WOA model by adding hub routing costs for communications between code sections assigned to service locations. The modified costs, ′ namely cpqf f = cpqfu + cpquf , are independent of the software and architecture characteristics which, while affecting the base WOA model’s c figures, do not modify the relative sense of the CS and WOA objective timings. Thus, as the additional CS costs are positive and added in a minimising objective to problems with the same feasible solution space based on Definition 1, it is clear that the WOA execution timing objective will always be at least as good as the CS model for consistent software, architecture and implementation assumptions [50]. ′ Expressed more formally, the cpqf f equation from Table 7 allows us to provide the following equation quantifying the WOA improvements for any valid CS assignment of general software to general hardware: tWOA = tCS −



xpf xqf ′ max cpqfu + cpquf ′ − cpqf f ′ , 0





(14)

pqf f ′

where t is a model execution time objective value, u is the central hub location, f and f ′ are any non-hub locations (i.e. f ̸= u, f ′ ̸= u) in a general architecture and the other symbols are as previously defined. The max above is a result of the reasonable architecture assumption [56,72] that we can route WOA communications through u if cpqfu + cpquf ′ < cpqf f ′ and allows us to provide the bound: tWOA ≤ tCS

∀x∈X

(15)

for any feasible assignment x ∈ X which formally indicates that the WOA execution time tWOA can never be more than the CS execution time tCS . For the Custom Instruction model CI we see that there are software, architecture and implementation specific parameters ′ and a negative in the modified costs of Table 7. Examining the cpqfu ′ and cpquf equations of Table 7 we see that we can guarantee the WOA will always be at least as good as the CI model for every valid assignment only if:

  αCI α χ max λuf + − µqf , 0 ≥ χqq′ WOA β βfu uf χpq ,p q′ 



∀ q, f ̸= u

(16)

which is a limit due to the potential for hiding read request times with computations and the lack of target IDs in Custom Instruction responses as discussed in Section 3. Noting that control flows to and from the assignable code sections will be symmetric as we start and end on the CPU as ensured by the shared model constraints (9) of Definition 1 [67], Eq. (16) reduces and generalises to the requirement that the granularity threshold, which we define as: χ

Γ = max

q,f ̸=u

µ ¯ qf λuf

(17)

S. Spacey et al. / J. Parallel Distrib. Comput. 72 (2012) 1617–1627

be less than or equal to one in a routable symmetric bandwidth architecture [72,56] using memory addresses for WOA target IDs χ and per activation computation times stabilised to µ ¯ qf through appropriate code section partitioning [67,63] which are reasonable implementation assumptions for most tightly coupled work [32,5, 41,69,76,43]. Eq. (17) demonstrates that we cannot guarantee the sense of the WOA’s performance in comparison to the Custom Instruction approach for all hardware and software cases given consistent implementation assumptions as we could for the Client–Server paradigm. Referring to the tightly coupled (single node) problem instance of Section 4.1 we see that λuf = 89 ns and so the cutoff for us to guarantee WOA will always be more efficient than CI for a particular benchmark running on this architecture is that all assignable code sections in the benchmark take no more than 23 parallel slots to execute on the FPGA (because each slot has an effective cycle speed of 3.75 ns from Table 5). The maximum FPGA parallel execution slots for the basic blocks of the dijkstra, fft, ispell, jpeg, sha and susan-e benchmarks of Table 4 were 7, 18, 9, 20, 6 and 72 respectively as measured by 3S [62,63] meaning we were guaranteed that WOA would be at least as good as CI for every assignment of all tasks except the susan-e benchmark (for which WOA still performed significantly better than CI because of the removal of the central hub RPC restriction and the fact that only 4 of the 249 susan-e blocks exceeded the 23 instruction slot threshold). However, despite the lack of a general guarantee for the execution timing of tightly coupled WOA assignments compared with the Custom Instruction alternative, the Γ ≤ 1 threshold of Eq. (17) allows us to draw the general conclusion that the WOA will be more efficient than its Custom Instruction RPC alternative for smaller computation times which makes the WOA paradigm an attractive replacement for the Custom Instruction approaches used in recent fine-grained assignment work seeking to capitalise on heterogeneous specialism [32,5,41,14,40,69,43]. Referring back to Table 7 we see that the relationship between the WOA and CI execution times for a general CI assignment is governed by:



tWOA = tCI −

pqf ,χpq

+







xpf xqu χpq

pqf



xpu xqf max λuf +

 αCI χ − µqf , 0 βuf

αWOA βfu 

xpf xqf ′ max cpqfu + cpquf ′ − cpqf f ′ , 0



(18)

pqf f ′

with the same routing assumptions and symbol definitions as Eq. (14) which allows us to provide the bound:



αWOA xpf xqu χpq βfu

1625

Table 11 Practical solution times (in seconds) for the optimal assignments of Section 4.2. The optimal solution timings were measured on a 2.13 GHz Intel Core 2 Duo Apple Mac with 4 GB of Memory using the default configuration of 64-bit CPLEX 12.2. Task

Amdahl’s Limit [3]

dijkstra fft ispell jpeg sha susan-e

0.00 0.00 0.04 0.05 0.00 0.00

One Node

Two Node

WOA

CI

WOA

CS

0.02 0.01 0.48 0.50 0.02 0.05

0.02 0.02 0.17 0.37 0.03 0.06

0.01 0.02 0.15 1.74 0.03 0.05

0.03 0.02 0.21 1.01 0.03 0.09

to know if finding WOA optimal assignments is any more complex than finding optimal CS assignments so we can be sure that the potential for WOA partition execution performance improvements are not balanced by an increase in assignment identification times. As already mentioned in Section 3.3, the formal WOA model is easily recognised as being strongly NP-hard through its quadratic communication costs cpqlm and binary assignment variables xpl [63, 22,52]. Specifically though, the WOA solution space complexity is: O |L||P|





(20)

where |L| is the number of assignment locations and |P| is the number of assignable code sections in a task. Applying this equation to the characteristics of Table 4 we see that the solution space complexity for the tasks considered here is between 268 and 21776 . However, despite this vast solution space, Table 11 shows that optimal solutions could be found with CPLEX [27] in just a few seconds on a standard computer demonstrating that it is not necessary to resort to heuristics for all problems in the software assignment domain [32,41,14,10,40,13,17,51,76,37,31,9,26,65]. Looking at Eq. (20) we see that the WOA solution space is highly sensitive to the number of assignment locations |L| and thus it is not unreasonable to expect that practical formal model solution times could become an issue for architectures with a larger number of potential computational locations [67,66] making the application of heuristic and other techniques necessary [65, 64,54,34]. It should be noted however that as the RPC based models can be considered simply different constant terms applied to an underlying WOA model as demonstrated in Section 4.1 we know that the RPC based paradigms have the exact same solution space complexity as WOA and thus we can expect similar optimal solution time issues for large node problems using the traditional paradigms as supported by Table 11. 5. Conclusion

(19)

This paper presented the Write-Only Architecture (WOA) execution paradigm and contributed:

for any feasible assignment x ∈ X indicating formally thatthe WOA execution timing objective tWOA could be up to pqf

1. a description of the WOA along with a comparison of its key characteristics against previous paradigms. 2. a formal timing model that can be used to find optimal assignments of program sections to multi-component architectures using standard solvers. 3. experimental results from a suite of benchmarks demonstrating speed-ups of up to 3.2 times for the WOA along with generalising result bounds and complexity analysis information.

tWOA ≤ tCI +

pqf

xpf xqu χpq

αWOA βfu

∀x∈X

above the CI objective tCI due to the additional

WOA target ID overheads. However, as already mentioned, this overhead will only be seen for coarse granularity assignments where hub improvements are not significant — that is if the target ID costs outweigh the gains in Eq. (18). While Eqs. (15) and (19) bound the worst case WOA improvements over an RPC alternative, the equations do not bound the best possible WOA performance improvements as the optimal WOA assignments could be at different points in the shared solution space than for the optimal CI and CS assignments [50]. Thus, while knowing for example that the WOA optimal will always be at least as good as the CS alternative, it would be of additional value for us

The results of Section 4.2 demonstrated that the WOA paradigm is capable of providing better accelerations than the previous systemised RPC execution models considered for the same software benchmarks, hardware architectures, characterisation measurements, assignment granularities, implementation assumptions and assignment algorithm optimality. In many cases,

1626

S. Spacey et al. / J. Parallel Distrib. Comput. 72 (2012) 1617–1627

the RPC based models were unable to produce an acceleration for the software at the fine measurement granularity used in this work due to the additional communication costs associated with the previous paradigms. In Section 4.3 general bounds, complexity and solution time information was provided. The bound analysis allowed us to show that the optimal WOA assignment can never be worse than the best RPC based Client–Server alternative for consistent implementations and that WOA is more efficient than Custom Instruction approaches for finer grained computations on current architectures. With the complexity analysis we showed that WOA has the same solution space complexity as previous RPC based methods and our actual solve times demonstrated the practical applicability of the formal WOA model. While this paper is self contained, there remains a piece of work to implement libraries to support the WOA for different hardware components without which RPC based paradigms are likely to remain the norm in systemised Distributed and High-Performance Computing implementations. This future work may also need to address the problem of data predictability as discussed in Section 3 and require additions to characterisation frameworks such as 3S [62,63] along with modifications to the timing equations of Section 3.2 to model other forms of computational specialism [77] and specific architecture characteristics and implementation decisions [28,30] so that optimal WOA assignments can be found for future heterogeneous architectures. Acknowledgements The support of the UK EPSRC is gratefully acknowledged. Additionally we would like to thank the reviewers whose insightful comments helped improve this paper greatly. References [1] Advanced Micro Devices Inc., Desktop Processor Solutions, Advanced Micro Devices Inc., 2010. [2] S. Ali, H. Siegel, M. Maheswaran, D. Hensgen, S. Ali, Representing task and machine heterogeneities for heterogeneous computing systems, Journal of Science and Engineering 3 (2000) 195–207. [3] G. Amdahl, Validity of the single processor approach to achieving large scale computing capabilities, in: Proc. of the AFIPS Spring Joint Computer Conference, 1967, pp. 483–485. [4] Apple Inc., MacBook Pro — a notebook full of innovations, http://www.apple. com/macbookpro/features.html#graphics, 2012. [5] K. Atasu, C. Özturan, G. Dündar, O. Mencer, W. Luk, CHIPS: custom hardware instruction processor synthesis, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 27 (3) (2008) 528–541. [6] R. Azimi, A. Bilas, miNI: reducing network interface memory requirements with dynamic handle lookup, in: Proc. of the 17th ACM International Conference on Supercomputing, 2003, pp. 261–272. [7] J. Bacon, H. K.G, Distributed computing with RPC the Cambridge approach, Tech. Rep., Cambridge University, 1987. [8] A. Bilas, D. Jiang, J. Singh, Shared virtual memory clusters: bridging the cost-performance gap between SMPs and hardware DSM systems, Journal of Parallel and Distributed Computing 63 (12) (2003) 1257–1276. [9] T. Braun, H. Siegel, N. Beck, L. Bölöni, M. Maheswaran, A. Reuther, J. Robertson, M. Theys, B. Yao, D. Hensgen, R. Freund, A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems, Journal of Parallel and Distributed Computing 61 (6) (2001) 810–837. [10] F. Broquedis, O. Aumage, B. Goglin, S. Thibault, P.-A. Wacrenier, R. Namyst, Structuring the execution of OpenMP applications for multicore architectures, in: Proc. of the IEEE International Symposium on Parallel Distributed Processing, 2010, pp. 1–10. [11] D. Buntinas, G. Mercier, W. Gropp, Design and evaluation of Nemesis, a scalable, low-latency, message-passing communication subsystem, in: Proc. of the 6th IEEE International Symposium on Cluster Computing and the Grid, 2006, pp. 521–530. [12] D. Burger, S. Keckler, K. McKinley, M. Dahlin, L. John, C. Lin, C. Moore, J. Burrill, R. McDonald, W. Yoder, Scaling to the end of silicon with EDGE architectures, IEEE Computer 37 (7) (2004) 44–55. [13] S. Bykov, A. Geller, G. Kliot, J. Larus, R. Pandya, J. Thelin, Orleans: cloud computing for everyone, in: Proc. of the 2nd ACM Symposium on Cloud Computing, SOCC, 2011, pp. 16:1–16:14.

[14] E. Chung, J. Hoe, K. Mai, CoRAM: an in-fabric memory architecture for FPGAbased computing, in: Proc. of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2011, pp. 97–106. [15] Dell Inc., Dell Precision Towers — Dell desktop tower workstations. http://www.dell.com/us/soho/p/precision-desktops/product-compare, 2012. [16] J. Dennis, Data flow supercomputers, IEEE Computer 13 (11) (1980) 48–56. [17] J. Duato, A. Peña, F. Silla, R. Mayo, E. Quintana-Orti, rCUDA: reducing the number of GPU-based accelerators in high performance clusters, in: Proc. of the IEEE International Conference on High Performance Computing and Simulation, 2010, pp. 224–231. [18] P. Eles, Z. Peng, A. Kuchcinski, A. Doboli, System level hardware/software partitioning based on simulated annealing and tabu search, Journal on Design Automation for Embedded Systems 2 (1997) 5–32. [19] G. Estrin, R. Fenchel, R. Razouk, M. Vernon, SARA (System ARchitects Apprentice): modeling, analysis, and simulation support for design of concurrent systems, IEEE Transactions on Software Engineering 12 (1986) 293–311. [20] T. Goodale, S. Jha, H. Kaiser, T. Kielmann, P. Kleijer, G. von Laszewski, C. Lee, A. Merzky, H. Rajic, J. Shalf, SAGA: a simple API for grid applications, high-level application programming on the grid, Computational Methods in Science and Technology 1 (12) (2006) 7–20. [21] M. Guthaus, J. Ringenberg, D. Ernst, T. Austin, T. Mudge, R. Brown, MiBench: a free, commercially representative embedded benchmark suite, in: Proc. of the 2001 IEEE International Workshop on Workload Characterization, 2001, pp. 3–14. [22] P. Hahn, B. Kim, M. Guignard, J. Smith, Y. Zhu, An algorithm for the generalized quadratic assignment problem, Computational Optimization and Applications 40 (3) (2008) 351–372. [23] J. Hennessy, D. Patterson, Computer Architecture: A Quantitative Approach, third ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003. [24] C. Hoare, Communicating sequential processes, Communications of the ACM 21 (8) (1978) 666–677. [25] B. Holden, Latency Comparison Between HyperTransport and PCI-Express in Communications Systems, HyperTransport Consortium, 2006. [26] M. Huang, V. Narayana, H. Simmler, O. Serres, T. El-Ghazawi, Reconfiguration and communication-aware task scheduling for high-performance reconfigurable computing, ACM Transactions on Reconfigurable Technology and Systems 3 (4) (2010) 20:1–20:25. [27] IBM Corp., CPLEX 12.1 Manuals, IBM Corp., 2008. [28] J.-K. Kim, S. Shivle, H. Siegel, A. Maciejewski, T. Braun, M. Schneider, S. Tideman, R. Chitta, R. Dilmaghani, R. Joshi, A. Kaul, A. Sharma, S. Sripada, P. Vangari, S. Yellampalli, Dynamically mapping tasks with priorities and multiple deadlines in a heterogeneous environment, Journal of Parallel and Distributed Computing 67 (2) (2007) 154–169. [29] Y.-K. Kwok, On exploiting heterogeneity for cluster based parallel multithreading using task duplication, Journal of Supercomputing 25 (1) (2003) 63–72. [30] Y.-K. Kwok, A. Maciejewski, H. Siegel, I. Ahmad, A. Ghafoor, A semi-static approach to mapping dynamic iterative tasks onto heterogeneous computing systems, Journal of Parallel and Distributed Computing 66 (1) (2006) 77–98. [31] Y. Lam, J. Coutinho, W. Luk, P. Leong, Optimising multi-loop programs for heterogeneous computing systems, in: Proc. of the IEEE Southern Programmable Logic Conference, 2009, pp. 129–134. [32] S.-K. Lam, T. Srikanthan, Rapid design of area-efficient custom instructions for reconfigurable embedded processing, Journal of Systems Architecture 55 (1) (2009) 1–14. [33] Y. Li, J. Antonio, H. Siegel, M. Tan, D. Watson, Determining the execution time distribution for a data parallel program in a heterogeneous computing environment, Journal of Parallel and Distributed Computing 44 (1) (1997) 35–52. [34] L. Liberti, Reformulations in mathematical programming: automatic symmetry detection and exploitation, Mathematical Programming (2010) 1–32. [35] X.-H. Lin, Y.-K. Kwok, V. Lau, A quantitative comparison of ad hoc routing protocols with and without channel adaptation, IEEE Transactions on Mobile Computing 4 (2) (2005) 111–128. [36] B. Liskov, L. Shrira, Promises: linguistic support for efficient asynchronous procedure calls in distributed systems, in: Proc. of the 1988 ACM SIGPLAN Conference on Programming Language Design and Implementation, 1988, pp. 260–267. [37] Q. Liu, G. Constantinides, K. Masselos, P. Cheung, Combining data reuse with data-level parallelization for FPGA-targeted hardware compilation: a geometric programming framework, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 28 (3) (2009) 305–315. [38] C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Reddi, K. Hazelwood, Pin: building customized program analysis tools with dynamic instrumentation, in: Proc. of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, vol. 40, 2005, pp. 190–200. [39] N. Lynch, M. Tuttle, Hierarchical correctness proofs for distributed algorithms, in: Proc. of the 6th ACM Symposium on Principles of Distributed Computing, 1987, pp. 137–151. [40] R. Lysecky, G. Stitt, F. Vahid, Warp processors, ACM Transactions on Design Automation of Electronic Systems 11 (3) (2006) 659–681. [41] O. Mencer, D. Pearce, L. Howes, W. Luk, Design space exploration with a stream compiler, in: Proc. of the 2nd IEEE International Conference on Field-Programmable Technology, 2003, pp. 270–277.

S. Spacey et al. / J. Parallel Distrib. Comput. 72 (2012) 1617–1627 [42] M. Mercaldi, S. Swanson, A. Petersen, A. Putnam, A. Schwerin, M. Oskin, S. Eggers, Instruction scheduling for tiled dataflow architectures, in: Proc. of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS, 2006, pp. 141–150. [43] D. Naishlos, J. Nuzman, C.-W. Tseng, U. Vishkin, Towards a first vertical prototyping of an extremely fine-grained parallel programming approach, Theory of Computing Systems 36 (2003) 521–552. [44] N. Nethercote, J. Seward, Valgrind: a program supervision framework, Electronic Notes in Theoretical Computer Science 89 (2) (2003) 44–66. [45] nVidia Corp., nVidia Cuda Programing Guide, nVidia Corp., 2009. [46] S. Oh, T. Kim, J. Cho, E. Bozorgzadeh, Speculative loop-pipelining in binary translation for hardware acceleration, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 27 (3) (2008) 409–422. [47] K. Ostrowski, K. Birman, D. Dolev, C. Sakoda, Implementing reliable event streams in large systems via distributed data flows and recursive delegation, in: Proc. of the 3rd ACM International Conference on Distributed Event-Based Systems, 2009, pp. 15:1–15:14. [48] S. Passas, K. Magoutis, A. Bilas, Towards 100 Gbit/s Ethernet: multicorebased parallel communication protocol design, in: Proc. of the 23rd ACM International Conference on Supercomputing, 2009, pp. 214–224. [49] D. Pearce, P. Kelly, T. Field, U. Harder, GILK: a dynamic instrumentation tool for the Linux kernel, in: Proc. of the 12th International Conference on Computer Performance Evaluation, Modelling Techniques and Tools, 2002, pp. 220–226. [50] R. Rockafellar, Convex Analysis, Princeton University Press, 1996. [51] S. Sadjadi, P. McKinley, Transparent autonomization in CORBA, Computer Networks 53 (10) (2009) 1570–1586. [52] S. Sahni, T. Gonzalez, P-complete approximation problems, Journal of the ACM 23 (3) (1976) 555–565. [53] S. Saiedian, G. Wishnie, A complex event routing infrastructure for distributed systems, Journal of Parallel and Distributed Computing 73 (3) (2012) 450–461. [54] V. Shestak, E. Chong, H. Siegel, A. Maciejewski, L. Benmohamed, I.-J. Wang, R. Daley, A hybrid Branch-and-Bound and evolutionary approach for allocating strings of applications to heterogeneous distributed computing systems, Journal of Parallel and Distributed Computing 68 (4) (2008) 410–426. [55] L. Shi, H. Chen, J. Sun, vCUDA: GPU accelerated high performance computing in virtual machines, in: Proc. of the IEEE International Symposium on Parallel Distributed Processing, 2009, pp. 1–11. [56] M. Showerman, J. Enos, A. Pant, V. Kindratenko, C. Steffen, R. Pennington, W. Hwu, QP: a heterogeneous multi-accelerator cluster, in: Proc. of the 10th LCI International Conference on High-Performance Clustered Computing, 2009. [57] H. Siegel, S. Ali, Techniques for mapping tasks to machines in heterogeneous computing systems, Journal of Systems Architecture 46 (8) (2000) 627–639. [58] H. Siegel, H. Dietz, J. Antonio, Software support for heterogeneous computing, ACM Computing Surveys 28 (1) (1996) 237–239. [59] G. Sitt, F. Vahid, A decompilation approach to partitioning software for microprocessor/FPGA platforms, in: Proc. of the IEEE Conference on Design, Automation and Test in Europe, vol. 3, 2005, pp. 396–397. [60] K. Sivaramakrishnan, K. Nagaraj, L. Ziarek, P. Eugster, Efficient session type guided distributed interaction, in: Coordination Models and Languages, in: Lecture Notes in Computer Science, vol. 6116, Springer, Berlin, Heidelberg, 2010, pp. 152–167. [61] Y. Song, M. Aguilera, R. Kotla, D. Malkhi, RPC Chains: efficient client–server communication in geodistributed systems, in: Proc. of the 6th USENIX Symposium on Networked Systems Design and Implementation, 2009, pp. 277–290. [62] S. Spacey, 3S: program instrumentation and characterisation framework, Tech. Rep., Imperial College London, 2008. [63] S. Spacey, Computational partitioning for heterogeneous systems, Ph.D. Thesis, Imperial College London, 2009. [64] S. Spacey, Concise CPLEX, Tech. Rep., Imperial College London, 2009. [65] S. Spacey, W. Luk, P. Kelly, D. Kuhn, Rapid design-space visualisation through hardware/software partitioning, in: Proc. of the IEEE Southern Programmable Logic Conference, 2009, pp. 159–164. [66] S. Spacey, W. Luk, D. Kuhn, P. Kelly, Parallel partitioning for distributed systems using sequential assignment, Awaiting Publication, 2012. [67] S. Spacey, W. Wiesemann, D. Kuhn, W. Luk, Robust software partitioning with multiple instantiation, INFORMS Journal on Computing 24 (3) (2012) 500–515. http://dx.doi.org/10.1287/ijoc.1110.0467. [68] W. Stevens, TCP/IP Illustrated (Vol. 1): The Protocols, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1993. [69] G. Stitt, F. Vahid, Hardware/software partitioning of software binaries, in: Proc. of the 2002 IEEE/ACM International Conference on Computer-Aided Design, 2002, pp. 164–170. [70] Sun Microsystems Inc., RPC: Remote Procedure Call protocol specification, 1988. [71] S. Swanson, K. Michelson, A. Schwerin, M. Oskin, Dataflow: the road less complex, in: the Workshop on Complexity-effective Design, WCED, 2003. [72] K. Tsoi, W. Luk, Axel: a heterogeneous cluster with FPGAs and GPUs, in: Proc. of the 18th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2010, pp. 115–124. [73] D. Turner, X. Chen, Protocol-dependent message-passing performance on Linux clusters, in: Proc. of the IEEE International Conference on Cluster Computing, 2002, pp. 187–194.

1627

[74] R. van Nieuwpoort, J. Maassen, G. Wrzesińska, R. Hofman, C. Jacobs, T. Kielmann, H. Bal, Ibis: a flexible and efficient Java-based grid programming environment, Concurrency & Computation: Practice & Experience 17 (7–8) (2005) 1079–1107. [75] T. von Eicken, D. Culler, S. Goldstein, K. Schauser, Active Messages: a mechanism for integrated communication and computation, 1992, pp. 256–266. [76] T. Wiangtong, P. Cheung, W. Luk, Hardware/software codesign: a systematic approach targeting data-intensive applications, IEEE Signal Processing Magazine 22 (3) (2005) 14–22. [77] J. Williams, C. Massie, A. George, J. Richardson, K. Gosrani, H. Lam, Characterization of fixed and reconfigurable multi-core devices for application acceleration, ACM Transactions on Reconfigurable Technology and Systems 3 (4) (2010) 19:1–19:29. [78] R. Wilson, R. French, C. Wilson, S. Amarasinghe, J. Anderson, S. Tjiang, S. Liao, C. Tseng, M. Hall, M. Lam, J. Hennessy, SUIF: an infrastructure for research on parallelizing and optimizing compilers, SIGPLAN Noticies 29 (12) (1994) 31–37. [79] Xilinx Inc., Vertex-5 Family Overview: Product Specification, Xilinx Inc., 2009. [80] D. Zhu, X. Qi, D. Mossé, R. Melhem, An optimal boundary fair scheduling algorithm for multiprocessor real-time systems, Journal of Parallel and Distributed Computing 71 (10) (2011) 1411–1425. [81] G. Zimmer, A. Chien, The impact of inexpensive communication on a commercial RPC system, Tech. Rep., University of Illinois at UrbanaChampaign, 1998. [82] J. Ziv, A. Lempel, A universal algorithm for sequential data compression, IEEE Transactions on Information Theory 23 (3) (1977) 337–343.

Simon Spacey received a B.Sc. in Physics from York University, an M.Sc. in Quantum Devices from Lancaster University, a D.CSC. in Computer Science from Cambridge University and a Ph.D. in Computer Science from Imperial College London. He is currently the head of research at SaSe Business Solutions. His research interests include distributed and high-performance computing systems, software characterisation, system modelling and formal optimisation.

Wayne Luk received the M.A., M.Sc., and D.Phil. degrees from the University of Oxford, Oxford, U.K., all in engineering and computing science. He is currently a Professor of computer engineering with the Department of Computing, Imperial College London, London, U.K., where he also leads the Custom Computing Group. His research interests include theory and practice of customising hardware and software for specific application domains, such as graphics and image processing, multimedia, and communications.

Paul H.J. Kelly received his Ph.D. in Computer Science from London University. He has chaired the Software Track at the International Parallel and Distributed Processing Symposium (IPDPS), and has served on the Program Committees of the International Conference on Supercomputing, Compiler Construction, Euro-Par, and many other conferences and workshops and currently heads the Software Performance Optimisation group at Imperial College London.

Daniel Kuhn received a M.Sc. in Theoretical Physics from ETH Zurich and a Ph.D. in Operations Research and Computational Finance from the University of St. Gallen/HSG. He has co-chaired the International Conference on Computational Management Sciences, is a reviewer for several mathematical journals and is a senior lecturer in the Computational Finance and Operations Research division of the Computer Science Department at Imperial College London.