Design and evaluation of low latency interconnection networks for real-time many-core embedded systems

Design and evaluation of low latency interconnection networks for real-time many-core embedded systems

Computers and Electrical Engineering 37 (2011) 958–972 Contents lists available at SciVerse ScienceDirect Computers and Electrical Engineering journ...

2MB Sizes 0 Downloads 8 Views

Computers and Electrical Engineering 37 (2011) 958–972

Contents lists available at SciVerse ScienceDirect

Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng

Design and evaluation of low latency interconnection networks for real-time many-core embedded systems q Fadi N. Sibai ⇑ R&D Center, Saudi Aramco, Dhahran 31311, Saudi Arabia

a r t i c l e

i n f o

Article history: Received 21 September 2010 Received in revised form 26 August 2011 Accepted 26 August 2011 Available online 29 September 2011

a b s t r a c t On-chip interconnection networks (OCINs) in many-core embedded systems consume large portions of the chip’s area, cost, delay and power. In addition to competing in area, cost, and power, OCINs must feature low diameters to meet real time deadlines. To achieve these goals, designing low-latency networks and sharing network resources are essential. We explore 13 OCINs – some are new such as the Enhanced Kite and the Spidergon–Donut networks – in 64-core systems with various topologies and properties. We also derive and compare their worst case delays, longest and average distances, critical link lengths, bisection bandwidths, total link and router costs, and total arbiter powers. Results indicate that the Enhanced Kite, Kite, Spidergon–Donut and Spidergon–Donut4 stand out in best worstcase delays with the Spidergon–Donut4 additionally featuring lower link and router costs, total arbiter power, and better 2D implementation and scalability. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction TO provide higher performance per Watt and a variety of functionality, embedded processors are taking the form of many-core multiprocessor systems on-chip (MPSoCs) [12]. The Intel 80-core Teraflops prototype chip [14] demonstrated that many-core computing is on the verge of becoming a reality. The many-core embedded systems of the future will have several intellectual property (IP) cores some identical to provide additional computation horsepower and others dedicated for specific tasks and with specific functionalities ranging from sophisticated codecs to encrypt/decrypt engines. Communication between the many integrated IP cores takes place over a network-on-chip (NoC) or on-chip interconnection network (OCIN). As the shared bus medium is incapable of scaling the system’s performance with increased number of cores owing to fundamental operational and electrical limits, OCINs offer MPSoCs better communication scalability. In fact, OCINs can provide for a system performance which scales with the number of cores, while keeping the power consumption, and required cost and area within acceptable limits. Commercially, NoCs are starting to emerge in products from Arteris and Sonics. Today, multi-core processors in personal computers adopt point-to-point interconnects. Rings have been used in DEC Alpha EV8 [27] and the Sony–Toshiba–IBM cell [26]. Fat trees [11] have been used by SPIN [16] but have the disadvantage that traffic is concentrated at the root. Duplicating the root to solve this problem is too costly. ST Micro introduced the Spidergon (SG) network [6]. Spidergon’s average distance was found to be between the average distances of the ring and 2D mesh topologies [4]. Meshes have been used by Tilera’s TILE64 and Intel’s Teraflops chips. Meshes are among designers’ favorite choices but the diameter (i.e. maximum latency) of the mesh is too high. Tori have better diameters than 2D meshes but, like meshes, have high link costs. Precisely, for an N  N 2D mesh, the diameter is 2(N  1), linearly growing with N.

q

Reviews processed and approved for publication to Mackenzie.

⇑ Tel.: +966 3 8722868.

E-mail address: [email protected] 0045-7906/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.compeleceng.2011.08.008

F.N. Sibai / Computers and Electrical Engineering 37 (2011) 958–972

959

Although 64-core chips are already available, chips with tens of cores will soon be common in embedded systems. Realtime applications usually produce bursty communication traffics. In real-time systems, QoS-based transport layer rules isolate the real-time traffic (aka guaranteed-service traffic) from the best-effort traffic. Flow control in real-time systems is based on message priorities necessitating preemption in input buffers and dropping of lower priority packets (if tolerable by the application). Moreover, the use of OCINs in real-time systems must meet the execution predictability and communication requirements of these systems. This implies that the OCIN must have a low diameter to meet the time deadlines of real-time applications in view of the relatively long core inter-distances in many-core chips. Dally [7] also stressed the need to design low diameter OCINs with small area and power requirements. In this paper, we explore low-diameter OCIN designs for real-time 64-core embedded systems. Some of these OCINs (e.g. Spidergon–Donut, Spidergon–Cylinder, and Enhanced Kite) are new. Furthermore, we compare these OCINs in terms of worst case and average distance, longest delay, bisection bandwidth, total link and router costs, arbiter power, ease of mapping onto the 2D chip space, and scalability. The paper is organized as follows. In Section 2, we review OCINs in multi- and many-cores. In Section 3, we present new and review old on-chip interconnection networks for 64-core MPSoCs. Section 4 discusses concentrated versions of the prominent OCINs of Section 3. As the Spidergon–Donut (SD) features a low diameter compared to the 2D mesh and is one prominent low diameter OCIN, some properties of the SD OCIN are derived in Section 5, and a deadlock-free routing algorithm for the SD is introduced in Section 6. Section 7 analyzes and compares all the considered networks. We conclude the paper in Section 8. 2. Background Several interconnection networks [2,3,8,10] were proposed in the past. These networks fall under various types including shared media (e.g. bus), direct networks (e.g. mesh), indirect networks such as multi-stage interconnection networks (MINs), and hybrid networks [5,15] mixing two or more topologies. A scalable OCIN cannot be a shared bus which does not scale with increasing number of IP cores. Although rings have shown their effectiveness in carrying collective communication traffic (broadcast, inter-processor interrupts, . . .) as on the STI Cell, they also do not scale well to a large number of cores. Indirect networks such as MINs suffer from router box delays paid on each communication no matter the location of the source and destination cores. Moreover blocking MINs are not suitable for real-time applications. Non-blocking MINs have higher costs but also suffer from the O(log N) latency paid on each communication, N being the number of nodes. Hybrid networks are promising in bringing the cost and power down but require common protocols to ensure end-to-end reliable message deliveries, and are untested in meeting real time requirements of embedded systems. If not carefully designed with low diameters and average distances, these hybrid networks will not be suitable for real time MPSoCs. A hierarchical NoC was proposed with small local 2D meshes each interconnecting 4 cores [5]. Unidirectional rings reduce the latency for global traffic and connect the meshes together. One ring is used for global traffic while another is used for local traffic. Such hybrid networks usually lack uniformity. Although the traffic though the global interconnect increases as a quadratic function of the number of cores, hierarchical rings [19] were shown to achieve good performance scalability. The hyper-ring (HR) [20] is a hierarchical and scalable network which interconnects rings of cores in the first (horizontal) dimension via 2 rings in the second (vertical) dimension. While the HR cost is lower than the 2D meshes and tori costs, the HR’s diameter (maximum hop count) was shown to be superior to the mesh, and comparable to the torus for about 100 cores. Both hybrid [5] and hierarchical [20] networks employ rings for global traffic, but for local traffic, [5] employs meshes while [20] employs rings. While the diameters of the 2D mesh and ring are acceptable for small N, OCINs can be designed with lower worst-case message latencies for large N. In addition to the diameter, power is critical. The power characteristics of OCINs are governed by the following laws [1,13]. With optimally placed and sized repeaters, the wire delay is a linear function of its distance. The repeated wire’s power consumption is the sum of the leakage power and dynamic power, both linear to the wire distance. The dynamic power is additionally linear to the bandwidth. The router’s power consumption is linear to the total bandwidth of its inputs [17]. Router power is dominant over wire propagation and buffer powers, motivating topologies with fewer hops and longer wires [7]. Table 1 summarizes the wire and router power and delay recommendations. Given these recommendations, it is required that the on-chip network brings down the diameter, power and cost requirements below those of the popular 2D mesh and below those of multi-stage interconnection networks [8,10] such as the Cube and Butterfly whose diameters are O(log N). One solution can be reached by careful network topology design. Another (concurrent) solution is to use bristling, i.e. router concentration, which clusters cores and allows them to share network resources to reduce the areas and costs of the network. It should be noted that circuit switching has also being recommended to reduce the size of routers (and therefore the total power and cost) by eliminating buffers. 3. 64-Core OCINs for embedded systems We consider 13 interconnection network designs for 64-core MPSoCs, some are new (e.g. Enhanced Kite, Spidergon– Cylinder, Spidergon–Donut, bristled Kite4 and bristled Spidergon–Donut4) and some are borrowed from the literature. As our main design or selection criteria are reducing the diameter, average distance, router degree, and link cost, we consider

960

F.N. Sibai / Computers and Electrical Engineering 37 (2011) 958–972

Table 1 Power and delay design guidelines in the submicron age. OCIN attribute

Characteristics

Recommendation

Wire delay

Linear to distance Exceeds logic delay Linear to distance

Favors short wires (related to network topology and floor planning)

Wire power

Router delay Router power

Favors short (related to network topology and floor planning) and lower bandwidth (fewer bits per link and/or lower frequency) repeated wires

One component is linear to bandwidth Less critical than router power Simple logic delay is lower than wire delay Buffers add too much delay Exceeds wire and buffer powers

Favors circuit switching and bufferless routers Favors bristling (which reduces # of routers and total end-to-end delay) Favors simple routing (topology-related) Favors fewer routers and longer wires (related to topology and floorplanning, bristling)

the following 13 OCINs. These OCINs can scale to connect higher number of cores. We then derive their main characteristics and compare them. 3.1. Mesh Fig. 1 displays the 64-core mesh which we also refer to OCIN I. In Fig. 1, the circles represent the IP cores and the rectangular boxes represent the routers. Each core directly connects to a router to access the mesh network. These routers relay the cores’ own traffic or in-transit traffic to other core destinations. The diameter of the Mesh OCIN is the distance between diagonally-opposed corners (A and B). In this paper, the link between a core and its nearest router counts as one hop. Thus the longest distance between the top left and bottom cores is 1 (source core to its router) + 7 (horizontal inter-router hops) + 7 (vertical inter-router hops) + 1 (destination router to its core) = 16 hops. Reducing the diameter and average delay can be easily achieved by providing more routers, links, router ports (i.e. increasing the router degree) and other resources. However the area and cost and power budgets favor the OCIN designs which minimize area, cost, and power, and ease the routing while simultaneously meeting the worst case and average delay requirements of real time traffics. Therefore other 64-core OCINs which we consider and evaluate are: (1) the 2D Torus (OCIN II); (2) the Kite network (OCIN III); (3) Warped Mesh 1 (OCIN IV); (4) Warped Mesh 2 (OCIN V); (5) the Lantern network (OCIN VI); (6) the Spidergon–Cylinder network (OCIN VII); (7) the Spidergon–Donut network (OCIN VIII); (8) the Enhanced Kite network (OCIN IX); (9) the 2D Torus 2 (OCIN X); (10) the bristled Spidergon–Donut4 (OCIN XI); (11) the bristled Kite 4 (OCIN XII); and (12) the bristled Torus4 (OCIN XIII). These OCINs will be covered in the next subsections. 3.2. Torus Fig. 2 shows the 8  8 Torus OCIN which adds wrap-around links to the Mesh network and reduces the diameter to 1 + 4 + 4 + 1 = 10 hops. The average distance is also improved compared to the mesh.

Fig. 1. OCIN I: Mesh.

F.N. Sibai / Computers and Electrical Engineering 37 (2011) 958–972

961

Fig. 2. OCIN II: Torus.

Fig. 3. OCIN III: Kite.

3.3. Kite Fig. 3 shows the Kite OCIN. In Fig. 3 and subsequent figures, the circular shapes contain each an IP core and a router. The links between the circles represent router-to-router links. This OCIN is named as such as it resembles a kite. The Kite network is built by forming four 16-core crossed rings where the 2 cross links [9] result in halving the diameter of a 16-core ring from 8 down to 4. Pairs of adjacent crossed rings are then joined together by links connecting diagonally-opposed corners. The first (top) crossed ring connects to the third and fourth (bottom) crossed rings via links originating from its second and third routers from its top left corner, respectively. The 2nd (from top) crossed ring connects to the fourth and third crossed rings below it via links originating from its 2nd and 3rd cores from its top left corner, respectively. The twin links (symetrically opposed, at the bottom of each 16-core ring) follow the same order but start from the bottom right to the left. The diameter of the Kite network falls on the path between the nodes labeled A and B. As the shortest distance between A and B is 9 routerto-router hops, the 64-core Kite OCIN has a diameter of 1 + 9 + 1 = 11. This is 1 hop higher than a 64-node torus but much lower than the 2D mesh diameter. 3.4. Warped meshes Figs. 4 and 5 display 2 versions of warped meshes [unknown author]. Warped meshes are built by cutting an 8  8 mesh into four smaller 4  4 meshes which are then joined together with the bold interconnect links of Figs. 4 and 5. Many other mesh-joining alternatives are also possible. The differences between the two warped meshes are small and involve only 4 links. In both warped meshes, the diameter is reflected by more than one possible path. One such path is between the top left core to the bottom right one. As the warping in the middle shrinks the longest distances between the 4  4 meshes, the diameter of each warped mesh is 11 hops (down from the Mesh’s 16).

962

F.N. Sibai / Computers and Electrical Engineering 37 (2011) 958–972

Fig. 4. OCIN IV: Warped Mesh 1.

Fig. 5. OCIN V: Warped Mesh 2.

Fig. 6. OCIN VI: Lantern.

F.N. Sibai / Computers and Electrical Engineering 37 (2011) 958–972

963

3.5. Lantern Fig. 6 presents the new 64-core Lantern network. This OCIN also cuts a 8  8 mesh into four 4  4 meshes and joins the four 4  4 meshes with links creating the shape of a lantern. These 4  4 meshes are not complete meshes as some links in the middle are removed to reduce the link cost. Those reduced meshes can be seen as four 4-core rings joined in ring fashion, resembling the cube-connected cycles (CCC) network [18]. Whereas the CCC topology joins rings via the cube function, the Lantern OCIN joins rings together into a ring of rings. Initially, two 4  4 meshes are joined together via their corner nodes. Another pair of 4  4 meshes is joined in a similar fashion. The two pairs are then joined together via their corner nodes, as seen in Fig. 6. The diameter is represented by the shortest path between cores A and B, or 1 + 8 (router-to-router hops) + 1 = 10 hops. Another representation of the diameter is by the shortest acyclic path connecting the top left and bottom right nodes of Fig. 6. 3.6. Spidergon–Cylinder (SC) The 16-node Spidergon [6] network connects 16 cores in a ring fashion and then adds a link between all pairs of opposite cores. Fig. 7a introduces the new 64-core Spidergon–Cylinder (SC) OCIN. This OCIN is created by extending the Spidergon into the z direction to create a 64-core Spidergon–Cylinder, a cylinder with four 16-core Spidergons stacked in a pile. Cores falling under the same z position are linked together. The longest distance is between one core in the top Spidergon slice and the opposite core in the bottom Spidergon slice, or 1 + 4 (inter-router hops in the x-plane) + 3 (inter-router hops in the yplane) = 9 hops. The bisection bandwidth of an N-core Spidergon–Cylinder is impressive (=N). Fig. 7b shows the 2D mapping of the Spidergon–Cylinder onto the 2D chip space. 3.7. Spidergon–Donut (SD) The diameter of the SC OCIN can be further reduced by joining the 2 ends of the cylinder with new links in a donut fashion, thereby creating the Spidergon–Donut (SD) OCIN of Fig. 8a. By doing so, the diameter falls to 1 + 4 + 2 + 1 = 8 hops, outperforming the Torus and 2D Mesh. As importantly, the node (router) degree of the Spidergon–Donut becomes fixed at 6 (including router-core links), while on the SC it varies between 5 and 6. A uniform router degree simplifies the design (e.g. allows copy-and-paste) and is generally a desirable property to have. Moreover, by joining both ends of an N-node SC network to form the Spidergon–Donut, the bisection bandwidth is doubled from N to 2 N. Fig. 8b shows the 2D mapping of the SD onto the 2D chip space. Folding techniques can be used to reduce the length of the wraparound links. 3.8. Enhanced Kite (e-Kite) The Kite OCIN of Fig. 3 has a variable router degree with some routers with 3 ports and others with 4 ports (including the link between the router and its associated core). Some of the 3-port routers can be replaced by 4-port routers which can be linked together to further reduce the diameter. Fig. 9 introduces an enhanced version of the Kite network (which we call eKite) with all routers uniformly supporting 4 ports. The newly added links, pictured in grey color, join each 16-core crossed ring with the remaining three crossed rings. However this time the joining of the rings starts from the top crossed ring from the right side towards the left. The symmetrically opposed twin links originate from the bottom crossed ring from the left

Fig. 7. OCIN VII: Spidergon–Cylinder (a) and its implementation (b).

964

F.N. Sibai / Computers and Electrical Engineering 37 (2011) 958–972

Fig. 8. OCIN VIII: Spidergon–Donut (a) and its folded implementation (b).

Fig. 9. OCIN IX: Enhanced Kite (e-Kite).

side towards the right. These additional links reduce the diameter to 8 (one such longest path is between the top left and bottom right cores), also outperforming the Torus (10) and 2D Mesh (16). 4. Bristling In this section, we focus on the prominent and most promising low-diameter OCINs of Section 3, and consider versions of these OCINs with concentrated routers. Fig. 10 shows the 2 concentrated Torus2 OCIN which consists of a 32-node 2D Torus, where each node is a 2-core cluster connected to the torus network via a shared router. The Torus2 employs bristling [7,21] and higher degree routers thereby reducing the total number of routers in the network. The net effect is that the OCIN is greatly reduced from a 64-router network to a 32-router network. The new bristled routers are larger than the routers in the original Torus but the inter-router hops are drastically reduced. The drawback of bristling is that the sharing of the router by the 2 cores raises the number of router ports (aka router degree, in relation to the Mesh OCIN of Fig. 1) to 6 bidirectional ports (including the links to the associated cores). Additionally, reducing the number of routers reduces the network’s reliability in case of a complete router failure.

F.N. Sibai / Computers and Electrical Engineering 37 (2011) 958–972

965

Fig. 10. OCIN X: Torus2.

Fig. 11. OCIN XI: Bristled Spidergon–Donut4 with 4 nodes sharing a router.

In Fig. 10, the number of columns and rows are uneven in order to keep the number of cores fixed at 32. The total number of links is 128, while the diameter is 8. Applying router bristling to the other OCINs also reduces their diameter. Fig. 11 displays a bristled Spidergon–Donut4 network composed of 4 Spidergon instances where each Spidergon instance is composed of 4 super nodes. Each super node consists of a router to which 4 IP cores are linked. The longest distance is reduced to 5 hops (from the SD’s 8 hops), and, the number of routers is reduced to 16 (from the SD’s 64 routers). Similarly, the 64-node Kite routers can be concentrated using the same approach leading to the 4-way bristled Kite4 OCIN of Fig. 12. In the first dimension, 4 crossed rings are built with 4 super nodes of 4 cores each. In the second dimension, 4 such crossed rings can be interconnected to form a Kite4 network as shown in Fig. 12. The resulting Kite instances in the first dimension interconnect the 4 super nodes in a fully connected network fashion, but the low number of super nodes cannot take advantage of the Enhanced Kite’s rich connections. Consequently, the diameter of the Kite4, pictured in Fig. 12 as the shortest path between nodes A and B, is 7, a 22% improvement over the Kite OCIN. A concentrated e-Kite is not considered due to implementation difficulties. Fig. 13 shows a 4 concentrated Torus4 network. For the Torus4 OCIN, the diameter is represented by the distance between the top left corner core and the core located in the row above the bottom row and in the 2nd column from right. Thus the Torus4 diameter is 1 (source core to its router) + 4 (inter-router hops) + 1 (destination router to its core) = 6 hops. Each of the above networks has its strengths and weaknesses which we will expose in Section 7.

966

F.N. Sibai / Computers and Electrical Engineering 37 (2011) 958–972

Fig. 12. OCIN XII: Bristled Kite4 with 4 nodes sharing a router.

Fig. 13. OCIN XIII: Bristled Torus 4 network.

5. The Spidergon–Donut OCIN Given that the Spidergon–Donut4 OCIN with the best diameter is constructed from the Spidergon–Donut OCIN, we briefly discuss in this section the construction of the Spidergon–Donut OCIN and some of its properties. The Spidergon network SG(K) with K = 8 nodes is displayed in Fig. 14a with nodes numbered in counter-clockwise order. Fig. 14b. displays a Spidergon–Donut SD(2,8) OCIN with 2 SG(8) instances interconnected in a donut fashion. The number of nodes in a Spidergon–Donut SD(N, K) OCIN, where K is the number of cores in the first dimension, and N is the number of SG(K) instances in the second dimension, is given by

NumNodesðN; KÞ ¼ N  K

ð1Þ

Fig. 14. Spidergon–Donut (a) 1D SG(8) (b) 2D SD(2, 8).

967

F.N. Sibai / Computers and Electrical Engineering 37 (2011) 958–972 Table 2 Comparison of Torus, Spidergon, Spidergon–Donut, and Spidergon–Donut4 networks. Property

N  N Torus

SG(N2)

SD(N, N)

SD4(N/2, N/2)

SD4(N, N/4)

# Cores # Links Diameter Bisection bandwidth Node degree

N2 2 N2 N N+2 4

N2 1.5 N2 N2/4 N2/4 + 2 3

N2 2.5 N2 0.75 N 2N 5

N2 13 N2/8 (3/8)N + 2 N 9

N2 13 N2/8 (9/16)N + 2 N/2 9

The number of links or link cost in a SD(N, K) is given by

NumLinksðN; KÞ ¼ N  ðK þ K=2Þ þ K  N ¼ 2:5  K  N

ð2Þ

as there are K ring links per SG instance or dimension, K/2 cross links per SG instance, and K  N inter-dimension links. The diameter of the SD(N, K) is given by

DiameterðN; KÞ ¼ N=2 þ K=4

ð3Þ

as the longest distance within an SG dimension is K/4, while the longest inter-dimension distance is N/2. The bisection bandwidth of the SD(N, K) is given by

BBðN; KÞ ¼ 2  K

ð4Þ

as there are two sets of K links which are cut when the donut is cut in half. Table 2 compares the network attributes of the 4-way bristled Spidergon–Donut4 OCINs SD4(N/2, N/2) and SD4(N, N/4), Spidergon SG(N2), and N  N Torus OCINs. All these OCINs equally have N2 nodes. The SD4(N/2, N/2) has N/2  N/2 = N2/4 supernodes or 4  N2/4 = N2 nodes. Similarly, the SD4(N, N/4) OCIN has N  N/4 = N2/4 supernodes or 4  N2/4 = N2 nodes. The SD4(N, N/4) has double the number of SG instances than the SD4(N/2, N/2) OCIN with half the number of IP cores per dimension. The logarithmic diameters of the OCINs of Table 2 are plotted in Fig. 15 vs. N, the square root of the total number of nodes (as all OCINs have N2 nodes) for systems with 8 cores up to 222 = 4M cores. Fig. 15 reveals that the 4 bristled Spidergon–Donut SD4(N/2, N/2) OCIN has the lowest diameter, ahead of its SD4(N, N/ 4) counterpart. This indicates that designing the network with too many SG instances (i.e. large N in the 2nd dimension) is not desirable. The Spidergon SG(N2) has by far the largest diameter than the other networks in Table 2. Thus the Spidergon is

Fig. 15. Diameter (Logarithmic) vs.

pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Number of Nodes (=N).

968

F.N. Sibai / Computers and Electrical Engineering 37 (2011) 958–972

not appealing for real–time many-core (>=64 cores) MPSoCs in comparison to its two-dimensional SD4 and SD counterparts. For 64-cores, the SD4(4, 4) results in the best diameter, scaling well to a very large number of cores. Finally, the implementation of the SD4 (SD) is feasible (see Fig. 8b) as it interconnects a number of SGs (already a commercial product) with multiple rings. 6. Routing in The SD In this section, we describe a one-to-one routing algorithm and a broadcast algorithm for the SD(N1, N0) network. This algorithm can be easily extended to the SD4. The one-to-one routing algorithm is displayed in Fig. 16. S (=S1S0) refers to the label of the current node. O (=O1O0) refers to the label of the original source node. D (=D1D0) refers to the label of the destination node. M refers to the message. We assume that N1 and N0 are multiple of 4. To avoid deadlocks and break cycles in the channel dependency graph of the SD, two virtual channels VC1 and VC2 are used, where channel VC1 is used if D0 > O0 when routing in the first dimension or if D1 > O1 when routing in the second dimension, and channel VC2 is used otherwise [22,23]. This condition guarantees that the routing algorithms are deadlock free as they funnel all messages destined to higher labeled cores (nodes) via VC1, and all messages destined to lower labeled cores via VC2. Thus messages flowing in opposite directions do not use the same virtual channel and never collide nor use the network resources allocated to another virtual channel. However, the use of virtual channels increases the router area due to additional resources (e.g. buffers, and larger crossbar and arbitration areas) allocated to virtual channels. The above routing algorithm is simple to implement and enhances the SD network’s appeal and feasibility. 7. Network evaluation In this section, we derive and analyze the longest and average distances, worst case delays, node degree, bisection bandwidth, total link cost, maximum router frequency, and total arbiter power for all 13 64-core OCINs. We then compare these networks based on these attributes. We also rate the OCIN scalability and ease of mapping onto the 2D chip space. Table 3 displays the network attributes of OCINs I–XIII. These attributes are discussed in the next sub-sections. 7.1. Longest and average distances For simplicity, in order to calculate the average distances (in hops), we select the top left core as the source of the communication. For each distance d (in hops, where 1 6 d 6 diameter) separating the source core from nd destination cores, nd is determined and then multiplied by d. The average distance is then given by diameter X

,

ðd  nd Þ

63

ð5Þ

d¼1

where diameter is the maximum distance d between the source core and any other core, and where 63 is the total number of cores minus 1 (representing the source core). Fig. 17 plots the longest and average distances (in hops) for all 13 OCINs. Compared to the 64-node 2D Mesh, the diameters of the Kite and Spidergon–Cylinder OCINs are reduced by 44%. The diameters of the SD and e-Kite and Torus2 are 50% shorter. The diameters of the Kite4, Torus4 and SD4 are reduced by 56%, 63% and 69%, respectively. Compared to the 64-node 2D Mesh, the average distance (in hops) improves by 35% for the Enhanced Kite, 40% for the SD, 45% for the Torus2, 53% for the Kite4, 56% for the Torus4, and 59% for the Spidergon–Donut4. The SD and e-Kite have the best worst-case and average distances among unconcentrated OCINs. The e-Kite achieves this top rank with a 40% lower link cost, and a 33% lower router cost than the SD. Thus from a cost perspective, and for 64 nodes, the e-Kite comes ahead of the SD. The Warped Meshes and the Lantern OCIN greatly improve on the diameter and average delay of the Mesh but are not competitive with the SD and e-Kite OCINs. When bristled OCINs are also considered, the SD4 ranks in the first position in diameter and average distance among all 13 OCINs although its link cost, router cost, and router degree are very close to the Torus4. Yet its diameter is 17% shorter than the second ranked Torus4. Given its circular shape and modularity, it should be easily implementable. The Kite4 ranks third (first) in average (longest) distance. The Kite OCIN III has a better average delay than OCINs IV–VI at a much lower total link cost but its enhanced version (OCIN IX) improves the diameter and average delay further at the expense of a small increase in link cost. Given its high cost (3rd after the 2D Torus and Mesh), the SD may prove expensive. Thus bristling greatly helps in reducing the high cost of the SD and strongly improves its implementation feasibility. 7.2. Link cost The link cost is the total number of links in the network. The Kite OCIN has the lowest link cost followed by Kite4, Torus4, e-Kite and Lantern. The SD4 follows next.

969

F.N. Sibai / Computers and Electrical Engineering 37 (2011) 958–972

Fig. 16. One-to-one routing algorithm for the SD(N1, N0).

Table 3 OCIN properties.

a b c d

OCIN number OCIN name

I Mesh

II Trs

III Kite

IV WM1

V WM2

VI Lntrn

VII SpCy

VIII SpDo

IX e-Kite

X Trs2

XI SpDo4

XII Kite4

XIII Trs4

Average distance Longest distance Link cost # of Routers Router degree (Bidir.) Bisection bandwidth # of Router I/O ports (max) Total router cost Link length (max)

9.11 16 176 64 3, 4, 5 8 10 320 1C

6.06 10 192 64 4 16 8 256 2C

6.27 9 84 64 3, 4 8 8 256 8C

7.89 11 110 64 3, 4, 5 10 10 320 6C

7.81 11 110 64 3, 4, 5 12 10 320 6C

6.19 10 96 64 3, 4, 5 8 10 320 4C

6.00 9 144 64 5, 6 16 12 384 8C

5.95 8 96 64 4 16 8 256 8C

5.05 8 128 32 6 8 12 192 2C

4.03 6 96 16 8 8 16 128 4C

30

60

240

180

180

120

240

240

60

120

120

Arbiter power (mW, 10 trans.) Total arbiter power (mW) Router eval. delay (ps) Router arbiter delay (ps) Router xbar delay (ps) Frequency (GHz, max) Longest delay (ns) 2D Mapping (A = best) Scalability (A = best)

31.5 2016 1400 3000 2500 0.333 48 A C

30 1920 800 2900 2300 0.345 29 A A

30 1920 800 2900 2300 0.345 26 C B

31.5 2016 1400 3000 2500 0.333 33 C C

31.5 2016 1400 3000 2500 0.333 33 C C

31.5 2016 1400 3000 2500 0.333 30 C C

32.3 2067 1500 3500 2500 0.286 31 B B

30 1920 800 2900 2300 0.345 23 C B

32.3 1034 1500 3500 2500 0.286 28 A A

3.78 5 104 16 9 8 18 144 2Cc 4Cd 60c 120d 50 800 2200 5800 2700 0.172 29 B A

4.29 5 94 16 7, 8 2 16 128 4C

Link delay (max, ps)

5.49 8 160 64 6 32 12 384 4Ca 8Cb 120a 240b 32.3 2067 1500 3500 2500 0.286 28 B A

45 720 2000 5100 2700 0.196 36 B A

45 728 2000 5100 2700 0.196 31 A A

SD(8,8). SD(16,16). SD4(8,2). SD4(4,4).

970

F.N. Sibai / Computers and Electrical Engineering 37 (2011) 958–972

Fig. 17. Longest and average distances and longest delays vs. OCIN topology.

7.3. Reliability Reliability-wise, a 9-port SD, 8-port Torus4, or 7or8-port Kite4 router becomes a single point of failure isolating 4 cores from the network if it completely fails. In comparison, a router failure in the non-bristled OCINs only affects one core. Transit traffic in either bristled (provided no complete router failure) or unbristled OCINs can still be redirected through other redundant paths at longer delay penalties. 7.4. Bisection bandwidth The bisection bandwidth reports the minimum number of links which must be cut to divide the network into equal halves. Regarding the bisection bandwidth, the SD ranks first (32) followed by the SC, Torus, and e-Kite networks (16). The Warped Mesh 1 and 2 networks follow. Bristling reduces the bisection bandwidth of the SD4 in comparison with the SD network. The same effect takes place on the Torus4 and Torus2 networks in reference to the Torus network. The Kite4 suffers from a very low bisection bandwidth. 7.5. Router cost and frequency, and total delay The router degree or radix reflects the number of communication ports at each core router. The Torus4, Torus2, e-Kite, SD, and SD4 are the only OCINs considered with a fixed node degree. Uniformity of node degree is desirable for laying out the chip floor plan and ease of design. Router sharing also reduces the number of routers by a factor of 4 (2 for the Torus2) compared the 2D Mesh OCIN I, a big advantage of bristling. The total router area (cost) in Table 3 is the product of the number of routers and the maximum number of output (or input) ports of any router in the network, assuming that the other router’s rectangle side length is uniform across all OCINs. To estimate the maximum router frequency, we assume that the router is a pipeline of 4 stages [25]: router evaluation (routing function), router arbitration (ensures that one output port is connected to one input port only), router crossbar, and interrouter link. We borrow these first three router stage delays from [25], which were obtained using Synopsis tools and 45 nm technology libraries, and assuming a 1 GHz frequency (not achievable). Specifically, router evaluation and arbitration delays were obtained for an 8-bit core address (64 cores) [25]. To estimate the worst case inter-router link delays, we examine the 2D mappings of the OCINs and derive the longest link lengths (in terms of square core side width = C, other spacings are ignored). Wire speeds for 45 nm process were obtained from [24] and assume 0.003 ps delay for 7 nm wire. The square core width is assumed to be 70 lm. A wire crossing the core width thus consumes a delay of 30 ps. Multiplying the wire speed by the maximum link length (in multiple of Cs) results in the maximum link delay (in ps, see Table 3). From Table 3, it is obvious that the router arbitration delay is dominant over the other router stage delays including the longest inter-router link delays, and the maximum router frequency is therefore the inverse of the router arbitration delay. The longest source to destination core delay is the product of the number of router pipeline stages and the diameter divided by the maximum router frequency. These longest delays are displayed in Table 3 and plotted (in ns, and scaled by 1/4) in Fig. 17. Although the SD and SD4 networks are still competitive with the other 64-node networks, their relatively larger router arbitration delay weighs in and dethrones them from the top spot. Instead, the e-Kite and Kite OCINs take the delay crown owing to the combination of low router degree, short link distances, and competitive diameter.

F.N. Sibai / Computers and Electrical Engineering 37 (2011) 958–972

971

When circuit switching is employed to reduce router delay, the router arbitration must facilitate the end-to-end communication which requires the availability of all routers on the way to the destination core. With a smaller total number of routers, the bristled OCINs will have to check for the availability of a smaller number of routers, simplifying the arbitration process. As to routing ease which implies router power, the Torus4 routing is well understood and is simple enough. Simple logic circuits can be designed to perform NSWE routing. The routing for the e-Kite is less understood and may be more difficult owing to the Enhanced Kite’s irregularities. The e-Kite’s routing may be simplified by a proper core router numbering assignment. A routing table may be required in the OCIN’s routers to guide routing decisions. 7.6. Router power The router arbitration stage power is obtained for each OCIN from [25] (using accurate and detailed synopsis simulations) based on the number of router ports and for 10 transactions. Table 3 displays the total arbiter power, which is the product of the router arbitration power and the total number of routers in the network. Clearly, 4 concentrated OCINs (X, XI, XII and XIII) consume less total router arbiter powers. In addition, their relatively low worst-case link distances result in lower link power compared to the e-Kite, Kite and Warped Mesh networks. 7.7. Router concentration From different perspectives, router bristling or concentration has high appeal in many-core systems. In addition to reducing the number of routers, bristling also has a great effect on the diameter, average delay, total link cost, total router cost, and the total arbiter power as the bristled OCINs (X, XI, XII and XIII) stand out. Each bristled router is obviously larger in area and consumes more power than the original router, but in totality, the concentrated router benefits are greater. 7.8. Network scalability The 2D mapping and scalability of each OCIN were also rated with A representing best scalability. On mapping to the 2D chip space, the Warped Mesh 1 and 2 and Lantern OCINs were rated C because of the inter-mesh wires, and the Kite and e-Kite OCINs were rated C because of their inter-ring wirings. The SC and SD and Kite4 OCINs were rated B due to their wire complexity. As to scalability to larger network sizes, the Mesh scalability is rated C due to its lower diameter scalability, while the Warped Meshes and Lantern were rated C due to their poor longest link scalability, and the Kite and e-Kite OCINs were rated B due to their wire complexities. Soon chips with 1000+ cores will be available. For such number of cores, the e-Kite and Kite lose their appeal owing to their crossed wire complexities. Although, the 64-core Torus2 and SD4 OCINs are on par with the next best delay, the SD4 has the edge over the Torus2 in total link and router costs. This SD4 advantage over the Torus2 and Torus4 holds for 1000 cores solidifying the SD’s position. 8. Conclusion We explored 13 OCINs – some new – to interconnect 64-core embedded MPSoCs. For the most prominent low diameter OCINs, we also considered their bristled versions. We also introduced the Kite, Enhanced Kite, Lantern, Spidergon–Cylinder, Spidergon–Donut, Spidergon–Donut4 and Kite4 OCINs. We also introduced a worst-case network delay computation methodology based on published wire and router stage delays. The eKite and Kite networks featured the best worst-case delay and link cost, but rated lower on 2D mapping and scalability. Torus2 and Torus came next in the delay race but at higher total link and router costs and arbiter powers than the SD4. The SD4 came on top in terms of diameter (and average distance), 17% better than the Torus4, and way ahead of popular OCINs such as the 2D Mesh or Spidergon. Furthermore, core clustering and router bristling strongly improve the SD4’s implementation feasibility in comparison with its non-bristled version. The 4 concentrated OCINS (XI, XII and XIII) had the best longest and average distances, total link and router costs, and total arbiter power than the remaining OCINs. Moreover, the SD4 longest delay is competitive with the e-Kite and Kite while being more implementation-feasible and more scalable. For the SD, the paper presented some of its key properties and a deadlock-free routing algorithm. As embedded MPSoCs grow in size reaching 128, 256, 512, 1024 cores, and above, core clustering and router bristling will prove more appealing in shrinking the network estate and router-to-router link lengths and power to meet real time deadlines. Determining the optimal cluster size or degree of bristling for various network sizes (i.e. total number of IP cores) is worthy of future investigation. References [1] Asanovic K, Bodik R, Catanzaro B, Gebis Joseph J, Husbands P, Keutzer K, et al. The landscape of parallel computing research: a view from Berkeley, University of California at Berkeley Technical Report No. UCB/EECS-2006-183. [2] Benini L, DeMicheli G. Networks on chips. Morgan Kaufman; 2006.

972

F.N. Sibai / Computers and Electrical Engineering 37 (2011) 958–972

[3] Bijlsma B. Asynchronous network on chip architecture performance analysis, MS Thesis, Department of Electrical Engineering, TU Delft, 2005. [4] Bononi Luciano, Concer Nicola. Simulation and analysis of network on chip architectures: ring, Spidergon and 2D Mesh. In: Proceedings of the Design, Automation and Test in Europe (DATE); 2006. [5] Bourduas S, Zilic Z. A ring/mesh interconnect using hierarchical rings for global routing. First ACM/IEEE Symposium on Networks-on-Chip; 2007. [6] Coppola M, Locatelli R, Maruccia G, Pieralisi L, Scandurra A. Spidergon: a novel on chip communication network. In: Proceedings of the international symposium on system on chip. Tampere, Finland; 2004. [7] Dally W. Enabling technology for on-chip interconnection networks. In: Keynote speech, first ACM/IEEE symposium on networks-on-chip; 2007. [8] Dally W, Towles B. Principles and practices of interconnection networks. Morgan Kaufman; 2003. [9] Dally W. Express cubes: improving the performance of k-ary n-cube interconnection networks. IEEE Trans Comput 1991;40(9). [10] Duato J, Yalamanchili S, Ni L. Interconnection networks: an engineering approach. IEEE CS Press; 1997. [11] Gomez C, Gilabert F, Gomez M, Lopez P, Duato J. Beyond fat-tree: unidirectional load-balanced multistage interconnection network. IEEE Comput Arch Lett 2008;7(2):49–52. [12] Hammond L, Nayfeh B, Olukotun K. A single-chip multiprocessor. IEEE Comput 1997;30(9). [13] Ho Ron, Mai Ken, Kapadia Hema, Horowitz Mark. Interconnect scaling implications for CAD. In: Proceedings of ICCAD; 1999. [14] Intel Teraflops Research Chip. http://www.intel.com/go/terascale. [15] Muralimanohar N, Balasubramonian R. Interconnect design considerations for large NUCA caches. In: Proceedings of international symposium on computer architecture (ISCA); 2007. [16] Peh L, Dally W. A delay model and speculative architecture for pipelined routers. In: Proceedings of seventh international symposium on highperformance computer architecture (HPCA); 2001. [17] Pinto A, Carloni L, Sangiovanni-Vincentelli A. Synthesis of low power NOC topologies under bandwidth constraints. University of California at Berkeley Technical Report No. UCB/EECS-2006-137. [18] Preparata FP, Vuillemin J. The cube-connected-cycles: a versatile network for parallel computation. FOCS 1979:140–7. [19] Ravindran G, Stumm M. A performance comparison of hierarchical ring- and mesh-connected multiprocessor networks. In: Proceedings of the third international symposium on high performance computer architectures. San Antonio, TX; 1997. [20] Sibai F. The hyper-ring network: a cost-efficient topology for scalable multicomputers. In: Proceedings of ACM symposium on applied computing. Atlanta; 1998. p. 607–12. [21] Sibai F. Resource sharing in networks-on-chip of large many-core embedded systems. In: Proceedings of 38th IEEE international conference on parallel processing workshops. Vienna; 2009. [22] Coppola M. Spidergon: a NoC for future SMP architectures. In: Proceedings of the fourth forum on application-specific MPSoC. France; 2004. [23] Coppola M, Grammatikakis Miltos, Locatelli Riccardo, Maruccia Giuseppe, Pieralisi Lorenzo. Design of cost-efficient interconnect processing units: Spidergon STNoC. CRC Press; 2008. [24] DeHon A. ESE680 lecture notes, Day 13. Spring: University of Pennsylvania; 2007. http://www.seas.upenn.edu/~ese534/spring2007/lectures/ Day13_6up.pdf. [25] Mineo C, Davis W. Save your energy: a fast and accurate approach to NoC power estimation. In: Proceedings of 15th international symposium on highperformance computer architecture. Raleigh, NC; 2009. [26] Gschwind M, Hofstee H, Flachs B, Hopkins M, Watanabe Y, Yamazaki T. Synergistic processing in cell’s multicore architecture. IEEE MICRO 2006;26(2). [27] Emer J. EV8: the post ultimate alpha. http://research.ac.upc.edu/pact01/keynotes/emer.pdf. Fadi N. Sibai is with the R&D Center, Saudi Aramco, Saudi Arabia. He previously worked for Intel Corporation, USA, and held academic positions in the USA and the Middle East. He authored over 130 technical publications and reports, and served on the organizing or program committees of over 20 international conferences. He received the Ph.D. in Electrical Engineering. His research interests are in Computer Engineering. His biography is published in the 2011 editions of Who’s Who in the World and IBC’s 2000 Outstanding Intellectuals of the 21st Century. He is a member of Eta Kappa Nu.