J. Parallel Distrib. Comput. (
)
–
Contents lists available at ScienceDirect
J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc
Dual partitioning multicasting for high-performance on-chip networks Jianhua Li a,∗ , Liang Shi a , Chun Jason Xue b , Yinlong Xu a a
College of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui, PR China
b
Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
highlights • • • • •
Multicast traffic threatens the scalability of on-chip unicasting mechanisms. We propose Dual Partitioning Multicasting to balance the network link usage. DPM simultaneously yields high performance for unicast traffic. DPM effectively improves average packet latency and network power dissipation. DPM yields better scalability compared with previous work.
article
info
Article history: Received 17 December 2011 Received in revised form 23 July 2013 Accepted 26 July 2013 Available online xxxx Keywords: On-chip network Multicast routing Rectilinear Steiner tree Latency-aware Load-balance
abstract As the number of cores integrated onto a single chip increases, power dissipation and network latency become ever-increasingly stringent. On-chip network provides an efficient and scalable interconnection paradigm for chip multiprocessors (CMPs), wherein one-to-many (multicast) communication is universal for such platforms. Without efficient multicasting support, traditional unicasting on-chip networks will be low efficiency in tackling such multicast communication. In this paper, we propose Dual Partitioning Multicasting (DPM) to reduce packet latency and balance network resource utilization. Specifically, DPM scheme adaptively makes routing decisions based on the network load-balance level as well as the link sharing patterns characterized by the distribution of the multicasting destinations. Extensive experimental results for synthetic traffic as well as real applications show that compared with the recently proposed RPM scheme, DPM significantly reduces the average packet latency and mitigates the network power consumption. More importantly, DPM is highly scalable for future on-chip networks with heavy traffic load and varieties of traffic patterns. © 2013 Elsevier Inc. All rights reserved.
1. Introduction The continuous decrease in transistor size has led to the persistent increase in the number of cores that can be integrated into CMP systems [4,16,17,36]. On-chip network [5] provides an efficient as well as scalable communication paradigm for CMP systems. Recent work [18] found that, without efficient multicasting support, many applications for CMPs, such as cache coherence protocols [25,31,32] and operand networks [4,29], will suffer from significant performance degradation. For on-chip networks without multicasting support, N unicast packets are injected into the network in order to transmit each multicast packet with
∗
Corresponding author. E-mail addresses:
[email protected] (J. Li),
[email protected] (L. Shi),
[email protected] (C.J. Xue),
[email protected] (Y. Xu). 0743-7315/$ – see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jpdc.2013.07.002
N multicasting destinations. Such routing mechanism is trafficintensive which can cause network congestion as well as unacceptable packet latency. As presented in [18], in a 4 × 4 mesh network with the state-of-the-art packet-switched unicast router, the saturation point drops rapidly when 1% of the injected packets are multicast packets. As a result, efficient multicasting support is imperative for such systems. This paper proposes a novel multicast routing scheme which can effectively route the on-chip multicast traffic with low latency and low power dissipation. How to effectively choose the paths for transmitting the multicast packets is the major challenge for efficient multicasting. Judicious routing can reduce the power dissipation and on-chip network traffic. Several research works have been done on the multicasting problem for on-chip networks. Lu et al. [22] proposed a path-based connection-oriented multicasting for wormholeswitched on-chip networks. The advantage of path-based multicasting is its simple implementation. However, as the constructed path may be long, it will force the packets to be transmitted along
2
J. Li et al. / J. Parallel Distrib. Comput. (
the long path which increases the packet latency even under low or medium network load. Abad et al. [1] proposed on-chip hardware support for multicasting which is based on a special router called multicast rotary router (MRR). A fully adaptive tree is used to transmit the multicast traffic. A considerably complex mechanism is proposed to avoid deadlock in [1]. Virtual Circuit Tree Multicasting [18] is a routing table based multicasting mechanism and its advantages and disadvantages have been analyzed in recent work [28,35]. bLBDR [28] is a logic-based multicasting scheme primarily proposed for irregular topology. In addition, the logicalbased bLBDR scheme removes the area-consuming routing table. RPM [35] is also not based on a routing table. The route calculation in RPM is based on the global distribution of the multicasting destinations in a mesh-based on-chip network. The transmission scheme in RPM could lead to ineffective routing decisions as indicated in [21,23]. All the previous works are either based on routing table support or dedicated to multicast communication while neglecting unicast communication. In this paper, we propose a novel multicasting scheme, Dual Partitioning Multicasting, to route multicast packets with low latency and low power consumption without routing table support. More importantly, DPM also achieves high performance for unicast communication. Under the DPM scheme, multicast packets are partitioned into two categories through exploiting the link sharing patterns based on the knowledge of global distribution of multicasting destinations. For unicast packets, we propose an adaptive and multicast-sensitive packet type assignment mechanism which further enhances the scalability of DPM through balancing the network traffic. For each category of packets, we propose corresponding multicast routing algorithms for transmission. The proposed DPM scheme can obtain the near-optimal worst-case throughput for unicast traffic which makes DPM adaptable under on-chip traffic with varieties of unicast traffic patterns and different percentage of unicast traffic. Extensive experimental results show that DPM can significantly reduce the average packet latency and mitigate the power consumption compared with RPM [35]. The main contributions of this paper are as follows:
• We propose to divide multicast packets into two categories through exploiting the link sharing patterns according to the distribution of the multicasting destinations. Two methods, called RST and B-SLC respectively, are proposed to categorize the multicast packets. • We propose dedicated routing algorithms to route the two categories of packets efficiently. Moreover, the dedicated routing algorithms can obtain near-optimal worst-case performance for unicast traffic. • We propose the E-SLC approach which can get more accurate multicast packet categorization with very low time complexity. Additionally, we propose multicast-sensitive unicast assignment to make DPM highly scalable for non-uniform traffic patterns. The remainder of this paper is organized as follows. The background of network partitioning and the motivation of this work are illustrated in Section 2. The proposed DPM scheme is introduced in Section 3. The implementation of DPM is presented in Section 4. Simulation and results analysis are discussed in Section 5. Finally, Section 6 concludes this paper. 2. Background and motivation Recent work [35] proposed a multicasting scheme called recursive partitioning multicasting (RPM) for mesh-based on-chip networks. In RPM, route calculation is based on the global distribution of the multicasting source and all the multicasting destinations. The whole mesh network is logically partitioned into 3, 5 or 8
)
–
Fig. 1. Partitioning of on-chip networks.
parts according to the location of the multicasting source. For onchip network, if a multicasting source locates in a center tile, the whole mesh is partitioned into 8 parts1 which is depicted in Fig. 1. Based on the above network partitioning, RPM proposed several multicasting rules to transmit network packets. Specifically, RPM has higher priority to first route the packet along south or north links. As depicted in Fig. 2(a), if multicasting destinations locate in both the NE and NW part, RPM will first transmit the packet to the north output port of the source tile. In the same way, if multicasting destinations locate in both the SE and SW part, RPM will first transmit the packet to the south output port of the source tile as shown in Fig. 2(b). Such sharing routing will reduce network traffic as well as link usage. However, west-sharing routing and eastsharing routing are restricted in RPM which are demonstrated in Fig. 2(c) and (d). The restrictions in Fig. 2(c) and (d) could cause unbalanced link usage which can exacerbate network performance. Fig. 3(a) shows the RPM’s routing solution for a sample multicast packet whose source is tile 14 and destinations are tiles 5 and 15. The optimal routing solution for this packet is shown in Fig. 3(b). RPM uses 10 network links which is 1.67χ of the link usage in the optimal routing as demonstrated in Fig. 3. However, the optimal routing is restricted by the routing rules in RPM which leads to more network link usage. Unnecessary link usage and router traversal consume more power and generate more network traffic. Moreover, if any one of the transmissions (packet to south destinations and packet to north destinations) is blocked at the multicasting source, the subsequent packets generated by the same source will be blocked. Therefore, a good multicasting scheme should balance the network link usage. Fig. 1(b) presents the routing rules for unicast packet under RPM [35]. RPM utilizes shortest path routing, wherein the unicast traffic to NE and SW parts is routed based on deterministic YX routing and the unicast traffic to NW and SE parts is transmitted by deterministic XY routing. Seo et al. [30] showed that such static routing suffers from poor worst-case and average-case throughput. As a result, a multicasting scheme, which is also unicast-aware, is beneficial for on-chip network performance. In other words, a robust multicasting scheme should simultaneously yield high performance for unicast traffic. Motivated by the above analysis, in this work, we propose a novel multicasting scheme called DPM. Under the DPM routing scheme, multicast packets are divided into two categories. One category exhibits more vertical links sharing like the cases in Fig. 2(a) and (b). The other category owns more horizontal links sharing similar to the scenarios in Fig. 2(c) and (d). How to categorize the multicast packets is presented in Section 3.1. We then propose four multicast routing algorithms, which are illustrated in Section 3.2,
1 A multicasting source located in a corner tile will partition the network into 3 parts while a boundary tile source will partition the network into 5 parts. The capital letters in Fig. 1 (NW, DN, etc.) represent the eight orientations.
J. Li et al. / J. Parallel Distrib. Comput. (
(a) North-sharing routing.
(b) South-sharing routing.
)
–
(c) West-sharing routing.
3
(d) East-sharing routing.
Fig. 2. Unfair multicast routing rules in RPM.
(a) RPM routing.
(b) Optimal routing.
Fig. 3. RPM routing and optimal routing for packet P (source: 14, destinations: 5 and 15).
to route these two categories of multicast traffic efficiently. Categorizing packets through differentiating the link sharing patterns and applying dedicated routing algorithms to implement the traffic routing can significantly minimize the on-chip link usage. In addition, the proposed DPM scheme can balance the network link usage through multicast-sensitive unicast traffic assignment which makes DPM more scalable in heavy traffic load on-chip with varieties of traffic patterns.
every intermediate node in the tree needs to exchange information to implement such routing. The large overhead makes rectilinear Steiner tree based routing infeasible. However, we can utilize the tree to compute the type bit of the multicast packet rather than route the packet. With the purpose of accurately calculating the packet type, we define the HV-benefit of a rectilinear Steiner tree as the difference between the weight of east and west branches of the tree and the weight of south and north branches.
3. Dual partitioning multicasting In the proposed DPM scheme, multicast packets are divided into two categories. In this section, we first present how to categorize the multicast packets. Then, the detailed routing algorithms are illustrated. Finally, we present the unicast-aware mechanism to balance the network load. 3.1. Multicast packets categorization 2
Under the DPM scheme, one bit in the header flit, denoted as type bit, is used to indicate which category a multicast packet belongs to. Two approaches are proposed to calculate the packet type. In the following, we will present the details of the two approaches respectively. 3.1.1. Rectilinear Steiner tree approach The Minimum Rectilinear Steiner Tree, known as MRST, problem is to find a rectilinear Steiner tree with the minimum cost. In this approach, a rectilinear Steiner tree is first constructed to connect the multicasting source and all destinations. Routing the multicast packet along the constructed tree can get optimal link usage. However, such routing needs to reserve network resources and construct the rectilinear tree which is time-consuming. In addition,
2 In this work, a wormhole switching technique [10] is used to transmit the network traffic and the packet is transmitted in the format of flits.
Definition 1. Given a Rectilinear Steiner Tree, noted as T , with root r. HV-benefit for T is defined as follows: HV T = (Weast + Wwest ) − (Wsouth + Wnorth ) where Weast , Wwest , Wsouth and Wnorth are the sum of the weight of edges in the east, west, south and north branch of T rooted at node r respectively. On the basis of the definition above, the type bit of a multicast packet can be determined in two steps. The first step is to construct the rectilinear Steiner tree T that connects the multicasting source (regarded as the root of T ) and all multicasting destinations of the multicast packet. Alexander et al. proposed an effective heuristic algorithm called IDOM [3] to calculate such RST. The heuristic repeatedly finds a Steiner candidate node that can reduce the overall minimum spanning tree cost, and then includes them into the growing set of Steiner nodes until they cannot find any improvement. In this paper, the rectilinear Steiner tree construction is based on the IDOM [3] algorithm. In the second step, the HV-benefit of T is calculated. If HV T is greater than zero, the packet’s type bit in the head flit is set to 0. Otherwise, it will be set to 1. Fig. 4(a) shows the resulting rectilinear Steiner tree of the multicast packet used in Section 3.1.2. Weast can be calculated by adding the weights of edge (12, 13), (13, 14) and (14, 19) and the result is 3. Wwest , Wsouth and Wnorth can be obtained in the same way. Accordingly, Wwest is 3, Wsouth is 2 and Wnorth is 2. Therefore, the value of HV T is 2. As a result, the type bit of this multicast packet is set to 0.
4
J. Li et al. / J. Parallel Distrib. Comput. (
(a) X-first tree (RST) tree.
)
–
(b) Y-first tree.
Fig. 4. Trees used to calculate the multicast packet type.
Algorithm 1: E-SLC
1 2 3 4 5 6 7 8 9
Input: multicasting destinations S Output: type_bit of the multicast packet dor_cost = get_dor_cost(S); /* get the cost of DOR routing. x_cost = get_x_cost(S); /* get the weight of X-first tree. y_cost = get_y_cost(S); /* get the weight of Y-first tree. if (dor_cost − x_cost ) > (dor_cost − y_cost ) then type_bit = 0; else if (dor_cost − x_cost ) < (dor_cost − y_cost ) then type_bit = 1; else type_bit = rand()%2; return type_bit;
*/ */ */
• dor_cost 3 represents the cost of DOR routing in terms of link utilization. Fig. 5. Fault-rate of packet type calculation using B-SLC.
3.1.2. Shared links comparison approach We propose two variants of shared links comparison method and the basic one is called B-SLC which is presented in recent work [21]. The other approach is called E-SLC, Enhanced SLC, which can yield more accurate packet type calculations compared with the basic approach. B-SLC : In this approach, the type bit of a multicast packet is determined through comparing the weight of two trees constructed by the DOR routing [33] path. Fig. 4(a) and (b) show two sample trees. The weight of each edge in the tree is the Manhattan distance of the two terminals of the edge. If the weight of the X-first tree is smaller than the Y-first tree, the type bit of the multicast packet is set to 0. Otherwise, the type bit is set to 1. For the case in Fig. 4, the weight of the X-first tree is 10 which is smaller than the weight of the Y-first tree. Therefore, the type bit is set to 0. The intuition behind the B-SLC approach is that: given a multicast packet, if the weight of the spanned X-first tree is smaller than the Y-first tree, transmitting the multicast packet along the horizontal links firstly has the potential of reducing more network link usage and vice versa. E-SLC : The B-SLC approach cannot always get the correct result compared with the RST method. Fig. 5 presents the percentage of multicast packets which are mis-categorized by the B-SLC approach compared with the RST method. The characteristics of the multicast packets are illustrated in Section 5. As shown in Fig. 5, on average, more than 35% of the multicast packets with different numbers of multicasting destinations could be incorrectly categorized. Inaccurate packet categorization will lead to suboptimal routing decisions that consume more network links as well as power. In this paper, we proposed an enhanced SLC approach which is presented as follows. Algorithm 1 presents the mechanism of the E-SLC approach. The meaning of the variables in Algorithm 1 is as follows.
• x_cost indicates the weight of the X-first tree. • y_cost indicates the weight of the Y-first tree. • type_bit indicates the type of the multicast packet. In the E-SLC approach, the cost of routing according to DOR, X-first tree and Y-first tree is first computed as shown in Algorithm 1 (lines 1–3). Then, the link usage reductions of routing the multicast packets according to the X-first tree and Y-first tree compared with DOR [33] routing are compared. If applying the X-first tree to route the multicast packet can obtain more link usage reduction, the type bit of the multicast packet is set to 0 indicated by lines 4–5 in Algorithm 1. On the contrary, if routing according to the Y-first tree yields more link usage reduction, the type bit is set to 1 (lines 6–7). Finally, if the X-first and Y-first tree get the same number of link usage reductions compared with DOR routing, the type bit is randomly set to 0 or 1 with the same probability. For the packet shown in Fig. 4, routing according to the X-first tree can obtain 4 link usage reduction compared with DOR routing while the link usage reduction obtained by the Y-first tree is 2. Therefore, the type bit is set to 0. E-SLC can get more accurate packet type calculation which will be evaluated in Section 5. In addition, for multicast packets with few destinations, the X-first and Y-first tree usually get the same link reductions. Under such a scenario, the E-SLC approach can more uniformly categorize the multicast packets into the two types. In other words, the E-SLC approach can balance the network load in the virtual networks used for the two types of multicast packets. 3.2. Dual partitioning multicasting We propose four multicast routing algorithms, North-Last, South-Last, East-Last and West-Last routing, to route the multicast packets. Either the North-Last or South-Last routing algorithm can
3 Note that dor_cost is the sum of the Manhattan distance from the multicasting source to each multicasting destination.
J. Li et al. / J. Parallel Distrib. Comput. (
)
–
5
be used to transmit multicast packets whose type bit is 0. Both the East-Last and West-Last routing algorithm can be applied to route multicast packets with type bit 1. There are in total four combinations of routing algorithms that DPM can apply to route the multicast packets. Each combination can cooperatively balance the network link usage. Therefore, any combination can be utilized to effectively route the packets for an on-chip network. In this section, we first present the multicast routing algorithms. Then, the details of the DPM scheme will be illustrated. 3.2.1. Routing algorithms in DPM Here, we illustrate the routing rules in North-Last routing while routing rules in West-Last, South-Last and East-Last routing can be illustrated in a similar way according to Fig. 6(b,c,d). We use the general case on-chip network partitioning, where the multicasting source is located in a center tile, to present the routing scheme in North-Last routing. Fig. 6(a) shows the North-Last routing rules which are illustrated as follows: • If there exists multicasting destinations in the NE part, transmitting a replica with all destinations in the NE part to the east output port. • If there exists multicasting destinations in the NW part, transmitting a replica with all destinations in the NW part to the west output port. • Packet to multicasting destinations in the SE part is transmitted through the south output port iff there are no destinations in the NE and DE parts and at least one destination locates in the SW or DS part. Otherwise, a packet to destinations in the SE part is transmitted through the east output port. • Packet to multicasting destinations in the SW part is transmitted through the south output port iff there are no destinations in the NW and DW parts and at least one destination locates in the SE or DS part. Otherwise, a packet to destinations in the SW part is transmitted through the west output port. Besides the routing rules above, in the proposed four algorithms, the traffic to DN, DW, DS and DE parts is directly transmitted to the north, west, south and east output ports of the router respectively. According to the routing rules illustrated above, the length of each path from the multicasting source to each multicasting destination is equal to the Manhattan distance of the two nodes. In other words, DPM is shortest-path based routing which indicates that there are no backward turns in the routing paths. Shortest path routing can guarantee low packet latency. Algorithm 2: Dual Partitioning Multicasting Input: multicast packet M Output: routing decision for M 1 sc_bv< N > dst_E, dst_W , dst_S, dst_N; 2
Using North-Last routing for route computation;
3 4 5 6
/* N is the number of network
tiles. */ if M .type_bit=0 then else Using West-Last routing for route computation; return dst_E, dst_W , dst_S, dst_N
3.2.2. Dual partitioning multicasting In the DPM scheme, if IP core S tries to send a multicast packet M to n multicasting destinations (d1 , d2 , . . . , dn ), the first step is to check the type bit of M in the directory. The approaches presented in Section 3.1 can be utilized by the directory controller to calculate the packet type in advance. With the packet type information, the packet is sent to the router and the route computation module will apply the routing scheme illustrated in Algorithm 2 to partition the destinations of the packet. Finally, the routing algorithms are responsible to transmit the packet to each part. The same routing process will be repeatedly implemented at each subsequent node which receives the replica of the multicast packet until the packet arrives at its destinations.
Fig. 6. Routing algorithms in DPM.
Fig. 7. Dual partitioning multicasting example.
To clarify Algorithm 2, we use the multicast packet shown in Fig. 7 as an example. In Algorithm 2, we apply the North-Last routing algorithm to route the packet whose type bit is 0. Additionally, the West-Last routing algorithm is used to route the packet with type bit 1. Before the IP core in tile 12 sends out the packet to the input channel of the router, the type bit of the packet needs to be calculated first. As illustrated in Section 3.1, the type bit of the packet shown in Fig. 7 is 0. Therefore, the IP core in tile 12 will send a multicast packet M with type bit 0 and destination address ‘‘0010010000100010000100100’’4 to the router. According to Algorithm 2 (line 2), the North-Last routing algorithm is used to route the packet. In Algorithm 2, dst_N , dst_S , dst_W and dst_E shown in line 1 are used to indicate the resulting destinations, the traffic to whom are first transmitted along the north, south, west and east output port respectively. Multicasting destination 2 locates at the DN part of multicasting source 12, so destination 2 is added into dst_N based on the North-Last routing rules. Because destinations 5 and 10 are in the NW and DW parts of the source, they will be added into dst_W . Similarly, in terms of the North-Last routing algorithm, destinations 14 and 19 are added to dst_E and destination 22 is added to dst_S. The route computation process of the multicast packet is shown in Fig. 7. Initially, dst_N , dst_S , dst_W and dst_E are all null and the packet’s destination address and packet type is sent to the
4 In this work, we apply bit string encoding [8] to encode the multicasting destinations of the packet. In the evaluation, we apply the sc_bv data type in SystemC to indicate the destination bit string and the index order in sc_bv is from right to left.
6
J. Li et al. / J. Parallel Distrib. Comput. (
)
–
router. The router will store the packet in the corresponding buffer upon receiving the packet. When the router tries to route this packet in buffer, the route computation module will first check the type bit of the packet and apply a suitable routing algorithm to route the packet. After the route computation phase, the origin multicasting destinations are split into several parts which are shown in Fig. 7. The router will transmit replicas of the original packet with new addresses returned from the route computation module. For instance, because the returned dst_N is not null which includes destination 2, the router will transmit a replica of M with new destination dst_N to the north output port. 3.3. Unicast aware mechanism In Section 3.1, we present the multicast packets categorization. As stated in Section 2, a unicast-aware multicasting scheme is beneficial for on-chip network performance. In this work, we propose two unicast-aware mechanisms which make DPM adaptable for varieties of on-chip network traffic patterns. 3.3.1. Multicast-oblivious unicast mechanism For a unicast packet with unique destination, the routing paths computed by North-Last and South-Last routing algorithms are static. For example, a unicast packet will be always transmitted along the west link first when the destination is in the NW part. Previous work [30] has showed that such a unicast routing scheme could constrain the worst-case throughput. In this paper, we apply the scheme in [21] for unicast packets where the type bit of each unicast packet is set to 0 and 1 with equal probability. The objective of the equal probability unicast traffic injection is to balance the traffic to each virtual network generated by unicast packets. The unicast traffic injection scheme in [21] is multicast-oblivious. 3.3.2. Multicast-sensitive unicast mechanism Recent work [19] showed that on-chip network load caused by real applications is not always uniform. Heterogeneous multicast traffic can lead to imbalanced network traffic between two virtual networks in DPM. The equal probability unicast injection cannot mitigate the load-imbalance caused by multicast traffic. We propose to equalize the network load between two virtual networks by distributed balancing the packets that are injected into the two virtual networks by each source. We use an 8-bit counter for each source to indicate the difference between the number of packets that are injected into two virtual networks. At first, the counter is set to 128 and an equal probability unicast assignment scheme is applied for DPM. When a packet is injected into VN0 and the value of the counter is smaller than 255, the counter is increased by one. One packet injection to VN1 will reduce the counter by one if the value of the counter is larger than zero. If the value of the counter is within the range [128 − θ , 128 + θ ], DPM still applies the equal probability unicast assignment. If the value of the counter is below 128 − θ , the unicast packet will be injected into VN0. Otherwise, the unicast packet will be injected into VN1. The unicast assignment here is dynamic and is multicast-sensitive.
Fig. 8. Wormhole router architecture for DPM.
channels (VC). For the purpose of preventing deadlock, two virtual networks are implemented in DPM as illustrated in Section 4.2. For this reason, even numbers of virtual channels are needed and the minimum value of n is 1. In the proposed DPM scheme, each packet with type bit 0 is stored in a virtual channel with odd virtual channel identifier (VCID) and a packet with type bit 1 is stored in the virtual channel with even VCID. The VCs in each router are served in a round-robin manner to guarantee fairness. There are two major differences between the DPM router and a conventional unicast wormhole router. One difference is the route computation module design. As shown in Fig. 8, the route computation module in DPM consists of two sub-modules. In accord with Algorithm 2, North-Last and West-Last routing algorithms are implemented in the router to demonstrate the route computation process of the proposed scheme. When a VC is served in turn, the multicasting destination address and type bit of the packet in the head of the buffer will be sent to the route computation module. The route computation module will check the type bit of the packet stored in the head flit. If the type bit is 0 which means the VCID of the currently served VC is odd, the North-Last route computation sub-module will take charge of the route computation for this packet. Otherwise, the West-Last route computation sub-module will take charge. The other difference is the switching technique design. Different from conventional wormhole switching, under DPM, a flit is deleted from a virtual channel iff the flit has been transmitted to all the destination output ports. For example, each flit of the packet in Fig. 7 is deleted from the VC iff the flit has been transmitted to all of the four next-hop nodes. Moreover, the packet replica is asynchronously transmitted to the output ports which means that one busy output port cannot block the transmission to the free output ports. Such asynchronous transmission design is crucial to prevent the deadlock caused by concurrent multicast. 4.2. Deadlock free analysis
4. DPM implementation In this section, we present the router architecture applied in the DPM scheme. Subsequently, the deadlock free problem analysis of DPM routing is illustrated. Finally, the implementation overhead of the DPM scheme is analyzed. 4.1. DPM router architecture Fig. 8 shows the wormhole-like router architecture for DPM. In this work, we use virtual channel flow control [9] and each input channel in the DPM router is implemented with 2n virtual
In DPM, packets with different types are routed with different routing algorithms through different virtual networks [7]. Taking the North-Last and West-Last routing combination for instance, packets with type 0 will be injected into virtual network 0 and the North-Last routing algorithm is used to route the packets. Likewise, the West-Last routing algorithm is utilized to route packets with type 1 in virtual network 1. Because the two networks are virtually separated, DPM is deadlock-free if and only if the two routing algorithms in corresponding virtual networks are deadlock-free. The deadlock freedom of the other three routing combinations can be verified in the same way. In addition, we utilize a credit-based
J. Li et al. / J. Parallel Distrib. Comput. (
)
–
7
Fig. 9. The turns in routing algorithms under DPM.
back-pressure mechanism to avoid end-to-end deadlock caused by buffer overflow. Fig. 9 shows the potential turns along the routing path in the proposed four routing algorithms under DPM. The dashed turns are restricted by the corresponding routing algorithms. For the North-Last routing algorithm shown in Fig. 6, traffic to the NW and NE part can only be first transmitted through the west and east output port respectively. The north-to-west and north-toeast turns shown in Fig. 9 are inhibited for the North-Last routing algorithm. The restricted turns in the North-Last routing algorithm are specifically designed according to the turn model [14] for deadlock-free routing. For unicast communication, the restricted turns in North-Last routing can ensure that ‘‘there are no cycles in the channel dependency graph.’’ Based on the deadlock-free adaptive routing theorem [12], North-Last routing is deadlock-free for unicast communication. For multicast communication under DPM, one busy output port cannot block the transmission to the free output ports. Such an asynchronous switching mechanism can ensure that ‘‘there are no cycles in the multicast channel dependency graph’’ [11]. Therefore, North-Last routing is also deadlock-free for multicast communication.5 4.3. Implementation overhead analysis In this section, we analyze the implementation cost of DPM including the cost of the packet type calculation as well as the routing module. 4.3.1. Overhead of categorizing packets Under DPM, one pseudo-random number is needed to categorize unicast packets in the multicast-oblivious mechanism illustrated in Section 3.3.1. The pseudo-random number can be generated using a linear feedback shift register [15]. In the multicast-sensitive scenario, one 8-bit counter is needed besides the pseudo-random number generator for categorizing unicast packets. For multicast packet categorization, the RST method is NPcomplete [13] which prevents its practical adoption due to the time complexity. By comparison, the time complexity of the two SLC approaches is O(N 2 ) where N is the dimension of the network. In this work, we propose to decouple the multicast packet categorization from the route computation with the purpose of eliminating the impact on the performance of routing. The details of the decoupled approach are as follows. Decoupled packet type calculation: For multicore systems typically with directory coherence protocols, the multicast destinations are stored in the directory entry which indicates the current sharers
5 Note that more virtual networks are required to avoid protocol level deadlock and the number of the virtual networks is protocol specific.
Fig. 10. DPM routing hardware logic.
of the corresponding cache block. The type of the subsequent multicast packet, which is activated by the invalidation operation upon write, can be calculated in advance. The directory entry can be extended by adding one bit to store the type of the multicast packet with the current destinations. With this approach, the latency of calculating the multicast packet can be decoupled from the subsequent routing process. In other words, the subsequent multicasting operation can directly utilize the packet type information stored in the directory entry for routing without additional latency. However, when a write request arrived at the directory controller, the packet type calculation could still be under way. In this case, we propose to use the previous type information stored in the directory entry and terminate the ongoing calculation. Under decoupled packet type calculation, when a new sharer is added to one block, the packet type needs to be re-computed based on the latest destinations. The old type bit is updated with the latest type information when the calculation is completed. For the intensively shared blocks, the packet type calculation may consume much energy along with the frequent updates to the directory entry. One approach to mitigate the overhead of updating the packet type is to reduce the frequency of updating, such as updating the type bit upon the addition of every two sharers. 4.3.2. Overhead of DPM routing logic DPM is also a logic-based routing scheme [28,35]. The implementation of DPM routing consists of a partitioning logic and routing computation logic. The partitioning logic determines the orientation of each multicasting destination with respect to the source through comparing the logical coordinates of the nodes. Two comparators are sufficient to calculate the orientation of each destination. The output of the partitioning logic is sent to the routing computation logic to determine how to route the packet to each partition. Fig. 10 shows a feasible design for DPM routing which utilizes the North-Last and West-Last routing combination. The implementation of the hardware logic includes 16 inverters, 4 threeinput AND gates, 4 four-input AND gates, 6 three-input OR gates and some wires. The overhead of the routing DPM routing logic is comparable to the routing logic in RPM [35]. Compared with the routing table based scheme, the overhead of logic based routing is negligible especially for future large-scale on-chip networks with big routing table. Therefore, in terms of area requirement, DPM like logic based routing schemes [28,35] are more scalable than conventional routing table based schemes [1,18]. 5. Experiments and analysis In this section, we present a detailed evaluation of the DPM scheme with respect to the average packet latency, power consumption and network load-balance. We compare the proposed
8
J. Li et al. / J. Parallel Distrib. Comput. (
DPM scheme with RPM [35]. The simulation methodology is first introduced. Then, the basic evaluation results will be presented and analyzed. Subsequently, the performance of SLC schemes and the multicast-sensitive mechanism will be analyzed respectively. Finally, we further show how our schemes are scalable with varieties of traffic patterns as well as different network size. 5.1. Simulation methodology We perform both trace-driven and full-system evaluation for the proposed DPM scheme. The specific simulation methodology is illustrated as follows: 5.1.1. Trace-driven simulation methodology The SystemC based cycle accurate simulator NIRGAM [27] is used for our experiments. We collect various statistics including the number of reads and writes to the router buffers, the total activities at the virtual channels and switch arbiters, the total number of crossbar traversals and the total activities at network links. With the collected statistics, the latest DSENT [34] power model is applied to calculate the dynamic power and static power consumption for the network. The main parameters of the simulated system are shown in Table 1. We model an 8 × 8 mesh on-chip network with 4 GHz frequency for the routers and links. We assume that the buffer, crossbar, link, VC and switch arbiter are all with 50% switching activity. We applied a two-stage router pipeline and the per-hop latency is 3 cycles. The synthetic traffic is generated by mixing different portions of multicast traffic with Uniform Random, Bit Complement and Transpose unicast traffic. 5.1.2. Full-system simulation methodology We perform the full-system evaluation using the GEMS framework [26] on top of Simics [24]. Table 2 describes the simulation parameters of the system configuration. We model a 64-core system based on tiled architecture [36] wherein each core has a private L1 I/D cache and a shared L2 slice. The tiles are connected using an 8 × 8 mesh network modeled with Garnet [2]. Moreover, the on-chip network power is calculated using DSENT [34]. For the evaluation, we use a set of PARSEC [6] workloads whose characteristics is shown in Table 3. The multicast traffic percentage and the average number of multicasting destinations shown in Table 3 are obtained through profiling the workloads during the full-system execution. We run all the workloads with 64 threads and simmedium input sets on the simulated multicore system. We skip the initialization and sequential part of all the workloads and evaluate the whole parallel sections of all workloads. 5.2. Basic results and analysis 5.2.1. Synthetic traffic Average packet latency: Fig. 11 presents the average packet latency for the evaluated synthetic traffic. RST, B-SLC and E-SLC in the figures indicate that the packet type calculation in DPM is based on RST, B-SLC and E-SLC respectively. The North-Last routing and West-Last routing algorithm combination is applied by DPM to route packets in the evaluation. Here, several observations can be found which are illustrated as follows. First, at low packet injection rates, the evaluated DPM variants have similar average packet latency compared to RPM. As shown in Fig. 11, for the synthetic traffic with Uniform Random, Bit Complement and Transpose unicast, the average packet latencies under RPM and the three DPM variants are close to each other when the packet injection rate is below 0.07, 0.04 and 0.04 respectively.
)
–
Table 1 Trace-driven evaluation parameters. Network topology
8 × 8 mesh network
Routing scheme
RPM and DPM
Virtual channel
4VCs per input channel VC depth with 8 flits
Packet length
One flit for synthetic traffic
Synthetic traffic
Multicast packets mixed with different types of unicast traffic
Table 2 Full-system evaluation parameters. Core
64 in-order cores, 1 GHz, running Solaris 10 OS, 1 thread per core.
L1 cache
32 kB private I/D cache, 4-way, 64 B block, write-back, LRU replacement, 2-cycle latency.
L2 cache
16 MB, 64 bank, 32-way, 64 B block, write-back, LRU replacement, 7-cycle bank access latency.
Memory
200-cycle round-trip memory latency.
Coherence protocol
MESI directory protocol.
Interconnects
8 × 8 mesh topology, 3-cycle per-hop latency (link and router).
Table 3 Real multicasting traffic characteristics. Workload
Multicast percentage (%)
Average multicasting destinations
blackscholes bodytrack facesim ferret fluidanimate freqmine swaptions vips
0.16 0.36 0.29 0.15 0.59 0.11 1.60 0.42
16.8 19.1 21.9 18.3 16.9 18.8 7.51 16.3
Second, when the packet injection rate becomes high, DPM can obtain significantly better scalability in average packet latency compared with RPM. As shown in Fig. 11(a), when the injection rate is above 0.07, the performance of RPM degrades significantly compared with the two DPM variants. Compared with RPM, the average packet latency of RST is still around 40 cycles when the injection rate increases to 0.15. The same trend can be found for the other two synthetic traffic scenarios. Finally, the evaluated SLC schemes yield similar average packet latency compared to RST when the injection rate is not too high. For example, the performance of three DPM variants are comparable when the injection rate is below 0.12, 0.08 and 0.07 for the evaluated three synthetic traffic scenarios respectively. Moreover, E-SLC yields slightly better average packet latency compared to the B-SLC scheme as indicated in Fig. 11. Power: Fig. 12 presents the normalized power consumption for synthetic traffic. We present a reasonable comparison between DPM and RPM for the packet injection rate below 0.1 according to Fig. 11. In accord with the performance trend shown in Fig. 11, under low injection rate the total power of DPM and RPM is very close. As the injection rate increases, DPM schemes yield more and more power reduction compared with RPM. On average, RST, B-SLC and E-SLC reduce the power consumption by 11.4%, 7.8% and 9.9% respectively compared with RPM. 5.2.2. Real traffic Average packet latency: Fig. 13 shows the average packet latency for PARSEC applications. The decoupled packet type calculation is
J. Li et al. / J. Parallel Distrib. Comput. (
(a) Synthetic traffic with Uniform Random unicast.
)
–
9
(b) Synthetic traffic with Bit Complement unicast.
(c) Synthetic traffic with Transpose unicast. Fig. 11. Average packet latency for synthetic traffic (10% multicast traffic, average 8 multicasting destinations, all multicasting destinations are randomly distributed).
(a) Synthetic traffic with Uniform Random unicast packet.
(b) Synthetic traffic with Bit Complement unicast packet.
(c) Synthetic traffic with Transpose unicast packet. Fig. 12. Power consumption for synthetic traffic (10% multicast traffic and average 8 multicasting destinations, all the multicasting destinations are randomly distributed).
10
J. Li et al. / J. Parallel Distrib. Comput. (
)
–
Fig. 15. Execution time for PARSEC applications.
Fig. 13. Average packet latency for PARSEC applications.
Table 4 Variance of the average link throughput.
Fig. 14. Power consumption for PARSEC applications.
utilized in the full system evaluation. For all the workloads, DPM schemes effectively reduce the average packet latency compared with RPM as depicted in Fig. 13. B-SLC reduces the average packet latencies by 3.7–11.4 cycles for the evaluated workloads compared to RPM. By comparison, E-SLC reduces the average packet latencies by 7.8–14.1 cycles. For RST, the average packet latency is reduced by 10.4–18.8 cycles compared with RPM. Power: Fig. 14 presents the normalized power consumption for PARSEC applications. In addition to the power of the on-chip network, the power of the packet type calculation is also evaluated. The power of calculating the packet type is estimated based on the number of ALU operations needed for each multicast packet. The power of ALU operation is obtained using the McPAT power model [20] from HP Company. As shown in Fig. 14, two DPM schemes, B-SLC and E-SLC, also reduce the total power dissipation compared with RPM. B-SLC reduces the power dissipation by 1.6%–12.5% for the evaluated workloads compared to RPM. E-SLC reduces the power dissipation by 4.4%–15.7%. Different from the SLC approaches, the RST approach significantly increases the total power dissipation. As shown in Fig. 14, RST increases the total power consumption by 14.2%–54.8% for the evaluated workloads compared with RPM. Due to the complexity of the RST algorithm, the packet type calculation consumes significant power which occupies 37.1% of the total power dissipation on average. The intensive power consumption is the main roadblock for the adoption of the RST approach even though it yields the best packet latency. Execution time: Fig. 15 shows the normalized execution time for PARSEC applications. B-SLC reduces the execution time by 5.9%– 10.1% for the evaluated workloads compared to RPM. Moreover, E-SLC further reduces the execution time by 10.2%–14.4%. With the most accurate packet type calculation, RST reduces the execution time by 14%–19.5% compared to RPM.
Traffic pattern
Scheme
Variance of link throughput
Uniform Random
RPM DPM
138.5 67.6
Bit Complement
RPM DPM
409.6 197.5
Transpose
RPM DPM
244.4 141.7
As illustrated above, except the high power consumption of RST based packet type calculation, DPM schemes outperform RPM for all the evaluated workloads. In the following, we present the analysis of the performance of DPM schemes. 5.2.3. Performance analysis In this section, we present the main contributors to the better performance and scalability of the DPM scheme compared with RPM. The first contributor is the more balanced link usage yielded by DPM compared with RPM. We calculated the average throughput for each horizontal link and vertical link under both RPM and DPM schemes for Bit Complement synthetic traffic. DPM here is based on the B-SLC method. As shown in Fig. 16, the maximum link throughput in RPM is more than 60 Gbps while the maximum link throughput in DPM is below 60 Gbps. As indicated in Fig. 16(a), the average link throughput of many links under RPM is below 10 Gbps. Compared with RPM, the network load under DPM is more balanced which is shown in Fig. 16(b). Table 4 presents the variance of the average link throughput for the evaluated schemes. The data in Table 4 quantitatively demonstrates that DPM can obtain more balanced network resource usage which signifies better on-chip network performance. The second contributor is the reduced network link usage. We profiled the on-chip network link usage of the multicast packets in all the synthetic traffic and the real multicast packets from PARSEC [6] applications. Fig. 17 shows the link usage reduction of the evaluated DPM variants compared to RPM. For the synthetic multicast packets, the link usage is reduced by 31.4%, 22.7% and 26.5% under RST, B-SLC and E-SLC compared to RPM. In addition, the link usage reduction is 24%, 18.6% and 20.3% respectively for the real multicast packets whose average multicasting destinations are presented in Table 3. The last contributor to the performance of DPM is the unicastaware mechanism in DPM. Unicast routing in RPM is static which is shown in Fig. 1. Previous work [30] has shown that the static unicast routing like RPM suffers from poor throughput compared with adaptive schemes. By comparison, the unicast packets can
J. Li et al. / J. Parallel Distrib. Comput. (
(a) RPM.
)
–
11
(b) DPM.
Fig. 16. Average throughput of the network links in different dimension and direction (packet injection rate: 0.05, 10% multicast traffic and average 8 multicasting destinations, all multicasting destinations are randomly distributed).
Fig. 17. On-chip network link usage reduction compared to RPM.
DPM under non-uniform synthetic traffic and the real traffic. The non-uniform synthetic traffic is generated as follows: Each multicast source will send several multicast packets to several randomly generated destinations repeatedly and the repeat time is uniformly distributed in the range of [1, 15]. When the repeat time arrives, the destinations and repeat time for a multicast packet will be regenerated. The number of multicasting destinations is uniformly distributed within [4, 32] and the packet injection rate is uniformly distributed in the range of [0.03, 0.07]. The E-SLC algorithm is applied by both of the DPM schemes here and the parameter θ in multicast-sensitive DPM is set to 20. As shown in Fig. 18, E-SLC with the multicast-sensitive method significantly reduces the average packet latency compared with basic multicast-oblivious E-SLC. For the non-uniform synthetic traffic, multicast-sensitive E-SLC reduces the packet latency by 31.7% by dynamically assigning the injection of unicast traffic into networks according to the multicast traffic variations. For the evaluated real traffic, multicast-sensitive E-SLC yields 5.5% average packet latency reduction on average compared with multicastoblivious E-SLC. Therefore, the multicast-sensitive design is a scalable solution for general-purpose on-chip networks. 5.3. SLC performance analysis
Fig. 18. Unicast-aware performance evaluation.
utilize the available output ports with equal probability which can obtain good average-case throughput and near-optimal worstcase throughput. In the evaluation above, the unicast scheme is multicast-oblivious. In the following, we will show that the performance of DPM can be further improved through the cooperation of unicast and multicast routing. For real applications such as group cache coherence [37], the multicast packets could be disproportionately categorized into the two categories and then injected into the corresponding virtual networks. Unbalanced traffic load between two networks is definitely detrimental for overall network performance. Fig. 18 shows the performance comparison of basic DPM and multicast-sensitive
As shown in the previous subsection, B-SLC achieves poorer performance and scalability compared with RST. This is because B-SLC cannot always get the correct packet type calculation while the RST method can correctly categorize the multicast packets. Incorrect packet categorization can cause more packets injection into the on-chip network which occupy more network resources and consume more energy. Specifically, as the injection rate increases, the number of packets that are incorrectly categorized by B-SLC in a given time interval will increase correspondingly which leads to the gradually increasing performance gap between B-SLC and RST. In this work, we propose an enhanced SLC approach, called E-SLC, which can get more accurate multicast packet categorization compared with B-SLC. Fig. 19 shows the fault rate of categorizing all the multicast packets in the synthetic traffic using B-SLC and E-SLC compared with RST. E-SLC reduces the fault rate by 38.3% on average compared with B-SLC. As indicated in Figs. 11–14 and Table 4, for all of the evaluated scenarios, E-SLC achieves better performance than B-SLC. Even though the performance of E-SLC may degrade with high packet injection rate, the low complexity property still makes itself a scalable scheme.
12
J. Li et al. / J. Parallel Distrib. Comput. (
Fig. 19. Fault-rate of packet type calculation.
Fig. 20. Scalability on percentage of multicast traffic (packet injection rate: 0.05, average 8 multicasting destinations, all the multicasting destinations are randomly distributed).
5.4. Scalability analysis In this section, we first analyze the scalability of DPM under different traffic patterns with varied percentage of multicast traffic and multicasting destinations. Subsequently, the analysis of the E-SLC scheme and unicast-aware mechanism is presented. Finally, we present the scalability analysis of the DPM scheme under different network sizes. 5.4.1. Scalability on traffic patterns Fig. 20 presents the average packet latency for the evaluated schemes with different proportion of multicast traffic. As shown in Fig. 20, DPM schemes with the RST and E-SLC approach exhibit better performance scalability than RPM. As the multicast traffic percentage increases, the biased link usage in RPM can exacerbate the performance of RPM. The performance degradation could be more sharper under higher packet injection rate [21]. By comparison, DPM can harmoniously utilize the network links for packet routing. Such balanced link usage leads to the steady performance scalability for DPM along with increasing network load. In addition, RPM routing rules illustrated in Section 2 can exacerbate the packet injection blocking at the source node which directly increases packet transmission latency. Fig. 21 shows the average packet latency for traffic with different numbers of multicasting destinations. As shown in Fig. 21, both RPM and DPM schemes yield good scalability. DPM can effectively and uniformly utilize the network resources which is the main cause of the performance scalability. For multicast packets with large numbers of multicasting destinations, many packets can exhibit the west-sharing pattern and east-sharing pattern demonstrated in Fig. 2. For this kind of packet, RPM will partition the
)
–
Fig. 21. Scalability on number of multicast destinations (packet injection rate: 0.05, 10% multicast traffic, all the multicasting destinations are randomly distributed).
Fig. 22. Scalability of different on-chip network size.
destinations and inject more packets with fewer multicasting destinations which will quickly become unicast packets. The transformed unicast packets will further increase the total network load and load-imbalance. This is the key reason for the slightly worse scalability of RPM compared with DPM. 5.4.2. Scalability on network size We conduct a set of experiments with different network size configurations using the E-SLC scheme and RPM. Specifically, for each evaluated configuration, the percentage of multicast packets in the total traffic is 10%. In addition, the average number of multicasting destinations is one quarter of the network size. For example, the average number of multicasting destinations is 16 for an 8 × 8 network which is close to the statistics of real applications according to our profiled data shown in Table 3. Moreover, each IP core in each evaluated configuration is injected into 40 000 single-flit packets and the packet injection rate is set to 0.05 flits/cycle/core in all configurations. Fig. 22 shows the evaluation results in terms of average packet latency. For small scale networks, such us 4 × 4 and 6 × 6, E-SLC and RPM have similar performance. However, as the network size increases, E-SLC begins to outperform RPM gradually as shown in Fig. 22. For a 16 × 16 network, the average packet latency of E-SLC is around 80 cycles while it is around 110 cycles for RPM. On the whole, both RPM and E-SLC yield good scalability when the packet injection rate is low. As analyzed in previous sections, the DPM scheme including E-SLC can significantly outperform RPM under high injection rate. 6. Conclusion In this paper, we propose a novel multicasting mechanism called DPM for high-performance on-chip networks. The proposed
J. Li et al. / J. Parallel Distrib. Comput. (
DPM scheme substantially reduces the average packet latency for on-chip networks compared with RPM [35]. In addition, DPM can reduce the power consumption through minimizing the link usage and adaptively balancing the network load through a multicastsensitive unicast routing mechanism. Extensive simulation results on both synthetic and real traffic show that DPM can effectively reduce average packet latency and energy dissipation compared with the previous work. The characteristics of DPM make it highly scalable under heavy traffic load on-chip networks with varieties of traffic patterns. Acknowledgments This work was partially supported by grants from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 123811 and 123210] and National Nature Science Foundation of China (No. 61073038). This is an expanded version of a paper presented at the International Conference on Parallel and Distributed Systems (IEEE ICPADS) 2010. References [1] P. Abad, V. Puente, J.-A. Gregorio, MRR: Enabling fully adaptive multicast routing for CMP interconnection networks, in: Proceedings of the 15th International Symposium on High Performance Computer Architecture, HPCA’09, 2009, pp. 355–366. [2] N. Agarwal, T. Krishna, L.-S. Peh, N. Jha, GARNET: a detailed on-chip network model inside a full-system simulator, in: IEEE International Symposium on Performance Analysis of Systems and Software, 2009, ISPASS 2009, 2009, pp. 33–42. [3] M. Alexander, G. Robins, New performance-driven FPGA routing algorithms, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 15 (12) (1996) 562–567. [4] M. Bedford Taylor, W. Lee, S. Amarasinghe, A. Agarwal, Scalar operand networks: on-chip interconnect for ilp in partitioned architectures, in: Proceedings of the 9th International Symposium on High-Performance Computer Architecture, HPCA’03, 2003, pp. 341–353. [5] L. Benini, G. De Micheli, Networks on chips: a new soc paradigm, Computer 35 (1) (2002) 70–78. [6] C. Bienia, S. Kumar, J.P. Singh, K. Li, The PARSEC benchmark suite: characterization and architectural implications, in: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT’08, 2008, pp. 72–81. [7] M. Chaudhuri, M. Heinrich, Exploring virtual network selection algorithms in dsm cache coherence protocols, IEEE Trans. Parallel Distrib. Syst. 15 (8) (2004) 699–712. [8] C.-M. Chiang, L.M. Ni, Multi-address encoding for multicast, in: Proceedings of the First International Workshop on Parallel Computer Routing and Communication, 1994, pp. 146–160. [9] W. Dally, Virtual-channel flow control, IEEE Trans. Parallel Distrib. Syst. 3 (2) (1992) 194–205. [10] W. Dally, B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann Publishers Inc., 2003. [11] J. Duato, A new theory of deadlock-free adaptive multicast routing in wormhole networks, in: Proceedings of the 5th IEEE Symposium on Parallel and Distributed Processing, 1993, pp. 64–71. [12] J. Duato, A new theory of deadlock-free adaptive routing in wormhole networks, IEEE Trans. Parallel Distrib. Syst. 4 (1993) 1320–1331. [13] M.R. Garey, D.S. Johnson, The rectilinear steiner tree problem is NP-complete, SIAM J. Appl. Math. 32 (4) (1977) 826–834. [14] C. Glass, L. Ni, The turn model for adaptive routing, in: Proceedings of the 19th Annual International Symposium on Computer Architecture, ISCA’92, 1992, pp. 278–287. [15] S. Golomb, Shift Register Sequences, Aegean Park Press, 1982. [16] Intel, From a few cores to many: a tera-scale computing research overview, 2006. http://download.intel.com/research/platform/terascale/terascale_ overview_paper.pdf. [17] Intel, Single-chip cloud computer, 2009. http://techresearch.intel.com/spaw2/ uploads/files/SCC_Platform_Overview.pdf. [18] N.E. Jerger, L.-S. Peh, M. Lipasti, Virtual circuit tree multicasting: a case for on-chip hardware multicast support, in: Proceedings of the 35th International Symposium on Computer Architecture, ISCA’08, 2008, pp. 229–240. [19] A. Kahng, B. Lin, K. Samadi, R. Ramanujam, Trace-driven optimization of networks-on-chip configurations, in: 47th ACM/IEEE Design Automation Conference, DAC’10, 2010, pp. 437–442.
)
–
13
[20] S. Li, J.H. Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, N.P. Jouppi, McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures, in: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, 2009, pp. 469–480. [21] J. Li, C.J. Xue, Y. Xu, LADPM: Latency-aware dual-partition multicast routing for mesh-based network-on-chips, in: 16th International Conference on Parallel and Distributed Systems, ICPADS’10, 2010, pp. 423–430. [22] Z. Lu, B. Yin, A. Jantsch, Connection-oriented multicasting in wormholeswitched networks on chip, in: IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures, 2006, pp. 205–210. [23] S. Ma, N.E. Jerger, Z. Wang, Supporting efficient collective communication in NoCs, in: Proceedings of the 2012 IEEE 18th International Symposium on HighPerformance Computer Architecture, HPCA’12, 2012, pp. 1–12. [24] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, B. Werner, Simics: a full system simulation platform, Computer 35 (2) (2002) 50–58. [25] M. Martin, M. Hill, D. Wood, Token coherence: decoupling performance and correctness, in: Proceedings of the 30th Annual International Symposium on Computer Architecture, ISCA’03, 2003, pp. 182–193. [26] M.M.K. Martin, D.J. Sorin, B.M. Beckmann, M.R. Marty, M. Xu, A.R. Alameldeen, K.E. Moore, M.D. Hill, D.A. Wood, Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset, SIGARCH Comput. Archit. News. 33 (2005) 92–99. [27] NIRGAM, A simulator for noc interconnect routing and application modeling, 2007. http://nirgam.ecs.soton.ac.uk/. [28] S. Rodrigo, J. Flich, J. Duato, M. Hummel, Efficient unicast and multicast support for cmps, in: Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO’08, 2008, pp. 364–375. [29] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S.W. Keckler, C.R. Moore, Exploiting ILP, TLP, and DLP with the polymorphous trips architecture, SIGARCH Comput. Archit. News. 23 (6) (2003) 46–51. [30] D. Seo, A. Ali, W.-T. Lim, N. Rafique, M. Thottethodi, Near-optimal worst-case throughput routing for two-dimensional mesh networks, in: Proceedings of the 32nd Annual International Symposium on Computer Architecture, ISCA’05, 2005, pp. 432–443. [31] D. Sorin, M. Plakal, A. Condon, M. Hill, M. Martin, D. Wood, Specifying and verifying a broadcast and a multicast snooping cache coherence protocol, IEEE Trans. Parallel Distrib. Syst. 13 (6) (2002) 556–578. [32] K. Strauss, X. Shen, J. Torrellas, Uncorq: unconstrained snoop request delivery in embedded-ring multiprocessors, in: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO’07, 2007, pp. 327–342. [33] H. Sullivan, T.R. Bashkow, A large scale, homogeneous, fully distributed parallel machine, I, SIGARCH Comput. Archit. News. 5 (1977) 105–117. [34] C. Sun, C.-H. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L.-S. Peh, V. Stojanovic, DSENT—a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling, in: Sixth IEEE/ACM International Symposium on Networks on Chip, 2012, pp. 201–210. [35] L. Wang, Y. Jin, H. Kim, E.J. Kim, Recursive partitioning multicast: a bandwidthefficient routing for networks-on-chip, in: The 3rd ACM/IEEE International Symposium on Networks-on-Chip, NOCS’09, 2009, pp. 64–73. [36] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C.-C. Miao, J. Brown, A. Agarwal, On-chip interconnection architecture of the tile processor, IEEE Micro 27 (5) (2007) 15–31. [37] W. Zuo, S. Feng, Z. Qi, J. Weixing, L. Jiaxin, D. Ning, X. Licheng, T. Yuan, Q. Baojun, Group-caching for NoC based multicore cache coherent systems, in: Proceedings of the Conference on Design, Automation and Test in Europe, DATE’09, 2009, pp. 755–760.
Jianhua Li received the B.S. degrees in computer science from Anqing Teachers’ College, Anqing, Anhui, P.R. China, in 2007. He is currently pursuing the Ph.D. degree from the Department of Computer Science and Technology, University of Science and Technology of China, Hefei, PR China. His research interests include on-chip networks, memory systems and emerging non-volatile memory technology.
Liang Shi received the B.S. degrees in computer science from Xi’an University of Post & Telecommunication, Xi’an, Shanxi, China, in 2008, and the Ph.D. degree from the University of Science and Technology of China in 2013. He is currently a lecturer with the School of Computer Science, at the Chongqing University, PR China. His research interests include embedded systems and emerging non-volatile memory technology.
14
J. Li et al. / J. Parallel Distrib. Comput. ( Chun Jason Xue received the B.S. degree in computer science and engineering from the University of Texas at Arlington in May 1997, and the M.S. and Ph.D. degrees in computer science from the University of Texas at Dallas, in Dec 2002 and May 2007, respectively. He is now an Assistant Professor in the Department of Computer Science at the City University of Hong Kong. His research interests include memory and parallelism optimization for embedded systems, software/hardware co-design, real time systems and computer security.
)
–
Yinlong Xu received his B.S. in mathematics from Peking University in 1983, and the M.S. and Ph.D. degrees in computer science from the University of Science and Technology of China (USTC) in 1989 and 2004 respectively. He is currently a professor with the School of Computer Science and Technology at USTC. Prior to that, he served the Department of Computer Science and Technology at USTC as an assistant professor, a lecturer, and an associate professor. Currently, he is leading a group of research students in doing some networking and high performance storage research. His research interests include network coding, wireless network, storage systems, etc. He received the Excellent Ph.D. Advisor Award of Chinese Academy of Sciences in 2006.