BiLink: A high performance NoC router architecture using bi-directional link with double data rate

BiLink: A high performance NoC router architecture using bi-directional link with double data rate

INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Q1 16 Q2 17 Q5 18 19 20 21 22 23 24 25 26 27 28 Q3 29 30 31 32 33 ...

2MB Sizes 0 Downloads 27 Views

INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Q1 16 Q2 17 Q5 18 19 20 21 22 23 24 25 26 27 28 Q3 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Contents lists available at ScienceDirect

INTEGRATION, the VLSI journal journal homepage: www.elsevier.com/locate/vlsi

BiLink: A high performance NoC router architecture using bi-directional link with double data rate Jingyang Zhu a, Zhiliang Qian b, Chi-Ying Tsui a a b

Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong Department of Micro- and Nano- Electronics, Shanghai Jiao Tong University, Shanghai, China

art ic l e i nf o

a b s t r a c t

Article history: Received 8 January 2015 Received in revised form 22 February 2016 Accepted 22 February 2016

This paper presents a novel high performance Network-on-Chip (NoC) router architecture design using a bi-directional link with double data rate (BiLink). Ideally, it can provide as high as 2 times speed-up compared with the conventional NoC router. BiLink utilizes an extra link stage between routers and transmits two flits in one link per cycle using phase pipelining if both routers require to use the current link. To further increase the effective bandwidth, the direction of each link can be configured in every clock cycle to cater for different traffic loads from each side. Therefore, the data rate can be as high as 4 times compared with conventional NoC routers under uneven traffic. Centralized mode control scheme is implemented using a finite state machine (FSM) approach. Cycle-accurate simulations are carried out on both synthetic traffic patterns as well as real application benchmarks. Simulation results show that BiLink can provide as high as 90% and 250% speedup compared with conventional NoC routers for even and uneven traffic, respectively. 2X and 3X gains in throughput are obtained under even and uneven traffic, respectively, when compared with the conventional NoC router for the virtual channel flow control. The BiLink router architecture is synthesized using TSMC 65 nm process technology and it is shown that an area overhead of 28% over state-of-the-art bi-directional NoC is introduced while the critical path is about 9% higher than that of the conventional routers. Despite the overhead in critical path and power consumption, a 47.45% improvement of Energy-Delay-Product (EDP) is achieved by BiLink under high injection rate traffic. & 2016 Elsevier B.V. All rights reserved.

Keywords: Network-on-Chip (NoC) Bi-directional link Double data rate

1. Introduction Network-on-Chip (NoC) has become a promising approach to solve the communication bottleneck in the modern many-core system-on-chip. With the potential deployment of many-core systems on new applications such as big data, artificial intelligence and deep machine learning, the NoC router requires to transfer a larger amount of communication data among processors. For example, the Google Brain project [14,8] uses 1000 machines to train a deep neural network. Each machine contains 16 cores on it and a subset of neural network will be mapped on each of them [8]. The requirement of the data bandwidth is high and uneven due to the interleaving of the feed-forward and backpropagation training phases. To address for the intensive bandwidth requirement of these applications, a higher throughput NoC router architecture is essential and crucial for the next generation of many-core systems. E-mail addresses: [email protected], [email protected] (J. Zhu), [email protected] (Z. Qian).

As the traffic pattern is usually uneven distributed among the network [13], self-reconfigurable router architectures have been proposed [13,6,20,2] to improve the NoC performance by adapting the direction of the links to the run time traffic conditions. A bidirectional NoC (BiNoC) router architecture was introduced in [13,6] to cater for the uneven traffic patterns. However, most of the emphasis on the existing reconfigurable NoC architecture has been focusing on optimizing the design of the router itself. The optimization of the interconnection between two neighboring routers is rarely touched. On the other hand, in the domain of communications, the introduction of network coding [1] provides an optimized way to use the channel bandwidth and achieves a significant improvement in the system throughput. Borrowing the concept of network coding, in [5], an extra coding unit was inserted between each pair of routers to enable the data transmission from both ends over a single physical channel simultaneously. In this work, to address the high bandwidth requirement for the next generation NoC architecture, we propose BiLink, a new NoC router architecture using bidirectional double data rate links.

http://dx.doi.org/10.1016/j.vlsi.2016.02.006 0167-9260/& 2016 Elsevier B.V. All rights reserved.

Please cite this article as: J. Zhu, et al., BiLink: A high performance NoC router architecture using bi-directional link with double data rate, INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.02.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92

J. Zhu et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

More specifically, in BiLink, a customized link stage is designed to transmit two flits over one physical channel in each cycle in a phase pipelined fashion. To further increase the effective bandwidth, the direction of each link can be configured to cater for the uneven distribution of the traffic loads. A centralized controller is implemented using a FSM to dynamically determine the operation mode to support BiLink transmission. In this way, data are transmitted in both the clock edges to maximize the potential throughput of the NoC router, leading to a better solution for the future data-intensive applications. Cycle accurate simulations were executed to verify the performance improvement of the proposed BiLink architecture. Simulation results show that the proposed BiLink architecture can achieve 90% and 60% improvements in the saturation injection rate compared to Bi-directional (BiNoC) router architectures [13] for even and uneven traffic distributions, respectively. Furthermore, BiLink also has a 250% improvement over the conventional NoC router for the uneven traffic distribution. In summary, this work brings the following contributions:

 We combine the idea of self-reconfigurable router structures  

with a double data rate link for NoC and achieve a significant performance improvement through this joint optimization. We implement the proposed BiLink structure to verify the performance as well as the hardware overhead. We propose three variants of BiLink architecture and perform a thorough analysis on the performance and implementation tradeoff of these structures.

The remainder of the paper is organized as follows. In Section 2, we discuss the basic idea of the normal double data rate bidirectional link (BiLink) and analyze its timing issue. In Section 3, a self-reconfigurable direction control scheme, namely aggressive bidirectional link (A-BiLink) is proposed. In Section 4, the detailed hardware implementation of BiLink and A-BiLink are addressed. In addition, a new variant of A-BiLink which is more suitable for hardware implementation is presented in this section. Simulation and hardware synthesis results are shown in Section 5 and the related work is discussed in Section 6. Finally, Section 7 concludes this work.

2. Bidirectional link stage To understand the basic principle behind the bidirectional link (BiLink), we will first discuss the data flow in BiLink. Then the related timing issues will be analyzed to show that BiLink can work properly under different timing constraints. 2.1. Motivation for exploring BiLink In both uni-directional and bi-directional NoCs, the data transfer occupies the entire clock cycle. In this work, we propose to further improve the throughput by allowing the transmission of two flits over the channel in every cycle. More specifically, we use both phases of the clock to transmit two different flits. In the first

phase of clock cycle, the routers at both ends of the link send the flits to the link module in the middle of the link (shown in Fig. 1 (a)). Then, in the next phase, the link module sends the two flits to the corresponding destination routers (shown in Fig. 1(b)). Compared to the conventional transfer mode, the transfer data rate is doubled using the proposed BiLink scheme and it can transfer up to four flits between routers R1 and R2 in every clock cycle. The main function of the intermediate link module is to isolate the flits from both routers at the two ends. For the link stage, two D Flip-Flops (DFFs) are required to store the data received from each side during the first half cycle. Moreover, two switches are used to control the direction of the data flow, in order to avoid overwriting the data originally stored in the registers. Fig. 2 (a) shows the hardware implementation of the link stage. When the clock phase is high, the switches S1 and S2 are open and the flits transmitted from both sides will be stored into these two DFFs in the link stage, respectively. Then, at the second phase of the clock cycle, S1 and S2 will be closed. The two DFFs will transmit the stored flits to the corresponding destination. For the router side, the output stage of each router has a similar structure to synchronize with the link stage. It sends flits at the first half clock cycle and receives flits at the second half as shown in Fig. 2(b). 2.2. Analysis of the timing constraints for BiLink With the insertion of the link module, we need to analyze the impact on the timing of the overall system under reasonable clock skew and jitter assumptions. First we investigate whether the insertion of the link stage will affect the clock frequency performance of the system. The datapath of a router consists of 2 parts, the inner pipeline stage and the link transfer stage. As will be shown in the simulation results in Section 5, the critical path of the inner pipeline stage of the router for the BiLink architecture is similar to that of the BiNoC. For the link transfer delay, the insertion of the link module will not cause extra delay. If the long wire delay of the link transfer is the critical path of the design, adding a link module in the middle breaks the long wire into half. Therefore the total delay of driving the long wire will be decreased instead and the overall critical path, which includes the clock to Q delay and the setup time of the DFF inserted in the link module, will be shortened. We designed and layouted the link stage and the router's output stage in TSMC 65 nm process, and used it to drive different lengths of wires. We simulated the performance of the overall link transfer using HSPICE under a clock skew of 10% of the clock period [19]. The results show that the wire with a link stage is always better in terms of critical path performance than that without a link stage. The hold time constraint of the link stage has also to be satisfied. The hold time of the DFF in the link module due to the datapath through the wire is easily satisfied because of the large delay of the long wire even under 10% positive clock skew. For the hold time requirement due to the inner loop with the link module as shown in Fig. 2(a), the following timing condition needs to be held: t clkQ þ t delay Z t hold

ð1Þ

Fig. 1. Data transfer mode for BiLink.

Please cite this article as: J. Zhu, et al., BiLink: A high performance NoC router architecture using bi-directional link with double data rate, INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.02.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

J. Zhu et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

3

, D

Q

From XBAR

S2

IN / OUT

D

Q

S3

IN / OUT

IN / OUT

, S1

Q

To Input VC

D

Link stage

Q

D

Router’s output Fig. 2. Hardware implementation of BiLink.

Fig. 3. Aggressive transmission mode for BiLink (A-BiLink).

where t clkQ , t hold , t delay are the delay time and hold time of the DFF, and the delay of the switch, respectively. Since the two DFFs are placed close to each other, we can assume the clock skew is negligible. From the cell library information of the TSMC 65 nm technology, the intrinsic delay of a DFF together with the delay of a switch are already much larger than the hold time of a DFF. Therefore Eq. (1) is easily met. Same result can be obtained for the inner loop within the router's output as shown in Fig. 2(b).

3. Self-reconfigurable BiLink As discussed in [13], most of the traffic patterns in real applications are strongly uneven. Hence, in the BiLink, although we can have at most 2X data rate compared with the conventional router, we will not achieve such a huge improvement in reality since most of the time only one router has flits to send over the channels. Therefore, we further modify the link stage presented in Section 2 to add the flexibility of configuring the direction of each link at run time based on traffic conditions. The proposed self-configurable BiLink, named as the Aggressive Bidirectional Link (A-BiLink) is shown in Fig. 3. In A-BiLink, we define two transmission modes between two neighboring routers, namely the normal model and the aggressive mode. In the normal mode, the link stage is configured as normal BiLink, and flits from both sides are transported. If only one side requires to access the link stage, the link stage is configured into the aggressive transmission mode, under which 2 flits are transmitted consecutively from one direction in both the first and the second half of the clock cycle. In this way, under uneven traffic pattern where the other link direction can be reversed most of the time, the A-BiLink can transmit at most 4 flits from one router to the other in every clock cycle. Thus, 2X and 4X data rate improvements are achieved when compared to BiNoC and conventional NoC architectures, respectively. The direction controller is implemented by a finite state machine (FSM) for each pair of routers. At run time, each router will send a notification signal, i.e., channel request, to the FSM. Then the FSM will configure the interconnect into different transmission modes according to the traffic requests from both ends. In the aggressive mode, the router has to pump out data at both the positive and negative edges of the clock. One implementation is to deploy a double edge triggered flip-flop (DET)

Fig. 4. Interconnection between each pair of routers in a 4x4 mesh topology.

similar to that described in [17]. However, DET will result in a large area and power overhead due to the large buffer size of the virtual channels. Therefore, in the next section, we present a Pseudo Aggressive Bidirectional Link (PA-BiLink), which implements the idea of A-BiLink with reasonable hardware cost.

4. BiLink hardware implementation The BiLink architecture consists of two main parts, the router implementation and the link module design. For a traditional virtual channel (VC) flow-control router, it consists of 5 pipeline stages: route computation (RC), virtual channel allocation (VA), switch allocation (SA), switch transversal (ST) and link transversal (LT) [7]. In BiLink implementation, most of the stages in the router's pipeline are similar to those of the conventional router and only SA, ST and LT stages are modified. The connection between two neighboring routers is shown in Fig. 4. The channel request is sent by a router to the direction control FSM to indicate that there are traffic flows requiring to use the

Please cite this article as: J. Zhu, et al., BiLink: A high performance NoC router architecture using bi-directional link with double data rate, INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.02.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

J. Zhu et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

flits received at the positive edge and the negative edge temporarily into two DFFs at different phases as shown in Fig. 5. In Fig. 5, we show a new variant of A-BiLink, PA-BiLink. Compared with A-BiLink, it has an extra pipeline stage, which converts the double edge transfer to the single edge transfer (i.e., D2S). As shown in Fig. 5(b), it utilizes two DFFs to sample at both the positive and negative edges. Then, the data is forwarded to the corresponding VCs. Since each flit coming in the same clock cycle goes to a different destination, it is safe to push them into the FIFOs without data clash. A small crossbar is used to connect the input DFFs with the VC buffers. The extra pipeline stage added in PA-BiLink will result in some degradation in the performance. More precisely, the zero load latency for PA-BiLink, following the analysis in [7], is equal to: T 0 ¼ H min t r þ t w þ

Fig. 5. Pseudo aggressive BiLink.

current channel. Upon receiving the channel request signal, the control FSM sends the appropriate mode control signal to both the routers and the link module. Based on the decision of mode control, the routers and the link module are configured into the corresponding directions to make full use of the channel bandwidth. In this section, we present the hardware implementation of each datapath component in BiLink. 4.1. Input and output buffer Since A-BiLink receives or transmits at most four flits every cycle, a double edge triggered flip-flop is required for its input and output stages. However, the double edge triggered flip-flops such as [17] will introduce additional area/power overhead. Since the area of the input FIFO contributes about 60–80% of the total area in a typical VC-based router [13], this additional overhead will increase the area and power a lot. Therefore, instead of using DET, we make a slight modification to the A-BiLink scheme and utilize the conventional single edge triggered flip-flop for the FIFOs in each virtual channel. First, we assume that virtual channel based allocation scheme is used for the A-BiLink architecture. In each cycle, only the flits from different virtual channels will be switch allocated and sent out by the output buffer. As a result, although A-BiLink may receive at most four flits at each cycle, they are coming from different input virtual channels in the current router and are forwarded to different input virtual channels in the down-streaming receiver. Based on this observation, a pipeline stage is added to store the

L b

ð2Þ

where H min is the average minimum hop count of the network, t r is the time delay through a single router, t w stands for the average time of flight and L=b represents the serialization latency of a packet. In PA-BiLink, only t r will be different compared to A-BiLink. It is increased from 4 to 5 and therefore the network latency will be increased by H min . In the D2S stage, for the packet that is sampled at the negative clock edge, it has only half clock cycle to reach the VC buffer and get sampled. It may pose timing issue when the clock frequency is high. However, the delay between DFF2 and the VC buffer shown in Fig. 5 is mainly due to the delay of the DFF and the XBAR between the two buffers. The size of the XBAR is 4  v, where v is the number of VCs. The delay of it is that of one 4-to-1 MUX, which is smaller than 50% of that of the 10  10 XBAR in the ST stage, and hence it will not be the critical path of the router. Similar technique can be used for the output buffer implementation. Rather than using a DET to transmit flits at both edges of the clock, we use two DFFs to implement the output buffer as shown in Fig. 5(c). 4.2. Route computation and virtual channel allocator The RC and VA modules have the same structures as those of the typical NoC router. In order to avoid the potential deadlock or livelock problem, deterministic XY routing is used in this work. The VC allocator is implemented as a separable allocator which consists of two stages to resolve the input and output request contentions, respectively (shown in Fig. 6(a)). In order to dynamically change the channel direction based on the traffic load, a dedicated channel request counter is deployed for each direction. When a new packet enters the RC stage, it will increase the downlink request counter by 1. When a packet is ready to leave the router, i.e., the tail flit finishes the SA stage, it will decrease the downlink request counter by 1. When the channel request counter is greater than 0, it will send a request signal to the control FSM in Fig. 4, indicating the current channel is requested to be used by the router. 4.3. Switch allocator In this work, in order to support transferring maximally four flits in a clock cycle under the aggressive transmission mode, the switch allocator needs to be modified. Specifically, two different types of requests, which corresponds to transferring flits at the positive and negative edges, respectively, are identified in the SA. Normally, the request for positive edge transmission will always be asserted except when the current channel is not in use. On the other hand, the request for negative edge transmission will only be asserted when the current channel is in the aggressive

Please cite this article as: J. Zhu, et al., BiLink: A high performance NoC router architecture using bi-directional link with double data rate, INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.02.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

J. Zhu et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

5

Fig. 6. Allocator structure.

transmission mode. A separable allocator as shown in Fig. 6(b) is used with p v : 4 arbiters in the input stage and 4p 4p : 1 arbiters in the output stage.1 In Fig. 6(b), a v : 4 arbiter2 can be implemented as four oblivious arbiters with different priorities. In order to obtain a maximal matching, we use the round robin arbiters in the output stage. 4.4. Crossbar The dimension of the crossbar is 5  5 for the conventional router and 10  10 for BiNoC router. In BiLink, to support maximally 4 flits for each input direction, a 20  20 crossbar is used. However, it will cause a large area overhead. In order to reduce the crossbar size, we split the 20  20 crossbar into two 10  10 crossbars, one is responsible for the positive edge transmission while the other is used for the negative edge transmission as shown in Fig. 7. Since the complexity of a crossbar is Oðn2 Þ, where n is the crossbar dimension, the area of two 10  10 crossbars is approximately half of that of a single 20  20 crossbar. 4.5. Mode controller In each router, the self-configurable BiLink architecture can transmit flits in three different modes, namely, the normal mode, the aggressive TX mode and the aggressive RX mode. The mode control logic is similar to that used in the BiNoC. However, the flits are transmitting at different clock phases and this complicates the design. In this work, instead of using a distributed control logic inside each router, we propose an explicit control module located in the middle of the link as shown in Fig. 4. The timing diagram of the proposed FSM of the control module is shown in Fig. 8. The request signals are generated by the RC module of each router and transmitted to the centralized control stage in the first half cycle. Then, the FSM is triggered at the negative clock edge and send the configuration signals back to each router as well as 1

p is the number of ports and v is the number of VCs. Typically when VC number is 4, no input arbiter is required since maximally only 4 VCs will be requested. 2

Fig. 7. Crossbar stage for BiLink.

the link stage in the negative phase of the following cycle. In summary, the mode controller will experience a time delay of two clock cycles. The state transition diagram of the control FSM is shown in Fig. 9. It consists of five different states, which are categorized into two different types: transmission mode and waiting mode. The transmission mode indicates how each pair of the neighboring routers is configured. The two routers are either in normal transmission mode which is shown in Fig. 1(a), (b) or aggressive transmission or receiving mode as shown in Fig. 3. Each edge in the transition diagram is represented by a 2-tuple (L, R), where L and R stand for requests from the left and the right side, respectively. In the beginning, the FSM is initialized as the

Please cite this article as: J. Zhu, et al., BiLink: A high performance NoC router architecture using bi-directional link with double data rate, INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.02.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

J. Zhu et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Fig. 8. Timing diagram for FSM.

Fig. 10. Data conflict hazard in state transition.

In BiLink architecture, the data may be received or sent at both positive and negative edges. Therefore, the local processing element (PE) needs to support such features, and it should have a similar structure as shown in Fig. 5. The credit signal between the router and the PE should also be doubled since the maximum number of allocated flits is 4 rather than 2 in BiNoC. 4.7. BiLink for router that has less pipeline stages Fig. 9. Transition diagram for control FSM. The states are: AggL2R (aggressive transmission, left to right), AggR2L (aggressive transmission, right to left), L2RWait (wait state after left to right aggressive transmission), R2LWait (wait state after right to left aggressive transmission), Normal (normal transmission).

normal transmission mode. If the left router has a channel request and its right neighbor does not, the mode is directly changed to “aggressive transmission, left to right” (AggL2R) as shown in Fig. 9. In addition, extra waiting states are added to make sure there is no flit conflict during the mode transition. In A-BiLink, data conflict may occur because the link transverse stage spans for one and a half cycles. If we do not stall the state transition one more cycle, the data coming from the last cycle will crash into the data transmitted in the current cycle as shown in Fig. 10. In this example, the TX end is changing from aggressive TX mode back to normal mode. In order to avoid this situation, waiting states are introduced. During these states, the aggressive transmitting end will not participate in the switch allocation and drain the remaining negative edged transmitted flits away. At the receiving end, the router will wait for an extra cycle to receive the draining flits, i.e., 2nd flit in Fig. 10. 4.6. Network interface The network interface (NI) of the BiLink architecture is modified mainly in the flit signal and credit control signal.

In some design, the speculation and lookahead techniques are used to reduce the number of pipeline stages of the NoC router [7]. In general, the depth of pipeline stages will not have serious impact on the saturation throughput of the NoC as for high packet injection rate, the packet latency is mainly due to the contention delay. Therefore, the high throughput of the proposed BiLink architecture will be preserved for router with shallow pipeline. For low injection rate, it seems that the additional D2S stage in PABiLink would impact the zero load latency for the router with less pipeline stages. Without loss of generality, the timing diagrams of a 2-cycle router and a 4-cycle router under low packet injection rate (i.e., without contention) are shown in Fig. 11 (Note that the link transfer stage is considered as a separate pipeline stage in addition to the router pipeline). In this example, the packet length is assumed to be 8-flit. The end-to-end packet latency for the conventional 4-cycle router increases from 12 to 13 (an 8% increase) due to the additional D2S stage in PA-BiLink. On the other hand, the number of cycles for transmitting the 8-flit packet for a 2-cycle router increases from 10 to 11 (a 10% increase) due to the additional D2S stage. The difference in the increase in zero load latency for these two routers is very small (within 2% difference). Therefore, the PA-BiLink performance for the 2-cycle and 4-cycle routers under low packet injection rate are quite similar. This will be demonstrated by simulation results presented in the next section.

Please cite this article as: J. Zhu, et al., BiLink: A high performance NoC router architecture using bi-directional link with double data rate, INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.02.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

J. Zhu et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

7

Fig. 11. Comparisons of the 4-cycle and 2-cycle router under the low packet injection rate.

5. Results evaluation 5.1. Simulation setup We implemented and compared the proposed BiLink architectures with two existing architectures, i.e., the traditional unidirectional NoC and the bi-directional NoC (BiNoC) [13]. For BiLink architecture, all three variants proposed have been implemented and evaluated. In particular, normal BiLink only supports the normal transmission mode. A-BiLink can configure the channel direction based on the run time traffic conditions while PA-BiLink reduces the hardware cost by sacrificing the performance of ABiLink. The 8  8 and 16  16 mesh NoC architectures are used for evaluation to demonstrate the performance gain and also the scalability of the proposed architecture. For all architectures, credit-based virtual channel flow control is used. Each input direction has four virtual channels with a buffer depth of 16 flits. To evaluate the system latency and throughput performance, a cycle accurate NoC simulator is implemented which is an extension of Noxim [3,16,11]. Both synthetic traffic patterns and real application benchmarks have been used to verify the performance. For synthetic traffic, random, bit reversal, butterfly, shuffle and transpose traffics are used. For real applications, five different benchmarks are used. They are multimedia system (mms), MPEG4 decoder (mpeg), picture-in-picture application (pip), video object plane decoder (vopd) and dual video object plane decoder (dvopd) [10,4,18]. The benchmarks are characterized by the corresponding communication task graphs. Similar to [4,10,16], we first use the mapping algorithm described in [10] to map the tasks onto the PEs in the mesh NoC. Based on the mapping results, the communication volumes among the cores are determined from the communication task graph. Each PE will then generate the corresponding traffic with the desired packet injection rate (pir). Of note, the pir is computed based on the communication data volumes as well as the packet injection factor as done in [16]. The synthetic traffic and real benchmarks can be further classified based on whether the traffic pattern is even or not. Table 1 summarizes this property for different benchmarks. Most of the traffic patterns are mixtures of even and uneven traffic load. The

Table 1 Property of traffic patterns and real benchmarks. Purely even Random, mpeg Purely uneven Bit reversal Mixture of even and uneven Butterfly, shuffle, transpose, mms, pip, vopd, dvopd

traffic property for the real benchmarks depends on not only the traffic attribute, but also the mapping algorithm used for the task graph. From the mapping results, mpeg is the only real benchmark that has even traffic loads. To compare the area and power overhead, we implemented all the router architectures in Verilog and synthesized them using Synopsys Design Compiler with TSMC 65 nm technology. 5.2. Performance comparisons Figs. 12–16 show the cycle accurate simulation results of the packet latency. The simulations explore the impact of different packet lengths (8-flit and 16-flit), different network sizes (8  8 mesh and 16  16 mesh), and different router architectures (2cycle router and 4-cycle router) on the packet latency performance. The latency is calculated as the equivalent time period where latencyeqv ¼ clock cycle  Tclk instead of just the clock cycle. It is a more fair comparison since the critical path is different for the conventional router and BiLink router as shown in the next sub-section. We compare the latency of the network under different synthetic traffic patterns for 6 architectures including a typical router with double data width (64-bit per flit), represented as Typical (64). As shown in Figs. 12–16, a significant improvement of performance can be observed for all synthetic traffic patterns. Here, we define the performance as the injection rate at near the saturation point of the network. More specifically, the network is assumed to reach the saturation point when the latency is approaching 100 equivalent clock cycles. For evenly distributed traffic patterns such as random, the BiLink will mostly be configured in the normal mode as the traffic from both ends of the link are uniform. Thus it is expected that there is not much difference in the performance between the A-BiLink and normal BiLink

Please cite this article as: J. Zhu, et al., BiLink: A high performance NoC router architecture using bi-directional link with double data rate, INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.02.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

J. Zhu et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 12. Random traffic pattern.

Fig. 13. Bit reversal traffic pattern.

architecture. In Figs. 12–16, it is shown that A-BiLink and PA-BiLink can achieve approximately 80% performance gain against the BiNoC and typical NoC architectures when the packet length is 8, and 90% performance gain when the packet length is 16. For the uneven traffic patterns like bit reversal, the gain for 8-flit and 16flit packets are still quite large, over 200%. As a result, the proposed architecture works well for a wide range of packet lengths because the speedup mainly depends on the amount of contention. In addition, it can be seen in Figs. 12–16 the performance gain is similar for both 2-cycle and 4-cycle routers as discussed in Section 4.7. The typical architecture with double link width will have a 100% performance gain when the traffic pattern is purely even. However, we can see that it cannot perform as good as BiLink when the traffic pattern becomes uneven. In addition, it will have a large hardware cost in terms of area and power. When we increase the size of the mesh NoC network, the chance of the contention occurrence between each pair of routers will increase as well. Therefore, from Figs. 12–16 we can observe that compared with the 8  8 mesh topology, the performance gain of the PA-BiLink over the typical NoC router (in terms of the saturation point) under the random traffic pattern increases from 78% to 86% for the 16  16 mesh topology. Similar performance gain can be observed for other traffic patterns. For those patterns which exhibit strong uneven traffic distribution (e.g., bit-reversal), some of the links in A-BiLink and PABiLink will be configured as the aggressive transmission mode most of the time. From the simulation results, we can observe that a 210% performance gain over the typical architecture when the packet length is 8 flits and a 250% performance gain when the packet length is 16 flits are obtained. In addition, compared with the BiNoC, which can adapt to the traffic load as well, our proposed structure can still have a further performance gain of 57%

and 60% when the packet length is 8 and 16 flits, respectively. We also observe that the typical architecture of double data width can only perform as good as BiLink and BiNoC. This is mainly due to the fixed channel direction of the typical router. For other traffic patterns, such as butterfly or shuffle, our proposed BiLink architecture will also outperform the typical router of single and double data width as well as the BiNoC, because they have both even and uneven traffics. In addition, as we discussed previously, PA-BiLink only has a small performance degradation in latency compared to A-BiLink. In Fig. 17, the simulation results using real application benchmarks are presented. In the simulations, each architecture is simulated under three different injection factors as defined in [16], which correspond to low, medium and high traffic loads, respectively. Specifically, the low injection factor refers to the injection rate that makes all 5 architectures work at the less congestion region (i.e., close to the zero load latency of the network). The medium workload means that the typical router will enter into the saturation region (i.e., the delay is larger than hundreds of cycles) while the other architectures still operate in the less congestion region. For the high injection factor, it is referred to the workload that even BiNoC is operating in the saturation region. Under this workload, all the existing NoC architectures will become saturated while the three variants of our proposed BiLink are still operating in the low-latency region. We first compare the BiLink architecture with conventional NoC and BiNoC by employing a low injection factor. Then, to demonstrate the superiority of BiLink architecture over BiNoC, we use a medium injection factor. Under this injection factor, conventional NoC becomes saturated and the latency becomes very high, so we do not include conventional NoC in Fig. 17(b). Finally, in order to further compare the three different BiLink variants, a high injection factor is used which will make

Please cite this article as: J. Zhu, et al., BiLink: A high performance NoC router architecture using bi-directional link with double data rate, INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.02.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

J. Zhu et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

9

Fig. 14. Butterfly traffic pattern.

Fig. 15. Shuffle traffic pattern.

both conventional NoC and BiNoC fall into saturation. Fig. 17 shows the normalized latency for different architectures. From Fig. 17(a), we can observe that the 3 variants of BiLink achieve approximately 20–90% performance gain over the BiNoC and 100–300% gain over the conventional NoC depending on the traffic distribution of the applications. For mpeg, which is a completely even traffic pattern as listed in Table 1, we can see that there is some performance degradation in BiNoC compared with the conventional NoC. It is due to the overhead caused by frequent mode transition in BiNoC. However, BiLink architectures mitigate the problem because it can transmit more flits in each cycle. In Fig. 17(b), the BiLink architectures always outperform BiNoC counterpart by at least 100%. From Fig. 17(c), we can see that in general A-BiLink performs better than normal BiLink for high injection rate. For benchmarks such as mms, pip, vopd and dvopd, the latency of A-BiLink is reduced by 40–70% compared with that of BiLink. No performance gain is obtained in the even traffic pattern such as mpeg for A-BiLink compared with BiLink. Furthermore, under different injection rates, the latency of PA-BiLink is always higher than that of A-BiLink because of the additional pipeline stage. Finally, to show the throughput gain of the PA-BiLink architecture, the throughputs of different router architectures at the saturation injection rate under different traffic patterns were also estimated and the results are summarized in Table 2. 5.3. Area and power overhead We synthesize 5 different architectures, i.e, typical, BiNoC, BiLink and PA-BiLink and typical with double data width, using the same TCL script. The basic parameters for each router are: 1) VC depth is 16 flits for each direction. 2) Flit data width is 32 bits.

3) 4 VCs per direction. 4) 5 directions for the router, i.e., north, east, south, west and local. 5) Credit based flow control scheme. The detailed area breakdown for each router is shown in Table 3. From Table 3, we can see that PA-BiLink has a 45% area overhead compared with the typical router. However, it still shows a large area reduction compared with the typical architecture with double data width. More importantly, PA-BiLink outperforms the typical router with double data width for most of the traffic patterns. From Table 3, it is also shown that: 1) The main contribution of the area breakdown is the area of input VC buffers. PA-BiLink only adds some additional control logics and DFFs as shown in Fig. 5(b), which is much more scalable than the double data width architecture. 2) VA stages are the same for each architecture as shown in Fig. 6 (a). Thus it will not cause any hardware overhead. 3) XBAR size in PA-BiLink has been doubled compared with BiNoC, since two 10  10 XBARs instead of one 10  10 XBAR are utilized here. 4) The area of SA stage has been increased linearly from BiNoC to PA-BiLink, because the number of allocation has been changed from 2 to 4 as shown in Fig. 6(b), which duplicates the output stage of the arbiters. 5) The size of the output registers has been doubled for PA-BiLink compared with BiNoC since we need additional negative edge triggered DFFs to transmit the data at each phase of the clock. The timing report of different architectures is shown in Table 4. For all the architectures, the critical path of the intra-router stage is SA, which is also reported in [15]. It should be noted that although some of the works, such as [12] can achieve a much

Please cite this article as: J. Zhu, et al., BiLink: A high performance NoC router architecture using bi-directional link with double data rate, INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.02.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

J. Zhu et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 16. Transpose traffic pattern.

higher frequency, they are using some full custom design techniques to shorten their critical path. It can be seen that BiNoC and PA-BiLink has the same critical path delay, which is 9% higher than that of the typical router. When we consider together with the reduction in latency cycle, even we have 9% increase in cycle time, the gain in overall latency is still very significant. The saturation point for PA-BiLink surpasses the typical one by at least 80% as shown in Figs. 12–16. Finally, we need to take the intermediate control logic and the link stage into account. The final equivalent area and power for the BiLink structure is calculated as: Total ¼ Router þ5  Link Stage þ 2:5  Mode Control

XBAR and the D2S stage are not needed and clock-gating is used to reduce the dynamic power consumption. In conclusion, a better energy efficiency is obtained for the proposed BiLink architecture since a higher throughput compensates the power overhead. For instance, the calculated energy-delay-product (EDP) of PA-BiLink is reduced by 47.45% when compared with the typical router under high packet injection rate. For low injection rate, since the throughput improvement is lower, PA-BiLink has a higher EDP. For target applications which require high throughput, the injection rate is high and the energy efficiency of PA-BiLink is higher.

ð3Þ

Since there are 2 link stages and 1 mode controller between each pair of neighboring routers and each router has 5 neighbors, the equivalent number of them associated with each router will then be 5 and 2.5, respectively. The equivalent area and power are summarized in Tables 5 and 6. From Table 5, the area overhead of PA-BiLink is 28% compared with BiNoC. This tradeoff of area cost for performance is acceptable because the performance gain for PA-BiLink over BiNoC is large (80–90% under even and uneven traffic patterns). In addition, the area of the router is typically small compared with that of the processing element in a tile (about 6% as reported in [21]). Therefore, a 28% area increase in the router only incurs less than 2% area overhead for the tile. The power consumption of the routers depends on the switching activity and hence the traffic workload of the routers. To have an accurate power analysis on different routers, we construct a testbench using a pair of routers to form a small network. Power consumption is evaluated under three different traffic scenarios: high injection rate, medium injection rate, and low injection rate. For high and medium injection rates, 4 and 2 packets are injected from each router going to the other, respectively. On the other hand, only 1 packet will be transmitted from one router to the other under the low injection rate. The switching activities of the modules of the routers are first extracted from the post-synthesis simulation, and then back-annotated for the final power evaluation. Table 6 summarizes the power consumption of different router architectures under different injection rates. From Table 6, an  40% and 31% overhead of power consumption for PA-BiLink is observed under high and medium injection rate, respectively when compared with the typical one. However, the overall energy consumption has been reduced by 17% and 14%, respectively because the throughput gain is twice of that of the typical router at high traffic load. We also simulate the power consumption at a lower traffic load. From the power simulation result, with proper clock gating, the overhead of the power consumption over the typical router is about 18%. The overhead is smaller than that of the high and medium traffic load since most of the time the extra

6. Related work Reconfigurable channel link direction: Conventionally, the transfer mode for the interconnect between a pair of routers is classified into two different types: unidirectional and bidirectional as shown in Fig. 18. The direction of each link is fixed in unidirectional NoC, one for transmitting data and the other for receiving data as shown in Fig. 18(a). However, the link capacity is not fully utilized if the distributions of traffic from both ends are not uniform and even. Thus in [13,20], a bi-directional router, BiNoC was introduced where the direction of each link can be reconfigured to maximize the bandwidth utilization as shown in Fig. 18(b). A dedicated Channel Direction Control (CDC) algorithm is used in each router to control the direction of the link [13]. An alternative approach to control the direction of the link is using a centralized bandwidth arbiter [6]. The flit-level speedup scheme is introduced to further increase the throughput of BiNoC by allowing 2 flits within 1 packet to be transmitted simultaneously [20]. Application mapping algorithm based on Quality-of-Service (QoS) of reconfigurable NoC routers is discussed in [2]. However, routers in these works can only transfer as many as 2 flits in each clock cycle. In addition, the performance gain will be small compared with the unidirectional architecture under the random and even traffic patterns. Fine-grained reconfigurable interconnect: In [9], the granularity of the interconnection between a pair of routers can be subdivided from the dimension of a flit into a phit of which the direction can be reconfigured independently. Due to the uneven distribution of traffic in NoC, the reconfigurable channel direction together with this fine granularity can achieve a more power and area efficient approach without degrading the performance of NoC. However, the additional serializer as well as the deserializer will cause an area overhead. Also, the communication latency is slightly increased to save the channel resources. Link utilization improvement using network coding: Network coding has been used in communication systems that employ intermediate relay stage to improve the effective bandwidth.

Please cite this article as: J. Zhu, et al., BiLink: A high performance NoC router architecture using bi-directional link with double data rate, INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.02.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

J. Zhu et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Table 2 Comparison of throughput λ (flits/cycle) for different router architectures.

Latency for real application (Low injection rate) 1.4 Tyical BiNoC BiLink A−BiLink PA−BiLink

1.2

Normalized Latency

1

0.8

Traffic/Arch.

Typical

BiNoC

PA-BiLink

Random

0.20

0.21

0.39

Bitreversal

0.07

0.14

0.21

Transpose

0.075

0.138

0.24

Butterfly

0.123

0.243

0.35

0.6

Table 3 Area breakdown of different NoC architectures.

0.4

0.2

0

mms

mpeg

pip

vopd

dvopd

Low packet injection rate factor Latency for real application (Medium injection rate) 1 BiNoC BiLink A−BiLink PA−BiLink

0.9 0.8

Unit: μm2

Typical

BiNoC

BiLink

PA-BiLink

Typical (64)

RC

180

242

180

292

180

Input VC buffer

145,868

152,997

153,406

167,018

282,912

VA

20,360

18,535

19,095

18,097

20,057

SA

5796

15,209

13,797

30,267

5693

XBAR

2987

11,951

11,951

23,902

6083

Output register

4706

6942

6677

12,280

5885

Other

10,188

17,656

14,364

24,170

9471

Total area

190,085

223,532

219,470

276,026

330,281

Normalized value

1.00

1.18

1.15

1.45

1.74

Normalized Latency

0.7 0.6 0.5

Table 4 Timing report of different NoC architectures.

0.4 0.3

Unit: ns

Typical

BiNoC

BiLink

PA-BiLink

Typical (64)

Critical path (SA stage)

0.95

1.04

1.04

1.04

0.94

Normalized value

1.00

1.09

1.09

1.09

0.99

0.2 0.1 0

mms

mpeg

pip

vopd

dvopd

Medium packet injection rate factor

Table 5 Equivalent area of different NoC architectures.

Latency for real application (High injection rate) 1.4 BiLink A−BiLink PA−BiLink

1.2

1 Normalized Latency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

11

0.8

Unit: μm2

Typical

BiNoC

BiLink

PA-BiLink

Typical (64)

Router

190,085

223,532

219,470

276,026

330,281

Link stage

0

0

923

1906

0

Mode controller

0

0

0

115

0

Equivalent area

190,085

223,532

224,085

285,843.5

330,281

Normalized value

1.00

1.18

1.18

1.50

1.74

0.6

Table 6 Equivalent power and energy-delay-product (EDP) of different NoC architectures.

0.4

0.2

0

Power (mW)/Normalized EDP

mms

mpeg

pip

vopd

dvopd

High packet injection rate factor

Typical

BiNoC

PA-BiLink

Typical (64)

Low injection rate

17.8/1

20.8/1.27

21.1/1.29

34.5/1.91

Medium injection rate

19.5/1

22.7/1.26

25.7/0.60

37.3/0.68

High injection rate

20.4/1

23.7/1.26

28.9/0.52

39.7/0.59

Fig. 17. Simulation for real applications.

Similar idea was borrowed and applied in the domain of NoC. In [5], a novel design of the link stage based on network coding has been proposed. The pattern of data transmission that mimics the way in network coding [1] is shown in Fig. 19. More specifically, in

Fig. 19, during the transmitting phase, R1 and R2 will send the data p and q to the intermediate coding unit, respectively. Then the coding unit will encode these two receiving data into a single packet (i.e., performing the p XOR q operation). Finally, at the

Please cite this article as: J. Zhu, et al., BiLink: A high performance NoC router architecture using bi-directional link with double data rate, INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.02.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

J. Zhu et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Fig. 18. Data transfer mode for conventional NoC routers.

Fig. 19. Network Coding in Network-on-Chip.

receiving phase, R1 and R2 will receive the encoded packet (i.e., p XOR q). To decode the data p and q, R1 and R2 XOR the received data with the original data that they send out to obtain the result. A coding unit is inserted in the middle of the link to act as a relay station similar to that in the conventional network coding. However, unlike network coding, the two incoming signals going into the coding unit actually do not have to be coded because it does not need to be broadcasted to the two sides. Moreover, this architecture cannot adapt well with the uneven traffic patterns for real applications. Comparing with the existing works, the proposed PA-BiLink architecture has the highest throughput performance among all routers as shown in Table 2. More specifically, under the even traffic pattern such as the random traffic, the throughput for PABiLink outperforms those of the traditional router and BiNoC by approximately 100%. Furthermore, the throughput of PA-BiLink surpasses that of the BiNoC by 45% to 73% under the uneven traffic patterns such as bitreversal and transpose due to the data transfer in both clock edges.

7. Conclusion In this paper, we have proposed a new NoC router architecture using bidirectional link with double data rate. We proposed to insert an intermediate link stage and used phase pipelining to double the data rate. In addition, a reconfigurable structure has been designed to improve the latency as well as the throughput under different traffic conditions by changing the direction of the link at run time. We explored three different BiLink architectures for performance and hardware overhead tradeoff. Simulation results show that the proposed architecture can achieve a 250% performance gain over the typical one, and 60% over the BiNoC under the uneven traffic pattern. For the even traffic patten, it also outperforms the typical and BiNoC routers with performance gain higher than 90%. The BiLink works well under different packet lengths, scales well with the larger network size as well as different pipelined router architectures. The area overhead of PABiLink over BiNoC is around 28% with a 40% overhead in power

under high injection rate. By utilizing the clock gating, the power overhead is reduced to 18% under low injection rate. Despite the overhead, the EDP of BiLink architecture is improved by 47.45% under the high injection rate owing to the high throughput. In summary, BiLink can provide a good performance/area/power tradeoff for high throughput router design.

Acknowledgment This work is supported by Hong Kong Research Grant Council (RGC) under Grant 619813.

References [1] R. Ahlswede, Ning Cai, S.-Y.R. Li, R.W. Yeung, Network information flow, IEEE Trans. Inf. Theory 46 (4) (2000) 1204–1216. [2] M.A. Al Faruque, T. Ebi, J. Henkel, Configurable links for runtime adaptive onchip communication, in: 2009 Design, Automation Test in Europe Conference Exhibition, DATE '09, April 2009, pp. 256–261. [3] G. Ascia, V. Catania, M. Palesi, D. Patti, Neighbors-on-path: a new selection strategy for on-chip networks, in: Proceedings of the 2006 IEEE/ACM/IFIP Workshop on Embedded Systems for Real Time Multimedia, 2006, pp. 79–84. [4] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini, G. De Micheli, Noc synthesis flow for customized domain specific multiprocessor systems-on-chip, IEEE Trans. Parallel Distrib. Syst. 16 (February (2)) (2005) 113–129. [5] K.C. Bollapalli, R. Garg, K. Gulati, S.P. Khatri, On-chip bidirectional wiring for heavily pipelined systems using network coding, in: 2009 IEEE International Conference on Computer Design, ICCD 2009, 2009, pp. 131–136. [6] Myong Hyon Cho, M. Lis, Keun Sup Shim, M. Kinsy, T. Wen, S. Devadas, Oblivious routing in on-chip bandwidth-adaptive networks, in: 2009 18th International Conference on Parallel Architectures and Compilation Techniques, PACT '09, 2009, pp. 181–190. [7] William James Dally, Brian Patrick Towles, Principles and Practices of Interconnection Networks, Access Online via Elsevier, 2004. [8] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al., Large scale distributed deep networks, in: Advances in Neural Information Processing Systems, 2012, pp. 1223–1231. [9] R. Hesse, J. Nicholls, N.E. Jerger, Fine-grained bandwidth adaptivity in networks-on-chip using bidirectional channels, in: 2012 Sixth IEEE/ACM International Symposium on Networks on Chip (NoCS), May 2012, pp. 132– 141.

Please cite this article as: J. Zhu, et al., BiLink: A high performance NoC router architecture using bi-directional link with double data rate, INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.02.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

J. Zhu et al. / INTEGRATION, the VLSI journal ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 Q4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

[10] Hu. Jingcao, R. Marculescu, Energy- and performance-aware mapping for regular noc architectures, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 24 (April (4)) (2005) 551–562. [11] Kai-Yuan Jheng, Chih-Hao Chao, Hao-Yu Wang, An-Yeu Wu, Traffic-thermal mutual-coupling co-simulation platform for three-dimensional network-onchip, in: 2010 International Symposium on VLSI Design Automation and Test (VLSI-DAT), IEEE, 2010, pp. 135–138. [12] Amit Kumar, Partha Kundu, Arvind P. Singh, Li shiuan Peh, Niraj K. Jha, A 4.6 Tbits/s 3.6 GHz single-cycle noc router with a novel switch allocator in 65 nm CMOS, in: ICCD-2007, 2007. [13] Ying-Cherng Lan, Hsiao-An Lin, Shih-Hsin Lo, Yu Hen Hu, Sao-Jie Chen, A bidirectional noc (binoc) architecture with dynamic self-reconfigurable channel, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 30 (3) (2011) 427–440. [14] Quoc V Le, Building high-level features using large scale unsupervised learning, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2013, pp. 8595–8598. [15] Chrysostomos Nicopoulos, Vijaykrishnan Narayanan, Chita R Das, Network-onChip Architectures: A Holistic Design Exploration, vol. 45, Springer, 2009. [16] M. Palesi, S. Kumar, V. Catania, Bandwidth-aware routing algorithms for networks-on-chip platforms, IET Comput. Digit. Tech. 3 (September (5)) (2009) 413–429.

13

[17] M. Pedram, Qing Wu, Xunwei Wu, A new design of double edge triggered flipflops, in: Proceedings of the 1998 Asia and South Pacific Design Automation Conference, ASP-DAC '98, February 1998, pp. 417–421. [18] A. Pullini, F. Angiolini, P. Meloni, D. Atienza, S. Murali, L. Raffo, G. De Micheli, L. Benini, Noc design and implementation in 65 nm technology, in: 2007 First International Symposium on Networks-on-Chip, NOCS 2007, May 2007, pp. 273–282. [19] A. Pullini, F. Angiolini, S. Murali, D. Atienza, G. De Micheli, L. Benini, Bringing nocs to 65 nm, IEEE Micro 27 (September (5)) (2007) 75–85. [20] Zhiliang Qian, Ying-Fei Teh, Chi-Ying Tsui, A flit-level speedup scheme for network-on-chips using self-reconfigurable bi-directional channels, in: 2012 Design, Automation Test in Europe Conference Exhibition (DATE), March 2012, pp. 1295–1300. [21] Praveen Salihundam, Shailendra Jain, Tiju Jacob, Shasi Kumar, Vasantha Erraguntla, Yatin Hoskote, Sriram Vangal, Gregory Ruhl, Nitin Borkar, A 2 Tb/s 6 4 mesh network for a single-chip cloud computer with dvfs in 45 nm cmos, IEEE J. Solid-State Circuits 46 (4) (2011) 757–766.

Please cite this article as: J. Zhu, et al., BiLink: A high performance NoC router architecture using bi-directional link with double data rate, INTEGRATION, the VLSI journal (2016), http://dx.doi.org/10.1016/j.vlsi.2016.02.006i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132