A practical low-latency router architecture with wing channel for on-chip network

A practical low-latency router architecture with wing channel for on-chip network

Microprocessors and Microsystems 35 (2011) 98–109 Contents lists available at ScienceDirect Microprocessors and Microsystems journal homepage: www.e...

2MB Sizes 0 Downloads 74 Views

Microprocessors and Microsystems 35 (2011) 98–109

Contents lists available at ScienceDirect

Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

A practical low-latency router architecture with wing channel for on-chip network Mingche Lai, Lei Gao ⇑, Sheng Ma, Xiao Nong, Zhiying Wang Department of Computer, National University of Defense Technology, Changsha, Hunan, China

a r t i c l e

i n f o

Article history: Available online 24 September 2010 Keywords: On-chip Network Router VLSI design

a b s t r a c t With increasing number of cores, the communication latency of Network-on-Chip becomes a dominant problem due to complex operations per node. In this paper, we try to reduce communication latency by proposing single-cycle router architecture with wing channel, which forwards the incoming packets to free ports immediately with the inspection of switch allocation results. Also, the incoming packets granted with wing channel can fill in the time-slots of crossbar switch and reduce the contentions with subsequent ones, thereby pushing throughput effectively. We design the proposed router using 65 nm CMOS process, and the results show that it supports different routing schemes and outperforms express virtual channel, prediction and Kumar’s single-cycle ones in terms of latency and throughput. When compared to the speculative router, it provides 45.7% latency reduction and 14.0% throughput improvement. Moreover, we show that the proposed design incurs a modest area overhead of 8.1% but the power consumption is saved by 7.8% due to less arbitration activities. Crown Copyright Ó 2010 Published by Elsevier B.V. All rights reserved.

1. Introduction With the continual shift advancement of device technology towards nanometer region, it will be allowed to introduce thousands of cores on a single chip in the future. There is a wide consensus both in industry and academia, the many cores are the only efficient way for utilizing the billions of transistors and represent the trend of future processor architecture. More recently, several commercial or prototype many-core chips, such as TeraScale [1], Tile [2] and Kilocore [3], have been delivered. To connect such many cores, the traditional bus-based or crossbar structure has been seen more and more incapable of meeting the challenges of intolerant wire delays or poor scalability in deep submicron conditions. Network-on-Chip as an effect way of on-chip communication has introduced a packet-switched fabric that can address the challenges of the increasing interconnection complexity [4]. Although NoC provides a preferable solution towards the long wire delay compared to traditional structures, the communication latency still becomes a dominant problem with the increasing number of cores. For example, the average communication latencies of 80-core TeraScale and 64-core Tile are close to 41 and 31 cycles, since their packets forwarded via many cores must perform complex operations at each node through 5-stage or 4-stage routers. In this way, the communication latency tending

⇑ Corresponding author. Tel.: +86 13787314163. E-mail addresses: [email protected] (M. Lai), [email protected] (L. Gao).

to be larger with increasing number of cores will become the bottleneck of application performance on the future many cores. There have been significant works to reduce communication latencies of NoCs in various approaches such as designing new topologies and developing fast routers. Bourduas et al. [5] has combined mesh and hierarchical rings to form a hybrid topology to provide fewer transfer cycles. In theory, the architects prefer to adopt many high-radix networks to further reduce average hop counts; however, for the complex structures such as flattened butterfly [6], finding the efficient wiring layout during the backend design flows is a challenge on its own right. Recently, many aggressive router architectures with single-cycle transfers have also been developed. Kumar et al. [7] proposes the express virtual channel (EVC) to reduce the communication latency by bypassing intermediate routers in a completely non-speculative fashion. This method is efficient to close the gap between speculative router and ideal ones; however, it does not work well at some non-intermediate node and only suits for the deterministic routing. Moreover, it sends a starvation token upstream every fixed n cycles to stop the EVC flits and prevent the normal flits of high-load node being starved. And it results that many packets at the EVC source node have to be forwarded via normal virtual channel (NVC), thereby increasing average latencies. In [8,9], another predictive switching scheme is proposed, where the incoming packets are transferred without waiting the routing computation and switch allocation if the prediction hits. Matsutani et al. [10] analyzes the prediction rates of six algorithms and finds that the average hit rate of the best one only achieves 70% under different traffic patterns. It means that many packets still require at least three cycles to go

0141-9331/$ - see front matter Crown Copyright Ó 2010 Published by Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2010.09.002

M. Lai et al. / Microprocessors and Microsystems 35 (2011) 98–109

through a router when prediction misses or packet conflicts. Then, Kumar et al. [11] presents a single-cycle router pipeline which uses advanced bundles to remove control setup overhead. However, the design in [11] only works well at low traffic rate because it emphasizes no flit exists in the input buffer when the advanced bundle arrives. At last, the preferred path [12] is also pre-specified to offer the ideal latency, but it can not adapt to the different network environments. Besides the single-cycle transfer exhibited by all the techniques listed above, we additionally emphasize on three aspects in the proposed low-latency router. Firstly, a preferred technique that accelerates a specific traffic pattern should also work well for other patterns and it would be best to suit for different routing schemes involving deterministic and adaptive ones. Secondly, besides for zero load latency, the high throughput and low latencies under different loads are also important since the traffic rate is easily changed on a NoC. At last, some complex hardware mechanisms should be avoided to realize the cost-efficiency of our design, such as prediction, speculation, retransmission and abort detection logics. In this paper, the main contribution is that we propose a low-latency router architecture with wing channel (Section 2). Regardless of what the traffic rate is, the proposed router inspects the switch allocation results, and then selects some new packets without port conflict to enter the wing channel and fill the time-slots of crossbar ports, thereby bypassing the complex 2-stage allocations and directly forwarding the incoming packets downstream in the next cycle. Here, no matter what traffic pattern or routing scheme is, once there is no port conflict with others, the new packet at current router can be delivered within one cycle, which is the optimal case in our opinions. Moreover, as the packets of wing channel make full use of the time-slots of crossbar and reduce contentions with subsequent ones, the network throughput is also pushed effectively. We then modify the traditional router with little additional cost constraint, and present the detailed micro-architecture and circuit schematics of proposed one in Section 3. Section 4 estimates the timing and power consumption via commercial tools and evaluates the network performance using a cycle-accurate simulator considering different routing schemes under various traffic rates or patterns. Our results indicate that the proposed router outperforms EVC, prediction and Kumar’s single-cycle ones in terms of latency and throughput metrics and then provides 45.7% latency reduction and 14.0% throughput improvement averagely when compared with state-of-the-art speculative one. The evaluation results of proposed router also show that although router area is increased by 8.1%, the average power consumption is saved by 7.8% due to less arbitration activities in low rates. Finally, Section 5 concludes the paper. 2. Proposed router design In this section, the single-cycle router architecture supporting different routing schemes is proposed, where the incoming packets forwarded to the free ports are selected for the immediate transfers at wing channel without waiting their VA and SA operations based on the inspection of switch allocation. Hence, it can reduce communication latency and improve network throughput under various network environments. Through the analysis of the original router, Section 2.1 first presents the single-cycle router architecture and describes its pipeline structure, and then we explain details of wing channel in Section 2.2. 2.1. Proposed router architecture It is well known that the wormhole flow control is first introduced to improve performance through fine-granularity buffer at

99

flit level. Here, the router with single channel, playing a crucial role in architecting the cost-efficient on-chip interconnects, always supports the low latency due to its little hardware complexity, but is prone to the head of line blocking which is a significant performance limiting factor. To remedy this predicament, the virtual channel provides an alternative to improve the performance, but is not amenable to the cost-efficient on-chip implementation. Some complex mechanisms, e.g. virtual channel arbitration, 2phase switch allocation, increase the normal pipeline stage. By the detailed analysis of router with virtual channel, it can be found that the switch allocation and switch traversal are necessary during the transfer of each packet but the pipeline delay of former always exceeds that of latter [13]. Hence, we believe that the complex 2-phase switch allocation may be preferred attributing to the arbitration among multiple packets at high rates, but it would increase the communication latencies at low rates, where the redundant arbitrations which increase the pipeline delay is unwanted because no contention happens. Given the aforementioned analysis, we thereby introduce another alternative to optimize the single-cycle router. When an input port receives a packet, it computes the state of the forwarding path based on the output direction and switch allocation results. As the forwarding path is free, this port grants the packet with wing channel, and then bypasses the 2-phase switch allocation pipeline stage to forward the packet directly. For the purpose of illustration, we first introduce the original speculative router [13] and then describe the change details. The main components of original router include input buffer, next routing computation (NRC), virtual channel arbitration (VA), switch allocation (SA) and crossbar units. When a header flit comes into the input channel, the router at the first pipeline stage parallelizes the NRC, VA and SA using speculative allocation, which performs SA based on the prediction of VA winner, otherwise cancels SA operation regardless of its results when the VA operation is failed due to conflicts with other packets. Then, at the second pipeline stage the winner packets will be transferred through the crossbar to the output port. Fig. 1 illustrates our proposed low-latency router architecture changed from the original one mentioned above. First, each input port is configured with a single wing channel, which uses the same number of flit slots to replace a certain normal VC reference, thus keeping the buffer overhead as a constant. In our proposed architecture, the wing channel as a special type of VC has its own simple mechanism. By inspecting the forwarding path of arriving packets to be free, the wing channel only holds them and asserts their request signals to fast arbiter immediately to implement the singlecycle transfers. Note that since the packet transfer at wing channel is performed when the switch allocation results of other normal channels are failed at previous cycle, it has lower priority than the normal transfers. Second, we add the fast arbitration logic which is of low complexity to handle the requests from wing channels of all inputs. Here, the extra one-stage fast arbitration logic implemented by five 4:1 arbiters incurs little hardware overhead, in contrast with 25 arbiters of the two-stage virtual channel arbitration. Through the fast arbitration, the winner will traverse the crossbar switch right now. Third, each input introduces a channel dispenser unit. Distinguished with the original router, the proposed router just allocates the logical identifier of free channel at neighborhood. According to the logical identifier stored in the header flits, the dispenser unit grants the physical channel identifier to the new packets. Besides, this unit is also responsible for selecting proper packets to use the wing channel, whose detail is described later in Section 2.2. At last, we use the advanced request signal to perform the routing computation in advance. In original router, the routing computation or next routing one must be completed before the switch traversal, and we find that it is difficult to

100

M. Lai et al. / Microprocessors and Microsystems 35 (2011) 98–109

Fig. 1. Proposed low-latency router architecture.

Fig. 2. Proposed aggressive pipeline.

combine them [14] in a single cycle without harming the frequency. Here, the advantage of proposed advanced routing computing method is to decouple the routing and switch functions by pre-computing the routing information before packet arrives. As shown in Fig. 1, because the SIG manager has already received the request signal upstream from SIG controller of neighborhood, the incoming packets can directly enter the ST stage without repeating routing computations. Fig. 2 depicts the original and proposed aggressive pipelines on which a header flit is transferred from router 0 to router 2. Here, the header flit in the original router at least consumes two consecutive cycles to go through VSA/NRC and ST pipeline stages. But in our proposed router, since the routing information is prepared prior to the packet arrival, the header flit can go through the crossbar switch within a single cycle once its input and output port are both free. As the scenario shown in Fig. 2, the arriving flit at router 0 inspects the contentions with other packets and then it selects the normal virtual channel to be forwarded to the right output through the original 2-stage pipeline. Here, once the header flit succeeds in virtual channel arbitration, it will send advanced request signal (SIG) downstream in the next cycle for the routing pre-computation at neighborhood. Then, the header flit is able to bypass the pipeline at router 1 by directly entering the wing channel upon it arrives because its forwarding path is inspected to be free. In the previous LT pipeline stage, router 1 has generated virtual channel mask to determine the candidates for each output port among all the competing wing channels. In the FST stage, router 1 chooses the single winner header flit to be transferred through crossbar switch according to VC masks and propagates its advanced SIG downstream to router 2. At router 2, the header flit transfer is also completed in a single cycle, similar to that at router 1.

For the purpose of comparison, Fig. 3a and b illustrates the packet transfers of each input when deployed with original and proposed routers respectively, x axis denotes time, y axis denotes input port, yellow boxes are the transfers at wing channel and gray boxes are the transfers at normal ones. Here, we capture the detailed router behaviors for a certain period according to our cycle-based simulations. At the beginning, we find that three different packets at input 1, 3, 4 are applying for the switch traversals towards output 4, 2, 2, and then three other incoming packets to be delivered to output 2, 3 and 1 just arrive at the input 1, 2 and 4 respectively. At this initial scenario, the former two routers will bring completely different transfer effects due to their own schedule principles, even though assuming the same packet injections in the next cycles. We collect all the time-slots of crossbar switch and it can be seen that, as compared to the original one (Fig. 3a), the proposed router provides up to 45.1% reduction in latency and up to 31.7% improvement in throughput mainly due to the single-cycle transfers and decreased contentions. Based on a detailed analysis of the router behaviors, the proposed router obviously outperforms the original one and the reasons can be concluded as follows. First, with the inspection of switch allocation results in the previous cycle, the proposed router allocates the free time-slots of crossbar switch for single-cycle transfers of new packets at wing channel. It can reduce the packet latencies and improve the switch allocation efficiency, thereby pushing throughput. Second, the single-cycle transfers of new packets at wing channel advance all the packets at the same input and significantly reduce the level of contentions with subsequent packets of other inputs, which translates directly to a higher throughput. Next, we provide two case studies. The first case happens at cycle 4, where the incoming packets from input 2, 3, 4 rush to the output 4 simultaneously. In original router, the switch

M. Lai et al. / Microprocessors and Microsystems 35 (2011) 98–109

101

Fig. 3. Effect of packet transfer through the proposed router.

allocation only chooses the single winner and thus it leads to the idle cycles of input 3, 4 as shown in Fig. 3a. But if we consider the proposed router, more packets will be scheduled in a shorter period. At cycle 5, the packet from normal channel of input 4 is forwarded towards output 2 in advance due to the fast transfers of previous packets from input 4 to output 3. Here, these fast transfers directly attributes to our proposed wing channel. Besides, the new packet at input 2 performs the single-cycle transfer with wing channel immediately once it inspects the idle of output 4, thereby improving the switch channel utilization. At the second case, let consider the frequent contentions after cycle 7 as shown in Fig. 3a. In the original router, the packet at input 4 is delivered to output 3 one cycle later than our proposed one at cycle 4, and thus it directly suspends the transfers of subsequent packets forwarded to output 2 or 4. It can be seen that the contentions at output 2 or 4 become higher after cycle 7 compared with our proposed router, and it mainly attributes to the competitions among these suspended packets and those incoming ones from other inputs. 2.2. Wing channel This section describes several principles of wing channel for the single-cycle transfer as follows: (1) Dispensation of wing channel. When the header flit is transferred to the input, it is granted the wing channel if both its input and output are inspected to be free. In this case, the header flit will transfer through the crossbar in a single cycle without the competition for network resources. (2) Allocation of virtual channel. To support the single-cycle transfers on the proposed router, wing channel should be granted higher priority over normal ones to win the output VC. But if multiple wing channels of different inputs are competing for the same direction, the router will use a simple and practical VC mask mechanism to cancel most of the

VC requests, only producing a single winner for output at each cycle. (3) Arbitration of crossbar switch. The new packet at wing channel just sends its request to FSA unit if its output VC has been granted or its VC mask is invalid. If it wins the fast arbitration, the header flit will be delivered to the right output and its body flits will come into heel in sequence. At this scenario, the transfer requests from normal channels of the same input will be canceled until the wing channel becomes empty, or the tail of packet leaves the router. However, if its fast arbitration fails, the idle of input port is inevitable and then the flits of other normal channels will go through the switch in the next cycle depending on the current switch allocation results. (4) Transmission of advanced SIG. The SIG encodes logical identifier of channel, NRC results and destination node. For an 8  8 mesh, this wiring overhead is ten bits, around 7.8% of the forward wiring, assuming 16-byte wide links. In the proposed router, if the VC mask of new packet at wing channel is invalid, its advanced SIG must be given the highest priority to be forwarded downstream immediately. However, for those packets with valid VC masks or at normal channels, their advanced SIGs are sent to SIG FIFO one cycle later than the grant of output VCs. Obviously, the latter has lower priority than the former. With this principle, we ensure the completion of routing computation is prior to the packet arrival, because advanced SIG always arrives at least one cycle earlier than header flits. Next, Fig. 4 shows a timing diagram of packet transfer from router 1 to router 5 with these principles. At cycle t2, the header flit arriving at router 2 enters the wing channel by inspecting the switch allocation results, and then it transfers through the router and forwards its advanced SIG in the next cycle. When the header reaches router 3, its routing computation is already completed. But

102

M. Lai et al. / Microprocessors and Microsystems 35 (2011) 98–109

Fig. 4. Packet transfer with principles of wing channel.

Fig. 5. Effect of fast arbitration for wing channel.

at cycle t4, this flit cannot be delivered immediately due to its contentions with the packets of other wing channels. Here, the router sets up the virtual channel mask and repeats the request of output VC at [t4, t5]. Once winning the VC arbitration at cycle t5, the router immediately sends advanced SIG to FIFO and continues this flit trip through FST stage at the same time. Note that although the header flit stays at router 3 for more than one cycle, it allows the packets of other channels to be transferred in advance without affecting the network performance. At last, due to the contentions inspected at cycle t7, this header flit selects normal channel and transfers through the original pipeline stage, thereby leaving the wing channel to subsequent new packets. Based on the above principles, the FA unit allocates the free time-slots of crossbar switch to new packets when it arrives to improve network performance and reduce power consumption. Fig. 5 illustrates the effect of fast arbitration for wing channel. At cycle 4, the transfer completion of wing channel informs the normal ones at the same input to request their outputs immediately. The winner packets of the switch port are then able to traverse the crossbar and leave no idle cycle on the input link of switch, which translates directly to a higher throughput. Next, at cycle 8, the empty slot due to the SA failure of cycle 7 is filled by the single-cycle transfer of FA, which also leads to a significant improvement in throughput. However, the idle of input link appears at cycle 10. At this moment, the transferring at wing channel is interrupted because its following flits have not arrived, and then our proposed router begins to

restart the output requests of other normal channels at same port, which results in one idle cycle of the switch input. This scenario is unavoidable but has no serious performance degradation due to its low frequency. On the other hand, our proposed router cancels the output requests of others during the transfer of wing channel, which can decrease the arbitration power consumption in most cases. 3. Micro-architecture design Based on the original routers, this section presents the main micro-architecture changes, including channel dispenser, fast arbitration, SIG controller and SIG manager units. 3.1. Channel Dispenser Fig. 6a shows the micro-architecture of channel dispenser which is mainly composed of VC assigner, VC tracker and VC table. The VC table forms the core of the logic of dispenser unit. It is a compact table indexed by logical channel id and holds physical channel id and tail pointers at each entry. With VC table, the VC tracker simply provides the next available channels by keeping track of all the normal ones. When receiving the information from VC tracker, the VC assigner decides whether to grant a wing channel or a normal one to the new packet based on the generated wing

M. Lai et al. / Microprocessors and Microsystems 35 (2011) 98–109

103

Fig. 6. Channel dispenser (a), SIG manager and controller (b).

flag. Once granting any channel to the incoming packets, the VC assigner needs to provide the dispensed channel id to VC table to change the status of available channels. Here, the VC assigner is also responsible for improving the throughput by fully utilizing all virtual channels at high rates. When all the normal VCs at local input are already exhausted, the VC assigner is forced to allocate the wing channel to the new packet. In this way, when the normal VCs are occupied and the generated wing flag is false, the wing channel will be served as normal one, and the packet allocated with wing channel will apply for the channel and switch like those in the normal ones. Our experiments in Section 4.3 will validate the effectiveness of this technique under different routing schemes. Then, the above channel dispenser also guarantees the protocollevel and network-level deadlock-free operations by adopting the relative low-cost methods. Toward that extent, we use different VC sets for request and response message types respectively to avoid protocol-level and network-level deadlock situations. The proposed router includes two VC sets per port: the first VC set is composed of channel 0 (i.e. wing channel), channel 1, and channel 2, and is served for the request packets; the second VC set comprising channel 0, channel 1 and channel 3 is used for the response packets. Note that since channel 2 and channel 3 are used by request and response packets exclusively, we can provide deadlock-free operations in two aspects. First, in order to provide deadlock recovery in adaptive routing algorithms, channel 2 and channel 3 employ a deterministic routing algorithm to break the network deadlock. For any packet at wing channel or channel 1, once it has been checked for possible deadlock situations, the channel dispenser unit of neighborhood will grant channel 2 and channel 3 to this packet with higher priority, thereby using the deterministic routing to break network deadlock situation. Second, we introduce separate channel 2 and channel 3 for two different message types respectively to break the dependencies between request and response messages. When both the shared channel 0 and channel 1 at neighborhood are hold by other messages of different types, the current message can be transferred through the exclusive VC based on its own type, thereby breaking the dependency cycle between messages of different classes. To satisfy the tight timing constraint, the real-time generation of wing flag is very important as shown in Fig. 6a, where we adopt a favorable tradeoff between the limited timing overhead and the fast packet transfer. In the dispenser unit, the inputs of the 2nd phase switch arbitration are used to inspect the output state of next cycle; however, the inspection of input state using the results

of original SA pipeline would prove to be intolerable because it influences the timing of critical path seriously. Instead, we consider the previous outputs of switch allocation to be the default input state, thereby cutting down the critical path. In such a scenario, the winner request from switch allocation of current cycle influences the single-cycle transfer of wing channel but does not harm network performance in high load, because the winner packet is to be transferred in the next cycle. As illustrated in Fig. 6a, the nrc information from header flit controls MUX to select the inputs of 2nd phase switch arbitration based on its output direction. Only when both MUX output and the previous output of switch allocation are false, the wing channel is granted to the new packets. Using the model in [13], we compare delay of channel dispensation indicated by thick line with that between input of 2nd phase switch arbitration and VSA stage results, as shown in Eqs. (1) and (2) respectively. In the typical case of 5 ports and 4 channels, the former equation equals 9.3 FO4s, close to 9.9 FO4s of latter one. In addition, it also introduces some timing overhead to generate buffer pointer based on the link traverses. With the same model, the delay of 3 FO4s is added when compared with the v-input MUX of original router. Here, the delay of 1 mm wire after placement and routing in 65 nm process is extracted to be 8.2 FO4s, far less than 18 FO4s of original critical path. Hence, this overhead would not harm the network frequency.

t1 ðnÞ ¼ log4 2n þ t2 ðn; v Þ ¼

11 nþ4 15

33 n 5 log4 n þ þ log4 v þ 3 10 3 12

ð1Þ

ð2Þ

3.2. Fast arbiter component Taking east port as example, Fig. 7 shows the micro-architecture of fast arbiter which includes the virtual channel mask logic at upper left and transfer arbitration logic at lower right. First, the grant bit for wing channel of a certain input is set in the case when the incoming header tends to perform the single-cycle transfer from this input to output east as discussed in Section 2.1. In order to do so, the incoming header always needs to check the states of wing channel and its forwarding path as shown in Fig. 7. Then, based on these grant bits for the wing channels, VC mask logic shields multiple requests competing for the same output. Here the state indicating that one input request has priority over

104

M. Lai et al. / Microprocessors and Microsystems 35 (2011) 98–109

Fig. 7. Fast arbiter and virtual channel arbitration components.

another is stored in a matrix, which is updated each cycle in a round-robin to reflect the new priorities. With the current matrix state, the mask bit of an input is set when any input request that has higher priority is asserted. At last, this logic will assert the new request of a wing channel to compete for the output if the conditions of grant bit and VC mask bit are both satisfied. In our design, any valid bit of wing grant register also generates a request mask signal to shield input requests of all the 1st phase VC arbiters for a particular output, thereby reducing the energy overhead of VC arbitration. The packets at wing channel perform single-cycle transfers based on the fast arbitration results, and thus the delay of functions on the entire path through VC mask register, fast arbiter and crossbar traversal is calculated in Eq. (3), where the result of 14.3 FO4s in the case of five ports and 16-byte wide (w = 128) is lower than 18 FO4s of original critical path. In addition, with the same model, the timing overhead of wing grant and VC mask based on the link traverse is calculated to be about 4.0 FO4s, which also has no influence to the critical path.

t 3 ðn; wÞ ¼ 6

7 1 log4 n þ 3 þ log4 w 10 60

ð3Þ

3.3. SIG manager and SIG controller Fig. 6b shows the SIG manager of input and SIG controller of output respectively. Here, SIG manager completes the routing computation in the next cycle of advanced SIG arrivals, and records their results into a compact SIG table, which holds the transfer destination and NRC information. For the incoming packet, the header flit uses its logical channel identifier to select and latch its SIG information based on the SIG table of input. By receiving SIG of multiple inputs, the SIG controller decides the SIG winner according to the new request from wing channel in the next cycle. On the other hand, for the other packets at normal channel or blocked ones at wing channel, the two-phase VC arbitration logic is to provide their advanced SIG instead. For the purposes of low complexity, our proposed router adopts another two-phase VC arbitration [15], where, the 1st phase arbiters at each input gain winner requests of different directions and then the 2nd phase logic at each

output generates a final winner at each cycle. With the winner of the 1st phase, the SIG selection which is parallel with the 2nd phase arbitration is performed by MUX. Then, according to the 2nd phase result, the SIG controller selects the SIG of final winner, thereby forwarding it to the SIG FIFO in the next cycle. 4. Experiment results 4.1. Simulation infrastructure In this section, we present the simulation-based performance evaluation of proposed router in terms of latency, throughput, pipeline delay, area and power consumption, under different traffic patterns. To model network performance, we develop a cycle-accurate simulator, which models all major components of router at clock granularity, including VC, switch allocation and flow controls, etc. The simulator inputs are configurable allowing us to specify parameters such as network size, routing algorithm, channel width and the number of VCs. And the simulator also supports various traffic patterns, such as random, hotspot and nearest-neighbor. Here, all simulations are performed in an 8  8 mesh network by simulating 8  105 cycles for different rates after a ware-up phase of 2  105 cycles. In more details, each router is configured with five physical channels including the local channel and each port has a set of four virtual channels with the depth of four 128-bit flits. To take the estimation a step further for original and proposed routers, we firstly synthesize their RTL designs with 65 nm standard cell library by Synopsys Design Compiler, and then derive the pipeline delay with static timing analysis by Synopsys Timing Analyzer. For power consumption, we further finish the physical design of above on-chip network and anti-mark the parasitic parameters derived from circuit layout into netlist to evaluate the total power overhead. In this way, the switching activities of both routers and wires were captured through gate-level simulation using a core voltage of 1.2 V and 500 MHz frequency. 4.2. Pipeline delay analysis Fig. 8 displays the distribution of pipeline delay by varying the numbers of ports and channels for two routers. It is evident that

105

M. Lai et al. / Microprocessors and Microsystems 35 (2011) 98–109

VSA_original

VSA(FST)_proposed

ST_original

ST_proposed

LT_original

LT_proposed

Stage delay [FO4s]

24 20 16 12 8 4

n=5,v=2

n=5,v=4

n=5,v=8

n=7,v=4

n=9,v=4

n=11,v=4

Fig. 8. Effect of two parameters on pipeline delay.

the wing channel incurs the modest timing overhead to switch traversal and link traversal stages, but does not obviously affect the delay of VSA stage which presents the critical path in our design. Here, with the number of ports and channels increasing to a certain extent, the difference of their critical path delay is quite small. As shown in Fig. 8, the critical path delay of proposed router is increased by 8.9% in the case of five ports and two channels, but this difference is gradually reduced to 3.2% and 1.3% when the channel number reaches 4 and 8respectively. This is because the delay between input of 2nd phase switch arbitration and VSA stage results increases as channel number (i.e., v) increases, while that of channel dispensation is a constant. Then, in the cases of four channels, it can be seen that the critical path delay of our router tracks that of original one well and it is because that the change of t2(n, v) is close to that of t1(n) as n increases. As expected, the path through fast arbiter and crossbar traversal has never been the critical one with the increasing n in our timing analysis, and it proves that our proposed router suits for different port numbers without obvious frequency degradation. As shown, the increasing of critical path delay is around 3.7% on average in different cases. 4.3. Performance of latency and throughput This section first compares three routers in terms of zero load latency as shown in Fig. 9. With increasing network size, the proposed router is clearly much more efficient at reducing the latency. The results show that proposed router reduces 34.3% zero load latency compared to original one when network size reaches 16-ary. Then, it also incurs 22.6% and 16.1% reductions in average zero load latency over prediction [10] and EVC [7] ones with different network sizes. Next, we evaluate network performance of proposed router as a function of injection rate and traffic pattern under different routing

Fig. 9. Comparison of zero load latencies.

schemes, and compare it with those of original, prediction [10], EVC [7] and Kumar’s single-cycle [11] ones respectively. Here, we refer the Kumar’s single-cycle router as SNR one for simplicity, and all the prediction, EVC and SNR structures are implemented and incorporated into our cycle-accurate simulator. To study the performance of EVCs, we configure one EVC within the entire set of four VCs and consider the dynamic EVCs scheme with the maximum EVC length of three hops. For better performance of the prediction router, we use many 4-bit binary counters to construct the history table that records the number of references towards all the outputs, and thus the router can achieve a higher hit rate by supporting the finite context prediction method. At last, for a fair comparison, we use the statically-managed buffers in SNR and assign the same number of buffers to each VC per port when modeling the behavior of major components. Fig. 10 displays network performance of four routers under XY deterministic routing. In random traffic pattern (Fig. 10a), the proposed router latency of 13.1 cycles at 0.1 flit/node/cycle rate is closer to the ideal, e.g., 11.8 cycles, when compared to 20.3, 18.6, 16.5 and 13.8 cycles of others. Here, our proposed router performs single-cycle transfer at each node in low rate; however, the prediction one always takes three cycles when the prediction misses and EVC one still needs two cycles at some non-intermediate nodes along its express path. Then, with the increasing of rate, our proposed router uses the single-cycle transfer of wing channel to fill in the free time-slot of crossbar switch and reduce the contentions with subsequent packets of other inputs, thereby leading to significant improvement of throughput. However, with the increasing of network traffic, the SNR’s performance which is almost identical to our proposed router at low rates begins to degrade obviously. It attributes to its strict pre-condition that no flit exists in the input buffer when the advanced bundle arrives. Under higher traffic rates, SNR is hard to forward most incoming packets within one cycle due to many residual flits at router input ports. The results of our proposed router show 44.5%, 34.2%, 21.7%, 36.1% reductions in average latency compared with other four ones, while improving throughput by 17.1%, 9.3% and 9.0% over the original, prediction and SNR ones. In hotspot pattern (Fig. 10b), the wing channel is effective to reduce packet latency at many low-load nodes, while improving switch throughput by shortening service time and reducing residual packets at few high-load nodes. Hence, the simulation results show that our proposed router reduces average latency by 46.8%, 38.8%, 25.1% and improves throughput by 10.8%, 8.2%, and 4.1% when compared to original, prediction and SNR routers respectively. Then, it is observed that the express channels at some low-load nodes are underutilized. Because the EVC router prevents EVC flits periodically to avoid the starvation of normal flits, many packets at low-load nodes cannot be forwarded through express channels that are occupied by others, thereby increasing packet latencies. As shown in Fig. 10b, our proposed router reduces 27.8% average latency when compared to EVC one.

106

M. Lai et al. / Microprocessors and Microsystems 35 (2011) 98–109

Fig. 10. Network performance with deterministic routing.

In nearest-neighbor pattern, the EVC router does not work well and takes two cycles at each node for communications between neighborhoods. Then, the prediction router performs the single-cycle transfer at neighbor but still needs three cycles at source node when the prediction misses. On the contrary, our proposed and SNR router perform better than others as shown in Fig. 10c. This is attributed that their packets can transfer through either source or destination in a single cycle. As shown in Fig. 10c, the latency of our router is reduced by 27.0%, 10.2% when compared to the EVC and prediction ones respectively. In addition, Fig. 11 compares network performance under regional congestion awareness adaptive routing among all routers except EVC one, which is restricted to the deterministic routing scheme. Here, we use the combinations of free VCs and crossbar demand to be the congestion metric [16]. First, the proposed router outperforms original and SNR ones in terms of packet latency and saturated throughput in all cases. For SNR router, the adaptive routing algorithm forwarding packets to the low traffic load regions reduces the contention possibility, and thus helps in clearing the input buffers by delivering the incoming flits in time. As seen in Fig. 11a and b, at low injection rates, the performance of SNR tracks the proposed router well. However, as network traffic increases, the increasing number of residual flits per port starts to cancel the single-cycle pipeline operations at SNR routers. Note that this issue can be completely avoided in our proposed design by guaranteeing the fast transferring of wing channel. It can be seen that our design outperforms SNR router with the latency reduction near saturation being nearly 44.2%. Second, for prediction router, the latest port matching strategy to predict output is introduced to

provide high hit rates in most cases but only approximately 68% of predictions hit in our experiment. Once prediction misses or packet conflicts, it requires at least three cycles in contrast with two cycles of our proposed router to forward a header flit, thereby increasing the residual packets and saturating network throughput in advance. When compared to prediction routers, our results in Fig. 11a and b show that average latency is reduced by 28.1% and 34.0% while network throughput is improved by 10.2% and 5.6% at random and hotspot patterns respectively. At last, we also investigate the performance of proposed routers under realistic network traces which are collected from the 64processor multiprocessor simulator [17] using the raytrace and barnes applications from SPLASH2 benchmarks [18]. Fig. 11c illustrates the simulation results of three different routers when compared to the original one under adaptive routing. For the raytrace application, the communication patterns among nodes correspond to the low traffic load with large hop-count. In this case, the switch contention is low and the router ports are empty, and thus one cycle delays through most SNR and proposed routers are realized, whereas the prediction routers need to consume three cycles when the predictions miss according to their traffic patterns or history tables. It can be seen that the SNR and proposed router provide up to 32.1% and 32.3% reductions in average latencies, in contrast with that for prediction router being 19.4%. Then, for the barnes application representing the moderate traffic rate with less hopcount, the multiple flits stored in the input buffer seriously influence the one cycle transferring process through the SNR router. On the contrary, our proposed router enables the single-cycle pipeline operations if only its delivering input and output ports are in-

M. Lai et al. / Microprocessors and Microsystems 35 (2011) 98–109

107

Fig. 11. Network performance with adaptive routing.

spected to be free, regardless of how many flits stored at the input buffers. Hence, the evaluation results in Fig. 11c show that with the barnes trace, our proposed router with a latency reduction of 28.6% outperforms SNR one with the reduction being around 19.3%. 4.4. Area/power consumption This section estimates the area of original and proposed routers with a 65 nm standard cell library as shown in Fig. 12a. Here, both of them are configured with 16 buffers per input port, and thus the wing channel will not incur any extra area overhead due to buffer resources. In our proposed router, we add the channel dispenser, the SIG manager, the fast arbiter, and modify the virtual channel

arbitration component. However, the area overhead of these components contributes little to the total area which is dominated by the buffers. Thus, the total area of our router is increased by nearly 8.1% compared with the original one. In terms of power consumption, we investigate the 8  8 mesh networks with adaptive routing at different rates in Fig. 12b. Here, because the increasing number of wires incurs some extra link power as reported in Section 2.2, both routers and forward wires are considered in the power evaluation basing on the physical design of on-chip network. Although more switching activities of channel dispensation, SIG forwarding and transmission lead to modest increase of power consumption, the average power of proposed router is still reduced by 7.8% when compared with original one. And this is because

Fig. 12. Router area and network power consumption.

108

M. Lai et al. / Microprocessors and Microsystems 35 (2011) 98–109

most packets which perform the single-cycle transfer in low rates avoid the complex two-stage switch and VC arbitration activities. For prediction router, the high miss rate always results in the frequent switching activities of some complex logics, e.g. abort detection, copy retransmission and flit killer, which are power consuming and completely avoided in our design. Hence, it can be seen that the power of proposed router is lower than that of prediction one in all rates with an average reduction by 15.1%. 5. Conclusions As the continuing technology shrinkage into the nanometer era leads to the increasing number of cores, Network-on-Chip as an effective communication fabric will face the tough challenges of low latency and high throughput in the future. In this paper, we propose a practical router architecture with wing channel, which is granted to the incoming packets by inspecting switch allocation results for the single-cycle transfers. With the wing channel, the packet transfers also improve the switch allocation efficiency and reduce contentions with subsequent ones from other ports, thereby pushing throughput effectively. Using 65 nm CMOS process, the proposed router is designed and it only incurs little overhead of critical path delay while exhibiting excellent scalability with increasing channels and ports. Simulation results using a cycle-accurate simulator indicate our proposed design outperforms EVC, prediction and Kumar’s single-cycle ones in terms of latency and throughput under various traffic patterns, and it provides 45.7% latency reduction and 14.0% throughput improvement as compared to speculative one. Moreover, although the router area is increased by 8.1%, the average power consumption is saved by 7.8% attributing to the less arbitration activities in low rates. Acknowledgements The authors would like to thank the anonymous reviewers for their suggestions and comments. This work is supported by National Basic Research Program (2007CB310901), National Science Foundation of China (60903039, 60873015, 60736013, and 61070037), Education Foundation of China (20094307120012) and NUDT Research Program. References [1] Y. Hoskote et al., A 5-GHz mesh interconnect for a teraflops processor, IEEE MICRO, vol. 27, September 2007, pp. 51–61. [2] S. Bell et al., TILE64 processor: a 64-core SoC with mesh interconnect. In: Proc. of ACM/IEEE ISSCC, 2008, pp. 88–90. [3] B. Levine et al., Kilocore: scalable, high performance and power efficient coarse grained reconfigurable fabrics, in: Proc. of International Symp. on Advanced Reconfigurable Systems, 2005, pp. 129–158. [4] T. Bjerregaard et al., ACM Computing Surveys 38 (2006) 1–51. [5] S. Bourduas et al., A hybrid ring/mesh interconnect for network-on-chip using hierarchical rings for global routing, in: Proc. of ACM/IEEE NOCS, 2007, pp. 195–204. [6] J. Kim et al., Flattened butterfly: a cost-efficient topology for high-radix networks, in: Proc. of ACM/IEEE ISCA, 2006, pp. 126–137. [7] A. Kumar et al., Express virtual channels: towards the ideal inter-connection fabric, in: Proc. of ACM/IEEE ISCA, 2007, pp. 150–161. [8] D. Park et al., Design of a dynamic priority-based fast path architecture for onchip interconnects, in: Proc. of IEEE HOTI, 2007, pp. 15–20. [9] R. Mullins et al., The design and implementation of a low-latency on-chip network, in: Proc. of ASP-DAC, 2006, pp. 164–169. [10] H. Matsutani et al., Prediction router: yet another low latency on-chip router architecture, in: Proc. of IEEE HPCA, 2009, pp. 367–378. [11] A. Kumar et al., A 4.6Tbits/s 3.6 GHz single-cycle NoC router with a novel switch allocator in 65 nm CMOS, in: Proc. of ICCD, 2007, pp. 63–70. [12] G. Michelogiannakis et al., Approaching ideal NoC latency with preconfigured routes, in: Proc. of ACM/IEEE NOCS, 2007, pp. 153–162.

[13] L.S. Peh, Flow Control and Micro-architectural Mechanisms for Extended the Performance of Interconnection Networks, PHD Thesis, 2001. [14] R. Mullins et al., Low-latency virtual-channel routers for on-chip networks, in: Proc. of ACM/IEEE ISCA, 2004, pp. 188–197. [15] C.A. Nicopoulos et al., ViChaR: a dynamic virtual channel regulator for network-on-chip routers, in: Proc. of IEEE MICRO, 2006, pp. 333–346. [16] P. Gratz et al., Regional congestion awareness for load balance in network-onchip, in: Proc. of IEEE HPCA, 2008, pp. 203–214. [17] H. Chafi et al., A scalable, non-blocking approach to transactional memory, in: Proc. of International Symposium on High-Performance Computer Architecture, February 2007, pp. 97–108. [18] S. Woo et al., The SPLASH-2 programs: characterization and methodological considerations, in: Proc. of International Symposium on Computer Architecture, June 1995, pp. 24–36.

Mingche Lai (S’04–M’07) received the B.S, M.S and Ph.D. degree in computer engineering from National University of Defense Technology, China, in 2003, 2005 and 2008 respectively. From 2008 to 2010, he was a Lecturer with the Department of Computer Science. Since 2007, he was also a Faculty Member with National Key Laboratory for Parallel and Distributed Processing of China. His research interests include computer architecture, hardware/software Co-design, VLSI design, on-Chip communication, and optical communication. He became a Member (M) of IEEE and ACM in 2007. And he also served as the technical reviewer of several conference and journals, e.g. DAC 2009, ICCAD2009. Since 2004, He authored more than 35 papers in internationally recognized journals and conferences. Contact him at [email protected].

Lei Gao (M’09) received the Ph.D. degree from National University of Defense Technology, Hunan, China, in 2009. She is currently an assistant researcher at National Key Laboratory for Parallel and Distributed Processing of China. Also she is currently worked together with Intel, served as a System Design Engineer. Her research interests include computer-aided design, VLSI design, multi-thread programming and hardware/software Co-design. Contact her at angela_nudt @yahoo.com.

Sheng Ma (S’ 07) received the B.S degree in computer science and technology from the National University of Defense Technology, P.R. China in 2007. He was admitted as a Ph.D. student ahead of graduation time due his excellent performance in the master study at Feb. 2009. Now, he is a Ph.D. candidate in the School of Computer, National University of Defense Technology. His research interests include computer architecture, multimedia oriented computer system and on-chip networks design. He served as the technical reviewer for the 16th ICECS international conference. Since 2009, he authored several papers in international conferences and journals. Contact him at [email protected].

Nong Xiao (M’04) received the Ph.D. degree in computer science and technology from the National University of Defense Technology, P.R. China in 1996. From 2004, he was a Professor with the Department of Computer Science. Now, he is also a senior Member with National Key Laboratory for Parallel and Distributed Processing of China. His research interests include Computer Architecture, Embedded System, Grid Computing and Large-scale Storage. He became a Member (M) of IEEE and ACM in 2004. Prof. XIAO has contributed 4 invited chapters to book volumes and published 100 papers in archival journals and refereed conference proceedings. Then, he has developed many grid computing products, e.g. the monitor system of China Grid. Contact him at [email protected].

M. Lai et al. / Microprocessors and Microsystems 35 (2011) 98–109 Zhiying Wang (M’02) received the Ph.D. degree in electrical engineering from computer science and technology from the National University of Defense Technology, P.R. China, in 1989. He is currently the Deputy Dean and Professor of computer engineering with Department of Computer, National University of Defense Technology, Hunan, China. He has contributed 10 invited chapters to book volumes, published 140 papers in archival journals and refereed conference proceedings, and delivered over 30 keynotes. His current research projects include asynchronous microprocessor design, nanotechnology circuits and systems based on Optoelectronic technology and virtual computer system. Prof. Wang became a Member (M) of IEEE and ACM in 2002 and 2003 respectively. His main

109

research fields include computer architecture, computer security, VLSI design, reliable architecture, multi-core memory system and asynchronous circuit. Contact him at [email protected].