QBNoC: QoS-aware bufferless NoC architecture

QBNoC: QoS-aware bufferless NoC architecture

Microelectronics Journal 45 (2014) 751–758 Contents lists available at ScienceDirect Microelectronics Journal journal homepage: www.elsevier.com/loc...

2MB Sizes 2 Downloads 55 Views

Microelectronics Journal 45 (2014) 751–758

Contents lists available at ScienceDirect

Microelectronics Journal journal homepage: www.elsevier.com/locate/mejo

QBNoC: QoS-aware bufferless NoC architecture Na Zhang a, Huaxi Gu a,d,n, Yintang Yang b, Dongrui Fan c a

State Key Laboratory of Integrated Service Network, Xidian University, China Institute of Microelectronics, Xidian University, China Key Laboratory of Computer System and Architecture Institute of Computing Technology, Chinese Academy of Sciences, China d Science and Technology on Information Transmission and Dissemination in Communication Networks Laboratory, The 54th Institute of CETC, China b c

art ic l e i nf o

a b s t r a c t

Article history: Received 12 February 2013 Received in revised form 1 April 2014 Accepted 8 April 2014 Available online 13 May 2014

Chip area and power consumption are the main restrictions for Network-on-Chip (NoC), a high proportion of which is consumed by the buffers in the routers. Therefore, bufferless NoC, which completely eliminates in-router buffers, has been proposed. However, the existing bufferless NoC designs do not provide Qualityof-Service (QoS) guarantee. In this paper, we propose a QoS-aware Bufferless NoC, named QBNoC. QBNoC employs hybrid switching mechanism, namely circuit switching mechanism for real-time application and wormhole switching mechanism for other applications. Besides, in order to decrease the deflection probability and thus improve the performance of the network, we propose a new output port allocation policy, named Two-Stage Allocation (TSA). Furthermore, new router architecture with shorter critical path is designed for QBNoC. The evaluation results show that by efficiently exploiting resources, our proposal significantly improves the performance of the whole network, and meanwhile satisfies the QoS requirements of different applications. & 2014 Elsevier Ltd. All rights reserved.

Keywords: Bufferless NoC Quality-of-Service QBNoC TSA Router architecture

1. Introduction Buffer is a core issue of the network system. Thus, numerous research groups have been working to determine a reasonable size of buffers in the up-to-date network [19,21]. The same problem is more serious in Network-on-Chip (NoC). For example, in the MIT RAW chip, on-chip interconnect occupies 30% of system power [6]; in the Intel Terascale chip, on-chip interconnect takes up 40% of system power [1]. In addition, buffers in the routers consume most of them. Recent studies show that in-router buffers can be completely eliminated, resulting in power consumption reduction of 20–40% and router chip area reduction of 75% [2,5]. Besides, the simplification of router architecture is also very considerable: in general, a bufferless router only needs the pipeline registers, the crossbar and the arbitration logic. In bufferless NoC, failing packets in contention for the output port will be deflected to other available ports [2,5], or be discarded and retransmitted by the source node [3,4,20]. Frequent deflections or retransmissions may lead to the degradation of network performance. However, under low or moderate network load, packets’ dropping and deflecting rarely occur, and thus have minimal impact on performance.

n

Corresponding author. E-mail addresses: [email protected] (N. Zhang), [email protected] (H. Gu). http://dx.doi.org/10.1016/j.mejo.2014.04.015 0026-2692/& 2014 Elsevier Ltd. All rights reserved.

For applications with strict limits to latency, NoC needs to provide satisfactory service within deterministic latency and throughput [12]. It is obvious that the Quality-of-Service (QoS), which refers to the capacity of a network to control traffic constraints to meet communication requirements of an application or of some of its specific modules, is also a key issue in NoC design [7,15,22]. It is estimated that NoC will support a variety of applications with different time constraints [9,14]. In general, the traffic is classified into four service levels: signaling (for control signals between nodes), real-time (representing latency-limited bit streams), RD/WR (modeling short data access), and blocktransfer (handling large data bursts) [17]. Under this situation, NoC should provide different levels of support for these applications. Therefore, routers are required to effectively explicit limited bandwidth resources and meet different QoS requirements of different applications as well. NoC with QoS guarantee is originally proposed in [16], and it can provide two types of service. QNoC is another design to solve QoS problem [17], in which packets of different service levels are transferred in an interleaved manner. On the basis of QNoC, router area is reduced via a modified buffering method [18]. Currently, there are no institutions that conduct research on the QoS problem for bufferless NoC. The priority policies that current bufferless NoC designs adopt are merely used to avoid livelock, neglecting the different QoS requirements of traffic with different service levels. For instance, BLESS prioritizes the packets

752

N. Zhang et al. / Microelectronics Journal 45 (2014) 751–758

arriving at the same router simultaneously according to their “age“ in the network [2]. Besides, in CHIPPER, the priority of Golden Packet is greater than the other packets, and all the packets of the network will become Golden Packet eventually by the traversal of packet IDs. It is clear that although packets are assigned different priorities [5], what existing bufferless NoCs provide is livelock-free Best Effort (BE) services, which only guarantee the delivery of all packets from a source node to a destination node, but provide no bounds for throughput or latency. In order to improve the performance of the whole network and guarantee QoS requirements of different applications, the bufferless NoC needs to include specific characteristics. This paper proposes a QoS-aware bufferless NoC (QBNoC) based on the mesh topology. Firstly, QBNoC makes use of a new priority policy which prioritizes packets mainly in accordance with their service levels. Secondly, QBNoC combines two switching mechanisms: circuit switching mechanism for real-time application to ensure the low latency requirement and wormhole switching mechanism for other applications to make the most of link resources. Thirdly, QBNoC uses a new two-stage port allocation strategy, called Two-Stage Allocation (TSA), to reduce the deflection probability of the network. Finally, new router architecture is designed for QBNoC to shorten the critical path. To our knowledge, this is the first work that considers the QoS issue of bufferless NoC. The remainder of this paper is organized as follows. An introduction is given to the critical techniques of QoS-aware Bufferless NoC in Section 2. We show the QBNoC router architecture in Section 3. The simulation results and discussions are given in Section 4. Finally, we present the conclusions in Section 5.

2. QoS-aware bufferless NoC 2.1. Hybrid switching mechanism With the increase in the number of different applications executing simultaneously, a principal objective of NoC is to provide for all communication requirements of heterogeneous IP cores within the chip. In descending order of the requirements on latency, QBNoC supports three kinds of applications, including real-time application, read/write application, and block-transfer application [17]. Different applications request different QoS guarantees, resulting in that a single switching mechanism cannot fully meets the demands of all applications [9]. Thus, most studies of switching mechanism are hybrid switching mechanism, a mixture of two or more than two basic characteristics of the switching mechanism, to provide different QoS guarantees for different applications [9,23,24]. The main idea of the hybrid switching mechanism is adopting different switching mechanisms for different applications. Circuit switching mechanism, which is connection-oriented, can ensure the upper limit of the delay. However, frequent path establishment makes it have large overhead, leading to the waste of network resources. Compared with circuit switching mechanism, wormhole switching mechanism makes full link utilization. Therefore, the low latency of real-time application and the full utilization of link resources can be both achieved by combining the above two mechanisms. As general circuit switching mechanism, the real-time communication in QBNoC is divided into three stages: path establishment, data transmission, and path release.

 The source node must send a Setup Packet to the destination node before the Real-time Packet transmission starts. The Setup Packet carries the address information of the destination node and reserves intermediate nodes for the Real-time Packet transmission hop by hop. An ACK Packet is generated by the destination node and sent to the source node along the

 

reserved path after the Setup Packet is received. The source node can confirm the success of the path establishment after it receives the ACK Packet. Real-time Packet is transmitted along the reserved path. Path release is executed in parallel with the Real-time Packet transmission to release the link resources for the reuse by the other real-time communications.

It is mentionable that ACK Packets are transmitted in the same network with Setup Packets and Real-time Packets in QBNoC. Therefore, after a path is reserved by a real-time communication, none of the bi-directional links along this path can be used by other real-time communications. We apply wormhole switching mechanism to read/write application and block-transfer application, which have relatively low latency restrictions, as well as the Setup Packet [8]. In view of no in-router buffers, we append packet-truncation mechanism to wormhole switching mechanism [2]. Thus, there are five different packet types, namely Setup Packet, ACK Packet, and Real-time Packet in real-time communication, RD/WR Packet in read/write application, and Block Packet in block-transfer application. Actually, Setup Packets belong to control signal for being used for path establishment in real-time application. In addition, Real-time Packets are the true real-time data. Therefore, as mentioned in Section 1, the traffic within QBNoC can also be divided into four classes, namely signaling, real-time, RD/WR, and block-transfer. It is worth noticing that link resources can be fully and rationally utilized in QBNoC. Once a path has been occupied by a real-time communication, then the other real-time communications cannot utilize the bi-directional links along this path until it is released. Nevertheless, RD/WR Packets and Block Packets can still exploit the idle unidirectional links along this path. For instance, as shown in Fig. 1, when Setup Packet or Real-time Packet is being transmitted, only the output links of R1's east port, R2's east port, and R3's north port and the input links of R2's west port, R3's west port, and R4's south port are occupied, while the input links of R1's east port, R2's east port, and R3's north port and the output links of R2's west port, R3's west port, and R4's south port remain available. When ACK Packet is being transmitted, the availability of the link resources is reversed. As a result, these remaining available link resources can be leveraged by the transmission of RD/WR Packets and Block Packets. 2.2. Routing algorithm On the one hand, for the purpose of fewer resources occupied by the real-time traffic, we employ simple XY algorithm to route Setup Packets. Hence, the path reserved for the real-time communication must be a minimal path. On the other hand, as can be seen in Fig. 1, the intermediate routers on the path reserved by the real-time communication still have three pairs of available input and output ports, fulfilling the necessary condition of deflection [2]. Therefore, we apply PMDR deflection algorithm to route RD/ WR Packets and Block Packets [10]. To provide each packet more choices, we slightly modify the PMDR algorithm by not only selecting the productive output port belonging to the dimension which has the most remaining hops away the destination as the first routing computation result, but also selecting the other productive output port as the second routing computation result. There are two fields in both RD/WR Packets and Block Packets, namely result_st field and result_nd field, to store the first routing computation result and the second routing computation, respectively. 2.3. Priority policy Existing bufferless NoCs prioritize packets according to various metrics such as “age”, distance to destinations, deflection count, etc.

N. Zhang et al. / Microelectronics Journal 45 (2014) 751–758

753

3

R4

2

Routers reserved by Setup Packet 1

R1

R2

R3 R1 source R4 destination ACK Packet transmission path

0

Setup/Real-time Packet transmission path 0

1

2

3

Fig. 1. Example of link utilization of QBNoC.

However, QoS requirements of different applications are ignored [10]. In our priority policy, four appropriate priority levels from high to low, i.e. 0, 1, 2, and 3 are defined according to the QoS requirements of different applications. In order to guarantee the transmission of real-time traffic, Setup Packets are assigned Priority 0. Seeing that the RD/WR traffic has higher latency requirement than the block-transfer traffic, RD/WR Packet and Block Packet are assigned Priority 2 and Priority 3, respectively. Furthermore, packets with lower priority may be deflected too frequently, thus the whole network set a maximum hop value. When some flits of RD/WR Packets or Block Packets reach the maximum hop value, their priority is modified to Priority 1, higher than that of common RD/ WR Packet and Block Packet flits, but still lower than that of Setup Packets to ensure no influence on the latency of link establishment. Besides, for the sake of completely avoiding livelock, the flit with the largest hop count is prioritized among the flits with Priority 1. In addition, the transmission of ACK Packets and Real-time Packets occupies dedicated link resources. Hence, there is no need to assign a priority level to them.

3.1. Routing computation unit As described in Section 2.3, the routing computation result of each Setup Packet is obtained via XY algorithm and then is stored in the result_st field. While, the routing computation result(s) of each RD/WR Packet or Block Packet, obtained by the modified PMDR algorithm, as seen in Algorithm 1, will be stored in the result_st field if there is only one productive output port, or be stored in result_st field and result_nd field if there are two productive output ports. Algorithm 1. Deflection routing 1: Given: (Δ1, Δ2): the distance between the source and the destination 2: Given: 0–4 represent east, west, north, south, and local ports, respectively Begin 3: if Δ1nΔ2 a0 then 4: 5:

2.4. Re-injection mechanism On the one hand, under the VSA strategy, there is a possibility that more than one Setup Packet contend for the same output port in the first stage. The failing Setup Packets will not take part in the second competition due to inadmissibility to be deflected, so that it cannot acquire any output port. On the other hand, as illustrated in Fig. 1, there are three available output ports of R1 in Setup Packet transfer process, which implies that if four flits simultaneously arrive at an edge router of the path reserved by a real-time communication, then the flit with the lowest priority is unable to acquire any output port. Seeing that there are no in-router buffers to temporarily store the failing flits under the two cases above, these failing flits will be sent back to the current IP core. If the IP core detects reception of a flit whose destination is not the current node, it will re-inject this flit into the network in preference to the newly generated flits when there is an idle input link [11].

3. QBNoC router architecture QBNoC router architecture mainly consists of Routing Computation Unit, Scheduler Unit, Ejection/Injection Unit, First-Stage Allocation Unit, and Second-Stage Allocation Unit, as shown in Fig. 2. Accordingly, the pipeline of QBNoC contains routing computation stage, ejection stage, injection stage, and two-stage allocation. In this section, each unit will be introduced in order of the pipeline.

6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

//The destination node isn’t at the same dimension as the current node. if |Δ1| 4|Δ2| then if Δ1 40 else end if

then result_st’0 result_st’1

if Δ2 40 else end if else

then result_nd’2 result_nd’3

if Δ2 40 else end if

then result_st’2 result_st’3

if Δ1 40 then result_nd’0 17: else result_nd’1 18: end if 19: end if 20: else 21: //The destination node is at the same dimension as the current node, or the destination node is exactly the current node. 22: if Δ1 40 then result_st’0 23: 24: 25:

Δ1 o0 Δ2 40 else if Δ2 o0 else if

else if

26: else 27: end if 28: end if 29: End

then result_st’1 then result_st’2

then result_st’3 result_st’4

754

N. Zhang et al. / Microelectronics Journal 45 (2014) 751–758

First-Stage Arbtration

Second-Stage Arbtration

W

S

N

Input Unit

1st WPort Arbitration Module

Input Unit

1st SPort Arbitration Module

Input Unit

Input Unit

1st NPort Arbitration Module

3 2 1

E

3 2 1 0

Routing Computation

2nd WPort Arbitration Module

2nd SPort Arbitration Module

2nd NPort Arbitration Module

E

W

S

N

Scheduler Unit

Eject Inject Port Request

Inject Inject Return to Port Grant IP core Fig. 2. QBNoC router architecture.

Fig. 3. Port allocation situations.

3.2. Ejection/injection unit A flit will enter the ejection unit if its destination is the current node, otherwise it will proceed. In order to avoid flits being deflected due to failure in the contention for the local output port [13], the QBNoC router is designed to have four local output ports to satisfy the requirement of the worst case in which four flits simultaneously reaching the router all intend to eject. When there are flits waiting for injection into the network in the local IP core, the injection unit will send an Inject Request signal per cycle. Once some input link is idle and has received an Inject Request signal, an Inject Grant signal will be sent to the injection unit, informing that the first flit of the waiting queue can be delivered to the input link [5]. 3.3. Two-stage allocation unit In the existing bufferless NoCs, the port allocation is executed exactly only once. For the purpose of decreasing the deflection rate and the average hop of the entire network, we designed TwoStage Allocation (TSA). As shown in Fig. 2, the four modules from

top to bottom in each stage are used to resolve contention for the east, west, south, and north output ports. Each port arbitration module contains different priority queues, which can be implemented by registers. The inner structures of the east port arbitration modules in the two stages are also given in Fig. 2. After injection stage, each flit will be sent to the first arbitration module of the port recorded in the result_st field. Besides, in accordance with its priority level, each flit will be arranged to the corresponding queue. For instance, if a Setup Packet requests the east output port, it will be sent to the Queue 0 of the east port arbitration module. In each module, if the first non-empty queue in priority order is Queue 1, the flit with the largest hop count in this queue will obtain the corresponding port, as illustrated in Fig. 3(a). What should be explained is that the dashed box labels the winning flit, and the number inside the flits with Priority 1 represents their hop count. Otherwise, the first flit of this queue will win the corresponding output port, as shown in Fig. 3(b) and (c). To ensure the fairness between different input ports, the order in which flits enter the arbitration modules from different input ports is determined by polling. Besides, the router will reserve the input/output port acquired by the winning Setup Packet in the

N. Zhang et al. / Microelectronics Journal 45 (2014) 751–758

competition. In the meantime, the failing Setup Packets in the first stage will be directly sent to the current IP core (will be introduced in Re-injection Mechanism). The second stage is used to provide another chance for a sort of RD/WR Packets and Block Packets, which own two productive ports and fail in the contention of the first stage, to compete for their productive ports. The flits winning in the first stage will pass through and be sent to the obtained output port directly, instead of taking part in the second stage. If the result_nd field of a failing flit is valid and the recorded output port is available, the flit will be sent to the second arbitration module of the recorded port. According to its priority level, it will be arranged to the corresponding queue. Then, which flit in each module will obtain the corresponding port is determined in the analogous way to the first stage. Additionally, the rest failing flits will be sent to any available output ports randomly. It can be observed that the transmission of flits is executed in parallel with the second stage allocation.

3.4. Scheduler unit The function of the scheduler unit is selecting the highestpriority flit among multiple queues, as presented in Algorithm 2. By polling the queues from high to low priority within each arbitration module, the algorithm finds the first non-empty queue. If the first non-empty queue is Queue 1, the algorithm selects the flit with the largest hop in this queue as the winning flit. Otherwise, it selects the first flit of this queue. In our QBNoC router, four local output ports can meet the ejection requests at the worst case, thus avoiding the unwanted deflections. Besides, compared to BLESS router [2], the port allocation of QBNoC router is parallel, which significantly shortens the length of the critical path. Furthermore, the second-stage allocation provides flits of RD/WR Packets and Block Packets Table 1 Packet generation model. Packet type

Real-time Packet Setup Packet/ACK Packet (only exists in QBNoC) RD/WR Packet Block-Transfer Packet

Packet size (flits) 40 1 4 2000

Generation rate (normalization)

755

another opportunity to acquire a productive output port, which also reduces the deflection rate of the network to some extent. Algorithm 2. Scheduling for selecting winning flits. 1: //At each cycle, in each port arbitration module: 2: Begin 3: if located in first stage then 4: for i’0 to 3 do 5: if Queue i is not empty then 6: if i¼1 then 7: select the flit with the largest hop 8: else 9: select the first flit to acquire the port 10: end if 11: break 12: end if 13: end for 14: else if located in second stage then 15: for i’0 to 2do 16: if Queue i is not empty then 17: if i¼1 then 18: select the flit with the largest hop 19: else 20: select the first flit to acquire the port 21: end if 22: break 23: end if 24: end for 25: end if 26: End

4. Evaluation In this section, we present the results of the performance evaluation of QBNoC in terms of average latency and throughput, and contrast QBNoC with BLESS using wormhole switching mechanism [2]. In addition, we test the advantage of TSA strategy over one-stage allocation on reducing the deflection probability.

1/80

4.1. Simulation environment

1 1/400

In order to evaluate the performance improvement of QBNoC, compared with BLESS, and observe the influence of the usage of TSA strategy on the network performance, we build three cycle-accurate

Fig. 4. Average latency and throughput as a function of injection rate for QBNoC and BLESS.

756

N. Zhang et al. / Microelectronics Journal 45 (2014) 751–758

simulation platforms to model 8  8 mesh network with point-topoint 64-bit bidirectional links. One of them implements BLESS, while the other two simulation platforms implement QBNoC with TSA and QBNoC without TSA. In all the simulation platforms, we assume that the packets’ size and generation rate of each application is constant and the same for all the nodes, as shown in Table 1. Once a packet is generated, it is stored in an infinite queue at the source node and waits for being injected into the network. The traffic pattern is uniformly-distributed random traffic, which means the destination node of a packet is randomly chosen with the same probability for all the nodes. No matter how many flits simultaneously arrive at the same destination node, all of them will immediately sink. Routing computation along with ejection/injection is supposed to be 1 cycle, which is equal to 5 ns. Switch transmission and port allocation are also supposed to be 1 cycle. The link traversal time of packets transmitted in circuit switching mechanism depends on their size, 1 cycle for ACK Packet, 40 cycles for Real-time Packet. The link traversal time of each hop packets transmitted in wormhole switching mechanism is 1cycle.

rate. Besides, the latency performance of QBNoC with TSA is better than that of QBNoC without TSA. As shown in Fig. 4(b), when the injection rate reaches 0.33 flit/IP core/cycle, the throughput of QBNoC with TSA achieves approximately 1.5 times than that of BLESS, while the throughput of QBNoC without TSA is only about 1.26 times as high as that of BLESS. In addition, the maximum sustainable injection rate of QBNoC with TSA is roughly 0.25 flit/IP core/cycle, greatly larger than that of BLESS. In Fig. 5, we contrast the performance of the RD/WR application and the block-transfer application in QBNoC and BLESS. Compared to BLESS, these two kinds of applications in QBNoC

4.2. Simulation results In Fig. 4, we show the average latency and throughput of QBNoC and BLESS. As can be seen in Fig. 4(a), the average latency of QBNoC is considerably lower than that of BLESS at any injection

Fig. 7. Link utilization in QBNoC and BLESS.

Fig. 5. Performance of the RD/WR application and the block-transfer application in QBNoC and BLESS.

Fig. 6. Performance of real-time application in QBNoC and BLESS.

N. Zhang et al. / Microelectronics Journal 45 (2014) 751–758

757

Fig. 8. Average deflection hop counts of RD/WR Packet and Block Packet in QBNoC and BLESS.

both gain improvement in latency. Besides, the saturation point and the trend of the curves nearly remains the same with Fig. 4(a). In addition, the performance of application in QBNoC with TSA is better than that in QBNoC without TSA. The latency of each real-time communication in QBNoC is composed of two parts: the latency of true real-time data transmission, which is a definite value equal to 40 cycles, and the latency of path establishment. Hence, the performance of the real-time application in QBNoC depends on the transmission of Setup Packets. As shown in Fig. 6(a), the maximum sustainable injection rate of Setup Packet transmission is only about 0.12 flit/IP core/cycle, because the identical resources cannot be shared between different real-time communications. Besides, the Setup Packet latency performance of QBNoC with TSA is little better than that of QBNoC without TSA, for the reason that the TSA strategy mainly acts on RD/WR Packets and Block Packets instead of Setup Packets. Fig. 6(b) shows that the performance of real-time application in QBNoC is still quite better than that in BLESS. We can conclude that in comparison with BLESS, not only the entire network's performance but also each single application's performance in our proposal can gain significant improvement, owing to two advantages of QBNoC. On the one hand, circuit switching mechanism and XY algorithm applied to the real-time application benefit performance improvement. Fewer resources are occupied by the real-time application for a shorter time in QBNoC, leading to not only better performance of the real-time application itself but also less impact on the other applications. Fig. 7 shows the link utilization comparison between QBNoC and BLESS, which is assessed by the average frequency of occupancy per link per cycle. We can see that the link resources of QBNoC are more fully utilized than BLESS. On the other hand, TSA policy further enhances the performance of the network by decreasing the deflection probability. As shown in Fig. 8, RD/WR Packet and Block Packet's average deflection hop counts in QBNoC with TSA are less than that in QBNoC without TSA, which exactly explains why the performance of QBNoC with TSA is more satisfactory than that of QBNoC without TSA, as shown in Figs. 4–6. The link utilization of QBNoC without TSA is more frequent than that of QBNoC with TSA, for the reason that more deflections occupy more resources, as shown in Fig. 7. Besides, due to the smaller size of RD/WR Packets, the reduction in average deflection hop count of them is not as obvious as that of Block Packets. In light of the two respects above, the link resources of QBNoC are utilized more sufficiently than that of BLESS, which is the critical factor contributing to the performance improvement.

5. Conclusions We have proposed QBNoC to meet the QoS requirements of different applications on bufferless NoC. The feature of QBNoC is that the hybrid switching mechanism is applied, and packets’ priority level is determined by their service level. Furthermore, a new port allocation strategy is proposed to decrease the deflection probability of the whole network. As a result, the resources obtain reasonable utilization, and consequently the performance of the network is improved. Therefore, it is safe to conclude that via rational utilization of resources, QBNoC can satisfy the QoS requirements of different applications, on the premise of improving the performance of the network.

Acknowledgment This work is supported partly by the National Natural Science Foundation of China under Grant no. 61070046 and no. 61334003, the 111 Project under Grant no. B08038, and the fund from Science and Technology on Information Transmission and Dissemination in Communication Networks Laboratory under Grant no. ITD-U12002.

References [1] Y. Hoskote, S. Vangal, A. Singh, N. Borkar, S. Borkar, A 5-GHz mesh interconnect for a teraflops processor, IEEE MICRO 27 (5) (2007) 51–61. [2] T. Moscibroda and O. Mutlu, A case for bufferless routing in on-chip networks, in: Proceedings of the 36th Annual International symposium on Computer Architecture, 2009, pp. 196–207. [3] C. Gomez, M. Gomez, P. Lopez, and J. Duato, BPS: A bufferless switching technique for NoCs, in: Proceedings of the 3rd International Conference on High-Performance Embedded Architectures and Compilers (HiPEAC), 2008, pp. 43–50. [4] C. Gomez, M. Gomez, P. Lopez, J. Duato, How to reduce packet dropping in a bufferless NoC, Concurr. Comput.:Pract. Exp. 23 (1) (2011) 86–99. [5] C. Fallin, C. Craik, and O. Mutlu, CHIPPER: A low-complexity bufferless deflection router, in: Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), 2011, pp. 144–155. [6] M. Taylor, J. Kim, J. Miller, D. Wentzlaff, The raw microprocessor: a computational fabric for software circuits and general-purpose programs, IEEE MICRO 22 (2) (2002) 25–35. [7] A. Mello, N. Calazans, F. Moraes, QoS in networks-on-chip – beyond priority and circuit switching techniques, VLSI-SoC: Adv. Topics Syst. Chip 291 (2007) 1–22. [8] W.J. Dally, B. Towles, Principles and Practices of Interconnection Networks, , 2004. [9] N. Stol, Michele Savi, and C. Raffaelli, 3-level integrated hybrid optical network (3LIHON) to meet future QoS requirements, in: Proceedings of the Global Telecommunications Conference, 2001, pp. 1–6.

758

N. Zhang et al. / Microelectronics Journal 45 (2014) 751–758

[10] M. George, S. Daniel, J.D. William, and K. Christos, Evaluating bufferless flow control for on-chip networks, in: Proceedings of the Fourth ACM/IEEE International Symposium on Networks-on-Chip, 2010, pp. 9–16. [11] J.M. Martinez, P. Lbpez, J. Duato, and T.M. Pinkston, Software-based deadlock recovery technique for true fully adaptive routing in wormhole networks, in: Proceedings of the International Conference on Parallel Processing, 2007, pp. 182–189. [12] B. Li, R. Iyer, M. Leddige, M. Espig, S.E. Lee, D. Newell, et al., CoQoS: coordinating QoS-aware shared resources in NoC-based SoCs, J. Parallel Distrib. Comput. 71 (5) (2011) 700–713. [13] C. Fallin, G. Nazario, X. Yu, K. Chang, R. Ausavarungnirun, and O. Mutlu, MinBD: minimally-buffered deflection routing for energy-efficient Interconnect, in: Proceedings of the IEEE/ACM Sixth International Symposium on Networks-onChip, May, 2012, pp. 1–10. [14] E. Carara, G.M. Almeida, G. Sassatelli, and F. G. Moraes, Achieving composability in NoC-based MPSoCs through QoS management at software level, in: Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), March, 2011, pp. 1–6. [15] C. Wang and N. Bagherzadeh, Design and evaluation of a high throughput QoS-aware and congestion-aware router architecture for network-on-chip, in: Proceedings of the 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2012, pp. 457–464. [16] K. Goossens, A. Peeters, R. Wielage, and A. Peeters, Networks on silicon: combining best-effort and guaranteed services, in: Proceedings of the Design Automation and Test Conference in Europe, March, 2002, pp. 423–425.

[17] E. Bolotin, I. Cidon, R. Ginosar, A. Kolodny, QNoC: QoS architecture and design process for network on chip, J. Syst. Archit. (2004) 105–128. [18] M. Shariat and N. Azizibabani, Reducing router's area in NoC by changing buffering method while providing QoS, in: Proceedings of the Intelligent Solutions in Embedded Systems (WISES), 2010, pp. 1–5. [19] C. Gomez, M.E. Gomez, P. Lopez, and J. Duato, An efficient switching technique for NoCs with reduced buffer requirements, in: Proceedings of the 14th IEEE International Conference on Parallel and Distributed Systems, 2008, pp. 713–720. [20] M. Hayenga, N.E. Jerger, and M. Lipasti, SCARAB: a single cycle adaptive routing and bufferless network, in: Proceedings of the 4 second Annual IEEE/ ACM International Symposium on Microarchitecture, 2009, pp. 244–254. [21] Junhui Wang, Huaxi Gu, Yintang Yang, Kun Wang, An energy- and bufferaware fully adaptive routing algorithm for Network-on-Chip, Microelectron. J. 44 (2), 2013, 17–144. [22] A. Jahanian, M.S. Zamani, Using metro-on-chip in physical design flow for congestion and routability improvement, Microelectron. J. 39 (2) (2008) 261–274. [23] K.G. Shin, S.W. Daniel, Analysis and implementation of hybrid switching, IEEE Trans. Comput. 45 (6) (1996) 684–692. [24] M. Modarressi, H.S. Azad, M. Arjomand, A hybrid packet-circuit switched onchip network based on SDM, in: Proceedings of the Conference on Design, Automation and Test in Europe, 2009, pp. 566–569.