A low latency energy efficient BFT based 3D NoC design with zone based routing strategy

A low latency energy efficient BFT based 3D NoC design with zone based routing strategy

Journal of Systems Architecture 108 (2020) 101738 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.el...

7MB Sizes 0 Downloads 29 Views

Journal of Systems Architecture 108 (2020) 101738

Contents lists available at ScienceDirect

Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc

A low latency energy efficient BFT based 3D NoC design with zone based routing strategy Avik Bose, Prasun Ghosal∗ Indian Institute of Engineering Science and Technology, Shibpur, Howrah 711103, West Bengal, India

a r t i c l e

i n f o

Keywords: Network on chip 3D NoC Uniform hopping distance Zone-based routing Low network latency

a b s t r a c t NoC, along with 3D IC technology, successfully addresses communication needs in complex many-core systems today. Major challenges are scalability, network efficiency, power consumption, and energy dissipation. Topology along with an efficient routing strategy, can mitigate such issues. This work proposes a scalable 3D BFT based design along with a zone-based routing policy. 12–76% minimum latency improvement has been observed compared to the state-of-the-art across eight different traffic patterns under heavy traffic. An average gain of 22–88% in router power consumption prevents traffic hot spots achieving 4–32% throughput improvement.

1. Introduction and motivation 1.1. Introduction Since the last few decades CMOS based systems are facing severe challenges due to rapid technology scaling [1,2]. The increasing demand for higher degree parallelism out of the processing engine led the processor architecture gradually to reach super-scalar regime [1]. The modern microprocessors demand the support of higher instructionlevel parallelism because of the greater architectural complexity as they have sophisticated micro-architectural features such as multiple instruction issue, dynamic branch prediction, out-of-order and, speculative execution and dynamic scheduling [3]. Instead of a large monolithic super-scalar processor, several smaller CPU cores working in tandem was an attractive choice for architects. Hence, many-core systems came into the picture. Eventually, SoC designers’ approach displaced from computation-centric to a more communication-centric model by this multi-core paradigm [1]. Integrating billions of transistors on a single die is possible today by technologies in deep sub-micron range [4]. The on-chip bus interconnect failed to cope up with the demand for high concurrency in system-on-Chip(SoC) as it suffers from poor scalability with the increasing number of processing elements [1]. Network-on-Chip is the only viable alternative that advocates, managing these communication necessities in Chip Multi-Processors (CMPs) and heterogeneous MPSoCs [1,5]. The ultra-deep sub-micron process has ushered the integration of tens or even more than hundreds of processing elements on a single chip. The ever-increasing demand of thread-level parallelism has thrust the scalability of modern many-core devices in such a way that the emer-



gence of future SoCs consisting of the large number (few hundreds) of processing cores is imminent [1,6,7]. Shrinking geometry of silicon layer introduces various interconnect challenges, especially, in many-core systems as global wires are needed to connect components over substantial distances. As a result of that, with growing scalability, the design of on-chip interconnection network as communication backbone is facing challenges regarding interconnect lengths and network diameter those affect communication latency. Design of underlying network topology plays a vital role to balance this trade-off. 1.2. Related research More than a decade back, researchers pioneered the idea of interconnection network for future SoCs [5]. Investigation [8] on Shared Bus System, Point to Point(P2P) connection and Crossbar network validates that the choice of on-chip interconnects can influence the overall performance of a chip significantly. Dally et al. have shown how network-on-chip eliminates the overhead of ad hoc global wiring by making the design more modular [9]. Design methodology of On-chip packet-switched mesh network is presented in both [9,10]. Several interconnection topologies like mesh, torus, butterfly etc. are thoroughly analyzed on different network parameters such as latency, throughput, packaging cost, etc. in [11]. Experimentation by inserting diagonal links in a mesh topology to decrease communication latency by reducing hop count is done in [12,13]. Application-specific long-range links inclusion in the mesh network is examined in [14] for achieving balanced traffic workload and relatively higher throughput with sustainable latency. Flattened Butterfly topology is presented in [15] as a scalable interconnect fabric that shows improvements over the mesh and concentrated

Corresponding author. E-mail addresses: [email protected] (A. Bose), [email protected] (P. Ghosal).

https://doi.org/10.1016/j.sysarc.2020.101738 Received 23 July 2019; Received in revised form 20 November 2019; Accepted 26 January 2020 1383-7621/© 2020 Elsevier B.V. All rights reserved.

A. Bose and P. Ghosal

mesh topologies in terms of latency and power consumption. Improving NoC performance has been attempted in [16] by incorporating additional diagonal links in a mesh. Designing fast routers [17–20] and devising network topologies [14,15,21] both have been the primary focus of the researchers for decades to achieve low latency and high bandwidth solutions. Feero et al. have investigated the fact that exploiting very low inter-layer distances how 3D NoC improves performance in terms of network latency and throughput with a considerable gain in die area and power consumption besides reducing the flat footprint in network designs as opposed to 2D environment [22]. Efficient three dimensional router design for vertical hop reduction [23] and achieving low power consumption [24] are also investigated. At the application levels how mode changes can be done by designing reconfigurable Network Interface Unit, achieving guaranteed service end-to-end VCs are demonstrated in [25]. A segment-based routing strategy to prevent deadlock in a mesh is devised for irregular regions on an NoC layer which is found in [26]. Design of an on-chip Cyber Physical System (CPSoC) based on a mesh network is shown in [27]. Paper [28] presents deadlock-free routing amongst the tasks creating subnets in a mesh NoC. All the above developments [25,26,28] are done using mesh topology. Moreover, the recent works [29–32] at application level on NoC are also done accounting the mesh-based network. Some latest works on NoC topologies can be found in [33,34]. Das et al. demonstrate the performance improvements inserting diagonal links and creating sub-networks in a mesh NoC in [33], whereas Wang et al. [34] proposes a 3D octagonal topology. Though both the topologies exhibit performance improvements in terms of latency, throughput and, power consumption, several designs (like flattened butterfly [15] and Butterfly Fat Tree [7]) exist those integrate much more IPs within relatively very small network diameter. Performance of Dynamic Time Division Multiple Access (DTDMA) bus as a fast vertical way of communication as opposed to conventional Through Silicon Via (TSV) is examined in [1]. Application of DTDMA bus on 3D non-uniform cache architecture can be found in [1] that mitigates L2 cache coherence problems. All the above works overlooked the impacts of scaling of the network on interconnect lengths. Both the delay and routing aspects of interconnects are considerably challenging in the global scenario rather than module-level [2]. As predicted in [35], with aggressive technology scaling modern NoC design is facing serious problems regarding global interconnect while connecting a large number of IP cores on a single die. The low (and fixed) router radix and uniform hop distance amongst the communicating node pairs make the Butterfly Fat Tree an interesting choice. But, with increasing scalability, it suffers from some major drawbacks especially, long interconnect length and increasing upward-traffic load. The present work investigates the above facts and tries to come up with a solution addressing the all possible design challenges. 1.3. Organization of the paper The overall organization of the rest of the paper is as follows. Section 2 summarizes the novel contributions of this paper. In Section 3, a detailed formulation of the present problem is presented. The proposed solution and necessary details of proposed algorithms are presented in Section 4. Simulation framework, experiments and results are discussed and analyzed in Section 5. Finally, Section 6 concludes the paper. 2. Novel contributions of the paper The novel contributions of the work can be summarized as follows. 1. To solve the upward traffic load problem, a new structure has been proposed by incorporating the design of the root and border routers. The design brings the following structural benefits. (a) The design successfully achieves a constant diameter that does not grow with the increasing network size; thus, the design is capable of integrating a large number of IPs with-

Journal of Systems Architecture 108 (2020) 101738

out affecting the communication latency (discussed in Section 4.1 and Section 4.3). (b) The constant diameter of the network gives flexibility to the placement of the communicating nodes. The source and the destination nodes can be placed anywhere in the network with no communication overhead no matter how big the network is (explained in the Section 4.2.6). (c) The design is made in such a way that the number of long interconnects becomes much less, which does not scale with the network size (illustrated elaborately in the Section 4.3). (d) The fixed number of wires makes the interconnect routing much easier compared to the generic BFT (depicted in the Fig. 5 of the Section 4 and Fig 19 of the Section 5.1). 2. A new floorplan, unlike the state-of-the-art BFT, is proposed with optimized interconnect lengths that comes with the following advantage. (a) The delays incurred by the wires remain under tolerable limits. Experimentation on the above wire length issue proves that using modern, sophisticated IP cores building the proposed ZBFT network is possible across various technology nodes (reported in the Table 6 of the Section 5.1.2). 3. A novel zone-based simple routing strategy is devised which is much lightweight and efficient in the following ways. (a) The NoC is conceptualized as a hierarchy of zones. Forming such zones distributes the responsibilities across the zonal routers. The complexity of the routing algorithm becomes much less as opposed to a classical BFT (demonstrated in the Section 4.2.1 with Fig. 6 and in the Section 4.2.8). (b) Unlike a generic BFT, the addressing mechanism for data communication becomes much simplified (shown in the Section 4.2.2). (c) The routing infrastructure separates the generated traffic into 2D and 3D, which then routed through the root and the border routers, respectively. The traffic separation policy balances the network load efficiently in all kinds of traffic injection modes (detailed analysis is given in the Section 5.2.1). 4. Exploiting the constant diameter of ZBFT and the traffic separation policy, a novel traffic mapping technique is employed which helps to alleviate the upward traffic load problem (existed in the classical BFT structure) of the zonal routers very efficiently. The strategy works even in heavy injection load across all the traffic patterns (detailed analysis is given in the Section 5.2.1 and reported in the Table 13). 3. Problem description From the design viewpoint NoC research domain may broadly be categorized into two types viz. Micro-Architectural, and MacroArchitectural innovations, respectively [1]. Developing hardware components (router, NIC) has been the focus of NoC research at microarchitectural granularity. On the other hand, macro-architectural development (e.g., topology, routing algorithm and, flow control policy) of NoC is carried out from a relatively higher abstraction level taking the interconnection network into account as a whole, which is the current context of the proposed work. Fig. 1 is a concise depiction of various design challenges and the related constraints of a network from a macrodevelopment perspective. Communication latency and throughput of a system are the most important parameters to evaluate the performance of an NoC. Hence, decreased latency and increased throughput as the two prime objectives of a network design face several challenges, of which the dominant ones are-focuses on the design and enhancement of individual hardware components (router, NIC) of a network. On the other hand, macro-architectural development (e.g., topology, routing algorithm and, flow control policy) of NoC is carried out from a relatively higher abstraction level taking the interconnection network into account as a whole, which is the current context of the proposed work. Fig. 1 is a concise depiction of various design challenges and the related

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Fig. 1. NoC macro-design challenges and constraints in a nutshell.

constraints of a network from a macro-development perspective. Communication latency and throughput of a system are the most important parameters to evaluate the performance of an NoC. Hence, decreased latency and increased throughput as the two prime objectives of a network design face several challenges, of which the dominant ones are as follows. 1. 2. 3. 4.

Network scalability. Path diversity. Traffic compatibility. Traffic injection tolerance.

Each of the above design challenges entangles in an elaborate interplay with a set of related constraints, which may affect the network performance significantly if not taken care of properly in Fig. 1 the respective restrictions (placed in rectangular boxes) of the NoC design challenges are connected by arrows to bring better clarity to the illustrations given below. 3.1. Network scalability and path diversity The underlying communication network needs to be scaled as per the requirements to satisfy the ever-growing application demands of the system. The network diameter, length of the interconnect wires and, radix (number of ports) of the routers are the three significant constraints those eventually start playing as the network size grows. The equation below explains how the above constraints affect communication latency with the increasing volume of a network. 𝑇𝑙𝑎𝑡𝑒𝑛𝑐𝑦 = 𝑚 ×

ℎ𝑚𝑎𝑥

∑ 𝑖=1

Eq. (2) tells that the edge connectivity value 1 of a network always becomes an element e of the set 𝐸connect . In other words, theoretically, it can be said that from an input VC at some router’s input port a header ⌊ ⌋ flit can choose at most 2𝑁𝐿 possible output ports to go to the next stage of the network to reach the destination. In practice, for a considerably large network, it is not a wise design choice to connect a router to all the other routers. Hence, path diversity of a network may increase the average radix of its routers. With the increasing router radix, the communication latency gets affected as it helps to aggravate the delay of various arbitration processes inside a router. The equation below explains the above fact precisely. ( )𝛼 1 1− 𝑝 × 𝑘𝜎 𝑝 0≤𝑘≤𝑣×𝑃 1 𝑝= 𝑘 𝑇 =

(𝜂𝑖 𝜎+𝛿𝑙𝑖,𝑖+1 )

𝑖≥1≤𝑚

the interconnect delay starts dominating the transmission latency and eventually it (𝛿𝑙𝑖,𝑖+1 ) goes beyond the critical path delay threshold of the routers. Hence, efficient network design must balance this diameter vs interconnect length trade-off that results as a direct consequence of scaling the underlying topology. Path diversity, on the other hand, defines the existence of more than one path between a pair of source and destination, whose aftermath influences the average router radix, as well as the distribution of the network load and, even the diameter of a network profoundly. The edge connectivity [36] of the corresponding graph of a network is a good measure of the path diversity. For an NoC having N routers connected by a total L links its edge connectivity can be defined as follows{ ⌊ ⌋} | 2𝐿 | 𝐸𝑐 𝑜𝑛𝑛𝑒𝑐 𝑡 = 𝑒 | 𝑒 ≤ (2) 𝑁 |

(1)

Eq. (1) is a measure of the latency of a packet of size (in flits) m where a flit experiences a delay of 𝜂 i cycles at the ith stage of the network having 𝜎 as the applied clock period of the router. After a flit completes its ith hop, an additional 𝛿𝑙𝑖,𝑖+1 delay overhead is incurred due to the propagation through the interconnect of length l which connects the routers of the ith and 𝑖 + 1th stage. With the increasing diameter due to consider scaling up of the network, the maximum hop ℎmax that can occur in a packet transmission gets also exacerbated. To reduce the average hop count if the length of the wires is increased then after a certain point

(3)

Eq. (3) represents the delay of an arbitration 2 phase of a P port router that operates on a clock duration 𝜎 and employs v VCs at every input port. If there are k such flits opting for a particular output port (or an

1

The number of edges to be removed to make the graph disconnected. A header flit goes through both the VC and switch arbitration process whereas, a body or tail flit faces only the crossbar arbitration. 2

A. Bose and P. Ghosal

input VC at the input port of the next stage router), then p is the selection probability of the flit whereas, it may get rejected 𝛼 times before getting selected. The more the number of flits competes for a common resource (VC, crossbar passage etc.) inside a router, the more the corresponding arbitration latency gets increased which eventually affects the overall transmission delay of a packet. On the other hand, lack of path diversity also causes contention among flits for the same output ports within routers and can lead to traffic congestion in the network at its worse because of poor traffic load distribution. This situation compels a congested router to drop packets which affect the overall throughput of a network. 3.2. Traffic compatibility and traffic injection tolerance The structure of the topology contributes the most to meet these design challenges. In a particular traffic pattern, the placement of the communicating IP node pairs determines the overall traffic load in various regions of the network. For example, uniform traffic (as its name suggests) distributes the whole traffic load across the network evenly whereas, the bit-permutation traffic like bit-complement imposes uneven loads on the routers of a network. The above traffic load is very much subjective to the topological architecture of the underlying network which if designed carefully, can handle even a pessimistic traffic pattern from the injection load viewpoint. A network with poor traffic compatibility exhibits an execrable distribution of traffic load across the network. In a heavily loaded router the race condition between flits for a particular output port penalizes a router by making it expending extra clock cycles to route those flits. ∑ℎ𝑚𝑎𝑥 ∑ℎ𝑚𝑎𝑥 ∑ℎ𝑚𝑎𝑥 (𝑝𝑖rc + 𝑝𝑖vca ) + 𝑚 × 𝑖=1 (𝑝𝑖sa + 𝑝𝑖xbar ) 𝜂𝑖 × 𝑝𝑏𝑖𝑡 .𝑏.𝑚 𝑖=1 𝑖=1 > (4) 𝑃𝑝𝑎𝑐𝑘𝑒𝑡 𝑃𝑝𝑎𝑐𝑘𝑒𝑡 Eq. (4) illustrates the above situation regarding energy consumption where 𝜂 i is the measure of the additional cycles that a flit (of 𝑏 bits) of an m flit packet wastes in the VC buffer incurring an energy overhead of 𝑝bit to store a single bit per cycle. The right-hand side of the inequality shown in Eq. (4) is the cumulative energy required to transmit a packet from its source to the destination which is comprised of per-hop power required for routing computation (𝑝𝑖rc ), VC arbitration (𝑝𝑖vca ), switch arbitration (𝑝𝑖sa ) and, crossbar propagation (𝑝𝑖xbar ) for an m flit packet. The inequality in Eq. (4) states that inefficient load balancing in a network makes the VC buffer energy required to store a packet dominating the overall power consumption happens due to transmission of that packet. As a result of this, the average router power consumption increased. Escalated power densities in such bottleneck regions aggravate thermal dissipation, especially in 3D NoC where the layers are stacked one upon another. Taking a diversified path to avoid congestion often comes with the price of having a higher number of router hopping than the usual. The above diversion also eventually increases the power requirement and communication latency to transmit each flit between a pair of source and destination. Hence, a designer needs to find a balance among these parameters to make the NoC efficient from the energy consumption perspective without affecting the network performance. On the successful finding of the balance amongst all the above constraints shown in Fig. 1 traffic injection tolerance of a network increases significantly, which is found in the experimental phase of the proposed work. The existing prevalent topologies do not address all the above issues as a whole. Cube networks (like mesh, torus, etc.) [11,37] suffer from large network diameter resulted from increasing scalability whereas, tree-based topologies are affected by poor path diversity. 3.3. Mesh and Torus Below are the advantages and drawbacks of mesh and torus networks in the light of the above-discussed design challenges and the related constraints.

Journal of Systems Architecture 108 (2020) 101738

3.3.1. Pros 1. Between every pair of source and destination nodes there exist enough number of diversified paths. 2. The radix of the routers are moderately low. 3. Length of the interconnect wires are comparatively low which does not get affected by scalability (only for mesh). 3.3.2. Cons 1. With increasing scalability the network diameter rises significantly that escalates the average communication hop count and affects overall transmission latency. 2. In a large network the path diversity does not leave any optimistic effects on the communication as especially, in considerably higher injection rate. 3. Large diameter diminish the effect of load balancing at the routers’ output ports. 4. Not at all compatible with all kinds of traffic pattern especially, like bit-permutation traffic where a source sends all of its packet to a single destination IP. 5. Increasing traffic load on the network aggravates the average power consumption by the routers which degrading the thermal profile of the network. 6. The network crashes at a very low data injection load thus, exhibiting very poor traffic injection tolerance. 3.4. Butterfly 3.4.1. Pros 1. Low router radix. 3.4.2. Cons Lack of path diversity makes the topology inefficient in the following ways1. 2. 3. 4. 5. 6. 7.

High communication latency. Low traffic injection tolerance. at high injection rate not compatible with even uniform traffic. Poor load balancing. Wire length increases significantly with increased scalability. High energy consumption overhead. With increasing injection rate thermal profile degrades.

3.5. Flattened Butterfly 3.5.1. Pros 1. Scalability does not degrade network diameter or wire length. 2. Sufficient degree of path diversity exists. 3. Compatible with all kinds of traffic. 4. Theoretically balances network load efficiently (energy issues are there with increasing injection load). 3.5.2. Cons 1. Average router radix is very high. 2. High radix router yields high power consumption. 3. Thermal profile get worse even in moderate traffic injection rate. Hence, traffic injection tolerance is very low due to energy (both in terms of consumption and dissipation) overhead. The aim of the proposed work has always been, working on the above challenges taking into account their effects. So, the effort has been made to bring all the above challenges under a single roof. An effort is made to give a solution where each of the design challenges can be met mitigating the repercussions on the related constraints. As Cube networks (like mesh, torus etc.) [11] suffer from large network diameter resulted from increasing scalability [11,37], designs from Butterfly family [11] are an attractive choice as they can integrate a large number of IPs within a relatively lower network diameter. Though the low (and fixed) router radix and uniform hop distance amongst the communicating node pairs

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Fig. 2. Logical diagram of four connected Butterfly Fat Trees.

Fig. 3. Connectivity between border and root routers. Each of (a),(b),(c), and (d) is a four root level router, each pertaining to a particular BFT. (e) Border routers of a floor, each of which is dedicated to a particular BFT.

makes the Butterfly Fat Tree interesting but, with increasing scalability, it suffers from some major drawbacks given below. 3.6. Butterfly Fat Tree (BFT) 3.6.1. Cons 1. Network diameter increases with the increased scalability that significantly affects the communication latency. 2. Performance degrades with increased scalability as the increased number of levels of the tree escalates the upward traffic load on the routers. 3. The network saturates in a relatively very low injection load. 4. Routing computation complexity increases with increasing network size. 5. For substantially large network size practical implementation on modern real IP cores is not possible using BFT. Because the floorplan complexity arises due to a large number of long interconnects which have propagation delay beyond the communication threshold in the corresponding technology node. 6. Average router power consumption is high due to poor distribution of the injection load. 7. Thermally intolerable at comparatively much lower injection rates. The proposed ZBFT network deals efficiently with all the above challenges by properly addressing the effects on the constraints without affecting the performance of the NoC.

Fig. 4. Floor plan of a single BFT as a locality of a pillar node.

(which are called border routers) is added to each layer of this design to accomplish inter-layer communication amongst IP blocks. Fig. 2 shows these pink colored border routers. Each border router is dedicated to a particular BFT through which only inter-layer traffic can pass to or from that BFT. No intra-layer traffic can be routed through these border routers. It means, if an IP block of a BFT on a layer wants to send data to another IP block belonging from a different BFT on the same layer, then the routing policy must exploit the root routers’ connectivity as shown in Fig. 3. All the root routers of a BFT is connected to the border router dedicated to it. Border routers of a layer are fully connected as shown in Fig. 3. Floor plan for a single BFT is shown in Fig. 4. Placing of routers is modified here than the classical 2D BFT floor plan shown in [7,22]. This modification was needed to route wires between routers, which is very evident in Fig. 5 where the complete floor plan of a layer comprising 4 BFTs is given. Both of the floor plans in Figs. 4 and 5 depict the connectivity of a border router (pink colored) to a circular DTDMA pillar node. DTDMA (Dynamic Time-Division Multiple-Access) bus is specifically designed to facilitate easy integration in multi-core systems [38]. 4.2. Routing

4. Proposed solution 4.1. Network architecture The proposed 3D NoC design has several identical layers. Each of such layer consists of 4 BFTs connected in a specific manner. Fig. 2 depicts the logical connectivity in a single NoC layer amongst 4 BFTs according to the design. Root level routers are coloured for clarity of understanding. As shown in Fig. 3, each root node is connected with all the other root nodes having the same colour. An extra stage of routers

Routing strategy for the design has been devised by characterizing the network into various routing zones. Routing responsibilities are distributed and vary across zonal routers that make the routing logic simple. 4.2.1. Different routing zones Four zones are formed for routing in the network, namely local, regional, tree and layer zone. Fig. 6 depicts the distribution of different zones across the network.

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

nect its regions through regional routers. Each root router has a connection to every region through a particular regional router that pertains to the respective region. Trees are shown in green boundaries in Fig. 6. d. Layer zone: A 3D NoC chip consists of several layers. Each such layer comprises four BFTs. As described in Section 4.1, border routers (labeled as B) handle inter-layer traffic. Each border router on a layer is dedicated to channelize 3D packets to and from a particular tree. The blue boundary depicts the layer in Fig. 6. In Fig. 6, only the localities those belong to the upper left region of a tree are enclosed with grey boxes and, the yellow border is made absent for those regions in the layer for better clarity. 4.2.2. Addressing mechanism A communicating node (an IP block at the leaf level of a BFT) in the network is identified by a router based on the 5-tuple Fig. 5. Floor plan of a complete NoC layer comprising four BFTs as the localities of their respective pillar nodes.

a. Local zone: Leaf nodes (where the IP blocks are placed) in a Butterfly Fat Tree topology are distributed across sixteen groups. Each of those comprises a distinct set of four-leaf nodes connected by a unique level 1 router. Each such group is called a locality. Hence with each locality, a local zone is formed. Therefore, a level 1 router is called a local router as it belongs to a distinct locality. Localities in Fig. 6 are shown by grey boundaries. b. Regional zone: A region is made up of four localities whose local routers are connected to two distinct level 2 routers (labeled as R). There exist eight such routers at level 2 of a BFT, each distinct pair of which belong to a particular region. Hence, these routers are called regional router. Regions are enclosed in Fig. 6 by yellow borders. c. Tree zone: Four such regions described above form a tree (Butterfly Fat Tree). Root (level 3) routers (labelled as T) of a BFT con-

𝜏 ⇒ {𝐿, 𝑇 , 𝑅, 𝑙, 𝑛} 𝐿 ≤ 𝑥 ∣ 𝑥 ≥ 0, 0 ≤ 𝑇 , 𝑅, 𝑙, 𝑛 ≤ 3.

(5)

where L, T, R, l and n represent the layer, tree, region, locality and, the node number of that IP block, respectively. Each address for communication is comprised of these distinct elements shown in Eq. (5) in order to provide complete zonal information to the routers. Numbering various zones in the network follows the convention shown in Fig. 7. Four boxes in this figure may belong to any of the four-zone types (locality, region, tree or, layer) discussed in Section 4.2.1. If it is a layer, then four BFTs of that layer are numbered in a clockwise manner ranging from 0 to 3, according to Fig. 7. The same convention is followed in case of numbering four regions of a BFT, four localities of a region and four nodes of a locality. Based on the position a particular source or destination is specified by populating each element of 𝜏 with its respective value (as per the numbering convention). As per the design, a network can have several layers. That is why the value of L in Eq. (5) starts from 0 (the Fig. 6. Formation of routing zone in a layer.

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

taken by a header flit of a virtual channel based wormhole switched [39] packet from router ‘A’ to reach router ‘B’, while load balancing is applied at node ‘A’. Fig. 9(c) also depicts that a maximum one intermediate hopping is allowed to reach a root router from a root router on the same layer; otherwise, deadlock or live-lock conditions may occur. All root routers follow the same principle as discussed above. Unlike other routers, a border router can distribute traffic across various channels for both the cases of incoming (from the associated BFT) and outgoing (to the associated BFT) traffic. Downward (incoming) traffic distribution is possible for a border router because when a packet needs to go down to reach the certain region, it can be routed through any of the root routers as every root router connects every region in a BFT. Load distribution of a border router ‘A’, is shown in Fig. 9(d). Fig. 7. Numbering convention of different routing Zones.

Fig. 8. Address format of a header flit.

lowermost layer) and does not exceed a value x (the uppermost layer) where 𝑥 + 1 is the maximum number of layers that can be manufactured in an implementation which is constrained by the fabrication process. It is evident from the numbering convention (as discussed) shown in Fig. 7, why rest of the elements of 𝜏 range from 0 to 3. Hence, header flit of a packet contains its respective source and destination addresses in the format shown in Fig. 8. It is a total (m + 8) bits address. The first m bits specify the Layer Number to which the respective node (source or destination) belongs. Next two bits indicate the Tree Number out of the four BFTs in that layer. Next two bits contain the Region Number out of the four regions of that BFT. Four localities of that region are identified by a two bits Locality Number field. Lastly, out of the four IP blocks in that locality which one is the corresponding node (destination or source) in context is denoted by a two bits field called Node Number. For example, in Fig. 6 the address of node A situated at the lower-left corner of the network can be given as𝜏𝐴 = 1111111111 ↓

(6)

11

11

11

11

11

𝐿

𝑇

𝑅

𝑙

𝑛

In Eq. (6) a 10 bits address is shown, out of which 2 bits represent the layer number of node A. So the number of layers in the network should be 4. 4.2.3. Load balancing mechanism Balancing of traffic load can be done exploiting the multipath characteristics (existence of multiple paths between a pair of source and destination) of the network. Fig. 9 depicts such scenarios where load balancing is possible. How a local router distributes traffic while sending packets upward is depicted in Fig. 9(a). Channels among which traffic loads are distributed are shown by green arrows. Fig. 9(a) does not show the full connectivity amongst local and regional routers. Like local routers, regional routers balance loads in the same way, which is shown in Fig. 9(b). As depicted in Fig. 3 and discussed in Section 4.1, one root router which belongs to a particular BFT has three connections that connect other same coloured root routers (each of which belongs to a distinct BFT) on a layer. These root routers are used to route 2D traffic. Hence, traffic generated from a BFT but, destined for other BFTs of the same layer are distributed across three upward channels after reaching each root router of that BFT. Left part of Fig. 9(c) shows 2D traffic load distribution of root a router (named ‘A’) in a BFT of a layer. The right-hand part of the figure explains how different routes can be

4.2.4. Deadlock and livelock avoidance policy at the root and border level Fig. 10 elaborates on the deadlock or livelock situation that may occur at the routers of root and border levels. A deadlock scenario is depicted in Fig. 10(a) where A, B, C and, D are the four neighbouring root (or border) routers (refer Fig. 3). With the help of coloring the nodes and the corresponding links, the deadlock condition is illustrated. The header of a packet that is stored in the input VC0 of the router A wants to go into the input VC1 of router B but, can not go because the VC is in use by some other packet at the router B. Reason of the blocking of VC1 at B is the packet header at the VC wants to get into the input VC2 at router C but can not acquire as the packet at that VC wants access to the input VC3 of the router D which is in use. A cycle exists from node A to D, where at each of the routers, some packet in some VC wants accessing a blocked VC at the neighboring router. With a single VC at each input port of the routers, the above deadlock chain can not be broken. A standard measure [11], which is taken to alleviate the situation, is what shown in Fig. 10(b). In Fig. 10(b), a packet always gets the same VC at each stage of the network while going to the destination. For example, packets from the router A always traverses through VC0 , whereas traffic generated from B always uses the VC1 at each router hop and so on. This policy has a major drawback, which is, a header of a packet inside an input VC of a router always opts for a fixed VC in the VC allocation stage, which increases the probability of rejection in the arbitration process especially, at a significantly high injection rate. Eventually, the above situation affects the overall communication latency of the packets. Moreover, this may bring a livelock situation where a packet visits the same set of routers again and again through a cycle (e.g., the cycle A → B → C → D → A in the Fig. 10(b)) wherever it exists in the path to the destination. To get remedy from the deadlock problem, a better solution that is taken generally by the routers in the XY-YX mesh routing [11] is restricting some directional moves (at least one move) while routing a packet. Each root and border router follows the same principle to avoid deadlock and livelock situations. Fig. 10(c) shows the packet forwarding policy of the root router A. The path of communication to the neighboring routers is shown by colored arrows for better clarity, as can be seen in Fig. 10(c), the root router A takes not more than two hops to reach its neighbors. For example, a packet from A, which is destined for D, can take either the path A → C → D or it can go directly to D like A → D. The path A → B → C → D is not allowed for any traffic from router A. Therefore, a packet from a root (or border) router does not take more than two hops to reach the destination tree; thus, breaking the deadlock chain that may occur without such restriction. This policy is reflected in the root (Algorithm 3 ) and border (Algorithm 4 ) routing algorithms. 4.2.5. Routing strategies of various zonal routers The routing algorithm is distributive in nature. Different zonal routers have their respective zone of interests to identify the source and destination of a header flit and forward it accordingly. Canonical Virtual Channel (VC) based routing environment with wormhole switching strategy [1,11,39,40] have been adopted for routers to devise algorithms

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Fig. 9. Load balancing at various stages of the network.

Fig. 10. Avoiding deadlock and livelock at the root and border level of the ZBFT.

in various zonal contexts. To employ the discussed (in Section 4.2.3) load balancing mechanism at every zonal router in the proposed network which shown in Fig. 9, links of a router needs to be classified into two categories which are up and down. From the perspective of a particular local or regional router, its parent links are accounted as up-links (or upward-links) and the child links are taken as down-links (or downwardlinks). For a root router, its up-links are the connections it has to other three same coloured root routers of the corresponding layer, including the border router connection. Links by which a border router is connected to rest of the border routers and its pillar node in a layer are the up-links. Down-links of a border router are those four connections, each of which connects a distinct root router of the associated BFT. In a generic routing process [41], after receiving a header flit under the supervision of the input controller, the type of the flit is detected by a decoding unit of the respective input port. Every flit carries this type information [41]. Finding the arrived one as a header flit, the input controller stores the flit into a VC and forwards the destination address

Fig. 11. Working principle of the routing computation unit.

to the Routing Computation (RC) unit [41]. The RC unit returns the output port through which the header flit is to be passed. In the proposed scenario, how zone-specific routing computation happens is shown in Fig. 11. Unlike the canonical case, along with the destination address an extra 1 bit information (input port type) is passed to the RC unit which helps to identify whether the input port at which a header flit arrives is

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Local Routing Logic

Regional Routing Logic

Input : destination address, input port type. Output: output port.

Input : destination address, input port type. Output: output port.

Initialize: 𝑐𝑜𝑢𝑛𝑡 = 0; // done when the router gets up

Initialize: 𝑐𝑜𝑢𝑛𝑡 = 0; // done when the router gets up

begin if 𝑖𝑛𝑝𝑢𝑡_𝑝𝑜𝑟𝑡_𝑡𝑦𝑝𝑒 = 𝑑𝑜𝑤𝑛 then if 𝐿𝑐𝑢𝑟𝑟 = 𝐿𝑑𝑒𝑠𝑡 then if 𝑇𝑐𝑢𝑟𝑟 = 𝑇𝑑𝑒𝑠𝑡 then if 𝑅𝑐𝑢𝑟𝑟 = 𝑅𝑑𝑒𝑠𝑡 then if 𝑙𝑐𝑢𝑟𝑟 = 𝑙𝑑𝑒𝑠𝑡 then 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑛𝑜𝑑𝑒_𝑛𝑢𝑚𝑏𝑒𝑟; end else 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑐𝑜𝑢𝑛𝑡; 𝑐𝑜𝑢𝑛𝑡 = (𝑐𝑜𝑢𝑛𝑡 + 1) mod 2; end end else 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑐𝑜𝑢𝑛𝑡; 𝑐𝑜𝑢𝑛𝑡 = (𝑐𝑜𝑢𝑛𝑡 + 1) mod 2; end end else 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑐𝑜𝑢𝑛𝑡; 𝑐𝑜𝑢𝑛𝑡 = (𝑐𝑜𝑢𝑛𝑡 + 1) mod 2; end end else 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑐𝑜𝑢𝑛𝑡; 𝑐𝑜𝑢𝑛𝑡 = (𝑐𝑜𝑢𝑛𝑡 + 1) mod 2; end end else if 𝑖𝑛𝑝𝑢𝑡_𝑝𝑜𝑟𝑡_𝑡𝑦𝑝𝑒 = 𝑢𝑝 then 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑛𝑜𝑑𝑒_𝑛𝑢𝑚𝑏𝑒𝑟; end end

begin if 𝑖𝑛𝑝𝑢𝑡_𝑝𝑜𝑟𝑡_𝑡𝑦𝑝𝑒 = 𝑑𝑜𝑤𝑛 then if 𝐿𝑐𝑢𝑟𝑟 = 𝐿𝑑𝑒𝑠𝑡 then if 𝑇𝑐𝑢𝑟𝑟 = 𝑇𝑑𝑒𝑠𝑡 then if 𝑅𝑐𝑢𝑟𝑟 = 𝑅𝑑𝑒𝑠𝑡 then 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑙𝑑𝑒𝑠𝑡 ; end else 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑐𝑜𝑢𝑛𝑡; 𝑐𝑜𝑢𝑛𝑡 = (𝑐𝑜𝑢𝑛𝑡 + 1) mod 2; end end else 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑐𝑜𝑢𝑛𝑡; 𝑐𝑜𝑢𝑛𝑡 = (𝑐𝑜𝑢𝑛𝑡 + 1) mod 2; end end else 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑐𝑜𝑢𝑛𝑡; 𝑐𝑜𝑢𝑛𝑡 = (𝑐𝑜𝑢𝑛𝑡 + 1) mod 2; end end else if 𝑖𝑛𝑝𝑢𝑡_𝑝𝑜𝑟𝑡_𝑡𝑦𝑝𝑒 = 𝑢𝑝 then 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑙𝑑𝑒𝑠𝑡 ; end end

Algorithm 1: Routing Technique of a Local Router.

connected to a upward or downward channel. Depending upon the input port type, RC logic acts accordingly and does load balancing where possible. A zonal router maintains its port addresses like every NoC router [1,11]. The RC logic in Fig. 11 returns a legitimate output port address after finishing routing computation. Depending on the radix of a router, the bit length of the port address is determined. For every root and border router this bit length is to be 3 as each of them has eight input/output ports. For local and regional routers it is six. Routing Algorithms: Algorithms executed by local, root, regional and border routers are presented in Algorithms 1-4, respectively. Each zonal router maintains counters to accomplish load balancing. Values of these counters resemble the channel numbers shown in Fig. 9, based on which a router decides through which particular output port a packet is to be routed. According to the load balancing mechanism (refer Fig. 9(a) and (b)) a local or a regional router needs to maintain a counter (named count in Algorithms 1 and 2) to balance traffic load across its two upward channels. When a header flit of a packet comes to a local (or regional) router from a downward channel then if the router finds that the packet is destined outside the locality (or tree), it checks the value of count and the proper upward channel is selected. The associated output port address is generated by the RC unit. As depicted in Fig. 9(a) and (b), the

Algorithm 2: Routing Technique of a Regional Router.

left upward channel is selected when count holds a zero and for the right upward-channel this value is to be one. A router applies its own internal convention while mapping this channel number to an output port address. In the same way, the root routing algorithm (refer Algorithm 3) uses a counter to balance 2D traffic loads across three upward channels, as shown in Fig. 9(c). With an exception to all zonal routers, a border router maintains two counters (countup and countdown ) to balance both upward and downward traffic loads as depicted in Fig. 9(d). 4.2.6. Communication hop count estimation Fig. 13 depicts a scenario where node A and E individually sends a header flit to node B in the same layer. Moreover, from node C also a header flit needs to be communicated to node D that is located on another layer. For clarity of understanding, the position of node D on its respective layer is shown as in the same layer where C situates. The maximum intra-layer (2D) and interlayer (3D) hop count of the network can be estimated from the situation in Fig. 13. Intra Layer Hopping Estimation(Case: A → B and E → B) i. One hopping is required from local router of A (or E) to one of the two regional (yellow coloured) routers. ii. Each regional router is distinctly connected to either red and violet or, orange and green coloured root routers. So one hopping from the local router to one of those two associated root routers (red/violet or orange/green) is required. In case of A, the header flit is routed to the red coloured root router, and for E the associated regional router selects the green coloured root router. iii. According to the root routing algorithm (Algorithm 3), a root router balances 2D traffic load coming from its down-channels across the

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Root Routing Logic

Border Routing Logic

Input : destination address, input port type. Output: output port.

Input : destination address, input port type. Output: output port.

Initialize: 𝑐𝑜𝑢𝑛𝑡 = 0; // done when the router gets up

Initialize: 𝑐𝑜𝑢𝑛𝑡𝑢𝑝 = 0; 𝑐𝑜𝑢𝑛𝑡𝑑𝑜𝑤𝑛 = 0; // done when the router gets up

begin if 𝑖𝑛𝑝𝑢𝑡_𝑝𝑜𝑟𝑡_𝑡𝑦𝑝𝑒 = 𝑑𝑜𝑤𝑛 then if 𝐿𝑐𝑢𝑟𝑟 = 𝐿𝑑𝑒𝑠𝑡 then if 𝑇𝑐𝑢𝑟𝑟 = 𝑇𝑑𝑒𝑠𝑡 then 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑅𝑑𝑒𝑠𝑡 ; end else 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑐𝑜𝑢𝑛𝑡; 𝑐𝑜𝑢𝑛𝑡 = (𝑐𝑜𝑢𝑛𝑡 + 1) mod 3; end end else 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑏𝑜𝑟𝑑𝑒𝑟_𝑙𝑖𝑛𝑘; end end else if 𝑖𝑛𝑝𝑢𝑡_𝑝𝑜𝑟𝑡_𝑡𝑦𝑝𝑒 = 𝑢𝑝 then if 𝑇𝑐𝑢𝑟𝑟 = 𝑇𝑑𝑒𝑠𝑡 then 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑅𝑑𝑒𝑠𝑡 ; end else 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑇𝑑𝑒𝑠𝑡 ; end end end Algorithm 3: Routing Technique of a Root Router.

three upward channels those connect rest of the same coloured root routers on the same layer. Hence, two cases are possible herea. A root router forwards a header to the same coloured root router on the same layer that is associated exactly with that BFT where the destination IP resides (as occurs in case of A → B). b. Otherwise, the header can be forwarded to a same coloured root router which does not belong to the destination BFT. Here one extra hopping is required than (iiia) to reach the root level of the destination BFT (as occurs in case of E → B). The reason is explained in Section 4.2.3 and depicted in Fig. 9(c). As a root router does its load balancing depending on value of the counter named count (refer Algorithm 3), then any case out of the possibilities (iiia) and (iiib) can occur. Hence, inter BFT (on same layer) routing from any root router is accounted as a matter of two hops. iv. It takes two hops to reach the local level (level 1) from root level (level 3) of a BFT. Therefore, after (iiia) the destination IP block (B) from the corresponding root level can be reached by two hops (as it happens for both A → B and E → B). Hence, the maximum intralayer hop count yields form the above discussion is six. Inter Layer Hopping Estimation(Case: C → D) i. Two hopping from local router of C to reach the root level. ii. One hopping from root level to border router of the BFT where source node C is located. iii. As Algorithm 4 states that a border router balances load across four upward-channels (including pillar node-link), then after (ii) two cases are possible-

begin if 𝑖𝑛𝑝𝑢𝑡_𝑝𝑜𝑟𝑡_𝑡𝑦𝑝𝑒 = 𝑑𝑜𝑤𝑛 then if 𝑇𝑐𝑢𝑟𝑟 = 𝑇𝑑𝑒𝑠𝑡 then 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑝𝑖𝑙𝑙𝑎𝑟_𝑛𝑜𝑑𝑒_𝑙𝑖𝑛𝑘; end else 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑐𝑜𝑢𝑛𝑡𝑢𝑝 ; 𝑐 𝑜𝑢𝑛𝑡𝑢𝑝 = (𝑐 𝑜𝑢𝑛𝑡𝑢𝑝 + 1) mod 4; end end else if 𝑖𝑛𝑝𝑢𝑡_𝑝𝑜𝑟𝑡_𝑡𝑦𝑝𝑒 = 𝑢𝑝 then if 𝑇𝑐𝑢𝑟𝑟 = 𝑇𝑑𝑒𝑠𝑡 then 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑐𝑜𝑢𝑛𝑡𝑑𝑜𝑤𝑛 ; 𝑐 𝑜𝑢𝑛𝑡𝑑𝑜𝑤𝑛 = (𝑐 𝑜𝑢𝑛𝑡𝑑𝑜𝑤𝑛 + 1) mod 4; end else 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑇𝑑𝑒𝑠𝑡 ; end end end Algorithm 4: Routing Technique of a Border Router.

a. A border router can forward the header flit to one of the border routers on the same layer. One hopping is required here. b. Otherwise, the border router can route the header to the pillar node. Interlayer jump through DTMA pillar is equivalent as a single router hop [38]. Depending on the Tree Number (as per the convention explained in Section 4.2.2) in the destination address of the destination node in its layer, two possibilities arise in interlayer border routinga. If there is a match, found between the source and destination nodes’ Tree Number, in their respective layers, then1. after (iiia) happens it takes two hops to reach the border router that is associated with the destination BFT in the destination layer. Because after the interlayer jump through the DTDMA pillar one extra router hop is needed, as the header flit needs to traverse an intermediate border router in the destination layer. 2. Otherwise after ((iiib)) one hopping suffices to reach the border router associated with the BFT of the destination layer where the destination node resides (only the interlayer jump through the DTDMA pillar is needed). b. If no such match found then1. after (iiia) it takes one hopping to reach the destination border router in the destination layer. It happens when the border router in the source layer whose Tree Number matches the destination node’s Tree Number in the destination layer. Otherwise, routing takes place like (iii(a)1). 2. Otherwise after (iiib) one hop is needed to reach the border router associated with the BFT of the destination layer where the destination node resides. This extra hop is incurred due to reach the destination border router in the destination layer. iv. Three hopping to reach the destination node from the corresponding border router.

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

In the case C → D shown in Fig. 13, the flow of routing followed is(i) → (ii) → (iiib) → (iiib2) → (iv) which is of eight router hops. But, the above discussion gives the insight that the interlayer hop count never exceeds nine. Hence, it is the highest possible value according to the routing technique that communication occurring in the proposed network may incur. These estimations are done on the following assumptions-

Bridge Routing Logic Input : input port type. Output: output port. begin if 𝑖𝑛𝑝𝑢𝑡_𝑝𝑜𝑟𝑡_𝑡𝑦𝑝𝑒 = 𝑑𝑜𝑤𝑛 then 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑢𝑝; end else if 𝑖𝑛𝑝𝑢𝑡_𝑝𝑜𝑟𝑡_𝑡𝑦𝑝𝑒 = 𝑢𝑝 then 𝑜𝑢𝑡𝑝𝑢𝑡_𝑝𝑜𝑟𝑡 = 𝑑𝑜𝑤𝑛; end end

i. No flits are dropped or detoured through a longer path than the usual due to network congestion. ii. A flit is consumed immediately as soon as it reaches its destination. There is no delay incurred which is comparable to a router hop while a source node injects flits into its local router. Hence, hopping starts from the local router of a source node and ends at reaching the destination node’s local router. iii. It is considered as equivalent to one router hopping when a flit jumps from one layer to another layer through DTDMA pillar [38]. 4.2.7. Inclusion of bridge routers As discussed in Section 1, with increasing network diameter the length of the wires those connect the IPs and routers starts dominating the network performance. In the experimental phase (Section 5.1) of the work, the optimal length of various interconnect wires of the network is determined across various technology nodes based on1. Practical IP core instances and, 2. Using IP cores comprising a standard 100K logic gates each as IP blocks (leaf level nodes). In the case of real processing cores (1.), it is found that the length of some interconnects exceed the desired clock cycle limit in spite of inserting repeaters. Those wires further need to be segmented to reduce their length. Therefore, a special Bridge router is introduced in the design. Fig. 14 shows the interconnect links connect the orange root routers (refer Fig. 3(b)) whose lengths are reduced using bridge routers. The complete connections among same coloured root and border routers (refer 3) yield the most extended wire lengths in the design. Hence, all wires those keep the root and the border routers connected need to be segmented employing bridge routers. As can be seen in Fig. 14, the diagonal links yield maximum length; therefore, two bridge routers are used to segment them. These are the longest wire segments (L1 ) that can be found in the proposed network. One bridge router suffices for rest of the links whose segments become the second-longest (L2 ) ones across the network. Another three class of wires are there in the design but, they need not to be segmented further. The benefit of using bridge routers in case of real modern processing cores as IPs are analyzed in detail which can be found in Section 5.1. Fig. 15 demonstrates the routing policy of a two radix bridge router which is quite straightforward. A bridge router classifies its two links as up and down. Each link has its respective input and output ports shown in the figure. Each input port has its virtual channels. As the name suggests, a bridge router bridges the time delay gap between two routers. The bridge-routing logic only checks for the input port type and decides what the output port should be. Algorithm 5 describes the logic on which the RC unit of a bridge router acts. If a header arrives at the up input port, the routing logic routes it to the down output port and vice versa. Port addressing inside a bridge router follows its internal convention like any other NoC routers. As it is a one way traffic movement (i.e. either up input port → down output port or down input port → up output port), at any input port of a bridge router the VC and switch arbitration take relatively very less amount of time than any other routers of the network. Still, in the experimental phase, bridge hopping is included for network simulation. 4.2.8. Comparative complexity analysis of the routing algorithms In the state of the art BFT architecture a switch (or router) S(l,a) is identified by its level l and position a at that level. Each switch has two

Algorithm 5: Routing Technique of a Bridge Router.

parents at level 𝑙 + 1 and four children at level 𝑙 − 1 as shown in Fig. 16. Addresses of the parents can be determined as𝑎 ⌋.2𝑙 + 𝑎 𝑚𝑜𝑑 2𝑙 ) 2𝑙+1 𝑎 𝑃 𝑎𝑟𝑒𝑛𝑡1 = (𝑙 + 1, ⌊ ⌋.2𝑙 + (𝑎 + 2𝑙−1 ) 𝑚𝑜𝑑 2𝑙 ) 2𝑙+1 𝑃 𝑎𝑟𝑒𝑛𝑡0 = (𝑙 + 1, ⌊

(7)

A parent link of a router of level l becomes the child link of some router in the level 𝑙 + 1. In other words, Parent0 of S(l,a) is connected to Childi of 𝑆(𝑙 + 1, 𝑝0 ) and Parent1 is connected to Childi of 𝑆(𝑙 + 1, 𝑝1 ) where𝑖=

𝑎 𝑚𝑜𝑑 2𝑙+1 2𝑙−1

(8)

and, p0 and p1 are the positions in level 𝑙 + 1 of two parent nodes of S(l,a) whose values are shown in Eq. (7). Accordingly, the child nodes addresses can also be determined. A router knowing its position and level when receives a header flit of some address, it needs to calculate the addresses of its two parents’ and four children to determine whether the header is destined for the subtree to which the router is connected or not. From Eq. (7) it is evident that the complexity of this calculation is Θ(22l ). For a significantly large value of l, this complexity becomes O(2l ) where a is a constant. So it indicates that for a particular router S(l, a), depending on the level l, the routing computation time grows exponentially. On the other hand, the proposed zone-based routing algorithms carried out by various zonal routers are of constant time complexities. The local routing algorithm yields a maximum nine operations (refer Algorithm 1) if a received header flit comes from downlink and is destined to a different locality in the same region. For regional, root, border and bridge-routing logic, the worst-case scenarios result in 8, 7, 6 and 2 operations, respectively. Hence, the routing algorithms of the proposed design are of constant time complexity O(1) that does not change with network scaling. 4.3. Scalability According to the earlier literature [7,42], the classical BFT is made following the rule shown below𝑁 = 2𝐿+3 𝐿≥0

(9)

where, L is the required number of levels 3 in the tree to integrate N IP blocks. As the communicating nodes are placed at the bottom of a BFT therefore, it takes L hops for a packet to reach the highest level of the tree and, the same hop count is incurred when the packet gets down from the root level to its destined IP node. So, the diameter of a classical BFT is 3

The leaf level where the IP blocks reside is taken as zero.

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Fig. 12. Three dimensional view of the proposed design.

always 2L. The basic topology which is proposed originally [7,42] has a 4 − ary 3 − 𝑓 𝑙𝑦 structure comprising 64 IP blocks. The generic scalability of BFT works as the multiple of 2 on this basic 64 nodes (let call it N0 ), which in turn is related to the required number of levels in the tree. The scalability of a BFT can be expressed as given below𝑁cbft = {2𝑖 𝑁0 | 0 ≤ 𝑖 ≤ 𝑛, 𝑁0 = 26 } 𝐷cbft = {2(𝐿 + 𝑗) | 0 ≤ 𝑗 ≤ 𝑛}

(10)

The network size of every BFT is an element of the set 𝑁cbft whereas the corresponding diameter value belongs to the set 𝐷cbft as shown in Eq. (10). Therefore, Eq. (10) illustrates that with the exponential increase (by the power of 2) of the number of nodes in a generic BFT, the required number of router levels gets an additive increment of one which in turn raises the network diameter by 2. The repercussion of the above generic scaling of a BFT comes with a profound influence on some important design constraints (e.g., interconnect length, router power consumption etc.) that causes substantial performance degradation when the network size becomes significantly large. The investigation on this fact finds that scaling the topology beyond 64 IP blocks in a classical manner, leaves the interconnect routing very complicated, which ends up of having many long interconnects in the network thus incurring more delays to the data communication. Rigorous experimentation on the above wire delay issue explores that for a considerably large NoC size like 16 × 16, no practical solution exists for a BFT if it is scaled generically. Section 5.1 gives a detailed report regarding the interconnect problem in light of both the classical BFT and the proposed ZBFT design. Moreover, with an increasing number of levels, the accumulated upward traffic load on each router also gets increased, which affects the overall communication latency making an early convergence of the network saturation point. The escalated traffic load on the routers, in turn, increases the average buffer waiting time of packets inside a router’s VC which elevates the energy consumption by the routers and worsen the thermal profile of the NoC layer. The analysis on the above network load issue and its effects are discussed in detail in the Section 5.2.1 whereas, the detailed report on the comparative network performance is discussed in Sections 5.2.3 and 5.4. The proposed ZBFT is designed in such a way that the intra-layer and interlayer hop count remains constant (discussed in Section 4.2.6.1 and 4.2.6.2) with increasing scalability. The number of levels and the

diameter of a ZBFT is expressed as below} 𝐿zbft = 𝑙𝑜𝑔2 𝑁0 − 1 𝑁0 = 26 𝐷zbft = 2(𝑙𝑜𝑔2 𝑁0 − 3) + 3

(11)

Eq. (11) depicts that the proposed design has achieved a constant diameter (𝐷zbft ) of 9 which does not grow with increasing network size. Fig. 17 depicts the effects of the scalability on network diameter and the number of wires between two levels in both the generic and ZBFT cases. As shown in Fig. 17(a), the classical scaling of a BFT increases its diameter in a drastic manner. In contrast, the proposed ZBFT achieves a constant diameter of 9 which remains unchanged after the network size crosses 384 (i.e., one layer of 4 trees and 2 trees on another layer) IP blocks. A ZBFT network has 5 levels (the leaf, local, regional, root and, the border) which does not scale with the number of nodes in the 𝐿 network. An L level classical BFT has 42𝑙 number of links between any level l and 𝑙 + 1. According to the generic floorplan of the BFT [7], links that connect the highest level routers to those situated below that level yield the maximum wire length. Therefore, the delay incurred by the highest level interconnects is the matter of concern which may affect the communication latency. Fig. 17(b) shows how the number of the highest level links gets increased radically with the growing scalability of a BFT. With the inclusion of an additional level, the highest number of links becomes twice as of the previous. For a network comprising 512 IPs, there are 128 highest level links which is substantially high for interconnect routing and makes the floor plan of a generic BFT very complex. On the contrary, the ZBFT has only 16 links at the highest level (i.e., the upward links which goes from the root or the border level routers) that does not change for whatever the size of the network becomes. Interconnect routing becomes much simplified with the proposed design. Section 5.1 analyzed the effect of such design on interconnects in details. Fig. 18 shows a layer of the design where Tree 1 has all 64 IP cores present. An IP core from the upper right locality of tree 2 is absent and so as the whole locality in the lower-left corner. Tree 3 has two regions missing, and lastly, all 64 IP blocks of tree four are missing. IP cores are integrated on a layer as per the needs. Absence of IP blocks in a zone makes the zonal router absent, which is the notion of 2D scalability. Keeping in mind the interconnect length and the floorplan complexity issue along with the upward traffic load on the zonal routers, a layer is designed to accommodate four BFTs at its maximum, beyond which an entirely new layer is added to the design. Then 3D scalability comes into the picture. Hence, with this design, even a

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Fig. 13. Intra and inter layer routing hop estimation.

due to long interconnects [43,44]. As an integral part of floorplanning, global interconnect routing has significant impacts on the overall performance of a System-on-Chip and its power-area budget. Generic on-chip network topology design does not keep into account exact geometric descriptions of the routing 4 hannels 5 nd switchboxes 6 s it is the phase prior to the detailed layout design of SoCs. Hence to the current context of the work, floorplanning is done considering the aspect ratio of IP blocks and network routers, as 1 (square-shaped). Global routing1 is carried out to determine the lengths of the walks yielded by interconnecting wires. The exact geometric details of each wire and pin are disregarded here [44,45]. Intrinsic RC delay of an interconnect is governed by the following equation-

Fig. 14. Location of the bridge routers in the network.

more significant number of IP core integration is possible without scaling the network diameter. Fig. 12 depicts how the design should look in its three-dimensional form with two layers. 5. Experimental results and validation The experimentation goes through two major phases. The first phase deals with the wire delay issue of the proposed design and determines the optimal number of bridge routers applied to reduce the segment length of long interconnects. In the second phase, taking resulting interconnect lengths found from the first phase into account, simulations are done to identify the effects of various performance parameters like network latency, throughput and power consumption of the routers. The last stage of experimentation evaluates the thermal profile of the design to find whether any abnormal temperature rise happens in the layers or not due to high traffic injection into the network.

𝐷𝑢𝑛𝑏𝑢𝑓 𝑓 𝑒𝑟𝑒𝑑 = 0.38𝑅𝑖𝑛𝑡 𝐶𝑖𝑛𝑡 𝐿2

where Rint and Cint be the per unit intrinsic resistance and capacitance of the interconnect material, and L is the length of wire [43]. Distribution of global wires over the chip is strongly correlated with the network topology and its floorplan of an NoC. Eventually with increasing wire length, the quadratic nature of its delay which shown in Eq. (12) starts dominating the communication latency. This situation gets critical when the delay of wire exceeds the pipeline stage delay of the routers to whom it connects. Inserting inverters (as repeaters or buffers) periodically along with the wire arrests this quadratic increase in someway and makes it a more linear one. Using too many repeaters adversely effects the delay factor as the delay induced by them begins to dominate. Hence, the effort to make the network scale-free beyond certain limits (from both 2D and 3D perspective) needs a comprehensive optimization of various interconnect parameters, most importantly, wire delay and power consumption. Buffered delay [43] of a wire after inserting repeaters is expressed as-

5.1. Interconnect delays The macro-architectural design of NoC focuses on delays of global interconnects those used to connect components like network routers and IP blocks placed on the chip [1]. More than 50% of the path delay occurs

(12)

4 5 6

Refers to interconnect routing. Spaces not occupied by logic blocks on a layer. Routing channels partitioned into rectangular regions.

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Fig. 15. Working principle of a bridge router.

Fig. 17. Change in network diameter and the number of highest level links in classical BFT and ZBFT topology. Fig. 16. A canonical BFT router.

indicates the size of the inverter which can be determined by√

(

𝐷𝑏𝑢𝑓 𝑓 𝑒𝑟𝑒𝑑 = 𝑁(𝐶𝐺 + 𝐶𝐽 )𝑅𝑒𝑞𝑛 (1 + 𝛽) + 𝐶𝐺 (1 + 𝛽)𝑅𝑖𝑛𝑡 𝑀 ( ) 𝐶𝑖𝑛𝑡 𝑅𝑒𝑞𝑛 ) 𝐶𝑖𝑛𝑡 𝑅𝑖𝑛𝑡 + 𝐿2 𝐿+ 𝑀 2𝑁

𝑀=

𝑅𝑒𝑞𝑛

𝐶𝑖𝑛𝑡 𝐶𝐺 (1 + 𝛽) 𝑅𝑖𝑛𝑡

(14)

(13)

where CG and CJ are the gate and junction capacitance respectively of NMOS inverter. Reqn is the equivalent resistance of the MOSFET in its operating region. 𝛽 be the ratio of the PMOS to NMOS device size. M

and, N as the optimal number of repeaters required for a wire of length L, is calculated as√ 𝑁=

𝑅𝑖𝑛𝑡 𝐶𝑖𝑛𝑡 𝐿2 ∕2 𝑅𝑒𝑞𝑛 (𝐶𝐺 + 𝐶𝐽 )(1 + 𝛽)

(15)

Fig. 18. Pictorial description of scalability of the proposed design.

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Fig. 19. Floorplan of a layer of the proposed design including bridge routers.

Table 1 Parameters of aluminum wire and FO4 delay across different technology nodes. Technology Node (in nm)

Rint (in mΩ/μ m)

Cint (in fF/μm)

FO4 delay (in ps)

130 90 65 45 32 22 14

45 55 61 48 42 35 27

0.26 0.26 0.22 0.17 0.15 0.12 0.09

49.30 37.97 27.90 19.04 13.90 11.20 8.93

5.1.1. Experiment details Experimentation on different interconnect delays of the proposed design has been done across technology nodes of 130 nm 90 nm, 65 nm, 45 nm, 32 nm, 22 nm and 14 nm. Instances for the IP block nodes of the network are adopted from real modern processing core architectures. The same process is repeated as well for hypothetical IP cores having 100 K gates 7 n various technology nodes mentioned above. There exist plenty of pieces of evidence in support of IP blocks having 100 K logic gates. Starting from an MPEG2 decoder, a DSP processor, to a general-purpose RISC processor, all of these can be manufactured within 100K gates [46]. Architecture details of the processing cores are given in Table 2. Processing cores of TileGx-64 [47] from TileraTM , a variant of Tile processor [48] is used for 90 nm technology. For 65 nm technology node, processing cores of Intel’s Teraflop research architecture [49] has been used. ARM○R Cortex○R -A9 [50] on 45 nm architecture is a popular general purpose choice of low-power cost-effective processing cores. UltraSPARC○R -I and II (also known as Niagara and Niagara II), a multi-threaded SPARC architecture [51] based processing cores are used for 32 nm and 22 nm technology nodes respectively. Details of these can be found in [52]. Lastly, Intel○R Xeon○R Scalable architecture [53] is chosen for 14 nm as IP core instances to be placed on layer of the proposed network. McPAT 1.1, an integrated power, area, and timing modeling framework [54] has been used to model the above real processor architectures. The various architectural parameters that are given in Table 2 are given to the simulator as input. Areas of each those processing cores (given in Table 2) are found from the results of the simulation. In case of placing IP blocks comprising 100 K gates each as network nodes, the area approximations of each those IPs are made 7

Two input minimum sized NAND gate is considered here for reference.

based on the transistor counts that can be fabricated across different technology nodes. The approximations are made consulting [55]. Approximated area of processing cores across various technology nodes are shown in Table 3. 5.1.2. Results and discussions Delay through an interconnect must be constrained within the applied clock period. Otherwise, the synchronization issue will arise in the routers that eventually affects the communication latency. Being focused on the network macro-architecture, it is in practice to consider the minimum conceivable clock cycle time of a highly pipelined design as equal of 15FO4, where FO4 (fan-out-of four) is the delay of an inverter driving four identical ones [35,56,57]. With identical input capacitance to that of an inverter, all the other logic gates are less able to deliver the same output current because an inverter comprises the minimal number of transistors compared to other logic gates. That is why an inverter is the standard benchmark for comparative measurements of delays through logic blocks [58]. Fig. 19 is the floorplan after including bridge routers to the design, as discussed in Section 4.2.7. Colour of a bridge router is of the same as the routers whose links it is applied on; e.g., red coloured bridge routers are used to segment the links among the red root routers (refer Fig. 3(a)). A bridge router acquires an area which is less than half of that any other router in the proposed network requires (refer Table 5). That is why, in Fig. 19, the bridge routers’ dimensions are much less than the rests. The network possesses five different categories (based on length) of interconnect wires namely, L1 , L2 , L3 , L4 and L5 . A set of L1 segments for a diagonal link between two red root routers (refer Fig. 3(a)) are highlighted in red colour in Fig. 19. One L2 segment set in the figure is indicated by yellow. For clarity, rest of the wires are not shown as they don’t need to be divided applying bridge routers on them. According to the floorplan, lengths of different wires are determined as per the span of each those wires on the layer taking the sizes of IP blocks and routers into account. ORION 3.0, an integrated power-area simulator is used to estimate the area requirements of different router type of the design, that accounts explicitly the control and data path resources of NoC routers [59]. Configuration details for each type of router are given in Table 4. The resulted area consumption (rounded off to two decimal points except, bridge router) of routers across technology nodes are given in Table 5. Local and Regional routers are of six ports. Root and Border routers are of eight ports. Bridge routers (consisting two ports) occupy the smallest area (rounded off to three decimal points) among all other routers as they are dedicated just reducing the segment length

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Table 2 Architecture details of experimented IP cores.

Table 2 (continued) (f) Intel Xeon on 14 nm Technology

(a) TileGx-64 on 90 nm Technology

Specification Details

Specification Details Core size L1 I-cache size L1 D-cache size L2 cache size I-TLB D-TLB Register count Bus width Operating frequency ISA DMA engine Embedding option

4.24 mm2 8 KB 8 KB 64 KB 8-entry 16-entry 64 (64 KB) 32 bit 750 MHz to 1 GHz (max.) 64 bit VLIW Yes Yes

Core Size L1 I-cache size L1 D-cache size L2 cache size I-TLB D-TLB Register Count Bus Width Operating Frequency ISA DMA Engine Embedding Option

7.04 mm2 128 KB 16 KB 512 KB to 1 MB (max.) 128-entry 128-entry 48 (256 bit) 64 bit 3.4 GHz 64 bit X86-64 No yes

(b) Intel TeraFlop on 65 nm Technology Specification details Core size L1 I-cache size L1 D-cache size L2 cache I-TLB D-TLB Register count Bus width Operating frequency ISA DMA engine Embedding option

2.66 mm2 3 KB 2 KB No 32-entry 32-entry 10 (32 bit) 32 bit 3.27 MHz to 5 GHz (max.) 96 bit VLIW No No

(c) ARM Cortex-A9 on 45 nm Technology Specification details Core Size L1 I-cache size L1 D-cache size L2 cache size I-TLB D-TLB Register Count Bus Width Operating Frequency ISA DMA Engine Embedding Option

2.43 mm2 16 KB 32 KB to 64 KB (max.) 128 KB to 8 MB (max.) 64-entry 64-entry 37 (32 bit) 32 bit 800 MHz to 2 GHz (max.) 32 bit RISC No Yes

(d) Sun Niagara1 on 32 nm Technology Specification Details Core Size L1 I-cache size L1 D-cache size L2 cache size I-TLB D-TLB Register Count Bus Width Operating Frequency ISA DMA Engine Embedding Option

4.67 mm2 16 KB 8 KB 3 MB 64-entry 64-entry 640 (64 bit) 64 bit 1 GHz to 1.4 GHz (max.) VIS1 SIMD No No

(e) Sun Niagara2 on 22 nm Technology Specification Details Core Size L1 I-cache size L1 D-cache size L2 cache size I-TLB D-TLB Register Count Bus Width Operating Frequency ISA DMA Engine Embedding Option

4.67 mm2 16 KB 8 KB 4 MB 64-entry 64-entry 640 (64 bit) 64 bit 1.2 GHz to 1.6 GHz (max.) VIS2 SIMD No No (continued on next page)

of long interconnects. Area consumption of other routers scales accordingly. Based on the number of interconnect routed through, channels (Footnote 2, Page 25), Switchbox (Footnote 3, Page 25) widths are also included while calculating the interconnect lengths. Considering global interconnect wires as of metal 5 categories, the wire width and pitch [35] are both taken as 8𝜆 [4,44] where 2𝜆 is the technology node [4]. Table 1 summarizes FO4 values as well as per unit length intrinsic resistance (Rint ) and capacitance (Cint ) of metal 5 aluminum wire (Al). Values of Rint and Cint are practical in the context of device fabrication which are collected from [60]. FO4 delay values across technology nodes are empirical and can be calculated using Elmore delay model [4,43]. Table 6 summarizes the delay yields of various interconnects when tested on practical instances of IP blocks as communicating nodes of the network. Across all technology nodes, only the wires of type L1 and L2 are needed to be segmented using bridge routers. Buffering is required only in case of L1 segments (except for 22 nm technology). As shown in Fig. 19 and discussed in Section 4.2.7, some wires connecting inter-tree root and border routers on a layer falls into this L1 category. All other interconnect wires are exempted from using repeaters. As can be seen in Table 6(a)–(d) and (f), after segmentation of L1 wires using two bridge routers, certain number of repeaters are used to confine the delay in between 15FO4. In case of 22 nm UltraSPARC○R -I architecture (shown in Table 6(e)), no repeater is required across all wire types because the unbuffered delays of them remains under 15FO4. Power consumption overhead of using buffers ranges from a minimum of 0.11% to the maximum 0.28%. On the other hand, Table 7 gives us the information that the proposed network design is capable of limiting the interconnect delays within 15FO4 successfully without employing any bridge router and buffer while testing on IP cores of 100 K gates as the source and destination node instances. It is not the very case that all IP cores of an SoC reside on a single layer because L2 cache memory blocks also reside in a layer for many design scenarios. A cache memory block consumes much lesser space than a processing core. In spite of that, experimentation on interconnect delays and their power consumptions is done pessimistically to set an upper limit to the above parameters of each wire. The network design comes out to be an efficient one in terms of interconnecting overhead as most of the wires do not require repeaters to be applied to them. Interconnect delays shown in Tables 6 and 7 can be used to estimate the maximum allowable frequency of the routers of the proposed network that can be used for data transmission. Though the router configurations shown in Table 4 use the same frequency as of the processing cores as shown in Table 2, the actual operating frequency of routers across technology nodes should be estimated from the maximum wire delay yield of the design. Table 8 shows the maximum frequency at which a router of the proposed design can operate while tested on real processor instances as IP blocks. According to the floorplan shown in Fig. 19, the frequency values are determined based on the delay of the longest wire type L1 . Most of these frequencies are beyond that at which the corresponding processors (shown in Table 2) run across various tech-

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Table 3 Approximated core size of hypothetical IP cores comprising 100 K gates across different technology nodes. Technology node (in nm) →

130

90

65

45

32

22

14

Core Area (in mm ) →

0.09

0.0441

0.0324

0.04

0.0225

0.046

0.035

2

Table 4 Router configuration details, Configuration for root and border routers Number of input ports Number of output ports Number of input VCs Number of output VCs Virtual channel depth Flit width Clock frequency

8 8 4 0 16 32 bits Same as the IP cores

Configuration for local and regional routers Number of input ports Number of output ports Number of input VCs Number of output VCs Virtual channel depth Flit width Clock frequency

6 6 4 0 16 32 bits Same as the IP cores

Configuration for bridge routers Number of input ports Number of output ports Number of input VCs Number of output VCs Virtual channel depth Flit width Clock frequency

2 2 2 0 16 32 bits Same as the IP cores

from the experimental viewpoint, which concludes that no possible solution exists for such theoretic scalability when the network is built on practical IP core instances. The network performance test is done using two layers of a ZBFT topology totalling 512 IP nodes. Tables 10 and 11 reports the effect of the generic scalability on the wires of different levels of a BFT having 256 and 512 nodes, respectively, on practical IP core instances across various technology nodes (refer Table 2). For an NoC layer of 16 × 16 nodes (i.e., equivalent to a single ZBFT layer) a classical BFT possesses 5 levels. It is found that the delay of the 64 highest level links (i.e., W5,4 those connect the routers between level 4 and 5) exceed the desired clock cycle limit (15FO4) as reported in Table 10. On the other hand, when the network comprises 512 communicating nodes the number of the highest level links (W6,5 between level 5 and 6) becomes 128 who cross the interconnect delay threshold. Moreover, there also exist 256 interconnects (W5,4 ) between the level 4 and 5 (except the 45 nm technology) whose latency is large enough to affect the network performance severely. The delays are so high that in spite of using buffers can not be confined within the limit. Even in some cases (refer to Tables 11(b) and 11(d), the buffered delay exceeds the unbuffered one due to using a high number of inverters. Clearly, the above large number of interconnects dominates the network latency which eventually creates a performance bottleneck in the network with increasing injection rate making the actual implementation of a classical BFT impossible for considerably large network size. Conversely, the design and floorplan of ZBFT not only reduces the number of the highest level interconnects (which is 16) but also, both the applied number of buffers on the long interconnects and their associated buffered delay are confined within the desired limit. Thus, the proposed ZBFT network is capable of integrating a large number of IP cores without affecting the network performance in any way.

nology nodes. So the design can support high-frequency routers with sophisticated micro-architecture. Table 9 shows maximum router frequencies when IP blocks of 100 K gates are accounted as processing nodes. Currently, the maximum operating frequency that is found achievable is 8 GHz on AMD FX series processors [61]. Though commercial models of FX processor run at maximum 4.7 GHz, the architecture is capable of operating at 8 GHz with thermal cooling mechanism using liquid nitrogen/helium. Therefore, frequency values reported in Table 8 are realistic whose implementation are possible. Router frequencies across different technology nodes presented in Table 9 are mostly hypothetical. These values are given to justify that the interconnect lengths of the proposed design do not create any channel bottleneck in the proposed network as these wires can withstand data rates which are much more than the ones in practice. As the design uses DTDMA bus [1] and the three dimensional inter layer distance is very small (around 20 um in 90 nm technology) [1,22], this delay does not affect network performance.

5.2. Network performance The structure of a generic Butterfly Fat Tree network is governed by the equation𝑛 = 𝑙𝑜𝑔2 𝑁 − 3

(16)

where, n is the highest (root) level (including the leaf level 0) that depends on the number of leaf nodes (N) of the tree which are the communicating IP blocks in an SoC [42]. These nodes act as the source, destination, or both in data communication across the network. Message arrival and departure rates to or from a router significantly impact communication latency of a network [11,42]. In a BFT network of 𝑁 = 4𝑛 processing nodes, there exist 4𝑛 − 1 possible destinations for a source. Therefore flits arrive at a router of level l may go up if its destination falls into the group of 4𝑛 − 4𝑙 nodes. The reason is 4𝑙 − 1 nodes can be reached without going up from a router at level l. Assuming the network

5.1.3. Scalability revisited One of the major drawbacks of scaling a BFT in a classical way [7,42] from the wire length perspective is discussed in Section 4.3. Investigation on the repercussion of this long interconnects issue is done

Table 5 Area (in mm2 ) of different routers in the network across various technology nodes. Technology nodes → Router type ↓ 8 port (Local and regional) 6 port (Root and border) 2 port (Bridge)

130 nm

90 nm

65 nm

45 nm

32 nm

22 nm

14 nm

0.21

0.15

0.11

0.07

0.06

0.04

0.03

0.14

0.1

0.07

0.05

0.04

0.03

0.02

0.009

0.008

0.007

0.006

0.005

0.004

0.003

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Table 6 Summary of interconnect delays and their power consumptions of proposed NoC on practical IP core instances across different technology nodes. TileGX-64 on 90 nm technology Interconnect categories ↓

Interconnect length in (μm)

Unbuffered delay in (ps)

Buffered delay in (ps)

15FO4 delay in (ps)

Number of bridge routers

Number of buffers

Power consumption unbuffered in (mW)

Power consumption buffered in (mW)

Power consumption overhead in (%)

L1 L2 L3 L4 L5

11801.2 9039.2 9780.5 5266.3 6621.8

796.62 467.37 547.18 158.65 250.82

558.56 NA NA NA NA

569.55

2 1 0 0 0

4 0 0 0 0

3.313 2.538 2.746 1.478 1.859

3.318 NA NA NA NA

0.15 NA NA NA NA

15FO4 delay in (ps)

Number of bridge routers

Number of buffers

Power consumption unbuffered in (mW)

Power consumption buffered in (mW)

Power consumption overhead in (%)

4 0 0 0 0

9.849 2.538 2.746 1.478 1.859

9.865 NA NA NA NA

0.17 NA NA NA NA

Intel TeraFlop on 65 nm technology Interconnect categories ↓

Interconnect length in (μm)

Unbuffered delay in (ps)

Buffered delay in (ps)

L1 L2 L3 L4 L5

9250.4 7057.9 7697.3 4164.4 5160

459.34 267.41 318.04 93.10 142.93

382.95 NA NA NA NA

418.50

2 1 0 0 0

15FO4 delay in (ps)

Number of bridge routers

Number of buffers

Power consumption unbuffered in (mW)

Power consumption buffered in (mW)

Power consumption Overhead in (%)

4 0 0 0 0

3.707 2.872 3.046 1.609 2.118

3.711 NA NA NA NA

0.11 NA NA NA NA

ARM Cortex-A9 on 45 nm technology Interconnect categories ↓

Interconnect length in (μm)

Unbuffered delay in (ps)

Buffered delay in (ps)

L1 L2 L3 L4 L5

10904.3 8449.3 8959.9 4732.6 6230.2

338.11 233.02 262.04 73.11 126.70

280.44 NA NA NA NA

285.60

2 1 0 0 0

15FO4 Delay in (ps)

Number of Bridge Routers

Number of Buffers

Power Consumption Unbuffered in (mW)

Power Consumption Buffered in (mW)

Power Consumption Overhead in (%)

7 0 0 0 0

4.309 2.538 2.746 1.478 1.859

4.321 NA NA NA NA

0.28 NA NA NA NA

Sun Niagara on 32 nm technology Interconnect categories ↓

Interconnect length in (μm)

Unbuffered Delay in (ps)

Buffered Delay in (ps)

L1 L2 L3 L4 L5

10134.5 8309.9 8758.4 4601.5 6178.7

258.83 174.02 193.31 53.36 96.21

201.40 NA NA NA NA

208.50

2 1 0 0 0

15FO4 Delay in (ps)

Number of Bridge Routers

Number of Buffers

Power Consumption Unbuffered in (mW)

Power Consumption Buffered in (mW)

Power Consumption Overhead in (%)

0 0 0 0 0

2.373 1.735 2.107 0.966 0.982

NA NA NA NA NA

NA NA NA NA NA

Sun Niagara-II on 22 nm technology Interconnect Categories ↓

Interconnect Length in (μm)

Unbuffered Delay in (ps)

Buffered Delay in (ps)

L1 L2 L3 L4 L5

8330.9 6456.9 7840.5 3594.7 3656.2

131.02 70.05 103.28 21.71 22.46

123.04 NA NA NA NA

168

2 1 0 0 0

15FO4 Delay in (ps)

Number of Bridge Routers

Number of Buffers

Power Consumption Unbuffered in (mW)

Power Consumption Buffered in (mW)

Power Consumption Overhead in (%)

133.95

2 1 0 0 0

6 0 0 0 0

2.063 2.538 2.746 1.478 1.859

2.067 NA NA NA NA

0.19 NA NA NA NA

Intel Xeon on 14 nm technology Interconnect Categories ↓

Interconnect Length in (μm)

Unbuffered Delay in (ps)

Buffered Delay in (ps)

L1 L2 L3 L4 L5

13764.4 10653.5 11146.8 5572.6 8079.14

184.16 110.32 120.78 30.59 63.45

130.4 NA NA NA NA

is not in saturated state the probability that a flit goes up from a router of level l is𝑃𝑙𝑢𝑝 =

4𝑛

4𝑙

− 4𝑛 − 1

Then the total flit rate goes up from level l to 𝑙 + 1 can be 𝑃𝑙𝑢𝑝 4𝑛 𝜆0 where

(17)

𝜆0 be the traffic injection rate by each processor. A level l has 4n /2l links that goes up to level 𝑙 + 1. Therefore, load (𝜆𝑙,𝑙+1 ) on a single up going channel [42] can be expressed as-

(18)

𝜆𝑙,𝑙+1 = 𝜆0

Thus, for downward movements the chance for a flit is𝑃𝑙𝑑𝑜𝑤𝑛 = 1 − 𝑃𝑙𝑢𝑝

4𝑛 − 4𝑙 𝑙 2 4𝑛 − 1

(19)

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Table 7 Summary of interconnect delays and their power consumptions of proposed NoC on IP cores of 100K gates across different technology nodes. 130 nm technology Interconnect Categories ↓

Interconnect Length in (𝜇m)

Unbuffered Delay in (ps)

15FO4 Delay in (ps)

L1 L2 L3 L4 L5

9158.6 3920.4 2864.6 1876.7 1280.1

392.56 71.93 38.41 16.49 7.67

739.5

Interconnect categories ↓

Interconnect length in (μm)

Unbuffered delay in (ps)

15FO4 delay in (ps)

L1 L2 L3 L4 L5

7021.76 2940.8 2222.56 1481.96 945.2

282.03 49.47 28.26 12.57 5.12

569.55

Interconnect Categories ↓

Interconnect Length

Unbuffered Delay

in (μm)

in (ps)

L1 L2 L3 L4 L5

6286.8 2668.4 1695.6 1302.4 867.1

58 38.23 20.74 9.11 4.04

418.50

Interconnect categories ↓

Interconnect length in (μm)

Unbuffered delay in (ps)

15FO4 delay in (ps)

L1 L2 L3 L4 L5

5329.4 2294 1657.6 1081.4 753.5

92.71 17.19 8.97 3.82 1.86

285.60

Interconnect Categories ↓

Interconnect Length in (μm)

Unbuffered Delay in (ps)

15FO4 Delay in (ps)

L1 L2 L3 L4 L5

4553.08 1946.48 1421.64 933.1 636.62

52.25 9.55 5.1 2.2 1.03

208.50

Interconnect Categories ↓

Interconnect Length in (μm)

Unbuffered Delay in (ps)

15FO4 Delay in (ps)

L1 L2 L3 L4 L5

5107.1 2327.9 3386.6 1875.8 796.5

43.82 9.11 19.27 5.92 1.07

168

Interconnect Categories ↓

Interconnect Length in (μm)

Unbuffered Delay in (ps)

15FO4 Delay in (ps)

L1 L2 L3 L4 L5

4134.1 1984.5 1285.7 788.7 683.2

18.10 3.83 1.61 0.61 0.46

Number of Bridge Routers

Number of Buffers

Power Consumption Unbuffered in (mW)

0

0

4.03 1.73 1.26 0.82 0.57

Number of bridge routers

Number of buffers

Power consumption unbuffered in (mW)

0

0

5.26 1.11 0.84 0.56 0.36

15FO4 Delay

Number of Bridge

Number of

Power Consumption Unbuffered

in (ps)

Routers

Buffers

in (mW)

0

0

5.48 2.33 1.72 1.12 0.76

Number of bridge Routers

Number of Buffers

Power consumption unbuffered in (mW)

0

0

1.82 0.78 0.57 0.37 0.26

Number of Bridge Routers

Number of Buffers

Power Consumption Unbuffered in (mW)

0

0

0.78 0.34 0.25 0.16 0.11

Number of Bridge Routers

Number of Buffers

Power Consumption Unbuffered in (mW)

0

0

1.38 0.63 0.0.92 0.0.51 0.21

Number of Bridge Routers

Number of Buffers

Power Consumption Unbuffered in (mW)

0

0.65 0.30 0.20 0.12 0.11

90 nm technology

65 nm technology

45 nm technology

32 nm technology

22 nm technology

14 nm technology

133.95

0

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Table 8 Maximum operating frequency of the routers across different technology nodes for practical IP cores instances. Technology node (in nm) → Operating frequency (in GHz) →

90 1.80

Each router of a BFT has 6 input/output ports to connect two up and four down channels (as discussed in Section 4.2.5). Hence from Eq. (19), the upward traffic injection that a router of level 𝑙 + 1 experiences starting from the leaf level (level 0) is given by𝜆𝑅𝑙+1

𝑢𝑝

= 𝜆𝑙,𝑙+1 × 4 = 𝜆0

4𝑛 − 4𝑙 𝑙+2 2 4𝑛 − 1

(20)

If each router distributes its upward traffic (traffic that goes to a higher level from certain level of a BFT) load across two up going channels then each channel between level 𝑙 + 1 and 𝑙 + 2 transfers traffic load as amounts to𝜆𝑙+1,𝑙+2 =

𝜆𝑅𝑙+1

𝑢𝑝

2 4𝑛 − 4𝑙 𝑙+1 = 𝜆0 𝑛 2 4 −1

(21)

Clearly from Eqs. (19) & (21) it is very evident that the traffic load that a channel experiences between level 𝑙 + 1 and 𝑙 + 2 is twice of the load that is experienced by a channel between level l and 𝑙 + 1, if proper load balancing is done for up going traffic. Recursively, the bandwidth requirement of up going channels can be expressed as𝜆1,2 𝜆0,1

=⋯=

𝜆𝑙,𝑙+1

=2

𝜆𝑙−1,𝑙

𝜆𝑙,𝑙−1 = 𝜆0 𝑃𝑙𝑑𝑜𝑤𝑛 2

(23)

This load becomes maximum when all the traffic from processing nodes at level 0 happen to be destined upward beyond level l with the prob𝑛

𝑙

ability 𝑃𝑙𝑢𝑝 = 44𝑛−4 because, in a BFT the upward packet that leaves a −1 certain level, must get down below that level to reach its destination. Thus, for this upward and downward traffic symmetry [42], Eq. (23) can be rewritten as4𝑛 − 4𝑙 𝑙 2 (24) 4𝑛 − 1 Eq. (24) is true for any down going channel and can be recursively written as𝜆𝑙,𝑙−1 = 𝜆0

𝜆𝑙+1,𝑙 = 𝜆𝑙,𝑙−1 = 𝜆𝑙−1,𝑙−2 = ⋯ = 𝜆1,0

(25)

A router at any level has to route traffic downward from two of its up (parent) channels. Therefore, downward traffic load that any router faces in certain level l is 𝜆𝑅𝑙

𝑑𝑜𝑤𝑛

= 𝜆𝑙+1,𝑙 × 2 = 𝜆0

4𝑛 − 4𝑙 𝑙+1 2 4𝑛 − 1

45 3.57

(26)

As per the design of a BFT, for downward traffic routing a router does have one down channel because any destination resides in a specific locality of a specific region of the tree. Thus for downward traffic, blocking probability of flits increases because of the absence of load balancing. That is why, if a BFT network is scaled in classical manner

32 4.97

22 7.64

14 7.67

[7,42] then the increasing scalability of the network starts affecting the performance. Experimentation on this fact proves that the proposed design achieves high performance gain in terms of latency and throughput even in high injection rate. Firstly, successive doubling the up-going channel bandwidth has been done upto root level routers. Secondly, as per the proposed design, traffic from root routers gets divided into 2D and 3D portions (discussed in 4.2). As the number of up and down links of root and border routers is the same (which is 4), no bandwidth doubling is required here. The amalgamation of the above two techniques makes the network efficient across various traffic patterns even in high injection rate. For the proposed design, to estimate the intralayer (on root routers) and interlayer (on border routers) up going traffic pressure, the up going load of each channel between layer 2 (layer of regional routers) and 3 (layer of root routers) of each tree is to be determined. Assuming the level of root layer as l, from Eq. (19) the injection load on each up going channel from regional to the root level of a tree can be expressed as𝜆𝑙−1,𝑙 = 𝜆0

4𝑛 − 4𝑙−1 𝑙−1 2 4𝑛 − 1

(27)

So if all the regional routers employ load balancing then the up going traffic load on each root router is𝜆𝑅𝑅𝑜𝑜𝑡

𝑢𝑝

(22)

For this reason, the bandwidth of a channel c between any level l and 𝑙 + 1 is accounted as double as of a channel between level l and 𝑙 − 1 for network traffic simulation. Downward traffic (traffic that goes to a lower level from a certain level of a BFT) load on a channel that goes down from level l to 𝑙 − 1 can be expressed as𝑙

65 2.62

= 𝜆𝑙−1,𝑙 × 4 = 𝜆0

4𝑛 − 4𝑙−1 𝑙+1 2 4𝑛 − 1

(28)

Assuming symmetry between 2D and 3D traffic this load is further divided between two portions. Hence, each border router dedicated to a particular tree faces injection load(considering pillar node connectivity discussed in Section 4.1) from its tree as𝜆𝑅𝐵𝑜𝑟𝑑𝑒𝑟

𝑢𝑝

=

𝜆𝑅𝑅𝑜𝑜𝑡

𝑢𝑝

×4 2 𝑛 4 − 4𝑙−1 𝑙+2 = 𝜆0 𝑛 2 4 −1

(29)

because all four root routers pass inter layer traffic to the border router. As per the load balancing routing strategy discussed in Section 4.2, each of the three channels connecting other border routers on the same layer and the channel that connects the pillar node experience individually a traffic rate of𝜆𝑅𝐵𝑜𝑟𝑑𝑒𝑟 𝑢𝑝 𝜆𝐵𝑜𝑟𝑑𝑒𝑟,𝑗 = 4 = 𝜆0

4𝑛 − 4𝑙−1 𝑙 2 4𝑛 − 1

(30)

∀𝑗, 𝑗 ∈ 𝐶𝑏𝑜𝑟𝑑𝑒𝑟𝑢𝑝 where, l is the root level and 𝐶𝑏𝑜𝑟𝑑𝑒𝑟𝑢𝑝 is the set of all the outgoing links of a border router those carry traffic out of the tree. Clearly, from Eqs. (28) & (30) it is evident that bandwidth doubling is not required for the outgoing channels of a border router as𝜆𝑅𝑜𝑜𝑡,𝐵𝑜𝑟𝑑𝑒𝑟 =

𝜆𝑅

𝑅𝑜𝑜𝑡𝑢𝑝

2

= 𝜆𝐵𝑜𝑟𝑑𝑒𝑟,𝑗

(31)

Eq. (31) says clearly that, amount of outgoing 3D traffic (𝜆Root,Border ) injected by every root router of a tree to the border router, equals the amount of traffic that every outgoing channel of the border router needs

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Table 9 Maximum operating frequency of the routers across different technology nodes for IP cores instances consisting 100 K gates. Technology Node (in nm) → Operating Frequency (in GHz) →

130 2.55

90 3.55

65 10.79

45 17.25

32 19.14

22 22.83

14 55.25

Table 10 Summary of the highest level interconnect delays of a 16 × 16 classical BFT on practical IP core instances across different technology nodes. TileGX-64 on 90 nm technology Interconnect Categories ↓

Number of instances

Interconnect Length in (μm)

Unbuffered Delay in (ps)

Buffered Delay in (ps)

15FO4 Delay in (ps)

Number of Buffers

W5,4

64

17223.01

1696.74

710.36

569.55

13

Intel TeraFlop on 65 nm technology Interconnect Categories ↓

Number of instances

Interconnect Length in (μm)

Unbuffered Delay in (ps)

Buffered Delay in (ps)

15FO4 Delay in (ps)

Number of Buffers

W5,4

64

13572.61

988.88

535.30

418.50

10

ARM Cortex-A9 on 45 nm technology Interconnect Categories ↓

Number of instances

Interconnect Length in (μm)

Unbuffered Delay in (ps)

Buffered Delay in (ps)

15FO4 Delay in (ps)

Number of Buffers

W5,4

64

12845.77

885.8

505.88

285.60

9

Sun Niagara on 32 nm technology Interconnect Categories ↓

Number of instances

Interconnect Length in (μm)

Unbuffered Delay in (ps)

Buffered Delay in (ps)

15FO4 Delay in (ps)

Number of Buffers

W5,4

64

17588.14

779.55

533.91

208.50

8

Sun Niagara-II on 22 nm technology Interconnect Categories ↓

Number of instances

Interconnect Length in (μm)

Unbuffered Delay in (ps)

Buffered Delay in (ps)

15FO4 Delay in (ps)

Number of Buffers

W5,4

64

17513.15

515.28

270.35

168

11

Intel Xeon on 14 nm technology Interconnect Categories ↓

Number of instances

Interconnect Length in (μm)

Unbuffered Delay in (ps)

Buffered Delay in (ps)

15FO4 Delay in (ps)

Number of Buffers

W5,4

64

21376.4

444.16

199.44

133.95

13

to carry. This is possible because the number of channels goes out from a border router equals the number of incoming channels from root level of the corresponding tree. From every root router the amount of intralayer 2D traffic generated is determined from Eq. (28) as𝜆𝑅𝑅𝑜𝑜𝑡

𝑢𝑝2𝐷

=

𝜆𝑅

𝑅𝑜𝑜𝑡𝑢𝑝

2

𝑛

𝑙−1

𝑙 = 𝜆0 4 4−4 𝑛 −1 2

(32)

Hence, when a root router distributes its 2D traffic across the three outgoing channels through which it is connected to other same coloured root routers of the same layer, each of those channel experiences traffic load as𝜆𝑅𝑅𝑜𝑜𝑡 𝑢𝑝2𝐷 𝜆𝑅𝑜𝑜𝑡𝑢𝑝 = 2𝐷 3 (33) 4𝑛 − 4𝑙−1 𝑙 𝜆0 𝑛 2 4 −1 = 3 As l is the level of root routers and n is the number of levels for routers, substituting l with 2 and n with 3 in Eq. (32) we get𝜆𝑅𝑜𝑜𝑡𝑢𝑝

2𝐷

= 1.27𝜆0

(34)

Applying the same process on Eq. (27) the bandwidth requirement of each channel between regional and root level in terms of 𝜆0 can be given as𝜆𝑙−1,𝑙 = 1.9𝜆0 𝜆𝑙−1,𝑙 is almost 1.5 times of 𝜆𝑅𝑜𝑜𝑡𝑢𝑝

(35) 2𝐷

which means, around 66% lesser

bandwidth required for each outgoing channel of a root router for intralayer 2D traffic routing compared to channels between regional and root levels. The downward traffic load (shown in Eq. (26) on any router of level l and bandwidth of its down going channels both scale accordingly as discussed above. Above estimations are made based on uniform random traffic. The assumption is the traffic that leaves a zone (locality, region, tree, or layer), also comes back to the zone considering the nonstatic source and destination nodes. The distribution of channel bandwidth at every level of the proposed design changes accordingly based on the traffic pattern being considered. Proposed design have m number of layers, each of which comprises four 4-ary 3-fly BFTs. For example, let assume a hypothetical scenario, where a particular tree on a specific layer receives all 3D traffic across the layers. Then from Eq. (29) the 3D

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Table 11 Summary of the interconnect delays of a classical BFT for an NoC comprising 512 IPs on practical IP core instances across different technology nodes. TileGX-64 on 90 nm technology Interconnect categories ↓

Number of instances

Interconnect length in (μm)

Unbuffered delay in (ps)

Buffered delay in (ps)

W6,5

128

23700.39

3212.98

977.6

15FO4 delay in (ps)

Number of buffers 18

569.55 256

W5,4

11850.20

803.24

582.11

9

Intel TeraFlop on 65 nm technology Interconnect categories ↓

Number of instances

Interconnect length in (μm)

Unbuffered delay in (ps)

Buffered delay in (ps)

W6,5

128

18675.40

1872.21

990.19

15FO4 delay in (ps)

Number of buffers 11

418.50 256

W5,4

9337.70

468.06

497.47

6

ARM Cortex-A9 on 45 nm technology Interconnect categories ↓

Number of instances

Interconnect length in (μm)

Unbuffered delay in (ps)

Buffered delay in (ps)

15FO4 delay in (ps)

Number of duffers

W6,5

128

17672.31

1019.39

399.7

285.60

14

15FO4 delay in (ps)

Number of buffers

Sun Niagara on 32 nm technology Interconnect categories ↓

Number of instances

Interconnect length in (μm)

Unbuffered delay in (ps)

Buffered delay in (ps)

W6,5

128

23813.21

1429.03

733

11 208.50

256

W5,4

11906.61

357.26

362.61

6

Sun Niagara-II on 22 nm technology Interconnect categories ↓

Number of instances

Interconnect Length in (μm)

Unbuffered Delay in (ps)

Buffered Delay in (ps)

W6,5

128

24086.21

974.65

371.73

15FO4 Delay in (ps)

Number of Buffers 15

168 256

W5,4

12043.11

243.67

186.33

11

Intel Xeon on 14 nm technology Interconnect Categories ↓

Number of instances

Interconnect Length in (μm)

Unbuffered Delay in (ps)

Buffered Delay in (ps)

W6,5

128

29396.30

839.95

274

15FO4 Delay in (ps)

Number of Buffers 17

133.95 256

W5,4

14698.15

209.99

incoming traffic load on the border router in context is𝜆𝑅𝐵𝑜𝑟𝑑𝑒𝑟

𝑑𝑜𝑤𝑛

= 𝜆0

4𝑛 −4𝑙−1 4𝑛 −1

2𝑙+4 × 𝑚

where m is the number of layers and l is the level of root routers in the corresponding BFT to which the border router is dedicated. If the border router balances the load across all four routers of its tree then each down going link between border and root levels of the tree carries𝑛

𝑙−1

(37)

amount of traffic from the border to a certain root router of the corresponding tree. Furthermore, it is assumed that all intralayer traffic has been destined for this tree. So, according to Eqs. (32) & (37) the total amount of downward traffic that each root router faces is𝜆𝑅𝑅𝑜𝑜𝑡

𝑑𝑜𝑤𝑛

= 𝜆𝐵𝑜𝑟𝑑𝑒𝑟,𝑅𝑜𝑜𝑜𝑡 + 𝜆𝑅𝑅𝑜𝑜𝑡

𝑢𝑝2𝐷

9

tween root and regional level of the tree experiences traffic load of(36)

𝑙+2 × 𝑚 𝜆𝐵𝑜𝑟𝑑𝑒𝑟,𝑅𝑜𝑜𝑜𝑡 = 𝜆0 4 4−4 𝑛 −1 2

137.16

(38)

Taking this load is locally uniform (the load is distributed amongst all the processing core of the tree in context), each downward channel be-

𝜆𝑅𝑜𝑜𝑡,𝑅𝑒𝑔𝑖𝑜𝑛𝑎𝑙 =

𝜆𝑅𝑅𝑜𝑜𝑡 ( 𝜆0

= = 𝜆0

𝑑𝑜𝑤𝑛

4

) ( 𝑛 ) 4𝑛 − 4𝑙−1 𝑙+2 4 − 4𝑙−1 𝑙 2 × 𝑚 + 𝜆 2 0 4𝑛 − 1 4𝑛 − 1 4

(39)

4𝑛 − 4𝑙−1 (1 + 4𝑚)2𝑙−2 4𝑛 − 1

= 𝜆𝑙,𝑙−1 where l be the root level. Hence, the downward load on a regional router can be expressed as𝜆𝑅𝑅𝑜𝑜𝑡,𝑅𝑒𝑔𝑖𝑜𝑛𝑎𝑙 = 𝜆𝑅𝑜𝑜𝑡,𝑅𝑒𝑔𝑖𝑜𝑛𝑎𝑙 × 2 = 𝜆0

4𝑛 − 4𝑙−1 (1 + 4𝑚)2𝑙−1 4𝑛 − 1

(40)

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Fig. 20. Communication patterns of the IPs on a 16 × 16 layer in different traffic injection modes.

and 𝜆𝑅𝑒𝑔𝑖𝑜𝑛𝑎𝑙,𝐿𝑜𝑐𝑎𝑙 =

𝜆𝑅

𝑅𝑜𝑜𝑡,𝑅𝑒𝑔𝑖𝑜𝑛𝑎𝑙

4

= 𝜆0

4𝑛 − 4𝑙−1 (1 + 4𝑚)2𝑙−3 4𝑛 − 1

(41)

Lastly, every channel by which a local router connects a processing core faces downward traffic load of𝜆𝐿𝑜𝑐𝑎𝑙,𝐿𝑒𝑎𝑓 =

𝜆𝑅𝑒𝑔𝑖𝑜𝑛𝑎𝑙,𝐿𝑜𝑐𝑎𝑙 ×2

= 𝜆0

4

4𝑛

− 4𝑙−1 (1 + 4𝑚)2𝑙−2 4𝑛 − 1

(42)

Clearly, for this unrealistic traffic pattern considered here, Eqs. (37), (39), (41) and (42) are the downward bandwidth requirements of each link between two levels at various hierarchies of the design considering the root level as l. The bandwidth requirement of upward and downward traffic across the network those discussed above are the highest possible ones that possibly may incur for the proposed network. In practice, the actual bandwidths of the channels are much less than those which can yield from the above analysis, because of the discrepancy between the traffic arrival and departure rate in a router. Let 𝜆 be mean arrival rate at an input port of a router. Then 𝛾 being the departure rate of flits from an output port can be expressed as𝛾=

1 = 𝑘𝜆 𝑆 +𝑤

(43)

where, S is the average processing time of a flit and w is the average waiting time that a flit spends in VC. The input injection rate decreases by k factors in the next routing stage of the network. Radix (number of input and output ports) of a router and its traffic load together influence various arbitration processes inside (as discussed in Section 3). This impacts S which in turn affects the waiting time w of a flit. After a router reaches its peak throughput (number of flits it can process in unit time) the output channels’ bandwidth starts playing an important role which if not taken care of, can affect the communication latency significantly.

This situation occurs because of a constant injection at the input ports of the router which prevents the input VCs from being empty. The above aspect inferred from Eq. 43 drives the simulation framework to treat the channel bandwidths relative to each other and scale accordingly with the hop delay of the routers. 5.2.1. Network load analysis across various traffic injection patterns Based on the locations of all the source and destination pairs the traffic load on the routers of a network varies considerably. The experimentation on the network performance is carried out applying eight different traffic patterns. In the context of the proposed network the employed traffic patterns can be categorized into three classesi Global: where an IP sends all of its data to outside of its BFT. ii Semi-global: Where a portion of the traffic generated by an IP remains confined within the same BFT to which the source belongs. iii Local: For this type of traffic both the source and the destination resides in the same BFT. Fig. 20 portrays the mapping patterns of communicating node pairs on a 16 × 16 NoC layer under diverse traffic injection loads. Under a particular injection pattern, the zone-based routing balances the overall communication load by splitting it into 2D and 3D traffic exploiting the structure of the proposed network. How the above strategy improves network performance in each case of the injection modes is illustrated below. Bit-complement traffic. Fig. 20 shows the locations of a set of sourcedestination pairs (blue coloured) of IP cores under bit-complement traffic injection. Every node in the 0th row sends packets to certain node in the 15th row of the layer. Following the mathematics behind the mapping for bit complement traffic [11] it can be concluded that in the proposed network, the BFTs on a layer communicate in a diagonal fashion. Therefore, a node in the lower left tree transmits data to a particular node belongs to the upper right tree and vice versa. The association amongst the above nodes is clear from Fig. 20. As the bit complement traffic is completely global in nature where all the traffic from a BFT crosses its root routers, it is essential to analyse the load imposed on

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Fig. 21. 2D bit-complement traffic load distribution by a root router.

each root router. From Eq. 19 the bit-complement traffic load on a root router can be determined as below𝜆𝑙,𝑙+1 = 𝜆0

4𝑛 − 4𝑙 𝑙 2 ×4 4𝑛 − 1

= 𝜆0 .1.2𝑙+2

(44)

= 2𝑙+2 𝜆0 As each source communicates only with a particular destination that 𝑛

𝑙

resides on a different BFT, the probability ( 44𝑛−4 ) for a packet going up−1 ward from the root level always remains 1. Exploiting the structural benefit of the proposed topology the above amount of traffic is divided into two classes, one for the interlayer (2D) and another for the intralayer (3D) communication. Conserving the diagonal nature of communication in the bit-complement pattern, each IP of a layer sends half of its traffic to the diagonally opposite BFT on the same layer and, the another half portion to the same destination BFT on the other layer. For example, IPs of the BFT 3 (refer Fig. 7 for numbering convention) route half of their packets to the IPs belong to BFT 1 on layer 2 (refer Fig. 12). As a result of the above policy the traffic load on each root router of a BFT is scaled down by half than the one where all the packets from a root router are routed to the destination BFT (diagonally placed) in the same layer. A root router distributes its 2D traffic load across 3 output channels those connect three other same coloured root routers in the layer. Hence, each of those output channels faces a traffic load amounted 8 below( )/ 2𝑙+2 𝜆0 𝜆1,2 = 3 2 (45) = 2.67𝜆0 All the other three neighbouring (same coloured) root routers in the layer also balance their 2D traffic load in the same way. Fig. 21 depicts how a root router balances its 2D traffic across the neighbouring routers. The output channel of the router B which connects the router C, has to carry 2.67𝜆0 self traffic (sky-blue coloured) that comes from the BFT associated with the router B plus, 2.67𝜆0 traffic (green coloured) injected by the router A to the southern input port of the router B. Therefore, an accumulated traffic of 5.34𝜆0 is injected to the western input port of the router C which is true for every root routers when bit-complement injection is applied. Fig. 22 is an in detailed illustration of the picture shown in Fig. 21 in context of the 2D load distribution done by the router A. The BFT associated with router A generates a traffic totalling 64𝜆0 . 8 As per the structural mathematics [42] of a BFT, the lowest level (where the local routers reside) of switches is always taken as 1 while calculating the number of channels between two levels which is essential to estimate traffic load on routers.

With proper load balancing at every zonal router it imposes a traffic load of 16𝜆0 (refer Eq. (44) on each root router of the BFT. Therefore, each down-link of the router A transfers an upward injection of 4𝜆0 . The simulation environment is configured to have 4 VCs per input port of a router. Then it is not very unwise to account that every input VC at an input port which accepts traffic from the regional level faces an data injection of 𝜆0 . The whole upward traffic is divided into 2D and 3D as discussed above. With the help of colouring the VCs it is described how the self traffic load is distributed across three neighbouring root routers. Geen VCs at the input ports of A carries 2D traffic where as grey VCs are assigned for channeling the 3D load. The diagonal nature of the bitcomplement communication facilitates to assign 4 VCs of C to the VCs of A as shown in the Fig. 22. The rest 4 VCs of a are distributed across the routers B and D. VCs at the north input port of the router D carry two kinds of traffic, out of which the black ones are the self traffic from the router C destined for A and, the yellow ones are the self traffic of B destined for the router D. Orange VCs at the western input port of D carries self traffic from the router B which reach the destination D by traversing through the router A. In the light of the pictorial description from Figs. 21 and 22 jointly with the associated discussion it is found that the maximum accumulated injection happens to each VC of a root router is at the rate of 1.335𝜆0 . So, in a steady-state situation where all the VCs face constant traffic injection, at every inter-tree output port of A that carries bit-complement traffic, there are 4 VCs that compete for the same crossbar passage. So it takes at 20 cycles to create a space in a VC for a packet of size (in flits) 5. Congestion inside a root router takes place if a packet comes before 20 cycles. In practice the actual amount of injection becomes much less because of the service time incurred at every stages of hopping as shown in Eq. (43). The inequality shown below expresses the non-conformant(lossless) traffic flow at the nth stage of a communication) ( )𝛼 𝑛 ( ∑ 𝑣 1 𝑚𝜎 𝛾𝑛 = + 𝑆𝑖 ≥ 1 − . (46) 𝜆 𝑝 𝑝 𝑖 𝑖=1 The L.H.S of the inequality is the spacing between two consecutive packet arrivals at some input port of the nth stage router whereas, 𝜆i is the injection rate at the input port of the router at the ith hop and, Si is the single hop delay (in cycles) for a packet that each of those v VCs experiences. R.H.S of the above inequality represents the time that the nth router takes to transmit an m flit packet while randomized switch arbitration is applied (refer Eq. (3)). Each router employs v VCs at an input port and 𝜎 is the critical path delay of a router. At a continuous injection at each input port of a local or regional router that accepts upward traffic, it takes a maximum of 40 cycles to transmit a five flit packet as for each up-going channel in local or regional level there exists eight competitors to access the corresponding output port. In practice half of the up-going packets in a local or regional router experience a delay of 20 cycles and, the other half faces a 40 cycle hop latency when all the VCs are full. Hence, with an injection of 1.335𝜆0 the minimum spacing between two consecutive packet arrival at the VCs of the corresponding input port of a root router becomes 1 𝛾𝑟𝑜𝑜𝑡 = + 20 × 3 (47) 1.335𝜆0 The injection tolerance of the proposed network becomes 0.71 in bitcomplement traffic when the above traffic separation employed. So, from Eq. (47) the minimum spacing yields as 61.05 cycles. Therefore, with efficient mapping of the communicating IPs across the layers along with the proposed zone-based routing algorithm, heavy traffic like bitcomplement does not create any performance bottleneck on the root routers. Fig. 23 demonstrates how 3D traffic load from a root router is distributed across the VCs of the associated border router. As mentioned above that an input port of a router has four virtual channels. All the virtual channels of the four root routers of the BFT in context are shown. The green coloured VCs at each input port of a root router carry 2D traffic whereas the grey and yellow ones are assigned for inter-layer traffic.

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Fig. 22. 2D A more closer look at VC level in case of the bit-complement traffic load distribution by a root router.

Fig. 23. 3D bit-complement traffic load distribution by a border router.

The associated border router has 4 input channel each of which connects a particular root router of the corresponding BFT. The VCs from the root level are mapped in a certain way to the VCs of the border router to balance the traffic load. For example, the red root router has 8 VCs that carries 2D traffic out of which the 4 grey ones are mapped to the 4 VCs at the input port of the corresponding border router. The rest 4 yellow VCs are mapped to the rightmost input port of the border router to be transmitted through the DTDMA bus. The same policy

applies for the orange and violet root router of the BFT. All the VCs of the green root router are mapped to the rightmost input port of the border router. Therefore, Each VC at the rightmost input port of the border router carries a self traffic of 4𝜆0 . As DTDMA is a bus communication, a whole packet is injected through the pillar node interface (shown in black colour). The simulation uses 32 bits as the flit size. So, for a five flit packet the DTDMA bus is required to transmit 160 bits at a time. Literature [1] says a DTDMA bus of 512 bits operating at 500 MHz frequency

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Table 12 Configuration details of simulation infrastructure for various topologies. Topologies

Network size



No. of layers

2D Mesh 3D Mesh 2D Torus 3D Torus 3D Butterfly 3D Flattened Butterfly Classical BFT

1 2 1 2 2 2 1

Routing

No. of VCs

VC depth

Flit width

Nodes per layer

Algorithm

Per Input Port

(in flits)

(in bits)

512 256 512 256 256 256 512

Adaptive DOR Adaptive DOR Adaptive DOR Adaptive DOR Destination Tag FBFLY-BYP As proposed in [42,63]

4 4 4 4 4 4 4

16 16 16 16 16 16 16

32 32 32 32 32 32 32

results as idea; throughput of 256 Gbps which is very less relative to a routing hop. Hence, with the transmission, propagation and, reception of a packet does not take more than a routing cycle to reach the border router of the destination layer. Therefore, in Fig. 23, an yellow VC of the border router in context takes at most 5 cycles to create create a space for a packet. Hence, at all the yellow VCs of the root router the hop time for a packet becomes at most 20 cycle which is not more than the case of 2D load balancing by the root router. Clearly, the upward 3D traffic does not create any bottle neck in the border routers. For the downward traffic (packets those come to the BFT from its border router and root routers) a 5 flit packet takes 20 cycles to complete its hop as, there exists 4 packets opt for a particular downward output port. As the network simulation is done having a VC depth (in flits) of 16 (refer Table 12), as a total 4 VCs at a downward input port at most 12 packets can be accommodated which come from the DTDMA bus. As each VC from border to root level faces 20 cycles hop delay, it seems that with the current depth of the virtual channels 4 VCs are not sufficient to cope up with the ultra fast propagation speed of the DTDMA bus. But in practice, like shown in the Eq. (47) minimum arrival time spacing between two packets in the VCs of the border router A (in Fig. 23) can be determined as1 + 20 × 3 6𝜆0 1 = + 60 6 × 0.71 = 60.24

(48)

1 + 20 × 3 2𝜆0 1 = + 60 2 × 0.71 = 60.71

(49)

𝛾𝑏𝑜𝑟𝑑𝑒𝑟𝑦𝑒𝑙𝑙𝑜𝑤 =

𝛾𝑏𝑜𝑟𝑑𝑒𝑟𝑔𝑟𝑒𝑦 =

Packets from A in bit-complement traffic has to reach border router C of the other layer. As discussed in Section 4.2.3, in a load balanced scenario traffic from a border router is distributed across the neighbouring border routers in the source layer to avoid congestion in the destination layer. Traffic in the grey VCs of the border router A is routed in this manner. At each of the border router A,B and, C half of the traffic transmitted through their respective DTDMA pillar and the other half goes to the appropriate neighbouring diagonal border router. Accumulating the self as well as the neighbouring traffic the resultant injection load at each VC of a border router become as shown in the Eq. (48) (6𝜆0 ) and Eq. 49 (2𝜆0 ). From Eq. (48) it is clear that the inter arrival time of packets into the yellow VCs of router A guarantees that within this window the receiving border router at the destination layer with 4 VCs is capable to transmit 12 packets (3 packets from each VC) to the root level. The above discussion on the border and root level load balancing under bit-complement traffic embraces the fact that the traffic separation policy taken by the routing algorithm exploiting the design of the proposed network alleviates heavy traffic load comes from each BFT to avoid network congestion. As discussed above, packets going upward may take a maximum of 40 cycles for a significantly high injection at the local and regional level while all the VCs get full. On the other hand,

from the root level, the incurred single-hop delay for a packet at every stage of the network is at most 20 cycles, regardless of whether it is an intralayer or inter-layer communication. Following the structural mathematics if a BFT is scaled classically [7], then this upward hop latency eventually becomes significantly large which affects the network performance. Experimentation on this fact justifies that in comparison with the proposed ZBFT topology, the classical BFT network exhibits drastic reduction in injection tolerance with considerable degradation of the performance in terms of latency, throughput and, the average router power consumption. Bit-reverse traffic. The yellow IPs in Fig. 20 falls under the bit-reverse traffic pattern. From the association of the IPs in Fig. 20 it can be said that under bit-reverse injection mode the destination nodes of the 64 IP blocks of a BFT of the ZBFT network are distributed evenly across the four BFTs of a layer. For example, out of the 64 IPs of the 3rd BFT (refer Fig. 7 for the numbering convention) on a layer, 32 IPs’ destinations are evenly divided across BFT 1 and 2. The destination of the other 32 IPs are distributed between the 0th BFT and the 3rd BFT itself. Hence, following the same source-destination pair mapping discussed in bitcomplement pattern (Section 5.2.1.1) each IP sends half of its traffic to to the destined IP in the same layer and, it routes other half of its data to the IP that belongs to the same BFT but, on the other layer. IP blocks whose destination fall into the same BFT do not follow the above principle. In this way 48𝜆0 traffic crosses the root routers. So, each root router faces an upward traffic load of 12𝜆0 which is less than what happens in bitcomplement injection as discussed in Section 5.2.1.1. Thus, no further load analysis is required. Random permutation traffic. Here, the source address bits are given randomized permutation to get the corresponding destination address. Hence, the effect of the above technique can make the sourcedestination mapping fall into any of the eight mentioned injection modes, either fully or partially. In any case, the traffic load on the root and border routers can not exceed which is found in the bit-complement pattern, as it imposes the heaviest traffic load on the zonal routers of a ZBFT. The effect of random permutation traffic on the performance of the ZBFT is reported in Section 5.2.3. Shuffle traffic. As per the mathematics [11] behind the sourcedestination mapping, in this kind of traffic address bits of a source are given a circular left shift to find the destination IP address. Though the effect on the overall traffic load of a BFT in this technique is similar as bit-reverse injection but, the locations of the communicating node pairs are different. The green coloured IPs in Fig. 20 represent the node pair associations in shuffle traffic pattern. Transpose traffic. The layer of IPs in the Fig. 20 are treated as a 16 × 16 matrix in transpose traffic pattern. As the word “Transpose” suggests the row and column information of a source are interchanged to get the desired destination IP. The mapping function [11] looks like below𝑑𝑖 = 𝑠𝑖+ 𝑏 𝑚𝑜𝑑 𝑏

(50)

2

Eq. (50) governs the mapping of the ith address bit of the source s to the corresponding address bit of the destination d where the bit width of the address is b. The above mapping function reflects the effect ex-

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Table 13 Summary of comparative improvements (%) of proposed NoC over other topologies for different performance metrics across different network traffic patterns. IMPROVEMENTS (%) ON NETWORK LATENCY Traffic patterns → topologies ↓

Bit complement

Bit reversal

Neighbour

Random permutation

Shuffle

Transpose

Tornado

Uniform

2D-Mesh 3D-Mesh 2D-Torus 3D-Torus 3D Butterfly 3D Flattened Butterfly Classical BFT

76−88 62−78 59−68 28−51 18−42 22−47 12−43

73−84 59−87 65−87 39−48 29−30 13−54 18−45

44 42−43 0−69 0−12 18−42 NIL 3 −8

70−89 58−77 60−75 36−38 25−43 6−64 14−42

67−81 53−74 63−77 37−43 27−29 4−35 14−39

73−80 59−87 65−70 37−44 21−42 11−33 11−38

76−90 60−76 75−81 48−55 18−42 19−51 18−49

72−90 58−77 69−84 36−47 21−28 8−11 8−32

IMPROVEMENTS (%) ON PACKET LATENCY Traffic Patterns → Topologies ↓

Bit Complement

Bit Reversal

Neighbour

Random Permutation

Shuffle

Transpose

Tornado

Uniform

2D-Mesh 3D-Mesh 2D-Torus 3D-Torus 3D Butterfly 3D Flattened Butterfly Classical BFT

76−88 62−85 59−68 28−51 18−95 22−50 12−43

73−85 60−89 65−88 39−48 29−30 13−70 18−45

44−93 43−48 0−93 0−12 18−95 NIL 3 −8

70−91 58−78 63−88 36−38 25−87 6−94 14−42

67−81 54−75 64−74 37−43 27−29 0−16 14−39

73−80 59−94 65−70 37−44 21−86 10−24 11−38

76−98 61−77 75−81 48−55 18−95 0−35 18−49

72−91 58−79 64−92 36−47 27−32 4−10 8−32

IMPROVEMENTS (%) ON FLIT LATENCY Traffic patterns → Topologies ↓

Bit complement

Bit reversal

Neighbour

Random permutation

Shuffle

Transpose

Tornado

Uniform

2D-Mesh 3D-Mesh 2D-Torus 3D-Torus 3D Butterfly 3D Flattened Butterfly Classical BFT

69−82 42−74 59−68 28−51 18−42 22−47 12−41

62−77 30−80 65−88 39−48 29−30 13−53 18−41

NIL NIL 0−69 0−12 18−42 NIL 4−11

58−86 30−59 63−73 36−38 25−42 6−63 14−39

50−69 14−51 64−74 37−43 27−29 4−35 14−32

62−70 29−81 65−70 37−44 21−42 13−33 11−35

68−86 39−60 75−81 48−55 18−42 19−51 18−38

61−85 28−58 64−85 36−47 27−28 3−10 8−32

IMPROVEMENTS (%) ON NETWORK THROUGHPUT Traffic Patterns → Topologies ↓

Bit Complement

Bit Reversal

Neighbour

Random Permutation

Shuffle

Transpose

Tornado

Uniform

2D-Mesh 3D-Mesh 2D-Torus 3D-Torus 3D Butterfly 3D Flattened Butterfly Classical BFT

NIL 4 −5 NIL 3 −5 3 −6 1−78 4−11

NIL 3 −4 0−49 3 −5 0 −3 0.03−1.64 3−16

0−28 3 −5 0 −7 3 −6 3 −6 0.04−89 3−18

0−84 1 −5 NIL 2 −5 0 −3 0.3−4.34 4−22

NIL 1 −5 0−32 2 −5 0 −3 1.53−2.28 7−22

0−33 4 −5 0−31 2 −6 2 −6 0.1−1.63 4−13

NIL 1 −5 0−31 2 −6 3 −6 0.03−2.1 3 −8

NIL 2 −5 NIL 2 −6 0 −3 1.68−2.39 3−14

IMPROVEMENTS (%) ON COMMUNICATION HOP COUNT Traffic Patterns → Topologies ↓

Bit Complement

Bit Reversal

Neighbour

Random Permutation

Shuffle

Transpose

Tornado

Uniform

2D-Mesh 3D-Mesh 2D-Torus 3D-Torus 3D Butterfly 3D Flattened Butterfly Classical BFT

70 43−44 43−44 NIL NIL NIL 12−28

56−70 31−32 52 14−15 14−15 NIL 12−25

NIL NIL NIL NIL NIL NIL NIL

55−56 30−31 50 11−12 9−10 NIL 12−26

44 11−14 43−44 12−13 12 NIL 11−21

56−57 29−31 52 12−15 NIL NIL 11−28

70 40−41 66 28 NIL NIL 30

57 29−31 50 11−12 12 NIL 25

TRAFFIC INJECTION TOLERANCE Traffic Patterns → Topologies ↓

Bit Complement

Bit Reversal

Neighbour

Random Permutation

Shuffle

Transpose

Tornado

Uniform

2D-Mesh 3D-Mesh 2D-Torus 3D-Torus 3D Butterfly 3D Flattened Butterfly Classical BFT 3D ZBFT

0.005 0.01 0.07 0.14 0.43 0.22 0.3 0.72

0.003 0.008 0.04 0.07 0.14 0.43 0.33 0.72

0.05 0.015 0.35 0.29 0.43 0.43 0.38 0.71

0.007 0.01 0.07 0.07 0.29 0.51 0.29 0.72

0.055 0.01 0.04 0.07 0.14 0.43 0.32 0.71

0.003 0.008 0.04 0.07 0.36 0.22 0.38 0.73

0.006 0.01 0.04 0.07 0.43 0.43 0.26 0.71

0.01 0.015 0.04 0.22 0.58 0.85 0.39 0.73

A. Bose and P. Ghosal

actly same as switching row and column of an IP on a layer. The above mechanism when applied on a layer of ZBFT, yields the locations of the communicating node pairs in such a way that an IP of BFT 3 transmits packets to certain IP in BFT 0 and vice versa and, traffic from the other two BFTs of the layer are confined within the respective trees. Hence, for the two diagonally opposite BFTs those generates inter-tree traffic, each IPs sends half of its packet to its respective BFT on the same layer and, it transmits another half of its data to the same BFT on the other layer. The same principle is followed by the IPs of BFT 0 and BFT 2 to alleviate the downward traffic load on the local and regional routers. As a result of the above technique each root router of BFT 0 and 2 faces an upward traffic load of 8𝜆0 because of the self contained traffic. On the other hand, each root router of BFT 1 and 3 channelizes an average traffic load of 16𝜆0 which is same as the bit-complement pattern. Neighbour traffic. The orange IPs in Fig. 20 represents the node pair associations under neighbour traffic. Each IP on a layer communicates with an IP located on a row just above it. In context of the ZBFT design it can be said that in each BFT the IPs only those falls on the border of the tree (according to the NoC floor shown in Fig. 20) sends packets outside of the tree. So, it is the 28𝜆0 traffic that cross the root routers which becomes a 7𝜆0 load on each root router. Exploiting the same traffic separation mechanism applied in case of the above discussed injection patterns, the actual upward 2D traffic load on each root router becomes 3.5𝜆0 which is much less than the traffic modes discussed till the point. Tornado Traffic. The overall effect of this traffic injection is similar as found in shuffle or bit-reverse mode, only the source-destination association of IPs (sky coloured) are different as shown in Fig. 20. Uniform Traffic. It is the most benign traffic which inherently distributes the traffic load of an IP block to all the other nodes in the network. The resultant effect in context of ZBFT leaves each root router carrying an upward injection load of 12𝜆0 which becomes 6𝜆0 when the traffic separation is applied. The above traffic load analysis on the proposed ZBFT design helps to draw conclusion on various performance parameters under the diverse injection patterns applied. In spite of being the heavily loaded injection the structure of ZBFT facilitates the routing algorithm to balance loads evenly across the layers employing the traffic separation mechanism. Improvements on latency and throughput are reported in Section 5.2.2. As the overall effects of the bit-reverse, shuffle and, tornado traffic on the network load is quite similar, performance improvements on them do not vary considerably. Improvements in case of the neighbour traffic is relatively less than the others because of the self confined nature of the traffic in context of the proposed ZBFT. With the increasing load on a router the average buffer waiting time of the packets during router hop also increased, which in turn increase the power consumption of the routers. Being the most loaded traffic injection bit-complement makes the routers of the ZBFT consuming a relatively higher amount of energy than the rest. The peak average power consumption of the routers at the highest possible injection load across various traffic patterns are reported in Section 5.4. Based on the peak energy consumption the thermal evaluation of the ZBFT is carried out. 5.2.2. Experiment details The latest version of BookSim, a detailed cycle-accurate simulator, has been used in this work [62]. With detailed modeling of router microarchitecture, inter-router channel delay, and flexible traffic models, the simulator efficiently reflects the state-of-the-art on-chip environment. Features like router’s stage pipelining and dynamic input buffer management give support to implement the proposed routing architecture. The experiment is done with two consecutive layers of the proposed 3D design, as shown in Fig. 12. Each layer consists of four trees (as shown in Fig. 5) that comprises 256 IP blocks. Evaluation of network performance has been carried out based on five parameters, namely, network latency, packet latency, flit latency, throughput, and communication hop count. The obtained results are compared to mesh [11], torus [11], butterfly [11] and flattened butterfly [15] networks. The state of

Journal of Systems Architecture 108 (2020) 101738

the art butterfly fat tree [7,42] topology has also been compared to the proposed design in this context. Table 12 details about the configurations of different topologies accounted for network simulation. As with the case of the proposed design, simulations for the rest of the networks are done on the same network sizes. Both the 2D and 3D counterparts of mesh and torus are used for the experimentation where Adaptive Dimension Ordered Routing is used. With this significantly large network size, the 2D butterfly network exhibits diminishing performance in comparison with its 3D counterpart because of its large number of long interconnects and the absence of path diversity. Therefore, the 2D butterfly is kept aside from these performance comparisons. The three-dimensional convention [15] is adopted to scale the flattened butterfly to such a large extent. Load balanced routing algorithm over the bypass channels have been used for this topology[15]. Classical 2D BFT uses routing technique [42] as described in Section 4.2.8. Routers of different networks do not have VCs at their output ports (adopted from [1]). Each router uses four VCs at each of its input ports. Beyond four VCs at an input port of a router, the hop time gets increased because of the increment in various arbitration times, which in turn affects latency and throughput [37]. The Comparative results on network latency and throughput are shown in Figs. 24 and 25, respectively. Detailed comparisons can be found in Table 13 across eight different traffic patterns. 5.2.3. Results and discussions The most common traffic pattern used to evaluate a network is uniform random traffic. This kind of traffic pattern is very benign because, for each source, it distributes the traffic uniformly over all the destinations. It balances the load inherently over all of the output channels of a router. A topology having poor path diversity and a routing algorithm that has poor load balancing technique, often turn out to be very efficient when tested upon uniform traffic load [11]. To give stress on a network and to test the efficiency of its routing algorithm, permutation traffic is the best choice where a source sends all of its packets to a single destination [11]. The proposed design is tested on some popular permutation traffic patterns like bit complement, bit reverse, shuffle, transpose, and random permutation. It is observed from the experimental results that the proposed design outperforms in each of those cases over the rest of the topologies to those it is compared. Table 13(a), (b) and (c) show improvements on network latency, packet latency and flit latency respectively. Table 13(d) reports improvements on the overall throughput of the networks. Table 13(f) shows the traffic load-bearing capacities of different topologies along with the proposed one. The proposed design has shown significant improvements in latency over other topologies across almost all traffic patterns. The neighbor traffic is generated exploiting the localities of a node, which situates most of the time, very few hop distance from the destination. Though flit latency seems not improving in the case of 2D and 3D mesh networks, both of them crashes down at comparatively very low injection loads (refer to Table 13(f)). Because of the bypass channel connectivity [15] in flattened butterfly topology, a significant number of IP blocks fall under a single-hop locality for a particular source node. That is why in the case of neighbor traffic, no latency improvement is found for the flattened butterfly network. On the other hand, average network throughput has improved around 89% over the flattened butterfly in case of neighbor traffic. The possible reason is, in the proposed design, the density of neighboring nodes is much lesser than the flattened butterfly, as a node’s (IP block) locality contains three other nodes at single hop and, four nodes at two hop distances. Because of the same neighboring node density, neighbor traffic injection manifests not a very significant improvement on latency when it comes to classical butterfly network, though throughput has improved quite a bit, because of the bandwidth doubling in proposed design (described in Section 5.2). Evidently, across all the topologies, neighbor traffic shows no improvements in communication hop count. Routing of packets exploiting the bypass channels makes the average

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Fig. 24. Average network latency comparison.

0.8

0.8

0.5

0.6 Throughput

0.4 0.3

0.5

0.3 0.2

0.1

0.1

0.1

0.2 0.3 0.4 0.5 0.6 Traffic Injection Rate

0.7

0.8

0.1

0.2 0.3 0.4 0.5 0.6 Traffic Injection Rate

0.7

0

(b) Bit Reverse Traffic

3D-mesh 3D-torus 3D-butterfly 3D-flatfly 3D-ZBFT

0.3

0.5

0.6

0.4 0.3

0.5

0.2

0.1

0.1

0.7

0.8

(d) Random Permutation Traffic

0.7

0.8

0.3

0.1 0.2 0.3 0.4 0.5 0.6 Traffic Injection Rate

0.8

0.4

0.2

0.1

0.7

0

0.1

3D-mesh 3D-torus 3D-butterfly 3D-flatfly 3D-ZBFT

0.7

0.2

0

0.2 0.3 0.4 0.5 0.6 Traffic Injection Rate

(c) Neighbour Traffic

3D-mesh 3D-torus 3D-butterfly 3D-flatfly 3D-ZBFT

0.6

0.4

0.1

0.8

0.7

Throughput

0.5

0.3

0.8

0.8

0.6

0.4

0.1

(a) Bit Complement Traffic 0.7

0.5

0.2

0

0.8

Throughput

0.6

0.4

0.2

3D-mesh 3D-torus 3D-butterfly 3D-flatfly 3D-ZBFT

0.7

Throughput

Throughput

0.6

0.8 3D-mesh 3D-torus 3D-butterfly 3D-flatfly 3D-ZBFT

0.7

Throughput

3D-mesh 3D-torus 3D-butterfly 3D-flatfly 3D-ZBFT

0.7

0.2 0.3 0.4 0.5 0.6 Traffic Injection Rate

0.7

0.8

(e) Shuffle Traffic

0

0.1

0.2 0.3 0.4 0.5 0.6 Traffic Injection Rate

(f) Transpose Traffic

Fig. 25. Average throughput comparison.

hop count in a flattened butterfly network lower than the proposed design. So, no improvements are found in this context. On the contrary, 3D-ZBFT outperforms in latency and throughput across various traffic patterns because of the high radix routers and long channel lengths of the flattened butterfly. The network diameter in terms of hop count is 12 for a classically (two dimensional) scaled BFT of 512 nodes. Whereas, including bridge hopping (refer Section 4.2.7),

the proposed design yields a diameter of 9 hops. In this context, the 3DZBFT network manages to have a performance gain of around 11% to 30%. With the 2D scaling of a BFT, as the number of levels increases, the downward blocking probability of flits increases as well. Moreover, the absence of proper bandwidth optimization leads to channel bottleneck, especially for downward traffic. Because of the compartmentalization of 2D and 3D traffic along with zone-based load

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Fig. 26. Average router power consumption comparison. Table 14 Summary of comparative improvements (%) for power consumption across different network traffic patterns. Traffic patterns → Topologies ↓

Bit complement

Bit reversal

Neighbour

Random permutation

Shuffle

Transpose

Tornado

Uniform

2D-Mesh 3D-Mesh 2D-Torus 3D-Torus 3D Butterfly 3D Flattened Butterfly Classical BFT

93−98 58−91 77−88 4−25 NIL 77−92 22−56

91−93 49−86 77−83 5−31 NIL 88−92 22−48

85−97 39−87 68−75 0 −1 NIL 74−92 18−24

91−97 50−89 77−89 5−29 NIL 87−92 24−49

90−96 43−86 77−82 5−30 NIL 86−92 22−51

91−95 49−86 77−83 4−31 NIL 88−92 24−54

93−98 57−91 87−89 7−42 NIL 90−92 24−56

91−95 49−91 77−89 5−33 NIL 88−92 24−58

Table 15 Summary of comparative peak average power consumption (in mW) of the routers of ZBFT and clasiscal BFT across different network traffic patterns. Traffic Patterns → Topology ↓

Bit Complement

Bit Reversal

Neighbour

Random Permutation

Shuffle

Transpose

Tornado

Uniform

ZBFT Classical BFT

410.196 418.196

362.821 401.225

360.488 410.429

379.487 396.401

366.491 395.491

365.495 398.495

365.117 410.416

376.372 389.272

balancing and bandwidth optimization, the proposed design achieves high traffic injection tolerance with sustainable latency and throughput across various traffic patterns. Though in the case of the uniform traffic flattened butterfly bears more injection load than 3D-ZBFT, the actual traffic tolerance is found in Section 5.4 when the networks are analyzed in the context of thermal dissipation. 5.3. Power consumption

Arbitration and Crossbar Traversal processes respectively, over the time period t. This is called the dynamic power consumption of an NoC router that is measured over a period of time. It largely depends on the structure of the topology, load distribution across the network, and radix of the routers. Moreover, it also varies across traffic patterns applies to the network. These factors increase switching activities in routers, which in turn increase the overall power consumption of them. Dynamic power consumption of a logic block is determined from Eq. (52).

Suppose a router is up for t time period. The power consumption of the router for this time t can be expressed in Eq. (51).

𝑃𝑑𝑦𝑛𝑎𝑚𝑖𝑐 = 𝛼𝐶𝑉 2 𝑓

𝑃𝑅𝑜𝑢𝑡𝑒𝑟𝑡 = 𝑃𝐵𝑢𝑓 𝑓 𝑒𝑟𝑡 + 𝑃𝑅𝐶 𝑡 + 𝑃𝑉 𝐶𝐴𝑡 + 𝑃𝑆𝐴𝑡 + 𝑃𝑋𝐵𝐴𝑅𝑡

(51)

Here, 𝑃𝐵𝑢𝑓 𝑓 𝑒𝑟𝑡 is the total energy that wasted to store flits in various input VC buffers over t time. 𝑃𝑅𝐶 𝑡 , 𝑃𝑉 𝐶𝐴𝑡 , 𝑃𝑆𝐴𝑡 and, 𝑃𝑋𝐵𝐴𝑅𝑡 are the total power consumption of the Routing Computation, VC Arbitration, Switch

(52)

where C is the cumulative capacitance of the critical path of the logic block, V is the driving voltage swing of CMOS, f is the applied clock frequency, and 𝛼 is the switching factor of the clock frequency for the time t as discussed above. The same way power consumption of the

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Fig. 27. 3D-ZBFT lowest layer thermal map across various traffic patterns.

various interconnect wires (refer to Section 5.1) of the design has been measured after repeater insertion. Table 14 summarizes average dynamic power consumption by network routers for different topologies under the tested traffic conditions. Based on the 32 nm technology, the average dynamic router power consumption is computed. The simulator calculates the energy of a router including the links’ energy those connect it. Low network diameter of the proposed design results in comparatively lower router hopping. The zone-based routing function with proper load balancing yield lower congestion in the network. High tolerance for traffic injection rate indicates deferred saturation throughput in comparison with various topologies. Improvements are observed in case of mesh, torus, flattened butterfly and, classical BFT networks. Butterfly network shows performance improvements in power consumption, but it crashes down comparatively at very lower injection rates due to network congestion. Hence though initially, routers of 3D butterfly topology manage to consume less en-

ergy, with the increasing injection rate the network quickly reaches its saturation point where the average router power consumption reaches very high. 5.4. Energy dissipation As discussed in the previous section, a high injection load increases switching activities inside a router that in turn increases the average energy consumption of routers in a network. The higher the routers consume energy, the more they dissipate it. There always be a tolerable thermal limit that a silicon layer can dissipate based on the technology node which the chip manufacturer uses. Intel○R Itanium○R Processor 9560 built on 32 nm technology has a limit of maximum 95° C allowed at the processor die [64], whereas Intel○R CoreTM i7 + 8700 Processor on 14 nm, can tolerate up to 100 ° C [65]. Hence, to cope up with heavy traffic injection load, it is necessary to analyze the ther-

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

Table 16 Summary of traffic injection tolerance of different networks across different network traffic patterns based on thermal dissipation Topologies →

3D Mesh

Traffic Patterns ↓

IR

Bit Complement Bit Reverse Neighbour Random Permutation Shuffle Transpose Tornado Uniform

0.01 0.008 0.015 0.01 0.01 0.008 0.01 0.015

3D Torus

3D Butterfly

3D Flattened Butterfly

Classical BFT

3D ZBFT

Temp.(°C)

IR

Temp.(°C)

IR

Temp.(°C)

IR

Temp.(°C)

IR

Temp.(°C)

IR

Temp.(°C)

45.2 45.11 45.16 45.15 45.12 45.12 45.19 45.22

0.14 0.07 0.29 0.29 0.07 0.07 0.07 0.22

45.13 45.07 45.16 45.07 45.07 45.07 45.09 45.2

0.29 0.14 0.29 0.005 0.07 0.22 0.29 0.29

46.03 45.22 46.93 45.13 45.93 45.91 46.99 45.96

0.07 0.07 0.14 0.07 0.07 0.07 0.015 0.07

46.02 45.91 46.5 45.96 45.91 45.91 47.2 46.02

0.21 0.25 0.2 0.25 0.23 0.26 0.22 0.31

48.81 47.02 47.03 47.33 46.03 47.14 47.21 46.51

0.71 0.73 0.71 0.72 0.71 0.73 0.71 0.73

46.03 45.91 46.03 45.95 45.92 45.91 46.03 45.94

mal profile of the proposed design to make sure whether any abnormal rise of temperature is occurring or not, which may lead to thermal hot-spot [66] condition. Thermal simulation has been carried out using the HotSpot temperature modeling tool [66] to get the thermal profiles of the topologies, based on their peak values of average router power consumption. As discussed in Section 5.2.1, bit-complement traffic incurs maximum energy overhead than the others, whereas the localized traffic like neighbor makes the routers consuming the minimum energy because of the traffic separation applied on ZBFT. For a significantly large network like 16 × 16, the classical BFT shows relatively much higher energy overhead even in case of neighbor traffic at considerably low injection rate (refer Table 13(f)) because of the cumulative increment of the traffic load on the routers at each stage of the tree. As the other topologies crash down at very low injection rates in terms of energy dissipation (reported in Table 16), so their peak router power consumptions do not carry much relevance in Table 15. In three dimensional environment, the stacking of NoC layers one upon another gives rise to their temperatures. Therefore, besides the proposed design, all the 3D counterparts of the rest of the topologies are evaluated thermally. The lower layer thermal profile is reported in each case as the upper layer remains close to the heat sink and heat spreader [66]. Fig. 27 shows the thermal map of the lowest layer (refer Fig. 12) that occurs from router activities in different traffic conditions of the proposed network. A layer is partitioned into a 64 × 64 grid. The graphs in Fig. 12 shows temperature variations across the grids. Table 16 comprehends the actual traffic injection tolerance of various networks in context with their lower layers’ peak temperature. The thermal condition of any region in the layer starts to enter in the hot-spot zone at temperatures beyond 60° C [66]. The 3DZBFT design manages to confine its peak temperature below 46.04°C while tolerating a significantly high traffic injection rate (IR). The floorplan of the design, along with the 2D and 3D traffic load distribution, have contributed successfully in this context. Three-dimensional mesh and torus manifest no abnormal rise in their lower layer temperature. On the other hand, the 3D butterfly and flattened butterfly can’t withstand injection load beyond the maximum of 0.29 and 0.07, respectively. Beyond these injection limits, both the network reaches a temperature, which is around 200 ° C. Though the power consumption across various traffic patterns shown in Fig. 26 supports this fact for flattened butterfly topology, it may seem conflicting for butterfly because of the comparatively low energy consumption of its routers. The possible reason can be the placement of butterfly routers on a layer near each other [11]. Hence, in spite of lower power consumption, the butterfly network exhibits high thermal dissipation. Scaling a BFT in a classical manner increases the upward traffic load in such a way that its injection tolerance becomes significantly low with considerably high power consumption by the routers, as shown in the Table 16 thus, bringing a much earlier saturation in comparison with the ZBFT. Though working on the thermal aspect is not the focus of the work, the observations in Table 16 confirms that no hot-spot situation occurs in the proposed network out of high traffic injection.

6. Conclusion and future directions Degradation of network performance with increasing diameter has always been a major limiting factor in on-chip interconnection networks. Higher-order scaling of a topology often results in long interconnect wires, which significantly dominate the communication latency as well as the throughput. The proposed design bridges the gap between this performance versus the scalability trade-off. Constant diameter with optimized wire lengths makes the design suitable for scaling as needed. Zone-based routing strategy simplifies the routing tasks for different routers across the network. Moreover, 2D and 3D traffic partitioning distribute the loads on various zonal routers, which facilitates maintaining network performance with sustainable energy consumption. The proposed ZBFT design alleviates the upward traffic load problem that occurs in classical BFT especially, with significantly large network size. Experimentation on interconnect length using practical IP core instances and the thermal evaluation both culminate on the fact that the implementation of this highly scaled network is possible whereas, no such solution exists for a generic BFT. The ZBFT is compatible with all kinds of tested traffic patterns and exhibits high injection tolerance. Exploiting the constant inter and intralayer communication hop property of the design, the future endeavors of the work will focus on testing the design’s fault-tolerant capability while making necessary enhancements to the design if required. Also, future experimentation will be involved in mapping application threads on IP cores of different layers and testing their performances. Declaration of Competing Interest NA References [1] C. Nicopoulos, V. Narayanan, C. Das, Network-on-Chip Architectures: A Holistic Design Exploration, Lecture Notes in Electrical Engineering, 45, Springer, the Netherlands, 2009, doi:10.1007/978-90-481-3031-3. [2] D. Sylvester, K. Keutzer, A global wiring paradigm for deep submicron design, IEEE Trans. Comput.-Aid. Des. Integr. CircuitsSyst. (TCAD) 19 (2) (2000) 242–252, doi:10.1109/43.828553. [3] K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K. Chang, The case for a single-chip multiprocessor, SIGOPS Oper. Syst. Rev. 30 (5) (1996) 2–11, doi:10.1145/248208.237140. [4] J. Rabaey, N. CHANDRAKASAN, A. Chandrakasan, B. Nikolić, Digital Integrated Circuits: A Design Perspective, Pearson Education India, 2017. [5] L. Benini, G. De Micheli, Networks on chips: a new SoC paradigm, Computer 35 (1) (2002) 70–78, doi:10.1109/2.976921. [6] P. Magarshack, P.G. Paulin, System-on-chip beyond the nanometer wall, in: Proceedings of the Design Automation Conference (IEEE Cat. No.03CH37451), 2003, pp. 419–424, doi:10.1145/775832.775943. [7] C. Grecu, P.P. Pande, A. Ivanov, R. Saleh, A scalable communication-centric SOC interconnect architecture, in: Proceedings of the International Symposium on Signals, Circuits and Systems. SCS 2003. (Cat. No.03EX720), 2004, pp. 343–348, doi:10.1109/ISQED.2004.1283698. [8] R. Kumar, V. Zyuban, D. Tullsen, Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling, in: Proceedings of the 32nd International Symposium on Computer Architecture (ISCA ’05), 2005, pp. 408–419, doi:10.1109/ISCA.2005.34.

A. Bose and P. Ghosal [9] W.J. Dally, B. Towles, Route packets, not wires: on-chip interconnection networks, in: Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232), 2001, pp. 684–689, doi:10.1109/DAC.2001.156225. [10] S. Kumar, A. Jantsch, J. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja, A. Hemani, A network on chip architecture and design methodology, in: Proceedings of the IEEE Computer Society Annual Symposium on VLSI. New Paradigms for VLSI Systems Design. ISVLSI 2002, 2002, pp. 117–124, doi:10.1109/ISVLSI.2002.1016885. [11] W. Dally, B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003. [12] P. Ghosal, T.S. Das, SD2D: a novel routing architecture for network-on-chip, in: Proceedings of the Electronic System Design (ISED), 2012 International Symposium on, 2012, pp. 221–225, doi:10.1109/ISED.2012.68. [13] P. Ghosal, T.S. Das, L2star: a star type level-2 2d mesh architecture for NOC, in: Proceedings of the 2012 Asia Pacific Conference on Postgraduate Research in Microelectronics and Electronics, 2012. 1555–159. doi: 10.1109/PrimeAsia.2012.6458645. [14] U. Ogras, R. Marculescu, “It’s a small world after all”: NoC performance optimization via long-range link insertion, IEEE Trans. Very Large Scale Integr. (VLSI) Systems 14 (7) (2006) 693–706, doi:10.1109/TVLSI.2006.878263. [15] J. Kim, J. Balfour, W. Dally, Flattened butterfly topology for on-chip networks, in: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), 2007, pp. 172–182, doi:10.1109/MICRO.2007.29. [16] P. Ghosal, T.S. Das, Advances in Computing and Information Technology: Proceedings of the Second International Conference on Advances in Computing and Information Technology (ACITY) July 13-15, 2012, Chennai, India - Volume 3, Springer, Berlin, Heidelberg, pp. 667–676. doi:10.1007/978-3-642-31600-5_65. [17] J. Kim, D. Park, T. Theocharides, N. Vijaykrishnan, C.R. Das, A low latency router supporting adaptivity for on-chip interconnects, in: Proceedings of the 42nd Annual Design Automation Conference (DAC), in: DAC ’05, ACM, New York, NY, USA, 2005, pp. 559–564, doi:10.1145/1065579.1065726. [18] J. Kim, C. Nicopoulos, D. Park, V. Narayanan, M. Yousif, C. Das, A gracefully degrading and energy-efficient modular router architecture for on-chip networks, in: Proceedings of the 33rd International Symposium on Computer Architecture (ISCA ’06), 2006, pp. 4–15, doi:10.1109/ISCA.2006.6. [19] A. Kumar, L.-S. Peh, P. Kundu, N.K. Jha, Express virtual channels: towards the ideal interconnection fabric, in: Proceedings of the 34th Annual International Symposium on Computer Architecture, in: ISCA ’07, ACM, New York, NY, USA, 2007, pp. 150– 161, doi:10.1145/1250662.1250681. [20] R. Mullins, A. West, S. Moore, Low-latency virtual-channel routers for on-chip networks, in: Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA ’04), 2004, pp. 188–197, doi:10.1109/ISCA.2004.1310774. [21] W. Dally, Express cubes: improving the performance of k-ary n -cube interconnection networks, IEEE Trans. Comput. 40 (9) (1991) 1016–1023, doi:10.1109/12.83652. [22] B. Feero, P. Pande, Networks-on-Chip in a three-dimensional environment: a performance evaluation, IEEE Trans. Comput. 58 (1) (2009) 32–45, doi:10.1109/TC.2008.142. [23] J. Kim, C. Nicopoulos, D. Park, R. Das, Y. Xie, V. Narayanan, M.S. Yousif, C.R. Das, A novel dimensionally-decomposed router for on-chip communication in 3D architectures, in: Proceedings of the 34th Annual International Symposium on Computer Architecture, in: ISCA ’07, ACM, New York, NY, USA, 2007, pp. 138–149, doi:10.1145/1250662.1250680. [24] D. Park, S. Eachempati, R. Das, A. Mishra, Y. Xie, N. Vijaykrishnan, C. Das, MIRA: a multi-layered on-chip interconnect router architecture, in: Proceedings of the 35th International Symposium on Computer Architecture (ISCA ’08), 2008, pp. 251–261, doi:10.1109/ISCA.2008.13. [25] R.B. Srensen, L. Pezzarossa, M. Schoeberl, J. Spars, A resource-efficient network interface supporting low latency reconfiguration of virtual circuits in time-division multiplexing networks-on-chip, J. Syst. Archit. 74 (2017) 1–13, doi:10.1016/j.sysarc.2017.02.001. [26] R. Milfont, P. Cortez, A. Pinheiro, J. Ferreira, J. Silveira, R. Mota, C. Marcon, Analysis of routing algorithms generation for irregular NOC topologies, in: Proceedings of the 34th 2017 18th IEEE Latin American Test Symposium (LATS), 2017, pp. 1–5, doi:10.1109/LATW.2017.7906768. [27] D.P. Sametriya, N.M. Vasavada, Hc-cpsoc: Hybrid cluster noc topology for cpsoc, in: Proceedings of the 34th 2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), 2016, pp. 240–243, doi:10.1109/WiSPNET.2016.7566128. [28] X. Han, Y. Fu, J. Jiang, C. Wang, A subnetting mechanism with low cost deadlockfree design for irregular topologies in noc-based manycore processors, in: Proceedings of the 34th 2016 3rd International Conference on Information Science and Control Engineering (ICISCE), 2016, pp. 110–114, doi:10.1109/ICISCE.2016.34. [29] M.O. Agyeman, A. Ahmadinia, N. Bagherzadeh, Energy and performance-aware application mapping for inhomogeneous 3D networks-on-chip, J. Syst. Archit. 89 (2018) 103–117, doi:10.1016/j.sysarc.2018.08.002. [30] N. Kadri, M. Koudil, A survey on fault-tolerant application mapping techniques for network-on-chip, J. Syst.Archit. 92 (2019) 39–52, doi:10.1016/j.sysarc.2018.10.001. [31] N. Chatterjee, S. Paul, S. Chattopadhyay, Task mapping and scheduling for networkon-chip based multi-core platform with transient faults, J. Syst. Archit. 83 (2018) 34–56, doi:10.1016/j.sysarc.2018.01.002. [32] N. Chatterjee, S. Paul, P. Mukherjee, S. Chattopadhyay, Deadline and energy aware dynamic task mapping and scheduling for network-on-chip based multi-core platform, J. Syst. Archit. 74 (2017) 61–77, doi:10.1016/j.sysarc.2017.01.008.

Journal of Systems Architecture 108 (2020) 101738 [33] T.S. Das, P. Ghosal, Performance centric design of subnetwork-based diagonal mesh NOC, Int. J. Electron. 106 (7) (2019) 1008–1028, doi:10.1080/00207217.2019.1576231. [34] Z. Wang, H. Gu, Y. Chen, Y. Yang, K. Wang, 3d network-on-chip design for embedded ubiquitous computing systems, J. Syst. Archit. 76 (2017) 39–46, doi:10.1016/j.sysarc.2016.10.002. [35] R. Ho, K.W. Mai, M.A. Horowitz, The future of wires, Proc. IEEE 89 (4) (2001) 490– 504, doi:10.1109/5.920580. [36] N. Deo, Graph Theory with Applications to Engineering and Computer Science (Prentice Hall Series in Automatic Computation), Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1974. [37] J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks: An Engineering Approach, The Morgan Kaufmann Series in Computer Architecture and Design Series, Morgan Kaufmann, 2003. [38] F. Li, C. Nicopoulos, T. Richardson, Y. Xie, V. Narayanan, M. Kandemir, Design and Management of 3D Chip Multiprocessors Using Network-in-Memory, in: Proceedings of the 33rd International Symposium on Computer Architecture (ISCA 2006), 2005, pp. 130–141. [39] L.M. Ni, P.K. McKinley, A survey of wormhole routing techniques in direct networks, Computer 26 (2) (1993) 62–76, doi:10.1109/2.191995. [40] W.J. Dally, Virtual-channel flow control, IEEE Trans. Parallel Distrib. Syst. 3 (2) (1992) 194–205, doi:10.1109/71.127260. [41] L. Peh, W.J. Dally, A delay model and speculative architecture for pipelined routers, in: Proceedings of the HPCA Seventh International Symposium on High-Performance Computer Architecture, 2001, pp. 255–266, doi:10.1109/HPCA.2001.903268. [42] R.I.G. and, An improved analytical model for wormhole routed networks with application to butterfly fat-trees, in: Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162), 1997, pp. 44–48, doi:10.1109/ICPP.1997.622554. [43] D. Hodges, Analysis And Design Of Digital Integrated Circuits, In Deep Submicron Technology (special Indian Edition), McGraw-Hill Education (India) Pvt Limited, 2005. [44] W. Wolf, Modern VLSI Design: System-on-Chip Design, Pearson Education, 2002. [45] N.A. Sherwani, Algorithms for VLSI Physical Design Automation, third edition, Kluwer Academic Publishers, 2002, doi:10.1007/B116436. [46] Design and reuse, Available at: http://www.us.designreuse.com/sip/. [47] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, J. Zook, Tile64 - processor: A 64-core soc with mesh interconnect, in: Proceedings of the 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers, 2008, pp. 88–598, doi:10.1109/ISSCC.2008.4523070. [48] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina, C. Miao, J.F. Brown III, A. Agarwal, On-chip interconnection architecture of the tile processor, IEEE Micro 27 (5) (2007) 15–31, doi:10.1109/MM.2007.4378780. [49] S.R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, S. Borkar, An 80-tile sub-100-w teraflops processor in 65-nm cmos, IEEE J. Solid-State Circuits 43 (1) (2008) 29–41, doi:10.1109/JSSC.2007.910957. [50] Cortex-A9 Revision: r4p1 Technical Reference Manual, ARM DDI 0388I (ID091612), Issue I, June 15, 2012. Available at: https://developer.arm.com/docs/ddi0388/i/preface. [51] P. Kongetira, K. Aingaran, K. Olukotun, Niagara: a 32-way multithreaded sparc processor, IEEE Micro 25 (2) (2005) 21–29, doi:10.1109/MM.2005.35. [52] www.oracle.com/technetwork/server-storage/sun-sparc-enterprise/ documentation/sparc-usersmanual-2516676.pdf, [53] https://ark.intel.com/products/series/125191/intel-xeon-scalable-processors, [54] S. Li, J.H. Ahn, R.D. Strong, J.B. Brockman, D.M. Tullsen, N.P. Jouppi, Mcpat: An integrated power, area, and timing modeling framework for multicore and manycore architectures, in: Proceedings of the 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2009, pp. 469–480. [55] Transistor counts for various processors across technology nodesAvailable at : https://en.wikipedia.org/wiki/Transistor_count. [56] B. Feero, P. Pande, Networks-on-chip in a three-dimensional environment: a performance evaluation, IEEE Trans. Comput. 58 (1) (2009) 32–45, doi:10.1109/TC.2008.142. [57] C. Grecu, P. Pande, A. Ivanov, R. Saleh, Timing analysis of network on chip architectures for MP-SOC platforms, Microelectron. J. 36 (2005) 833–845, doi:10.1016/j.mejo.2005.03.006. [58] I. Sutherland, B. Sproull, D. Harris, Logical Effort: Designing Fast CMOS Circuits, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999. [59] A.B. Kahng, B. Lin, S. Nath, Explicit modeling of control and data for improved NOC router estimation, in: Proceedings of the DAC Design Automation Conference 2012, 2012, pp. 392–397, doi:10.1145/2228360.2228430. [60] Predictive technology model (PTM), Available at: http://ptm.asu.edu/. [61] Amd’s overclock processors, https://www.amd.com. [62] N. Jiang, J. Balfour, D.U. Becker, B. Towles, W.J. Dally, G. Michelogiannakis, J. Kim, A detailed and flexible cycle-accurate network-on-chip simulator, in: Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2013, pp. 86–96, doi:10.1109/ISPASS.2013.6557149. [63] P.P. Pande, C. Grecu, A. Ivanov, R. Saleh, High-throughput switch-based interconnect for future SOCS, in: Proceedings of the 3rd IEEE International Workshop on System-on-Chip for Real-Time Applications, 2003. Proceedings., 2003, pp. 304–310, doi:10.1109/IWSOC.2003.1213053.

A. Bose and P. Ghosal

Journal of Systems Architecture 108 (2020) 101738

[64] Intel itanium processor 9560 specifications„ Available at: https://ark.intel.com/ content/www/us/en/ark/products/71699/intel-itanium-processor-9560-32mcache-2-53-ghz.html. [65] Intel core i7+8700 processor specifications,Available at: https://ark.intel.com/ content/www/us/en/ark/products/140642/intel-core-i7-8700-processor-12mcache-up-to-4-60-ghz-includes-intel-optane-memory-16gb.html. [66] R. Zhang, M.R. Stan, K. Skadron, Hotspot 6.0: validation, acceleration and extension, Technical Report, Technical Report CS-2015-04 Google Scholar, 2015. Avik Bose is currently associated with Indian Institute of Engineering Science and Technology, Shibpur, India as a Research Fellow and working on Energy and Performance-centric Design of 3D Network-on-Chips. He has received his Masters Degree from BESU, Shibpur, India and Bachelors Degree from WBUT, India.

Prasun Ghosal is currently an Associate Professor in Indian Institute of Engineering Science and Technology, Shibpur, India. He is a Raman Post Doctoral Fellow (Indo-US) and Heidelberg Laureate Post Doctoral Fellow (Germany). His research is in “Performance Centric, Power Aware Nanoscale Electronic System Design and Computing”. He has contributed more than 105 research articles, and 14 Book Chapters. He is Young Scientist Research Awardee from ISCA, Best Paper Awardee in IEEE iNIS - 2016, ICAEE - 2014, ADCONS - 2011 etc. He served as VC, Executive Committee, IEEE CS, TCVLSI, VC, Steering Committee IEEE iNIS etc.