Reconfigurable on-chip interconnection networks for high performance embedded SoC design

Reconfigurable on-chip interconnection networks for high performance embedded SoC design

Journal Pre-proof Reconfigurable On-Chip Interconnection Networks for High Performance Embedded SoC Design Masoud Oveis-Gharan , Gul N. Khan PII: DOI...

2MB Sizes 0 Downloads 31 Views

Journal Pre-proof

Reconfigurable On-Chip Interconnection Networks for High Performance Embedded SoC Design Masoud Oveis-Gharan , Gul N. Khan PII: DOI: Reference:

S1383-7621(20)30005-9 https://doi.org/10.1016/j.sysarc.2020.101711 SYSARC 101711

To appear in:

Journal of Systems Architecture

Received date: Revised date: Accepted date:

6 March 2019 17 November 2019 1 January 2020

Please cite this article as: Masoud Oveis-Gharan , Gul N. Khan , Reconfigurable On-Chip Interconnection Networks for High Performance Embedded SoC Design, Journal of Systems Architecture (2020), doi: https://doi.org/10.1016/j.sysarc.2020.101711

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Reconfigurable On-Chip Interconnection Networks for High Performance Embedded SoC Design Masoud Oveis-Gharan and Gul N. Khan Department of Electrical, Computer and Biomedical Engineering, Ryerson University, 350 Victoria Street, Toronto ON M5B2K3 Canada.

Author’s email address: [email protected]

ABSTRACT System on Chip (SoC) based embedded devices are providing key solutions to meet the demands of current and future high-performance embedded applications. These solutions become critical when the SoC IC designs are affected by the limitation of sub nanometer technologies that cannot be shrunk further. Network on Chip (NoC) is a scalable communication system that can provide efficient solutions for on-chip interconnection problems of SoCs such as re-configurability for multiple embedded applications. Most of the reconfigurable NoCs presented in the past improve performance of SoC at the expense of higher power and additional hardware. In this paper, we present a novel high-performance re-configurable NoC architecture that can improve the performance along with similar or improved power requirements of the system for different SoC applications. The proposed NoC architecture can also be considered as a hard IP for future partially configurable FPGA devices. Simulation and experimental results of our approach are compared with the recent on-chip interconnection approaches that supports our claim. Keywords: Reconfigurable on-chip networks, NoC router, Router and switch layers, High-performance NoCs, Network architecture and design.

2

Introduction The performance of embedded and other on-chip systems is continuously improving. These improvements have been partially made possible due to the shrinking of fabrication technology that allows more transistors to be accommodated in the same chip area. This trend has changed to integrating more (in numbers) but simpler processing cores on a chip to alleviate power consumption of complex processors [1]. Multiple and many-core architectures have become part of the future generation of high-performance embedded computing platforms. Multi and many-core systems have emerged as the main research and design focus of many semiconductor research centers and organizations. For example, 1000-core single-chip systems are now available, and it is expecting to see it in the store shelves any time soon [2]. In parallel to these developments, the performance of on-chip interconnection structures is also improving, and highperformance scalable interconnection systems such as Networks-on-Chip (NoCs) have emerged as the backbone of on-chip communication for high performance multi- and many-core systems. NoCs are being put forward as a hard or mixed IP (Intellectual Property) cores for large FPGAs that will only require less than 1% of the FPGA area and execute 5-6 times faster than a soft NoC IP [3].

An important feature of NoC systems is the provision of scalability that is suitable for complex SoCs. Traditionally, the scalability is achieved by the architectural feature of homogeneity for most of the conventional NoCs (CNoC). CNoC is a generic NoC architecture consists of similar nodes that are connected via links according to the NoC topology. For example, a conventional 4×4 mesh-topology NoC is shown in Figure 1(a). Due to symmetric and homogeneous nature of such NoCs, they are considered scalable and suitable for most of the many-core SoCs or many generic shared-memory multi-core CPU architectures [4]. The NoC nodes include the switching hardware that receive data packets from the source cores and forward them to reach their destination cores. In CNoC, NoC nodes consist of routers and communication links connecting these routers to form the NoC topology. An NoC router architecture with 5-inputs and 5-outputs is illustrated in Figure 1(b). The messages are passed through NoC in the form of packets. A packet consists of multiple flits, where a flit is the smallest unit of data that passes through NoC nodes in a pipelined style per clock cycle. The routing protocol is usually in the form of wormhole routing, where a header flit of the packet passes through a path to reserve a route for the packet. The route is kept reserved until all the flits of the packet pass through it and reach their destination core [5]. The conventional NoC architecture mentioned above is not always adequate for the application oriented high-performance SoC systems. This inadequacy is due to problems related to the structure and routing organization of a CNoC.

FPGA

I/O

Memory

RF-IP

CPU

IP GPU DSP

IP DDR

(a)

r0 g0

0

VC0

VC2

Arbiter

r0

g0

r 0 g0

r0

g0

r1

r1

g1

g1

Arbite r g

Arbiter

rr21 g1 g2

Arbiter

g0

r 3 g3

r1

1

0 r 1 g1 r 3 g3

Input Port

4

(b) r1

g1

4

Fig. 1. (a) Conventional 4×4 Mesh NoC. (b) NoC node router with five input-output ports. Crossbar Arbiter r2

g2

The first problem with a CNoC is the lack of adequate structural flexibility as the communication between neighboring nodes is always faster than the communication between far nodes for low traffic applications. This is due to the pipelined nature of NoC communication where the far nodes require more pipeline stages (or clock cycles) than the neighboring nodes. Application-oriented systems are designed to meet the needs of one or a few applications. For multiapplication SoCs the places of cores are fixed, however their communicating targets are different according to the loaded application being executed by the SoC. In other words, the core of a multiple application SoC must be able to communicate with various target cores according to the task or application being executed. Consider the NoC system given in Figure 1(a). If the CPU core is required to directly communicate with the Memory core in one application, it may also have to communicate with the RF-IP core for another application. Moreover, in a typical application oriented SoC, all the cores do not need to provide high speed communication. For example, the CPU core of Figure 1(a) may communicate ordinarily with the Memory core in one application, and it may have to perform high speed communication with the same Memory core for another application. Therefore, for high-performance SoCs, the NoC must provide adequate flexibility so that its far nodes can communicate with a comparable latency for varying traffic situations as its neighboring nodes.

4

The second problem with the usage of CNoC for an application based SoC is the most commonly used wormhole routing mechanism. When a packet passes through a route, the route is reserved until the packet’s transmission is fully completed. Moreover, wormhole routing can also starve some cores. An ideal NoC may utilize a starvation-free mechanism involving round-robin based packet scheduling. However, it may still have the NoC operating speed problem. Consider a CNoC with a round-robin based wormhole routing, where two cores share the same route then the packet from a core will wait for the passage of current packet and that can cause a big delay. To alleviate such delays in wormhole routing mechanism, several methods such as adaptive routing [6], flow regulation, priority-based arbitration, and re-configurable path are introduced. One of the commonly used method introduced for this problem is Virtual Channels (VCs). In VC-based wormhole routing, the packets sharing a route are not delayed for a long time. The flits of different packets can share a route and pass one by one through different VCs. However, the latency problem of routing is not completely alleviated. In other words, passing more flits from a route increases the pipelined stages of communication. Assume a 4-VC round robin wormhole routing based CNoC, where four packets request to pass a shared route path. In this case, each flit needs to wait for other three flits to be passed (that is a delay). Another problem of VC based routing is the high cost of such an NoC. In this paper, we illustrate that by removing or scaling down the VC organization in exchange of reconfiguration by adding a few switches to each node of the NoC, the NoC becomes an efficient reconfigurable platform for multiple application oriented high-performance SoC.

Various NoC research groups and researchers have proposed reconfigurable NoCs as optimal and suitable interconnection systems for application-specific high-performance SoCs [7, 8, 9, 10, 11]. Most of them have added extra hardware including hardwired connection to a generic NoC for creating reconfigurable NoC architectures. In fact, the hardware cost is increased to provide re-configurability. Our NoC design approach does add some extra components to the CNoC but these components also help to remove some of the routers to lower the overall hardware cost of the NoC. Basically, our reconfigurable NoC support the best flexibility along with balancing the hardware cost and communication speed. There are different models of FPGAs in terms of their underlying structures. Many FPGAs accommodate some useful DSP and other IPs (hard or soft) as well as CPUs to extend the FPGA applications. Our proposed NoC architecture will inspire FPGA developers to accommodate reconfigurable NoCs in some of the future FPGA models. In such case, those FPGAs can also implement faster-communication based multi and many core applications. We believe that our proposed reconfigurable NoC architecture is the best candidate for a hard IP that should be included in the future partially configurable FPGAs [12].

The rest of the paper is organized as follows. Some significant past NoC architectures related to NoC reconfiguration are reviewed in Section 2. Our novel reconfigurable NoC architecture is described in Section 3. Then the reconfigurable NoC is evaluated in terms of hardware cost and performance, and results are compared with the other NoC architectures in Section 4. Finally, the conclusions are drawn in Section 5.

Overview and Related Research In NoC domain, there are two categories of reconfigurable NoC architectures commonly known as homogeneous and heterogeneous. In this section, we will discuss homogeneous reconfigurable architectures along with some significant past works due to two reasons. Heterogeneous NoC architectures are not scalable and they are only suitable for specific applications. Such NoC architectures vary according to the design constraints, protocols, or designer style. Therefore, it is hard and, in some cases impossible to evaluate and provide pros and cons of heterogeneous NoCs proposed by one

researcher or group with the others. The second reason is that our proposed reconfigurable NoC architecture is also of homogeneous type.

One of the keys past works on reconfigurable NoCs has been proposed by Stuart, Stensgaard and Sparsø [7, 13]. They presented a reconfigurable NoC architecture (ReNoC) that enables the NoC topology to be reconfigurable. Their proposed architecture is not restricted to a simple 2D mesh. In their homogeneous NoC topology approach, the NoC node consists of a buffered router that is wrapped by an asymmetric Reconfigurable Switch (RS). By configuring these RSs appropriately, a wide range of different NoC topologies can be created. However, homogeneous ReNoC have some drawbacks. A 2D-mesh ReNoC is expensive as compared to CNoC whose nodes just consist of routers. In other words, their approach consumes more hardware and power in exchange of providing re-configurability. We will later discuss that our proposed NoC architecture is flexible to provide re-configurability without sacrificing power and hardware. Another drawback of ReNoC architecture is that its RSs are statically configured and must be pre-set to enable static routes before the application is executed. Ideally, a re-configurable NoC should be dynamically reconfigurable at run time. The third drawback of ReNoC is the utilization of a switch along with the router for each ReNoC node. It also limits the ReNoC to a small set of applications and cannot satisfy some high-speed and large-scale applications. On the other hand, the RSs and a router per node can be adjusted in our proposed NoC re-configuration to reduce power and chip area of the SoC.

The ReNoC architecture has been utilized and developed by many researchers. Chen et al. proposed the Single-cycle Multi-hop Asynchronous Repeated Traversal (SMART) NoC architecture that reconfigures a generic mesh topology for SoC applications at runtime [8]. They have used additional combinational logic in the router that allows a packet to bypass all the way from source to destination core in a single clock cycle. It is claimed that their architecture is better than ReNoC as it also avoids latching the flits at the intermediate routers. However, their SMART scheme proposes a bypass path for the input buffers as compared to directly bypassing the router in ReNoC approach. Moreover, SMART focuses on pushing latency down further to traversing multiple hops in a single cycle at high frequency. The SMART architecture is further developed and utilized in a fast NoC layer for implementing large non-uniform cache architecture [9], where the NoC layer has a hierarchical network that employs SMART links. The number of routers that can be bypassed in one clock cycle depends on the underlying IC technology and the clock cycle duration. It is observed that the integration of control mechanism used in SMART restricts its clock frequency and the router overheads. Chen and Jha has introduced a dedicated switch-link based NoC layer to transmit SMART-hop setup requests that can save long wires but at the cost of an additional control NoC [10]. Our approach does not need any dedicated control structure, and it is simpler and straightforward to employ during SoC design and application mapping. We have also observed that some of the drawbacks of ReNoC mentioned earlier also exist in the SMART approaches [8, 10].

Modarressi et al. have also built on ReNoC architecture and presented a reconfigurable architecture for NoC to configure various application specific topologies as illustrated in Figure 2 [11]. The NoC nodes are composed of routers and they are not directly connected to each other. Reconfigurable switches (RSs) are used to inter-connect NoC nodes where RS does not have buffers, arbitration and routing logic. Their configurable NoC can implement regular and arbitrary NoC topologies if the RSs are set properly. They also introduce a reconfigurable structure that provides more flexibility by

6

increasing the number of RSs between two adjacent routers. The main drawback of their approach is that it uses extra hardware components such as RSs as compared to a CNoC in exchange for providing flexible reconfiguration. Another drawback is the higher delay between two cores, which is equivalent to two routers and one RS as compared to the minimum delay of two routers in a CNoC. Recently, the same ReNoC based architecture is adapted to implement a 3D neural network accelerator [14]. Sarabi-Azad and Zomaya has generalized the architecture proposed by Modarressi and others [11, 15]. They have explored different reconfigurable structures by altering the placement of routers and RSs [15]. Due to direct connection of the cores with the routers, their scheme has the same drawbacks as we have mentioned for the NoC architecture presented by Modarressi et al. [11]. Our novel re-configurable NoC architecture avoids most of these drawbacks, as the SoC cores are connected to the RSs rather than routers. It allowed us to reduce the number of routers to balance the hardware cost of additional RSs. Another advantage of our approach is the lower communication delay between two neighbouring IP cores. Our approach results in a smaller delay (e.g. AHRNoC2 shown in Fig. 6(a) of two RSs and one router as compared to two routers and one RS used to interconnect two neighboring cores of the NoC architectures presented in the past [11, 15]. However, if we relax the criteria of assigning one active-port (router-port) to communicate with an IP core, the neighbouring cores can communicate via two reconfigurable switches and results in a delay of two RSs. Moreover, any single-hop communications involving two neighbouring cores will not cause any deadlock.

SoC Core

Route r

Reconfigurable Switch (RS)

Router and SoC Core

Fig. 2. ReNoC based reconfigurable NoC architecture Suvorov et al. utilized a newer version of NoC architecture proposed by Modarressi and others [11] to improve fault tolerance in NoCs [16]. In their architecture, a RS as well as each router is also connected to eight neighboring routers and RSs. In this way, their proposed NoC structure is dynamically reconfigurable and its different parts can also be configured independently. The interconnection graph formed by configuration may have regular structure (such as torus, binary tree, etc.) as well as irregular or hybrid structure with regular and irregular zones. However, their NoC routers and RSs must have at least nine inputs and outputs, and there are equal number of RSs and routers that make their NoC structure very expensive in terms of hardware cost. Most recently, a reconfigurable NoC architecture called RSPmesh (Router-Shared-Pair mesh) is proposed to realize fault-tolerant SoCs [17]. In the RSPmesh configurable NoC topology, neighbouring pairs of PE cores share their router pairs by using MUXs. The main role of NoC reconfiguration is to bypass any faulty routers or links and achieve a mesh based NoC (smaller size) to serve all the PE cores. We will discuss a new interconnection structure in this paper whose SoC cores are attached to the RSs that facilitates fewer routers than RSs in a reconfigurable NoC. A multi-layer mesh NoC approach is presented by Möller et al. to improve the quality of service of SoCs [18]. While one mesh layer of routers is fixed in the system for control purposes, other data layers (network of RSs) are configured at runtime to provide the desired data throughput required by the application. In their reconfigurable NoC approach, the route information of packets is dynamically sent through the router layer to a configuration controller unit. The configuration controller is also aware of the routing protocol, and it calculates the route as well as which input should be

connected to which output for each RS layer on the path between source and destination cores. The configuration controller must also be connected to each RS nodes, to avoid long critical path delays for large NoCs. The main drawback of this approach is that its NoC architecture does not follow the Globally Asynchronous Locally Synchronous (GALS) design style [19] that is one of the critical design parameters of NoCs. On the other hand, our proposed NoC architecture presented in this paper completely follows GALS design style of NoCs.

Application Oriented Reconfigurable NoC 3.1 High-performance Reconfigurable NoC Architecture The main objective of our NoC architecture presented in this paper is to provide a reconfigurable NoC with high flexibility in terms of topology and speed with little increase in the hardware cost. Our proposed reconfigurable NoC architecture can be easily scaled to any SoC, and more importantly it can scale up (in three dimensions also) to satisfy the constraints of hardware overhead and communication speed of the system. An abstract level view of our Applicationoriented High-performance Reconfigurable NoC (AHRNoC) architecture is shown in Figure 3. Our AHRNoC consists of a network of reconfigurable switches (S) and a shrinkable network of routers (R). The reconfigurable switch (re-switch) based RS-network receives messages from SoC cores and passes them through its re-switches and then through the router network to implement the application currently being executed. Figure 4 illustrates a 4×4 2D-mesh example of our reconfigurable NoC architecture that consists of two communication layers i.e. RS-network and a full router-network. These layers are inter-connected in such a way that the IP cores are connected to the router layer through RS layer. The RS layer consists of a network of re-switches that can have different topology as shown in Figure 4 for a 4×4 2D-mesh NoC.

The router layer is a conventional router based NoC interconnection that can be of different topologies and sizes. A 4×4 mesh router-network based interconnection is depicted in Figure 4. The routers are active components having the infrastructure such as buffers and arbiter to communicate in a pipelined manner in-between the source and sink cores [20]. On the other hand, the re-switches are passive components that only connect their input ports to their output ports. They do not have any flit buffers, arbitration and routing logic; however, they have a control register to latch the routing information of a packet being transmitted. One of the main features that should be observed in our NoC design is that each IP core has an exclusive access to a router port through one or more re-switches. This feature assures that our AHRNoC architecture remains a subset of the router based NoCs.

SOC IP Cores

Reconfigurable Switch Networks

Reconfigurable Router Network

Fig. 3. Our high level reconfigurable SoC

8

Figure 5 shows a section of an NoC architecture, where IP5 core does not have a dedicated access to a port of the router and it cannot be a section of our NoC. Consider a scenario, where IP5 intends to send a packet to IP0. If the switch-based route to IP0 is closed through RS-network and S0 to S4 re-switches only connect their IPs to their respective routers, then IP5 cannot communicate with IP0 as it is unable to connect to an active component (router), which is in violation to our AHRNoC approach. Our approach guarantees that if there is no route for a source to destination core in a RS-layer then there will be a route in the router layer via a router R5 (connected to S5) not shown in Figure 5. When a packet reaches the router-layer, it benefits from all the protocols and facilities associated with a conventional NoC. The cores that require a high-speed communication try to communicate through RS-layer.

IP0

IP2

IP1 S

S

R

R RS-Network

IP4

IP5 S

IP6

IP9

S

S

IP12 S

R

R IP11 S

R

R IP14 S

R

S

S

IP13

R IP15

S

R

Router Network

IP7

IP10

R

R

S

R

R S

S

R

S

IP8

IP3 S

S

R

R

Fig. 4. AHRNoC1: IP, S and R represent cores, reconfigurable switches and routers. A typical NoC router consists of input and output ports, an arbiter and a crossbar switch [5]. In the router microarchitecture presented in this paper, the input-ports utilize buffers for VC (Virtual Channel) organization, and the output ports are just data busses. Router layer employs wormhole routing-based communication, where VCs alleviate congestion and deadlock related wormhole routing problems. VCs can be utilized in the structure of routers to improve the performance of router-layer communication. 3.2. Reconfigurable Communication Protocol and Details Multi-core SoCs designed for executing multiple applications may require configurable interconnections. An efficient way of routing in our proposed re-configurable NoC is to place look-up tables on the switch and router modules. In this way, the switch and router modules can interconnect different links according to the application to be executed by the SoC. Routing in our proposed AHRNoC is like a conventional NoC, where the IP cores send data in the form of packets that consist of multiple flits. The header (first) flit contains the route information of a packet and the tail (last) flit closes the routing of that packet. When an IP core sends the header flit of a packet to a switch module, the switch module connects the input-port from the IP core to a relevant output-port according to the destination information provided by the header flit. Assume a packet moves only in the RS-layer to reach its destination that can be a router or an IP core. When the header flit enters each re-switch on its route, it is switched towards its destination. The communication in the RS-layer does not

need pipeline stages and it takes only one clock cycle for the header flit to reach its destination. Due to which, the communication in the RS-layer is faster as there is no buffering or pipeline stages involved in this layer.

IP1

IP0

IP2 S1

S0

R R

R IP4

IP3

S2

S4

S3

R

R IP5 S5

R

Fig. 5. IP core access to a port or a virtual port through RS network. However, a packet may also have to pass through the router layer. When a header flit reaches the input-port of a router, it is stored in the input-port buffer. Then, the input-port issues a request signal to the arbiter (see Figure 6b). The arbiter performs arbitration among the potential input-ports’ flits that are making requests to access the crossbar and other shared resources [20]. When a flit wins the arbitration, it is routed through to exit the router after passing through the crossbar switch. This form of communication in a router needs several pipeline stages [5]. For routers utilizing VCs, the structure of the router input-port becomes more complex. The crossbar switch can be configured to connect any input buffer of the router to any output port (channel), with the constraint that an input-port is only connected to one outputport. The micro-architecture of crossbar switch is illustrated as a multiplexer-based unit in Figure 6(b). One can notice that the structure of a crossbar is also like a switch structure. The organization of a router arbiter is straightforward, but it becomes complex for a VC-based router.

10

IP

IP

IP

S

S

IP S

S

R

R IP

IP S

IP

S

IP

IP

IP

IP

S

S

IP S

S

R

R IP

IP

IP S

request

S

Router Arbiter RC SA ST

S

Empty Read

r0

g0

r1

g1

S

5×5 Mux-based Reconfigurable Switch.

rArbiter-0 g0 0 rr51

g5g1

Arbiter-5 r5

F7

IP

(a) AHRNoC2

grant Full Write

S

S

g5

F1 F0

(b) VC-based router

Fig. 6. 4×4 Mesh AHRNoC2 with optimal number of routers. Multiple application execution and the level of configurability of our AHRNoC architecture is described here. In AHRNoC, Crossbar data communication is in the form of packets having multiple flits. The header flit has the routing information, and the tail flit has an end of packet flag. When a header flit reaches a re-switch, the information of the route is latched in its control buffers. The re-switch connects its inputs to its output ports according to the routing information to pass the header flit. After the passage of a packet header flit through the re-switches on the route to its destination (IP core or a router), the re-switches keep their input-output connection states till the packet tail flit passes through them. The switching of all the re-switches on the route of the packet will take place in one clock cycle when the maximum delay of RS-layer is less than the clock cycle of IP cores. As the tail flit of the packet passes through a series of re-switches, it resets and prepares the re-switches to service the next packet.

The timing flow of passing a packet in RS-layer is explained here. One of the common ways of communication in different clock domains is to transfer data by means of FIFO buffer, where the write to FIFO happens in one clock domain and the read from the FIFO occurs in another clock domain.  Consider Figure 4 and Source-0 intends to send a packet to Destination-15 through RS-layer.  Source-0 sends header flit to the RS-layer based on its own clock domain at clk0.  Then the control unit of each re-switch compares the routing information in the header flit with the routing information already stored in its look-up table.  The result of this comparison assigns the direction of packet inside the re-switch.  By this mechanism, the re-switches convey the header flit to the input FIFO of Destination-15 during clk0.  Then at the second clock (clk1), the header flit is stored in the FIFO. It is assumed that the delay from Source-0 to Destination-15 is less than the period of Source-0 clock.

 Destination-15 reads the incoming flits based on its own clock domain, when its FIFO informs about the data.  In case of a deep FIFO (>2) buffer and when the destination-clock frequency is less or equal to the source-clock frequency, Source-0 can send its flits per its own clock rate and Destination-15 can read the incoming flits per its own clock rate. Now we explain and discuss some important packet routing processes in the RS layer. Consider Figure 7 that is part of RS layer of Figure 4 and assume Source-0 wants to send a packet (of 32 flits) to Destination-6 and then a new packet (of 64 flits) to Destination-12. The latches modules in Figure 7 contain the latches and some logic to manage these latches. The module has five Header&Tail inputs (HT1, …, HT5) and five Switch-State inputs (Lin1, …, Lin5) that are associated to five RS inputs (In1, …, In5) respectively. It also has five outputs (Lout1, …, Lout5) that are used for multiplexers connected to five RS outputs (Out1, …, Out5) respectively. The following steps describe the processes for the Re-Switches (RSs) as illustrated in Figure 7. Step1: At Time0, all RSs are in reset state, having zero values at their outputs. Step2: At Time1 (a clock after), Source-0 starts to send a packet (of 32 flits) to Core-6 (issue the header flit). In RS0, In1 at this time includes an address (Address1) and Header&Tail = 1. HT1 of Latches module is connected to Header&Tail and will have a value 1. Address1 passes through the Look-up table and causes Lin1 to become 3. If HT1=1 and Lin1=3, then Lout3 will become 1. In this way, the header flit passes through RS0 from In1 to Out3 and goes to In5 of RS1. The same process of RS0 occurs for RS1, RS5 and RS6. In fact, the header flit causes HT5, HT2 and HT5 to become 1 in RS1, RS5 and RS6 respectively, and the values 5, 2 and 5 appear on Lout4, Lout3 and Lout1 of RS1, RS5 and RS6 respectively. Now, Source-0 is connected to Core-6 through RS0, RS1, RS5 and RS6. Step3: From Time2 to Time31, H&T is 2 that cause Lout3, Lout4, Lout3 and Lout1 of RS0, RS1, RS5 and RS6 remained unchanged. All the 30 data flits will reach to Core-6. Step4: At Time32, the tail flit comes i.e. Header&Tail = 3. In such condition, Lout3, Lout4, Lout3 and Lout1 of RS0, RS1, RS5 and RS6 do not change. Therefore, the tail flit reaches to Core-6 and terminates the receiving of the packet flits by Core-6. Step5: At Time33, if Source-0 does not want to send a new packet to Core-6, it will reset RS0, RS1, RS5 and RS6 by sending zero to In1. For reset, the value of Lout selects the H&T that can reset it. In RS0, if Lout3 = 1 and H&T1 = 0, then Lout3 becomes reset. In fact, the value 1 of Lout3 selects the HT1 for reset, and only the value 0 of HT1 resets Lout3. As illustrated in Figure 7, the first inputs of all the output multiplexers are grounded. Therefore, when Lout3 = 0, Out3 of RS0 becomes zero. By considering similar process in RS0, Out4, Out3 and Out1 of RS1, RS5 and RS6 will reset respectively. Step6: Suppose at Time40, Source-0 starts to send a packet (of 64 flits) to Core-12 i.e. the header flit is issued. In RS0, In1 at this time includes an address (Address2) and Header&Tail = 1. If HT1=1 and Lin1=4, then Lout4 becomes 1. The header flit passes through RS0 from In1 to Out4 and goes to In1 of RS4. The same process of RS0 occurs for RS4, RS8 and RS12. Now, Source-0 is connected to Core-12 through RS0, RS4, RS8 and RS12. Step7: From Time41 to Time102, H&T is 2 causing Lout3, Lout4, Lout3 and Lout1 of RS0, RS1, RS5 and RS6 remained unchanged. All 62 data flits will reach to Core-6. Step8: At Time103, the tail flit comes i.e. Header&Tail = 3. Lout4, Lout4, Lout4 and Lout1 of RS0, RS4, RS8 and RS12 do not change. Therefore, the tail flit reaches to Core-12 and terminates its receiving flits. Step9: At Time104, if Source-0 does not want to send a new packet to Core-12, it resets RS0, RS4, RS8 and RS12 by sending zero to In1. In RS0, if Lout4 = 1 and HT1 = 0, then Lout4 becomes zero. When Lout4 = 0, Out4 becomes zero. By considering similar RS0 process, RS4, RS8 and RS12 will reset. As discussed before, the re-switch control unit consists of latches, look-up table and some logic gates. The logic gates are used for detecting the header and tail flits, comparing the flit routing information with the look-up table, issuing the re-switch state and resetting the re-switch. The latches keep the re-switch state during the passage of a packet. Look-up table can be a storage in the form of flash memories (same as those are used in FPGAs), EEPROMs or simple latches. The routing paths of an application or a group of applications should be stored in these memories in advance, and then these applications can run on the AHRNoC. For multiple applications, different solutions can be considered for AHRNoC based SoC. For example, assume a scenario in which application App1 is running over an AHRNoC, and application App2 wants to start execution along with App1. There can be three solutions of such a scenario. Firstly, App2 starts execution in the router layer

12

immediately. Secondly, if there are no sharing routes and destinations, the routing path of App2 is sent to re-switches and only then App2 can start execution. Lastly, App1 stops sending new packets and AHRNoC waits for all the packets associated with App1 to reach their destinations. Then the routing information of both App1 and App2 can be passed on to re-switches, and after that both applications (App1 and App2) can start executions.

Control

Look-up table S0 S1 to to D1 D2 state state

S15 to D0 state

Control Unit

Control Unit

Control Unit

Look-up

Look-up

Look-up

Unit

Separating Header and Tail

S-S

Sep H&T

Sep H&T

Switch State

MUX S-S

S-S

Look-up multiplexer

S

S S

S/D Info

MUX

MUX Source/ Destination Information

S

S/D S S Info

S

S S

S/D Info

Sep H&T

HT5 L5

HT2 L2

HT5

Latches

Latches

Latches

L5

Header & Tail

Out1

In1

In2

Core-6 HT1

Out3

Out4

Lin1

In5

In5

Out5 In5

Latches Module re-switch-5

re-switch-1

re-switch-6

Lout3 Lout4 S/D Info

Control Unit Crossbar

S/D Info

Control Unit Crossbar

Look-up

S/D Info

Control Unit Crossbar

Look-up

Look-up

Link

Source-0

In1 Out3

In5

Out4

S-S

Sep H&T

S-S

Sep

Latches

H&T

S-S

Sep

Latches

H&T

Latches Out1

re-switch-0 In2

In2

Core12

In2

Out4

Out4 In5

In5

re-switch-4

Out5

re-switch-8

In5

re-switch-12

Fig. 7. Block diagram of the routes between Source-0 and Core-6 and Core-12 in RS layer. Crossbar

Crossbar

Crossbar

At the start of a new application, all the re-switches will be in their reset state and ready to be set by the new routing information. To show the level of re-configurability of AHRNoC based SoC, assume multiple applications are ready to be loaded for execution at different time intervals. Any one of the IP cores (e.g. Core-n) can be made responsible to control the switching of multiple applications. Core-n will send a command (packet) to all the source IP cores through the router layer. After receiving this command packet, all the IP cores will stop sending packets related to the current application. To ensure that there are no more packets associated to the current application, the NoC can wait for a while. At the end of a waiting period, the AHRNoC is assured that all the re-switches are reset and ready for packet communication for the next application. Then the next application can be loaded on the SoC IP cores for execution. 3.3. High-Speed, Low Cost and Flexible AHRNoC We argue two advantages of our proposed AHRNoC architecture. Firstly, the structure of re-switch used is much smaller as compared to that of a router. Therefore, NoC designers can create an optimal NoC in terms of hardware by employing more switch modules and fewer routers. For instance, in a 4×4 2D-mesh AHRNoC2 shown in Figure 6, only four routers are used. The input-ports of these routers are adequate for accommodating 16 IP cores in case they need to communicate through the router layer. In other words, our AHRNoC architecture is flexible to be implemented at a much lower cost. The other feature we mentioned is the faster communication in the RS-layer as compared to the router layer. This feature makes our AHRNoC as an ideal communication engine for high-speed SoCs. Moreover, we can increase the number of RS-layers to provide faster interconnection routes for IP cores. Figure 8 shows a reconfigurable NoC with two RSlayers and one router layer. More IP cores can communicate through multiple RS-layers. In fact, more cores will have access through RS-layers that are much faster than the router layer. There are number of advantages associated with our AHRNoC architectures that are highlighted below.  The main features of AHRNoC architecture, homogeneity and scalability make it suitable for a variety of applications.  AHRNoC provides high speed interconnection for high performance SoCs. SoC cores can communicate via fast RSnetwork, which can be further enhanced by adding more RS-layers to satisfy speed related constraints of the application.  The flexibility of AHRNoC allows a trade-off between performance and the hardware overhead. Router-layer is expensive that is scaled down to fewer routers. Router-layer can be further shrunk for complex routers. The shrinkage can be in the form of fewer number of VCs per input-port as well as lower buffer size for VCs. We will illustrate in the experimental results that the hardware required by the RS-layer is a small fraction of the router-layer. Therefore, by lowering the router layer size in exchange of increasing the number of RS-layers will satisfy both performance and hardware requirements of the NoC.  AHRNoC architecture is scalable and it is also separable in terms of hardware cost and speed. The RS-layer has more impact on the NoC communication speed, where the router-layer impacts the hardware cost. Therefore, a designer can easily identify and balance higher speed and lower silicon cost.  The AHRNoC has a potential to provide minimal of two re-switches delay for communication if we relax the condition of communication via an active port (router) for the neighbouring IP cores. For example, assume IP0 and IP1 of Figure 5 are required to communicate in an application, and their routing is mapped via the RS layer, so minimum two reswitches, S0 and S1 can be involved in the communication. However, routing path may be different for another application, and IP0 and IP1 cannot communicate only through the RS layer (e.g. IP1 also gets packets from some other IP cores). Therefore, IP0 and IP1 may have to use both switch and router layers for communication. In such a case, S0, R0, R1 and S1 can be involved in the communication. In this way, the minimum delay of AHRNoC becomes different based on the application being executed.  AHRNoC architecture is also dead-lock free. The communication in the RS-layer is based on the routing paths that are associated with each application and developed by the designer (or a software embedded in SoC) during setup. In this way, multiple (including newer) applications can be loaded to AHRNoC based SOC. The designer is responsible for

14

taking care of a dead-lock free RS-layer. However, any communication in the router layer is based on its routers along with the routing mechanism and algorithm. Therefore, when a routing algorithm ensures a dead-lock free router-layer, AHRNoC communication involving the router-layer will also be deadlock free. In our experimental results, we have employed XY algorithm for routing that is a dead-lock free mechanism.  The main negative point of our AHRNoC is its higher interconnection wiring cost.

IP

IP S

IP S

S

IP S

S

R

R IP

IP S

IP

S

IP

IP

IP

IP

IP

S

R IP

S

R IP

S S

R

S S

R

R

S

R

S S

S

S

R

S

IP

S S

R

R

S

R

S S

S

S

R

S

IP

S S

S S

R

Fig. 8. AHRNoC3 with two RS-layers and one router-layer.

S

R

R

4. AHRNoC Evaluation and Experimental Results SoCs are growing fast in terms of accommodating many IP cores and consequently implementing many applications on a single chip. To satisfy the communication needs of many-core SoCs, designers prefer to employ NoCs. Most suitable platforms in terms of area, power and speed for many-core SoC architecture are FPGAs and ASICs. FPGAs are growing faster and trying to compete with the ASICs especially when price or time to market is considered. Moreover, current FPGAs consume less area and power along with comparable speed to ASICs. Our proposed AHRNoC architecture may also be appreciated when implemented on recent FPGAs. Future FPGAs will appreciate it more. In this section, our AHRNoC architecture is analyzed and evaluated in terms of its hardware characteristics and performance metrics. Various versions of AHRNoC architectures are evaluated and compared with the Conventional NoCs (CNoCs) and a past reconfigurable NoC architecture [11]. The results are presented for selected 2D-mesh based CNoC architectures, where each node consists of a router as illustrated in Figure 1(a) for a mesh topology. Identical routers are used for all CNoCs in terms of input-ports, VC buffer organization, arbiters and crossbar switch. We are demonstrating the micro-architecture of our proposed NoC by modeling with Verilog and Modelsim and the hardware characteristics are obtained by using Synopsys Design Compiler. It indicates that our Verilog-coded NoC can also be easily implemented for any FPGA platform.

4.1. Hardware Characteristics To investigate the hardware characteristics of AHRNoC architecture and to compare it with CNoC and some other reconfigurable approaches, we consider NoCs presented in Figures 1, 2, 4, 6, 8 and 9. These NoCs are synthesized in terms of power consumption, chip area, and critical path delay. Their characteristics are determined by implementing their node components (switches and routers) in System-Verilog and then estimating their hardware parameters by employing Synopsys Design Compiler. One of the recent 15nm, Nangate ASIC libraries are used for this evaluation [21]. A global operating voltage of 0.8V and time-period of 1 ns (1 GHz) is applied for all the components evaluated. Table 1 provides the hardware characteristics of different node components of various NoCs given in Figures 1, 2, 4, 6, 8 and 9. Chip area, power and delay parameters presented within brackets are related to active re-switches. The only difference among various re-switch components is the number of ports. For example, a 5-port switch node has 5 input and 5-output ports as shown earlier in Figure 6(a). The structure of routers varies in the number of ports or the number of VCs utilized in their input-ports. For instance, Figure 1(b) shows a 5-port 3-VC router, and the structure of a router with 6 ports and without any VC (no-VC) is illustrated in Figure 6(b). The input-ports of routers are setup to utilize zero, 2, 3 or 4 VCs and 8 buffer slots. A buffer slot accommodates a 16-bit flit (Figure 6b). For the sake of simplicity, Table 1 contains four rows at the end that represent the average ratios of router characteristics against the average of re-switch components. ‘Average 4VC routers/re-switches’ means the average characteristic of a router with 4-VC as compared to a reconfigurable switch.

16

IP

IP S

IP

S S

S

S S

S

S

IP S

S S

S

S S

S

S

IP IP

IP

IP IP

S

S S

S

S S

S

S

S S

IP

IP

R

S S

S

S

IP

R

IP

IP

R

R

S

S

IP

IP

Fig. 9. AHRNoC4 with two RS-layers and an optimally shrunk router layer. It can be observed that the ratios of average characteristics (area, power and critical path delay) of a no-VC router versus a reswitch component are almost 7, 8 and 9 times respectively. Architectural and other details of active re-switches are discussed in Sections 4.2 and 4.3. A re-switch consumes 1/7th hardware as compared to a no-VC router because a re-switch consists of a few multiplexers and a control circuit as shown in Figure 6(a). Its control unit also needs a few logic gates and registers to keep the routing look-up table, and to switch the inputs to the outputs according to the routing information of a packet. Our System-Verilog designs indicate that the chip area of multiplexers and control unit of a re-switch is almost equal. In the case of a NoC router, the chip area of a crossbar switch is much lower than its other components (such as arbiter and input-ports). For example, the area of a crossbar switch for a 6-port router is at least 1/13th of the total silicon area of the router.

TABLE 1. NoC Components Hardware Characteristics ASIC design 15nm Nan Gate Lib. Node Components

Area(µm2 Power*(mW ) )

Critical path (ps)

6-port re-switch (active)

712 (857) 0.21 (0.22) 17 (28)

5-port re-switch (active)

515 (636) 0.15 (0.16) 16 (27)

4-port re-switch (active)

342 (439) 0.12 (0.13) 16 (27)

3-port re-switch (active)

172 (244) 0.08 (0.09) 16 (27)

5-port 4-VC router

4660

1.89

585

4-port 4-VC router

3599

1.49

496

3-port 4-VC router

2413

1.11

491

5-port 3-VC router

3904

1.49

491

4-port 3-VC router

3012

1.18

486

3-port 3-VC router

2239

0.89

427

5-port 2-VC router

3402

1.26

485

4-port 2-VC router

2696

1.01

439

3-port 2-VC router

2001

0.77

344

6-port no-VC router

3832

1.34

183

5-port no-VC router

3161

1.12

168

4-port no-VC router

2492

0.90

139

3-port no-VC router

1848

0.68

104

Average 4-VC routers/re-switches (active)

10 (8)

13 (12)

33 (19)

Average 3-VC routers/re-switches (active)

9 (7)

10 (9)

29 (17)

Average 2-VC routers/re-switches (active)

8 (6)

9 (8)

26 (16)

Average no-VC routers/re-switches (active)

7 (5)

8 (7)

9 (5)

* Total of Dynamic and Static Power

18

Another important point observed from the synthesis results available in Table 1 is the impact of utilizing VCs in the routers. The average ratios of area, power and critical path delay of 4-VC router versus a re-switch are almost 10, 13 and 33 times respectively. We have observed that the average area of 4-VC routers increases by 10 times, but their average speed decreases by 1/33 times. It indicates that a 4-VC router is equivalent to a no-VC router and three re-switches in terms of area, a no-VC router and five re-switches in terms of power, and a no-VC router and 24 re-switches in terms of speed. The extra hardware of a 4-VC router is due to the VC mechanism that affects the structures of the arbiter and the input-ports that are shown in Figures 1(b) and 6(b).

Table 2 provides the hardware characteristics of various size CNoCs and a past reconfigurable NoC in addition to our AHRNoCs. As mentioned before, the CNoCs have a mesh topology as shown in Figure 1 and listed as CNoC1, CNoC2, CNoC3 and CNoC4 based on the number of VCs. For example, the nodes of CNoC1 are simple routers without a VC mechanism, and the router of CNoC4 has 4-VCs per input-port. A past NoC architecture presented by Modarressi et al. [11] shown in Figure 2 is also synthesized and named ReNoC in Table 2. AHRNoC architectures evaluated are called AHRNoC1, AHRNoC2, AHRNoC3 and AHRNoC4 and they are illustrated in Figures 4, 6, 8 and 9 respectively. For the sake of simplicity, Table 2 has two columns presenting the power and area ratios of different NoCs against the 4×4 RS-network (RS-layer illustrated in Figure 4). The communication link characteristics are discussed in following section. All AHRNoC versions perform faster communication than other NoCs as the nodes can communicate via RS-layer. In ReNoC, there are at least two no-VC routers and a switch between two nodes, where no-VC router works 1/9 times slower (on average) than a reconfigurable switch (see Table 1). Therefore, the minimum delay in the case of ReNoC is 19 times of the re-switch delay. Similarly, the minimum delay in a CNoC is 18 times of the re-switch delay. In addition to higher communication potential of AHRNoCs, they can also be implemented cheaply with lower hardware. One can observe from the synthesis results given in Table 2 that AHRNoC2 and AHRNoC4 consume much lower power as compared to other NoCs. Firstly, only four routers are used in AHRNoC2 and AHRNoC4 as compare to 16 routers in other NoCs. Moreover, the high cost of a router that averagely consumes 8 times more power as compared to a reconfigurable switch.

TABLE 2. Hardware Characteristics of NoCs ASIC design 15nm NanGate Library NoC

Area Power* Link Area Link Power* (µm2) (µm2) (mW) (mW)

4×4 RS-Network 8336

2.5

43200

2×2 no-VC 6port Router- 18640 Network

7.5

43200

4×4 no-VC 5- 39972 14.4 port RouterNetwork

43200

CNoC2 (Fig. 1) 43180 16.3 4×4 2-VC 5-port

46656

CNoC1 (Fig. 1)

0.86 0.86

Characteristic ratio against a “4×4 RS Network” Area

Power

1.0

1.0

2.2

2.9

4.8

5.6

5.2

6.3

0.86

0.94

Router-Network CNoC3 (Fig. 1) 4×4 3-VC 5-port 48668 19.0 Router-Network

48960

CNoC4 (Fig. 1)

0.98

1

5.8

7.4

6.8

9.3

4×4 4-VC 5-port 57084 23.9 Router-Network

50112

ReNoC (Fig. 2) 49218 18.0

86400

1.3

5.9

7.0

AHRNoC1 (Fig. 4) 48308 17.0

86400

1.3

5.8

6.6

AHRNoC2 (Fig. 6) 26976 10.1

86400

1.3

3.2

3.9

AHRNoC3 (Fig. 8) 56644 19.6

129600

1.7

6.8

7.6

AHRNoC4 (Fig. 9) 35312 12.7

129600

1.7

4.2

4.9

* Total Dynamic and Static Power

4.2. RS-Layer Interconnection Wires and Wiring Characteristics The RS-layer of AHRNoC utilizes additional wires and switches to achieve re-configurability, flexibility and high-speed communication. Most of the power consumed in wires is due to switching activity associated with the passing of messages (flits or packets) [22]. We have not modeled the wiring delay for the RS-layer, as the wire delay depends on various parameters including the total area of SoC that is not available. We have employed two approaches to estimate the link power and chip area characteristics of NoCs. The first approach is based on the wire results of two past well known research results being discussed in Section 4.2.1. The second approach is based on the Copper on-chip wire characteristics that are discussed in Section 4.2.2.

4.2.1. On-Chip Wiring Characteristics

The power consumed in a wire is composed of dynamic and static power. In geometries smaller than 90nm, static power becomes a dominant component [23, 24]. The main source of dynamic power for links is due to the charging and discharging of their relevant capacities, which is proportional to switching activity of wires. For example, if there is no data passing in the wires, the wire dynamic-power will be zero. We use the following assumptions to estimate the link power associated with the NoCs listed in Table 2. i. The links are assumed to be only available in-between the nodes of CNoC, ReNoC and AHRNoC. It means, the links among the components inside a node are internal i.e. their specifications are included in the component parameters

20

given in Table 1. For example, a node of AHRNoC3 includes four components such as an IP, two re-switches and a router and their interconnection has been considered as part of the node parameters. ii. The link dynamic-power is proportional to the number of wires of a link. For example, consider a packet passes in CNoC1, CNoC4 and AHRNoC3 from Source-0 (upper-left corner node) to Destination-15 (down-right corner node). It passes 6 links in all the three NoCs and creates same switching activity in the links. Therefore, CNoC4 that has more wires in its links due to extra control bits for VC implementation will consume more link-based dynamic-power. iii. The link static-power is proportional to the total number of wires in the NoC.

By considering the above assumption, the following estimations for the on-chip link related power can be concluded.  ReNoC, AHRNoC1 and AHRNoC2 links consume almost two times more than CNoC1 link static-power consumption. This is due to their additional link wires that are two times of the CNoC1 wires.  ReNoC, AHRNoC1 and AHRNoC2 links consume dynamic-power consumption as CNoC1. This is due to the same data transfer assumed in all the NoCs, and the links of ReNoC, AHRNoC1 and AHRNoC2 have the same number of wires as that of CNoC1.  AHRNoC3 and AHRNoC4 links consume almost three times more static power than CNoC1 links. However, their links consume the same dynamic power as of CNoC1.

The power breakdown results of a 5-GHZ mesh interconnect for a Teraflops Processor [25] are considered for linkpower estimation. Teraflops Processor is chosen as the detail of its NoC power is available. Moreover, its NoC architecture is similar to a CNoC1 structure. Some of the specification of Teraflops Processor architecture are listed below.  The simulation is done at 4 GHz, 1.2 V supply, 65 nm CMOS process technology.  The Teraflops Processor architecture contains 80 tiles arranged in a 2D array and connected by a mesh network.  A tile consists of a processing engine connected to a router to facilitate packet communication among these tiles.  Each flit contains six control bits and 32 data bits.  Each router has five-port with a two-lane pipelined packet switching.  Each link connected to a port has two 39-bit unidirectional point-to-point wires.  The NoC including routers and links consumes 28 percent of the SoC tile power. (chip is divided to 80 equal tiles)  The links consumes 11 percent of the NoC power.

The above specification indicates that the Teraflops Processor architecture [25] has almost the same architecture as CNoC architecture without VC (i.e. CNoC1). By considering the ratio of link and router for Teraflops Processor, the link power consumptions of the NoCs listed in Table 2 are determined as given below.

 If links consumes 11 percent power of the NoC, the CNoC1 links consume almost 12.4 percent of CNoC1 routers.  Each link connected to a port in CNOC1 has two 16-bit unidirectional wires and 6 controlling wires [5]. Therefore, the number of wires of a link in CNoC1 is ((2×16) + 6) = 38.  By considering the number of wires of a link in the Teraflops Processor architecture (i.e. 2×39=78), the CNoC1 links consume almost 6% ((38/78)×12.4) of the routers power. Therefore, CNoC1 links consumes 0.43µW static-power as well as same 0.43µW dynamic-power (when dynamic power is assumed to be equal to the static power [24]).  CNoC2 links consume almost 6.5% ((41/78)×12.4) of CNoC1 routers power. CNoC3 and CNoC4 links consume around 6.8% ((43/78)×12.4) and 7% ((44/78)×12.4) of CNoC1 routers power respectively. Actually, their dynamic and static power consumption is proportional to the number of wires in a link.  ReNoC, AHRNoC1 and AHRNoC2 links consume around 9% of CNoC1 routers power. AHRNoC3 and AHRNoC4 links consume around 12% of CNoC1 routers power. (by following the assumption iii.)

In order to estimate the chip area of links, we need to determine the exact length of each link. These parameters are determined by considering the floor planning process. We can employ the many-core chip results presented in KiloCore: a 32-nm 1000-core Processor [2]. We have chosen this example as its SoC area details are available for a 32nm technology. 2 KiloCore chip is fabricated on a die of 64mm and each processor (router and IP core) occupies 239μm×232μm. Therefore, we can consider the average link length to be around 240μm. Copper wire (resistivity of 2.8 mΩ·m) is chosen for interconnection link, where the width of wires is taken to be 0.1μm. The spacing between two adjacent wires is kept at 0.1μm. In CNoC1, there are 24 links between the nodes, and each link has 38 wires. Assuming repeater-less links, the 2 average CNoC1 link area becomes (24×(38×0.1+37×0.1)×240) = 43200µm . Similarly, CNoC2, CNoC3, and CNoC4 link areas 2 2 2 are (24×(41×0.1+40×0.1)×240)µm , (24×(43×0.1+42×0.1)×240)µm and (24×(44×0.1+43×0.1)×240)µm respectively. Each ReNoC, AHRNoC1 or AHRNoC2 link has two times more area than the CNoC1 link area. Similarly, AHRNoC3 or AHRNoC4 links will have two times more area than the link area of a CNoC1.

In terms of timing delay, we propose to employ active re-switches for AHRNOC to cater for long-distance delays as compared to CNoC and ReNoC. In this way, the AHRNoC scalability problem of long-distance delay in many core SoCs can also be solved. A re-switch is converted to an active switch by adding registers at its output-port. For estimating the effect of active re-switches, we assume the following:  As before, the global links are available between the nodes in CNoC, ReNoC and AHRNoC.  The link delay is less than the clock cycle for CNoC, ReNoC and AHRNoC based SoCs.

A register has 11ps delay for Nangate 15nm technology. Therefore, the average time delays of active re-switches become almost 40% more than those of passive re-switches. Each output port of an active re-switch has a 16-bit register for

22

flit storage and a 3-bit register to cater for the control signals. Therefore, the average power and area of active re-switches become almost 7% and 20% more than those of passive re-switches respectively. The maximum delay in a 4×4 RS-layer for communication among distant (corner) IP cores (e.g. IP0 and IP15 of Figure 4) will be based on a route with seven active reswitches, and six link wires. For the same Nangate 15nm ASIC technology, the delay for seven active re-switches is equal to 189ps. We can ignore the hop-wire delay in the RS-layer as it is equal to that of a router layer, which is also ignored in CNoC and ReNoC NoCs. It means, an IP core can run with 378ps (189×2) clock cycle and latch its messages in the FIFO located at the farthest IP core, and get its credit back in one clock cycle. As one can observe from the Table 1 that only no-VC routers have the average delay less than 378ps.

4.2.2 Wiring Characteristics of AHRNoC

For global wiring in 15nm technology, two examples are considered based on the topology and area characteristics of two fabricated SoCs i.e. KiloCore and Teraflops [2, 25]. The wire specification is estimated by the following: 2

i.

Consider the Teraflops SoC (TFS) example that has 275mm area and 10×8 nodes mesh topology [25].

ii.

Assume KiloCore SoC (KCS) as the 2 example having 64mm chip area and 32×32 mesh topology [2].

iii.

We also assume that the IP nodes have the same chip area that is constant for all the NoCs.

iv.

Figure 10 illustrates a schematic of the scaled down floorplan that we consider for our estimation.

nd

2

Fig. 10. Schematic of a scaled down floorplan (2×4) of SoCs being considered.

For the sake of simplicity and fare comparison, we ignore some layout factors and consider the following formulations for estimating global wiring in 15nm technology with global operating voltage of 0.8V and time-period of 1ns (for 1 GHz clock). -8

Wire resistance is calculated by R = (ρ/H) L/W, where W, L, H are width, length and height of wire, and ρ = 1.7×10 Ωm for Copper. (ρ/H) is known as Sheet Resistance and If we consider H=100nm in our estimation, (ρ /H) = 0.17Ω [30]. Wire capacitance is calculated by C= (ε/t) × (L×W), where t is distance between two layers, and ε/t is fixed by technology.

-4

2

For ε = 0.26pF/mm and t = 1µm, then (ε/t) = 2.60×10 pF/mm , C = Cper-squre-micron ×W×L [30]. The following two equations are employed for wire delay and power estimation [31].

Wire delay = ½ R*C

(1) 2

Wire Dynamic Power = α × C×V ×f

(2)

where V, C, f and α are global operating voltage, wire capacitance, frequency, and activity factor respectively. α is assumed 1/16 (¼ × ½ × ½) in our estimation as 25% of wires always carry the data (for CNoC1), the frequency of changing of data is half of the clock cycle, and wires are not always populated with data (assuming 50% populated).

For estimating the distance between two nodes, (see Figure 10) the following equations are considered. Distance between two Nodes = √

–√

(3)

where Asoc and Anoc are the SoC and NoC areas, and n and m are the number of rows and columns of a mesh SoC.

2

2

In equation 3, Anoc can be derived from Table 2. Asoc is considered to be 64mm and 275mm for KCS and TFS chips respectively. Assume the width of wires is taken to be 0.1μm, and the spacing between two adjacent wires is also kept at 0.1μm. We consider the following equation for total wire area, Awire.

Total wire area = (2×(n-1) ×m) × (i ×0.1 + (i -1)×0.1)×L

(4)

where there are 2×(n-1)×m links between the nodes, and each link has r wires

The number of wires, i in a link are 38, 41, 43, 44, 2×38, 2×38, 2×38, 3×38, and 3×38 for CNoC1, CNoC2, CNoC3, CNoC4, ReNoC, AHRNoC1, AHRNoC2, AHRNoC3 and AHRNoC4 respectively. By considering the above formulation and assumption, the area ratio, power and delay of wires are estimated for both SoC examples and listed in Table 3.

24

TABLE 3. SoCs Hardware Characteristics

ASIC design 15nm -4

2

Wire assumptions: W= 0.1µm, V= 0.8v, f= 1MHz R= L/W×(1 Ω), C = W×L×(2.60×10 pf/µm ) 2

2

KCS: Asoc = 64mm , 1024 nodes NoC

NoC characteristics

TFS: Asoc 275mm , 80 nodes NoC characteristics

Wire characteristics

Wire characteristics

Ano

Awir P D P D An P D Awir P D c/Asoc e/Asoc ower* elay ower elay oc/Asoc ower elay e/Asoc ower elay (mW) (ps) (mW) (ps) (%) (mW) (ps) (%) (mW) (ps) (%) (%) CNo C1 CNo C2 CNo C3 CNo C4 ReN oC AHR NoC1 AHR NoC2 AHR NoC3 AHR NoC4

4.0 4.3 4.9 5.7 4.9 4.8 2.7 5.7 3.5

9 21.6

1 68

1 043.2

4 85

1 216.0

4 91

1 529.6

5 85

1 152.0

1 68

1 088.0

1 83

6 46.4

1 83

1 254.4

1 83

8 12.8

1 83

4.7 5.0 5.1 5.1 9.0 9.1 9.7 13.3 14.2

3 1.4

0 .88

3 3.5

0 .87

3 4.6

6 6 1.2 6 5.5 8 9.6

.80

5.5

.91

1

0.0

0

1

1 83

1 96

0.0 6

1 83

01 0.1

0

1 68

70

5 0

9

1

0.0

0

5 85

80

9

.96

2

0.0

0

4 91

39

9

.84

1

0.1

0

4 85

90

0

.84

1

0.0

0

1 68

63

9

.80

1.0

0.0

0

3

1 44

8

.84

4.5

0.0 7

1 83

1 27

1 83

0.71 0.76 0.80 0.82 1.41 1.41 1.42 2.11 2.13

2 0.5

7 2

2 2.1

7 2

2 3.2

7 2

2 3.6

7 1

4 0.9

7 1

4 1.0

7 2

4 1.3

7 3

6 1.3

7 1

6 1.7

7 2

* Total Dynamic and Static Power

The NoCs area and power listed in Table 3 are first derived from Table 2 and then scaled out to 1024 and 80 nodes for KCS and TFS SoCs respectively. As discussed before, the NoC delays are based on their slowest components and derived from Table 1. It can be observed from the values in Table 3, the average wire power in TFS example is almost 14% of CNoC1 power that is closed to 12.4% measured by Wang et.al [24]. It somehow supports some of our estimations in this section. Moreover, some of the technology wire characteristics (e.g. ρ/H and ε/t) are coordinated with the results referred by Cheng et. al and Ho [30, 31].

By considering the number of wires in NoCs, one can expect that the wire areas and power consumption in AHRNoC1 and AHRNoC2 are almost two times that of CNoC1. Similarly, the wire area and power consumption for AHRNoC3 and AHRNoC4 are three times that of CNoC1 for both KCS and TFS examples. However, we can utilize re-configurability in AHRNoC1 by expanding averagely 5.2% (9.1-4.7+4.8-4) and 0.72% more SoC area and 20% and 28% more NoC and wire power as compared to those of CNoC1 in the case of KCS and TFS respectively. In terms of delay, it can be observed from the values presented in Table 3, the wire delays in SoCs for the KCS and TFS examples are almost in the same range. This is due to the fact that the delay is proportional to the square of L or differentiation of Asoc and Anoc. However, Anoc is very small as compared to Asoc, and the wire delays in both examples become almost proportional to Asoc. A switch in RS layer has maximum of a 17ps delay (see Table 1). Therefore, if a router has two clock cycles based pipeline, a flit can pass through 18((2×168+0.88)/(17+0.84)) switches in RS layer of AHRNoC1 as compared to a flit passage through a CNoC1 router for KCS example. Similarly, a flit can pass through 4((168×2+72)/(17+72)) Re-switches in AHRNoC1 as compared to a CNoC1 router for TFS example.

6

If we consider 1GHz for clock cycle, the clock period becomes 10 psec. In KCS based AHRNoC1, 35868(1000000/(27+0.88)) re-switches and for TFS based AHRNoC1, 10101(1000000/(27+72)) re-switches can deliver a flit in one clock cycle. In conclusion, by expanding a small (around 5%) increase of SoC area and 20% of NoC and wire power, the SoC can utilize re-configurability feature with 9 times faster communication as compared to a conventional NoC.

4.3. Communication in RS Layer with Long Distance Link

The goal in this paper is to introduce re-configurability in NoC by utilizing both router and RS layers. Moreover, the RS layer in our approach includes switches that are level sensitive, and at least some of them can deliver the message per clock cycle. We have estimated that for 1MHz clock cycle, the RS layer of AHRNoC1 can deliver the message per clock cycle in both KCS and TFS based estimation. We have also estimated that if clock cycle is determined by the critical path delay of routers, 9 or 2 re-switches can deliver flits per clock cycle in KCS or TFS based NoCs respectively. We have also discussed that the router layer can help the RS-layer in routing messages. For example, in the KCS based AHRNoC1, the routing path can be defined in such a way that the flits pass through 9 re-switches then go into a router and exit the router and pass through another set of 9 re-switches, and repeat similar movement till they reach to their destination.

However, there can be some cases where the wire delay between nodes are large, which causes each re-switch to operate at the rate of clock cycle. As mentioned before, one of the solutions to this problem is to employ active reswitches. In this case, the source cores cannot send their flits per clock cycle, instead the flits will reach the destinations based on the number of re-switches on the route of their packets. However, existing handshaking mechanism for each flit between source and destination in RS-layer decreases communication throughput. In other words, the same delay that a flit takes to reach to destination needs for its credit to come back to source. This solution may not be an optimum solution and it needs further research and investigation. To improve the communication for active re-switches, we can propose a different mechanism of communication between the sources and destinations. In this mechanism, the source can send flits without handshaking in the RS layer, and the communication among re-switches is similar to a shift register. This type of mechanism is possible for communication between various sources and destinations in the RS-layer and the destinations

26

are not stalled during communication. However, the communication in RS-layer involving one or more routers suffer due to the drawback of handshaking as a router encounters contention due to shared resources.

4.4. Performance Evaluation Latency and throughput parameters are selected for evaluating the performance of AHRNoC by using ModelSim platform. We have explored and compared four NoCs including CNoC1, CNoC2, CNoC3 and AHRNoC1 (given in Figures 1 to 4 respectively) for some commonly used applications such as MPEG-4 decoder, Audio/Video Benchmark application (AV) [26] and Double Video Object Plane Decoder (DVOPD) with the capability to decode two streams in parallel [27]. The evaluation results presented here are based on AHRNoC1 architecture, however, the results presented support the efficiency of overall AHRNoC approach. The MPEG-4, AV Benchmark and DVOPD applications are mapped to 3×4, 4×4, and 4×7 2D mesh topology NoCs respectively. Two versions of DVOPD (28 and 32-core graphs) are available in the literature, and we have used a 28-core version of DVOPD put forward by Concer et. al [27]. MPEG-4 and AV Benchmark core graphs with bandwidths are shown in Figure 11. Each data communication among the cores of these application core-graphs has a communication rate and a path.

4.4.1. Latency and Throughput Formulation For performance evaluation, the source cores generate packets that are transferred to their destinations for the NoCs being tested. The sink cores receive packets and the sink and source cores keep a record of sending and receiving times of every flit of the packets. The number of clock cycles from the time a flit is injected to the time it ejects from the NoC is measured as the latency of that flit in terms of clock cycles [28]. One can also calculate the latency by subtracting the flit injection time from the flit ejection time and dividing the result by clock cycle period. For the sake of meaningful measurements, a specific number of packets are sent to each NoC under test. Then one must wait for these packets to reach their respective destinations to calculate the average latency by using equation (5).

(5)

Average Latency =

AU

VU

0

1

M-CPU

40

SRAM1 3

ADSP

60

190 0.5

SDRAM 600 5

40

RAST

RISC

6

4

7

173 910

764 33848

500

32

33848

U-SMP

BAB

8

9

IDCT

250 SRAM2

11

10

197 3672

12

670

DSP4 8

DSP2 12

7061 7005 7061

33848

MEM2 33848 5 DSP1

3672

(a)

25

MEM2 DSP3 38016 5 641 6

ASIC1 DSP8 7061 7705 1 2 CPU 197 ASIC438016 10 9

ASIC3 4 DSP2

20363 33848

13

20363

38016

7061

DSP6 7 26924 DSP7 3 DSP5 11 26924

38016 DSP6 7 MEM3 75584 15

75205 DSP3 197 6 116873 MEM1 38016

(b)

14

75205

9

CPU 10

DSP1 13

MEM1 14

33848 ASIC4

DSP7 3

7005 28265

197

ASIC2 0 DSP4 8 144

28265

DSP8 2

80

640

144

25 641

640 80 7705

ASIC3 4

80

ASIC1 1

80

76425

2 0.5

25

ASIC2 0

75584

38016

DSP5 11

MEM3 15

Fig. 11. (a) MPEG-4 mapped to a 3×4 mesh CNoC. (b) AV Benchmark mapped to a 4×4 mesh CNoC.

The latency can determine stalls in communication where the NoCs with more stall conditions in their communications leads to higher latency. In our pipelined NoC communication, we inject multiple flits before the first flit is even received as illustrated for two scenarios, S1 and S2 in Figure 12. The first ejected flit occurs after the delay (latency) from the first injected flit. Afterwards the successive injected and ejected flits continue every clock cycle. Therefore, in a pipelined NoC, the latency alone does not provide all the information that we need to know about the performance of an NoC. As one can identify that the average delays of three flits F1, F2 and F3 in S1 and S2 scenarios are the same and equal to 4 cycles. Throughput information can provide some insight about “how fast data is injected to the data path”. In other words, the throughput is a useful measure of how fast flits can be pushed into the NoC. In our performance evaluation experiments, the throughput is measured in terms of data packets per clock cycle. In the S1 scenario of Figure 12, the injected flits and ejected flits have the same throughput, because they change at the same rate. Therefore, we do not consider the ejected flits in our estimation and throughput calculation is based on how the injected flits are changing. Assume n flits injected to NoC during a specific time slot, the throughput is estimated according to equation (6).

Throughput =

(6)

Clk S1

Sending flit

F1

F2

F3

F4

Receiving flit

F1

Sending flit S2

F5

F1

F2

F1

Receiving flit 0

10

20

30

F3

F4

F3

F2

40

50

60

F2 70

80

F3 nsec

Fig. 12. Timing diagram for two scenarios: S1 and S2 with the same delay.

For example, consider scenario S1 where five flits are injected into the NoC in 60ns, then the throughput at this time becomes one flit per clock cycle. For scenario S2, three flits are injected into the NoC in 60ns, then the average throughput becomes 0.5 flit per clock cycle (or a flit per two cycles). Another point is that the throughput is always less than one. This is because more than 1 flit per cycle cannot be injected in both scenarios of Figure 12. Overall, latency illustrates the behaviour of an NoC in terms of packet delivery time, or the time an injected packet takes to reach its destination no matter when the packet is injected in the NoC. The throughput provides the behaviour of an NoC in terms of the accepting rate of any new packets. It does not matter, how the packets are delivered through the NoC.

28

4.4.2. Application Mapping for AHRNoC and CNoCs The mappings to the past NoCs such as CNoC1, CNoC2 and CNoC3 follow XY routing type methodology i.e. a mapping is done in X direction to reach to the Y dimension of its destination, then it maps to Y direction to reach their destination. Figure 13 depicts such mapping of AV Benchmark application for the CNoC2. Consider, the communication path between node 3 and 5 that passes via nodes 2 and 1. The arrow lines indicate the direction of a packet travel from source core to the sink core. These arrows can also be used to find the maximum number of VCs that are efficient for each channel. Considering the MPEG-4 core-graph of Figure 11(a), there are 3 arrows pointing the north input-channel of node 5, which indicate that three packets may pass through this channel concurrently. These three packets require three VCs to service them without any blockage. There is no other link-channel in Figure 11(a) with more than three packets travelling concurrently. It illustrates that MPEG-4 performance cannot be improved by having more than three VCs. Similarly, Figures 11 and 13 illustrate that the performance of AV benchmark application cannot be improved further by having more than two VCs.

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Fig. 13. AV benchmark mapping to CNoC2.

For the sake of a fair comparison, the mapping in AHRNoC1 follows the mapping methodology listed below and AV Benchmark application mapping is demonstrated in Figure 14.  The communication mappings follow XY routing in the router layers, and a specific mapping in RS layers. This is due to routers having arbiter and buffer memory to implement the routing algorithm; however, re-switches do not have any arbiter and buffer and they are setup in advance for a routing methodology.  In the switch node, an output-port can receive packet from one input-port as a re-switch does not have any buffer and arbiter to share the output port among multiple input-ports. For example, consider the communication from the source cores 0 and 5 to a destination core-4. The switch S4 cannot deliver two packets to its output port connected to core-4. These communications can only be mapped via the router layer as depicted for AV benchmark in Figure 14.

 For mapping communications on the RS-layers, higher rate and longer route communications are given priority over the lower rate and shorter route communications. Considering different communications to sink core-10, the communication from source core-15 has a higher rate and longer route as compared to communication from core 8. Therefore, communication from core-15 to sink core-10 is mapped over the RS-layer due to its higher priority.  A communication that cannot be mapped over the RS-layer is mapped via both RS and router layers. Therefore, communications from source core-8 to sink core-13 with a bandwidth of 3672 is mapped on both RS-layer and router layer. As the path from source core-8 to switch S14 is free, routing from core-8 to node-14 is mapped on the RS-layer and from node 14 to core-13 is mapped via the router layer.  When a destination node has packets from more than one sources, the mapping to that destination should be through the router layer. For example, the destinations 0, 1, 2, 4, 5, 7, 9, 10 and 13 receive packets through the router layer as shown in Figure 14.  A destination that receives packets only from one source can be mapped over the RS-layer. It can be observed from Figure 14 communications, which terminate at destination cores 3, 8, 11, 12, 14 and 15.  A node that does not have a source core or its source sends no packets through the RS-layer, its output channel from RS-layer to the router-layer can be used for other communications. This rule is followed for nodes 7, 10, 11, 13 and 14 in Figure 14.

Map in

1

0

3RS-layer

2 Router layer

5

4

6

RSlayer

8

7

9

10 11

12

13

Map in RS and routerlayers

15 14

Fig. 14. AV Benchmark mapping to AHRNoC1.

30

An initial version of the communication mapping mechanism is implemented for AHRNoC1 architecture and it will be extended to all the AHRNoCs presented in this paper.

4.4.3. Performance Evaluation Results Generally, NoCs follow Globally Asynchronous Locally Synchronous (GALS) design, and the critical path of each component can be considered as its clock period. However, for the sake of fairness and simplicity, all experimental results presented in this section have the same clock frequency and the results are determined in clock cycles. The throughput and latency are measured for various flit injection rates. The injection rate is defined as flits per time-unit per node to be injected in the NoC. We have set the time-unit of 9 clock cycles. The maximum latency in our design cannot exceed a specific time limit of 9 clock cycles as shown in the latency results of Figure 15. An injection rate of 9 means that nine flits per time-unit per node are injected to the NoC as a source core/node cannot inject more than one flit per clock cycle. To measure the latency, a specific number of packets are injected. In the case of MPEG-4, AV Benchmark and DVOPD applications; 221888, 1520512 and 221888 packets are sent to each NoC respectively. NoC communication is based on wormhole switching where the channel width is equal to the flit size of 16-bits. A packet is made of 16 flits, and each inputport has 8-slot central buffer to implement VCs. The routers in CNoC1 and AHRNoC1 do not support VCs, where the routers for CNoC2 and CNoC3 has 2- and 3-VCs respectively in their input-ports. The flit arrival/departure for all the routers takes two clock cycles. It should be noted that for AV benchmark only two VCs are required due to which CNoC2 and CNoC3 based results are identical.

The performance results for these applications (MPEG4, AV-benchmark and DVOPD) are presented in Figures 15 and 16. It can be observed from the results that the average latency in AHRNoC1 is less than those of CNoC1, CNoC2, and CNoC3 for all the injection rates. The average latencies of MPEG4, AV-benchmark and DVOPD in AHRNoC1 are 27%, 58% and 37% less than those of CNoC3 respectively. It is also observed that AV-benchmark related performance results for CNoC3 and CNoC2 are same as mentioned earlier. The average throughput for AHRNoC1 is higher than those of CNoC1, CNoC2, and CNoC3. For all the applications, the average throughputs in AHRNoC1 are 1-5% higher than those of CNoC3. The advantage of AHRNoC1 becomes more interesting when we consider the hardware characteristics. As we have observed from the synthesis data of Table 2, a 4×4 AHRNoC1 consumes 0.7% and 11% less chip-area and power as compared to a 4×4 CNoC3. The critical path delays of NoC components limit the maximum operating frequency of the NoC. The critical path delay of AHRNoC1 is determined by its slowest component i.e. a 5-port no-VC router, and for CNoC3 it is determined by the 5-port 3-VC router. Therefore, according to Table 1, AHRNoC1 can operate three times faster than a CNoC3.

Average Latency

12 10

AHRNoC1

CNoC1

CNoC2

CNoC3

8 6 4 2 0 1

2 3 4 5 6 7 Inject Rate (Flit/Time Unit/node) (a) Latency for MPEG-4

8

9

Average Latency

9 8 7 6 5 4 3 2 1 0

AHRNoC1

CNoC1

CNoC2,3

1

2

3 4 5 6 7 Inject Rate (Flit/Time Unit/node) (b) Latency for AV Benchmark

Average Latency

20

8

9

AHRNoC1 CNoC1 CNoC2 CNoC3

15 10 5 0 1

2

3 4 5 6 7 Inject Rate (Flit/Time Unit/node)

8

9

(c) Latency for DVOPD

Fig. 15. Latency for AHRNoC1 and CNoC based SoCs.

The success of AHRNoC approach can be guaranteed in most of the SoC applications. This is due the fact that most of the NoCs employ a regular grid topology e.g. Mesh or torus to provide an efficient interconnect infrastructure for different application-specific SoCs. The regular topology of AHRNoCs and irregular communication in application-specific SoCs always lead to three kinds of communications i.e. the communications that map to either RS-layer, router-layer, or both. For instance, 38% of AV-benchmark mapping shown in Figure 13 is in RS-layer, where 49% and 35% of DVOPD and MPEG-4 mapping are in RS-layer respectively. However, just communication mapping on RS-layer does not determine the performance improvement. It is also observed from Figures 15 and 16 that 38% of AV-benchmark mapping delivers more communication as compared to DVOPD mapping. AHRNoC1 performance improvement is due to mapping of communications in RS-layer, where data transfer has happened in one clock cycle. However, the communications over the router-layer are pipelined and prone to contention due to shared resources [28]. The behaviour of the results in Figures 15 and 16 depend on the following factors.

32

11 Average Throughput %

10

AHRNoC1

CNoC1

CNoC2

CNoC3

9 8 7 6 5 4 1

2

3 4 5 6 7 8 Inject Rate (Flit/Time Unit/node) (a) Throughput for MPEG-4

Average Throughput %

15

AHRNoC1 CNoC2,3

13

9

CNoC1

11 9 7 5

Average Throughput %

1

2

3 4 5 6 7 Inject Rate (Flit/Time Unit/node) (b) Throughput for AV Benchmark

26 24 22 20 18 16 14 12 10 8 1

AHRNoC1

CNoC1

CNoC2

CNoC3

2

3 4 5 6 7 Inject Rate (Flit/Time Unit/node)

8

9

8

9

(c) Throughput for DVOPD

Fig. 16. Throughput for AHRNoC1 and CNoC based SoCs.

 The latency graphs of Figure 15 do not have a linear trend with the flit injection rate. It is due to irregular communications among the nodes. For example, most of the communications of MPEG-4 are concentrated to core-5 and AV-benchmark communication is almost scattered.  Contention in a CNoC depends on the number of utilized VCs. Higher VC utilization reduces contention and improves performance. It can be verified from the data shown in Figure 15 that the average latencies of applications on CNoC2 and 3 are 27%, 58% and 37% less than those of CNoC1. However, the maximum number of optimal VCs is application

dependent and increasing the VCs beyond the maximum values does not improve the performance (latency and throughput) and results in a waste of resources. As we mentioned earlier, the maximum requested VCs in MPEG-4, AVbenchmark and DVOPD are 3, 2 and 3 respectively.  AHRNoC1 shows higher performance at the high flit-injection rates (7-to-9). At lower injection rate, there are time intervals without flit injection that leads to lower the contention in CNoCs. When one flit per 3 clock cycles is injected from each node to the NoC for an injection rate of 3, there are two clock cycles when the flit can move ahead in the NoC without contention. However, at the highest injection rate of 9, one flit is injected per clock cycle, where AHRNoC1 delivers the flits via both RS-layer and router-layer resulting in lower contention. 4.5. Analytical Performance Evaluation In this section, we discuss the efficiency of AHRNoC associated with boundary performance situations. Consider Figure 4, assume a communication scenario that every IP core communicate with its neighbor and no IP core receives packet from more than a core, e.g., 0→1, 1→2, 2→3, 3→7, 7→6, … 4→8, … 11→15, …13→12, 12→0 (0→1 i.e. IP0 sends packet to IP1). In this scenario, all the communications paths can be mapped in RS layer, and AHRNoC provides maximum performance. In other words, the neighboring node scenario illustrates the advantage of AHRNoC as compared to other approaches that can send flits per second. Each flit should pass through the pipeline stages of a router for CNoC and ReNoC based systems. The worst-case scenario occurs when all the IP cores get data from at least two other IP cores. Then all the communication paths should be mapped to the router layer as the re-switches do not have data buffers and arbiter to store and manage different packets (see Section 4.2). However, the communication in most of the SoC applications are not crowded as mentioned in this worst-case scenario. Most of the communication paths are in a pipelined form, i.e. data move from one IP core to another that can be considered as the best scenario. We can observe such communication patterns in MWD, PIP, MP3 ENC DEC, 263 ENC MP3 DEC, 263 DEC MP3 DEC core graphs of Figure 17 used by Sahu and others [29]. To support the efficiency of our approach, we analytically investigate the performance for these applications.

The investigation results given in Table 4 will confirm the fairness and robustness of our AHRNoC architecture. These applications have a similar but simpler communication as compared to MPEG-4, AV Benchmark and DVOPD application we have implemented in the previous section. Table 4 lists the approximate performance improvements of AHRNoC as compared to CNoC for these applications. For the sake of fare comparison, we assume the following for our investigation.  Routing paths of each application are mapped well in RS-layer. They are mapped based on the methodology presented in Section 4.4.2.  Maximum delay of RS layer is less than IP clock period, or flit communication in RS-layer takes 1 clock cycle.  Flit arrival/departure for routers takes two clock cycles due to 2 pipeline stages of a router,  For each flit communication in router layer, averagely two routers are involved. Therefore, flit communication in router layer takes 4 clock cycles on average.  There is no traffic in the router layer. Various columns of Table 4 are listed and explained below. Cores Mapped in Router Layer: The 2nd column depicts the cores that are mapped to the router layer. For example, in MWD core graph, cores 5 and 10 receive packets from more than one IP core. The re-switch of core-5 does not have buffers and arbiter to handle two different packets, however, its router can handle. Data Rate in Router Layer: It lists the percentage of communication that is mapped to the router layer. For example, core-5 of MWD receives 96 and 128, and core-10 receives 96 and 96 out of 1024(6×96+3×64+2×128) flits that traverse for the application.

34

64

1

128

2

128

3

64

5

4

96

6

7 96

96

8

96

9

10

64

8

64

7

1000

4

4 8

12

1000

5

180

6

187 25

1

193

38001

38016

3 46733

37958

4

100

9

11

500

8

12

(d) 263 ENC MP3 DEC

7

380

10 2083

3672

9

3

3672

11

(c) MP3 ENC DEC

25

2

6

2083

5

500

10

4060

10

8

12

25

500

2

6

24634 38001

7

4

25

13

4060

7 1

10

500

150

(a) MWD

2083

10

500

(b) PIP 64

2

3

64

9

870

64

11

2083 5

4060

3

64

25

1

64

6

96

96

2

64

96

5

128

1

11

3672

12

4060

5

13

10 500

14

(e) 263 DEC MP3 DEC

Fig. 17. Core graphs where the communication shown is in flits (a) MWD (b) PIP (c) MP3 ENC DEC (d) 263 ENC MP3 DEC (e) 263 DEC MP3 DEC

AHRNoC Performance Improvement: The last column illustrates the approximate performance improvement for AHRNoC as compared to CNoC. The calculation for this column is done based on the values of previous column of Table 4 and the above assumptions (i.e. flit communication in the RS-layer takes one cycle where it takes four clock cycles in the router layer). For example, if 100 flits are transferred for MWD mapped to AHRNoC, 60 flits pass through the RS-layer and 40 flits pass through the router layer. In the case of CNoC, 100 flits pass through the routers. Therefore, Approximate performance improvement of AHRNoC as compared to CNoC is almost 45% higher as given by ((100×4-(60+40×4))/(100×4)).

TABLE 4. Approximate AHRNoC Performance Improvement Applicatio Cores Mapped to ns Router Layer

Data Rate in the Router Layer

AHRNoC Performance Improvement

(3*96+128)/(6*96+3*64+2*128) ≈ (100*4 (60+40*4))/(100 40% *4) ≈ 45%

MWD

5, 10

PIP

6

MP3 ENC DEC

5, 13

(1000*2+500+10)/(25+4060+2083 (100*4 *2+1000*2+10+870+500*2+180+1 (85+15*4))/(100 50+4060) ≈ 15% *4) ≈ 64%

263 ENC MP3 DEC

1, 3, 12

(38016+24634+193+46733+37958 (100*4 +10+500)/(193+25+38016+38001* (45+55*4))/(100 2+46733+37958+24634+38001+20 *4) ≈ 34% 83+110+500+4060) ≈ 55%

263 DEC MP3 DEC

4, 5, 14, 7

(500+187+100+3672+500+10+380 (100*4 +3672)/(187+250+25*2+500*3+10 (54+46*4))/(100 0+3672*3+2083+10+4060+380) ≈ *4) ≈ 41% 46%

(2*64)/(7*64+128) ≈ 22%

(100*4 (78+22*4))/(100 *4) ≈ 59%

As one can observe from the analytical data given in Table 4, there are significant performance improvements for the applications, some are low some are high. However, our AHRNoC approach can handle multiple applications, and its average improvements would always be promising where the average performance improvement for these applications is almost 57%. There are also some standard performance simulation scenarios such as Uniform, Tornado, Poisson, and Complement packet distributions. The behaviors of packet routes of these packet distributions are unknown before simulation and will change during execution. Therefore, these kinds of distribution scenarios may not be useful to evaluate our approach due to its specific feature, which requires that the routing information should be stored at the IP core when application is being setup to the SoC. The success of our AHRNoC approach is since most of the SoC applications are designed for pipelined- SoC architectures. Such SoC applications allow most of the communication to be mapped at the RS-layer and improve the performance of AHRNoC based system.

36

5. Conclusions We have presented an application-oriented high-performance reconfigurable NoC (AHRNoC) architecture in this paper. AHRNoC architecture consists of one or multiple networks of reconfigurable switches and a network of routers. Some useful features associated with AHRNoC architectures are homogeneity, scalability, high interconnection speed and flexibility in performance as well as easier hardware design. In the design stage, AHRNoCs have a potential of higher communication, and they can be implemented with lower hardware cost. For example, the AHRNoC2 (Figure 6) consumes less hardware as compared to some conventional or past NoCs. The success of AHRNoC architecture can be guaranteed in most of the SoC applications. This is due the fact that the AHRNoCs have a regular architecture (e.g. Mesh) and the applications have irregular traffic. There will be always un-used resources that can be utilized by AHRNoC to improve performance. Our approach presents two novel ideas in terms of NoC re-configuration at the application level and flexibility at the architectural structure stage. AHRNoC is re-configurable in terms of routing path that creates a platform for many SoC applications with different communication speed. It can be setup for a variety of applications, and router and re-switch layers cater for packet communication related to each application at runtime. The NoC structure is flexible as separate layers provide the possibility that router layer can be scaled down in terms of topology. In this way, AHRNoC provides reconfigurability without considerable increase in hardware and power consumption.

The persuasive results presented in this paper can easily illustrate the efficiency of our approach. For example, the average latencies of some well-known application-specific SoCs such as MPEG-4, AV-benchmark and DVOPD in AHRNoC are 27%, 58% and 37% less than those in CNoC1, CNoC2, and/or CNoC3 respectively. The average throughput in AHRNoC1 is also higher than those of other conventional NoCs. The advantage of AHRNoC1 becomes more interesting when we also compare the hardware overhead. As one can determine from Table 2, a 4×4 AHRNoC1 consumes 0.7% and 11% less area and power respectively and can run three times faster as compared to a 4×4 CNoC3. Our estimation results in Table 3 show that an NoC can utilize re-configurability feature with 9 times faster communication as compared to a conventional NoC by expending a small (almost 5%) increase of SoC area and 20% NoC and wire power.

Acknowledgments This research is partly supported by a grant from NSERC Canada, an equipment grant from CMC Canada, and the Ryerson University Faculty of Engineering and Architectural Science Dean’s Research Fund (DRF).

References

[1]

J. Joven, A. Bagdia, F. Angiolini, P. Strid, D. Castells-Rufas, E. Fernandez-Alonso, J. Carrabina, and G. De Micheli, "QoS-Driven Reconfigurable Parallel Computing for NoC-Based Clustered MPSoCs", IEEE Trans. on Industrial Informatics, vol. 9, no. 3, August 2013, pp. 1613–1624.

[2]

B. Bohnenstiehl, A.Stillmaker, J. J. Pimentel, T. Andreas, B. Liu A. T. Tran, E. Adeagbo, and B. M. Baas. "KiloCore: A 32-nm 1000-Processor Computational Array", IEEE Journal of Solid-State Circuits, vol. 52 , no. 4, April 2017, pp. 891–902. M. S. Abdelfattah and B. Betz, "Design tradeoffs for hard and soft FPGA-based Networks-on-Chip", In Proc. Int. Conf. Field-Programmable Technology, Seoul, South Korea, Dec. 2012, pp. 95–103. Z. Zhang, D. Refauvelet, A. Greiner, M. Benabdenbi, and F. Pecheux. "On-the-Field Test and Configuration Infrastructure for 2-D-Mesh NoCs in Shared-Memory Many-Core Architectures", IEEE Trans. on VLSI Systems, vol. 22, no. 6, June 2014, pp. 1364–1376. M. Oveis-Gharan and G.N. Khan, "Efficient Dynamic Virtual Channel Organization and Architecture for NoC Systems", IEEE Trans. on VLSI Systems, vol. 24, no. 2, Feb. 2016, pp. 465–478. M. Oveis Gharan and G. N. Khan, "Packet-based Adaptive Virtual Channel Configuration for NoC Systems", Int. Workshop on the Design and Performance of Network on Chip, In Proceedia Computer Science, vol. 34, pp. 552–558, 2014. M. Stensgaard and J. Sparsø, "ReNoC: A network-on-chip architecture with reconfigurable topology" In Proc. 2nd ACM/IEEE Int. Symp. Networks-on-Chip, Newcastle upon Tyne UK, April 2008, pp. 55–64. C. H. O. Chen, S. Park, T. Krishna, S. Subramanian, A. P. Chandrakasan, and L. S. Peh, "SMART: A single-cycle reconfigurable NoC for SoC applications", In Proc. Design, Automation and Test in Europe Conference and Exhibition (DATE), Grenoble France, March 2013, pp. 338–343. A. Arora, M. Harne, H. Sultan, A. Bagaria, and S. R. Sarangi, "FP-NUCA: A Fast NOC Layer for Implementing Large NUCA Caches", IEEE Trans. on Parallel and Distributed Systems, vol. 26, issue 9, Sept. 2015, pp. 2465–2478. X. Chen and N. K. Jha Reducing, "Wire and Energy Overheads of the SMART NoC Using a Setup Request Network", IEEE Trans. VLSI Systems, vol. 24, no. 10, Oct. 2016, pp. 3013–3026. M. Modarressi, A. Tavakkol, and H. Sarbazi-Azad, "Application-Aware Topology Reconfiguration for On-Chip Networks", IEEE Trans. on VLSI Systems, vol. 19, no. 11, Nov. 2011, pp. 2010–2022. M. S Abdelfattah and B. Betz, "The Power of Communication: Energy-Efficient NoCs for FPGAs", In Proc. 23rd Int. Conf. Field Programmable Logic and Applications, Porto Portugal, Sept. 2013, pp. 1–8. M. Stuart, M.B. Stensgaard and J. Sparsø, "The ReNoC Reconfigurable Network-on-Chip: Architecture, Configuration Algorithms, and Evaluation", ACM Transactions on Embedded Computing Systems, vol. 10 Issue 4, Nov. 2011, pp. 45:1–45:26. A. Firuzan, M. Modarressi, M. Daneshtalab, and M. Reshadi, "Reconfigurable Network-on-Chip for 3D Neural Network Accelerators" In Proc. 12th IEEE/ACM Int. Symp. on Networks-on-Chip, Torino Italy, Oct. 2018, pp. 1–8

[3] [4]

[5] [6]

[7] [8]

[9] [10] [11] [12] [13]

[14]

[15] H. Sarbazi-Azad and A. Y. Zomaya, "A Reconfigurable On-Chip Interconnection Network for Large Multicore Systems", In Large Scale Network-Centric Distributed Systems, H. Sarbazi-Azad and A. Y. Zomaya Eds. Wiley-IEEE Press. 2013. [16] E. Suvorova, Y. Sheynin, and N. Matveeva, "Reconfigurable NoC development with fault mitigation", In Proc. 18th Conf. of Open Innovations Association and Seminar on Information Security and Protection of Information Technology (FRUCT-ISPIT) St. Petersburg Russia, April 2016, pp. 335–344. [17] Y. Chen, K. Ren, and N. Gu, "Router-Shared-Pair-Mesh: A Reconfigurable Fault-Tolerant Network-on-Chip Architecture", Int. J. Embedded Systems, vol. 10, no. 6, 2018, pp. 526–536

38

[18] L. Moller, P. Fischer, F. Moraes, L. S. Indrusiak, and M. Glesner, "Improving QoS of Multi-layer Networks-on-Chip with Partial and Dynamic Reconfiguration of Routers", In Proc. Int. Conf. Field Programmable Logic and Applications, Milano, Italy, Aug. –Sept. 2010, pp. 229–233. [19] M. Fattah, A. Manian, A. Rahimi, and S. Mohammadi, "A High Throughput Low Power FIFO Used for GALS NoC Buffers", In Proc. IEEE CS Annual Symp. on VLSI (ISVLSI), Lixouri, Kefalonia Greece, July 2010, pp. 333–338. [20] M. Oveis-Gharan and G.N. Khan, "Dynamic Virtual Channel and Index-based Arbitration based Network on Chip Router Architecture", In Proc. Int. Conf. on High Performance Computing and Simulation, Innsbruck Austria, July 2016, pp. 96–103. [21] J. M. Matos, R. P. Ribas, A. Reis, G. Schlinker, L. Rech, J. Michelsen, and M. Martins, "Open cell library in 15nm FreePDK technology", In Proc. Int. Symp. on Physical Design, Monterey, California USA March-April 2015, pp. 171– 178. [22] M. Said, H. Hassan, H. Kim, and M. Khamis, "A novel power reduction technique using wire multiplexing", In Proc. 30th IEEE Int. System-on-Chip Conf. (SOCC), Munich Germany Sept. 2017 pp. 149–152. [23] H. Wang, L. Peh, and S. Malik, "A technology-aware and energy-oriented topology exploration for on-chip networks", In Proc. Design, Automation and Test in Europe, Munich Germany, pp. 1238–1243, March 2005. [24] N.S. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J.S. Hu, M.J. Irwin, M. Kandemir, and V. Narayanan, "Leakage Current: Moore’s Law Meets Static Power" IEEE Computer, vol. 36 , Issue 12 , Dec. 2003, pp. 68–75. [25] S.R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar, “An 80-tile sub-100-W tera FLOPS processor in 65-nm CMOS” IEEE Journal of Solid-State Circuits, vol. 43, no. 1, Jan. 2008, pp. 29–41. [26] V. Dumitriu and G. N. Khan, "Throughput-oriented NoC topology generation and analysis for high performance SoCs", IEEE Trans. on VLSI Systems, vol. 17, no. 10, Oct. 2009, pp. 1433–1446. [27] N. Concer, L. Bononi, M. Souli´e, R. Locatelli, and L. P. Carloni, "The Connection-Then-Credit Flow Control Protocol for Heterogeneous Multicore Systems-on-Chip", IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 29, no. 6, June 2010, pp. 869-882. [28] W. J. Dally and B. Towles, "Performance Analysis", In Principles and Practices of Interconnection Networks. San Francisco, California USA: Morgan Kaufmann, 2003. [29] P. K. Sahu, K. Manna, N. Shah, and S. Chattopadhyay, "Extending Kernighan–Lin partitioning heuristic for application mapping onto Network-on-Chip", Journal of Systems Architecture, vol. 60, no. 7, Aug. 2014, pp. 562-578. [30] Y. Cheng, C. Lee and Y. Huang. "Copper Metal for Semiconductor Interconnects", In Noble and Precious MetalsProperties, Nanoscale Effects and Applications, Eds. M. S. Seehra and A. D. Bristow, IntechOpen, 2018. [31] R. Ho, "On-Chip Wires: Scaling and Efficiency", PhD thesis, Stanford University, California USA, 2003.

Dr. Masoud Oveis-Gharan received his Bachelor of Engineering in the field of Electrical Engineering (Electronics) from Isfahan University of Technology, Esfahan, Iran in 1991. He completed his Masters of Sciences in the field of embedded system design and simulation from Ryerson University, Toronto in 2011. He also completed his PhD from Ryerson University in NoC Systems. His research interests include embedded system design and modeling, computer architectures, Systems-on-Chip, NoC system design and power and performance optimization for Network-on-Chip architectures.

Dr. Gul N. Khan graduated in Electrical Engineering from University of Engineering and Technology, Lahore in 1979. He received his M.Sc. in Computer Engineering from Syracuse University in 1982. After working as research associate at Arizona State Univ. Tempe Arizona, he joined Imperial College London and completed his Ph.D. in 1989. He joined RMIT University, Melbourne in 1993. In 1997, he joined the computer engineering faculty at Nanyang Tech. University, Singapore. He moved to Canada in 2000 and worked as Associate Professor of computer engineering at University of Saskatchewan before joining Ryerson University. Currently, he is a Professor and program director of computer engineering at Ryerson University. His research interests include embedded systems, hardware/software codesign, MPSoC, NoC, fault-tolerant systems, high performance computing, and CPU-GPU based heterogeneous systems. He has published more than 100 articles/papers in peer reviewed journals and conferences and he has edited a book titled Embedded and Networking Systems that is published by Taylor and Francis.

40