Scalable and energy efficient wireless inter chip interconnection fabrics using THz-band antennas

Scalable and energy efficient wireless inter chip interconnection fabrics using THz-band antennas

Journal Pre-proof Scalable and energy efficient wireless inter chip interconnection fabrics using THz-band antennas Sagar Saxena, Deekshith Shenoy Man...

1MB Sizes 0 Downloads 5 Views

Journal Pre-proof Scalable and energy efficient wireless inter chip interconnection fabrics using THz-band antennas Sagar Saxena, Deekshith Shenoy Manur, Naseef Mansoor, Amlan Ganguly

PII: DOI: Reference:

S0743-7315(18)30817-7 https://doi.org/10.1016/j.jpdc.2020.02.002 YJPDC 4186

To appear in:

J. Parallel Distrib. Comput.

Received date : 6 November 2018 Revised date : 31 October 2019 Accepted date : 3 February 2020 Please cite this article as: S. Saxena, D.S. Manur, N. Mansoor et al., Scalable and energy efficient wireless inter chip interconnection fabrics using THz-band antennas, Journal of Parallel and Distributed Computing (2020), doi: https://doi.org/10.1016/j.jpdc.2020.02.002. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Elsevier Inc. All rights reserved.

Journal Pre-proof

Scalable and Energy Efficient Wireless Inter Chip Interconnection Fabrics Using THz-band Antennas a

Sagar Saxena a, Deekshith Shenoy Manur a, Naseef Mansoor b and Amlan Ganguly b

lP repro of

Department of Computer Engineering at Rochester Institute of Technology, Rochester, NY 14623 . E-mail: {ss6010, ds7194}@rit.edu b Golisano College of Computing and Information Science, Rochester Institute of Technology, Rochester, NY 14623. E-mail: {nxm4026, axgeec}@rit.edu

Abstract

Computing platforms ranging from embedded systems to server blades comprise of multiple Systems-on-Chips (SoCs). Conventionally, communication between chips in these multichip platforms are realized using high-speed I/O modules over metal traces on a substrate. Due to the high-power consumption of I/O modules and non-scalable pitch of pins or solder bumps their bandwidth density and power consumption becomes bottleneck for multichip systems. Wireless chip-to-chip communication is emerging as an alternative solution to the traditional interconnection challenges of multichip systems. Novel devices based on graphene structures capable of establishing wireless links are explored in recent literature to provide high bandwidth THz links. In this work, we propose to utilize graphene-based wireless links to enable energy-efficient, multi-modal chip-to-chip communication protocol to create toroidal folding based interconnection architectures for multichip systems. With cycle-accurate simulations we demonstrate that such designs can outperform state-of-the-art wireline multichip systems.

1. Introduction

rna

Keywords: Multichip System, Wireless Interconnects, Graphene-based Antenna, THz Wireless

Jou

Platform based computing systems consisting of multiple System-on-Chips (SoCs) or multicore processors are needed to support the complexity of modern server or embedded systems. With the increase in computational demands, the number of SoCs or multicore chips in a platform have increased making the modern computing systems more complex [1]. This makes the interconnection fabric in these systems grow in both size and complexity. Therefore, the overall performance of the system depends on the efficiency of the interconnect architectures of both inter-chip and intra-chip chip communications. The advancement in intra-chip communication has been able to address the scalability and bandwidth issues by making a transition from bus-based systems to Network-on-Chip * Corresponding author: Amlan Ganguly ([email protected])

Journal Pre-proof

lP repro of

(NoC) architectures [2]. However, the performance of inter-chip communication is limited by the traditional interconnect architectures. Traditional inter-chip interconnections are realized using solder bumps or C4 interconnects placing individual chips on a substrate or Printed Circuit Board (PCB). Peripheral Component Interconnect (PCI) is one of the most common standard local I/O bus technology to interconnect board-level multichip systems. Recently, PCI express (PCIe) is presented as next generation I/O technology [3]. However, recent trends according to the International Technology Roadmap for Semiconductors (ITRS) predict that the pitch of the I/O interconnects in ICs is not scaling as fast as the gate lengths or pitch of on-chip interconnects [4][5]. This implies a gap in density and performance of traditional I/O systems relative to on-chip interconnections. Moreover, longer and bulkier substrate traces for inter-chip communication due to the wiring complexity further aggravates the crosstalk and the signal integrity issues. Typically, interchip communication involves multihop paths over intra-chip global wires in both source and destination chips, I/O blocks and substrate traces. Often the intra and inter-chip communication protocols are also different to offer design flexibility but results in loss of speed and energy efficiency.

Jou

rna

While metallic inter-chip interconnects are not scaling well, research in recent years have bought to light many alternative interconnect solutions like inter-chip photonics [6], vertically integrated monolithic 3D ICs [7] or silicon interposers [8] inductive and capacitive coupling [9] and inter-chip wireless interconnects [10][11] as solutions to the off-chip interconnection challenges. Recent research envisions wireless communication in the Terahertz band (0.1-10THz) as a key technology to satisfy the increasing demand for high speed communication [12]. Wireless data communication links up to several centimeters in length with graphene-based antenna arrays are demonstrated [12]. Novel transmitting and receiving devices based on micro-scale graphene structures have been investigated in [12]. These wireless interconnections are shown to improve energy efficiency and bandwidth of on-chip data communication in multicore chips over state-ofthe-art counterparts [13]. While many conventional architectures have been proposed in recent time. Our study of antenna array architecture that leverages the properties of graphene-based plasmonic devices is based on [14]. In this work, we propose to use such graphene-based THz band wireless interconnects to establish a seamless communication backbone which enables data exchange between cores in a single chip as well as in a multichip system. THz band interconnects can support wider bandwidths and higher data rates compared to other wireless interconnects in UltraWide Band (UWB) or millimetre wave (mm-wave) interconnects [12]. Moreover, smaller 2

Journal Pre-proof

lP repro of

antenna sizes compared to mm-wave antennas can enable on-chip implementations of antenna arrays providing beam-steering capability. Such beam-steering will in turn support novel architectures. We propose and evaluate two interconnection architectures for multichip system with several multi-core processors, based on network folding approach with graphene antennas. We propose a novel multi-modal medium access mechanism to establish the wireless links between the communicating cores in the multicore system. The same switching protocol for both on-chip and inter-chip data communication is used for seamless data exchange. We perform thorough performance evaluation of the interconnection fabric based on system-level simulations and compare the proposed architectures with multiple wired traditional fabrics based on mesh or concentrated mesh topologies. We show that the use of graphene based folded architectures can reduce the energy consumption of data transfer over multichip systems compared to wireline counterpart. Finally, we also compare the graphene-based interconnection fabric with other emerging technologies like inter-chip photonic links and millimeter-wave wireless interconnects.

2. Related Works

rna

The rest of the paper is organized as follows. In section 2 we summarize the related works from the focused perspective of research in wireless inter-chip interconnection fabrics and advances in graphene based on-chip THz links. In section 3 we present the twoproposed folded graphene based wireless multi-chip interconnections and the rationale behind them. In section 4 we present thorough performance evaluations and comparison with other architectures and technologies with conclusions in section 5.

2.1. Wireless multichip Integration Technology

Jou

Intra-chip wireless NoCs and chip-to-chip wireless links in the context of sensor networks, Internet-of-Things (IoT) and mobile computing have been studied for a long time resulting in a rich body of literature over the past decade. However, here we focus our attention towards recent research on using wireless interconnects in multi-chip computing platforms such as blade servers and embedded systems consisting of multiple multicore processors. In [15] transceivers for wireless multichip systems are proposed using highspeed design methodologies. The use of wireless interconnects for processor-memory communication have been investigated in [10][11]. High-Performance Computing (HPC) environment with mm-wave wireless interconnections for multichip system have been proposed in [16]. In [17] transceivers for 60GHz inter and intra-chip communications are designed. However, mm-wave wireless interconnects are limited in their bandwidth. 3

Journal Pre-proof

Therefore, to improve the bandwidth novel devices working as transceivers and antennas such as graphene-based nanostructures need to be utilized for inter-chip communications in multi-chip platforms.

lP repro of

2.2. Graphene-based Wireless Interconnection

In this subsection we will focus on the recent literature that has used or developed various aspects of graphene-based THz interconnections such as antennas, architectures, Medium Access Control (MAC) protocols and transceivers. On-chip antennas with graphene-based structures are predicted to provide high bandwidth wireless communication channels [14][18]. Wireless NoCs using graphene-based antennas in the THz frequency channels have been proposed and evaluated [19]. In [20] authors evaluated MAC protocols for a Wireless NoC (WiNoC) architecture with graphene-based omnidirectional antennas. However, the performance gain for such Wireless intra-chip NoCs are limited as only a single wireless link can be active at any given point of time due to the omnidirectional nature of the antennas.

Jou

rna

Many recent literatures have talked about different components of THz communication which includes graphene-based antennas, wireless channel modelling etc. In [21] authors have presented performances and tradeoffs results based on accurate 3D models that should be considered according to amplifier gain and output power. In [22] authors have presented the holistic view of Terahertz nanocircuits realized with the multiple -terminal gated graphene plasmonic channels. The authors have shed light on the key components including graphene interconnects, oscillators, multipliers, phase shifters, filters and onchip antenna. In [23] author presents the idea of a hybrid wired wireless architecture that aims to make the most of the inherent broadcast capabilities of the wireless side by proposing two independent network planes. In [24] authors have highlighted the importance of package optimization to ensure the feasibility of the WiNoC approach, as it is capable of reducing path losses by several tens of dBs. While in [25] authors have studied the channel modelling methodology, traffic modelling methodology and scalability study methodology of THz-assisted multicore CPUs allowing to estimate the number of supported cores in terms of both throughput for a given traffic volume and tolerable access delay under a given MAC protocol. Recently, directional array of graphene antennas has been explored in[14]. The electrical tunability of graphene can be utilized to create frequency tunable devices that can generate, modulate and transmit THz waves [26][27]. Furthermore, graphene supports the 4

Journal Pre-proof

lP repro of

propagation of Surface Plasmon Polariton (SPP) waves, which are highly confined EM waves at the interface of a conductor and dielectric [28][29]. The authors in [30] and [31] designed and studied a plasmonic phase modulator by exploiting the electrical tunability of graphene to modulate the SPP wave without the need of subharmonic mixers. In [32] and [33], the authors proposed a graphene-based nano-antenna which acts as a lossy waveguide to effectively launch SPP waves into free space. In [14], authors have studied the beamforming capabilities of a novel, plasmonic array architecture. The authors have presented and investigated the performance of a single front end, composed of a port acting as the power source, a plasmonic modulator and the plasmonic antenna. This will enable novel architectures and communication mechanisms using THz graphene antennas utilizing space division multiplexing of links potentially improving interconnect performance. In this work, we propose a hybrid interconnection fabric for multichip systems using both on-chip wired links for intra-chip and low power high bandwidth THzband directional array of graphene antennas for inter-chip communication. 3. Graphene-based Wireless Interconnection Framework

In this section, we describe the topology, physical layer and communication protocols of the proposed graphene based folded multichip interconnection fabric. 3.1. Topology

rna

We will discuss two approaches for the proposed graphene based wireless interconnection topologies. Adopting a folding approach for the topology design such as a toroidal structure generally improves its throughput and latency compared to a mesh based regular structure [2][34]. Therefore, in this proposed topology, we explore the possibility of a folded topology for the multichip system consisting of several multicore chips. 3.1.1. TF Wireless Approach

Jou

The proposed wireless interconnection fabric is shown in Fig. 1. Cores within each individual chip are interconnected by an intra-chip NoC. The intra-chip NoC can be of any architecture such as regular tile-based mesh or irregular custom designs depending upon applications and design trade-offs. However, in this work the topology of the intra-chip NoCs is chosen as mesh as it is used in several multicore based products [2] and is relatively easy to design, verify and manufacture. To utilize the benefit of regular NoC structure while alleviating the issues of wireline inter-chip links we equip certain NoC switches in the multicore chips with wireless transceivers to create toroidal links folding 5

Journal Pre-proof

Jou

rna

lP repro of

the whole fabric. To create these inter-chip wireless interconnects enabling a folding approach we propose the use of the high bandwidth, directional point-to-point graphene antenna arrays that can directly connect distant switches in the multichip system with a single hop. These arrays are shown to be highly directional as discussed later in section 3.2. This will avoid multi-hop data transfer over long-wired paths through intra-chip NoCs and over substrate traces or silicon interposer links which lead to high latencies and energy dissipation. The folding strategy to form the multichip system interconnection fabric can be understood as follows: First, the architecture is folded along the diagonals which results in 2 different diagonal modes of communication D1 & D2 between switches at the diagonally opposite corners of the chips as shown in Fig. 1(a) D1 is the mode between top-left and bottom-right. D2 is the mode between top-right and bottom-left. Following folding the diagonally opposite corners, the opposite edges are folded by equipping switches at the edges in each chip with the graphene array-based transceivers to form the vertical and horizontal modes shown as V and H respectively in Fig. 1(a). Next, we also enable the folding of adjacent edges by equipping the selected switches along the edges of the fabric with graphene antenna arrays that are directed towards each other to augment the diagonal modes of communication D1 and D2. In this way folding along all the 4 edges results in 4 modes of communication namely, H, V, D1 and D2. In general, this folding approach can be extended for an NxN array of chips, where N is any positive integer. Only few switches

(a)

(b)

Fig.1. 4-chip wireless link deployment using (a) TF approach and (b) AW approach 6

Journal Pre-proof

3.1.2. ALL Wireless Approach

lP repro of

amongst all in the multichip system that help in folding the interconnection architecture have been selected and deployed with the graphene transceivers. This deployment is done in such a way that multiple links for a single mode of communication can be established without having any interference with each other. This deployment also allows concurrent communication in same THz frequency band of graphene antenna arrays which are operating in the same mode. Therefore, transceivers that would operate in the same mode are not deployed in adjacent switches. This prevents multiple transceivers of the same mode to be in the range of main lobe of its radiation pattern of a particular antenna array. This is discussed in detail in section 3.2. Thus, each link of the same mode is able to operate concurrently and the communication protocol for this Toroidal Folding (TF) approach in different modes will be discussed in section 3.4.1.

Jou

rna

In this subsection, we propose another topology for the graphene-based wireless multichip system. While the above-mentioned TF approach reduces the network diameters and average distance between nodes drastically, most switches do not have direct access to the wireless links. Therefore, as an alternative to the TF fabric, in the All Wireless (AW) topology we propose to deploy graphene based wireless transceivers in all switches of the multichip interconnection fabric. In this topology while all switches are equipped with wireless antennas the topology is not an all-to-all network due to the directional graphene antenna arrays used and adopted communication protocol discussed in the next subsections. To create the AW fabric, we adopt the same mesh topology for the intra-chip NoCs as in section 3.1.1. Each intra-chip mesh is then divided into clusters of 4 switches. Each of these switches will operate only in one mode (D1, D2, V or H) as discussed in section 3.1.1. A switch in any mode communicates only with another switch in the same mode using directional point-to-point graphene arrays. In order to avoid interference between switches operating in the same mode we adopt Frequency Division Multiple Access (FDMA). To enable FDMA adjacent clusters within a chip are equipped with graphene arrays tuned to different frequencies as opposed to spatially separating adjacent wireless transceivers, which was adopted in the previous topology. To enable chip-to-chip wireless communication we map clusters in a chip with corresponding clusters in other chips with the same frequency as shown in Fig. 1(b). As can be seen, the AW topology resembles a folded-torus topology where the folding is realized with point-to-point wireless links. However, different from a traditional folded-torus this architecture implements folding even in diagonal directions. 7

Journal Pre-proof

3.2

Physical Layer Choices

lP repro of

Intra-chip communication happens over the wireline NoCs. On-chip wireline links are realized with traditional global-wire based interconnects depending on the adopted mesh topology as discussed in section 3.1.1. The physical layer of TF and AW WiNoCs is generally considered to consist of the on-chip embedded antennas and the transmitter and receiver (transceiver) circuitry. It is a common myth that the wireless interconnects may not be able to sustain the high bandwidth required to cater to the requirements of on-chip data transfer. The bandwidth of the wireless link is a fraction of the carrier frequency generally varying between one-tenth to one-fourth. The bandwidth of on-chip wireless links needs to be in the order of multi-gigabits to provide data rates required for core-tocore data transfer in NoCs. Therefore, the choice of the carrier frequency needs to be in tens to hundreds of GHz taking us into the THz bands. The physical layer designs correspond to this principle. 3.2.1 On-chip Antennas Adopted for the Proposed Architectures

Jou

rna

On-chip antennas are a deciding factor in link budget management. In case of low power budget, directional antennas or phased arrays are recommended because of high path loss at mm-wave spectrum. Fixed beam log-periodic antennas are also investigated in the context of on-chip wireless links[35]. Many recent works explore THz antennas using Graphene structures [21][22][23][24]. However, to enable the proposed architectures an antenna array capable of beam-steering is necessary. Therefore, in this work, we adopt the beam-steering graphene antenna array from [14]. The proposed plasmonic array architecture differs from conventional array architectures in several ways. It leverages the unique plasmonic properties of the devices to greatly simplify the array design. Each radiating element of the array consists of the source, the modulator and the antenna. The plasmonic modulator can directly modulate THz frequencies without the need of sub-harmonic mixers. The nano-antenna is designed for resonance with SPP wavelength and thus is much smaller than a regular patch antenna for the same frequency. Since the mutual coupling of the radiating elements in this plasmonic array depends on SPP wavelength and not the free-space wavelength [14], these small nano-antennas can be packed into a very dense array. Thus, with a very small area footprint, the power output and beamforming capabilities of the plasmonic array can outperform a similarly designed conventional array. The gain or directivity of an antenna is the ratio of radiation intensity averaged over all directions and the beam width is normally measured at the half power or -3dB point of the main lobe. The width of the radiation cone, D is given by 8

Journal Pre-proof

lP repro of

𝐷 = 2𝑅 tan 𝜃1/2 . (1) Where, R is the distance between the transmitter and receiver. θ1/2 is the half angle of the main lobe, which is roughly around 15.20 in this case. Based on (1) the width of the main lobe is around 25.6mm for the longest link in the D1 mode in the TF architecture as shown in Fig. 1(a). The cone width for the D2 mode is also the same whereas, the maximum cone width for the H and V modes is 8.5mm. This limits the number of receivers operating in the same mode (H or V) along the same edge within any single chip to 3, as in that case they can be 10mm apart (one at one of the centre switches and two at the two corners). Similarly, in case of the diagonal modes (D1 and D2), the maximum number of receivers avoiding interference in a single chip is also 3. In the AW wireless fabric multiple transceivers of a single color and mode are not present in the same chip. The transceivers of same color and mode in different chips are farther apart than the main lobe width even for the longest link as shown in Fig. 1(b) thus avoiding interference in communication. The maximum number of graphene transceivers on a single chip can change if the die sizes are different. Moreover, sharper main lobes will enable denser link deployment with higher performance gains, but that might require larger antenna arrays.

rna

It is observed that the path loss at certain frequencies such as 1.21 THz, 1.28 THz, 1.45 THz etc. are very high due to the molecular absorption attenuation caused by the isotopologues of gases with different absorption coefficients at various frequencies[36]. However, it is known that the loss due to molecular absorption is almost negligible for distances below 10cm for frequencies below 10THz [36]. Therefore, this graphene-based antenna arrays could be deployed for designing a flexible and scalable multichip interconnection fabric where the dimensions of the system are less than one meter.

Jou

We envision to use a quilt packaging system where, the package cover over each chip can be patterned to create a cavity over each antenna array. These packaging systems allows wireless communication between chips with low insertion loss and hence provide higher performance compared to other conventional packaging systems [37]. This will enable the antennas to communicate through air medium. The propagation of THz band wireless channel is better understood and analysed in free-space or air compared to any other medium like silicon [37]. Therefore, we envision quilt packaging to enable propagation through air medium for THz communication between chips. This helps us to use the channel model for propagation through air to estimate the required link budget and power consumption in section 4.2.

9

Journal Pre-proof

3.3

Deadlock Free Flow Control and Routing

lP repro of

The routing protocol for the proposed multichip system is a seamless intra and interchip data communication mechanism. Wormhole switching has been adopted for both wired as well as the wireless links in the multichip system where data packets are broken down into flow control units or flits [42] over Virtual Channel (VC) based switches. Main advantage of using this kind of switching is that it reduces the buffer requirements at the switches as unlike packet switching the whole packet is not forwarded thereby making the switches consume less power with lower area overheads. All these switches have their own unique addresses and bidirectional ports for all the links that are attached to them. As the directional wireless links are point to point, even if partial packets are being transmitted the integrity of wormhole switching is maintained.

rna

As the overall system is not a regular network, we adopt a shortest path routing to optimize network performance. We use a forwarding-table based routing over precomputed shortest paths determined by Dijkstra’s algorithm. Dijkstra’s algorithm extracts a Minimum Spanning Tree (MST), which provides the shortest path between any pair of nodes in a graph. The exact MST depends on the chosen start node for the algorithm but the length of paths between any particular pair, along the tree does not depend on the start node. Hence, it is chosen randomly from among all the switches in the system. However, for a specific start node the shortest path along the extracted tree is always unique as the minimum spanning tree is inherently free of loops. A sufficient condition for deadlockfreedom according to Duato’s theorem[42], is that routing paths to not have cyclic dependencies among the routing channels. Consequently, deadlock is avoided by transferring flits over the shortest path along the MST extracted by Dijkstra’s algorithm, as it is inherently free of cyclic dependencies.

Jou

As Dijkstra’s algorithm is compute-intensive, the MST is pre-computed at design-time. The adopted wormhole switching further supports the shortest path routing over the MST as only the header flit is routed to the next switch in the path to final destination using the pre-computed routing table. Rest of the body flits simply follow the path that is laid by the header flit into the reserved VC. So, each switch has local forwarding information rather than global routing information making the routing logic scalable with size. For the TF architecture the shortest path between cores in different chips will involve traveling to the nearest graphene enabled wireless node, then reaching the corresponding core in the same mode at the destination chip and then finally being routed over the intra10

Journal Pre-proof

lP repro of

chip mesh to the final destination. In case of the AW architecture each cluster in a chip has at least one wireless mode connected to the corresponding cluster in another chip. Therefore, inter-chip data needs to be routed to the core operating in the specific mode depending upon the destination chip and gets routed over the wireless link. On reaching the destination cluster in the destination chip, the packet is routed to the final destination core over the intra-chip mesh links. On the other hand, if the source and destination are in the same chip, packets are routed using the intra-chip wireline mesh for both TF and AW architectures. 3.4 Wireless Communication Protocol and Transceiver

In this section we discuss two different communication protocols for the two different topologies. Several wireless channel access mechanisms tailored for wireless interconnections in NoC environments are known [20][43]. In mm-wave interconnects wireless bandwidth is limited by the state-of-the-art transceiver design and on-chip antenna technology. To improve performance, multiple wireless transceivers need to access the wireless medium to communicate via the energy-efficient high bandwidth wireless interconnects. Consequently, for both proposed approaches we adopt a channel access mechanism that is suitable for the 4 modes of communication in the multichip system. 3.4.1 Multi-Modal Communication Protocol

Jou

rna

As discussed in the section 3.1.1, switches have been deployed with the directional graphene antennas in such a way that they enable four modes of wireless communication namely Horizontal (H), Vertical (V), Diagonal1 (D1) and Diagonal2 (D2). As all the graphene antennas are operating in the same frequency band and the transceiver that would be operating in different modes are very closely located in the adjacent tiles (closer than the minimum distance for acceptably low interference), only a single mode out of the four modes is enabled at a time to avoid interference amongst transceivers operating in different modes that are near to each other. So, communication happens in four phases which are H, V, D1 and D2 as shown in Fig. 2. Each phase is further divided into 2 sub-phases that

Fig. 2. Multi-modal communication protocol. 11

Journal Pre-proof

3.4.2

lP repro of

would enable half-duplex communication between any pair using the same physical wireless channel. This is denoted by opposing arrows in Fig. 3. The duration of each phase also plays an important role in overall performance of the whole system. So, the duration of the phase has also been optimized to have the best performance. The results of the optimization of phase duration has been presented in results section 4.1. Wake signals that are being created by a simple counter are used to enable the transceiver in their respective phase. Therefore, a combination of separation in both space (Space Division Multiple Access) and time (Time Division Multiple Access) enables the multi-modal communication in the proposed graphene enabled multichip system. FDMA based Communication Protocol

In FDMA-based communication protocol is designed to enable simultaneous transmission among cores across different clusters in the AW wireless architecture. In this work, we adopt a multi-modal protocol as discussed in section 3.4.1 with four modes of communication. In addition to the four modes for the switches in each cluster, different clusters in the same chips are tuned to different frequencies. So, clusters in different frequencies will communicate concurrently in different THz bands. Only clusters in the same frequency are connected directly across chips using at least of the 4 modes. 4. Simulation Results

TABLE 1 Component Configuration for Simulation

Jou

rna

In this section, we evaluate the performance Component Configuration and energy efficiency of the 3 stage pipelined, 5 ports (except wireless intra and inter-chip NoC Router wireless), 0.078pJ/bit/port interconnection fabric using 8, Each 4 flits deep a cycle accurate simulator. Total VC We compare the wireless Flit width 32 bits interconnect based multichip Wired NoC links 64-bit flits, single cycle latency, system with their wireline 0.2pJ/bit/mm counterparts using both 100Gbps, 1.176fJ/bit synthetic and application- Graphene links specific traffic patterns. The Technology node 65nm, 1V supply, 2.5GHz system clock simulator characterizes the multichip architecture and models the progress of the flits over the switches and links per cycle accounting for those flits that reach the destination as well as those that are stalled. Ten thousand iterations were performed eliminating transients in the first thousand 12

Journal Pre-proof

Jou

rna

lP repro of

iterations for the synthetic traffic patterns. For application-specific traffic, each kernel is run to completion. In our experiments, we consider each core to be connected to a threestage pipeline network switch adopted from [34]. The switches are connected with other switches according to the proposed architecture. We consider each input and output port of a switch to have 8 VCs with a buffer depth of 4 flits for all the architectures considered in this paper. We consider a representative maximum packet size of 16 flits with a flit size of 32 bits in our experiments unless otherwise mentioned. The configuration of the components used for the simulation is presented in Table.1. The channel capacity of THz bands is shown to be more than 4Tbps for distances of 0.1mm [44]. However, THz transceivers such as in [45] are able to exploit around 100Gbps data rates. Also, the maximum data rate on the wireless links is conservatively assumed to be 1/10th of the carrier frequency in on-chip wireless communications [46] . Based on these research, a conservative wireless bandwidth of 100Gbps is assumed for this work. The power consumption of the wireless links is estimated from a link budget analysis as discussed in section 4.2. The network switches are synthesized from a RTL level design using 65nm standard cell libraries from Chip MultiProjects (http://cmp.imag.fr), using Synopsys. On the other hand, the delay and energy dissipation on the intra-chip wireline links is obtained through Cadence simulations considering the specific lengths of each link based on the established mesh topology in the individual chips considering 20mmx20mm dies. The delay and power dissipation including both dynamic and static power consumption of all these components of the interconnection fabric are then incorporated in the cycle accurate

Fig. 3. Representative simulation workflow. 13

Journal Pre-proof

lP repro of

simulator as shown in Fig. 3 to evaluate the performance and energy efficiency of the different interconnect systems. Fig. 3 shows examples of tools that maybe used in a simulation framework used here. Industry-standard tool suites are used in characterizing component-level designs while reputed methodologies are used for system-level simulations in our adopted flow. All the digital components are driven by a 2.5GHz clock and 1V power supply, which are the nominal frequency and voltage in the 65nm technology node. In the Mesh based intra-chip NoCs all wired links are considered to be single-cycle links. First, we discuss the optimization of the phase duration using this simulation platform followed by the performance evaluation in the next subsection. 4.1. Optimization of Phase Duration

5

10

15

9Chip

20

1Chip

Peak acheivable bandwidth per core (Gbps)

34 32 30 28 26 24 22 20

4Chip

Jou

Peak acheivable bandwidth per core (Gbps)

1Chip

rna

As discussed in section 3.4.1 the overall communication in the system occurs in phases and to avoid interference only one mode is enabled at a time. Here we optimize the total duration of communications in both directions in a particular phase to have the best performance in terms of data bandwidth of the multichip system. Through system level simulations the performance has been analyzed in terms of peak achievable bandwidth per core at network saturation as a function of phase duration using uniform random traffic for a system with a single chip, 4 chips and 9 chips. The peak achievable bandwidth per core is measured as the maximum sustainable data rate in number of bits successfully routed per core per second at network saturation. Longer phase durations will provide longer access of the wireless channel to each tile equipped with the graphene antennas potentially improving performance. However, increasing the phase duration will eventually increase

25

4Chip

37 32 27 22 17 12 15

20

25

30

35

40

45

Duration per phase (cycles)

Duration per phase (cycles) (a)

(b)

Fig. 4. Peak achievable bandwidth per core as a function of phase duration in (a) TF Wireless approach and (b) AW Wireless approach.

the interval between two consecutive channel accesses by a particular wireless node in a 14

Journal Pre-proof

lP repro of

specific mode. This will eventually result in degradation in performance. Fig. 4 shows the peak achievable bandwidth per core with different phase duration for systems with different number of chips. Each chip is considered to have 64 cores. It can be seen from Fig. 4(a) that the peak achievable performance of the wireless system using TF approach is maximum for the phase duration of 20 cycles for all system sizes considered here. Fig. 4(b) shows the peak achievable bandwidth per core of the wireless system using the AW approach. The bandwidth is maximum for phase duration of 30 cycles for a 1-chip system and 35 cycles for a 4-chip system. These optimal values are used in the simulation to evaluate the system performance in the latter sections.

4.2 Energy Estimation for proposed Graphene-based inter-chip Wireless Links

rna

In this subsection, we estimate the energy consumption of the graphene-based interchip wireless links used in our proposed multichip architecture. To reduce the transmitted energy, we consider using directional array of graphene antennas. For a distance of 10 cm at 1 THz, the path loss is shown to be around 65dB [36]. A gain of 10dB can be achieved while operating at 1.05 THz. This is in contrast with any other type of antenna arrays which requires phase shifters and are therefore easier to operate. The high gain makes the antenna array highly directional which would support our architectures that requires directional wireless links enabling the folding based interconnection fabrics. The S11 parameter which is the reflection coefficient of the antenna should be less than -10db for acceptable performance [14]. Based on modelling estimates of antenna design in [14] reflection coefficient is at -21.3db. We develop our estimate of the energy consumption per bit over the THz channel using the graphene-based transmitters and receivers based on the path loss model developed in [12] and considering the reflection coefficient [14]. We calculate Ptx which is power consumed by the transmitter with the following equation

Jou

(Ptx + (1-s11)2 -path loss) - (Noise power(dbm)) > 20db

(2)

Ptx is approximately around -27dBm which is roughly about 2 Microwatts. A signal-tonoise ratio (SNR) of 20dB is assumed for our calculations as it provides a BER of less than 10-9 with non-coherent OOK modulation adopted in the graphene-based transmitters. The noise power considered, is primarily due to the thermal noise in the channel and can be calculated by 𝑁𝑇 = 𝐾𝑇𝐵. (3) Where, k is the Boltzmann constant, T is the absolute temperature and B is the bandwidth. 15

Journal Pre-proof

lP repro of

The energy required to transmit a bit from a transmitter to receiver through any wireless link is defined as Energy per bit, 𝐸𝑏𝑖𝑡 and is given by: 𝐸𝑏𝑖𝑡 = 𝑃𝑡𝑥 𝑇 (4) Where, T is the bit duration times the antenna efficiency. Using (2)-(4), the 𝐸𝑏𝑖𝑡 for a graphene-based wireless link is found to be 1.176fJ/bit for a system a physical bandwidth or data rate of 100 Gbps. In addition, laser source and the transmitter and receiver will consume power. This energy consumption per bit is incorporated in the simulator to estimate the average packet energy of the wireless inter and intra-chip wireless fabric proposed in this paper. 4.3 Performance Evaluation with Synthetic Traffic

Jou

rna

In this subsection, the performance of proposed multichip system with different sizes in term of peak achievable bandwidth per core, average packet energy consumption, and average packet latency have been compared with the wired architectures. Average packet energy is the energy consumed to transfer an entire packet from source to destination in the multichip system on average. In our experiments, the 20mmx20mm chips are considered arranged by itself or in a 2x2 and 3x3 array for the 1, 4 and 9 chip cases respectively. The TF architecture for the 1-chip system is designed by following the same principles as in the case of the multichip system, but the folding is done along the corners and edges of the same chip. In the 1-chip AW architecture the chip is divided into 4 quadrants and the quadrants are interconnected like the chips in the 4-chip system. For the purpose of comparison, two wired counterparts for the multichip system are considered. In first wired system, an overall mesh topology is adopted. The intra-chip NoCs in all the chips is a regular mesh topology. For the wireline multichip interconnections, the intrachip NoCs are extended through a silicon interposer by connecting switches along the boundaries of neighbouring chips. We have considered an interposer based wired system as it is shown to be suitable for extending the NoC across multiple chips and outperform traditional substrate based wired systems [8]. While, it can be argued that different topologies mentioned in this paper can have different bandwidth and hence different results. Therefore, the underlying wireline mesh topology of the intra-chip NoC is the same in both TF and AW architectures. This makes the comparison with the conventional wireline mesh a direct comparison showing the benefits of the THz interconnects only. The distance between neighbouring cores in adjacent chips is considered to be the same as the inter-core distance within each chip. Therefore, due to the use of on-die metal wires in the interposer the delay and power consumption characteristics are assumed to be like that of the on-chip wired NoC links. The second wireline architecture chosen is a Folded 16

Journal Pre-proof

lP repro of

Torus topology. This topology is an extension of Torus architecture. In Folded Torus architecture, each switch is connected to its every alternative switch in both horizontal and vertical direction. Hence, in this architecture, the links are essentially arranged in a folded manner to yield equal link lengths. As in the case of the mesh, in the folded torus wired multichip system the links are extended via the interposer. Fig. 5 shows the peak achievable bandwidth per core and average packet energy for the wireless multichip fabric using TF and AW approach as well as the completely wired interconnections forming a mesh and a folded torus at network saturation using uniform random traffic for system with different sizes. Peak bandwidth per core for the wired folded torus architecture is better than the wired mesh system for all system sizes. This is because links are arranged in a folded manner to yield better connectivity resulting in lower latency with a higher throughput and lower energy than the wired mesh architecture. It can be seen from the figure that peak bandwidth per core for the wired system keeps on decreasing with increasing system size along with a huge increase in average packet energy (shown in log-scale). This is because with increase in size the average path length between source and destination cores increases resulting in longer multi-hop communication as seen later in Table 2. In terms of peak achievable bandwidth per core, AW wireless architecture performs better than both the wired counterparts. This is because in the AW topology

20 10

10000 1000 100 10

0

1

1

4 No of chips in systems

9

Average packet energy (nJ)

30

rna

40

Wired folded torus bandwidth AW wireless bandwidth TF wireless energy AW wireless energy

Jou

Peak bandwidth per core (Gbps)

Wired mesh bandwidth TF wireless bandwidth Wired mesh energy Wired folded torus energy

Fig. 5. Performance evaluation of system with different number of chips

switches are connected in an alternating manner which resembles the folded torus topology. However, the high-speed graphene-based links result in faster data transfer 17

Journal Pre-proof

lP repro of

causing higher bandwidth in the AW wireless architecture compared to the folded torus wired architecture. For a single chip case, AW wireless outperforms the wireline mesh and folded torus by around 1.40x and 1.25x respectively. For a 4-chip system, AW wireless is 1.33x better than mesh-based wireline system and 1.10x better than wireline folded torus based multichip system respectively. The TF wireless architecture outperforms all the architectures considered in this work. This is because of the edge to edge long range graphene-based wireless links that folds the network, thereby effectively reducing the network diameter by about half. For a single chip case, TF wireless architecture outperforms the wireline mesh by 1.55x and wireline folded torus by 1.38x in terms of the bandwidth. Whereas for a 4-chip system, TF wireless system is 1.60x better than the wireline mesh based multichip system and is 1.30x better than the wireline folded torus based multichip system. While for a 9-chip system, the TF wireless system is 4.74x and 2.27x better than the wireline mesh and wireline folded torus based multichip system respectively.

Jou

rna

The average packet energy dissipation for all system sizes including the single-chip case is lower for the wireless systems compared to all the wired multichip systems. The average packet energy in the wirelessly connected system does not increase as drastically as the wired system with increase in the number of chips. This is due to the direct energy-efficient low power one-hop wireless links between distant cores in the multicore chip. The advantage is more significant in multichip systems between cores embedded in different chips due to the folding topology made possible by THz wireless links. In contrast, the multi-hop communication over the metal interconnects restricts the potential performance of such systems with wired interconnects. Moreover, folding a large network with wired links would not improve the performance as much due to the latency of long wired links

TABLE 2 AVERAGE HOP COUNT FOR DIFFERENT ARCHITECTURES

Component

Wired Mesh

Folded Torus

AW Wireless

TF Wireless

1-chip

5.33

4.06

3.70

3.32

4-chip

10.66

8.03

6.32

5.03

9-chip

16.32

12.27

NA

8.14

18

increasing significantly. In addition, in both the proposed TF and AW wireless architectures, multiple graphene-based links can operate simultaneously without any interference due to the directional nature of antenna and the multi-modal communication protocol, which results in higher performance for the system compared to that of the wired architecture.

1Chip Wired Mesh 1 Chip Folded Torus 1Chip AW Wireless 1Chip TF Wireless

140 120 100 80 60 40 20 0 0.001 0.01 0.1 Injection load (Flits/core/cycles)

(a) 4Chip Wired Mesh 4Chip Folded Torus 4Chip AW Wireless 4Chip TF Wireless

lP repro of

Average packet latency (Cycles)

Journal Pre-proof

1

rna

Jou

Average packet latency (Cycles)

Average packet latency (Cycles)

The advantages of the proposed THz graphene antenna enabled folded multichip system is more evident in Fig. 6 where the average packet latency at different injection loads with different system sizes are shown 1 for the architectures considered in this (b) paper for uniform random traffic. The packet latency is the average number 9 Chip Wired Mesh 9 Chip Folded Torus of clock cycles required to transmit a 9 Chip TF Wireless 140 packet to the destination core 120 successfully. Due to different average 100 80 distances between cores in the 60 different interconnection 40 20 architectures, the latency 0 0.001 0.01 0.1 1 characteristics are different. This is Injection load (Flits/core/cycles) demonstrated by the average latencies (c) Fig. 6. Average packet latency for systems with (a) 1at low injections loads. It can be chip, (b) 4-chip system and (c) 9-chip system. observed that AW wireless has a latency characteristic better than the wireline architectures as the average hop between cores are better than the wireline architectures as shown in Table 2. However, the TF architecture due to its edge-to-edge folding reduces the network diameter further. Hence, the improvement in average hop count correlates latency characteristics. However, the hop-count in merely an indication of the latency trends. The latency reduction is higher 140 120 100 80 60 40 20 0 0.001 0.01 0.1 Injection load (Flits/core/cycles)

19

Journal Pre-proof

lP repro of

than 2x because at every additional hop or switch there is an unpredictable queueing delay that packets may encounter. The probability and delay at each of the additional hops can increase unpredictably especially, in presence of high-load scenarios. Moreover, the switch architecture also adds additional delay that is not captured in the hop-count calculations. Therefore, in general, the reduction in latency as observed through simulations is higher than that indicated by hop-count comparisons only. In our simulations the TF wireless architecture has the lowest latency compared to the systems with wireline interconnects and the AW wireless for all sizes. Moreover, the relative gain in latency also increases with increase in number of chips in the system. This is because the impact of the folding is higher for larger system sizes. In summary it is interesting to note that the TF approach performs better than the AW approach as it results in lower hop-count and better folding of the fabric with the wireless links. As the TF wireless performs better than the AW wireless architecture we evaluate and compare the TF architecture for the rest of the paper.

rna

Next, we compare the TF architecture further with wired architectures which has better performance compared to the mesh and folded-torus. A concentrated mesh NoC architecture with 64 cores is simulated using the same methodology as discussed above for this comparison. The concentrated mesh has a peak bandwidth of 70Gbps/core and a packet energy of 200nJ/packet [47] In comparison, the 64-core TF wireless NoC achieves a bandwidth of 34.39Gbps/core and a packet energy of 2.8nJ/packet. While the bandwidth of the TF architecture is less than the concentrated mesh due to better connectivity, the packet energy is much lower due to the use of ultra-low power THz wireless links. 4.4 Effect of Increase in Flit Width on overall system

Jou

In this section we analyse the effect of increasing flit width for TF wireless multichip system with uniform random traffic patterns and compare it with the interposer based completely wireline architecture. For this experiment, we used four different flit sizes of 32, 64, 128, and 256 bits. This is because as noted in [48] , higher flit widths beyond 128 are shown to provide marginal gains in system performance. In case of wireline intra-chip interconnections and interposer based wired inter-chip links, widening physical channel width to accommodate larger flit width will increase the data rate on the wireline links.

20

Journal Pre-proof

Relative gain in bandwidth

4 3 2 1

0

0

32

2

9

1

6

lP repro of

1

Relative gain in bandwidth

5

Relative gain in packet energy

Relative gain in bandwidth

2

64 128 256 Flit Width (bits)

0

32

3

Relative gain in packet energy

Relative gain in Bandwidth Relative gain in packet energy

Relative gain in packet energy

64 128 256 Flit Widths(Bits)

(b)

(a)

Fig. 7. Relative gain in bandwidth and average packet energy with different flit width for (a) 1-chip and, (b) 4-chip system.

rna

On the other hand, the data rate of the wireless links is governed by the speed of the transceiver and bandwidth of the antennas, which does not change with flit size. Hence, while the wireline communication becomes faster with increase in flit size, the wireless communication speed remains constant. This results in a reduction in relative gains for the wireless multichip communication architecture with respect to the interposer-based system as shown in Fig. 7. However, even with a flit width of 256 bits we see a relative improvement of 1.12 in data bandwidth and 1.17 in average packet energy for a 1-chip system. However, with increase in system-size in a 4-chip system the relative gain in bandwidth is 1.17 and that in packet energy is 3.85 implying larger advantage of using wireless interconnections for multichip systems especially when the system-size increases even with ultra-wide flits. TABLE 3. CORE CONFIGURATION FOR SIMULATION Component Cores

Configuration 1Chip: 16 equal size cluster where, 1 cluster contains: 1 Shared LLC core and 3 OoO cores with private L1 cache. 16 Out-of-Order, 16 memory cores/chip, 2.5 GHz.

Jou

4Chip: 16 equal size cluster where, 1 cluster contains: 8 Shared LLC core and 8 OoO cores with private L1 cache. 16 Out-of-Order, 16 memory cores/chip, 2.5 GHz 9Chip: 16 equal size cluster where, 1 cluster contains: 12 shared LLC core and 24 OoO cores with private L1 cache. 16 Out-of-Order, 16 memory cores/chip, 2.5 GHz

L1 Cache LLC (L2) Cache Cache Coherency

32KB, 4-way, LRU policy, private 512KB, 8-way, LRU policy, shared Directory-based MOESI 21

Journal Pre-proof

4.5 Performance Evaluation with Application specific traffic

lP repro of

In this subsection, we evaluate the performance of the proposed graphene enabled TF wireless system with application specific traffic patterns from PARSEC [49] and SPLASH-2 [50] benchmark suites. To generate the application specific traffic patterns, we consider a multicore chip with 16 memory cores and 16 out-of-order (OoO) processing cores. Each core consists of a 32KB of L1 and 512KB of L2 cache running a Directory Based MOESI Cache coherency protocol. This core configurations as presented in table 3, are then used to extract the core-to-memory and memory-to-memory cache coherency traffic for the PARSEC and SPLASH-2 benchmark applications when they are executed till completion using SynFull [51]. The traffic patterns are generated by mapping the cores depending on the number of cores in the system. For example, to map these traffic patterns to the 1 chip system (64-core) we consider 16 equal sized clusters where each cluster contains 1 shared Last Level Cache (LLC) cores and 3 OoO cores with private L1 Cache.

rna

For the 4-chip system (256-core) we considered 16 equal sized clusters where each cluster contains 8 shared Last Level Cache (LLC) cores and 8 OoO cores with private L1 cache. Similarly, for the 9-chip system (576-core) 16 equal sized clusters where each cluster contains 12 shared Last Level Cache (LLC) cores and 24 OoO cores with private L1 Cache were considered. The reduction in average packet latency for graphene enabled folded 1-chip, 4-chip and 9-chip wireless system with respect to the 1-chip, 4-chip and 9chip wired architecture for different application specific traffic patterns is shown in Fig.8. The latency best represents the performance in these cases as the interconnection network is not saturated in the steady-state. The reduction in average packet latency for the wireless architectures vary between applications due to the variation in traffic patterns.

Jou

On an average, the latency of a 1-chip wireless system is better than wired mesh configuration and wired folded torus by about 1.22x and about 1.06x times respectively. Whereas, the latency of a 4-chip wireless system is better than wired mesh configuration and wired folded torus by about 3.20x and about 2.17x times respectively. Similarly, the latency of a 9-chip wireless system is better than wired mesh configuration and wired folded torus by about 6.31x and about 3.85x times respectively. This is due to the presence of single hop wireless interconnects folding the multichip interconnect architecture. This

22

Wired Mesh

Wired Folded Torus

80 70 60 50 40 30 20 10 0

(b)

Average

Raytrace

Water_spatial

TF Wireless

Jou

120 100 80 60 40 20 0

Wired Folded Torus

rna

Wired Mesh

Barnes FFT Choleskey Fluidanimate Facesim Bodytrack Blacksholes Radix Swaptions Volrend Water_nsquared Water_spatial Raytrace Average

Averge packet Latency (Cycles)

TF Wireless

Barnes FFT Choleskey Fluidanimate Facesim Bodytrack Blacksholes Radix Swaptions Volrend Water_nsquared Water_spatial Raytrace Average

Average Packet Latency (Cycles)

(a)

Volrend

Radix

Swaptions

Blacksholes

Facesim

Bodytrack

Fluidanimate

FFT

Choleskey

40 35 30 25 20 15 10 5 0

TF Wireless

is aided by the proposed multi-modal communication protocol, which enables concurrent communication links realized by high bandwidth directional graphene-based wireless interconnects to fold the network efficiently. This reduction in latency correlates with the lower average hop-counts for the TF wireless architecture as noted in Table 2. The relative reduction of latency in the wireless multichip systems compared to the wired multichip systems increase with increase in system size. This is because in larger systems the wireline communications happen over longer multi-hop physical paths creating worse bottlenecks that in smaller systems. However, with the wireless links, these multi-hop paths can be avoided regardless of the distance between them.

lP repro of

Wired Folded Torus

Water_nsquared

Wired Mesh

Barnes

Average packet latency(Cycles)

Journal Pre-proof

(c)

Fig. 8. Average packet latency for application specific traffic for (a) 1-chip (b) 4-chip system and (c) 9-chip system.

23

Fig. 9 shows the average packet energy for a folded 1-chip, 4-chip and 9-chip wireless system with respect to the 1-chip, 4-chip and 9-chip wireline systems in the presence of application specific traffics. On an average, the average packet of energy for a 1-chip wireless system is about 1.67x and about 1.51x times lower than that of wired mesh and wired folded torus architecture respectively Whereas the average packet of energy for a 4-chip wireless system is

Journal Pre-proof

Average PAcket Energy (nJ)

Wired Mesh

Wired Folded Torus

25 20 15 10 5 0

Average

Raytrace

Water_spatial

Volrend

Water_nsquared

Radix

Swaptions

Blacksholes

Facesim

Bodytrack

Fluidanimate

FFT

Choleskey

(a)

TF Wireless

about 4.18x and about 2.81x times lower than that of wired mesh and wired folded torus architecture respectively. Similarly, the average packet of energy for a 9-chip wireless system is about 1.89x and about 1.25x times lower than that of wired mesh and wired folded torus architecture respectively. These improvements in energy efficiency is due to the presence of extremely low power, directional point to point, high bandwidth graphene antenna based multiple concurrent wireless links.

lP repro of

Wired Folded Torus

4 3.5 3 2.5 2 1.5 1 0.5 0

Barnes

Average packet Energy (nj)

Wired Mesh

TF Wireless

4.6 Area Overheads

Average

Raytrace

Water_spatial

Water_nsquared

Volrend

Swaptions

Radix

rna

Bodytrack

Blacksholes

Facesim

Fluidanimate

Choleskey

FFT

Barnes

Jou

Barnes FFT Choleskey Fluidanimate Facesim Bodytrack Blacksholes Radix Swaptions Volrend Water_nsquared Water_spatial Raytrace Average

Average Packet energy (nJ)

These advantages in performance and energy are achievable for a relatively low area overhead of about 0.36mm2 per 4x4 graphene antenna array. The area overheads of the transceiver circuits are (b) negligible in comparison the antenna array. In the TF architecture 9 such Wired Mesh Wired Folded Torus TF Wireless antenna arrays are required in a single 70 chip with 64 cores, which amounts to 60 50 3.24mm2 area per chip in the multi-chip 40 30 system. This means 0.81% of the area of 20 10 a typical chip of size 400mm2 is required 0 to enable the antenna arrays. For the AW architecture all switches in the system need to be equipped with the antenna array. Therefore, the area overhead will (c) be 23.04mm2 per chip in the system, Fig. 9. Average packet energy for application specific making it 5.76% of the die size. traffic for (a) 1-chip (a) 4-chip system and (b) 9-chip However, the TF architecture achieves system. great performance benefits at the cost of 0.81% of die area. 24

Journal Pre-proof

TABLE 4: ENERGY PER BIT FOR A SINGLE POINT-TO-POINT LINK AND POSSIBLE AGGREGATE BANDWIDTH FOR DIFFERENT INTERCONNECT TECHNOLOGIES.

mm-wave Wireless Interconnects[10]

Energy

2.3pj/bit

Aggregate Physical Bandwidth

16Gbps

4.7

Inter-chip Photonic Interconnects [53]

Inter-chip Graphene Interconnects

lP repro of

Component

0.43pj/bit

1.176fJ/bit

160Gbps

100Gbps

Comparative Evaluation with Alternative Technologies

Jou

rna

In this section, performance of 4 chip system that uses graphene-based antennas has been compared with few other emerging alternatives multichip integration technologies. We have considered a mm-wave inter-chip wireless interconnection and inter-chip photonic interconnections for comparison with the proposed graphene based multi-chip interconnection architecture. We have considered mm-wave transceivers and antennas to be deployed to the same locations as the graphene transceivers to create the same architecture for fair comparison. However, all mm-wave transceivers were assumed to work in the 60GHz channel requiring a token-passing based mechanism for contentionfree channel access [16]. Off-chip photonic interconnects has emerged as another enabling technology for chipto-chip communication[52]. In the photonic multichip system, the interchip communication happens through high bandwidth photonic interfaces with intra-chip NoCs within each chip. For a comparable topology with respect to the TF architecture we have considered eight switches from each chip to be connected to the inter-chip waveguide. An all-to-all Single-Write-Multiple-Read (SWMR) architecture is considered [53] . For a 4chip system we study here, this requires a total of 768 wavelengths for each of the 32 cores (8 in each of 4 chips) to communicate with each of the other 24 switches (8 in each of the other 3 chips in the system). These 768 wavelengths can be sustained with 12 waveguides by Dense Wavelength Division Multiplexing (DWDM). A U-shaped waveguide bundle can be used for such an architecture with 4 chips attached to it through the edge switches [53]. Communication among cores or switches within the same chip utilize wired mesh 25

Journal Pre-proof

35

25

30

20

25

15

20 15

10

10

5

5 0

0

mm-wave

Average packet energy (nJ)

Packet Energy

lP repro of

Peak bandwidth per core (Gbps)

Bandwidth per core

Photonics

Graphene

Fig.10 Performance evaluation for a 4 chip system with alternative interconnect technologies

links over the intra-chip NoCs. Use of photonic waveguides within the chips is possible but requires layout of more waveguides with even higher area overheads.

rna

The energy/bit for a single point-to-point link and possible aggregate physical bandwidth provided by each of these technologies in these configurations are summarized in Table 4. Fig. 10 shows the peak bandwidth per core and overall system average packet energy for 4-chip systems with these different interconnect technologies. Mm-wave system has the lowest bandwidth per core and highest average packet energy among all the configurations considered here. This is because only a single transmitter can access the wireless channel at any given instant of time and it also has the highest energy consumption per bit.

Jou

The Photonic inter-chip architecture outperforms the mm-wave architecture due to presence of high bandwidth wavelength division multiplexing (WDM) concurrent links. However, the performance of the photonic multichip system is lower than that of the graphene based wireless system. This is due to the folding effect of the graphene links. Whereas in case of the photonic system, the data packets will have to reach the photonic interfaces of the chip at its periphery. It can be seen from the Fig. 10 that graphene based multichip system has achieved the highest bandwidth with the lowest average packet energy as compared to the performance of mm-wave wireless and photonic interconnect architectures. These improvements in packet energy are due to extremely low power graphene based wireless links thus making graphene antennas a promising solution for multichip integration in the future. 26

Journal Pre-proof

5. Conclusion and Future Work

6. Acknowledgement

lP repro of

In this paper, we present the design of a hybrid graphene-based wireless inter and intrachip wireless interconnection fabric with a folding strategy. We propose two different wireless architecture with different deployment strategies of graphene based wireless interconnects. Using low power and high bandwidth graphene-based wireless links the performance and energy efficiency of systems can be significantly improved compared to wired counterparts with both synthetic and application-based traffic scenarios. These results will encourage further research on integration, fabrication and characterization of graphene antennas in CMOS multicore chips.

This work was supported in part by the US National Science Foundation (NSF) CAREER grant CNS-1553264. References

J. D. Owens, W. J. Dally, R. Ho, D. N. J. Jayasimha, S. W. Keckler, and L.-S. Peh, “Research challenges for on-chip interconnection networks,” in IEEE Micro, vol. 27, no. 5, pp. 96–108, 2007.

[2]

L. Benini and G. De Micheli, "Networks on chips: a new SoC paradigm," in Computer, vol. 35, no. 1, pp. 70-78, Jan 2002.

[3]

PCI-SIG, “PCI Express Architecture,” URL: http://www.pcisig.com.

[4]

International Technology http://www.itrs.net/.

[5]

J. Kye, Y. Woo, J. Zeng, H. Levinson, A.Wehbi, P. Hang, V. Ton-That, V. Kanagala, D. Yu, D. Blackwell, A. Beece, S. Gao, S. Thangaraju, R. Alapati and S. Samavedam, "Challenges of analog and I/O scaling in 10nm SoC technology and beyond," 2014 IEEE International Electron Devices Meeting, San Francisco, CA, 2014, pp. 18.3.1-18.3.4.

[6]

Hendry, et al, “Modeling and Evaluation of Chip-to-Chip Scale Silicon Photonic Networks” IEEE Annual Symposium on High-Performance Interconnects (HOTI), 2014 1,8, 26-28, (Aug. 2014),

[7]

Topol, A.W., et al, “Three-dimensional integrated circuit”, in IBM Journal of Research and Development, vol.50, no.4.5, pp.491,506, 2006.

rna

[1]

for

Jou

Roadmap

27

Semiconductors,

2012

Edition,

URL:

Journal Pre-proof

A. Kannan, N. E. Jerger and G. H. Loh, "Enabling interposer-based disintegration of multicore processors," in 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Waikiki, HI, pp. 546-558, 2015.

[9]

H. Nakahara, R. Yasudo, H. Matsutani, H. Amano and M. Koibuchi, "3D Layout of Spidergon, Flattened Butterfly and Dragonfly on a Chip Stack with Inductive Coupling Through Chip Interface," 2017 14th International Symposium on Pervasive Systems, Algorithms and Networks & 2017 11th International Conference on Frontier of Computer Science and Technology & 2017 Third International Symposium of Creative Computing (ISPAN-FCSTISCC), Exeter, 2017, pp. 52-59.

lP repro of

[8]

[10] M.

S. Shamim, J. Muralidharan, and A. Ganguly, “An Interconnection Architecture for Seamless Inter and Intra-Chip Communication Using Wireless Links,” in Proceedings of the 9th International Symposium on Networks-on-Chip - NOCS 15, 2015.

[11] DiTomaso,

D.; Kodi, A.; Kaya, S. and Matolak, D. 2011. iWISE: Inter-router Wireless Scalable Express Channels for Network-on-Chips (NoCs) Architecture. in 2011 IEEE 19th Annual Symposium on High Performance Interconnects (HOTI), (2011), 11–18.

[12] I.

F. Akyildiz, J. M. Jornet, and C. Han, “Terahertz band: next frontier for wireless communications,” in Physical Commun. J., vol. 12, Sep. 2014.

[13] Sagar

Saxena, Deekshith Shenoy Manur, Md Shahriar Shamim, and Amlan Ganguly. “A folded wireless network-on-chip using graphene based THz-band antennas.” In Proceedings of the 4th ACM International Conference on Nanoscale Computing and Communication (NanoCom '17). ACM, New York, NY, USA, Article 29, 6 pages.

rna

[14] M. Andrello, A. Singh, N. Thawdar and J. M. Jornet, "Dynamic Beamforming Algorithms for

Ultra-directional Terahertz Communication Systems Based on Graphene-based Plasmonic Nano-antenna Arrays," 2018 52nd Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 2018, pp. 1558-1563. [15] S. Laha, S. Kaya, D. W. Matolak, W. Rayess, D. DiTomaso and A. Kodi, "A New Frontier in

Jou

Ultralow Power Wireless Links: Network-on-Chip and Chip-to-Chip Interconnects," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 34, no. 2, pp. 186-198, Feb. 2015. [16] M. S. Shamim, N. Mansoor, R. S. Narde, V. Kothandapani, A. Ganguly and J. Venkataraman,

"A Wireless Interconnection Framework for Seamless Inter and Intra-Chip Communication in Multichip Systems," in IEEE Transactions on Computers, vol. 66, no. 3, pp. 389-402, 2017. [17] H.

H. Yeh and K. L. Melde, "60 GHz multi-antenna design in Multi-Core system," in Proceedings of the 2012 IEEE International Symposium on Antennas and Propagation, Chicago, IL, 2012, pp. 1-2. 28

Journal Pre-proof

[18] I.

Llatser et al, "Characterization of graphene-based nano-antennas in the terahertz band," in Proc of 6th EUCAP, 2012, pp. 194-198.

[19] S. Abadal, E. Alarcón, A. Cabellos-Aparicio, M. C. Lemme and M. Nemirovsky, "Graphene-

lP repro of

enabled wireless communication for massive multicore architectures," in IEEE Communications Magazine, vol. 51, no. 11, pp. 137-143, November 2013. [20] G. Piro, et al., “Initial MAC Exploration for Graphene-enabled Wireless Networks-on-Chip,”

in Proc. NANOCOM’ 14. 2014.

[21] ] I. E. Masri et al., "Accurate Channel Models for Realistic Design Space Exploration of Future

Wireless NoCs," 2018 Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS), Turin, 2018, pp. 1-8. doi: 10.1109/NOCS.2018.8512171 [22] P. Chen and A. Alù, "All-graphene terahertz analog nanodevices and nanocircuits," 2013 7th

European Conference on Antennas and Propagation (EuCAP), Gothenburg, 2013, pp. 697698. [23] S.

Abadal, J. Torrellas, E. Alarcón and A. Cabellos-Aparicio, "OrthoNoC: A BroadcastOriented Dual-Plane Wireless Network-on-Chip Architecture," in IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 3, pp. 628-641, 1 March 2018. doi: 10.1109/TPDS.2017.2764901

[24] X.

Timoneda, A. Cabellos-Aparicio, D. Manessis, E. Alarcón and S. Abadal, "Channel Characterization for Chip-scale Wireless Communications within Computing Packages," 2018 Twelfth IEEE/ACM International Symposium on Networks-on-Chip (NOCS), Turin, 2018, pp. 1-8. Petrov et al., "Terahertz Band Intra-Chip Communications: Can Wireless Links Scale Modern x86 CPUs?" in IEEE Access, vol. 5, pp. 6095-6109, 2017.

rna

[25] V.

[26] L. Falkovsky and S. Pershoguba, “Optical far-infrared properties of a graphene monolayer and

multilayer,” Physical Review B, vol. 76, pp.1–4, 2007. P. Gusynin and S. G. Sharapov, “Transport of dirac quasiparticles in graphene: hall and optical conductivities,” Physical Review B, vol. 73, p. 245411, Jun. 2006.

Jou

[27] V.

[28] ] L. Ju, B. Geng, J. Horng, C. Girit, M. martin, Z. Hao, H. Bechtel, X. Liang, A. Zettl, Y. R.

Shen, and F. Wang, “Graphene plasmonics for tunable terahertz metamaterials,” Nature Nanotechnology, vol. 6, pp. 630–634, Sep. 2011. [29] F. H. L. Koppens, D. E. Chang, and F. J. Garcia de Abajo, “Graphene plasmonics: a platform

for strong light matter interactions,” Nano Letters, vol. 11, no. 8, pp. 3370–3377, Aug. 2011.

29

Journal Pre-proof

[30] P. K. Singh, G. Aizin, N. Thawdar, M. Medley, and J. M. Jornet, “Graphene-based plasmonic

phase modulator for terahertz-band communication,” in Proc. of the European Conference on Antennas and Propagation (EuCAP), 2016. [31] S. H. Lee, H.-D. Kim, H. J. Choi, B. Kang, Y. R. Cho, and B. Min, “Broadband modulation of

lP repro of

terahertz waves with non-resonant graphene meta-devices,” IEEE Transactions on Terahertz Science and Technology, vol. 3, no. 6, pp. 764–771, 2013. [32] J.

M. Jornet and I. F. Akyildiz, “Graphene-based plasmonic nanoantenna for terahertz band communication in nanonetworks,” IEEE JSAC, Special Issue on Emerging Technologies for Communications, vol. 12, no. 12, pp. 685–694, Dec. 2013.

[33] J.

E. Burke, “Analytical study of tunable bilayered-graphene dipole antenna,” Army Armament Research, Development and Engineering Center, Dover, NJ, USA, Tech. Rep., Mar. 2011

[34] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, “Performance evaluation and design

trade-offs for network-onchip interconnect architectures,” IEEE Trans. Comput., vol. 54, no. 8, pp. 1025–1040, Aug. 2005. [35] H.

K. Mondal, S. H. Gade, M. S. Shamim, S. Deb and A. Ganguly, "Interference-Aware Wireless Network-on-Chip Architecture Using Directional Antennas," in IEEE Transactions on Multi-Scale Computing Systems, vol. 3, no. 3, pp. 193-205, 1 July-Sept. 2017.

[36] I. Akyildiz, J. Jornet and C. Han, "TeraNets: ultra-broadband communication networks in the

terahertz band," in IEEE Wireless Communications, vol. 21, no. 4, pp. 130-135, 2014. H. Bernstein et. al, “Quilt packaging: a new paradigm for interchip communication” in Electronic Packaging Technology Conference, 2005. EPTC 2005. Proceedings of 7th, 2nd ed. vol. 3, pp. 6. J. Peters, Ed. New York: McGraw-Hill, Dec 2005

rna

[37] G.

[38] W. Knap, J. Lusakowski, T. Parenty, S. Bollaert, A. Cappy, V. Popov, and M. Shur, “Terahertz

emission by plasma waves in 60 nm gate high electron mobility transistors,” Applied Physics Letters, vol. 84, no. 13, pp. 2331–2333, 2004.

Jou

[39] T. Otsuji, T. Watanabe, S. Boubanga Tombet, A. Satou, W. Knap, V. Popov, M. Ryzhii, and V.

Ryzhii, “Emission and detection of terahertz radiation using two-dimensional electrons in IIIV semiconductors and graphene,” IEEE Transactions on Terahertz Science and Technology, vol. 3, no. 1, pp. 63–71, 2013. [40] P. K. Singh, G. Aizin, N. Thawdar, M. Medley, and J. M. Jornet, “Graphene-based plasmonic

phase modulator for terahertzz-band communication,” in Proc. of the European Conference on Antennas and Propagation (EuCAP), 2016.

30

Journal Pre-proof

[41] N. Thawdar, J. M. Jornet, and I. Michael Andrello, “Modeling and performance analysis of a

reconfigurable plasmonic nano-antenna array architecture for terahertz communications,” NanoCom’18: ACM The Fifth Annual International Conference on Nanoscale Computing and Communication. Morgan Kaufmann, 2002. [43] S.

lP repro of

[42] J. Duato, S. Yalamanchili, and L. NI, “Interconnection Networks-An Engineering Approach”,

Deb et al., "Design of an Energy-Efficient CMOS-Compatible NoC Architecture with Millimeter-Wave Wireless Interconnects," in IEEE TC, v. 62, no. 12, pp. 2382-2396, Dec. 2013.

[44] Quoc-Tuan

Vien, Michael Opoku Agyeman, Tuan Anh Le, and Terrence Mak, “On the Nanocommunications at THz Band in Graphene-Enabled Wireless Network-onChip,” Mathematical Problems in Engineering, vol. 2017, Article ID 9768604, 13 pages, 2017.

[45] M.

Fujishima, "Terahertz wireless communication using 300GHz CMOS transmitter," 2016 13th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), Hangzhou, 2016, pp. 411-414.

[46] S.

B. Lee et al., “A Scalable Micro Wireless Interconnect Structure for CMPs,” Proc. 15th Annual Int’l. Conf. Mobile Computing and Net., Beijing, China, pp. 217–28, 2009.

[47] M.

M. Ahmed, M. S. Shamim, N. Mansoor, S. A. Mamun and A. Ganguly, "Increasing interposer utilization: A scalable, energy efficient and high bandwidth multicore-multichip integration solution," 2017 Eighth International Green and Sustainable Computing Conference (IGSC), Orlando, FL, 2017, pp. 1-6. Lee, C. Nicopoulos, S. J. Park, M. Swaminathan, and J. Kim, “Do we need wide flits in networks-on-chip?” in Proc. IEEE Comput. Soc. Annu. Symp. pp. 2–7, VLSI, 2013.

rna

[48] J.

[49] C. Biennia, S. Kumar, J. P. Singh, K. Li, “The PARSEC benchmark suite: characterization and

[50] S.

Jou

architectural implications”, In Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT '08). ACM, New York, NY, USA, pp. 72-81, 2008. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of the 22nd International Symposium on Computer Architecture, pp. 24-36, June 1995.

[51] M.

Badr and N. E. Jerger, "SynFull: Synthetic traffic models capturing cache coherent behaviour," in ACM/IEEE 41st ISCA, 2014.

[52] X. Wu, et al., “UNION: A unified inter/intrachip optical network for chip multiprocessors,” in

IEEE Trans. Very Large Scale Integr. Syst., vol. 22, no. 5, pp. 1082–1095, May 2014. 31

Journal Pre-proof

[53] A. Joshi et al., "Silicon-photonic clos networks for global on-chip communication," 2009 3rd

ACM/IEEE International Symposium on Networks-on-Chip, San Diego, CA, 2009, pp. 124133.

lP repro of

Sagar Saxena is currently pursuing his Master of Science in Computer Engineering at Rochester Institute of Technology, Rochester, NY, USA. He received his Bachelor of Technology (BTech) in Electronics and Communication Engineering from Jaypee Institute of Information Technology, India in 2015. His research interests mainly lie in advanced and high-performance computing architectures with a particular focus on design of modern Network-on-Chips (NoC) with novel and emerging interconnect technologies. Deekshith Shenoy Manur received the B.E. degree from The National Institute of Engineering, Mysore, India in 2013 and MS degree from Rochester Institute of Technology, Rochester in 2017. His research interests include designing wireless network-on-chip architectures using novel devices for intra and inter-chip wireless communication.

rna

Naseef Mansoor is currently an Assistant Professor in the department of Electrical and Computer Engineering and Technology at Minnesota State University, Mankato, MN. He received his PhD in Computing and Information Sciences from Rochester Institute of Technology, Rochester, NY, USA in 2017. Prior to his PhD, he received his BSc in Computer Science and Engineering from Bangladesh University of Engineering and Technology, Dhaka, Bangladesh in 2009. His research interests are in wireless and photonic Network-on-Chip architectures, and heterogeneous computing systems.

Jou

Amlan Ganguly is currently an Associate Professor in the Department of Computer Engineering at Rochester Institute of Technology, Rochester, NY, USA. He received his PhD and MS degrees from Washington State University, USA and BTech from Indian Institute of Technology, Kharagpur, India in 2010, 2008 and 2005 respectively. His research interests are in robust and scalable intra-chip and inter-chip interconnection architectures and novel datacenter networks with emerging technologies such as wireless interconnects. He is a member of IEEE. 32

Highlights:

Journal Pre-proof

The first version was reviewed and Major Revisions was suggested on April 22, 2019. We submitted the revised version on June 20, 2019. After receiving second round of comments, we are resubmitting the second revised version on October 31, 2019. Best regards.

Jou

rna

lP repro of

--- Authors.

Conflict of Interest:

Journal Pre-proof

Jou

rna

lP repro of

We report no conflict of interest for this paper.