Computers and Electrical Engineering 83 (2020) 106578
Contents lists available at ScienceDirect
Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng
ChangeSUB: A power efficient multiple network-on-chip architectureR Mohammad Baharloo a,∗, Rashid Aligholipour b, Meisam Abdollahi c, Ahmad Khonsari a,c a
School of computer science in the Institute for Research in Fundamental Sciences, Tehran Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, Iran c Department of Electrical and Computer Engineering, University of Tehran, Tehran, Iran b
a r t i c l e
i n f o
Article history: Received 13 July 2019 Revised 9 February 2020 Accepted 10 February 2020
Keywords: Multiple network-on-chip Router architecture Low power Power gating Static power Energy proportionality
a b s t r a c t Applying power gating on network-on-chip (NoC) as an effective static power-aware technique could lead to a significant reduction in on-chip network performance. Since the NoC performance has a considerable impact on the overall chip performance, providing a tradeoff between chip power and its performance is crucial. To this end, applying power gating in multiple network-on-chip (multi-NoC) instead of traditional NoC is a promising solution. However, in multi-NoC, waking-up a chain of routers in a switched-off sub-network (subnet) incurs performance penalty. In this paper, we introduce an architecture, namely ChangeSUB, which provides an opportunity to change the subnet of packets in multi-NoC architecture. In the proposed architecture, packets avoid encountering switched-off routers by changing their subnet. Experimental results indicate that compared to traditional multiNoC design, the proposed architecture decreases the network latency, execution time, and NoC’s static power consumption by 10.5%, 4.5%, and 17.6%, respectively, with just imposing 1.9% hardware overhead. © 2020 Elsevier Ltd. All rights reserved.
1. Introduction Nowadays, in the field of chip design, researchers and designers face several major challenges related to high performance computing, energy/power consumption, temperature management, process variation, and area overhead [1,2]. In recent years, with the advent of many-core chip processors, the power consumption of the on-chip network has become a large portion of the chips total power budget. For example, in the Intel Teraflop [3] and 36-core SCORPIO processors [4], up to 28% and 19% of the total chip power belong to their on-chip networks, respectively. Power consumption of NoC consists of the power consumed by communication links and routers, whereas routers allocate a large portion of the total NoCs power budget to itself [5]. Among the router components, Buffers has the largest share of about 64% of the NoC’s leakage power [6]. Therefore, by reducing the buffer size or the elimination of it, the NoC power consumption can be dramatically decreased. On the other hand, with the increase of input buffer size, the on-chip network performance, and consequently,
R
This paper is for regular issues of CAEE. Reviews processed and recommended for publication to the Editor-in-Chief by Associate Editor Dr. L. Bittencourt. Corresponding author. E-mail addresses:
[email protected] (M. Baharloo),
[email protected] (R. Aligholipour),
[email protected] (M. Abdollahi),
[email protected] (A. Khonsari). ∗
https://doi.org/10.1016/j.compeleceng.2020.106578 0045-7906/© 2020 Elsevier Ltd. All rights reserved.
2
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
Fig. 1. NoC’s static power portion for different technology nodes [8].
the overall performance of the chip will increase to a certain extent [1]. As a result, in the process of on-chip network design, power consumption and performance should be compromised. The power consumption of on-chip network is obtained by the summation of switching, short-circuit, and static power, which is summarized in (1).
P = Pswitching + Pshort−circuit + Pstatic =
2 αCLVdd f + IscVdd + IleakageVdd
(1)
The switching and short-circuit power are the components of dynamic power consumption. The switching power depends on the operating voltage (Vdd ), clock frequency (f), the load capacitance (CL ), and transition activity factor (α ). The short-circuit power is the function of short circuit current (Isc ) which arises when pull-up and pull-down networks in a CMOS circuit are simultaneously active. The static power (a.k.a leakage power) dissipates due to leakage current (Ileakage ) which is comprised of various components such as sub-threshold leakage and gate induced drain leakage (GIDL). In the nanometer regime, the static power constitutes the substantial share of power dissipation, which is mainly due to the recent technology advances such as reduced feature size and reduced transistor operating voltage [7,8]. Fig. 1 depicts the static power portion of the total NoC power in three different CMOS technology nodes. The trend is evident, where the static power share has risen from 43% to 65% (an increase of about 22%) by technology scaling from 45nm down to 22nm. Given that the static power will be dissipated, even when the NoC is unloaded or idle, one of the most effective ways to reduce static power is to utilize the power gating technique. Power gating is a practical approach for reducing leakage power, which recently has been widely used by system designers [5,7,9,10]. This technique was initially used to reduce the static power of execution units that are idle [11] and also the reduction of dynamic power consumption of the other parts such as register files [12]. By power gating of the idle units, the leakage current of these units is cut-off, and ultimately, the overall power consumption is reduced. The same idea is also utilized to reduce the power consumption of the onchip network [5,7,10], where the power gating is applied to idle routers to prevent leakage power. A simple schematic of the power gating technique in NoC routers [5] is shown in Fig. 2. When the router is idle, by activating the power gating signal, the transistor is switched off, and thus, the leakage current is cut off. The power gating technique can be applied with a negligible performance penalty in NoCs under light traffic load. In this case, most of the on-chip routers are idle, so that they are good candidates for power gating. On the other hand, under heavy traffic load, applying power gating to on-chip components not only has an insignificant effect on the chip power reduction but also will result in a substantial performance penalty. This issue arises from the fact that under heavy workloads, packets encounter the switched-off routers more frequently. In this case, packets should wait a while for the routers to be turned on, which leads to increasing network latency and imposing a power overhead (the power which dissipates to turn on the switched-off routers). The performance of on-chip network plays a crucial role in the overall performance of the chip. Therefore, the loss of efficiency due to applying power gating should be compensated as much as possible to achieve an appropriate trade-off between performance and power consumption. For implementing the power gating technique, the multi-NoC is more suitable than the traditional single network-on-chip (single-NoC). In multi-NoC, each router is divided into a number of small-sized routers (as depicted in Fig. 3). As a result, the on-chip network is divided into several small-sized subnet. The architecture of components in multi-NoC is similar to single-NoC, but each component in multi-NoC such as crossbar, data-path, and buffer has a smaller size than single-NoC components. We keep the aggregate buffer size constant across Single- and multi-NoC designs so that the total buffer space of single-NoC is divided equally between the subnets of multi-NoC. Since the link width in multi-NoC is smaller than in
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
3
Fig. 2. Power gating circuitry for NoC’s router.
Fig. 3. single-NoC vs. multi-NoC structure [9].
single-NoC, and consequently, this ratio is valid for the flit size in both structures, the router buffer depth in terms of flits is the same in both designs. As depicted in Fig. 3, in a multi-NoC structure, cores are connected to the subnet selection module through the network interface (NI). In this way, a packet generated in a core can be injected into any subnets through the subnet selection module. Indeed, based on the congestion state of the network, a proper subnet is selected and activated to transmit the packet. The subnet selection policy and the congestion criterion, which is exploited in this work is the same as what adopted in [10]. For compromising the performance and power consumption and providing minimal on-chip communications, zero subnet (lowest order subnet) is always active. In the absence of congestion, the packet is injected into the zero subnet. Otherwise, it will be injected into the higher-order subnets based on the congestion status of the subnets. The congestion in a subnet is determined based on the buffer occupancy of its routers, which was investigated in [10] as an effective criterion for congestion detection in multi-NoC structure. The authors in [9,10] investigated that multi-NoC is a more suitable structure for applying power gating techniques in comparison with traditional single-NoC. But the main problem of the proposed method in these studies, which utilizes the multi-NoC structure is the lack of the ability to change the subnet of packets. In this way, the packet injected into a subnet will be routed to the destination in the same subnet, and there is no possibility of changing the subnet of the packet. In this paper, an architecture called ChangSUB is presented, which provides the hardware circuitry to change the subnet of a packet. In the proposed architecture, the subnet of a packet in which the packet was initially injected can be changed. Furthermore, packets injected into a particular subnet, in the presence of empty capacity in the zero subnet, are transmitted to this subnet through the NI. This approach will prevent multiple routers from being turned on in higher-order subnets which are generally switched-off. The results obtained from the simulations indicate a 5.5% improvement in the execution time of real-world application benchmarks compared to the proposed architecture in [10], which is called as Catnap. Besides, the proposed architecture reduces the power consumption of non-zero subnets by up to 17.6%.
4
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
The remainder of this paper is organized as follows: Section 2 describes the most recent related works. In Section 3, the proposed ChangeSUB architecture and its details are presented. Section 4 discusses the experimental results of the performance evaluation of the proposed architecture. Eventually, Section 5 concludes the paper. 2. Related works In recent years, a lot of studies have been recently conducted to reduce the power consumption of NoCs [5,9,10,13,14]. For example, authors in [13,14] applied dynamic voltage and frequency scaling (DVFS) technique on a hybrid NoC architecture for reducing the dynamic power consumption of communication links. But over time, the contribution of dynamic power consumption has declined. Therefore, today, dynamic power reduction methods are not widely considered to reduce on-chip network power as before. In contrast, the static power contribution has increased with technology scaling. As a result, static power reduction in today’s modern chips is highly paid attention. One of the most effective approaches to reducing static power consumption is the power gating technique. The power gating can be applied to NoC in both fine-grained and coarse-grained forms. In fine-grained form, the power gating is applied only to a part of the router’s components [15–17]. Most of these methods try to reduce the static power of input buffers, which is the largest portion of the power consumption of a router. The fine-grained techniques have a lot of complexity and do not reduce the static power consumption as well as the coarse-grain methods. Hemanta et al. [18] proposed a performance and power-aware hybrid wired/wireless NoC structure (called P2 NoC), which utilized the power gating strategy for router elements as well as wireless interfaces in the case of inactivity. The authors provided a hybrid twolevel pre-computed (coarse-grained) and runtime (fine-grained) utilization estimation to reduce power consumption besides minimizing hardware overhead and performance degradation. They also proposed a deadlock-free bypass routing mechanism to compensate for the negative impacts of power gating. P2 NoC reduces the total packet transmission energy consumption by about 49% besides 7% area overhead and insignificant performance loss. Kim et al. [2] concentrated on the buffer element of NoC router, which has large area overhead and power consumption. A router microarchitecture called FlexiBuffer was proposed by the authors, where the active buffer size can be adaptively adjusted through a fine-grained power gating to reduce the leakage power of the buffer. The contribution of FlexiBuffer was two-folded. Firstly, a credit-based flow control (called early credit) was proposed, which appropriately determined the active buffer window size. Then, a novel split queue mechanism to overcome the limitation of circular buffer management was introduced. The authors claimed that besides minimum performance loss, the total router power consumption degrades by about 39%. In [5], a power gating technique called Power Punch was introduced. In Power Punch, through the wakeup control signals, it turns on the switched-off routers before the packet arrives. It is done by exploiting the slack time at injection nodes and slack in hop count to forward wake-up signal ahead of packets to hide the wake-up latency of poweredoff routers. The experimental results show that Power Punch decreases the router static power by about 83%. In [19], a power gating technique was devised through the XY routing protocol. Based on the XY routing protocol, with the probability of about 55%, a packet in a router is forwarded to the direction of its previous-hop [19]. To put it another way, under the XY routing policy, only about 44% of packets that pass a router change their directions. In this method, the microarchitecture of routers have changed in such a way that a router is turned on only in two cases; either its associated core injects a new packet or a packet should be turned through it. In other cases, when a packet should be forwarded directly, or a packet should be ejected, there is no need to turn on routers. Accordingly, the idle time of routers is increased, which reduces the switching overhead of the power gating mechanism and consequently reduces the leakage power of the on-chip network. Authors in [7] stated that the conventional power gating approach leads to more frequent state-transitions and hence imposes notable performance and power overhead. Therefore, they proposed a power-aware NoC approach called Node-Router Decoupling (NoRD), which decouples the routers capability of packet transmission from the routers status of powered-on/off for prolonging the period of routers idle state. Indeed, they devised an internal bypass way in each router, which is connected to the NI to forward packets from an input port to an output port without waking the router up. For this purpose, some changes have been made to the NI of routers. This mechanism can reduce the static energy of router by 29.9% besides improving the average packet latency by about 26.3% with an insignificant hardware overhead of 3% in comparison to an optimized conventional power-gated router. The main limitation of this method is that it is designed for the k × k mesh, where k is an even number. In [10], an architecture called Catnap is introduced. In Catnap, which is a multi-NoC architecture, each router is divided into a number of small-sized routers. As a result, the on-chip network is divided into several small-sized networks, called a subnet. In this case, each subnet has a number that identifies its priority in carrying packets. Specifically, numbers start from zero, and lower-order subnets have a higher priority in carrying packets. To provide network connectivity and also the minimum quality of service, the zero-subnet (a subnet with the highest priority) is always active. In light network traffic loads, packets are always injected into the zero-subnet. If network traffic is increased so that packets encounter congestion in the zero-subnet, the network attempts to inject packet into higher-order subnets with respect to their priority. In fact, with increasing traffic load, subnets are selected to carry packets one after another and in the order of their priority. By injecting the packets into a switched-off higher-order subnet, the routers of that subnet turn on one after another as packets pass through them. The main problem with Catnap is that it only relies on the temporal distribution of traffic to detect congestion. So that, if a packet encounters a congested router and is injected into a higher-order subnet, it is assumed that all of its lower-order
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
5
Fig. 4. single-NoC vs. multi-NoC design under power gating.
subnets are congested, while the congestion may have only occurred at one of the routers in that subnet. As a result, a packet injected into a subnet will be forwarded to the same subnet till it reaches its destination. To consider spacial as well as temporal traffic distribution to handle the congestion problem of the on-chip network, authors in [20] proposed a method called ShuttleNoC. In this approach, with the aid of extra hardware which devised for each on-chip link, packets can shuttle between subnets. Hence, this method imposes a lot of hardware overhead, which leads to excessive power consumption and area overhead. In the following section, the proposed architecture called ChangeSUB is explained in detail where packets can change their subnet through the NI with little hardware overhead to provide an efficient on-chip power gating approach.
3. Proposed architecture 3.1. Overview As mentioned in Section 2, the multi-NoC design is considered to be one of the most suitable structures for applying power-gating where each router is divided into multiple smaller-scale sub-routers, and each node (i.e., core) is connected to all sub-routers. Fig. 4 highlights the difference and advantages of the multi-NoC compared to the traditional NoC design. Clearly, in the conventional NoCs, the majority of routers are active even under light traffic load, while multi-NoC utilizes a few active small-scale subnets to conduct the traffic. To ensure that all the cores are connected all the time, multi-NoC always keeps the zero subnet active while the higher-order subnets are activated on-demand and based on the congestion status of the network. As previously mentioned, the subnet selection policy which we have exploited in this work is the one which was adopted in [10]. According to this policy and based on our experiments, whenever the buffer occupancy of a router in a subnet exceeds a specified limit, namely 80% in our study, it can be figured out that the subnet is approaching congestion. In such circumstances, a router from an immediately higher-order subnet will be activated, and NI injects packets into the network through the activated router until the congestion is resolved. When the congestion on a given subnet (e.g., Subnet 3) is entirely resolved, the power gating mechanism which has been adopted in this paper starts to switch off the routers in the immediately higher-order subnet (e.g., Subnet 4). The congestion state in a subnet vanishes whenever the buffer occupancy of its routers gets less than a specified threshold, namely 50% in our study, and the buffers in the immediately lower-order subnet left empty for a predetermined number of cycles (four in our study). The process is checked separately for each router in each subnet so that the router Rkj (jth router of subnet k) is powered off if congestion in the router Rkj −1 is entirely resolved. Activation of a subnet is done hop by hop, which incurs some performance penalty due to the wake-up latency of routers. According to our previous study [9], the wake-up latency of a 4-stage pipeline on-chip router is about ten cycles that three cycles of it can be hidden by utilizing the Look-ahead routing scheme. In order to determine the number of subnets in a multi-NoC, the network throughput must be evaluated. Based on our previous study [9] on the throughput evaluation of different multi-NoC configurations, it has been revealed that partitioning a 128-bit single-NoC into more than four subnets leads to a significant throughput loss. Accordingly, to apply a fine-grained power gating mechanism with a negligible throughput loss, NoC architecture should be configured as a multi-NoC with four 32-bit subnets.
6
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
Fig. 5. The main Catnap problem where the subnet of packets can not be changed.
Splitting the on-chip network into several smaller subnets imposes some overheads due to layout complexity and duplicated control logic. According to the previous studies [9,10], the power and area overheads imposed by the replication of the control logic are relatively insignificant and are less than 4% of the whole power and area of a router. The power overhead could be compensated by the total power reduction achieved through the decreased complexity of smaller network components and clock distribution. On the other hand, narrower components of multi-NoC, such as link, crossbar, and buffer, can drive at a lower voltage than the wider single-NoC to achieve the same frequency. Thus, given the quadratic relation between power and voltage, it is clear that the total power consumption of multiple narrower subnets is less than the power consumption of broadband single-NoC. Also, according to the investigation has been made by [7], a significant portion of the NoC’s static power is dissipated by the routers’ buffers. As mentioned earlier, the total buffer space in the single-NoC is evenly divided among the subnets of multi-NoC. Consequently, by applying the power gating mechanism in a multi-NoC, a large number of the network components, including the buffers will be powered-off which leads to a significant reduction in the network power consumption. However, the significant shortcoming of the previous studies [9,10], which utilize multi-NoC design for applying power gating, is that it only relies on the temporal distribution of the traffic to measure the congestion. Consequently, when a packet experiences congestion in one of the routers in a specific subnet, it is assumed that all the routers in this subnet are in the congested state or are close to being congested. However, in many practical scenarios, it can be observed that only one router is congested, and the other routers of the same subnet are far from the congestion state. This assumption means that when a packet is injected into a higher-order subnet, it will route in that subnet until it reaches its destination, which implies that the packet will encounter several power-gated routers along its path. Thus, the packet, unnecessarily, experiences a high latency penalty in this situation, which could be easily avoided by sending it back to the lower uncongested subnet. Moreover, many routers should be powered on in this process, which imposes a significant power overhead. Fig. 5 shows the problem which arises from the lack of the ability to change the subnet of a packet in multi-NoC design. The figure shows a multi-NoC with two subnets, where lightly loaded routers are colored white, and the congested routers are distinguished by black. This condition happens when a core generates many packets during a short time interval, which can be observed when tasks are not mapped in a balanced fashion or in cache coherency process many invalidation packets are generated. For example, assume that the kth core is running a task that requires a block of cache that is stored in the private cache of all other cores. Thus, the core will send an invalidation packet to all the other cores on the chip. The number of these packets is higher than the number of virtual channels of the router. Therefore, the invalidation packets will be injected into all the existing higher-order subnets. However, it is possible that all the other routers of higher-order subnets are empty and, if the packets can be sent back to the lower-order subnets somewhere in their path, performance and power consumption will be improved. Given the above, the traffic distribution of on-chip network is not fixed, and congestion can occur anywhere in the network. This is very likely, especially when NoC handles a mixture of several different workloads. Under this circumstance, it is not possible to conceive of predictable spatial traffic distribution. However, even given the high probability of congestion at the center of the network, ignoring less likely transient congestion, can impose a high overhead on the power gating mechanism. Evaluation results in [20] show that simultaneously considering both spatial and temporal traffic distributions in an on-chip network in which power gating mechanism has been applied can lead to a significant improvement in NoC’s power dissipation and performance. In [20], through extra hardware that enables packets to shuttle between subnets of a
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
7
Fig. 6. Packet type breakdown in PARSEC applications.
Fig. 7. Overview of the proposed ChangeSUB & NI architectures.
multi-NoC in order to bypass local congestion, the power consumption and performance have been improved by about 9.3% and 14.7%, respectively. The area overhead of this work is relatively high and is about 7% compared to traditional multi-NoC design, which makes it a non-scalable design. Besides, its extra hardware circuitry incurs 3.3% dynamic power overhead. Consequently, it is advantageous to devise an architecture with minimal extra hardware and consequently less area and power overhead, which enables capturing traffic heterogeneity in both spatial and temporal aspects. In the proposed method, we only allow the control packets to use the NI and change their subnet (from a non-zero subnet to zero subnet) with just a little extra hardware circuitry. Our rationales behind this decision are as follows: • In many applications, the number of control packets dominates the number of all other packet types. For example, in eight applications of PARSEC benchmark suite (See Fig. 6), more than 71% of the overall packets are control packets (based on the MESI two-level cache coherency model). Control packets are significantly smaller than the data packets and can change their subnet with minimum power and latency overhead. • According to our observation in different applications, more than half of the packets that inject into the non-zero subnets are the control packets. By sending back this large amount of control packet to zero subnet, we can avoid powering on the power-gated routers in higher-order subnets. Consequently, a remarkable saving in terms of power consumption and network latency can be achieved. • Data packets are typically large, for example, with 32-bit flit size and 64-byte cache block size, the size of a data packet is 18-flits. Therefore, if the subnet of a data packet is changed, the output port of the router that is connected to the NI would be occupied for 18 cycles. Also, this packet occupies the NI for 18 cycles, which leads to an excess delay for the packet and other packets that have to be injected or ejected into/from the subnet. Nevertheless, in ChangeSUB architecture for a control packet, a change subnet request can be sent to the ChangeSUB controller and subsequently to the NI. In this way, if the request is granted, the packet can use a proposed secondary buffer, called bypass buffer, to be injected into the zero subnet. The ChangeSUB architecture and the microarchitecture of routers in
8
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
Fig. 8. Microarchitecture of router in the non-zero subnets of ChangeSUB architecture.
ChangeSUB are presented in Fig. 7 and Fig. 8, respectively. As can be seen in Fig. 7, in NI architecture, a bypass is designed through which a packet can be ejected and then re-injected into the network. 3.2. Router architecture To implement our proposed architecture, we have to change the microarchitecture of the routers in the non-zero subnets, presented in Fig. 8 (For the sake of simplicity, we only show the west ports). It should be noted that the routers in zero subnet have the same architecture as routers in the traditional single-NoC, except that they are one quarter the size of the router in the traditional NoC. In this architecture, for each input port, we need 1) a multiplexer, 2) a secondary buffer, 3) a multiplexer with four inputs for the secondary buffer, and 4) a multiplexer in the output of the router which is connected to the NI. The multiplexers are controlled by another component called ChangeSUB Controller. Each packet can go through different paths, depending on the subnet number. As follows, we describe the procedure of handling a packet in detail: 1. When a packet enters a router in zero subnet: Because the microarchitecture of the router is not changed in comparison with the traditional routers, the packet will traverse a normal path through the pipeline. Therefore, in this case, there is no difference between the control and data packets. Again, remember that the routers in the zero subnet are always active. 2. When a packet enters a router in a non-zero subnet: In this situation, a control packet has the ability to go back to the zero subnet. Depending on the type of packet, its flits will traverse different paths in the router to reach the output port. In what follows, we elaborate the process based on the type of packet: (a) Data Packets: As explained before, data packets are prone to occupy the NI for a very long time. Thus, we avoid changing the subnet of data packets. Consequently, as in the traditional routers, the data packets would traverse a router in four cycles if there is no contention in any stage of the router pipeline. In the first cycle, the header flit is stored in the input buffer, and the routing unit computes the output port of the packet based on the address in the head flit. In the second cycle, a virtual channel (VC) is allocated to the packet based on the request of the head flit. The VC allocation process is only done for the head flit of a packet. In the third cycle, namely, switch allocation, the switch allocator unit connects the input port and the output port of the crossbar to construct a special path for the flit. In the last phase, the flit uses the specified path to reach the corresponding output port through the crossbar. (b) Control Packets: When a control packet enters a router in a non-zero subnet, it sends a request signal to the ChangeSUB controller unit for changing its subnet. When the ChangeSUB controller unit receives the request, it looks at the previous two cycles to decide whether to send the request signal to the NI or not. Specifically, if it has not sent a subnet change request to the NI in the last two cycles, it records the input buffer number and virtual channel number and sends the request signal to the NI. Checking the last two cycles is related to the latency of receiving the grant signal from the NI. In the proposed architecture, it takes two cycles to send a request and receive the corresponding grant signal. Consequently, the time interval between two consecutive requests to NI should not be less than two cycles. Otherwise, it is possible to receive a wrong grant signal which was intended for the previous request sent in the previous cycle. To avoid such malfunctioning, we employ these two wait cycles mechanism. This can be implemented by one bit counter, which is enabled by Request to NI signals.
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
9
Fig. 9. Regular routine vs. subnet changing routine for a one flit control packet in an on-chip router of the proposed ChangeSUB architecture.
In the proposed architecture, a control packet in a router of non-zero subnet could experience two different scenarios, which depicted in Fig. 9. According to this figure, while a packet is written in the input buffer and the route computation unit determines its output port, a subnet change request is sent to the ChangeSUB controller for this packet (first phase). In the second phase, during the virtual channel allocation process, if there are no subnet change requests in the last two cycles, the request will deliver to the NI. In the third phase, if there is a free virtual channel in the NI’s output port, which leads to the router in zero subnet, the NI will reserve this virtual channel and send back the grant signal to the ChangeSUB controller. Otherwise, if there is no free virtual channel, the grant signal will not be sent. Meanwhile, the switch allocation process is done, and the packet is ready to go through the crossbar. In the fourth phase, if the subnet change request was not granted in the previous phase (i.e., third phase), the packet would proceed through the crossbar. Otherwise, it goes forward through the Bypass buffer, and eventually, it passes through the output multiplexer to the link that leads to the NI. It is worth noting that if the subnet change request for a packet is granted, the processing of three regular computation phases of a router (i.e., route computation, virtual channel allocation, and switch allocation) becomes invalid, which leads to some performance and power overheads for the proposed architecture. But due to the
10
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
Fig. 10. Modification of source and destination fields of a control flit during subnet changing mechanism.
small size of the control packets and the small amount of space they occupy from the router’s buffer, they leave the buffer queue in a cycle through the bypass buffer if their subnet change request is granted. Thus, the performance overhead imposed by the control packets during the subnet changing process is negligible. Our evaluations show that, on average, for three benchmark applications (i.e., blachscholes, dedup, and swaptions), the performance overhead is about 1.6%. On the other hand, the power overhead imposed by the wasted computing phases during the subnet changing process only affects dynamic power and does not affect the static power of the network. On average, the imposed power overhead on the total power consumption of the on-chip network is about 0.02% for the three application benchmarks mentioned above. It should be noted that, when a control packet changes its subnet and reaches to the NI, some fields of the packet, such as source and destination addresses, will be changed. Indeed, the two most significant bits of the source and destination address fields representing the subnet number are modified. As an example, which is depicted in Fig. 10, in a multi-NoC with four subnets and 8 × 8 mesh topology, the source address of a packet that is injected into the second subnet from the Core 0 is 128. If the Core 63 is the destination of the packet, the address of the destination router for the packet will be 191. As such, whenever the packet changes its subnet to zero subnet, its source and destination addresses will be adjusted to 0 and 63, respectively. Then, the NI injects the packet into the zero subnet. 3.3. Alleviating the challenges of ChangeSUB architecture In what follows, we describe the challenges of our proposed architecture and their solutions. a) Handling multiple requests in a Router; It is possible that several packets enter a router in a non-zero subnet through the north, west, south, and east ports, and they send a request for changing their subnets simultaneously. In this situation, it is necessary to prioritize the packets for changing their subnets. The prioritization mechanism adopted in the proposed architecture, prioritized packets based on the number of remaining hops to their destinations. In this way, a packet with more remaining hops to its destination has the higher priority to change its subnet. The reason for such a decision is that packets that are far away from their destination will be more likely to encounter power-gated routers in their paths. Consequently, if their subnet can’t be changed, the latency and power overhead of the multi-NoC will be increased dramatically due to the latency and power overheads that arise from activating a large number of power-gated routers. b) Handling multiple requests in the NI; Routers in different subnets that are associated with a single core work independently. Consequently, they might send a subnet change request to the NI simultaneously. In this situation, the NI has to select and grant only one of the requests. Note that, in the proposed multi-NoC design, packets inject into the subnets in an ascending order, i.e., a packet will inject into a higher-order subnet if its current subnet is congested. This means that in comparison with lower-order subnets, the probability of congestion in higher-order subnets is lower. thus, it is more probable that the number of power-gated routers is higher in higher-order subnets. As a result, to avoid powering on a large number of power-gated routers in higher-order subnets, the routers in these subnets have a higher priority to change the subnet of their packets. This decision would lead to lower latency and higher power efficiency.
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
11
Fig. 11. Evaluation environment.
c) Buffer backpressure mechanism; In traditional on-chip router architecture, in each cycle, only one flit can win the arbitration process and leave the input buffer of a router through the crossbar. Therefore, for each flit, one credit signal is sent to the previous router to inform it about the current state of the input buffer. However, in the proposed architecture, in each cycle, two flits may leave the input buffer: 1) a flit that wants to change its subnet, which proceeds through the Bypass buffer, and 2) a flit that leaves the router through the crossbar. Consequently, the traditional credit-based buffer backpressure mechanism is not suitable for the proposed architecture. To overcome this challenge, for a buffer with depth N, we utilize logN reverse signaling wires to encode the credit count. d) Avoiding frequent dynamic switching between subnets; In ChangeSUB architecture, as mentioned in Section 1, at the moment of packet injection, according to the congestion status of the network, an appropriate subnet is selected. After injection, if the injected packet is a data packet, its subnet will not be changed due to the reasons described in Section 3.1. In the case of control packet, a packet in the non-zero subnets can change its subnet to the zero subnet only once if there is a free buffer space in zero subnet. Also, control packets are not allowed to change their subnet to the higher-order subnets during transmission to the destination. Thus, given that subnet selection occurs only at the moment of packet injection, and the subnet of a packet can only once be changed to the zero subnet, there is no possibility of frequent dynamic switching between non-zero subnets and the zero subnet. 4. Comparative performance evaluation For the evaluation process, we study the performance of the proposed ChangeSUB architecture through running eight applications from the PARSEC [21] benchmark suite. Furthermore, in order to evaluate ChangeSUB architecture through a full range of network workloads, two synthetic traffics (bit-complement and shuffle) are utilized. Finally, we compare the area-overhead of ChangeSUB architecture. 4.1. Evaluation methodology We use the full-system simulator named Gem5 to evaluate our proposed architecture. In our evaluations, we employ the Ruby memory model and Garnet2.0 network model [22]. Since Gem5 can not support the power consumption model, the DSENT simulator [23] was added into the Gem5 to measure the latency and power consumption simultaneously. The static and dynamic power consumption are calculated based on the 22nm technology node, which is common in today’s microprocessors. For further evaluations, we also report the area overhead of our proposed architecture through VHDL implementation of on-chip routers with power gating and subnet changing functionality, applied to 45nm VLSI technology node and Synopsis Design Compiler synthesis tool. The adopted topology is a 8 × 8 mesh, where each router is connected to the core and an L1 cache through appropriate output ports. Moreover, each router is connected to the L2 cache, which is shared among all the on-chip cores. For simulating and evaluating the effects of the power gating scheme, the Agate simulator [24] was employed, which can be integrated into Gem5. As Agate simulator is compatible with Garnet1.0 network model, it was upgraded to support the Garnet2.0 model, which was utilized in this work. The overall evaluation environment is presented in Fig. 11, and the parameters are summarized in Table 1. 4.2. Baseline structures To demonstrate the efficiency of the proposed architecture, it is compared against three different structures: • No-PG: In this structure, the on-chip network is assumed as a traditional single-NoC without any subnet. In this structure, power gating is not applied. Flit size and the link bandwidths are supposed to be 128 bits.
12
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578 Table 1 Simulation parameters. Parameter
Value
Technology node Number of cores Frequency Network topology & size Routing algorithm Router private I/D L1 cache Coherence protocol Shared L2 cache Virtual channel Memory size & latency
22 nm 64, x86 ISA 2 GHz 8 × 8 mesh XY 4-stage 32KB, 4-way, LRU, 2-cycle latency MESI-Two Level 512KB, 8-way, LRU, 10-cycle latency 2 VCs/virtual network (VN), 3 VNs, 4 flits/VC 1GB, 128-cycle access time
Fig. 12. Average packet latency for PARSEC applications.
• ConvOpt-PG: In this structure, like No-PG, the on-chip network is a traditional single-NoC, whereas a power gating technique is applied. The ConvOpt-PG structure utilizes three optimization techniques to reduce the latency overhead of the power gating technique [24]. Flit size and the link bandwidths are also considered the same as No-PG structure. • Catnap: In this approach, each 128-bits router is divided into four individual 32-bits routers (i.e., 4-subnets multiNoC structure). The power gating scheme which is adopted in this approach explained in Section 2. In this structure, unlike the proposed architecture, the subnet of a packet cannot be changed after injecting it into the network [10]. 4.3. Average packet latency Fig. 12 demonstrates the average packet latency of running real benchmark applications on the architectures mentioned above. As can be seen in the figure, ChangeSUB architecture consistently achieves a lower latency compared to its counterparts except for No-PG approach. In No-PG, all on-chip components are always active, so the performance overhead that is imposed on the other architectures by applying the power gating mechanism does not exist. In ConvOpt-PG and Catnap approaches, on average, the network latency is increased by 56.9% and 27%, respectively. This high amount of overhead in ConvOpt-PG approach arises from the low injection rate of PARSEC applications. As a result, a large number of on-chip routers are powered-off, leading to an increase in the number of wake-ups a packet encounters across its path through the network and, consequently, an increase in the accumulated latency overhead to wake them up. In Catnap, due to utilizing multi-NoC, which is a more appropriate structure for applying power gating mechanism, there is less latency overhead in comparison to ConvOpt-PG. On the other hand, ChangeSUB architecture has the best performance compared to the other two approaches (i.e., ConvOpt-PG and Catnap), with only on average, 13.4% overhead in network latency compared to No-PG. In ChangeSUB, the best results of average packet latency obtained from running caneal and streamcluster applications where only about 3.7% and 5.5% increase is observed compared to No-PG, respectively. The reason for such a result is that these applications have heavier loads than the others and exhibit bursty communications. These characteristics make the zero subnet congested, and consequently, NI injects packets into higher-order subnets. So, ChangeSUB architecture could return packets from higher-order subnet to zero subnet to avoid encountering powered-off routers. In this way, packets that change their
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
13
Fig. 13. Execution time.
subnets do not need to wait for waking up switched-off routers, and consequently, the performance of the on-chip network will be improved. Thus, applications with heavy workload and bursty communication benefit more from the specifications of ChangeSUB architecture. Contrarily, under light network utilization, zero subnet could handle the traffic, and fewer packets are injected into higher-order subnets. So, there is less chance for the proposed architecture to change the subnets of the packets in higherorder subnets. In this way, no significant improvement is achieved. The proposed architecture outperforms ConvOpt-PG and Catnap in terms of average packet latency so that the average network latency of ChangeSUB is reduced by 27.2% and 10.5%, respectively, in comparison with these approaches.
4.4. Execution time Fig. 13 shows the execution time of real-world applications, where the results are normalized to the No-PG approach. It can be observed that different applications have different levels of sensitivity to network latency. In our experiment, dedup shows the highest sensitivity to network latency, while swaption shows the least. The execution time of dedup application on ChangeSUB architecture shows 12.3% and 5.6% reduction compared to ConvOpt-PG and Catnap, respectively. The results for swaption application are correspondingly obtained as 1.7% and 0.9%. Nevertheless, in ChangeSUB, the average execution time over the eight PARSEC applications is decreased by 4.5% compared to Catnap architecture. Overall, the ConvOpt-PG, Catnap, and ChangeSUB increase the execution time by 18%, 8.7%, and 4.4%, respectively, compared to No-PG approach.
4.5. Number of encountered powered-off routers This metric is considered to highlight the efficiency of ChangeSUB architecture. When a packet encounters a powered-off router, it has to wait until the router wakes up, which incurs a considerable amount of latency and power overheads. The problem worsens if the number of powered-off routers that the packet encounters along its path is large. Fig. 14 shows the average number of such routers for two approaches, i.e., Catnap and ChangeSUB. The results for ConvOpt-PG are not presented, because in this approach, the average number of wake-ups a packet experiences through its path is much more than the other two approaches (on average 3.5). The proposed approach encounters a significantly lower number of powered-off routers, since it returns the packets to zero subnet, where all the on-chip routers are always active. On average, the proposed approach reduces the average number of encountered powered-off routers by 54.3% compared to Catnap. In multi-NoC architecture, a large number of packets reach their destination through the ever-active subnet (i.e., zero subnet) without encountering any powered-off router. In other words, the number of packets encounter switched-off routers is very low since there is always an active subnet. Hence, in both Catnap and ChangeSUB architectures that utilize multi-NoC architecture, the results for the average number of encountered powered-off routers are fractional and less than one. In order to enlighten the efficiency of ChangeSUB architecture to improve the power gating performance, we evaluate the average number of cycles that a packet waits across its path for waking up a router. Fig. 15 shows that on average, in ChangeSUB architecture, the waiting time for waking up routers is 51% less than the Catnap approach.
14
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
Fig. 14. Average number of encountered powered-off routers.
Fig. 15. Average number of cycles a packet wait for waking up powered-off routers.
4.6. Static power Fig. 16 shows the static power consumption and power gating overhead of non-zero subnets. In the figure, the power consumption of zero subnet is not reported because all the network components in this subnet are always active, and the power gating technique is not applied. In the figure, since the power gating overhead of No-PG and ConvOpt-PG are much more than the overheads of Catnap and ChangeSUB approaches, the results about No-PG and ConvOpt-PG are omitted for better presentation. Experimental results show that on average, the proposed architecture reduces static power consumption by 56.9% compared to Catnap, where the streamcluster application achieves the highest static power reduction by about 81.5%, and blackschols shows the lowest static power reduction by 24.4%. The reason is that streamcluster has heavy workload with bursty communication phases, which cause packets are injected into higher-order subnets. In this way, ChangeSUB has more chance to return packets to zero subnet and save a considerable amount of power by avoiding switching on poweredoff routers. Contrarily, blackschols has low packet injection rate. Under light workload, the majority of packets are injected into zero subnet and ChangeSUB architecture has very little chance to change the subnets of packets to save power. On average, ChangeSUB compared to Catnap reduces the static power consumption and power gating overhead by 56.9% and 59.7%, respectively. 4.7. Full range network utilization analysis Here, we use synthetic traffic to analyze the behavior of the approaches across a full range of network utilization with more details. Specifically, we use two synthetic traffics; 1) Uniform and 2) Shuffle, where the injection rate increases from
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
15
Fig. 16. Static power and power gating overhead of non-zero subnets.
0.005 packets/node/cycle until the network becomes completely congested. Fig. 17 shows the behavior of different approaches in terms of average packet latency and static power consumption under Uniform and Shuffle traffics, respectively. As can be seen in the figure, the average packet latency and the static power consumption of on-chip network follow similar trends under both workloads. At the light traffic load, most of the on-chip network traffic in ChangeSUB and Catnap architectures are transferred through the zero subnet. Thus, given that the width of the subnet in these two architectures is a quarter the width of the on-chip network of No-PG architecture, it is expected that the network latency in these two architectures will be about four times greater in comparison with No-PG approach. However, as can be concluded from Fig. 17c, and Fig. 17d, at light traffic load (i.e., 0.005), the average network latency of ChangeSUB increased by about 26% and 31% in Uniform and Shuffle traffic patterns, respectively, compared to No-PG. This contradiction originates from the fact that the performance overhead of network partitioning in multi-NoC architecture could be compensated for the following reasons. First, in multi-NoC architecture, due to the higher number of subnets as compared to the traditional NoC, the head-of-line blocking, which is one of the main causes of network inefficiency, is greatly reduced, and consequently, the on-chip network performance is improved. Second, in the traditional single-NoCs with wider links, network utilization will be poor due to the smaller size of the control packets. For example, in a single-NoC with 128 bits link-width, the size of a typical control packet (32/64 bits) is much smaller than the size of the flit, causing internal packet fragmentation and consequently low network utilization. It is noteworthy to say that in our simulations, control packets constitute more than 60% of the network load while the transmission delay of these packets is not affected by shrinking network width. Under light traffic load, in ConvOpt-PG, Catnap, and ChangeSUB approaches, the majority of on-chip routers can stay powered-off, resulting in substantial saving in terms of static power consumption. Notably, in ConvOpt-PG and ChangeSUB approaches that use the multi-NoC structure, the static power consumption is close to 25% of what is consumed in the No-PG approach. The reason for such reduction is that, under light traffic load, zero subnet can carry a major volume of network traffic, and therefore higher-order subnets can be powered-off. However, the average packet latency of Catnap and ChangeSUB is lower than ConvOpt-PG approach, because in multi-NoC structure, there is always an active subnet (i.e., zero subnet) which can handle light traffic loads alone. In this way, packets encounter a lower number of wake-ups across their path, and consequently, suffer less latency overhead. As the workload increases from low to moderate, the routers gradually wake up, and the number of powered-off routers is reduced. In such a situation, the static power of ConvOpt-PG increases with a slope higher than that of Catnap and ChangeSUB approaches. The reason for this slope is that, in Catnap and ChangeSUB, thanks to the use of multi-NoC design,
16
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
Fig. 17. Static power and average packet latency of synthetic workloads.
power gating can be applied in a fine-grained manner. Under moderate workload, the number of packets in higher-order subnets increases, and ChangeSUB has more chance for returning packets from higher-order subnets to zero subnet to reduce the number of encountered powered-off routers. So, ChangeSUB has lower average packet latency and static power compared to Catnap. When the network gets close to the congestion state, the results of No-PG and ConvOpt-PG regarding average packet latency and static power consumption are completely overlapped. Under heavy traffic load, the majority of routers are active, and ConvOpt-PG acts like the No-PG approach. On the other hand, the behavior of ChangeSUB becomes very similar to Catnap because zero subnet is filled, and no packet can be returned into that subnet. When the network is saturated, the static power of the Catnap and ChangeSUB exceeds the power of No-PG approach. This increase is due to the overhead imposed by the physical division of the network into multiple networks. To put another way, in multi-NoC design with four subnets for each router, there exist four virtual channel allocation units, four crossbar allocation units, and four clock generators, while the traditional NoC design (i.e., single-NoC) has only one of each of these units. 4.8. Area overhead Since our approach exploits extra hardware, we evaluate the area overhead of the power gating controller in the architectures mentioned above at 45nm technology node using Synopsis Design Compiler synthesis tool. In our evaluation, we model all the extra components of ChangeSUB architecture and find that our approach has only 1.9% and 5% area overhead compared to Catnap and traditional NoC, respectively. The area overhead of power gating transistors, and its associated signal distribution is about 6.4% [24]. It is noteworthy to mention that, implementation of a well-designed power gating mechanism imposes the area overhead between 4–10% [25]. 5. Conclusion We investigated the problem of inflexibility in changing the subnets of packets through their paths in multiple networkon-chip structure. In on-chip network, it is very likely that congestion occurs locally due to the unbalanced spatial distribution of the network traffic. In this way, packets are injected into a powered-off subnet that leads to encountering lots of wake-ups and imposing significant power and performance overheads. By the way, an architecture called ChangeSUB is proposed, which obviates the aforementioned inefficiency of traditional multiple network-on-chip structure. In our proposed architecture, packets which injected into a subnet to detour the congestion area could return to zero subnet across their
M. Baharloo, R. Aligholipour and M. Abdollahi et al. / Computers and Electrical Engineering 83 (2020) 106578
17
path. Our evaluations reveal that thanks to our proposed architecture, the average packet latency and execution time of different benchmarks compared to traditional multiple network-on-chip structure decreases by 10.5% and 4.5%, respectively, while the incurred area overhead is only about 1.9%. Furthermore, the static power consumption of the on-chip network (except zero subnet) is decreased by about 17.6%. Declaration of Competing Interest None. References [1] Kundu S, Chattopadhyay S. Network-on-chip: the next generation of system-on-chip integration. CRC press; 2018. [2] Rohbani N, Shirmohammadi Z, Zare M, Miremadi S-G. Laxy: a location-based aging-resilient xy-yx routing algorithm for network on chip. IEEE Trans Comput Aided Des Integr Circuits Syst 2017;36(10):1725–38. [3] Hoskote Y, Vangal S, Singh A, Borkar N, Borkar S. A 5-ghz mesh interconnect for a teraflops processor. IEEE Micro 2007;27(5):51–61. [4] Daya BK, Chen CO, Subramanian S, Kwon W, Park S, Krishna T, et al. Scorpio: a 36-core research chip demonstrating snoopy coherence on a scalable mesh noc with in-network ordering. In: 2014 ACM/IEEE 41st international symposium on computer architecture (ISCA); 2014. p. 25–36. [5] Chen L, Zhu D, Pedram M, Pinkston TM. Power punch: towards non-blocking power-gating of noc routers. In: High performance computer architecture (HPCA), 2015 IEEE 21st international symposium on. IEEE; 2015. p. 378–89. [6] Chen X, Peh L-S. Leakage power modeling and optimization in interconnection networks. In: Proceedings of the 2003 international symposium on low power electronics and design. ISLPED ’03. New York, NY, USA: ACM; 2003. p. 90–5. ISBN 1-58113-682-X. [7] Chen L, Pinkston TM. Nord: node-router decoupling for effective power-gating of on-chip routers. In: Proceedings of the 2012 45th annual IEEE/ACM international symposium on microarchitecture. IEEE Computer Society; 2012. p. 270–81. [8] Chen L, Zhao L, Wang R, Pinkston TM. Mp3: minimizing performance penalty for power-gating of clos network-on-chip. In: 2014 IEEE 20th international symposium on high performance computer architecture (HPCA); 2014. p. 296–307. [9] Baharloo M, Khonsari A. A low-power wireless-assisted multiple network-on-chip. Microprocess Microsyst 2018;63:104–15. [10] Das R, Narayanasamy S, Satpathy SK, Dreslinski RG. Catnap: energy proportional multiple network-on-chip. In: ACM SIGARCH computer architecture news, 41. ACM; 2013. p. 320–31. [11] Hu Z, Buyuktosunoglu A, Srinivasan V, Zyuban V, Jacobson H, Bose P. Microarchitectural techniques for power gating of execution units. In: Proceedings of the 2004 international symposium on low power electronics and design. ISLPED ’04. New York, NY, USA: ACM; 2004. p. 32–7. ISBN 1-58113-929-2. [12] Namazi A, Abdollahi M. Pcg: partially clock-gating approach to reduce the power consumption of fault-tolerant register files. In: 2017 euromicro conference on digital system design (DSD); 2017. p. 323–8. [13] Murray J, Pande PP, Shirazi B. Dvfs-enabled sustainable wireless noc architecture. In: SOC conference (SOCC), 2012 IEEE international. IEEE; 2012. p. 301–6. [14] Shamim MS, Mhatre A, Mansoor N, Ganguly A, Tsouri G. Temperature-aware wireless network-on-chip architecture. In: 2014 international green computing conference (IGCC). IEEE; 2014. p. 1–10. [15] Wang P, Niknam S, Wang Z, Stefanov T. A novel approach to reduce packet latency increase caused by power gating in network-on-chip. In: Proceedings of the eleventh IEEE/ACM international symposium on networks-on-chip. NOCS ’17. New York, NY, USA: ACM; 2017. 3:1–3:8. ISBN 978-1-4503-4984-0. [16] Kim G, Kim J, Yoo S. Flexibuffer: reducing leakage power in on-chip network routers. In: 2011 48th ACM/EDAC/IEEE design automation conference (DAC); 2011. p. 936–41. [17] Matsutani H, Koibuchi M, Ikebuchi D, Usami K, Nakamura H, Amano H. Ultra fine-grained run-time power gating of on-chip routers for cmps. In: Proceedings of the 2010 fourth ACM/IEEE international symposium on networks-on-chip. NOCS ’10. Washington, DC, USA: IEEE Computer Society; 2010. p. 61–8. ISBN 978-0-7695-4053-5. [18] Mondal HK, Gade SH, Kishore R, Deb S. P2noc: power- and performance-aware noc architectures for sustainable computing. Sustain Comput 2017;16:25–37. [19] Farrokhbakht H, Taram M, Khaleghi B, Hessabi S. Toot: an efficient and scalable power-gating method for noc routers. In: NOCS; 2016. p. 1–8. [20] Lu H, Yan G, Han Y, Wang Y, Li X. Shuttlenoc: Boosting on-chip communication efficiency by enabling localized power adaptation. In: Design automation conference (ASP-DAC), 2015 20th Asia and South Pacific. IEEE; 2015. p. 142–7. [21] Bienia C, Kumar S, Singh JP, Li K. The parsec benchmark suite: characterization and architectural implications. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques. PACT ’08. New York, NY, USA: ACM; 2008. p. 72–81. ISBN 978-1-60558-282-5. [22] Binkert N, Beckmann B, Black G, Reinhardt SK, Saidi A, Basu A, et al. The gem5 simulator. ACM SIGARCH Comput Archit News 2011;39(2):1–7. [23] Sun C, Chen CO, Kurian G, Wei L, Miller J, Agarwal A, et al. Dsent - a tool connecting emerging photonics with electronics for opto-electronic networkson-chip modeling. In: 2012 IEEE/ACM Sixth international symposium on networks-on-chip; 2012. p. 201–10. [24] Chen L, Zhu D, Pedram M, Pinkston TM. Simulation of noc power-gating: Requirements, optimizations, and the agate simulator. Journal of Parallel and Distributed Computing 2016;95:69–78. Special Issue on Energy Efficient Multi-Core and Many-Core Systems, Part I. [25] Hu Z, Buyuktosunoglu A, Srinivasan V, Zyuban V, Jacobson H, Bose P. Microarchitectural techniques for power gating of execution units. In: Proceedings of the 2004 international symposium on low power electronics and design. ACM; 2004. p. 32–7. Mohammad Baharloo is a postdoctoral fellow in the School of Computer Science at the Institute for Research in Fundamental Sciences (IPM). He received his Ph.D., M.S., and B.S. degrees in computer engineering from University of Tehran, Sharif University of Technology, and Bahonar University of Kerman, respectively. His research interests are hybrid-NoC, MPSoCs and GPUs. Rashid Aligholipour received his B.S. degree with distinction of computer hardware from Islamic Azad University Khoy branch, Iran and M.S. from Isfahan university of technology, Iran, in 2014 and 2018, respectively. His research interests are in the area of computer architecture, many-core processors, GPUs, high performance computing systems and parallel programming. Meisam Abdollahi is a PhD student who joined Dependable System Design (DSD) Laboratory in the School of Electrical and Computer Engineering at the University of Tehran, Iran, in Fall 2012. He received his BSc and MSc degrees in computer hardware engineering (major computer architecture). He is currently working on reliable task mapping on the optoelectrical network-on-chip platforms. Ahmad khonsari received the BS, MS, and PhD degrees in electrical and computer engineering from Shahid-Beheshti University, Iran University of Science and Technology, and University of Glasgow, respectively. He is currently an associate professor in the Department of ECE, University of Tehran, Iran. He has been with School of Computer Science, Institute for Research in Fundamental Sciences (IPM).