Microprocessors and Microsystems 35 (2011) 766–778
Contents lists available at SciVerse ScienceDirect
Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro
A reliable and power efficient flow-control method to eliminate crosstalk faults in network-on-chips q Ahmad Patooghy, Seyed Ghassem Miremadi ⇑, Hamed Tabkhi Dependable Systems Laboratory, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
a r t i c l e
i n f o
Article history: Available online 26 August 2011 Keywords: Network-on-chip Crosstalk Reliability Power consumption
a b s t r a c t This paper proposes a power-efficient flow-control method to tackle the problem of crosstalk faults in Network-on-Chips (NoCs). The method, called FRR (Flit Reordering/Rotation), combines three coding mechanisms to entirely eliminate opposite direction transitions (OD transitions) as the source of crosstalk faults in NoC communication channels. The first mechanism, called flit-reordering, reorders flits of every packet to find a flit sequence which produces the lowest number of OD transitions on NoC channels. The second mechanism called flit-rotation, logically rotates the content of every flit of the packet with respect to previously sent flit to achieve even more reduction in the number of OD transitions. Finally, the third mechanism called flit-insertion, investigates flits of the packet to find the OD transitions which are not removed by first and second mechanisms. This mechanism inserts null-flits between the required flits to completely eliminate appearance of OD transitions on NoC channels. Evaluation of FRR method is done in two ways: (1) VHDL-based simulations are carried out for 16- and 32-bit channels when maximum reorderings and maximum rotations in the first and second mechanisms are limited to 2, 4, and 8. (2) An analytical model is developed to calculate and compare the expected number of OD transitions in an unprotected NoC as well as an FRR-enabled NoC. Both simulation and analytical results confirm that the FRR method completely removes crosstalk faults from NoC channels. In addition, VHDL simulations show that the FRR method provides a remarkable power saving, since the method reduces the number of transitions in NoC channels by at least 32.8%. Crown Copyright Ó 2011 Published by Elsevier B.V. All rights reserved.
1. Introduction Recent advances in VLSI technologies have enabled designers to accommodate tens of IP blocks such as processing cores, memory modules, and I/O interfaces in a single chip [38]. Communication between these blocks is a key feature which seriously affects the chip performance. Traditional communication architectures, e.g., point-to-point architecture, shared-bus architecture, and seg-
q This paper is an extension of the work presented in [34,40]. The extension includes: (i) The work presented in [34,40] are crosstalk mitigation methods. In this paper, two following mechanisms are added to what are proposed in [34,40] to reach crosstalk elimination rather than crosstalk mitigation. (a) A combination of flit reordering [34] and flit rotation [40] mechanisms is added to achieve higher crosstalk mitigation in the proposed method. (b) Null-flit insertion is incorporated to improve the crosstalk mitigation capability of the proposed method to crosstalk elimination. (ii) An analytical crosstalk model is proposed to estimate the reliability of an unprotected NoC as well as an FR2-enabled NoC. (iii) The proposed method is evaluated in terms of power consumption, area overhead and timing overhead using a wide range of HDL simulations. (iv) This work was partially supported by a grant from Iran Telecommunication Research Center (ITRC). ⇑ Corresponding author. E-mail addresses:
[email protected] (A. Patooghy),
[email protected] (S.G. Miremadi),
[email protected] (H. Tabkhi).
mented-bus architecture are not efficient solutions due to their high cost [41], performance bottleneck [42], or lack of scalability [2]. Network-on-Chip (NoC) has been proposed [1,2] as a scalable, cost-efficient communication architecture for such chips. In the NoC context, a core sends packetized data to other cores through on-chip switches which connect the cores according to a predefined structure called topology, e.g., mesh, torus topologies. It has been shown that NoCs are highly sensitive to transient faults due to the use of nano-scale VLSI technologies in their fabrication process [7,8]. Crosstalks [10,11], particle strikes [9], electro-magnetic interferences [31], and power supply disturbances [31] are the most important transient faults which affect the correct functionality of NoCs. Among these faults, crosstalks have the major contribution in causing errors in NoCs [14,22]. Crosstalks happen because of coupling capacitances formed between adjacent wires of communication channels in NoCs. The coupling capacitances may result in undesired transition on a victim wire when desired transitions appear on the neighboring wires of the victim wire [9,15]. Such coupling capacitances have negative impacts on delay, power consumption, and signal integrity of data transmission in NoCs [16]. High sensitivity of NoCs to crosstalk faults makes reliability as one of the main concerns in the design of these products. In this
0141-9331/$ - see front matter Crown Copyright Ó 2011 Published by Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2011.08.004
A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778
regard, several methods have been proposed in literature to mitigate effects of crosstalk faults. These methods can be divided into the following four categories based on the level of abstraction in which the methods work. (1) Methods at the lowest level of design abstraction, i.e., layout level, mitigate the rate of crosstalk faults by the use of crosstalk-aware fabrication process. As an example, specialized routing strategies [16] routes on-chip wires in a way that minimizes the coupling capacitances between the adjacent wires to reach crosstalk mitigation. Shielding method is the other example of this abstraction level which inserts shield wires between wires of communication channels. Insertion of the shield wires reduces the rate of adjacent transitions in opposite directions which in turn lessens the probability of crosstalk occurrence [18,19]. Although, layout level methods reduce the rate of crosstalk faults, these methods impose modifications in the fabrication process of the chip which has negative impacts on the design time and production cost of chips. (2) At a higher level of design abstraction, i.e., transistor level, intentionally skewing signal transition timings on adjacent wires has been proposed to reduce the delay effects of crosstalks. However, this method is only applicable to repeaterenabled communication channels [17]. (3) At the next higher level, RTL level, data coding is widely used to reduce the rate of crosstalk faults in on-chip channels. In a coding technique, n bits of data are mapped to k bits of code such that the probability of crosstalk occurrence for the coded data is lower than the original data. Delay reduction codes [20,21] and crosstalk-avoidance codes [7,22,33] are examples of the coding techniques trying to suppress the effects of crosstalk faults. Fibonacci-based coding methods are also proposed [36,37] to remove harmful bit sequences, ‘101’ and ‘010’, from all packets traversing NoC to reach crosstalk prevention. In this regard, a recursive algorithm is proposed to generate a Fibonacci codebook without any patterns of ‘101’ and ‘010’. However, Fibonacci-based methods suffer from high complexity of their coding algorithm especially when the width of communication channel grows [23]. Data coding has been also utilized in flow-control methods to detect and correct crosstalk faults in both end-to-end or switch-toswitch manners [3,4]. In an end-to-end flow-control method, the source node adds error detection codes, e.g., parity or cyclic redundancy checks, to each packet and the destination node checks the integrity of the packet. If an error is detected, the source node is requested to resend the packet. In a switchto-switch method, data correctness checking is performed whenever a flit (a packet is divided into fixed-size units called flits) reaches the next node. If an erroneous flit is detected, a NACK signal is sent to the sender node to indicate that the flit should be retransmitted. In this situation, the sender node stops sending the next flit and resends the requested flit. (4) At a highest level of design abstraction, flood-based routing algorithms are proposed to tolerate transient faults (including crosstalks) using packet redundancy [12,13]. In these algorithms, whenever a new packet is received, the receiver node chooses a subset of its adjacent nodes and sends the packet to them. At the next round, the selected nodes which have already received the packet spread the packet in the same manner. Destination node then will be able to use multiple copies of the same packet to overcome probable data errors occurred to the packet during the transmission. Although, several methods have been proposed to enhance the reliability of NoCs against crosstalk faults, these methods inadver-
767
tently affect other NoC parameters including performance, power consumption, area occupation, or production cost. For example: (1) layout level methods complicate the fabrication process of the chip which increases the production cost of the chip or (2) in real-world applications where the width of channels are at least 64 bits, Fibonacci-based coding methods impose a noticeable performance overhead due to their encoder modules [36], or (3) flood-based methods impose high power overheads on the NoC because of their aggressive packet redundancy [39]. This paper proposes a power efficient flow-control method to overcome the problem of crosstalk faults in NoC channels. The main advantage of this method is that it simultaneously provides reliability enhancement in packet transmission as well as power reduction for packet delivery (see Section 5). The method, called FRR (Flit Reordering/Rotation) utilizes three mechanisms to prevent crosstalk occurrences in NoC channels. In the first mechanism, called flit-reordering, flits of each newly generated packet are reordered to minimize the number of opposite direction transition appearing between consecutive flits of the packet. To do this, the packet is divided into some non-overlapping windows of flits to find a sequence of flits which produces the lowest number of opposite direction transitions for each window. The second mechanism, called flit-rotation, logically rotates the content of flits to achieve higher reduction in the number of opposite direction transitions between consecutive flits. In the third mechanism, called flitinsertion, flits of the packet are investigated to find the opposite direction transitions which are not removed by flit-reordering and flit-rotation mechanisms. The third mechanism inserts nullflits between the required flits to completely eliminate appearance of opposite direction transition on NoC channels. VHDL-based simulation as well as analytical modeling is carried out in a wide range of working conditions to evaluate the FRR method. Both simulation and analytical results confirm that the FRR method completely removes crosstalk faults from NoC channels. In addition, VHDL simulations show that the FRR method provides a remarkable power saving. This is because that the method reduces the number of transitions in NoC channels by at least 32.8%. The rest of the paper is organized as follows. Section 2 discusses how flow-control methods improve the reliability of packet transmission in NoCs. An analytical discussion regarding probability of crosstalk occurrence in a typical NoC channel is presented in Section 3. The proposed method is introduced and evaluated in Sections 4 and 5 respectively and finally Section 6 concludes the paper.
2. Flow-control methods Flow-control methods are widely exploited to improve the reliability of packet transmission of network-on-chips [6,26]. These methods add information redundancy to the packets traversing the network and use the redundancy to check the integrity of packets. Based on the frequency of correctness checking, these flow-control are divided into Switch-to-Switch and End-to-End categories [3,4,6,26]. In a switch-to-switch flow-control method, the sender node adds information redundancy, e.g., parity or CRC bits to flits of the packet and the receiver node checks the integrity of each receiving flit separately. After sending a flit, the sender switch keeps a copy of the flit in a retransmission buffer until the receiving switch activates an ACK (or NACK) signal. If the ACK signal is activated, the sender switch sends the next flit of the packet; otherwise, the flit is retransmitted from the corresponding retransmission buffer. The receiver switch checks the correctness of each newly received flit and sends ACK/NACK signal to the sender switch based on the result of the check. Due to the frequent flit
768
A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778
correctness checking, switch-to-switch methods have rather low latency in detecting errors. In other words, these methods do not allow the erroneous flits to propagate through the network. This is achieved by adding encoder/decoder modules to all output/input channels of NoC. Fig. 1 shows the foundation of a typical switch-toswitch flow-control method. As shown in Fig. 1, retransmission buffers R1 to Rn, encoder module, and communication controller module are added to each output channel of NoC; and decoder module and communication controller module are added to each input channel of NoC. End-to-end methods are the other option to control flow of packets in NoCs. In these methods the source node adds error detection codes to the packet (e.g., the last flit of the packet contains error detection code) and data integrity checking is performed for the whole packet whenever the last flit of the packet reaches the destination node; instead of checking flits separately in each intermediate node. In the case of detecting an erroneous packet by the destination node, a control packet (containing sequence number of the erroneous packet) is sent to the source node requesting to resend the packet. Although end-to-end methods need hardware support as well, their hardware requirements are much lower in comparison with switch-to-switch methods. Endto-end methods require an encoder and a decoder module per each switch of NoC; instead of one per each channel in the switchto-switch methods. In addition, retransmission buffers are not required in end-to-end methods, because it is reasonable to assume that the packet can be regenerated by the corresponding application in the source node. Fig. 2 shows the hardware support needed in a typical end-to-end flow-control method. Lower hardware requirements reduces the power overhead of end-to-end methods as compared to the power overhead of switch-to-switch methods Sender Switch
Receiver Switch
R1
R2
Decoder
VC 0
Encoder
VC 1
Multiplexer
Retransmission Buffers
Demultip lexer
VC k
Channel
VC 0 VC 1
VC k
Rn
Communication Controller
Communication Controller
ACK/NACK signals
Fig. 1. Foundation of a typical switch-to-switch flow-control method. VC0 to VCk and R1 to Rn are flit buffers and retransmission buffers respectively.
VC4K-1
VC4K-1
VC 0
VC0 Encoder
Injection Channel
Crossbar Switch
. . .
Multiplexer
Routing Algorithm
. . .
Packet Request Module
To Local Processor
To North Switch
To South Switch
To East Switch
To West Switch
Local Processing Element
ACK/NACK Decoder Ejection Channel
Fig. 2. Foundation of a typical end-to-end flow-control method.
[5]. A comprehensive study on the efficiency of flow-control methods is presented in [39] which confirms that flow-control methods can effectively improve the reliability of packet transmission in NoCs. This paper proposes a power efficient flow-control method to improve the reliability of packet transmission in NoCs against crosstalk faults.
3. Crosstalk analysis This section presents (1) the related work for crosstalk faults in on-chip communication channels, and (2) an analytical analysis regarding probability of crosstalk occurrence in NoC communication channels. We used the following notation to represent different types of transitions which may appear on a single bit communication channel: symbols " and ; to represent transitions 0 ? 1 and 1 ? 0 respectively, and symbols and – for do not care transition and no transition. Generally, crosstalk faults happen because of coupling capacitances formed between adjacent wires of communication channels in NoCs. The presence of coupling capacitances causes unwanted correlation between wires of NoC channels, i.e., signal transitions in some wires of NoC channel may inadvertently affect other wires [10,11]. The affected wires which are called victim wires encounter either delay in their rising/falling edges or unwanted transitions [15,16]. Effects of crosstalk faults directly depend on the signal transitions appearing on the wires of NoC channels. Investigations show [15,21,23] that the ;" and "; transition patterns happening on two adjacent wires are the main source of crosstalk faults in NoC channels. Although patterns ;"; and ";" can also produce crosstalk faults, they can be considered as an overlapped-sequence of ;" and "; transition patterns. In other words, prevention of ;" and "; transition patterns directly prevents ;"; and ";" transition patterns. Consequently, researchers try to prevent or decrease the rate of ;" and "; transition patterns to augment the immunity of NoC channels toward crosstalk faults. Crosstalk avoidance codes (CACs) [20–22] reduce the crosstalk probability by avoiding some specific patterns of transitions. Authors of [15] have shown that delay effects of crosstalk faults can be mitigated by preventing ;" and "; transitions on two adjacent wires of a communication channel. Based on [23], prevention of the patterns ;", –"–, "–", and their complements will minimize the delay effects of crosstalk faults. Duplicate-add-parity [24,25] is proposed to reduce the probability of crosstalk occurrence by duplicating bits of each flit and adding a parity bit to the duplicated data. Duplicate-add-parity requires expanding the communication channel into double size of the flit width, e.g., for flits width K bits, the duplicate-add-parity needs a channel with 2K + 1 bits width. As shown in Fig. 3, encoder of duplicate-add-parity makes a redundant copy for each flit and adds one-bit parity to the redundant flit. The original flit, redundant flit, and the parity bit are sent through the communication channel. In the decoder side, the parity bit is regenerated and compared with the received one. The comparison determines which part of the received data should be stored and which part should be dropped. A close scrutiny in the duplicate-add-parity method reveals that although ";", –"–, and "–" patterns of transitions are prevented, duplicate-add-parity is not able to eliminate or reduce ";, ;" patterns. That means the method is still vulnerable toward the crosstalk faults [15]. The boundary shift code scheme proposed in [15], attempts to reduce crosstalk-induced delay by avoiding a shared boundary between successive flits. This method is very similar to duplicate-add-parity method since it uses flit duplication and one parity bit to achieve crosstalk avoidance and single-error correction. However, the boundary shift code method places the parity bit on the opposite side of the double-width flit at each clock
769
A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778
Y0 X0
Y1 Y2
X1
Y3 Y4
X2
Y5 Y6
X3
Y7 Y8
Y0
0
Y1
1
Y2
0
Y3
1
Y4
0
Y5
1
Y6
0
Y7
1
X0
X1
X2
X3
Y8
B
A
Fig. 3. Encoder (A) and decoder (B) of the duplicate-add-parity method for an NoC with channel width of 9 bits.
cycle. This is done to avoid dependent boundaries in subsequent flits. Pair of opposite direction transitions, called OD transitions hereafter, can be eliminated by avoiding bit sequences ‘010’ and ‘101’ in all flits traversing NoC channels [20]. However, prevention of bit sequences ‘010’ and ‘101’ requires complex encoders especially when the width of communication channel grows [23]. Partial coding is proposed to tackle this problem; in this way, the communication channel is broken into several sub-channels with smaller widths. Each sub-channel is encoded separately and then the sub-channels are combined such that the probability of crosstalk occurrence at the boundaries of sub-channels is minimized [20,21]. To discuss the probability of crosstalk occurrence in an unprotected communication channel, we calculate the expected number of OD transitions, i.e., ;" and ";, in a communication channel which has K bits width. Obviously, the probabilities of other mentioned harmful transition patterns, i.e., ";" and ";", have a direct relation to those of "; and ;" patterns. Fig. 4 shows all possible transitions appearing on a 2-bit communication channel when two consecutive flits f0 and f1 pass through the channel. Flits f0 and f1 have the width of 2 bits as well. As an example, in Fig. 4. A flit f0 is assumed to be ‘00’ while flit f1 can be any of its possible combinations. Transitions appearing on the channel in this situation are depicted in the right hand side of flit f1 in Fig. 4A. Considering the high rate of data transmission in communication channels of an NoC, we can ignore the correlation between the flits f0 and f1 [15,32]. This means we can assume that each bit of a flit gets its value independent of other bits/flits. Let the probability of a single bit in a flit being ‘0’ be P0 and the probability of being ‘1’ be P1 = 1 P0. Table 1 represents the frequency and
A
B
C
Table 1 Transition pairs appearing on a 2-bit communication channel, their frequencies and their probabilities. Symbol Transition Frequency Probability of occurrence pair of occurrence
Probability of occurrence (assuming P0 = P1 = 1/2) 1/4
I0
––
4
P 20 þ ð1 P 0 Þ2 Þ2
I1
–"
2
I2
"–
2
P 0 ð1 P 0 Þ½P 20 þ ð1 P 0 Þ2 1/8 P 0 ð1 P 0 Þ½P 2 þ ð1 P 0 Þ2 1/8
I3
""
1
P 20 ð1 P 0 Þ2
I4
–;
2
P 0 ð1 P 0 Þ½P 20 þ ð1 P 0 Þ2 1/8 1/16 P 2 ð1 P 0 Þ2
0
1/16
I5
";
1
I6
;–
2
I7
;"
1
P 0 ð1 P 0 Þ½P 20 þ ð1 P 0 Þ2 1/8 1/16 P 2 ð1 P 0 Þ2
I8
;;
1
P 20 ð1 P 0 Þ2
0
0
1/16
probability of occurrence for the transition pairs of Fig. 4. In the last column of Table 1, probability of occurrences is calculated under the assumption of P0 = P1 = 1/2. As shown in Table 1, some of the transition pairs, i.e., I5, I7, can produce crosstalk faults since they lead to an OD transition on two adjacent wires of the communication channel. However, expanding the width of communication channel reveals that there are still other possibilities which may cause crosstalk faults. For the sake of clarity, suppose transition pairs I1 and I8 occur in a 4-bit channel. In this situation, the channel experiences the transition sequence ‘–";;’ which has an OD transition in the boundary of transition pairs. Generally, if one of the transition pairs {I1, I3, I7} appears at the left neighboring of one of the transition pairs {I6, I7, I8}, the resulting transition sequences, i.e., ‘";’ include a pair of OD transitions in the boundary of transition pairs. Similarly, the transition sequences ‘;"’ can be seen if one of the transition pairs {I4, I5, I8} appears at the left neighboring of one of the transition pairs {I2, I3, I5}. Transition sequences ‘";’ and ‘;"’ which are referred as S1 and S2 have the following probabilities of occurrence:
PS1 ¼ PðfI1 ; I3 ; I7 gÞ PðfI6 ; I7 ; I8 gÞ ¼ P20 ð1 P 0 Þ2
ð1Þ
2
ð2Þ
PS2 ¼ PðfI4 ; I5 ; I8 gÞ PðfI2 ; I3 ; I5 gÞ ¼
P20 ð1
P0 Þ
where PS2 ¼ PðfI1 ; I3 ; I7 gÞ ¼ PI1 þ P I3 þ P I7 can be calculated according to Table 1. In a communication channel with the width of K bits, there are a total of K2 transition pairs, and K2 1 boundary transition pairs (see Fig. 5). In order to appear no pair of OD transitions on a K-bit channel, both of the following conditions should be satisfied: (1) Transition pairs appearing on the communication channel are not allowed to be I5 or I7. Fig. 5 shows the transition pairs appeared on the communication channel when a K-bit flit f1
D
Fig. 4. All possible transitions appearing on a 2-bit channel when flits f0 and f1 pass through the channel.
Fig. 5. Transition pairs and boundary transitions appearing on a K-bit communication channel.
770
A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778
follows a K-bit flit f0. According to Fig. 5, transition pair TP1 appears when the first two bits of f1, Y1Y2 follow the first two bits of f0, X1X2. (2) Transition pairs appearing on the communication channel are not allowed to produce neither S1 nor S2 to prevent OD transitions in boundary transition pairs. In Fig. 5, boundary transition pair bi is composed of right transition of the transition pair TPi and the left transition of the transition pair TPi+1. Let random variable represents the number of OD transition pairs appearing on an unprotected communication channel. Based on the abovementioned conditions, the probability of appearing no pair of OD transitions on a K-bit channel, i.e., PKUnprotected ðL ¼ 0Þ, can be written as:
PKunprotected ðL ¼ 0Þ ¼ ½1 ðPI5 þ P I7 ÞK=2 ð1 PS1 ÞK=2
1
ð3Þ
which considers that P S1 ¼ PS2 (see Table 1). The probability of appearing only one pair of OD transitions on the channel is:
P Kunprotected ðL ¼ 1Þ ¼
K ðPI5 þ PI7 Þ ½1 ðP I5 þ PI7 Þ21 1 K=2 1 K K ð1 ðPS1 Þ21 ½1 ðPI5 þ PI7 Þ2 P S1 1 K=2
K
ð1 ðPS1 Þ22 ð4Þ Above equation considers that in each of the following situations exactly one OD transition pair appears on the communication channel: (1) Only one of transition pairs TP1 to TPK/2 is allowed to be I5 or I7, and none of the boundary transition pairs is allowed to produce S1 or S2. (2) Only one of boundary transition pairs b1 to b(K/2)1 is allowed to produce S1 or S2, and none of the transition pairs to TPK/2 is allowed to be I5 or I7. Extending the above equation, we can calculate the probability of appearing exactly i pairs of OD transitions on the channel by the use of:
P KUnprotected ðL ¼ iÞ ¼
i X K=2 K ðPI5 þ PI7 Þm ½1 ðPI5 þ PI7 Þ2m
m
nu¼0
K=2 1 K P S1 im ð1 PS1 Þ2i1þm im ð5Þ where i 6 K/2. Finally, the expected number OD transition pairs appearing on a K-bit communication channel is:
EKUnprotected ðODÞ ¼
K=2 X
i P KUnprotected ðL ¼ iÞ
i¼0
¼
K=2 X i X K=2 K ðPI5 þ PI7 Þm ½1 ðPI5 þ PI7 Þ2m i i¼0
m¼0
m
K=2 1 K PS1 im ð1 PS1 Þ2i1þm im ð6Þ Now, we can use the above equation to calculate average number of OD transitions in a K-bit channel when Q bits of data are transmitted through the channel by:
Av erageKunprotected ðODÞ ¼
Q K :E ðODÞ K unprotected
ð7Þ
which helps us to compare the expected number of OD transitions in an unprotected channel with a channel exploiting the proposed method (see Section 5). 4. The proposed method As discussed earlier, rate of crosstalk faults in NoC channels can be effectively decreased by preventing specific transition patterns of transitions, i.e., ;" and "; [15,21,23]. This section presents the proposed flow-control method called FRR (Flit Reordering/Rotation) which totally eliminates the appearance of OD transitions on NoC channels. To reach this aim, the FRR method exploits three mechanisms namely flit-reordering, flit-rotation and flit-insertion. The first two mechanisms, i.e., flit-reordering and flit-rotation, reduce the rate of OD transitions on the channel with respect to an unprotected channel, while the third mechanism, i.e., flit-insertion, eliminate the rest of OD transitions from NoC channels. In the first mechanism, flits of every newly generated packet are examined and reordered to find a sequence of flits which produces the lowest number of OD transitions between consecutive flits of the packet. The second mechanism, i.e., flit-rotation, logically rotates content of flits with respect to the previously flit passed the channel to achieve even more reduction in the number of OD transition on the channel. When a packet is reordered and rotated the by first and second mechanisms, the third mechanism investigates flits of the packet to find those OD transitions which are not removed by first and second mechanisms. The third mechanism inserts null-flits between the required flits to completely remove appearance of opposite direction transitions on NoC channels. At the rest of this section we describe the hardware aspects of the three mentioned mechanisms. 4.1. Flit reordering The first mechanism of the FRR method reorders the flits of each newly generated packet to reduce the rate of OD transitions appearing on communication channels of NoC. Flit-reordering is done at the time of injecting the packet into the network with at most a few cycles of delay. Encoder of the flit-reordering mechanism divides flits of the packet into some non-overlapping windows. Flits of each window are then separately examined to find a sequence of flits with minimum number of OD transitions. Fig. 6A shows how flits of a packet are divided into h non-overlapping windows and Fig. 6B shows how flits may be reordered by the flit-reordering encoder. For instance, the first flit of the first window, flit f1,w1, is reordered by the encoder as a last flit of the first window (compare Fig. 6A with B). Although the shown packet in Fig. 6A is divided into h windows of n flits, the flit-reordering encoder supports the case that the last window contains less than n flits. The flit-reordering encoder adds tag bits to the flits to enable the flit-reordering decoder to restore the original order of flits at the destination node. When flits of all windows are reordered by the encoder, the reordered packet should be passed through the output port of the encoder. From the NoC point of view, this is an ordinary packet which will be delivered to its corresponding destination. At the destination node, the decoder of flit-reordering mechanism uses the tag bits and rearranges the flits to recover the original packet. Obviously the same window sizes are used in both the encoder and decoder sides. Flit-reordering encoder considers the last flit of the window i when it is reordering flits of the window i + 1. This is done by the means of an extra flit buffer namely Previously Sent Flit buffer (PSFreordering) which is added into the architecture of the flit-reordering encoder. Fig. 7 represents the block diagram of flit-reordering encoder. Flit-reordering encoder is composed of n flit buffers, n
771
A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778
A f1,W1
Window 2
f2,W1
fn,W1
f1,W2
f2,W2
f2,W2
f1,Wh
f2,Wh
fn,Wh
Window h
Window 1
Flits of the packet separated in h windows
B
Tag bits
fn,W1
Window 2
f2,W1
f1,W1
f2,W2
fn,W2
f1,W2
f2,Wh
f1,Wh
fn,Wh
Window h
\
Window 1
Fig. 6. Flits of the packet are divided into h non-overlapping window (A), then flits in each window are reordered separately (B).
Input Sequense of Flits Sent Flit Detector
PSF Buffer
Flit Buffer 1
OD Trans. Extractor #1
Flit Buffer 2
OD Trans. Extractor #2
Flit Buffer 3
OD Trans. Extractor #3
Flit Buffer n
OD Trans. Extractor #n
Window of Flits NW Detector
Flit-Reorddering Encoder
1-bit width k-bits width
Minimum Detector
Output Sequence of Flits Dirty Bit Flip-Flop
Fig. 7. Block diagram of the flit-reordering encoder which mitigates the rate of opposite direction transitions on NoC channels.
dirty bit flip-flops, n OD transition detector modules, a PSFreordering buffer, and a Minimum Detector module where n is the size of reordering window. For every window of n flits, n rounds of competition are needed to find the best sequence of flits. In each round of competition which lasts one cycle, flits with dirty bits of ‘0’ are examined to find the flit which produces the lowest number of OD transitions with respect to the previously sent flit. Such a flit, so called winner flit, is chosen by the Minimum Detector module to pass the encoder at the next cycle. Note that OD transitions which are produced by tag bits are also considered in the winner flit selection. After selecting the winner flit, following tasks are done to initialize the encoder for the next round of competition: (1) the winner flit is sent through the output port of the encoder, (2) a copy of the winner flit is sent to the PSFreordering buffer, and (3) dirty bit of the winner flit is set to ‘1’ to stop the winner flit contributing in the next competition.
To minimize the performance overhead of the flit-reordering encoder, flits of the packet are loaded into the buffers of flit-reordering encoder in a pipelined fashion. In this way, lowest performance overhead which is n cycles delay for an encoder with the window size of n flits is achieved. However, to separate flits of the next window contributing in competitions of the previous window, flits of the next window are loaded into the flit-reordering encoder with dirty bits of ‘1’. When all n flits of the next window are loaded into the flit-reordering buffers, Next Window Detector module, referred in Fig. 7 as NW detector, resets all n dirty bits at the same time. The next window is begun to process by the flit-reordering encoder at this time. Reordering of flits is done by the source and destination nodes, i.e., the intermediate nodes do not contribute the reordering, so the flit-reordering mechanism can be considered as an end-to-end flow-control method. As we discussed earlier, in end-to-end
772
A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778
flow-control methods one code module is needed per each node of the NoC. This means that the hardware overhead of the flit-reordering mechanism is very low (for more details see Section 5). It should be noted that the flit-reordering mechanism requires some additional wires in the communication channel to send the tag bits for flits. Number of these additional wires is log2(n), where n is the size of reordering window. 4.2. Flit rotation When flits of a packet are reordered by flit-reordering encoder, the second mechanism, i.e., flit-rotation, is applied on the flits of the packet to reach more reduction in the number of OD transitions. The flit-rotation encoder creates m rotated versions for every flit which is going to be sent through the channel. These versions are investigated to select the version which produces the minimum number of OD transitions on the channel. For the sake of clarity, let f be the currently flit which is going to be sent through the channel. fi is defined as the i-bit left rotated version of flit f, where 0 6 i < m. The flit-rotation encoder computes the number of OD transitions appearing on the channel in the case that flit fi is sent through the channel for all i when 0 6 i < m. The flit fj which produces the lowest number of OD transitions is selected as the winner flit to send through the channel. Fig. 8 shows the block diagram of the encoder of flit-rotation mechanism. The encoder module is designed in a way that minimizes the timing overhead of the flitrotation mechanism. To do this, at each cycle of clock signal, one flit of the packet is examined and is encoded which is referred in Fig. 8 as current flit to send (CFS). CFS is encoded with respect to the previously flit which was passed the encoder, i.e., previously sent flit (PSFrotation). To recover the original flit at the destination node, a tag field is added into the winner flit to specify the number of rotations which should be applied to the winner flit at the desti-
nation node. Since m versions of the flit are investigated, log2(m) bits are required for the tag field. Note that, each of the first and second mechanisms has its own previously sent flit buffer as well as tag field to enable the destination node to recover the original packet. According to Fig. 8, Extractor module #i is a combinational logic which counts the number of OD transitions between flit fi1 and the previously sent flit. Minimum Detector module selects the winner flit and allows Extractor module #i to write the flit fi1 on the output port of the encoder. The other extractors are disconnected from the output port at this time. To take the tag bits into account, Extractor module #i has added appropriate tag bits to flit fi1 when it computes number of OD transitions for flit fi1. Although for a X-bit channel there are X 1 rotated versions, to minimize the area and power consumption overheads of the flit-rotation encoder, we used m versions for each flit, where m = 2, 4, 8, and 16. Effects of different rotation sizes on the effectiveness and overheads of the flit-rotation encoder are studied in Section 5. Flits which leave the flit-rotation encoder are fed into the third encoder which is described in the following subsection. 4.3. Flit insertion The third mechanism used in the FRR method which is named flit-insertion eliminates OD transitions on NoC communication channels. This mechanism investigates flits of the packet to find those OD transitions between consecutive flits which are not removed by the first and second mechanisms. Using the third mechanism null-flits are inserted between the required flits of the packet to prevent appearance of OD transitions on the channel. A null-flit is a flit with the content of zero in all bits. As an example suppose flits f1 = ‘001100’, f2 = ‘111000’, and f3 = ‘100101’ are received by the flit-insertion encoder at the times of t1, t2, and t3 respectively. Since the two other mechanisms, i.e., flit-reordering
Fig. 8. Block diagram of the flit-rotation encoder used to mitigate the rate of opposite direction transition on NoC channels.
773
A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778
and flit-rotation deliver one flit per cycle on their outputs, flitinsertion mechanism should be able to receive one flit per cycle. This means that flits f1, f2, and f3 are received by the flit-insertion encoder at the times of t1, t1 + 1, and t1 + 2 respectively. Obviously transition sequence ‘""–;–’ appears on the channel when flit f2 follows flit f1; and flit f3 produces the transition sequence ‘–;;"–"’ on the channel when it follows flit f2. Since there is an unresolved OD transition between flits f3 and f2, the flit-insertion encoder performs the following tasks: (1) Provides one-cycle stall between flits f2 and f3 by the use of temporary flit buffers embedded in this encoder. To do this, flit f3 is blocked for one cycle in the flit-insertion encoder and then flit f3 continues its way and passes the encoder. In other words, flit f3 leave the flit-insertion encoder two cycles after flit f2. (2) A null-flit is sent in the delay cycle through the output port of the encoder to discharge all wires of the channel. In this way, the transition sequence appearing between flit f2 and the inserted null-flit is ‘;;;– – –’ and the transition sequence appearing between the inserted null-flit and flit f3 is ‘"– –"–"’. As it can be seen the flit-insertion encoder eliminates all of the remained OD transitions on NoC communication channel. However, due to the performance overhead of delay insertion, we used the flit-insertion encoder as the last mechanism of the FRR method. Fig. 9 shows the block diagram of the flit-insertion encoder. As shown in Fig. 9, temporary buffers are needed in flit-insertion encoder to provide one cycle stall whenever a null-flit should be inserted between flits of the packet. Fig. 10 shows how the FRR method exploits the three mentioned encoders in a pipeline manner to remove OD transitions from NoC channels. This architecture minimizes the performance and power consumption overheads of the FRR method. The performance overheads of the first and second encoders are constant values of n cycles and one cycle delay respectively, where n is the size of reordering window in the first encoder. Simulation and analytical results show that the performance overhead of the third mechanism is negligible as well (see Section 5). Altogether the FRR method imposes a few nano-seconds of delay on the critical path of the switch architecture, which is a negligible delay for end-to-end flow-control methods. Evaluations performed in Section 5 confirm that the FRR method can be used as a cost-efficient method to overcome the problem of crosstalk faults in NoC channels.
5. Evaluation of the proposed method 5.1. Analytical evaluation In this section we calculate the average number of OD transition pairs in an NoC channel exploiting the FRR method. It is then compared with that of calculated for an unprotected channel in Section 3. For the sake of fairness, this section uses the same assumptions which are made in Section 3, i.e., flits are assumed to be uncorrelated [15,32] before and after adding tag bits. Eq. (5) (see Section 3) calculates the probability of having L = l OD transition paris in an unprotected channel with the width of K bits:
PKunprotected ðL ¼ lÞ ¼
l X K=2 K ðPI5 þ PI7 Þm ½1 ðPI5 þ PI7 Þ2m m¼0
m
K=2 1 lm
K PS1 im ð1 PS1 Þ2l1þm :
ð5Þ
It can be said that the Eq. (5) is the probability of having L = l OD transition pairs at the input point of FRR encoder which is referred in Fig. 10 as point (a). This section recalculates this probability for an FRR-enabled communication channel. In this regard, we calculate this probability for the output ports of the first and second encoders, i.e., points (b) and (c). Note that considering the third mechanism used in FRR method, no OD transition appears at the output port of flit-insertion encoder, i.e., point (d) of Fig. 10. Such an analysis helps us to: (1) study the efficiency of the mechanisms used in the FRR method, and (2) estimate the average timing overhead of the third mechanism of FRR method. Now let us to calculate the expected number of OD transition pairs at the output port of flit-reordering encoder. Considering the flit-reordering mechanism, there are n rounds of competition for window size of n flits. At the first round of competition, the numbers of OD transition pairs between the previously sent flit and each of the n loaded flits follow n random variables L1, L2, . . . , Ln which all have the same distribution with PK(L = l). Since the flit with the lowest number of OD transition pairs is selected as the winner flit, number of OD transition pairs on the output port of flit-reordering mechanism, i.e., point (b) at the first round of competition follows a random variable L1min which is defined as:
Fig. 9. Block diagram of the flit-insertion encoder used to eliminate opposite direction transitions on NoC channels.
774
A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778
FR2 Encoder From the Local Core
A
Flit Reordering Encoder
B
Flit Rotation Encoder
C
D
Flit Insertion Encoder
To the Network
Fig. 10. Block diagram of the FRR encoder.
L1min ¼ MinðL1 ; L2 . . . ; Ln Þ
ð8Þ
At the second round of competition, the winner flit has passed the flit-reordering encoder, so n 1 remaining flits of the window contribute in the competition. Using similar reasoning, number of OD transition pairs appearing on the output port of flit-reordering mechanism at the second round of competition follows a random variable L2min which is defined as:
L2min ¼ MinðL1 ; L2 . . . ; Ln1 Þ
ð9Þ
and generally, number of OD transition pairs appearing on the output port of flit-reordering encoder at the ith round of competition is a random variable Limin :
Limin ¼ MinðL1 ; L2 . . . ; Lniþ1 Þ
ð10Þ
where 1 6 i 6 n. Probability of having less than l OD transition pairs at the ith round of competition can be calculated as:
i K PK;i FlitReordering Lmin < l ¼ P ðMinðL1 ; L2 . . . ; Ln i þ 1; Þ < lÞ ¼ PðMinðL1 ; L2 . . . ; Lniþ1 ; Þ < lÞ1 PK ðL1 > lÞ PK ðL2 > lÞ PK ðLniþ1 ; > lÞ
ð11Þ
ð12Þ
Using the above probability accumulative function, the probability of having exactly l OD transition pairs on the output port of flit-reordering encoder at the ith round of competition can be calculate by: K;i K;i PK;i FlitReordering ðLmin ¼ lÞ ¼ P FlitReordering ðLmin 6 lÞ P FlitReordering ðLmin < lÞ:
ð13Þ Subsequently, the expected number of OD transition pairs at the ith round of competition as follows:
i EK;i FlitReordering Lmin ¼
K=2 X
i j P K;i FlitReordering Lmin
¼j :
ð14Þ
j¼0
The expected number of OD transition pairs appeared on the output port of flit-reordering encoder when all n flits of a window pass the flit-reordering encoder is:
EKFlitReordering ðODÞ ¼
n 1X i PK;i FlitReordering Lmin : n i¼1
! n Q0 1X i EK;i L K n n i¼1 FlitReordering min
ð16Þ
n where Q 0 ¼ Q þ QK log 2 to consider the tag bits added to flits by the flit-reordering encoder. In order to calculate average number of OD transition pairs on the output port of flit-rotation encoder, i.e., point (c) we should firstly calculate the probability of having L = l OD transition pairs at the input port of flit-rotation encoder, i.e., point (b). Note that Eq. (12) calculates this probability for a given competition round. This equation should be modified to calculate probability of having L = l OD transition pairs at the point (b) regardless of competition round in the first encoder. In this regard, we assumed that probability of being in each round of competition for the flit-reordering encoder at each instant of time is equal to 1/n. Consequently, the probability of having L = l OD transition pairs at the output port of flit-reordering encoder regardless of the competition round can be calculated by:
PK;i FlitReordering ðL ¼ lÞ ¼
n 1X i PK;i FlitReordering Lmin ¼ 1 : n i¼1
ð17Þ
Since the flit-rotation encoder selects its winner flit among m rotated versions of the incoming flit, probability of having less than l OD transition pairs at the output port of flit-rotation encoder, i.e., point (c) is:
Since we assumed that there is no correlation between flits of a i window, PK(Lq > l) = PK(L > l) for every q, so PK;i Flit Reordering ðLmin > lÞ can be simplified to:
niþ1 i K PK;i FlitReordering Lmin < l 1 1 P unprotected ðL1 < 1Þ
Av erageðODÞKFlitReordering
ð15Þ
Finally, average number of OD transition pairs appearing on the output port of flit-reordering encoder when Q bits of data pass this encoder is:
PKFlitReordering ðL < lÞ ¼ PðMinðL1 ; L2 ; . . . Lm Þ < lÞ
ð18Þ
where random variables L1, L2, . . . , Lm have the distribution function of P KFlitReordering ðL ¼ lÞ. Accordingly, probability of having L = l OD transition pairs at the output port of flit-rotation encoder is:
PKFlitRotation ðL ¼ lÞ ¼ PKFlitRotation ðL 6 lÞ PKFlitRotation ðL < lÞ:
ð19Þ
The expected number of OD transition pairs appeared on the output port of flit-rotation encoder is:
EKFlitRotation ðODÞ ¼
K=2 X
j PKFlitRotation ðL ¼ jÞ:
ð20Þ
j¼0
Finally, average number of OD transition pairs appearing on the output port of flit-rotation encoder when bits of data pass this encoder is:
Av erageðODÞKFlitRotation
00 Q EKFlitRotation ðODÞ K
ð21Þ
l 0m m where Q 00 ¼ Q 0 þ QK log 2 to consider the tag bits added to flits by the flit-rotation encoder. Since the third mechanism of FRR method eliminates all the remaining OD transitions, no OD transition appears on its output port, i.e., injection channel of the source node. In other words, number of OD transitions at the output port of flit-insertion encoder, i.e., point (d) is zero. As mentioned, the flit-insertion encoder imposes one cycle delay between two consecutive flits if there is at least one OD transition between the flits. The number of delay cycles which are imposed by the flit-insertion encoder is M ð1 PKFlitRotation ðL ¼ 0ÞÞ, where M is the number of flits in the packet at the input port of flit-insertion encoder. Using this value,
775
A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778
the length of packet at the output port of flit-insertion encoder can be calculated by:
5.2. Experimental evaluation
M ¼ M þ M 1 PKFlitRotation ðL ¼ 0Þ :
In order to experimentally evaluate the proposed flow-control method, a VHDL-based simulator is developed. The simulator is composed of an FRR encoder module and a random flit generator module. The FRR encoder receives flits from the random flit generator module and codes them based on the mentioned mechanisms, i.e., flit-reordering, flit-rotation, and flit-insertion. Number of OD transition pairs as well as number of transitions is counted in points (a), (b), (c), and (d) (see Fig. 10) by the use of a monitor hardware added to the FRR encoder. A synthesizable version of the simulator is used to investigate the power consumption, area, and timing overheads of the FRR method. To do this, Design Compiler tool is utilized to extract the overheads of the FRR method which is synthesized in 65 nm technology size. Power consumption, area overhead and critical path delay of the monitor hardware have been ignored in our report since it does not exist in real working conditions. Simulation experiments are done for different reordering sizes (n), rotation sizes (m), and different channel widths (K). In each simulation experiment, 5 MB of random data has been generated by the flit generator module and delivered to the FRR encoder module. Since the amount of power saving is depend on the number of channels per each node of NoC as well as the length of NoC channels, in our evaluations:
0
ð22Þ 0
Consequently it can be said that on average (M M) temporary buffers are needed in the flit-insertion encoder. Using the above discussion, an unprotected channel is compared with a channel equipped with the FRR method in terms of average number of OD transitions. Table 2 represents average number of OD transitions for 8-, 16- and 32-bit channels when 600 KB of random data is transmitted through the channels. Average number of OD transitions for the points (a), (b), (c), and (d) are extracted by the means of proposed analytical model. Points (a) and (d) can be considered as an unprotected channel and an FRR enabled channel respectively; however point (b) and (c) help us to study the behavior of mechanisms used in the FRR encoder. Parameters n and m used in Table 2 respectively refer to the size of reordering window used in the flit-reordering encoder, and the maximum size of flit-rotation used in the flit-rotation encoder. As it can be seen in Table 2, mechanisms used in the FRR method efficiently eliminate OD transitions in NoC channels. Based on the third mechanism used in the FRR method, average number of required temporary buffers in the flit-insertion encoder depends on the average number of OD transition pairs appearing at the point (c). This value is also calculated using the proposed model to estimate buffer requirement of the flit-insertion encoder. Next subsection presents experimental results to validate the results extracted from analytical modeling. Considering mechanisms used in the first encoder of l FRRmmethod, m Mlog 2 performance overhead of the first encoder is m þ cycles K which is due to m reordering buffers and the added tag bits. l Simm Zlog m 2 ilarly, performance overhead of the second encoder is 1 þ K l m where Z ¼ M þ
Mlog m 2 K
. Table 2 represents the total performance
overhead imposed to a packet with the length of 32 flits, i.e., M = 32, when the packet leaves the flit-rotation encoder, i.e., at the point (c).
(1) We studied a single NoC channel, which makes our evaluations independent of NoC architecture, (2) We did not logged the power consumption of the channel. Rather than, we logged the numbers of opposite direction transitions as well as number of transitions in points (a), (b), (c), and (d) by the use of a monitor hardware added to the FRR encoder. According to these two points, we can claim that our power simulations are true for all length of NoC channels and all NoC architectures. However, simulation experiments are done for
Table 2 Improvements and overheads of the FRR method extracted by analytical modeling. Channel width
Size of reordering window (n)
8 bits
2 2 2 4 4 4 8 8 16 16
2 4 8 2 4 8 2 4 2 4
4 4 4 8 8 8 16 16
2 4 8 2 4 8 2 4
449,214
4 4 8 8 8 16 16 16
8 16 4 8 16 4 8 16
458,984
16 bits
32 bits
Maximum rotation (m)
Average number of OD transition pairs
Reduction (%) with respect to point (a)
Point a
Point b
Point c
Point b
Point c
429,255
171,477
127,158 20,825 540 55,862 3725 15 12,570 3337 739 23
60.1
70.4 95.1 99.9 87.0 99.1 100 97.1 99.2 99.8 100
12 18 27 16 23 32 21 28 25 33
88.7 91.2 99.1 97.3 98.0 100 99.4 99.8
10 14 20 12 16 23 14 18
83.3 93.4 94.0 96.1 98.3 97.7 98.2 99.9
15 24 11 16 25 12 17 26
58,037
16,914 4633 70,987
23,719
6895 82,913 32,388
11,881
50,597 39,415 4056 12,046 9161 210 2886 1012 76,504 30,496 27,642 17,674 7720 10,693 8149 550
86.5
96.1 98.9 84.2
94.7
98.5 81.9 92.9
97.4
Performance overhead (cycle)
776
A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778
different reordering sizes (n), rotation sizes (m), and different channel widths (K). Simulations are done under the constraint of fixed flit width, i.e., flits are generated with widths of 16 or 32 bits; and consequently NoC channels require 16 + log2(mn) or 32 + log2(mn) bits width respectively to pass the flit as well as tag bits added to the flit by FRR encoder. On the other hand, fixed channel width constraint which is used in our analytical modeling considers widths of 16 or 32 bits for NoC channels, so flits should have the widths of 16 log2(mn) or 32 log2(mn) bits respectively to remain enough space for the tag bits. In the first constraint, the performance overhead of FRR encoder is n + 1 cycles delay which is minimized since tag bits have their own wires on NoC channels. However, area overhead is proportional to log2(mn) which is the maximum area overhead of the FRR encoder. In contrast, in the fixed channel width constraint, performance overhead of FRR method is maxi2 lMlogm m 3 mized, i.e., 1 þ 6 6 6
Mþ
K
2
log n2
K
7 and area overhead is minimized. 7 7
Table 3 represents the results obtained under the constraint of fixed flit width. Table 3 shows number of OD transition pairs and the number of transitions in points (a), (b), and (c) of FRR encoder. Results of the point (a) can be considered as results of an unprotected NoC communication channel. For example consider the FRR encoder when n = 2, m = 8, and flit width = 16 bits. In this case,
flit-reordering encoder reduces the number of OD transition pairs from 8,281,237 to 4,921,862, i.e., 40.6% reduction. In addition, number of transitions is reduced from 21,113,314 to 18,261,772, i.e., 13.5% reduction. After the flit-reordering, the flit-rotation encoder reduces these values to 93,210 and 1,193,401 respectively which mean 99.1% and 94.3% reduction respectively. According to Table 3, the higher n and m, the higher reduction in OD transition pairs and regular transitions. Investigations show, communication channels in most of NoCs consume 20–36% of total consumed power [24]. Since total power in digital systems is proportional to number of signal transitions, reducing transitions in NoC communication channels directly reduces the power consumption of NoC. Based on Table 3, more power saving is achieved by the FRR method in larger n and m and/or wider communication channels. Table 4 represents the power and area overhead of the FRR encoder for some working conditions. The overheads of the FRR method are extracted by the use of a synthesizable version of the simulator which is synthesized in 65 nm technology size. As it can be inferred from Tables 3 and 4, the power overhead of FRR method can be neglected as compared to its power saving. In the next experiment the FRR method is compared with the duplicate-add-parity method [24,25] which is designed to prevent transition patterns ";", –"–, and "–" (and their complement). To do this, the duplicate-add-parity method is also simulated by a VHDL
Table 3 Improvements and overheads of the FRR method extracted under the fixed flit width constraint. Flit width
Size of reordering (n) Rotation (m)
Number of OD transition pairs
Reduction (%) with respect to point (a)
Number of transitions
Point a
Point b
Point c
Point b
Point c
Point a
Point b
Point c
Point b
Point c
16 bits
n = 2, n = 2, n = 2, n = 4, n = 4, n = 4,
m=2 m=4 m=8 m=2 m=4 m=8
8,281,237
4,921,862
275,786 88,596 93,210 263683 67,450 68,367
40.6
96.6 98.9 99.1 96.8 99.2 99.4
21,113,314
18,261,772
13,183,650 12,250,077 1,193,401 11,328,168 11,621,142 10,722,688
13.5
37.5 41.9 94.3 46.3 44.9 49.2
n = 2, n = 2, n = 2, n = 4, n = 4, n = 4,
m=2 m=4 m=8 m=2 m=4 m=8
9,101,424
85.6 91.2 88.9 96.6 97.6 96.9
20,588,526
32 bits
3,076,158
4,980,364
2,805,099
1,305,162 800,015 1,004,537 305,162 216,011 277,031
62.9
45.3
69.2
16,855,570
16,750,706
14,848,956
Reduction (%) with respect to point (a)
20.1
13,823,093 12,893,401 12,617,166 8,285,199 7,180,195 7,365,731
18.6
32.8 37.3 38.7 59.7 65.1 64.2
27.9
Table 4 Power consumption, area overhead and critical path timing of the FRR encoder in different working conditions. Encoder parameters
m = 2, n = 2 m = 2, n = 4
Channel width = 16 bits
Channel width = 32 bits
Power consumption (lW)
Area occupation (lm2)
Critical path timing (ns)
Power consumption (lW)
Area occupation (lm2)
Critical path timing (ns)
102 241
1824 4262.4
15 16
353 623
3137 13,124
16 18
Table 5 The FRR method in comparison with the duplicate-add-parity. Flit size
16 32
Size of reordering, rotation (n), (m)
n = 2, m = 2 n = 2, m = 2
Number of OD transition patterns
Reduction of OD transitions (%) with respect to point (a)
Number of transition patterns ";", –"–, "–"
Reduction of ";", – "–, "–" (%) with respect to point (a)
Point a
Point b
Point c
DAP
Point b
Point c
DAP
Point a
Point b
Point c
DAP
Point b
Point c
DAP
8,281,237 9,101,424
4,921,862 4,980,364
275,786 1,305,162
8,281,237 9,101,424
40.6 45.3
96.6 85.6
0 0
6,881,512 8,250,457
3,221,862 3,692,553
2,053,161 1,807,615
0 0
53.2 55.2
70.2 78.1
100 100
A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778
model. Table 5 compares the reductions of the FRR and duplicateadd-parity methods with respect to the both set of transition patterns {;", ";} and {";", ;";, –"–, "–", –;–, ;–;}. Note that the duplicate-add-parity method can correct single bit errors which happened in the flits, however, in this section we compare the duplicate-add-parity and the FRR methods from the crosstalk prevention and power consumption points of view. As it can be seen in Table 5, the duplicate-add-parity method does not reduce the number of OD transition patterns, i.e., ;", ";. In contrast, the FRR method has a noticeable reduction in the number of transition patterns ";", –"–, "–" (and their complement). 6. Conclusions This paper proposed an efficient flow-control method to simultaneously enhance the reliability of packet transmission and reduce power consumption for packet delivery in NoCs. The method, called FRR, exploits three mechanisms to entirely eliminate opposite direction transitions as the source of crosstalk faults in NoC communication channels. The first and second mechanisms, i.e., flit-reordering and flit-rotation reduce the rate of opposite direction transitions on NoC channels whereas the third mechanism, i.e., flit-insertion makes it zero. As simulation results show, the main advantage of the proposed method is that it simultaneously provides crosstalk elimination as well as power reduction. The crosstalk elimination is achieved since the proposed method eliminates OD transitions in NoC channels, and the power reduction is achieved due to the reduction in the number of regular transitions on NoC channel. An analytical model was proposed to calculate and compare the expected number of OD transitions in an unprotected NoC as well as an FRR-enabled NoC. References [1] S. Kumar, A. Jantsch, J.P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja, A. Hemani, A network on chip architecture and design methodology, in: Proceedings of ISVLSI, April 2002, pp. 117–122. [2] L. Benini, G. De Micheli, Networks on chips: a new SoC paradigm, IEEE Computers 35 (1) (2002) 70–78. [3] S. Murali, T. Theocharides, N. Vijaykrishnan, M.J. Irwin, L. Benini, G. De Micheli, Analysis of error recovery schemes for networks-on-chips, IEEE Design and Test of Computers 22 (5) (2005) 434–442. [4] D. Bertozzi, D.L. Benini, G. De Micheli, Low power error-resilient encoding for on-chip data buses, in: Proceedings of DATE, March 2002, pp. 102–109. [5] A.M. Fazeli, S.G. Miremadi, A low-power and SEU-tolerant switch architecture for network on chips, in: Proceedings of the IEEE/IFIP Pacific Rim International Symposium on Dependable Computing (PRDC 2007), Melbourne, Victoria, Australia, December 2007. [6] D. Park, C. Nicopoulos, J. Kim, N. Vijaykrishnan, C.R. Das, Exploring faulttolerant network-on-chip architectures, in: International Conference on Dependable Systems and Networks (DSN), 2006, p. 93. [7] R. Hegde, N.R. Shanbhag, Towards achieving energy efficiency in presence of deep submicron noise, IEEE Transactions on VLSI Systems 8 (4) (2000) 379– 391. [8] S. Murali, D. Atienza, L. Benini, G. De Micheli, A multipath routing strategy with guaranteed in-order packet delivery and fault-tolerance for networks on chip, in: Proceedings of the 43rd ACM/IEEE Design Automation Conference (DAC ’06), San Francisco, Calif, USA, July 2006, pp. 845–848. [9] A.P. Frantz, L. Carro, É.F. Cota, F.L. Kastensmidt, Evaluating SEU and crosstalk effects in network-on-chip routers, IOLTS, 2006, pp. 191–192. [10] M.H. Tehranipour, N. Ahmed, M. Nourani, Testing SoC interconnects for signal integrity using boundary scan, VTS, 2003, pp. 158–172. [11] H. Zimmer, A. Jantsch, A fault model notation and error-control scheme for switch-to-switch buses in a network-on-chip, in: Proceedings of ISSS/CODES, September 2003, pp. 188–193. [12] T. Dumitras, S. Kerner, R. Marculescu, Towards on-chip fault-tolerant communication, in: Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC), 2003, pp. 225–232. [13] M. Pirretti, G.M. Link, R.R. Brooks, N. Vijaykrishnan, M. Kandemir, M.J. Irwin, Fault tolerant algorithms for network-on-chip interconnect, in: Proceedings of the ISVLSI, 2004. [14] M. Kuhlmann, S.S. Sapatnekar, Exact and efficient crosstalk estimation, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 20 (7) (2001) 858–866.
777
[15] K.N. Patel, I.L. Markov, Error-correction and crosstalk avoidance in DSM busses, IEEE Transactions on Very Large Scale Integration (VLSI) 12 (2004) 1076–1080. [16] T. Gao, C.L. Liu, Minimum crosstalk channel routing, in: Proceedings of International Conference on Computer-Aided Design (ICCAD), November 1999, pp. 692–696. [17] K. Hirose, H. Yasuura, A bus delay reduction technique considering crosstalk, in: Proceedings of Design, Automation and Test Europe (DATE), 2000, pp. 441– 445. [18] H. Kaul, D. Sylvester, D. Blaauw, Active shields: a new approach to shielding global wires, in: Proceedings of Great Lakes Symposium on Very Large Scale Integration (GLS-VLSI), April 2002, pp. 112–117. [19] K.M. Lepak, I. Luwandi, L. He, Simultaneous shield insertion and net ordering under explicit RLC noise constraint, in: Proceedings of Design Automation Conference (DAC), June 2001, pp. 199–202. [20] C. Duan, A. Tirumala, S.P. Khatri, Analysis and avoidance of cross-talk in onchip buses, Hot Interconnects 9 (2001) 133–138. [21] B. Victor, K. Keutzer, Bus encoding to prevent crosstalk delay, in: Proceedings of International Conference on Computer-Aided Design (ICCAD), 2001, pp. 57– 69. [22] D. Bertozzi, L. Benini, G.D. Micheli, Low power error resilient encoding for onchip data buses, in: Proceedings of DATE, 2002, pp. 102–109. [23] S.R. Sridhara, N.R. Shanbhag, Coding for reliable on-chip buses: a class of fundamental bounds and practical codes, IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 26 (5) (2007) 977–982. [24] S.R. Sridhara, N.R. Shanbhag, Coding for system-on-chip networks: a unified framework, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 13 (6) (2005) 655–667. [25] D. Rossi, A.K. Nieuwland, A. Katoch, C. Metra, New ECC for crosstalk impact minimization, IEEE Design and Test of Computers 22 (4) (2005) 340–348. [26] P.P. Pande, A. Ganguly, B. Feero, B. Belzer, C. Grecu, Design of low power & reliable networks on chip through joint crosstalk avoidance and forward error correction coding, IEEE International Symposium on Defect and FaultTolerance in VLSI Systems (DFT’06), 2006, pp. 466–476. [31] V. Raghunathan, M.B. Srivastava, R.K. Gupta, Energy-aware system design: a survey of techniques for energy efficient on-chip communication, Design Automation Conference (DAC), 2003, pp. 900–905. [32] M.R. Stan, W.P. Burleson, Bus-invert coding for low-power I/O, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 3 (1) (1995) 49– 58. [33] A. Ganguly, P.P. Pande, B. Belzer, Crosstalk-aware channel coding schemes for energy efficient and reliable NOC interconnects, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17 (11) (2009) 1626–1639. [34] A. Patooghy, H. Tabkhi, S.G. Miremadi, An efficient method to reliable data transmission in network-on-chip, in: 13th Euromicro Conference on Digital System Design (DSD 2010), Lille, France, September 2010, accepted for publication. [36] X. Wu, Z. Yan, Efficient CODEC Designs for Crosstalk Avoidance Codes Based on Numeral Systems, IEEE Transactions on Very Large Scale Integration (TVLSI) Systems PP(99), pp. 1–11. [37] Chunjie Duan, Victor Cordero, Sunil P. Khatri, Efficient on-chip crosstalk avoidance CODEC, design, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17 (4) (2009). [38] J. Hu, R. Marculescu, DyAD – smart routing for networks-on-chip, in: DAC ’04: Proceedings of the 41st Annual Conference on Design Automation, 2004. [39] A. Patooghy, S.G. Miremadi, M. Fazeli, A reliable switch architecture for network on chips, Elsevier Journal of Integration: The VLSI. [40] A. Patooghy, S.G. Miremadi, M. Shafaei, FiRot: an efficient crosstalk mitigation method for network-on-chips, in: Proceedings of 16th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2010). [41] S. Manolache, P. Eles, Z. Peng, Fault and energy-aware communication mapping with guaranteed latency for applications implemented on NoC, in: Proc. of DAC, 2005. [42] L. Benini, D. Bertozzi, Xpipes: a network-on-chip architecture for gigascale systems-on-chip, IEEE Circuits and Systems Magazine 4 (2) (2004) 18–31.
Ahmad Patooghy received his B.S. in Computer Engineering from Azad University of Arak, Iran, and his M.Sc. in computer engineering from Sharif University of Technology, Tehran, Iran, in 2003 and 2005, respectively. He is currently a PhD. Student at department of Computer Engineering, Sharif University of Technology. His research interests include Network on Chip, dependability evaluation of VLSI circuits, fault injection, analytical modeling.
778
A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778
Seyed Ghassem Miremadi got his M.Sc. in Applied Physics and Electrical Engineering from Linköping Institute of Technology and his PhD in Computer Engineering from Chalmers University of Technology, Sweden, in 1984 and 1995, respectively. He is an Associate professor of Computer Engineering at Sharif University of Technology. As fault-tolerant computing is his specialty, he initiated the ‘‘Dependable Systems Laboratory’’ at Sharif University in 1996 and has chaired the Laboratory since then. The research laboratory has participated in several research projects which have led to several scientific articles, conference papers and technical reports. Dr. Miremadi and his group have done research in Physical, Simulation-Based and Software-Implemented Fault Injection, Dependability Evaluation Using HDL Models, Fault-Tolerant Embedded Systems and Fault Tree Analysis. Dr. Miremadi was the Education Director (1997–1998) and the Head (1998– 2002) of Computer Engineering Department at Sharif University and since 2002 is the Research Director of the department. He is a member of the IEEE Computer Society, IEEE Reliability Society and the Computer Society of Iran.
Hamed Tabkhi received his M.Sc. in Computer Engineering from Sharif University of Technology, Tehran, Iran, 2008. He is currently a PhD Candidate at Department of Electrical and Computer Engineering, Northeastern University, Boston, USA. His research interests include Embedded Systems Design and modeling, Dependable Systems, and Computer Architecture.