A reliable and power efficient flow-control method to eliminate crosstalk faults in network-on-chips

Microprocessors and Microsystems 35 (2011) 766–778 Contents lists available at SciVerse ScienceDirect Microprocessors and Microsystems journal homep...

Download PDF

1MB Sizes 0 Downloads 34 Views

Report

PDF Reader
Full Text

Microprocessors and Microsystems 35 (2011) 766–778

Contents lists available at SciVerse ScienceDirect

Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

A reliable and power efﬁcient ﬂow-control method to eliminate crosstalk faults in network-on-chips q Ahmad Patooghy, Seyed Ghassem Miremadi ⇑, Hamed Tabkhi Dependable Systems Laboratory, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran

a r t i c l e

i n f o

Article history: Available online 26 August 2011 Keywords: Network-on-chip Crosstalk Reliability Power consumption

a b s t r a c t This paper proposes a power-efﬁcient ﬂow-control method to tackle the problem of crosstalk faults in Network-on-Chips (NoCs). The method, called FRR (Flit Reordering/Rotation), combines three coding mechanisms to entirely eliminate opposite direction transitions (OD transitions) as the source of crosstalk faults in NoC communication channels. The ﬁrst mechanism, called ﬂit-reordering, reorders ﬂits of every packet to ﬁnd a ﬂit sequence which produces the lowest number of OD transitions on NoC channels. The second mechanism called ﬂit-rotation, logically rotates the content of every ﬂit of the packet with respect to previously sent ﬂit to achieve even more reduction in the number of OD transitions. Finally, the third mechanism called ﬂit-insertion, investigates ﬂits of the packet to ﬁnd the OD transitions which are not removed by ﬁrst and second mechanisms. This mechanism inserts null-ﬂits between the required ﬂits to completely eliminate appearance of OD transitions on NoC channels. Evaluation of FRR method is done in two ways: (1) VHDL-based simulations are carried out for 16- and 32-bit channels when maximum reorderings and maximum rotations in the ﬁrst and second mechanisms are limited to 2, 4, and 8. (2) An analytical model is developed to calculate and compare the expected number of OD transitions in an unprotected NoC as well as an FRR-enabled NoC. Both simulation and analytical results conﬁrm that the FRR method completely removes crosstalk faults from NoC channels. In addition, VHDL simulations show that the FRR method provides a remarkable power saving, since the method reduces the number of transitions in NoC channels by at least 32.8%. Crown Copyright Ó 2011 Published by Elsevier B.V. All rights reserved.

1. Introduction Recent advances in VLSI technologies have enabled designers to accommodate tens of IP blocks such as processing cores, memory modules, and I/O interfaces in a single chip [38]. Communication between these blocks is a key feature which seriously affects the chip performance. Traditional communication architectures, e.g., point-to-point architecture, shared-bus architecture, and seg-

q This paper is an extension of the work presented in [34,40]. The extension includes: (i) The work presented in [34,40] are crosstalk mitigation methods. In this paper, two following mechanisms are added to what are proposed in [34,40] to reach crosstalk elimination rather than crosstalk mitigation. (a) A combination of ﬂit reordering [34] and ﬂit rotation [40] mechanisms is added to achieve higher crosstalk mitigation in the proposed method. (b) Null-ﬂit insertion is incorporated to improve the crosstalk mitigation capability of the proposed method to crosstalk elimination. (ii) An analytical crosstalk model is proposed to estimate the reliability of an unprotected NoC as well as an FR2-enabled NoC. (iii) The proposed method is evaluated in terms of power consumption, area overhead and timing overhead using a wide range of HDL simulations. (iv) This work was partially supported by a grant from Iran Telecommunication Research Center (ITRC). ⇑ Corresponding author. E-mail addresses: [email protected] (A. Patooghy), [email protected] (S.G. Miremadi), [email protected] (H. Tabkhi).

mented-bus architecture are not efﬁcient solutions due to their high cost [41], performance bottleneck [42], or lack of scalability [2]. Network-on-Chip (NoC) has been proposed [1,2] as a scalable, cost-efﬁcient communication architecture for such chips. In the NoC context, a core sends packetized data to other cores through on-chip switches which connect the cores according to a predeﬁned structure called topology, e.g., mesh, torus topologies. It has been shown that NoCs are highly sensitive to transient faults due to the use of nano-scale VLSI technologies in their fabrication process [7,8]. Crosstalks [10,11], particle strikes [9], electro-magnetic interferences [31], and power supply disturbances [31] are the most important transient faults which affect the correct functionality of NoCs. Among these faults, crosstalks have the major contribution in causing errors in NoCs [14,22]. Crosstalks happen because of coupling capacitances formed between adjacent wires of communication channels in NoCs. The coupling capacitances may result in undesired transition on a victim wire when desired transitions appear on the neighboring wires of the victim wire [9,15]. Such coupling capacitances have negative impacts on delay, power consumption, and signal integrity of data transmission in NoCs [16]. High sensitivity of NoCs to crosstalk faults makes reliability as one of the main concerns in the design of these products. In this

0141-9331/$ - see front matter Crown Copyright Ó 2011 Published by Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2011.08.004

A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778

regard, several methods have been proposed in literature to mitigate effects of crosstalk faults. These methods can be divided into the following four categories based on the level of abstraction in which the methods work. (1) Methods at the lowest level of design abstraction, i.e., layout level, mitigate the rate of crosstalk faults by the use of crosstalk-aware fabrication process. As an example, specialized routing strategies [16] routes on-chip wires in a way that minimizes the coupling capacitances between the adjacent wires to reach crosstalk mitigation. Shielding method is the other example of this abstraction level which inserts shield wires between wires of communication channels. Insertion of the shield wires reduces the rate of adjacent transitions in opposite directions which in turn lessens the probability of crosstalk occurrence [18,19]. Although, layout level methods reduce the rate of crosstalk faults, these methods impose modiﬁcations in the fabrication process of the chip which has negative impacts on the design time and production cost of chips. (2) At a higher level of design abstraction, i.e., transistor level, intentionally skewing signal transition timings on adjacent wires has been proposed to reduce the delay effects of crosstalks. However, this method is only applicable to repeaterenabled communication channels [17]. (3) At the next higher level, RTL level, data coding is widely used to reduce the rate of crosstalk faults in on-chip channels. In a coding technique, n bits of data are mapped to k bits of code such that the probability of crosstalk occurrence for the coded data is lower than the original data. Delay reduction codes [20,21] and crosstalk-avoidance codes [7,22,33] are examples of the coding techniques trying to suppress the effects of crosstalk faults. Fibonacci-based coding methods are also proposed [36,37] to remove harmful bit sequences, ‘101’ and ‘010’, from all packets traversing NoC to reach crosstalk prevention. In this regard, a recursive algorithm is proposed to generate a Fibonacci codebook without any patterns of ‘101’ and ‘010’. However, Fibonacci-based methods suffer from high complexity of their coding algorithm especially when the width of communication channel grows [23]. Data coding has been also utilized in ﬂow-control methods to detect and correct crosstalk faults in both end-to-end or switch-toswitch manners [3,4]. In an end-to-end ﬂow-control method, the source node adds error detection codes, e.g., parity or cyclic redundancy checks, to each packet and the destination node checks the integrity of the packet. If an error is detected, the source node is requested to resend the packet. In a switchto-switch method, data correctness checking is performed whenever a ﬂit (a packet is divided into ﬁxed-size units called ﬂits) reaches the next node. If an erroneous ﬂit is detected, a NACK signal is sent to the sender node to indicate that the ﬂit should be retransmitted. In this situation, the sender node stops sending the next ﬂit and resends the requested ﬂit. (4) At a highest level of design abstraction, ﬂood-based routing algorithms are proposed to tolerate transient faults (including crosstalks) using packet redundancy [12,13]. In these algorithms, whenever a new packet is received, the receiver node chooses a subset of its adjacent nodes and sends the packet to them. At the next round, the selected nodes which have already received the packet spread the packet in the same manner. Destination node then will be able to use multiple copies of the same packet to overcome probable data errors occurred to the packet during the transmission. Although, several methods have been proposed to enhance the reliability of NoCs against crosstalk faults, these methods inadver-

767

tently affect other NoC parameters including performance, power consumption, area occupation, or production cost. For example: (1) layout level methods complicate the fabrication process of the chip which increases the production cost of the chip or (2) in real-world applications where the width of channels are at least 64 bits, Fibonacci-based coding methods impose a noticeable performance overhead due to their encoder modules [36], or (3) ﬂood-based methods impose high power overheads on the NoC because of their aggressive packet redundancy [39]. This paper proposes a power efﬁcient ﬂow-control method to overcome the problem of crosstalk faults in NoC channels. The main advantage of this method is that it simultaneously provides reliability enhancement in packet transmission as well as power reduction for packet delivery (see Section 5). The method, called FRR (Flit Reordering/Rotation) utilizes three mechanisms to prevent crosstalk occurrences in NoC channels. In the ﬁrst mechanism, called ﬂit-reordering, ﬂits of each newly generated packet are reordered to minimize the number of opposite direction transition appearing between consecutive ﬂits of the packet. To do this, the packet is divided into some non-overlapping windows of ﬂits to ﬁnd a sequence of ﬂits which produces the lowest number of opposite direction transitions for each window. The second mechanism, called ﬂit-rotation, logically rotates the content of ﬂits to achieve higher reduction in the number of opposite direction transitions between consecutive ﬂits. In the third mechanism, called ﬂitinsertion, ﬂits of the packet are investigated to ﬁnd the opposite direction transitions which are not removed by ﬂit-reordering and ﬂit-rotation mechanisms. The third mechanism inserts nullﬂits between the required ﬂits to completely eliminate appearance of opposite direction transition on NoC channels. VHDL-based simulation as well as analytical modeling is carried out in a wide range of working conditions to evaluate the FRR method. Both simulation and analytical results conﬁrm that the FRR method completely removes crosstalk faults from NoC channels. In addition, VHDL simulations show that the FRR method provides a remarkable power saving. This is because that the method reduces the number of transitions in NoC channels by at least 32.8%. The rest of the paper is organized as follows. Section 2 discusses how ﬂow-control methods improve the reliability of packet transmission in NoCs. An analytical discussion regarding probability of crosstalk occurrence in a typical NoC channel is presented in Section 3. The proposed method is introduced and evaluated in Sections 4 and 5 respectively and ﬁnally Section 6 concludes the paper.

2. Flow-control methods Flow-control methods are widely exploited to improve the reliability of packet transmission of network-on-chips [6,26]. These methods add information redundancy to the packets traversing the network and use the redundancy to check the integrity of packets. Based on the frequency of correctness checking, these ﬂow-control are divided into Switch-to-Switch and End-to-End categories [3,4,6,26]. In a switch-to-switch ﬂow-control method, the sender node adds information redundancy, e.g., parity or CRC bits to ﬂits of the packet and the receiver node checks the integrity of each receiving ﬂit separately. After sending a ﬂit, the sender switch keeps a copy of the ﬂit in a retransmission buffer until the receiving switch activates an ACK (or NACK) signal. If the ACK signal is activated, the sender switch sends the next ﬂit of the packet; otherwise, the ﬂit is retransmitted from the corresponding retransmission buffer. The receiver switch checks the correctness of each newly received ﬂit and sends ACK/NACK signal to the sender switch based on the result of the check. Due to the frequent ﬂit

768

A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778

correctness checking, switch-to-switch methods have rather low latency in detecting errors. In other words, these methods do not allow the erroneous ﬂits to propagate through the network. This is achieved by adding encoder/decoder modules to all output/input channels of NoC. Fig. 1 shows the foundation of a typical switch-toswitch ﬂow-control method. As shown in Fig. 1, retransmission buffers R1 to Rn, encoder module, and communication controller module are added to each output channel of NoC; and decoder module and communication controller module are added to each input channel of NoC. End-to-end methods are the other option to control ﬂow of packets in NoCs. In these methods the source node adds error detection codes to the packet (e.g., the last ﬂit of the packet contains error detection code) and data integrity checking is performed for the whole packet whenever the last ﬂit of the packet reaches the destination node; instead of checking ﬂits separately in each intermediate node. In the case of detecting an erroneous packet by the destination node, a control packet (containing sequence number of the erroneous packet) is sent to the source node requesting to resend the packet. Although end-to-end methods need hardware support as well, their hardware requirements are much lower in comparison with switch-to-switch methods. Endto-end methods require an encoder and a decoder module per each switch of NoC; instead of one per each channel in the switchto-switch methods. In addition, retransmission buffers are not required in end-to-end methods, because it is reasonable to assume that the packet can be regenerated by the corresponding application in the source node. Fig. 2 shows the hardware support needed in a typical end-to-end ﬂow-control method. Lower hardware requirements reduces the power overhead of end-to-end methods as compared to the power overhead of switch-to-switch methods Sender Switch

Receiver Switch

R1

R2

Decoder

VC 0

Encoder

VC 1

Multiplexer

Retransmission Buffers

Demultip lexer

VC k

Channel

VC 0 VC 1

VC k

Rn

Communication Controller

Communication Controller

ACK/NACK signals

Fig. 1. Foundation of a typical switch-to-switch ﬂow-control method. VC0 to VCk and R1 to Rn are ﬂit buffers and retransmission buffers respectively.

VC4K-1

VC4K-1

VC 0

VC0 Encoder

Injection Channel

Crossbar Switch

. . .

Multiplexer

Routing Algorithm

. . .

Packet Request Module

To Local Processor

To North Switch

To South Switch

To East Switch

To West Switch

Local Processing Element

ACK/NACK Decoder Ejection Channel

Fig. 2. Foundation of a typical end-to-end ﬂow-control method.

[5]. A comprehensive study on the efﬁciency of ﬂow-control methods is presented in [39] which conﬁrms that ﬂow-control methods can effectively improve the reliability of packet transmission in NoCs. This paper proposes a power efﬁcient ﬂow-control method to improve the reliability of packet transmission in NoCs against crosstalk faults.

3. Crosstalk analysis This section presents (1) the related work for crosstalk faults in on-chip communication channels, and (2) an analytical analysis regarding probability of crosstalk occurrence in NoC communication channels. We used the following notation to represent different types of transitions which may appear on a single bit communication channel: symbols " and ; to represent transitions 0 ? 1 and 1 ? 0 respectively, and symbols and – for do not care transition and no transition. Generally, crosstalk faults happen because of coupling capacitances formed between adjacent wires of communication channels in NoCs. The presence of coupling capacitances causes unwanted correlation between wires of NoC channels, i.e., signal transitions in some wires of NoC channel may inadvertently affect other wires [10,11]. The affected wires which are called victim wires encounter either delay in their rising/falling edges or unwanted transitions [15,16]. Effects of crosstalk faults directly depend on the signal transitions appearing on the wires of NoC channels. Investigations show [15,21,23] that the ;" and "; transition patterns happening on two adjacent wires are the main source of crosstalk faults in NoC channels. Although patterns ;"; and ";" can also produce crosstalk faults, they can be considered as an overlapped-sequence of ;" and "; transition patterns. In other words, prevention of ;" and "; transition patterns directly prevents ;"; and ";" transition patterns. Consequently, researchers try to prevent or decrease the rate of ;" and "; transition patterns to augment the immunity of NoC channels toward crosstalk faults. Crosstalk avoidance codes (CACs) [20–22] reduce the crosstalk probability by avoiding some speciﬁc patterns of transitions. Authors of [15] have shown that delay effects of crosstalk faults can be mitigated by preventing ;" and "; transitions on two adjacent wires of a communication channel. Based on [23], prevention of the patterns ;", –"–, "–", and their complements will minimize the delay effects of crosstalk faults. Duplicate-add-parity [24,25] is proposed to reduce the probability of crosstalk occurrence by duplicating bits of each ﬂit and adding a parity bit to the duplicated data. Duplicate-add-parity requires expanding the communication channel into double size of the ﬂit width, e.g., for ﬂits width K bits, the duplicate-add-parity needs a channel with 2K + 1 bits width. As shown in Fig. 3, encoder of duplicate-add-parity makes a redundant copy for each ﬂit and adds one-bit parity to the redundant ﬂit. The original ﬂit, redundant ﬂit, and the parity bit are sent through the communication channel. In the decoder side, the parity bit is regenerated and compared with the received one. The comparison determines which part of the received data should be stored and which part should be dropped. A close scrutiny in the duplicate-add-parity method reveals that although ";", –"–, and "–" patterns of transitions are prevented, duplicate-add-parity is not able to eliminate or reduce ";, ;" patterns. That means the method is still vulnerable toward the crosstalk faults [15]. The boundary shift code scheme proposed in [15], attempts to reduce crosstalk-induced delay by avoiding a shared boundary between successive ﬂits. This method is very similar to duplicate-add-parity method since it uses ﬂit duplication and one parity bit to achieve crosstalk avoidance and single-error correction. However, the boundary shift code method places the parity bit on the opposite side of the double-width ﬂit at each clock

769

A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778

Y0 X0

Y1 Y2

X1

Y3 Y4

X2

Y5 Y6

X3

Y7 Y8

Y0

0

Y1

1

Y2

0

Y3

1

Y4

0

Y5

1

Y6

0

Y7

1

X0

X1

X2

X3

Y8

B

A

Fig. 3. Encoder (A) and decoder (B) of the duplicate-add-parity method for an NoC with channel width of 9 bits.

cycle. This is done to avoid dependent boundaries in subsequent ﬂits. Pair of opposite direction transitions, called OD transitions hereafter, can be eliminated by avoiding bit sequences ‘010’ and ‘101’ in all ﬂits traversing NoC channels [20]. However, prevention of bit sequences ‘010’ and ‘101’ requires complex encoders especially when the width of communication channel grows [23]. Partial coding is proposed to tackle this problem; in this way, the communication channel is broken into several sub-channels with smaller widths. Each sub-channel is encoded separately and then the sub-channels are combined such that the probability of crosstalk occurrence at the boundaries of sub-channels is minimized [20,21]. To discuss the probability of crosstalk occurrence in an unprotected communication channel, we calculate the expected number of OD transitions, i.e., ;" and ";, in a communication channel which has K bits width. Obviously, the probabilities of other mentioned harmful transition patterns, i.e., ";" and ";", have a direct relation to those of "; and ;" patterns. Fig. 4 shows all possible transitions appearing on a 2-bit communication channel when two consecutive ﬂits f0 and f1 pass through the channel. Flits f0 and f1 have the width of 2 bits as well. As an example, in Fig. 4. A ﬂit f0 is assumed to be ‘00’ while ﬂit f1 can be any of its possible combinations. Transitions appearing on the channel in this situation are depicted in the right hand side of ﬂit f1 in Fig. 4A. Considering the high rate of data transmission in communication channels of an NoC, we can ignore the correlation between the ﬂits f0 and f1 [15,32]. This means we can assume that each bit of a ﬂit gets its value independent of other bits/ﬂits. Let the probability of a single bit in a ﬂit being ‘0’ be P0 and the probability of being ‘1’ be P1 = 1 P0. Table 1 represents the frequency and

A

B

C

Table 1 Transition pairs appearing on a 2-bit communication channel, their frequencies and their probabilities. Symbol Transition Frequency Probability of occurrence pair of occurrence

Probability of occurrence (assuming P0 = P1 = 1/2) 1/4

I0

––

4

P 20 þ ð1 P 0 Þ2 Þ2

I1

–"

2

I2

"–

2

P 0 ð1 P 0 Þ½P 20 þ ð1 P 0 Þ2 1/8 P 0 ð1 P 0 Þ½P 2 þ ð1 P 0 Þ2 1/8

I3

""

1

P 20 ð1 P 0 Þ2

I4

–;

2

P 0 ð1 P 0 Þ½P 20 þ ð1 P 0 Þ2 1/8 1/16 P 2 ð1 P 0 Þ2

0

1/16

I5

";

1

I6

;–

2

I7

;"

1

P 0 ð1 P 0 Þ½P 20 þ ð1 P 0 Þ2 1/8 1/16 P 2 ð1 P 0 Þ2

I8

;;

1

P 20 ð1 P 0 Þ2

0

0

1/16

probability of occurrence for the transition pairs of Fig. 4. In the last column of Table 1, probability of occurrences is calculated under the assumption of P0 = P1 = 1/2. As shown in Table 1, some of the transition pairs, i.e., I5, I7, can produce crosstalk faults since they lead to an OD transition on two adjacent wires of the communication channel. However, expanding the width of communication channel reveals that there are still other possibilities which may cause crosstalk faults. For the sake of clarity, suppose transition pairs I1 and I8 occur in a 4-bit channel. In this situation, the channel experiences the transition sequence ‘–";;’ which has an OD transition in the boundary of transition pairs. Generally, if one of the transition pairs {I1, I3, I7} appears at the left neighboring of one of the transition pairs {I6, I7, I8}, the resulting transition sequences, i.e., ‘";’ include a pair of OD transitions in the boundary of transition pairs. Similarly, the transition sequences ‘;"’ can be seen if one of the transition pairs {I4, I5, I8} appears at the left neighboring of one of the transition pairs {I2, I3, I5}. Transition sequences ‘";’ and ‘;"’ which are referred as S1 and S2 have the following probabilities of occurrence:

PS1 ¼ PðfI1 ; I3 ; I7 gÞ PðfI6 ; I7 ; I8 gÞ ¼ P20 ð1 P 0 Þ2

ð1Þ

2

ð2Þ

PS2 ¼ PðfI4 ; I5 ; I8 gÞ PðfI2 ; I3 ; I5 gÞ ¼

P20 ð1

P0 Þ

where PS2 ¼ PðfI1 ; I3 ; I7 gÞ ¼ PI1 þ P I3 þ P I7 can be calculated according to Table 1. In a communication channel with the width of K bits, there are a total of K2 transition pairs, and K2 1 boundary transition pairs (see Fig. 5). In order to appear no pair of OD transitions on a K-bit channel, both of the following conditions should be satisﬁed: (1) Transition pairs appearing on the communication channel are not allowed to be I5 or I7. Fig. 5 shows the transition pairs appeared on the communication channel when a K-bit ﬂit f1

D

Fig. 4. All possible transitions appearing on a 2-bit channel when ﬂits f0 and f1 pass through the channel.

Fig. 5. Transition pairs and boundary transitions appearing on a K-bit communication channel.

770

A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778

follows a K-bit ﬂit f0. According to Fig. 5, transition pair TP1 appears when the ﬁrst two bits of f1, Y1Y2 follow the ﬁrst two bits of f0, X1X2. (2) Transition pairs appearing on the communication channel are not allowed to produce neither S1 nor S2 to prevent OD transitions in boundary transition pairs. In Fig. 5, boundary transition pair bi is composed of right transition of the transition pair TPi and the left transition of the transition pair TPi+1. Let random variable represents the number of OD transition pairs appearing on an unprotected communication channel. Based on the abovementioned conditions, the probability of appearing no pair of OD transitions on a K-bit channel, i.e., PKUnprotected ðL ¼ 0Þ, can be written as:

PKunprotected ðL ¼ 0Þ ¼ ½1 ðPI5 þ P I7 ÞK=2 ð1 PS1 ÞK=2

1

ð3Þ

which considers that P S1 ¼ PS2 (see Table 1). The probability of appearing only one pair of OD transitions on the channel is:

P Kunprotected ðL ¼ 1Þ ¼

K ðPI5 þ PI7 Þ ½1 ðP I5 þ PI7 Þ21 1 K=2 1 K K ð1 ðPS1 Þ21 ½1 ðPI5 þ PI7 Þ2 P S1 1 K=2

K

ð1 ðPS1 Þ22 ð4Þ Above equation considers that in each of the following situations exactly one OD transition pair appears on the communication channel: (1) Only one of transition pairs TP1 to TPK/2 is allowed to be I5 or I7, and none of the boundary transition pairs is allowed to produce S1 or S2. (2) Only one of boundary transition pairs b1 to b(K/2)1 is allowed to produce S1 or S2, and none of the transition pairs to TPK/2 is allowed to be I5 or I7. Extending the above equation, we can calculate the probability of appearing exactly i pairs of OD transitions on the channel by the use of:

P KUnprotected ðL ¼ iÞ ¼

i X K=2 K ðPI5 þ PI7 Þm ½1 ðPI5 þ PI7 Þ2m

m

nu¼0

K=2 1 K P S1 im ð1 PS1 Þ2i1þm im ð5Þ where i 6 K/2. Finally, the expected number OD transition pairs appearing on a K-bit communication channel is:

EKUnprotected ðODÞ ¼

K=2 X

i P KUnprotected ðL ¼ iÞ

i¼0

¼

K=2 X i X K=2 K ðPI5 þ PI7 Þm ½1 ðPI5 þ PI7 Þ2m i i¼0

m¼0

m

K=2 1 K PS1 im ð1 PS1 Þ2i1þm im ð6Þ Now, we can use the above equation to calculate average number of OD transitions in a K-bit channel when Q bits of data are transmitted through the channel by:

Av erageKunprotected ðODÞ ¼

Q K :E ðODÞ K unprotected

ð7Þ

which helps us to compare the expected number of OD transitions in an unprotected channel with a channel exploiting the proposed method (see Section 5). 4. The proposed method As discussed earlier, rate of crosstalk faults in NoC channels can be effectively decreased by preventing speciﬁc transition patterns of transitions, i.e., ;" and "; [15,21,23]. This section presents the proposed ﬂow-control method called FRR (Flit Reordering/Rotation) which totally eliminates the appearance of OD transitions on NoC channels. To reach this aim, the FRR method exploits three mechanisms namely ﬂit-reordering, ﬂit-rotation and ﬂit-insertion. The ﬁrst two mechanisms, i.e., ﬂit-reordering and ﬂit-rotation, reduce the rate of OD transitions on the channel with respect to an unprotected channel, while the third mechanism, i.e., ﬂit-insertion, eliminate the rest of OD transitions from NoC channels. In the ﬁrst mechanism, ﬂits of every newly generated packet are examined and reordered to ﬁnd a sequence of ﬂits which produces the lowest number of OD transitions between consecutive ﬂits of the packet. The second mechanism, i.e., ﬂit-rotation, logically rotates content of ﬂits with respect to the previously ﬂit passed the channel to achieve even more reduction in the number of OD transition on the channel. When a packet is reordered and rotated the by ﬁrst and second mechanisms, the third mechanism investigates ﬂits of the packet to ﬁnd those OD transitions which are not removed by ﬁrst and second mechanisms. The third mechanism inserts null-ﬂits between the required ﬂits to completely remove appearance of opposite direction transitions on NoC channels. At the rest of this section we describe the hardware aspects of the three mentioned mechanisms. 4.1. Flit reordering The ﬁrst mechanism of the FRR method reorders the ﬂits of each newly generated packet to reduce the rate of OD transitions appearing on communication channels of NoC. Flit-reordering is done at the time of injecting the packet into the network with at most a few cycles of delay. Encoder of the ﬂit-reordering mechanism divides ﬂits of the packet into some non-overlapping windows. Flits of each window are then separately examined to ﬁnd a sequence of ﬂits with minimum number of OD transitions. Fig. 6A shows how ﬂits of a packet are divided into h non-overlapping windows and Fig. 6B shows how ﬂits may be reordered by the ﬂit-reordering encoder. For instance, the ﬁrst ﬂit of the ﬁrst window, ﬂit f1,w1, is reordered by the encoder as a last ﬂit of the ﬁrst window (compare Fig. 6A with B). Although the shown packet in Fig. 6A is divided into h windows of n ﬂits, the ﬂit-reordering encoder supports the case that the last window contains less than n ﬂits. The ﬂit-reordering encoder adds tag bits to the ﬂits to enable the ﬂit-reordering decoder to restore the original order of ﬂits at the destination node. When ﬂits of all windows are reordered by the encoder, the reordered packet should be passed through the output port of the encoder. From the NoC point of view, this is an ordinary packet which will be delivered to its corresponding destination. At the destination node, the decoder of ﬂit-reordering mechanism uses the tag bits and rearranges the ﬂits to recover the original packet. Obviously the same window sizes are used in both the encoder and decoder sides. Flit-reordering encoder considers the last ﬂit of the window i when it is reordering ﬂits of the window i + 1. This is done by the means of an extra ﬂit buffer namely Previously Sent Flit buffer (PSFreordering) which is added into the architecture of the ﬂit-reordering encoder. Fig. 7 represents the block diagram of ﬂit-reordering encoder. Flit-reordering encoder is composed of n ﬂit buffers, n

771

A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778

A f1,W1

Window 2

f2,W1

fn,W1

f1,W2

f2,W2

f2,W2

f1,Wh

f2,Wh

fn,Wh

Window h

Window 1

Flits of the packet separated in h windows

B

Tag bits

fn,W1

Window 2

f2,W1

f1,W1

f2,W2

fn,W2

f1,W2

f2,Wh

f1,Wh

fn,Wh

Window h

\

Window 1

Fig. 6. Flits of the packet are divided into h non-overlapping window (A), then ﬂits in each window are reordered separately (B).

Input Sequense of Flits Sent Flit Detector

PSF Buffer

Flit Buffer 1

OD Trans. Extractor #1

Flit Buffer 2

OD Trans. Extractor #2

Flit Buffer 3

OD Trans. Extractor #3

Flit Buffer n

OD Trans. Extractor #n

Window of Flits NW Detector

Flit-Reorddering Encoder

1-bit width k-bits width

Minimum Detector

Output Sequence of Flits Dirty Bit Flip-Flop

Fig. 7. Block diagram of the ﬂit-reordering encoder which mitigates the rate of opposite direction transitions on NoC channels.

dirty bit ﬂip-ﬂops, n OD transition detector modules, a PSFreordering buffer, and a Minimum Detector module where n is the size of reordering window. For every window of n ﬂits, n rounds of competition are needed to ﬁnd the best sequence of ﬂits. In each round of competition which lasts one cycle, ﬂits with dirty bits of ‘0’ are examined to ﬁnd the ﬂit which produces the lowest number of OD transitions with respect to the previously sent ﬂit. Such a ﬂit, so called winner ﬂit, is chosen by the Minimum Detector module to pass the encoder at the next cycle. Note that OD transitions which are produced by tag bits are also considered in the winner ﬂit selection. After selecting the winner ﬂit, following tasks are done to initialize the encoder for the next round of competition: (1) the winner ﬂit is sent through the output port of the encoder, (2) a copy of the winner ﬂit is sent to the PSFreordering buffer, and (3) dirty bit of the winner ﬂit is set to ‘1’ to stop the winner ﬂit contributing in the next competition.

To minimize the performance overhead of the ﬂit-reordering encoder, ﬂits of the packet are loaded into the buffers of ﬂit-reordering encoder in a pipelined fashion. In this way, lowest performance overhead which is n cycles delay for an encoder with the window size of n ﬂits is achieved. However, to separate ﬂits of the next window contributing in competitions of the previous window, ﬂits of the next window are loaded into the ﬂit-reordering encoder with dirty bits of ‘1’. When all n ﬂits of the next window are loaded into the ﬂit-reordering buffers, Next Window Detector module, referred in Fig. 7 as NW detector, resets all n dirty bits at the same time. The next window is begun to process by the ﬂit-reordering encoder at this time. Reordering of ﬂits is done by the source and destination nodes, i.e., the intermediate nodes do not contribute the reordering, so the ﬂit-reordering mechanism can be considered as an end-to-end ﬂow-control method. As we discussed earlier, in end-to-end

772

A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778

ﬂow-control methods one code module is needed per each node of the NoC. This means that the hardware overhead of the ﬂit-reordering mechanism is very low (for more details see Section 5). It should be noted that the ﬂit-reordering mechanism requires some additional wires in the communication channel to send the tag bits for ﬂits. Number of these additional wires is log2(n), where n is the size of reordering window. 4.2. Flit rotation When ﬂits of a packet are reordered by ﬂit-reordering encoder, the second mechanism, i.e., ﬂit-rotation, is applied on the ﬂits of the packet to reach more reduction in the number of OD transitions. The ﬂit-rotation encoder creates m rotated versions for every ﬂit which is going to be sent through the channel. These versions are investigated to select the version which produces the minimum number of OD transitions on the channel. For the sake of clarity, let f be the currently ﬂit which is going to be sent through the channel. fi is deﬁned as the i-bit left rotated version of ﬂit f, where 0 6 i < m. The ﬂit-rotation encoder computes the number of OD transitions appearing on the channel in the case that ﬂit fi is sent through the channel for all i when 0 6 i < m. The ﬂit fj which produces the lowest number of OD transitions is selected as the winner ﬂit to send through the channel. Fig. 8 shows the block diagram of the encoder of ﬂit-rotation mechanism. The encoder module is designed in a way that minimizes the timing overhead of the ﬂitrotation mechanism. To do this, at each cycle of clock signal, one ﬂit of the packet is examined and is encoded which is referred in Fig. 8 as current ﬂit to send (CFS). CFS is encoded with respect to the previously ﬂit which was passed the encoder, i.e., previously sent ﬂit (PSFrotation). To recover the original ﬂit at the destination node, a tag ﬁeld is added into the winner ﬂit to specify the number of rotations which should be applied to the winner ﬂit at the desti-

nation node. Since m versions of the ﬂit are investigated, log2(m) bits are required for the tag ﬁeld. Note that, each of the ﬁrst and second mechanisms has its own previously sent ﬂit buffer as well as tag ﬁeld to enable the destination node to recover the original packet. According to Fig. 8, Extractor module #i is a combinational logic which counts the number of OD transitions between ﬂit fi1 and the previously sent ﬂit. Minimum Detector module selects the winner ﬂit and allows Extractor module #i to write the ﬂit fi1 on the output port of the encoder. The other extractors are disconnected from the output port at this time. To take the tag bits into account, Extractor module #i has added appropriate tag bits to ﬂit fi1 when it computes number of OD transitions for ﬂit fi1. Although for a X-bit channel there are X 1 rotated versions, to minimize the area and power consumption overheads of the ﬂit-rotation encoder, we used m versions for each ﬂit, where m = 2, 4, 8, and 16. Effects of different rotation sizes on the effectiveness and overheads of the ﬂit-rotation encoder are studied in Section 5. Flits which leave the ﬂit-rotation encoder are fed into the third encoder which is described in the following subsection. 4.3. Flit insertion The third mechanism used in the FRR method which is named ﬂit-insertion eliminates OD transitions on NoC communication channels. This mechanism investigates ﬂits of the packet to ﬁnd those OD transitions between consecutive ﬂits which are not removed by the ﬁrst and second mechanisms. Using the third mechanism null-ﬂits are inserted between the required ﬂits of the packet to prevent appearance of OD transitions on the channel. A null-ﬂit is a ﬂit with the content of zero in all bits. As an example suppose ﬂits f1 = ‘001100’, f2 = ‘111000’, and f3 = ‘100101’ are received by the ﬂit-insertion encoder at the times of t1, t2, and t3 respectively. Since the two other mechanisms, i.e., ﬂit-reordering

Fig. 8. Block diagram of the ﬂit-rotation encoder used to mitigate the rate of opposite direction transition on NoC channels.

773

A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778

and ﬂit-rotation deliver one ﬂit per cycle on their outputs, ﬂitinsertion mechanism should be able to receive one ﬂit per cycle. This means that ﬂits f1, f2, and f3 are received by the ﬂit-insertion encoder at the times of t1, t1 + 1, and t1 + 2 respectively. Obviously transition sequence ‘""–;–’ appears on the channel when ﬂit f2 follows ﬂit f1; and ﬂit f3 produces the transition sequence ‘–;;"–"’ on the channel when it follows ﬂit f2. Since there is an unresolved OD transition between ﬂits f3 and f2, the ﬂit-insertion encoder performs the following tasks: (1) Provides one-cycle stall between ﬂits f2 and f3 by the use of temporary ﬂit buffers embedded in this encoder. To do this, ﬂit f3 is blocked for one cycle in the ﬂit-insertion encoder and then ﬂit f3 continues its way and passes the encoder. In other words, ﬂit f3 leave the ﬂit-insertion encoder two cycles after ﬂit f2. (2) A null-ﬂit is sent in the delay cycle through the output port of the encoder to discharge all wires of the channel. In this way, the transition sequence appearing between ﬂit f2 and the inserted null-ﬂit is ‘;;;– – –’ and the transition sequence appearing between the inserted null-ﬂit and ﬂit f3 is ‘"– –"–"’. As it can be seen the ﬂit-insertion encoder eliminates all of the remained OD transitions on NoC communication channel. However, due to the performance overhead of delay insertion, we used the ﬂit-insertion encoder as the last mechanism of the FRR method. Fig. 9 shows the block diagram of the ﬂit-insertion encoder. As shown in Fig. 9, temporary buffers are needed in ﬂit-insertion encoder to provide one cycle stall whenever a null-ﬂit should be inserted between ﬂits of the packet. Fig. 10 shows how the FRR method exploits the three mentioned encoders in a pipeline manner to remove OD transitions from NoC channels. This architecture minimizes the performance and power consumption overheads of the FRR method. The performance overheads of the ﬁrst and second encoders are constant values of n cycles and one cycle delay respectively, where n is the size of reordering window in the ﬁrst encoder. Simulation and analytical results show that the performance overhead of the third mechanism is negligible as well (see Section 5). Altogether the FRR method imposes a few nano-seconds of delay on the critical path of the switch architecture, which is a negligible delay for end-to-end ﬂow-control methods. Evaluations performed in Section 5 conﬁrm that the FRR method can be used as a cost-efﬁcient method to overcome the problem of crosstalk faults in NoC channels.

5. Evaluation of the proposed method 5.1. Analytical evaluation In this section we calculate the average number of OD transition pairs in an NoC channel exploiting the FRR method. It is then compared with that of calculated for an unprotected channel in Section 3. For the sake of fairness, this section uses the same assumptions which are made in Section 3, i.e., ﬂits are assumed to be uncorrelated [15,32] before and after adding tag bits. Eq. (5) (see Section 3) calculates the probability of having L = l OD transition paris in an unprotected channel with the width of K bits:

PKunprotected ðL ¼ lÞ ¼

l X K=2 K ðPI5 þ PI7 Þm ½1 ðPI5 þ PI7 Þ2m m¼0

m

K=2 1 lm

K PS1 im ð1 PS1 Þ2l1þm :

ð5Þ

It can be said that the Eq. (5) is the probability of having L = l OD transition pairs at the input point of FRR encoder which is referred in Fig. 10 as point (a). This section recalculates this probability for an FRR-enabled communication channel. In this regard, we calculate this probability for the output ports of the ﬁrst and second encoders, i.e., points (b) and (c). Note that considering the third mechanism used in FRR method, no OD transition appears at the output port of ﬂit-insertion encoder, i.e., point (d) of Fig. 10. Such an analysis helps us to: (1) study the efﬁciency of the mechanisms used in the FRR method, and (2) estimate the average timing overhead of the third mechanism of FRR method. Now let us to calculate the expected number of OD transition pairs at the output port of ﬂit-reordering encoder. Considering the ﬂit-reordering mechanism, there are n rounds of competition for window size of n ﬂits. At the ﬁrst round of competition, the numbers of OD transition pairs between the previously sent ﬂit and each of the n loaded ﬂits follow n random variables L1, L2, . . . , Ln which all have the same distribution with PK(L = l). Since the ﬂit with the lowest number of OD transition pairs is selected as the winner ﬂit, number of OD transition pairs on the output port of ﬂit-reordering mechanism, i.e., point (b) at the ﬁrst round of competition follows a random variable L1min which is deﬁned as:

Fig. 9. Block diagram of the ﬂit-insertion encoder used to eliminate opposite direction transitions on NoC channels.

774

A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778

FR2 Encoder From the Local Core

A

Flit Reordering Encoder

B

Flit Rotation Encoder

C

D

Flit Insertion Encoder

To the Network

Fig. 10. Block diagram of the FRR encoder.

L1min ¼ MinðL1 ; L2 . . . ; Ln Þ

ð8Þ

At the second round of competition, the winner ﬂit has passed the ﬂit-reordering encoder, so n 1 remaining ﬂits of the window contribute in the competition. Using similar reasoning, number of OD transition pairs appearing on the output port of ﬂit-reordering mechanism at the second round of competition follows a random variable L2min which is deﬁned as:

L2min ¼ MinðL1 ; L2 . . . ; Ln1 Þ

ð9Þ

and generally, number of OD transition pairs appearing on the output port of ﬂit-reordering encoder at the ith round of competition is a random variable Limin :

Limin ¼ MinðL1 ; L2 . . . ; Lniþ1 Þ

ð10Þ

where 1 6 i 6 n. Probability of having less than l OD transition pairs at the ith round of competition can be calculated as:

i K PK;i FlitReordering Lmin < l ¼ P ðMinðL1 ; L2 . . . ; Ln i þ 1; Þ < lÞ ¼ PðMinðL1 ; L2 . . . ; Lniþ1 ; Þ < lÞ1 PK ðL1 > lÞ PK ðL2 > lÞ PK ðLniþ1 ; > lÞ

ð11Þ

ð12Þ

Using the above probability accumulative function, the probability of having exactly l OD transition pairs on the output port of ﬂit-reordering encoder at the ith round of competition can be calculate by: K;i K;i PK;i FlitReordering ðLmin ¼ lÞ ¼ P FlitReordering ðLmin 6 lÞ P FlitReordering ðLmin < lÞ:

ð13Þ Subsequently, the expected number of OD transition pairs at the ith round of competition as follows:

i EK;i FlitReordering Lmin ¼

K=2 X

i j P K;i FlitReordering Lmin

¼j :

ð14Þ

j¼0

The expected number of OD transition pairs appeared on the output port of ﬂit-reordering encoder when all n ﬂits of a window pass the ﬂit-reordering encoder is:

EKFlitReordering ðODÞ ¼

n 1X i PK;i FlitReordering Lmin : n i¼1

! n Q0 1X i EK;i L K n n i¼1 FlitReordering min

ð16Þ

n where Q 0 ¼ Q þ QK log 2 to consider the tag bits added to ﬂits by the ﬂit-reordering encoder. In order to calculate average number of OD transition pairs on the output port of ﬂit-rotation encoder, i.e., point (c) we should ﬁrstly calculate the probability of having L = l OD transition pairs at the input port of ﬂit-rotation encoder, i.e., point (b). Note that Eq. (12) calculates this probability for a given competition round. This equation should be modiﬁed to calculate probability of having L = l OD transition pairs at the point (b) regardless of competition round in the ﬁrst encoder. In this regard, we assumed that probability of being in each round of competition for the ﬂit-reordering encoder at each instant of time is equal to 1/n. Consequently, the probability of having L = l OD transition pairs at the output port of ﬂit-reordering encoder regardless of the competition round can be calculated by:

PK;i FlitReordering ðL ¼ lÞ ¼

n 1X i PK;i FlitReordering Lmin ¼ 1 : n i¼1

ð17Þ

Since the ﬂit-rotation encoder selects its winner ﬂit among m rotated versions of the incoming ﬂit, probability of having less than l OD transition pairs at the output port of ﬂit-rotation encoder, i.e., point (c) is:

Since we assumed that there is no correlation between ﬂits of a i window, PK(Lq > l) = PK(L > l) for every q, so PK;i Flit Reordering ðLmin > lÞ can be simpliﬁed to:

niþ1 i K PK;i FlitReordering Lmin < l 1 1 P unprotected ðL1 < 1Þ

Av erageðODÞKFlitReordering

ð15Þ

Finally, average number of OD transition pairs appearing on the output port of ﬂit-reordering encoder when Q bits of data pass this encoder is:

PKFlitReordering ðL < lÞ ¼ PðMinðL1 ; L2 ; . . . Lm Þ < lÞ

ð18Þ

where random variables L1, L2, . . . , Lm have the distribution function of P KFlitReordering ðL ¼ lÞ. Accordingly, probability of having L = l OD transition pairs at the output port of ﬂit-rotation encoder is:

PKFlitRotation ðL ¼ lÞ ¼ PKFlitRotation ðL 6 lÞ PKFlitRotation ðL < lÞ:

ð19Þ

The expected number of OD transition pairs appeared on the output port of ﬂit-rotation encoder is:

EKFlitRotation ðODÞ ¼

K=2 X

j PKFlitRotation ðL ¼ jÞ:

ð20Þ

j¼0

Finally, average number of OD transition pairs appearing on the output port of ﬂit-rotation encoder when bits of data pass this encoder is:

Av erageðODÞKFlitRotation

00 Q EKFlitRotation ðODÞ K

ð21Þ

l 0m m where Q 00 ¼ Q 0 þ QK log 2 to consider the tag bits added to ﬂits by the ﬂit-rotation encoder. Since the third mechanism of FRR method eliminates all the remaining OD transitions, no OD transition appears on its output port, i.e., injection channel of the source node. In other words, number of OD transitions at the output port of ﬂit-insertion encoder, i.e., point (d) is zero. As mentioned, the ﬂit-insertion encoder imposes one cycle delay between two consecutive ﬂits if there is at least one OD transition between the ﬂits. The number of delay cycles which are imposed by the ﬂit-insertion encoder is M ð1 PKFlitRotation ðL ¼ 0ÞÞ, where M is the number of ﬂits in the packet at the input port of ﬂit-insertion encoder. Using this value,

775

A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778

the length of packet at the output port of ﬂit-insertion encoder can be calculated by:

5.2. Experimental evaluation

M ¼ M þ M 1 PKFlitRotation ðL ¼ 0Þ :

In order to experimentally evaluate the proposed ﬂow-control method, a VHDL-based simulator is developed. The simulator is composed of an FRR encoder module and a random ﬂit generator module. The FRR encoder receives ﬂits from the random ﬂit generator module and codes them based on the mentioned mechanisms, i.e., ﬂit-reordering, ﬂit-rotation, and ﬂit-insertion. Number of OD transition pairs as well as number of transitions is counted in points (a), (b), (c), and (d) (see Fig. 10) by the use of a monitor hardware added to the FRR encoder. A synthesizable version of the simulator is used to investigate the power consumption, area, and timing overheads of the FRR method. To do this, Design Compiler tool is utilized to extract the overheads of the FRR method which is synthesized in 65 nm technology size. Power consumption, area overhead and critical path delay of the monitor hardware have been ignored in our report since it does not exist in real working conditions. Simulation experiments are done for different reordering sizes (n), rotation sizes (m), and different channel widths (K). In each simulation experiment, 5 MB of random data has been generated by the ﬂit generator module and delivered to the FRR encoder module. Since the amount of power saving is depend on the number of channels per each node of NoC as well as the length of NoC channels, in our evaluations:

0

ð22Þ 0

Consequently it can be said that on average (M M) temporary buffers are needed in the ﬂit-insertion encoder. Using the above discussion, an unprotected channel is compared with a channel equipped with the FRR method in terms of average number of OD transitions. Table 2 represents average number of OD transitions for 8-, 16- and 32-bit channels when 600 KB of random data is transmitted through the channels. Average number of OD transitions for the points (a), (b), (c), and (d) are extracted by the means of proposed analytical model. Points (a) and (d) can be considered as an unprotected channel and an FRR enabled channel respectively; however point (b) and (c) help us to study the behavior of mechanisms used in the FRR encoder. Parameters n and m used in Table 2 respectively refer to the size of reordering window used in the ﬂit-reordering encoder, and the maximum size of ﬂit-rotation used in the ﬂit-rotation encoder. As it can be seen in Table 2, mechanisms used in the FRR method efﬁciently eliminate OD transitions in NoC channels. Based on the third mechanism used in the FRR method, average number of required temporary buffers in the ﬂit-insertion encoder depends on the average number of OD transition pairs appearing at the point (c). This value is also calculated using the proposed model to estimate buffer requirement of the ﬂit-insertion encoder. Next subsection presents experimental results to validate the results extracted from analytical modeling. Considering mechanisms used in the ﬁrst encoder of l FRRmmethod, m Mlog 2 performance overhead of the ﬁrst encoder is m þ cycles K which is due to m reordering buffers and the added tag bits. l Simm Zlog m 2 ilarly, performance overhead of the second encoder is 1 þ K l m where Z ¼ M þ

Mlog m 2 K

. Table 2 represents the total performance

overhead imposed to a packet with the length of 32 ﬂits, i.e., M = 32, when the packet leaves the ﬂit-rotation encoder, i.e., at the point (c).

(1) We studied a single NoC channel, which makes our evaluations independent of NoC architecture, (2) We did not logged the power consumption of the channel. Rather than, we logged the numbers of opposite direction transitions as well as number of transitions in points (a), (b), (c), and (d) by the use of a monitor hardware added to the FRR encoder. According to these two points, we can claim that our power simulations are true for all length of NoC channels and all NoC architectures. However, simulation experiments are done for

Table 2 Improvements and overheads of the FRR method extracted by analytical modeling. Channel width

Size of reordering window (n)

8 bits

2 2 2 4 4 4 8 8 16 16

2 4 8 2 4 8 2 4 2 4

4 4 4 8 8 8 16 16

2 4 8 2 4 8 2 4

449,214

4 4 8 8 8 16 16 16

8 16 4 8 16 4 8 16

458,984

16 bits

32 bits

Maximum rotation (m)

Average number of OD transition pairs

Reduction (%) with respect to point (a)

Point a

Point b

Point c

Point b

Point c

429,255

171,477

127,158 20,825 540 55,862 3725 15 12,570 3337 739 23

60.1

70.4 95.1 99.9 87.0 99.1 100 97.1 99.2 99.8 100

12 18 27 16 23 32 21 28 25 33

88.7 91.2 99.1 97.3 98.0 100 99.4 99.8

10 14 20 12 16 23 14 18

83.3 93.4 94.0 96.1 98.3 97.7 98.2 99.9

15 24 11 16 25 12 17 26

58,037

16,914 4633 70,987

23,719

6895 82,913 32,388

11,881

50,597 39,415 4056 12,046 9161 210 2886 1012 76,504 30,496 27,642 17,674 7720 10,693 8149 550

86.5

96.1 98.9 84.2

94.7

98.5 81.9 92.9

97.4

Performance overhead (cycle)

776

A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778

different reordering sizes (n), rotation sizes (m), and different channel widths (K). Simulations are done under the constraint of ﬁxed ﬂit width, i.e., ﬂits are generated with widths of 16 or 32 bits; and consequently NoC channels require 16 + log2(mn) or 32 + log2(mn) bits width respectively to pass the ﬂit as well as tag bits added to the ﬂit by FRR encoder. On the other hand, ﬁxed channel width constraint which is used in our analytical modeling considers widths of 16 or 32 bits for NoC channels, so ﬂits should have the widths of 16 log2(mn) or 32 log2(mn) bits respectively to remain enough space for the tag bits. In the ﬁrst constraint, the performance overhead of FRR encoder is n + 1 cycles delay which is minimized since tag bits have their own wires on NoC channels. However, area overhead is proportional to log2(mn) which is the maximum area overhead of the FRR encoder. In contrast, in the ﬁxed channel width constraint, performance overhead of FRR method is maxi2 lMlogm m 3 mized, i.e., 1 þ 6 6 6

Mþ

K

2

log n2

K

7 and area overhead is minimized. 7 7

Table 3 represents the results obtained under the constraint of ﬁxed ﬂit width. Table 3 shows number of OD transition pairs and the number of transitions in points (a), (b), and (c) of FRR encoder. Results of the point (a) can be considered as results of an unprotected NoC communication channel. For example consider the FRR encoder when n = 2, m = 8, and ﬂit width = 16 bits. In this case,

ﬂit-reordering encoder reduces the number of OD transition pairs from 8,281,237 to 4,921,862, i.e., 40.6% reduction. In addition, number of transitions is reduced from 21,113,314 to 18,261,772, i.e., 13.5% reduction. After the ﬂit-reordering, the ﬂit-rotation encoder reduces these values to 93,210 and 1,193,401 respectively which mean 99.1% and 94.3% reduction respectively. According to Table 3, the higher n and m, the higher reduction in OD transition pairs and regular transitions. Investigations show, communication channels in most of NoCs consume 20–36% of total consumed power [24]. Since total power in digital systems is proportional to number of signal transitions, reducing transitions in NoC communication channels directly reduces the power consumption of NoC. Based on Table 3, more power saving is achieved by the FRR method in larger n and m and/or wider communication channels. Table 4 represents the power and area overhead of the FRR encoder for some working conditions. The overheads of the FRR method are extracted by the use of a synthesizable version of the simulator which is synthesized in 65 nm technology size. As it can be inferred from Tables 3 and 4, the power overhead of FRR method can be neglected as compared to its power saving. In the next experiment the FRR method is compared with the duplicate-add-parity method [24,25] which is designed to prevent transition patterns ";", –"–, and "–" (and their complement). To do this, the duplicate-add-parity method is also simulated by a VHDL

Table 3 Improvements and overheads of the FRR method extracted under the ﬁxed ﬂit width constraint. Flit width

Size of reordering (n) Rotation (m)

Number of OD transition pairs

Reduction (%) with respect to point (a)

Number of transitions

Point a

Point b

Point c

Point b

Point c

Point a

Point b

Point c

Point b

Point c

16 bits

n = 2, n = 2, n = 2, n = 4, n = 4, n = 4,

m=2 m=4 m=8 m=2 m=4 m=8

8,281,237

4,921,862

275,786 88,596 93,210 263683 67,450 68,367

40.6

96.6 98.9 99.1 96.8 99.2 99.4

21,113,314

18,261,772

13,183,650 12,250,077 1,193,401 11,328,168 11,621,142 10,722,688

13.5

37.5 41.9 94.3 46.3 44.9 49.2

n = 2, n = 2, n = 2, n = 4, n = 4, n = 4,

m=2 m=4 m=8 m=2 m=4 m=8

9,101,424

85.6 91.2 88.9 96.6 97.6 96.9

20,588,526

32 bits

3,076,158

4,980,364

2,805,099

1,305,162 800,015 1,004,537 305,162 216,011 277,031

62.9

45.3

69.2

16,855,570

16,750,706

14,848,956

Reduction (%) with respect to point (a)

20.1

13,823,093 12,893,401 12,617,166 8,285,199 7,180,195 7,365,731

18.6

32.8 37.3 38.7 59.7 65.1 64.2

27.9

Table 4 Power consumption, area overhead and critical path timing of the FRR encoder in different working conditions. Encoder parameters

m = 2, n = 2 m = 2, n = 4

Channel width = 16 bits

Channel width = 32 bits

Power consumption (lW)

Area occupation (lm2)

Critical path timing (ns)

Power consumption (lW)

Area occupation (lm2)

Critical path timing (ns)

102 241

1824 4262.4

15 16

353 623

3137 13,124

16 18

Table 5 The FRR method in comparison with the duplicate-add-parity. Flit size

16 32

Size of reordering, rotation (n), (m)

n = 2, m = 2 n = 2, m = 2

Number of OD transition patterns

Reduction of OD transitions (%) with respect to point (a)

Number of transition patterns ";", –"–, "–"

Reduction of ";", – "–, "–" (%) with respect to point (a)

Point a

Point b

Point c

DAP

Point b

Point c

DAP

Point a

Point b

Point c

DAP

Point b

Point c

DAP

8,281,237 9,101,424

4,921,862 4,980,364

275,786 1,305,162

8,281,237 9,101,424

40.6 45.3

96.6 85.6

0 0

6,881,512 8,250,457

3,221,862 3,692,553

2,053,161 1,807,615

0 0

53.2 55.2

70.2 78.1

100 100

A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778

model. Table 5 compares the reductions of the FRR and duplicateadd-parity methods with respect to the both set of transition patterns {;", ";} and {";", ;";, –"–, "–", –;–, ;–;}. Note that the duplicate-add-parity method can correct single bit errors which happened in the ﬂits, however, in this section we compare the duplicate-add-parity and the FRR methods from the crosstalk prevention and power consumption points of view. As it can be seen in Table 5, the duplicate-add-parity method does not reduce the number of OD transition patterns, i.e., ;", ";. In contrast, the FRR method has a noticeable reduction in the number of transition patterns ";", –"–, "–" (and their complement). 6. Conclusions This paper proposed an efﬁcient ﬂow-control method to simultaneously enhance the reliability of packet transmission and reduce power consumption for packet delivery in NoCs. The method, called FRR, exploits three mechanisms to entirely eliminate opposite direction transitions as the source of crosstalk faults in NoC communication channels. The ﬁrst and second mechanisms, i.e., ﬂit-reordering and ﬂit-rotation reduce the rate of opposite direction transitions on NoC channels whereas the third mechanism, i.e., ﬂit-insertion makes it zero. As simulation results show, the main advantage of the proposed method is that it simultaneously provides crosstalk elimination as well as power reduction. The crosstalk elimination is achieved since the proposed method eliminates OD transitions in NoC channels, and the power reduction is achieved due to the reduction in the number of regular transitions on NoC channel. An analytical model was proposed to calculate and compare the expected number of OD transitions in an unprotected NoC as well as an FRR-enabled NoC. References [1] S. Kumar, A. Jantsch, J.P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja, A. Hemani, A network on chip architecture and design methodology, in: Proceedings of ISVLSI, April 2002, pp. 117–122. [2] L. Benini, G. De Micheli, Networks on chips: a new SoC paradigm, IEEE Computers 35 (1) (2002) 70–78. [3] S. Murali, T. Theocharides, N. Vijaykrishnan, M.J. Irwin, L. Benini, G. De Micheli, Analysis of error recovery schemes for networks-on-chips, IEEE Design and Test of Computers 22 (5) (2005) 434–442. [4] D. Bertozzi, D.L. Benini, G. De Micheli, Low power error-resilient encoding for on-chip data buses, in: Proceedings of DATE, March 2002, pp. 102–109. [5] A.M. Fazeli, S.G. Miremadi, A low-power and SEU-tolerant switch architecture for network on chips, in: Proceedings of the IEEE/IFIP Paciﬁc Rim International Symposium on Dependable Computing (PRDC 2007), Melbourne, Victoria, Australia, December 2007. [6] D. Park, C. Nicopoulos, J. Kim, N. Vijaykrishnan, C.R. Das, Exploring faulttolerant network-on-chip architectures, in: International Conference on Dependable Systems and Networks (DSN), 2006, p. 93. [7] R. Hegde, N.R. Shanbhag, Towards achieving energy efﬁciency in presence of deep submicron noise, IEEE Transactions on VLSI Systems 8 (4) (2000) 379– 391. [8] S. Murali, D. Atienza, L. Benini, G. De Micheli, A multipath routing strategy with guaranteed in-order packet delivery and fault-tolerance for networks on chip, in: Proceedings of the 43rd ACM/IEEE Design Automation Conference (DAC ’06), San Francisco, Calif, USA, July 2006, pp. 845–848. [9] A.P. Frantz, L. Carro, É.F. Cota, F.L. Kastensmidt, Evaluating SEU and crosstalk effects in network-on-chip routers, IOLTS, 2006, pp. 191–192. [10] M.H. Tehranipour, N. Ahmed, M. Nourani, Testing SoC interconnects for signal integrity using boundary scan, VTS, 2003, pp. 158–172. [11] H. Zimmer, A. Jantsch, A fault model notation and error-control scheme for switch-to-switch buses in a network-on-chip, in: Proceedings of ISSS/CODES, September 2003, pp. 188–193. [12] T. Dumitras, S. Kerner, R. Marculescu, Towards on-chip fault-tolerant communication, in: Proceedings of the Asia and South Paciﬁc Design Automation Conference (ASP-DAC), 2003, pp. 225–232. [13] M. Pirretti, G.M. Link, R.R. Brooks, N. Vijaykrishnan, M. Kandemir, M.J. Irwin, Fault tolerant algorithms for network-on-chip interconnect, in: Proceedings of the ISVLSI, 2004. [14] M. Kuhlmann, S.S. Sapatnekar, Exact and efﬁcient crosstalk estimation, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 20 (7) (2001) 858–866.

777

[15] K.N. Patel, I.L. Markov, Error-correction and crosstalk avoidance in DSM busses, IEEE Transactions on Very Large Scale Integration (VLSI) 12 (2004) 1076–1080. [16] T. Gao, C.L. Liu, Minimum crosstalk channel routing, in: Proceedings of International Conference on Computer-Aided Design (ICCAD), November 1999, pp. 692–696. [17] K. Hirose, H. Yasuura, A bus delay reduction technique considering crosstalk, in: Proceedings of Design, Automation and Test Europe (DATE), 2000, pp. 441– 445. [18] H. Kaul, D. Sylvester, D. Blaauw, Active shields: a new approach to shielding global wires, in: Proceedings of Great Lakes Symposium on Very Large Scale Integration (GLS-VLSI), April 2002, pp. 112–117. [19] K.M. Lepak, I. Luwandi, L. He, Simultaneous shield insertion and net ordering under explicit RLC noise constraint, in: Proceedings of Design Automation Conference (DAC), June 2001, pp. 199–202. [20] C. Duan, A. Tirumala, S.P. Khatri, Analysis and avoidance of cross-talk in onchip buses, Hot Interconnects 9 (2001) 133–138. [21] B. Victor, K. Keutzer, Bus encoding to prevent crosstalk delay, in: Proceedings of International Conference on Computer-Aided Design (ICCAD), 2001, pp. 57– 69. [22] D. Bertozzi, L. Benini, G.D. Micheli, Low power error resilient encoding for onchip data buses, in: Proceedings of DATE, 2002, pp. 102–109. [23] S.R. Sridhara, N.R. Shanbhag, Coding for reliable on-chip buses: a class of fundamental bounds and practical codes, IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 26 (5) (2007) 977–982. [24] S.R. Sridhara, N.R. Shanbhag, Coding for system-on-chip networks: a uniﬁed framework, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 13 (6) (2005) 655–667. [25] D. Rossi, A.K. Nieuwland, A. Katoch, C. Metra, New ECC for crosstalk impact minimization, IEEE Design and Test of Computers 22 (4) (2005) 340–348. [26] P.P. Pande, A. Ganguly, B. Feero, B. Belzer, C. Grecu, Design of low power & reliable networks on chip through joint crosstalk avoidance and forward error correction coding, IEEE International Symposium on Defect and FaultTolerance in VLSI Systems (DFT’06), 2006, pp. 466–476. [31] V. Raghunathan, M.B. Srivastava, R.K. Gupta, Energy-aware system design: a survey of techniques for energy efﬁcient on-chip communication, Design Automation Conference (DAC), 2003, pp. 900–905. [32] M.R. Stan, W.P. Burleson, Bus-invert coding for low-power I/O, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 3 (1) (1995) 49– 58. [33] A. Ganguly, P.P. Pande, B. Belzer, Crosstalk-aware channel coding schemes for energy efﬁcient and reliable NOC interconnects, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17 (11) (2009) 1626–1639. [34] A. Patooghy, H. Tabkhi, S.G. Miremadi, An efﬁcient method to reliable data transmission in network-on-chip, in: 13th Euromicro Conference on Digital System Design (DSD 2010), Lille, France, September 2010, accepted for publication. [36] X. Wu, Z. Yan, Efﬁcient CODEC Designs for Crosstalk Avoidance Codes Based on Numeral Systems, IEEE Transactions on Very Large Scale Integration (TVLSI) Systems PP(99), pp. 1–11. [37] Chunjie Duan, Victor Cordero, Sunil P. Khatri, Efﬁcient on-chip crosstalk avoidance CODEC, design, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17 (4) (2009). [38] J. Hu, R. Marculescu, DyAD – smart routing for networks-on-chip, in: DAC ’04: Proceedings of the 41st Annual Conference on Design Automation, 2004. [39] A. Patooghy, S.G. Miremadi, M. Fazeli, A reliable switch architecture for network on chips, Elsevier Journal of Integration: The VLSI. [40] A. Patooghy, S.G. Miremadi, M. Shafaei, FiRot: an efﬁcient crosstalk mitigation method for network-on-chips, in: Proceedings of 16th IEEE Paciﬁc Rim International Symposium on Dependable Computing (PRDC 2010). [41] S. Manolache, P. Eles, Z. Peng, Fault and energy-aware communication mapping with guaranteed latency for applications implemented on NoC, in: Proc. of DAC, 2005. [42] L. Benini, D. Bertozzi, Xpipes: a network-on-chip architecture for gigascale systems-on-chip, IEEE Circuits and Systems Magazine 4 (2) (2004) 18–31.

Ahmad Patooghy received his B.S. in Computer Engineering from Azad University of Arak, Iran, and his M.Sc. in computer engineering from Sharif University of Technology, Tehran, Iran, in 2003 and 2005, respectively. He is currently a PhD. Student at department of Computer Engineering, Sharif University of Technology. His research interests include Network on Chip, dependability evaluation of VLSI circuits, fault injection, analytical modeling.

778

A. Patooghy et al. / Microprocessors and Microsystems 35 (2011) 766–778

Seyed Ghassem Miremadi got his M.Sc. in Applied Physics and Electrical Engineering from Linköping Institute of Technology and his PhD in Computer Engineering from Chalmers University of Technology, Sweden, in 1984 and 1995, respectively. He is an Associate professor of Computer Engineering at Sharif University of Technology. As fault-tolerant computing is his specialty, he initiated the ‘‘Dependable Systems Laboratory’’ at Sharif University in 1996 and has chaired the Laboratory since then. The research laboratory has participated in several research projects which have led to several scientiﬁc articles, conference papers and technical reports. Dr. Miremadi and his group have done research in Physical, Simulation-Based and Software-Implemented Fault Injection, Dependability Evaluation Using HDL Models, Fault-Tolerant Embedded Systems and Fault Tree Analysis. Dr. Miremadi was the Education Director (1997–1998) and the Head (1998– 2002) of Computer Engineering Department at Sharif University and since 2002 is the Research Director of the department. He is a member of the IEEE Computer Society, IEEE Reliability Society and the Computer Society of Iran.

Hamed Tabkhi received his M.Sc. in Computer Engineering from Sharif University of Technology, Tehran, Iran, 2008. He is currently a PhD Candidate at Department of Electrical and Computer Engineering, Northeastern University, Boston, USA. His research interests include Embedded Systems Design and modeling, Dependable Systems, and Computer Architecture.

A reliable and power efficient flow-control method to eliminate crosstalk faults in network-on-chips

A reliable and power efficient flow-control method to eliminate crosstalk faults in network-on-chips

Recommend Documents