Adaptive instruction codec architecture design for network-on-chip

ARTICLE IN PRESS JID: CAEE [m3Gsc;April 22, 2016;16:53] Computers and Electrical Engineering 0 0 0 (2016) 1–18 Contents lists available at Science...

Download PDF

4MB Sizes 19 Downloads 175 Views

Report

Full Text

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;April 22, 2016;16:53]

Computers and Electrical Engineering 0 0 0 (2016) 1–18

Contents lists available at ScienceDirect

Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng

Adaptive instruction codec architecture design for network-on-chipR Trong-Yen Lee∗, Chi-Han Huang, Min-Jea Liu, Jhen-Syuan Chen Department of Electronic Engineering, National Taipei University of Technology, 1, Sec. 3, Chung-Hsiao E. Rd., Taipei 10608, Taiwan, ROC

a r t i c l e

i n f o

Article history: Received 16 June 2015 Revised 22 January 2016 Accepted 27 February 2016 Available online xxx Keywords: Adaptive instruction codec architecture (AICA) Network on chip (NoC) FPGA Power consumption, and throughput

a b s t r a c t This work proposes a novel Adaptive Instruction Codec Architecture (AICA) for network-onchip (NoC) that improves channel utilization to transfer packets and ﬂits, in order to solve issues of power consumption and throughput. The proposed architecture allows multiple packets to be stuffed into a single packet, and thus can transfer more packets than other network interface (NI) in one time unit. Reducing the number of packets for transmission allows the channel to be reused to transfer additional messages, thus improving channel throughput. This architecture reduces the number of packets transmitted, thus indirectly alleviating the deadlock problem. Many repeating and similar instructions are frequently transferred in NoC. The proposed AICA reduces transmission redundancy, and supports process elements (PE) with 16-bit or 64-bit core CPU. Experimental results show that the proposed architecture and algorithms delivers improvement of up to 48.1% on power consumption, and 46.3% on throughput. © 2016 Published by Elsevier Ltd.

1. Introduction Multiprocessor architecture is a popular research topic in system-on-chip (SoC). However, the bus architecture of SoC cannot meet the high performance and throughput requirements for packet transfer in multiprocessor systems. Therefore, network-on-chip (NoC) was proposed to solve transmission problems in multiprocessor architectures [1], but it leads to new issues about power, throughput and deadlock [2,3]. Considering the issue of power consumption, Rosa et al. [4] proposed distributed dynamic frequency scaling (DFS) to reduce overhead in the NoC execution time and frequency. The DFS controls local ﬁrst-in-ﬁrst-out (FIFO) through the globally asynchronous locally synchronous (GALS), and depends on process element (PE) information to switch the frequency dynamically. Jafarzadeh et al. [5] proposed data encoding schemes in the network interface (NI), based on the end-to-end ﬂit, to minimize the link-to-switch times when accessing ﬂits. Swaminathan et al. [6] proposed a low-power ﬂexible NI, which includes packing, unpacking, master/slave wrapper, FIFO and conﬁguration controller. The conﬁguration controller uses enabling and disabling at run time in the NI. Huaxi et al. [7] proposed fat-tree-based optical NoC, which includes topology, planning and protocol, and an optical turnaround router based on an optimal algorithm to minimize network control data. Nicopoulos et al. [8] proposed IntelliBuffer system to save power. IntelliBuffer uses clock gating to disable clock when the

R

Reviews processed and recommended for publication to the Editor-in-Chief by Associate Editor Dr. T. H. Meen. Corresponding author. Tel.: +886-2-2771-2171 Ext. 2251; Fax: +886-2-2731-7120. E-mail addresses: [email protected] (T.-Y. Lee), [email protected] (C.-H. Huang), [email protected] (M.-J. Liu), [email protected] (J.-S. Chen). ∗

http://dx.doi.org/10.1016/j.compeleceng.2016.02.021 0045-7906/© 2016 Published by Elsevier Ltd.

Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE 2

ARTICLE IN PRESS

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

Fig. 1. Typical mesh topology of the NoC architecture [26].

slot is empty; accesses the slots with lowest power from the leakage-classiﬁcation-register (LCR) table, and modiﬁes arbitration units to avoid switch overhead. Lee et al. [9] proposed smart power saving using a dynamic control clock for slots. This control clock minimizes the power in each slot according to different network parameters. Considering issue of throughput, Choudhary et al. [10] proposed a genetic algorithm to improve the energy distribution and communication load. The genetic algorithm can sense congested nodes to avoid deadlock nodes to improve the throughput. Fattah et al. [11] proposed one-hot memory asynchronous FIFO for NoC, which can reduce the decoder time to improve throughput when the address is one hot. Fu et al. [12] proposed a novel router with neighbor-ﬂow-regulation (NFR) to improve throughput, avoiding starvation and congestion by regulating information ﬂow among neighboring routers. Tassori et al. [13] proposed data compression to improve throughput. Their model determines whether to use Huffman coding to reduce packet bit according to the transmission distance. However, most PEs to NI are instructions. Some works also combine the circuit and packet switching to improve throughput. Wu et al. [14] proposed a dual switching mode NoC router that includes circuit switching and packet switching. The circuit switching conﬁrms the packet transmission path and avoids packet congestion. The packet switching avoids channel occupancy when packets are transmitted. Lusala et al. [15] proposed spatial division multiplexing and time division multiplexing to improve resource usage in circuit switching, and process the best traﬃc ﬂow in packet switching. Onizawa et al. [16] proposed three steps to improve throughput. First, an asynchronous router is implemented based on level-encoded dual rail (LEDR). Second, a simple transmission scheme based on an even number of ﬂits is used. Finally, congestion is solved by reducing the number of packets. Tran et al. [17] designed a router architecture with shared queues (RoShaQ), router architecture maximizes buffer utilization by allowing sharing of multiple buffer queues among input ports. Considering fault tolerance, Jiang et al. [18] proposed a routing algorithm based on X–Y routing. This method improves performance by setting routes that avoid fault paths and error nodes. Collet et al. [19] proposed a novel schema to improve the fault tolerance of NoC through by diagnosing, repairing and detecting of faults and errors automatically at runtime. Liu et al. [20] proposed built-in-self-repair (BISR) for a NoC buffer which improves repair eﬃciency by repairing the error buffer to avoid routing problems in communication links. Kariniemi et al. [21] proposed a novel fault-tolerant scheme with 2D mesh NoC in a multi-core SoC. Their model detect dynamic and static faults using fault-diagnosis-and-repair (FDAR) to repair faulty switches, and uses fault-tolerant-dimension-order-routing (FTDOR) algorithm to route packets adaptively in faulty networks. Bakhouya et al. [22] proposed bio-inspired concept to adaptively changing links when an area is congested for NoC. It has a distributed and immune system which includes self-healing, self-conﬁguration and self-optimization to improve the fault tolerance in NoC. Oxman et al. [23] proposed a simple congestion reduction method in NoC measures congestion based on the average link utilization per unit of time to perform routing in a buffer-less NoC. The rest of this paper is organized as follows. Section 2 analyzes the throughput and power issues in NoC. Section 3 presents the transmission ﬂit formats with adaptive instruction codec architecture (AICA). Section 4 presents the design of the adaptive instruction codec architecture. The experimental results are shown in Section 5. Section 6 brieﬂy draws conclusions. 2. Analysis of throughput and power issue Multi-core systems transmit many instruction ﬂits in network-on-chip (NoC). Therefore, reducing frequently repeated and similar instruction ﬂits can improve router eﬃciency. For traditional communication architectures, Geetha et al. [24] and Kadayif et al. [25] proposed instruction encoding to improve the throughput by reducing redundancy. Fig. 1 illustrates the typical mesh topology of the NoC architecture [26]. The process elements (PEs) transfer ﬂits through the network interface (NI) to the router. Each router has four direction ports to connect to neighboring routers, and one Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

3

Network Interface FIFO_PE_1 Router

Data_in Full

Unpacking

Ack_unpacking

New_flit

Empty Data_out

Wr

Rd

Data_out Data_ok

Val_unpacking Data_in

Rd_block

FIFO_PE_N Data_in Full

Data_out

Empty

Packing Ack_noc Data_out

Val_pack

Val_noc

Ready Data_out

Rd

FIFO

Start_pack Data_in

Data_in

Data_in_PE(N) Start

Data_out

Wr

Ack_pack

Local_PE Rd Data_in_PE(1)

Idle

Full Wr

Data_in

Empty Data_out Rd

Fig. 2. Network interface architecture [28].

direction port to the connecting PE. The router components are routing computation (RC), switch arbiter (SA), virtual channel arbiter (VA), transmission channel and crossbar (XBAR). The router uses the routing algorithm [27] to transfer ﬂits to the destination PE. A ﬂit is divided into a header ﬂit, body ﬂits and a tail ﬂit. Transmission is performed by ﬂit switching. In the worst case, nine PEs transmit ﬂits at the same time, leading to an increased probability of channel blocking, and deadlock to reduce transmission performance. The adaptive instruction codec architecture (AICA) reduces the transmission time between router and channel, and the latency caused by blocking and deadlock more than coding time in NI.Fig. 2 illustrates the NoC during transmission through the NI, providing the process elements (PE) and router ﬂit communication interface [28]. The NI provides two directions of transmission between the PE and the router. The packing modules need to add routing information to the ﬂit when the local PE transmits the ﬂit to the router. Whenever the NI completes packing, it checks the channel buffer to avoid overﬂow caused by the router. Unpacking the module restructures each ﬂit in the ﬂit to the PE from the router when the module buffer has remaining space. Eq. (1) shows the power consumption for one port, where TranUntreated represents the transmission time of each untreated ﬂit, and TranTreated represents the transmission time of each ﬂit treated by AICA. If all ﬂits in the architecture is untreated, then TranTreated = TranUntreated . Thus, early completion of ﬂit transfer means that channel can go to sleep early. In the un-codec router, if the port sends n ﬂits to the next router in one second, then the throughput for one direction port is as shown in Eq. (2). The maximum throughput is given by nﬂit/second. The throughput of AICA with the 64-bit router is given by Eq. (3). The port sends n ﬂits to the next router in one second; each ﬂit codec with k ﬂits when the router uses AICA, and the maximum throughput is given as kn ﬂits/second. The k represents the effective number of segments in the next section. The low power and high throughput are the important issues in NoC. This work proposes AICA for NoC to solve the throughput and power issues between packing/un-packing modules and router. The difference value (D) between the present ﬂits and previous ﬂits from the uncoding ﬂits database is calculated. To save coding time for NoC, the AICA is different from progressive coding, including sampling, quantization and entropy coding.

Slee poneport = (T ranUntreated − T ranTreated ) × clock

T hroughputoneport =

∞

(1)

n × f lit / sec

(2)

n=1

T hroughputoneport =

∞

k × n × f lit / sec;

1≤k≤4

(3)

n=1

3. The number of segments and formats ﬂit analysis 3.1. Effective number of segments analysis Fig. 3 illustrates the adaptive instruction codec architecture (AICA) coded ﬂit format, which includes control and data regions. The control region includes ﬂit type (2 bits), implying that the type of ﬂits, source address determine the source process element (PE) address, while destination address determines the destination PE address. The bit number of the source Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE 4

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

Fig. 3. Coded ﬂit format. Table 1 Flit type format. Flit type

Bits

Functional description

Single ﬂit Header ﬂit Body ﬂit Tail ﬂit

00 01 10 11

One ﬂit Routing information Payload Last data

and destination address is determined from the network topology. The header ﬂit (01) contains routing information, which is used to route ﬂit to the destination address until the channel registration is complete. The body ﬂit (10) and the tail ﬂit (11) are transferred through the registered channel according to the header ﬂit (01), rather than using the destination address bits. Thus, the AICA architecture uses destination address bits as encoding control bits (E). The data region is 2d , where d represents the address line size of PE. In this work, PE is 64-bit core, thus d = 6 and data region is 64. The total data frame depends on the value of 2d and the topology. Each data frame includes index (I) bits to store the pre-ﬂits, and difference (D) bits to recovery current ﬂit, and I + D < 2d , otherwise the current ﬂit is regarded as uncoding. The encoding equation is written as follows:

2d > ( I + D ). To avoid redundant bit data frames, the numbers of bits I and D are

d>n + 1. Since

2d

=

2k

(4) 2n ;

therefore, (4) can be rewritten as (5)

data frames, the equation is

2 = ( I + D ) × 2k . d

(6)

By Eqs. (5) and (6), k is given by

k = d − ( n + 1 ).

(7)

However, if the threshold n is equal to 0 or one or two, then I and D equal 1, two or 4 bits, respectively, and the encoding range decreases, increasing the uncoding rate and leading to the performance degradation. Thus, the n range is

2
(8)

Without the un-coding case of n = d−1, the k range is given by

0
(9)

By Eq. (9), the effective number of segments is

20 <2k ≤ 2d−4 .

(10)

If the PE is 16-bit core, then d = 4. In this case, let k = 1 to meet the base coding. This work proposes to set d = 4 and 6 for the AICA in NoC. 3.2. Flit formats of protocol in AICA Table 1 shows the ﬂits type format which includes single, header, body and tail ﬂits. Fig. 4 illustrates the un-coding ﬂit formats of the 2D mesh is shown in. If the ﬂit can be transmitted by one ﬂit to the destination, then the type is single-ﬂit (00). The header ﬂit (01) provides routing information to the register channel, and lets the same source address as ﬂits use this channel. The body ﬂit (10) uses the register channel to convey data to next router or PE. Finally, the tail type ﬂit (11) unregisters the channel. This work designs 16-bit and 64-bit AICA codec architectures for 16-bit and 64-bit cores PE, respectively, to transmit information. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE

ARTICLE IN PRESS T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

[m3Gsc;April 22, 2016;16:53] 5

Fig. 4. 2-D mesh un-coding ﬂits format.

Fig. 5. AICA 16-bit 2-D mesh ﬂits format. Table 2 AICA 16-bit encoded control bit function description. Encoding control bits

Functional description

00 01 10 11

Not encoding Un-deﬁned Encoding Un-deﬁned

3.3. Flit formats of protocol in AICA with 16-bit This section provides one solution to solve the d ≤ 4 then the n = 2 in Fig. 3. In order to maximize the codec eﬃciency, I and D cannot equal 2n . The sign bit of D is recorded in I LSB to increase coding range; thus I is 3-bit and D is 5-bit. Fig. 5 illustrates the AICA 16-bit 2D mesh ﬂit formats, in which the destination address bit is replaced by a coding bit in the control region to represent coding status when the ﬂit type is body (coding) or tail (non-coding). Whether the ﬂit type is single or header does not affect the transmission principle. Table 2 shows the function description for a body or tail ﬂits. The ﬁrst case of 2-bit (00) represent a non-coding and the second case of 2-bit (10) represent Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE 6

ARTICLE IN PRESS

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

Fig. 6. AICA 64-bit 2-D mesh ﬂits formats.

Fig. 7. 64-bit AICA coded registers.

Fig. 8. 64-bit AICA ﬂit format in Case 1.

encoding case. To increase the coding range (k = 1) to D = 5 and I = 3, the D has the sign bit in the LSB of I. The 16-bit AICA reduces the maximum transmission ﬂit from two to one, and improves channel utilization 50%. The destination router transmits the current ﬂit to PE directly when the MSB of the encoding control bits is set to 0. Otherwise, the router uses the AICA decoder in the next chapter. 3.4. Flit formats of protocol in AICA with 64-bit This section presents the ﬂit formats for Eq. (10) when d > 4. The maximum codec eﬃciency can be ensured when n = 3 and the number of data frames is four. The transmission principle of single or header ﬂits does not change. A body or tail ﬂit cannot be encoded, and therefore is conveyed to PE or restructure other ﬂits, in the format in Fig. 6 (a). Fig. 6 (b) illustrates the ﬂit encoding format, which comprises two to four ﬂits. The protocol has two encoding bits. If the encoding bits are “01”, then two ﬂits can be encoded, and these can be sorted with the next ﬂits (which are non-encoded ﬂits). If the encoding bits are “10”, then three ﬂits can be encoded, and these can be sorted with the next ﬂit (a non-encoded ﬂit). Finally, if the encoding bits are “11”, then are four ﬂits can be encoded. The current ﬂit may contain one, two, three or four ﬂits to ensure maximum eﬃciency of the codec in AICA. Fig. 7 illustrates how to calculate the possibility of encoding ﬂits to improve transmission performance in one ﬂit. The term NO_reg to records non-encoded ﬂits, and YES_reg records the encoded ﬂits. The division of the current ﬂit into different ﬂits has four possible cases. Fig. 8 illustrates Case 1, in which the encoding control bits are set to “00”. In Case 1, only NO_reg has one ﬂit, which means that un-coded ﬂit sending to the local router. Fig. 9 and Fig. 10 illustrate Cases 2, 3 and 4. In Case 2, the encoding control bits are set as “01”, in which YES_reg contains one or two ﬂits, and NO_reg is full. Case 2 comprises four subcases in the leaf node. In subcases 2.1 to 2.4, the Control bits are set to 4 hexadecimal (0 0 0 0–0 011). The channel utilization for these subcases is calculated here. Subcase 2.2 cannot reduce transmission times, because it only encodes one ﬂit. Other subcases reduce transmission ﬂits from ﬁve to four, and thus improve channel utilization by 20%. Case 3 will set the Encoding Control bits “10”, which means that YES_reg contains three ﬂits. This case has six subcases in leaf node. In subcase 3.1 to 3.6, the Control bits are set to 4 hexadecimal (0 0 0 0–0101). The channel utilization for these subcases is calculated. Subcases 3.1 and 3.4 reduce the number of transmission ﬂits from four to two, and thus improve channel utilization by 50%. The other subcases reduce the number of transmission ﬂits from ﬁve to three, and thus improve channel utilization by 40%. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE

ARTICLE IN PRESS T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

[m3Gsc;April 22, 2016;16:53] 7

Fig. 9. 64-bit AICA ﬂit format in Cases 2–4 (second ﬂit is un-coded).

Fig. 10. 64-bit AICA ﬂit format in Cases 2–4 (second ﬂit is coded).

In Case 4, the Encoding Control bits are set to “11”, which means that all ﬂits can be coded in YES_reg. This case reduces the number of transmission ﬂits from four to one, and improves the channel utilization 75%. The nonterminal node case sends a subcase without a leaf node case to the transmission buffer when the tail ﬂit arrives to terminate the encoding stage. Finally, the decoder reconstructs the sequence, and restore the data from each ﬂit according to the encoding control bit and control bits. The 64-bit AICA sends the encoding ﬂit to the local router when it completes encoding. The encoder packs with YES_reg and NO_reg according to the case, and sends the data to the local router. The decoder accesses the ﬂits from the router. Table 3 shows the different recombining and sorting of ﬂits according to the case number. Case 1 is a non-encoding ﬂit which transmits the current ﬂit directly to the unpacking module. Subcase 2.1 has two encoded ﬂits and three non-encoded ﬂits, with the ﬁrst non-encoded ﬂits recombinant between the ﬁrst and second data frame, and the other un-coded ﬂits placed after the second data frame. The subcase 2.2 has one encoded ﬂit and three non-encoded ﬂits, with the non-coded ﬂits following the ﬁrst data frame. Subcase 2.3 has two encoded ﬂits and three non-encoded ﬂits, with the ﬁrst and second non-encoded ﬂits recombinant between the ﬁrst and second data frame, and other un-coded ﬂits recombinant after the second data frame. Subcase 2.4 includes two encoded Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE 8

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18 Table 3 AICA 64-bit control table. Encoding control bits

Number of ﬂits

Control bits

Case number

00 01

Un-coding one ﬂit Encoding one or two ﬂits

10

Encoding three ﬂits

11

Encoding four ﬂits

Null 16’h0 0 0 0 16’h0 0 01 16’h0010 16’h0011 16’h0100 16’h0101 16’h0110 16’h0111 16’h10 0 0 16’h1001 16’h1010 16’h1011 16’h1100 16’h0 0 0 0 16’h0 0 01 16’h0010 16’h0011 16’h0100 16’h0101 16’h0110 Null

1 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4

ﬂits and three non-encoded ﬂits, with the un-coded ﬂits recombinant after the second data frame. Subcases 2.5, 2.11 and 3.7 include many encoded ﬂits, with each ﬂit decoded according to address I from the non-encoded ﬂit database. In subcases 2.6 and 2.8, include one ﬂit is encoded, while all others are non-coded and are recombinant after the ﬁrst data frame. Subcase 2.7 has two encoded ﬂits and one non-coded ﬂit, e which is recombinant between the ﬁrst and second data frame. Subcase 2.9 has two coded ﬂits and two coded ﬂits, with the ﬁrst uncoded ﬂit recombinant between the ﬁrst and second data frame, and the other uncoded ﬂit recombinant after the second data frame. Subcase 2.10 includes two coded ﬂits and two uncoded ﬂits, with the uncoded ﬂits recombinant between the ﬁrst and second data frame. Subcases 2.12 and 2.13 includes two coded ﬂits and many uncoded ﬂits, with the uncoded ﬂits recombinant after the second data frame. Subcase 3.1 includes three coded ﬂits be and one uncoded ﬂit, with the ﬁrst uncoded ﬂits recombinant after the ﬁrst data frame. Subcase 3.2 includes three coded ﬂits and two uncoded ﬂits, with the ﬁrst uncoded ﬂits recombinant after the ﬁrst data frame, and other uncoded ﬂits recombinant between the second and third data frames. Subcase 3.3 has three coded ﬂits and two uncoded ﬂits, with the uncoded ﬂits recombinant between the ﬁrst and second data frames. Subcases 3.4 includes three coded ﬂits and one uncoded ﬂit, with the ﬁrst uncoded ﬂits recombinant between the second and third data frame. Subcase 3.5 includes three coded ﬂits and two un-coded ﬂits, with the un-coded ﬂits recombinant between the second and third data frames. Subcase 3.6 includes three coded ﬂits and one uncoded ﬂit, with the uncoded ﬂits recombinant after the third data frame. In Case 4, every ﬂit is encoded in YES_reg (all four ﬂits). The data restructured by decoded the coded ﬂits according to address I from the uncoding ﬂit database. Finally, the 64-bit AICA sends the YES_reg and NO_reg to the transmission buffer of each node when the tail ﬂit arrives or the buffer is full.

4. Design of adaptive instruction codec architecture 4.1. Adaptive instruction codec architecture Many repeating and similar instructions are transferred frequently in network-on-chip (NoC). Therefore, this work proposes an adaptive instruction codec architecture (AICA) to reduce transmission redundancy. Fig. 11 illustrates the AICA with NoC architecture, which can use a 16-bit or 64-bit core CPU. The AICA includes a 16-bit or 64-bit encoder and decoder between the router and the packing/unpacking modules in network interface (NI). The router connects routers for data transmission. This 16-bit and 64-bit AICA is proposed to verify the eﬃcacy of transmission for the worst and ideal cases. Fig. 12 illustrates the AICA encoder and decoder architecture. Encoding is started when the process element (PE) send a ﬂit and completes packing in NI. The encoder comprises a compare register, an encoded register and an encoding control bit. The compare register provides the comparison values, then the encoder calculates the difference between the comparison value and the current ﬂit. If the difference value (D) is greater than a threshold d, then the current ﬂit is recorded in the compare register and sent to the router. Otherwise, the index (I) and difference value (D) are recorded in the encoded register. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE

ARTICLE IN PRESS T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

[m3Gsc;April 22, 2016;16:53] 9

Fig. 11. AICA with NoC architecture.

Fig. 12. AICA encoding and decoding architecture.

The encoded register saves and encoding ﬂits and groups them together. In the 16-bit AICA, the encoded register only records encoding ﬂits. However, the encoded register in 64-bit AICA records not only encoding ﬂits, but also uncoded ﬂits, with the encoding case described in the previous chapter. The encoding control bit is set up according to the case type. A ﬂit is encapsulated and sent to the local router once its encoding is. The AICA decoder begins decoding when the router sends a ﬂit to the NI. The decoder consists of a compare register, recorder register, decode register and encoding control bit. An uncoded ﬂit is recorded in the compare register and sent to the unpacking module. The decoder is sorted with different AICA cases depending on the encoding control bit, and saved in the decoding register. When sorting the tasks is complete, the decoder uses the index to read the comparison value from the compare register, then restores the ﬂits with the comparison and difference values. When the decoding task is complete, the ﬂits are transmitted from the recording register to the unpacking module. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE 10

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

Table 4 AICA encoding algorithm with 16-bit. AICA Encoding Algorithm with 16-bit Input: Flits Fs from the packing module to AICA Output: Transmission buffers Tb⊃{UF, EF}, where UF are un-coding ﬂits, EF are encoding ﬂits 1. Initialize encoded and un-coded registers Er=0 and Ur=0, where Er⊃{Er1 , Er2 } and Ur⊃{Ur1 , Ur2 } 2. Initialize index I=0, where I bits=(3 bits−1 sign bit) 3. Initialize compare registers comparereg =0, where comparereg capacity=max(size of 2I bits ) 4. Initialize encoding control bits Ecb ⊃{Ub, Eb}, where Ub are un-coding bits, Eb are encoding bits 5. While (Fs) do 6. If (single ﬂit or header ﬂit arrival) 7. {Tb= Fs}//Fs bypass to Tb /∗ Starts encoding∗ / 8. Else if (body ﬂit or tail ﬂit arrival) //∗ Coded ﬂit2 and store Tb when ﬂit1 can be encoded ∗ // 9. If(Er1 full) 10. {Ur2 =Fs payload}//Stage1: ﬂit reading ///∗ Stage2 and Stage3: ﬂit coding and ﬂit store∗ /// 11. If (Ur2 cannot be encoded) 12. If(comparereg are full) 13. {Tb=Eb+Er1 , next Tb=Ub+Ur2 } 14. Else 15. {comparereg [I]=Ur2 , I ++, Tb= Eb+Er1 , next Tb=Ub+Ur2 } 16. Else 17. {Er2 =I+D, Tb=Eb+Er1 +Er2 }, where D=Ur2 −comparereg [I], D bits=5 //∗ Code ﬂit1 or store Tb ∗ // 18. Else 19. {Ur1 =Fs}//Stage1: ﬂit reading ///∗ Stage2 and Stage3: ﬂit coding and ﬂit store∗ /// 20. If (Ur1 cannot be encoded) 21. If(comparereg are full) 22. {Tb=Ub+Ur1 } 23. Else 24. {comparereg [I]=Ur1 , I++, Tb=Ub+Ur1 } 25. Else 26. {Er1 =I+D}, where D=Ur1 −comparereg [I], D bits=5 27. Repeat step 8 till the Tb are full or tail ﬂit arrive AICA /∗ Send to router∗ / 28. If(tail ﬂit arrival) 29. {Send ﬂits from Tb to router, clear comparereg , return step 1} 30. Else if(Tb full) 31. {Send ﬂits from Tb to router, return step 8} 32. End while

4.2. Design of AICA encoder with 16-bit Table 4 shoes the design for optimizing the transmission performance of the proposed AICA encoder algorithm in 16-bit in the worst case where threshold d < 4 is shown in. This algorithm has three stages: ﬂit reading, coding and storage. The encoder determines the ﬂit type, and sends a single or header ﬂit to the transmission buffer at lines 6 and 7. For other ﬂit types, the algorithm starts encoding at lines 8 to 27 (starts coding). The encoding stage reads ﬂit from packing module (stage1) and compares the D with the current ﬂit (ﬂit1 or ﬂit2) from comparereg at lines 10, 11, 16, 19, 20 and 25. If the current ﬂit (ﬂit1) is uncoded, then the encoder records in comparereg until the buffer is full; adds the encoding sign bit with ﬂit1, and sends the ﬂit1 to the transmission buffer at lines 20 to 24. If the ﬂit1 is coded, then the encoder records the ﬂit1 I from comparereg address and the D from the current ﬂit in Er1 , and waits for the next ﬂit at lines 25 and 26 (stag2 and stage3). When the ﬂit1 encoding is complete at line 9, if the current ﬂit (ﬂit2) is uncoded, then the encoder records in comparereg until the buffer is full; adds the encoding sign bit with ﬂit2, and sends the Er1 ﬂit and ﬂit2 to the transmission buffer at lines 11 to 15. Otherwise, it records the ﬂit2 I from comparereg address and the D from the current ﬂit in Er2 , then encapsulates the Er1 and Er2 to one ﬂit, and sends this to the transmission buffer at lines 16 and 17 (stag2 and stage3). The algorithm encodes ﬂits until the transmission buffer is full or the last ﬂits are reached. The encoder transmits the buffer data to the router, and continues encoding when the buffer is full. Otherwise, it transmits the buffer data to the router, and clears comparereg at lines 28 to 31. Fig. 13 illustrates a simpliﬁed instructional computer (SIC) encoded in 16-bit AICA. This work encodes paragraph Opcodes from PE in three stages. In the ﬁrst stage, the encoder records ﬂits in un-coded registers (Ur1 or Ur2). In the second stage, the encoder compares the coding range from comparereg , and stores this ﬂit in comparereg (if this is not full) or encoded Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

11

Fig. 13. SIC encode with AICA 16-bit. Table 5 AICA decoding algorithm with 16-bit. AICA Decoding Algorithmwith 16-bit Input: Flits Fs from the destination router to AICA Output: Unpacking buffers Ub⊃{UF, DF}, where UF are un-coding ﬂits, DF are decoding ﬂits 1. Initialize decoded registers Dr=0, where Dr⊃{Dr1 , Dr2 }, Dr1 ⊃{I1 , D1 }, Dr2 ⊃{I2 , D2 }, I1 =Fs [15:14], D1 =Fs [12:8], I2 =Fs [7:6], D2 =Fs [4:0] 2. Initialize index I=0, where I bits=(3 bits−1 sign bit) 3. Initialize decoder compare registers de_comparereg =0, where de_comparereg =max(size of 2I ) 4. Initialize temporarily register Tr=0 5. While (Fs) do 6. If (single ﬂit or header ﬂit arrival) 7. {Ub=Fs}//Fs bypass to Ub /∗ Starts decoding∗ / 8. Else if (body ﬂit or tail ﬂit arrival) 9. If (Fs is encoding ﬂit) //∗ Stage1 and Stage2: recovery and store ∗ // 10. For (j=1 to 2) 11. If (Drj is positive) 12. {Tr=de_comparereg [Ij ] + Dj , Ub={Fs[MSB to MSB-5], Tr} 13. Else 14. {Tr=de_comparereg [Ij ] [4:0] + Dj , Ub={Fs[MSB to MSB-5], de_comparereg [Ij ][15:5], Tr[4:0]}} Stage2: store 15. Else 16. If(de_comparereg are full) 17. {Ub=Fs} 18. Else 19. {de_comparereg [I]=Dr, I++, Ub=Fs} 20. Repeat step 8 till Ub are full or tail ﬂit arrive AICA /∗ Send to unpacking module ∗ / 21. If(tail ﬂit arrival) 22. {Send ﬂits from Ub to unpacking module, clear de_comparereg , return step 1} 23. Else if(Ub full) 24. {Send ﬂits from Ub to unpacking module, return step 8} 25. End while

registers (Er1 or Er2). In the last stage, the encoder sends ﬂits, which contain encoding control bits, from un-coded or encoded registers to transmission buffers (Tb). 4.3. Design of AICA decoder with 16-bit Table 5 shows the 16-bit decoding algorithm. This algorithm has two stages: recovery and storage. If the ﬂit type is single or header then it is sent to the unpacking buffer at lines 6 and 7. Otherwise, the decoding stage is started at lines 8 to 20 (starts decoding). The decoder reads the encoding control bits from this ﬂit. If the type is uncoded, then the ﬂit is recorded in de_comparereg until the buffer is full, and sent to the unpacking buffer at lines 15 to 19. Otherwise, the decoder determines the next step according to the sign bit of the ﬁrst and second data frame. If the sign bit is positive, then the decoder adds the D from the current body ﬂits and de_comparereg from index (I) in temporary_register (Tr); combines the ﬂits in Tr and de_comparereg into one ﬂit, and sends this ﬂit to the unpacking buffer at lines 11 and 12. Otherwise, the decoder adds the D from the current body ﬂits and the de_comparereg value of the last half-byte from I in Tr; combines ﬂits in Fs, Tr and de_comparereg to one ﬂit, and sends this ﬂit to the unpacking buffer at lines 13 and 14. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE 12

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

Fig. 14. SIC encode with AICA 64-bit.

The decoding continues until the unpacking buffer is full or the last ﬂits are reached. The decoder transmits the buffer data to the unpacking module, and continues decoding when the buffer is full. Otherwise, it transmits the buffer data to the unpacking module, and clears de_comparereg at lines 21 to 24 (transmission stage). 4.4. Design of AICA encoder with 64-bit This subsection describes the 64-bit AICA encoding and decoding method. The encoding task is active when the threshold d > 4. This algorithm has three stages: coding and classiﬁcation; case status detection, and restructuring and storage. Table 6 shows the 64-bit AICA encoding algorithm. If the current ﬂit is single or header type, then it is sent to the transmission buffer at lines 11 and 12. Otherwise, the ﬂit number is stored in Fc to start the encoding stage at lines 13 to 25(starts coding). If the body or tail ﬂit is non-coding, then the encoder records ﬂits in comparereg until the buffer is full; combines them into one ﬂit; records it in uncoded registers (Ur), and records the ﬂit counter (Fc) in uncoded_sequences (Us) at lines 15 to 19. Otherwise, the encoder encapsulates I and D in one ﬂit; records this ﬂit in encoded registers (Er), and records the Fc in encoded_sequences (Es) at lines 20 and 21 (stage1). The stage1 encodes ﬂits until the US and ES meet case conditions of Fig. 8, Fig. 9 and Fig. 10 at lines 22 to 25 (stag2). If encoder break stage1, then it encapsulates ﬂits in Er to one ﬂit; adds encoding control bits with encapsulation ﬂit and un-coding ﬂit by different case conditions, and stores this to the transmission buffer at lines 27 to 42 (stag3). The encoding algorithm encodes continuously until the transmission buffer is full or the last ﬂit is reached. It then sends the data from the transmission buffer to router, and clears all registers, at lines 43 to 46. The decoder reorders the ﬂits according to different cases, as described in the next subsection. Fig. 14 shows the simpliﬁed instructional computer (SIC) encoded with 64-bit AICA. This study encodes a paragraph Opcode from PE in three stages. In the ﬁrst stage, the encoder records the number of ﬂits in the Fc; combines the ﬂits to store in comparereg until this is full, and records the combined ﬂit in Er or Ur. In the second stage, the encoder detects the cases status. In the ﬁnal stage (stage 3), the encoder adds encoding control bits to the ﬂits from Er or Ur, as described in Table 6, and transmits Er or Ur to transmission buffers (Tb). 4.5. Design of AICA decoder with 64-bit The 64-bit decoding algorithm is shown in Table 7. This algorithm has two stages: recovery and storage. When the ﬂit type is single or header, the algorithm transmits the ﬂit to the unpacking module at lines 11 and 12. Otherwise, the decoding stage is started at lines 13 to 40 (starts decoding). In Case 1, the current body ﬂit is stored in de_comparereg until the buffer is full, then the ﬂit is sent to the unpacking module or Rreg , depending on the pre-ﬂit requirement, in lines 15 to 25. In Case 2 or Case 3, the ﬂits are reordered with Rreg . The ﬁrst stage of Case 2 and Case 3 is decoding, in which the D is added from the current body ﬂit and de_comparereg value from I. In the second stage, ﬂits are ordered as stage1 and un-coding ﬂits (from Rreg ) in lines 26 to 32. In Case 4, the algorithm adds the D from current body ﬂits and de_comparereg from I in lines 33 to 39. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE

ARTICLE IN PRESS

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

13

Table 6 AICA encoding algorithm with 64-bit. AICA Encoding Algorithm with 64-bit Input: Flits Fs from the packing module to AICA Output: Transmission buffers Tb⊃{UF, EF}, where UF are un-coding ﬂits, EF are encoding ﬂits 1. Initialize encoded registers and sequence Er=0 and ES=0, where Er and ES capacity= 4 2. Initialize un-coded registers and sequence Ur=0 and US=0, where Ur and US capacity= 3 3. Initialize encoded and un-coded registers index EI=0 and UI=0 4. Initialize index I=0 and ﬂits counter Fc=0, where I bits=8 5. Initialize compare registers comparereg =0, where comparereg capacity=max(size of 2I bits ) 6. The Case⊃ {Case 1 , Case 2, Case 3, Case 4}, where Case 1 are UF, Case 2 ⊃ {Case 2.1 , Case 2.2 ,…Case 2.13 } are encoding one or two ﬂits, Case 3 ⊃{Case 3.1 , Case 3.2 ,…Case 3.7 } are encoding three ﬂits and Case 4 are encoding four ﬂits. Each cases determine by US and ES 7. The encoding control bits Ecb ⊃ {Ec1 , Ec2 , Ec3 , Ec4 }, where Ec1 to Ec4 are Case 1 to Case 4 encoding control bits 8. The sub encoding control bits for Case 2 SEcb2 ⊃ {SEcb2Nn , SEcb2Ln }, where SEcb2Ln ⊃ {Case 2.1 , Case 2.2 ,…, Case 2.4 } is leaf nodes case and SEcb2Nn ⊃ {Case 2.5 , Case 2.6 ,…, Case 2.13 } is nonterminal nodes case 9. The sub encoding control bits for Case 3 , SEcb3 ⊃ {SEcb3Nn , SEcb3Ln ,}, where SEcb3Ln ⊃ {Case 3.1 , Case 3.2 ,…, Case 3.6 } is leaf nodes case and SEcb3Nn = Case 3.7 is nonterminal nodes case 10. While (Fs) do 11. If (single ﬂit or header ﬂit arrival) 12. {Tb=Fs}// Fs bypass to Tb /∗ Starts encoding∗ / 13. Else if (body ﬂit or tail ﬂit arrival) //∗ Coding and classiﬁcation∗ // 14. Fc= Fc+1 15. If(Fs cannot be encoded) 16. If(comparereg are full) 17. {Ur[UI]=Fs, US[UI]=Fc, UI++} 18. Else 19. {Ur[UI]=Fs, US[UI]=Fc, UI++, comparereg [I]=Fs, I++} 20. Else 21. {Er[EI]=I+D, ES[EI]=Fc, EI++}, where D=Fs−comparereg [I], D bits=8 //∗ Cases status detection ∗ // 22. If (US and ES = Case conditions or tail ﬂit arrival) 23. {break} 24. Else 25. {Repeat step 13} 26. End while //∗ Restructuring and store ∗ // 27. While(Case) do 28. If(Case 1 ) 29. {Tb= Ec1 +Ur[0], UI=0, Fc=0} 30. Else if(Case 2 ) 31. If(tail ﬂit arrival) 32. {Tb= Ec2 + SEcb2 +Er, EI=0, next Tb= Ec1 +Ur, UI=0, Fc=0} 33. Else 34. {Tb= Ec2 + SEcb2Ln +Er, EI=0, next Tb= Ec1 +Ur, UI=0, Fc=0} 35. Else if (Case 3 ) 36. If(tail ﬂit arrival) 37. {Tb= Ec3 + SEcb3 +Er, EI=0, next Tb= Ec1 +Ur, UI=0, Fc=0} 38. Else 39. {Tb= Ec3 + Secb3Ln +Er, EI=0, next Tb= Ec1 +Ur, UI=0, Fc=0} 40. Else if (Case 4 ) 41. {Tb= Ec4 +Er, EI=0, Fc=0} 42. Repeat step 13 till the Tb are full or tail ﬂit arrive AICA ∗ / Send to router∗ / 43. If(tail ﬂit arrival) 44. {Send ﬂits from Tb to router, clear comparereg , return step 1} 45. Else if(Tb full) 46. {Send ﬂits from Tb to router, return step 13} 47. End while

The algorithm decodes continuously until the last ﬂit is reached, when the ﬂit is sent from Rreg to the unpacking module, and all register are cleared at lines 41 to 44.

5. Experimental results This work proposes an adaptive instruction codec architecture to reduce repeated instructions. The design tool was the Xilinx ISE 14.7, and the simulation tool was Modelsim. The emulation platform was Xilinx FPGA Virtex-6 XC6VLX240T1ff1136, and the chip measurement was performed using Xilinx Chipscope and Xpower. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE 14

ARTICLE IN PRESS

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

Table 7 AICA decoding algorithm with 64-bit. AICADecoding Algorithmwith 64-bit Input: Flits Fs from the destination router to AICA Output: Unpacking buffers Ub⊃{UF, DF}, where UF are un-coding ﬂits, DF are decoding ﬂits 1. Initialize decoded registers Dr=0, where Dr⊃{Dr1 , Dr2 , Dr3 , Dr4 }, Dr1 ⊃{I1 , D1 }, Dr2 ⊃{I2 , D2 }, Dr3 ⊃{I3 , D3 }, Dr4 ⊃{I4 , D4 } I1 =Fs[63:56], D1 =Fs[55:48], I2 =Fs[47:40], D2 =Fs[39:32], I3 =Fs[31:24], D3 =Fs[23:16], I4 =Fs[15:8], D4 =Fs[7:0] 2. Initialize index I=0, where I bits=8 3. Initialize decoder compare registers de_comparereg =0, where de_comparereg =max(size of 2I ) 4. Initialize reorder register and index Rreg =0 and RI=0, where Rreg and RI capacity=7 5. Initialize temporarily register Tr=0 6. Initialize reorder ﬂag and times Rf=0 and Rt=0, when Rf=1 to reorder with next Fs 7. The coding types Ct⊃{Ct1 , Ct2 , Ct3 , Ct4 }, where Ct=Fs[65:64] 8. The sub coding types SCt2 ⊃{SCt2.1 , SCt2.2 ,…SCt2.13 } and SCt3 ⊃{SCt3.1 , SCt3.2 ,…SCt3.7 }, where SCt2 and SCt3 = Dr1 , Ct2 total ﬂits Ct2tf =number of ﬂits (Ct1 +SCt2 ) and Ct3 total ﬂits Ct3tf =number of ﬂits (Ct1 +SCt3 ) 9. Initialize insertion sequence registers ISR by SCt2 and SCt3 , where capacity= ISR[20][7], which stored each sub-case insertion sequence 10. While (Fs) do 11. If (single ﬂit or header ﬂit arrival) 12. {Ub=Fs}//Fs bypass to Ub /∗ Stage 1 and Stage 2: Recovery and store∗ / 13. Else if (body ﬂit or tail ﬂit arrival) 14. If (Fs is encoding ﬂit) 15. If (Ct1 ) 16. If(de_comparereg not full) 17. {de_comparereg [I]=Fs, I++} 18. If(Rf) 19. {Rreg [RI[Rt]]= Fs, Rt−−} 20. If(Rt=0) 21. {Rf =0, Ub=Rreg } 22. Else 23. {Repeat step 13} 24. Else 25. {Ub=Fs} 26. Else if(Ct2 or Ct3 ) 27. {RI= ISR[SCt2 or SCt3 ], Rt= Ct2tf or Ct3tf , Rf =1} 28. For j=2 to number of ﬂits with SCt2 or SCt3 29. If (Drj is positive) 30. {Tr=de_comparereg [Ij ]+Dj , Rreg [RI[Rt]]= Tr, Rt−−} 31. Else 32. {Tr=de_comparereg [Ij ]+Dj , Rreg [RI[Rt]]= {de_comparereg [Ij ][63:8], Tr[7:0]}, Rt—} 33. Else if(Ct4 ) 34. For j=1 to 4 35. If (Drj is positive) 36. {Tr=de_comparereg [Ij ]+Dj , Rreg [j]= {de_comparereg [Ij ][63:16], Tr} 37. Else 38. {Tr=de_comparereg [Ij ]–Dj , Rreg [j]= {de_comparereg [Ij ][63:16], Tr} 39. {Ub=Rreg } 40. Repeat step 13 till Ub are full or tail ﬂit arrive AICA 41. If(tail ﬂit arrival) 42. {Send ﬂits from Ub to unpacking module, clear de_comparereg, return step 1} 43. Else if(Ub full) 44. {Send ﬂits from Ub to unpacking module, return step 13} 45. End while

The power consumption on network-on-chip (NoC) is evaluated by Eq. (11). The term Pav is the total average power consumption of NoC, which includes the power consumption from the router, network interface (NI) and the transmission channel. The parameter β is the number of ﬂits transmitted by adaptive instruction codec architecture (AICA) through the router and channel, which was reduced by up to 75%. Prt and Pli are the total average power consumption when the ﬂits pass each router and transmission line, respectively. The terms Pcoder and Pdecoder refer to the power consumed when encoding and decoding, respectively. These can be disregarded because the ﬂits are almost always transmitted in both routers (β Prt ) and channels ((β −1) Pli ). Throughput is analyzed by Eq. [29]. A smaller β leads to improved power consumption and throughput.

Pav = β Prt + (β − 1 )Pli + Pcoder + Pdecoder

(11)

The experimental setup is described as follows. Fig. 15 and Fig. 16 illustrate the 16-bit and 64-bit AICA codecs, respectively. Each ﬂit simulates Reduced Instruction Set Computing (RISC) to transfer to the other router. The DataPort connects the CPU and encoder, as illustrated in Fig. 15 (a) and Fig. 16 (a): the CPU sends the packing data from DataPort to the NI, and is read by the encoder from the packing module. The DataPort_1 connects the encoder and router. The encoder uses the 64-bit or 16-bit encoding algorithm to reduce the redundancy instruction set and repackage a coding ﬂit to DataPort_1, as Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

[m3Gsc;April 22, 2016;16:53] 15

Fig. 15. 16-bit AICA measurement scheme.

Fig. 16. 64-bit AICA measurement scheme. Table 8 16-bit and 64-bit of AICA power consumption and throughput table. Constraints Methods

Power consumption(mW)

Throughput (M Flits/s)

Number of ﬂits

AICA 16-bit

AICA 64-bit

AICA 16-bit

AICA 64-bit

100 200 300 400 500 600 700 800 900 10 0 0 20 0 0 30 0 0 40 0 0 50 0 0 Average

38 39.2 40.7 41 41.2 41.8 42.2 43.9 44.7 44.9 46 46.7 47 47.2 43.1

52.1 54.1 56 56.6 57.3 57.6 57.8 58 59.1 61 62.1 63 63.2 63.3 58.6

305 316 325 331 343 350 355 358 359 359 369 370 371 371 348.7

360 371 383 389 396 401 410 414 420 422 429 432 432 434 406.6

shown in Fig. 15 (b) and Fig. 16 (b). The DataPort_2 connects the router to the decoder, which accesses ﬂits from the router. The DataPort_3 connects the decoder to the CPU. The decoder recovers the coding ﬂits and structure to instruction set, as shown in Fig. 15 (c) and Fig. 16 (c). The content of DataPort_3 is same as that of DataPort, indicating that the correctness of transmission is veriﬁed. Table 8 shows the analysis of the average power and throughput for 100 to 50 0 0 ﬂits. Table 9 shows the comparisons of throughput. The proposed 64-bit and 16-bit AICA methods improved throughput by up to 46.3% and 25.4%, respectively. Table 10 shows the comparisons of power consumption. The proposed 64-bit and 16-bit AICA method improved power consumption by up to 29.3% and 48.1%, respectively. Table 9 and Table 10 compare the features and code of the proposed AICA algorithm with other technologies. Comparison results indicate that AICA aim instruction coding and is simpler than the method of Tassori et al. [13] as it does not apply calculation PMF or Hoffman trees. The method of Mishra et al. [30] improves performance by classifying networks as bandwidth-sensitive or latency-sensitive, and adapting the method of transferring ﬂits according to the network type. The method of Tran et al. [17] improves performance by using shared queues to maximize buffer utilization. The proposed AICA improves bandwidth utilization and router throughput by increasing ﬂit capacity. The improved throughput reduces the power consumption according to Eq. (1), since the early ﬂit transfer completion allows the channel to sleep early. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE 16

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18 Table 9 Throughput comparison. Constraints

Technologies Throughput (M Flits/s) 16-bit improved (%) 64-bit improved (%) Average improved (%)

Methods AICA 16-bit

AICA 64-bit

Tassori [13]

Mishra [30]

Tran [17]

AICA 348.7 – – –

406.6 – –

Huffman coding 310 12.5% 31.2% 23.5%

Heterogeneous NoC 278 25.4% 46.3%

RoShaQ 335 4.1% 21.4%

Table 10 Power consumption comparison. Constraints

Technologies Power Consumption (mW) 16-bit reduced (%) 64-bit reduced (%) Average reduced (%)

Methods AICA 16-bit

AICA 64-bit

Tassori [13]

Mishra [30]

Tran [17]

AICA 43.1 – – –

58.7 – –

Huffman coding 83 48.1% 29.3% 30.3%

Heterogeneous NoC 80.7 46.6% 27.3%

RoShaQ 60 28.2% 2.2%

6. Conclusion This work proposes a novel solution for instruction codec transmission, aiming for high throughput and low power consumption in the network on chip (NoC) architecture. The proposed method increases the proportion of encoded ﬂits through reduces redundancy ﬂits that the transmission channel can accommodate more ﬂits than other network interface (NI), so that the router improves bandwidth utilization and throughput. The proposed method Adaptive Instruction Codec Architecture (AICA) reduces power consumption through bit and transition reduction in routers and transmission channels, therefore, the router early completion of ﬂit transfer means that channel can go to sleep early. The proposed AICA reduces transmission redundancy, and supports process elements (PEs) with 16-bit or 64-bit core CPU. The AICA encoder and decoder has both 16-bit and 64-bit designs, and includes packing/un-packing modules between the NI and the router. The router is connected to other routers for transmission of data or instructions. The encoder and decoder algorithms of AICA are proposed and implemented on a Xilinx Virtex-6 FPGA device. In the experimental environment, the design tool uses Xilinx ISE 14.7, and the simulation tool uses the Modelsim 10.2. The emulation platform uses Xilinx Virtex-6 XC6VLX240T-1ff1136. The function veriﬁcation and chip power measurement use the Xilinx Chipscope and Xpower tools, respectively. The average power and throughput were analyzed for 100 to 50 0 0 ﬂits. Experimental results indicate that the new proposed architecture improves throughput by up to 46.3%, and reduces power consumption by up to 48.1%. Acknowledgments The authors would like to thank the Ministry of Science and Technology of the Republic of China, Taiwan, for supporting this research under Contract No. MOST 103-2221-E-027-125. References [1] Nicopoulos C, Srinivasan S, Yanamandra A, Park D, Narayanan V, Das C, Irwin M. On the effects of process variation in network-on-chip architectures. IEEE Trans Depend Secure Comput 2010;7(3):240–54. [2] Marculescu R, Ogras UY, Peh LS, Jerger NE, Hoskote Y. Outstanding research problems in NoC design: system, microarchitecture, and circuit perspectives. IEEE Trans Comput 2009;28(1):3–21. [3] Liu Y, Tan Y, Hang J. Key problems on network-on-chip. In: Proc. of 10th IEEE international conference on computer-aided design and computer graphics; 2007. p. 549–52. [4] da Rosa TR, Larrea V, Calazans N, Moraes FG. Power consumption reduction in MPSoCs through DFS. In: Proc. of 2012 25th symposium on integrated circuits and systems design (SBCCI); 2012. p. 1–6. [5] Jafarzadeh N, Palesi M, Khademzadeh A, Afzali-Kusha A. Data Encoding Techniques for Reducing Energy Consumption in Network-on-Chip. IEEE Trans. Very Large Scale Integ (VLSI) Syst 2014;22(3):675–85. [6] Swaminathan K, Lakshminarayanan G, Lang F, Fahmi M, Ko S-B. Design of a low power network interface for network on chip. In: Proc. of 2013 26th annual IEEE Canadian conference on electrical and computer engineering (CCECE); 2013. p. 1–4. [7] Gu H, Xu J, Zhang W. A low-power fat tree-based optical network-on-chip for multiprocessor system-on-chip. In: Proc. of design, automation & test in europe conference & exhibition(DATE); 2009. p. 3–8. [8] Nicopoulos C, Srinivasan S, Yanamandra A, Park D, Narayanan V, Das C, Irwin M. On the effects of process variation in network-on-chip architectures. IEEE Trans Depend Secure Comput 2010;7(3):240–54. [9] Lee T-Y, Huang C-H. Design of smart power-saving architecture for network on chip. VLSI Des 2014:10. [10] Choudhary N, Gaur MS, Laxmi V, Singh V. GA based congestion aware topology generation for application speciﬁc NoC. In: Proc. of 2011 sixth IEEE international symposium on electronic design, test and application (DELTA); 2011. p. 93–8. [11] Fattah M, Manian A, Rahimi A, Mohammadi S. A high throughput low power FIFO used for GALS NoC buffers. In: Proc. of IEEE annual symposium on VL SI (ISVL SI); 2010. p. 333–8.

Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE

ARTICLE IN PRESS T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

[m3Gsc;April 22, 2016;16:53] 17

[12] Fu W, Shao J, Xie B, Chen T, Liu L. Design of a high-throughput NoC router with neighbor ﬂow regulation. In: Proc. of 2012 IEEE 14th international conference on high performance computing and communication & 2012 IEEE 9th international conference on embedded software and systems (HPCC-ICESS); 2012. p. 493–500. [13] Tassori M, Tassori M, Mossavi M. Adaptive data compression in NoC architectures for power optimization. J Int Rev Comput Softw 2010;5(5):540–7. [14] Wu C, Chai S, Li Y-B, Yang Z-M. Design of a dual-switching mode NOC router microarchitecture. In: Proc. of 2010 international conference on electrical and control engineering (ICECE); 2010. p. 2733–6. [15] Lusala AK, Legat J. A hybrid NoC combining SDM-TDM based circuit-switching with packet-switching for real-time applications. In: Proc. of 2012 IEEE 10th international on new circuits and systems conference (NEWCAS); 2012. p. 17–20. [16] Onizawa N, Matsumoto A, Funazaki T, Hanyu T. High-throughput compact delay-insensitive asynchronous NoC router. IEEE Trans. Comput 2014;63(3):n637–49. [17] Tran AT, Baas BM. Achieving high-performance on-chip networks with shared-buffer routers. IEEE Trans Circuits Syst Soc 2013;22(6):1063–8210. [18] Jiang SY, Luo G, Liu Y, Jiang SS, Li XT. Fault-tolerant routing algorithm simulation and hardware veriﬁcation of NoC. IEEE Trans Appl Supercond 2014;24(5):1,5. [19] Collet JH. A brief overview of the challenges of the multicore roadmap. In: IEEE Trans. mixed design of integrated circuits & systems (MIXDES), 2014 proceedings of the 21st international conference; 2014. p. 22–9. [20] Liu H-N, Huang Y-J, Li J-F. A built-in self-repair method for RAMs in mesh-based NoCs. In: Proc. of international symposium on VLSI design, automation and test; 2009. p. 259–62. [21] Kariniemi H, Nurmi J. Fault-tolerant 2-D mesh network-on-chip for multiprocessor systems-on-chip. In: Proc. of international conference on design and diagnostics of electronic circuits and systems; 2006. p. 184–9. [22] Bakhouya M. Towards a bio-inspired architecture for autonomic network-on-chip. In: Proc. of international conference on high performance computing and simulation (HPCS); 2010. p. 491–7. [23] Oxman G, Weiss S. Simple method to reduce congestion in bufferless network-on-chip. Electron Lett 2014:581–3. [24] Geetha K, Ammasai Gounden N. Compressed instruction set coding (CISC) for performance optimization of hand held devices. In: Proc. of IEEE international conference on advanced computing and communications; 2008. p. 241–7. [25] Kadayif I, Kandemir MT. Instruction compression and encoding for low-power systems. In: Proc. of 15th IEEE international conference on ASIC and SOC; 2002. p. 301–5. [26] Tsai WC, Lan YC, Hu YH, Chen SJ. Networks on chips: structure and design methodologies. J Elect Comput Eng 2012;2012:1–15. [27] Mak T, Cheung PYK, Lam K-P, Luk W. Adaptive routing in network-on-chips using a dynamic-programming Network. IEEE Trans Ind Electron 2011;58(8):3701–16. [28] Matos D, Costa M, Carro L, Susin A. Network interface to synchronize multiple packets on NoC-based systems-on-Chip. In: Proc. of 18th IEEE/IFIP VLSI system on chip; 2010. p. 31–6. [29] Pande PP, Grecu C, Jones M, Ivanov A, Saleh R. Performance evaluation and design trade-offs for network-on-chip interconnect architectures. IEEE Trans Comput 2005;54(8):1025–40. [30] Mishra AK, Mutlu O, Das CR. A heterogeneous multiple network-on-chip design: an application-aware approach. In: Proc. of IEEE design automation conference; 2013. p. 1–10.

Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE 18

ARTICLE IN PRESS

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

Trong-Yen Lee received his Ph.D. degree in Electrical Engineering from National Taiwan University, Taipei, Taiwan, ROC in 2001. Since 2002, he has been a member of the faculty in the Department of Electronic Engineering, National Taipei University of Technology, where he is currently a professor. His research interests include hardware-software co-design of embedded systems, FPGA systems design, and VLSI design. Chi-Han Huang received his M.S. degree in Electronic Engineering from National Taipei University of Technology, Taipei, Taiwan, ROC in 2011. He is currently working toward the Ph.D. degree in the Department of Electronic Engineering, National Taipei University of Technology, Taipei, Taiwan. His current research interests include VLSI design and Network-on-Chip design. Min-Jea Liu received his M.S. degree in the Graduate Institute of Communication Engineering from Tatung University, Taipei, Taiwan, ROC in 2010. He is currently working toward the Ph.D. degree in the Department of Electronic Engineering, National Taipei University of Technology, Taipei, Taiwan, ROC. His current research interests include VLSI design and multiplier design. Jhen-Syuan Chen received his M.S. degree in the Graduate Institute of Computer and Communication Engineering from National Taipei University of Technology, Taipei, Taiwan, ROC in 2014. His current research interests include VLSI design and Network-on-Chip design.

Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

Adaptive instruction codec architecture design for network-on-chip

Adaptive instruction codec architecture design for network-on-chip

Recommend Documents