Adaptive instruction codec architecture design for network-on-chip

Adaptive instruction codec architecture design for network-on-chip

ARTICLE IN PRESS JID: CAEE [m3Gsc;April 22, 2016;16:53] Computers and Electrical Engineering 0 0 0 (2016) 1–18 Contents lists available at Science...

4MB Sizes 19 Downloads 175 Views

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;April 22, 2016;16:53]

Computers and Electrical Engineering 0 0 0 (2016) 1–18

Contents lists available at ScienceDirect

Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng

Adaptive instruction codec architecture design for network-on-chipR Trong-Yen Lee∗, Chi-Han Huang, Min-Jea Liu, Jhen-Syuan Chen Department of Electronic Engineering, National Taipei University of Technology, 1, Sec. 3, Chung-Hsiao E. Rd., Taipei 10608, Taiwan, ROC

a r t i c l e

i n f o

Article history: Received 16 June 2015 Revised 22 January 2016 Accepted 27 February 2016 Available online xxx Keywords: Adaptive instruction codec architecture (AICA) Network on chip (NoC) FPGA Power consumption, and throughput

a b s t r a c t This work proposes a novel Adaptive Instruction Codec Architecture (AICA) for network-onchip (NoC) that improves channel utilization to transfer packets and flits, in order to solve issues of power consumption and throughput. The proposed architecture allows multiple packets to be stuffed into a single packet, and thus can transfer more packets than other network interface (NI) in one time unit. Reducing the number of packets for transmission allows the channel to be reused to transfer additional messages, thus improving channel throughput. This architecture reduces the number of packets transmitted, thus indirectly alleviating the deadlock problem. Many repeating and similar instructions are frequently transferred in NoC. The proposed AICA reduces transmission redundancy, and supports process elements (PE) with 16-bit or 64-bit core CPU. Experimental results show that the proposed architecture and algorithms delivers improvement of up to 48.1% on power consumption, and 46.3% on throughput. © 2016 Published by Elsevier Ltd.

1. Introduction Multiprocessor architecture is a popular research topic in system-on-chip (SoC). However, the bus architecture of SoC cannot meet the high performance and throughput requirements for packet transfer in multiprocessor systems. Therefore, network-on-chip (NoC) was proposed to solve transmission problems in multiprocessor architectures [1], but it leads to new issues about power, throughput and deadlock [2,3]. Considering the issue of power consumption, Rosa et al. [4] proposed distributed dynamic frequency scaling (DFS) to reduce overhead in the NoC execution time and frequency. The DFS controls local first-in-first-out (FIFO) through the globally asynchronous locally synchronous (GALS), and depends on process element (PE) information to switch the frequency dynamically. Jafarzadeh et al. [5] proposed data encoding schemes in the network interface (NI), based on the end-to-end flit, to minimize the link-to-switch times when accessing flits. Swaminathan et al. [6] proposed a low-power flexible NI, which includes packing, unpacking, master/slave wrapper, FIFO and configuration controller. The configuration controller uses enabling and disabling at run time in the NI. Huaxi et al. [7] proposed fat-tree-based optical NoC, which includes topology, planning and protocol, and an optical turnaround router based on an optimal algorithm to minimize network control data. Nicopoulos et al. [8] proposed IntelliBuffer system to save power. IntelliBuffer uses clock gating to disable clock when the

R

Reviews processed and recommended for publication to the Editor-in-Chief by Associate Editor Dr. T. H. Meen. Corresponding author. Tel.: +886-2-2771-2171 Ext. 2251; Fax: +886-2-2731-7120. E-mail addresses: [email protected] (T.-Y. Lee), [email protected] (C.-H. Huang), [email protected] (M.-J. Liu), [email protected] (J.-S. Chen). ∗

http://dx.doi.org/10.1016/j.compeleceng.2016.02.021 0045-7906/© 2016 Published by Elsevier Ltd.

Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE 2

ARTICLE IN PRESS

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

Fig. 1. Typical mesh topology of the NoC architecture [26].

slot is empty; accesses the slots with lowest power from the leakage-classification-register (LCR) table, and modifies arbitration units to avoid switch overhead. Lee et al. [9] proposed smart power saving using a dynamic control clock for slots. This control clock minimizes the power in each slot according to different network parameters. Considering issue of throughput, Choudhary et al. [10] proposed a genetic algorithm to improve the energy distribution and communication load. The genetic algorithm can sense congested nodes to avoid deadlock nodes to improve the throughput. Fattah et al. [11] proposed one-hot memory asynchronous FIFO for NoC, which can reduce the decoder time to improve throughput when the address is one hot. Fu et al. [12] proposed a novel router with neighbor-flow-regulation (NFR) to improve throughput, avoiding starvation and congestion by regulating information flow among neighboring routers. Tassori et al. [13] proposed data compression to improve throughput. Their model determines whether to use Huffman coding to reduce packet bit according to the transmission distance. However, most PEs to NI are instructions. Some works also combine the circuit and packet switching to improve throughput. Wu et al. [14] proposed a dual switching mode NoC router that includes circuit switching and packet switching. The circuit switching confirms the packet transmission path and avoids packet congestion. The packet switching avoids channel occupancy when packets are transmitted. Lusala et al. [15] proposed spatial division multiplexing and time division multiplexing to improve resource usage in circuit switching, and process the best traffic flow in packet switching. Onizawa et al. [16] proposed three steps to improve throughput. First, an asynchronous router is implemented based on level-encoded dual rail (LEDR). Second, a simple transmission scheme based on an even number of flits is used. Finally, congestion is solved by reducing the number of packets. Tran et al. [17] designed a router architecture with shared queues (RoShaQ), router architecture maximizes buffer utilization by allowing sharing of multiple buffer queues among input ports. Considering fault tolerance, Jiang et al. [18] proposed a routing algorithm based on X–Y routing. This method improves performance by setting routes that avoid fault paths and error nodes. Collet et al. [19] proposed a novel schema to improve the fault tolerance of NoC through by diagnosing, repairing and detecting of faults and errors automatically at runtime. Liu et al. [20] proposed built-in-self-repair (BISR) for a NoC buffer which improves repair efficiency by repairing the error buffer to avoid routing problems in communication links. Kariniemi et al. [21] proposed a novel fault-tolerant scheme with 2D mesh NoC in a multi-core SoC. Their model detect dynamic and static faults using fault-diagnosis-and-repair (FDAR) to repair faulty switches, and uses fault-tolerant-dimension-order-routing (FTDOR) algorithm to route packets adaptively in faulty networks. Bakhouya et al. [22] proposed bio-inspired concept to adaptively changing links when an area is congested for NoC. It has a distributed and immune system which includes self-healing, self-configuration and self-optimization to improve the fault tolerance in NoC. Oxman et al. [23] proposed a simple congestion reduction method in NoC measures congestion based on the average link utilization per unit of time to perform routing in a buffer-less NoC. The rest of this paper is organized as follows. Section 2 analyzes the throughput and power issues in NoC. Section 3 presents the transmission flit formats with adaptive instruction codec architecture (AICA). Section 4 presents the design of the adaptive instruction codec architecture. The experimental results are shown in Section 5. Section 6 briefly draws conclusions. 2. Analysis of throughput and power issue Multi-core systems transmit many instruction flits in network-on-chip (NoC). Therefore, reducing frequently repeated and similar instruction flits can improve router efficiency. For traditional communication architectures, Geetha et al. [24] and Kadayif et al. [25] proposed instruction encoding to improve the throughput by reducing redundancy. Fig. 1 illustrates the typical mesh topology of the NoC architecture [26]. The process elements (PEs) transfer flits through the network interface (NI) to the router. Each router has four direction ports to connect to neighboring routers, and one Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

3

Network Interface FIFO_PE_1 Router

Data_in Full

Unpacking

Ack_unpacking

New_flit

Empty Data_out

Wr

Rd

Data_out Data_ok

Val_unpacking Data_in

Rd_block

FIFO_PE_N Data_in Full

Data_out

Empty

Packing Ack_noc Data_out

Val_pack

Val_noc

Ready Data_out

Rd

FIFO

Start_pack Data_in

Data_in

Data_in_PE(N) Start

Data_out

Wr

Ack_pack

Local_PE Rd Data_in_PE(1)

Idle

Full Wr

Data_in

Empty Data_out Rd

Fig. 2. Network interface architecture [28].

direction port to the connecting PE. The router components are routing computation (RC), switch arbiter (SA), virtual channel arbiter (VA), transmission channel and crossbar (XBAR). The router uses the routing algorithm [27] to transfer flits to the destination PE. A flit is divided into a header flit, body flits and a tail flit. Transmission is performed by flit switching. In the worst case, nine PEs transmit flits at the same time, leading to an increased probability of channel blocking, and deadlock to reduce transmission performance. The adaptive instruction codec architecture (AICA) reduces the transmission time between router and channel, and the latency caused by blocking and deadlock more than coding time in NI.Fig. 2 illustrates the NoC during transmission through the NI, providing the process elements (PE) and router flit communication interface [28]. The NI provides two directions of transmission between the PE and the router. The packing modules need to add routing information to the flit when the local PE transmits the flit to the router. Whenever the NI completes packing, it checks the channel buffer to avoid overflow caused by the router. Unpacking the module restructures each flit in the flit to the PE from the router when the module buffer has remaining space. Eq. (1) shows the power consumption for one port, where TranUntreated represents the transmission time of each untreated flit, and TranTreated represents the transmission time of each flit treated by AICA. If all flits in the architecture is untreated, then TranTreated = TranUntreated . Thus, early completion of flit transfer means that channel can go to sleep early. In the un-codec router, if the port sends n flits to the next router in one second, then the throughput for one direction port is as shown in Eq. (2). The maximum throughput is given by nflit/second. The throughput of AICA with the 64-bit router is given by Eq. (3). The port sends n flits to the next router in one second; each flit codec with k flits when the router uses AICA, and the maximum throughput is given as kn flits/second. The k represents the effective number of segments in the next section. The low power and high throughput are the important issues in NoC. This work proposes AICA for NoC to solve the throughput and power issues between packing/un-packing modules and router. The difference value (D) between the present flits and previous flits from the uncoding flits database is calculated. To save coding time for NoC, the AICA is different from progressive coding, including sampling, quantization and entropy coding.

Slee poneport = (T ranUntreated − T ranTreated ) × clock

 T hroughputoneport =

∞ 

(1)

 n × f lit / sec

(2)

n=1

 T hroughputoneport =

∞ 

 k × n × f lit / sec;

1≤k≤4

(3)

n=1

3. The number of segments and formats flit analysis 3.1. Effective number of segments analysis Fig. 3 illustrates the adaptive instruction codec architecture (AICA) coded flit format, which includes control and data regions. The control region includes flit type (2 bits), implying that the type of flits, source address determine the source process element (PE) address, while destination address determines the destination PE address. The bit number of the source Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE 4

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

Fig. 3. Coded flit format. Table 1 Flit type format. Flit type

Bits

Functional description

Single flit Header flit Body flit Tail flit

00 01 10 11

One flit Routing information Payload Last data

and destination address is determined from the network topology. The header flit (01) contains routing information, which is used to route flit to the destination address until the channel registration is complete. The body flit (10) and the tail flit (11) are transferred through the registered channel according to the header flit (01), rather than using the destination address bits. Thus, the AICA architecture uses destination address bits as encoding control bits (E). The data region is 2d , where d represents the address line size of PE. In this work, PE is 64-bit core, thus d = 6 and data region is 64. The total data frame depends on the value of 2d and the topology. Each data frame includes index (I) bits to store the pre-flits, and difference (D) bits to recovery current flit, and I + D < 2d , otherwise the current flit is regarded as uncoding. The encoding equation is written as follows:

2d > ( I + D ). To avoid redundant bit data frames, the numbers of bits I and D are

d>n + 1. Since

2d

=

2k

(4) 2n ;

therefore, (4) can be rewritten as (5)

data frames, the equation is

2 = ( I + D ) × 2k . d

(6)

By Eqs. (5) and (6), k is given by

k = d − ( n + 1 ).

(7)

However, if the threshold n is equal to 0 or one or two, then I and D equal 1, two or 4 bits, respectively, and the encoding range decreases, increasing the uncoding rate and leading to the performance degradation. Thus, the n range is

2
(8)

Without the un-coding case of n = d−1, the k range is given by

0
(9)

By Eq. (9), the effective number of segments is

20 <2k ≤ 2d−4 .

(10)

If the PE is 16-bit core, then d = 4. In this case, let k = 1 to meet the base coding. This work proposes to set d = 4 and 6 for the AICA in NoC. 3.2. Flit formats of protocol in AICA Table 1 shows the flits type format which includes single, header, body and tail flits. Fig. 4 illustrates the un-coding flit formats of the 2D mesh is shown in. If the flit can be transmitted by one flit to the destination, then the type is single-flit (00). The header flit (01) provides routing information to the register channel, and lets the same source address as flits use this channel. The body flit (10) uses the register channel to convey data to next router or PE. Finally, the tail type flit (11) unregisters the channel. This work designs 16-bit and 64-bit AICA codec architectures for 16-bit and 64-bit cores PE, respectively, to transmit information. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE

ARTICLE IN PRESS T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

[m3Gsc;April 22, 2016;16:53] 5

Fig. 4. 2-D mesh un-coding flits format.

Fig. 5. AICA 16-bit 2-D mesh flits format. Table 2 AICA 16-bit encoded control bit function description. Encoding control bits

Functional description

00 01 10 11

Not encoding Un-defined Encoding Un-defined

3.3. Flit formats of protocol in AICA with 16-bit This section provides one solution to solve the d ≤ 4 then the n = 2 in Fig. 3. In order to maximize the codec efficiency, I and D cannot equal 2n . The sign bit of D is recorded in I LSB to increase coding range; thus I is 3-bit and D is 5-bit. Fig. 5 illustrates the AICA 16-bit 2D mesh flit formats, in which the destination address bit is replaced by a coding bit in the control region to represent coding status when the flit type is body (coding) or tail (non-coding). Whether the flit type is single or header does not affect the transmission principle. Table 2 shows the function description for a body or tail flits. The first case of 2-bit (00) represent a non-coding and the second case of 2-bit (10) represent Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE 6

ARTICLE IN PRESS

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

Fig. 6. AICA 64-bit 2-D mesh flits formats.

Fig. 7. 64-bit AICA coded registers.

Fig. 8. 64-bit AICA flit format in Case 1.

encoding case. To increase the coding range (k = 1) to D = 5 and I = 3, the D has the sign bit in the LSB of I. The 16-bit AICA reduces the maximum transmission flit from two to one, and improves channel utilization 50%. The destination router transmits the current flit to PE directly when the MSB of the encoding control bits is set to 0. Otherwise, the router uses the AICA decoder in the next chapter. 3.4. Flit formats of protocol in AICA with 64-bit This section presents the flit formats for Eq. (10) when d > 4. The maximum codec efficiency can be ensured when n = 3 and the number of data frames is four. The transmission principle of single or header flits does not change. A body or tail flit cannot be encoded, and therefore is conveyed to PE or restructure other flits, in the format in Fig. 6 (a). Fig. 6 (b) illustrates the flit encoding format, which comprises two to four flits. The protocol has two encoding bits. If the encoding bits are “01”, then two flits can be encoded, and these can be sorted with the next flits (which are non-encoded flits). If the encoding bits are “10”, then three flits can be encoded, and these can be sorted with the next flit (a non-encoded flit). Finally, if the encoding bits are “11”, then are four flits can be encoded. The current flit may contain one, two, three or four flits to ensure maximum efficiency of the codec in AICA. Fig. 7 illustrates how to calculate the possibility of encoding flits to improve transmission performance in one flit. The term NO_reg to records non-encoded flits, and YES_reg records the encoded flits. The division of the current flit into different flits has four possible cases. Fig. 8 illustrates Case 1, in which the encoding control bits are set to “00”. In Case 1, only NO_reg has one flit, which means that un-coded flit sending to the local router. Fig. 9 and Fig. 10 illustrate Cases 2, 3 and 4. In Case 2, the encoding control bits are set as “01”, in which YES_reg contains one or two flits, and NO_reg is full. Case 2 comprises four subcases in the leaf node. In subcases 2.1 to 2.4, the Control bits are set to 4 hexadecimal (0 0 0 0–0 011). The channel utilization for these subcases is calculated here. Subcase 2.2 cannot reduce transmission times, because it only encodes one flit. Other subcases reduce transmission flits from five to four, and thus improve channel utilization by 20%. Case 3 will set the Encoding Control bits “10”, which means that YES_reg contains three flits. This case has six subcases in leaf node. In subcase 3.1 to 3.6, the Control bits are set to 4 hexadecimal (0 0 0 0–0101). The channel utilization for these subcases is calculated. Subcases 3.1 and 3.4 reduce the number of transmission flits from four to two, and thus improve channel utilization by 50%. The other subcases reduce the number of transmission flits from five to three, and thus improve channel utilization by 40%. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE

ARTICLE IN PRESS T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

[m3Gsc;April 22, 2016;16:53] 7

Fig. 9. 64-bit AICA flit format in Cases 2–4 (second flit is un-coded).

Fig. 10. 64-bit AICA flit format in Cases 2–4 (second flit is coded).

In Case 4, the Encoding Control bits are set to “11”, which means that all flits can be coded in YES_reg. This case reduces the number of transmission flits from four to one, and improves the channel utilization 75%. The nonterminal node case sends a subcase without a leaf node case to the transmission buffer when the tail flit arrives to terminate the encoding stage. Finally, the decoder reconstructs the sequence, and restore the data from each flit according to the encoding control bit and control bits. The 64-bit AICA sends the encoding flit to the local router when it completes encoding. The encoder packs with YES_reg and NO_reg according to the case, and sends the data to the local router. The decoder accesses the flits from the router. Table 3 shows the different recombining and sorting of flits according to the case number. Case 1 is a non-encoding flit which transmits the current flit directly to the unpacking module. Subcase 2.1 has two encoded flits and three non-encoded flits, with the first non-encoded flits recombinant between the first and second data frame, and the other un-coded flits placed after the second data frame. The subcase 2.2 has one encoded flit and three non-encoded flits, with the non-coded flits following the first data frame. Subcase 2.3 has two encoded flits and three non-encoded flits, with the first and second non-encoded flits recombinant between the first and second data frame, and other un-coded flits recombinant after the second data frame. Subcase 2.4 includes two encoded Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE 8

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18 Table 3 AICA 64-bit control table. Encoding control bits

Number of flits

Control bits

Case number

00 01

Un-coding one flit Encoding one or two flits

10

Encoding three flits

11

Encoding four flits

Null 16’h0 0 0 0 16’h0 0 01 16’h0010 16’h0011 16’h0100 16’h0101 16’h0110 16’h0111 16’h10 0 0 16’h1001 16’h1010 16’h1011 16’h1100 16’h0 0 0 0 16’h0 0 01 16’h0010 16’h0011 16’h0100 16’h0101 16’h0110 Null

1 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4

flits and three non-encoded flits, with the un-coded flits recombinant after the second data frame. Subcases 2.5, 2.11 and 3.7 include many encoded flits, with each flit decoded according to address I from the non-encoded flit database. In subcases 2.6 and 2.8, include one flit is encoded, while all others are non-coded and are recombinant after the first data frame. Subcase 2.7 has two encoded flits and one non-coded flit, e which is recombinant between the first and second data frame. Subcase 2.9 has two coded flits and two coded flits, with the first uncoded flit recombinant between the first and second data frame, and the other uncoded flit recombinant after the second data frame. Subcase 2.10 includes two coded flits and two uncoded flits, with the uncoded flits recombinant between the first and second data frame. Subcases 2.12 and 2.13 includes two coded flits and many uncoded flits, with the uncoded flits recombinant after the second data frame. Subcase 3.1 includes three coded flits be and one uncoded flit, with the first uncoded flits recombinant after the first data frame. Subcase 3.2 includes three coded flits and two uncoded flits, with the first uncoded flits recombinant after the first data frame, and other uncoded flits recombinant between the second and third data frames. Subcase 3.3 has three coded flits and two uncoded flits, with the uncoded flits recombinant between the first and second data frames. Subcases 3.4 includes three coded flits and one uncoded flit, with the first uncoded flits recombinant between the second and third data frame. Subcase 3.5 includes three coded flits and two un-coded flits, with the un-coded flits recombinant between the second and third data frames. Subcase 3.6 includes three coded flits and one uncoded flit, with the uncoded flits recombinant after the third data frame. In Case 4, every flit is encoded in YES_reg (all four flits). The data restructured by decoded the coded flits according to address I from the uncoding flit database. Finally, the 64-bit AICA sends the YES_reg and NO_reg to the transmission buffer of each node when the tail flit arrives or the buffer is full.

4. Design of adaptive instruction codec architecture 4.1. Adaptive instruction codec architecture Many repeating and similar instructions are transferred frequently in network-on-chip (NoC). Therefore, this work proposes an adaptive instruction codec architecture (AICA) to reduce transmission redundancy. Fig. 11 illustrates the AICA with NoC architecture, which can use a 16-bit or 64-bit core CPU. The AICA includes a 16-bit or 64-bit encoder and decoder between the router and the packing/unpacking modules in network interface (NI). The router connects routers for data transmission. This 16-bit and 64-bit AICA is proposed to verify the efficacy of transmission for the worst and ideal cases. Fig. 12 illustrates the AICA encoder and decoder architecture. Encoding is started when the process element (PE) send a flit and completes packing in NI. The encoder comprises a compare register, an encoded register and an encoding control bit. The compare register provides the comparison values, then the encoder calculates the difference between the comparison value and the current flit. If the difference value (D) is greater than a threshold d, then the current flit is recorded in the compare register and sent to the router. Otherwise, the index (I) and difference value (D) are recorded in the encoded register. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE

ARTICLE IN PRESS T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

[m3Gsc;April 22, 2016;16:53] 9

Fig. 11. AICA with NoC architecture.

Fig. 12. AICA encoding and decoding architecture.

The encoded register saves and encoding flits and groups them together. In the 16-bit AICA, the encoded register only records encoding flits. However, the encoded register in 64-bit AICA records not only encoding flits, but also uncoded flits, with the encoding case described in the previous chapter. The encoding control bit is set up according to the case type. A flit is encapsulated and sent to the local router once its encoding is. The AICA decoder begins decoding when the router sends a flit to the NI. The decoder consists of a compare register, recorder register, decode register and encoding control bit. An uncoded flit is recorded in the compare register and sent to the unpacking module. The decoder is sorted with different AICA cases depending on the encoding control bit, and saved in the decoding register. When sorting the tasks is complete, the decoder uses the index to read the comparison value from the compare register, then restores the flits with the comparison and difference values. When the decoding task is complete, the flits are transmitted from the recording register to the unpacking module. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE 10

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

Table 4 AICA encoding algorithm with 16-bit. AICA Encoding Algorithm with 16-bit Input: Flits Fs from the packing module to AICA Output: Transmission buffers Tb⊃{UF, EF}, where UF are un-coding flits, EF are encoding flits 1. Initialize encoded and un-coded registers Er=0 and Ur=0, where Er⊃{Er1 , Er2 } and Ur⊃{Ur1 , Ur2 } 2. Initialize index I=0, where I bits=(3 bits−1 sign bit) 3. Initialize compare registers comparereg =0, where comparereg capacity=max(size of 2I bits ) 4. Initialize encoding control bits Ecb ⊃{Ub, Eb}, where Ub are un-coding bits, Eb are encoding bits 5. While (Fs) do 6. If (single flit or header flit arrival) 7. {Tb= Fs}//Fs bypass to Tb /∗ Starts encoding∗ / 8. Else if (body flit or tail flit arrival) //∗ Coded flit2 and store Tb when flit1 can be encoded ∗ // 9. If(Er1 full) 10. {Ur2 =Fs payload}//Stage1: flit reading ///∗ Stage2 and Stage3: flit coding and flit store∗ /// 11. If (Ur2 cannot be encoded) 12. If(comparereg are full) 13. {Tb=Eb+Er1 , next Tb=Ub+Ur2 } 14. Else 15. {comparereg [I]=Ur2 , I ++, Tb= Eb+Er1 , next Tb=Ub+Ur2 } 16. Else 17. {Er2 =I+D, Tb=Eb+Er1 +Er2 }, where D=Ur2 −comparereg [I], D bits=5 //∗ Code flit1 or store Tb ∗ // 18. Else 19. {Ur1 =Fs}//Stage1: flit reading ///∗ Stage2 and Stage3: flit coding and flit store∗ /// 20. If (Ur1 cannot be encoded) 21. If(comparereg are full) 22. {Tb=Ub+Ur1 } 23. Else 24. {comparereg [I]=Ur1 , I++, Tb=Ub+Ur1 } 25. Else 26. {Er1 =I+D}, where D=Ur1 −comparereg [I], D bits=5 27. Repeat step 8 till the Tb are full or tail flit arrive AICA /∗ Send to router∗ / 28. If(tail flit arrival) 29. {Send flits from Tb to router, clear comparereg , return step 1} 30. Else if(Tb full) 31. {Send flits from Tb to router, return step 8} 32. End while

4.2. Design of AICA encoder with 16-bit Table 4 shoes the design for optimizing the transmission performance of the proposed AICA encoder algorithm in 16-bit in the worst case where threshold d < 4 is shown in. This algorithm has three stages: flit reading, coding and storage. The encoder determines the flit type, and sends a single or header flit to the transmission buffer at lines 6 and 7. For other flit types, the algorithm starts encoding at lines 8 to 27 (starts coding). The encoding stage reads flit from packing module (stage1) and compares the D with the current flit (flit1 or flit2) from comparereg at lines 10, 11, 16, 19, 20 and 25. If the current flit (flit1) is uncoded, then the encoder records in comparereg until the buffer is full; adds the encoding sign bit with flit1, and sends the flit1 to the transmission buffer at lines 20 to 24. If the flit1 is coded, then the encoder records the flit1 I from comparereg address and the D from the current flit in Er1 , and waits for the next flit at lines 25 and 26 (stag2 and stage3). When the flit1 encoding is complete at line 9, if the current flit (flit2) is uncoded, then the encoder records in comparereg until the buffer is full; adds the encoding sign bit with flit2, and sends the Er1 flit and flit2 to the transmission buffer at lines 11 to 15. Otherwise, it records the flit2 I from comparereg address and the D from the current flit in Er2 , then encapsulates the Er1 and Er2 to one flit, and sends this to the transmission buffer at lines 16 and 17 (stag2 and stage3). The algorithm encodes flits until the transmission buffer is full or the last flits are reached. The encoder transmits the buffer data to the router, and continues encoding when the buffer is full. Otherwise, it transmits the buffer data to the router, and clears comparereg at lines 28 to 31. Fig. 13 illustrates a simplified instructional computer (SIC) encoded in 16-bit AICA. This work encodes paragraph Opcodes from PE in three stages. In the first stage, the encoder records flits in un-coded registers (Ur1 or Ur2). In the second stage, the encoder compares the coding range from comparereg , and stores this flit in comparereg (if this is not full) or encoded Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

11

Fig. 13. SIC encode with AICA 16-bit. Table 5 AICA decoding algorithm with 16-bit. AICA Decoding Algorithmwith 16-bit Input: Flits Fs from the destination router to AICA Output: Unpacking buffers Ub⊃{UF, DF}, where UF are un-coding flits, DF are decoding flits 1. Initialize decoded registers Dr=0, where Dr⊃{Dr1 , Dr2 }, Dr1 ⊃{I1 , D1 }, Dr2 ⊃{I2 , D2 }, I1 =Fs [15:14], D1 =Fs [12:8], I2 =Fs [7:6], D2 =Fs [4:0] 2. Initialize index I=0, where I bits=(3 bits−1 sign bit) 3. Initialize decoder compare registers de_comparereg =0, where de_comparereg =max(size of 2I ) 4. Initialize temporarily register Tr=0 5. While (Fs) do 6. If (single flit or header flit arrival) 7. {Ub=Fs}//Fs bypass to Ub /∗ Starts decoding∗ / 8. Else if (body flit or tail flit arrival) 9. If (Fs is encoding flit) //∗ Stage1 and Stage2: recovery and store ∗ // 10. For (j=1 to 2) 11. If (Drj is positive) 12. {Tr=de_comparereg [Ij ] + Dj , Ub={Fs[MSB to MSB-5], Tr} 13. Else 14. {Tr=de_comparereg [Ij ] [4:0] + Dj , Ub={Fs[MSB to MSB-5], de_comparereg [Ij ][15:5], Tr[4:0]}} Stage2: store 15. Else 16. If(de_comparereg are full) 17. {Ub=Fs} 18. Else 19. {de_comparereg [I]=Dr, I++, Ub=Fs} 20. Repeat step 8 till Ub are full or tail flit arrive AICA /∗ Send to unpacking module ∗ / 21. If(tail flit arrival) 22. {Send flits from Ub to unpacking module, clear de_comparereg , return step 1} 23. Else if(Ub full) 24. {Send flits from Ub to unpacking module, return step 8} 25. End while

registers (Er1 or Er2). In the last stage, the encoder sends flits, which contain encoding control bits, from un-coded or encoded registers to transmission buffers (Tb). 4.3. Design of AICA decoder with 16-bit Table 5 shows the 16-bit decoding algorithm. This algorithm has two stages: recovery and storage. If the flit type is single or header then it is sent to the unpacking buffer at lines 6 and 7. Otherwise, the decoding stage is started at lines 8 to 20 (starts decoding). The decoder reads the encoding control bits from this flit. If the type is uncoded, then the flit is recorded in de_comparereg until the buffer is full, and sent to the unpacking buffer at lines 15 to 19. Otherwise, the decoder determines the next step according to the sign bit of the first and second data frame. If the sign bit is positive, then the decoder adds the D from the current body flits and de_comparereg from index (I) in temporary_register (Tr); combines the flits in Tr and de_comparereg into one flit, and sends this flit to the unpacking buffer at lines 11 and 12. Otherwise, the decoder adds the D from the current body flits and the de_comparereg value of the last half-byte from I in Tr; combines flits in Fs, Tr and de_comparereg to one flit, and sends this flit to the unpacking buffer at lines 13 and 14. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE 12

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

Fig. 14. SIC encode with AICA 64-bit.

The decoding continues until the unpacking buffer is full or the last flits are reached. The decoder transmits the buffer data to the unpacking module, and continues decoding when the buffer is full. Otherwise, it transmits the buffer data to the unpacking module, and clears de_comparereg at lines 21 to 24 (transmission stage). 4.4. Design of AICA encoder with 64-bit This subsection describes the 64-bit AICA encoding and decoding method. The encoding task is active when the threshold d > 4. This algorithm has three stages: coding and classification; case status detection, and restructuring and storage. Table 6 shows the 64-bit AICA encoding algorithm. If the current flit is single or header type, then it is sent to the transmission buffer at lines 11 and 12. Otherwise, the flit number is stored in Fc to start the encoding stage at lines 13 to 25(starts coding). If the body or tail flit is non-coding, then the encoder records flits in comparereg until the buffer is full; combines them into one flit; records it in uncoded registers (Ur), and records the flit counter (Fc) in uncoded_sequences (Us) at lines 15 to 19. Otherwise, the encoder encapsulates I and D in one flit; records this flit in encoded registers (Er), and records the Fc in encoded_sequences (Es) at lines 20 and 21 (stage1). The stage1 encodes flits until the US and ES meet case conditions of Fig. 8, Fig. 9 and Fig. 10 at lines 22 to 25 (stag2). If encoder break stage1, then it encapsulates flits in Er to one flit; adds encoding control bits with encapsulation flit and un-coding flit by different case conditions, and stores this to the transmission buffer at lines 27 to 42 (stag3). The encoding algorithm encodes continuously until the transmission buffer is full or the last flit is reached. It then sends the data from the transmission buffer to router, and clears all registers, at lines 43 to 46. The decoder reorders the flits according to different cases, as described in the next subsection. Fig. 14 shows the simplified instructional computer (SIC) encoded with 64-bit AICA. This study encodes a paragraph Opcode from PE in three stages. In the first stage, the encoder records the number of flits in the Fc; combines the flits to store in comparereg until this is full, and records the combined flit in Er or Ur. In the second stage, the encoder detects the cases status. In the final stage (stage 3), the encoder adds encoding control bits to the flits from Er or Ur, as described in Table 6, and transmits Er or Ur to transmission buffers (Tb). 4.5. Design of AICA decoder with 64-bit The 64-bit decoding algorithm is shown in Table 7. This algorithm has two stages: recovery and storage. When the flit type is single or header, the algorithm transmits the flit to the unpacking module at lines 11 and 12. Otherwise, the decoding stage is started at lines 13 to 40 (starts decoding). In Case 1, the current body flit is stored in de_comparereg until the buffer is full, then the flit is sent to the unpacking module or Rreg , depending on the pre-flit requirement, in lines 15 to 25. In Case 2 or Case 3, the flits are reordered with Rreg . The first stage of Case 2 and Case 3 is decoding, in which the D is added from the current body flit and de_comparereg value from I. In the second stage, flits are ordered as stage1 and un-coding flits (from Rreg ) in lines 26 to 32. In Case 4, the algorithm adds the D from current body flits and de_comparereg from I in lines 33 to 39. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE

ARTICLE IN PRESS

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

13

Table 6 AICA encoding algorithm with 64-bit. AICA Encoding Algorithm with 64-bit Input: Flits Fs from the packing module to AICA Output: Transmission buffers Tb⊃{UF, EF}, where UF are un-coding flits, EF are encoding flits 1. Initialize encoded registers and sequence Er=0 and ES=0, where Er and ES capacity= 4 2. Initialize un-coded registers and sequence Ur=0 and US=0, where Ur and US capacity= 3 3. Initialize encoded and un-coded registers index EI=0 and UI=0 4. Initialize index I=0 and flits counter Fc=0, where I bits=8 5. Initialize compare registers comparereg =0, where comparereg capacity=max(size of 2I bits ) 6. The Case⊃ {Case 1 , Case 2, Case 3, Case 4}, where Case 1 are UF, Case 2 ⊃ {Case 2.1 , Case 2.2 ,…Case 2.13 } are encoding one or two flits, Case 3 ⊃{Case 3.1 , Case 3.2 ,…Case 3.7 } are encoding three flits and Case 4 are encoding four flits. Each cases determine by US and ES 7. The encoding control bits Ecb ⊃ {Ec1 , Ec2 , Ec3 , Ec4 }, where Ec1 to Ec4 are Case 1 to Case 4 encoding control bits 8. The sub encoding control bits for Case 2 SEcb2 ⊃ {SEcb2Nn , SEcb2Ln }, where SEcb2Ln ⊃ {Case 2.1 , Case 2.2 ,…, Case 2.4 } is leaf nodes case and SEcb2Nn ⊃ {Case 2.5 , Case 2.6 ,…, Case 2.13 } is nonterminal nodes case 9. The sub encoding control bits for Case 3 , SEcb3 ⊃ {SEcb3Nn , SEcb3Ln ,}, where SEcb3Ln ⊃ {Case 3.1 , Case 3.2 ,…, Case 3.6 } is leaf nodes case and SEcb3Nn = Case 3.7 is nonterminal nodes case 10. While (Fs) do 11. If (single flit or header flit arrival) 12. {Tb=Fs}// Fs bypass to Tb /∗ Starts encoding∗ / 13. Else if (body flit or tail flit arrival) //∗ Coding and classification∗ // 14. Fc= Fc+1 15. If(Fs cannot be encoded) 16. If(comparereg are full) 17. {Ur[UI]=Fs, US[UI]=Fc, UI++} 18. Else 19. {Ur[UI]=Fs, US[UI]=Fc, UI++, comparereg [I]=Fs, I++} 20. Else 21. {Er[EI]=I+D, ES[EI]=Fc, EI++}, where D=Fs−comparereg [I], D bits=8 //∗ Cases status detection ∗ // 22. If (US and ES = Case conditions or tail flit arrival) 23. {break} 24. Else 25. {Repeat step 13} 26. End while //∗ Restructuring and store ∗ // 27. While(Case) do 28. If(Case 1 ) 29. {Tb= Ec1 +Ur[0], UI=0, Fc=0} 30. Else if(Case 2 ) 31. If(tail flit arrival) 32. {Tb= Ec2 + SEcb2 +Er, EI=0, next Tb= Ec1 +Ur, UI=0, Fc=0} 33. Else 34. {Tb= Ec2 + SEcb2Ln +Er, EI=0, next Tb= Ec1 +Ur, UI=0, Fc=0} 35. Else if (Case 3 ) 36. If(tail flit arrival) 37. {Tb= Ec3 + SEcb3 +Er, EI=0, next Tb= Ec1 +Ur, UI=0, Fc=0} 38. Else 39. {Tb= Ec3 + Secb3Ln +Er, EI=0, next Tb= Ec1 +Ur, UI=0, Fc=0} 40. Else if (Case 4 ) 41. {Tb= Ec4 +Er, EI=0, Fc=0} 42. Repeat step 13 till the Tb are full or tail flit arrive AICA ∗ / Send to router∗ / 43. If(tail flit arrival) 44. {Send flits from Tb to router, clear comparereg , return step 1} 45. Else if(Tb full) 46. {Send flits from Tb to router, return step 13} 47. End while

The algorithm decodes continuously until the last flit is reached, when the flit is sent from Rreg to the unpacking module, and all register are cleared at lines 41 to 44.

5. Experimental results This work proposes an adaptive instruction codec architecture to reduce repeated instructions. The design tool was the Xilinx ISE 14.7, and the simulation tool was Modelsim. The emulation platform was Xilinx FPGA Virtex-6 XC6VLX240T1ff1136, and the chip measurement was performed using Xilinx Chipscope and Xpower. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE 14

ARTICLE IN PRESS

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

Table 7 AICA decoding algorithm with 64-bit. AICADecoding Algorithmwith 64-bit Input: Flits Fs from the destination router to AICA Output: Unpacking buffers Ub⊃{UF, DF}, where UF are un-coding flits, DF are decoding flits 1. Initialize decoded registers Dr=0, where Dr⊃{Dr1 , Dr2 , Dr3 , Dr4 }, Dr1 ⊃{I1 , D1 }, Dr2 ⊃{I2 , D2 }, Dr3 ⊃{I3 , D3 }, Dr4 ⊃{I4 , D4 } I1 =Fs[63:56], D1 =Fs[55:48], I2 =Fs[47:40], D2 =Fs[39:32], I3 =Fs[31:24], D3 =Fs[23:16], I4 =Fs[15:8], D4 =Fs[7:0] 2. Initialize index I=0, where I bits=8 3. Initialize decoder compare registers de_comparereg =0, where de_comparereg =max(size of 2I ) 4. Initialize reorder register and index Rreg =0 and RI=0, where Rreg and RI capacity=7 5. Initialize temporarily register Tr=0 6. Initialize reorder flag and times Rf=0 and Rt=0, when Rf=1 to reorder with next Fs 7. The coding types Ct⊃{Ct1 , Ct2 , Ct3 , Ct4 }, where Ct=Fs[65:64] 8. The sub coding types SCt2 ⊃{SCt2.1 , SCt2.2 ,…SCt2.13 } and SCt3 ⊃{SCt3.1 , SCt3.2 ,…SCt3.7 }, where SCt2 and SCt3 = Dr1 , Ct2 total flits Ct2tf =number of flits (Ct1 +SCt2 ) and Ct3 total flits Ct3tf =number of flits (Ct1 +SCt3 ) 9. Initialize insertion sequence registers ISR by SCt2 and SCt3 , where capacity= ISR[20][7], which stored each sub-case insertion sequence 10. While (Fs) do 11. If (single flit or header flit arrival) 12. {Ub=Fs}//Fs bypass to Ub /∗ Stage 1 and Stage 2: Recovery and store∗ / 13. Else if (body flit or tail flit arrival) 14. If (Fs is encoding flit) 15. If (Ct1 ) 16. If(de_comparereg not full) 17. {de_comparereg [I]=Fs, I++} 18. If(Rf) 19. {Rreg [RI[Rt]]= Fs, Rt−−} 20. If(Rt=0) 21. {Rf =0, Ub=Rreg } 22. Else 23. {Repeat step 13} 24. Else 25. {Ub=Fs} 26. Else if(Ct2 or Ct3 ) 27. {RI= ISR[SCt2 or SCt3 ], Rt= Ct2tf or Ct3tf , Rf =1} 28. For j=2 to number of flits with SCt2 or SCt3 29. If (Drj is positive) 30. {Tr=de_comparereg [Ij ]+Dj , Rreg [RI[Rt]]= Tr, Rt−−} 31. Else 32. {Tr=de_comparereg [Ij ]+Dj , Rreg [RI[Rt]]= {de_comparereg [Ij ][63:8], Tr[7:0]}, Rt—} 33. Else if(Ct4 ) 34. For j=1 to 4 35. If (Drj is positive) 36. {Tr=de_comparereg [Ij ]+Dj , Rreg [j]= {de_comparereg [Ij ][63:16], Tr} 37. Else 38. {Tr=de_comparereg [Ij ]–Dj , Rreg [j]= {de_comparereg [Ij ][63:16], Tr} 39. {Ub=Rreg } 40. Repeat step 13 till Ub are full or tail flit arrive AICA 41. If(tail flit arrival) 42. {Send flits from Ub to unpacking module, clear de_comparereg, return step 1} 43. Else if(Ub full) 44. {Send flits from Ub to unpacking module, return step 13} 45. End while

The power consumption on network-on-chip (NoC) is evaluated by Eq. (11). The term Pav is the total average power consumption of NoC, which includes the power consumption from the router, network interface (NI) and the transmission channel. The parameter β is the number of flits transmitted by adaptive instruction codec architecture (AICA) through the router and channel, which was reduced by up to 75%. Prt and Pli are the total average power consumption when the flits pass each router and transmission line, respectively. The terms Pcoder and Pdecoder refer to the power consumed when encoding and decoding, respectively. These can be disregarded because the flits are almost always transmitted in both routers (β Prt ) and channels ((β −1) Pli ). Throughput is analyzed by Eq. [29]. A smaller β leads to improved power consumption and throughput.

Pav = β Prt + (β − 1 )Pli + Pcoder + Pdecoder

(11)

The experimental setup is described as follows. Fig. 15 and Fig. 16 illustrate the 16-bit and 64-bit AICA codecs, respectively. Each flit simulates Reduced Instruction Set Computing (RISC) to transfer to the other router. The DataPort connects the CPU and encoder, as illustrated in Fig. 15 (a) and Fig. 16 (a): the CPU sends the packing data from DataPort to the NI, and is read by the encoder from the packing module. The DataPort_1 connects the encoder and router. The encoder uses the 64-bit or 16-bit encoding algorithm to reduce the redundancy instruction set and repackage a coding flit to DataPort_1, as Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

[m3Gsc;April 22, 2016;16:53] 15

Fig. 15. 16-bit AICA measurement scheme.

Fig. 16. 64-bit AICA measurement scheme. Table 8 16-bit and 64-bit of AICA power consumption and throughput table. Constraints Methods

Power consumption(mW)

Throughput (M Flits/s)

Number of flits

AICA 16-bit

AICA 64-bit

AICA 16-bit

AICA 64-bit

100 200 300 400 500 600 700 800 900 10 0 0 20 0 0 30 0 0 40 0 0 50 0 0 Average

38 39.2 40.7 41 41.2 41.8 42.2 43.9 44.7 44.9 46 46.7 47 47.2 43.1

52.1 54.1 56 56.6 57.3 57.6 57.8 58 59.1 61 62.1 63 63.2 63.3 58.6

305 316 325 331 343 350 355 358 359 359 369 370 371 371 348.7

360 371 383 389 396 401 410 414 420 422 429 432 432 434 406.6

shown in Fig. 15 (b) and Fig. 16 (b). The DataPort_2 connects the router to the decoder, which accesses flits from the router. The DataPort_3 connects the decoder to the CPU. The decoder recovers the coding flits and structure to instruction set, as shown in Fig. 15 (c) and Fig. 16 (c). The content of DataPort_3 is same as that of DataPort, indicating that the correctness of transmission is verified. Table 8 shows the analysis of the average power and throughput for 100 to 50 0 0 flits. Table 9 shows the comparisons of throughput. The proposed 64-bit and 16-bit AICA methods improved throughput by up to 46.3% and 25.4%, respectively. Table 10 shows the comparisons of power consumption. The proposed 64-bit and 16-bit AICA method improved power consumption by up to 29.3% and 48.1%, respectively. Table 9 and Table 10 compare the features and code of the proposed AICA algorithm with other technologies. Comparison results indicate that AICA aim instruction coding and is simpler than the method of Tassori et al. [13] as it does not apply calculation PMF or Hoffman trees. The method of Mishra et al. [30] improves performance by classifying networks as bandwidth-sensitive or latency-sensitive, and adapting the method of transferring flits according to the network type. The method of Tran et al. [17] improves performance by using shared queues to maximize buffer utilization. The proposed AICA improves bandwidth utilization and router throughput by increasing flit capacity. The improved throughput reduces the power consumption according to Eq. (1), since the early flit transfer completion allows the channel to sleep early. Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

ARTICLE IN PRESS

JID: CAEE 16

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18 Table 9 Throughput comparison. Constraints

Technologies Throughput (M Flits/s) 16-bit improved (%) 64-bit improved (%) Average improved (%)

Methods AICA 16-bit

AICA 64-bit

Tassori [13]

Mishra [30]

Tran [17]

AICA 348.7 – – –

406.6 – –

Huffman coding 310 12.5% 31.2% 23.5%

Heterogeneous NoC 278 25.4% 46.3%

RoShaQ 335 4.1% 21.4%

Table 10 Power consumption comparison. Constraints

Technologies Power Consumption (mW) 16-bit reduced (%) 64-bit reduced (%) Average reduced (%)

Methods AICA 16-bit

AICA 64-bit

Tassori [13]

Mishra [30]

Tran [17]

AICA 43.1 – – –

58.7 – –

Huffman coding 83 48.1% 29.3% 30.3%

Heterogeneous NoC 80.7 46.6% 27.3%

RoShaQ 60 28.2% 2.2%

6. Conclusion This work proposes a novel solution for instruction codec transmission, aiming for high throughput and low power consumption in the network on chip (NoC) architecture. The proposed method increases the proportion of encoded flits through reduces redundancy flits that the transmission channel can accommodate more flits than other network interface (NI), so that the router improves bandwidth utilization and throughput. The proposed method Adaptive Instruction Codec Architecture (AICA) reduces power consumption through bit and transition reduction in routers and transmission channels, therefore, the router early completion of flit transfer means that channel can go to sleep early. The proposed AICA reduces transmission redundancy, and supports process elements (PEs) with 16-bit or 64-bit core CPU. The AICA encoder and decoder has both 16-bit and 64-bit designs, and includes packing/un-packing modules between the NI and the router. The router is connected to other routers for transmission of data or instructions. The encoder and decoder algorithms of AICA are proposed and implemented on a Xilinx Virtex-6 FPGA device. In the experimental environment, the design tool uses Xilinx ISE 14.7, and the simulation tool uses the Modelsim 10.2. The emulation platform uses Xilinx Virtex-6 XC6VLX240T-1ff1136. The function verification and chip power measurement use the Xilinx Chipscope and Xpower tools, respectively. The average power and throughput were analyzed for 100 to 50 0 0 flits. Experimental results indicate that the new proposed architecture improves throughput by up to 46.3%, and reduces power consumption by up to 48.1%. Acknowledgments The authors would like to thank the Ministry of Science and Technology of the Republic of China, Taiwan, for supporting this research under Contract No. MOST 103-2221-E-027-125. References [1] Nicopoulos C, Srinivasan S, Yanamandra A, Park D, Narayanan V, Das C, Irwin M. On the effects of process variation in network-on-chip architectures. IEEE Trans Depend Secure Comput 2010;7(3):240–54. [2] Marculescu R, Ogras UY, Peh LS, Jerger NE, Hoskote Y. Outstanding research problems in NoC design: system, microarchitecture, and circuit perspectives. IEEE Trans Comput 2009;28(1):3–21. [3] Liu Y, Tan Y, Hang J. Key problems on network-on-chip. In: Proc. of 10th IEEE international conference on computer-aided design and computer graphics; 2007. p. 549–52. [4] da Rosa TR, Larrea V, Calazans N, Moraes FG. Power consumption reduction in MPSoCs through DFS. In: Proc. of 2012 25th symposium on integrated circuits and systems design (SBCCI); 2012. p. 1–6. [5] Jafarzadeh N, Palesi M, Khademzadeh A, Afzali-Kusha A. Data Encoding Techniques for Reducing Energy Consumption in Network-on-Chip. IEEE Trans. Very Large Scale Integ (VLSI) Syst 2014;22(3):675–85. [6] Swaminathan K, Lakshminarayanan G, Lang F, Fahmi M, Ko S-B. Design of a low power network interface for network on chip. In: Proc. of 2013 26th annual IEEE Canadian conference on electrical and computer engineering (CCECE); 2013. p. 1–4. [7] Gu H, Xu J, Zhang W. A low-power fat tree-based optical network-on-chip for multiprocessor system-on-chip. In: Proc. of design, automation & test in europe conference & exhibition(DATE); 2009. p. 3–8. [8] Nicopoulos C, Srinivasan S, Yanamandra A, Park D, Narayanan V, Das C, Irwin M. On the effects of process variation in network-on-chip architectures. IEEE Trans Depend Secure Comput 2010;7(3):240–54. [9] Lee T-Y, Huang C-H. Design of smart power-saving architecture for network on chip. VLSI Des 2014:10. [10] Choudhary N, Gaur MS, Laxmi V, Singh V. GA based congestion aware topology generation for application specific NoC. In: Proc. of 2011 sixth IEEE international symposium on electronic design, test and application (DELTA); 2011. p. 93–8. [11] Fattah M, Manian A, Rahimi A, Mohammadi S. A high throughput low power FIFO used for GALS NoC buffers. In: Proc. of IEEE annual symposium on VL SI (ISVL SI); 2010. p. 333–8.

Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE

ARTICLE IN PRESS T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

[m3Gsc;April 22, 2016;16:53] 17

[12] Fu W, Shao J, Xie B, Chen T, Liu L. Design of a high-throughput NoC router with neighbor flow regulation. In: Proc. of 2012 IEEE 14th international conference on high performance computing and communication & 2012 IEEE 9th international conference on embedded software and systems (HPCC-ICESS); 2012. p. 493–500. [13] Tassori M, Tassori M, Mossavi M. Adaptive data compression in NoC architectures for power optimization. J Int Rev Comput Softw 2010;5(5):540–7. [14] Wu C, Chai S, Li Y-B, Yang Z-M. Design of a dual-switching mode NOC router microarchitecture. In: Proc. of 2010 international conference on electrical and control engineering (ICECE); 2010. p. 2733–6. [15] Lusala AK, Legat J. A hybrid NoC combining SDM-TDM based circuit-switching with packet-switching for real-time applications. In: Proc. of 2012 IEEE 10th international on new circuits and systems conference (NEWCAS); 2012. p. 17–20. [16] Onizawa N, Matsumoto A, Funazaki T, Hanyu T. High-throughput compact delay-insensitive asynchronous NoC router. IEEE Trans. Comput 2014;63(3):n637–49. [17] Tran AT, Baas BM. Achieving high-performance on-chip networks with shared-buffer routers. IEEE Trans Circuits Syst Soc 2013;22(6):1063–8210. [18] Jiang SY, Luo G, Liu Y, Jiang SS, Li XT. Fault-tolerant routing algorithm simulation and hardware verification of NoC. IEEE Trans Appl Supercond 2014;24(5):1,5. [19] Collet JH. A brief overview of the challenges of the multicore roadmap. In: IEEE Trans. mixed design of integrated circuits & systems (MIXDES), 2014 proceedings of the 21st international conference; 2014. p. 22–9. [20] Liu H-N, Huang Y-J, Li J-F. A built-in self-repair method for RAMs in mesh-based NoCs. In: Proc. of international symposium on VLSI design, automation and test; 2009. p. 259–62. [21] Kariniemi H, Nurmi J. Fault-tolerant 2-D mesh network-on-chip for multiprocessor systems-on-chip. In: Proc. of international conference on design and diagnostics of electronic circuits and systems; 2006. p. 184–9. [22] Bakhouya M. Towards a bio-inspired architecture for autonomic network-on-chip. In: Proc. of international conference on high performance computing and simulation (HPCS); 2010. p. 491–7. [23] Oxman G, Weiss S. Simple method to reduce congestion in bufferless network-on-chip. Electron Lett 2014:581–3. [24] Geetha K, Ammasai Gounden N. Compressed instruction set coding (CISC) for performance optimization of hand held devices. In: Proc. of IEEE international conference on advanced computing and communications; 2008. p. 241–7. [25] Kadayif I, Kandemir MT. Instruction compression and encoding for low-power systems. In: Proc. of 15th IEEE international conference on ASIC and SOC; 2002. p. 301–5. [26] Tsai WC, Lan YC, Hu YH, Chen SJ. Networks on chips: structure and design methodologies. J Elect Comput Eng 2012;2012:1–15. [27] Mak T, Cheung PYK, Lam K-P, Luk W. Adaptive routing in network-on-chips using a dynamic-programming Network. IEEE Trans Ind Electron 2011;58(8):3701–16. [28] Matos D, Costa M, Carro L, Susin A. Network interface to synchronize multiple packets on NoC-based systems-on-Chip. In: Proc. of 18th IEEE/IFIP VLSI system on chip; 2010. p. 31–6. [29] Pande PP, Grecu C, Jones M, Ivanov A, Saleh R. Performance evaluation and design trade-offs for network-on-chip interconnect architectures. IEEE Trans Comput 2005;54(8):1025–40. [30] Mishra AK, Mutlu O, Das CR. A heterogeneous multiple network-on-chip design: an application-aware approach. In: Proc. of IEEE design automation conference; 2013. p. 1–10.

Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021

JID: CAEE 18

ARTICLE IN PRESS

[m3Gsc;April 22, 2016;16:53]

T.-Y. Lee et al. / Computers and Electrical Engineering 000 (2016) 1–18

Trong-Yen Lee received his Ph.D. degree in Electrical Engineering from National Taiwan University, Taipei, Taiwan, ROC in 2001. Since 2002, he has been a member of the faculty in the Department of Electronic Engineering, National Taipei University of Technology, where he is currently a professor. His research interests include hardware-software co-design of embedded systems, FPGA systems design, and VLSI design. Chi-Han Huang received his M.S. degree in Electronic Engineering from National Taipei University of Technology, Taipei, Taiwan, ROC in 2011. He is currently working toward the Ph.D. degree in the Department of Electronic Engineering, National Taipei University of Technology, Taipei, Taiwan. His current research interests include VLSI design and Network-on-Chip design. Min-Jea Liu received his M.S. degree in the Graduate Institute of Communication Engineering from Tatung University, Taipei, Taiwan, ROC in 2010. He is currently working toward the Ph.D. degree in the Department of Electronic Engineering, National Taipei University of Technology, Taipei, Taiwan, ROC. His current research interests include VLSI design and multiplier design. Jhen-Syuan Chen received his M.S. degree in the Graduate Institute of Computer and Communication Engineering from National Taipei University of Technology, Taipei, Taiwan, ROC in 2014. His current research interests include VLSI design and Network-on-Chip design.

Please cite this article as: T.-Y. Lee et al., Adaptive instruction codec architecture design for network-on-chip, Computers and Electrical Engineering (2016), http://dx.doi.org/10.1016/j.compeleceng.2016.02.021