High-throughput bit processor for cryptography, error correction, and error detection

High-throughput bit processor for cryptography, error correction, and error detection

Accepted Manuscript High-throughput Bit Processor for Cryptography, Error Correction, and Error Detection Yuanhong Huo , Dake Liu PII: DOI: Reference...

1MB Sizes 0 Downloads 80 Views

Accepted Manuscript

High-throughput Bit Processor for Cryptography, Error Correction, and Error Detection Yuanhong Huo , Dake Liu PII: DOI: Reference:

S0141-9331(17)30169-2 10.1016/j.micpro.2018.06.013 MICPRO 2713

To appear in:

Microprocessors and Microsystems

Received date: Revised date: Accepted date:

19 March 2017 26 February 2018 14 June 2018

Please cite this article as: Yuanhong Huo , Dake Liu , High-throughput Bit Processor for Cryptography, Error Correction, and Error Detection, Microprocessors and Microsystems (2018), doi: 10.1016/j.micpro.2018.06.013

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT F

Tetrahedron Letters journal homepage: www.elsevier.com

Yuanhong Huoa, Dake Liua a

Beijing Institute of Technology, Beijing , 100081, China

CR IP T

High-throughput Bit Processor for Cryptography, Error Correction, and Error Detection

ABSTRACT

Article history: Received Received in revised form Accepted Available online

The product lifetime (time-in-market) of a high-end embedded SoC (System-on-Chip) can be rather short due to possible design changes, leading to a highly expensive SoC redesign. Most of the SoC redesign are induced by the requirements for function changes of non-programmable ASIC modules. Plenty of the non-programmable ASIC modules are used for bit-wise algorithms. It is thus necessary to offer programmable/flexible VLSI designs for the bit-wise algorithms. In this paper, we propose a programmable ASIP design for four types of the bit-wise algorithms: block ciphers, stream ciphers, Reed-Solomon (RS) Codes, and Cyclic Redundancy Check (CRC). We achieve this via finding out the algorithm similarities and the optimal parallel degree (128-bit) among the four types of bit-wise algorithms. The flexibility of our design can enlarge the range of applications and extend the time-in-market of a SoC. Besides, our design achieves ASIC-like performance such as 25.6 Gb/s for AES encryption, 17.6 Gb/s for RS(255,239) decoding, and 281.6 Gb/s for CRC calculation, etc with 0.19 mm2 (28 nm) silicon area. Finally, we show that the performance of our design is sufficient for high-speed communication protocols like IEEE 802.11ad when running real-time AES, RS, and CRC simultaneously.

AN US

ARTICLE INFO

M

Keywords: Application Specific Instruction Set Processor Software Defined Radio Cryptographic Processor VLSI (Very Large-Scale Integration)

ED

2017 Elsevier Ltd. All rights reserved.

1. Introduction

AC

CE

PT

In wireless communication systems, SoC product lifetime can be rather short because of possible design changes [1], leading to a redesign of a SoC. When advanced silicon technology is adopted and the SoC scale is large, the NRE (Non-Recurring Engineering) cost for a redesign of a SoC could be tens of million dollars [2]. Most SoC redesign are induced by the need for function changes of non-programmable ASIC modules of a SoC. We recognized that a large set of the non-programmable ASIC modules are used for various bit-wise algorithms for cryptography, error correction, and error detection. It is thus necessary to propose flexible/programmable VLSI designs for the bit-wise algorithms so that the SoC can offer adaption to the changes, extending the SoC product lifetime. Generally, VLSI circuits for the bit-wise algorithms need to provide high performance and low computing latency under limited chip area and power consumption. It is because the circuits usually need to support the high data rates of communication networks [3][4]. Besides, due to stringent area constraints in embedded systems and high manufacturing costs, re-usable designs and flexible intellectual property (IP) cores are continuously ——— sought for [5]. In this case, our motivation is to  Corresponding author. E-mail addresses: [email protected] (Y. Huo), [email protected] (D. Liu).

offer a high-performance flexible/programmable solution that can support multiple bit-wise algorithms in one IP. However, it is of great challenge to meet the mentioned stringent, and often conflicting, design requirements (e.g., high flexibility, low computing latency, multi-standard compatibility, and small area, etc), at the same time. ASIC (Application Specific Integrated Circuit) solution [6][7] is currently adopted to achieve high throughput at low power consumption due to the low computing latency requirements of bit-wise algorithms. However, ASIC designs are with low flexibility. Any change of ASIC designs might end to a new tapeout of a SoC with a cost of tens of million dollars [2]. To extend the product lifetime of a SoC, FPGA (Field Programmable Gate Array) [8] can be adopted to implement bit-wise algorithms. FPGA can keep the product lifetime longer than ASIC. However, the cost and power consumption of FPGA is too high for volume products. In some cases, GPP (General Purpose Processor) [9], the most flexible solution, is utilized. GPP can support all bit-wise algorithms with high flexibility. However, the shortcomings of GPP are high energy dissipation and poor performance. GPUs (Graphics Processing Units) and DSPs (Digital Signal Processors)

ACCEPTED MANUSCRIPT

2

Tetrahedron

[10] are also adopted to implement bit-wise algorithms. Implementing bit-wise algorithms on DSPs/GPUs is flexible. A drawback of software implementations on DSPs/GPUs is low power efficiency [10].

Table 1 Selected Typical Algorithms Implemented. Algorithm Polynomial Typical Applications

0x1021

CRC-32C (Castagnoli) RS(204,188) RS(224,208) RS(224,216) RS(255,223) RS(255,239)

0x1EDC6F41

SNOW 3G ZUC AES ARIA Camellia

NA NA NA NA NA

0x8005

0x5D6DCB 0x04C11DB7

0x11D 0x11D 0x11D 0x187 0x11D

Bisync, Modbus, USB, ANSI X3.28, SIA DC-07 X.25, V.41, HDLC FCS, XMODEM, Bluetooth FlexRay HDLC, MPEG-2, Ethernet, ITUT V.42, Serial ATA iSCSI, SCTP, SSE4.2, Btrfs DVB IEEE 802.11ad, IEEE 802.11e Wireless HD, ECMA TC48 CCSDS ECMA 387, ETSI-BRAN, IEEE802.16d, ITU-T G.975, IEEE 802.15.3c, G.709, ITU-T G.984.3 3GPP LTE 3GPP LTE 3GPP LTE, WiGig, TLS/SSL TLS/SSL IPSec, TLS/SSL

2. Design of BP-ASIP

CE

PT

In this paper, we propose a high-performance bit processor (BP-ASIP) for four types of bit-wise algorithms: block ciphers (AES [13], ARIA [14], and Camellia [15]), stream ciphers (SNOW 3G [16] and ZUC [17]), Reed-Solomon Codes, and CRC. Table 1 lists selected typical algorithms supported by BP-ASIP, the algorithms’ generator-polynomial and typical applications. Generally speaking, the major contributions of this paper are threefold. First, we make a thorough investigation on the bit-wise algorithms adopted in current protocols and discover that there are similarities among the mentioned four types of bit-wise algorithms. Then, we find out the optimal parallel degree among AES, RS (based on GF(28)), and CRC. To the best of our knowledge, we are the first to research the optimal parallel degree among AES, RS, and CRC. Second, we propose a highperformance ASIP design (with the optimal parallel degree) for the four types of bit-wise algorithms. Our configurable tablebased design can offer high flexibility via configuring the contents of lookup tables. Besides, the programmability of our design can offer algorithm changes via software programming. At the same time, our design achieves ASIC-like throughput such as 25.6 Gb/s for AES encryption, 70.4 Gb/s for SNOW 3G, 17.6 Gb/s for RS (255,239) decoding, and 281.6 Gb/s for CRC calculation, etc occupying 0.19 mm2 in 28 nm CMOS. Finally, we show that our design can run multiple algorithms/standards

AC

The rest of this paper is organized as follows. Section 2 explains the design of BP-ASIP in detail. Section 3 illustrates the top-level architecture, the datapath, the instruction set, the memory subsystem, and the pipeline of BP-ASIP. Section 4 illustrates the area and power consumption of BP-ASIP. Section 5 discusses the performance of BP-ASIP. Section 6 evaluates BPASIP. In Section 7, some concluding remarks are provided.

In this section, we are going to design BP-ASIP according to application requirements. Firstly, we specify the algorithm scope of this design. Secondly, we find out the optimal parallel degree among the targeted algorithms and specify the throughput of the targeted algorithms. Thirdly, we design and fine-tune BP-ASIP for the specified algorithm scope with the optimal parallel degree. Finally, we explain the method of hardware sharing among the targeted algorithms.

AN US

CRC-16CCITT CRC-24 CRC-32

0xF13

I.432.1, ATM HEC, ISDN HEC and cell delineation mobile networks

M

0x07

ED

CRC-8CCITT CRC-12CDMA2000 CRC-16-IBM

So far, flexible (programmable) designs including multistandard FEC decoding [18][19], interleaving [20], and symbol processing [21][22] have been presented for SDR (Software Defined Radio). In 2015, Liu [23] reviewed a baseband ASIP for programmable SDR. The main missing part of the baseband ASIP [23] is flexible/programmable solutions for the bit-wise algorithms including CRC and RS, etc. BP-ASIP can act as a bit processing core mentioned in [23], and support high-speed cryptography at the same time.

CR IP T

To offer more flexibility without breaking down the performance and power efficiency, specialized processors [11][12] are focused on. Previous ASIP designs for bit-wise algorithms have successfully fulfilled the requirements of smartcard applications [11] or high performance cryptographic processing [12], etc. However, previous ASIP designs for bitwise algorithms either aim at a small predefined set of algorithms (e.g., CRC [5] and cryptography algorithms, etc) providing limited flexibility and/or throughput, or focus primarily on high performance and flexibility resulting in an unacceptable chip area for mobile devices.

simultaneously with low scheduling overhead, via a simple modification. We further show that our design can provide sufficient throughput for IEEE 802.11ad [4] when our design (with a simple modification) runs real-time AES, CRC, and RS decoding simultaneously.

2.1. Analysis of Algorithm Similarities and Specification of Algorithm Scope To support multiple bit-wise algorithms in one IP, we firstly analyzed available bit-wise algorithms adopted in current protocols. Then, we discover that there are similarities among cryptography, CRC, and RS. For example, block ciphers, stream ciphers, RS, and CRC can all be implemented via table-based architecture [13][24][25]. If we implement these similar functions with common configurable functional blocks, the flexibility can be greatly increased without breaking down the performance and the total hardware costs can be greatly decreased via hardware sharing. Table 2 Bit-wise Algorithms of Selected Typical Protocols. Protocol Algorithms IEEE Std 802.11n IEEE Std 802.11ac IEEE Std 802.11ad IEEE Std 802.15.3c IEEE Std 802.16.1 IEEE Std 802.16 IEEE Std 802.20 4G LTE DVB-T2 DVB CCSDS IPSec TLS/SSL

AES, CRC-8, CRC-32, etc AES, CRC-8, etc AES, RS(224,208), CRC-16-CCITT, CRC-8, etc RS(255,239), CRC-16, CRC-8, etc AES, CRC-16-CCITT, CRC-16-ANSI, CRC-5, etc AES, RS(255,255-R), CRC-32, CRC-24, CRC-16-CCITT, CRC-8, etc AES, CRC-24, CRC-16, CRC-6, etc AES, SNOW 3G, ZUC, CRC-24A, CRC24B, CRC-16, CRC-8, etc CRC-32, CRC-8, etc RS(204,188), CRC-8, etc AES, RS(255,223), etc AES, Camellia, etc AES, Camellia, ARIA, etc

ACCEPTED MANUSCRIPT To fulfill the requirement of the IEEE 802.11ad, we specify the throughput of RS(224,208) as > 2.6 Gb/s. Through our research, we conclude that the bit-width of 128-bit is sufficient for the specified throughput at an anticipative clock frequency (  1.0 GHz). Besides, compared with adopting less (e.g., 64-bit) bitwidth with higher clock frequency (e.g.,  2.0 GHz) to achieve the same throughput, adopting 128-bit bit-width consumes less clock cycles when processing a payload so that the voltage and frequency can be scaled which can reduce the power and energy dissipation in a given time period. The bit-width of BP-ASIP is thus finally designed as 128-bit. When a higher throughput is required, more than one BP-ASIP can be adopted to process two or more payloads in parallel.

2.2. Finding Out the Optimal Parallel Degree

2.3. Designing and Fine-tuning BP-ASIP With the Specified Parallel Degree AES is first implemented among all the algorithms targeted in this paper because AES is the most complex algorithm of the targeted algorithm scope. Generally, the most complex algorithm of a scope impacts the design of an ASIP for the scope greatly. The metrics of complexity include size and type of operands, number of different operators, and complexity of necessary operations, etc. To achieve the specified throughput for AES, we propose a datapath to accelerate the intermediate round of AES encryption [13], implementing the four steps of the intermediate round: Sub-Bytes, ShiftRows, MixColumns, and AddRoundKey. After implementing AES, we obtain a 128-bit preliminary datapath.

CE

PT

ED

M

AN US

To propose a processor for the specified algorithm scope, the bit-width (parallel degree) should be specified first. This paper obtains the bit-width of BP-ASIP according to algorithms CRC32, AES-128-CBC, and RS(255,239) decoding with no errors/erasures. It is because CRC-32, AES-128-CBC [6], and RS(255,239) decoding with no errors/erasures are typical cases of CRC, block ciphers and stream ciphers, and RS, respectively. We analyze the (normalized) throughput (Fig. 1(a)) and area (Fig. 1(b)) of VLSI designs for the three algorithms under the bit-width of 8, 16, 32, 64, 128, and 256. As shown in Fig. 1(b), the area increases linearly with the increase of bit-width. The throughput of CRC calculation increases linearly with the increase of bitwidth as shown in Fig. 1(a). When one payload is processed, the throughput of AES-128-CBC and RS(255,239) decoding increases linearly with the bit-width if the bit-width is no more than 128-bit. However, if the bit-width is more than 128-bit, the throughput of AES-128-CBC and RS(255,239) decoding remains unchanged as shown in Fig. 1(a). It is because AES-128-CBC and RS(255,239) decoding process 128-bit data in the innermost loop each time and the procedures are iterative. We thus conclude that 128-bit is a parallel degree top-limit for AES-128-CBC and RS(255,239) decoding.

CR IP T

Then, we specify the algorithm scope of this design according to application requirements. As shown in Table 2, CRC algorithms with no more than 32-bit generator polynomial and RS algorithms based on GF(28) arithmetic are widely adopted in embedded devices. Stream ciphers ZUC [17] and SNOW 3G [16] are adopted for current wireless communication networks, like 4G LTE. Block cipher AES [13] is adopted in almost all security protocols. These algorithms should, in general, be implemented with high speed [4][6][26] and high flexibility [5][27]. These algorithms are thus accelerated by BP-ASIP. Another two block ciphers ARIA [14] and Camellia [15] adopted in TLS/SSL, like AES, are also accelerated for their high security and typicality [28].

AC

Fig. 1. (a) Normalized Throughput versus Bit Width. (b) Normalized Area versus Bit Width.

On the other hand, according to the application requirements, the throughput of the algorithms targeted in this paper requires to be high enough. For advanced networks like the fifth generation (5G) mobile networks that adopt CRC, the peak data rate is likely to be tens of Gb/s [3]. Considering the stringent latency requirements of 5G, the higher the CRC throughput is, the less the CRC computing time will be. More time can thus be spared for other computing-intensive tasks. We thus specify the CRC throughput of this design as > 100 Gb/s to fulfill the required high throughput and low computing latency. To fulfill practical requirements, the throughput of AES is specified as 10 Gb/s in all encryption modes including the cipher block chaining (CBC) mode. RS (224,208), shortened version of RS (255,239), is utilized for Low-Power Single Carrier (LPSCPHY) of the IEEE 802.11ad [4] with the Raw Bit Rate of 625.6 Mb/s to 2503 Mb/s.

To achieve the specified CRC throughput (> 100 Gb/s), we have proposed a high-performance table-based method [24] for parallel CRC calculation. The LUTs (Lookup tables) utilized for the parallel CRC calculation are implemented with SRAMs. The LUTs can thus be configured and reused for the RS and cryptography algorithms targeted in this paper. The configurable LUTs can also ensure that our design can support new algorithms under the scope of the design. We thus adopt the table-based method [24] to implement CRC (to be discussed in Section 3.2) on BP-ASIP. After implementing AES and CRC, we obtain a 128-bit table-based datapath. RS algorithms are based on Galois field arithmetic. To support multiple RS algorithms through software programming, this work focuses on the acceleration of Galois field arithmetic (GF(28)). Taking Galois field multiplication as an example, firstly, we parallelize the traditional table-based Galois field multiplication [25] to a parallel degree of 128-bit. Then, we specify the parallel Galois field multiplication as an acceleration instruction of BPASIP. Afterwards, we map the function of the parallel Galois field multiplication onto the 128-bit table-based datapath incrementally (to be discussed in Section 3.2). The parallel Galois field multiplication can thus be accelerated in hardware. Finally, we implement the parallel Galois field multiplication of the targeted RS algorithms with the acceleration instruction. The RS algorithms targeted in this paper can thus be accelerated. Similarly, we map the rest of the targeted algorithms onto the datapath incrementally. Common parts among the algorithms are abstracted and implemented by shared functional blocks to achieve low silicon cost. The critical path of the datapath for different algorithms targeted in this paper are various and the critical path of the datapath for the parallel CRC calculation is the shortest. To improve the CRC throughput under the parallel degree of 128-bit, we need to fulfil the parallel CRC calculation with high clock frequency. We thus adopt pipelining technology to fine-tune the 128-bit table-based datapath. According to the requirements of

ACCEPTED MANUSCRIPT Tetrahedron

the targeted algorithms, the 128-bit table-based datapath is finetuned to contain four pipeline stages so that the critical path of the 128-bit table-based datapath is the critical path of the datapath for the parallel CRC calculation. The 128-bit tablebased datapath and the implementation of CRC, RS, and AES on the datapath will be explained in detail in Section 3.2.

2.4. Method of Hardware Sharing

M

ED

PT

CE

AC

Table 3 LUT Cost of Selected Typical Algorithms. Number Size of LUT Algorithm of LUT (bit) 16 48 32 16 16 8 16

256 256 256 256 256 256 256

instruction-decoder

program memory

program flow controller

DMA

special registers

AGU (address generation unit)

data memory

M1

M2

M3

M4

RPN (read-permutation-network)

After implementing all the algorithms targeted in this paper on BP-ASIP, the hardware costs of the algorithms are exposed. In our table-based datapath, the LUTs take up most of the hardware costs. Therefore, this section mainly explains how to share the LUTs among the targeted algorithms. Table 3 lists the LUT cost (in column 2 and 3) of selected typical algorithms in the unified datapath. Column 4 shows the number of bit processed per clock cycle under the specified hardware constraints. The LUTs utilized for the parallel CRC calculation contain 32-bit keywords while the cryptography and RS algorithms targeted in this paper need LUTs with 8-bit keywords. However, the LUTs are with the same number of keywords. In this work, four 256  8b LUTs are utilized to act as one 256  32b LUT and 64 SRAMs (256  8b) are utilized for the LUTs of CRC, Galois field arithmetic (GF(28)), and cryptography algorithms. The contents of the LUTs can be obtained either from the corresponding algorithm specifications, or from previous works such as the table-based CRC calculation [24] and AES [13], etc, and this paper will not explain them in detail.

CRC RS AES SNOW 3G ZUC Camellia ARIA

3.1. Top-level Architecture

AN US

After implementing all the targeted algorithms on the datapath, we fix the datapath. Then, the control signals for the processing routines in the datapath are extracted and represented by a group of control indications. A specific instruction set (to be discussed in Section 3.3) is thus designed. Afterwards, algorithm pseudocodes of the targeted algorithms are developed with the instruction set. Meanwhile, addressing and control information of the algorithm pseudocodes are extracted, so that the specific memory subsystem, the control path, and the top-level architecture of BP-ASIP are designed. Based on the specified functional blocks and the instruction set, the RTL (Register Transfer Level) description is developed. The programming toolchain (e.g., simulator, assembler, etc) is developed at the same time. Based on the algorithm pseudocodes, the assembly codes of the targeted algorithms are developed. The correctness and performance are verified for both the functional design and the silicon layout. Meanwhile, the assembly codes offer the support of all algorithms covered by the previously defined scope. Finally, the software (assembly codes) and hardware (assembly instruction set) are integrated and BP-ASIP is designed via hardware/software (HW/SW) codesign methodology.

and BP-ASIP. The master core prepares the code and the payload data to be used by BP-ASIP, schedules the execution of BPASIP, and handles the outputs of BP-ASIP through the interconnection network. The master core gives the data to be processed to BP-ASIP and loads algorithms to run on BP-ASIP through the interconnection network. BP-ASIP includes a control path, a datapath, a memory subsystem, a direct memory access (DMA) controller, and a network interface. BP-ASIP executes all the algorithms targeted in this paper with no extra hardware cost of the system. Thanks to its programmable and configurable table-based architecture, BP-ASIP can offer certain flexibility to adapt to emerging changes. This can lead to a remarkable extension of time-in-market of embedded SoCs.

CR IP T

4

 32 8 8 8 8 8 8

Bit Per Cycle 16  8 16  8 16  8 32 16 8 8 4  32

3. Implementation of BP-ASIP This section details BP-ASIP and the architecture of the whole system with a host for BP-ASIP. The whole system includes a master core (GPP), a main memory, an interconnection network,

datapath

register file

WPN (write-permutation-network)

Fig. 2. Top-level architecture of BP-ASIP.

BP-ASIP adopts SIMD (Single Instruction Multiple Data) architecture to meet the requirements on the computational complexity. The top-level architecture (Fig. 2) of BP-ASIP consists of three parts: control logic, memory subsystem, and datapath. The control logic includes program flow controller, program memory, instruction-decoder, DMA (direct memory access) controller, and special registers. The control logic reads an instruction from the program memory, a 256  80b SRAM, each clock cycle and decodes the machine code (i.e., the instruction fetched) into control signals. The memory subsystem is composed of four parts: AGU (address generation unit), data memory, read-permutation-network, and write-permutationnetwork. The AGU generates addresses for operands according to the machine code. Then the addresses generated will be passed to the data memory. The data memory contains four memory blocks. Each memory block contains 16 SRAMs (256  8b) for caching intermediate results of the targeted algorithms and provides 16-byte data per clock cycle. The outputs of these memory blocks will be passed to the read-permutation-network and then to the datapath. The datapath handles the bit-wise algorithms implemented in this paper in parallel. The outputs of the datapath will be passed to the write-permutation-network. The read-permutation-network and write-permutation-network are designed for data shuffling to ensure that the vector data can be accessed in parallel without access conflict. The outputs of the write-permutation-network will be written to a memory block or register file.

ACCEPTED MANUSCRIPT 3.2. Datapath 128 bit src1

128 bit src0 Input data shuffle control

L0 control

L0 output data shuffle control L1 control L1 output data shuffle control

128 bit src2

Crossbar 0

Logic layer0

RAM0

Crossbar1

RAM1

Data line (128-bit)

RAM2

L2 control

Logic layer2

L3 control

Logic layer3

RAM3

Control line

Logic layer1

are passed to Logic layer1 through Crossbar1. In the second pipeline stage, Logic layer1 fulfils the modulo add. The outputs of Logic layer1 are reordered in Crossbar2. The outputs of Crossbar2 act as addresses for RAM2. In the third pipeline stage, RAM2 performs the antilog table lookup and Logic layer2 performs the test for zero. At last, selDataOut chooses the outputs of Logic layer2. The rest of the targeted Galois field functions can be fulfilled in a similar way. 128 bit src0

Crossbar2

Select output control

128 bit src1

128 bit src2

Crossbar 0

selDataOut

Logic layer0

RAM0

RAM1

RAM2

RAM3

128 bit dst

Logic layer2

Crossbar1

Fig. 3. Datapath of BP-ASIP. Crossbar2

M

ED

PT

AC

CE

Seven blocks (the grey blocks) of the datapath (Fig. 4(a)) are adopted for the parallel CRC calculation. Firstly, the first four bytes of the input data are XORed with a CRC value chosen by Crossbar0. Crossbar0 chooses CRC initial value as the CRC value in the first round. Otherwise, Crossbar0 will choose the outputs of Crossbar1. Four new bytes are thus obtained. Then, the four new bytes and the remained 12 bytes of input data act as addresses for RAM0-RAM3. Afterwards, 4  4  32-bit words from RAM0-RAM3 are passed to Crossbar1. In Crossbar1, the sixteen 32-bit words are XORed with each other and a new CRC value is thus figured out. All the operations are completed within one clock cycle. When the calculation is completed, selDataOut chooses the outputs of Crossbar1. Different from the parallel CRC calculation, the parallel Galois field arithmetic consumes more than one pipeline stage of the datapath. Taking the parallel Galois field multiplication as an example, three pipeline stages of the datapath (Fig. 4(b)) are adopted to fulfill the four sub functions of Galois field multiplication: a test for zero, two log table lookups, a modulo add, and an anti-log table lookup [25]. In the first pipeline stage, RAM0 and RAM1 are adopted to perform the log table lookups of src0 and src1, respectively. The outputs of RAM0 and RAM1

Logic layer3

selDataOut 128 bit dst (a)

128 bit src0

128 bit src1

128 bit src2

Crossbar 0

AN US

According to the computational needs and constraints, explained in Section 2.3, the datapath (Fig. 3) of BP-ASIP is designed with 4 pipeline stages and 12 blocks. Among these blocks, RAM0-RAM3 are adopted for the previously mentioned LUTs. Each of RAM0-RAM3 contains sixteen 256  8b SRAMs and takes in sixteen 8-bit data as inputs. The inputs act as addresses for the 16 SRAMs and sixteen 8-bit data from the SRAMs can be obtained. RAM2 can be utilized in the first pipeline stage or the third pipeline stage of the datapath to satisfy the different requirements of the targeted algorithms. Crossbar0Crossbar2 reorder the input vector data according to the requirements of different instructions. Crossbar1 is also utilized to process the outputs of RAM0-RAM3 when performing CRC calculation (to be discussed). Logic layer0-Logic layer3 fulfill the computing functions in the first, the second, the third, and the fourth pipeline stage of the datapath, respectively for the proposed instructions. Taking Logic layer3 as an example, it is adopted for the instruction RSGTMAC (to be discussed in Section 3.3) and XORs the 16-byte output of Logic layer2 with each other to obtain 1-byte result. Fundamental operations (e.g., ADD, XOR, etc) are shared by the computing functions inside these Logic Layers as much as possible to achieve low silicon cost. The block seldataOut is utilized to choose results from the four pipeline stages of the datapath. The reconfigurability of the datapath can be achieved by changing the contents of the control registers and the LUTs to obtain high flexibility.

CR IP T

Logic layer1

Logic layer0

RAM0

RAM1

RAM2

RAM3

Logic layer2

Crossbar1

Logic layer1 Crossbar2

128 bit src0

Logic layer3

selDataOut 128 bit dst (b) 128 bit src1

128 bit src2

Crossbar 0

Logic layer0

RAM0

RAM1

RAM2

RAM3

Logic layer2

Crossbar1 Logic layer1 Crossbar2

Logic layer3

selDataOut 128 bit dst (c)

128 bit src0

128 bit src1

128 bit src2

Crossbar 0

Logic layer0

RAM0

RAM1

RAM2

RAM3

Logic layer2

Crossbar1 Logic layer1 Crossbar2

Logic layer3

selDataOut 128 bit dst (d)

Fig. 4. (a) Blocks utilized by CRC. (b) Blocks utilized by Galois field multiplication. (c) Blocks utilized by AES encryption. (d) Blocks utilized by AES decryption.

ACCEPTED MANUSCRIPT Tetrahedron

AES can also be accelerated utilizing the datapath. Two pipeline stages of the datapath (Fig. 4(c)) are utilized for AES encryption. In the first pipeline stage, RAM0 and Crossbar1 fulfill the functions SubBytes and ShiftRows of AES encryption, respectively. The round keys are passed from Crossbar0, Logic layer0, Crossbar1, to Logic layer1. In the second pipeline stage, Logic layer1 fulfills the functions MixColumns and AddRoundKey of AES encryption. In this design, we did not implement any key scheduler for AES encryption assuming that the round keys are stored in the data memory before performing encryption. But on-the-fly generation of round keys is possible in our design. Three pipeline stages of the datapath (Fig. 4(d)) are consumed by AES decryption. In the first pipeline stage, Logic layer0 and Crossbar1 fulfill the functions AddRoundKey and InvShiftRows of AES decryption, respectively. In the second pipeline stage, Logic layer1 performs the function InvMixColumns of AES decryption. The outputs of Logic layer1 are passed from Crossbar2 to RAM2. In the third pipeline stage, RAM2 fulfills the function InvSubBytes of AES decryption. The outputs of RAM2 are passed from Logic layer2 to selDataOut. Other algorithms of block ciphers and stream ciphers targeted in this paper can also be accelerated on the datapath.

RSGDIV RSGMAC

3 3

128 144

RSGTMAC

4

143

AESENC

2

176

AESENCF

1

32

AESENCL AESDEC

2 3

48 592

AESDECF

3

48

AESDECL

2

32

first round of CRC-32 remained rounds of CRC-32 16 8-bit Galois field multiplication 16 8-bit Galois field division 16 8-bit Galois field multiplication and addition 16 8-bit Galois field multiplication, XOR the results to 8-bit output intermediate round of AES encryption first round of AES encryption last round of AES encryption intermediate round of AES decryption first round of AES decryption last round of AES decryption

ED

38 38 128

PT

1 1 3

CE

CRC32F CRC32B RSGMUL

Function

M

Table 4 Selected Typical Instructions of BP-ASIP. Instruction Pipeline Scaled Mnemonic Stages Speedup

The instruction set of BP-ASIP contains 39 SIMD instructions. Among these instructions, 8 are for CRC, 8 are for RS, 23 are for cryptography algorithms including AES, SNOW 3G, ZUC, Camellia, and ARIA. Table 4 lists selected typical instructions of BP-ASIP. Column 1 shows the instruction mnemonics in assembly. Column 2 presents pipeline stages consumed in the datapath by each instruction for hazard control during programming. Column 4 shows functions of the instructions. Column 3 describes speedups of the instructions, i.e., the number of basic operations (e.g., ADD, SUB, XOR, and LOAD, etc) implemented by each instruction. When utilizing GPP, each operation equals one instruction while in BP-ASIP, all the operations are merged in one instruction. The SIMD instructions can offer algorithm modifications via configuring the datapath, e.g., the LUTs of the datapath. Utilizing the instruction set introduced, all the algorithms targeted in this paper including CRC-8, CRC-16, CRC-16-CCITT, CRC-32, RS(255,223),

AC

As loop control under software flow consumes much resource, an efficient branch-cost-free loop acceleration in hardware is introduced. All the loops in this design are thus performed with no branch cost. We achieve this through two mechanisms. First, the microcode for each instruction of BP-ASIP contains a special part indicating how many times the instruction needs to be repeated.We can configure the special part of an instruction via appending an option “-i Imm‖ at the end of the instruction in assembly code, e.g., if we want to repeat an instruction 16 times, an option ―-i 16‖ can be utilized. Second, we introduce an instruction REPEAT to repeat a block of instructions several times.

...

AddRoundKey (State0, RoundKey); AddRoundKey (State1, RoundKey); for (i=1; i
AN US

3.3. Instruction Set

RS(255,239), RS(224,208), AES, ZUC, and SNOW 3G, etc can be efficiently accelerated.

CR IP T

6

...

(a)

... AESENCF dr0 dm0[ar0+=s] dm1[ar1] AESENCF dr1 dm0[ar0+=s] dm1[ar1+=s%] REPEAT 9 { AESENC dr0 dr0 dm1[ar1] AESENC dr1 dr1 dm1[ar1+=s%] } AESENCL dm2[ar2+=s] dr0 dm1[ar1] AESENCL dm2[ar2+=s] dr1 dm1[ar1+=s%]

... (b) Fig. 5. Slices of (a) C code and (b) assembly code for AES-128 encryption. Fig. 5 shows how to perform AES-128 encryption on BP-ASIP with the proposed instructions. Fig. 5(a) and Fig. 5(b) present the C code and corresponding assembly code for AES-128 encryption, respectively. Data hazard avoidance assembly coding is adopted to enhance performance. The REPEAT instruction repeats the following 2 instructions 9 times. When performing AES-192 and AES-256, the 2 instructions will be repeated 11 times and 13 times, respectively. The presented assembly code consumes 23 clock cycles. When there is no pipeline mismatch, the encryption of 128-bit data consumes 11 clock cycles on BPASIP.

3.4. Memory Subsystem The operands of BP-ASIP are all vector data. To ensure that BP-ASIP can work efficiently, 16-byte vector data should be obtained within one clock cycle. With no proper addressing patterns, required 16-byte data can not be obtained

ACCEPTED MANUSCRIPT simultaneously all the time. It is thus necessary to design specific addressing patterns for BP-ASIP.

2

imm (e.g., 18) ar

3

ar++

4

ar--

5

ar+=s

6

ar-=s

7

ar+=s%

8

ar-=s%

9

ar+=s%++

10

ar+=S[]%

point to address represented by an immediate, e.g., 18 point to address represented by register ar, next clock cycle ar remains point to address represented by register ar, next clock cycle ar = ar+1 point to address represented by register ar, next clock cycle ar = ar-1 point to address represented by register ar, next clock cycle ar = ar+s, s is the step point to address represented by register ar, next clock cycle ar = ar-s, s is the step point to address represented by register ar, next clock cycle ar = ar+s, s is the step; if (ar  AddrEnd) ar = AddrStart point to address represented by register ar, next clock cycle ar = ar-s, s is the step; if (ar  AddrEnd) ar = AddrStart point to address represented by register ar, next clock cycle ar = ar+s, s is the step; if (ar  AddrEnd) fAddrStart++; AddrEnd++; ar = AddrStart; g point to address represented by register ar, next clock cycle ar = ar+S[i], i=(i+1)%32, S[] is a LUT for various steps; if (ar  AddrEnd) ar = AddrStart

The proposed design is synthesized by Synopsys Design Compiler with TSMC (Taiwan Semiconductor Manufacturing Company Limited) 28 nm 0.9 V high performance standard cell library, then placed and routed by Cadence Encounter. The power estimation is provided by Synopsys Design Compiler. The overall area cost of BP-ASIP is 0.19 mm2, wherein memory blocks consume 0.15 mm2. The equivalent gate count for the whole design is 188.8 kgates and for the logic part is 50.8 kgates. The total peak power consumption is 47 mW (consisting of 27.8 mW cell internal power, 18.7 mW net switching power, and 219.1 μW cell leakage power) under the clock frequency of 2.2 GHz. Comparison of these results with state-of-the-art works will be presented in Section 6.

CR IP T

1

4. Area and Power Consumption

Function

Among all the modules of BP-ASIP, datapath and data memory are the modules consuming most of the area. Datapath with 16 KB SRAM costs 56.1% of the area (66.3% for the LUTs and 33.7% for the logic part). Data memory with 4 memory blocks consumes 37.2%. Program memory and AGU cost 2.9% and 1.1%, respectively. The permutation networks, including read-/write-permutation-network, cost 2.5%. Processor top-level and control path consume negligible area. The core layout for the processor is shown in Fig. 7 with floorplan density of 0.7.

AN US

Table 5 Addressing Patterns of BP-ASIP. Addressing Assembly Pattern Description

which pipeline stage of the datapath should output results. At last, the results will be stored during the pipeline stage WB (write back result).

ED

M

Table 5 lists the ten addressing patterns supported by BP-ASIP. Taking addressing pattern 5 as an example, 16-byte data is accessed continuously from the address pointed by a special register ar (0  ar  212-1). Then, ar increases by a step of s (1  s  212-1). The values of ar and s are stored in a configuration register. The register can be initialized by assembly coding. All the algorithms targeted in this paper can thus be supported adopting the proposed addressing patterns.

IF

ID

Mem

Perm

Exe-2

Exe-3

Exe-4 WB

Output Control

Addr line

Control line

Exe-1

CE

Data line

PT

3.5. Pipeline Scheduling

Fig. 7. Pipeline scheduling of BP-ASIP.

AC

Fig. 6. Pipeline scheduling of BP-ASIP.

To approach the efficiency limit of the datapath and the memory bandwidth, the instructions of BP-ASIP are realized adopting pipelining technology. BP-ASIP contains at most nine pipeline stages (Fig. 6). During IF (instruction fetch), an instruction will be read out from the program memory. During ID (instruction decoding), the obtained instruction will be decoded into control signals and the addresses of operands will be generated. The generated addresses will be passed to the data memory and the source operands will be obtained during Mem (memory access). The obtained operands will be passed to the read-permutation-network and permuted if necessary during Perm (permutation). The outputs of the read-permutationnetwork will be passed to the datapath. The datapath utilizes one to four pipeline stages according to the requirements of different instructions. A block called Output Control is adopted to choose

5. Performance Analysis 5.1. Running a Single Algorithm Table 6 summarizes the performance of selected typical algorithms on BP-ASIP. Both encryption and decryption of the three block ciphers (i.e., AES, ARIA, and Camellia) are accelerated, with standard key size of 128, 192, and 256 bits. The results of the three block ciphers are achieved when BP-ASIP performs encryption with the key size of 128 bits. The results of the RS algorithms are obtained when BP-ASIP performs RS decoding with no erasures or errors. When BP-ASIP performs RS decoding with maximum errors, the throughput will be scaled down by three. When BP-ASIP performs RS(255,223) decoding, two instructions are needed to complete the innermost loop. It is because the innermost loop of RS(255,223) decoding processes 32-byte data each time while SIMD instructions of BP-ASIP

ACCEPTED MANUSCRIPT Tetrahedron

process 16-byte data each time. For the remained RS algorithms listed in Table 6, BP-ASIP consumes one instruction for the innermost loop during RS decoding. Table 6 Performance of BP-ASIP. Algorithm

Block Size (bit)

Cycles Per Block

Throughput (Gb/s)

AES SNOW 3G ZUC ARIA Camellia CRC-32 CRC-24 CRC-16-CCITT CRC-8-CCITT RS(255,223) RS(255,239) RS(224,208) RS(204,188) RS(224,216) RS(255,251)

128 32 32 128 128 128 128 128 128 223 239 208 188 216 251

11 1 2 13 22 1 1 1 1 510 255 224 204 224 255

25.6 70.4 35.2 21.6 12.8 281.6 281.6 281.6 281.6 8.8 17.6 17.6 17.6 17.6 17.6

8 8 8 8 8 8

3.45 μs (8640-bit/2503-Mbps). The time period consists of > 7500 clock cycles when the system clock frequency is 2.2 GHz. As shown in Fig. 8(b), two time slots (each time slot equals 1% of the time period) are adopted for CRC calculation in data reception and transmission, respectively, 44% of the time period is adopted for RS decoding, and 15% of the time period is adopted for RS encoding. There is 31% of the time period left for AES encryption/decryption with 8% of the time period (600 clock cycles) for computing context switching between the different tasks. Completing full rate payload AES encryption and decryption require 0.68 μs or 20% of the time period adopting BP-ASIP. The 31% of the time period left is thus sufficient for real-time AES encryption and decryption.

CR IP T

8

program memory

SIMD datapath

CRC program

CRC LUTs

RS program

RS LUTs

AES program

5.2. Running Cryptography, CRC, and RS Simultaneously

CE

PT

ED

M

Bit-wise algorithms are usually adopted in baseband processing. Baseband processing is a hard real time application with stringent latency requirements. Therefore, special attention must be paid to achieve efficient multi-standard/algorithm support with low scheduling overhead. BP-ASIP achieves high throughput when performing cryptography, error correction, and error detection simultaneously under proper scheduling with extra 20 KB SRAMs for LUTs (4  16  256  8bit for CRC, 2  16  256  8bit for AES encryption/decryption, and 3  16  256  8bit for RS). Next, we will analyze the performance of BP-ASIP when it is adopted to perform multiple algorithms (e.g., AES, CRC, and RS) at the same time, which can occur in current communication protocols [4].

AC

Taking the IEEE 802.11ad with the 2503 Mb/s bit rate (the highest bit rate of LPSCPHY [4]) as an example, we show the performance of BP-ASIP, running real-time CRC, RS, and AES on payload simultaneously. Firstly, as shown in Fig. 8(a), the program used to process CRC, RS, and AES is designed and loaded to the program memory. The payloads to be processed are buffered in the data memory. To achieve short computing context switching time, the mentioned extra 20 KB SRAMs for LUTs are introduced. The overall area cost of BP-ASIP will increase from 0.19 mm2 to 0.28 mm2. When processing a payload (8640-bit [4]) of the IEEE 802.11ad with the 2503 Mb/s bit rate, BP-ASIP is adopted to fulfill as much task as possible with the priority order of CRC, RS encoding/decoding, and AES encryption/decryption. Through our research, we recognize that 100 clock cycles are sufficient for computing context switching between the different tasks under proper scheduling. In this case, we schedule BP-ASIP for the IEEE 802.11ad (in mobile/portable devices) in the time period of

CRC payload RS payload

AES payload

(a)

Task

CRC Rx

RS decoding

AN US

Note that when performing AES encryption in the feedback (CBC, CFB, and OFB) cipher modes, in our two-stage pipelined design (for AES encryption), the throughput can be maintained if two independent data streams are applied (i.e., interleaved). Otherwise, the throughput will be scaled down by two. When performing RS decoding, three pipeline stages of the datapath are needed. As the procedure is iterative, the throughput can be maintained if three independent payloads are applied. Otherwise, the throughput will be scaled down by three.

AES LUTs

data memory

RS CRC encoding

Tx

AES decryption

AES encryption

Security

time period

Time (b)

Fig. 8. (a) Configuring BP-ASIP for the IEEE 802.11ad. (b) Scheduling BPASIP for the IEEE 802.11ad.

6. Evaluation The synthesis results of BP-ASIP are compared with state-ofthe-art VLSI circuits for bit-wise algorithms collected till February 2018. There are no similar designs that can support all the algorithms implemented in this paper. Therefore, the synthesis results of our proposal are compared with previous ASICs/ASIPs for CRC-32, RS(255,239) decoding, and AES-128 encryption (in both feedback (CBC mode) and non-feedback mode (CTR mode)), respectively. Table 7 summarizes the comparison results. The scaled throughput (column 5) is obtained via dividing the throughput (column 3) by corresponding clock frequency. The power consumption (column 7) is scaled to 28 nm (0.9 V). Column 9 of Table 7 shows the targeted algorithms of each selected design. Three high throughput ASICs/ASIPs for CRC calculation, three successful RS(255,239) decoders (without FIFO memory), and three successful ASICs/ASIPs for AES encryption are chosen for comparison. Among the listed designs for AES encryption, Sayilar’s design [12] and Liu’s design [10] achieve higher throughput than BP-ASIP in CTR mode. It is because these designs can process all rounds of AES encryption [13] in parallel while BP-ASIP processes one round each time. However, due to these designs’ very deep pipelines, these designs are not able to provide the same high performance in feedback modes as in non-feedback modes. BP-ASIP achieves higher throughput than these designs in CBC mode. It is because BP-ASIP is programmable and can thus process two independent data

ACCEPTED MANUSCRIPT

Among the listed designs, Sun’s design [26], Lin’s design [7], and Wang’s design [6] are the most comparable ASIC designs to BP-ASIP for CRC-32, RS(255,239) decoding, and AES encryption, respectively. If we implement the three designs separately, the gate count will be 200.5+66.0+45.3 = 311.8 kgates (not yet including the FIFO memory for RS(255,239) decoding). However, if BP-ASIP is adopted, the gate count will be decreased to 188.8 kgates. Besides, BP-ASIP is programmable and can thus support extra bit-wise algorithms such as ZUC, SNOW 3G, and Camellia, etc with high flexibility.

BP-ASIP obtains high flexibility via offering post-fabrication programmability/reconfigurability. Firstly, in our table-based datapath, the LUTs (implemented with SRAMs) can be configured. Therefore, when the implemented algorithms need to be modified, e.g., the generator-polynomial of CRC/RS need to change, our design can still work properly via changing contents of the LUTs with no extra hardware modification. Besides, BPASIP can offer changes to the algorithms implemented in this paper by programming with the proposed accelerated instructions. For example, when one of the implemented cryptography algorithms is cracked, BP-ASIP can still work properly via updating software (e.g., the key length can be extended with little overhead via software update). The product lifetime of BP-ASIP can thus be extended. The computing latency of BP-ASIP is acceptable for most of current communication scenarios like LTE/LTE-advanced standards. In LTE/LTE-advanced standards, the maximum code block size is determined to be 6144-bit. If BP-ASIP is adopted to perform CRC calculation for the code block, the latency is less than 23 ns and can satisfy the latency constraints of current 4G (fourth generation) and the upcoming 5G [3] wireless communication networks. As BP-ASIP achieves low computing latency for algorithms targeted in this paper, tasks adopting the algorithms can be completed within a very short time by BPASIP. BP-ASIP can thus be switched off when it is idle to save power. Taking LTE as an example, when performing CRC calculation, the throughput for LTE should be 150 Mb/s. If BPASIP is adopted, BP-ASIP only activates 1/1877 time and the actual peak power consumption is less than 30 μW.

AN US

BP-ASIP supports a wide variety of algorithms for its configurable table-based datapath, programmable architecture, and application specific instruction set. All CRC algorithms with no more than 32-bit generator polynomial and all RS algorithms based on GF(28) arithmetic can be accelerated with high flexibility. The three cryptography algorithms (i.e., AES, SNOW 3G, and ZUC), adopted in LTE, and another two typical bock ciphers (i.e., Camellia and ARIA) [28] can be supported. Besides, the instructions proposed in this paper can also be adopted for

other bit-wise algorithms (e.g., algorithms based on GF(28) arithmetic, etc).

CR IP T

streams simultaneously, in feedback modes to fully adopt its twostage pipelined datapath (for AES encryption), achieving the same throughput as in non-feedback modes. Wang’s design [6] and Chang’s design [27] consume less power than BP-ASIP. However, BP-ASIP achieves ASIC-comparable energy efficiency (Scaled-Power-Cons./Throughput) and obtains programmability so that BP-ASIP can extend the time-in-market of a SoC. The comparison results show that BP-ASIP achieves ASIC-like performance for CRC, RS(255,239) decoding, and AES encryption with high flexibility. Note that there is a trade-off between power and flexibility in BP-ASIP. One can or cannot sacrifice power for flexibility depending on the major constraint for a certain design/product.

PT

ED

M

Table 7 Comparison between BP-ASIP and Existing Methods for Bit-wise Algorithms. Method Algorithm Throughput Latency Scaled-Thr. (Gb/s) (cycles) (bit/cycle) Toal [5] CRC-32 4.92 1 31.9 Walma [29] CRC-32 35.28 1 62.9 Sun [26] CRC-32 112.42 1 128.0 Yuan [30] RS(255,239) 5.10 515 8.0 Lin [7] RS(255,239) 2.56 777 8.0 Chang [27] RS(255,239) 1.28 NA 8.0 Wang [6] AES-128-CBC 0.84 10 12.8 AES-128-CTR 0.84 10 12.8 Liu [10] AES-128-CBC 1.02 152 0.8 AES-128-CTR 153.70 152 127.0 Sayilar [12] AES-128-CBC 6.40 20 6.4 AES-128-CTR 128.00 20 128.0

Gate Count (kgates) 9.0 33.1 66.0 18.4 45.3 46.4 200.5

Scaled-PowerCons. (mW) NA NA NA NA NA 1.05a 1.49a

Program mability Y N N N N N N

Algorithm Coverage CRC Codes CRC-32 CRC-32 RS(255,239) RS(255,239) RS Codes AES

1776.8

129.40a

Y

AES, etc

3533.9

633.11a

Y

a

AC

CE

Block Ciphers Stream Ciphers Hash Functions CRC-32 281.60 1 128.0 CRC Codes This work RS(255,239) 17.60 255 8.0 188.8 47.00 Y RS Codes AES-128-CBC 25.60 11 11.6 Block Ciphers AES-128-CTR 25.60 11 11.6 Stream Ciphers *Compared with previous VLSI designs, our design obtains sufficient programmability/flexibility, supports a large set of bit-wise algorithms, and achieves ASIC-like performance and ASIC-comparable silicon cost.

Scaled-Power-Cons.=Power-Consumption 

65 1.2 2 PBP-ASIP-28nm ( )  (according to [22]),where PBP-ASIP-28nm and PBP-ASIP-65nm are power of BP-ASIP Technology Voltage PBP-ASIP-65nm obtained in 28nm (0.9V) and 65nm (1.2V), respectively.

7. Conclusion This paper proposes a high-throughput application-specific instruction-set processor for block ciphers (AES, ARIA, and Camellia), stream ciphers (SNOW 3G and ZUC), RS, and CRC. To achieve this goal, the algorithm similarities and the optimal parallel degree among block ciphers, stream ciphers, RS, and CRC are analyzed and proposed. Our design offers high

flexibility via introducing configurable table-based architecture and supporting software programming, so that the SoC can offer adaptation to the changes of the targeted algorithms, extending the SoC product lifetime. Besides, our design achieves ASIC-like performance such as 25.6 Gb/s for AES encryption, 70.4 Gb/s for SNOW 3G, 17.6 Gb/s for RS (255,239) decoding, and 281.6 Gb/s for CRC calculation, etc. The performance of our design is shown to be sufficient for high-speed communication protocols

ACCEPTED MANUSCRIPT Tetrahedron

like IEEE 802.11ad when running real-time AES, RS, and CRC simultaneously (via a simple modification). Through hardware sharing, our design obtains low silicon cost (ASICcomparable) and low power consumption, so that our design can be a valuable option for client-end systems. In our future work, we will extend this design to accelerate more bit-wise algorithms.

[23] [24]

8. Acknowledgement

[25]

The finance supporting from National High Technical Research and Development Program of China (863 program) 2014AA01A705 is sincerely acknowledged by authors. Synopsys ASIP Designer is also acknowledged.

[26]

References

[8]

[9]

[10]

[11]

[12]

[13] [14]

[15] [16]

[17]

[18]

[19]

[20] [21]

[22]

[29]

[30]

AN US

[7]

M

[6]

ED

[4] [5]

PT

[3]

K. Masselos, S. Blionas, and T. Rautio, Reconfigurability requirements of wireless communication systems, in: Proceedings of the IEEE Workshop on Heterogeneous Reconfigurable Systems on Chip, 2002. S. Sutardja, 1.2 The future of IC design innovation, in: 2015 IEEE International Solid-State Circuits Conference-(ISSCC) Digest of Technical Papers, IEEE, 2015, pp. 1–6. J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. Soong, and J. C. Zhang, What will 5G be?, Selected Areas in Communications, IEEE Journal on 32 (6) (2014) 1065–1082. http://standards.ieee.org/getieee802/download/802.11ad-2012.pdf. C. Toal, K. McLaughlin, S. Sezer, and X. Yang, Design and implementation of a field programmable CRC circuit architecture, IEEE transactions on very large scale integration (VLSI) systems 17 (8) (2009) 1142–1147. M. Y. Wang, C. P. Su, C. L. Horng, C. W. Wu, and C. T. Huang, Single- and Multi-core Configurable AES Architectures for Flexible Security, IEEE Transactions on Very Large Scale Integration Systems 18 (4) (2010) 541-552. Y.-M. Lin, C.-H. Hsu, H.-C. Chang, and C.-Y. Lee, A 2.56 Gb/s Soft RS (255, 239) Decoder Chip for Optical Communication Systems, Circuits and Systems I: Regular Papers, IEEE Transactions on 61 (7) (2014) 2110–2118. V. Hampel, P. Sobe, and E. Maehle, Experiences with a FPGA-based Reed/Solomon-encoding coprocessor, Microprocessors and Microsystems 32 (5) (2008) 313–320. M. E. Kounavis and F. L. Berry, Novel table lookup-based algorithms for high-performance CRC generation, Computers, IEEE Transactions on 57 (11) (2008) 1550–1560. B. Liu and B. M. Baas, Parallel AES encryption engines for manycore processor arrays, Computers, IEEE Transactions on 62 (3) (2013) 536– 547. Y. Eslami, A. Sheikholeslami, P. G. Gulak, S. Masui, and K. Mukaida, An area-efficient universal cryptography processor for smart cards, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 14 (1) (2006) 43–56. G. Sayilar and D. Chiou, Cryptoraptor: High throughput reconfigurable cryptographic processor, in: Computer-Aided Design (ICCAD), 2014 IEEE/ACM International Conference on, IEEE, 2014, pp. 155–161. S. Heron, Advanced Encryption Standard (AES), Network Security 2009 (12) (2009) 8–12. D. Kwon, J. Kim, S. Park, S. H. Sung, Y. Sohn, J. H. Song, Y. Yeom, E.-J. Yoon, S. Lee, J. Lee et al., New block cipher: ARIA, in: Information Security and Cryptology-ICISC 2003, Springer, 2004, pp. 432–445. K. Aoki, T. Ichikawa, M. Kanda, M. Matsui, S. Moriai, J. Nakajima, and T. Tokita, Specification of Camellia-A 128-bit block cipher, 2000. I. UEA2&UIA, Specification of the 3GPP Confidentiality and Integrity Algorithms UEA2& UIA2. Document 2: SNOW 3G Specifications. Version: 1.1. ETSI, 2006. Specification of the 3GPP Confidentiality and Integrity Algorithms 128EEA3 & 128-EIA3. Document 2: ZUC Specification. Version: 1.5., 2011. Z. Wu and D. Liu, High-Throughput Trellis Processor for Multistandard FEC Decoding, IEEE Transactions on Very Large Scale Integration (VLSI) Systems 23 (12) (2015) 2757–2767. F. Naessens, P. Raghavan, V. D. P. Liesbet, and A. Dejonghe, Unified C-programmable ASIP architecture for multi-standard Viterbi, Turbo and LDPC decoding, 2011. R. Asghar, Flexible Interleaving Sub–systems for FEC in Baseband Processors, 2010. A. Nilsson, E. Tell, and D. Liu, An 11 mm, 70 mW fully programmable baseband processor for mobile WiMAX and DVB-T/H in 0.12 m CMOS, IEEE Journal of Solid-State Circuits, 44 (1) (2009) 90–97. C. Zhang, L. Liu, D. Marković, and V. Öwall, A Heterogeneous Reconfigurable Cell Array for MIMO Signal Processing, IEEE

CE

[2]

[28]

AC

[1]

[27]

Transactions on Circuits & Systems I Regular Papers 62 (3) (2015) 733742. D. Liu, Baseband ASIP design for SDR, China Communications 12 (7) (2015) 60–72. Y. Huo, X. Li, W. Wang, and D. Liu, High performance table-based architecture for parallel CRC calculation, in: Local and Metropolitan Area Networks (LANMAN), 2015 IEEE International Workshop on, IEEE, 2015, pp. 1–6. J. S. Plank et al., A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems, Softw., Pract. Exper. 27 (9) (1997) 995–1012. Y. Sun and M. S. Kim, A table-based algorithm for pipelined CRC calculation, in: Communications (ICC), 2010 IEEE International Conference on, IEEE, 2010, pp. 1–5. H.-C. Chang, C.-C. Lin, F.-K. Chang, and C.-Y. Lee, A universal VLSI architecture for Reed–Solomon error-and-erasure decoders, IEEE Transactions on Circuits and Systems I: Regular Papers 56 (9) (2009) 1960–1967. J. H. Kong, L. M. Ang, and K. P. Seng, A comprehensive survey of modern symmetric cryptographic solutions for resource constrained environments, Journal of Network and Computer Applications 49 (2014) 15–50. M. Walma, Pipelined cyclic redundancy check (CRC) calculation, in: Computer Communications and Networks, 2007. ICCCN 2007. Proceedings of 16th International Conference on. IEEE, 2007, pp. 365– 370. B. Yuan, Z. Wang, L. Li, M. Gao, J. Sha, and C. Zhang, Area-efficient Reed-Solomon decoder design for optical communications, Circuits and Systems II: Express Briefs, IEEE Transactions on 56 (6) (2009) 469– 473.

CR IP T

10

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

Dake Liu (SM’08) received the D.Tech. degree from Linkoping University, Linkoping, Sweden, in 1995. He has experiences in the design of communication systems and radio frequency CMOS integrated circuits. He is currently a Professor and the Head of the Institute of Application Specific Instruction Set Processors (ASIP), Beijing Institute of Technology, Beijing, China, and also a Professor with the Computer Engineering Division, Department of Electrical Engineering, Linkoping University. He is the Co-Founder and Chief Technology Officer of Freehand DSP AB Ltd., Stockholm, Sweden, and the Co-Founder of Coresonic AB, Linkoping, which was acquired by MediaTek, Hsinchu, Taiwan. He has authored over 150 papers on journals and international conferences and holds five U.S. patents. His current research interests include high-performance lowpower ASIP and integration of on-chip multiprocessors for communications and media digital signal processing. Dr. Liu is enrolled in the China Recruitment Program of Global Experts.

CR IP T

Yuanhong Huo received the B.Sc. degree from Zhengzhou University, Zhengzhou, China, in 2011. He is currently pursuing the Ph.D. degree in computer science and technology with the Beijing Institute of Technology, Beijing, China. His current research interests include Application Specific Instruction Set Processors (ASIP) design, Software Defined Radio (SDR), software-hardware co-design and VLSI implementation.