A processor for IoT applications: An assessment of design space and trade-offs

A processor for IoT applications: An assessment of design space and trade-offs

ARTICLE IN PRESS JID: MICPRO [m5G;February 29, 2016;15:14] Microprocessors and Microsystems 0 0 0 (2016) 1–9 Contents lists available at ScienceDi...

2MB Sizes 0 Downloads 30 Views

ARTICLE IN PRESS

JID: MICPRO

[m5G;February 29, 2016;15:14]

Microprocessors and Microsystems 0 0 0 (2016) 1–9

Contents lists available at ScienceDirect

Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

A processor for IoT applications: An assessment of design space and trade-offs Sergio F. Johann∗, Matheus T. Moreira, Leandro S. Heck, Ney L.V. Calazans, Fabiano P. Hessel Pontificia Universidade Católica do Rio Grande do Sul, Faculdade de Informática, Brazil

a r t i c l e

i n f o

Article history: Received 1 July 2015 Revised 21 November 2015 Accepted 2 February 2016 Available online xxx Keywords: Configurable processor cores Low power ASIP Design space exploration IoT

a b s t r a c t Contemporary embedded systems require low-power solutions while still keeping a minimum performance level, and this is even more acute in the Internet of Things (IoT) domain, with its vast design space. This work proposes a configurable RISC processor associated to a design flow that includes a hardware synthesis flow and a software toolchain. This design flow is useful to explore design space and trade-offs of processor cores for IoT applications, by enabling multiple hardware configurations with variable degrees of complexity, while maintaining compatibility with the chosen instruction set architecture, which is itself configurable. Results rely on example designs targeting a 65 nm technology and post-mapped hardware simulations of two benchmarks sets, the CoreMark and Mälardalen suites. These results indicate that substantial power savings can be obtained by tailoring the architecture to a given application class, while reducing hardware complexity and maintaining performance figures. Findings show that the proposed processor provides an interesting resource to target low-end and middle-sized IoT applications, while demonstrating that reducing hardware complexity usually leads to the best trade-off between performance and power. © 2016 Elsevier B.V. All rights reserved.

1. Introduction Recent studies have presented the Internet of Things (IoT) as a trend in contemporary embedded systems market [1,2]. Due to their ubiquitous nature, IoT components usually need to present characteristics like being very low cost, lightweight, wirelessly connectable, capable of complex operations and very energy-efficient. In this context, specific applications can impose different additional constraints, such as environment invisibility (embeddable devices), ruggedness (to heat, moisture, vibration, etc.) and timing properties (for real time systems). In fact, IoT applications greatly vary in their degree of interaction with the environment, as well as in computation and communication needs. However, a common requirement of major importance is energy efficiency, given that such applications are usually operated on batteries and/or on energy harvesting devices. As applications become more complex and



Corresponding author. Tel.: +55-51-3353-8718. E-mail addresses: sergio.fi[email protected], [email protected] (S.F. Johann), [email protected], [email protected] (M.T. Moreira), leandro.heck@ pucrs.br (L.S. Heck), [email protected] (N.L.V. Calazans), [email protected] (F.P. Hessel).

smarter, they usually require more processing power and achieving energy efficiency is consequently a challenging task. The classical approach to implement low power embedded devices is to use 8 or 16-bit processor cores. However, due to the nature of applications and the need for integrating with other computer systems through network protocols such as IPv6 [2], 32-bit processor cores end up providing better performance and energy trade-offs for these applications. Compared to 8 or 16-bit devices, 32-bit cores spend more energy per cycle but can finish tasks in fewer cycles, so they stay idle for longer periods of time [3]. Also, to reduce power, cores can typically execute at modest clock frequencies, and the number of pipeline stages along with the logic needed to implement forwarding units, pipeline control and branch prediction can be simplified or avoided as in the ARM Cortex-M0+ processor [4]. Other approaches to improve efficiency rely on heterogeneous platforms, which combine processor cores along with hardware accelerators [5,6]. Although efficient, the design of such accelerators for individual applications can be expensive in terms of hardware resources and can lead to high non-recurring engineering (NRE) costs [7]. In fact, not only design time is increased due to the integration of heterogeneous blocks, but the entire design flow becomes more complex, the same occurring with the associated software tools. Furthermore, adding

http://dx.doi.org/10.1016/j.micpro.2016.02.002 0141-9331/© 2016 Elsevier B.V. All rights reserved.

Please cite this article as: S.F. Johann et al., A processor for IoT applications: An assessment of design space and trade-offs, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.02.002

JID: MICPRO 2

ARTICLE IN PRESS

[m5G;February 29, 2016;15:14]

S.F. Johann et al. / Microprocessors and Microsystems 000 (2016) 1–9

accelerators makes the platform less general. Another way to tackle these limitations is to reduce power through the use of simpler hardware elements and using compiler-assisted improvements, as suggested in [8]. Current solutions proposed in industry for the IoT are adapted to cover a subset of applications only. As in many embedded systems, in such applications a single, fixed software is typically executed during most of the device lifetime [2]. Based on this, to cope with the exposed problems, this work argues that subtle processor hardware optimizations can improve performance and power trade-offs for specific applications. Yet, such optimizations still allow the use of a uniform programming flow. They avoid increasing programming complexity to levels leading to under-utilization of additional accelerators, as occurs e.g. with the Cell processor [9]. This article proposes a generic processing core for IoT applications that has a design-time configurable hardware, and yet employs a same basic instruction set architecture (ISA). Coupled to an equally configurable software tool chain, this enables hardware optimizations to be completely transparent to the application developer, generating code correctly targeted to each different core configuration. Thus, designers are free to explore application models, performance and power. To evaluate the proposition, we designed a configurable RISC processor core with two pipeline stages that supports a subset of the MIPS-I ISA. The core can be configured for different register file (RF) sizes and to provide hardware support, or not, for multiplication and/or division. Along with the hardware comes an encompassing software tool chain, counting with a compiler, a linker, an assembler, binary image tools and runtime libraries. The configurable processor and the tool chain constitute the flow employed in this article. Using this flow, the paper evaluates eight different core versions, with varying complexity, based on the following configuration choices: i) 16- or 32-register RF; ii) no hardware support for multiplication/division, serial hardware multiplication only, serial hardware multiplication and division or high-performance hardware multiplication only. Designs were synthesized targeting a 65 nm bulk CMOS technology. Experiments account for performance and power using benchmarks from the CoreMark [10] the Mälardalen [11] suites. On average, a simple core (16-register RF without hardware support for multiplication/division) is often the most performance- and powerefficient option. For specific benchmarks, substantial improvements in terms of performance and power can be obtained, by choosing a proper hardware implementation for a given application. This shows that finer-grain optimizations can be explored with the proposed flow, keeping the generality of the overall design process. 2. Related work Several works discuss the optimization of processors for specific applications. Others propose frameworks to optimize different aspects of a given application. This section reviews works more closely related to what is proposed here, putting the related literature in perspective with the proposed approach. Chen et al. [12] highlight the need of optimizations in 3D graphics on embedded platforms and present a design space exploration (DSE) approach using high level models in UML and SystemC for a fast performance evaluation. Although fast, their approach is not suitable for energy sensitive applications, as energy consumption is not evaluated. Kim and Kim [13] present a framework for efficient design space exploration of digital systems. Their approach relies on formally defined representations of requirements and architectural constraints, using a method to select a good candidate solution. The authors use their framework to evaluate cache memory design, but miss the evaluation of more irregular circuits such as CPUs. Chiang et al. [14] describe a SystemC-QEMU framework and

use it to build a cycle accurate ISS which is several times faster than RTL simulations. Being cycle accurate, their approach is adequate for performance evaluation but is not well suited for energy evaluation, which is a major drawback for the selection of architectural features. Gauthier et al. in [8] explore software optimization techniques to reduce power in processors. A drawback here is assuming a fixed microarchitecture, avoiding hardware optimizations, which limits design space exploration. Lee et al. [15], on the other hand, explore optimizations in the ISA of application-specific processors, taking into account both software and hardware optimization opportunities. Unfortunately, authors limit hardware optimization flexibility by allowing optimizations only in instruction encoding. The works of Banz et al. [5] and Constantin et al. [16] also explore ISA optimizations. The former addresses optimizing an ISA for an image processing application and the latter focus on ISA optimizations for cryptography. Both address hardware and software optimizations. However, by targeting an specific application, they lose generality. Labrecque et al. [17] conduct a more general evaluation, presenting an in-depth exploration of the effects of compiler optimizations on the ISA and in hardware design. Results provide two important guidelines followed here: i) ”Hazard detection logic cause problems to optimize area and operating frequency in processors”; and ii) ”It is possible to limit compiler use of a significant fraction of the RF for many benchmarks without degrading performance.” Yannacouras et al. [18] present a comprehensive set of architectural modifications on the Altera Nios II processor, including pipeline depth, multiply/divide units and shifter implementation. Unfortunately, FPGAs are seldom an option for IoT, as these devices are very cost-sensitive and energy efficiency is mandatory. Besides, it is unclear if the analysis apply to modern fabrication processes. Azizi et al. [19] evaluate several processor design choices using a statistical model framework. Authors target performance and energy efficiency and use the SPEC suite as benchmark. Several evaluated configurations include complex processors (e.g. OOO superscalar) more adequate for generic workloads, not for IoT. Compared to these works, our idea stands off by providing a less constrained approach for hardware optimization while still maintaining compatibility with a basic ISA. Such characteristic enables exploring optimizations and tailoring a processor core without compromising the generality of a common flow and toolchain to program it. This is different from approaches such as application specific instruction set processors (ASIPs) [5,15]. These optimize an ISA for a specific application, modifying the datapath and adding hardware acceleration blocks, which requires deep architectural knowledge along with new programming tools and libraries. Table 1 provides an overview and comparison for the related work, along with the characteristics of the approach proposed herein.

3. The proposed processor: HF-RISC The processor proposed here is called HF-RISC and implements a variation of the MIPS-I ISA introduced in [20]. More specifically, HF-RISC uses a subset of the MIPS-I ISA (for compatibility with existing tools and optimizing compilers), to simplify the core design. Naturally, the core has an organization distinct from most MIPSI fully compatible processors, and uses less pipeline stages. Reducing the number of pipeline stages increases the number of instructions per cycle, simplifies the design [17] and reduces energy consumption [19]. The approach is useful in lower speed applications, where energy/MHz trade-offs need to be explored, rather than increasing clock frequency to a maximum. The industry employs the same principle. As an example, embedded applications abound that use 32-bit processors like the ARM Cortex-M [4] family with only two or three pipeline stages [3] as a substitute to

Please cite this article as: S.F. Johann et al., A processor for IoT applications: An assessment of design space and trade-offs, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.02.002

ARTICLE IN PRESS

JID: MICPRO

[m5G;February 29, 2016;15:14]

S.F. Johann et al. / Microprocessors and Microsystems 000 (2016) 1–9

3

Table 1 Summary of the related work. Work

Technique/approach

Goal/optimization

Chen et al. [12] Kim et al. [13] Chiang et al. [14] Gaunthier et al. [8] Lee et al. [15] Banz et al. [5] Lee et al. [16] Labrecque et al. [17] Yiannacouras et al. [18]

DSE framework using high level models (UML and SystemC) DSE framework using formally defined representations DSE using a SystemC-QEMU framework, cycle accurate ISS Software optimization techniques on a fixed microarchitecture Optimizations in the ISA encoding of application-specific processors ISA optimizations for image processing ISA optimizations for cryptography DSE based on effects of compiler and ISA optimizations Architecture DSE using pipeline, multiply/divide, and shifter implementation variations DSE framework using statistical models Hardware optimization methodology using a single ISA and a common software toolchain

Optimizations in 3D graphics, performance evaluation Optimal architecture selection, cache memory design Performance evaluation Power reduction Application performance, energy reduction Application performance evaluation Application performance evaluation Energy consumption metrics Application and architecture performance evaluation

Azizi et al. [19] This work

current 16-bit and 8-bit microcontrollers. Such design choices aim to improve both performance and energy efficiency. The most relevant differences between the organization proposed here and the classic 5-stage MIPS are: • Short, 2-stage pipeline.The goal is to simplify the core design and to reduce chip area and energy consumption. • No hazard detection or forward units. These are not needed due to the short pipeline. • Shared instruction and data memory organization, i.e. a von Neumann organization. Data accesses takes 2 cycles, due to bus multiplexing. • No unaligned loads/stores; No MMU; No exceptions. • No co-processor, all peripherals (VECTOR, CAUSE, MASK, STATUS and EPC registers) are memory mapped. • Support to different configurations, including RF size, optional multiply/divide units, and related hardware. • A set of MCU-like peripherals, including an optional UART, an IRQ controller with 8 internal and 8 external interrupts, one running counter, two programmable counters, compare registers and a debug interface. In terms of throughput, most HF-RISC instructions take just one clock, but load and store instructions take two clocks each, due to the memory bus multiplexing. Also, multiply and divide instructions take several cycles, depending on the chosen hardware configuration. Section 3.2 presents more details about these instructions’s latency. A side effect of the simple pipeline is the absence of load delay slots of conventional MIPS organizations. One branch delay slot arises due to pipeline design, and the compiler can schedule instructions in this slot. 3.1. General description Fig. 1 shows the HF-RISC block diagram. The pipeline is split in two stages by means of an Instruction Register (IR) and a Data Register (DR). IR holds the instruction fetched from memory at the Fetch stage, and IR feeds the control unit, which sends signals to all processor datapath components. DR holds the shared memory data bus value during data access cycles. MUXes, adders, a program counter (PC) register and a memory interface compose the Fetch stage. This stage gets an instruction from memory, computes the address of jumps and branches and then updates the PC accordingly. The second stage (Execute), includes the RF, MUXes, an ALU, sign extension units, optional multiply/divide units and associated registers (HI and LO), and logic for memory access during the data cycle. This stage accesses the RF and feeds operands to the ALU inputs. Next, it writes back the result to the RF or to memory. This stage also computes the addresses to use on data accesses. The two shaded sections in the second stage depict processor areas that can be tailored for a given

Performance and energy efficiency evaluation Best application performance-power ratio

application. The RF can be cut in half, affecting source MUXes (for rs, rt and rd and write_data). The multiply/divide unit can have different configurations or be omitted. Table 2 shows the 49-opcode subset of MIPS-I supported by HF-RISC. Some configurations have even less instructions, omitting divide and/or multiply and HI/LO register management opcodes. A minimal HF-RISC ISA subset comprises 41 opcodes. 3.2. Configurations Based on preliminary experiments, we derived eight configurations, varying the size of the RF and the choices on the multiply/divide unit. A set of 35 benchmarks from the Mälardalen WCET suite [11] allowed defining good candidate configurations. Experiments showed that some applications depend less on the number of available registers, where the RF can be reduced. Some applications depend less on the latency of multiplications or divisions. As the RF and the multiply/divide units are relatively large compared to the rest of the core, the processor can be tailored for specific applications to maximize the performance/energy ratio. The RF plays a significant role in energy savings [21]. It is important to highlight that all applications can execute on any of the cores, because the compiler can be instructed to generate code that is compatible with each configuration for all applications. Table 3 shows the eight proposed processor configurations. The RF was defined in two different configurations: 16 or 32 general purpose registers (GPRs). The omitted registers in the smaller RF are in the range of $8 to $23, which avoids removing essential registers like $0 (the zero constant) and $31 (the routine return address register). Table 4 presents the relevant latencies of each configuration. Multiplications have three different modes: software (around 250 cycles per operation, worst case), serial hardware (between 11 and 35 cycles) and parallel hardware (3 cycles). Divisions have two modes: software (around 350 cycles per division, worst case) and serial hardware (38 cycles). The hardware versions of multiply and divide can deal with signed or unsigned operands. 4. Optimization strategies The first step to assess the trade-offs between the different configurations of HF-RISC was to validate the eight different cores listed in Table 3. This process employs the software flow depicted in Fig. 2 to execute a comprehensive set of benchmarks for different behavioral simulation scenarios, starting with the EEMBC CoreMark suite, followed by benchmarks from the Mälardalen suite. Note that this choice is due to the fact that these benchmarks are used in industry for evaluating different processors for IoT applications, like the ARM Cortex-M7. This flow consists in using a modified GCC toolchain to compile a binary code of the benchmarks

Please cite this article as: S.F. Johann et al., A processor for IoT applications: An assessment of design space and trade-offs, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.02.002

ARTICLE IN PRESS

JID: MICPRO 4

[m5G;February 29, 2016;15:14]

S.F. Johann et al. / Microprocessors and Microsystems 000 (2016) 1–9

Fetch

irq

IRQ Control

+

branch

irq_ack

Control Unit

control

jump

0x10000100

Execute

0x4

+

rs zero / less than

read_reg1

PC Next

rt

data_read1

rs rt

0x0

data_read

Reg File 0x0

IR

data_read2

rt rd

Instr/Data External Memory

ALU

addr

PC

read_reg2

result

write_reg

write_data

0xF

target

rt data data_write

HI

EXT32h Write Data

DR

Data Memory Logic

Mulply Divide LO

EXT32h EXT32b

pc+4

Fig. 1. The HF-RISC block diagram. Shaded areas can be tailored to the application.

Table 2 The 49 HF-RISC INSTRUCTION SET Mnemonics. Arith

Logic

Shift

Compare

Load/store

Branch

Jump

Mul/div

addiu addu subu

and andi nor or ori xor xori

sll sra srl sllv srav srlv

slt sltu slti sltiu

lui lb lbu lh lhu lw sb sh sw

beq bne bgez bgezal bgtz blez bltzal bltz

j jal jr jalr

mthi mfhi mtlo mflo mult multu div divu

Table 4 HF-RISC configurations performance figures. RF (in GPRs) and multiply/divide latency (in clock cycles). Component Core configuration

Table 3 The set of all possible core configurations. Suffix legend: (S) implemented in software, (H) in hardware, (F) with a fast hardware. Core

Name

RF (# bits)

Multiply

Divide

A B C D E F G H

A16SS B16HS C16HH D16FS E32SS F32HS G32HH H32FS

16 16 16 16 32 32 32 32

S H H F S H H F

S S H S S S H S

and to use the Cadence Incisive simulator to execute it on different RTL versions of the processor. The toolchain used for machine code generation relies on the GNU GCC 4.6.1 compiler and Binutils 2.23.1. The compiler was modified to support three new flags (options) to generate binaries compatible with each version

RF Multiply Divide

A16SS

B16HS

C16HH D16FS

E32SS

F32HS

G32HH H32FS

16 ∼ 250 ∼ 350

16 11–35 ∼ 350

16 11–35 38

32 ∼ 250 ∼ 350

32 11–35 ∼ 350

32 11–35 38

16 3 ∼ 350

32 3 ∼ 350

of the HF-RISC core. The first flag (-mpatfree) avoids the generation of unaligned memory accesses. The second (-mnohwmul) and third flags (-mnohwdiv) allow avoiding the generation of multiplication and division instructions, respectively. Software multiply and divide routines were implemented in C16HH and the compiler generates calls to such routines on configurations that lack hardware support for them. Along with such routines, a minimal C16HH library was written to support the compilation of all benchmarks with minimal effort. Moreover, additional flags (-ffixed-lo and ffixed-hi) for HI and LO registers are specified for the configurations using software multiplication (cores A16SS and E32SS). These are needed to prevent the compiler from referencing these registers. Also, code generation compatible with the 16-GPR RF is accomplished by telling that removed registers are not be used by the compiler, activating the -ffixed=reg flag. In these configurations, the existing registers are $t0 to $t7 and $s0 to $s7. Once the HF-RISC cores were validated, we synthesized them targeting the 65 nm bulk CMOS technology from STM and analyzed performance parameters using the Cadence framework tools, as the hardware flow of Fig. 2 details. The cell library employed

Please cite this article as: S.F. Johann et al., A processor for IoT applications: An assessment of design space and trade-offs, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.02.002

JID: MICPRO

ARTICLE IN PRESS

[m5G;February 29, 2016;15:14]

S.F. Johann et al. / Microprocessors and Microsystems 000 (2016) 1–9

5

Fig. 2. Evaluation flow for performing synthesis, simulation and power analysis activities on the eight HF-RISC cores.

for technology mapping was also provided by the vendor and included core and clock cells. Syntheses were performed using Cadence RTL Compiler, targeting a frequency of 200 MHz. Note that the frequency was defined as the most relaxed operation frequency achievable for the simplest core configuration. In this way we avoided underusing simple cores that could operate at higher frequencies with minimally-sized logic gates. This allows a fair comparison, as more complex cores (with longer critical path) employ bigger logic gates to achieve design timing closure. After the synthesis, timing and area reports are generated and the mapped cores are exported to a Verilog netlist. Next, synthesized cores had their gate and wire delays annotated to a standard delay format (SDF) file, with statistical wire load models. For synthesis and delay annotation, all circuits employ delay and power models for typical fabrication and operating conditions. The area reports are the basis to compute area trade-offs. The Verilog netlist and the SDF file are the source for performing timing simulation of the cores. To do so, we use the same environment from behavioral simulation, that allows to execute the benchmarks from both CoreMark and Mälardalen suites. These simulation scenarios allow extracting performance and power figures for each core. For the power analysis, the activity of the internal nodes is annotated during simulation and exported to a toggle count file (TCF). The generated TCFs are the basis for conducting power analysis in the RTL Compiler tool. Next, we measure dynamic and leakage power for the cores while executing the CoreMark suite and use it as a basis for assessing power trade-offs. This choice reduces the complexity of analyzing the core for each benchmark of Mälardalen, which would be time consuming. Also, using the figures measured for CoreMark enables a fair comparison, due to

Fig. 3. Area, leakage and dynamic power partitioning in all HF-RISC cores.

the fact that this is an industry standard that performs a comprehensive evaluation of the cores, albeit it does not include division operations. The approach provides a fair comparison of trade-offs, without jeopardizing generality. 5. Results and discussion Table 5 shows results of area, leakage power (LP), dynamic power (DP) and total power (TP) collected for all cores. As explained above, power results rely on switching activity annotated for the simulation of the CoreMark benchmarks. We use these results as a baseline for discussing the trade-offs among cores. The smallest core (A16SS) has area of 19,546 μ m2 and a total power of 0.627 mW, being 0.23 mW leakage power and 0.396 mW dynamic power. As functionalities are added, area and power figures increase. E.g. the largest core (H32FS) has area of 41,284 μ m2 and dissipates 1.219 mW of total power, being 0.478 mW leakage and 0.740 mW dynamic power. Another perspective of these results provides further understanding of the costs related to more complex cores. Fig. 3 presents the distributions of area, leakage and dynamic power, respectively, for the different cores, split into RF, multiply and divide hardware and the basic core circuitry (control signals, pipe registers

Please cite this article as: S.F. Johann et al., A processor for IoT applications: An assessment of design space and trade-offs, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.02.002

ARTICLE IN PRESS

JID: MICPRO 6

[m5G;February 29, 2016;15:14]

S.F. Johann et al. / Microprocessors and Microsystems 000 (2016) 1–9

Table 5 Area and power for HF-RISC core configurations running CoreMark benchmarks @200 MHz (using CoreMark and GCC with default optimizations (-O2)). Suffix legend: (LP) leakage power, (DP) dynamic power, (TP) total power, (CPM) CoreMarks per MHz. Metric

Time (s) Iterations/s Total gates Area (μm2 ) LP (mW) DP (mW) TP (mW) CPM CPM/mW CPM/mm2

Core configuration A16SS

B16HS

C16HH

D16FS

E32SS

F32HS

G32HH

H32FS

24.994 192.04 2908 19,546 0.230 0.396 0.627 1.0 0 0 1.595 51.173

16.982 282.656 3534 23,519 0.276 0.391 0.667 1.472 2.207 62.595

16.982 282.656 4136 26,233 0.307 0.410 0.717 1.472 2.053 56.119

13.300 360.896 5403 32,408 0.373 0.613 0.986 1.880 1.906 58.0 0 0

21.329 225.048 3922 28,201 0.330 0.526 0.856 1.172 1.369 41.563

16.118 297.808 4535 32,182 0.376 0.503 0.879 1.551 1.765 48.197

16.118 297.808 5158 34,933 0.407 0.522 0.929 1.551 1.670 44.402

12.436 385.976 6474 41,284 0.478 0.740 1.219 2.010 1.649 48.694

Table 6 Mälardalen execution time (in thousands of clock cycles) for different configurations of HF-RISC (-O2, default GCC optimizations) and power, area and performance tradeoffs. Benchmark

adpcm bs bsort100 cnt compress cover crc duff edn expint fac fdct fft1 fibcall fir insertsort janne_comp jfdctint lcdnum lms ludcmp matmult minver ndes ns nsichneu prime qsort-exam qurt recursion select sqrt statemate st ud

Core configuration A16SS

B16HS

C16HH

D16FS

E32SS

F32HS

G32HH

H32FS

3900.2 0.1 90.8 40.3 25.4 2.2 23.0 0.8 889.1 75.1 0.1 8.9 334.7 0.2 1586.8 0.9 0.2 33.7 0.1 16731.1 197.6 1079.3 51.2 47.8 7.1 8.5 187.3 4.6 44.2 2.6 4.1 21.0 1.8 15444.4 15.2

2756.2 0.1 90.8 40.2 25.5 2.2 23.0 0.8 230.4 75.2 0.1 9.5 184.9 0.2 960.9 0.9 0.2 34.8 0.1 13157.5 82.9 598.5 39.8 47.6 7.2 9.1 173.8 4.6 40.0 2.6 4.1 19.1 1.8 8271.1 13.2

639.1 0.1 90.8 6.3 8.7 2.2 23.0 0.8 230.4 11.1 0.1 9.5 184.9 0.2 720.0 0.9 0.2 13.3 0.1 13157.5 82.9 327.7 39.8 47.6 7.2 9.1 27.83 4.6 40.0 2.6 4.1 19.1 1.8 7600.3 6.2

2460.7 0.1 90.8 40.2 25.5 2.2 23.0 0.8 84,749 73.9 0.1 9.0 177.1 0.2 604.5 0.9 0.2 34.2 0.1 12208.8 77.1 436.9 39.1 47.6 7.2 9.1 166.2 4.6 38.3 2.6 4.1 18.4 1.8 7865.0 12.0

3824.3 0.1 90.7 39.2 24.0 2.2 22.7 0.8 818.0 72.0 0.1 6.1 260.4 0.2 1232.7 0.8 0.2 30.7 0.1 12389.1 166.2 961.1 35.4 44.5 6.5 8.5 183.9 3.4 29.3 2.9 2.9 13.7 1.6 12441.7 12.6

2692.9 0.1 90.7 39.2 24.0 2.2 22.7 0.8 223.1 72.8 0.1 5.0 115.4 0.2 939.2 0.8 0.2 30.9 0.1 8907.6 54.1 589.0 23.1 44.5 6.5 8.5 170.3 3.4 25.2 2.9 3.0 11.9 1.6 5426.4 12.0

638.5 0.1 90.7 6.3 7.5 2.2 22.7 0.8 223.1 10.9 0.1 5.0 115.4 0.2 708.1 0.8 0.2 9.8 0.1 8907.6 54.1 326.3 23.1 44.5 6.5 8.5 27.8 3.4 25.2 2.9 3.0 11.9 1.6 4775.4 5.4

2397.4 0.1 90.7 39.2 24.0 2.2 22.7 0.8 77.4 71.5 0.1 4.4 107.6 0.2 582.8 0.8 0.2 30.3 0.1 7958.3 48.4 427.4 22.4 44.5 6.5 8.5 162.7 3.4 23.5 2.9 3.0 11.1 1.6 5020.2 10.9

and ALU). As Fig. 3(a) shows, using 32-GPR RFs can cause an area overhead of up to 44% comparing core A16SS and E32SS, as in core A16SS the RF represents almost 50% of the core area. In addition, the increase in area has a direct impact on leakage power showed in Fig. 3(b). Moreover, dedicated circuitry for multiplication and division also have a big impact in area and leakage power. In fact, these unities may be completely absent, as in the baseline core A, and represent an increasing area overhead as they grow in complexity. Compared to core A16SS, these blocks present an overhead of 20%, 34% and 66%, for cores B16HS (with a serial multiplier), C16HH (with a serial multiplier and divider) and D16FS (with a parallel multiplier), respectively. Dynamic power results are similar to area and leakage power as Fig. 3(c) shows. The

consequences are overheads of up to 32% for increasing the RF and 55% for having more complex multipliers and dividers. This is because complex circuitry increases switching activity. These results demonstrate the basic costs of increasing core complexity to achieve better performance. For performance evaluation, the first step was to analyze simulation results given by the CoreMark. Table 5 presents the performance of each core using default compiler optimizations. Performance figures are given in terms of total time, iterations per second and CoreMarks per MHz (CPM). These figures allow understanding which core is more suited for general applications. As expected, the more complex the core, the biggest the CPM is. Note that our most complex core, the H32FS has a CPM of 2.010, which

Please cite this article as: S.F. Johann et al., A processor for IoT applications: An assessment of design space and trade-offs, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.02.002

ARTICLE IN PRESS

JID: MICPRO

[m5G;February 29, 2016;15:14]

S.F. Johann et al. / Microprocessors and Microsystems 000 (2016) 1–9 Table 7 Mälardalen power, area and performance trade-offs for HF-RISC configurations. Core names were shortened to the first letter. Benchmark

adpcm bs bsort100 cnt compress cover crc duff edn expint fac fdct fft1 fibcall fir insertsort janne_complex jfdctint lcdnum lms ludcmp matmult minver ndes ns nsichneu prime qsort-exam qurt recursion select sqrt statemate st ud

Best core for EDP

ED2 P

E2 DP

ADP

C A A C C A A A D C A F F A C A A G A F F C F A A A C E F A E F A G C

C A A C G A A A D C A F F A D A A G A F F C F A A A C E F A E F A G G

C A A C C A A A D C A A B A C A A C A B B C B A A A C A A A A A A B C

C A A C C A A A D C A F F A C A A G A F F C F A A A C A F A A F A G C

is comparable to an ARM1176 processor, with a CPM of 2.078. However, an important aspect showed by this analysis is the impact of the RF size. If we compare the pairs of cores that differ only in this aspect it is clear that having a bigger RF provides only modest improvements in performance when using dedicated multiplication/division unities (B16HS/F32HS, C16HH/G32HH and D16FS/H32FS), around 5%. However, when multiply/divide hardware is not present, as in cores A and E, the size of the RF has a greater impact, in this case 17%. This is justified by the need to perform multiplication/division operations in software, which requires a high usage of the register file. Other important metrics that add value to the analysis is the trade-off between performance, power and area. Accordingly, we present the performance, power and area efficiency using metrics of CPM/mW and CPM/mm2 in the last 2 rows of Table 5. As this table shows, albeit bigger RFs in isolation produce performance gains, they compromise power and area efficiency in all cases. For instance, core B16HS presents 23% higher CPM/mW and 30% higher CPM/mm2 when compared to core F32HS. This is justified by the fact that the RF represents a big portion of the total area of the cores, as Fig. 3(a) shows. In other words, the improvements provided by bigger RFs have a very high cost in terms in power and area. For different levels of support for multiplication/division operations, bigger power and area efficiency are obtained when employing a basic serial multiplier only, if using a small RF. This appears in Table 5, where core B16HS provides better trade-offs than A16SS, C16HH and D16FS. Similar results are observed when employing a big RF. See the results for cores E32SS to H32FS in the

7

table. However, in this case, a parallel multiplier provides slightly better area efficiency, albeit it is not as good in power efficiency. These results show that a generic processor core can be tailored for high performance, power efficiency or area efficiency. The first of these is achieved by employing a large RF coupled to a very fast multiplier and should target the biggest CPM possible, as is the case for core H32FS. Drawbacks are higher power and area. For power- and area-efficiency, a small RF coupled to a simple hardware multiplier (e.g. a serial multiplier) is the best option. For such designs, CMP/mW and CPM/mm2 must be the target, as achieved by core B16HS. In short, core B16HS is the ideal candidate for generic applications. Compare this to a state-of-the-art industry solution, an ARM Cortex-M0, which presents a CPM/mW of 2.284. Assuming similar operating frequency in its most advanced technology node, a 45 nm technology, core B16HS provides similar energy efficiency. Note that these results consider a more advanced technology node for the Cortex-M0, which indicates that further improvements can be obtained for the proposed cores at more advanced technologies. Albeit CoreMark is an industrial standard for processor evaluation, it does not have division operations and is too general to enable the understanding of specific requirements for some applications. Hence, to provide a deeper analysis of the trade-offs with focus on application tailored cores we also use here another set of benchmarks, a comprehensive part of the Mälardalen suite. Table 6 summarizes the results obtained for each benchmark on each core. The table shows the number of clock cycles for executing each benchmark. Based on the obtained results, Table 7 scrutinizes area and performance trade-offs by identifying the most efficient core using three commonly adopted metrics: energy delay product (EDP), energy delay squared product (ED2 P) and area delay product (ADP) for each benchmark. We also added an extra metric: energy squared delay product (E2 DP). This metric is commonly used by circuits designers and allows assessing trade-offs with a focus on low-power, which is frequently a requirement for IoT applications. A summary of the frequency distribution of the best results presented in Table 7 is showed in Fig. 4. For assessing area trade-offs we rely on the post synthesis area and power characterization described in Section 4. As Fig. 4 shows, core A16SS is the best in terms of EDP, ED2 P, E2 DP and ADP in most cases. In fact, it presents better results in 43% of the benchmarks for EDP and ED2 P and in 48% of the benchmarks for ADP. These results suggest that for general applications, having a simple core with a small RF is the best option. For applications that present a larger number of arithmetic operations, like FIR filters, having dedicated blocks for multiplication and division in a core with a small RF was the best option, and core C16HH was the most suited for these benchmarks. This is different from CoreMark, where core B16HS was the most energy efficient. However it is important to highlight that the benchmarks evaluated for the Mälardalen suite either do not rely on multiplication operations or when they do, they also rely on division operations. In this way, this distribution of best efficiency between core A16SS and C16HH is justified mostly by the fact that these cores are very similar to core B16HS but, without a dedicated multiplier and with a multiplier and divider unit, respectively. A big RF was more efficient only for specific applications that made aggressive use of multiplication and division operations, such as fftl or sqrt. In fact, it is important to highlight that even when delay plays a major role, as in ED2 P figures, having a small RF was not only sufficient, but more energy efficient. However, when energy is more relevant, as in E2 DP results, a small RF is more efficient for all benchmarks. In this way, these results indicate that in the IoT era, having small RFs is the best option. Also, application specific improvements can be obtained by optimizing dedicated computation blocks, such as the multiplier and divider units.

Please cite this article as: S.F. Johann et al., A processor for IoT applications: An assessment of design space and trade-offs, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.02.002

JID: MICPRO 8

ARTICLE IN PRESS

[m5G;February 29, 2016;15:14]

S.F. Johann et al. / Microprocessors and Microsystems 000 (2016) 1–9

analysis is that a core with reduced hardware complexity enables the best power efficiency for generic applications. Moreover, the design of a complete method for programming such processors is a complex task. In this way, maintaining compatibility, with a single method for all versions of several application specific tailored cores is an important aspect for exploring their usage in IoT applications. Ongoing work includes the use of the proposed cores on multi-processor systems-on-chip, allowing the exploration of hybrid solutions, where multiple versions of the core are integrated in a single system. Finally, we have recently taped out a microcontroller using a modified version of the E32SS core, with a set of peripherals and programmable memory. The chip was tested and we verified its correct operation, validating HF-RISC on silicon. References

Fig. 4. Frequency distribution of the best values related to Table 7.

6. Conclusions Optimizing processors for IoT applications where power is increasingly constraining is a challenging task. This article explored how to allow such optimizations while maintaining a generic flow for synthesis and programming of a chosen 32-bit RISC processor. As expected, the obtained results show that optimizations can be obtained by improving specific hardware blocks, depending on the application. However, an important insight obtained from the

[1] L. Atzori, A. Iera, G. Morabito, The internet of things: a survey, Comput. Netw. 54 (15) (2010) 2787–2805. [2] J. Gubbi, R. Buyya, S. Marusic, M. Palaniswami, Internet of Things (IoT): a vision, architectural elements, and future directions, Futur. Gener. Comput. Syst. 29 (7) (2013) 1645–1660. [3] J. Yiu, The Definitive Guide to the ARM Cortex-M0, in: Y. Joseph (Ed.), Newnes, 2011. [4] ARM, Cortex-M0+ Technical Reference Manual, Technical Report, 2012.URL http://infocenter.arm.com/help/topic/com.arm.doc.ddi0484b/ DDI0484B_cortex_m0p_r0p0_trm.pdf [5] C. Banz, C. Dolar, F. Cholewa, H. Blume, Instruction set extension for high throughput disparity estimation in stereo image processing, in: Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), 2011, pp. 169–175. [6] L. Bauer, M. Shafique, J. Henkel, Efficient resource utilization for an extensible processor through dynamic instruction set adaptation, IEEE Trans. Very Large Scale Integr. Syst. 16 (10) (2008) 1295–1308. [7] H. Tabkhi, R. Bushey, G. Schirner, Function-level processor (FLP): a high performance, minimal bandwidth, low power architecture for market-oriented MPSoCs, IEEE Embed. Syst. Lett. 6 (4) (2014) 65–68. [8] L. Gauthier, T. Ishihara, Processor energy characterization for compiler-assisted software energy reduction, J. Electr. Comput. Eng. 2012 (2012) 8:8–8:8. [9] M. Schellmann, S. Gorlatch, D. Meiländer, T. Kösters, K. Schäfers, F. Wübbeling, M. Burger, Parallel medical image reconstruction: from graphics processing units (GPU) to Grids, J. Supercomput. 57 (2) (2011) 151–160. [10] The Embedded Microprocessor Benchmark Consortium (eembc), (http://www. eembc.org). Accessed: 2014-12-20. [11] J. Gustafsson, A. Betts, A. Ermedahl, B. Lisper, The Mälardalen WCET benchmarks: past, present and future, in: Proceedings of the International Workshop on Worst-Case Execution Time Analysis (WCET), 2010, pp. 136–146. [12] L.-B. Chen, C.-T. Yeh, H.-Y. Chen, I.-J. Huang, A system-level model of design space exploration for a tile-based 3D graphics SoC refinement, IEICE Tran. Fundam. Electron. Commun. Comput. Sci. E92.A (12) (2009) 3193–3202. [13] J.K. Kim, T.G. Kim, A plan-generation-evaluation framework for design space exploration of digital systems design, IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E89-A (3) (2006). [14] M.-C. Chiang, T.-C. Yeh, G.-F. Tseng, A QEMU and systemc-based cycle-accurate ISS for performance estimation on SOc development, IEEE Trans. Comput.Aided Des. Integr. Circuits Syst. 30 (4) (2011). [15] J. Lee, K. Choi, N.D. Dutt, Energy-efficient instruction set synthesis for application-specific processors, in: Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003, pp. 330–333. [16] J. Constantin, A.P. Burg, F.K. Gurkaynak, Instruction set extensions for cryptographic hash functions on a microcontroller architecture, in: Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors (ASAP), 2012, pp. 117–124. [17] M. Labrecque, P. Yiannacouras, J.G. Steffan, Custom code generation for soft processors, SIGARCH Comput. Archit. News 35 (3) (2007) 9–19. [18] P. Yiannacouras, J. Rose, J.G. Steffan, The microarchitecture of FPGA-based soft processors, in: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems(CASES), 2005, pp. 202–212. [19] O. Azizi, A. Mahesri, B.C. Lee, S.J. Patel, M. Horowitz, Energy-performance Tradeoffs in processor architecture and circuit design: a marginal cost analysis, in: Proceedings of the Annual International Symposium on Computer Architecture (ISCA), 2010, pp. 26–36. [20] J. Hennessy, N. Jouppi, F. Baskett, J. Gill, MIPS: a VLSI processor architecture, in: VLSI Systems and Computations, 1981, pp. 337–346. [21] X. Guan, Y. Fei, Register file partitioning and compiler support for reducing embedded processor power consumption, IEEE Trans. Very Large Scale Integr. Syst. 18 (8) (2010) 1248–1252.

Please cite this article as: S.F. Johann et al., A processor for IoT applications: An assessment of design space and trade-offs, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.02.002

JID: MICPRO

ARTICLE IN PRESS

[m5G;February 29, 2016;15:14]

S.F. Johann et al. / Microprocessors and Microsystems 000 (2016) 1–9 Sergio F. Johann received a B.S. degree in Computer Science at University of Santa Cruz do Sul (UNISC, Brazil) in 2004 and both M.Sc. and a Ph.D. degrees in Computer Science at Pontifical Catholic University of Rio Grande do Sul (PUCRS, Brazil) in 2008 and 2012, respectively. He is currently an Assistant Professor at PUCRS. Embedded systems researcher and developer. Has experience in Computer Science, with focus on these main topics: real time operating systems, real time systems/applications, embedded system design, computer architecture, multiprocessor systems-on-chip, networks-on-chip, application modeling and mapping, control systems and RFID systems. Matheus T. Moreira received his bachelor’s degree from the Pontifical Catholic University of Rio Grande do Sul (PUCRS), Brazil, in Computer Engineering in 2011, M.Sc. degree in Computer Science in 2012, and is currently a Ph.D. candidate in the same University. Also, he is currently an Assistant Professor at PUCRS. His research interests include asynchronous circuits design, networks-onchip and multi-processor systems-on-chip. He is a student member of the IEEE.

Ney L. V. Calazans received the Ph.D. degree in Microelectronics in 1993, from the Université Catholique de Louvain (UCL), Belgium, and the M.Sc. degree in Computer Science in 1988, from Federal University of Rio Grande do Sul (UFRGS), Brazil. He is currently a Professor at the Catholic University of Rio Grande do Sul (PUCRS). His research interests include non-synchronous circuits, intrachip communication networks, computer-aided design techniques and tools. Professor Calazans is a Senior Member of the IEEE and a member of the Brazilian Computer Society, SBC and the Brazilian Society of Microelectronis (SBMicro). Fabiano P. Hessel is a Professor of Computer Science at Pontifical Catholic University of Rio Grande do Sul (PUCRS), Brazil. He received his Ph.D. in Computer Science from Universite Joseph-Fourier, TIMA laboratory France. He is the head of Embedded System Group. He was the Associate Editor of the ACM Transaction on Embedded Computer Systems – Special Issue on Rapid System Prototyping, General Chair and Program Chair of RSP (2007, 2008, 2011). He had several publications in prestigious conferences and journals, book chapters and books. His research interests are embedded real-time systems, real time operating systems and MPSoC

Leandro S. Heck received his bachelor’s degree from the Pontifical Catholic University of Rio Grande do Sul (PUCRS), Brazil, in Computer Engineering in 2012, M.Sc. degree in Computer Science in 2012, and is currently a Ph.D. candidate in the same University. His research interests include GALS designs, metastability and asynchronous circuits design. He is a student member of the IEEE.

Please cite this article as: S.F. Johann et al., A processor for IoT applications: An assessment of design space and trade-offs, Microprocessors and Microsystems (2016), http://dx.doi.org/10.1016/j.micpro.2016.02.002

9