A Survey and Taxonomy of FPGA-based Deep Learning Accelerators

Accepted Manuscript Taxonomy of FPGA-Based Topologies for Future Deep Learning Architectures Ahmed Ghazi BLAIECH , Khaled BEN KHALIFA , Carlos Albert...

Download PDF

2MB Sizes 0 Downloads 110 Views

Report

PDF Reader
Full Text

Accepted Manuscript

Taxonomy of FPGA-Based Topologies for Future Deep Learning Architectures Ahmed Ghazi BLAIECH , Khaled BEN KHALIFA , Carlos Alberto VALDERRAMA SAKUYAMA , Marcelo A.C. FERNANDES , Mohamed Hedi BEDOUI PII: DOI: Reference:

S1383-7621(18)30415-6 https://doi.org/10.1016/j.sysarc.2019.01.007 SYSARC 1559

To appear in:

Journal of Systems Architecture

Received date: Revised date: Accepted date:

21 September 2018 19 January 2019 22 January 2019

Please cite this article as: Ahmed Ghazi BLAIECH , Khaled BEN KHALIFA , Carlos Alberto VALDERRAMA SAKUYAMA , Marcelo A.C. FERNANDES , Mohamed Hedi BEDOUI , Taxonomy of FPGA-Based Topologies for Future Deep Learning Architectures, Journal of Systems Architecture (2019), doi: https://doi.org/10.1016/j.sysarc.2019.01.007

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights  

AC

CE

PT

ED

M

AN US

CR IP T



This paper is original given that serve as directive, a reference and motivation for researchers in the area of neural networks. This paper analyzes the design requirements and characteristics of existing topologies to finally propose development strategies and implementation architectures for better use of FPGA-based deep learning topologies. The literature used in this paper is very recent (many references are in 2018)

1

ACCEPTED MANUSCRIPT

Taxonomy of FPGA-Based Topologies for Future Deep Learning Architectures Ahmed Ghazi BLAIECH, Khaled BEN KHALIFA, Carlos Alberto VALDERRAMA SAKUYAMA, Marcelo A. C. FERNANDES, Mohamed Hedi BEDOUI

Abstract

AN US

CR IP T

Deep learning, the fastest growing segment of Artificial Neural Network (ANN), has led to the emergence of many machine learning applications and their implementation across multiple platforms such as CPUs, GPUs and reconfigurable hardware (Field-Programmable Gate Arrays or FPGAs). However, inspired by the structure and function of ANNs, large-scale deep learning topologies require a considerable amount of parallel processing, memory resources, high throughput and significant processing power. Consequently, in the context of real time hardware systems, it is crucial to find the right trade-off between performance, energy efficiency, fast development, and cost. Although limited in size and resources, several approaches have showed that FPGAs provide a good starting point for the development of future deep learning implementation architectures. Through this paper, we briefly review recent work related to the implementation of deep learning algorithms in FPGAs. We will analyze and compare the design requirements and features of existing topologies to finally propose development strategies and implementation architectures for better use of FPGA-based deep learning topologies. In this context, we will examine the frameworks used in these studies, which will allow testing a lot of topologies to finally arrive at the best implementation alternatives in terms of performance and energy efficiency.

M

Keywords: Deep Learning, Framework, Optimized implementation, FPGA.

ED

1. Introduction

CE

PT

In recent years, the research on Artificial Neural Networks (ANNs) has shown promising results on speech recognition [1], machine translation [2] and scene analysis [3]. Deep learning algorithms are becoming an increasingly popular way to learn sequences of data [4] where decisions can be made. Thus, deep learning is now an excellent tool for object detection [5, 6], image segmentation [7, 8], data classification [9] and data prediction [10] in several application fields.

AC

The architecture of deep learning algorithms is similar to classical neural networks, but it works with many hidden layers. During training, these models begin with a feature extraction phase followed by a full connection one to predict network outputs. The topology is a stack of many convolutional layers, in addition to pooling, non-linearity and classifier layers. Usually, the classification layer uses the traditional neural networks, such as Multi-Layer Perceptron, Support Vector Machine and other. Despite the great success and popularity of deep learning algorithms, their use and implementation still pose many problems due to the large amount of data used mainly during the training phase. Indeed, successful training of a network composed of several layers involves the use of large data sets. For example, the ImageNet dataset [11], utilized frequently for object recognition, contains about 14 million tagged images. With such data sets, each layer will be involved in many neural calculations. This increases the complexity of the designer's tasks to 2

ACCEPTED MANUSCRIPT

train, test and generate the final implementation. Therefore, development frameworks have been designed to simplify and automate the tasks, so a lot of high-performance platforms are available.

CE

PT

ED

M

AN US

CR IP T

CPU-GPU deep learning platforms have become a common alternative to custom hardware in signal processing because they are comparatively inexpensive, powerful and easily programmable, especially for training purposes [12, 13]. Such platforms allow the fast processing of hundreds of signals simultaneously. However, the network implementation on CPUs is too slow, while it consumes a lot of energy on GPUs. Thus, custom architectures and development methods adapted to deep learning algorithms can give better performances, especially during the prediction phase, for embedded systems with strong energy consumption restrictions [14]. Commercial ASIC-based alternatives are available today. For instance, Google services use the so called Tensor Processing Unit (TPU) on Application-Specific Integrated Circuit (ASIC) to accelerate machine learning workloads with TensorFlow. Indeed, a TPU (with a 28nm process that runs at 700MHz, a 12.5GB/s bandwidth, and 40W consumption) is an architecture dedicated to neural network computation that can deliver within its context higher performances and a greater performance-per-Watt than a CPU, a GPU or a FPGA. As defined by Norman Jouppi, “The TPU is programmable like a CPU or GPU. It is not designed for just one neural network model; it executes Complex Instruction Set Computer (CISC) instructions on many networks (convolutional, Long-Short Term Memory LSTM models, and large, fully connected models). So it is still programmable, but uses a matrix as a primitive instead of a vector or scalar.” This dedicated architecture uses quantization (integer approximate) to reduce the amount of memory and computing resources (weight matrix multiplication and activation function) needed for predictions with neural network models. Its matrix processor (instead of GPU vector processing), a systolic array optimized for operation density, applies fixed pattern operations to a single input. It is a programmable CISC focusing on the major mathematical operations required to run many different kinds of neural network models. A compiler and software stack translates API calls from TensorFlow graphs into TPU instructions. This is an interesting approach considering that the characteristics and performance requirements of neural networks are quite specific compared to generic processors such as CPUs, GPUs and FPGAs. Targeting embedded and edge computing, Intel proposed Movidius Myriad VPU (Vision Processing Unit), which is a programmable chip with a dedicated hardware accelerator for deep neural network inferences (neural compute engine) in computer vision [15]. Movidius includes an ultra-high throughput intelligent memory fabric (PIM) with extensive support for sparse data structures. In the academic world, a significant contribution was the hardware accelerator, called EIE (energy efficient inference engine), suggested in [16], which directly works on a model of compressed neural networks, in order to obtain substantial accelerations and energy savings.

AC

To better understand the challenges that have motivated this evolution, we choose in this work recent research targeting FPGAs for hardware acceleration of deep learning algorithms. Indeed, like software approaches, FPGA is a reconfigurable medium whose components, processing elements, memory units and interconnections can change functions, prior or at runtime, while performing a particular program or part of a program. Taking advantage of its reconfigurability, FPGAs offer the possibility of finding the best compromise regarding consumption and performance by analyzing the solutions space, thanks to the number of degrees of freedom available on a network, and by testing the network directly on the device. Furthermore, in the particular context of embedded systems, FPGAs perform better than GPUs, where power consumption is the main problem, despite its high performance. We present the 3

ACCEPTED MANUSCRIPT

CR IP T

deep learning implementation works on FPGA, not to replace the GPU implementation but to extract the advantages of FPGA over GPU and CPU. We will summarize the elements and technical constraints adopted in the literature focusing on the hardware acceleration of Convolutional Neural Networks (CNNs), considered as a class of Deep Neural Networks (DNNs). We will begin by exploring the different strategies that can be adopted before the implementation of a CNN. This critical phase will depend mainly on the requirements and type of application. We will also examine CNN-based hardware acceleration design flows. During this phase, the designers choose, in addition to the design flow, language and development tools, performance evaluation criteria and the hardware solution granularity in particular. Towards the end, we will explore the various recent integration technologies for high-performance and/or energy-efficient neural networks. This phase will enable us to compare the different CNN hardware accelerator concepts and the methods used to estimate the performances of these FPGA-based accelerators.

AN US

While a similar survey can be found in [17,18], we focus in this paper on the recent techniques that were not covered in the previous work. Besides, a recent review of CNN implementation has been proposed in [19,20], but focused only on acceleration of CNNs, while our work mainly carries out all aspects such as context, topology, implementation strategy, development flow and performance gain, exposing the interesting solutions suited for these criteria and generating the adopted future strategies after analyzing interesting solutions.

ED

M

This paper is organized as follows: Section 2 describes the use context which exposes the importance of deep learning in several domains, the popular topologies of deep learning, and their implementation on many supports like CPU, GPU and FPGA. Section 3 presents the art state of deep-learning implementation strategies on FPGA focused on many parameters and design steps adapted to the application. Section 4 describes the development flows specified by the designer to generate the best implementation performance. Section 5 fixes the adopted future strategies by selecting and analyzing the interesting solutions and proposing the best directives for researchers interested in the FPGA-based deep-learning field. Section 6 concludes and presents the future perspectives of this work. 2. Context

CE

PT

Deep Learning is now playing an important role in several domains, having already been used in many contexts. Indeed, it is used for the recognition of patterns from the Iris [21, 22], ImageNet [11] and CIFAR-10 [23,24] datasets for common images, and from the MNIST [25] and SVHN [26] datasets for handwritten digits. Deep learning is also utilized for the character-level language model, which predicts the next character given a previous one [27].

AC

The challenges and opportunities for deep learning are particularly important in the field of medical applications. In this context, we can, with deep learning, automate multiple tasks [28], such as making better estimates and decisions for detection, localization, segmentation (lung cancer, detection of pulmonary nodules, diagnosis of colonic polyps, or breast cancer screening) [7, 8], making predictions (try to derive new medical information, which is not currently available from existing clinical data) and gathering information to detect more than one anomaly [29]. Nevertheless, the training and testing of medical datasets require the use of large networks, such as CNNs and DNNs.

2.1 Different topologies of deep learning

4

ACCEPTED MANUSCRIPT

CNNs and DNNs are presented as the most popular deep learning families [17]. They consist of methods allowing computers to learn from massive datasets with exceptional accuracy. Both are powerful tools that use many layers and filters to hierarchically extract, for instance, meaningful features from objects in images. They usually consist of stacks of multiple convolution layers combined with pooling, normalization and classifier layers. Neurons in a DNN convolutional layer have private synaptic weights, while neurons in CNNs share a common kernel in the same output feature map.

∑ ∑

))

AN US

( ∑ (

CR IP T

The convolution layer extracts simple features like edges and curves from images by using a number of filters. Each filter consists of Kx Ky coefficients as a subset of the synaptic weights of the layer. Many filters cover different areas of input feature maps by strides of Sx and Sy, and produce output neurons forming output future maps (i.e. input feature maps of the next layer). An output neuron at position (x,y) of the feature map f0 is formally computed as follows:

where f(*) denotes a non-linear active function, such as Sigmoid and Tanh, or more generally a ReLU layer (described by y = max(0, x)) [30], indicates the input feature maps used to produce

by a simpler one:

M

the output feature map fo, and and respectively represent the bias and kernel values between the input feature map fi and the output feature map fo. This formula applies to neurons in a DNN convolutional layer with private synaptic weights. In a CNN, neurons of the same output feature map share a common core, so that the synaptic weight in the formula must be replaced . Due to this difference, the synaptic weights (i.e. parameters) in a DNN times) than those in a CNN

ED

convolutional layer are considerably more numerous (about layer having the same size.

CE

PT

The pooling layer, following the convolutional layer, provides scale invariance, makes the network more tolerant to image distortion and reduces feature map dimensions. The pooling consists in selecting the maximum or average of some input neurons partitioned in nonoverlapping pooling windows of size Kx Ky. The formula of the maximum pooling output is:

AC

where fo is equal to fi due to the one-to-one mapping relationship between input and output feature maps. If the average pooling is calculated, the maximum operator, indicated above, must be replaced by an average operator. After pooling, the normalization layer attempts to simulate the lateral inhibition of biological neurons by performing a competition among neurons occupying the same location in different input feature maps. There are two normalization layer types, i.e. the Local Contrast Normalization (LCN) [31] and the Local Response Normalization (LRN) [30]. We only introduce LRN as follows:

5

ACCEPTED MANUSCRIPT ∑ (

)

where α, β and k are specified parameters, and M is the number of neighbours next to the input feature map fi.

(

∑

where f(∗) is a ReLU or an active function, and is the synaptic weight.

CR IP T

Finally, the Full Connections (FC) layers consist of dense connections between neurons that typically contain the bulk of the network weights. The computation of FC layers corresponds to a matrix multiplication followed by the addition of an optional bias parameter to each output. Formally, the output neuron can be computed with: )

is the bias value owned by the output neuron no,

AN US

The AlexNet network, one of the first successful training programs for ImageNet using CNNs, can be used as a typical network example. AlexNet consists of the following layers: convolution, max pooling, FC, ReLU and LRN. 2.2 Deep learning implementation on CPU/GPU

M

CNNs and DNNs have become particularly useful for embedded applications in autonomous robots, mobile phones and automobiles, which require real-time execution and high accuracy. However, these algorithms are computationally very expensive because much time is spent on convolution operations [32].

PT

ED

The CNN/DNN implementation is a difficult task, especially for embedded systems, because those using CPUs are too slow, which motivates optimization strategies in terms of computational speed [25]. Other software-based CNNs use GPUs [12,13]. Unfortunately, as GPUs consume a lot of energy, they are not suitable for all embedded systems [9]. Thus, implementation alternatives based on FPGAs have been proposed for low-power real-time embedded systems.

CE

2.3 Deep learning implementation on FPGA

AC

Reconfigurable hardware (FPGAs) provides a significant amount of logic resources suitable for accelerating compute-intensive applications like CNNs. The programmability and reconfigurable features of FPGAs make it possible to evaluate a custom design faster than ASICs, and therefore they explore many more implementation alternatives. These features allow us to find the optimal model to solve a given problem with multiple tradeoffs, requiring several attempts. FPGAs can provide better performances with significantly less power consumption, which is essential for mobile, embedded platforms. For instance, the authors in [14] reported that in terms of performance per power, an FPGA-based CNN was about 10 times more efficient than a GPU-based one. Indeed, FPGAs can use binary precision that reduces hardware resources for a higher clock rate, while CPUs and GPUs cannot. 6

ACCEPTED MANUSCRIPT

The increasing number of layers raises the accuracy of the CNN/DNN classification. Thus, the size and processing power is an essential factor that can be managed with FPGAs, unlike CPUs. However, the network topology and size can become a problem for FPGAs in terms of resources and interconnectivity. 3. Implementation strategies In order to better understand the development steps of deep learning on FPGAs, it is important to set the strategy to adopt according to the application requirements.

CR IP T

As we will see in detail, there are many parameters and design steps that are preferable, such as flexibility versus optimization (which brings us to configurability instead of customization), a generic or adapted topology, data types such as binarized, fixed or floating point weights, a training or deployment network, and optimization parameters such as memory, processing power and execution speed. 3.1 Flexibility versus optimization

AN US

The development process of FPGA-based deep learning can already start from high level description languages such as OpenCL or C/C++, can be used by commercial ESL (Electronics System Level) tools [33,34] and is dedicated for development frameworks [35, 36]. These software-oriented approaches support fast design exploration bringing flexibility and hardware abstraction. Traditional low-level design tools, using VHDL or Verilog, are still available to specialists to provide optimized solutions.

CE

PT

ED

M

Flexibility and optimization can be combined in terms of design steps. For instance, at the software and algorithmic level, the Amazon Elastic Compute Cloud (Amazon Web Services EC2) F1 instances are Xilinx FPGAs reconfigured to accelerate data workloads supporting machine learning inference, video processing and genomics (among other services) [37]. Considering optimization, the design methodology of these instance offers the Xilinx development flow to the service of cloud users and hardware IP developers. This strategy allows, starting from OpenCL or C++ source code, a quick deployment of custom machine learning hardware acceleration providing 90x higher performance than CPUs [38]. This strategy has also promoted the advent of Mipsology, FPGA-based class-leading acceleration for deep learning, and neural network inference [39]. At a lower scale, scientific research [35,36] is still behind that momentum, so tools and architectural strategies must be offered to provide configurability, scalability and optimization strategies. 3.2 Configurable versus customized implementation

AC

To provide the best prediction with deep learning, it is worth evaluating a lot of network topologies, preferably targeting the same FPGA support. On the other hand, a custom implementation will force the designer to iterate over single network topology. In both cases, it becomes essential to know how the changes will affect performance, in addition to understanding the fit of the network topology. To tackle this problem, various studies have been conducted to propose automatic mapping solutions that would take into account the complexity of the neural network architecture and the type of predefined elements contained in an FPGA. In a global sense, mapping algorithms associate functionalities to predefined library components, as with FPGAs in a lower level, when allocating available hardware resources, whilst an iterative incremental design gradually unveils underlying parameters. Thus, two basic strategies are put forwards, which we designate as hardware and software design automation. 7

ACCEPTED MANUSCRIPT

CR IP T

Hardware design automation includes different hardware design approaches that take into account a neural model to accelerate. These approaches have been widely adopted in FPGAbased accelerators due to their reconfigurability property [40, 41, 42, 43, 44, 45]. Most of the suggested approaches rely on the automatic generation of Hardware Description Language (HDL) representations based on neural network parameters. The difference between these methods lies in the representation choice of the intermediate level of the network to bridge the gap between the high and low-level designs. Accordingly, hardware design automation directly modifies the use of hardware resources to support the network topology up to the physical limits given by the target architecture constraints. In this way, the network can get the best performance on the target platform. FPGAs are suitable for this type of applications because of their reconfigurability. For example, in [45], the authors proposed a framework composed of several algorithms optimizing the computation complexity and reducing the number of interconnections. They suggested as well a modelling methodology for better resource utilization and a tool to automatically generate a customized Verilog hardware representation of a desired CNN.

PT

ED

M

AN US

Some software design strategies consist of executing different CNNs on the same hardware support simply by using instruction sequences to modify the configuration parameters. The proposed solutions are based on a generic but already implemented network, with certain granularity supporting its configuration. These approaches have the advantage of providing a quick transition between network choice and hardware deployment. The main difference between the various approaches lies essentially in the degree of granularity of the instructions, which greatly determines the flexibility of the result. Starting from this principle, the authors in [46] proposed a set with only three types of instructions. Their solution allowed carrying out, during the compilation a static configuration while adopting a dynamic strategy of data reuse on each layer of the network. Unlike [46], the authors in [47] used a set of instructions for each layer. Each of these layers relied on the utilization of configurable Finite State Machines (FSMs) to schedule the execution of instructions specific for each layer. This solution permitted reducing the access to the instruction memory and reusing resources, while requiring a certain resource overhead due to the need of configurable FSMs. These instruction-based software methods are usually independent of the hardware support, hence enabling the accelerator to switch from one network to another at runtime.

AC

CE

To find a compromise between hardware and software solutions, the authors in [48] put forward a design method combining both approaches. Their idea was to perform an optimization phase combined with the compilation of instructions. The hardware architecture was first described using predefined HDL models according to the network settings. The data control flow of the network process was controlled by code fragments compiled according to the network description. With this approach, it was still possible to reuse the hardware support with a new network by simply modifying the software code fragments. This method made it possible to gain in hardware resources, because it used a preselected network, and to gain in overall (deployment and execution) temporal performances. 3.3 Training versus final implementation The deep learning development flow is carried out through a training phase to compute the network weights. During this process, the network is in a state where several parameters are simultaneously updated. After updating the network, several parameters are now fixed configuration values, and the network is ready for decision making. 8

ACCEPTED MANUSCRIPT

The training process, although interesting because it is the most expensive activity in terms of development time, remains very difficult for FPGAs since many operations must be performed massively and simultaneously. Thus, until now, data-stream parallel platforms have been preferred, supported by multiple and fast memory accesses. However, it is more convenient to transfer the updated network to an FPGA so as to implement the prediction phase. 3.4 Runtime programmable versus preconfigured topology The topology adopted for implementing the deep learning on FPGA can be runtime programmable or preconfigured. Many studies [24,49,50] have focalized on performing a generic architecture with several hidden layers allowing the architecture design exploration and after that choosing the best one in terms of performance.

AN US

CR IP T

Nevertheless, to avoid the long training phase of each topological alternative, other implementations have been proposed using well known topologies, such as GGNet-E [36], AlexNet[30], GoogleNet[51], VGG [52], Overfeat [53], SFC LFC, and CNV [54]. These topologies are characterized by convolutional, fully connected, max-pooling and SoftMax (normalized exponential function) layers. By using preconfigured topologies, in addition to quickly improving performance during the training process, the architecture can also be optimized. For instance, an improved version of the VGG architecture was presented in [55], which consisted in pruning neurons for fully connected layers. 3.5 Fixed-point versus floating-point coding

Traditionally, the floating point numbers are a natural choice for handling data during neural network training. Data flow from the input to the output for the feed-forward execution and from the output to the input for back-propagation.

PT

ED

M

Data size and accuracy are important parameters to manage so as to optimize the network in terms of complexity, size and speed. Lin and al. [56] determined the optimal data accuracy for all network layers based on a fixed-point quantization strategy. In [57], the authors demonstrated that the requirements for optimal accuracy vary not only from one network to another, but also from one layer to another within the same network. Other approaches have suggested hybrid representations. Indeed, in [58], the authors proposed a hybrid approach based on a floatingpoint representation for CNN weights and a fixed-point representation for activation functions. The suggested approach reduced storage units by up to 36% and energy consumption by up to 50%.

AC

CE

Besides, some work [54,59] has adopted binarized numbers for inputs, weights and the activation of outputs, with which operators are constrained to single-bit values. This binarization, besides using simpler operators, decreases the redundant information in the resulting parameters [60], thus optimizing the implementation cost. However, as with fixed-point representation, it is important to target applications where the loss of precision is less important than computational speed or size. 3.6 Memory/ power versus execution time CNN topologies are computationally intensive; most of these computations are performed in convolution layers, reaching in extreme cases more than 30 GFLOPs (Giga Floating Point Operations per second) [52]. Therefore, the available processing power affects the execution time. Several other studies have also shown that memory management operations are responsible for much of the energy consumption. In other words, a lot of techniques are used to 9

ACCEPTED MANUSCRIPT

reduce multiply-accumulate operations, hence minimizing memory and energy consumption. In order to decrease this energy, some approaches have minimized memory cell size, optimized memory hierarchy, provided memory access and data storage models, and managed DRAM refresh rates and other properties related to the memory subsystem [61]. 4. Development flows

CR IP T

The deep learning development goes through two main steps: training and prediction. The training step is performed by software developers on GPUs supported by training datasets, languages and related development tools. The already trained network with associated configuration constants can then be used in the prediction step that will be performed by the target software and hardware platform. During that step, hardware designers can specify the design flow, the language, the development tools and the granularity of implementation. 4.1 Used libraries

AN US

The use of deep learning libraries simplifies development and implementation. Therefore, many existing frameworks are now extended to use FPGAs and FPGAs/GPUs, such as TensorFlow [62], Caffe[63], CNNLab [49], and IBM Blue [64]. Moreover, execution can be supported by clusters, giving developers the opportunity to perform benchmarking assessments of their implementation across different supports. 4.2 Hardware design flow

ED

M

The implementation of deep learning on FPGAs can be done using high-level languages such as C++ [65] and tools like Vivado HLS (High Level Synthesis) for Zynq, Virtex or Kintex families from Xilinx. At a lower level, other achievements have used the VHDL language and Vivado(Xilinx) or Quartus (Intel/Altera) development environments. On the other hand, designers use environments with languages supported by Cloud platforms and GPUs. Indeed, the OpenCL language, utilized to program GPUs, can also be used as an entry language for implementation on Intel-Altera and Xilinx (Vivado SDAccel) FPGAs [35]. 4.3 Implementation granularity

AC

CE

PT

Due to a large number of required computations, the hardware integration of real-time CNNs becomes a challenge, especially if we target low-power embedded applications. However, before describing a classic or specific CNN architecture, it is important to note several features of most learning models and applications that generally make them well suited for parallelization using hardware accelerators. We can imagine several methods to parallelize and/or distribute the calculation on different machines and multi-cores. We distinguish three types of competitive solutions that we can formalize according to the kind of parallelism granularity: data parallelism, models parallelism, and pipeline parallelism. 4.3.1 Data parallelism Data parallelism consists in distributing data over multiple threads, while using the same model, running on several machines (or modules). This approach can be used if the data are too large to be stored (and processed) on a single machine or to speedup learning. 4.3.2 Model parallelism

10

ACCEPTED MANUSCRIPT

When the model is too large to fit on one machine to support the learning process, one option is to distribute it over several machines. For example, a single layer can be inserted into the memory of a single machine, followed by forward and backward propagations from one machine to another in a serial manner. Model parallelism is a distributed process needing a rapid inter-process communication. 4.3.3 Pipeline Parallelism In addition to data parallelism, pipeline parallelism can be applied to concatenated layers (or steps) by running them on simultaneous executing threads. As a result, the output of one layer is streamed to the next, hence the overlapping of the threads execution.

CR IP T

The feed-forward computation of the network is well adapted to pipeline parallelism. Consequently, a platform that can exploit deep pipeline parallelism, such as FPGAs, can be an advantage. 5. Performance

AN US

After implementation, performance can be evaluated by using some metrics. Many metrics are available to highlight the strengths and weaknesses of the results produced by CNN design tools, which can be used as quality factors. Measurements should include performance figures, such as throughput and latency, resource consumption, energy efficiency and application accuracy. All of these criteria play a significant role in strategically defining trade-offs to drive design flows towards appropriate target implementation.

AC

CE

PT

ED

M

The throughput is the most relevant performance metric for speed-based applications, such as streaming, high-speed image recognition and large-scale multi-user applications covering large amounts of data. The throughput is measured in GOPS (Giga Operations Per Second) and is usually obtained by processing large batches of input data. Latency, or reaction time, is the most critical factor in latency-sensitive applications, such as autonomous driving cars or stand-alone systems, and in real-time cloud applications. Latency, measured in seconds, represents the elapsed time between entering a datum into the system and retrieving its output response. Resource consumption (in hardware terms: the area occupied with DSPs, LUTs, RAM and FFs) is an indicator that permits evaluating the optimization effectiveness of development tools on a target platform. Accuracy is an indispensable metric, especially when approximation techniques are applied, such as normalization and fixed-point or binarized weights or operators. Noting that potential trade-offs regarding performance and accuracy need to be quantified in terms of quality degradation of results. However, these performance indicators are generally limited by factors external to the target circuit. In fact, the throughput only reflects to the quality of the design in relation to the number of computational resources used by the neural networks. Yet, it does not consider the available bandwidth of the associated bus or external memory, which could lead to performance degradation. Moreover, the occupied area expressed in number of resources specific to the FPGA family, such as LUTs (Xilinx) or ALMs (Intel), does not allow a fair comparison between different vendors, even between FPGA families of the same vendor. Energy efficiency is a better performance measure for a given energy cost, i.e. the average number of operations that can be performed within an energy limit. This energy cost comes from two elements: calculation and memory accesses. Energy efficiency is measured in “Operations/Second” per Watt (OPS/W), or Operations/Joule (OP/J), since 1 Watt = 1 Joule/Second. The workload of the target network is fixed. Thus, increasing the energy efficiency 11

ACCEPTED MANUSCRIPT

of a neural network hardware accelerator means decreasing the total cost of the energy required to process each neural network input. 6. Adopted future strategies 6.1 Interesting solutions To exploit the advances of FPGAs in the CNN/DNN implementation, like reducing the evaluation, exploration and development time, there has been extensive interest in their use on DNN implementation frameworks based on forward and backward data processing. This type of topologies offers the opportunity to explore architectural alternatives. For that reason, many frameworks are now available for FPGA-based implementations. To better understand the scope and advantages of the proposed solutions, we will first review the details of existing approaches.

AN US

CR IP T

In [49], the authors presented a new flexible programming framework called CNNLab, which provided a uniform programming model for users, so that the hardware implementation and the scheduling would be invisible to programmers. The training and inference phases were done on FPGA device. To describe the CNN model, this work defined each layer associated with a tuple of parameters (convolution layer, normalization layer, …). In this environment, several topologies and implementation alternatives were tested. The implementations were tested on GPU using CUDA and on the Intel-Altera DE5 FPGA with OpenCL. This work showed that the implementation on GPU had better performances on all layers in terms of execution time and throughput compared to FPGA. It demonstrated also that the speedup could achieve up to 1000x for FC layers. However, FPGA had a comparatively better energy performance due to limited needed hardware resources and its lower work frequency.

CE

PT

ED

M

In [40], the authors presented a modified version of the CNN framework Caffe supporting FPGAs. This framework allowed specific FPGA implementations with the flexibility of reconfiguring the device when necessary, seamless memory transactions between the host (software) and the device (hardware), simplicity to use test benches, and ability to implement pipelined layers by permitting multiple layers to execute concurrently. Furthermore, the coarsegrained data parallelism was exploited by handling two input images at the same time, each one on a compute unit in FPGA. Only the classification was done on-chip. To validate the framework, the Xilinx SDAccel development environment was used to implement a Winograd convolution engine on a Xilinx Virtex 7 XC7VX690T-2 FPGA. The evaluation showed that the FPGA layer can be used alongside other layers running on a host processor, thus to carry out several popular CNNs such as AlexNet [30], GoogleNet [51], VGG [52], and Overfeat [53]. The Soumith Chintala convnet-benchmarks [66] was utilized as an experimental database for classification using CNN models. The highest resource utilization of post-placement and routing was 83.2% of the available LUTs on FPGA.

AC

In [50], the authors suggested a platform for users to speedup and simplify the design and implementation of CNNs on FPGAs. This work allowed the designer to have a clear idea about the resources utilized by its implementation without having to synthesize it, which effectively reduced the development time. It used interpolation to estimate resources before synthesis. It would also be possible to parallelize and speedup the overall computation by dividing the images to be processed among the different cores. Several topologies were utilized for this network with different parameters: the input image size, the number of convolutional layers, the input and output feature maps, the kernel dimension. The number of linear layers were all varied, but the classification layer always used 10 classes. The training phase was carry out outside the FPGA support and the trained weights was required as input of development flow. The implementation 12

ACCEPTED MANUSCRIPT

was done from C++ code utilizing Vivado HLS. The resource estimation results showed an average relative prediction error of 15% with a low variance. Given the large amount of flip-flops available on current FPGAs, the authors assumed that with the latter prediction error, which an average of flip-flops 1,264 units, was still a good result.

CR IP T

In [24], the authors presented a high performance implementation of deep learning neural networks using the DLAU accelerator. This latter was integrated as a standalone unit which was flexible and adaptive to accommodate various applications with configurations. It also utilized three pipelined processing units to improve the throughput. Performance profiling showed that when the Matrix Multiplication (MM) took longer, an MM accelerator would be designed to improve the overall speed of the system. A tile technique was also used for inputs and weights. The floating-point coding was used in addition and multiplication operations. This technique could solve a large-scale deep implementation. Many topologies were employed for training Mnist and CIFAR-10 datasets. The training phase was executed by Matlab software. The validation of this architecture was done on Xilinx Zynq-zedboard xc7Z020 with Xilinx Vivado by C code. DLAU could achieve up to 36.1x speedup at a 256×256 network size. It utilized also 167 DSP blocks due to use of floating-point operations.

ED

M

AN US

In [67], the authors opted for an optimized streaming method for the hardware acceleration of DCNN on an embedded platform. The streaming method acted as a compiler, transforming a high-level representation of DCNN into operation codes to be executed by the hardware accelerator. This method utilized the maximum available computational resources, supported by a novel scheduled routing topology that combined data reuse and data concatenation. Different DCNN architectures, with any number of filters and layers, using the implemented control method, could be run. The system fully explored weight-level and node-level DCNN parallelizations. The validation of this method was done by three channels and eight layers. A single layer consisted of 3×128 convolution operations with 6×6 filters, a max-pooling operation with a 4×4 window size and a non-linearity operation. This network was trained with 20 class, which was the size of network output, utilizing the Torch environment [68] on NVIDIA GPU. The implementation on Xilinx Kintex-7 XC7K325T FPGA provided a performance of 247 GOPS of throughput and four Watts of power. Note that two data words were encapsulated into a 32-bit stream (each word is 16-bit). This implementation was used for video/image recognition.

AC

CE

PT

In [27], the authors presented a hardware implementation of an LSTM recurrent network on programmable logic Zynq 7020 FPGA. The LSTM, used to model the language at the character level, predicted the next character given the previous character. The adopted topology was a two-layer LSTM neural network with a hidden layer of size 128 (weight matrix height). The input and output of characters was a one-hot encoded vector of size 65. The training of character-level language models was written in Torch7, while for the FPGA implementation, it used the C code. The implemented design was scaled by replicating the number of LSTM modules running in parallel. It used also four 32-bit DMA ports. Each one could transmitt two 16-bit streams given that the operations were done in 16 bits. The average error percentage for the output vector was 2.8%, while the bandwidth between FPGA and external full duplex DDR3 could reach 3.8 GB/s. In [69], the authors introduced a layer multiplexing scheme for the on-chip training of deep feed-forward neural architectures using the back-propagation algorithm. This architecture contained only one physical layer of neurons, but this layer could be reused iteratively in order to accelerate architectures with several hidden layers. The implementation was configurable given that the user could set the parameters to specify the neural network architecture, including the number of hidden layers, the number of neurons in each of these layers, the learning 13

ACCEPTED MANUSCRIPT

parameters, etc. The training phase was run with 50 patterns while the validation phase used 20 patterns, both from the Iris plant dataset. The implementation was done in a VHDL code on a Virtex-5 XC5VLX110T FPGA. The experimental results showed that for a large number of hidden layers, FPGA implementation was approximately 20 to 30 times faster (learning phase execution time) than a PC-CPU. However, the performance of the standard back-propagation algorithm started to degrade with 10 or more hidden layers.

CR IP T

In [36], the authors presented a framework that explored heterogeneous algorithms for accelerating CNNs on FPGAs. Their framework employed an architecture where, according to the optimization algorithm, some layers were merged to save memory transfers while effectively preserving computing resources. The specification of the target FPGA included block RAMs (BRAMs), DSPs and off-chip bandwidth among other parameters. Furthermore, this framework provided an automated tool-chain to ease the map from a Caffe model to the FPGA bitstream generation using Vivado HLS. The conventional algorithm was faster because the Winograd’s minimal filtering theory was used for efficient FPGA implementation. Thanks to this framework, the authors could provide a comprehensive solution for mapping a wide variety of CNNs on FPGAs with the AlexNet and GGNet-E topologies (containing 16 convolutional layers, three fully connected layers, three max-pooling layers and one softmax layer). The training was carry out outside the FPGA device.

ED

M

AN US

Based on the fusion architecture, the authors employed a two-level pipeline design: an intralayer pipeline and an inter-layer one. To evaluate this framework, they used an off-the-shelf device, the Zynq ZC706 with 16-bit fixed-point data types, as an experimental platform. With the VGG topology, the platform could achieve on average 1.99x performance speedup compared to the work in [70] for various transfer constraints. The fusion strategy helped to decrease the feature-map transfer, which was translated into average energy savings of 68.2% on memory transfers for different transfer constraints. With the Alexnet topology, their heterogeneous algorithm improved performance by 99% on average, leading to additional 50% energy savings on computing resources. Compared to [70], this result offered an acceleration gain of 1.24x, just restricted by the exploration space of small solutions.

AC

CE

PT

In [54], the authors proposed a new FINN framework, consisting of fast and flexible FPGAbased accelerators using a heterogeneous streaming architecture. Utilizing a new set of optimizations for efficient mapping of binarized neural networks on hardware, the authors could implement fully connected, convolutional and pooling layers. This work used per-layer computing resources, tailored to the user-provided throughput requirements. The framework provided automated pipelining to meet the clock frequency target. A Tiny-CNN predefined library was used to assign 1000 images of MNIST, CIFAR-10 and SVHN datasets for neural network training. Only the prediction was done on FPGA device. Three neuronal architectures were adopted SFC (three layers with 256 neurons per layer), LFC (three layers with 1,024 neurons per layer), VGG16 (two convolution layers, one maxpool layer, with 64-128-256 input channels and two fully connected layers of 512 neurons in the output layer). Test instances, described in C++, were implemented on the SoC-FPGA Xilinx Zynq-7000 Series with Vivado HLS version 2016.3. The authors thought that high classification accuracy could be achieved even when weights and activations were reduced from floating point to binary values. In the first place, this assumption stemmed from the fact that the performance of the FPGA related to computations was 66 TeraOperation-Per-Second (TOPS) for binary operations, which was about 16x and 53x higher when compared to 8-bit and 16-bit fixed-point operations, respectively. Second, the authors also demonstrated that the difference in accuracy between low-precision networks and floating-point 14

ACCEPTED MANUSCRIPT

networks decreases as the network size increases. Finally, the authors presented a roofline model for ZC706, assuming a 90% LUT utilization, a 200 MHz clock frequency and 1.6 GB/s of DRAM bandwidth. Prototypes exhibited good runtime efficiency, with ∼80% for SFC and ∼90% for LFC.

CR IP T

In [55], the authors exposed a neuron pruning technique which eliminated part of weights based on Chainer framework [71]. In this work, a sequential-input parallel-output was put forward for a fully connected layer. To obtain fast memory access, the weight memory was an on-chip memory built into FPGA. The neuron pruning of fully connected layers was applied to the VGG-11 CNN on CIFAR-10 datasets. Only the prediction was done on FPGA device. The implementation was done in C++ using Vivado 2016.2 on Xilinx Inc. Kintex 7 XC7K325T FPGA. The input of neural networks was a normalized 32 × 32 image, which consists of 8-bit RGB color data. The evaluation results showed that by applying neuron pruning, the number of neurons was reduced by 89.3% while maintaining 99%. accuracy. The FPGA implementation was 219 times faster than CPU and 12.5 times faster than GPU. Moreover, the energy efficiency performance was 125.28 times better than CPU and 17.88 times better than GPU.

ED

M

AN US

In [72], the authors presented FP-BNN, a Binarized Neural Network (BNN) for FPGAs, which drastically decreased hardware resource utilization while maintaining acceptable accuracy. The Resource-Aware Model Analysis (RAMA) was used to assess the resource cost, to help on-chip system architecture design. The RAMA removed the bottleneck involving multipliers by using bitlevel XNOR and shift operations and optimized the access to parameters with data quantization and the on-chip storage. The evaluation of the FP-BNN accelerator was done for MNIST MultiLayer Perceptrons (MLP), Cifar-10 ConvNet and AlexNet on Stratix-V FPGA. The interconnection was reconfigured by the controller according to the type of the layer. The training phase was carry out by an IBM x3650 M4 server equipped with an NVIDIA Tesla K40.The inference performance of FP-BNN reached, with acceptable accuracy loss, a TOPS throughput, which was significantly faster than previous CNN designs. The energy efficiency was 10 times better over other computing platforms. Indeed, the utilization of DSP blocks was not high, because only a small part of arithmetic operations needed floating-point multipliers.

AC

CE

PT

In [73], the authors proposed an end-to-end approach, their TVM compiler could take from other existing frameworks a deep learning high-level specification and it generated an optimized low-level language for a diverse set of hardware back-ends. TVM permitted applying many optimization phases such as high-level operator fusion, mapping to arbitrary hardware primitives and memory latency hiding. It also provided an automated optimization of low-level programs to target hardware characteristics by employing a novel learning-based cost modeling method for the rapid exploration of code optimizations. Regarding the FPGA implementation, they suggested the Vanilla Deep Learning Accelerator (VDLA), which extracted characteristics from previous accelerator proposals to produce a minimum size hardware architecture. For instance, the authors used TVM to generate ResNet inference kernels and offload as many layers as possible with VDLA. The result could be implemented on a low-power PYNQ board, characterized by its modest FPGA resources. However, the on-chip buffers were not large enough to provide on-chip storage for a single ResNet layer. Thus, a driver library was built for VDLA with a C runtime API that made instructions to be sent to the target accelerator for execution. After that, the code generation algorithm converted an accelerator program to a series of calls to API. Most of the computation was spent on the convolution layers that could be offloaded to VDLA. For those convolution layers, the implementation was done with a 16 × 16 matrix-vector unit clocked at 15

ACCEPTED MANUSCRIPT

200MHz that performed products of 8-bit values and accumulated them into a 32-bit register at each cycle, achieving 40× speedup.

AN US

CR IP T

The authors in [41] described an RTL compiler called ALAMO which analyzed the algorithm structure and parameters, and automatically integrated a set of modular and scalable primitives to speedup the generation of learning algorithms on FPGA. The basic idea of this solution was that the compiler quantitatively analyzed the design strategy in order to optimize the throughput of a given CNN model taking into account the constraints on FPGA resources. The data paths in the router were pipelined, where the number of pipeline stages was parameterized. Weights had to be stored in off-chip memory (e.g. DRAM) and transferred to FPGA during computation as approximately 250 MB of memory, required to store all the weights exceeding the on-chip memory capacity of FPGAs. A multiple data word-length (integer and fractional bits) was used for implementation (10-bit unsigned numbers for the feature values after ReLU, 8-bit signed numbers for the kernel weights, 18-bit for the multiplier outputs and 25-bit for the adder). The data word-length could be dynamically adjusted by the compiler for each layer. Only the inference was done on FPGA device. ALAMO was evaluated on Altera Stratix-V GXA7 FPGA for the inference tasks of AlexNet and NiN[74] CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. The power consumption of the Stratix-V FPGA chip when running the AlexNet and NiN was measured as 19.5 Watts and 19.1 Watts, respectively, with an initial board power of 16.5 Watts. The proposed RTL compiler achieved a 1.9x improvement in throughput compared to OpenCL-based environment.

CE

PT

ED

M

The authors in [75] proposed an algorithm-hardware co-optimization framework for different DNN types, sizes, and application scenarios. The algorithmic strategy adopted the general blockcircular matrices to achieve a fine-grained trade-off in terms of accuracy and compression ratio. It applied to both fully-connected and convolutional layers, supported by a mathematically rigorous proof of the effectiveness of the method. All operations were pipelined on the basic computing block. The training and the inference phases were done on FPGA device. The hardware strategy was to achieve ultra-high energy efficiency and performance for FPGA-based implementation by using effective reconfiguration, batch processing, deep pipelining, resource reuse and hierarchical control. The authors evaluated the suggested framework on both lowpower FPGA Intel (Altera) CyClone V 5CEA9 and higher-performance Xilinx Kintex-7 XC7K325T FPGA, for the implementation of small-to-medium-scale DNNs, using MNIST, SVHN and CIFAR-10 benchmarks. The accuracy degradations were constrained to be 1% to 2% between the original and block-circular matrix-based models. Experimental results demonstrated that the proposed framework achieved at least 152x speedup and 71x energy efficiency gain, compared with the IBM TrueNorth processor with the same accuracy. It achieved at least 31x energy efficiency gain, compared with the reference FPGA-based work [54].

AC

In [76], the authors described the FlexiGAN end-to-end solution, which generated an optimized synthesizable FPGA accelerator from a high-level Generative Adversarial Network (GAN) specification. GANs were architectures composed of two nets, pitting one against the other, generally used in unsupervised machine learning. FlexiGAN was coupled with a novel template architecture that aimed to harness the benefits of both MIMD (Multiple Instruction, Multiple Data) and SIMD (Single Instruction, Multiple Data) execution models to avoid ineffectual operations. FlexiGAN came with a compilation workflow that reordered computation, optimized dataflow, separated data retrieval and generated a two-level instruction hierarchy in order to accelerate a given GAN specification. The end-to-end solution was evaluated by generating FPGA accelerators for a variety of GANs like 3D-GAN or ArtGAN using the Xilinx XCVU13P FPGA chip. All 16

ACCEPTED MANUSCRIPT

arithmetic operations were performed with 16-bit precision. Only the inference phase was done on FPGA device. Vivado Design Suite v2017.2 was utilized to synthesize the FlexiGAN architecture. The global and local instruction buffers were implemented with BRAMs and UltraRAMs. FlexiGAN-FPGA operated at a frequency of about 190 MHz. The generated accelerator provided a 2.2x higher speedup performance compared to an optimized conventional convolution design. In addition, FlexiGAN provided on average 2.6x (up to 3.7x) improvements in performance-per-Watt over Titan X GPU, while its clock rate was about 5.0x lower.

AN US

CR IP T

In [77], the authors described a new toolkit, called Kibo, for training and inference in FPGAbased deep learning. The goal was to provide an optimized implementation with the ability to customize the bit width of fixed-point weights and activations. The toolkit which was written by python, allowed the user to quickly evaluate the impact of alternative numerical representations by comparing functions commonly utilized in deep learning structures. Two examples were proposed for evaluating the Kibo toolkit in both inference and training: an MLP used for inference, and a recurrent neural network containing a Long-Short Term Memory (LSTM) architecture for learning time-series data. MLP implementation using fixed-point logic with five integer bits and five fractional bits achieved 97.9% accuracy, which was comparable to state-ofthe-art architectures using a single hidden layer. LSTM implementation showed that the loss quickly decreased from the first to the second training epoch, reaching lower loss values for networks using larger fixed-point representations. Similarly, for a fixed-point with 16 bits in the integer part and 16 bits in the fractional part, the results achieved a behavior similar to that of a network trained using floating-point arithmetic. The implementation was attained utilizing the Xilinx Zynq ZC706 board.

PT

ED

M

In [78], the authors presented an efficient accelerator for Depthwise Separable CNN (DS-CNN) on FPGA for mobile edge computing. This accelerator permitted all layers to work concurrently in a pipelined fashion to improve the system throughput and performance. To implement the accelerator, the double-buffering-based memory channels were used to handle the dataflow between adjacent layers. The authors implemented and evaluated the accelerator for the DSCNN on Intel Arria 10 FPGA with Intel Quartus Prime 16.0 for synthesis and ModelSim for simulation. The parallelization was exploited for computing the multiple output and input feature maps. Also in this work, the data wordlength utilized for the MAC operation was 16 bits in fixedpoint coding. For 2D convolution, the multiplication was performed by using parallel multipliers, and the adder tree summed all the results from multipliers in parallel per cycle.

AC

CE

The results of experiments indicated that the proposed DS-CNN accelerator had a performance of 98.9 GOP/s and achieved up to 17.6x speedup and 29.4x low power than CPU and GPU implementations, respectively. The proposed accelerator defined in this paper was trained and tested in the MNIST and CIFAR-10 datasets. 6.2 Analysis of interesting solutions Table 1 regroups the most representative proposals according to the application domain, the strategies and flow of development, and the performance parameter for the optimization and implementation of FPGA-based deep learning networks. This suggested classification permits at first clarifying the implementation choices for the designer according to its application and related constraints.

17

ACCEPTED MANUSCRIPT

From Table1, we note that the works are implemented in a general use domain or in a specific one. The comparative between works is favored while working in a specific domain given that mostly of these implementations have adopted some common databases (Mnist, CIFAR-10, etc.). Implementations strongly depend on the characteristics of the FPGA target, the topologies of CNNs (preconfigured or runtime programmable), the coding (fixed and floating point, binary,…), the training type (on-chip, off-chip), the optimization techniques and the architectures (flexibility, configurability), which help to identify challenges and design effort.

CR IP T

The size reduction of networks by using fixed-point and binary coding are the optimizations that better exploit the advantages of FPGAs in terms of computational resources and energy consumption, like the Binarized Neural Network (BNN) [72] for FPGAs which provides good energy efficiency. In the same vein, data sizes, memory and pruning optimization solutions are the best choice to overcome the limited memory capacity and routing of FPGAs.

AN US

Design frameworks have also an important place in this development process as they promote software developers access to FPGAs, besides reducing the development time and getting more flexibility and configurability to the application. As a consequence, most work has integrated an optimization phase in the process of frameworks such as the TVM compiler in [73], the ALAMO RTL compiler in [41], the algorithm-hardware co-optimization framework in [75], the FlexiGAN end-to-end solution in [76], and the Kibo toolkit in [77]. The coupling between the optimization phase and the based-framework design flow provides a higher performance in terms of speedup, energy efficiency and throughput. As noticed, most of the proposals have been built using a high level language such as C++, and OpenCL supported by hardware development tools like Vivado, Vivado HLS, SDAcel, and Quartus.

Support

Strategies Topology

Performance gain

Development

CE

Domain

AC

Works

PT

ED

M

In the context of FPGA implementation, preferred performance parameters, in addition to speedup, throughput, FPGA resources, power consumption and energy efficiency, are accuracy and prediction error, used for prediction.

Training

Flow

Architecture

Energy efficiency Throughput Speedup Error of prediction FPGA resources Power

Optimization

Flexibility &

configurability Off-chip

On-chip

18

Coding

[49]

General use

ACCEPTED MANUSCRIPT 32-bit

IntelRuntime Altera DE5 programmable







-CNNLab framework floating  point -Support with OpencL environment - Model parallelism

[40]

Xilinx Virtex7

General use

- AlexNet [30] 

-GoogleNet[51]





32-bit -Extension floating framework Caffe for point FPGA support.

- VGGA [52]



-SDAccel [35] environment

- Overfeat[53].

CR IP T

-Pipelined layer implementations

-Exploitation of data parallelism

Video and image recognition

Xilinx Kintex-7

Specific topology [67]











Character Zynq 7020 LSTM[79] level language model

 

- Framework creation to estimate resources usage under development



- High-level language C++ using Vivado HLS





16-bit fixed point

- Transforming highlevel representation  of DCNNs into executed application on hardware accelerator



-Fully exploration weight level and node level parallelizations of DCNNs

AC [27]

- High-level language C++ using Vivado HLS

32-bit - DLAU accelerator floating with three pipelined point processing units

CE

[67]



Estimati on of any precision in fixed point

AN US

- CIFAR-10

M

Xilinx Runtime Zynqprogrammable zedboard

- Mnist



ED

[24]

Images Estimation Runtime of classification programmable resources in various supports

PT

[50]

16-bit fixed point



19

High-level language C using Vivado HLS





[69]

Iris data set (plant)

Virtex-5

ACCEPTED MANUSCRIPT

Runtime programmable







16-bit fixed point

- Implementation by VHDL code



- Flexibility of architecture given that a single physically layer of neurons can simulate several hidden layers - Model parallelism

Zynq ZC706

General use

- GGNet-E [36] 

- AlexNet





16-bit fixed point

Framework for mapping from Caffe model to FPGA bitstream using Vivado HLS

CR IP T

[36]





[54]

Images classification

AN US

-Two-level pipeline design is exploited: intra-layer pipeline and inter-layer one.

Xilinx -SFC [54] Zynq-7000 -LFC [54]





ED

Xilinks Kintex7

CIFAR- 10

- SVHN

CE

- MNIST

Stratix-V

AC

[72]





8-bit fixed point

- Chainer framework - Parallel output in fully connected layer









- High-level language C++ using Vivado HLS

- ConvNet [48] - AlexNet

 

- Pipeline between layers

VGG-11[55]





- High-level language C++ using Vivado HLS

PT

[55]

- FINN framework -Tiny-CNN library

M

- VGG-16 [52]



Binary values







Binary values

- Resource-Aware Model Analysis (RAMA) - High-level specification -Reconfigured interconnection according to type of layer.

20

[73]

PYNQ FPGA

General use

ACCEPTED MANUSCRIPT 8-bit

- ResNet [80]



- MobileNet [46]





fixed point

-TVM compiler that takes high-level specification including Cafee, Pytorch, Caffe2



- Many optimization phases General use

Altera Stratix-V

- AlexNet 

- NiN [74]





Multiple data wordlength in fixed point

- ALAMO RTL compiler





- High-level specification with Quartus design tools

CR IP T

[41]

- Integration of modular set and scalable primitives

- SVHN - CIFAR- 10

General use

Xilinx Variety of GANs XCVU13P (3D-GAN[81], ArtGAN[82], …)







12-bit fixed point







16-bit fixed point

PT - MNIST

- EEG signals

Variety of LSTM supports









- Exploitation of pipeline in all operations - FlexiGAN end-toend -High-level specification Json file

GAN with

- High-level language C++ using Vivado HLS

CE AC

[77]

-Algorithm-hardware co-optimization framework - High-level specification

ED

[76]

LowSpecific power topology [75] FPGA Intel (Altera) and Xilinx Kintex-7

AN US

- MNIST

M

[75]

- Optimization of dataflow





21



Multiple data wordlength in fixed point

- Kibo toolkit - High-level language python - Low-level by HLS oriented C



[50][54][55] [67][72][73] [75][76][77] [78]



- CIFAR- 10





fixed point

Development strategies

Topology

[54][72]

8 bits

[55][73]

Mutiple data wordlength

16 bits

!

[67][27][69]

Floating point

[40][49][24]

Features

-Parallelization exploited for Off-chip On-chip computing multiple [40][67][27] [36][54][55] [49][75][69] and input [72][73][41] output [77] [76][71][78] feature maps[50][24]

Optimization

Flexibility & configurability

[40][67][36] [54][55][72] [73][41][75] [76][71][78] [49][50][24] [69]

[40][67][36] [54][55][72] [73][41][75] [76][71][78] [49][50][24] [69]

[75] [36][76][78] Table1: Overview of[41][75][77] interesting solutions [50]

After analysis, we have clarified and simplified the representation of the implementation Performance gain parameter by proposing three trees (Figure 1, Figure 2 and Figure 3) for reordering Table 1. These figures regroup the interesting solutions according to choice of strategies and the flow of Power FPGA development, as wellEnergy as the performance parameters for the optimization and the Speedup Throughput Error rate consumption efficiency resources implementation of FPGA-based deep learning networks.

AN US

!

12 bits

 

Training

Binary

Fixed point

[40][67][27] [36][54][55] [72][73][41] [75][76][77] [78]

[49][50][24] [69]

- FPGA-based depthwise separable CNN accelerator with all layers working concurrently in pipelined fashion.

Coding

Preconfigured

[75][78]

55][67][73] [75][76][77] [69]

CR IP T

Runtime programmable

[24][36][54] [75][78]

[78]

ACCEPTED MANUSCRIPT 16-bit

Intel Arria DS-CNN[78] 10 FPGA

- MNIST

[24][69][36]

[67][27][54] Figure 1 shows that [36][55][72] most works [49][67][41] adopted a preconfigured topology, an off-chip training, an [50][27] [54][55][72] [41][78] [75][76] [73][75][76] optimized phase, and a configurable and flexible architecture. This figure clearly illustrated as [77][78] well the fixed point coding in 8 bits, 12 bits, 16 bits and a multiple-data wordlength.

[40][50][54]

!

M

!

Topology

Preconfigured

!

Binary

Fixed point

[40][67][27] [36][54][55] [72][73][41] [75][76][77] [78]

CE

[49][50][24] [69]

Training

Coding

PT

Runtime programmable

ED

Development strategies

AC

[78]

[49][69]

[69]

[54][72]

8 bits

12 bits

16 bits

[55][73]

[75]

[67][27][69] [36][76][78]

Floating point

[40][49][24]

Architecture

On-chip

Off-chip

Optimization

Flexibility & configurability

[49][75][69] [77]

[40][67][27] [36][54][55] [72][73][41] [76][71][78] [50][24]

[40][67][36] [54][55][72] [73][41][75] [76][71][78] [49][50][24] [69]

[40][67][36] [54][55][72] [73][41][75] [76][71][78] [49][50][24] [69]

Mutiple data wordlength

[41][75][77] [50]

Figure1: Development strategies of interesting solutions

Figure2 depicts the framework use importance with a high-level language as well as the pipeline parallelism application. Indeed, most works implemented the deep learning with hardware development tools of Xilinx like Vivado, Vivado HLS, SDAcel.

22

ACCEPTED MANUSCRIPT

Framework

Language

[36][40] [41][49][50] [54][55][67] [72][73][75] [76][77]

[36][40] [41][49][50] [54][55][67] [72][73][75] [76][77]

Synthesis tools

Development flow

Language

High-level

High-level

Low-level

[24][27][36] [40][41][49] [50][54][55] [67][72][73] [75][76][77] [78]

[69]

Low-level

[24][27][36] [40][41][49] [50][54][55] [67][72][73] [75][76][77] [78]

Model

[49][69] Prallelism

Model

Development strategies [40][67][55]

[49][69]

[69]

[78]

Topology

[40] [24][36][54] [75][78]

Vivado/Vivado HLS/SDAccel (Xilinx)

Binary

Fixed point

Preconfigured [40][67][27] [36][54][55] [72][73][41] [75][76][77] [78]

[49][50][24] [69]

[54][72]

Features

Floating point

On-chip

Off-chip

Optimization

[40][49][24]

[49][75][69] [77]

[40][67][27] [36][54][55] [72][73][41] [76][71][78] [50][24]

[40][67][36] [54][55][72] [73][41][75] [76][71][78] [49][50][24] [69]

Mutiple data wordlength

For the evaluation of deep learning implementation performance on FPGA, the authors were interested firstly in speedup, and secondly in energy efficiency and throughput. This is to highlight the advantage of the FPGA support (Figure 3). ! 12 bits

16 bits

[55][73]

[75]

[67][27][69] [36][76][78]

[41][75][77] [50]

AN US

8 bits

Performance gain

Speedup

Energy efficiency

Power consumption

Error rate

Throughput

[40][50][54]

[24][69][36] [54][55][72] [73][75][76] [77][78]

[36][55][72] [75][76]

[49][67][41]

[50][27]

[67][27][54] [41][78]

M

FPGA resources

!

ED

Figure 3: Performance gain of interesting solutions

CE

PT

From Table 1 and previous figures, we can explore others results about the relation between the used topology, the domain type, the development strategies and flow, and the performance gains.

AC

Figure 4 shows the relation between the domain and the topology illustrated by the percentage of papers that have used preconfigured and runtime-programmable topologies in both general or specific domains. These percentages inform the reader that the preconfigured topology is always the most utilized deep learning implementation on FPGA regardless of the domain type.

23

!

[41][49][72] [75][78]

Training

Coding

[41][49][72] [75][78]

Quartus/openCl (Intel)

[24][27][36] [40][50][54][ 55][67][73] [75][76][77] [69]

Figure2: Development flow of interesting solutions

Runtime programmable

Quartus/openCl (Intel)

[24][27][36] [40] [40][50][54][ [24][36][54] Synthesis 55][67][73] tools [75][78] [75][76][77] [69]

[40][67][55] [78]

Pipeline

Data

Vivado/Vivado HLS/SDAccel (Xilinx)

Pipeline

Data

CR IP T

Framework

Prallelism

Flexibility & configurability

!

[40][67][36] [54][55][72] [73][41][75] [76][71][78] [49][50][24] [69]

ACCEPTED MANUSCRIPT

CR IP T

Figure 4: Percentage of papers that have used preconfigured and runtime-programmable topologies according to domain type

AN US

Given that the interesting papers provide performance gains, we illustrate in Figure 5 and Figure 6 the percentage of papers that have a gain in power, FPGA resources, prediction error, speedup, throughput, energy efficiency relative to topology, training, architecture, coding, framework, language, parallelism and support types. These percentages inform the reader about the development strategies and flow to adopt in order to give the desired gain. We note that most of performance gains in FPGA resources, speedup, throughput and energy efficiency are reached with a pipeline parallelism, an optimized, flexible and configurable architecture, an off-chip training and a preconfigured topology. The binary and multiple wordlength coding is adopted for FPGA resources and speedup gains, whereas in the energyefficiency and throughput gains, the 16-bit fixed-point coding is frequently used.

AC

CE

PT

ED

M

In other words, the gain in power consumption can be achieved by the exploitation of data and model parallelism. To ensure a good error rate performance, a runtime programmable topology is favored with 16 bits or a multiple wordlength in fixed-point coding.

Figure 5: Percentage of papers that have performance gains according to development strategies

24

ACCEPTED MANUSCRIPT

Figure 6: Percentage of papers that have performance gains according to development flow

6.3 Proposals based on results

CR IP T

Figures 1-6 provide a better view of the choice of the implementation parameter and give the directive for researchers to decide for their future implementation.

M

AN US

As result, the best way for deep learning implementation on FPGAs is to choose the most appropriate architecture based on the application domain, such as classification, segmentation, prediction, and so on. After that, it is preferable to do the training phase on a GPU support, not only to minimize the development time, but also to overcome the large size that can be achieved by the best performing network. FPGA offers excellent support for the inference, because it allows energy savings through a low working frequency and a minimum error prediction. Actually, the available tools like Vivado HLS simplify the implementation on FPGA by supporting high level language like C or OpenCL (Vivado SDAccel) and synthesizable data-types. The use of frameworks accelerates the implementation time while the optimization-phase integration framework process is desirable. The arithmetic can be with a multiple wordlength in fixed point, and also with binary coding, which allows optimizing performances and resources. By using FPGAs, we can optimize the needed amount of resources (LUTs, flip-flops, etc.), the speedup, the energy, the throughput, and the prediction error rate at the same time.

ED

7. Conclusion

AC

CE

PT

In this paper, we have presented a taxonomy of FPGA-based topologies for future deeplearning architectures. The programmability and reconfigurable features of FPGAs allow evaluating a custom design quickly and saving development time expenses. FPGA can also offer better performances with significantly less power consumption, especially needed for mobile embedded platforms. Before beginning the deep learning implementation on FPGA, it is essential to set the strategy to adopt, based on the application requirements. Hardware designers will start by specifying the design flow, the development language and tools, and the granularity of implementation (particularly necessary with library support). Through this document, we have analyzed the design requirements and characteristics of existing topologies to finally propose development strategies and implementation architectures for better use of FPGA-based deep learning topologies. This review can serve as a reference for researchers in the area of neural networks. As a perspective, we can see that the suggested frameworks and topologies will evolve towards the development of a generation of programmable deep learning architectures by exploiting the advantages of FPGA to particularly satisfy the needs in application fields such as automotive and smart wearable embedded devices. 25

ACCEPTED MANUSCRIPT References

[1] A. Graves, A.-r. Mohamed, and G. Hinton, Speech recognition with deep recurrent neural networks, in IEEE Int. Conf. on ICASSP, 2013, pp. 6645–6649. [2] I. Sutskever, O. Vinyals, and Q. V. Le, Sequence to sequence learning with neural networks, in Advances in neural information processing systems, 2014, pp. 3104–3112.

CR IP T

[3] W. Byeon, M. Liwicki, and T. M. Breuel, Scene analysis by mid-level attribute learning using 2d LSTM networks and an application to web-image tagging, Pattern Recognition Letters, vol. 63, pp. 23– 29, 2015. [4] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Ng, Convolutional-recursive deep learning for 3d object classification, in Advances in Neural Information Processing Systems, 2012, pp. 665–673. [5] Zhennan Yan, Yiqiang Zhan, Shaoting Zhang, Dimitris Metaxas, Xiang Sean Zhou, Multi-Instance MultiStage Deep Learning for medical image recognition, book chapter from Deep Learning for Medical Image Analysis, 2016.

AN US

[6] Jun Xu, Chao ZhouBing LangQingshan Liu, Deep Learning for Histopathological Image Analysis: Towards Computerized Diagnosis on Cancers, Deep Learning and Convolutional Neural Networks for Medical Image Computing-Precision Medicine, High Performance and Large-Scale DatasetsPart of the Advances in Computer Vision and Pattern Recognition book series (ACVPR), 2017.

M

[7] Brahim Ait Skourt, Abdelhamid El Hassani, Aicha Majda , Lung CT Image Segmentation Using Deep Neural Networks, Procedia Computer Science. Volume 127, 2018, Pages 109-113. https://doi.org/10.1016/j.procs.2018.01.104

ED

[8] Alexander Kalinovsky,Vassili Kovalev, Lung Image Segmentation Using Deep Learning Methods and Convolutional Neural Networks, Conference: XIII International Conference on Pattern Recognition and Information Processing (PRIP-2016) At: Minsk, Belarus

PT

[9] Gustavo Carneiro, Jacinto Nascimento, Andrew P. Bradley, Deep Learning Models for Classifying Mammogram Exams Containing Unregistered Multi-View Images and Segmentation Maps of Lesions, book chapter from Deep Learning for Medical Image Analysis, 2016.

CE

[10] Wei Zhang, Zhihui Lu, Ziyan Wu, Jie Wu, Huanying Zou, Shalin Huang, Toy-IoT-Oriented data-driven CDN performance evaluation model with deep learning. Journal of Systems Architecture, Volume 88, Pages 13-22, 2018. https://doi.org/10.1016/j.sysarc.2018.05.005.

AC

[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009 [12] A. Krizhevsky, One weird trick for parallelizing convolutional neural networks, arXiv preprint arXiv:1404.5997, 2014. [13] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, cudnn: Efficient primitives for deep learning, vol. abs/1410.0759, 2014 [14] Dundar, A., Jin, J., Gokhale, V., Martini, B., Culurciello, E.: Memory access optimized routing scheme for deep networks on a mobile coprocessor. In: HPEC 2014, pp. 1–6 (2014). [15]

Intel® Movidius™ Myriad™ X VPU, https://www.movidius.com/myriadx

Ultimate 26

Performance

at

Ultra-Low

Power,

ACCEPTED MANUSCRIPT

[16] Han, Song, et al. "EIE: efficient inference engine on compressed deep neural network." Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 2016. [17] Zhen Li, Yuqing Wang, Tian Zhi, Tianshi Chen, A survey of neural network accelerators, Frontiers of Computer Science, October 2017, Volume 11, Issue 5, pp 746–761. DOI 10.1007/s11704-0166159-1 [18] Griffin Lacey, Graham Taylor, Shawki Areibi « Deep Learning on FPGAs: Past, Present, and Future », in arXiv, 2016. https://arxiv.org/pdf/1602.04283.pdf. [19] Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang and Huazhong Yang, A Survey of FPGA Based Neural Network Accelerator, in arXiv, 2018. https://arxiv.org/pdf/1712.08934.pdf ] amel Abdelouahab, Maxime Pelcat, Jocelyn S rot, and Fran ois erry, Accelerating NN inference on FPGAs: A Survey, in arXiv, 2018. https://arxiv.org/pdf/1806.01683.pdf

CR IP T

[

[21] Daugman, J. How iris recognition works. IEEE Trans. Circuits Syst. Video Technol. 2004, 14, 21–30. [22] J. Bigun, F. Alonso-Fernandez, H. Hofbauer, and A. Uhl, Experimental analysis regarding the influence of iris segmentation on the recognition rate, IET Biometrics, vol. 5, no. 3, pp. 200–211, 2016. [23] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.

AN US

[24] Chao Wang, Qi Yu, Lei Gong, Xi Li, Yuan Xie and Xuehai Zhou, DLAU: A Scalable Deep Learning Accelerator Unit on FPGA, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems ( Volume: 36, Issue: 3, March 2017 ). DOI: 10.1109/TCAD.2016.2587683 [25] LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

M

[26] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

ED

[27] Andre Xian Ming Chang, Berin Martini, Eugenio Culurciello, Recurrent Neural Networks Hardware Implementation on FPGA, International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering. Vol. 5, Issue 1, January 2016. https://pdfs.semanticscholar.org/1a27/a30724ed0409791db99d96551d341807faed.pdf.

PT

[28] Afonso Menegola, Michel Fornaciali, Ramon Pires, Sandra Avila, Eduardo Valle; Towards Automated Melanoma Screening: Exploring Transfer Learning Schemes, Published 2016 in ArXiv.

CE

[29] Muhammad Imran Razzak, Saeeda NazAhmad Zaib, Deep Learning for Medical Image Processing: Overview, Challenges and the Future, Classification in BioApps pp 323-350, Part of the Lecture Notes in Computational Vision and Biomechanics book series (LNCVB, volume 26), 2017.

AC

[30] Krizhevsky A, Sutskever I, Hinton G E, Imagenet classification with deep convolutional neural networks, In: Proceedings of the Neural Information Processing Systems Conference. 2012, 1097–1105 [31] Jarrett K, Kavukcuoglu K, Ranzato M A, LeCun Y, What is the best multi-stage architecture for object recognition?, In: Proceedings of the 12th IEEE International Conference on Computer Vision. 2009, 2146– 2153 [32] C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Cu- lurciello, “Hardware accelerated convolutional neural networks for synthetic vision systems,” in ircuits and Systems (IS AS), Proceedings of 2010 IEEE International Symposium on, 2010, pp. 257–260. [33] Xilinx Inc. Vivado Design Suite. http://www.xilinx.com/products/design-tools/vivado.html 27

ACCEPTED MANUSCRIPT

[34] Intel Quartus Prime Software Suite, https://www.intel.com/content/www/us/en/software/programmable/ quartus-prime /overview.html [35] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, and Y. Cao, Throughputoptimized opencl-based fpga accelerator for large-scale convolutional neural networks, in Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ACM, 2016. [36] Qingcheng Xiao, Yun Liang, Liqiang Lu, Shengen Yan and Yu-Wing Tai, Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs, DAC '17 Proceedings of the 54th Annual Design Automation Conference 2017. DOI: http://dx.doi.org/10.1145/3061639.3062244. loud, https://www.xilinx.com/products/design-

CR IP T

[37] AWS loud: Xilinx FPGAs in World’s Largest tools/acceleration-zone/aws.html [38]

Adaptive Machine Learning https://www.xilinx.com/applications/megatrends/machine-learning.html

Acceleration,

[39]

Mipsology: The Future Of FPGA -Based Machine Learning, https://www.xilinx.com/support/documentation/product-briefs/mipsology-aws-f1.pdf

AN US

[40] Roberto DiCecco, Griffin Lacey, Jasmina Vasiljevic, Paul Chow, Graham Taylor and Shawki Areibi, Caffeinated FPGAs: FPGA Framework For Convolutional Neural Networks, 2016 International Conference on Field-Programmable Technology (FPT), 7-9 Dec. 2016, Xi'an, China. DOI: 10.1109/FPT.2016.7929549

M

[41] Yufei Ma, Naveen Suda, Yu Cao, Sarma Vrudhula, and Jae Sun Seo, ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler, Integration, the VLSI Journal, 2018. DOI:10.1016/j.vlsi.2017.12.009

ED

[42] Raghid Morcel, Haitham Akkary, Hazem Hajj, Mazen Saghir, Anil Keshavamurthy, Rahul Khanna, and Hassan Artail. 2017, Minimalist Design for Accelerating Convolutional Neural Networks for LowEnd FPGA Platforms. In Field-Programmable Custom Computing Machines (FCCM), 2017 IEEE 25th Annual International Symposium on. IEEE, 196–196.

PT

[43] Thorbjörn Posewsky, Daniel Ziener: A Flexible FPGA-Based Inference Architecture for Pruned Deep Neural Networks. ARCS 2018: 311-323

CE

[44] Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1– 12.

AC

[45] Hanqing Zeng, Ren Chen, Chi Zhang, and Viktor Prasanna, A Framework for Generating High Throughput CNN Implementations on FPGAs, In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '18). ACM, New York, NY, USA, 117-126. DOI: https://doi.org/10.1145/3174243.3174265 [46] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (2017). arXiv:1704.04861 http://arxiv.org/abs/1704.04861

28

ACCEPTED MANUSCRIPT

[47] Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. 2016, Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks, In Computer-Aided Design (ICCAD), 2016 IEEE/ACM International Conference on. IEEE, 1–8. [48] Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li, DeepBurning: automatic generation of FPGA-based learning accelerators for the neural network family, In Design Automation Conference (DAC), 2016 53nd ACM/EDAC/IEEE.IEEE, 1–6. [49] Maohua Zhu, Liu Liu, Chao Wang, Yuan Xie, CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA- a Practical Study with Trade-off Analysis, Distributed, Parallel, and Cluster Computing (cs.DC). 2016. https://arxiv.org/pdf/1606.06234.pdf

CR IP T

[50] Andrea Solazzo, Emanuele Del Sozzo, Irene De Rose, Matteo De Silvestri, Gianluca C. Durelli, Marco D. Santambrogio, Hardware Design Automation of Convolutional Neural Networks, 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 224-229, 2016. Doi: https://doi.org/10.1109/isvlsi.2016.101 [51] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, Going Deeper with Convolutions. CoRR, abs/1409.4842, 2014.

AN US

[52] K. Simonyan, A. Zisserman, Very deep convolutional networks for large scale visual recognition, in: Proceedings of ICLR, 2015. [53] Pierre Sermanet, David Eigen, Xiang Zhang, Michae l Mathieu, ob Fergus, and Yann LeCun, OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. CoRR, abs/1312.6229, 2013.

M

[54] Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre and Kees Vissers, FPGA '17 Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Pages 65-74, Monterey, California, USA — February 22 - 24, 2017. Doi:10.1145/3020078.3021744

ED

[55] Tomoya Fujii, Simpei Sato, Hiroki Nakahara, and Masato Motomura, An FPGA Realization of a Deep Convolutional Neural Network Using a Threshold Neuron Pruning, ARC 2017, LNCS 10216, pp. 268–280, 2017. DOI: 10.1007/978-3-319-56258-2 23.

PT

[56] D.D. Lin, S.S. Talathi, V.S. Annapureddy, Fixed point quantization of deep convolutional networks, arXiv:1511.06393, 2015.

CE

[57] Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, Raquel Urtasun, Andreas Moshovos, Proteus: Exploiting precision variability in deep neural networks, Parallel Computing, Volume 73, 2018, Pages 40-51, ISSN 0167-8191, https://doi.org/10.1016/j.parco.2017.05.003.

AC

[58] L. Lai, N. Suda, V. Chandra. Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations, arXiv:1703.03073, 2017. [59] Yixing Li, Zichuan Liu, Kai Xu, Hao Yu, A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks, ACM Journal on Emerging Technologies in Computing Systems, Vol. 14, No. 2, Article 18, 2018. Doi: https://doi.org/10.1145/3154839 [60] S. Han, H. Mao, and W. J. Dally. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman coding. CoRR, abs/1510.00149, 2015.

29

ACCEPTED MANUSCRIPT

[61] M. Shafique, F. Sampaio, B. Zatt, S. Bampi, J. Henkel, Resilience-Driven STT-RAM Cache Architecture for Approximate Computing, Workshop on Approximate Computing (AC), Paderborn, Germany, Oct. 15-16, 2015.

CR IP T

[62] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL http://tensorflow.org/. Software available from tensorflow.org [63] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell, Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014. [64] High, R., The Era of Cognitive Systems: An Inside Look at IBM Watson and How it Works. IBM Corporation, Redbooks, Armonk (2012) Google Scholar.

AN US

[65] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Tomasz Czajkowski, Stephen D. Brown, and Jason H. Anderson. 2013. LegUp: An Open-source High level Synthesis Tool for FPGA-based Processor/Accelerator Systems. ACM Trans. Embed. Comput. Syst. 13, 2, Article 24 (Sept. 2013), 27 pages [66] Soumith Chintala. Convnet-benchmarks. [Online]. Available: https://github.com/soumith/convnetbenchmarks.

M

[67] Aysegul Dundar, Jonghoon Jin, Berin Martini and Eugenio Culurciello, Embedded Streaming Deep Neural Networks Accelerator with Applications, IEEE Transactions on Neural Networks and Learning Systems, Volume: 28, Issue: 7, 2016. DOI: 10.1109/TNNLS.2016.2545298.

ED

[68] R. Collobert, C. Farabet, and K. Kavukcuoglu, Torch7: A matlab-like environment for machine learning, in BigLearn, NIPS Workshop, no. EPFL-CONF-192376, 2011.

CE

PT

[69] Francisco Ortega- amorano, Jos M. Jerez, Iv n G mez and Leonardo Franco, Deep Neural Network Architecture Implementation on FPGAs Usinga Layer Multiplexing Scheme, Distributed Computing and Artificial Intelligence, 13th International Conference pp 79-86, Advances in Intelligent Systems and Computing 474. DOI: 10.1007/978-3-319-40162-1_9. DOI: 10.1007/9783-319-40162-1_9.

AC

[70] M. Alwani, H. Chen, M. Ferdman, and P. Milder. Fused-layer cnn accelerators. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 15-19 Oct. 2016. DOI: 10.1109/MICRO.2016.7783725 [71] Chainer: a powerful, flexible, and intuitive framework of neural networks. http:// chainer.org/ [72] Shuang Liang, Shouyi Yin, Leibo Liu, Wayne Luk, Shaojun Wei, FP-BNN: Binarized Neural Network on FPGA, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.09.046 [73] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie YanMeghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy, TVM: An Automated End-to-End Optimizing Compiler for Deep Learning, arXiv:1802.04799v2 [cs.LG] 20 May 2018. Doi: https://arxiv.org/pdf/1802.04799.pdf 30

ACCEPTED MANUSCRIPT

[74] M. Lin, Q. Chen, S. Yan, Network in network, in: Int. Conf. On Learning Representations (ICLR), 2014. [75]: Yanzhi Wang, Caiwen Ding, Zhe Li, Geng Yuan, Siyu Liao, Xiaolong Ma, Bo Yuan, Xuehai Qian, Jian Tang, Qinru Qiu, Xue Lin, Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework, arXiv:1802.06402v1 [cs.LG] 18 Feb 2018. Doi: https://arxiv.org/pdf/1802.06402.pdf [76] Amir Yazdanbakhsh, Michael Brzozowski, Behnam Khaleghi, Soroush Ghodrati, Kambiz Samadi, Nam Sung im, Hadi Esmaeilzadeh, “FlexiGAN: An End-to-End Solution for FPGA Acceleration of Generative Adversarial Networks, Appears in the Proceedings of the 26th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2018. Doi: https://www.cc.gatech.edu/~hadi/doc/paper/2018-fccm-flexigan.pdf

CR IP T

[77] Daniel Holanda Noronha, Philip H.W. Leong, Steven J.E. Wilton, Kibo: An Open-Source Fixed-Point Tool-kit for Training and Inference in FPGA-Based Deep Learning Networks, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 21-25 May 2018. DOI: 10.1109/IPDPSW.2018.00034 [78] Wei Ding, Zeyu Huang, Zunkai Huang, Li Tian, Hui Wang, Songlin Feng, Designing Efficient Accelerator of Depthwise Separable Convolutional Neural Network on FPGA, accepted manuscript in Journal of Systems Architecture, 2018. doi: https://doi.org/10.1016/j.sysarc.2018.12.008.

AN US

[79] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [80] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity Map- pings in Deep Residual Networks. arXiv preprint arXiv:1603.05027 (2016). [81] J. Wu and al., “Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling,” in NIPS, 16.

PT

ED

M

[82] W. R. Tan and al.,ArtGAN: Artwork Synthesis with Conditional Categorial GANs, in arXiv, 2017.

AC

CE

AUTHOR BIOGRAPHY

31

ACCEPTED MANUSCRIPT

CR IP T

Dr.Ahmed Ghazi BLAIECH received his PhD degree in Computer Science from Sfax University in 2015. He is currently Assistant Professor at the Higher Institute of Applied Sciences and Technology of Sousse (ISSATSo), University of Sousse, Tunisia. He is a researcher at Technology and Medical Imaging Lab LTIM(LR12ES06) at the Faculty of Medicine of Monastir, University of Monastir, Tunisia. His research activities are related to : Image/Signal Processing, Machine Learning, Hardware/Software Design and Programming, Real-Time and Parallel Systems, and Internet Of Things.

ED

M

AN US

Dr.Khaled BEN KHALIFA received his MSc in Physic Microelectronic and his DEA in Materiaux et Dispositive pour l'electronique and a PhD diploma in PhysicsElectronics from the University of Monastir, Tunisia, in 1999, 2001 and 2006, respectively. Currently, he is an assistant professor (electrical engineering) in High Institute of Applied Sciences and Technology, University of Sousse, Tunisia and senior researcher at the “Technologie et Imagerie Médicale Laboratory (LR12ES06)” at the Faculty of Medicine, University of Monastir, Tunisia. His research interests are related to real-time embedded systems, FPGA-based systems, system-on-chip, neural networks and heterogeneous multiprocessor architectures

AC

CE

PT

Pr. Carlos Alberto VALDERRAMA SAKUYAMA is, since 2004, Director of the Electronics and Microelectronics Department at the Polytechnic Faculty of the University of Mons in Belgium and former member of the Numediart and InforTech institutes. He obtained the Ph.D. degree at the INPG (1998, Grenoble, France), the M.Sc. diploma at the UFRN/COPPE (1993, Rio de Janeiro, Brazil), and the electrical-electronics engineering diploma at the UNC (1989, Cordoba, Argentina). He was invited professor at the UCC (2011, Cordoba, Argentina), at the UFPE (2004, Pernambuco, Brazil) and at the UFRN (1998 & 2017, Rio Grande do Norte, Brazil). He was team manager at CoWare, today Synopsys (1999-2004, Belgium). In 2009, he was responsible for the creation of the spinoff Nsilition from a project funded by the Walloon Region. He has participated in more than 18 national and international research projects, from 4G chips development to next-generation tracking devices and architectures for IoT, HPC and space. He serves as technical reviewer and committee member of multiple journals and international conferences. His research activity is supported by more than 150 publications on international conferences, more than 10 books chapters, and more than 30 scientific journals. He is IEEE senior member since 2006.

32

ACCEPTED MANUSCRIPT

CR IP T

Pr.Marcelo A. C. Fernandes was born in Natal, Brazil. He received the BS degree in Electrical Engineering in 1997, the MS degree in Electrical Engineering in 1999, from the Federal University of Rio Grande do Norte, Natal, Brazil and the Ph.D. degree in Electrical Engineering in 2010, from the University of Campinas, Campinas, SP, Brazil. Currently, he is an Associate Professor in the Department of Computer Engineering and Automation, Federal University of Rio Grande do Norte, Natal, Brazil. His research interests include artificial intelligence, digital signal processing, embedded systems, reconfigurable hardware, and tactile internet.

AC

CE

PT

ED

M

AN US

Pr.Mohamed Hedi Bedoui received the Ph.D. degree in biomedical engineering from Lille University, Villeneuve-d’Ascq, France, in 1992. He is currently a Professor of Biophysics with the Faculty of Medicine, Monastir University, Monastir, Tunisia. He is the Director of the “Technologie et Imagerie Médicale Laboratory (LR12ES06)” with the Faculty of Medicine, Monastir University. He has many published papers in international journals. His current research interests include biophysics, medical imaging processing, embedded system, and codesign HW/SW.

33

A Survey and Taxonomy of FPGA-based Deep Learning Accelerators

A Survey and Taxonomy of FPGA-based Deep Learning Accelerators

Recommend Documents