Journal Pre-proof
Area and Power Efficient Pipelined Hybrid Merged Adders for Customized Deep Learning Framework for FPGA Implementation T. Kowsalya PII: DOI: Reference:
S0141-9331(19)30338-2 https://doi.org/10.1016/j.micpro.2019.102906 MICPRO 102906
To appear in:
Microprocessors and Microsystems
Received date: Revised date: Accepted date:
1 July 2019 1 October 2019 10 October 2019
Please cite this article as: T. Kowsalya , Area and Power Efficient Pipelined Hybrid Merged Adders for Customized Deep Learning Framework for FPGA Implementation, Microprocessors and Microsystems (2019), doi: https://doi.org/10.1016/j.micpro.2019.102906
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.
Area and Power Efficient Pipelined Hybrid Merged Adders for Customized Deep Learning Framework for FPGA Implementation Dr. T. Kowsalya, Associate Professor,
Department of Electronics and Communication Engineering, Muthayammal Engineering College, Rasipuram, Namakkal, Tamilnadu, India. E-mail:
[email protected] Abstract: With the rapid growth of deep learning and neural network algorithms, various fields such as communication, Industrial automation, computer vision system and medical applications have seen the drastic improvements in recent years. However, deep learning and neural network models are increasing day by day, while model parameters are used for representing the models. Although the existing models use efficient GPU for accommodating these models, their implementation in the dedicated embedded devices needs more optimization which remains a real challenge for researchers. Thus paper, carries an investigation of deep learning frameworks, more particularly as review of adders implemented in the deep learning framework. A new pipelined hybrid merged adders (PHMAC) optimized for FPGA architecture which has more efficient in terms of area and power is presented. The proposed adders represent the integration of the principle of carry select and carry look ahead principle of adders in which LUT is re-used for the different inputs which consume less power and provide effective area utilization. The proposed adders were investigated on different FPGA architectures in which the power and area were analyzed. Comparison of the proposed adders with the other adders such as carry select adders (CSA), carry look ahead adder (CLA), Carry skip adders and Koggle Stone adders has been made and results have proved to be highly vital into a 50% reduction in the area, power and 45% when compared with above mentioned traditional adders. Keywords: Deep Learning Framework, PHMAC, GPU, Neural Networks, Optimization.
1. Introduction In recent years, there has been a dramatic increase in neural network and deep learning framework compared to the traditional learning models in various fields such as image, vision, medical systems and communication. Various deep learning algorithms such as Re Concurrent Neural Networks (RNN), Convolutional Neural Networks (CNN) have been proposed for the different application research areas. With the advent of above-mentioned algorithms, the level of accuracy in detection has increased from 78% to 95% and further improved the classification ratio by extracting the several features. In short, capabilities of the deep learning algorithms have made them highly for the application of artificial intelligence (AI).
However, Deep Neural Network (DNN) structures face complexities in terms of storage and power. Since the size of the deep learning models is proportional to the input sizes, processing needs more FLOPS (floating point Operations), power and area. Therefore, it is important to optimize the computational model for the deep learning of neural networks. CPUs have the ability to perform 10-100 GFLOP/Sec. However, the efficiency of the power is less than 1 GOP/J. [1] and becomes very complex in the achievement of the performance improvement and low power and area. Also, it also becomes a real nightmare when implementing in the embedded devices where the power and area plays a mandatory role. In addition, various CPU, GPU and FPGA are gradually becoming the new trends in the implementation of an area efficient deep learning neural networks. [2] FPGAs can accomplish high parallelism and simplified logic to the calculation procedure of NN with hardware equipment for any particular models. A few examiners demonstrate the possible of simplifying the neural system model in equipment in a cordial manner without influencing the precision of the model. Thus, FPGA can accomplish more effectiveness than other Embedded CPU's. Deep learning framework consists of five different layers namely the convolutional layer, the fully connected layer, the pooling layers, the non-linear layer and the element wise layer. Out of these layers, the convolutional layer (CONV) uses the two-dimensional neuron process in which the adders and multiplier integration are integrated to perform the operation. As we detailed earlier, these adders and multipliers increase exponentially as the size of the input increases. Hence this paper proposes a novel and unique collection of pipelined Hybrid merged adders which use the integration of the principle of carry select and carry look ahead principle of adders in which LUT is re-used for different inputs consuming less power in the effective area utilization. The paper discusses the implementation of the proposed adders in the CONV layer of deep learning neural networks. The integration is based on the principle of s3 principle in which the LUT is shared among the operands and consisting of both signed and unsigned bits. The paper also discusses about the LUT optimization for packing the proposed adders in an adder-tree structures suitable for deep learning framework which assists nearly 50%, 45% reduction in area and power. The organization of the paper is follows (i)
Related works by one or more authors are detailed in the Section-2
(ii)
Background Preliminary about the Deep Learning Algorithms is presented in Section-3
(iii)
The architecture, workflow, implementation mechanism of the proposed pipelined hybrid merged adders are discussed in Section-4
(iv)
Experimental setup, simulation results and comparative analysis are detailed in the Section-5
(v)
Conclusion with the indication of future scope is presented in Section-6.
2. Related works Teng Wang examined the neural system accelerator that is dependent on FPGA. In particular, this work makes separate surveys of accelerators intended for explicit issues, explicit calculations, algorithm highlights, and general formats. Correlation has been done on the structure and usage of the accelerator dependent on FPGA under various gadgets and system models and contrasted with it and the renditions of CPU and GPU. At long last, this work presents a reference to the preferences and impediments of the accelerators on FPGA stages and open for future research [1]. Chao Wang examined with the help of rooine model identified the solution with best performance and lowest FPGA resource requirement and implemented a CNN accelerator on a VC707 FPGA board and compare it to previous approaches. The peak performance of 61.62 GFLOPS less than 100MHz working frequency, which outperform previous approaches significantly [2]. Matthieu Courbariaux has presented the binarization method for loads during both forward and backward propagation known as the binary connect. This work demonstrates its conceivable feature for training integrated with Binary Connect on the invariants of CIFAR10, SVHN datasets. The effect of such a strategy on particular hardware equipment executions of a profound system can be major only, generally by expelling the requirement around 2/3 of the augmentations. In this manner conceivable permitting to accelerate by a 3 factor during training time. During the test time, the impact of the deterministic version of binary connect could be considerably increasingly significant, by reducing the factor to 16, then the memory necessity to a profound system, which affects the memory for calculation data transmission ad on the size of the models that would be run [3]. Chao Wang and Lei Gong used FPGA for the large scale deep learning networks in to the objective of maintaining low power and cost. The proposed DLAU uses a carry save adder in the computation process and a further process of the project to increase computation speed, Parallel self-timed adder is used in the design. Results of the proposed system show high power consumption of the proposed DLAU accelerator when compared to other devices. [4]. Stylianos I. Venieris have exhibited a review of the current CNN-to-FPGA tool flows, the receiver is a near investigation of the key qualities which are incorporated in the supported applications, compositional decisions, structure space investigation techniques and accomplished execution. In addition, significant difficulties and goals presented by the most recent patterns in CNN algorithmic research are recognized and exhibited. A uniform assessment strategy is proposed, going for the thorough, complete and inside and out assessment of CNN-to-FPGA tool flows [5]. Javier Duarte has conducted a contextual investigation for neural system deduction in FPGAs concentrating on a classifier for fly substructure which would empower, among numerous different material science situations, looks for new dim area particles and novel estimations of the Higgs boson. While this work centres on a particular model, the exercises are expansive. A companion compiler package for this work has been created dependent on High-Level Synthesis (HLS) called hls4ml to manufacture AI models in FPGAs. The
utilization of HLS builds availability over an expansive client network and takes into account an exceptional decline in firmware improvement time. This work guides FPGA asset utilization and inertness versus neural system hyper parameters to distinguish the issues involved in molecule material science that would profit by performing neural system induction with FPGAs. For the propose precedent stream substructure model has been proposed for the purpose but implementation of the need for the arithmetic units still needs research. [6]. Jason Luu has considered the expansive scope for various usage of hard adders and carry chains inside a delicate logic block. The researcher‟s work demonstrates distinctive hard adder and carry chain designs showing fundamentally the same as territory and delay values on genuine applications in spite of critical contrasts on micro benchmarks. Hardened adders provide an acceleration of roughly 15% for a area penalty of around 5% bringing about a area-delay product reduction by roughly 10% [7]. Yufei Ma has provided a novel solution for the RTM implementation and It achieves adaptability. It is known as a compiler and it co-ordinates a lot of scalable promises for accelerating accelerate the operations of different profound learning algorithms of an FPGA. Integrating the modules for the CNN implementation, this work provides a strategy for compiler for optimising the throughput. The proposed RTL compiler, named ALAMO, is shown on Altera Stratix-V GXA7 FPGA for the deduction of errands of AlexNet and NiN CNN models, accomplishing 114.5 GOPS and 117.3 GOPS, separately. This speaks of a 1.9X improvement in throughput when contrasted with the OpenCL-based structure. The results represent the guarantee of the programmed compiler answer for modularized and adaptable equipment increasing the speed of profound learning. But the author has not provided an explanation for to explain about the detailed implementation of the deep learning algorithms in terms of designs in FPGA for low power [8]. Adrien Prost-Boucle has introduced a new architecture which is fully dedicated to interference utilizing activation function ternary weights. This architecture provides the best throughput. The advantage of this architecture lies in its configurability during the design process. This generally utilizes the data which is drawn for the target technologies in order to adopt the very best FPGA resource. The architecture achieves 5.2k frames/sec/Watt using VC709 board for the classification process, which indicates utilization of approximately half of the FPGA resources [9]. Atul Rahman has proposed a novel system to for precise assessment and production of different decisions for CNN accelerators on FPGA resources. The proposed work is very extensive in terms of design space. It is very useful for an increase the performance of the system, on chip and off chip memories. It has layers which show the viability of the system, just as the requirement for such high-level architecture investigation way to find the best designs of the CNN model. This high level architecture consumes a large area and more power which making the implementation of the deep learning algorithm unsuitable for smart device designs [10]. Li Yang has investigated from software to hardware, from circuit level to system level is carried out to complete analysis of FPGA-based neural network inference accelerator design and serves as a guide to future work [11].
Ying Wang ; Jie Xu refers to the utilization of implement FPGA-based NN accelerator or help generate chip design for early design stage. Deep Burning supports a large family of NN models and greatly shorten the design flow of NN accelerators for the machine learning application developers. The evaluation shows that the produced learning accelerators burnt to our FPGA board exhibit great power efficiency evaluated to state-of-the-art FPGA-based solutions [12]. 3. Background preliminary view DEEP LEARNING FRAMEWORK Deep learning is used in the consolidation of the low-level features for representing high level attributes in order to discover distributed feature representation. The idea has been proposed by Hinton in 2006. [1] Unsupervised greedy layer-by-layer training algorithm based on the realization of the desire for settling the deep structure-related optimization issues has been proposed. At that point the deep structure of multi-layer programmed encoder is proposed. In addition, a constitutional neural system has proposed by Lecun et al. is the first right multilayer structure learning calculation which uses relative spatial relations for reducing the number parameters to improve training execution. In deep learning, neural network is known as a bio-motivation model that normally incorporates numerous layers of neurons, and a combination of various algorithms between various system layers. Every layer gets the neurons of the past layer as info relating to each neuron of the primary neural network [2], computing a weighted sum of all the input neurons associated with it. A connection of parameters is known as the weight. Various neural system models are made out of various sorts of layers. These system layers are presented one by one beneath. 3.1) Fully connected layer A fully connected (FC) layer actualizes the association between each input neuron and each output neuron. The presence of a weight and a predisposition [3], which affects every component of the input and output is communicated. Fout = Fin∗Weight + bias
(1)
3.2) Convolutional layer The convolutional (CONV) layer is utilized for two-dimensional neuron forms. The input and output neurons of this layer can be depicted as consisting of a two-dimensional element guide set, Fin and Fout. In addition, each component map is known as a channel [4]. The CONV layer executes a two-dimensional convolution piece Kij for each information and output channel pair, actualizing an offset factor bi for each output channel. In addition, the procedure for the estimation in the CONV layer with N info channels and M output channels is portrayed as Equation 2. ()
∑
(
()
)
(2)
j =0, 1…...M-1 3.3) Non-linear layer A nonlinear function is generally utilized as the output of every neuron. It can specifically initiate the output, additionally called the actuation function. Usually utilized in this layer are sigmoid, tanh and relu, and so forth. 3.4) Pooling layer The pooling layer is for the most part utilized for two-dimensional neuron forms. The pooling provides examples for each information channel independently [5], which causes a decreases in feature measurement. There are two general down examining techniques: normal pooling and greatest pooling. 3.5) Element-wise layer This layer is generally utilized in the RNN, and some of the time utilized in some CNN models [6], [7]. This layer gets two neuron tensors which are the same in measurement and plays out a double activity on the relating neurons of the two tensors. 4. PROPOSED ADDERS The adders proposed by the researcher consist of pipelined hybrid adders which integrate the digital unsigned radix-2 input numbers and two‟s complement numbers. Consider an unsigned-digit (USD) number whose digits are obtained from the digit set D = {β, ··· , 1, 0, 1, ··· , β}, where r is the radix and β is the biggest digit in the digit-set. X and Y are given a chance to be excess radix-2 USD numbers and every digit Xi from X involves two bits X+ l and X− l , and given as Xi = X+ l − X− l. The total digit Z produced again involves two paired bits t + l and u− l , whose decimal yield is processed as 2t + l − u− l.[8],[9] .The sources of info X+ l , X− l and Y1 are fed as inputs to the hybrid adders works of the principle of s3 (whole entirety subtractor)principle and creates t + l and u− l as yields[5], as appeared in truth table of Table I. Table I Illustration of Functional truth table using Proposed Adders Y1
X1
X1+
X1-
Output(decimal)
0
0
0
0
0
0
0
0
1
0
0
1
1
0
1
0
1
1
1
1
1
0
0
1
0
1
1
1
0
1
Fig.1 shows the corresponding adder. The Boolean function governing the proposed s3 principle is given as T1t =Y1X1t + Xlt Xl U1+= X1+ exor X1+ exor Y1 A unified addition mechanism is controlled by the control signal „C‟ which is used for selection of the unsigned bits and two‟s complement numbers. If „C‟=0 then unsigned numbers are selected for addition and the modified architecture is adopted as shown in Fig 2
Fig 1 Hybrid Full Adders with Radix-2 unsigned umbers and two‟s complement numbers.
Fig 2 Hybrid Full Adders with Radix-2 signed umbers and two‟s complement numbers.
The proposed hybrid full adders can be packed in single LUT as adder trees with the dual output mode in which it accepts three or more inputs and then produces two outputs. It can be implemented in slice logic and as these LUT based implementation makes the 50% of space unused which remains to be an ineffective area for utilization for overcoming this problem. It has been optimized the unused LUT for the proposed adders based on the netlists [10], [11],
to be used in the adder-tree structure of deep learning framework. The principle of optimization is shown in Fig 3 and the structure is merged for use in the adders continuously.
Fig 3 Different Optimized LUTS for Proposed Adders using different Inputs The basic functionality of the optimized merged LUTs is explained in the Fig 3. For „n‟ input adders, it uses only n-1 LUTs, where the „n‟ LUTs are generated for „n‟ inputs during the RTL synthesis [12]. Assume n=3, 2 additions of two unsigned numbers added in use one LUTs which results in partial integration with the other inputs of the other two‟s complement numbers which use other LUTs for producing the sum. These merging structures contribute to the high effective area utilization which is suitable for the deep learning framework. A high degree of parallelism is applied for the proposed structure for the achievement of high speed computation. 5. EXPERMENTAL SETUP The proposed Pipelined Hybrid Merged Adders were developed using Verilog hardware description language and synthesized and simulated using Xilinx VIVADO tools for evaluation of the different parameters namely area and power. The different Xilinx hardware were taken for the experimentation and the specification used for the experimentations are shown in Table II. Table II Hardware Specification used for the Experimentation Setup S.No FPGA Features
Hardware Description_1
Hardware Description_2
01
FPGA used
xc7s25csga324
xcE6vlx760
02
No of Slice LUT
14600
118560
03
No of Slice Registers
29200
758784
04
IOBS
150
240
05
Buffer Control Register
32
48
06
No of DSP
10
07
Meanwhile the proposed adders were simulated using VIVADO for 64 -bit inputs suitable for the image processing application. The different outputs from the proposed adders are shown in Fig 4.
Fig 4 Simulated Outputs from the Proposed Adders for the 64-bit Inputs The RTL Schemtaics for the 64 bit Proposed Adders were generated using the above mentioned tool which are shown in Fig 5. Synthesis report for the Proposed Adders obtained from the Xilinx
VIVADO are shown in Fig 6 and Power Analysis Output for the Proposed Adders using Xilinx VIVADO Xpower Analyzer are shown in Fig 7. As the next step of evulation , area interms of the LUTs and dynamic power are calculated and details are shown in the Table III.
Fig 5 RTL Schematics Generated for the Proposed 64-bit Pipelined Hybrid Merged Adders
Fig 6 Synthesis report for the Proposed Adders obtained from the Xilinx VIVADO
Fig 7 Power Analysis Output for the Proposed Adders using Xilinx VIVADO Xpower Analyzer Table III Hardware utilization and Power for the Proposed 64-bits Hybrid Adders. Sl.no
Structure details
No of Slice LUT
No of Slice Registers
Bonded InputOutput Buffers
Buffer Control Register
No of DSP
Power(mW)
01
Proposed Hybrid Adders
233
456
12%
12%
Nil
45.56
5.1 COMPARATIVE ANALYSIS: The proposed adder was implemented in different families of Xilinx processors which are listed in Table II. The Carry Select Adders (CSA), Carry Look-Ahead Adders (CLA), carry skip adders and koggle stone adders are implemented and compared with the proposed hybrid
adders. Different input size was considered for the comparison. Different cases of implementation have been considered for experimentation and evaluation. CASE (1): Table IV shows a comparative analysis between the existing adders with proposed adder inxc7s25csga324. Table IV Comparative Analysis between the different existing adders and proposed hybrid adders (16-bit Input) Sl.no
Adder
No of Input bits
No of No of Bonded Buffer No of Power(mW) Slice Slice Input- Control DSP LUT Registers Output Register Buffers
Details
01
CSA
16
44
1%
1%
0
2.88
02
CLA
19
46
1%
1%
0
2.56
CSKA
18
46
1%
1%
0
2.67
04
KSA
17
55
1%
1%
0
3.46
05
PHMAC
15
30
1%
1%
0
2.00
03 16-bit
The Fig 8 shows that PHMAC shows less power consumption compared with other adders and KSA consumes more power. The power consumption of PHMAC is 2mw. Proposed Hybrid Adders(16-bit Input) LUT
Registers
Power (mW)
55 46
44
46 30
19
16 2.88 CSA
18 2.56
CLA
17 3.46
2.67 CSKA
15
KSA
2 PHMAC
Fig 8 Comparison Chart of different existing adders and proposed hybrid adders (16-bit Input) Table V shows a comparative analysis between the different existing adders with proposed hybrid adder for 32 input bit.
Table V Comparative Analysis between the different existing adders and proposed hybrid adders (32-bit Input) Sl.no
Adder
No of Bonded Buffer No of Power(mW) Slice Input- Control DSP Registers Output Register Buffers
Details
No of Slice LUT
01
CSA
68
143
1%
1%
0
22.88
02
CLA
67
156
1%
1%
0
28.56
CSKA
66
176
1%
1%
0
34.67
04
KSA
69
185
1%
1%
0
33.46
05
PHMAC
25
90
1%
1%
0
11.70
03
No of Input bits
32-bit
Fig 9 shows that PHMAC consumers power of 11.70 mw which less compared with other adders and CSKA consumes more power. Proposed Hybrid Adders(32-bit Input) LUT
Registers
185
176
156
143
Power (mW)
90 68
67 22.88
CSA
69
66
28.56
CLA
34.67
CSKA
33.46
KSA
25
11.7
PHMAC
Fig 9 Comparison Chart of different existing adders and proposed hybrid adders (32-bit Input) Table VI Comparative Analysis between the different existing adders and proposed hybrid adders (64-bit Input) Sl.no
No of Adder Input Details bits
No of No of Bonded Buffer No of Power(mW) Slice Slice InputControl DSP LUT Registers Output Register Buffers
01
CSA
545
987
12%
12%
0
70.48
02
CLA
534
990
12%
12%
0
71.56
CSKA
566
1201
12%
12%
0
75.67
04
KSA
669
1321
12%
12%
0
87.46
05
PHMAC
233
456
12%
12%
0
45.56
03
64-bit
The Fig 10 shows less power consumption of power 45.56mw compared with other adders. Proposed Hybrid Adders(64-bit Input) LUT
Registers
1201
Power (mW) 1321
990
987 545
669
566
534
456 233
71.56
70.48 CSA
CLA
87.46
75.67 CSKA
KSA
45.56 PHMAC
Fig 10 Comparison Chart of different existing adders and proposed hybrid adders (64-bit Input) CASE II: Table VII shows a comparative analysis between the existing adders with proposed adder in xcE6vlx760. Table VII Comparative Analysis between the different existing adders and the proposed hybrid adders (16-bit Input) Sl.no
No of Input bits
01 02
Adder Details
No of Slice LUT
No of Bonded Buffer No of Power(mW) Slice Input- Control DSP Registers Output Register Buffers
CSA
29
40
<1%
1%
0
3.48
CLA
25
48
<1%
1%
0
3.66
CSKA
27
48
<1%
1%
0
3.69
16-bit 03
04
KSA
26
75
<1%
1%
0
3.78
05
PHMAC
15
30
<1%
1%
0
2.00
The Fig 12 shows that PHMAC shows less power consumption compared with other adders and KSA consumes more power. The power consumption of PHMAC is 2mw. Proposed Hybrid Adders(16-bit Input) LUT
Registers
Power (mW)
75
48
48
40 29
27
25
30
26 15
CSA
3.78
3.69
3.66
3.48 CLA
CSKA
KSA
2 PHMAC
Fig 12 Comparison Chart of different existing adders and proposed hybrid adders (16-bit Input) Table VIII Comparative Analysis between the different existing adders and the proposed hybrid adders (32-bit Input) Sl.no
No of Adder Input Details bits
No of No of Bonded Buffer No of Power(mW) Slice Slice InputControl DSP LUT Registers Output Register Buffers
01
CSA
78
163
1%
1%
0
27.88
02
CLA
79
166
1%
1%
0
26.56
CSKA
89
196
1%
1%
0
44.89
04
KSA
89
234
1%
1%
0
53.96
05
PHMAC
25
90
1%
1%
0
11.70
03
32-bit
The Fig 13 shows that PHMAC shows less power consumption compared with other adders and KSA consumes more power. The power consumption of PHMAC is 11.70mw.
Proposed Hybrid Adders(32-bit Input) LUT
Registers
Power (mW) 234
196 166
163
89
79
78
27.88
CSA
CLA
53.96
44.89
26.56
90
89 25
CSKA
KSA
11.7
PHMAC
Fig 13 Comparison Chart of different existing adders and proposed hybrid adders (32-bit Input) Table IX Comparative Analysis between the different existing adders and the proposed hybrid adders(64-bit Input) Sl.no
No of Adder Input Details bits
No of No of Bonded Buffer No of Power(mW) Slice Slice InputControl DSP LUT Registers Output Register Buffers
01
CSA
678
1197
9%
9%
0
70.48
02
CLA
634
1290
9%
9%
0
71.56
CSKA
754
1501
9%
9%
0
75.67
04
KSA
769
1632
9%
9%
0
87.46
05
PHMAC
233
456
9%
9%
0
45.56
03
64-bit
The Fig 14 shows that PHMAC shows less power consumption compared with other adders and KSA consumes more power. The power consumption of PHMAC is 45.56 mw.
Proposed Hybrid Adders(64-bit Input) LUT
Registers
1501 1197 678
Power (mW) 1632
1290
769
754
634
456 70.48 CSA
71.56 CLA
75.67 CSKA
87.46 KSA
233 45.56 PHMAC
Fig 14 Comparison Chart of different existing adders and proposed hybrid adders (64-bit Input) The table shows the utilization of the power consumption and area utilized by the proposed hybrid adders as only 50%, 45% compared to the utilization and power of the other existing adders. Hence it is clear that the proposed adders find their place in the implementation of the deep learning framework in field programmable gate Array. 6. Conclusion In this investigation, the pipelined hybrid merged adders suitable for deep learning framework have been designed and implemented in Xilinx Virtex-7 xc7s25csga324/xcE6vlx760 FPGA processor. The major parameters namely area utilization and dynamic power consumption were computed and analyzed. The different existing adder structures are also implemented in the Xilinx FPGA processors and are compared with the proposed hybrid adders. Nearly a 50% reduction in area utilization and 45% lower power consumption were achieved through use of the proposed hybrid merged adders. These adders were customized for forming adder trees in the deep learning algorithm, particularly in the convolution (CONV) layers. The possibility of reducing area utilization and power through use of intelligent technique has been demonstrated. This shows deep learning techniques as easily realizable with the ability to be implemented for embedded devices. Conflict of interest There is No conflict of Interest.
REFERENCES:
1. Teng Wang, Chao Wang, Xuehai Zhou, Huaping Chen,”A Survey of FPGA Based Deep Learning Accelerators: Challenges and Opportunities”, arXiv:1901.04988 [cs.DC] 25 Dec 2018. 2. Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161170. 3. Matthieu Courbariaux, YoshuaBengio, “BinaryConnect: Training Deep Neural Networks with binary weights during propagations”, Neural Information Processing Systems Conference, pages 1117–1125 , 2015. 4. Chao Wang ; Lei Gong ; Qi Yu ; Xi Li ; Yuan Xie ; Xuehai Zhou, “DLAU: A Scalable Deep Learning Accelerator Unit on FPGA,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol.36, Issue.3, pp.513517, 2017. 5. Stylianos I. Venieris, Alexandros Kouris and Christos-SavvasBouganis. 2018. “Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions,” ACMComput. Surv. 0, 0, Article 0, (March 2018), 36 pages. https://doi.org/10.1145. 6. Javier Duarte, Song Han, Philip Harris , SergoJindariani , Edward Kreinar , BenjaminKreis , Jennifer Ngadiuba, Maurizio Pierini , Ryan Rivera , Nhan Tran , Zhenbin Wu, “Fast inference of deep neural networks in FPGAs for particle physics,” DOI: 10.1088/1748-0221/13/07/P07027, 2018. 7. Jason Luu ; Conor McCullough ; Sen Wang ; Safeen Huda ; Bo Yan ; Charles Chiasson ; Kenneth B. Kent ; Jason Anderson ; Jonathan Rose ; Vaughn Betz, “On Hard Adders and Carry Chains in FPGAs,” IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, pp.52-59, 2014. 8. Yufei Ma, Naveen Suda , Yu Cao , SarmaVrudhula , Jae-sun Se, “ALAMO: FPGA acceleration of deep learning algorithms with a modularized RTL compiler,” Integration, the VLSI Journal xxx, pp.1-10, 2017. 9. Adrien Prost-Boucle, Alban Bourge, and Frederic Petrot. “High-Efficiency Convolutional Ternary Neural Networks with Custom Adder Trees and Weight Compression,” ACM Trans. Reconfigurable Technol. Syst. 11, 3, Article 15 (December 2018), 24 pages, https://doi.org/10.1145/3270764. 10. Atul Rahman ; Sangyun Oh ; Jongeun Lee ; Kiyoung Choi, “Design space exploration of FPGA accelerators for convolutional neural networks,” Design, Automation & Test in Europe Conference & Exhibition (DATE), pp.11471152, 2017. 11. Li Yang, Zhezhi He, Deliang Fan, “A Fully Onchip Binarized Convolutional Neural Network FPGA Impelmentation with Accurate Inference,” International Symposium on Low Power Electronics and Design ‟18, July 23–25, 2018, Seattle, WA, USA. 12. Ying Wang ; Jie Xu ; Yinhe Han ; Huawei Li ; Xiaowei Li, “Deep Burning: Automatic generation of FPGA-based learning accelerators for the Neural
Network family,” 53nd ACM/EDAC/IEEE Design Automation Conference (DAC), pp.1-6, 2016.
Author Biography:
Dr.T.Kowsalya received B.E. Degree in Electronics and Communication Engineering in VLB Janaki Ammal College of Engineering and Technology, Coimbatore in the year 1996 and she received M.E Degree in Communication Systems in Government College of Technology, Coimbatore in the year 2005.She has also completed Ph.D Degree in Anna University, Chennai in the year 2018. She worked in various institutions as Associate professor. She is having 22 years of teaching Experience. Currently she is working as Associate Professor in Muthayammal Engineering College, Rasipuram, Namakkal (Dt),Tamilnadu,India. Her research interests are Low Power VLSI and Signal Processing.