Power-efficient ReRAM-aware CNN model generation

Integration, the VLSI Journal xxx (xxxx) xxx Contents lists available at ScienceDirect Integration, the VLSI Journal journal homepage: www.elsevier...

Download PDF

2MB Sizes 0 Downloads 38 Views

Report

PDF Reader
Full Text

Integration, the VLSI Journal xxx (xxxx) xxx

Contents lists available at ScienceDirect

Integration, the VLSI Journal journal homepage: www.elsevier.com/locate/vlsi

Power-eﬃcient ReRAM-aware CNN model generation Maedeh Hemmat ∗ , Azadeh Davoodi Department of Electrical and Computer Engineering, University of Wisconsin - Madison, Madison, WI, USA

A R T I C L E

I N F O

Keywords: Convolutional Neural Network implementation ReRAM crossbars Analog to digital/digital to analog converters Low-power design

A B S T R A C T

This is the ﬁrst work to propose a co-design approach for generation of the network model of a convolutional neural network (CNN) and its implementation using the ReRAM technology to achieve power eﬃciency. Stateof-the-art in this area is based on implementation of an already ﬁxed CNN model. It uses parallel crossbar structures to achieve a desired precision quantization of the edge weights using the limited precision provided by individual ReRAM devices. In contrast, in this work we keep the ReRAM crossbar structure in mind during model generation, and target altering a base CNN model such that the resulting implementation will be more power eﬃcient. This is by means of eliminating the parallel crossbars as much as possible, thus reducing the number of required analog-to-digital and digital-to-analog converters which are dominant sources of power consumption. We study four architectural techniques which are applicable to convolutional or fully-connected layers in a CNN to alter the network model, thus enabling it to work with lower precision of the weights with negligible loss in accuracy. In our experiments, we achieve at least 50% power savings with almost no loss in accuracy for popular CNNs compared to ReRAM implementation of the base model. Further, we propose 2 independent power reduction techniques and investigate the impact of replacing the conventional analog Sigmoid activation function with a simpler design in digital mode on power. The results show that the power consumption of the network can be further decreased to at least 30% of the base model.

1. Introduction Convolutional Neural Networks (CNNs) have gained signiﬁcant attention during recent years and are widely used in diﬀerent applications especially computer vision. An important challenge for eﬃcient hardware implementation of CNNs is their high demand in terms of memory and computation [1,2]. Metal-oxide Resistive Random-Access Memory (ReRAM) devices have emerged as a promising substrate for implementing computation intensive layers of CNN, namely convolutional and fully-connected layers [3]. ReRAM devices allow in-memory processing and are able to store and process large amount of data in a small footprint. Furthermore, resistive cross-point structures, known as ReRAM crossbars, are naturally able to transfer the weighted combination of input voltages to output currents, thus realize matrix-vector multiplication with incredible power eﬃciency which are heavily used in CNNs [4]. However, the interfaces between ReRAMs and peripheral circuits is a challenge; while ReRAM is an analog computing device, the large amount of intermediate data generated in CNNs are usually stored as digital values [5,6]. In addition, some functions in CNNs such as

‘pooling’ operate in digital mode. Therefore, ReRAM crossbars need additional interfaces including analog-to-digital (ADC) and digital-toanalog (DAC) converters to be able to communicate with the rest of system. Frequent use of ADCs and DACs signiﬁcantly contribute to the area and power consumption on chip [5–7]. In this work, we propose a co-design framework for generating the CNN model and implementing on ReRAM crossbars to achieve power eﬃciency. In the proposed approach, we rely on an architecturelevel design space exploration technique to achieve a power-eﬃcient ReRAM-aware implementation of a CNN; The design exploration is done by changing the high-level network model while keeping its ReRAM-based implementation in mind. Given a ‘base’ CNN model, our techniques focus on changing the base model by various means so that the new model works with a reduced precision of the edge weights in one or more layers. These include increasing the number of network layers in the base CNN model, changing the kernel size in a convolution layer, etc, such that the overall classiﬁcation accuracy remains the same. Typically, parallel ReRAM crossbars are used to achieve a desired precision per edge weight (e.g., 16 bits) given the limited precision of

∗ Corresponding author. E-mail addresses: [email protected] (M. Hemmat), [email protected] (A. Davoodi). https://doi.org/10.1016/j.vlsi.2019.08.003 Received 20 March 2019; Received in revised form 24 June 2019; Accepted 13 August 2019 Available online XXX 0167-9260/© 2019 Published by Elsevier B.V.

Please cite this article as: M. Hemmat, A. Davoodi, Power-eﬃcient ReRAM-aware CNN model generation, Integration, the VLSI Journal, https://doi.org/10.1016/j.vlsi.2019.08.003

M. Hemmat, A. Davoodi

Integration, the VLSI Journal xxx (xxxx) xxx

2. Preliminaries

individual ReRAM devices (e.g., 3 bits per device). However increase in number of parallel crossbars directly translates into increase in number of used ADCs and DACs, thus increase in power consumption. Given this observation, our techniques aim to achieve a CNN model with the fewest number of parallel ReRAM crossbars, thus minimizing the usage of ADCs and DACs and yielding to a power-eﬃcient implementation. In the end, an implementation is found which is guaranteed to have a similar accuracy as the base model with same or better power consumption. Compared to prior work on changing the CNN model for a lower quantization [1], our CNN model generation approach is speciﬁcally designed based on the precision constraints imposed by individual ReRAM devices as well as by practical limitations in the size of each ReRAM crossbar. Our techniques allow us to decide the edge weight bit precision at the granularity level of a CNN layer, meaning that diﬀerent layers of a network can have diﬀerent weight bit precision. Our techniques are also compatible with other edge weight quantization techniques [4,8,9]; those techniques are not ReRAM-aware and can be applied beforehand to generate a ‘generic’ power-eﬃcient base model. The base model will then be the starting point to apply our techniques to speciﬁcally achieve a ReRAM-aware power-eﬃcient model. Our framework is also suitable to take as input a standard, trained model because (1) Our proposed techniques always add more computational units to an existing network structure without any invasive alteration of the original one; (2) The altered network is retrained considering the limitations of the underlying ReRAM technology while using the old weights as starting point. The combination of (1) and (2), i.e., having a less invasive alteration of the CNN together with an eﬀective initialization of the weights are the reasons for obtaining good performance. The main contributions of our work are as follows:

2.1. ReRAM device and crossbar structure ReRAM is a passive two-port device with multiple variable resistive states which can be conﬁgured as a function of its voltage level and the previous state of the device. ReRAMs can be arranged to build a crossbar structure. For example consider the top crossbar shown in Fig. 2(a). The resistance of the device at row i and column j is ﬁrst conﬁgured to a speciﬁc level. The device can thus be viewed as storing a weight Gij corresponding to its conductance and the array can be viewed as storing a weight matrix. First analog input voltages are applied to the rows. A cell located at row i and column j passes a current equal to Gij × Vi . Using Kirchhoﬀ’s law, the current at end of each column is the summation of currents passed by each cell in the column. The crossbar structure thus implements analog multiplication of the weight matrix by the vector of input voltages at the rows. It generates the vector of output currents at the columns [2]. Fig. 2(a) shows using two crossbars to handle positive and negative weights. One crossbar is conﬁgured for only positive weights (with 0s at the locations of the negatives). Similarly the second one is conﬁgured for (the absolute value) of negative weights. The rows are connected to each other (sharing the DACs) to receive the same inputs and multiplications are done separately on each structure. Analog-to-digital converters (ADCs) are used per column before subtracting the results. 2.2. Convolutional Neural Networks: overview and ReRAM implementation A CNN is composed of convolutional, subsampling, and fullyconnected layers. First, a convolutional layer is one with ‘feature maps’ as its inputs and outputs. Consider the example in Fig. 1 for LeNet5 [10]. The ﬁrst convolutional layer is C1 and receives a 28 × 28 image. It maps these to 20 features of size 24 × 24. This is done by deﬁning a window (known as kernel) of size 5 × 5 which goes through the image with a stride of 1 unit. Each kernel is mapped to a single point per feature map. The mapping for a single kernel k is expressed as follows: k is the vector corresponding to the 5 × 5 kerYk = WXk , where X25 ×1 k is the vector corresponding to the mapping of the kernel nel and Y20 ×1 to the 20 feature maps. This mapping is done via multiplication by the weight matrix W20×25 . Note, W remains the same as k varies, implying that the multiplications may be done one kernel at a time, using the same weight matrix. In a convolutional layer, the main operation is matrix-vector multiplication per kernel which is suitable for ReRAM crossbar implementation: ﬁrst the ReRAM devices are conﬁgured for the weight matrix, then the crossbar rows are fed with analog voltages and are equal to the kernel size. The columns are equal to the number of kernels which is same as the number of feature maps. After each matrixvector multiplication, an element-wise nonlinear operation such as rectiﬁer activation function is additionally applied which is implemented separately, as the last stage in a convolutional layer. Each convolutional layer is typically followed by a pooling layer which essentially applies subsampling and reduces the size of each feature map. For example in Fig. 1, layer S1 reduces the 24 × 24 feature maps to 12 × 12. Implementation of subsampling is not computationally intensive, and does not require ReRAM devices. In the LeNet5 example, we denote CS1 to be combination of C1 followed by S1. Layer CS1 in this example is followed by layer CS2 which is another convolutional layer extracting 50 features followed by subsampling. Note, the inputs to CS2 are 20 feature maps, each of size 12 × 12. We denote this as 20 diﬀerent channels. So for ReRAM-based implementation of CS2, we require 20 diﬀerent crossbar structures, each implementing one of the channels. In contrast, CS1 only had one channel. Finally, the third type of layer in a CNN is a fully-connected layer which are FC1 and FC2 in this example. Each one can be viewed as a matrix-vector mul-

• To reduce power consumption of analog peripherals while maintaining accuracy, we propose an algorithmic procedure to alter the highlevel architecture of a CNN by means of four diﬀerent architectural techniques. The proposed algorithm will search the design space to ﬁnd an optimal point where the minimum power consumption can be achieved with no signiﬁcant degradation in accuracy. • Once ReRAM-aware CNN model is generated, we propose two more techniques on ADC resolution analysis and skipping insigniﬁcant neurons to further decrease the power consumption. The proposed techniques are direct consequences of eliminating parallel crossbars and reducing the bit precision of weights when generating CNN model, which force a signiﬁcant portion of weights to be zero. • Reducing the bit precision of edge weights will reduce the bit precision for the output feature maps generated at each layer. It then opens up an opportunity to replace the conventional analog Sigmoid activation function with a simpler design in digital mode. Considering high eﬃciency and small footprint of ReRAM-based look-up tables, we study the impact of replacing traditional Sigmoid function with ReRAM-based look-up tables in ReRAM-aware CNN models. We evaluate the eﬃciency of our proposed architectures and power reduction techniques on well-known classiﬁcation testbenches including MNIST, CIFAR10, and ImageNet running on four diﬀerent networks. Our experimental results show that our proposed algorithmic procedure to generate a ReRAM-aware CNN model can save at least 50% of power consumption with almost no loss in accuracy compared to the base model. In addition, applying two power reduction techniques can reduce the power consumption of these popular CNNs by at least 30% of the power consumption of the base model. The rest of paper is organized as follows. In Section 2, we review the basics of CNNs and their ReRAM-based implementation. Section 3 discusses our techniques, followed by power reduction techniques. In Section 5, we give an overview on related works and compare our work with them. The results are provided in Section 6 and the paper is concluded in Section 7.

2

M. Hemmat, A. Davoodi

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 1. (a) Various layers in the LeNet5 CNN [10]; (b) Impact of two of our proposed techniques on LeNet5 network.

Fig. 2. Overview of the base architecture [6]: (a) ReRAM crossbar implementing analog matrix-vector multiplication; (b) Splitting in parallel structures to overcome the limited precision implemented by individual ReRAMs to reach a desired precision per layer.

tiplication as well and implemented by a ReRAM crossbar structure. In FC1, 50 × 4 × 4 inputs are mapped to 500 outputs. So W has 800 rows and 500 columns.

menting a speciﬁc sub range within a desired range [6]. For instance, as shown in Fig. 2(b), assuming all weights require 15- bit precision, 5 split levels are created. The outputs of the split levels are then combined by adder/shifter to create a 15-bit quantity. Note that, in each split, there should be two crossbars for one single weight matrix to store both negative and positive weights on ReRAM crossbar. As an example, to store a weight matrix of size (1024 × 300) on 3-bit MLCs with 6-bit precision, we will need 8 crossbars of size (512 × 300) by taking the practical limitation into consideration. In the remainder of the paper, we refer to the ReRAM structure which incorporates all the above improvements for a base CNN model as the ‘base architecture’. According to the experiments, 8-bit ADCs/DACs are used in the base architecture to generate intermediate data resulting from matrix multiplications in each split. The selected resolution will minimize the accuracy degradation.

2.3. Limitations and practical considerations The ReRAM crossbar in a fully-connected layer is typically much larger than the convolutional layer which creates signiﬁcant IR drop if implemented as one large structure. In practice, a crossbar is recommended to be at most 512 × 512 for reliable operation [11]. Therefore, weight matrixes larger than this size are decomposed into ones with at most 512 per dimension. The results are later combined to create the complete output [6]. Furthermore, a CNN may require some of its weights to be of a precision higher than what can be implemented by a single ReRAM device. For example, a recommended precision for one ReRAM device is 3 bits in Multi Level Cells for reliable operation in the presence of factors such as process variations [12] while in practice, some weights in a CNN require up to 16 bits or higher precision [13]. To address this limitation, the structure is typically split into parallel ones, each imple-

3. Proposed CNN model generation approach We propose to alter the CNN model such that the resulting ReRAM structure is power eﬃcient. Note that the alteration of CNN model 3

M. Hemmat, A. Davoodi

Integration, the VLSI Journal xxx (xxxx) xxx

and evaluating the power eﬃciency of the corresponding ReRAM implementation happen concurrently. This is in contrast to prior work which use a ﬁxed CNN model. Our work is mainly addressing the issues and challenges related to ReRAM-based implementation of CNNs and is targeting ReRAM devices. Therefore, it is diﬀerent from other works which do quantization to reduce power consumption and memory storage; those generic techniques can be used to create a base model and will be the input to our framework, speciﬁcally to achieve a power-eﬃcient ReRAM-aware implementation. We note, once our techniques alter the base CNN model, we use the same procedure for ReRAM implementation as in the base model. A signiﬁcant source of power consumption in the base architecture is due to the ADCs and DACs components [5]. Not only individually they consume relatively high power (as a single component), but also there are too many ADCs and DACs as a ReRAM crossbar is split into parallel structures to implement diﬀerent bit ranges and reach a desired precision quantization (See Fig. 2(a) which shows a separate ADC per column and DAC per row.). This is despite the fact that some sharing may be done for DACs which supply the rows. With this observation, we study four techniques to change the CNN model to eliminate the parallel structures in the resulting ReRAM architecture as much as possible. Our techniques aim to achieve this by reducing the precision quantization for storing the edge weights which directly translates into decreasing the split levels of the crossbar arrays. This is done while guaranteeing almost no loss in the classiﬁcation accuracy of the altered CNN model compared to the base. The high-level algorithmic procedure for all the architectures is as follows. We always seek to ﬁnd a point in the design space where two goals are achieved: (1) achieving similar classiﬁcation accuracy compared to the base architecture, and (2) minimizing the number of splits (or parallel crossbars) used for each layer in order to minimize power consumption. To achieve these goals, we use a two-phase algorithm. The ﬁrst phase changes the network architecture using one of the proposed techniques while the second one increases the number of splits per layer if necessary. The high-level overview of our algorithm is explained in Fig. 3. We ﬁrst start by implementing the base architecture using minimum quantization (i.e., only one split per considered layer). This results in minimum power state because of having minimum number of splits; however it has low accuracy. Next, we keep applying one of the four architectural techniques to improve the accuracy. Once we reach the accuracy similar to the base model, we consider if power consumption still remains lower than the base. If it does, then the altered model is generated as output. If not, we start again from the base model, this time with higher precision (i.e., with an additional number of split per layer to implement the weights). While our four architecture techniques are not specialized for ReRAM implementation, the manner of exploration of the design space to reach the ﬁnal CNN model is completely specialized. Speciﬁcally, for each potential structure explored during search of design space, the power of the ReRAM implementation is evaluated and the one with minimum power is found. Moreover, the quantizations of the edge weights are changed to better match with the technology constraints of ReRAM Multi-Level Cell. Note that the four proposed techniques are applicable/eﬀective on speciﬁc types of the layer in the network. Architectures 1 and 2 are applicable to fully connected layers while Architectures 3 and 4 mainly change the structure of the convolutional layers. With this observation in mind, the appropriate architecture can be selected with respect to the structure of the underlying network. Next, we discuss our four architectural techniques, and show how altering the CNN model to any of these architectures impacts the ReRAM implementation.

3.1. Architecture 1 Our ﬁrst technique is based on adding new layers to the base CNN model to compensate for the loss in accuracy caused by decreasing the precision and decreasing the split levels in the crossbar structure. In the ﬁrst phase of the algorithm for architecture 1, we start with storing all the edge weights of the base model in a single split ReRAM crossbar per layer. This means the precision to store an edge weight is determined by that of an individual ReRAM device (which is recommended to be 3 bits in a reliable Multi Level Cell ReRAM [12]). If this results in loss in accuracy (which will likely be the case if the precision of weights in the base model was higher than 3 bits), then the algorithm for architecture 1 inserts new layers in the base CNN model until degradation in accuracy gets compensated, or until the overall power of the network after a modiﬁcation exceeds that of the base model. We note, while each newly-inserted layer adds some new weights to the model and results in a new single-split crossbar which adds to power consumption, there are also power savings due to eliminating the additional parallel splits and their corresponding power-hungry analog peripherals due to reduction in the bit precision of the edge weights compared to the original network. As our experiments show, the power saving achieved from elimination of parallel splits are much higher than the power overhead incurred by the newly-inserted layer. In case there is no power saving compared to the base case, the altered design is disregarded and we start the second phase of the algorithm where a new split level is introduced to increase the precision (e.g., to 6 bits with two splits given the initial 3 bits per split). A new iteration is then started by designing and inserting (this time) fewer layers to the base model using a higher precision. The process repeats until power savings is achieved or until reaching the number of splits as the base case (in which case there won’t be any power savings theoretically). Fig. 4 shows inserting a new fully-connected layer in the LeNet300100 network [10]. This neural network is composed of three fullyconnected layers (FC1 to FC3). Assuming one split level (3 bit precision of the weights for all layers), a new fully-connected layer (FC3’) is inserted between FC2 and FC3 which would be of dimension 100 × 100. The altered model results in negligible loss in classiﬁcation accuracy compared to the base but achieves power saving as we show in our experiments. The reason the new layer is inserted between FC2 and FC3 is because FC1 is larger in size (784 × 300) compared to the others so inserting the layer after FC1 will create more power overhead. Also FC3 is much smaller (100 × 10) so inserting the layer after FC3 may not be suﬃcient to compensate for accuracy degradation. It should be noted that although we need to train the network every time we explore a new point in the design space, we are making incremental changes to the network. Therefore, we can utilize techniques such as transfer learning [14,15] or incremental training [16] to accelerate the training process and minimize the overhead of ﬁnding a new design point. The incremental retraining time may further be reduced given that the changes to the structure of the CNN are less invasive in our framework, and we can start with the weights of the original model as the starting point (as opposed to random initialization). 3.2. Architecture 2 In our second technique, we consider adding neurons to the output of an existing layer in the base model using the same two-phase algorithm described earlier. More speciﬁcally, we start with the accuracy of the base model assuming it is implemented on a single-split crossbar. This means the precision of the base model is reduced to that provided by a single ReRAM device (e.g., 3 bits). Then we gradually add neurons to a target layer which we recommend to be a fully-connected layer until the resulting accuracy loss becomes negligible, or the power overhead asso4

M. Hemmat, A. Davoodi

Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 3. Overview of our proposed procedure.

Fig. 5. Overview of Architecture 3.

Fig. 4. Overview of Architecture 1.

ciated with the added neurons becomes more signiﬁcant than the power saving due to eliminating the additional split levels in the corresponding ReRAM crossbars. If there are no power savings, we start the second phase of the algorithm and increase the split level by one, and repeat the process of adding neurons to a fully-connected layer from scratch [17]. Note that adding neurons to a speciﬁc layer will change the dimension of weight matrixes and subsequently the number of used ADC/DACs. For example in the LeNet300-100 network, adding 75 neurons to the output of layer FC2 will change the dimension of its weight matrix to 300 × 175 which increases the number of columns in the corresponding ReRAM crossbar. It also increases the number of inputs to FC3, thus the number of rows of its ReRAM crossbar. However, despite the increase in the dimension of the matrixes, power saving can be achieved due to elimination of parallel crossbars. This technique is most beneﬁcial when applied to fully-connected layers because the new neurons will have connections to all neurons feeding in the layer and will be more eﬀective in increasing the accuracy when the altered CNN is retrained. Selection of the appropriate fully-connected layer is done by weighing the expected increase in accuracy versus power overhead; e.g., in the example of LeNet300-100, FC2 is the suitable layer for adding neurons because its size is in between the sizes of FC1 and FC3.

Fig. 6. Overview of Architecture 4.

ing the number of kernels and in eﬀect, extracting more information when doing the convolution. As shown in Fig. 5, this results in increase in the number of columns in the ReRAM crossbar. Similar to the previous techniques, we ﬁrst start with one split level (which results in accuracy degradation but power saving) and then alter the CNN model by gradually increasing the number of kernels (which gradually compensates for accuracy but increases the power). The iterations and stopping criteria are identical as before [17]. Fig. 1 shows the impact of applying this technique to layer C1 in LeNet5. Here the number of feature maps is increased from 20 to 30. Note that the change made to C1 impacts the other layers in the CNN as well. 3.4. Architecture 4 Our ﬁnal technique is based on decreasing the size of the kernel in a convolutional layer. As can be seen in Fig. 6, decrease in the kernel size results in decrease in number of rows in the ReRAM crossbar. Again, we start with a single-split crossbar and gradually decrease the kernel size until power saving is achieved with negligible loss in accuracy. Otherwise we add another split level and repeat the process. Fig. 1 shows decreasing the kernel size in two convolutional layers

3.3. Architecture 3 This technique targets a convolutional layer speciﬁcally and is based on increasing the number of feature maps which corresponds to increas5

M. Hemmat, A. Davoodi

Integration, the VLSI Journal xxx (xxxx) xxx

4.2. Skipping insigniﬁcant neurons in computation As discussed in Section 2.2, in fully connected layers, each output neuron is connected to all of the input neurons through synaptic weights. Hence, each input neuron plays a role in generation of all of the output neurons in one layer. On the other hand, once the ReRAMaware model is generated, a signiﬁcant portion of weights are zero due to reduced edge weight precision. Having too many zero weights will reduce the impact of many input neurons on the generated output neurons. Thus, here, we propose a scheme to recognize and skip the input neurons which do not have a signiﬁcant contribution in generation of the output neuron, because the majority of their synaptic weights are quantized to zero. Recall, in ReRAM-based implementation of fully connected layers, each column corresponds to one output neuron and the columns are independent from each other. In addition, number of rows in the crossbar matches the number of inputs for one particular layer. Using fewer input neurons and skipping some of them in computation while generating output neurons translates to having no input at the beginning of some of the rows in the crossbar, thus reducing the number of DACs. Next we explain how we recognize unimportant input neurons during computation. Basically, if a particular row in the weight matrix has too many zero weights, we conclude that the corresponding input neuron is not contributing too much in generating the output, thus it can be skipped. Also note that the maximum number of zeros can be equal to the number of output neurons of the layer. We ﬁnd a threshold for each layer such that the rows whose zero weights are more than the threshold are skipped. Obviously, for smaller threshold, more input neurons are skipped, but it may lead to significant accuracy degradation. To determine the threshold, we run an experiment where the threshold is changed from 10% of output neurons to 40% of output neurons with step size of 5% and we measure the corresponding accuracy. Then, we choose the threshold which has less than 1% accuracy degradation. By skipping some inputs, we can deactivate the corresponding DAC, thus reduce the power consumption. In addition, considering the cascaded architecture of CNNs, the input neurons for a particular layer are the output neurons of the previous layer. Therefore, besides reducing number of DACs, the proposed technique allows us to reduce the number of ADCs corresponding to the skipped rows in the next layer since we do not need the analog values of those neurons for the next layers anymore. The key point about the proposed scheme is that skipping insigniﬁcant neurons is done by deactivating/eliminating their corresponding analog peripherals, thus reducing the power consumption.

Fig. 7. Normalized power consumption of SAR ADC at diﬀerent resolutions; Power is normalized to 5-bit ADC [19].

of LeNet5 from 5 × 5 to 3 × 3. Note that changing the kernel size impacts the dimensions of other layers as well [17]. 4. Additional power reduction techniques The proposed architectures and the algorithmic procedure discussed earlier try to minimize the power consumption of the network by means of minimizing the parallel ReRAM splits per layer. To achieve this goal, the proposed algorithm changed the high-level architecture of the network to obtain reduced precision edge weights, a point where power reduced without any signiﬁcant accuracy degradation. As a result of the quantization, the proposed approach maps the edge weights from a huge space, containing continuous weights, to a space with limited number of levels, determined by the number of splits per layer and number of bits per cell for MLC ReRAMs. In this section, we take advantage of these quantized edge weights and propose two techniques for power saving in the generated model compared to the base model. 4.1. ADC resolution analysis As we show in our experiments, it is possible to decrease the precision of edge weights to 3 bits for fully connected layers and to 6 bits for convolutional layers in the case of 3-bit MLC ReRAMs without any signiﬁcant accuracy degradation. On the other hand, due to the nature of CNNs, many of the weights of neural networks have near-zero values [18] which get quantized to zero. Therefore, in the generated ReRAMaware CNN model, a signiﬁcant portion of the weights are zero. Having too many zeros for the weights will provide an opportunity to decrease the resolution of ADC/DAC for further power consumption without signiﬁcant accuracy degradation since full resolution ADC/DACs (i.e., 8 bit resolution for the base architecture) are not necessary anymore. Reducing the resolution of ADC/DAC will lead to signiﬁcant reduction in power consumption as the power overhead of ADCs grows exponentially with the resolution [19]. Fig. 7 shows the normalized power consumption of SAR ADC when the resolution is changed from 5 bits to 10 bits. As the ﬁgure shows, increasing the resolution from 5 bits to 7 bits will increase the power consumption by more than 50%. Similarly, the power consumption of a 10-bit ADC is about three times higher than the power consumption of the 5-bit one. Therefore, in this technique and for each layer, we recalculate the required resolution for ADC/DACs based on the non-zero weights, and use a simpler ADC/DACs accordingly. The required resolution for ADC/DACs per layer may vary because each layer has diﬀerent bit precision. Also note that decreasing the resolution of intermediate data generated in CNNs by reducing the precision of ADC/DACs can decrease the memory required and will reduce the computational power of other components in the network including pooling layer and shifter/adders.

4.3. ReRAM-based look-up table Sigmoid activation function As described in Section 2.2, each output neuron generated from matrix-vector multiplication goes through an element-wise nonlinear activation function such as Rectiﬁer or Sigmoid function. Sigmoid function is a mathematical function with a ‘S’-shaped curve as shown in Fig. 8. In ReRAM-based CNNs, due to the analog nature of ReRAMs in performing matrix-vector multiplications, Sigmoid function is usually implemented in analog mode as presented in [3,5,21,22]. However, analog implementation of Sigmoid may lead to an increase in power and area consumption and reduce the reliability of the network. Therefore, in [2], the authors have used a digital Sigmoid function. However, due to relatively high bit precision for output neurons, analog Sigmoid function is mostly used. The reduced bit precision of weights in generated ReRAM-aware CNN model opens up an opportunity to move Sigmoid activation function from analog mode to digital one. As the results in Section 6 show, in our work, the output neurons have relatively small bit precision (i.e., 4-bit and 6-bit for the output neurons of fully connected layers and convolutional layers, respectively). Motivated by limited bit precision 6

M. Hemmat, A. Davoodi

Integration, the VLSI Journal xxx (xxxx) xxx

storage and matrix multiplication operation as well as eDRAM buﬀers for aggregating data between pipeline stages. However, state-of-the-art in this area is based on implementation of an already ﬁxed CNN model with ﬁxed precision. Our work bridges the gap between model generation and ReRAM implementation by incorporating a co-design approach in order to achieve a more power-eﬃcient implementation. The ﬁnal quantization is implicitly determined as we explore the design space of changing the CNN model and its ReRAM implementation to achieve the power eﬃciency. We note, our framework can’t optimize binary CNNs (BNNs) [27–30] any further because our proposed architectural techniques always add to the original model in exchange for reducing the bit precision and thus the number of required ReRAM crossbars while the BNN already has its weights with the minimum precision. In other words, the BNN is already set to be implemented with minimum number of ReRAM crossbar splits. We note though the application of BNNs is limited and many optimized CNNs/DNNs today use non-binary weights.

Fig. 8. Sigmoid function curve [20].

for output neurons, we proposed to use ReRAM-based look-up tables to implement Sigmoid functions. The small area and power consumption of ReRAM crossbars as well as their low read latency make them a promising solution to implement the Sigmoid function. Here, we implement the Sigmoid function as a look-up table and investigates its power and area consumption compared to the analog Sigmoid function. To do so, the ReRAM crossbar should be implemented as follows: the number of columns in the crossbar should be equal to the bit-precision of the layer and number of rows in the crossbar should be equal to the number of distinct values the neuron can take in that particular layer (i.e., two to power of number of bits for that layer). For instance, in the case of fully connected layers for which output neurons have 4-bit precision, the size of ReRAM crossbar is 16 × 4. Then, for any of those 16 distinct inputs, its corresponding value after applying Sigmoid function, is pre-stored in the ReRAM crossbar with 4-bit precision. Note that Look-up table based Sigmoid activation function works in digital mode and the ReRAM cells used in the crossbar can only store one bit of information. Then, similar to analog mode, each output neuron should pass through Sigmoid activation function. To perform the Sigmoid transformation, a decoder is used before the ReRAM crossbar. Depending on the binary value for the input neurons, one of the rows is activated. Then, the value read at the end of each column (i.e., whether 1 or zero due to using SLC ReRAMs) is the corresponding Sigmoid transformation for the input. Note that since we are using SLC ReRAMs and each time only one row is activated, we do not need to use ADCs at the end of each column. Instead, sense ampliﬁers are used which has less overhead compared to ADCs.

6. Simulations results In our experiments we used well-known CNN models including LeNet300-100, LetNet5, CIFAR10, and CaﬀeNet. These were implemented and trained with Neupy [32]. The ﬂoating point weights after training were then quantized at diﬀerent precision levels to create different variations of the base model in our experiments. Each CNN with its quantized weights was then imported in Matlab to verify the classiﬁcation accuracy post-quantization. In each experiment and for each CNN, we started from a 32-bit quantization of the weights and veriﬁed that it resulted in almost no degradation in accuracy compared to ﬂoating point weights, as also veriﬁed in the literature such as [1]. We then gradually decreased the precision to fewer bits to get a power eﬃcient ReRAM implementation. This was done while ensuring the degradation in accuracy remains negligible with decrease in precision, compared to the accuracy when using ﬂoating point weights. Also note that the quantization error from ADC/DAC is less than Vref where Vref is the full-scale input voltage range and N is the res2N ×2 olution of ADC/DAC [33]. For 32 nm ADC/DAC with Vref = 1 V, the quantization error will be less than 0.002 and 0.03 for 8-bit and 4bit ADC/DAC, respectively which are relatively small compared to (at most) 2% accuracy degradation reported in our experiments. Due to this relatively small quantization error for ADC/DAC as well as the inherent error resiliency of CNNs, the state-of-the-art ReRAM-based implementations do not consider the loss due to ADC and DAC. We experimented with diﬀerent ReRAM architectures; depending on the architecture of the network in use, we compared several of the four proposed architectural techniques with diﬀerent variations of the base case as introduced in the previous sections. Besides classiﬁcation accuracy, we also compared power consumption in our experiments. For power comparison we considered components including ADCs, DACs, and the ReRAM array. Assuming 3-bit storage for each ReRAM device using Multi Level Cells, each crossbar was then implemented and simulated by the NVSIM tool [34] which allowed measurement of power consumption of ReRAM crossbars. Table 2 for example reports the power and area of a 512 × 512 ReRAM array implemented with 3-bit devices. The table also reports power and area of an 8-bit ADC and DAC which we used from. [2]. 8-bit ADC/DACs are used in the base architecture to maintain the accuracy of the network after conversion. We note, ADCs and DACs were major sources of power consumption given that many of them need to be used per crossbar [5,6]. Power consumption of 6-bit and 4-bit ADC/DAC are also reported since they will be used in the generated CNN models for which resolution of conversion is decreased due to the weight quantization. The reported power consumption are adopted from analysis in [19].

5. Related works There have been several works on network quantization of CNNs and their ReRAM-based implementation for eﬃcient implementation. First, from a quantization perspective, there exists numerous works which can be divided into two main groups: (1) Reduced bit precision CNNs with ﬁxed-point [1,23–25] or ternary [26] edge weights; (2) Binary CNNs with binary weights [27–30]. In these works, the main goal is maintaining the post-quantization accuracy by modifying the training process. Evaluating hardware metrics such as power consumption are typically missing. Moreover, the quantization is not targeting power-eﬃcient ReRAM-based implementation. Second, there exists several works for designing ReRAM-based accelerators for training or inference including [2,3,31]. In these works, different accelerators are designed in which ReRAM crossbars are used for both storage and computation. As an example, in [2], the authors designed a piplelined in-situ architecture for CNN inference in which each layer has its own set of dedicated ReRAM crossbars for weight 7

M. Hemmat, A. Davoodi

Integration, the VLSI Journal xxx (xxxx) xxx

Table 1 Information on experimented neural networks.

CS1 CS2 CS3 FC1 FC2 FC3 #Edgs #Parm

LeNet5

LeNet300100

CIFAR10

5 × 5 5 × 5 – 4 × 4 500 × – 2293K 431K

– – – 784 × 300 300 × 100 100 × 10 266K 266K

5 × 5 5 × 5 5 × 5 4 × 4 – – 12.3M 89.4K

× 1 × 20 × 20 × 50 × 50 × 500 10

× × × ×

Table 4 Comparison of power and accuracy of various architectures and quantizations in LeNet300-100; conversion resolution = 8 bit. 3 × 32 32 × 32 32 × 64 64 × 10

Base

Architecture 1 Architecture 2

Area (mm2 )

Power (mW)

A Single 8-bit ADC A Single 6-bit ADC A Single 4-bit ADC A Single 8-bit DAC 512 × 512 ReRAM array

0.0012 – – 0.0012 0.0060

2.0 1.58 1.11 1.9 0.4

#Splits

#DACs

#ADCs

Accuracy

Pnorm

4 3 2 1 1

4736 3552 2368 1284 1259

5680 4260 2840 1620 1750

97.41% 93.65% 92.80% 97.00% 96.80%

1.00 0.79 0.58 0.42 0.41

a new fully-connected layer FC3’ of dimension 100 × 100. For Architecture 2, it inserted 75 additional neurons to the layer FC2. The grey rows corresponding to architectures 1 and 2 in the table indicate the layers that ended up being diﬀerent than the base case due to applying techniques 1 and 2. Note, in Architecture 2, adding neurons to FC2 also impacts FC3. Table 4 shows comparison of power and accuracy of these architectures. As mentioned before, for the base case, we experiment with diﬀerent variations by altering the precision quantization to get implementations which trade oﬀ accuracy and power. For example, 12-bit quantization requires 4 split levels which gives the worst power but the best accuracy. The total number of ADC/DAC used in base architecture and the proposed more-power eﬃcient architectures can be compared using the data reported in Tables 3, 5 and 7 and the bit precision used for the new CNN model. As an example, for the base architecture of LeNet300-100, 1184 DACs and 1420 ADCs are used per split. To implement the network with 12-bit precision, four splits are used and total number of DACs and ADCs are 4736 and 5680, respectively. On the other hand, for Architecture 1, we need 1284 DACs and 1620 ADCs per split. Since the network can achieve acceptable accuracy using this architecture with only one single crossbar, the total number of DACs and ADCs used remains 1284 and 1629, respectively which is signiﬁcantly less than base architecture. This is while the accuracy degrada-

Table 2 Reference area and power values of some major components used in various ReRAM architectures. Component

Quantization 12 9 6 3 3

6.1. Assessment of architectures 1 and 2 Here, we used the LeNet300-100 which was explained in Section 2.2. It was trained with 60K images from the MNIST dataset [10] which were designated for training. The 10K images of the dataset designated for testing were then used to measure the classiﬁcation accuracy. Table 3 reports diﬀerent CNN models and architectures compared in this experiment. For each layer, we report the size of its weight matrix, the size of the two ReRAM arrays for storing negative and positive weights. Recall these two arrays are equal in size as we explained in Section 2.1. For FC1, the arrays had the input dimension higher than 512 so it had to be broken into two arrays, as discussed in Section 2.3. For each architecture, we also report the total number of ADCs and DCAs for each layer in a single split level. We assumed DACs are shared across the rows in positive and negative weight matrixes as explained in Section 2.1. We also report information about the ReRAM architectures generated by techniques 1 and 2. For architecture 1, our technique inserted

Table 5 Information on the CNN models and ReRAM architectures of LeNet5 for techniques 3 & 4.

Table 3 Information on the CNN models and ReRAM architectures of LeNet300-100 for techniques 1 & 2.

8

M. Hemmat, A. Davoodi

Integration, the VLSI Journal xxx (xxxx) xxx

Table 6 Comparison of power and accuracy of various architectures with diﬀerent quantizations in LeNet5; conversion resolution = 8 bit.

Base Architecture 3 Architecture 4 Architectures 3 and 4 combined

Quantization(CS1,other)

#Split (CS1,other)

#DAC

#ADC

Accuracy

Pnorm

(12,12) (9,9) (6,3) (6,3) (6,3)

(4,4) (3,3) (2,1) (2,1) (2,1)

7300 5475 2500 2498 1972

16240 12180 7640 6100 5740

98.10% 97.30% 97.10% 96.80% 96.50%

1.00 0.77 0.56 0.48 0.42

6.2. Assessment of architectures 3 and 4

Table 7 Information on the CNN models and ReRAM architectures of CIFAR10 for techniques 3 & 4.

For this experiment we used the LeNet5 and CIFAR10 networks which included convolutional layers. For LeNet5, we used the same MNIST dataset as explained for LeNet300-100 and same setup for testing and training. Table 5 shows diﬀerent CNN models and architectures used in this experiment. The ﬁrst two layers are convolutional layers (CS1 and CS2) followed by two fully connected layers (FC1 and FC2). For each layer, we report similar metrics as in the previous experiment. The only additional information is column 4 which reports the number of channels per layer. For CS1 and CS2, the size of the input in the weight matrix is same as the kernel size (i.e., 5 × 5 in the base architecture and in Architecture 3, but is 3 × 3 in Architecture 4). Architecture 3 is based on changing the number of kernels in CS1 (from 20 to 30) and CS2 (from 50 to 75). These conﬁgurations were generated by techniques 3 and 4. Again, the grey rows show the layers in architectures 3 and 4 which were diﬀerent than the base architecture. Table 6 compares the power and accuracy for these architectures. For the base case, we experiment with diﬀerent precision quantizations as in the previous experiments. Here, we often had to use a higher precision for convolutional layer CS1 than the rest of the layers to ensure achieving accuracy levels similar to the 32-bit quantization case (in all layers). For the base case, further decrease in quantization (e.g., to (6,6) resulted in signiﬁcant drop in accuracy (77%) so such cases are not reported because they were not acceptable). Again, both architectures 3 and 4 result in negligible drop in accuracy compared to the (12,12) base architecture but result in signiﬁcant saving in normalized power (0.56 and 0.48, respectively). Compared to the (9,9) base case, the power savings remain signiﬁcant, i.e., 27% and 38% for architectures 3 and 4, respectively. Tables 7 and 8 show the results in CIFAR10. The experimental setup and presentation of results are identical as before. Power savings are even higher for architecture 4. As can be seen, we have not used Architectures 1 and 2 for CIFAR10. This is because the structure of CIFAR10 consists of three relatively large convolutional layers followed by a single fully connected layer. The number of output neurons in the last fully connected layer can not be changed because each neuron in the last layer corresponds to one output class (i.e., 10 output classes in CIFAR10). Hence, the number of neurons is ﬁxed and determined by the dataset. As a result, Architecture 2 can not be used. Also, Architecture 1 is not eﬀective because inserting a layer between CS3 and FC1 changes the input dimension of the layer to 1024 which is relatively large, so inserting it will create signiﬁcant power overhead (similar to the discussion in Section 3.1 for LeNet300100). Hence, we have used Architectures 3 and 4 for this network to alter its convolutional layers. As the results show, although the four proposed architectures will increase the number of the weights in the network, the power consumption is signiﬁcantly reduced due to reducing the precision of edge weights and consequently reducing the number of splits and their corresponding power-hungry analog peripherals.

tion of Architectures 1 and 2 are negligible compared to the base model with 12-bit precision (i.e., less than 1%). Note, we veriﬁed the 12-bit base architecture had almost the same accuracy as the 32-bit one. The 3-bit implementation of the base architecture is not reported because it resulted in unacceptable (i.e., too much) degradation in accuracy (79% accuracy). To report the power saving of the proposed architectures, we have normalized the power numbers of each network to its corresponding 12-bit base architecture. More speciﬁcally, the power consumption of the networks is calculated based on the number of ADCs and DACs per split (given in Table 3) and number of split levels (given in Table 4). Note that here we use 8-bit ADC/DAC to perform conversion in both base architecture and proposed CNN models. As can be seen, in Architectures 1 and 2 which are based on 3-bit precision of the weights (but a ReRAM-aware design of the CNN model), we achieve similar accuracy (within 1%) as the 12-bit base architecture. However, their power consumption is only 0.42 and 0.41 of the 12-bit base case, respectively. The 6 and 9 -bit base architectures, have worse accuracy and worse power compared to Architectures 1 and 2. These show the eﬀectiveness of techniques 1 and 2 in achieving a signiﬁcantly better ReRAM-aware CNN model.

9

M. Hemmat, A. Davoodi

Integration, the VLSI Journal xxx (xxxx) xxx

Table 8 Comparison of power and accuracy of various architectures with diﬀerent quantizations in CIFAR10; conversion resolution = 8 bit.

Base Architecture 3 Architecture 4

Quantization (CS1,other)

#Split (CS1,other)

#DAC

#ADC

Accuracy

Pnorm

(12,12) (9,9) (6,3) (6,3)

(4,4) (3,3) (2,1) (2,1)

10796 8097 3574 3446

25504 19128 11368 6628

81.10% 79.20% 80.30% 79.90%

1.00 0.77 0.53 0.35

6.3. Assessing the combination of the proposed architectures

Table 9 Number of CNNs involved in design space exploration for each architecture.

While the proposed architectures can be independently applied to diﬀerent layers of the network to compensate for the accuracy loss, it is possible to combine them to alter the CNN models. As an example, we can apply Architectures 3 and 4 simultaneously to a convolutional layer so both kernel size and number of kernels are changed. In this case, our proposed algorithmic procedure needs to explore a much larger design space where each model can be altered by more than one of the architectures. Hence, more design choices are available which can further increase the power eﬃciency. Note that the general ﬂow of the proposed procedure shown in Fig. 3 will remain the same. For this experiment, we used LeNet5 and applied both Architectures 3 and 4 to its convolutional layers, CS1 and CS2. The network changes are summarized in Table 5. The power consumption and accuracy are also reported in Table 6. As the results show, combining two architectures can provide more power saving compared to each individual architecture and the power consumption is further reduced to 0.42 of the base case. This was 0.56 and 0.48 of the base case for Architectures 3 and 4, respectively. This is while the accuracy degradation still remains negligible (i.e., less than 2% compared to the base model).

Architecture used

Number of CNNs involved

LeNet300100 Architecture 1 Architecture 2 LeNet5 Architecture 3 Architecture 4 CIFAR10 Architecture 3 Architecture 4

3 2 6 4 7 4

Table 10 Information on CaﬀeNet.

6.4. Runtime overhead of the proposed framework As stated earlier, exploration of the design space for CNN model generation requires starting from the minimum quantization precision determined by the Multi Level Cell technology (i.e., 3 bits in this work) and gradually expanding the network architecture until the desired accuracy is reached while having power saving. To generate the appropriate model, several CNN models should be created, trained, and tested for power consumption and accuracy which incur overhead, particularly in the training phase. Table 9 reports the number of CNNs which were trained for diﬀerent architectures in each network. As an example, for Architecture 3 of LeNet5, the training time can be increased up to 6×, given that 6 networks should be created and tested once exploring the design space. While we note it is possible to signiﬁcantly reduce the training time using incremental training, this overhead remains the main area of improvement of our work in the future. This includes taking advantage of knowledge about the structure of a given CNN (which may be well known to the designer) to a-priori discard ineﬀective architectures and reduce the search overhead.

in changing the number of input neurons of the next layer. Table 11 summarizes the accuracy and power consumption of the base architecture and Architecture 2. The power consumption of Architecture 2 is 0.2 of the base case while the accuracy of the network is slightly increased in comparison to the base model due to having more neurons in that layer.

Table 11 Power and accuracy of various architectures of CaﬀeNet.

Base Architecture 1

6.5. Assessment on a larger DNN In this section, we evaluated the eﬃciency of our proposed approach on a larger DNN. We used CaﬀeNet which is Caﬀe’s replication of AlexNet. It has ﬁve convolutional layers for which CS1, CS2, and CS5 are followed by subsampling layers. The convolutional layers are followed by three FC layers. More details on the network structure are summarized in Table 10. CaﬀeNet was trained on ImageNet 2012 dataset. In this experiment we applied Architecture 2 to the ﬁrst fully connected layer of the network. To alter the base model, we increased the number of output neurons in FC1 from 3154 to 4096. This also resulted

Quantization

Accuracy

Pnorm

32 6

79.1% 79.7%

1.00 0.20

Table 12 Resolution of ADC/DAC of various architectures in LeNet300-100.

LeNet300-100 Architecture 1 Architecture 2

10

ADC Resolution

Accuracy

Pnorm (Before Skipping Neurons)

Pnorm (After Skipping Neurons)

4 4

96.7% 96.5%

0.23 0.24

0.16 0.17

M. Hemmat, A. Davoodi

Integration, the VLSI Journal xxx (xxxx) xxx

a row in ReRAM crossbar) is connected to all of the outputs through one synaptic weight (i.e., one ReRAM cell in the crossbar) and the output current is formed by summing all of the currents passing through ReRAM cells in each column. Therefore, a row with too many zeros does not play a signiﬁcant role in calculation of output neurons and subsequently accuracy. To recognize unimportant rows, the threshold value for number of zeros in a row is calculated per layer of the network and rows with zero weights more than the measured threshold are skipped in computations (i.e., their corresponding analog peripherals are deactivated and a zero voltage is applied). Table 14 summarizes the threshold of zero weights and number of skipped rows per layer for LeNet300-100, LeNet5, and CIFAR10 networks. Note that this technique is only applicable to fully connected layers. The normalized power consumption before and after applying this scheme is reported in Table 12. The results show that skipping unimportant neurons for computation will reduce the power consumption from 0.23 to 0.16 and 0.24 to 0.17 for LeNet300-100 Architecture 1 and Architecture 2, respectively. Furthermore, according to Table 13 the power reduction for LeNet5 is about 5% for Architecture 3 and 3% for Architecture 4. However, in the case of CIFAR10, the power reduction is only around 1%. The reason behind this is that in CIFAR10, there exists only one fully connected layer. Therefore, applying the proposed technique does not contribute too much to power reduction.

Table 13 Resolution of ADC/DAC of various architectures in LeNet5 and CIFAR10.

LeNet5 Architecture 3 Architecture 4 CIFAR10 Architecture 3 Architecture 4

ADC Resolution (CS1,other)

Accuracy

Pnorm (Before Skipping Neurons)

Pnorm (After Skipping Neurons)

(6,4) (6,4)

97.01% 96.68%

0.32 0.26

0.29 0.21

(6,4) (6,4)

79.91% 79.8%

0.30 0.20

0.29 0.184

6.6. ADC resolution analysis Here, we investigate the impact of ADC resolution on the accuracy of the network when the ReRAM-aware model is generated. To maintain the accuracy, we start to decrease the resolution of ADCs and subsequently DACs from 8 bit to lower resolutions and measure the accuracy. Recall that the resolution of ADC/DACs may be diﬀerent per layer since the weight precision of the layers are diﬀerent from each other. As we have higher precision weights for the ﬁrst convolutional layer in LeNet5 and Cifar10 networks, we expect to have higher resolution ADCs for these layers as well. Tables 12 and 13 summarize the ADC/DAC resolution, the normalized power and accuracy of the networks used in this work. As the results show, the ADC resolution can be decreased to 6 bits for the ﬁrst convolutional layer in LeNet5 and CIFAR10. In addition, further resolution reduction to 4 bits is possible for other convolutional and fully connected layers, considering 3-bit weight for those layers. Due to exponential reduction in power consumption of ADC/DACs when resolution is decreased, the power consumption of Architectures 1 and 2 for LeNet300-100 is reduced to 0.23 and 0.24 of the base case. Also, the accuracy degradation due to resolution reduction is almost negligible (within 1%). Similarly, Architectures 3 and 4 are successful to save more power using reduced ADC resolution as summarized in Table 13.

6.8. ReRAM-based Sigmoid activation function Finally, in this part, we have implemented the Sigmoid activation function using ReRAM crossbars. The details of proposed look-up table is discussed in Section 4.3. We have used 6-bit Sigmoid function (i.e., 64 × 6 ReRAM array) for ﬁrst convolutional layer and 4-bit Sigmoid function (i.e., 16 × 4 ReRAM array) for rest of convolutional layers and fully connected layers. To measure the area and power consumption of ReRAM-based look-up tables, we have used NVSIM simulator. We have also considered the power consumption of peripherals including decoders and sense ampliﬁers. The results are summarized in Table 15 for 32 nm technology and for diﬀerent sizes of crossbars. To have a fair comparison with other digital implementation of Sigmoid function, we have also reported the power consumption of an 8-bit digital Sigmoid function at 32 nm technology reported in [2]. As the results show, the ReRAM-based look-up table implementation of Sigmoid function is able to save the power and area consumption for and 30% and 33%, respectively.

6.7. Skipping insigniﬁcant neurons In this part, we have applied the proposed technique discussed in Section 4.2 to fully connected layers to skip insigniﬁcant input neurons which subsequently decreases number of DACs and ADCS used. Recall that in fully connected layers, each input neuron (represented as

Table 14 Information on threshold value for zero weights and number of insigniﬁcant input neurons of various architectures in diﬀerent CNNs. Layer

Weight Matrix

LeNet300-100 - Architecture 1 FC1 784 × 300 FC2 300 × 100 100 × 100 FC3′ FC3 100 × 10 LeNet300-100 - Architecture 2 FC1 784 × 300 FC2 300 × 175 FC3 175 × 10 LeNet5 - Architecture 3 FC1 (4 × 4 × 75) × 500 FC2 500 × 10 LeNet5 - Architecture 4 FC1 (6 × 6 × 50) × 500 FC2 500 × 10 CIFAR10 - Architecture 3 FC1 1024 × 10 CIFAR10 - Architecture 4 FC1 2304 × 10

Threshold of #Zero Weights

#Eliminated Rows

35% 20% 40% 30%

346/784 46/300 38/100 23/100

35% 35% 30%

303/784 66/300 81/175

30% 30%

417/1200 121/500

25% 30%

570/1800 130/500

30%

398/1024

30%

598/2304

11

M. Hemmat, A. Davoodi

Integration, the VLSI Journal xxx (xxxx) xxx

Table 15 Power and area consumption of ReRAM-based Sigmoid function. Component

Power (mW)

Area (mm2 )

4-bit Sigmoid function (16 × 4 ReRAM array) 6-bit Sigmoid function (64 × 6 ReRAM array) 8-bit Sigmoid function (256 × 8 ReRAM array) Digital Sigmoid function in [2]

0.1 0.14 0.1812 0.26

0.00009 0.00016 0.0002 0.0003

7. Conclusions

[13] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014, arXiv:\ignorespaces1409.1556. [14] H. Shin, H.R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D.J. Mollura, R.M. Summers, Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning, IEEE Trans. Med. Imaging 35 (5) (2016) 1285–1298.

In this work, we proposed a co-design approach for power eﬃcient implementation of CNNs on ReRAM crossbars. We ﬁrst introduced four architectural techniques to alter the high-level network model of a base CNN to minimize power consumption of analog peripherals when implemented using ReRAM crossbars. Given that DACs and ADCs are dominant sources of power consumption, we aimed to minimize their usage by eliminating parallel ReRAM crossbars as much as possible. This required reducing the precision quantization of the CNN edge weights. To compensate for the resulting degradation in accuracy, we proposed various techniques to alter the network model. Our techniques are applicable to convolutional and fully-connected layers. An algorithmic procedure is proposed to change the network architecture and compromise between power and accuracy. Then, two more power reduction techniques are proposed to further reduce the power consumption of the generated ReRAM-aware network. The proposed techniques are mainly originated from elimination of parallel crossbars and reducing the edge weight precision. Evaluating the eﬃciency of our proposed techniques on standard CNN models showed a signiﬁcant power consumption reduction compared to the base models.

[15] B. Harris, M. Moghaddam, D. Kang, I. Bae, E. Kim, H. Min, H. Chao, S. Kim, B. Egger, S. Ha, K. Chio, Architectures and algorithms for user customization of CNNs, in: Asia and South Paciﬁc Design Automation Conference, 2018, pp. 540–547. [16] H. Tann, S. Hashemi, R.I. Bahar, S. Reda, Runtime conﬁgurable deep neural networks for energy-accuracy trade-oﬀ, in: International Conference on Hardware/Software Codesign and System Synthesis, 2016, pp. 34:1–34:10. [17] M. Hemmat, A. Davoodi, Power-eﬃcient ReRAM-aware CNN model generation, in: International Conference on Computer Design, 2018, pp. 156–162. [18] X. Chen, J. Jiang, J. Zhu, C. Tsui, A high-throughput and energy-eﬃcient RRAM-based convolutional neural network using data encoding and dynamic quantization, in: Asia and South Paciﬁc Design Automation Conference, 2018, pp. 123–128. [19] M. Yip, A. Chandrakasan, A resolution-reconﬁgurable 5-to-10-bit 0.4-to-1V power scalable SAR ADC for sensor applications, Solid-State Circuits 48 (6) (2013) 1453–1464. [20] B. Heruseto, E. Prasetyo, H. Afandi, M. Paindavoine, Embedded analog CMOS neural network inside high speed camera, in: Asia Symposium on Quality Electronic Design, 2009, pp. 144–147. [21] M. Cheng, L. Xia, Z. Zhu, Y. Cai, Y. Xie, Y. Wang, H. Yang, TIME: a training-in-memory architecture for memristor-based deep neural networks, in: Asia and South Paciﬁc Design Automation Conference, 2017, pp. 26:1–26:6. [22] L. Xia, B. Li, T. Tang, P. Gu, P. Chen, S. Yu, Y. Cao, Y. Wang, Y. Xie, H. Yang, MNSIM: simulation platform for memristor-based neuromorphic computing system, IEEE Trans. Comput. Aided Des. Integr Circuits Syst. 37 (5) (2018) 1009–1022. [23] S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan, Deep learning with limited numerical precision, in: International Conference on Machine Learning, 2015, pp. 1737–1746. [24] D. Lin, S. Talathi, S. Annapureddy, Fixed point quantization of deep convolutional networks, in: International Conference on Machine Learning, 2016, pp. 2849–2858. [25] M. Samragh, M. Imani, F. Koushanfar, T. Rosing, LookNN: neural network with no multiplication, in: Design, Automation, and Test in Europe Conference, 2017, pp. 1779–1784. [26] K. Hwang, W. Sun, Fixed-point feedforward deep neural network design using weights 1, 0, and -1, in: IEEE Workshop on Signal Processing Systems, 2014, pp. 1–6. [27] M. Rastegari, V. Ordonez, J. Redmon, A. Farhadi, XNOR-Net: ImageNet classiﬁcation using binary convolutional neural networks, in: European Conference on Computer Vision, 2016, pp. 525–542. [28] M. Courbariaux, Y. Bengio, J. David, Binaryconnect: training deep neural networks with binary weights during propagations, in: Advances in Neural Information Processing Systems, 2015, pp. 3123–3131. [29] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, Y. Bengio, Binarized neural networks, in: Advances in Neural Information Processing Systems, 2016, pp. 4107–4115. [30] M. Kim, P. Smaragdis, Bitwise Neural Networks, 2016, arXiv:\ignorespaces1601. 06071. [31] L. Song, X. Qian, H. Li, Y. Chen, Pipelayer: a pipelined ReRAM-based accelerator for deep learning, in: International Symposium on High Performance Computer Architecture, 2017, pp. 541–552. [32] http://neupy.com/pages/home.html. [33] J. Rosa, Sigma-delta modulators: tutorial overview, design guide, and state-of-the-art survey, IEEE Trans. Circuits Syst. 58 (1) (2016) 1–21. [34] X. Dong, C. Xu, Y. Xie, N.P. Jouppi, NVSIM: a circuit-level performance, energy, and area model for emerging nonvolatile memory, IEEE Trans. Comput. Aided Des. Integr Circuits Syst. 31 (7) (2012) 994–1007.

References [1] S. Hashemi, N. Anthony, H. Tann, R.I. Bahar, S. Reda, Understanding the impact of precision quantization on the accuracy and energy of neural networks, in: Design, Automation, and Test in Europe Conference, 2017, pp. 1474–1479. [2] A. Shaﬁee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. Strachan, M. Hu, R. Williams, V. Srikumar, ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars, in: International Symposium on Computer Architecture, 2016, pp. 14–26. [3] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, Y. Xie, PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory, in: International Symposium on Computer Architecture, 2016, pp. 27–39. [4] T. Tang, L. Xia, B. Li, Y. Wang, H. Yang, Binary convolutional neural network on RRAM, in: Asia and South Paciﬁc Design Automation Conference, 2017, pp. 782–787. [5] B. Li, L. Xia, P. Gu, Y. Wang, H. Yang, Merging the interface: power, area and accuracy co-optimization for RRAM crossbar-based mixed-signal computing system, in: Design Automation Conference, 2015, pp. 13:1–13:6. [6] L. Xia, T. Tang, W. Huangfu, M. Cheng, X. Yin, B. Li, Y. Wang, H. Yang, Switched by input: power eﬃcient structure for RRAM-based convolutional neural network, in: Design Automation Conference, 2016, pp. 125:1–125:6. [7] Y. Zha, J. Li, Reconﬁgurable in-memory computing with resistive memory crossbar, in: International Conference on Computer Aided Design, 2016, pp. 120–127. [8] A. Zhou, A. Yao, Y. Guo, L. Xu, Y. Chen, Incremental Network Quantization: towards Lossless CNNs with Low-Precision Weights, 2017, arXiv: \ignorespaces1702.03044. [9] J. Wu, C. Leng, Y. Wang, Q. Hu, J. Cheng, Quantized convolutional neural networks for mobile devices, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4820–4828. [10] Y. Lecun, L. Bottou, Y. Bengio, P. Haﬀner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324. [11] C. Xu, X. Dong, N. Jouppi, Y. Xie, Design implications of memristor-based RRAM cross-point structures, in: Design, Automation, and Test in Europe Conference, 2011, pp. 734–739. [12] M. Zangeneh, A. Joshi, Design and optimization of nonvolatile multibit 1T1R resistive RAM, IEEE Trans. Very Large Scale Integr. Syst. 22 (8) (2014) 1815–1828.

12

Power-efficient ReRAM-aware CNN model generation

Power-efficient ReRAM-aware CNN model generation

Recommend Documents