Sign backpropagation: An on-chip learning algorithm for analog RRAM neuromorphic computing systems

Sign backpropagation: An on-chip learning algorithm for analog RRAM neuromorphic computing systems

Accepted Manuscript Sign backpropagation: An on-chip learning algorithm for analog RRAM neuromorphic computing systems Qingtian Zhang, Huaqiang Wu, Pe...

3MB Sizes 0 Downloads 33 Views

Accepted Manuscript Sign backpropagation: An on-chip learning algorithm for analog RRAM neuromorphic computing systems Qingtian Zhang, Huaqiang Wu, Peng Yao, Wenqiang Zhang, Bin Gao, Ning Deng, He Qian

PII: DOI: Reference:

S0893-6080(18)30238-7 https://doi.org/10.1016/j.neunet.2018.08.012 NN 4016

To appear in:

Neural Networks

Received date : 7 November 2017 Revised date : 25 June 2018 Accepted date : 10 August 2018 Please cite this article as: Zhang, Q., Wu, H., Yao, P., Zhang, W., Gao, B., Deng, N., et al., Sign backpropagation: An on-chip learning algorithm for analog RRAM neuromorphic computing systems. Neural Networks (2018), https://doi.org/10.1016/j.neunet.2018.08.012 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Title Page (With all author details listed)

Sign Backpropagation: An On-chip Learning Algorithm for Analog RRAM Neuromorphic Computing Systems Qingtian Zhang, Huaqiang Wu*, Peng Yao, Wenqiang Zhang, Bin Gao, Ning, Deng*, He Qian Qingtian Zhang

[email protected]

1.

Institute of Microelectronics

Peng Yao

[email protected]

Tsinghua University

Wenqiang Zhang

[email protected]

Beijing, 10084, China

Huaqiang Wu

[email protected]

1.

Institute of Microelectronics

Bin Gao

[email protected]

2.

Center for Brain-Inspired Computing Research

Ning Deng

[email protected]

Tsinghua University,

He Qian

[email protected]

Beijing, 10084, China

*Manuscript Click here to view linked References

Sign Backpropagation: An On-chip Learning Algorithm for Analog RRAM Neuromorphic Computing Systems Qingtian Zhanga , Huaqiang Wua,b,∗, Peng Yaoa , Wenqiang Zhanga , Bin Gaoa,b , Ning Denga,b,∗, He Qiana,b a

b

Institute of Microelectronics, Tsinghua University, Beijing, 10084, China Center for Brain-Inspired Computing Research, Tsinghua University, Beijing, 10084, China

Abstract Currently, powerful deep learning models usually require significant resources in the form of processors and memory, which leads to very high energy consumption. The emerging resistive random access memory (RRAM) has shown great potential for constructing a scalable and energy-efficient neural network. However, it is hard to port a high-precision neural network from conventional digital CMOS hardware systems to analog RRAM systems owing to the variability of RRAM devices. A suitable on-chip learning algorithm should be developed to retrain or improve the performance of the neural network. In addition, determining how to integrate the periphery digital computations and analog RRAM crossbar is still a challenge. Here, we propose an on-chip learning algorithm, named sign backpropagation (SBP), for RRAM-based multilayer perceptron (MLP) with binary interfaces (0, 1) in forward process and 2-bit (±1, 0) in backward process. The simulation results show that the proposed method and architecture can achieve a comparable classification accuracy with MLP on MNIST dataset, meanwhile it can save area and energy cost by the calculation and storing of the intermediate results and take advantages of the RRAM crossbar potential in neuromorphic computing. Keywords: neural network, resistive random-access memory (RRAM), on-chip learning, ∗

Corresponding author Email addresses: [email protected] (Huaqiang Wu), [email protected] (Ning Deng) Preprint submitted to Neural Networks

June 25, 2018

multilayer perceptron (MLP), neuromorphic computing 1. Introduction Deep learning has dramatically improved state-of-the-art performance in many domains, such as speech and visual object recognition (LeCun et al. (2015); Krizhevsky et al. (2012); Hinton et al. (2012)). However, training such deep networks consumes megajoules of energy using existing CPU and GPU chips, which are based on CMOS technologies (Le (2013); Eryilmaz et al. (2015)). An important part of this energy consumption comes from the memory wall between the data processing and data storage (Chi et al. (2016); Keckler et al. (2011)). Recent studies demonstrated that the emerging resistive random access memory (RRAM) devices are able to perform arithmetic operations in the memory (Wong et al. (2012); Li et al. (2013)). Its simple structure and excellent performance in terms of computation efficiency and power consumption has attracted many researchers (Gao et al. (2014); Jo et al. (2010); Yu et al. (2012)). Various neuromorphic systems have been built using the RRAM crossbar structure (Fig. 1a), which can naturally realize weighted combination and matrix-vector multiplication (Xu et al. (2011)). Some simulation results show that the neural networks implemented on RRAM-based neuromorphic systems could achieve huge energy savings (Wang et al. (2015); Li et al. (2015a); Xia et al. (2016)). However, there are some significant challenges involved in porting a high precision neural network developed for digital CPU or FPGA to analog RRAM systems. There are two different approaches to use RRAM in neuromorphic computing systems. One method is dividing RRAM’s dynamic conductance range into multiple levels and using it as a digital device with low precision (Wang et al. (2015); Xia et al. (2016); Hu et al. (2012)). Then the neural network weights can be programmed to the RRAM arrays. In this case, a single RRAM device is treated as a digital device with multi-bits. However, RRAM exhibits inevitable cycle-to-cycle and device-to-device variations (Wong et al. (2012)). The variations make it hard to write the RRAM devices to the target conductance state even with the write-verify method. If the RRAM device is designed for a limited precision, such as 4 bits, this means the RRAM device will need 16 levels of conductance. It might be doable for a single device, but will be very challenging for a RRAM array considering device to device variations. 2

The other method is using RRAM as an analog device and only using the crossbar computing architecture (Yakopcic et al. (2014); Soudry et al. (2015)). The RRAM conductance is determined iteratively by an on-chip learning algorithm that optimizes the system output. It is supposed that the analog RRAM has the potential to have any conductance value within the dynamic range. Although it will encounter the same problem of how to modulate the RRAM conductance accurately, the on-chip learning algorithm more easily tolerates the RRAM variation (Zamanidoost et al. (2015b)) and empowers the RRAM system with online learning ability. To update the RRAM conductance to a target values normally requires write-verify method and sometimes multiple write-verify steps are required. In the backpropagation algorithm, it is necessary to read the exact resistance of the RRAM device at time t and tune the RRAM device to another resistance value based on BP at time t + 1. To implement this updating process in circuits requires a long time and consumes a significant amount of energy due to the non-ideal device characteristics, such as the nonlinearity in conductance change, device variations, limited ON/OFF range, and asymmetry SET and RESET operations (Yu et al. (2013); Burr et al. (2015); Gokmen and Vlasov (2016)). This is an inefficient way to update the device conductance on chip with the write-verify method. In view of this, we propose a new on-chip learning algorithm for RRAMbased neuromorphic computing systems, named sign backpropagation (SBP). The SBP algorithm does not design an exact conductance state of any RRAM device, which can mitigate the effects of some non-ideal device characteristics, such as the variation, nonlinearity, and asymmetry. The algorithm is based on the sign of the backpropagation error, not the sign of the derivative as the Manhattan update rule (Zamanidoost et al. (2015a)) or the resilient backpropagation (RPROP) algorithms do (Riedmiller and Braun (1993); Igel and Hskcn (2014)). The algorithm is designed for multilayer perceptron (MLP) with binary interfaces (0/1) in forward process and 2-bit (±1/0) interfaces in backward process, which saves a significant amount of area and power consumption for analog and digital interconversion (Li et al. (2015b)). The momentum term is used to improve the convergence and performance of the network. In the case of MNIST (LeCun et al. (1998))the proposed structure and algorithm obtained 94.5% classification accuracy with 784300-10 MLP and much smaller number of bits of intermediate results.

3

Figure 1: Illustration of the RRAM crossbar and the device analog behavior. The details of the device have been reported in our other work (Yao et al. (2017)). (a) The structure of the RRAM crossbar array. (b) Conductance variations of RRAM in different initial conductance states after a SET or RESET pulse. (c) The analog behavior model of RRAM, which indicates the distributions of final conductance of RRAM after a SET RESET pulse on a given initial conductance state. The standard deviation of the variations is shown as the ribbon band along the curve.

4

2. Preliminaries 2.1. RRAM device and the RRAM crossbar The typical RRAM device is a passive two-terminal element with a variable conductance within a certain range. It can be used to build the crosspoint structure (Xu et al. (2011)), which is well known as the RRAM crossbar array (Fig. 1a). The RRAM crossbar array can naturally realize the weighted combination and matrix-vector multiplication with O(1) computation complexity. Based on Kirchhoffs law, the relationship between the input voltages and output currents can be expressed as in (1) (Hu et al. (2012)). in =

M X

m=1

vm · gm,n ,

(1)

where vm is the input voltage on the mth row, in is the output current in the nth column, and gm,n is the conductance of the RRAM device at the specific point, m = 1, 2, . . . , M , n = 1, 2, . . . , N , where M is the number of matrix rows and N is the number of columns. Based on the RRAM crossbar, the matrix-vector multiplication y = Ax can be directly calculated with vm = xm and gm,n = A(m, n) without considering the wire resistance. The analog current output can be gathered in only a few nanoseconds. The most important feature of RRAM is its analog behavior, which means the conductance of RRAM can be modulated gradually by applying a series of pulses (Hu et al. (2012)). Ideally, the conductance of RRAM could increase or decrease any certain value as wanted. However, the truth is that manipulating the RRAMs conductance accurately is still very challenging. Even with the same operation and initial state, the final conductance after identical operations varies due to the cycle-to-cycle and device-to-device variations (Wong et al. (2012)). Fortunately, the variation is not completely random. Although the writing accuracy can be improved by controlling the pulse width and amplitude, it is very expensive to implement a series of ingenious modulation schemes on-chip. And the smaller the pulse amplitude is, the narrower the RRAM conductance range (Yao et al. (2017)). One cheaper way to modulate the RRAM conductance on chip is to use only the basic writing operations. There are two basic writing operations for modulating the analog RRAM conductance. One is called SET, which means applying a positive pulse on the device to increase the conductance. The other is the inverse operations that decrease the conductance and is called RESET. Fig. 5

1b shows the device behavior on our 1024-cell-1T1R array with these two operations (Yao et al. (2017)). The conductance variation of the RRAM cell after a SET or RESET operation depends on its primary conductance and exhibits a kind of variability. Based on these experimental results, we built a behavioral model based on the lognormal distribution that reconciled the randomness and used it in the following simulations (2) ( G + XSET (G) , (2) Gnext = G − XRESET (G) where ln(XSET (G)) ∼ N (µ1 (G), σ12 (G)) and ln(XRESET (G)) ∼ N (µ2 (G), σ22 (G)) (Fig. 1c). 2.2. MLP A typical MLP consists of a number of fully-connected layers that are executed sequentially (Fig. 2a). The output of each layer can be regarded as: ! X yl = f wl,j × yl−1 + bl , (3) j

where yl is output of the lth layer, y0 is the input data of the network, Wl = wl,j is the weight matrix, bl is the bias vector, l = 1, 2, . . . , L, L is the number of layers of the network not including the input layer, and f () is a non-linear activation function, such as the rectified linear unit (ReLU) or the sigmoid function. The backward pass of MLP is shown in Fig. 2b with the standard backpropagation (BP) algorithms. Although it has been proved that the performance of MLP is not as good as a convolutional neural network (CNN) on MNIST dataset, the MLP still works well in some other domains (Zhang et al. (2016)). Besides, in many types of neuromorphic computing systems, such as CNN, the fully-connected layer is still essential, which is the same as the layer in MLP. Thus, MLP is an indispensable bridge between the RRAM-based neuromorphic computing systems and current powerful deep learning models. 2.3. RRAM-based MLP The most common computation in MLP is matrix-vector multiplication, which is exactly the crossbar’s forte. Thus, the remaining problem is how to tell the crossbar the multiplier and realize the periphery digital functions with 6

Figure 2: The common MLP structure and our settings (red parts). (a) The feed forward of the MLP. (b) The BP algorithm and the SBP algorithm (red parts).

7

Figure 3: The feed forward part of RRAM-based MLP architecture.

the analog crossbar output efficiently. To prevent the interfaces between the analog crossbar and periphery digital functions from consuming too much energy and area of the chip, we use the current comparators to realize a binary activation function. The inputs of the comparators are the analog output current of the RRAM crossbar and a contrast current. The output of each layer can be regarded as: ! ! X yl = (zl > cl ) = gl,j × yl−1 > cl , (4) j

where Gl = (gl,j ) is the conductance matrix of the RRAM crossbar array and corresponds to the weight matrix Wl = (wl,j ) in the MLP, zl is the output current of the RRAM crossbar in the lth layer, cl is the contrast current, and l = 1, 2, . . . , L − 1. The output of the network yL was computed with the softmax function. Compared with the intermediate results in hidden layers, there are many fewer outputs of the neural network, but each one is much more important. Thus, the digital and high-precision softmax function is useful and does not cost too much in terms of area and energy (Zunino and Gastaldo (2002)). The input of the network y0 , that is, the raw data, was not binarized since the data does not need to be stored on the chip. The feed forward part of the RRAM-based MLP architecture is shown in Fig. 3. There are two advantages to implementing the forward process of RRAMbased MLP in this way. 1) The binary staircase function can tolerate the variations and faults of RRAM devices. 2) The output of a layer can be 8

easily transferred to the input of the next layer without complicated analogto-digital (ADC) or digital-to-analog convertors (DAC). With this architecture, a serious problem for training emerges. The derivative of the activation function is almost “0”, which prevents the BP algorithm from working with this architecture. Thus, we modified the BP algorithm and redefined the derivative calculations. The details of the algorithm will be discussed in the following section. 3. On-chip learning and SBP 3.1. On-chip Learning of RRAM-based MLP As the name suggests, the on-chip learning of RRAM means the conductance of RRAM will be determined on the chip. Using any initial conductance values, the conductances are updated and determined by the data and the on-chip system itself. One of the most important purposes of on-chip learning for RRAM-based neuromorphic computing systems is replacing the training of neural networks in CPUs or GPUs, which takes a huge amount of energy and cannot be set up on the RRAM system easily. The other is to empower the neuromorphic system with online learning ability. This can be used to improve the performance of the neural network or in other online learning applications, such as reinforcement learning. For performance and efficiency, the on-chip learning algorithm must be able to tolerate the RRAM variations and must not require too much computation, which is power- or time-consuming on-chip. 3.2. SBP To determine the proper conductance of RRAM in the neuromorphic system, an updating and termination logic, that is, a learning algorithm, is essential. The BP algorithm is the most successful and widely used learning algorithm for training deep neural network in recent years. The pity is that the derivative of the activation function in our architecture is almost “0”, whereas it is a very important factor in the BP algorithm, as shown in Fig. 2b. Besides, as mentioned above, it is inconvenient to write or change an accurate value of the conductance on the RRAM device as demanded by BP. Thus, we proposed a new learning algorithm based on the BP algorithm and the sign of the backward pass, named the sign backpropagation (SBP) algorithm. The SBP algorithm determines how to update the conductance of RRAM according to the difference between the chip output of the input 9

data and the data label, and finally lead to a stable state that the system can complete the recognition task. The objective of the SBP algorithm is to minimize the error function of the system output and the ground truth. Without loss of generality, the quadratic error function was used in this work: E=

1 kyL − tk22 , 2

(5)

where yL is defined as before, and t is the truth label of the data. The equations used for computing the backward pass are shown as follows: ∂E = yL − t, ∂yL

(6)

∂E ∂E X = gl,k × , ∂yl ∂zl+1 k

(7)

where gl,k denotes the conductance as before, l = 1, 2, . . . , L − 1, and ∂E/∂zl is defined as:     ∂E ∂E ∂E × := sign ≥ σl , (8) ∂zl ∂yl ∂yl

where := means define and σl , l = 1, 2, . . . , L is a threshold that filters out the small values in the error. The threshold σl plays an important role in improving the convergence of the algorithm and will be discussed later. The derivative of the weight is calculated as: ∂E ∂zl ∂E = · , ∂wl ∂zl ∂wl

(9)

∂zl := sign (yl−1 ) , (10) ∂wl where l = 1, 2, . . . , L and y0 is the input data of the network. Since the activation function is non-differentiable, we defined the derivative as in (10). It can be easily found that both ∂E/∂zl and ∂E/∂wl can only be valued as -1, 0, or 1 from their definition. Thus, The ADC and DAC in the backward pass can be implemented with some simple logic circuits (Fig. 5). The updating logic of the RRAM is shown in Table 1. The backward part of the RRAM-based MLP is shown in Fig. 4. 10

Table 1: Update logic of RRAM device ∂E ∂wl

1 0 -1

Update Operation One SET pulse Nothing One RESET pulse

Figure 4: The back propagation part of the RRAM-based MLP architecture.

The threshold σl in (8) is used to filter out the small values in the error and improve the convergence of the algorithm. As shown in (9), the pseudo gradient in the SBP algorithm is different from the origin in the BP algorithm. ,dg i between the pseudo gradient dp In the worst case, the distortion angle |dhdpp|·|d g| and the genuine gradient dg will be very large with σl = 0 and ∂E/∂w → 0, which usually leads the network to an oscillation state, where hp, qi indicates the inner product of vector p and q and |p| indicates the length of the vector p. With a proper 0 < σl < max (|∂E/∂yl |), the angle difference between the pseudo gradient and the genuine gradient will be smaller. This can be easily proved and illustrated as Fig. 6a. The average angle difference between the σ quantization results and a series of genuine gradients is shown in Fig. 6b.

11

Figure 5: The circuits for error backpropagation and RRAM crossbar update. (a) The circuits for error backpropagation. The backpropagation error is transformed by the RRAM crossbar while using the BL as input. (b) The circuits for weight update. The update direction is calculated by the simple logic circuits

12

Figure 6: Illustration of the benefits from introducing the threshold σl . (a) The difference between the pseudo gradient with σl = 0, 0 < σl < |∂E/∂yl |, and the genuine gradient of ReLU. The pseudo gradient with 0 < σl < |∂E/∂yl | is closer to the gradient of ReLU than the pseudo gradient with σl = 0 as the red area is smaller than the blue area. (b) The simulation results of the update ratio, which is the percentage of weights that required to be updated after one training batch, and angle distortion of the updating direction in the SBP algorithm from the updating direction in the BP algorithm with different σ.

13

Figure 7: The example learning behaviors on MNIST with RRAM-based MLP and different learning schemes. The momentum scheme is denoted as Mome and the number indicates the size of the hidden layer.

4. Experimental results 4.1. Experimental setup We demonstrated a two-layer RRAM-based MLP on the MNIST dataset with the initial size of 784 × 100 × 10. MNIST is a widely used dataset for handwritten digit recognition. There are 60,000 training images and 10,000 test images including 10 classes from ‘0’ to ‘9’. We trained the network on 50,000 random training images and benchmarked the network on the test set. The contrast current cl and the threshold σl have impressive influence on the classification accuracy. We find that the system works well in most cases when setting cl equal to the output of the training images on a RRAM column with all mid-level conductances and σl equal to the output of the backward pass on a RRAM row with all low resistance states. The RRAM devices were initialized randomly. The model and the analyses were implemented in MATLAB. We first evaluated our RRAM-based MLP architecture with the SBP algorithm while σ = 0. Then we increased σ to a proper value. The preliminary result in accordance with the above settings was not so satisfactory, as shown in Fig. 7. Thus, we increased the number of hidden layer units to 300 and introduced the momentum term with a rate of 0.5 to acceler14

Figure 8: The role of parameters in the system. (a) The classification error rate and device operation time with different σ. (b) The classification accuracy with different number hidden layer units.

15

ate the convergence. For consistency, we only stored the momentum of the backpropagation error, not the gradient of each weight. This means that the RRAM would be updated only if the direction of ∂E/∂z is the same as the last iteration. With the help of the momentum term and larger number hidden layer units, the average accuracy of 20 trials on MNIST with RRAMbased MLP is 94.5%, which is only about a 1% decrease compared with the benchmark LeNet-300 (LeCun et al. (1998)). 4.2. The significant parameters in the system The quantization threshold σ plays a key role in the system performance. With a proper σ, the system could attain a higher classification accuracy and less operation time of the RRAM. The operation time is the total number of programming events in the whole training process. The fewer programming events means less energy consumption in the training process (Fig. 8a). The best σ (about 0.25 in this experiment) can make the system achieve a comparable recognition accuracy to LeNet while saving 10× the operation time. When the σ is close to 1, both the classification error rate and device operation times would be small, but the convergence speed would be very slow, because only a few weights in the network would be updated at each iteration. The number of hidden layer units is also an important parameter that affects the classification accuracy a lot. Since the interfaces in the forward process have been binarized, the system needs more hidden layer units to achieve a better accuracy, whereas, the neural networks with full precision interfaces may be already saturated with the smaller number hidden layer units (Fig. 8b). As the energy consumption of the array is usually much less than the interfaces (Li et al. (2015b)), it is an energy efficient way to use a larger array rather than a higher precision interface to achieve similar accuracy. Besides the hyper-parameters in the algorithm, we also evaluated the influence of the RRAM device characteristics. Following the definition of nonlinearity in (Chen et al. (2016)), the test error rate of MNIST dataset with various nonlinearity in weight update and SET/RESET symmetry are shown in (Fig. 9). The SBP algorithm worked well when one of the nonlinearity in SET or RESET process is low, even if the other is very high. It is because that a nonlinear SET/RESET operation can be compensated by a series of linear RESET/SET operation. The nonlinearity is more important to the SBP algorithm than the symmetry because the nonlinearity has a direct 16

Figure 9: The test error rate of MNIST dataset with various weight update nonlinearity and SET/RESET symmetry.

effect on the possible value of RRAM conductance. The core assumption of the SBP algorithm is that the analog RRAM has the potential to have any conductance value within the range. The low nonlinearity confirms this assumption while the symmetry not. Similarly, if the conductance change per pulse is smaller (larger) while other characteristics and hyper-parameters fixed, the SBP will work better (worse). 5. Conclusion In this paper, we proposed a RRAM crossbar-based neuromorphic computing architecture for MLP and a matching on-chip learning algorithm named SBP. The simulation results show that the proposed architecture and algorithm can achieve 94.5% classification accuracy with 784-300-10 MLP structure on the MNIST dataset while the intermediate results were significantly reduced to 1 or 2 bits. The SBP algorithms can take advantages of the analog RRAM potential in neuromorphic computing and tolerate the non-ideal RRAM device characteristics, such as the asymmetry of the SET/RESET process.

17

References Burr, G. W., Narayanan, P., Shelby, R. M., Sidler, S., 2015. Large-scale neural networks implemented with non-volatile memory as the synaptic weight element: Comparative performance analysis (accuracy, speed, and power). In: IEEE International Electron Devices Meeting. pp. 4.4.1–4.4.4. Chen, P. Y., Lin, B., Wang, I. T., Hou, T. H., Ye, J., Vrudhula, S., Seo, J. S., Cao, Y., Yu, S., 2016. Mitigating effects of non-ideal synaptic device characteristics for on-chip learning. In: Ieee/acm International Conference on Computer-Aided Design. pp. 194–199. Chi, P., Li, S., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y., Xie, Y., 2016. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. In: ACM/IEEE International Symposium on Computer Architecture. pp. 27–39. Eryilmaz, S. B., Kuzum, D., Yu, S., Wong, H. S. P., 2015. Device and system level design considerations for analog-non-volatile-memory based neuromorphic architectures. In: 2015 IEEE International Electron Devices Meeting (IEDM). Vol. 30. pp. 4.1.1–4.1.4. Gao, B., Bi, Y., Chen, H. Y., Liu, R., Huang, P., Chen, B., Liu, L., Liu, X., Yu, S., Wong, H. S., 2014. Ultra-low-energy three-dimensional oxide-based electronic synapses for implementation of robust high-accuracy neuromorphic computation systems. Acs Nano 8 (7), 6998–7004. Gokmen, T., Vlasov, Y., 2016. Acceleration of deep neural network training with resistive cross-point devices: Design considerations. Frontiers in Neuroscience 10 (51), 333. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29 (6), 82–97. Hu, M., Li, H., Wu, Q., Rose, G. S., 2012. Hardware realization of bsb recall function using memristor crossbar arrays. In: Design Automation Conference. pp. 498–503.

18

Igel, C., Hskcn, M., 2014. Improving the rprop learning algorithm. Proceedings of the Second International Symposium on Neural Computation. Jo, S. H., Chang, T., Ebong, I., Bhadviya, B. B., Mazumder, P., Lu, W., 2010. Nanoscale memristor device as synapse in neuromorphic systems. Nano Letters 10 (4), 1297. Keckler, S. W., Dally, W. J., Khailany, B., Garland, M., Glasco, D., 2011. Gpus and the future of parallel computing. IEEE Micro 31 (5), 7–17. Krizhevsky, A., Sutskever, I., Hinton, G. E., 2012. Imagenet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems. pp. 1097–1105. Le, Q. V., 2013. Building high-level features using large scale unsupervised learning. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 8595–8598. LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521 (7553), 436–444. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), 2278– 2324. Li, B., Gu, P., Shan, Y., Wang, Y., Chen, Y., Yang, H., 2015a. Rram-based analog approximate computing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34 (12), 1905–1917. Li, B., Shan, Y., Hu, M., Wang, Y., Chen, Y., Yang, H., 2013. Memristorbased approximated computation. In: IEEE International Symposium on Low Power Electronics and Design. pp. 242–247. Li, B., Xia, L., Gu, P., Wang, Y., Yang, H., 2015b. Merging the interface: power, area and accuracy co-optimization for rram crossbar-based mixedsignal computing system. In: Design Automation Conference. p. 13. Riedmiller, M., Braun, H., 1993. A direct adaptive method for faster backpropagation learning: the rprop algorithm. In: IEEE International Conference on Neural Networks. pp. 586–591 vol.1.

19

Soudry, D., Castro, D. D., Gal, A., Kolodny, A., Kvatinsky, S., 2015. Memristor-based multilayer neural networks with online gradient descent training. IEEE Transactions on Neural Networks and Learning Systems 26 (10), 2408. Wang, Y., Tang, T., Xia, L., Li, B., Gu, P., Yang, H., Li, H., Xie, Y., 2015. Energy efficient rram spiking neural network for real time classification. In: Edition on Great Lakes Symposium on Vlsi. pp. 189–194. Wong, H. S. P., Lee, H. Y., Yu, S., Chen, Y. S., Wu, Y., Chen, P. S., Lee, B., Chen, F. T., Tsai, M. J., 2012. Metaloxide rram. Proceedings of the IEEE 100 (6), 1951–1970. Xia, L., Tang, T., Huangfu, W., Cheng, M., Yin, X., Li, B., Wang, Y., Yang, H., 2016. Switched by input: power efficient structure for rrambased convolutional neural network. In: 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC). pp. 1–6. Xu, C., Dong, X., Jouppi, N. P., Xie, Y., 2011. Design implications of memristor-based rram cross-point structures. In: 2011 Design, Automation Test in Europe. pp. 1–6. Yakopcic, C., Hasan, R., Taha, T. M., Mclean, M. R., 2014. Efficacy of memristive crossbars for neuromorphic processors. In: International Joint Conference on Neural Networks. pp. 15–20. Yao, P., Wu, H., Gao, B., Eryilmaz, S. B., Huang, X., Zhang, W., Zhang, Q., Deng, N., Shi, L., Wong, H. P., 2017. Face classification using electronic synapses. Nature Communications 8, 15199. Yu, S., Gao, B., Fang, Z., Yu, H., 2012. A neuromorphic visual system using rram synaptic devices with sub-pj energy and tolerance to variability: Experimental characterization and large-scale modeling. Electron Devices Meeting .iedm.technical Digest.international 2012, 10.4.1–10.4.4. Yu, S., Gao, B., Fang, Z., Yu, H., Kang, J., Wong, H. S. P., 2013. A low energy oxidebased electronic synaptic device for neuromorphic visual systems with tolerance to device variation. Advanced Materials 25 (12), 1774–9.

20

Zamanidoost, E., Bayat, F. M., Strukov, D., Kataeva, I., 2015a. Manhattan rule training for memristive crossbar circuit pattern classifiers. In: IEEE International Symposium on Intelligent Signal Processing. pp. 1–6. Zamanidoost, E., Klachko, M., Strukov, D., Kataeva, I., 2015b. Low area overhead in-situ training approach for memristor-based classifier. In: Proceedings of the 2015 IEEE/ACM International Symposium on Nanoscale Architectures. pp. 139–142. Zhang, Y., Sun, Y., Phillips, P., Liu, G., Zhou, X., Wang, S., 2016. A multilayer perceptron based smart pathological brain detection system by fractional fourier entropy. Journal of Medical Systems 40 (7), 173. Zunino, R., Gastaldo, P., 2002. Analog implementation of the softmax function. In: IEEE International Symposium on Circuits and Systems. pp. II–117–II–120 vol.2.

21