Solid State Electronics 155 (2019) 82–92
Contents lists available at ScienceDirect
Solid State Electronics journal homepage: www.elsevier.com/locate/sse
Analog neuromorphic computing using programmable resistor arrays Paul M. Solomon
T
IBM, T.J. Watson Research Center, Yorktown Heights, NY, USA
ARTICLE INFO
ABSTRACT
The review of this paper was arranged by Profs. S. Luryi, J. M. Xu, and A. Zaslavsky
Digital logic technology has been extraordinarily successful and has been fueled by incredible gains in integration density and performance achieved over the years following Moore’s law. This has led to societal changes where more and more everyday functions are aided by smart devices following the path to artificial intelligence. However, in the field of deep machine learning, even this technology falls short, partly because device scaling gains are no longer easy to come by, but also due to intractable energy costs of computation. Deep learning, using labeled data, can be mapped onto artificial neural networks, arrays where the inputs and outputs are connected by programmable weights, and which can perform pattern recognition functions. The learning process consists of finding the optimum weights however this learning process is very slow for large problems. Exploiting the fact that weights do not need to be determined with high precision, as long as they can be updated precisely, the device community has recognized that analog computation approaches, using physical arrays of memristor (programmable resistor) type devices could offer significant speedup and power advantages compared to pure digital, or pure software approaches. On the other hand, the history of analog computation is not reassuring since the rule has been that more capable digital devices invariably supplant analog function. In this paper I will discuss the opportunities and limitations of using analog techniques to accelerate the learning process in resistive neural networks.
1. Introduction Analog computing has a long and illustrious history [1–3]; indeed, it is as old as civilization. From the Greek Antikythera [4], medieval astrolabe, slide-rule, gunsight positioners, NASA trajectory plotters, automotive design etc. etc., analog computing has always been there to serve a need where a fit to a special application suggests a special device to implement the solution efficiently, and usually in a transparent manner. Thus, for example, an analog simulator of a large electrical network is essentially a scale model of that network where analog amplifiers placed at the nodes implement the local differential equations. With the advent of general-purpose digital computers, analog applications were supplanted. All the traditional applications, those listed above and more, can today be done easily by digital computers. One only has to marvel at the wonders of image and audio processing exhibited in a smart phone, all by digital means, to appreciate the power of digital processing where the analog front end is pushed as far to the periphery as possible. The unprecedented scaling of digital technology, following Moore’s law [5], has facilitated this take over; however, one is now near the end of the scaling path and cost of computation in energy per operation is stagnating [6]. On the other hand, the types of
E-mail address:
[email protected]. https://doi.org/10.1016/j.sse.2019.03.023
Available online 26 March 2019 0038-1101/ © 2019 Elsevier Ltd. All rights reserved.
problems being attacked are becoming larger and more challenging. Problems such as simulation of quantum chemical systems of more than a few hundred atoms [7], molecular biological systems [8] and artificial intelligence (AI). For these types of applications, even the computational power of modern digital technology may not be sufficient, and the speed and power-efficiency of analog computation is called for. The lure of analog computation, still valid in the digital age, is that computation can be done using the natural characteristics of analog devices, both static and dynamic, which would require thousands of digital devices and multiple time steps to perform the same function. There is a trade-off for accuracy, but specific applications where low accuracy is tolerated but huge throughput is needed, constitute a natural target. Confining ourselves (for convenience) to analog computing using electrical elements, the analog elements can perform elementary operations of addition and subtraction (Kirchhoff’s Laws), multiplication and division by a scalar (resistors), integration and differentiation (capacitors), and taking logarithms and exponentiation (diodes and transistors). Thus, one can solve differential equations and perform array operations purely in the analog, voltage and time, domains without the need for analog to digital (ADC) conversion and for execution of multiple program steps to perform elementary operations.
Solid State Electronics 155 (2019) 82–92
P.M. Solomon
This translates into compactness, power-efficiency and speed. For the neuromorphic application an important additional benefit is the smoothness of analog functions which leads to the ability to define gradients which are so important for optimization and function minimization in deep learning. Analog computation has serious and significant weaknesses the first and foremost being limited precision. Whereas digital calculations can be extremely precise (1:10−16 for double precision) analog calculations rarely achieve a precision of < 0.1%. While some applications may not require high precision, other limitations can be equally serious. These are the lack of large-scale analog memory, especially non-volatile memory, limited flexibility and re-configurability of analog circuits, limited dynamic range, and progressive corruption of the data by noise for multiple stages of analog computation. For the neuromorphic application the noise issue is mitigated by the fact that deep learning occurs by multiple small steps so that the noise can be averaged out,1 as long as it is unbiased i.e. does not cause drift of the solution. The limited memory and reconfigurability are the most serious obstacles to practical applications although we shall see how clever design involving mixed analog-digital systems can mitigate these problems. In this chapter we will explore the application of analog computation to neuromorphic systems, concentrating on an array processing concept called the Resistive Processing Unit [9] or RPU.
Fig. 1. Diagram of communication between two neurons via a synapse from. In biology there are thousands of individual synaptic connections to a given neuron so that the action potential in the post synaptic neuron is triggered when the multitude of asynchronous action potentials from the presynaptic neurons reach a given threshold. Reproduced from Burr et al., [32] with minor changes.
2. Neuromorphic computing using spiking neural networks The field of neuromorphic computing has branched out in different directions, all inspired by the brain but abstracted by varying degrees. All involve artificial neural networks (ANN) modelled after the brain after McColloch and Pitts [10]. One direction has been to use ANNs as a model to understand the brain itself while the another has focused on the application of ANNs to AI applications. Biological neurological circuits (see Fig. 1) consist of a large number of neurons (∼1011 in the human brain) each connected to other neurons via a large number (∼104) of synapses [11,12]. The axon acts as an electrical conduit for the action potential and the synapses transmit the signal to a neighboring neuron. In many neuromorphic models the synapse is represented as an electrical conductance or ‘weight’ that can be altered (updated) via the learning process. Hebb introduced the idea of synaptic plasticity with his famous saying “Neurons that fire together wire together.” A major, and relatively recent, breakthrough was the discovery of spike timing dependent plasticity (STDP) by Markram [13] and Bi and Poo [14] where, as shown in Fig. 2, the strength of synaptic connections is increased or decreased depending on whether the synapse is activated before (causal) or after (non-causal) the post-synaptic neuron fires. STDP has been introduced, as a learning rule, into ANNs with a variety of mathematical models. See reviews in Refs. [15,16]. An important branch of neuromorphic computing has evolved based on spiking neural networks (SNNs), many using STDP based learning rules. Most SNN machines use asynchronous logic2 to account for the explicit time dependence of STDP. While raw STDP describes unsupervised learning, supervised learning has been accomplished with SNNs using the restricted Boltzmann machine [17], or back-propagation, using event-driven stochastic connections [18]. While some general problems such as image recognition [17,19,20] have been attacked, the emphasis has been as a tool to study the brain. Most of the work used software on conventional hardware, but specialized hardware machines for example, True North [21], Spinnaker [22], Braindrop [23], Brainscales
Fig. 2. An illustration of Spike Timing Dependent Plasticity, where the synapse strength is enhanced by a causal and depressed by an anti-causal weight change as correlated with the timing between pre and post synaptic action potentials. Reproduced from Burr et al., [32] with minor changes.
[24] and Loihi [25], have been, or are being built using custom digital or mixed analog-digital CMOS integrated circuits. These are impressive machines. IBM’s True North [21] is an all-digital SNN with a million neurons, 5.4 billion transistors and is capable of complex image recognition tasks, including multi-object detection and classification, but it is an inference machine only so does not need a learning rule. Intel’s Loihi [25], with over 2 billion transistors and using the latest 14 nm technology, is similarly all-digital and can implement a variety of STDP learning rules. Spinnaker [22] is a giant brain simulation machine, currently under construction, involving a million computer cores capable of simulating a billion neurons.3 It uses a highly simplified neural spiking model. Brainscales [24], by contrast, is a mixed analog-digital machine, which is designed to faithfully emulate the biological function of neuro-synaptic circuits. Each chip has 512 neuron compartments capable of receiving inputs from 16 K synapses. 3. Back-propagation algorithm The application of ANNs to solving real world problems took a major leap forward with the invention of the back-propagation (BP) algorithm by Rummelhart, Hinton and Williams in 1985 [26]. Since then it has been in the forefront of advances in AI applications such as image recognition [27,28] machine translation [29] and speech recognition [30]. In the above examples the algorithm is executed in software run on fast graphics processing units (GPUs). Google has developed even higher throughput tensor processing units (TPUs) for AI applications. The BP algorithm can be regarded as a giant optimization procedure where up to billions of synaptic weights are optimized to
1 Indeed, a small amount of noise can be beneficial in aiding convergence in large optimization problems. 2 Since the time scale of neuron spiking is much slower than standard logic circuits, a hybrid synchronous-asynchronous approach is often used.
3 Built by the Advanced Processor Technologies Research Group (APT) at the School of Computer Science, University of Manchester, the system is part of the European Union's Human Brain Project.
83
Solid State Electronics 155 (2019) 82–92
P.M. Solomon
Fig. 4. Mapping a multi-layer perceptron onto a hardware array. © [2015] IEEE. Reprinted, with permission, from Burr et al., [50].
Fig. 3. Multi-layer perceptron, illustrating the MNIST example. © [2015] IEEE. Reprinted, with permission, from Burr et al., [50].
component of the output vector gives the probability of a particular class. Thus, for example, the MNIST [36] task consists of identifying one of 10 handwritten numbers represented by the 10 outputs. An error function is generated by comparing the actual with the expected output.4 Now, by backpropagating the error through the transpose of the weight arrays and applying the derivative of the activation function, it can be shown (Rummelhart [26]) that the derivative of each individual weight with respect to the error can calculated:
minimize the statistical error, when comparing the output of the neural calculation with the expected output (the ground truth) for a huge number of input examples. In contrast to STDP, time does not play an explicit role in the learning rule. The BP algorithm uses simple, static, linear weights as synapses in the ANN arrays. Such models are amenable to mathematical analysis and a large literature has arisen around them. The fact that these linear weights are highly oversimplified, compared to physical brain synapses, does not seem to deter from the success of this method. These learning machines have been extremely successful but consume enormous computing resources, both in time and power. To alleviate this, analog computational techniques are being explored, for instance by using memristor devices [31,32] to replace the programmable synapses. Replacing digital with analog processing has the potential for reducing training time and energy consumption by orders of magnitude [9,32], and will form the basis of our treatment. The basic element of the BP model is the perceptron [33] which implements the linear operation, H = WX + B where W is the weight matrix, X the input vector, B a bias term, and H the (linear) output vector, followed by a nonlinear threshold-type operation Y = f (H ) where f is the activation function. Each element of the activated vector yj can be regarded as a decision made by the activation function based on the wij weighted input vector. Each input/output-plane of a perceptron is called a layer and perceptrons can be cascaded to form a multi-layer system (see Fig. 3), where the output layer of the first becomes the input layer of the next. For our hardware model the perceptron can be mapped simply onto a device array, with a device at each crosspoint. For instance, for the array of Fig. 4 the (horizontal) address rows can be represent the input neurons, while the (vertical) columns form the output neurons and the crosspoint devices the synapses. If all crosspoints are occupied the ANN is fully connected. In Fig. 5, from Ref. [9], we show how ANN arrays can be organized as part of a larger system. Even though superficially similar to a conventional computer chip, with sub-arrays linked by data busses, the effect is radically different from the standard Von Neumann architecture since computation of the vector-matrix product of the weights, and storage of the weights themselves, is done locally in the same array. This removes the Von Neumann bottleneck [21,34] (see Fig. 5a) where the operands of computations must be shuttled back and forth between the memory and the CPU. The algorithm is explained in Fig. 6, after LeCun, Bengio and Hinton [35]. The input signal is propagated forward from layer to layer each time undergoing a vector-matrix operation, where the values of the weights are coefficients, followed by application of the activation function. The final output consists of a classification where each
E E yl z l yk z k = , wkj yl z l yk z k wkj where the example and notation are taken from Fig. 6. The chain of partial derivatives is obtained by back-propagation, as shown in Fig. 6b, where the derivatives of the form yk / z k are the derivatives of the appropriate activation function and those of the form z k / yj represent propagation backwards through the perceptron using the computed derivative j = E / z k as the input. For a linear array, the derivative with respect to the weight is just z k / wkj = yj . Thus, using the method of gradient descent for minimization of the error, the ‘delta update rule’ [37] may be formulated as wkj = k yj . Thus, for the multi-layer system one needs to retain the values, k , of the partial derivatives computed during the backward pass as well as the inputs, yj , computed during the forward pass, in order to do the update. Using previously learned weights one could execute the classification algorithm using a single forward pass. This is known as inference and is a cheap and easy way to perform simple AI tasks; however, learning the weights requires a massive effort involving tens of thousands (or more) of forward backward and update cycles to go through a single set of input patterns (an epoch) and to repeat this for hundreds of epochs. Even using a fast GPU, this takes many hours of computation. Neural networks using the BP algorithm can be used to implement different types of learning and pattern recognition. In addition to the generalized perceptron that we have discussed, more efficient architectures can serve different purposes. For instance, the Convolutional Neural Network (CNN), as described by LeCun et al. [36], optimizes weights in a large set of small kernels, which are convolved with an image to recognize specific features. On the other hand, the Recurrent Neural Network (RNN) [38,39] analyzes time dependent data or sequenced data such as sound patterns or correlated word patterns or phrases. All of these use the BP algorithm in various forms. Hochreiter and Schmidhuber [39] introduced the ‘long short-term memory’ 4 In Hinton’s implementation the error is represented by the cross-entropy which is calculated by applying the SoftMax activation function to the output and subtracting this from the target function.
84
Solid State Electronics 155 (2019) 82–92
P.M. Solomon
Fig. 5. (a) Illustration of von Neumann bottleneck. (b) Avoiding the bottleneck with neural network logic in memory array. After Merolla et al., [21].
Fig. 6. Back-propagation algorithm showing forward (a) and backward (b) passes. After LeCun, Bengio and Hinton [35].
(LSTM) concept for RNNs, to preserve the integrity of BP gradients for backpropagation over long sequences of data.
and error calculator block. Here we are using the notation of Rummelhart [37]. The forward and backward passes, as described in Fig. 7, are translated into hardware form where the weight arrays are updatable resistor networks or memristors. Vector-matrix operations are implemented by applying a voltage vector x i to the rows and summing the currents x i wij collected by the columns. The outputs hj are passed through a hardware activation function f (hj ) to generate the input yj to the next array. This is repeated for the next array fed through another activation function, generating an output yk . This is compared to the target vector tk resulting in an error k . This error is now fed back through the derivative of the activation functions f (hk ) and the transpose of the arrays. Note that the original outputs, hk , hj , must be retained in order to calculate the derivatives. Completion of the backward pass leaves the quantities yj , k and x i , j at the second and first arrays respectively, ready for the updates wkj = yj k and wji = xi j respectively. In addition to the natural vector matrix operations one also must evaluate the activation functions and their derivatives. A particularly simple one is the ReLu function [41,17] which is simply a diode, R (x ) = x × (x 0) , function and its derivative is just a pass gate, R (x ) = (x 0) . These can be implemented with simple analog circuits. The SoftMax activation function, needed for the cross-entropy evaluation [35], is much more difficult to compute and needs to be computed
4. Analog array implementation of the BP algorithm Drawing on the example if the brain, it was early recognized that high accuracy was not required for the computations5; on the other hand, massive parallelism was required to achieve computational throughput. Thus, analog techniques seem the ideal medium to do the massive computations required. In this section we will present an illustrative analog system to implement the BP algorithm and use it to illustrate the difficulties of implementing it in practice. In the following section we will present a more practical approach called the Resistive Processing Unit (RPU) after Gokmen and Vlasov [9]. The system is shown in Fig. 7.6 It is a three-layer system consisting of two resistive arrays, current amplifiers, activation function circuits, 5 In the digital domain the relaxed precision requirements are exploited by reduced precision digital computation [30,40]. 6 This system is significantly oversimplified, for sake of illustration, leaving out offset and bias terms.
85
Solid State Electronics 155 (2019) 82–92
P.M. Solomon
Fig. 7. Hypothetical three-layer system using hardware arrays and all analog peripheral circuits. Note that the variables x i , hj , hk , form.
j,
k
must be stored in analog
Fig. 8. RPU architecture showing (a) RPU column with current integrator and analog to digital converter (ADC), (b) RPU array with peripheral circuitry and (c) RPU system. After Gokmen and Vlasov [9].
with high precision [40] (but may not be needed [42]). Another difficult requirement is to perform the point by point multiplication necessary for the update. Perhaps the most difficult requirement is to implement extensive analog memory, since the vectors h and must be retained in order to do the update. A subtler, yet important, requirement is the need to scale the data so that it fits within the dynamic range of the amplifiers, for both forward and backward passes. This would have to be done dynamically as the learning rate is adjusted. There are several scale factors that must be commensurate with each other: the dynamic range of the amplifiers, activation temperature of activation functions such as sigmoid and SoftMax, variance of the data etc. Some of this is adjusted automatically via the BP algorithm while others would have to be adjusted explicitly with the aid of gain control circuits. All these factors make an all analog multilevel system rather impractical, but such a system is also not necessary as we will discuss next.
problematic functions to the digital domain, where they can easily be done, one can realize a more practical system combining the strengths of analog and digital systems while circumventing their limitations. Such a system is shown in Fig. 8, after Ref. [9], consisting of an array of RPU tiles, communication busses,7 and local arithmetic units for computing the activation functions etc. Current summing is done using analog integrators, which have the advantages of accepting data in pulse-width modulated form, and easy data scaling by varying the integration time. A single ADC is sufficient, per array, since the ADC is fast enough to multiplex the O(N) integrator outputs. The arrays need not be square, but for more general use one could include square arrays and select a subset of the rows/columns for a specific purpose. The update scheme is shown in Fig. 9. An analog multiplication is called for at every cross-point. An elegant approach taken by Gokmen and Vlasov [9] is to use stochastic multiplication where (see Fig. 9a) each xi and j element is replaced by a short, binary, pulse train of duty factor equal to the element value. The pulse train pattern is generated using random numbers to ensure an unbiased randomness (for a large number of updates) so that the duty factor of the intersection (AND operation) of the two pulse trains approaches the product of the two input signals. Discrimination is needed between the fully selected case (both rows and columns activated) and the half-selected case (rows or columns only) where the device sees the full or half update voltages respectively (see Fig. 9b). This scheme works remarkably well as shown in Fig. 9c which is a simulation of the learning process, taken from Ref. [9]. Increasing the bit length per update from 1 to 10 enables one to approach floating point accuracy (all other factors being ideal). The
5. Resistive processing unit (RPU) The RPU system, as described by Gokmen and Vlasov [9], is a mixed analog-digital system that attempts to mitigate the difficulties of the all analog system while still retaining its key advantage of parallelism. For an array of linear size N the N2 array operations of outer product and vector-matrix multiplication can be executed in O(1) time, saving O(N2) von Neumann type memory-register operations. The functions which are difficult to implement in the analog domain, (memory, scaling, function evaluation) occur mostly at the periphery of the arrays rather than in the arrays themselves. The peripheral calculations generally have O(N) rather than O(N2) complexity, so by transferring the O(N)
7
86
These busses only need to handle the O (N ) type data
Solid State Electronics 155 (2019) 82–92
P.M. Solomon
Fig. 9. (a) Stochastic update method showing array topology (a) voltage waveforms on the array lines in relation to the switching threshold of the weight device (b) and results of RPU training with different bit lengths for the update (c). After Gokmen and Vlasov [9].
Table 1 Device -requirements for the RPU application. After Gokmen et al. [9]. Parameter
Value
Pulse Duration Operating Voltage Max. Device Area Ave. Dev. Conductance Conductance On/Off Ratio Conductance change, full select Max. cond. change, half select Number of ‘states’ Up/down Symmetry
1 ns 1V 0.04 μm2 40 nS 8 170 pS 17 pS 1000 1.05
Tolerance
30% 30% 2%
expressed as ‘number of states’ or the total range weigh change divided by the minimum weight increment. Since the learning process is corrective and diffusive, a distinguishable weight increment may in fact be indistinguishable from noise on the time frame of a single update/read cycle. A further requirement is placed on the value of the weight resistance. Too high a value impacts the signal to noise ratio for read, and speed (see discussion below) while too low a value increases the power dissipation and voltage drop along the array wires.8 These requirements are summarized in Table 1 and are valid at least for the MNIST case considered by Ref. [9]. An additional requirement is retention time. The BP algorithm is self-restorative so that a small decay in the weight value may be corrected during the update process, however the tolerance to this loss is quite small. Estimates are that the decay time constant must be > 106x the BP cycle time [43,44]. The computational throughput of the RPU array increases as the square of its size, since all computations are done in parallel. The maximum size is limited by noise constraints (see below) to N ∼ 4000. In their sizing [9] Gokmen and Vlasov estimated the advantages of the analog/digital RPU compared with an all-digital GPU obtaining acceleration factors > 1000x in some cases. Even though GPU’s have improved since that publication, large factors in power efficiency and performance acceleration remain.
Fig. 10. Effect of device imperfections on RPU training. (a) Size of weight increment, (b) local device temporal and spatial variations (c) local device asymmetry, and (d) stochastic noise in peripheral circuits. Reproduced with minor changes from Gokmen and Vlasov [9].
multiplication only works for the positive quadrant and would need to be done twice to accommodate both positive and negative j values (or 4x if the xi are signed as well). With this approach Gokmen and Vlasov studied how various device imperfections affected the training accuracy (see Fig. 10 for a subset). Here we will discuss the four variables shown in the figure. The first is the weight increment or inverse ‘number of states’. The stability of the BP algorithm depends on the ability to reduce the learning rate so that the average gradient points in the correct direction for a decrease of the loss function. This requires very small steps in the weight change < 1% of the average weight value as shown in Fig. 10a. A weight change of 0.1% was used for the other plots. The BP algorithm is rather tolerant to local device variations (Fig. 10b), which is good news for prospects of using hardware devices. Likewise, the algorithm could withstand relatively high levels of stochastic noise (Fig. 10d), as long as the noise was unbiased in its update direction. This gives a picture of the descent process akin to a diffusive process with a directed drift down the path of steepest descent. Symmetry (Fig. 10c) is the most stringent requirement since any asymmetry introduces a bias to the update driving it offcourse. The other important requirement, related to symmetry, is the minimum distinguishable size of the weight update. This is often
6. Extension of RPU to other NN architectures The example [9] we have discussed is a fully connected Deep Neural Network (DNN) which can be mapped efficiently onto the RPU array architecture. Gokmen et al. [45] have also mapped the RPU onto the CNN application by feeding in the image sequentially as a series of patches and convolving them with a set of kernels (filters) to be 8 These numbers may be considerably improved for non-linear resistors as discussed in the section on noise and variability.
87
Solid State Electronics 155 (2019) 82–92
P.M. Solomon
The device is small in size, has higher resistance and can be more symmetric than the PCM device, however the ultra-fine filaments required to achieve the large resistances lead to excessive stochasticity and noise in the switching process. Non-filamentary RRAM devices have been made [32,54] but these have other drawbacks. The RRAM technology uses CMOS compatible hafnium oxide, which has also been found to have ferroelectric properties when doped. The ferroelectric memory is also being explored as a synaptic element [55,56] but attaining symmetry and reliable read-write operation is far from assured. Intercalation devices [49,57], basically integrated batteries (see Fig. 13e, f), can store charge for long periods of time. By using this charge to modulate an adjacent electrode, a proportional weight conductance which is symmetric in charge introduction/extraction, can be attained. Calibrated charge introduction needs extra circuitry within an array cell similar to the CMOS cell [44], which is an area and technological penalty. Another issue is the maximum rates of charge/discharge determining the maximum update rate. Magnetic (MRAM) technology [58,59] is also being explored. Domain wall motion [59] gives a possible symmetry mechanism. Overall the technology options are nascent with only the PCM offering possible near-term solutions. A summary of the first five candidates is given in Table 2.
Fig. 11. Tendency of physical devices to saturate when stressed repeatedly in the same direction.
optimized. The updates are done after each complete image, using digitally stored intermediate data. The reduced frequency for the updates leads to a more stringent non-volatility device requirement. In this application the digital rescaling of data was particularly important to reduce the effect of noise. The small size of the kernels (even the complete set) leads to smaller RPU arrays, hence lower throughput. This has been addressed by Rasch et al. [46] who replicated the kernels many times, and were able to train them on random parts of the same image in parallel. The RPU concept was also applied to Recurrent Neural Networks [47]. The mapping from RPU to RNN is straightforward. This example tested the RPU concept for far larger networks than for the original MNIST problem. It was found, for these larger networks, that the symmetry requirement was even more stringent than before (5% → 2%).
8. Noise and variability The Achilles heel of analog systems is their susceptibility to be degraded by accumulation of noise. In contrast, for digital systems, standardization of the logic levels, well above the noise level, cuts off noise propagation. For the perceptron an analogous standardization occurs due to the non-linear activation function, but this threshold is much lower so that, in a multilevel system, the noise can still be transmitted through the system in the form of stochastic update decisions so that the backpropagation convergence resembles more a diffusive ‘Brownian motion’ type of process rather than a deterministic trajectory. Gokmen and Vlasov [9] determined the influence of noise during forward, backward and update cycles. The algorithm was quite tolerant to local device variations, both spatial and temporal (Fig. 10b), accumulated during the update cycle, provided the symmetry of positive and negative updates was maintained. During the read cycle thermal and shot noise, generated in the devices and peripheral circuits, were considered. Comparing the forward and backward passes revealed that the noise tolerance was much reduced in the latter case to a noise level of ∼6% of the activation temperature. Here we extend the noise investigation to considerations of array scaling and examine some noise mitigation strategies. The noise analysis is based on the circuit of Fig. 14. An integrator of integration time T, having 2 / T equivalent noise bandwidth,9 detects the signal as well as the noise generated in the weight conductances and in the integrating amplifier. The crux of the problem is that the system must detect a change in conductance approximately equivalent to a single weight conductance switching through its range. While the signal current is not attenuated at the input of the current summing amplifier, the signal voltage is greatly attenuated. Referring to the network in Fig. 14a, the Thevenin equivalent voltage at the input of the amplifier (with the amplifier removed) is attenuated by a factor of ∼1/2N, since there are 2(N 1) conductances connected to ground. Since N may be several thousands the attenuation is very large, so that the signal becomes very susceptible to corruption by amplifier noise and by noise generated in the conductance network itself. The two components have different dependencies with the array signal to noise ratio,
7. Device candidates We have discussed device requirements in the abstract, in the context of the RPU executing the BP algorithm, but which devices and device types meet these requirements? Overall, we are looking for a programmable resistor that can be programmed fast (ns), of small size and high resistance, having quasi-linear characteristics, a symmetric response to updates and a retention time of (at least) seconds. These are very different from the requirements for fast, low power logic devices that have driven the integrated circuit industry up to now, or memory devices that require large on/off ratios. The symmetry requirement is particularly problematic since very few physical processes can provide it. In most physical processes the state of the device tends to saturate at high/ low values as potentiation/ depression stimulus time increases. Thus, as illustrated in Fig. 11 where the rate of state change depends on the value of the state i.e. how far it is from its saturated value. It is always possible to operate in a range very small compared to the saturated value, but this throws away much needed dynamic range. Of symmetric devices the capacitor charged via a current source, realizable in CMOS [43,44], is a prototypical example. The CMOS cell can incorporate logic functions as well as the current sources and a storage capacitor, as seen in Fig. 12, after Li et al. [44]. The small retention time, typically < 1 s, is a major problem. Burr et al. [32] give a thorough review of memristor type devices potentially useful as AI weights. Here we will cover a few examples shown in Fig. 13 from Refs. [32,48,49]. Phase change memory (PCM, see Fig. 13a, b) is the most mature of the nonvolatile programmable resistor technologies and has been adapted for the RPU application [42,50,51]. The progressive phase change on set (crystallization) makes it possible to obtain a continuous range of stable resistance states; however, the reset (amorphization) process is abrupt leading to nonsymmetric characteristics. Special symmetrizing tricks [50,51] must be employed e.g. using unidirectional updates only, on a device pair. The oxide-based switch [48,52,53] (called RRAM for historical reasons) forms a resistive channel after breakdown (see Fig. 13c, d) which can be modulated with positive and negative (set/reset) pulses.
9 This relationship is valid for white noise as can be easily seen by integrating a series of stochastic pulses as in shot noise.
88
Solid State Electronics 155 (2019) 82–92
P.M. Solomon
Fig. 12. (a) CMOS RPU cell. (b) Linear and symmetric weight programming with magnified region (c) showing individual states. After Li et al. [44].
Fig. 13. Some device options being tested for neuromorphic applications (a, c, e) with illustration of characteristic features (b, d, f). (a, b) Phase change switch, after Burr et al., [32], (c, d) Filamentary RRAM filament, after Burr et al., [32], (e, f) solid state electrochemical RAM, after Bishop et al., [49].
addressed by the RPU analog/digital hierarchy in Fig. 8 where an optimum tile size can be chosen to get the best signal/noise operating point. In addition to this, various tricks and stratagems (a common feature of analog designs) can mitigate the disadvantages. A simple trick is to add an isolating device10 in series with the weight conductance, as shown in Fig. 14b. These isolating devices exhibit small conductance at low applied voltages. Absent the isolating device all devices contribute to noise during reading, whether powered up or not. Including the isolating devices, the idle rows and columns (typically > 80%), biased at 0 V, neither contribute to noise nor attenuate the signal. Fig. 15 shows the amplifier and array noise contributions as a function of array size at activities of 100% (no isolating devices) and 10%. The two cases of thermal and shot noise are considered. For the amplifier a noise spectral density of 30 nV/ Hz was used, higher than in Ref. [9] but consistent with noise characteristics of CMOS devices of reasonable size. At a fixed clock speed, the throughput first increases as N 2 (limited by clock speed), then plateau’s when limited by amplifier noise, and finally decreases as array noise kicks in. Note the strong benefit in using the isolating devices and the strong penalty when assuming shot noise.11 For the latter case an optimum array size of under
Table 2 Suitability of device selected technologies for the RPU application. Parameter
Technology CMOS
Density Speed Symmetry Conductance Noise & drift retention endurance
b
x √ √ H √b x √
PCM a,b
√ √ xa H √b √ √
RRAM √ √ ? √ ? √ ?
FERAM a
? ? ?a L ? ? ?
ECRAM √a,b ? √a √ √b ? ?
L/H: available conductances are too low/high for optimal RPU array a Symmetry can be achieved at expense of cell size by adding components. b Noise can be reduced by using larger devices.
SNR array gw / N and the amplifier SNR amp 1/ N . Since the weight conductance decreases as 1 / N 2 to satisfy array constraints on voltage drop and power, the resulting dependency is SNR array 1 / N 3/2 . So, we see that the array contribution becomes dominant for large arrays. To handle increased stochastic noise the read integration time must be increased by 1 / SNR 2 . Since throughput (when limited by read time) ∝ N 2 /read time, the resulting throughput 1 / N and a constant respectively for the two cases. This adverse scaling trend highlights the adverse scaling properties of analog computation in general. Fortunately, the inherent difficulties in scaling analog systems are
10
Also called a selector device [28]. Shot noise is obtained when carriers are injected over a barrier into a high field region. The ratio of shot noise to thermal noise, for the same voltage and current, is eV / kT where the symbols denote electronic charge, applied voltage, Boltzmann’s constant and absolute temperature respectively. 11
89
Solid State Electronics 155 (2019) 82–92
P.M. Solomon
Fig. 14. (a) Circuit for estimating array and amplifier noise. (b) Addition of isolator devices (c) to eliminate noise from inactive (V = 0) lines.
Fig. 16. PCM-CMOS RPU cell showing (a) an array consisting of (b) unshared (c) and shared (d) cells. The unshared cells have two PCM/FET cells for long term storage of +/− weights and a CMOS pair for updating the linear, volatile, weight representing the less significant portion of the weight. The shared cells provide the negative component of the CMOS part. Reprinted with permission from Ambrogio, et al., [42]. Copyright (2016) Nature/Springer/Palgrave.
Fig. 15. (a) RPU throughput of a single RPU tile vs. array size as limited by clock speed and noise.
1000 × 1000 is obtained. Not included in this discussion has been 1/f (flicker) noise [60] i.e. noise where the power spectral density varies inversely with frequency. This noise source can be much larger than either thermal or shot noise, even more so in small devices since it scales inversely with device area. It is not fundamental but depends on the existence of traps or other metastable states in the device; however, it is ubiquitous in extant nanoscale devices. Flicker noise is present in the amplifiers as well as the array devices. In the former case it can be eliminated with a high sampling frequency and baseline subtraction techniques. These cannot be applied inside the arrays, where weights are necessarily stored, although very slow noise transients, akin to drift, can be corrected if the learning rate is sufficiently high. A quantitative evaluation of flicker noise in RPU arrays is sorely needed but has not yet been done. It is not trivial due to the long-term correlations across switching events.
As discussed above, the PCM device gives inherently asymmetric characteristics. This difficulty is overcome by them and others by using set pulses only in a pair of devices and subtracting the two currents. There is still the need to reset the weights once they reach their limit. The resets are done rather infrequently (after ∼10,000 cycles) to reduce their cost. Ambrogio et al. use an additional strategy which greatly reduces the cost of the resets. They use PCM just for the most significant bits of the weight and a CMOS scheme for the least significant (see Fig. 16). Small weight changes over many cycles (8000) are accumulated in the CMOS weights then transferred sequentially to the PCM devices. CMOS weights can be highly symmetric, but this depends on accurate balancing between the charge injection and extraction transistors, which can be marred by device to device variations. The fact that only the least significant bits are included greatly reduces the accuracy requirements of the CMOS device, allowing circuit simplification, but a further trick is needed to achieve the required symmetry. By inverting the sign convention for the weight change for alternate transfer cycles the direction of asymmetry is inverted as well, so on the average this cancels out. The success of this technique can be gauged from Fig. 17 and from the improvement in MNIST test accuracy from ∼92% to 98%, the latter being comparable with software. Great simplifications in the peripheral circuits were also explored and algorithmically (at least for fairly small scale-problems like MNIST) shown to give equivalent accuracy to the full treatment. For instance, the final SoftMax activation function was eliminated in favor of direct comparison of the conventionally activated output with the ‘ground truth’ i.e. labeled data. Furthermore, the activation function and its
9. Living with imperfect devices – a case study. The IBM group headed by G.W. Burr has explored [32,42,50,51] the RPU concept based on the PCM device, culminating, to date, in the paper by Ambrogio et al. [42]. In this section we will highlight some stratagems used by them in overcoming the many challenges of analog RPU design constrained by device limitations. This RPU is based on a tiling concept, similar to that described in Fig. 8, with a tile size of about 512 × 512 cells. This size is dictated by the fairly high conductance (10–20 μS) of the PCM devices and is close to the optimum noise limited performance (see Fig. 15). 90
Solid State Electronics 155 (2019) 82–92
P.M. Solomon
Fig. 17. The effect of polarity inversion algorithm in constraining weight drift (a, d) are for purely nominal CMOS devices, (b, e) for devices with expected random variations, and (c, f) for the same devices but including the polarity inversion algorithm. Weights are transferred from the CMOS to the PCM devices after 8000 examples and this is when polarity inversion is applied. See [42] for details. Reprinted with permission from Ambrogio, et al., [42]. Copyright (2016) Nature/ Springer/Palgrave.
derivative were simplified into simple piece-wise linear functions [50,51] which could be implemented using digital voltage levels and analog voltage ramps replacing more complex and area intensive analog to digital conversion circuitry. The general strategy of using time encoded signals and digital voltages, while still analog in spirit,12 is more compatible with digital CMOS design techniques than analog voltage approaches.
Alternatively, a different approach may be tried, as in Ref. [61] where the weights were encoded digitally and updated using a digital counter at every cross-point yet read out in the analog domain. One might argue that these approaches are very area intensive therefore not competitive but against this the RPU unit cell is very powerful and can be likened to a small processor at every cross-point. Beyond the RPU, we look again to the brain as an inspiration and call for the brain simulation machines to help elucidate the brain’s function. The brain is an existence proof of the processing power of an analog machine consisting of an enormous number of imperfect analog devices and is a constant spur to future research and innovation.
10. Discussion Analog computation has been applied in the past where its benefits in terms of speed and power and complexity have greatly exceeded the extant digital competition, for applications where limited precision was acceptable. These have been included applications of important societal benefit for example military, electrical power and space exploration. With exponential improvements in digital technology (Moore’s law) most of these applications have been superseded by digital solutions. Today digital technology is nearing the end of its scaling path and future improvements, especially in energy per function, are expected to be much less dramatic, so that a future analog solution might have a fundamental and long-lasting benefit over any digital alternative. Into this arena the application of analog techniques to machine learning is especially significant, given the requirements of parallel computation on huge scale without the need for high precision. As an example of this we have explored the application of analog techniques to the implementation of neuromorphic arrays, specifically the RPU using the back-propagation algorithm. A crucial requirement for such applications is a non-volatile-analog memory element (the synaptic weight) to circumvent the von Neumann logic ↔ memory bottleneck. The requirements for this element were found to be quite stringent however, especially symmetry, number of states, and temporal stability of the weights, needed for accurate convergence of the BP algorithm. These are in addition to the competitive requirements of small physical size, sufficiently high speed and low power consumption. As of now there are no candidates that fulfil all these requirements, but there is much ongoing research, in industry and academia, into a broad spectrum of promising candidates. A good case can be made that these requirements can be fulfilled by a combination of devices, as in Ref. [42].
Acknowledgements I would like to acknowledge fruitful discussions with Wilfried Haensch, Malte Rasch, Mattia Rigotti, and Tayfun Gokmen and would like to thank the reviewer for a most helpful and illuminating review. References [1] Ulman Berndt. Analog computing ISBN 978-3-486-75518-3 Oldenbourg Wissenschaftsverlag; 2013. [2] Tsividis Yannis. Not Your Father’s Analog Computer. IEEE Spectrum 2017. [3] Lundberg KH. The history of analog computing: introduction to the special section. IEEE Control Syst Mag 2005;25(3):22–5. https://doi.org/10.1109/MCS.2005. 1432595. [4] Freeth T, Bitsakis Y, Moussa X, Seiradakis JH, Tselikas A, Mangou H, et al. Decoding the ancient Greek astronomical calculator known as the Antikythera Mechanism. Nature 2006;444:587–91. [5] Special report: “50 years of Moore’s Law”. IEEE Spectrum 2015. [6] Theis N, Solomon PM. In quest of the next switch: prospects for greatly reduced power dissipation in a successor to the silicon field-effect transistor. Proc IEEE 2010;98:2005–14. [7] Kandala A, Mezzacapo A, Temme K, Takita M, Brink M, Chow JM, et al. Hardwareefficient variational quantum eigensolver for small molecules and quantum magnets. Nature 2017;549:242–6. [8] Daniel R, Rubens JR, Sarpeshkar R, Lu TK. Synthetic analog computation in living cells. Nature 2013;497(7451):619–23. https://doi.org/10.1038/nature12148. [9] Gokmen T, Vlasov Y. Acceleration of deep neural network training with resistive cross-point devices: design considerations. Front Neurosci 2016;10:333. [10] McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 1943;5:115–33. [11] Shepherd Gordon M. Neurobiology. N.Y.: Oxford University Press; 1994. [12] Hertz John, Krogh Anders, Palmer Richard G. Introduction to the theory of neural computation. Addison Wesley Publishing Corp.; 1993. ISBN 0-201-51560-1. [13] Markram H, Lubke J, Frotscher M, Sakmann B. Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science 1997;10:213–5. [14] Bi G-Q, Poo M-M. Synaptic modifications in cultured hippocampal neurons:
12 In practice the pulse width is generated by ∼6 bit decoding of a digital value, so this is actually a low resolution digital to analog conversion.
91
Solid State Electronics 155 (2019) 82–92
P.M. Solomon
[15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44]
[45] Gokmen T, Onen M, Haensch WE. Training deep convolutional neural networks with resistive cross-point devices. Front Neurosci 2017;11:538. [46] Rasch MJ, Gokmen T, Rigotti M, Haensch W. Efficient ConvNets for analog arrays arXiv:1807.01356v1 [cs.ET]. 2018. [47] Gokmen T, Rasch M, Haensch W. Training LSTM networks with resistive cross-point devices. Front Neurosci 2018. https://doi.org/10.3389/fnins.2018.00745. [48] Larentis S, Nardi F, Balatti S, Gilmer DC, Ielmini D. Resistive switching by voltagedriven ion migration in bipolar RRAM—Part II: modeling. IEEE Trans Electron Dev 2012;59:2468–75. [49] Bishop DM, Solomon P, Kim S, Tang J, Tersoff J, Todorov T, et al. Time-resolved resolved conductance conductance in electrochemical systems for neuromorphic computing. SSDM conference, late news. 2018. [50] Burr GW, Shelby RM, di Nolfo C, Jang JW, et al. Experimental demonstration and tolerancing of a large-scale neural network (165,000 synapses), using phase-change memory as the synaptic weight element. IEEE Trans Electron Dev 2015;62:3498–507. [51] Narayanan P, Fumarola A, Sanches L, Lewis S, Hosokawa K, Shelby RM, et al. Towards on-chip acceleration of the backpropagation algorithm using non-volatile memory. IBM J Res Dev 2017;61:1–11. [52] Wong HSP, Lee HY, Yu SM, Chen YS, Wu Y, Chen PS, et al. Metal–oxide RRAM. Proc IEEE 2012;100:1951–70. https://doi.org/10.1109/JPROC.2012.2190369. [53] Gong N, Idé T, Kim S, Boybat I, Sebastian A, Narayanan V, et al. Signal and noise extraction from analog memory elements for neuromorphic computing. Nat Commun 2018;9. https://doi.org/10.1038/s41467-018-04485-1. [54] Govoreanu B, Crotti D, Subhechha S, Zhang L, Chen YY, Clima S, et al. a-VMCO: a novel forming-free, self-rectifying, analog memory cell with low-current operation, nonfilamentary switching and excellent variability. Symposium on VLSI technology digest of technical papers. 2015. [55] Kaneko Y, Nishitani Y, Ueda M. Ferroelectric artificial synapses for recognition of a multishaded image. IEEE Trans Electron Dev 2014;61:2827–33. [56] Jerry M, Dutta S, Kazemi A, Ni K, Zhang J, Chen P-Y, et al. A ferroelectric field effect transistor based synaptic weight cell. J Phys D 51, n43. [57] Fuller EJ, Gabaly FE, Léonard F, Agarwal S, Plimpton SJ, Jacobs-Gedrim RB, et al. Li-ion synaptic transistor for low power analog computing. Adv Mater 2017;29:1604310. [58] Vincent AF, Larroque J, Locatelli N, Ben Romdhane N, Bichler O, Gamrat C, et al. Spin-transfer torque magnetic memory as a stochastic memristive synapse for neuromorphic systems. IEEE Trans Biomed Circ Syst 2015;7:166–74. [59] Dutta S, Siddiqui SA, Büttner F, Liu L, Ross CA, Baldo MA. A logic-in-memory design with 3-terminal magnetic tunnel junction function evaluators for convolutional neural networks. 2017 IEEE/ACM Int. symp. nanoscale arch., newport, RI 2017. p. 83–8. https://doi.org/10.1109/NANOARCH.2017.8053724. [60] Nemirovsky Y, Brouk I, Jakobson CG. 1/f noise in CMOS transistors for analog applications. IEEE Trans Electron Dev 2001;48:921–7. https://doi.org/10.1109/16. 918240T. [61] Chen Q, Zheng Q, Ling W. Implementation of artificial neural network using counter for weight storage. Proc. 20th NAFIPS Int. Conf. 2001. https://doi.org/10. 1109/NAFIPS.2001.944747.
dependence on spike timing, synaptic strength, and postsynaptic cell type. J Neurosci 1998;18:10464–72. Sjöström J, Gerstner W. Spike-timing dependent plasticity. Scholarpedia 2010;5:1362. Markram H, Gerstner W, Sjöström PJ. Spike-timing-dependent plasticity: a comprehensive overview. Front Synaptic Neurosci 2012;4. Stromatias E, Neil D, Pfeiffer M, Galluppi F, Furber SB, Liu S-C. Robustness of spiking Deep Belief Networks to noise and reduced bit precision of neuro-inspired hardware platforms. Front Neurosci 2015;9. art. 222. Neftci EO, Pedroni BU, Joshi S, Al-Shedivat M, Cauwenberghs G. Stochastic synapses enable efficient brain-inspired learning machines. Front Neurosci 2017;11. art 324. Diehl PU, Cook M. Unsupervised learning of digit recognition using spike-timingdependent plasticity. Front Neurosci 2015;9. art 99. Kheradpisheh SR, Ganjtabesh M, Thorpe SJ, Masquelier T. STDP-based spiking deep convolutional neural networks for object recognition. Neural Networks 2018;99:56–67. Merolla PA, Arthur JV, Alvarez-Icaza R, Cassidy AS, et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science 2014;345:668–72. Knight J, Voelker AR, Mundy A, Eliasmith C, Furber S. Efficient SpiNNaker simulation of a heteroassociative memory using the neural engineering framework. IJCNN 2016:5210–7. Neckar A, Fok S, Benjamin BV, Stewart TC, et al. Braindrop: a mixed-signal neuromorphic architecture with a dynamical systems-based programming model. Proc IEEE 2019;107:144–54. Schemmel J, Kriener L, Muller P, Meier K. An accelerated analog neuromorphic hardware system emulating NMDA- and calcium-based non-linear dendrites. 2017. arXiv:1703.07286v1 [cs.NE]. Davies M, Srinivasa N, Lin T-H, Chinya G, et al. Loihi: a neuromorphic manycore processor with on-chip learning. IEEE Micro 2018:84–96. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature 1986;323:533–6. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Comm ACM 2017;60:84–90. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. IEEE CVPR 2016. https://doi.org/10.1109/CVPR.2016.90. Xiong W, Droppo J, Huang X, Seide F, Seltzer M, Stolcke A, et al. Towards human parity in conversational speech recognition. In Trans. Audio, Speech, and Lang. Processing. IEEE/AC DOI: 10.1109/TASLP.2017.2756440. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, et al. Google’s neural machine translation system: bridging the gap between human and machine translation arXiv:1609.08144v2 [cs.CL]. 2016. Cruz-Albrecht JM, Derosier T, Srinivasa N. A scalable neural chip with synaptic electronics using CMOS integrated memristors. Nanotechnology 2013;24:384011. Burr GW, Shelby RM, Sebastian A, Kim S, Kim S, Sidler S, et al. Neuromorphic computing using non-volatile memory. Adv Physics X 2017;2:89–124. Rosenblatt F. The perceptron – a perceiving and recognizing automaton Tech. Rep. 85-460-1 Cornell Aeronautical Laboratory; 1957. Backus J. Can programming be liberated from the Von Neumann style? A functional style and its algebra of programs. Commun ACM 1978;21:613–41. LeCun Y, Bengio Y, Hinton G. Deep learning (review). Nature 2015;521:426–44. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE 1998;86:2278–324. https://doi.org/10.1109/5. 726791. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature 1986;323:533–6. Karpathy A, Johnson J, Fei-Fei Li. Visualizing and understanding recurrent networks arXiv:1506.02078v2 [cs.LG]. 2015. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997;9:1735–80. Wang N, Choi J, Brand D, Chen C, Gopalakrishnan K. Training deep neural networks with 8-bit floating point numbers. NIPS proceedings. 2018. Nair V, Hinton GE. Rectified linear units improve restricted Boltzmann machines. Proc. 27th international conference on machine learning. 2010. Ambrogio S, Narayanan P, Tsai H, Shelby RM, et al. Equivalent-accuracy accelerated neural network training using analog memory. Nature 2018;558:60–7. Kim S, Gokmen T, Lee H-M, Haensch WE. Analog CMOS-based resistive processing unit for deep neural network training. arXiv: 1706.06620 [cs.ET]. Li Y, Kim S, Solomon PM, Gokmen T, Sun X. Capacitor-based cross-point array for analog neural network with record symmetry and linearity. Symposium on VLSI technology, Honolulu, Hawai. 2018.
Paul Michael Solomon has been a Research staff member at IBM T.J. Watson Research center from 1975 to the present. He obtained a BSc in Electrical engineering from the University of Cape Town in 1968, and obtained a Phd degree from the Technion in Haifa, Israel, in 1974. At IBM his interests have been in the field of high speed semiconductor devices. He has contributed to the theory for scaling bipolar transistors to very small dimensions and has developed methodologies to compare the performance of high speed semiconductor devices. The design and evaluation of highspeed semiconductor logic devices has been a continuing topic ranging from self-aligned bipolar transistors through novel heterostructure field effect and CMOS transistors and, recently to various nanodevice concepts. He has contributed to the physics of transport in semiconductors, being the first to measure Coulomb drag in field effect transistors, and has contributed to our understanding of band to band tunneling. He has taught at Stanford University and Columbia University on the physics of high speed devices. With the shift of emphasis of toward neuromorphic applications he has contributed to the IBM neuromorphic effort in searching for viable devices to serve as synaptic weights in hardware realizations of artificial neural networks. He is a fellow of the IEEE and American Physical Society.
92