Probabilistic rounding in neural network learning with limited precision

Probabilistic rounding in neural network learning with limited precision

Neurocomputing 4 (1992) 291-299 Elsevier 2'91 NEUCOM 199 Probabilistic rounding in neural network learning with limited precision Markus H6hfeld a ...

509KB Sizes 0 Downloads 28 Views

Neurocomputing 4 (1992) 291-299 Elsevier

2'91

NEUCOM 199

Probabilistic rounding in neural network learning with limited precision Markus H6hfeld a and Scott E. Fahiman b a SiemensAG, Corp. R&D, Munich, Germany b Carnegie Mellon University, Pittsburgh, PA, USA Received 29 November 1991 Revised 21 August 1992

Abstract H~Shfeld, M. and S.E. Fahlman, Probabilistic rounding in neural network learning with limited precision, Neurocomputing 4 (1992) 291-299. A key question in the design of specialized hardware for simulation of neural networks is whether fixedpoint arithmetic of limited precision can be used with existing learning algorithms. Several studies of the backpropagation algorithm report a collapse of learning ability at around 12 to 16 bits of precision, depending on the details of the problem. In this paper, we investigate the effects of limited precision in the Cascade Correlation learning algorithm. As a general result, we introduce techniques for dynamic rescaling and probabilistic rounding that facilitate learning by gradient descent down to 6 bits of precision.

Keywords. Backpropagation; cascade correlation learning; limited precision learning; probabilistic rounding.

1. Review of Cascade Correlation Cascade Correlation (CC) is an incremental supervised learning algorithm for feedforward neural networks. It starts with a minimal network only consisting of input and output layers. New hidden units (candidates) are added one by one. They receive weighted connections from the network's inputs and all previously added hidden units. In a first training phase, the output of the candidate is not yet connected to the net and we train the incoming weights to maximize a correlation measure. Then we freeze the incoming weights and connect lhe candidate to all output units. In the second training phase, we re-train all weights going to the output units, including those of the new unit. We will now consider the arithmetic operations that occur during CC learning and that are effected by precision constraints. Output units are trained to minimize the familiar s u m - s q u a r e d error measure Correspondence to: Markus H6hfeld, Siemens AG, Corp. R&D, Munich, Germany. 0925---2312/92/$05.00 (~ 1992 - Elsevier Science Publishers B.V. All fights reserved

292

M. Hfhfeld, S.E. Fahlman

E = 1 F_, (yo

- to )

,

op

where Yop is the output of unit o for pattern p, and top is the desired or target output. E is minimized by gradient descent using

eop = (Yop - top)f~

(2)

OE OWoi = ~

(3)

and

eopIip,

P

where f ' is the derivative of the sigmoid activation function, Iip is the value of input (or hidden unit) i, and Woi is the weight connecting input i to output unit o. Candidate units are trained to maximize C, the correlation between their own input y and the residual errors eo still observed at the outputs. C is defined as

C = ~

Y'~(yop- (y))(eop- (eo)) , o

(4)

p

where (y) and (eo) are averages of y and eo over all patterns. The maximization of C proceeds by gradient ascent using

=

Oo(eop- (eo))g

(5)

o

and

OC Owi= ~

¢5opIip,

(6)

P

where ao is the sign of the correlation between the candidates units value and the residual error at output o. Abbreviating the slopes OE/Ow and OC/Ow by S, respectively, the weight change Aw is computed by the quickprop rule [2]

S(t) Awt = S(t - 1) - S(t) Awt_l , hwt = eS(t) , AW t = ~tAwt_ 1 ,

S(t) if Awt-i # 0 and S(t - 1) - S(t) < # ifAwt_x=O

(7)

otherwise.

Here, s is the learning rate known from backpropagation and # is a constant, which limits the maximum step-size. Weights are now updated using

w(t + 1) = w(t) + Aw(t) .

(8)

2. Overall experimental design Fixed point representation requires the definition of a range of numbers. All representable numbers are then equally distributed over this range. Instead of making a different choice for errors, slopes and weights, we uniformly use n bits plus sign to represent numbers in

293

Probabilistic rounding in neural network learning

[ - l , 1 - 2n]. We account for larger ranges by the use of common scaling factors. For example, instead of using weights w in [ - 8 , 8), we use weights w' in the range [ - 1 , 1), with scaling factor A = 8. The 'true' value of a weight is then obtained by w = Aw'. Internally, our simulations use floating point arithmetic, and the effects of limited precision are modelled by the precision function, P(x). P clipps out-of-range values to the limit values - 1 and 1 - 2 n. Numbers within these limits are rounded to the nearest multiple of 2 -~, thereby loosing information of low order bits (quantization). We refer to these two error sources as precision errors. We investigated the effects of limited weight precision, limited sigmoidprecision and limited precision in the Aw-computations (weight-update precision), both individually and in combination. Simulations were performed on three common benchmark problems. The simplest problem is the 6-bit parity problem. The second benchmark is the 'two-spirals' problem [7], which has already been intensively studied with Cascade-Correlation [3]. The third one is a sonar signal classification task, mines vs. rocks [4]. The net is trained on 104 patterns, each with 60 analog inputs. All simulation results given in this paper are averages over five runs with different random initial weights. A superscript notes the number of successful runs, if this is different from five. For comparison, we include results for floating point learning with both, infinite (float) and restricted (float-l) range. In addition to the number of hidden units built, we report the average number of training epochs required. Because the same set of parameters was used across all experiments to achieve comparability, learning times could still be optimized.

3. Effects of limited weight range Using floating point arithmetic, we found that the range requirements of CC differ widely between applications: CC could only derive minimal nets with a weight range of at least ± 128 for the sonar task, +64 for the spirals task and +8 for the parity task. For tighter weight ranges, performance decreased rapidly. Since it: is difficult to decide in advance on an appropriate weight range for a given problem, we use block floating point format for each unit's weight vector: weights are restricted to [ - 1 , 1) and are automatically re-scaled during learning to minimize precision errors. The gain A of the sigmoid functions serves as the scaling factor, which is adjusted by Aold * 2 , Anew = Aold / 2

,

if (W) > 0.5 if (w) < 0.25

Aold Wnew --

Anew

'//;old

whenever the mean (w) over the absolute weight values grows larger than 0.5 or shrinks below 0.25. The average is only taken over the significant weights with Iwl > 0.l. A different A~, is used for each unit. On all tasks, learning with a variable gain and floating point weights in [ - 1 , 1] performed as well as learning with unbounded weights. However, e was scaled inversely to the gain A to prevent excessively large steps at high gains. The fixed-point simulations described in the remainder of this paper all employ variable gain terms. Similarly, the error terms epo and the slopes OE/Ow and OC/Ow were scaled by variable global factors to minimize precision errors.

294

M . Hbh yeld, S.E. Fahlman

4. Effects of limited precision To investigate limited weight precision, we changed the weight update formula (8) to

w(t + 1) = P(w(t) + Aw(t)) ,

(9)

i.e. weights are always quantized and saturate at +1. Floating point precision was used for the computation of Aw and the sigmoid function. Table 1 shows the results for our three benchmarks for various choices of the word length n for weights. Table 1.

Number of hidden units and learning epochs for three selected tasks with limited weight precision.

Float

Float-1

15

14

12

10

6-Parity

6.0 1039

4.4 785

5.8 1035

4.8 899

4.8 842

fail

Sonar

1.6 1044

1.0 837

1.2 879

1.2 901

1.8 1058

fail

Spirals

12.4 3702

13.0 3826

12.6 3520

13.0 3752

fail

fail

We see that learning is affected very little by a reduction in weight precision until, at a certain value of n, we get a sudden failure. This value of n is problem dependent: it is 12 for two-spirals and 10 for sonar and parity. Similar results were reported for backpropagation on other problems [1, 5, 6]. The cause is that below the critical value of n, most of the weight updates are below the minimum step-size and are thus rounded to 0. All areas of the error surface with a gradient smaller than the quantization level divided by the learning rate effectively appear as plateaus. To investigate the effects of a limited precision sigmoid function, the activation function is implemented by f(x)

1 P , (1 + e -~x

0.5)

(10)

On all three benchmarks, we found that training is not affected by limited sigmoid precision down to n = 6. CC also converges for n = 4, but one or two additional units are installed. 1 Next, we studied in isolation the effects of limited precision in the computation of weight update terms. As these computations involve a sequence of operations, many of the intermediate values have to be stored temporarily, leading to a variety of precision errors. We changed formulae (1)-(7) such that all multiplications only involve limited precision operands. In multiply-accumulate operations, all 2n bits of multiplication results were used and accumulators are supposed to be large enough not to saturate. Only the final accumulation results are rounded. As expected, results were qualitatively similar to the case of limited weight precision: below the critical precision of 10-12 bits, most of the Aw's were rounded to 0 and learning 1Similar results could be obtained for the hyperbolic tangent with range [-1, 1], but learning parameters had to be tuned for this case.

295

Probabilistic rounding in neural network learning

failed. This shows that the accuracy of internal computations is of minor importance; what is important is that the final result h w is different from 0 and that this update is not lost by truncation errors when added to w.

5. Probabilistic weight updates The straightforward implementation of limited precision fails abruptly for word widths less than 12 bits, because most of the Aw are too small to push the weight w to the next legal value. One way around this problem is to always update a weight by at least the minimum step, but - with our benchmarks - this led to oscillations and training failed to converge. Another possibility is to use probabilistic weight updates: whenever a proposed update Aw falls below the quantization level, we take the minimum step with some probability p thai is proportional to the size of Aw. This exploits gradient information of very small Aw's at least in a statistical sense. From an implementation point of view, this is achieved by probabilistic rounding of multiplication results defined by

Pp(x) PAx)

= P(x),

(

if Ix I _> 2 -n

sgn(x)2 -n

with probability p,

0

with probability 1 - p,

if Ix I < 2 -n

(l 1)

where the probability p equals 2nx. Probabilistic rounding does not help in a longer sequence of multiplications, because once an intermediate result is truncated to O, the final result will be 0 - no matter how large other multiplicants are. Therefore, probabilistic rounding is only used in the last multiplication. Intermediate results are rounded up rigorously to the quantization level by sign preserving rounding:

Ps(x) =

P(x),

i f l x l > 2 -n

sgn(x) 2 - n ,

if Ixl < 2 -~ .

(]2)

6. Results using probabilistic updates These probabilistic weight updates gave significantly better results for low precision. However, we still observed many wasted epochs in which no weight was updated because all of the Aw's were small. We therefore added the constant 1/fan-in to the probability to make it likely that at least one weight of each unit is changed in each epoch. This resulted in much faster learning and did not cause severe oscillations. For comparison, Table 2 presents the results for limited weight precision using probabilistic rounding in (9). The performance for foating point (float) and restricted range floating point (float-l) correspond to Table 1, as there is no rounding in 'infinite' precision. The results for 12 bits show considerable improvement over the simulations with ordinary rounding: for the sonar task, the number of hidden units decreases from 1.8 to 1.0 and for the spirals task, training with 12 bits now converges as well as with unlimited precision, whereas it failed in Table 1.

296

M. Hrhfeld, S.E. Fahlman

Table 2.

Number of hidden units and learning epochs for three selected tasks with limited weight precision and probabilistic rounding. Float

Float.1

12

10

8

6

4

6-Parity

6.0 1039

4.4 785

4.8 828

5.0 909

4.6 836

6.2 1299

6.8 1327

Sonar

1.6 1044

1.0 837

1.0 778

1.6 1060

2.0 1239

3.0 1721

5.4 2027

Spirals

12.4 3702

13.0 3826

13.0 3827

12.2 3411

13.8 3888

15.0 4464

20.73 5122

For precision restricted from 10 bits down to 6, the performance of CC degrades gracefully. More hidden units are installed and learning time increases, but learning converges consistently even for 6 bits of weight precision. For only 4 bit of precision the number of hidden units rises sharply, because - with large A - a probabilistic decision can easily throw a unit from one extreme output to the other. In a final experiment, we restricted weight and sigmoid precision simultaneously. Furthermore, we used limited precision in the computation of weight-update terms Aw. Remember that multiplications only involve limited precision operands, but that multiplication results are used with double precision in subsequent accumulations. We used sign preserving rounding Ps for the slopes OE/Owoi and OC/Owi, and probabilistic rounding Ppfor Aw, while e, tS, E and C were rounded as usual by P. The results are shown in Table3. The performance for 12 bits is comparable to the floating point implementation. In the range from 10 to 6 bits, graceful degradation can be observed and CC only adds few additional hidden units. Table 3.

Number of hidden units and learning epochs with limited weight precision, limited sigmoid precision and limited precision in arithmetic operations. Probabilistic rounding was used for Aw and sign preserving rounding in slope computations. Float

Float-1

12

10

8

6

4

6-Parity

4.4 858

4.4 837

4.6 856

5.8 1056

5.4 1038

4.8 1072

10.84 1561

Sonar

1.6 987

1.2 906

1.0 859

1.8 1147

2.0 1263

2.8 1511

9.4 1980

Spirals

13.8 4248

12.2 3662

11.8 3308

12.4 3509

14.2 4067

19.4 4476

fail

To summarize, the use of probabilistic rounding significantly reduces the precision requirements. At 12 bits of precision (plus sign), performance is equivalent to a 32-bit floating point implementation. Precision can be reduced to approximately 6 bits plus sign with, at worst, a moderate performance decrease. Probabilistic updates become increasingly important as the precision is lowered. Table 4 shows that 90% of the weight updates are probabilistic for 8 bit precision and that the ratio

Probabilistic rounding in neural network learning

297

increases to around 95% for 6 bit. In the latter case nearly all weight updates are by the same step 2 - n and the weight-update computation merely only provides the sign. This kind of learning was termed 'coarse weight updates' in [8], which reports satisfactory results using coarse updates with backpropagation. Table 4.

Percentage of weight updates smaller than the quantization level 2 - n for simultaneous precision constraints. These updates were rounded probabilistically to 0 or + 2 -n.

12

10

8

6

4

6-Parity

59%

72%

84%

92%

98%

Sonar

71%

88%

95%

97%

99%

Spirals

67%

79%

89%

94%

96%

7. Discussion

7.1 Probabilistic rounding with error backpropagation An important question is whether the techniques of variable gain term and probabilistic weight updates can be used outside the CC framework. Recall, that in CC, we train only a single layer of weights at a time and never have to propagate error information backward through connections. When approximate, rounded error values are backpropagated and combined, we might expect some new problems to appear. To demonstrate that the ideas carry over to multilayer training procedures and fixedtopology networks, we used quickprop [2] to train a 6-4-1 net with shortcuts on the parity task. We augumented the quickprop update rule by the same scaling and rounding techniques as in the CC simulations. Weight, sigmoid and weight-update precision were all restricted at once.

Once again, no learning occurred with less than 12 bits of precision unless probabilistic rounding was employed. Table 5 shows the results of ten trials with probabilistic rounding. About 70% to 80% of the trials converged, regardless of precision, even as it was reduced to 4 bits. Learning is slower for fixed-point arithmetic than for floating point, but we see only a slight increase in learning time as precision is reduced from 12 to 4 bits. These results suggest that variable gains and probabilistic rounding are useful in general, and that they do not depend on peculiarities of the CC architecture. The interaction of limited-precision errors as they are backpropagated through the network does not appear to cause problems.

7.2 Quickprop vs. backprop CC internally uses the quickprop update rule as well as candidate training. From the point of view of hardware implementation the quickprop rule is problematic. The computation (7) of Aw is expensive, both in terms of arithmetic operations and storage requirement. We therefore tried to use simple gradient descent instead of quickprop in the CC training phases. For 6 bits of precision, we noticed a less than twofold increase in learning time for

298

M. HSh f eld, S.E. Fahlrnan

Table 5.

Training a fixed-topology 6-4-1 net with shortcuts using quickprop with probabilistic rounding and variable gain terms. We show the number of successful trials out of 10 and average learning epochs for successful trials.

6-Parity

Float

Float-1

12

10

8

6

4

Cvgt. trials

9

8

7

8

7

7

8

Avg. epochs

402.2

330.0

787.1

676.2

594.3

834.3

897.5

the two simple tasks, the sonar data and 6-parity, but no convergence could be obtained for the difficult two-spirals task. This result suggests that second order methods in gradient descent are not only useful to speed up convergence. Their improved gradient information becomes increasingly important for coarse weight updates in limited precision implementations.

8. Conclusions

These studies suggest the following conclusions for limited precision implementations of gradient descent learning: • Limited precision of the sigmoid function does not effect learning performance down to 6 bits of precision. • Limited weight range is handled well, and at relatively low cost, by the use of variable gain terms. • Limited precision in weights has little effect until some critical limit is reached, at which point there is an abrupt failure of learning. This effect is observed in CC, in quickprop and in standard backpropagation. The failure occurs, because most weight-update steps are effectively zero. • In the three problems we studied, this catastrophic failure occurs at 12-14 bits of precision, plus sign. • Probabilistic rounding of weight update terms provides an effective solution for this problem. We only observed a gradual performance degradation as precision is reduced to 6 bits. • Without probabilistic rounding, a machine with 16-bit integer arithmetic and storage would suffice for the problems studied, but is uncomfortably close to the minimum required precision. With probabilistic rounding, such a machine would achieve the same performance as floating-point arithmetic.

Probabilistic rounding in neural network learning

299

References [1] T. Baker and D. Hammerstrom, Modifications to artificial neural network models for digital hardware implementation, Tech. Rep. CS/E 88-035, Oregon Graduate Center, 1988. [2] S.E. Fahlman, Faster-learning variations on backpropagation: An empirical study, in: D.S. Touretzky et al., eds., Proc. 1988 Connectionist Models Summer School (Morgan Kaufmann, Los Altos, CA, 1988). [3] S. Fahlman and C. Lebiere, The Cascade-Correlation learning architecture, in: D.S. Touretzky, ed.,Advances in Neurallnformation Processing(Morgan Kaufmann, ~ Altos, CA, 1990). I4] R.P. Gorman and T.J. Sejnowski, Analysis of hidden units in a layered network trained to classify sonar targets, NeuralNetworks 1 (1988) 75-89. [5] P.W. Hollis, J.S. Harper and J.J. Paulos, The effects of precision constraints in a backpropagation learning network, Neural Computation 2 (1990) 363-373. 16] J.L. Holt and J.N. Hwang, Finite precision error analysis of neural network hardware implementations, Tech. Report FF-10, University of Washington, Seattle, 1991. [7] K.J. Lang and M.J. Witbrock, Learning to tell two spirals apart, in: D.S. Touretzky et al., eds., Proc. ,~988 Connectionist Models Summer School (Morgan Kaufmann, Los Altos, CA, 1988). [8] P.A. Shoemaker, M.J. Carlin and R.L. Shimabukuro, Back-propagation learning with coarse quantization of weight updates, in: Proc. IJCNN-90, Washington D.C. (1990).

-

-

-

-

e

c

w

amstrs

e

omutrsci

ce

omt

ity

of Kaiserslautern, Germany, in 1988. Since 1989, he has been working at Siemens AG, Corporate Research and Development, Munich. In the context of a two-year R & D trainee program, he worked on software as well as hardware issues of neural network implementations. At present, his research interests include recurrent neural network learning algorithms and neural network applications in control.

.....

: ....

Scott E. Fahlman received the Ph.D. from the Massachusetts Institute of Technology in

1977 for his work on the NETL system, a knowledge representation system built upon massively parallel hardware. His B.S. and M.S. degrees are also from MIT. He is a Senior Research Scientist in the School of Computer Science at Carnegie Mellon University, Pittsburgh, PA, where he has been on the faculty since 1978. His research interests include artificial intelligence and artificial neural networks. He has been active in the development of new learning algorithms and architectures for neural nets, the use of high-performance parallel machines for neural-net simulation, and the application of this new technology to real-world problems. He also leads the CMU Common Lisp project, and has played a major role in the definition of the Common Lisp language.