Physics Letters A 374 (2010) 4859–4863
Contents lists available at ScienceDirect
Physics Letters A www.elsevier.com/locate/pla
Noise-robust realization of Turing-complete cellular automata by using neural networks with pattern representation Makito Oku a,∗ , Kazuyuki Aihara b,a a b
Department of Mathematical Informatics, Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan
a r t i c l e
i n f o
Article history: Received 31 May 2010 Received in revised form 12 October 2010 Accepted 12 October 2010 Available online 15 October 2010 Communicated by C.R. Doering
a b s t r a c t A modularly-structured neural network model is considered. Each module, which we call a ‘cell’, consists of two parts: a Hopfield neural network model and a multilayered perceptron. An array of such cells is used to simulate the Rule 110 cellular automaton with high accuracy even when all the units of neural networks are replaced by stochastic binary ones. We also find that noise not only degrades but also facilitates computation if the outputs of multilayered perceptrons are below the threshold required to update the states of the cells, which is a stochastic resonance in computation. © 2010 Elsevier B.V. All rights reserved.
1. Introduction
What is the extent of the potential computational ability of our brain? Is there any difference between the brain and computers with respect to computational complexity? Some partial answers to these fundamental questions have been obtained. The computational power of a binary-state neural network has been shown to correspond to that of finite state automata [1,2]. This implies that any real machine, which has finitely many configurations, can be simulated by using the corresponding neural network with a sufficient number of units. Some neural network models are even equivalent to Turing machines (TMs) [3] if they can realize an infinite number of configurations [4]. Such models have either an infinite number of units or a special type of units that have an infinite number of states (analogstate units with infinite precision [5,6], for example). A variant of the former type is the infinite family of neural networks [7], each of which has a different size and corresponds to different input length. If we use any of these neural network models, every recursive function—function computable by a TM—can be computed in principle. All the abovementioned constitutions assume deterministic systems without any noise. However, as von Neumann pointed out [8], the brain is composed of unreliable biological elements, and how the brain as a whole performs accurate computations under a noisy environment is a matter of great interest. Accordingly, we address the question whether it is possible to construct noisy neural network models that are equivalent to TMs.
*
Corresponding author. E-mail address:
[email protected] (M. Oku).
0375-9601/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.physleta.2010.10.024
Although the mechanism and the origin of noise in the real brain would be very complicated, thermal, physical, chemical, and electrical fluctuations would make the neuronal activity noisy. We model such a stochastic nature of real neurons as stochastic binary neurons. In this Letter, we show that the computational ability of a stochastic binary-state neural network with some sort of redundancy as well as stability in its dynamics is comparable to that of a universal TM, by using a concrete construction method. The basic idea is to make use of the Hopfield neural network model [9] as a means of noise-robust representation of symbols. The Hopfield network is a fully connected recurrent neural network model with symmetric connections. Multiple bit patterns can be stored in the network, and the storage capacity increases in proportion to the network size. Let us store the two patterns s A , s B ∈ {1, −1}n (n = 100), shown in Fig. 1, in the network [10]. Each pattern has the exactly equal numbers of 1 and −1. The two patterns are orthogonal (s A · s B = 0). These two patterns correspond to the local minima of computational energy, and starting from a neighborhood of either pattern, the system converges to the corresponding one in finite steps. This type of an information representation method is called pattern representation in which a spatial pattern of the activity of multiple units rather than the activity of a single unit is used to represent information. This concept originates from the so-called “cell assembly hypothesis” [11]. The advantages of pattern representation are that a large number of patterns can be fundamentally represented as compared to the number n of units, namely 2n to n, and the method has high redundancy which will increase the reliability of information representation. In our case, however, the former advantage is not simply utilized since the number of patterns that a Hopfield network of size n can store is considerably less than n.
4860
M. Oku, K. Aihara / Physics Letters A 374 (2010) 4859–4863
Fig. 1. Two stored patterns s A and s B (a) and schematic of encoding of a bit string by pattern representation (b). Each pattern is displayed in the form of a 10 by 10 matrix. The black and white small boxes indicate 1 and −1, respectively.
If the two patterns are used to code 1 and 0, a bit string can be represented by an array of Hopfield networks (see Fig. 1). We call each network a ‘cell’, the definition of which is extended later in this Letter. Each cell can recover from a certain amount of perturbations in its state because of the pattern completion property inherent in the Hopfield network. If all the units are replaced by stochastic elements—each cell being a Boltzmann machine [12] without hidden units—the state of each cell fluctuates around any one of the four ground states—two stored patterns and two reverse patterns of them. Notice that here pattern representation appears again in a different manner. The array of Hopfield networks or Boltzmann machines itself performs pattern representation on a larger scale. The first advantage mentioned above applies to this case. The number of configurations that can be represented is very large as compared to the number of cells. Then, is it possible to directly regard the array as a tape of a TM? This simple idea, however, has a critical problem: the mobile head cannot be constructed in the framework of neural networks. Instead of using this simple but implausible idea, we transform the computation of a TM into that of a Turing-complete cellular automaton. The term “Turing-complete system” implies that the system can compute every recursive function. In other words, it can simulate a universal TM. 2. Model How shall we realize the interaction among cells in the framework of neural networks? Now, we add a multilayered perceptron [13] to each cell, because it is suitable for representing highly nonlinear interaction rules among cells, and redefine it as a ‘cell’, as shown in Fig. 2. Following the convention used for artificial neural networks, we refer to them as the feedforward neural network (FNN) and the recurrent neural network (RNN) henceforth. The FNN nonlinearly transforms the inputs from neighboring cells and transmits its output to the RNN in the same cell, which serves as bias to the RNN. The output of the RNN is determined by three factors: the feedback from itself, the bias from the FNN, and the temperature T . For simplicity, let us consider the interaction between the bias and the feedback in the limit T → 0. Let n denotes the number of units in the RNN; x ∈ {1, −1}n , the output of the RNN; the n × n matrix W , the feedback connections; z, the output of the FNN’s output layer; and V , the weight matrix of connections from the FNN to the RNN. If the threshold value for all units of the RNN is 0, the internal states y ∈ Rn of the RNN units are described as follows:
y = W x + V z.
(1)
Fig. 2. Schematic of a cell consisting of a feedforward neural network and a recurrent neural network. See text for definition of variables.
Here, W is the unnormalized autocorrelation matrix of the two patterns, W = s A sTA + s B sTB , and the bias is assumed to be a linear combination of the two stored patterns, V z = n(λ A s A + λ B s B ). For simplicity, a technique for dimension reduction is adopted. According to the orthogonality between the two patterns, it is possible to sort the units so that s A and s B are the same in the first half and opposite in the second half. Each half side contains exactly the same numbers of 1 and −1. For an arbitrary bit string x of length n, the numbers of inverted units as compared to s A in the first and second halves are denoted by nv and nu, respectively. If we only consider the proportions u and v by neglecting the order, any x can be identified by the pair of the values of u and v. Note that this technique does not work for the binary coding of 0 and 1 except when λ A = λ B = 0. By using xT s A = n(1 − 2u − 2v ) and xT s B = 2n(u − v ), the computational energy of the RNN is calculated as follows:
1 E = − xT W x − ( V z)T x 2
(2)
2 2 1 1 = −n2 4 u − +4 v − 4
4
+ λ A (1 − 2u − 2v ) + 2λ B (u − v ) .
(3) (4)
Fig. 3 shows some examples of the energy landscape of the RNN. If there is no bias, i.e., λ A = λ B = 0, the four corners (u , v ) = (0, 0), (1/2, 0), (0, 1/2), and (1/2, 1/2) are the local minima of the value of E = −n2 /2, which correspond to the two stored patterns and their reverse patterns. The energy takes the maximum value 0 at (1/4, 1/4). If the bias to one of the stored
M. Oku, K. Aihara / Physics Letters A 374 (2010) 4859–4863
4861
Fig. 3. Energy landscape examples. The bias strength is (λ A , λ B ) = (0, 0), (0.4, 0), and (0, 0.4) for (a), (b), and (c), respectively. s∗A and s∗B denote reverse patterns of s A and s B . Table 1 Update rule of the Rule 110. Condition Next state
111 0
110 1
101 1
100 0
011 1
010 1
001 1
000 0
patterns is applied, the energy of the biased pattern decreases, and its basin of attraction enlarges. The peak position moves to ((1 + λ A − λ B )/4, (1 + λ A + λ B )/4), where the peak value of the energy is n2 (λ2A + λ2B )/2. A sufficiently strong bias even destabilizes other states. Fig. 4 shows the parameter regions in which each pattern is stable. Note that the system shows hysteresis. For example, if the initial state is s A , and λ B is continuously increased with λ A fixed to 0, a sudden jump to s B occurs at λ B = 1. In reverse pathway, however, there is no change in the state at λ B = 1. If T > 0, the state transition over the energy gap becomes possible. In this case, some bias weaker than the feedback can change the state of the RNN. However, spontaneous transition to an unbiased state also becomes possible. 3. Simulation results We implement the Rule 110 cellular automaton, which is one of the elementary cellular automata [14]. An elementary cellular automaton is a one-dimensional cellular automaton. The update of each cell depends on its state and states of two neighboring cells, which have 23 = 8 possible configurations. Assigning either 0 or 1 to each configuration yields 28 = 256 rules in total, and the 110th rule (see Table 1) has been proven to be Turing-complete [15,16]. The FNN is trained to reproduce a nonlinear map F : {1, −1}3n → {1, −1}n from the concatenation of three patterns, each of which is s A or s B , to either s A or s B according to the update rule of the Rule 110 by using the standard back-propagation algorithm [17]. During the training phase, analog-state units with the activation function f ( y ) = 2/(1 + exp(− y )) − 1 are used. For generalization, the inputs are distorted by random flips with the probability of 0.1 in each learning step, while the complete pattern is presented as a teaching signal. 105 steps with a constant learning rate of 0.01 are employed. 10 hidden units are used. The number of units in the output layer of the FNN is 100, and V = (nλ) I , where I is the identity matrix. λ = 1 corresponds to the threshold at which the bias from the FNN is balanced with the feedback in the RNN. After the learning, all units are replaced with stochastic binarystate (1 or −1) ones. The probability of a unit i being active for the
Fig. 4. Stable regions for the four patterns. The set of patterns corresponding to the local minima of the energy is shown in each region.
next iteration is p i = 1/(1 + exp(−β y i )), where y i is the internal state of the unit, and β = 1/ T is the inverse temperature that controls the noise level. The update scheme is synchronous. All FNNs share the same weight coefficients. We also introduce a parameter τ which controls the relative time constant of the FNN against the RNN. If τ = 1, both the FNN and RNN units are updated at the same rate. On the other hand, if τ > 1, the RNN units update their output τ times more frequent than the FNN ones. We refer to the time interval for the FNN units to update their output once as a “step” henceforth. Fig. 5 shows an example run of the array of neural network modules. The initial state of a spaceship—a particular sequence of seven cells 1001111 is surrounded by the background pattern 00010011011111 of 14 cells—is encoded by the pattern representation. Namely, 0 is translated to s B and 1 to s A . The boundary condition is periodic. Although the output of each unit is sometimes erroneous, the output of each cell appears to be quite accurate in the sense that the activity pattern of RNN output is closest to the correct pattern among the four ground states. To show this more concretely, we plot in Fig. 5(c) a binarized image based on the value of u of the RNN output in each cell with the threshold equal to 1/4, which exactly matches the correct output of the Rule 110. Furthermore, the value of v is less than 1/4 in every cell. On the other hand, a larger noise level, which corresponds to smaller β , can introduce errors in computation. If λ < 1 and τ = 1, the computation will hardly progress in the general case. However, if λ is just below the threshold value (λ = 0.99, for example) and an appropriate level of noise is induced,
4862
M. Oku, K. Aihara / Physics Letters A 374 (2010) 4859–4863
Fig. 5. (a) An example run of implementation of the Rule 110 cellular automaton. Each row corresponds to the system’s state at each time step. Time evolution is from top to bottom. β = 0.6, λ = 1.1, and τ = 1. (b) An enlarged view of the same run. (c) A binarized image based on the value of u, i.e., black corresponds to u 1/4 and white to u > 1/4.
Fig. 6. Example runs with different noise levels. λ = 0.95 and
a part of cells are updated correctly. This type of noise-induced update is much robustly observed if τ is increased. Fig. 6 shows example runs in such situations. We can observe quite accurate computation of the Rule 110 with an appropriate level of noise, as shown in Fig. 6(b), while too strong noise disturbs the order of the cellular pattern, as shown in Fig. 6(a). On the other hand, the pattern of the Rule 110 does not evolve correctly with much weaker noise, as shown in Fig. 6(c). These results suggest that there exists a certain optimal noise level. To show this more concretely, we estimate error rate per step in the following way. We define a step to be correct if the thresholded values of u and v match the correct state for all cells; u < 1/4 and v < 1/4 corresponds to 1, and u > 1/4 and v < 1/4 to 0. Otherwise, we regard the step as erroneous. Once an error occurs, afterward steps are also erroneous with high probability. Therefore, the number of steps before the first error is meaningful. Fig. 7(a) shows the distribution of it. This distribution can be approximated well by geometric distribution p (k) = (1 − p )k p. This implies that the error rate per step is almost constant and inde-
τ = 10.
pendent of the number of steps. Accordingly, we estimate the error rate by using maximum likelihood method p = (1 + k¯ )−1 where k¯ is the sample mean. Fig. 7(b) shows the estimated error rate against the inverse temperature. As we can see clearly, there exists an optimal noise level. In this case the best β is around 0.7, which is different for other cases. In the vicinity of the optimal noise level, the error rate increases exponentially against β . 4. Discussion In this work we have shown a concrete construction method for a Turing-complete neural network model composed of only stochastic binary-state units. By numerical simulations, we have confirmed that the proposed model can simulate the Rule 110 with high accuracy under certain amount of noise level, as shown in Fig. 5. The noise robustness of the model would come from three factors: redundancy, stability, and modularity. First, n units collectively represent a binary symbol by pattern representation in each
M. Oku, K. Aihara / Physics Letters A 374 (2010) 4859–4863
4863
Fig. 7. (a) Distribution of steps before the first error over 500 trials. The dashed curve shows the geometric distribution fitting using maximum likelihood estimation. β = 0.6. (b) The error rate per step against the inverse temperature. Each point shows the estimated value from 500 trials. In both figures, λ = 0.95 and τ = 10.
cell. This redundancy would compensate for the uncertainty in the output of each unit. Second, the two patterns stored in each Hopfield network are dynamically stable. This property would help to recover from perturbations. Third, the modularity of the proposed model would prevent an error in a cell from propagating to others immediately. Such errors can be corrected by RNNs before the next update of FNNs. In fact, not only the real brain but also many other biological systems also have these mechanisms for robustness against noise [18]. On the other hand, although noise is supposed to be harmful in general, in a particular situation a probabilistic model can outperform its deterministic counterpart. In our study, the subthreshold region of λ corresponds to such a situation. In that region, a certain level of noise helps the network’s state to overcome the energy barrier as shown in Fig. 6. In contrast, no state transition of cells occurs in the deterministic system corresponding to β → ∞. This phenomenon is an example of stochastic resonance (SR) [19]. SR is a phenomenon in which a subthreshold signal and additive noise interact to increase the signal-to-noise ratio. In this case, a signal is the bias from the FNN, and the threshold corresponds to the point at which the bias exceeds the feedback of the RNN. The interaction between the two projections leads to multistability and hysteresis, as shown in Fig. 4, both of which are tightly related to SR. The characteristic U-shaped curve of SR is also observed, as shown in Fig. 7(b). 5. Conclusion This study has indirectly shown that the computational ability of the stochastic binary-state neural network is comparable to that of a universal TM. In particular, the originality of this work can be attributed to the usage of the pattern representation scheme. Precisely speaking, the proposed model should be compared not with a deterministic TM but with a probabilistic TM, which is a type of nondeterministic TM without branching.
It would be valuable if we could estimate analytically the sufficient number of units per RNN and the optimal noise level when the required numbers of cells and steps for computation as well as acceptable error rate are given. Unfortunately, this is not an easy task because the way FNNs respond to different noise levels is highly nonlinear. Acknowledgements The authors would like to thank the anonymous reviewers for their fruitful comments and suggestions. This research is partially supported by the Japan Society for the Promotion of Science, a Grant-in-Aid for JSPS Fellows (21 937) and FIRST Aihara Innovative Mathematical Modelling Project. References [1] W.S. McCulloch, W. Pitts, Bulletin of Mathematical Biology 5 (1943) 115. [2] M.L. Minsky, Computation: Finite and Infinite Machines, Prentice Hall, Englewood Cliffs, NJ, 1967. [3] A. Turing, Proceedings of the London Mathematical Society 42 (1936) 230. [4] R. Hartley, H. Szu, in: Proceedings of the IEEE First International Conference on Neural Networks, 1987, pp. 15–22. [5] H.T. Siegelmann, E.D. Sontag, Journal of Computer and System Sciences 50 (1995) 132. [6] K. Tanaka, D. Hasegawa, Systems and Computers in Japan 32 (2001) 42. [7] P. Orponen, Neural Comput. 8 (1996) 403. [8] J. von Neumann, Automata studies, in: C.E. Shannon, J. McCarthy (Eds.), in: Annals of Mathematics Studies, vol. 34, Princeton University Press, 1956, pp. 43– 98. [9] J.J. Hopfield, Proc. Natl. Acad. Sci. USA 79 (1982) 2554. [10] M. Adachi, K. Aihara, Neural Netw. 10 (1997) 83. [11] D.O. Hebb, The Organization of Behavior, Wiley, New York, 1949. [12] D.H. Ackley, G.E. Hinton, T.J. Sejnowski, Cognitive Science 9 (1985) 147. [13] F. Rosenblatt, Psychol. Rev. 65 (1958) 386. [14] S. Wolfram, Rev. Mod. Phys. 55 (1983) 601. [15] S. Wolfram, A New Kind of Science, Wolfram Media Inc., Champaign, IL, 2002. [16] M. Cook, Complex Systems 15 (2004) 1. [17] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Nature 323 (1986) 533. [18] H. Kitano, Science 295 (2002) 1662. [19] L. Gammaitoni, P. Hänggi, P. Jung, F. Marchesoni, Rev. Mod. Phys. 70 (1998) 223.