Local minima in hierarchical structures of complex-valued neural networks

Local minima in hierarchical structures of complex-valued neural networks

Neural Networks 43 (2013) 1–7 Contents lists available at SciVerse ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet L...

308KB Sizes 3 Downloads 71 Views

Neural Networks 43 (2013) 1–7

Contents lists available at SciVerse ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Local minima in hierarchical structures of complex-valued neural networks Tohru Nitta Mathematical Neuroinformatics Group, Human Technology Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba Central 2, 1-1-1 Umezono, Tsukuba, Ibaraki, 305-8568, Japan

article

info

Article history: Received 4 December 2010 Received in revised form 18 October 2012 Accepted 8 February 2013 Keywords: Complex number Singular point Redundancy Local minimum Saddle point

abstract Most of local minima caused by the hierarchical structure can be resolved by extending the real-valued neural network to complex numbers. It was proved in 2000 that a critical point of the real-valued neural network with H − 1 hidden neurons always gives many critical points of the real-valued neural network with H hidden neurons. These critical points consist of many lines in the parameter space which could be local minima or saddle points. Local minima cause plateaus which have a strong negative influence on learning. However, most of the critical points of complex-valued neural network are saddle points unlike those of the real-valued neural network. This is a prominent property of the complex-valued neural network. © 2013 Elsevier Ltd. All rights reserved.

1. Introduction In recent years, complex-valued neural networks, whose parameters (weights and threshold values) are all complex numbers, have been applied in various fields dealing with complex numbers such as signal processing, image processing, optoelectronics, associative memories, adaptive filters, and telecommunications (Hirose, 2003; Nitta, 2009). The application field of the complexvalued neural network is wider than that of the real-valued neural network because the complex-valued neural network can represent more information (phase and amplitude) than the real-valued neural network, and the complex-valued neural network has some inherent properties such as the ability to transform geometric figures (Nitta, 1993, 1997; Nitta & Furuya, 1991) and the orthogonal decision boundary (Nitta, 2004, 2008b). In the applications of the multi-layered type real-valued neural networks, the error back-propagation learning algorithm (called here, Real-BP; Rumelhart, Hinton, & Williams, 1986) has been often used. Naturally, the complex-valued version of the Real-BP (called here, Complex-BP) can be considered, and was actually proposed by several researchers independently in the early 1990’s (Benvenuto & Piazza, 1992; Georgiou & Koutsougeras, 1992; Kim & Guest, 1990; Nitta, 1993, 1997; Nitta & Furuya, 1991). This algorithm enables the network to learn complex-valued patterns naturally. On the one hand, Sussmann (1992) reported that a redundancy exists in the parameters of real-valued neural networks, which results from the hierarchical structure of the network. Fukumizu

E-mail address: [email protected]. 0893-6080/$ – see front matter © 2013 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2013.02.002

and Amari (2000) used Sussmann’s results to prove the existence of local minima caused by the hierarchical structure of the realvalued neural networks. That is, they proved that a critical point of the model with H − 1 hidden neurons always gives many critical points of the model with H hidden neurons. These critical points consist of many lines in the parameter space, which can be local minima or saddle points. An existence proof for local minima in the real-valued neural networks does not exist other than as described in results in Fukumizu and Amari (2000). The local minima of neural networks cause a standstill of learning. In fact, Fukumizu et al. confirmed using computer simulations that the local minima they discovered caused 50,000 times of plateaus, which had a strong negative influence on learning. In Section 3, it is proved that most of the local minima that Fukumizu et al. discovered are resolved by extending the real-valued neural network to complex numbers; most of the critical points attributable to the hierarchical structure of the complex-valued neural network are saddle points, which is a prominent property of the complexvalued neural network. This paper serves as the initial step toward total comprehension of local minima of the complex-valued neural network. 2. The complex-valued neural network This section describes the complex-valued neural network used in the analysis. First, we will consider the following complexvalued neuron. The input signals, weights, thresholds and output signals are all complex numbers. The net input Un to a complexvalued neuron n is defined as: Un = m Wnm Xm + Vn , where Wnm is the (complex-valued) weight connecting the complex-valued

2

T. Nitta / Neural Networks 43 (2013) 1–7

neurons n and m, Xm is the (complex-valued) input signal from the complex-valued neuron m, and Vn is the (complex-valued) threshold value of the complex-valued neuron n. To obtain the (complex-valued) output signal, convert the net input Un into its real and√ imaginary parts as follows: Un = x + iy = z, where i denotes −1. The (complex-valued) output signal is defined to be

ϕC (z ) = ϕ(x) + iϕ(y),

(1)

def

where ϕ(u) = tanh(u) = (exp(u) − exp(−u))/(exp(u) + exp (−u)), u ∈ R (R denotes the set of real numbers) and is called hyperbolic tangent. Note that −1 < Re[ϕC ], Im[ϕC ] < 1. Note also that ϕC (z ) is not holomorphic as a complex function because the Cauchy–Riemann equations do not hold: ∂ϕC (z )/∂ x + i∂ϕC (z )/∂ y = (1 − ϕC (x)2 ) + i(1 − ϕC (y)2 ) ̸= 0, where z = x + iy. A complex-valued neural network consists of such complexvalued neurons described above. The network used in the analysis will have 3 layers: L − H − 1 network. The activation function ψC of the output neuron is linear, that is, ψC (z ) = z for any z ∈ C where C denotes the set of complex numbers. For any input pattern x = (x1 , . . . , xL )T ∈ C L to the complex-valued neural network where xk ∈ C is the input signal to the input neuron k (1 ≤ k ≤ L) and T denotes transposition, the output value of the output neuron is defined to be f (H ) (x; θ (H ) ) =

H 

˜ jT x˜ ) + ν0 ∈ C , νj ϕC (w

(2)

j =1

˜ j = (wj0 wjT )T ∈ C L+1 , wj0 ∈ C is the threshold of the where w

hidden neuron j, wj = (wj1 , . . . , wjL )T ∈ C L is the weight vector of the hidden neuron j (wjk ∈ C is the weight between the input neuron k and the hidden neuron j) (1 ≤ j ≤ H ), x˜ = (1 xT )T ∈ C L+1 , νj ∈ C is the weight between the hidden neuron j and the output neuron (1 ≤ j ≤ H), ν0 ∈ C is the threshold of the ˜ 1T , . . . , w ˜ HT )T which output neuron, and θ (H ) = (ν0 , ν1 , . . . , νH , w summarizes all the parameters in one large vector. It is assumed that the net inputs to any two hidden neurons are not nearly rotation–equivalent (Definition 1). This assumption is needed to use the Uniqueness Theorem for the complex-valued neural network in the analysis which was proved in Nitta (2008a). The uniqueness theorem states that the redundancy of the learning parameters of an irreducible complex-valued neural network to approximate a given complex-valued function is determined up to a finite group. Definition 1. If any of the following eight conditions is valid, the two complex-valued linear affine functions ϕs : C m → C 1 and ϕt : C m → C 1 are called nearly rotation–equivalent:

∀z ∈ C m ;

Re[ϕs (z )] = Re[ϕt (z )]

and ∃z ∈ C m ;

∀z ∈ C m ;

and ∃z ∈ C m ;

∀z ∈ C m ;

(5)

Im[ϕs (z )] ̸= −Re[ϕt (z )],

(6)

Re[ϕs (z )] ̸= Re[ϕt (z )],

(7)

Re[ϕs (z )] ̸= −Re[ϕt (z )],

(8)

Im[ϕs (z )] = Re[ϕt (z )]

and ∃z ∈ C m ;

∀z ∈ C m ;

Im[ϕs (z )] ̸= Re[ϕt (z )],

Im[ϕs (z )] = −Im[ϕt (z )]

and ∃z ∈ C m ;

∀z ∈ C m ;

(4)

Im[ϕs (z )] = Im[ϕt (z )]

and ∃z ∈ C m ;

∀z ∈ C m ;

Im[ϕs (z )] ̸= −Im[ϕt (z )],

Re[ϕs (z )] = Im[ϕt (z )]

and ∃z ∈ C m ;

∀z ∈ C m ;

(3)

Re[ϕs (z )] = −Im[ϕt (z )]

and ∃z ∈ C m ;

∀z ∈ C m ;

Im[ϕs (z )] ̸= Im[ϕt (z )],

Re[ϕs (z )] = −Re[ϕt (z )]

Re[ϕs (z )] ̸= −Im[ϕt (z )],

(9)

Im[ϕs (z )] = −Re[ϕt (z )]

and ∃z ∈ C m ;

Re[ϕs (z )] ̸= Im[ϕt (z )].

(10)

˜ jT x˜ to a hidden neuron j is a complexRemark 1. The net input w valued linear affine function on C L . The nearly rotation-equivalence is a concept that a complex-valued linear affine function on C L comes close to be identically equal to another one by a counterclockwise rotation of it by a 0, π /2, π or 3π /2 radians about the origin. Given N complex-valued training data {(x(p) , y(p) ) ∈ C L × C | p = 1, . . . , N }, we use a complex-valued neural network to realize the relation expressed by the data. The objective of the training is to find the parameters that minimize the error function defined by EH (θ (H ) ) =

N 

l(y(p) , f (H ) (x(p) ; θ (H ) )) ∈ R ,

(11)

p=1

where l(y, z ) : C × C −→ R is a partial differentiable function (called loss function) such that l(y, z ) ≥ 0 and the equality holds if and only if y = z. Note that l is not holomorphic as a complex function because it takes a real value. To the author’s knowledge, all of the multi-layered complex-valued neural networks proposed so far (for example Benvenuto & Piazza, 1992; Georgiou & Koutsougeras, 1992; Kim & Guest, 1990; Nitta, 1993, 1997; Nitta & Furuya, 1991) employ the mean square error l(y, z ) = (1/2)|y − z |2 which takes a real value. Note that component-wise activation function (Eq. (1)) is bounded but non-holomorphic as a complex-valued function because the Cauchy–Riemann equations do not hold. However, some complex-valued neural network models with holomorphic activation functions (e.g., ϕC (z ) = 1/(1 + exp(−z )) Kim & Guest, 1990 and ϕC (z ) = exp(z ) Savitha, Suresh, Sundararajan, & Saratchandran, 2008) cause a problem with the learning convergence property because they have poles. It has been proved that the complex-valued neural network with the component-wise activation function (Eq. (1)) can approximate any continuous complexvalued function, whereas networks with holomorphic activation functions cannot approximate any non-holomorphic complexvalued function (Arena, Fortuna, Muscato, & Xibilia, 1998; Arena, Fortuna, Re, & Xibilia, 1993). That is, the complex-valued neural network with the non-holomorphic activation function (Eq. (1)) is a universal approximator, but networks with the holomorphic activation functions are not. Consequently, the properties of complex-valued neural networks depend heavily on the holomorphy of the activation functions used. Therefore, we must investigate the properties of the complex-valued neural networks with non-holomorphic activation functions as distinct from those with holomorphic activation functions. For example, in the case of realvalued neural networks, the results on local minima obtained in Fukumizu and Amari (2000) are applicable to a wide class of threelayer models. The situation is, however, not the same in the case of complex-valued neural networks for the reason explained above. First, the uniqueness theorem must be proved to investigate the properties on singular points for complex-valued neural networks with holomorphic complex-valued activation functions. However, the uniqueness theorem for the complex-valued neural network with the non-holomorphic activation function has already been proved (Nitta, 2008a). For that reason, we specifically examine the complex-valued neural network with the component-wise activation function (Eq. (1)). 3. Local minima based on the hierarchical structure in the complex-valued neural network This section investigates local minima based on the hierarchical structure in the complex-valued neural network described in Section 2.

T. Nitta / Neural Networks 43 (2013) 1–7

3

3.1. Local minima in the real-valued neural network It is difficult to prove the existence of local minima, which exerts a bad influence upon learning of neural networks. For example, it was proved that there are no local minima in the finite weight region for the XOR problem (Hamney, 1998; SprinkhuizenKuyper & Boers, 1998). Before that time, it had been believed that some of critical points of the XOR problem are local minima (Lisboa & Perantonis, 1991). No rigorous existence proof for local minima in the real-valued neural networks exists other than the result presented in Fukumizu and Amari (2000). They proved that a subset of critical points corresponding to the global minimum of a smaller network can be local minima or saddle points of the larger network, and confirmed by computer simulation that the local minima they discovered caused the 50,000 plateaus, which negatively influenced learning. Incidentally, as a generalization of the above result, they derived similar results for mixture-type models (Fukumizu, Akaho, & Amari, 2003). That is, given a critical point of the likelihood for a smaller model, duplication of any of the components gives critical points as lines (critical lines) for the larger model, and the critical lines can be local maxima or saddle points. 3.2. Local minima in the complex-valued neural network In this section, it is shown that most of the local minima caused by the hierarchical structure shown in Fukumizu and Amari (2000) can be resolved by extending the real-valued neural network to complex numbers. 3.2.1. Redundancy based on the hierarchical structure This section makes clear the redundancy based on the hierarchical structure of the complex-valued neural network, that is, the structure of the redundancy of the complex-valued neural network with H hidden neurons for a given set of parameters of the complex-valued neural network with H − 1 hidden neurons. Definition 2. Define FH = {f (H ) (x; θ (H ) ) : C L −→ C 1 | θ (H ) ∈ ΘH },

πH : ΘH −→ FH ,

(12)

θ (H ) −→ f (H ) (x; θ (H ) ),

Fig. 1. Neural networks implemented with the parameters in complex submanifolds Aj , Bj and Cj1 j2 .

(13)

where ΘH is the set of all the parameters (weights and thresholds) of the L − H − 1 complex-valued neural network, i.e., ΘH = C LH +2H +1 . FH is a functional space, the family of all the functions realized by the L − H − 1 complex-valued neural network. The functional spaces {FH }∞ H =0 have a trivial hierarchical structure: F0 ⊂ F1 ⊂ · · · ⊂ FH −1 ⊂ FH ⊂ · · · .

(14)

πH gives a complex-valued function for a given set of parameters

H − 1 hidden neurons. The following proposition can be easily shown using the results in Nitta (2008a). Proposition 1. ΩH consists of the union of the following complex submanifolds of ΘH : Aj = {θ (H ) ∈ ΘH | νj = 0} (1 ≤ j ≤ H ), Bj = {θ

(H )

∈ ΘH | wj = 0} (1 ≤ j ≤ H ),

(16) (17)

(H )

of the complex-valued neural network with H hidden neurons. Obviously, πH is not one-to-one: different θ (H ) may give the same input–output function.

˜ j1 = w ˜ j2 , w ˜ j1 = −w ˜ j2 , Cj1 j2 = {θ ∈ ΘH | w ˜ j1 = iw ˜ j2 or w ˜ j1 = −iw ˜ j2 } (1 ≤ j1 < j2 ≤ H ). w

Definition 3. Define

The network structures implemented by Eqs. (16)–(18) are shown in Fig. 1. The hidden neuron j of the neural network implemented by Aj never influence the output neuron because the weight νj between the hidden neuron j and the output neuron is equal to zero. The output of the hidden neuron j of the neural network implemented by Bj is equal to only a constant ϕC (wj0 ) because wj = 0, then we can remove the hidden neuron j and replace the threshold of the output neuron ν0 with ν0 + ϕC (wj0 ). In the case of Cj1 j2 , we can remove the hidden neuron j2 and replace the weight νj1 between the hidden neuron j1 and the output neuron with νj1 + qνj2 where q = −1, 1, −i or i. Proposition 1 shows

ΩH = {θ

θ

(H )

∈ ΘH | πH (θ

(H −1)

(H )

) ∈ iH −1 (FH −1 (θ

(H −1)

)),

∈ ΘH −1 , such

that the net input to any hidden neuron of the complex-valued neural network realized with

θ (H −1) and the one to any hidden neuron of the complex-valued neural network realized with

θ (H ) are not nearly rotation–equivalent},

(15)

where iH −1 : FH −1 −→ FH , f −→ iH −1 (f ) = f is the inclusion.

ΩH is the set of all the parameters θ (H ) that realize the input–output functions of the complex-valued neural network with

(18)

the structure of the set ΩH of all the parameters θ (H ) that realize the input–output functions of the complex-valued neural network with H − 1 hidden neurons.

4

T. Nitta / Neural Networks 43 (2013) 1–7

Next, we examine the structure of the set of all the parameters

θ (H ) that realize the input–output function realized by a given set of parameters θ (H −1) . Definition 4. Define

complex-valued neural network realized

γλ : ΘH −1 −→ ΘH ,

neuron of the complex-valued neural (H )

are not nearly rotation–equivalent},

(19)

where f (H −1) (x, θ (H −1) ) ∈ FH −1 − FH −2 is a complex function realized by a given set of parameters θ (H −1) and we use the following notation for its parameters and indexing: f

(x; θ

)=

ξj ϕ ( ˜ ˜ ) + ξ0 . T C uj x

(20)

j=2

(H −1)

θ (H −1) −→ (ξ0 , λξ2 , (1 − λ)ξ2 , ξ3 , . . . , ξH , u˜ T2 , u˜ T2 , ˜ T3 , . . . , u˜H T )T . u

Proposition 2. ΩH (θ (H −1) ) consists of the complex submanifolds obtained by transforming Λ, Ξ and Γ using the transformations in the finite group WL,H defined in Nitta (2008a), where

Λ = {θ (H ) ∈ ΘH | ν1 = 0, ν0 = ξ0 , ˜ j = u˜ j (2 ≤ j ≤ H )}, νj = ξj , w

(21)

Ξ = {θ (H ) ∈ ΘH | w1 = 0, ν1 ϕC (w10 ) + ν0 = ξ0 , ˜ j = u˜ j (2 ≤ j ≤ H )}, νj = ξj , w

(22)

˜1 = w ˜ 2 = u˜ 2 , ν0 = ξ0 , Γ = {θ (H ) ∈ ΘH | w ˜ j = u˜ j (3 ≤ j ≤ H )}. ν1 + ν2 = ξ2 , νj = ξj , w

(23)

The finite group WL,H is generated by the two transformations

θji and τj1 j2 , where (i) θji multiplies the weights between the input layer and the hidden neuron j by −i ∈ C , and the weight between the hidden neuron j and the output neuron by i ∈ C , (ii) τj1 j2 interchanges two hidden neurons j1 and j2 . Λ is a complex submanifold of A1 (Eq. (16)) and a complex ˜ 1 -complex plane (L + 1)-dimensional affine space parallel to the w ˜ 1 ∈ C L+1 is free (all the other components of θ (H ) because only w are fixed using θ (H −1) ). Ξ is a complex submanifold of B1 (Eq. (17)) and a complex 2-dimensional submanifold defined by a nonlinear equation: (24)

because ν1 , w10 , and ν0 are free (all the other components of θ (H ) are fixed using θ (H −1) ). Γ is a complex submanifold of C12 (Eq. (18)) and a complex 1-dimensional affine space defined by ν1 + ν2 = ξ2 where ν1 , ν2 ∈ C are free, and ξ2 ∈ C is fixed. Here, we define the following canonical embeddings from ΘH −1 to ΘH , which will be used for the analysis of the critical points in the following sections.

(27)

It is trivial to show that the following proposition holds. Proposition 3.

˜ ∈ C L+1 }, Λ = {αw˜ (θ (H −1) ) | w

(28)

Ξ = {β(ν,w) (θ (H −1) ) | (ν, w) ∈ C 2 },

(29)

Γ = {γλ (θ (H )

ΩH (θ ) is the set of all the parameters θ ∈ ΘH that realize a complex function f (H −1) (x, θ (H −1) ) realized by a given set of parameters θ (H −1) . The following proposition can also be easily shown using the results in Nitta (2008a).

ν1 ϕC (w10 ) + ν0 = ξ0

(26)

(iii) For any λ ∈ C , define

with θ (H −1) and the one to any hidden

(H −1)

(25)

θ (H −1) −→ (ξ0 − νϕC (w), ν, ξ2 , . . . , ξH , (w, 0T ), ˜ T2 , . . . , u˜H T )T . u

the net input to any hidden neuron of the

(H −1)

˜ T , u˜ T2 , . . . , u˜ TH )T . θ (H −1) −→ (ξ0 , 0, ξ2 , . . . , ξH , w β(ν,w) : ΘH −1 −→ ΘH ,

(f (H −1) (x, θ (H −1) )), such that

H 

αw˜ : ΘH −1 −→ ΘH , (ii) For any (ν, w) ∈ C 2 , define

ΩH (θ (H −1) ) = {θ (H ) ∈ ΘH | πH (θ (H ) ) ∈ iH −1

network realized with θ

˜ ∈ C L+1 , define Definition 5. (i) For any w

(H −1)

) | λ ∈ C }.

(30)

The canonical embeddings αw˜ , β(ν,w) , γλ in Definition 5 are the complex-valued version of the corresponding embeddings αw˜ , β(ν,w) , γλ defined in Fukumizu and Amari (2000). 3.2.2. Critical points This section investigates the critical points of the complexvalued neural network. Generally, the objective of the learning of neural networks is to obtain a global minimum of the error function. If ω∗ is a global minimum of the error function E (ω), the equation ∂ E (ω∗ )/∂ω = 0 holds. However, ∂ E (ω∗ )/∂ω = 0 does not always assure that ω∗ is a global minimum. The point ω∗ satisfying ∂ E (ω∗ )/∂ω = 0 is called a critical point of E. There are three types of critical points: a local minimum, a local maximum and a saddle point, which can be identified using Hessian as is well known. In the case of neural networks including complex-valued neural networks, it is very difficult to analyze their critical points simply using Hessian. That is the reason why this paper focuses on the critical points caused by the hierarchical structure of the complex-valued neural network. Specifically, we define the critical point of the complexvalued neural network defined in Section 2. (H )

(H )

Definition 6. (i) A parameter θ (H ) = (θ1 , . . . , θK ) ∈ ΘH is called a critical point of the error function EH (θ (H ) ) if the following equations hold:

∂ EH (θ (H ) ) ∂ Re[θ (H ) ] ∂ EH (θ (H ) ) ∂ Im[θ (H ) ]

 =

∂ Re[θ1(H ) ] 

=

∂ EH (θ (H ) ) ∂ EH (θ (H ) ) ∂ Im[θ1(H ) ]

,...,

,...,

∂ EH (θ (H ) )

T

∂ Re[θK(H ) ] ∂ EH (θ (H ) ) ∂ Im[θK(H ) ]

= 0,

(31)

= 0,

(32)

T

where K = LH + 2H + 1 is the number of the parameters of the complex-valued neural network. (ii) A critical point θˆ

(H )

∈ ΘH is called a local minimum (maxi-

mum) if there exists a neighborhood around θˆ point θ

(H )

(H )

in the neighborhood EH (θ

(H )

(H )

such that for any

(H ) ) ≥ EH (θˆ ) (EH (θ (H ) ) ≤

EH (θˆ )) holds, and called a saddle if it is neither a local minimum nor a local maximum.

T. Nitta / Neural Networks 43 (2013) 1–7

The next lemma can be easily shown by simple calculations.

˜ T2∗ , . . . , u˜ TH∗ )T ∈ ΘH −1 Lemma 1. Let θ ∗(H −1) = (ξ0∗ , ξ2∗ , . . . , ξH∗ , u be a critical point of EH −1 . Then, for any 2 ≤ j ≤ H, the following equations hold: ∂ EH −1 (θ ∗(H −1) ) ∂ EH −1 (θ (∗H −1) ) = = 0, ∂ Re[ξ0 ] ∂ Im[ξ0 ] ∂ EH −1 (θ ∗(H −1) ) ∂ EH −1 (θ ∗(H −1) ) = = 0, ∂ Re[ξj ] ∂ Im[ξj ]  N  ∂ EH −1 (θ ∗(H −1) ) ∂ l  (p) (H −1) (p) (H −1)  = y ,f (x , θ ∗ ) ∂ Re[u˜ j ] ∂z p=1  ∂ϕC (u˜ Tj∗ x˜ (p) ) (p)T × ξ j∗ x˜ ∂z  ∂ϕC (u˜ Tj∗ x˜ (p) ) T ( p ) + x˜ ∂ z¯ ∂ l  (p) (H −1) (p) (H −1)  y ,f (x , θ ∗ ) + ∂ z¯ ∂ϕC (u˜ Tj∗ x˜ (p) ) (p)T × ξ j∗ x˜ ∂z  ∂ϕC (u˜ Tj∗ x˜ (p) ) T + x˜ (p) ∂ z¯ = 0,  N  ∂ EH −1 (θ ∗(H −1) ) ∂ l  (p) (H −1) (p) (H −1)  = i y ,f (x , θ ∗ ) ∂ Im[u˜ j ] ∂z p=1  ∂ϕC (u˜ Tj∗ x˜ (p) ) (p)T × ξj∗ x˜ ∂z  ∂ϕC (u˜ Tj∗ x˜ (p) ) T − x˜ (p) ∂ z¯ ∂ l  (p) (H −1) (p) (H −1)  y ,f (x , θ ∗ ) + ∂ z¯ ∂ϕC (u˜ Tj∗ x˜ (p) ) (p)T × ξ j∗ x˜ ∂z  ∂ϕC (u˜ Tj∗ x˜ (p) ) T ( p ) x˜ − ∂ z¯ = 0.

(33)

(34)

(39)

(40)

∂ EH (θ (H ) ) = 0. ˜ 2] ∂ Re[w

(35)

(41)

˜ j ](j Similar equations hold for the parameters Im[w because of Eq. (36). This completes the proof. 

= 1, 2)

Theorem 2. The critical point of Theorem 1 consists of a straight line in the 2-dimensional affine space defined by ν1 + ν2 = ξ2∗ if we move λ ∈ R. Proof. Since ν1 = λξ2∗ and ν2 = (1−λ)ξ2∗ , the following equation holds:













Re[ξ2∗ ] 0 Re[ν1 ]  Im[ξ2∗ ]  Im[ν1 ]  0   Re[ν ]  =  Re[ξ ]  + λ  −Re[ξ ]  .  2∗ 2∗ 2 −Im[ξ2∗ ] Im[ξ2∗ ] Im[ν2 ]

(42)

˜ T2∗ , . . . , u˜ TH∗ )T ∈ Theorem 3. Let θ (∗H −1) = (ξ0∗ , ξ2∗ , . . . , ξH∗ , u ΘH −1 be a critical point of EH −1 . Let β(ν,w) be as in Eq. (26). Then, the point β(0,w) (θ (∗H −1) ) is a critical point of EH for any w ∈ C . (36)

Theorem 1. Let = (ξ0∗ , ξ2∗ , . . . , ξH∗ , u˜ T2∗ , . . . , u˜ TH∗ )T ∈ ΘH −1 be a critical point of EH −1 . Let γλ be as in Eq. (27). Then, the point γλ (θ ∗(H −1) ) is a critical point of EH for any λ ∈ R. Proof. Let θ (H ) = γλ (θ (∗H −1) ) for any λ ∈ R. First, we can easily find from Lemma 1 and the equation f (H ) (x, θ (H ) ) = f (H −1) (x, θ ∗(H −1) ) that the following equations hold: for any 0 ≤ j ≤ H, (37)

and for any 3 ≤ j ≤ H, (H )

∂ EH (θ ) ∂ EH (θ ) = = 0. ˜ j] ˜ j] ∂ Re[w ∂ Im[w

 N  ∂ l  (p) (H −1) (p) (H −1)  ∂ EH (θ (H ) ) = y ,f (x , θ ∗ ) ˜ 1] ∂ Re[w ∂ z p=1  ∂ϕC (u˜ T2∗ x˜ (p) ) (p)T × λξ2∗ x˜ ∂z  ∂ϕC (u˜ T2∗ x˜ (p) ) T + x˜ (p) ∂ z¯ ∂ l  (p) (H −1) (p) (H −1)  + y ,f (x , θ ∗ ) ∂ z¯  T ∂ϕC (u˜ 2∗ x˜ (p) ) (p)T × λξ2∗ x˜ ∂z  ∂ϕC (u˜ T2∗ x˜ (p) ) T ( p ) + x˜ , ∂ z¯

Similarly,

θ ∗(H −1)

(H )

However, in the case of j = 1, since

∂ EH (θ (H ) ) = 0. (from Eq. (35)). ˜ 1] ∂ Re[w

The next theorem shows the existence of the critical points of the complex-valued neural network.

∂ EH (θ (H ) ) ∂ EH (θ (H ) ) = = 0, ∂ Re[νj ] ∂ Im[νj ]

5

(38)

Proof. Let θ (H ) = β(0,w) (θ (∗H −1) ) for any w ∈ C . Then, we can easily find from Lemma 1, the equation f (H ) (x, θ (H ) ) = f (H −1) (x, θ ∗(H −1) ) and the assumption ν1 = ν = 0 that the following equations hold: for any 0 ≤ j ≤ H,

∂ EH (θ (H ) ) ∂ EH (θ (H ) ) = = 0, ∂ Re[νj ] ∂ Im[νj ]

(43)

and for any 1 ≤ j ≤ H,

∂ EH (θ (H ) ) ∂ EH (θ (H ) ) = = 0.  ˜ j] ˜ j] ∂ Re[w ∂ Im[w

(44)

˜ T2∗ , . . . , u˜ TH∗ )T ∈ Theorem 4. Let θ (∗H −1) = (ξ0∗ , ξ2∗ , . . . , ξH∗ , u ΘH −1 be a critical point of EH −1 . Let αw˜ be as in Eq. (25). If w = 0, then the point αw˜ (θ (∗H −1) ) is a critical point of EH for any w ∈ C . ˜ = (w, w T )T = (w, 0T )T , the equation αw˜ = β(0,w) Proof. Since w holds. Thus, we find that from Theorem 3 that αw˜ (θ ∗(H −1) ) is a critical point of EH for any w ∈ C . 

6

T. Nitta / Neural Networks 43 (2013) 1–7

 N  ∂ l  (p) (H −1) (p) (H −1)  × y ,f (x , θ ∗ ) ∂ z p=1   ∂ϕC (u˜ T2∗ x˜ (p) ) (p)T x˜ . · Re ∂z

It is trivial to show that the following theorem holds. Theorem 5. The critical points of Theorems 3 and 4 consist of a straight line in C LH +2H +1 if we move w ∈ C . Theorem 6. There exist many critical points each of which forms a straight line in the complex-valued neural network. Proof. If θ is a critical point of Theorems 1, 3 and 4, so is G(θ) for any transformation G ∈ WL,H , where WL,H is the finite group defined in Nitta (2008a).  3.2.3. Saddle points This section investigates the property of the critical points of Theorems 1, 3 and 4. We need the following lemma in Fukumizu and Amari (2000). Lemma 2. Let E (θ) be a function of class C 1 , and θ ∗ be a critical point of E (θ). If in all neighborhood of θ ∗ there exists a point θ such that E (θ) = E (θ ∗ ) and ∂ E (θ)/∂θ ̸= 0, then θ ∗ is a saddle point. Theorem 7. Let γλ (θ ∗(H −1) ) be any critical point of EH in Theorem 1. Assume that ξ2∗ = ̸ 0 and

 N  ∂l  ∂z

p=1

·

y(p) , f (H −1) (x(p) , θ (∗H −1) )



T (p) C u2∗ x

∂ϕ ( ˜ ˜ ) ∂z

 x˜ (p)

T

̸= 0.

(45)

Proof. From Theorem 2, the set of the critical points {γλ (θ (∗H −1) ) | λ ∈ R } is a straight line in the 2-dimensional affine space defined by ν1 + ν2 = ξ2∗ . And it is obvious that the error function EH takes the same value in the 2-dimensional affine space defined by ν1 + ν2 = ξ2∗ . Here, for any z ∈ C such that z ̸∈ R, let (H ) θˆ = (ξ0∗ , z ξ2∗ , (1 − z )ξ2∗ , ξ3∗ , . . . , ξH∗ , u˜ T2∗ , u˜ T2∗ , u˜ T3∗ , . . . , u˜ TH∗ )T . (H ) Obviously, θˆ ∈ ΘH belongs to the 2-dimensional affine space defined by ν1 + ν2 = ξ2∗ , and is, however, not on the straight line which the set of the critical points {γλ (θ (∗H −1) ) | λ ∈ R }

forms. Here, from Eq. (45), at least one of the following two conditions holds:

 N  ∂l 

 y(p) , f (H −1) (x(p) , θ (∗H −1) )

∂z  ∂ϕC (u˜ T2∗ x˜ (p) )

· Re

Similarly, we obtain (H ) (H ) ∂ EH (θˆ ) ∂ EH (θˆ ) =− ˜ 1] ˜ 2] ∂ Im[w ∂ Im[w = −2(z − z )ξ2∗  N  ∂ l  (p) (H −1) (p) (H −1)  × y ,f (x , θ ∗ ) ∂z p=1   ∂ϕC (u˜ T2∗ x˜ (p) ) (p)T · Im x˜ . ∂z

(49)

Thus, at least one of the Eqs. (48) and (49) does not vanish because of Im[z ] ̸= 0 where z corresponds to λ in Eq. (39), and the assumptions ξ2∗ ̸= 0 and Eq. (45), that is, at least one of the (H )

Eqs. (46) and (47) holds. And θˆ can belong to any neighborhood of the straight line {γλ (θ (∗H −1) ) | λ ∈ R }. Therefore, from Lemma 2, the critical point γλ (θ (∗H −1) ) is a saddle point.  Remark 2. Eq. (45) depends on only the training patterns and the parameter θ (∗H −1) , and the left hand side of Eq. (45) seldom vanishes. Theorem 8. Assume that for any w ̸= 0,

Then the critical point γλ (θ (∗H −1) ) is a saddle point.

p=1

(48)

∂z

 N  ∂l  p=1

∂z

x˜ (p)

T

̸= 0,

(46)

 N  ∂l 

 y(p) , f (H −1) (x(p) , θ (∗H −1) ) ∂ z p=1   ∂ϕC (u˜ T2∗ x˜ (p) ) (p)T ̸= 0. · Im x˜ ∂z

Then, we can find that from Eqs. (35) and (39) (H ) (H ) ∂ EH (θˆ ) ∂ EH (θˆ ) =− ˜ 1] ˜ 2] ∂ Re[w ∂ Re[w = 2(z − z )ξ2∗

(47)

,f

(H −1)

(p)

(x

, θ (∗H −1) )



T (p)

˜ x˜ ) ̸= 0. · ϕC (w

(50)

(i) Let β(0,w) (θ ∗(H −1) ) be any critical point of EH in Theorem 3. Then the critical point β(0,w) (θ (∗H −1) ) is a saddle point. (ii) Let αw˜ (θ ∗(H −1) ) be any critical point of EH in Theorem 4. Then the critical point αw˜ (θ (∗H −1) ) is a saddle point. Proof. (i) From the definition of αw˜ (Eq. (25)), we find that

˜ ∈ C L+1 }. β(0,w) (θ (∗H −1) ) ∈ {αw˜ (θ (∗H −1) ) | w

(51)

That is, β(0,w) (θ ∗(H −1) ) is embedded in a complex L + 1 dimensional plane. And it is obvious that EH (β(0,w) (θ ∗(H −1) )) = EH (αw˜ (θ ∗(H −1) )) L+1 T T L+1 ˜ ˜ for any w ∈ C

. Here, take w = (w, w ) ∈ C

such that w ̸= (H ) ˆ ˜ X . And also, let θ X = αw˜ X (θ ∗(H −1) ). 0 arbitrarily, and denote it by w (H ) Then, we can find from Eq. (50) that ∂ EH (θˆ )/∂ Re[ν1 ] ̸= 0 or (H )



y

 (p)

(H )

X

∂ EH (θˆ X )/∂ Im[ν1 ] ̸= 0. And θˆ X can belong to any neighborhood of the point β(0,w) (θ (∗H −1) ). Therefore, from Lemma 2, the critical point β(0,w) (θ (∗H −1) ) is a saddle point. (ii) The critical point αw˜ (θ (∗H −1) ) is a saddle point because if w = 0 then the equation αw˜ = β(0,w) holds.  Remark 3. Eq. (50) depends on only the training patterns and the parameter θ (∗H −1) , and the left hand side of Eq. (50) seldom vanishes. Theorem 9. There exist many saddle points each of which forms a straight line in the complex-valued neural network. Proof. If θ is a saddle point of Theorems 7 and 8, so is G(θ) for any transformation G ∈ WL,H , where WL,H is the finite group defined in Nitta (2008a). 

T. Nitta / Neural Networks 43 (2013) 1–7

7

In the case of the complex-valued neural network, most of the critical points with respect to the embeddings αw˜ , β(0,w) and γλ are saddle points, unlike those of the real-valued neural network case in which the critical point with respect to the corresponding embeddings can be a local minimum or a saddle point (Fukumizu & Amari, 2000).

paper will be a clue to analyze the various types of local minima of complex-valued neural networks. In future studies, we will explore such local minima in the complex-valued neural networks, compare the number of local minima of complex-valued neural networks and that of real-valued neural networks, and address the problem of the avoidance of local minima.

4. Discussion

Acknowledgments

As described in Section 3.1, an existence proof for local minima in real-valued neural networks does not exist other than in the results in Fukumizu and Amari (2000). Fukumizu et al. confirmed using computer simulations that the local minima they discovered caused 50,000 plateaus, which had a strong negative influence on learning. In Section 3.2, it was proved that most of local minima which Fukumizu et al. discovered could be resolved by extending the real-valued neural network to complex numbers; most of critical points caused by the hierarchical structure of the complexvalued neural network are saddle points, which is a prominent property of the complex-valued neural network. However, the problem of whether the complex-valued neural network has less local minima than the real-valued neural network or not remains unsolved. First, it is difficult to prove the existence of local minima itself, as understood also from the existence proof of local minima of the real-valued neural network itself not existing, other than the results obtained by Fukumizu et al. It was assumed in this paper that the complex-valued neural network used in the analysis had a restriction: the net inputs to any two hidden neurons are not nearly rotation–equivalent (Definition 1). The number of hidden neurons of such a complexvalued neural network can be reduced by removing reducibility of the three types (Proposition 1). Recently, it has been shown that the general complex-valued neural network that has no restrictions has been analyzed, and it has reducibility of another type (called exceptional reducibility): any irreducible three-layered complex-valued neural network is decomposed into Weight–Rotation–Equivalent (WRE) neural networks, a type of restricted network, and can be reduced by minimizing its WRE neural network (Kobayashi, 2010). It is important to clarify how the exceptional reducibility is related to the local minima of complexvalued neural networks, which is an interesting topic for future study.

The author would like to give special thanks to the members of the Mathematical Neuroinformatics Group and the anonymous reviewers for valuable comments.

5. Conclusions Results show that most of local minima caused by the hierarchical structure which could cause 50,000 plateaus were resolved by extending the real-valued neural network to complex numbers. That is, most of critical points resulting from the hierarchical structure of the complex-valued neural network are saddle points, whereas those of the real-valued neural network can be local minima. This is a prominent property of the complexvalued neural network. The local minima investigated in this paper were only those caused by the hierarchical structures of the neural network. Local minima of the other types there might exist in the complex-valued neural network. We believe that the results presented in this

References Arena, P., Fortuna, L., Muscato, G., & Xibilia, M. G. (1998). Lecture notes in control and information sciences: vol. 234. Neural networks in multidimensional domains. London: Springer. Arena, P., Fortuna, L., Re, R., & Xibilia, M.G. (1993). On the capability of neural networks with complex neurons in complex valued functions approximation. In Proc. IEEE int. conf. on circuits and systems (pp. 2168–2171). Benvenuto, N., & Piazza, F. (1992). On the complex backpropagation algorithm. IEEE Transactions on Signal Processing, 40(4), 967–969. Fukumizu, K., Akaho, S., & Amari, S. (2003). Critical lines in symmetry of mixture models and its application to component splitting. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, vol. 15 (pp. 889–896). The MIT Press. Fukumizu, K., & Amari, S. (2000). Local minima and plateaus in hierarchical structures of multilayer perceptrons. Neural Networks, 13(3), 317–327. Georgiou, G. M., & Koutsougeras, C. (1992). Complex domain backpropagation. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 39(5), 330–334. Hamney, L. G. C. (1998). XOR has no local minima: a case study in neural network error surface analysis. Neural Networks, 11(4), 669–682. Hirose, A. (Ed.) (2003). Complex-valued neural networks. Singapore: World Scientific Publishing. Kim, M.S., & Guest, C.C. (1990). Modification of backpropagation networks for complex-valued signal processing in frequency domain. In Proc. int. joint conf. on neural networks. vol. 3 (pp. 27–31). Kobayashi, M. (2010). Exceptional reducibility of complex-valued neural networks. IEEE Transactions on Neural Networks, 21(7), 1060–1072. Lisboa, P. J. G., & Perantonis, S. J. (1991). Complete solution of the local minima in the XOR problem. Network, 2, 119–124. Nitta, T. (1993). A complex numbered version of the back-propagation algorithm. In Proc. world congress on neural networks. vol. 3 (pp. 576–579). Nitta, T. (1997). An extension of the back-propagation algorithm to complex numbers. Neural Networks, 10(8), 1392–1415. Nitta, T. (2004). Orthogonality of decision boundaries in complex-valued neural networks. Neural Computation, 16(1), 73–97. Nitta, T. (2008a). The uniqueness theorem for complex-valued neural networks with threshold parameters and the redundancy of the parameters. International Journal of Neural Systems, 18(2), 123–134. Nitta, T. (2008b). Complex-valued neural network and complex-valued backpropagation learning algorithm. In P. W. Hawkes (Ed.), Advances in imaging and electron physics, vol. 152 (pp. 153–221). Amsterdam, The Netherlands: Elsevier. Nitta, T. (Ed.) (2009). Complex-valued neural networks: utilizing high-dimensional parameters (p. 504). Pennsylvania, USA: Information Science Reference, ISBN: 978-1-60566-214-5. Nitta, T., & Furuya, T. (1991). A complex back-propagation learning. Transactions of Information Processing Society of Japan, 32(10), 1319–1329 (in Japanese). Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Parallel distributed processing, vol. 1. The MIT Press. Savitha, R., Suresh, S., Sundararajan, N., & Saratchandran, P. (2008). Complex-valued function approximation using an improved BP learning algorithm for feedforward networks. In Proc. international joint conference on neural networks (pp. 2252–2259). Sprinkhuizen-Kuyper, I. G., & Boers, E. J. W. (1998). The error surface of the 2-2-1 XOR network: the finite stationary points. Neural Networks, 11(4), 683–690. Sussmann, H. J. (1992). Uniqueness of the weights of minimal feedforward nets with a given input–output map. Neural Networks, 5(4), 589–593.