Journal Pre-proof Adaptive complex-valued stepsize based fast learning of complex-valued neural networks Yongliang Zhang, He Huang
PII: DOI: Reference:
S0893-6080(20)30013-7 https://doi.org/10.1016/j.neunet.2020.01.011 NN 4374
To appear in:
Neural Networks
Received date : 5 October 2019 Revised date : 5 December 2019 Accepted date : 14 January 2020 Please cite this article as: Y. Zhang and H. Huang, Adaptive complex-valued stepsize based fast learning of complex-valued neural networks. Neural Networks (2020), doi: https://doi.org/10.1016/j.neunet.2020.01.011. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2020 Elsevier Ltd. All rights reserved.
Journal Pre-proof
Adaptive complex-valued stepsize based fast learning of complex-valued neural networks Yongliang Zhanga , He Huanga,∗ a School
of Electronics and Information Engineering, Soochow University, Suzhou 215006, P. R. China
pro of
Abstract
Complex-valued gradient descent algorithm is a popular tool to optimize functions of complex variables, especially for the training of complex-valued neural networks. However, the choice of suitable learning stepsize is a challenging task during the training process. In this paper, an adaptive complex-valued stepsize design method is proposed for complex-valued neural networks by generalizing the adaptable learning rate tree technique to the complex domain. The scaling and rotation factors are introduced to simultaneously adjust the amplitude and phase of complex-valued stepsize. The search range is thus expanded from half line to half plane such that better search direction is obtained at each iteration. We analyze the dynamics of the algorithm near a saddle point and find that it is very easy to escape from the saddle point to guarantee fast convergence and high accuracy. Some experimental results on function approximation and pattern classification tasks are presented to illustrate the advantages of the proposed algorithm over some previous ones.
re-
Keywords: Adaptive complex-valued stepsize, complex-valued neural networks, fast learning, rotation factors, scaling factors, saddle points. 1. Introduction
Jo
urn a
lP
Over the last years, different kinds of neural networks have been widely and successfully applied in various fields with desired performance [1, 2]. Among them, complex-valued neural networks (CVNNs) play an increasingly important role because of their advantages such as high functionality and fast learning [3]. For many practical applications (e.g., image classification, natural language processing and complex-valued signal processing, etc), it is verified that the performance achieved by CVNNs is superior to that of the real-valued counterparts [4, 5, 6, 7]. Depending on the activation functions, CVNNs are generally classified into three types: real-imaginary type CVNNs [8], amplitude-phase type CVNNs [9] and fully CVNNs [10]. Amplitude-phase type CVNNs perform nonlinear transformation on the amplitude and phase of signals by employing two real-valued activation functions. They are generally applied to complex-valued signal processing, because they have a good ability to directly deal with the relationship between the amplitude and phase of complex-valued signals [9]. Real-imaginary type CVNNs employ a pair of real-valued functions as a nonlinear complex-valued activation function, which solve the issue caused by the unboundedness of holomorphic complex-valued functions. However, the correlation between the real and imaginary components is ignored due to the inherent property of realimaginary type CVNNs. To overcome it, fully CVNNs are introduced for practical applications. Fully CVNNs adopt some elementary transcendental functions to guarantee the validity of
the well-defined first-order derivative information. In addition, several efficient training methods for fully CVNNs were proposed in [11, 12] by the adoption of Wirtinger calculus [13, 14]. It is known that, up to now, real-valued learning rate is utilized in almost all learning algorithms for CVNNs [15, 16]. While different from real-valued neural networks, most of the critical points of CVNNs where the gradient of the objective function vanishes are saddle points [17]. As discussed in [18], complex-valued gradient based training algorithms with real positive stepsizes may converge very slowly when it falls into the area near a critical point. It would eventually degrade the performance of learning algorithms. In [19, 20], adaptive real-valued stepsize based methods were established to accelerate the convergence process. However, this issue was not completely resolved by the methods in [19, 20] since the search direction was obtained based on the negative gradient. On the other hand, the authors in [10, 21] initially investigated the effect of complex-valued stepsizes on complex multilayer perceptrons. It was found that the usage of complex-valued stepsize can enhance the performance for complex-valued tensor decomposition problems [22]. Recently, theoretical and experimental analysis on the advantages and disadvantages of complex-valued stepsize for complex gradient descent learning algorithms was presented in [23]. It was shown that the search space can be enlarged from half line to half plane when a complex-valued stepsize is adopted for a pre-calculated search direction and it is thus able to escape from saddle points. Moreover, a complex Barzilai-Borwein method (CBBM) was proposed in [23] for the choice of complex-valued stepsize by extending the real Barzilai-Borwein method [24, 25] to the complex domain. However, according to our experiments, it is recognized that, similar to its real-valued counterpart, the CBBM suffers from
∗ Corresponding
author. Tel.: +86 512 67871211; fax: +86 512 67871211. Email address:
[email protected] (He Huang)
Preprint submitted to Neural Networks
January 3, 2020
Journal Pre-proof the probabilistic divergence and may be unstable during the whole training procedure. Therefore, the design of complexvalued stepsize for the learning of CVNNs has not yet been fully addressed and remains challenging. It is thus of great significance to further investigate this issue. Recently, an adaptable learning rate tree (ALRT) algorithm for real-valued feedforward neural networks was proposed in [26], where the stepsize can be efficiently determined according to the training loss. Different from most of the adaptive methods such as Adam [27], AdamW [28] and Amsgrad [29], where the stepsize gradually decreases during the training procedure, the ALRT can increase or decrease the stepsize adaptively such that the objective function decreases as much as possible at each iteration. It was further shown by experimental results that the ALRT outperformed other adaptive stepsize methods under certain conditions. However, for the ALRT algorithm in the real domain [26], it only changes the length of the stepsize along the negative gradient direction. There is much evidence that the negative gradient direction is not necessarily the best search direction for optimization. Fortunately, a complex-valued stepsize with a proper phase provides the possibility of changing the search direction from the negative gradient one. That is to say, better search direction can be obtained by multiplying the negative gradient by a reasonable complex-valued stepsize with nonzero phase. Motivated by these observations, in this paper, an adaptive complex-valued stepsize design method is presented for the learning of fully CVNNs based on improved ALRT technique. By introducing the scaling and rotation factors, both the amplitude of the learning rate and the search direction can be simultaneously adjusted to guarantee the reduction of the objective function as suficient as possible at each iteration. For simplicity, the method proposed in this study is named by complex-valued ALRT (CALRT). The contribution and advantages of the proposed CALRT are as follows: (1) The complex-valued stepsize can be efficiently and easily determined according to the training loss at each iteration. (2) Due to the introduction of varying phase in the complex-valued stepsize, the CALRT is more capable to escape from saddle points than the one with constant complex-valued stepsize and thus converges more quickly. (3) Compared with other adaptive stepsize design methods, the CALRT can find more accurate solution and the training process is more stable. (4) When the training reaches the vicinity of the optimal solution, the CALRT is easy to converge since the steepest descent direction is always considered. (5) The initial stepsize is easy to be prescribed since it has limited effect on the performance of the CALRT. Finally, some experimental results on function approximation and pattern classification are given to illustrate the effectiveness and advantages of the proposed algorithm over some existing ones. The rest of this paper is organized as follows. Some preliminaries on fully CVNNs, Wirtinger calculus and ALRT technique are introduced in Section 2. In Section 3, the basic principle of the CALRT is described. The dynamics of the CALRT is analyzed with simulation results in Section 4 to verify that it is efficient to escape from saddle points. In Section 5, the effect of some parameters on the performance of the CALRT is dis-
cussed and experimental results are obtained to demonstrate the advantages of the CALRT over some existing algorithms. The conclusion is finally made in Section 6. 2. Preliminaries 2.1. Fully CVNNs
pro of
A fully CVNN is composed by alternate stacking of L layers with linear and nonlinear operations. For l = 2, 3, . . . , L, the output of the lth layer is calculated by a recursive way and has the form hl = gl (wl hl−1 + bl ), (1)
re-
where hl−1 is the output vector of the (l − 1)th layer, h1 is the input of the CVNN, wl is the weight matrix connecting the neurons between the lth and (l − 1)th layers, bl and gl (·) are respectively the bias vector and complex-valued activation function of the neurons in the lth layer. CVNNs are commonly trained in a supervised way. Given I the training samples {(xn , yˆ n )}n=1 ⊂ C p × Ck with xn being the input and yˆ n being the desired output. The actual final output of the CVNN corresponding to the input xn is assumed to be yn = gL (wL hL−1 + bL ). The purpose of the training of fully CVNNs is to minimize the objective function defined by E(w) =
I X
`(yn , yˆ n ),
(2)
n=1
lP
where w ∈ CN represents all the parameters to be adjusted including wl and bl for l = 2, 3, . . . , L, and `(·, ·) is defined by `(y, yˆ ) = |y − yˆ |2 = (y − yˆ )H (y − yˆ )
(3)
w(t+1) = w(t) − ηt 5w∗ E(w(t) ),
(4)
urn a
with H being the conjugate transpose. To minimize (2), the parameters can be updated along the steepest descent direction [30], which is described by
where ηt ∈ C is a stepsize at the tth iteration and 5w∗ E(w(t) ) is defined below. 2.2. Wirtinger calculus
Jo
As an important tool, the so-called Wirtinger calculus can be utilized to optimize the objective function of complex variables. For a complex-valued function f (z) : C → C defined by √
f (z) = u(x, y) + iv(x, y),
(5)
where i = −1 and z = x + iy. If the partial dervatives of u(x, y) and v(x, y) with respect to x and y exist, f (z) is called real-differentiable. Additionally, if the function f (z) satisfies Cauchy-Riemann equation ∂u(x, y) ∂v(x, y) = , ∂x ∂y ∂v(x, y) ∂u(x, y) =− , ∂x ∂y 2
(6)
Journal Pre-proof 2.4. Complex-valued stepsize For complex-valued gradient descent based learning algorithms, the advantages of applying a complex-valued stepsize are discussed in [23]. It is found that, totally different from the case of real-valued stepsize, the search space is ultimately expanded from half line to half plane by complex-valued stepsize. This would make the training of fully CVNNs more efficient since better search direction can be indeed found. When a complex-valued stepsize is adopted, the updating equation (4) can be rewritten as w(t+1) = w(t) − <(ηt ) 5w∗ E(w(t) ) + i=(ηt ) 5w∗ E(w(t) ) (10) = w(t) − |ηt |eiθt 5w∗ E(w(t) )
pro of
it is called to be analytic. It is known that most of the objective functions (e.g., the mean square error function) applied for the training of fully CVNNs can not satisfy (6). In this situation, Wirtinger calculus provides a promising way to deal with it. In fact, by introducing the conjugate operation, the variables x and y can be expressed by z + z∗ , x= 2 (7) z − z∗ y= . 2i From this point of view, f (z) can be regarded as a function of z and z∗ . According to Wirtinger calculus, ∂ f /∂z and ∂ f /∂z∗ can be well defined by treating z and z∗ as independent variables. Furthermore, by using the chain rule, one can finally obtain ! ∂f 1 ∂f ∂f = − i , ∂z 2 ∂x ∂y ! (8) 1 ∂f ∂f ∂f = + i , ∂z∗ 2 ∂x ∂y
with
! =(ηt ) , θt = arctan <(ηt ) q |ηt | = <(ηt )2 + =(ηt )2 ,
(11) (12)
where <(ηt ) and =(ηt ) represent the real and imaginary parts of ηt , respectively. To guarantee the decrease of the objective function at each iteration, θt is required to satisfy θt ∈ (−π/2, π/2).
re-
∂f where − ∂z ∗ is the steepest descent direction. In this paper, it is denoted by 5z∗ f (z, z∗ ) or 5z∗ f (z).
lP
2.3. ALRT The ALRT technique [26] is an effective method for the adaptive determination of learning rate in the real domain. By using this method, the stepsize can be increased or decreased during the learning process. Therefore, the stepsize determined by the ALRT is flexible.
urn a
Fig. 1. The adjustment process of the learning rate from iteration t to iteration t + 1 [26].
Jo
As in Fig. 1, a tree structure is employed to describe the adjustment process for real-valued stepsize. It is seen from Fig. 1 that there are N × K nodes generated from iteration t to iteration t+1. Here, K represents the number of nodes to be saved at each iteration according to the values of the objective function in the ascending order. The other nodes (i.e., the red nodes in Fig. 1) are removed. The saved K nodes are utilized as the candidates to determine the stepsize at the next iteration. Specifically, let ηt be the stepsize at the tth iteration and {r1 , r s , . . . , rN } be the set of the scaling factors that controls the change of the stepsize. As suggested in [26], the updating formula for the stepsize ηt+1 at the (t + 1)th iteration is ηt+1 = ηt · rnt ,
Fig. 2. Illustration of the search space with a complex-valued stepsize. Fig. 2 is an illustration of the search space achieved by a complex-valued stepsize. Here, θt represents the rotation angle acted on the steepest descent direction and |ηt | is the amplitude of the complex-valued stepsize. That is to say, θt changes the search direction on the basis of the negative gradient and |ηt | is used to make significant progress for the reduction of the objective function along the new search direction. In [23], the CBBM was proposed to choose suitable complexvalued stepsize. The basic idea is to approximate the Hessian matrix by an iterative way to reduce the computational complexity. Then, the updating formula for the complex-value stepsize is given by H (y(t) ) s(t) (13) ηt = (y(t) )H y(t)
(9)
where nt ∈ {1, 2, . . . , N} and rnt is the scaling factor chosen at the tth iteration. Meanwhile, the beam search algorithm is applied to choose the K possible stepsizes which achieve the first K minimum values of the objective function. Then, these possible stepsizes are used to carry out the next iteration. Please refer to [26] for more details. 3
Journal Pre-proof
Fig. 3. Illustration of the determination process of complex-valued stepsize in the CALRT. where s(t) = w(t) −w(t−1) and y(t) = 5w∗ E(w(t) )−5w∗ E(w(t−1) ). In practice, some techniques in [31, 32] can be utilized to facilitate the convergence of the CBBM.
pro of
the values of the objective function in the ascending order with respect to the stepsizes in (14). It should be pointed out that, as in (14), the stepsize at the next iteration is only dependent on the amplitude of the one at the current iteration, rather than the whole complex-valued stepsize. That is, ηt+1 is designed by employing the information of |ηt |. As mentioned in Section 2.4, to guarantee the convergence of learning algorithm, the rotation angle θt is required to satisfy −π/2 < θt < π/2. If the phase of the current stepsize is taken into account in (14) to obtain the next stepsize, the actual rotation angle becomes θmt plus the phase. As a result, it is possible that the above requirement can not be satisfied and thus the convergence can not be achieved in this situation. Therefore, it is reasonable to only make use of the amplitude of the current stepsize in the updating formula (14). As seen from (14) and Fig. 3, the CALRT algorithm involves four kinds of parameters to be prescribed in advance: the set of the scaling factors {rk : k = 1, 2, . . . , N}, the set of the rotation factors {θk : k = 1, 2, . . . , M}, the number of branches and the beam size K. The number of branches represents the number of all possible stepsizes obtained for each node at the current iteration. Therefore, the number of branches is N × M. As the ALRT, the CALRT uses the breadth-first beam search strategy to find the best K stepsizes among all possible choices based on the training loss. That is, the beam size K represents the number of stepsizes to be kept, which will be used for the next iteration. The best one is adopted as the complex-value stepsize. The procedure of the CALRT based fast learning algorithm for fully CVNNs is given in Algorithm 1.
3. CALRT
Jo
urn a
lP
re-
In the framework of ALRT, the stepsize must be real-valued since it is designed for real-valued neural networks. As a result, the search direction in [26] is actually not changed by the ALRT. On the other hand, it is well recognized that the negative gradient direction is generally not the best search direction for unconstrained optimization problems. In order to better train CVNNs and fully take advantages of complex-valued stepsize, the ALRT technique is improved and generalized to the complex domain in this study. It can be efficiently employed to adaptively determine appropriate complex-valued stepsize for CVNNs during the training procedure. As discussed above, it is known that the phase of complex-valued stepsize controls the search direction and its amplitude has a remarkable effect on the learning progress along the direction. Therefore, it is reasonable to establish a strategy to determine the amplitude and phase of complex-valued stepsize. Inspired but different from the ALRT, two kinds of factors (i.e., the scaling and rotation factors) are introduced to adaptively choose suitable complex-valued stepsize for the training of CVNNs. More specifically, the rotation factor is applied to change the search direction on the basis of the negative gradient of the objective function. While, the scaling factor aims to adjust the amplitude of complex-valued stepsize. This is the CALRT algorithm proposed in this paper. The CALRT based adjustment process for complex-valued stepsize from iteration t to iteration t + 1 is described in Fig. 3, where ηt represents the stepsize at the t iteration, N is the number of the scaling factors rk (k = 1, 2, . . . , N) for the amplitude, and M is the number of the rotation factors θk (k = 1, 2, . . . , M) acted on the search direction. The integers in the circles represent the ascending order with respect to the values of the objective function for different complex-valued stepsizes. In practice, we take θk ∈ (− π2 , π2 ) and rk ∈ (0, +∞). Let pk = eiθk , by making full use of the two kinds of factors, the complex-valued stepsize is updated by ηt+1 = |ηt | · rnt · pmt = |ηt | · rnt · eiθmt ,
4. Analysis of CALRT 4.1. Ability to escape from saddle points There are generally many saddle points in the objective function of fully CVNNs [17], which would seriously affect the convergence speed of training algorithms. However, compared with real-valued stepsize and constant complex-valued stepsize, adaptively varying complex-valued stepsize can make the training algorithm easily escape from saddle points and achieve rapid convergence because better search direction is chosen at each iteration based on (14). It should be noted that the change of the search direction is fixed for constant complex-valued stepsize. It means that the flexibility on the choice of new search direction is limited and thus the ability to escape from saddle points is heavily restricted. Different from constant complex-valued stepsize, the CALRT endows the step length and direction with much freedom
(14)
where nt = 1, 2, . . . , N is the index of the scaling factors at iteration t and mt = 1, 2, . . . , M is the index of the rotation factors at iteration t. To efficiently obtain a desired complex-valued stepsize, there are K nodes to be saved at each iteration according to 4
Journal Pre-proof Algorithm 1 The CALRT based fast learning algorithm
k,n,m
pro of
Let X =
Fig. 4. Dynamics with different types of stepsizes starting from a saddle point.
re-
Initialization: The beam size: K; N The scaling factors: {rn } n=1 ; M The rotation factors: {θm } m=1 ; The initial stepsize: η(0) ; The initial weights and biases: w(0) ; Input: I The training data: {xn , yˆ n } n=1 ∈ C p × Ck ). (x1 , x2 , . . . , xI ) and Yˆ = (ˆy1 , yˆ 2 , . . . , yˆ I ); t←1 repeat for k = 1, 2, . . . , K do for n = 1, 2, . . . , N do for m = 1, 2, . . . , M do (t−1) η(t) | · rn · eiθm k,n,m ← |ηk (t−1) (t−1) w(t) − η(t) ) k,n,m ← wk k,n,m · 5w∗ E(wk (t) ˆ Ek,n,m ← E(wk,n,m ; X; Y) end for end for end for kt , nt , mt ← argmin {E}
Table 1: The convergence results of different algorithms.
lP
w(t) ← w(t) n (t)kt ,mt ,nt (t) o Sort (ηk,n,m , wk,n,m )|1 ≤ k ≤ K, 1 ≤ n ≤ N, 1 ≤ m ≤ M in terms of the ascending order corresponding to the values of Ek,n,m and save the first K pairs as (t) (t) (t) (η1 , w1 ), ..., (ηK , w(t) K ). t ←t+1 until convergence
Algorithm
Iteration number
Final solution
CONV(RS) CONV(CS) CBBM CALRT
354 105 35 10
1.9050 × 10−6 2.2986 × 10−6 1.5366 × 10−8 9.5860 × 10−11
only requires 10 iterations and the obtained local minimum is equal to 9.5860 × 10−11 . It clearly exhibits the efficiency of the CALRT.
urn a
and both of them are adjusted simultaneously. Consequently, the ability of the learning algorithm escaping from saddle points is inevitably enhanced by the CALRT. To clearly show this advantage of the CALRT, a comparative experiment is conducted for the function in [23]:
4.2. Performance around local minimum If a constant complex-valued stepsize is always adopted, the performance of the learning algorithm would degrade heavily when the starting point is near to a local minimum of the objective function [23]. It is even worse than that achieved by real-valued stepsize. On the other hand, since θmt in (14) can take the value 0, the CALRT completely keeps the real-valued stepsize as a choice. Therefore, the performance of the CALRT near a local minimum is totally different from and better than that achieved by constant complex-valued stepsize. This is verified by the experimental results in Fig. 5. Actually, in some cases, the function, to be optimized, would decrease rapidly along the negative gradient direction. It is known that constant complex-valued stepsize with nonzero phase always makes the search direction deviate from the negative gradient direction. As a result, the convergence speed of the algorithm with constant complex-valued stepsize becomes much slower than that using real-valued stepsize. This phenomenon is more obvious when the starting point is close to a local minima (e.g., the situation in Fig. 5) and was already verified in [23].
(cos(cos(cos(w)))−0.1)(cos(cos(cos(w))−0.1))∗ , w ∈ C. (15)
Jo
We compare the CALRT with conventional gradient descent algorithms with real-valued stepsize and constant complexvalued stepsize, and the CBBM algorithm [23]. In this experiment, the real-valued stepsize is 0.1 and the constant complexvalued stepsize and the initital stepsize for the CALRT and CBBM are the same and set as 0.1 − 0.05i. The first two algorithms are simply denoted by CONV(RS) and CONV(CS), respectively. The experimental results are given in Fig. 4 and Table 1. It is observed from this figure that there are many local minima (in fact, they are equal) and saddle points for this function. The results show that, starting from the same point near a saddle point, the CALRT algorithm with the scaling factors {0.2, 1, 5} and rotation factors {−π/3, 0, π/3} can efficiently escape from it and ensures the fastest convergence speed among the four algorithms. Furthermore, the most accurate local minimum is found by the CALRT. As seen from Table 5, the CALRT 5
Journal Pre-proof Table 2: Summary of the datasets. Data Size
Heart Iris Glass Vehicle Biodeg Wine Ionosphere Liver disorder Opticaldigit Banknote
270 150 214 846 1055 178 351 345 1797 1372
pro of
Datasets
Features
Classes
13 4 10 18 41 13 34 6 64 4
2 3 6 4 2 2 2 2 10 2
Table 3: Main parameters involved in the CALRT. Type
The scaling factors The rotation factors Number of branches Beam size
{0.5, n 1.0, 2.0} o π π 4 , 0, − 4 9 20
re-
Fig. 5. Performance of different algorithms nearby a local minimum.
Value
As illustrated by Fig. 5, the algorithm with constant complexvalued stepsize selects other search directions instead of the negative gradient direction and too many iterations are thus required to obtain a satisfactory solution. However, the CALRT algorithm can choose better search direction at each iteration to guarantee the best performance among the four algorithms.
lP
To demonstrate this, we experimentally compare the CALRT with a constant complex-valued stepsize based gradient descent algorithm subject to different initial stepsizes. The other parameters of the CALRT are the same as described in Table 3. The average testing accuracies achieved by the two algorithms on the Ionosphere dataset are summarized in Table 4.
5. Experimental results
Table 4: Effect of the initial stepsize on the performance of different algorithms.
urn a
In this section, more experiments are conducted to further show the advantages of the CALRT algorithm. First of all, the Ionosphere dataset in [33] is used to analyze the effect of some parameters (such as the initial stepsize, the scaling and rotation factors, the number of branches and the beam size) on the performance of the CALRT. The Ionosphere dataset has two classes of samples and each sample has 34 features. A fully CVNN with 45 hidden neurons is constructed for this classification task. Then, some real-valued classification problems listed in Table 2 and a function approximation example are considered to verify the performance achieved by the CALRT is much better than that by some previous algorithms.
Initial stepsize 0.1 π 0.2e 6 i π 0.2e 3 i π 0.5e 6 i π 0.5e 3 i π 1.0e 6 i π 1.0e 3 i
CONV (%) 91.71 92.57 86.29 90.00 68.95 70.19 67.24
CALRT (%) 92.48 92.76 92.76 92.86 92.86 95.77 95.77
It is seen from Table 4 that the average testing accuracies achieved by the CALRT are higher than those by the conventional algorithm for different initial stepsizes. One can also find that our results are more stable. It implies that the effect of the initial stepsize on the CALRT is less than on the conventional method. According to Algorithm 1, it is known that the CALRT always try to choose the best stepsize at each iteration from the candidate stepsizes based on the current values of the objective function. Here, the best stepsize is defined as the one that achieves the minimum value of the objective function at the current iteration. It is believed that the CALRT is not very sensitive to the change of the initial stepsize.
Jo
5.1. Effect of some parameters on CALRT
5.1.1. The initial stepsize Different from some traditional algorithms, the complexvalued stepsize in our learning algorithm is always adjusted by the CALRT during the training procedure. As a result, the initial stepsize has a particularly small effect on the performance of the CALRT. This is similar to the ALRT [26]. While, the initial stepsize may have a relatively large impact on the performance of some conventional algorithms.
6
pro of
Journal Pre-proof
(a) The error curves with respect to different scaling factors.
(b) The error curves with respect to different rotation factors.
Fig. 6. The training error on the Ionosphere dataset for different factors.
The rotation factors {π/8, 0, −π/8} are close to 0, which have relatively little influence on the change of the negative gradient direction. (2) The two rotation factors in {3π/8, 0, −3π/8} are respectively near to π/2 and −π/2, which make the CALRT choose a relatively better direction than the above rotation factors {π/8, 0, −π/8} and find a smaller minimum. Therefore, it can perform better than {π/8, 0, −π/8}. However, when the elements are close to π/2 and −π/2, some better search directions may be ignored. As a result, the best result is obtained by the rotation factors {π/4, 0, −π/4} for this dataset. Generally, it is difficult to choose suitable scaling and rotation factors guaranteeing the best performance for the CALRT. It is suggested to designate moderate values for the two kinds of factors in practice.
urn a
lP
re-
5.1.2. The scaling and rotation factors In the CALRT framework, the scaling and rotation factors are introduced to adaptively determine suitable complex-valued stepsize. As discussed before, the search direction can be efficiently changed by the rotation factor, and the scaling factor characterizes the learning progress along the new search direction. Owing to the introduced two kinds of factors, the CALRT exhibits good properties for the learning of fully CVNNs. In some cases (e.g., the training reaches near a local minimum such as in Fig. 5), it is worth taking the steepest descent direction into consideration. It is thus reasonable to include the zero in the set of rotation factors. That is, when the rotation factor is chosen to be 0 by the CALRT, the training is proceeded along the steepest descent direction at the current iteration. On the other hand, the CALRT can symmetrically search the half plane starting from a basic search direction as shown in Fig. 2. Therefore, it is reasonable to choose the rotation factors symmetrically with respect to 0 to ensure that the left and right sides of the basic direction are searched with equal opportunity. For example, the rotation factors can be prescribed as {−π/4, 0, π/4} or {−π/4, −π/6, 0, π/6, π/4}, etc. The setting of the scaling factors follows the similar line of the ALRT [26]. We always include 1.0 in the set of the scaling factors. The effect of the two kinds of factors is experimentally analyzed. The results are given in Fig. 6. From Fig. 6(a), it is seen that the performance achieved by the scaling factors {0.25, 1.0, 4.0} is better than that by {0.5, 1.0, 2.0}. This may be because, for this dataset, the large length of the stepsize makes the learning much progress and thus obtain relatively better result. However, we find that when the elements in the set of the scaling factors are far away from 1.0, the fluctuation of the learning may occur. The effect of the rotation factors is illustrated by Fig. 6(b). It is observed that the performance achieved by the rotation factors {π/4, 0, −π/4} is the best among the three groups. We judge the reasons for this situation as: (1)
Jo
5.1.3. Number of branches and beam size For the CLAR, the number of branches is equal to the product of the numbers of the scaling and rotation factors. The beam size represents the number of the preserved nodes at each iteration. It is obvious that the number of branches and the beam size play more important role on the performance of the CALRT than the scaling and rotation factors. Naturally, the larger the beam size and the number of branches are, the performance of the CALRT would be better. This is because more choices are provided with the increasing of the two numbers and thus better search direction and learning rate can be determined. When the beam size reaches a certain value, it is found that the performance can not be much improved. However, it should be pointed out that the computation complexity is greatly increased when too large beam size and too many branches are adopted. A tradeoff between the performance of the CALRT and the computation complexity is required for practical applications. In the following experiments, the effect of the numbers of the scaling and rotation factors and the beam size is discussed. Fig. 7 depicts the curves of the 7
pro of
Journal Pre-proof
(a) The error curves with respect to different scaling factors and beam sizes.
(b) The error curves with respect to different rotation factors and beam sizes.
Fig. 7. The training error on the Ionosphere dataset for different factors and beam sizes.
5.2. Pattern classification
urn a
lP
re-
training error under different numbers for the scaling and rotation factors. It is clearly seen from Figs. 7(a) and 7(b) that either the number of the scaling factors or the number of the rotation factors increases, the CALRT based learning algorithm can find smaller minimum with faster convergence speed. The performance is also much improved when the beam size is large. The experimental results confirm that better performance is guaranteed by larger numbers of the introduced factors. The influence of the beam size on the performance of the CALRT is presented in Fig. 8. As the beam size increases, the performance is becoming better and better. One can also know when the beam size is very large, the improvement of the performance is not very significant. It is obvious that the improvement when the beam size is changed from 40 to 50 is not better than that when the beam size is changed from 1 to 10. Therefore, from this point of view, it is believed that when the beam size reaches a certain value, the performance of the CALRT would not have a remarkable improvement. This is similar to the ALRT discussed in [26].
Fig. 8. The effect of different beam sizes. It is observed from Fig. 9 that the performance achieved by the CALRT is the best for the Ionosphere dataset among the four algorithms. Specifically, compared to the other three algorithms, the training error curve obtained by the CALRT has the most rapid convergence speed and the smallest minimum is found. It is noted that, although the performance of the CBBM is also very prominent, it fluctuates heavily during the training process, and suffers from the probabilistic nonconvergence. In order to facilitate observation, the training and testing error rates are further adopted in the experiments to evaluate the performance of different algorithms. The experimental results are given in Fig. 10, which still confirm the advantages of the CALRT based learning algorithm. It is noted that, from Fig. 10(b), the testing error rate becomes worse when the number of iteration reaches 50. That is, the overfitting occurs for the CAL-
Jo
In this subsection, some real-valued classification problems are taken to evaluate the performance of the CALRT. The details of the datasets are listed in Table 2. The method in [34] is applied to transform real-valued input features to complexvalued ones. We firstly take the Ionosphere dataset as an example to give a detailed comparison on the performance of different algorithms including the conventional gradient descent algorithms with real-valused and complex-valued stepsizes, the CBBM and CALRT based learning algorithms. As in Section 4, the first two algorithms are respectively abbreviated by CONV(RS) and CONV(CS). The convergence curves on this dataset obtained by the four algorithms are shown in Fig. 9 and the training and testing error rates are given in Fig. 10. 8
pro of
Journal Pre-proof
(a) The curves for the training error rate.
Table 5: The classification results achieved by different algorithms. CONV(CS)
CBBM
84.44 98.22 81.56 74.17 85.23 98.11 87.62 66.69 93.26 92.65
82.36 98.67 81.56 74.72 85.13 97.74 87.86 62.72 93.43 92.51
89.14 97.78 78.44 75.12 84.70 97.74 91.90 70.87 93.16 98.20
CALRT
89.14 98.67 83.13 77.80 86.08 98.49 93.33 (b) The curves for the testing error rate. 74.56 94.10 99.22 Fig. 10. The trend of the testing and training error rates on the Ionosphere dataset for different algorithms.
lP
Heart Iris Glass Vehicle Biodeg Wine Ionosphere Liver disorder Opticaldigit Banknote
CONV(RS)
urn a
Dataset
re-
Fig. 9. The training processes on the Ionosphere dataset for different algorithms.
RT when the number of iteration is large. To solve this problem, two methods can be actually employed: (1) the early stopping strategy by the cross validation [35] and (2) the regularization techniques such as the L1 or L2 regularization [36]. The classification results on the datasets in Table 2 by different algorithms are summarized in Table 5. These results illustrate that the performance of the CALRT based learning algorithm on the classification problems is particularly better than that achieved by the other three algorithms.
(l = 1, 2, 3, 4) from the range [−0.5, 0.5]. A fully CVNN with 50 hidden neurons is constructed for the complex-valued function approximate problem. The activation function is taken as f (z) = tanh(z), z ∈ C.
Jo
The CVNN is respectively trained by the four algorithms: CONV(RS), CONV(CS), CBBM and CALRT as before. For the CALRT, the initial stepsize is η = 0.1 − 0.05i, the beam size is 4 and other parameters are as the same as in Table 3. The average results on 10 independent trials are depicted in Fig. 11. The convergence curves clearly show that the CALRT outperforms the other three algorithms in terms of the convergence speed and approximation accuracy.
5.3. Function approximation The function to be approximated [37] is f (z) =
1 z22 ( + z3 + 10z1 z4 ). 1.5 z1
(17)
(16)
where z = (z1 , z2 , z3 , z4 )T ∈ C4 . We generate 50 training samples by randomly choosing the real and imaginary parts of zl 9
Journal Pre-proof References
pro of
[1] D. Wang, D. Liu, Learning and guaranteed cost control with event-based adaptive critic implementation, IEEE Transactions on Neural Networks and Learning Systems 29 (2018) 6004–6014. [2] D. Wang, M. Ha, J. Qiao, Self-learning optimal regulation for discretetime nonlinear systems under event-driven formulation, IEEE Transactions on Automatic Control (2019) in press. [3] I. Aizenberg, Complex-Valued Neural Networks with Multi-valued Neurons, volume 353, Springer, 2011. [4] Z. Zhang, H. Wang, F. Xu, Y.-Q. Jin, Complex-valued convolutional neural network and its application in polarimetric SAR image classification, IEEE Transactions on Geoscience and Remote Sensing 55 (2017) 7177– 7188. [5] Q. Sun, X. Li, L. Li, X. Liu, F. Liu, L. Jiao, Semi-supervised complexvalued GAN for polarimetric SAR image classification, arXiv preprint arXiv: 1906.03605 (2019). [6] C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, C. J. Pal, Deep complex networks, arXiv preprint arXiv:1705.09792 (2017). [7] M. Nakamura, Y. Fukumoto, S. Owaki, T. Sakamoto, N. Yamamoto, Experimental demonstration of SPM compensation using a complex-valued neural network for 40-Gbit/s optical 16QAM signals, IEICE Communications Express 8 (2019) 281–286. [8] T. Nitta, An extension of the back-propagation algorithm to complex numbers, Neural Networks 10 (1997) 1391–1415. [9] A. Hirose, S. Yoshida, Generalization characteristics of complex-valued feedforward neural networks in relation to signal coherence, IEEE Transactions on Neural Networks and Learning Systems 23 (2012) 541–551. [10] T. Kim, T. Adalı, Approximation by fully complex multilayer perceptrons, Neural Computation 15 (2003) 1641–1666. [11] B. Zhang, Y. Liu, J. Cao, S. Wu, J. Wang, Fully complex conjugate gradient-based neural networks using Wirtinger calculus framework: Deterministic convergence and its application, Neural Networks 115 (2019) 50–64. [12] R. Savitha, S. Suresh, N. Sundararajan, Projection-based fast learning fully complex-valued relaxation neural network, IEEE Transactions on Neural Networks and Learning Systems 24 (2013) 529–541. [13] K. Kreutz-Delgado, The complex gradient operator and the CR-calculus, arXiv preprint arXiv:0906.4835 (2009). [14] D. Brandwood, A complex gradient operator and its application in adaptive array theory, IEE Proceedings H-Microwaves, Optics and Antennas 130 (1983) 11–16. [15] D. Xu, H. Zhang, D. P. Mandic, Convergence analysis of an augmented algorithm for fully complex-valued neural networks, Neural Networks 69 (2015) 44–50. [16] H. Zhang, X. Liu, D. Xu, Y. Zhang, Convergence analysis of fully complex backpropagation algorithm based on Wirtinger calculus, Cognitive Neurodynamics 8 (2014) 261–266. [17] T. Nitta, Local minima in hierarchical structures of complex-valued neural networks, Neural Networks 43 (2013) 1–7. [18] K. Fukumizu, S. Amari, Local minima and plateaus in hierarchical structures of multilayer perceptrons, Neural Networks 13 (2000) 317–327. [19] D. P. Mandic, A generalized normalized gradient descent algorithm, IEEE Signal Processing Letters 11 (2004) 115–118. [20] S. L. Goh, D. P. Mandic, Stochastic gradient-adaptive complex-valued nonlinear neural adaptive filters with a gradient-adaptive step size, IEEE Transactions on Neural Networks 18 (2007) 1511–1516. [21] T. Kim, T. Adali, Fully complex multi-layer perceptron network for nonlinear signal processing, Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology 32 (2002) 29–43. [22] Y. Chen, D. Han, L. Qi, New als methods with extrapolating search directions and optimal step size for complex-valued tensor decompositions, IEEE Transactions on Signal Processing 59 (2011) 5888–5898. [23] H. Zhang, D. P. Mandic, Is a complex-valued stepsize advantageous in complex-valued gradient learning algorithms?, IEEE Transactions on Neural Networks and Learning Systems 27 (2015) 2730–2735. [24] J. Barzilai, J. M. Borwein, Two-point step size gradient methods, IMA Journal of Numerical Analysis 8 (1988) 141–148. [25] M. Raydan, On the barzilai and borwein choice of steplength for the gradient method, IMA Journal of Numerical Analysis 13 (1993) 321– 326.
Fig. 11. The trend of average errors on the function approximation for different algorithms. .
re-
6. Conclusion
Jo
Acknowledgments
urn a
lP
In this paper, an adaptive CALRT method has been established for the determination of complex-valued stepsize for the learning of fully CVNNs. The scaling and rotation factors have been introduced such that more appropriate search direction has been efficiently found and much learning progress has been guaranteed at each iteration. As one of the advantages of the proposed algorithm, it completely has the capability escaping from saddle points during the training procedure. Compared with some existing methods, faster convergence speed and more accuracy solution have been obtained by our algorithm. This has been verified by some experimental results on the pattern classification and function approximation problems. It is recognized that some search directions (e.g., Newton and quasi-Newton directions) usually guarantee faster convergence speed than the negative gradient one for unconstrained optimization problems. It is thus of interest to investigate the proposed CALRT technique for other search directions. On the other hand, deep learning has gained rapid development in recent years. The adaptive complex-valued stepsize based learning of deep CVNNs remains challenging. These will be our future research works.
The authors would like to thank the anonymous reviewers for their constructive comments that have greatly improved the quality of this paper. This work was jointly supported by the Natural Science Foundation of Jiangsu Province of China under Grant no. BK20181431 and the Qing Lan Project of Jiangsu Province.
10
Journal Pre-proof
Jo
urn a
lP
re-
pro of
[26] T. Takase, S. Oyama, M. Kurihara, Effective neural network training with adaptive learning rate based on training loss, Neural Networks 101 (2018) 68–78. [27] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [28] I. Loshchilov, F. Hutter, Fixing weight decay regularization in Adam, arXiv preprint arXiv:1711.05101 (2017). [29] S. J. Reddi, S. Kale, S. Kumar, On the convergence of Adam and beyond, arXiv preprint arXiv:1904.09237 (2019). [30] D. Xu, J. Dong, H. Zhang, Deterministic convergence of Wirtingergradient methods for complex-valued neural networks, Neural Processing Letters 45 (2017) 445–456. [31] L. Sorber, M. V. Barel, L. D. Lathauwer, Unconstrained optimization of real functions in complex variables, SIAM Journal on Optimization 22 (2012) 879–898. [32] O. Burdakov, Y.-H. Dai, N. Huang, Stabilized Barzilai-Borwein method, arXiv preprint arXiv: 1907.06409 (2019). [33] D. Dua, C. Graff, UCI machine learning repository, 2017. URL: http: //archive.ics.uci.edu/ml. [34] M. F. Amin, K. Murase, Single-layered complex-valued neural network for real-valued classification problems, Neurocomputing 72 (2009) 945– 955. [35] N. Morgan, H. Bourlard, Generalization and parameter estimation in feedforward nets: Some experiments, in: Advances in Neural Information Processing Systems, 1990, pp. 630–637. [36] S. Scardapane, S. V. Vaerenbergh, A. Hussain, A. Uncini, Complexvalued neural networks with non-parametric activation functions, IEEE Transactions on Emerging Topics in Computational Intelligence (2019) in press. [37] K. Subramanian, R. Savitha, S. Suresh, A complex-valued neuro-fuzzy inference system and its learning mechanism, Neurocomputing 123 (2014) 110–120.
11
Journal Pre-proof Declaration of interests The authors declare that they have no known competing financial interests or personal
Jo
urn a
lP
re-
pro of
relationships that could have appeared to influence the work reported in this paper.