Pergamon
Neural Networks Neural Networks 11 (1998) 215–234
Contributed article
Hybrid interior point training of modular neural networks Peter T. Szymanski, Michael Lemmon*, Christopher J. Bett Department of Electrical Engineering, University of Notre Dame, Notre Dame, IN 46556 USA Received 20 September 1995; accepted 23 September 1997
Abstract Modular neural networks use a single gating neuron to select the outputs of a collection of agent neurons. Expectation–maximization (EM) algorithms provide one way of training modular neural networks to approximate non-linear functionals. This paper introduces a hybrid interior-point (HIP) algorithm for training modular networks. The HIP algorithm combines an interior-point linear programming (LP) algorithm with a Newton–Raphson iteration in such a way that the computational efficiency of the interior point LP methods is preserved. The algorithm is formally proven to converge asymptotically to locally optimal networks with a total computational cost that scales in a polynomial manner with problem size. Simulation experiments show that the HIP algorithm produces networks whose average approximation error is better than that of EM-trained networks. These results also demonstrate that the computational cost of the HIP algorithm scales at a slower rate than the EM-procedure and that, for small-size networks, the total computational costs of both methods are comparable. 䉷 1998 Elsevier Science Ltd. All rights reserved. Keywords: Modular neural networks; Training; Algorithms; Interior-point methods; Expectation–maximization methods
1. Introduction An artificial neural network is a distributed computing paradigm consisting of a set of self-similar processing units called neurons. In many neural network architectures, the network’s output yˆ has the form X f (f¯ Tm z¯) (1) yˆ (¯z) ¼ i
where f¯ m and z¯ are vectors in R R and f is a mapping from R into R. Given a target mapping, y:R R → R, we are interested in designing a network whose outputs form an approximation yˆ of the target mapping. Neural networks are well-known to form good approximations of complex input–output mappings. This universal approximation (Cybenko, 1989; Hornik et al., 1989; Park and Sandberg, 1991) ability of neural networks is probably one of the most important reasons for their use. In the networks represented by Eq. (1), each neuron contributes to the network’s output. In certain cases, however, it is reasonable to expect that the structure of the target mapping will be different in disjoint regions of the input space. In such cases, there is little benefit to be gained by additively mixing the outputs of the neurons. It might be better to use the output of a single * Requests for reprints should be sent to M. Lemmon
0893–6080/98/$19.00 䉷 1998 Elsevier Science Ltd. All rights reserved PII S 08 93 - 60 8 0( 9 7) 0 01 1 9- 6
neuron to compute exclusively the network approximation. Such networks can be viewed as a collection of experts where each neuron is an expert at approximating the target function over a region in R R. These mixture of expert or modular networks were first introduced by Jacobs et al. (1991). The training of modular networks can be carried out using expectation–maximization (EM) procedures (Dempster et al., 1977; Redner and Walker, 1984; Jordan and Xu, 1993). The results of Redner and Walker (1984) raised concerns about the applicability of EM algorithms because of their linear convergence rate. That work called for the use of more sophisticated optimization methods whose computational costs are well understood. This paper presents just such an algorithm for the training of modular networks using Gaussian activation functions to form piecewise linear approximations. The algorithm is based on a modification of a primal interior point (IP) linear programming (LP) algorithm (Karmarkar, 1984; Gonzaga, 1992). The resulting algorithm combines an IP LP algorithm with a Newton– Raphson (NR) iteration in a way that preserves the computational efficiency of the original IP LP algorithm. The resulting algorithm is referred to as the hybrid interior point (HIP) algorithm. A small-step version of the HIP procedure was first presented by Lemmon and Szymanski (1994). One principal contribution of this paper is a formal analysis
216
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
of the HIP procedure’s asymptotic behaviour and computational complexity. The second principal contribution of this paper is a detailed simulation which compares the performance of the HIP algorithm and the EM procedure. The remainder of this paper is organized as follows. Modular neural networks are introduced in Section 2. The HIP algorithm is presented in Section 3. Theoretical results on the HIP algorithm’s asymptotic behaviour and computational costs are summarized in Section 4. Formal proofs for the results in Section 4 will be found in Appendix A. Detailed simulation experiments comparing the HIP and EM algorithms will be found in Section 5. Section 5 also examines the use of the HIP algorithm in training a modular network to model the input/output behaviour of a fossil fuel electric power generating plant. A final discussion will be found in Section 6.
functional measures the activity level of the mth agent in response to an input z¯ 僆 RR . yˆ m : RR → R is a functional from the input space into the Reals representing the mth agent’s approximation of the target mapping at z¯. The gate chooses that agent with the largest activity level and the selected agent’s approximation is then used as the network’s output. Mathematically, the network’s output in response to z¯ is written as yˆ (¯z) ¼ yˆ m (¯z)
(3)
m ¼ arg max am (¯z)
(4)
1ⱕmⱕM
Linear Gaussian (LG) modular networks arise when the approximation function is linear and the activation functions are Gaussian. In particular, we assume that the mth agent’s activation function has the form 2
am (¯z) ¼ q(m) esk¯z ¹ q¯ m k
(5)
where q(m) 僆 R, q¯ m 僆 R , s is a negative real number, and X q(m) ¼ 1. The activation function is parameterized by m the scalar q(m) and the R-vector q¯ m . The mth agent’s approximation function has the form R
2. Modular neural networks This paper adopts a viewpoint in which neural network training is treated as a function approximation problem (Poggio and Girosi, 1990). Let y:R R → R be a continuous function from R R into R which we call the target mapping. It is assumed that y is unknown, except at a finite set of input/ output pairs. The set of N input/output pairs is denoted as T ¼ {(¯z1 , y(¯z1 )), (¯z2 , y(¯z2 )), …, (¯zN , y(¯zN ))}
(2)
where z¯i 僆 R (i ¼ 1,…,N) is the input and y(¯zi ) 僆 R is the associated output. The set T will usually be referred to as the training set. The objective of the training problem is to find a functional yˆ : RR → R from a known parameterized set of functionals Y which minimizes the approximation error’s size over the training set. In this paper, the approximation will be realized by a special class of modular neural network. The structure of a modular network is shown in Fig. 1. The network uses a gating neuron or gate to select between the approximations generated by a collection of agent neurons or agents. Let z¯ 僆 RR be the input to the network. In response to this input, each agent produces an ordered pair of outputs, (am (¯z), yˆ m (¯z)). am : RR → R is a functional from the input space R R into the non-negative Reals. The R
yˆ m (¯z) ¼ f¯ Tm z¯
(6)
where f¯ m 僆 R . This approximation is parameterized by the R-vector f¯ m . We therefore see that the mth agent is parameterized by the scalar q(m) and the two R-vectors q¯ m and f¯ m . In our later discussion, it will be convenient to aggregate these parameter vectors q¯ m and f¯ m into a single parameter vector v¯ m 僆 R2R . In particular, we will let v¯ m ¼ [f¯ Tm q¯ Tm ]T . The vectors, ¯ ¼ [v¯ T v¯ … v¯ T ]T (7) ⌰ R
1
q¯ ¼ [q(1)
M
q(2)
…
q(M)]T
(8)
provide a parameterization (¯q, ⌰) of the entire network. The squared output approximation error over training set T ¼ {(¯zj , y(¯zj )}j ¼ 1, …, N is denoted as 2 ¯ ¼ (¯q, ⌰) dout
N X
(y(¯zj ) ¹ yˆ (¯zj ))2
(9)
j¼1
¼
N M X X
Q(mlj)(y(¯zj ) ¹ f¯ Tm z¯ j )2
(10)
j¼1 m¼1
The second equation arises by expanding yˆ (¯zj ) and letting ( 1 if am (¯zj ) ⱖ an (¯zj ) for all 1 ⱕ n ⱕ M (11) Q(mlj) ¼ 0 otherwise The vector ¯ ¼ [Q(1l1) Q(1l2) Q
…
Q(mlj) …
Q(MlN)]T (12)
Fig. 1. Modular neural network.
of Q(mlj) introduced in Eq. (11) will be called the network’s selection strategy, since it denotes which agent is associated
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
with which input vector. Note that for a given network ¯ there will be a unique selection parameterization (¯q, ⌰) ¯ strategy Q. This means that the network can also be ¯ ¯ ⌰). described by the ordered pair, (Q, A naive approach to the training problem would involve 2 ¯ with respect q¯ and ⌰. ¯ This problem, (¯q, ⌰) minimizing dout however, is ill-posed (Hadamard, 1964) in the sense that solutions may not be continuous with respect to the data given in the training set. To obtain a well-posed training problem, it is essential that the original output-error criterion be augmented with a regularization term. One common regularization term presents the error of the mth agent on the jth input as (13) dm2 (j) ¼ (y(¯zj ) ¹ f¯ Tm z¯j )2 þ kq¯ m ¹ z¯j k2 The total network error then has the form N M X X ¯ ¼ 1 Q(mlj)dm2 (j) d 2 (¯q, ⌰) N j¼1 m¼1
(14)
where Q(mlj) is as defined in Eq. (11). ¯ with The training problem involves minimizing d2 (¯q, ⌰) ¯ This is a non-linear optimization respect to q¯ and ⌰. problem which can be very difficult to solve. An alternative (heuristic) approach to solve this problem involves optimizing with respect to the selection strategy. Since a network ¯ has an associated selection strategy parameterization (¯q, ⌰) ¯ ¯ we can attempt to optimize a network with regard to ⌰ Q, ¯ This optimization problem is and the selection strategy Q. ¯ ¯ and quadratic in ⌰. easier to solve since it is linear in Q ¯ there Note, however, that for a given selection strategy Q may not be a set of q¯ that realizes this strategy. We therefore ¯ One need to introduce a heuristic way of relating q¯ to Q. useful heuristic is to view q(m) as the marginal probability of the mth neuron being activated. One estimate of this probability is given by 1 X Q(mlj) (15) q(m) ¼ N j With the preceding discussion, we can now state the training problem as follows. It is first assumed that there exists a design set consisting of N training points T ¼ {(yj , z¯j ) 僆 R ⫻ RR , j ¼ 1, …, N} where y j represents a desired output in response to an input z¯j . The training problem is
217
where Q(mlj) 僆 R, q¯ 僆 RR and f¯ m 僆 RR (m ¼ 1,…,M; j ¼ 1,…,N). The problem is to determine the set of parameters ¯ T ]T 僆 RMN ⫻ R2MR which minimize the objective, ¯ T, ⌰ [Q where ¯ T ¼ [Q(1l1), Q(1l2), …, Q(mlj), …] Q
(16)
and ¯ T ¼ [vT1 , vT2 , …, vTM ] ⌰
(17)
with vTm ¼ [f¯ Tm , q¯ Tm ] 僆 RR ⫻ RR . This procedure directly ¯ The network paraestimates the network parameters ⌰. ¯ meters q¯ are then obtained from Q using Eq. (15). Note ¯ ¯ and ⌰, that this problem is convex with respect to Q separately. The special structure of this problem means that it can be solved by using EM algorithms or by using the HIP algorithm presented in Section 3. This section has introduced a set of approximations that can be realized as modular LG neural networks. The modular LG networks were characterized by the parameters, ¯ The regularized training problem was stated with (¯q, ⌰). respect to this parameterization. It was noted that solving this problem is more easily accomplished if we optimize the ¯ This network with respect to the selection strategies Q. observation led to a training problem statement whose performance measure is linear with respect to the selection ¯ and is quadratic with respect to the other netstrategies Q ¯ Problems of this form can be solved work parameters ⌰. using EM or HIP algorithms.
3. HIP training algorithm The training problems under consideration consist of those which may be decomposed into linear and quadratic subproblems. These problems can be written in an alternate form which emphasizes their dependence upon disjoint ¯ and letting subsets of their parameters. Letting x¯ ¼ Q T ¯ ¼ 1 d (1) 1 d (2) … 1 d (j) … (18) c¯ (⌰) N 1 N 1 N m with dm (j) ¼ k¯zj ¹ q¯ m k2 þ (yj ¹ f¯ Tm z¯j )2 , Problem 1 can be rewritten as Problem 2.
Problem 1. minimize :
N M X X
¼
j¼1 m¼1
1 Q(mlj)[k¯zj ¹ q¯ m k2 N
þ (yj ¹ f¯ Tm z¯j )2 ] with respect to :
Q(mlj), q¯ m , f¯ m (m ¼ 1, …, M; j ¼ 1, …, N)
subject to :
Q(mlj) ⱖ 0 M X m¼1
Q(mlj) ¼ 1, (j ¼ 1, …, N)
minimize :
¯ T x¯ c¯ (⌰)
with respect to :
¯ x¯ , ⌰
subject to :
¯ x¯ ⱖ 0 A¯x ¼ b,
where A ¼ [IN⫻N …IN⫻N ] 僆 RN⫻MN and b¯ ¼ [1, …, 1]T 僆 RN . The rewritten objective in Problem 2 is linear in the ¯ parameters x¯ and is quadratic in the parameters ⌰. The dependence upon the two distinct parameter sets suggests that an alternating minimization (AM) strategy can be employed to solve Problem 1. In an AM procedure, the original optimization problem is decomposed into
218
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
several subproblems which are iteratively solved by component minimizers for each subproblem. Section 3.1 briefly describes the notion of an AM algorithm and introduces the HIP algorithm as an example of an AM procedure. Section 3.2 discusses the component minimizers that are used in the HIP algorithm. Section 3.3 shows how the two component minimizers can be efficiently combined to produce a computationally efficient algorithm. 3.1. AM Let h:R R → R be a positive semi-definite functional over R . Consider the problem of minimizing this functional ¯ with respect to the parameter vector F. ¯ It is assumed h(F) that the parameter vector can be decomposed as R
¯T ¼[F ¯ T1 F
¯ T2 F
…
¯ TK ] F
(19) X ¯ 僆 R where R ¼ i Ri . where F¯ i 僆 R (i ¼ 1,…,K) and F An AM procedure attempts to solve this problem by iteratively minimizing the functional with respect to F¯ i for ¯ j (j ⫽ i), i ¼ 1,…,K while holding the other subvectors, F constant. AM procedures can be implemented by two nested loops. The inner loop performs the K component minimiza¯ with respect to F¯ i (i ¼ 1, …, K) while holding tions of h(F) ¯ j constant for j ⫽ i. The outer loop encloses the inner loop F and controls the algorithm’s iterations until some stopping criterion is satisfied. Algorithm 1 gives a pseudo-code description of a general AM algorithm. Ri
R
Algorithm 1 (general AM algorithm) Initialize j ¼ 0. ¯ (0) , F ¯ (0) , …, F ¯ (0) i. ¯ (0) ¼ hF Choose initial feasible F 1 2 k repeat for i ¼ 1 to K ¯ (j) minimize f (F with respect to F¯ (j) i ¯ (j) held constant for k ⫽ i subject to F k end j¼jþ1 until(stopping criterion is satisfied) EM algorithms (Redner and Walker, 1984) provide wellknown examples of an AM procedure. The HIP procedure described below, and first introduced in Lemmon and Szymanski (1994), provides another example of an AM procedure. The HIP algorithm consists of a parameterized outer loop, whose iteration produces a sequence that converges to a local optimum of the training problem. Each element within that sequence is generated by the inner loop. This inner loop consists of a component linear and quadratic minimizer. A pseudo-code listing for the HIP algorithm is shown in Algorithm 2. The behaviour of this algorithm is controlled by parameters b and g. The outer loop in the HIP algorithm generates a sequence ¯ (k) }. Associated with the kth iteration in of solutions {¯x(k) , F
the loop is a parameter a(k) ⱖ 0 which is a term in a monotone increasing sequence of the form a(k þ 1) ¼ ba(k)
(20)
where b ⬎ 1; a(k) is a barrier coefficient used by the component linear minimizer. The parameter b is a free parameter of the algorithm controlling the number of iterations in the outer loop of the HIP algorithm. The inner loops of the HIP algorithm consist of the component linear and quadratic minimizers. The linear minimizer solves the LP problem associated with minimizing the network error with respect to the selection strategies ¯ being constant. The quadratic minimizer ¯ subject to F Q, uses an NR descent method to minimize the network error ¯ subject to the selection strategies Q ¯ being with respect to F constant. This minimizer is controlled by the parameter g. The combination of linear IP solvers with NR descent methods forms the hybrid IP or HIP algorithm. Detailed discussion of each component minimizer will be found in Section 3.2. Section 3.3 shows how to efficiently combine both minimizers. Algorithm 2 (HIP AM algorithm) Initialize k ¼ 0. ¯ (k) , and initial x¯ (k) . Choose a(k) ⬎ 0, ⌰ repeat a(k þ 1) ¼ ba(k) , b ⬎ 1. Linear update: x¯ 0 ¼ x¯ (k) i¼0 ¯ ¹ x¯ i¹ 1 )k ⬎ 0:2401) while(kPAXi Xi (a(k þ 1) c¯ (⌰) ¯ (k) ) ¹ x¯ i¹ 1 ) x¯ i þ 1 ¼ x¯ i ¹ Xi PAXi Xi (a(k þ 1) c¯ (⌰ i¼iþ1 end x¯ (k þ 1) ¼ x¯ i Quadratic update: For m ¼ 1,…,M v¯ (k þ 1) ¼ (1 ¹ g(k) )v¯ (k) þ g(k) v¯ (k þ 1), ⴱ end k¼kþ1 ¯ (k) ) ¹ (¯x ⴱ , ⌰ ¯ ⴱ )k ⬍ e until(k¯x(k) , ⌰ 3.2. Component minimizers The HIP algorithm combines an IP LP algorithm with a quadratic minimization procedure. Both of these component minimizers are discussed in detail below. 3.2.1. Linear minimizer The IP method for solving the LP subproblems is a barrier method that uses logarithmic barriers to keep solutions inside the feasible set. The method generates a sequence of points that follows a path in the interior of the polyhedral
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
set of feasible solutions. The resulting central path guides the approximate solutions to the optimal solution on the feasible set’s p boundary. The path-following algorithm is fast, taking O( nL) iterations to converge with a computational cost of O(n3:5 L) floating point operations. n is the dimension of the LP problem and L denotes the ‘‘size’’ of the LP problem, being the number of bits required to repre¯ and c¯ . (Note: in our applications sent the coefficients of A, b, the LP problems will be of size n ¼ NM.) The IP method is a path-following algorithm (Gonzaga, 1992) which solves the following sequence of optimization problems. Each problem is based upon a primal parameterization of Problem 2 when minimized solely with respect to x¯ . Let the kth solution to this problem be denoted as x¯ (k) with ith component x(k) i . The kth problem has the form X log x(k) minimize : a(k) c¯ T x¯ (k) ¹ i i
(21)
(k)
with respect to :
x¯
subject to :
¯ x¯ (k) ⱖ 0 A¯x(k) ¼ b,
where a(k) ⱖ 0 (k ¼ 1,…,K) is a monotone increasing sequence of real numbers generated by the iteration a(k þ 1) ¼ ba(k) where b ⬎ 1. x¯ ⴱ (a(k) ) denotes the optimal solution for the kth optimization problem in the sequence and is referred to as a central point. The locus of all points, x¯ ⴱ (a(k) ) where a(k) ⱖ 0, is called the central path. The augmented problem takes the original LP cost function and adds a logarithmic barrier which keeps the central points away from the boundaries of the feasible set. As a increases, the effect of the barrier is decreased, thereby allowing the kth central point to approach the LP problem’s optimal solution in a controlled manner. Fig. 2 depicts the feasible set for a sample LP problem along with the central path, central points, and a set of level curves corresponding to contours of the barrier function. IP algorithms solve the LP problem by approximately solving the sequence of augmented problems shown in Eq. (21). The approximate central points are computed using one or more scaling steepest descent (SSD) updates of the form x¯ (k þ 1) ¼ x¯ (k) ¹ XPAX X(a(k þ 1) c ¹ (¯x(k) ) ¹ 1 ) where
X ¼ diag(x1 , x2 , …, xn ) 僆 R
n⫻n
Fig. 2. IP concepts.
(22) is
a
scaling
219
Fig. 3. Path-following example.
transformation, PA ¼ I ¹ AT (AAT ) ¹ 1 A 僆 Rn⫻n is an orthogonal projection, and x¯ ¹ 1 ¼ [x1¹ 1 , x2¹ 1 , …, xn¹ 1 ]T . Fig. 3 illustrates this sequence of approximate central points. The IP algorithm can be proven to converge to an e-neighbourhood of the optimal solution to the LP problem provided these approximate solutions are close to the central path. The following proximity measure is essential for characterizing this notion of ‘‘nearness’’ to the central path: d(¯x(k) , a(k) ) ¼ kPAX (k) X (k) (a(k) c¯ ¹ (¯x(k) ) ¹ 1 )k
(23)
d(¯x(k) , a(k) ) places x¯ (k) in the region of quadratic convergence for the SSD algorithm. Once a point is near the central path and satisfies the proximity condition d(¯x(k) , a(k) ) ⱕ 0:5
(24)
the IP algorithm must maintain the proximity of successive points to the central path. This is generally done by selecting a b and then using a sufficient number of SSD steps to enforce the proximity condition. If b is chosen so that only a single SSD step is required, then the IP algorithm is said to be a small-step procedure.pSmall-step algorithms have been shown to converge in O( nL iterations with an associated computational cost of O(n3:5 L) (Gonzaga, 1992). If a larger value of b is used, then several SSD steps may be required to enforce the proximity condition. Such algorithms are often referred to as large-step IP algorithms. In general, large-step algorithms can exhibit considerably lower total computational cost than their small-step counterparts. 3.2.2. Quadratic minimizer ¯ T x¯ with respect The quadratic minimizer minimizes c¯ (⌰) ¯ of ⌰. The determination of successive quadratic parameters is accomplished using standard descent techniques to ¯ The determination of ¯ T x¯ with respect to ⌰. optimize c¯ (⌰) successive quadratic parameter estimates is termed the ⌰-update. The ⌰-update uses an NR approach to solve the quadratic ¯ The basic form of an NR optimization with respect to ⌰.
220
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
¯ is the Hessian of the objective functional with where H(⌰) ¯ is the gradient with respect to ⌰ ¯ ¯ and ⵜ ¯ (⌰) respect to ⌰ ⌰ ¯ (k) . Typically, the update is performed evaluated at point ⌰ iteratively for general problems. The optimization is quadratic, however, so the following closed form solution ¯ ⴱ ¼ [v¯ ⴱ1 , v¯ ⴱ2 , …, v¯ ⴱM ] where v¯ ⴱm ¼ [f¯ ⴱm , q¯ ⴱm ], results: ⌰ " #¹1 N N X X 1 1 ⴱ T Q(mlj)¯zj z¯ j Q(mlj)¯zj yj (25) f¯ m ¼ N N j¼1 j¼1
must be maintained by the HIP algorithm if the attractive computational properties of the IP method are to be preserved. Note, however, that performing the quadratic mini¯ and hence mization changes the parameter vector ⌰, ¯ changes the cost vector c¯ (⌰) (see Fig. 4). It is possible that changes to the cost vector will rotate the cost vector about the current solution x¯ (k) in such a way that x¯ (k) no longer satisfies the proximity condition in Eq. (24). To make sure that the proximity condition is not violated, it ¯ ¹ update is necessary to constrain the step size of the ⌰ by choosing the g parameter appropriately. In Section 4 and Appendix A it is shown that the appropriate choice is
and
g(k) ⬍
update is ¯ (k) ¹ [H(⌰ ¯ (k) )] ¹ 1 ⵜ ¯ (⌰ ¯ (k) ) ¯ (k þ 1) ¼ ⌰ ⌰ ⌰
q¯ ⴱm
¼
N X 1 Q(mlj)¯zj N j¼1 N X 1 Q(mlj) N j¼1
(26)
0:12995 ¯ (k) )k ¯ (k þ 1), ⴱ ) ¹ c¯ (⌰ nk¯c(⌰
(27)
For this choice of g the HIP algorithm preserves the computational efficiency of the underlying IP LP algorithm.
4. Theoretical results
minimizing solutions assume that X N N T ¯ (1=N)Q(mlj) ⫽ 0 and that [ (1=N)Q(mlj)¯ z z j j ] is j¼1 j¼1 non-singular. The first condition will always be satisfied as at least one sample point will always be assigned to a model (᭙m, ᭚j such that Q(mlj) ⬎ 0). The second condition can be violated for insufficient amounts of data or when the data is linearly dependent. The problem can be corrected in such a case by reducing the number of models or increasing the number of samples to ensure the linear independence of the data.
The utility of HIP algorithms rests in their convergence and computational properties. This section summarizes the key technical results supporting the algorithm’s asserted computational efficiency. The results pertain to small-step versions of the HIP algorithm, but related results to the large-step version can be obtained using similar arguments. Proofs for the following results appear in Szymanski (1995), Gonzaga (1992) and in Appendix A. The results deal with proximity, convergence, and computational properties.
3.3. Combining component minimizers
4.1. Proximity
The behaviour of the HIP algorithm is controlled by the parameters b and g. Choices for b are dictated by the desired number of iterations in the outer loop of the HIP algorithm. If a small enough b is used, then we have a small-step HIP algorithm. For larger b, a large-step HIP procedure results. The primary difference between these methods lies in the number of SSD iterations required in the linear component minimizer. b is therefore chosen to control the total number of iterations in the outer loop of the HIP algorithm. The effect of specific b choices on the algorithm’s total computational cost is discussed in Section 4. These results suggest that choices for b resulting in large-step HIP algorithms should be fastest. The simulation results in Section 5 support this assertion. Choices for g are important for ensuring that the component minimizers work well together. The desirable choices for g are summarized below. Formal proofs of justifying the desirability of these choices will be found in Section 4 and Appendix A. A key concern in the HIP algorithm is the proximity of ¯ (k) } to the intermediate solutions in the sequence {¯x(k) , ⌰ central path. The proximity of successive solutions to the central path (see Eq. (24)) results in the attractive convergence properties for the basic IP algorithm. This condition
As discussed earlier, the HIP algorithm must maintain the proximity condition given in Eq. (24). This proximity condition is enforced by using the appropriate number of SSD iterations for a given b. The following classical results (Gonzaga (1992)) reparameterize b with respect to n using the following equation: p b ¼ 1 þ n= n (28)
These X
Fig. 4. Effect of cost plane rotation.
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
When n 僆 (0, 0.1], then we say the algorithm is a small-step procedure. Theorem 1 and Corollary 2 specify how SSD steps affect proximity for a small-step HIP algorithm. Theorem 1 indicates that only a single SSD step is required to maintain proximity in moving from x¯ (k) to x¯ (k þ 1) , whereas Corollary 2 details the ‘‘extra’’ proximity which is gained by taking multiple SSD steps. Theorem 3 specifies the effect of SSD steps on proximity for large-step HIP algorithms. It states that a linear p number of steps (O[nm2 =(1 þ m2 )] steps with m ¼ n= n) is required to find x¯ (k þ 1) satisfying Eq. (24) given an initial x¯ (k) which satisfies Eq. (24). Theorem 1. (Gonzaga, 1992). Let ¯ ¹ x¯ ¹ 1 )k be the proximity mea¯ ¼ kPAX X(a¯c(⌰) d(¯x, a, ⌰) sure where x¯ ¹ 1 ¼ hx1¹ 1 , x2¹ 1 , …, xn¹ 1 iT and a p⬎ 0. If ¯ (k) ) ⱕ 0:5 and a(k þ 1) ¼ a(k) (1 þ n= n) where d(¯x(k) , a(k) , ⌰ ¯ (k) ), such that n 僆 (0, 0.1], then one SSD step finds (¯x(k þ 1) , ⌰ (k þ 1) (k þ 1) ¯ (k) d(¯x , a , ⌰ ) ⱕ 0:5 Corollary 2. (Appendix A). Assuming the same conditions as in Theorem 1, then J SSD steps produce a point þ 1) ¯ (k) þ 1) ¯ (k) ) ⱕ , ⌰J ) such that d(¯x(k , a(k þ 1) , ⌰ (¯x(k J J 2 d1 ¼ (0:7) . ¯ be the Theorem 3. (Gonzaga, 1992). Let d(¯x, a, ⌰) proximity measure where a ⬎ 0. pIf ¯ (k) ) ⱕ 0:5 and a(k þ 1) ¼ a(k) (1 þ m ⱖ 0:1= n d(¯x(k) , a(k) , ⌰ then O(nm2 =(1 þ m2 )) SSD updates find a new point ¯ (k) ) such that d(¯x(k þ 1) , a(k þ 1) , ⌰ ¯ (k) ) ⱕ 0:5 (x(k þ 1) , ⌰ The preceding results are used in determining the choice of g introduced in Eq. (27). Theorem 4 describes a bound on g (k) that maintains the proximity condition. Note that this bound applies to either the small-step or large-step HIP algorithms. ¯ (k), ⴱ be the ¯ (k) and ⌰ Theorem 4 (Appendix A). Let ⌰ current and minimizing parameter vectors at time k. Let ¯ (k) ) and c¯ (k), ⴱ ¼ c¯ (⌰ ¯ (k), ⴱ ) and assume c¯ (⌰) ¯ is a c¯ (k) ¼ c¯ (⌰ ¯ Let d(¯x, a, ⌰) ¯ be the non-negative, convex function of ⌰. ¯ (k) ) ¼ proximity measure. Assume d(¯x(k þ 1) , a(k þ 1) , ⌰ (k þ 1) (k) (k) (k) (k þ 1), ⴱ ¯ ¯ þg ⌰ ¯ ¼ (1 ¹ g )⌰ . If d1 ⬍ 0:5 and let ⌰ (k) g is chosen as g(k) ⬍
d2 ¹ d1 2nk¯c(k þ 1), ⴱ ¹ c¯ (k) k
(29)
¯ (k þ 1) ) ⬍ where d1 ⬍ d2 ¼ 0:5, then d(¯x(k þ 1) , a(k þ 1) , ⌰ d2 ¼ 0:5. The component minimizers together form an iteration ¯ (k þ 1) ) in proximity to that computes (¯x(k þ 1) , ⌰ (k þ 1) ¯ (k þ 1) ¯ (k) ) which is already in , ⌰ ) given (¯x(k) , ⌰ x¯ ⴱ (a (k) ¯ (k) proximity to x¯ ⴱ (a , ⌰ ). Theorem 5 quantifies the result for both the small-step and large-step cases that is found in Eq. (27).
221
(k) ¯ (k) Theorem 5 (Appendix A). Let d(¯x(k) , ap , ⌰ ) ⱕ 0:5 and (k þ 1) (k) ¼ a (1 þ m) wherepm ⬍ 0:1= n for the smalllet a step algorithm and m ⱖ 0:1= n for the large-step case. ¯ (k þ 1), ⴱ where ¯ (k) þ g(k) ⌰ ¯ (k þ 1) ¼ (1 ¹ g(k) )⌰ Let ⌰
g(k) ⬍
0:12995 nk¯c(k þ 1), ⴱ ¹ c¯ (k) k
Then, one iteration of the small-step variant of Algorithm 2 with two SSD steps or one iteration of the large-step variant ¯ (k þ 1) ) with O(nm2 =(1 þ m2 )) SSD steps produces (¯x(k þ 1) , ⌰ (k) ¯ (k) (k þ 1) (k þ 1) ¯ (k þ 1) , a , ⌰ )ⱕ from (¯x , ⌰ ) such that d(¯x ⬍ 0:5. 4.2. Asymptotic convergence The component minimizers’ combination as an HIP algorithm is only useful if it converges to an optimum for Problem 1. The preceding theorems only imply that the HIP ¯ ⴱ ), as a(k) algorithm will converge to a fixed point, (¯x ⴱ , ⌰ increases. This point corresponds to an optimum of the LP subproblem and an optimum for the quadratic subproblem. Proximity and central paths do not, however, guarantee that ¯ ⴱ ) is a local optimum for Problem 1. This section (¯x ⴱ , ⌰ summarizes results establishing the asymptotic convergence of the small-step HIP algorithm to a local optimum. These results first quantify the changes in the linear and quadratic solutions as functions of the LP subproblem’s duality gap. Those results then imply that the solution sequence is Cauchy and converges to a fixed point. It is then shown that this fixed point is a local optimum for the regularized training problem. Convergence results for the HIP algorithm depend heavily on the LP subproblem’s duality gap, as do similar results for the LP solvers. The duality gap D(k) is a nonnegative quantity that denotes an upper bound on the distance to an optimal solution to the LP subproblem in terms of problem cost. An expression for the bound on the duality gap d(¯x, a) will be found in Gonzaga (1992). This expression extends to the HIP algorithm where d(¯x, a) ¯ With this extension, a bound on the becomes d(¯x, a, ⌰). duality gap at the kth iteration for the HIP procedure is p ¯ (k) ) n n þ d(¯x(k) , a(k) , ⌰ (30) D(k) ⱕ a(k) ¯ (k) ) can be D(k) is a function of a ( k ) since d(¯x(k) , a(k) , ⌰ bounded by 0.5 from the previous proximity conditions. Thus, as a(k) increases, the distance to an optimal LP solution decreases. This property is used in the succeeding results where D(k) takes on a role similar to that of a descent function. The component linear and quadratic minimizers make ¯ (k) where the ¯ and ⌰ incremental changes to x¯ (k) (or Q) changes can be bounded by a function of the duality gap. These changes rely on satisfaction of proximity conditions and on the forms of the updates themselves. The changes in x¯ (k) can be related directly to the duality gap. Theorem 6 describes the change in x¯ and provides a bound on
222
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
¯ k¯x(k þ 1) ¹ x¯ (k) k in terms of the duality gap. The changes in ⌰ depend directly on the changes in x¯ due to the updates in ¯ and Eqs. (25) and (26). Theorem 7 details the changes in ⌰ⴱ (k þ 1), ⴱ (k), ⴱ ¯ ¯ ¹ ⌰ k in terms of provides a bound on k⌰ k¯x(k þ 1) ¹ x¯ (k) k. Both norms have upper bounds which are functions of the duality gap. Both are therefore strictly decreasing as the duality gap is strictly decreasing, implying ¯ (k þ 1), ⴱ ¹ ⌰ ¯ (k), ⴱ k → 0 as that k¯x(k þ 1) ¹ x¯ (k) k → 0 and k⌰ (k) D → 0. ¯ (k) and D(k) be the set of Theorem 6 (Appendix A). Let Q linear parameters (probabilities) and the duality gap at ¯ (k) ) and d(¯x, a, ⌰) ¯ iteration k respectively. Let c¯ (k) ¼ c¯ (⌰ ¯ be the proximitypmeasurements where x¯ ¼ Q. Let a(k þ 1) ¼ a(k) (1 þ n= n) where n 僆 (0, 0.1]. Assume that satisfies the proximity condition x¯ (k) ¯ (k) ) ⱕ 0:5. For two SSD updates, the change d(¯x(k) , a(k) , ⌰ ¯ kDQk ¼ kQ(k þ 1) ¹ Q(k) k is in Q, kDQk ⬍ 2(d(1 þ n) þ n)D(k)
(31)
¯ (k) ). where d ¼ d(¯x(k) , a(k) , ⌰ Theorem 7 (Appendix A). Let Q(mlj) ¼ Q(k) (mlj) and DQ(mlj) ¼ Q(k þ 1) (mlj) ¹ Q(k) (mlj). Define ¯vⴱm ¼ [(f¯ ⴱm )T , (q¯ ⴱm )]T 僆 R2R where f¯ ⴱm and q¯ ⴱm are defined as in Eqs. (25) and (26). Let Q CP be the set of all Q values on or near the central path. Define Am ¼
N X 1 Q(mlj)¯zj z¯Tj N j¼1
Em ¼
N X 1 DQ(mlj)¯zj z¯Tj N j¼1
where Am , Em 僆 RR⫻R . Assume bounded inputs and outputs for the design set T, k¯zk ⱕ z, lyl ⱕ Ymax . Assume that A m is full of rank for Q 僆 Q CP. Let mmax ¼ supQ僆QCP kAm¹ 1 k, kEm k ⬍ 1=mmax , and bmax ¼ supQ僆QCP [SNj¼ 1 (1=N)Q(mlj)] ¹ 1 . Then ¯ (k), ⴱ k ⱕ MKkDQk ¯ (k þ 1), ⴱ ¹ ⌰ k⌰
(32)
where K ¼ mmax zYmax [1 þ mmax z =(1 ¹ r1 )] þ bmax z[1 þ bmax =(1 ¹ r2 )], r1 ¼ kAm¹ 1 kkEm k and r2 [SNj¼ 1 (1=N)DQ(mlj)]= [SNj¼ 1 (1=N)Q(mlj)]. 2
changes depend upon D(k) , as the closed form solutions for the minimizing parameter vectors depend directly upon the linear parameters which themselves depend upon D(k) . Asymptotic convergence of the small-step variant ¯ (k) )} → (¯x ⴱ , ⌰ ¯ ⴱ ) as k increases. requires that {(¯x(k) , ⌰ The preceding two results do not imply this condition. Rather, they imply that the linear parameters converge and the optimal estimates of the quadratic parameters converge. In order to prove convergence, both the linear and quadratic parameters must converge simultaneously. For this to occur, the convex combination coefficient g(k) must tend to and remain fixed at unity as k increases. This implies that the quadratic parameters are always updated exactly as ¯ (k þ 1) ¹ their optimal estimates and that k⌰ (k) (k þ 1), ⴱ (k), ⴱ ¯ ¯ ¯ ⌰ k ¼ k⌰ ¹ ⌰ k. When this condition holds, ¯ (k þ 1) ¹ ⌰ ¯ (k) k → 0. then both k¯x(k þ 1) ¹ x¯ (k) k → 0 and k⌰ Theorem 8 states that there is an iteration number K, for which all iterations greater than K satisfy the condition on ¯ (k) )} → g(k) . Theorem 9 provides the result that {(¯x(k) , ⌰ ¯ ⴱ ). The proof of Theorem 9 shows that the (¯x ⴱ , ⌰ sequence is Cauchy and converges to a fixed point. Theorem 8 (Appendix A). For the HIP algorithm described above, there exists a K ⱖ 0 such that g(k) ¼ 1 for all k ⱖ K. Theorem 9 (Appendix A). For the HIP algorithm described above, if g(k) ¼ 1 for all k ⱖ K ⱖ 0, then the ¯ (k) )} converges to a fixed sequence of solutions {(¯x(k) , ⌰ ¯ ⴱ ). point (¯x ⴱ , ⌰ The optimality of the fixed point determined by the HIP algorithm is established in Theorem 10. ¯ ⴱ )T ]T Theorem 10 (Appendix A). Let w¯ ⴱ ¼ [(¯x ⴱ )T , (⌰ (k) (k) T ¯ (k) )T ]T } ¯ } ¼ {[(¯x ) , (⌰ be the limit of the sequence {w produced by the HIP algorithm. Let the minimization problem be minimize :
¯ T x¯ c¯ (⌰)
with respect to :
¯ x¯ , ⌰
subject to :
A¯x ¼ b, x¯ ⱖ 0
(33)
¯ ⴱ )T ]T is a locally minimum solution to w¯ ⴱ ¼ [(¯x ⴱ ) , (⌰ Eq. (33). T
4.3. Computational properties The significance of Theorems 6 and 7 is that the norms of the changes in the parameters have bounds which are functions of the duality gap. The results indicate that the changes in the parameters are guaranteed to decrease as the duality gap decreases. It is initially surprising that both parameter changes can be bounded by the duality gap. However, considering that the problem is expressed as a linear subproblem, it is not surprising that the linear parameters’ changes are functions of D(k) , which is intrinsic to LP problems. It is also not surprising that the quadratic parameters’
The final two results refer to the small-step variant’s computational properties. Small-step HIP variants are direct extensions of small-step IP LP solvers. The latter palgorithms enjoy excellent computational properties, O[ n log2 (1=e)] iterations convergence rate and O[n3:5 log2 (1=e)] flops cost, to converge to an e-neighbourhood of an optimal solution. Under appropriate problem scaling assumptions, the HIP procedure inherits the computational efficiency p of itsplinear counterpart. The extension requires O[ n log2 ( n=e)]
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
iterations to converge with p an associated pcomputational cost of O[(n3:5 þ n1:5 R2 þ nMR3 ) log2 ( n=e)] flops to converge to an e-neighbourhood of locally optimal solutions. In this case n ¼ NM. The two results are summarized below. Together, they indicate that the small-step variant is computationally efficient and that it will scale well as problem size increases. ¯ (k) )T ]T be Theorem 11 (Appendix A). Let w¯ (k) ¼ [(¯x(k) )T , (⌰ the kth approximate solution generated by the HIP algorithm. Assume that D(0) ⬍ 1=e. Assume that the cost vector is scaled by S such that g(k) ¼ 1 for all k ⱖ K and that 2[d(1 þ n) þ n](1 þ MK)D(K) ⬎ e. Given these assumptions, the HIP algorithm will converge to an e-neighbourhood of a ¯ ⴱ k ⬍ e) ¯ (k) ¹ w in locally optimal solution (kw p p O[ n log2 ( n=e)] iterations. The HIP algorithm’s overall computational cost is computed by determining the computational complexity of the individual steps and then multiplying by the algorithm’s convergence rate. The linear update executes two SSD steps, each employing a matrix inversion with a worst case complexity of O(n 3) where n ¼ MN. The quadratic update requires NR 2 multiplications resulting from the vector outer product and employs a matrix inversion with a computational complexity of O(R 3). Also, the quadratic update is done on M models so the quadratic update has a computational complexity of O(nR 2) or O(MR 3). These complexity estimates multiplied by the estimate of the number of iterations establish Theorem 12. Theorem 12 (Appendix A). The computational cost of the algorithm is O[(n3:5 þ n1:5 R2 þ pHIP p small-step 3 nMR ) log2 ( n=e)]. 5. Examples Two collections of experiments examine the HIP algorithms’ computational properties and use. The first set performs comparisons between the two HIP algorithms and the EM algorithm. Comparisons in Section 5.1 address the relative computational costs and error characteristics of the three algorithms. The second set of experiments demonstrates the two IP algorithms’ use in training networks to determine multiple, linear, set point models of a non-linear plant. Those experiments use data from a fossil fuel burning, electric power generating plant. 5.1. Mixture density parameter estimation The first collection of experiments compares the IP algorithms with the EM algorithm when used on mixture density parameter estimation problems. The three algorithms identify the means of a set of normal densities using sample vectors generated from the densities. The densities have means which correspond to the vertices of
223
a simplex in an R-dimensional space. Similar to experiments in Redner and Walker (1984), the means are equispaced. These experiments assume M ¼ 2 to 15 candidate densities. Samples sets ranging in size from N ¼ 300 to N ¼ 750 vectors are the training sets used. All the experiments assume known and equal covariances, 兺 ¼ I. The specific EM algorithm used in these experiments is a two-step procedure modelled after the EM algorithms described in Redner and Walker (1984) for solving mixture problems. It consists of implementing equations (4.0) and (4.8) from Redner and Walker (1984). These equations are q(m)(k þ 1) ¼ and (k þ 1) q¯ m
"
N (k) ¯ (k) ) 1 X q(m) P(¯zj lm, ⌰ ¯ (k) ) N j¼1 P(¯zj l⌰
# ¯ (k) ) q(m)(k þ 1) P(¯zj lm, ⌰ ¼ z¯ j ¯ (k) ) P(¯zj l⌰ j¼1 " # N ¯ (k) ) ¹ 1 X q(m)(k þ 1) P(¯zj lm, ⌰ ⫻ ¯ (k) ) P(¯zj l⌰ j¼1
(34)
N X
ð35Þ
for m ¼ 1,…,M, where ¯ (k) ) ¼ P(¯zj l⌰
M X
q(m)k1 exp( ¹ k1 k¯zj ¹ q¯ m k2 )
m¼1
with k 1 normalizing constant and q(m) being the marginal probability of observing a sample from the mth class. The following experiments assume a value of unity for k 1 in the exponential in the preceding equation. Different distances between the mixtures’ means in the experiments help demonstrate the algorithms’ convergence rates and costs for different, relative variances. The experiments use the EM algorithm above for comparison with the HIP algorithms. The EM algorithm solves a soft clustering problem, whereas the HIP algorithms solve a hard clustering problem. k-means clustering, a hard version of EM, may be used to solve the problem. The comparisons use the EM algorithm, however, since its implementation resembles the HIP algorithms. The first set of experiments makes a qualitative analysis of the three methods. This set assumed problem parameter values of R ¼ 9, N ¼ 300, M ¼ 2 to 9, and a spacing of four units between the means. This spacing between the means corresponds to a well-separated set of densities whose parameters should be easy to identify. The algorithms were initialized with random initial conditions. In the runs, the algorithms executed until the solutions were within 10 ¹5 of an optimal solution. Results from this set examine the convergence rate in iterations and the computational cost in flops. Results for the first set of experiments appear in Figs. 5 and 6. These results serve as a means to rank the algorithms’ performance relative to one another. Mean empirical iteration results, along with a ‘‘best-fit’’ characteristic, in
224
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
Fig. 7. Computational cost (N ¼ 300). Fig. 5. Algorithm iteration rates.
Fig. 5 exhibit O(n 1.53), O(n 0.45), and constant iteration rates for the EM, small-step, and large-step algorithms respectively, where n is the dimension of the linear subproblem. Mean computational cost results in Fig. 6 exhibit O(n 2.42), O(n 1.78), and O(n 1.53) flops cost rate for the EM, small-step, and large-step algorithms respectively. Whereas the IP variants exhibit lower convergence rates and cost rates than does EM, the plots demonstrate that the EM algorithm incurs lower cost for these well-separated problems. As the number of models increases, however, EM performance begins to approach the large-step algorithm’s performance in terms of cost. The same cannot be said with respect to the small-step IP variant, which still has a much greater cost. These results allow us to eliminate the small-step variant from further consideration in our comparisons, as it always has much greater cost. The next sets of experiments perform a more quantitative analysis of the large-step IP variant and the EM algorithm. These experiments assume R ¼ 19, N ¼ 300, 500, 750 samples, M ¼ 2 to 15 mixtures, and a spacing of two units between the densities’ means. This spacing corresponds to more closely spaced, harder to identify means.
Fig. 6. Algorithm computational costs.
The algorithms terminate in these runs when the solutions are within 5 ⫻ 10 ¹5 of an optimal solution. Computational cost results for the three cases appear in Figs. 7–9. These graphs display the mean computational costs for the two algorithms along with bounds indicating the maximum and minimum costs incurred for the varioussized mixtures. The cost rates for the EM and large-step algorithms are O(n 1.7) and O(n 1.45) flops respectively. The two algorithms are comparable, exhibiting very similar mean costs for all three cases. The EM algorithm generally exhibits slightly better cost, but the IP algorithm outperforms the EM algorithm at some points. Analysis of the costs’ standard deviations reveals two observations. First, Table 1 tabulates the percentages for the two algorithms of the number of runs resulting in costs greater than the mean cost and the number of runs with costs within a standard deviation of the mean. The EM algorithm again performs slightly better in both accounts, but the results are comparable. Thus, the algorithms’ statistical performances are comparable. The second observation deals with the standard deviations themselves. Figs. 10–12 plot the standard deviations of the costs normalized by the mean cost for each number of mixtures for the three experiments. The results show that the
Fig. 8. Computational cost (N ¼ 500).
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
225
better than the EM algorithm’s worst case performance. Considering that EM has a cost with greater standard deviation than does the large-step method, this suggests that for these closely spaced problems the IP algorithm may be better in terms of cost incurred.
Fig. 9. Computational cost (N ¼ 750).
IP algorithm consistently exhibits a smaller, normalized standard deviation than does EM. This implies that the IP algorithm requires a number of computations that vary less from run to run than does EM. In terms of algorithm operation, it appears that the IP algorithm is less susceptible to the effects of initial conditions in the number of required computations than is EM. Final cost results examine the worst case costs incurred by the two algorithms. Figs. 13–15 plot the maximum costs incurred by the two algorithms for the three test cases. In all three cases, the IP algorithm’s worst case cost is generally
Fig. 11. Normalized cost standard deviation (N ¼ 500).
Table 1 Cost standard deviation results Algorithm
N
Percentage of costs ⬎ mean
Percentage of costs within one standard deviation
EM Large-step IP EM Large-step IP EM Large-step IP
300 300 500 500 750 750
0.367 0.417 0.433 0.467 0.425 0.400
0.767 0.717 0.742 0.700 0.750 0.700 Fig. 12. Normalized cost standard deviation (N ¼ 750).
Fig. 10. Normalized cost standard deviation (N ¼ 300).
Fig. 13. Worst case computational cost (N ¼ 300).
226
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
Error analysis, similar to the preceding cost analysis, follows here. Mean-square parameter error results appear in Figs. 16–18. The plots depict the mean-square parameter error between the computed estimates and the actual values for each number of models. The errors are averaged over the number of runs at each point. The plots demonstrate that the EM algorithm always results in worse error than does the IP algorithm.
Examination of the errors’ standard deviation yields additional observations. Table 2 contains information similar to that contained in Table 1. It lists the percentages for the two algorithms of the number of runs resulting in errors greater than the mean error and the number of runs with errors that fall within a standard deviation of the mean error. This table suggests that the error distributions are comparable between both methods. Additional results in Figs. 19–21 plot the standard deviations of the errors normalized by the mean error for each case. The plots demonstrate that the EM algorithm has a
Fig. 14. Worst case computational cost (N ¼ 500).
Fig. 17. Parameter error (N ¼ 500).
Fig. 15. Worst case computational cost (N ¼ 750).
Fig. 18. Parameter error (N ¼ 750).
Table 2 Error standard deviation results
Fig. 16. Parameter error (N ¼ 300).
Algorithm
N
Percentage errors ⬎ mean
Percentage costs within one standard deviation
EM Large-step IP EM Large-step IP EM Large-step IP
300 300 500 500 750 750
0.392 0.517 0.383 0.483 0.467 0.467
0.658 0.667 0.725 0.667 0.692 0.683
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
greater variance in the errors it incurs in producing its estimates for all three trials. The larger errors occur because the EM algorithm has more difficulty in separating the samples from run to run. The IP method, however, separates the samples with a much smaller variance in its estimates.
Final error results examine the worst case errors incurred by the two algorithms. Figs. 22–24 plot the maximum errors incurred by the two algorithms. In all the cases, the IP algorithm’s worst case error is significantly better than the EM algorithm’s worst case error. This again supports the observation that the IP algorithm generalizes better than the EM algorithm on the test cases.
Fig. 19. Normalized error standard deviation (N ¼ 300). Fig. 22. Worst case error (N ¼ 300).
Fig. 20. Normalized error standard deviation (N ¼ 500). Fig. 23. Worst case error (N ¼ 500).
Fig. 21. Normalized error standard deviation (N ¼ 750).
227
Fig. 24. Worst case error (N ¼ 750).
228
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
5.2. Model parameter identification A final experiment examines the small-step HIP algorithm’s use on a multiple model system identification problem. Here, a modular network is trained to identify multiple linear, set point models of a non-linear system. The problem uses input/output data for the fuel flow process from a fossil fuel burning electric power generating plant (see Fig. 25). Two sets of data consisting of input/output pairs of the fuel flow process are available to train the
network: an N ¼ 300 sample design set and an N ¼ 1438 sample validation set. The experiment used the small-step HIP algorithm to train modular network and recorded the computational properties of the HIP algorithm for M ¼ 2 to 9 model units. Also recorded was the network’s performance in approximating the fuel flow process. The experiment’s results follow here. Fig. 26(a) shows 0.68 that the empirical iteration p rate is O(n ) iterations and is slightly worse than the O( n) predicted rate. Fig. 26(b) shows that the O(n 1.9) flop empirical cost (open circles/
Fig. 25. Fuel flow input/output data.
Fig. 26. Fuel flow computational properties.
Fig. 27. Fuel flow results.
229
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
solid line) exceeds the O(n 3.5) flop theoretical cost rate (dashed line). The difference between the theoretical and empirical rates in this and the previous experiments is attributed to the fact that sparse matrix techniques were used to implement the IP algorithms. Finally, Fig. 27 gives an indication of the resulting network’s performance in approximating the fuel flow process. Fig. 27(a) plots the mean-square error of the approximation for hard and soft switching between models. The results show a decreasing and then slightly increasing trend with an optimum at M ¼ 4 models. Fig. 27 display the networks’ performance for an M ¼ 4 model approximation. The figure demonstrates that the resulting network approximates the fuel flow process well. 6. Discussion This paper used the HIP algorithm to train modular neural networks. The HIP procedure combines an IP LP algorithm with an NR iteration to produce an algorithm whose computational cost scales in a polynomial manner with problem size. In particular, this paper formally proves that the HIP algorithm converges to a fixed point that is a local optimum of the regularized training problem. This paper also shows how to configure the algorithm so that the number of iterations and total computational cost scale at the same rate as fast IP LP algorithms. The formal results obtained in this paper were validated through simulation experiments which compared large-step and small-step versions of the HIP algorithm against EM procedures. These results showed that, for the test suite of problems, large-step HIP algorithms exhibited a total computational cost that was comparable with that of the EM procedures, but that the resulting solutions obtained by the HIP algorithm had lower errors than the EM procedure’s solutions. The results also showed that the HIP algorithm’s total computational costs were relatively insensitive to a problem’s initial conditions; something which was definitely not the case with the EM methods. Further experiments demonstrated the use of the HIP algorithm in identifying multiple predictors for non-linear dynamical plants. Acknowledgements The authors would like to acknowledge the partial financial support of the National Science Foundation (MSS9216559), the Electric Power Research Institute (RP8030-06), and the Army Research Office (DAAH04-95-1-0600, DAAHO4-96-0134).
Appendix A 1. Proximity proofs Proof of Corollary 2. From Gonzaga (1992), if
p ¯ (k) ) ⱕ 0:5 and a(k þ 1) ¼ a(k) (1 þ n= n) with n d(¯x(k) , a(k) , ⌰ ¯ (k) ) ⱕ 0:7. If 僆 (0,0.1], then d(¯x(k) , a(k þ 1) , ⌰ (k) (k) ¯ d(¯x , a, ⌰ ) ⬍ 1, then one SSD step produces x¯ 1(k þ 1) from x¯ (k) where (Lemma 5.4 in Gonzaga (1992)) þ 1) ¯ (k) ) ⱕ d(¯x(k) , a(k þ 1) , ⌰ ¯ (k) )2 ⱕ (0:7)2 , a(k þ 1) , ⌰ d(¯x(k 1
Application of another SSD step produces x¯ 2(k þ 1) from x¯ 1(k þ 1) with þ 1) ¯ (k) ) ⱕ d(¯x(k þ 1) , a(k þ 1) , ⌰ ¯ (k) )2 ⱕ (0:7)22 , a(k þ 1) , ⌰ d(¯x(k 2 1
Proceeding inductively, J assuming þ 1) (k þ 1) ¯ (k) 2 d(¯x(k , a , ⌰ ) ⱕ (0:7) , then J
that
þ 1) þ 1) (k þ 1) ¯ (k) ¯ (k) )2 , ⌰ ) ⱕ d(¯x(k , a(k þ 1) , ⌰ d(¯x(k J J þ1 , a J
J þ1
ⱕ [(0:7)2 ]2 ¼ (0:7)2
A
Proof of Theorem 4. The proof must demonstrate that using g(k) in Eq. (27) maintains the nearness of x¯ (k þ 1) to ¯ (k) . Let the central path after updating ⌰ ¯ ¼ x¯ ¹ 1 ) and define h¯ 1 and h¯ 2 as ¯ ¼ PAX X(a¯c(⌰) ¯h(¯x, a, ⌰) ¯ (k) ) and h¯ 2 ¼ h(¯ ¯ x(k þ 1) , a(k þ 1) , ⌰ ¯ x(k þ 1) , a(k þ 1) , h¯ 1 ¼ h(¯ (k þ 1) ¯ ¯ ). Using the triangle inequality, h1 and h¯ 2 are related ⌰ as kh¯ 2 k ⱕ kh¯ 2 ¹ h¯ 1 k þ kh¯ 1 k ¯ ¹ x¯ ¹ 1 ) and d 1 for kh¯ 1 k Substituting h¯ ¼ PAX X(a¯c(⌰) produces kh¯ 2 k ⱕ ka(k þ 1) PAX (k þ 1) X (k þ 1) (¯c(k þ 1) ¹ c¯ (k) )k þ d1 Applying the quadratic update and using the non-negativity ¯ c¯ (k þ 1) ⱕ (1 ¹ g(k) )¯c(k) þ g(k) c¯ (k þ 1), ⴱ , and convexity of c¯ (⌰), results in kh¯ 2 k ⱕ ka(k þ 1) g(k) PAX (k þ 1) X (k þ 1) (¯c(k þ 1), ⴱ ¹ c¯ (k) )k þ d1 (k) Using kPAX (k) Xp D(k) from Szymanski (1995) and k ⱕ (k) (k) D ⱕ (n þ 0:5 n)=a ⬍ 2n=a(k) , then
kh¯ 2 k
ⱕ g(k) a(k þ 1) kPAX (k þ 1) X (k þ 1) kk¯c(k þ 1), ⴱ ¹ c¯ (k) k þ d1 ⬍ g(k) 2nk¯c(k þ 1), ⴱ ¹ c¯ (k) k þ d1
Plugging in the value of g from Eq. (27) yields (d2 ¹ d1 ) 2nk¯c(k þ 1), ⴱ ¹ c¯ (k) k þ d1 ¼ d2 A kh¯ 2 k ⬍ 2nk¯c(k þ 1), ⴱ ¹ c¯ (k) k Proof of Theorem 5. By assumption, ¯ (k) ) ⱕ 0:5. For the small-step algorithm, d(¯x(k) , a(k) , ⌰ Corollary 2 states that two SSD steps produce x¯ (k þ 1) (k þ 1) (k þ 1) ¯ (k) 22 where d(¯x , a , ⌰ ) ⱕ (0:7) ¼ 0:2401. For the large-step algorithm, Theorem 3 states that O[nm2 =1( þ m2 )] steps produce x¯ (k þ 1) such that ¯ (k) ) ⱕ 0:5. Taking a constant two d(¯x(k þ 1) , a(k þ 1) , ⌰ ¯ (k) ) below additional steps reduces d(¯x(k þ 1) , a(k þ 1) , ⌰ 2 2 0.2401. Thus, O[nm /(1 þ m )] SSD steps produce x¯ (k þ 1) ¯ (k) ) ⱕ 0:2401. A valid g(k) for such that d(¯x(k þ 1) , a(k þ 1) , ⌰
230
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
the quadratic update can be computed using Eq. (27) where ¯ ¹ update results in d 1 ¼ 0.2401 and d 2 ¼ 0.5, and the ⌰ (k þ 1) ¯ ⌰ such that d(¯x
(k þ 1)
, a
(k þ 1)
¯ , ⌰
(k þ 1)
) ⱕ 0:5
The convergence analysis begins with a set of definitions. ¯ be the set of solutions produced by the ¯ ⌰) Let Q ¼ Q, ¯ 傺 Q be the set of all Q ¯ values which algorithm. Let Q(Q) ¯ are solutions to the LP problem. Let Q(⌰) 傺 Q be the set of all solutions to the quadratic problem. Define co(Q) to be the convex hull of the set of points in Q. Let g be defined as 0:12995 g ¼ min ¯ 2 )k, 1 ¯ 1 ) ¹ c¯ (⌰ nk¯c(⌰ ¯ (k), ⴱ 僆 co(Q(⌰)) ¯ be the current and ¯ (k) , ⌰ Lemma 13. Let ⌰ minimizing parameter vectors at iteration k. Let g(k) 僆 (0, 1] be defined as in Eq. (27) and let the parameter vectors be updated as (36)
If ¯ (k þ 1), ⴱ k ⬍ g(k) k⌰ ¯ (k þ 1), ⴱ ¹ ⌰ ¯ (k) k ¯ (k þ 2), ⴱ ¹ ⌰ k⌰
(37)
¼ k¯x(k þ 1) ¹ x¯ (k) k ⱕ k¯x(k þ 1) ¹ x¯ 1 k þ k¯x1 ¹ x¯ (k) k
(k) Examine the distance moved in the p first update, k¯x1 ¹ x¯ k. (k þ 1) ¼ a(1 þ n= n) and that P AX is an Use the facts that a orthogonal projection (PAX ¼ P2AX ).
k¯x1 ¹ x¯ (k) k
ⱕ kX (k) PAX (k) kkPAX (k) X (k) [(a(k) c¯ (k) ¹ (¯x(k) ) ¹ 1 ) p þ a(k) n¯c(k) = n]k But XP AX ¼ P AXX and PAX (k) X (k) k ⱕ D(k) (Szymanski (1995) , so k¯x1 ¹ x¯ (k) k ⱕ kPAX (k) X (k) [(a(k) c¯ (k) ¹ (¯x(k) ) ¹ 1 ) p þ a(k) n¯c(k) = n]kD(k) p Using Eq. (23), a(k) (n= n)kPAX (k) X (k) c¯ (k) k ⱕ n(d þ 1) ¯ (k) ), the (Gonzaga, 1992), and letting d ¼ d(¯x(k) , a(k) , ⌰ result becomes
Now examine the distance moved in the second update, k¯x(k þ 1) ¹ x¯ 1 k k¯x(k þ 1) ¹ x¯ 1 k
¼ kX1 PAX1 X1 (a(k þ 1) c¯ (k) ¹ x¯ 1¹ 1 )k ¼ kX1 P2AX1 X1 (a(k þ 1) c¯ (k) ¹ x¯ 1¹ 1 )k
¯ (k þ 1) k ⱕ k⌰ ¯ (k þ 2), ⴱ ¹ ⌰ ¯ (k þ 1), ⴱ k þ k⌰ ¯ (k þ 1), ⴱ ¯ (k þ 2), ⴱ ¹ ⌰ k⌰
ⱕ kPAX1 X1 (a(k þ 1) c¯ (k) ¹ x¯ 1¹ 1 )kD(k)
¯ (k þ 1) k ¹⌰
¯ (k þ 1), ⴱ ¹ ⌰ ¯ (k) k þ (1 ¹ g(k) )k⌰
¯ (k) ). However, kPAX1 X1 (a(k þ 1) c¯ (k) ¹ x¯ 1¹ 1 )k ¼ d(¯x1 , a(k þ 1) , ⌰ Using Lemma 5.4 in Gonzaga (1992), it can be shown that ¯ (k) ) ⱕ d(w ¯ (k) )2 . ¯ (k) , a(k þ 1) , ⌰ ¯ (k þ 1) , a(k þ 1) , ⌰ Since d(w (k) (k þ 1) (k) (k) (k þ 1) ¯ ) ¼ kP (k) X (a c¯ (k) ¹ (¯x(k) ) ¹ 1 )k ¼ , ⌰ d(¯x , a AX (d(1 þ n) þ n) from above, then d(¯x1 , a(k þ 1) , (k) ( ¯ ) ⱕ (d(1 þ n) þ n) 2). Thus ⌰
¯ (k þ 1), ⴱ ¹ ⌰ ¯ (k) k ⬍ g(k) k⌰
k¯x(k þ 1) ¹ x¯ 1 k
¯ (k þ 1) in Eq. (36) and then subUsing the expression for ⌰ mitting Eq. (37) into the result produces the desired result ¯ (k þ 1) k ¯ (k þ 2), ⴱ ¹ ⌰ k⌰
ⱕ [d þ n(d þ 1)]D(k) ⱕ (d(1 þ n) þ n)D(k)
(38)
¯ (k þ 1), ⴱ to Proof. Begin by adding and subtracting ⌰ (k þ 2), ⴱ (k þ 1) ¯ ¯ ¹⌰ k and then using the triangle inequality k⌰ to break apart the norm
¼ kX (k) PAX (k) X (k) (a(k þ 1) c¯ (k) ¹ (¯x(k) ) ¹ 1 k ¼ X (k) P2AX (k) X (k) [(a(k) c¯ (k) ¹ (¯x(k) ) ¹ 1 ) p þ a(k) n¯c(k) = n]k
k¯x1 ¹ x¯ (k) k
then ¯ (k þ 1) k ⬍ k⌰ ¯ (k þ 1), ⴱ ¹ ⌰ ¯ (k) k ¯ (k þ 2), ⴱ ¹ ⌰ k⌰
kDQk
A
2. Asymptotic convergence proofs
¯ (k þ 1) ¼ (1 ¹ g(k) )⌰ ¯ (k) þ g(k) ⌰ ¯ (k þ 1), ⴱ ⌰
where X (k) ¼ diag(¯x(k) ) 僆 Rn⫻n and X1 ¼ diag(¯x1 ) 僆 Rn⫻n results in x¯ (k þ 1) . Let
¯ (k þ 2), ⴱ ¹ ⌰ ¯ (k þ 1), ⴱ k ⱕ k⌰
ⱕ (d(1 þ n) þ n)2 D(k)
¯ (k þ 1), ⴱ ¹ ⌰ ¯ (k) k þ 1 ¹ g(k) )k⌰ ¯ (k) k ¯ (k þ 1), ⴱ ¹ ⌰ ⬍ k⌰
A
Proof of Theorem 6. Starting with (¯x(k) , a(k) , updating a(k) , and performing two SSD steps as x¯ 1 ¼ x¯ (k) ¹ X (k) PAX (k) X (k) (a(k þ 1) c¯ (k) ¹ (¯x(k) ) ¹ 1 ) x¯ (k þ 1) ¼ x¯ 1 ¹ X1 PAX1 X1 (a(k þ 1) c¯ (k) ¹ x¯ 1¹ 1 )
¯ (k) )D(k) ⱕ d(¯x1 , a(k þ 1) , ⌰
But (d(1 þ n) þ n ⬍ 1 and (d(1 þ n) þ n)2 ⬍ (d(1 þ n) þ n), therefore kDQk
ⱕ k¯x(k þ 1) ¹ x¯ 1 k þ k¯x1 ¹ x¯ (k) k ⱕ (d(1 þ n) þ n)2 D(k) þ (d(1 þ n) þ n)D(k) ⬍ 2(d(1 þ n) þ n)D(k)
A
¯ ¼ hv¯ T1 , va ¯ T2 , …, va ¯ TM iT and let Proof of Theorem 7. Let ⌰
231
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
X k Nj¼ 1 (1=N)Q(mlj)¯zj k ⱕ z. Thus
¯ m ¼ hf¯ Tm , q¯ Tm iT , then va ¯ (k), ⴱ k ¯ (k þ 1), ⴱ ¹ ⌰ k⌰
M X
ⱕ
(k þ 1), ⴱ ⴱ kv¯ m ¹ v¯ (k), m k
(k þ 1), ⴱ ⴱ ¹1 ¹ f¯ (k), ¹ Am¹ 1 k kf¯ m m k ⱕ zYmax k(Am þ Em )
þ zYmax kDQkk(Am þ Em ) ¹ 1 k
m¼1 M X
ⱕ
2
(k þ 1), ⴱ ⴱ (k þ 1), ⴱ kf¯ m ¹ f¯ (k), ¯m m k þ kq
m¼1 ⴱ þ q¯ (k), m k
(k þ 1), ⴱ kq¯ m
ⴱ ¹ q¯ (k), m k
Let ¯¼ w
N X 1 Q(mlj)yj z¯ j N j¼1
Dw¯ ¼
¯ Dw¯ 僆 RR . Using the definitions for Q(k) (mlj), where w, (k þ 1) (mlj), f¯ ⴱm and q¯ ⴱm , one finds that Q (k þ 1), ⴱ f¯ m
ⴱ ¹ f¯ (k), m
¼ (Am þ Em )
(k þ 1), ⴱ ⴱ q¯ m ¹ q¯ (k), ¼ m
¹1
N X j¼1
(k þ 1), ⴱ ⴱ ¹1 ¹ f¯ (k), ¹ Am¹ 1 k kf¯ m m k ⱕ zYmax k(Am þ Em )
¯ þ Dw) ¯ ¹ Am¹ 1 w ¯m (w
N X 1 (Q(mlj) þ DQ(mlj))¯zj j¼1 N
¹
7 7 zkDQk 7 kþ 7 N N X X 5 1 1 (Q(mlj) (Q(mlj) þ DQ(mlj))k k N N j¼1 j¼1 X [Am þ Em ] ¼ Nj¼ 1 (1=N)Q(k þ 1) (mlj)¯zj z¯Tj is full rank by and assumption and k(Am þ Em ) ¹ 1 k ⱕ mmax X N (k þ 1) ¹1 (mlj)] ⱕ bmax , so [ j ¼ 1 (1=N)Q 1
¹
N X 1 DQ(mlj)yj z¯j j¼1 N
1 (Q(mlj) þ DQ(mlj)) N
N X 1 Q(mlj)¯zj N j¼1
6 6 1 ⱕ zk 6 6X N 4 1 (Q(mlj) þ DQ(mlj)) N j¼1 3
þ mmax zYmax kDQk 2 6 6 1 (k þ 1), ⴱ ⴱ 6 ¹ q¯ (k), k ⱕ zk kq¯ m m 6X N 4 1 (Q(mlj) þ DQ(mlj)) j¼1 N 3 ¹
N X 1 Q(mlj) N j¼1
1 N X 1 Q(mlj) j¼1 N
7 7 7 k þ bmax zkDQk 7 5
Grouping like terms, taking the norms of the differences, and using the triangle inequality
Using Theorem 2.3.4 from Golub and VanLoan (1989)
(k þ 1), ⴱ ⴱ ¹1 ¹ f¯ (k), ¹ Am¹ 1 k kf¯ m m k ⱕ k(Am þ Em )
and
¯ mk ⫻ kw¯ m k þ k(Am þ Em ) ¹ 1 kkDw 2
N X 1 Q(mlj)¯zj j¼1 N
6 6 (k þ 1), ⴱ (k), ⴱ ¹ q¯ m k ⱕ k 6 kq¯ m 6X N 4 1 (Q(mlj) þ DQ(mlj)) j¼1 N 3 N N X X 1 1 Q(mlj)¯zj 7 DQ(mlj)¯zj 7 j¼1 N j¼1 N 7 k ¹ N 7 kþk X N X 1 5 1 Q(mlj) (Q(mlj) þ DQ(mlj)) j¼1 N j¼1 N The triangle inequality and the properties of norms can be ¯ ⱕ zYmax , and used to show that kDw¯ m k ⱕ zYmax kDQk, kwk
k(Am þ Em ) ¹ 1 ¹ Am¹ 1 k ⱕ kEm km2max =(1 ¹ r1 )
k
1 N X j¼1
¹
1 (Q(mlj) þ DQ(mlj)) N 1 k ⱕ b2max kDQk=(1 ¹ r2 ) N X 1 Q(mlj) N j¼1
where kEm k ⱕ z 2 kDQk the bound becomes (k þ 1), ⴱ ⴱ 2 3 ¹ f¯ (k), kf¯ m m k ⱕ mmax z Ymax kDQk=(1 ¹ r1 )
þ mmax zYmax kDQk (k þ 1), ⴱ ⴱ 2 ¹ q¯ (k), kq¯ m m k ⱕ bmax zkDQk=(1 ¹ r2 ) þ bmax zkDQk
Substituting
these
results
into
the
expression
for
232
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
¯ (k þ 1), ⴱ ¹ ⌰ ¯ (k), ⴱ k produces k⌰ ¯ (k), ⴱ k ⱕ MK ¯ (k þ 1), ⴱ ¹ ⌰ k⌰
A
Proof of Theorem 8. Let A be the set 0:12995 ¯ ¯ ¯ ¯ A ¼ co(Q(⌰)) ⫻ co(Q(⌰)) : k¯c(⌰1 ) ¹ c¯ (⌰2 )k ⱖ n ¯ values in A corresponds to all pairs of ⌰ ¯ ¯ co(Q(⌰)) ⫻ co(Q(⌰)) that result in g being less than or exactly equal to unity. Further define e⌰¯ as ¯ 1¹⌰ ¯ 2k gk⌰ (39) e⌰¯ ¼ min ¯ 1, ⌰ ¯ 2 )僆A (⌰
¯ 1¹⌰ ¯ 2 k for all e⌰¯ corresponds to the minimum value of gk⌰ ¯ pairs of ⌰ values which result in g being less than or exactly equal to unity. If the set A is empty, then e⌰¯ is defined to be ¯ 2 ) 僆 co(⌰) ¯ ⫻ (⌰) ¯ for the problem ¯ 1, ⌰ ⬁ as all values of (⌰ (k) are such that g ¼ 1. e⌰¯ is bounded away from zero as ¯ is a bounded, convex set which means that there is co(⌰) ¯ 2 k and k¯c(⌰ ¯ 1 ) ¹ c¯ (⌰ ¯ 2 )k for the set. Let ¯ 1¹⌰ a maximum k⌰ K 1 be the smallest k for which ¯ k⌰
(k þ 2), ⴱ
¯ ¹⌰
(k þ 1), ⴱ
(k)
k ⬍ 2(d(1 ¹ n) þ n)MKD
¯ ⬍ e⌰
(40)
K 1 exists as e⌰¯ ⬎ 0 and D(k) → 0 with increasing k. The inequality with k ¼ K 1 signifies the iteration, K 1, at which ¯ (k þ 1), ⴱ k is smaller than the minimum possible ¯ (k þ 2), ⴱ ¹ ⌰ k⌰ ¯ ¯ 2 k. It is at K 1 that the sequence term, gk⌰1 ¹ ⌰ (k þ 1), ⴱ (k) ¯ ¯ ¹ ⌰ k becomes strictly decreasing. k⌰ Examine f(k) for the following two cases where k ⱖ K1 . g(k) ⬍ 1: Eq. (40) implies that ¯ (k þ 1), ⴱ ¹ ⌰ ¯ (k) k 2(d(1 þ n) þ n)MKD(k) ⬍ g(k) k⌰ and as 2(d(1 þ n) þ n)MKD(k) ⬍ e⌰¯ (k) ¯ (k þ 1), ⴱ ¯ (k) k by its definition. Since ¹⌰ e⌰¯ ⱕ g k⌰ ¯ (k þ 1), ⴱ k ⬍ g(k) k⌰ ¯ (k þ 1), ⴱ ¹ ⌰ ¯ (k) k and g(k) ⬍ 1, ¯ (k þ 2), ⴱ ¹ ⌰ k⌰ (k þ 1), ⴱ (k) ¯ k is a strictly decreasing ¯ ¹⌰ then by Lemma 13, k⌰ sequence. Let ¯ 2 )k ¼ 0:12995 ¯ ⫻ co(Q(⌰)) ¯ : k¯c(⌰ ¯ 1 ) ¹ c¯ (⌰ ¯ co(Q(⌰)) A n ¯ values in co(Q(⌰)) ¯ ⫻ co(Q(⌰)) ¯ be the set of all pairs of ⌰ that result in g(k) being exactly equal to unity. Define the term a as a¼
min
¯ 1, ⌰ ¯ 2 )僆A (⌰
¯ 1¹⌰ ¯ 2k k⌰
a is the minimum separation between parameter vectors in ¯ that results in g(k) being exactly equal to one. a is co(Q(⌰)) ¯ (k þ 1), ⴱ ¹ ⌰ ¯ (k) k is strictly greater than or equal to e⌰¯ . Since k⌰ decreasing to zero as it is bounded above by D(k) which decreases to zero, it eventually decreases past a, at which Let the iteration at which point g(k) ¼ 1. (k þ 1), ⴱ ¯ (k) k ⬍ a be k ¼ K 2. ¯ ¹⌰ k⌰ g(k) ¼ 1: the first case showed that g(k) → 1 for k ⱖ K1 . This second case demonstrates that g(k) ¼ 1 for all
k ⱖ K2 ⱖ K1 . Let k ⱖ K2 and g(k) ¼ 1. Assume that ¯ (k þ 1) ¼ g(k þ 1) ⬍ 1. Now g(k) ¼ 1 implies that ⌰ (k) ¯ (k) (k) ¯ (k þ 1), ⴱ (k þ 1), ⴱ (k þ 1) ¯ (1 ¹ g )⌰ þ g ⌰ ¼⌰ .g ⬍ 1 implies ¯ (k þ 1) k ¼ k⌰ ¯ (k þ 2), ⴱ ¹ ⌰ ¯ (k þ 1), ⴱ k ⬎ a. But ¯ (k þ 2), ⴱ ¹ ⌰ that k⌰ ¯ (k þ 1), ⴱ k ⬍ e ¯ and e ¯ ⬍ a by ¯ (k þ 2), ⴱ ¹ ⌰ by assumption, k⌰ ⌰ ⌰ definition. This is a contradiction, so the assumption is incorrect and g(k þ 1) ¼ 1. A Thus, for k ⱖ K ¼ K2 ⱖ K1 ⱖ 0, g(k) ¼ 1. Proof of Theorem 9. Let the kth ¯ (k) )T iT and define w¯ (k) ¼ h(¯x(k) )T , (⌰ T ¯ ⴱ )T iT . Begin by w¯ ⴱ ¼ h(¯x ⴱ ) , (⌰ ¯ (k) k for k ⱖ K. ¯ (k þ 1) ¹ w kw
solution be the term examining
¯ (k þ 1), ⴱ ¹ ⌰ ¯ (k) k ¯ (k) k ⱕ k¯x(k þ 1) ¹ x¯ (k) k þ k⌰ ¯ (k þ 1) ¹ w kw ¯ (k þ 1), ⴱ ¹ ⌰ ¯ (k), ⴱ k ⱕ k¯x(k þ 1) ¹ x¯ (k) k þ k⌰ ⬍ 2(d(1 þ n) þ n)D(k) þ MK2(d(1 þ n) þ n)D(k) ⬍ 2(d(1 þ n) þ n)(1 þ MK)D(k) Examine the following sum for all j ⱖ k ⱖ K ⬁ X
¯ (j þ 1) ¹ w ¯ (j) k ⬍ kw
j¼k
⬁ X
2(d(1 þ n) þ n)(1 þ MK)D(j)
j¼k
(41) (k)
(k)
k
(0)
The duality gap pD can be expressed as D ¼ b D where b ¼ 1=(1 þ n= n) (Gonzaga (1992)). Using this fact, Eq. (41) can be simplified as ⬁ X
¯ (j þ 1) ¹ w ¯ (j) k ⬍ kw
j¼k
⬍
⬁ X
2(d(1 þ n) þ n)(1 þ MK)D(j)
j¼k ⬁ X
2(d(1 þ n) þ n)(1 þ MK)bj D(0) ⬍ 2(d(1 þ n) þ n)
j¼k
⫻ (1 þ MK)
D(k) 1¹b
This result corresponds to a constant, 2(d(1 þ n) þ n)(1 þ MK)=(1 ¹ b), multiplied by the strictly decreasing duality gap D(k) . Let k be chosen so that X ⬁ ¯ (j þ 1) ¹ w ¯ (j) k ⬍ e for e ⬎ 0 (i.e. let the duality gap j ¼ k kw be small enough so that the condition holds). If k is so ¯ (j) k ⬍ e for arbitrary e. chosen, then ᭙i,j ⱖ k, kw¯ (i) ¹ w (k) Thus, {w¯ } is a Cauchy sequence, and it converges to a ¯ ⴱ )T iT . A ¯ ⴱ ¼ h(¯x ⴱ )T , (⌰ fixed point w Proof of Theorem 10. Begin by assuming that ¯ ⴱ )T iT is not a local minimum. Then, w¯ ⴱ ¼ h(¯x ⴱ )T , (⌰ ¯ T iT from ¯ ¼ hD¯xT , D⌰ there exists a feasible direction Dw ¯ in the solution space such that wⴱ ¯ ⴱ þ lDw) ¯ ⱕ f (w ¯ ⴱ) f (w
(42)
for some l 僆 (0, ⬁) and ¯ ⴱ þ jDw) ¯ ⬍ f (w ¯ ⴱ) f (w
(43)
¯ is the cost function. Let W for some j 僆 [0, l) where f (w)
233
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
¯ Let Dw ¯ 僆 W where be the set of all feasible directions at wⴱ. ¯ T iT and let w ¯ T1 iT . ¯1 ¼w ¯ ⴱ þ lDw¯ ¼ h¯xT1 , ⌰ Dw¯ ¼ hD¯xT , D⌰ Two mutually exclusive cases can occur: x¯ ⴱ can be a unique ¯ is held constant, or x¯ ⴱ optimal solution to Eq. (33) when ⌰ can be a non-unique optimal solution. The first case occurs ¯ supports the polyhedral set of feasible solutions when c¯ (⌰) ¯ supports only at the vertex x¯ ⴱ. The second occurs when c¯ (⌰) the set at vertex x¯ and is also parallel to one of the faces of the polyhedral set which also contains x¯ ⴱ. 1. x¯ ⴱ is a unique optimal solution.¯xⴱ being optimal implies that for all feasible directions, D¯x, from x¯ ⴱ (Bazaraa et al., 1993) ¯ ⴱ )T D¯x ⬎ 0 c¯ (⌰ ¯ T iT changes ⌰ⴱ ¯ as ¯ ¼ hD¯xT , D⌰ Moving in direction Dw ¯ ¯ ¯ ⌰1 ¼ ⌰ ⴱ þ lD⌰. This rotates the cost vector as ¯ ⴱ ) þ lˆc(D⌰) ¯ where cˆ (D⌰) ¯ is a term that ¯ 1 ) ¼ c¯ (⌰ c¯ (⌰ ¯ If l is denotes the change in the cost vector due to D⌰. T T ¯ ¯ small enough, then lˆc(D⌰) D¯x ⬍ c¯ (⌰ ⴱ ) D¯x and ¯ 1 )T D¯x ⬎ 0 for all feasible D¯x from x¯ ⴱ. Thus, x¯ ⴱ is still c¯ (⌰ the optimal solution to the LP problem associated with Eq. (33). If x¯ ⴱ is the optimal solution, then ¯ 1 ) x¯ 1 ¯ 1 ) x¯ ⴱ ⬍ c¯ (⌰ c¯ (⌰ T
polyhedral set of feasible solutions and D¯x1 ¼ (¯xⴱ1 ¹ x¯ ⴱ )=k¯xⴱ1 ¹ x¯ ⴱ k. Thus, D¯x1 is a feasible direction from x¯ ⴱ to x¯ ⴱ1 and ¹ D¯x1 is a feasible direction from x¯ ⴱ1 to x¯ ⴱ. Since x¯ ⴱ1 is also an optimal solution, ¯ ⴱ )T D¯x ⱖ 0 for all feasible directions from x¯ ⴱ1 and c¯ (⌰ ¯ c¯ (⌰ ⴱ )T ( ¹ D¯x1 ) ¼ 0 for the feasible direction ¹ D¯x1 leading from x¯ ⴱ1 to x¯ ⴱ. Without loss of generality, assume D¯x1 is such that ¯ T D¯x1 ⬎ 0 lˆc(D⌰) which implies that ¯ ⴱ )T D¯x1 þ lˆc(D⌰) ¯ T D¯x1 ⬎ 0 ¯ 1 )T D¯x1 ¼ c¯ (⌰ c¯ (⌰ With this small rotation of the cost plane, x¯ ⴱ is the unique optimal solution to the LP problem. Thus, ¯ is the optimal quadratic ¯ 1 )T x¯ 1 . Since ⌰ⴱ ¯ 1 )T x¯ ⴱ ⬍ c¯ (⌰ c¯ (⌰ minimizer for Eq. (33) subject to constant x¯ ¼ x¯ ⴱ, then ¯ 1 )T x¯ ⴱ ⬍ c¯ (⌰ ¯ 1 )T x¯ 1 . This again leads to ¯ ⴱ )T x¯ ⴱ ⬍ c¯ (⌰ c¯ (⌰ ¯ is a a contradiction which implies that the solution wⴱ local minimum. 3. Computational property proofs
T
The minimizing solution to the quadratic optimization of ¯ is ⌰ⴱ. ¯ Thus ¯ T x¯ ⴱ with respect to ⌰ c¯ (⌰) ¯ 1 )T x¯ ⴱ ⬍ c¯ (⌰ ¯ 1 )T x¯ 1 ¯ ⴱ )T x¯ ⴱ ⬍ c¯ (⌰ c¯ (⌰ ¯ T ⬎T , the cost of the ¯ ¼ ⬍ D¯xT , D⌰ For feasible direction Dw T T T ¯ ¯ ⴱ ¼ h(¯x ⴱ ) , (⌰ ⴱ ) i is less than the cost of the solution w ¯ T1 ⬎T . Since Dw ¯ was any arbitrary solution w¯ 1 ¼ ⬍ x¯ T1 , ⌰ ¯ there is no feasible direction feasible direction from wⴱ, that satisfies Eq. (42) and Eq. (43), which is a contradiction. ¯ⴱ¼ Thus, the assumption is incorrect and w ¯ ⴱ )T iT is a local minimum. h(w¯ ⴱ )T , ⌰ 2. x¯ ⴱ is a non-unique optimal solution to the LP problem.¯xⴱ being non-unique implies that for all feasible directions D¯x from x¯ ⴱ ¯ ⴱ )T D¯x ⱖ 0 c¯ (⌰ This can be broken down into two cases, those for which ¯ ⴱ )T D¯x ¼ 0. The ¯ ⴱ )T D¯x ⬎ 0 and those for which c¯ (⌰ c¯ (⌰ first considers those directions in which movement causes a strict increase in the cost function, while the second considers those directions in which movement causes no change in the cost function. ¯ ⴱ )T D¯x ⬎ 0.For movement (a) Directions for which c¯ (⌰ in these directions, l can be chosen small enough that ¯ ⴱ )T D¯x and the reasoning of case 1 ¯ T D¯x ⬍ c¯ (⌰ lˆc(D⌰) applies, which generates the contradiction. ¯ ⴱ )T D¯x ¼ 0.Here, the cost (b) Directions for which c¯ (⌰ ¯ hyperplane c¯ (⌰ ⴱ ) is parallel to one of the constraint planes a i where a i is a row of A. This implies a set of optimal solutions {¯x : x¯ ¼ b¯x ⴱ þ (1 ¹ b)¯xⴱ1 , ⴱ b 僆 [0, 1]} where x¯ ⴱ and x¯ 1 are extreme points of the
Proof of Theorem 11. First, the assumption that S is chosen as indicated guarantees that kw¯ (k þ 1) ¹ w¯ (k) k will be a Cauchy ¯ (k) k is Cauchy, the ¯ (k þ 1) ¹ w sequence for k ⱖ K. Since kw norm of the difference between the current and optimal solutions can be expressed in closed form as ¯ ⴱ k ⬍ 2(d(1 þ n) þ n)(1 þ MK) ¯ (k) ¹ w kw
D(k) 1¹b
(44)
Using Eq. (44), examine how long it will take for ¯ ⴱ k ⬍ e. The goal is to find the smallest k that ¯ (k) ¹ w kw violates e ⬍ 2(d(1 þ n) þ n)(1 þ MK)
D(k) 1¹b
By definition of the duality gap, D(k) ¼ bk D(0) ⬍ bk (1=e). So e ⬍ 2(d(1 þ n) þ n)(1 þ MK)
b(k) (1=e) 1¹b
Taking the base 2 logarithm: log2 e ⬍ log2 2(d(1 þ n) þ n)(1 þ MK) þ log2 1=e þ log2 ⫻
1 þ k log2 b 1¹b
After some arithmetic, one finds that k ⬍ 2 log2 1=e þ log2 2(d(1 þ n) þ n)(1 þ MK) þ log2 1 1 =log2 ⫻ 1¹b b But log2 (1 þ m) ⱖ m=(1 þ m) (Gonzaga, 1992), and using this
234
P.T. Szymanski et al./Neural Networks 11 (1998) 215–234
expression for log 2(1/b) results in k ⬍ 2 log2 1=e þ log2 2(d(1 þ n) þ n)(1 þ MK) þ log2 p 1 ( n=n þ 1) ⫻ 1¹b Also, log2 [1=(1 ¹ b)] ¼ log2 (
p n=n þ 1), so
k ⬍ [2 log2 1=e þ log2 2(d(1 þ n) þ n)(1 þ MK) p p þ log2 ( n=n þ 1)]( n=n þ 1) Since p2(d(1 n) þ n)(1 þ MK) and n are constants, þ p k ¼ O[ n log2 ( n=e)] iterations. A
References Bazaraa, M.S., Sherali, H.D., & Shetty, C.M. (1993). Nonlinear programming: theory and algorithms. John Wiley and Sons, Inc. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2, 304–314. Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum Likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1–38.
Golub, G.H., & VanLoan, C.F. (1989). Matrix computations. The Johns Hopkins University Press, Baltimore, MD. Gonzaga, C.C. (1992). Path following methods for linear programming. SIAM Review, 34, 167–224. Hadamard, J. (1964). La Theorie des Equations aux Derivees Partielles. Edition Scientifiques, Beijing. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Jacobs, R.A., Jordan, M.I., Nowlan, S.J., & Hinton, G.E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87. Jordan, M.I., & Xu, L. (1993). Convergence results for the EM approach to mixtures of experts architectures. Technical Report 9303. MIT Computational Cognitive Science, September 1993. Karmarkar, N. (1984). A new polynomial-time algorithm for linear programming. Combinatorics, 4, 373–395. Lemmon, M.D., & Szymanski, P.T. (1994). Interior point implementations of alternating minimization training. In G. Tesauro, D.S. Touretzky and T.K.Leen (Eds), Advances in Neural Information Processing Systems, San Mateo, California (vol. 7, pp. 574–582). MIT Press. Park, J., & Sandberg, I.W. (1991). Universal approximation using radial basis function networks. Neural Computation, 3, 246–257. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78, 1481–1497. Redner, R.A., & Walker, H.F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26, 195–239. Szymanski, P.T. (1995). Alternating minimization of non-convex constrained optimizations via interior point techniques. Ph.D. Thesis. Dept. of Electrical Engineering, University of Notre Dame, November 1995.